All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Introduce a special token to handle whitespaces as they are.
- Set default value of num_beams to 1.
- Refactor seq2seq module to make it slightly faster.
- Remove normalization of whitespaces to "␣".
- Support Python 3.12.
- Preserve document IDs when they are given as input.
- Support
jumanpp
andknp
input formats. This functionality allows you to partly use tokenization results ofjumanpp
as input.kwja --tasks word --text "$(echo "外国人参政権" | jumanpp)" --input-format jumanpp kwja --tasks word --filename <(echo "外国人参政権" | jumanpp) --input-format jumanpp kwja --tasks word --text "$(echo "外国人参政権" | jumanpp | knp -tab)" --input-format knp kwja --tasks word --filename <(echo "外国人参政権" | jumanpp | knp -tab) --input-format knp
- Analyze
デ
,ト
, and時間
cases in addition toガ
,ヲ
,ニ
, andガ2
cases in predicate-argument structure analysis.
- Merge senter module into char module
- Version specification of
rhoknp
inpyproject.toml
.
- Support executing kwja with
python -m kwja
.
- Fix a bug in senter prediction.
- Fix a bug in the interactive mode.
- Fix a bug in the seq2seq model's output.
- Support Python 3.11.
- Support NN-based sentence segmentation.
kwja --tasks senter --text "モーニング娘。は日本のアイドルグループです。"
- Support multiple files as input.
kwja --filename file1.txt --filename file2.txt
- Introduce a config file. You can specify some options in
XDG_CONFIG_HOME/kwja/config.yaml
.model_size: base device: cpu num_workers: 0 torch_compile: false typo_batch_size: 1 senter_batch_size: 1 seq2seq_batch_size: 1 char_batch_size: 1 word_batch_size: 1
- Implement padding truncation of word module to accelerate inference.
- Support Windows.
- Support CUDA 11.7 by default instead of CUDA 10.x.
- Skip typo correction by default.
- Optimize package requirements for faster loading.
- Optimize model initialization for faster loading.
- Replace mt5 models with t5 models pre-trained on Japanese corpora in seq2seq module.
- Use partially annotated data for word normalization to train seq2seq module.
- Remove the discourse module.
- Fix a bug that warning messages are shown when Juman++ and/or KNP are not installed.
- Fix a bug that document IDs are not assigned properly when a text file is given as input.
2.0.0 - 2023-03-14
- Introduce the seq2seq module for more accurate reading prediction and canonicalization.
- Replace RoBERTa-based models with DeBERTaV2-based models.
- Support CUDA 11.7 by default instead of CUDA 10.2.
- Fix many minor bugs.
1.4.2 - 2023-02-22
- Fix a bug with analysis results not being output.
1.4.1 - 2023-01-25
- Fix a bug where checkpoint is not found
1.4.0 - 2023-01-25
- Add an option for which tasks to be performed.
- Fix a corner case where a long sentence can be deleted when document splitting.
1.3.0 - 2023-01-23
- Enable progress bar while executing kwja command.
- Add benchmark script.
- Implement text normalization in char module.
- Add tiny model.
- Fix scripts for building datasets to support latest rhoknp and remove dependency on kyoto-reader.
- Support versioning of local cache directory.
- Stash unsuitable documents so as not to discard them while applying typo module.
- Fix bugs of document_split_stride, reading aligner, and writers.
- Fix phrase masking in cohesion analysis
- Remove unused main dependencies,
python-Levenshtein
,ipadic
,tinydb
,BetterJSONStorage
, anddartsclone
.
1.2.2 - 2022-11-07
- Fix a bug where cohesion analysis results are sometimes weird.
1.2.1 - 2022-11-04
- Fix bugs of word module writer and interactive mode.
1.2.0 - 2022-10-27
- Add option to change batch size to CLI.
--typo-batch-size
,--char-batch-size
, and--word-batch-size
.
- Add large model.
- Output predictions per batch to avoid out of memory error.
- Allow input files containing multiple documents in one file from command line.
- Use pure-cdb for storing JumanDIC instead of TinyDB.
1.1.2 - 2022-10-13
- Fix a bug where the CLI does not work due to a missing dependency.
- Relax the version constraint on
torch
.
1.1.1 - 2022-10-12
- Output the instruction message of the CLI tool to stderr instead of stdout.
1.1.0 - 2022-10-11
- Interactive mode.
- Support for Python 3.8.
- Use GPU for analyses if available.
- Remove unused dependencies,
wandb
andkyoto-reader
.
1.0.3 - 2022-10-03
- Add
--version
option to CLI.
- Analyze unnormalized texts in word module after word normalization.
1.0.2 - 2022-10-01
- Fix dependency parsing.
1.0.1 - 2022-09-28
- Remove an unnecessary dependency,
fugashi
.