Skip to content

Latest commit

 

History

History
231 lines (174 loc) · 6.86 KB

CHANGELOG.md

File metadata and controls

231 lines (174 loc) · 6.86 KB

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[v2.4.0] - 2024-07-18

Added

  • Introduce a special token to handle whitespaces as they are.

Changed

  • Set default value of num_beams to 1.
  • Refactor seq2seq module to make it slightly faster.

Removed

  • Remove normalization of whitespaces to "␣".

[v2.3.0] - 2024-02-01

Added

  • Support Python 3.12.

Fixed

  • Preserve document IDs when they are given as input.

[v2.2.0] - 2023-10-27

Added

  • Support jumanpp and knp input formats. This functionality allows you to partly use tokenization results of jumanpp as input.
    kwja --tasks word --text "$(echo "外国人参政権" | jumanpp)" --input-format jumanpp
    kwja --tasks word --filename <(echo "外国人参政権" | jumanpp) --input-format jumanpp
    
    kwja --tasks word --text "$(echo "外国人参政権" | jumanpp | knp -tab)" --input-format knp
    kwja --tasks word --filename <(echo "外国人参政権" | jumanpp | knp -tab) --input-format knp
  • Analyze , , and 時間 cases in addition to , , , and ガ2 cases in predicate-argument structure analysis.

Changed

  • Merge senter module into char module

[v2.1.3] - 2023-08-28

Fixed

  • Version specification of rhoknp in pyproject.toml.

[v2.1.2] - 2023-06-07

Added

  • Support executing kwja with python -m kwja.

Fixed

  • Fix a bug in senter prediction.

[v2.1.1] - 2023-06-03

Fixed

  • Fix a bug in the interactive mode.
  • Fix a bug in the seq2seq model's output.

[v2.1.0] - 2023-06-02

Added

  • Support Python 3.11.
  • Support NN-based sentence segmentation.
    kwja --tasks senter --text "モーニング娘。は日本のアイドルグループです。"
  • Support multiple files as input.
    kwja --filename file1.txt --filename file2.txt
  • Introduce a config file. You can specify some options in XDG_CONFIG_HOME/kwja/config.yaml.
    model_size: base
    device: cpu
    num_workers: 0
    torch_compile: false
    typo_batch_size: 1
    senter_batch_size: 1
    seq2seq_batch_size: 1
    char_batch_size: 1
    word_batch_size: 1
  • Implement padding truncation of word module to accelerate inference.
  • Support Windows.

Changed

  • Support CUDA 11.7 by default instead of CUDA 10.x.
  • Skip typo correction by default.
  • Optimize package requirements for faster loading.
  • Optimize model initialization for faster loading.
  • Replace mt5 models with t5 models pre-trained on Japanese corpora in seq2seq module.
  • Use partially annotated data for word normalization to train seq2seq module.

Removed

  • Remove the discourse module.

Fixed

  • Fix a bug that warning messages are shown when Juman++ and/or KNP are not installed.
  • Fix a bug that document IDs are not assigned properly when a text file is given as input.

2.0.0 - 2023-03-14

Added

  • Introduce the seq2seq module for more accurate reading prediction and canonicalization.

Changed

  • Replace RoBERTa-based models with DeBERTaV2-based models.
  • Support CUDA 11.7 by default instead of CUDA 10.2.

Fixed

  • Fix many minor bugs.

1.4.2 - 2023-02-22

Fixed

  • Fix a bug with analysis results not being output.

1.4.1 - 2023-01-25

Fixed

  • Fix a bug where checkpoint is not found

1.4.0 - 2023-01-25

Added

  • Add an option for which tasks to be performed.

Fixed

  • Fix a corner case where a long sentence can be deleted when document splitting.

1.3.0 - 2023-01-23

Added

  • Enable progress bar while executing kwja command.
  • Add benchmark script.
  • Implement text normalization in char module.
  • Add tiny model.

Fixed

  • Fix scripts for building datasets to support latest rhoknp and remove dependency on kyoto-reader.
  • Support versioning of local cache directory.
  • Stash unsuitable documents so as not to discard them while applying typo module.
  • Fix bugs of document_split_stride, reading aligner, and writers.
  • Fix phrase masking in cohesion analysis

Removed

  • Remove unused main dependencies, python-Levenshtein, ipadic, tinydb, BetterJSONStorage, and dartsclone.

1.2.2 - 2022-11-07

Fixed

  • Fix a bug where cohesion analysis results are sometimes weird.

1.2.1 - 2022-11-04

Fixed

  • Fix bugs of word module writer and interactive mode.

1.2.0 - 2022-10-27

Added

  • Add option to change batch size to CLI.
    • --typo-batch-size, --char-batch-size, and --word-batch-size.
  • Add large model.

Fixed

  • Output predictions per batch to avoid out of memory error.
  • Allow input files containing multiple documents in one file from command line.
  • Use pure-cdb for storing JumanDIC instead of TinyDB.

1.1.2 - 2022-10-13

Fixed

  • Fix a bug where the CLI does not work due to a missing dependency.
  • Relax the version constraint on torch.

1.1.1 - 2022-10-12

Fixed

  • Output the instruction message of the CLI tool to stderr instead of stdout.

1.1.0 - 2022-10-11

Added

  • Interactive mode.
  • Support for Python 3.8.

Fixed

  • Use GPU for analyses if available.

Removed

  • Remove unused dependencies, wandb and kyoto-reader.

1.0.3 - 2022-10-03

Added

  • Add --version option to CLI.

Fixed

  • Analyze unnormalized texts in word module after word normalization.

1.0.2 - 2022-10-01

Fixed

  • Fix dependency parsing.

1.0.1 - 2022-09-28

Removed

  • Remove an unnecessary dependency, fugashi.