To install the environment, run:
pip install -r requirements.txt
MuCGEC and NLPCC18: download links can be found in the MuCGEC repository.
FCGEC: FCGEC repository.
NaCGEC: NaCGEC repository.
Process the data into the same format as data/MuCGEC/train_examples.json
.
Using data/MuCGEC/utils.py
to split the data into two parts for two-stage training.
Chinese BART large: Hugging Face Link
Baichuan2-7B-Base: Hugging Face Link
# bart
bash seq2seq/scripts/train_stage1.sh
# baichuan2
bash llm/scripts/train_stage1.sh
# bart
bash seq2seq/scripts/generate_stage2_pred.sh
# baichuan2
bash llm/scripts/generate_stage2_pred.sh
# bart
bash seq2seq/scripts/train_align.sh
# baichuan2
bash llm/scripts/train_align.sh
# bart
bash seq2seq/scripts/train_alignment_distill.sh
# baichuan2
bash llm/scripts/train_alignment_distall.sh
For predicting, please use llm/src/predict.py
or seq2seq/src/predict.py
.
For evaluation, we adopt the ChERRANT scorer to calculate character-level P/R/F0.5 for FCGEC and NaCGEC, and M2Scorer to calculate word-level P/R/F0.5 for NLPCC18-Test. For the usage, please refer to this script.
If you find our work helpful, please cite us as:
@inproceedings{yang-quan-2024-alirector,
title = "Alirector: Alignment-Enhanced {C}hinese Grammatical Error Corrector",
author = "Yang, Haihui and Quan, Xiaojun",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2024",
year = "2024",
}
Link on ACL Anthology: https://aclanthology.org/2024.findings-acl.148/