To extract keywords in e-commerce documents, we need to get high macrof1 and macrof2. Others loss can not achieve good enough metrics, so we create PBP and CECLA loss to achieve this purpose. In addition, we create BAT model to attain better performance than others SOTA model.
The data we used can not release, so we implement on CONLL2003 English dataset and get the great f1 on it.
放論文模型跟loss function 放CONLL2003資料表現
- python: 3.7
- recommend to use pipenv to build the develop environment
pip install -r requirements.txt
pip install -e .
CONLL2003-English mission
After checking requirements and finishing installation, you can follow this step:
(1) Get CONLL2003 dataset from here and detail from here. Then move train.txt, dev.txt, and test.txt to /BAT/data.
(2) Tune the configure file. (or you can use default)
cd /BAT/config
vim conll2003.yaml
(3) Train model.
We connect xlm-roberta-large(freeze) model and bat model, and only train on bat model.
Enter to python environment.
$ python
>>> import nltk
>>> nltk.download('punkt')
cd /BAT
CUDA_VISIBLE_DEVICES=0 python sample_conll.py --config-name conll2003.yaml
or (use tee to save training log)
CUDA_VISIBLE_DEVICES=0 python -u sample_conll.py --config-name conll2003.yaml 2>&1 | tee -a conll2003.log
Our sample code get f1:93 in CONLL2003-English NER mission.
we follow https://github.com/wzhouad/NLL-IE preprocess for CONLL2003-English data.
If this repository is helpful to you, please cite this paper.
@inproceedings{Liu-bat-2022,
author = {Chiung-Ju Liu, Huang-Ting Shieh},
title = {BAT: BORN FOR AUTO-TAGGING: FASTER AND BETTER WITH NEW OBJECTIVE FUNCTIONS},
journal = {arXiv preprint arXiv:2206.07264},
year = {2022}
}