In this repo, we further extend our work in Can Cross-domain Term Extraction Benefit from Cross-lingual Transfer? by introducing a novel nested term labeling mechanism and evaluating the performance of the model in the cross-lingual and multi-lingual settings in comparison with the traditional BIO annotation regime.
Please install all the necessary libraries noted in requirements.txt using this command:
pip install -r requirements.txt
The experiments were conducted on 2 datasets:
ACTER dataset | RSDO5 dataset | |
---|---|---|
Languages | English, French, and Dutch | Slovenian |
Domains | Corruption, Wind energy, Equitation, Heart failure | Biomechanics, Chemistry, Veterinary, Linguistics |
Original version | AylaRT/ACTER | Corpus of term-annotated texts RSDO5 1.0 |
The newly nested term labeling mechanism (NOBI) and the labeled data can be accessed at @honghanhh/nobi_annotation_regime.
The workflow of the model is described in our coming paper in 2023. To reproduce the results, please run the following command:
chmod +x run.sh
./run.sh
which will run the model that covers all the following scenarios:
-
ACTER dataset with XLM-RoBERTa in mono-lingual, cros-lingual, and multi-lingual settings with both ANN and NES version with multi-lingual settings covering only three languages from ACTER and additional Slovenian add-ons (10 scenarios).
-
RSDO5 dataset with XLM-RoBERTa in mono-lingual, cros-lingual, and multi-lingual settings with cross-lingual and multi-lingual taking into account the ANN and NES version (48 scenarios).
Note that the model produces the results for NOBI annotated set. To reproduce the results for BIO annotated set, please refers to @honghanhh/ate-2022.
Feel free to hyper-parameter tune the model. The current settings are:
num_train_epochs=20, # total # of training epochs
per_device_train_batch_size=32, # batch size per device during training
per_device_eval_batch_size=32, # batch size for evaluation
learning_rate=2e-5, # learning rate
eval_steps = 500,
load_best_model_at_end=True, # load the best model at the end of training
metric_for_best_model="f1",
greater_is_better=True
Plesae refer the results and error analysis to our coming paper in 2023.
Tran, Hanh Thi Hong, et al. "Can Cross-Domain Term Extraction Benefit from Cross-lingual Transfer?." Discovery Science: 25th International Conference, DS 2022, Montpellier, France, October 10–12, 2022, Proceedings. Cham: Springer Nature Switzerland, 2022.
- 🐮 TRAN Thi Hong Hanh 🐮
- Prof. Senja POLLAK
- Prof. Antoine DOUCET
- Prof. Matej MARTINC