This repository includes the script for translation the SQuADv2 to a chosen language:
- The structure of SQuAD is kept as it was
- Translated answers are matched through a similarity matcher on ngram basis
- Text is translated with the ´mtransalte´ package
- Run
python prepare_squad.py --dest_lng pl --train_or_dev dev
for translating the dev set to polish - Run
python prepare_squad.py --dest_lng pl --train_or_dev train
for translating the train set to polish
As a result a new dir with the name of the translated SQuAD will be created
The output files can than be used for training a QA model:
- Use the existing multilingual bert as lm or any other language specific model
- Run the following script from huggigface tranformers
export SQUAD_DIR=path/to/translated_squad
python run_squad.py
--model_type bert \
--model_name_or_path bert-base-multilingual-cased \
--do_train \
--do_eval \
--version_2_with_negative \
--train_file $SQUAD_DIR/translated_squad_train.json \
--predict_file $SQUAD_DIR/translated_squad_dev.json \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--save_steps=8000 \
--output_dir ../../output \
--overwrite_cache \
--overwrite_output_dir
Now the model can be shared on the Huggigface Model Platform
The model can than be used for QA tasks with the following code:
from transformers import pipeline
qa_pipeline = pipeline(
"question-answering",
model="henryk/bert-base-multilingual-cased-finetuned-polish-squad2",
tokenizer="henryk/bert-base-multilingual-cased-finetuned-polish-squad2"
)
qa_pipeline({
'context': "Warszawa jest największym miastem w Polsce pod względem liczby ludności i powierzchni",
'question': "Jakie jest największe miasto w Polsce?"})
Output:
{
"score": 0.9986,
"start": 0,
"end": 8,
"answer": "Warszawa"
}
Model | EM/F1 | HasAns (EM/F1) | NoAns |
---|---|---|---|
SlavicBERT | 52.90/59.61 | 37.04/50.48 | 68.71 |
polBERT | 50.63/57.24 | 35.98/49.21 | 65.24 |
multiBERT | 55.67/61.94 | 35.76/48.31 | 75.52 |