Name: Boris Zarubin Email: [email protected]
This is a SpERT model, pretrained on a RuNNE for this CodaLab competition.
All notebooks are available on kaggle and designed to run on kaggle.
Before applying SpERT model, different other approaches were researched and tested:
- piqn - too expensive in terms of computational resources
- MRC - outdated and abandoned due to compatibility problems
- Biaffine NER - outdated and abandoned due to compatibility problems
Eventually it was decided to use SpERT model. The main challenge here was to transform RuNNE dataset to SPERT format and vice versa, the details are described in the next section.
Our input is a list of strings, where each string contains several sentences and char offsets with NER labels. Our task is to convert this data to a list of sentences, where each sentence is a list of tokens, where NER labels link these tokens using indices.
Further, nltk.tokenize.sent_tokenize(text, language='russian')
and nltk.tokenize.word_tokenize(sentence, language='russian')
will be used for tokenizing.
The main parts of the preprocess notebook are
- Preprocess functions
First task to solve is finding char offsets, having tokens list and original text. It is done by iterating through the original text and yielding char offsets if next token is met. If not all tokens are found a
ValueError
is raisen. Implemented intokens_to_indices(text, tokens)
Second task to solve is finding token indices, having char offsets for each token and some input offset. it is done by iterating through original char offsets and finding point of overlaps. Implemented inindex_to_tokens(token_indices, index)
- Transformation functions
Now, that we have some backend functions we could apply them to transform data.
Our first step is to parse the NEREL data. Here sentinfo (sentence information) format is introduced, which serves as a mediator between NEREL and SPERT. It looks as follows:
Now, to convert NEREL to sentinfo, we do the following:
List[{ 'sentence': sentence as a string, 'sent_offset': sentence offset relative to original NEREL text (NEREL batch), 'tokens': list of tokens, 'tokens_offsets': tokens offsets relative to sentence, 'batch_ix': index of NEREL batch, 'sent_ix': sentence index in sentinfo data, 'ners': NER labels in SPERT format }]
Implemented insentinfo = [] for nerel_batch in nerel_data: for sentence in sent_tokenize(nerel_batch): sent_offset = tokens_to_indices(nerel_batch, sentence) tokens = word_tokenize(sentence) tokens_offsets = tokens_to_indices(sentence, tokens) ners = [] for nerel_ner in nerel_batch['ners']: if nerel_ner in sentence: ners.append(index_to_tokens(tokens_offsets, nerel_ner)) # append all this data as sentinfo item sentinfo.append(...)
nerel_to_sentinfo(nerel_data, train=True)
- Convert NEREL dataset to SPERT format
To convert NEREL to SPERT we firstly convert it to sentinfo and then return the following:
Implemented in
return [{ 'tokens': sent['tokens'], 'entities': sent['ners'], 'relations': [] } for sent in sentinfo]
sentinfo_to_spert(sentinfo)
- Convert SPERT predictions to NEREL
Here,
sentinfo['batch_ix']
andsentinfo['sent_offset']
are used to convert SPERT predictions to NEREL format. Implemented inpred_to_nerel(batches_count, sentinfo, pred_data)
Training and Predicting are done as in here. The only differences are:
- there are no relations in NEREL dataset
- different BERT is used (
cointegrated/rubert-tiny
andai-forever/ruBert-large
)
In this notebook, model.safetensors
if converted to pytorch_model.bin
to avoid compatibility problems.
Overall train-to-predict pipeline would look like that:
- Convert NEREL dataset to SPERT format
- Train SpERT model
- Convert safetensors
- Predict
- Convert predictions to NEREL format
rubert-large-15epochs
. Dev set:
Mention F1: 83.00%
Mention recall: 83.28%
Mention precision: 82.73%
Macro F1: 74.54%
Macro F1 few-shot: 0.00%
rubert-large-20epochs
. Dev set:
Mention F1: 82.54%
Mention recall: 83.21%
Mention precision: 81.88%
Macro F1: 74.08%
Macro F1 few-shot: 0.00%
rubert-large-30epochs
. Dev set:
Mention F1: 83.00%
Mention recall: 83.53%
Mention precision: 82.47%
Macro F1: 74.02%
Macro F1 few-shot: 0.00%
rubert-tiny-20epochs
. Dev set:
Mention F1: 69.72%
Mention recall: 71.47%
Mention precision: 68.05%
Macro F1: 55.92%
Macro F1 few-shot: 0.00%
rubert-large
has shown to be the best in terms of F1
score, with insignificant difference between models with different number of training epochs. On other hand, rubert-tiny
was too modest to capture complex nature of NEREL dataset.