BioNER Code is adapted from WeLT: Improving Biomedical Fine-tuned Pre-trained Language Models with Cost-sensitive Learning
Dependencies
- Python (>=3.6)
- Pytorch (>=1.2.0)
- Clone this GitHub repository
- Navigate to the BioNER folder and install all necessary dependencies:
python3 -m pip install -r requirements.txt
Note: To install the appropriate torch, follow the download instructions based on your development environment.
NER Datasets
Dataset | Source |
---|---|
|
NER datasets are directly retrieved from BioBERT via this link |
|
We have extended the aforementioned NER datasets to include BioRED. To convert from BioC XML / JSON to conll , we used bconv and filtered the chemical and disease entities. |
Data & Evaluation code Download
To directly download NER datasets for fine-tuning models from scratch, use download.sh
or manually download them via this link in main directory, unzip datasets.zip
and rm -r datasets.zip
The same instructions are used for the evaluation code.
Data Pre-processing
We adapted the preprocessing.sh
from BioBERT to include BioRED
We conducted the experiments on two different BERT models using the WELT weighting scheme. We have compared WELT against the corresponding traditional fine-tuning approaches(i.e. BioBERT fine-tuning). We explain the WeLT fine-tuning approach. We provide all the fine-tuned models on Huggingface, an example of fine-tuning from scratch using WeLT, and an example of predicting and evaluating disease entities.
Our experimental work focused on BioBERT(mixed/continual pre-trained language model) & PubMedBERT(domain-specific/trained from scratch pre-trained language model), however, WELT can be adapted to other transformers like ELECTRA.
Model | Used version in HF 🤗 |
---|---|
BioBERT | model_name_or_path |
PubMedBERT | model_name_or_path |
We have adopted BioBERT-run_ner.py to develop a cost-sensitive trainer in run_weight_scheme.py that extends Trainer
class to WeightedLossTrainer
and override compute_loss
function to include WELT
in weighted Cross-Entropy loss function
After fine-tuning BERT models, we recognize chemical & disease entites via ner.py
. The output files are in predicted path directory
Evaluation
We have used the strict and approximate evaluation of BioCreative VII
Track 2 - NLM-CHEM track Full-text Chemical Identification and Indexing in PubMed articles
- Fine-tuned models available on HF
- Fine-tuning from scratch example
- Predicting disease entities using WeLT example
- Evaluating predicted WeLT disease example
The manuscript is in preparation (TBD)
Authors: Ghadeer Mobasher*, Pedro Ruas, Francisco M. Couto, Olga Krebs, Michael Gertz and Wolfgang Müller
Ghadeer Mobasher is part of the PoLiMeR-ITN (http://polimer-itn.eu/) and is supported by the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement PoLiMeR, No 81261