This repo is part of the development of the following Master Thesis Named entity recognition and normalization in biomedical literature: a practical case in SARS-CoV-2 literature
Biomedical Named Entity Recognition and Normalization of Diseases, Chemicals and Genenetic entity classes through the use of state-of-the-art models. The core piece in the modelling of the text entities recognition will be BioBERT. Normalization step will be achived through inverse index search in a Solr database.
Package bionlp is mainly proposed to be used as part of the webpage or the annotation of CORD-19. In its dockerized versions these requirements are already satisfied. If it was desired to use it separately, the following dependencies must be satisfied:
- transformers>=4.5.0
- spacy>=3
- pysolr~=3.9.0
- torch
bionlp package can be found on bio-nlp/bionlp
Solr Database is available online at:
- Diseases: http://librairy.linkeddata.es/solr/#/bioner-diseases/core-overview
- Chemicals: http://librairy.linkeddata.es/solr/#/bioner-diseases/core-overview
- Genetics: http://librairy.linkeddata.es/solr/#/bioner-diseases/core-overview
- COVID: http://librairy.linkeddata.es/solr/#/bioner-diseases/core-overview
The endopoint for normalization 'http://librairy.linkeddata.es/solr/' will be later passed as environment variable if it is desired to leverage this database. If this endpoint is not passed, the system will looked for the database by default in localhost. Customize endopints could be passed as enviromental variables for a customize Solr Database but schemas must agree with the ones which were proposed.
If Solr database in localhost is desired to be used, this must be setup and populated in localhost before the following deployments since normalization step will need this database. Details about this configuration are found in bio-nlp/Solr.
Available at: https://librairy.github.io/bio-ner/.
The webpage allows to easily use the system just pasting the text we want to process and clicking analyze button. This data will be sent through an AJAX call to the system which will return the data annotated and normalized in the following views:
Annotated results will be represented in coloured boxes where each box represents one entity class.
Normalized results will appear in a table for each of the entity classes. The found term will be retrieved along with the ids stored in a Solr Database. An extra Table will appear if COVID related terms appear in the processed text regarding to drug target evidences or related proteins.
In order to ease the later use of the retrieved information a Json text box is also established.
This web platform can be easily deployed thanks to its dockerization. Docker image can be found on Docker Hub: https://hub.docker.com/r/alvaroalon2/webapp_bionlp. Docker image includes the models within the image. If it was not wanted to use the provided online Solr database endpoint 'http://librairy.linkeddata.es/solr/', then the environment variable should not be passed in docker run.
The Docker Nvidia Toolkit is needed for GPU support inside Docker containers with NVIDIA GPUs. The deployment can be performed as follows:
docker pull alvaroalon2/webapp_bionlp:gpu
docker run --name webapp -it --gpus all --network 'host' -e SOLR_URL="http://librairy.linkeddata.es/solr/" alvaroalon2/webapp_bionlp:gpu
If a GPU is not available, deployment can be done also on CPU. If it is the case, it is recommended to use the CPU dockerized version instead of GPU:
docker pull alvaroalon2/webapp_bionlp:cpu
docker run --name webapp -it --network 'host' -e SOLR_URL="http://librairy.linkeddata.es/solr/" alvaroalon2/webapp_bionlp:cpu
The proposed system will be used in a practical case: Annotation of CORD-19 corpus which contains thousands of COVID-19 related articles. For this purpose, the corpus will be previously pre-processed, to separate it on paragraphs, and loaded in a Solr database with the use of https://github.com/librairy/cord-19. In order to ease the use of this annotation the use of the dockerized version is recommended. The repository on Docker Hub for this docker image can be found on:https://hub.docker.com/r/alvaroalon2/bionlp_cord19_annotation. Docker image includes the models within the image. If it was not wanted to use the provided online Solr database endpoint 'http://librairy.linkeddata.es/solr/', then the environment variable should not be passed in docker run.
The Docker Nvidia Toolkit is needed for GPU support inside Docker containers with NVIDIA GPUs. Steps for running the container and itialize its anotation and normalization are as follows:
docker pull alvaroalon2/bionlp_cord19_annotation:gpu
docker run --name annotation -it --gpus all --network 'host' -e SOLR_URL="http://librairy.linkeddata.es/solr/" alvaroalon2/bionlp_cord19_annotation:gpu
If a GPU is not available, deployment can be done also on CPU. ANnotation will be substantially slower. If it is the case it is recommended to use the CPU dockerized version instead of GPU:
docker pull alvaroalon2/bionlp_cord19_annotation:cpu
docker run --name annotation -it --network 'host' -e SOLR_URL="http://librairy.linkeddata.es/solr/" alvaroalon2/bionlp_cord19_annotation:cpu
One model was proposed for each of the entity classes: Diseases, Chemicals and Genenetic. Therefore, the final system is composed by three models in which each of them carries out the annnotation of its proper entity class. System will automatically check if models have been previously stored in its proper folder. If the model is missing an automatical download of a cached version is download from its proper Huggingface repository where proposed models were uploaded. These are the repositories for the proposed models:
- Diseases: https://huggingface.co/alvaroalon2/biobert_diseases_ner
- Chemicals: https://huggingface.co/alvaroalon2/biobert_chemical_ner
- Genetic: https://huggingface.co/alvaroalon2/biobert_genetic_ner
Further details are described on: bio-nlp/models. Models could be leveraged in other required systems if desired.
Fine-tuning process was done in Google Collab using a TPU. For that purpose Fine_tuning.ipynb Jupyter Notebook is proposed which make use of the scripts found on bio-nlp/fine-tuning which has been partially adapted from the originally proposed in BioBERT repository in order to allow TPU execution and the use of a newer version of huggingface-transformers.
Details about visualization can be found on bio-nlp/Embeddings along with an example.