COVID-Bert4Classification

Use BERT to solve a multi-label text classification problem.

Architecture

The model is simply a BERT model followed by a linear classifier, using BP-MLL function as loss function.

Before using the model

Set up the environment

# This script will automatically download all the packages and models needed.
python set_env.py

Train the model

Simply run the command as you have downloaded the pretrained BERT.

make
tail -f logger.log  # read the log file

Evaluate the model

make evaluate
cd result  # the html form result are put in result/

Predict labels of the texts

make predict  # just for test

To use in a Python script, use the code below

>>> from .predict import Prediction

>>> text = "Some text here"
>>> Prediction.predict(text)
{'Treatment': Label('has_label'=True, 'prob'=0.98), ...}

>>> texts = ["many", "texts", "here"]
>>> Prediction.predict(text)
[{'Treatment': Label('has_label'=True, 'prob'=0.98), ...}, ...]

Add entries to MongDB

python insert_to_db.py

Clear models in `model` and results in `result`

make clean

Explanation to the parameters in `config/config.json`

IO

Save and load the trained model, used in load function in model.py

model_dir: point to the directory where the model (not pretrained model) is saved

HyperParam

Hyper parameters that control the whole training process, used in train.py

batch_size: the number of samples that feed into the model at a time (should not be too big, or the model will need too much memory)
lr: learning rate (5e-5, 3e-5, 1e-5 or something like that)
epoch: number of iteration on the whole training set (depend on the size of the training set)
accumulation_step: Here we use gradient accumulation technique to get the performance of a larger batch size (the actual batch size is approximately batch_size * accumulation_step).

Loss

Parameters for the loss function (here we use BP-MLL loss function), used for initilizing bp_mll loss function in bp_mll.py

bias: the weight of positive label and negative label (refer to this paper)

Network

Parameters used for initializing the neural network, used in model.py

pretrained_model: point to the directory containing the pretained bert model (here we use SciBert by Allen AI)
hidden_size: the size of the pretrained model's hidden size (768 here)
dropout_prob: the probability for dropout layer to drop a element in the input tensor
label_num: the total number of labels

Dataset

Used in dataset.py for loading the dataset from file

tokenizer_path: For pretrained model, keep it the same as IO.model_dir
dataset_path: point to the json file where annotated data is stored
text_key: key of the text for each entry
label_key: key of the labels for each entry

Predict

Used for evaluation and prediction

position_threshold: minumum output for an output to be considered as a positive output (predicted_label = output_probility > position_threshold ? 1 : 0)

TODO

Can only reload the model on a single GPU server.
Unable to load the optimizer for continuing training.
Better performance for training on multiple GPUs.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
config		config
model		model
result		result
rsc		rsc
test_set		test_set
.gitattributes		.gitattributes
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
bp_mll.py		bp_mll.py
cross_validation.py		cross_validation.py
dataset.py		dataset.py
evaluate.py		evaluate.py
insert_to_db.py		insert_to_db.py
model.py		model.py
predict.py		predict.py
set_env.py		set_env.py
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

COVID-Bert4Classification

Architecture

Before using the model

Train the model

Evaluate the model

Predict labels of the texts

Add entries to MongDB

Clear models in `model` and results in `result`

Explanation to the parameters in `config/config.json`

IO

HyperParam

Loss

Network

Dataset

Predict

TODO

About

Releases

Packages

Languages

COVID-19-Text-Mining/COVID-Bert4Classification

Folders and files

Latest commit

History

Repository files navigation

COVID-Bert4Classification

Architecture

Before using the model

Train the model

Evaluate the model

Predict labels of the texts

Add entries to MongDB

Clear models in model and results in result

Explanation to the parameters in config/config.json

IO

HyperParam

Loss

Network

Dataset

Predict

TODO

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Clear models in `model` and results in `result`

Explanation to the parameters in `config/config.json`

Packages