.
├───bert_types_models # folder containing codes used for bert type finetuning
│ │ bert_multilingual.ipynb # [DON'T USE IT FOR TRAINING] Notebook used to iterate and create model training script <NotUpdated>
│ │ requirements.txt # requirements files to install indepencies
│ │ train_bert.py # Main script used to train bert models (bertmultilingual/camembert/flaubert)
│ │ utils.py # Utility script containing all other class and fucntions to build model and train it
│ │
│ │
│ ├───models_trained # directory containing all trained models
│ ├───bertm_5epochs_dropout
│ ├───bertm_5epochs_dropout_concat
│ ├───bertm_5epochs_dropout_freezing_concat
│ ├───bertm_5epochs_nodropout
│ └───camembert_5epochs_nodropout
│
├── data # data folder also ouput folder for preprocessed data
├───inference
│ │ docker-compose.yml # Docker compose to build containers : service+app
│ │
│ ├───app # StreamLit application for runing inference
│ │ app.py
│ │ Dockerfile
│ │ requirements.txt
│ │
│ └───service # FastAPI service for doing inference.
│ │ Dockerfile
│ │ main.py
│ │ requirements.txt
│ │ utils.py
│ │
│ └───models # Directory containing Bert type model finetuned used for inference
│
├───preprocessing # Folder containing code used for preprocessing
│ preprocess.py # script to run preprocessing
│ preprocess_urls.ipynb # notebook used
│
│
├───statistical_embedding_models # folder containing codes used for straining statistical models
│ tf_idf.ipynb # Notebook used to to do Statistical modeling : TFIDF with different Models
│
├── LICENSE
└── README.md
- install requirements :
cd preprocessing
pip install -r requirements.txt
- preprocess data :
python preprocess.py
- install requirements :
cd bert_types_models
pip install -r requirements.txt
- Training : make sure to preprocess data first. To change model, you need to change PRE_TRAINED_MODEL_NAME variable inside script. models already used ['bert-base-multilingual-uncased', 'camembert-base', 'flaubert/flaubert_base_cased'] script will export model trained to the folder models_trained/ with the MultiLabelBinarizer object for production + scoring during training
python train_bert.py
- Models Results :
Models | Accuracy | Hamming loss | AUC | F1 score macro | F1 score micro | F1 score weighted |
---|---|---|---|---|---|---|
bertm_5epochs_dropout | 0.073055 | 0.00949833 | 0.611556 | 0.244681 | 0.51026 | 0.406705 |
bertm_5epochs_dropout_concat | 0.187065 | 0.00833813 | 0.741633 | 0.528397 | 0.638434 | 0.608079 |
bertm_5epochs_dropout_freezing_concat | 0.0251423 | 0.0113899 | 0.537301 | 0.093941 | 0.268745 | 0.197125 |
bertm_5epochs_nodropout | 0.0770082 | 0.00946042 | 0.61398 | 0.25053 | 0.515028 | 0.411921 |
camembert_5epochs_nodropout | 0.0477546 | 0.0102866 | 0.566295 | 0.134595 | 0.438133 | 0.308513 |
flaubert_5epochs_nodropout | 0.000158128 | 0.012761 | 0.501113 | 0.00339176 | 0.0116508 | 0.00889494 |
- Training procedure is explained in the Notebook (statistical_embedding_models/tf_idf.ipynb)
- Models results
Accuracy | Hamming loss | AUC | F1 score macro | F1 score micro | F1 score weighted | |
---|---|---|---|---|---|---|
LR | 0.121936 | 0.00891236 | 0.653643 | 0.396298 | 0.538858 | 0.49323 |
LSVC | 0.171484 | 0.00831222 | 0.70922 | 0.503603 | 0.607955 | 0.577387 |
XGB | 0.151981 | 0.00873008 | 0.702646 | 0.488 | 0.584884 | 0.555727 |
MLP | 0.101116 | 0.00913117 | 0.635244 | 0.300006 | 0.543177 | 0.457907 |
- run :
cd inference
docker-compose up --build
- Note that you may want to get my trained model, get it from here and put both files on inference/service/models/.
- Bert types model doesn't perform well, main raison: no context meaning in the urls, the statistical approche (tfidf) is better because it doesn't depend on the context it's just a convertion of each word (token) into one statistical embedding.
- Use some Neural networks with Gloove/FastText embeddings or Train proper embeddings.
- Handle Imbalance by using weighted loss on the BCElosswithlogits()
- Use data augmentation for minority labels.
- Convert Model to ONNX format to optimize inference latency
- Task Similar to the one in this Paper by Mircrosoft but the problem is that the model is not yet opensourced but would recommend just using a scrapper then Bert type models.
- Going deeper into production and deployement with AWS, check my article here (https://towardsdatascience.com/deploy-fastai-transformers-based-nlp-models-using-amazon-sagemaker-and-creating-api-using-aws-7ea39bbcc021)