Temporo mandibular Joint Ostea Arthritis analysis tools
Contributors : Celia Le, Tengfei Li
Scripts for TMJOAI project
python 3.7.9 with the libraries : numpy (v1.19.5) pandas (v1.2.0) scikit-learn (0.24.0) colour seaborn matplotlib statsmodels xgboost (v1.3.1) lightgbm (v3.1.1)
TMJOAI is a prediction tool of the health status of a patient for TemporoMandibular Joint Osteoarthitis (TMJ OA).
It uses decision tree algorithms for machine learning.
It has 2 features :
- Making a prediction of the health status if the input file does not contain it : the output is a csv file containing the health prediction.
- Adding data to the dataset and train the machine learning models : the output contains the evaluation metrics of the model and several statistical plots (circplot, manhattanplot, Boxplot, ROC curve).
inputfile: csv file containing patient's data to predict health status -> Prediction | csv file containing data to add to the dataset -> Training
File containing training dataset: Data.csv
bash src/main_TMJOAI.sh -i inputfile
The tool to use (prediction or training) is determined wheter the inputfile contains the health status of the patient or not.
python3 src/main_prediction.py file/to/predict -o output/file -f models/folder
input: csv file not containing the health status of a patient
output: csv file containing the prediction (healthy or diseased)
The prediction is based on the average prediction of the trained models (50 XGBoost models and 50 LightGBM models).
usage: main_prediction.py [-h] [--folder FOLDER] [--output OUTPUT] input
positional arguments:
input input csv file with data to predict
optional arguments:
-h, --help show this help message and exit
--folder FOLDER, -f FOLDER folder containing the models
--output OUTPUT, -o OUTPUT output file
bash src/main_training.sh -i file/to/add/to/dataset -d file/containing/dataset -o output/file
Input: csv file containg data to add to the training dataset
Output: trained models, evaluation metrics, statistical plots
What it does:
- Add data to Data.csv (containing the full dataset)
- Preprocess the data: Interaction file, AUC file
- Create statistical plots: circplot, manhattanplot (with and without interaction terms)
- Train the machine learning models: 5 models have been tested (XGBoost, LightGBM, RandomForest, RidgeRegression, LogisticRegression); 2 models (XGBoost and LightGBM) are used to make the final prediction
- Calculate evaluation metrics: metrics of the different models trained, average of these metrics and metrics of the final model
- Create plots based on the models training: ROC, Boxplot_contribution, Boxplot_values
The models are trained using a default 10 times 5-folds cross-validation.
Each time, the random seed for spliting the folds for the cross validation varies between seed1 and seed_end.
Program to train the OA prediction tool
Syntax: main_training.sh [--OPTIONS]
options:
-i|--inputfile Name of the file containing the values to add to the training dataset.
-d|--datafile Name of the file contraining all the training data.
--interaction_file Name of the file contraining the interactions features calculated from the training data.
-a|--auc Name of the file contraining the AUC value of each interaction feature.
-o|--output_folder Name of the output folder to save the outputs.
-s|--src_folder Name of the source folder containing the python scripts.
-m|--model_folder Name of the source folder to save the trained models.
--seed1 First random seed to split the folds for the cross validation.
--seed_end Last random seed to split the folds for the cross validation.
--nbr_folds Number of folds for the cross validation.
-h|--help Print this Help.
You can get the tmjoai docker image by running the folowing command line:
docker pull dcbia/oai:latest
Training:
To run the training inside the docker container, run the following command line:
docker run --rm -v */my/input/file*:/app/$(basename */my/input/file*) -v */my/dataset/file*:/app/$(basename */my/dataset/file*) -v */my/output/folder*:/app/out -v */my/models/folder*:/app/models dcbia/oai:latest bash src/main_training.sh -i /app/$(basename */my/input/file*) -d /app/$(basename */my/dataset/file*) -o /app/out -m /app/models
Prediction:
To run the prediction inside the docker container, run the following command line:
docker run --rm -v */my/input/file*:/app/$(basename */my/input/file*) -v */my/output/folder*:/app/out dcbia/oai:latest python3 /app/OAI/python/src/main_prediction.py /app/$(basename */my/input/file*) -o /app/out/prediction.csv --folder /app/OAI/python/Models_AF
python3 src/Step0_AddTrainingData.py
Verifies if the data in the inputfile is already in the file containing the dataset, if not, it adds the data at the end of the file.
usage: Step0_AddTrainingData.py [-h] [--file FILE] input
positional arguments:
input input csv data file to add to the training dataset
optional arguments:
-h, --help show this help message and exit
--file FILE, -f FILE csv file containing all the training data
python3 src/Step0_InterractionFile.py
Calculates the interaction between the features by multiplying each of them together.
usage: Step0_InterractionFile.py [-h] [--output OUTPUT] input
positional arguments:
input input csv file
optional arguments:
-h, --help show this help message and exit
--output OUTPUT, -o OUTPUT output file
python3 src/Step0_AUC.py
Calculates the AUC of each feature.
usage: Step0_AUC.py [-h] [--input INPUT] [--output OUTPUT]
[--first_seed FIRST_SEED] [--last_seed LAST_SEED]
[--folds FOLDS]
optional arguments:
-h, --help show this help message and exit
--input INPUT, -i INPUT input csv interraction file
--output OUTPUT, -o OUTPUT output filename
--first_seed FIRST_SEED number of the first seed
--last_seed LAST_SEED number of the last seed
--folds FOLDS number of the folds for cross-validation
python3 src/STAT_circ.py
Draws circular plot containing the AUC, the pvalues and the qvalues of the features.
usage: STAT_circ.py [-h] [--output OUTPUT] [--sort SORT]
[--original_features ORIGINAL_FEATURES]
[--min_auc MIN_AUC]
input
positional arguments:
input input csv file (original or interraction features
optional arguments:
-h, --help show this help message and exit
--output OUTPUT, -o OUTPUT output filename
--sort SORT method for sorting values (AUC,pval,qval)
--original_features ORIGINAL_FEATURES number of original features without interractions
--min_auc MIN_AUC minimum AUC to select features
python3 src/STAT_manhattan.py
Draws manhattan plot of the AUC, pvalues and qvalues of the features.
usage: STAT_manhattan.py [-h] [--output OUTPUT]
[--original_features ORIGINAL_FEATURES]
input
positional arguments:
input input csv file (original or interraction features)
optional arguments:
-h, --help show this help message and exit
--output OUTPUT, -o OUTPUT output filename
--original_features ORIGINAL_FEATURES number of original features to remove from interractions
python3 src/Step1_RandomForest.py
python3 src/Step1_RidgeRegression.py
python3 src/Step1_LogisticRegression.py
python3 src/Step1_XGBoost.py
usage: Step1_XGBoost.py [-h] [--interactions INTERACTIONS] [--auc AUC]
[--output OUTPUT]
optional arguments:
-h, --help show this help message and exit
--interactions INTERACTIONS, -i INTERACTIONS input csv interraction file
--auc AUC input csv AUC file
--output OUTPUT, -o OUTPUT output folder
python3 src/Step1_LightGBM.py
usage: Step1_LightGBM.py [-h] [--interactions INTERACTIONS] [--auc AUC]
[--output OUTPUT]
optional arguments:
-h, --help show this help message and exit
--interactions INTERACTIONS, -i INTERACTIONS input csv interraction file
--auc AUC input csv AUC file
--output OUTPUT, -o OUTPUT output folder
python3 src/Step1_FinalModel.py
Makes the average prediction of all the prediction made by the previously trained models.
The prediction of the health status of each patient is made by averaging the prediction made by all the models not using the patient for the training (10 out of 100 models).
usage: Step1_FinalModel.py [-h] [--interactions INTERACTIONS] [--auc AUC]
[--output OUTPUT] [--folder FOLDER]
optional arguments:
-h, --help show this help message and exit
--interactions INTERACTIONS, -i INTERACTIONS input csv interraction file to test
--auc AUC input csv AUC file
--output OUTPUT, -o OUTPUT output folder
--folder FOLDER, -f FOLDER models folder
python3 src/FinalStat.py
Returns the evaluation metrics of the trained models, their average and the metrics of the final model.
usage: Step2_FinalStat.py [-h] [--output OUTPUT] [--folder FOLDER]
optional arguments:
-h, --help show this help message and exit
--output OUTPUT, -o OUTPUT output filename
--folder FOLDER folder to evaluate
Calculate evaluation metrics: metrics of the different models trained, average of these metrics and metrics of the final model
python3 src/Step2_ROC_Plot.py
Draws the ROC curve of the trained models and the top features.
usage: Step2_ROC_Plot.py [-h] [--input INPUT] [--output OUTPUT]
[--folder FOLDER]
optional arguments:
-h, --help show this help message and exit
--input INPUT, -i INPUT input interaction features csv file
--output OUTPUT, -o OUTPUT output filename
--folder FOLDER folder to evaluate
python3 src/Step2_Boxplot.py
Draws boxplot of the top features values and contributions.
usage: Step2_Boxplot.py [-h] [--input INPUT] [--output OUTPUT]
[--folder FOLDER]
optional arguments:
-h, --help show this help message and exit
--input INPUT, -i INPUT input interraction features csv file
--output OUTPUT, -o OUTPUT output folder
--folder FOLDER, -f FOLDER folder to evaluate