This project aims to investigate difference between the expressivity of physico-chemical properties (i.e. Atchley factors) and language models in cancer classifications using TCR CDR3 sequences. With the use of a language model, we obtained high AUCs in classifying whether a patient has cancer.
For more details regarding this research, please view my dissertation here.
Warning
This code has been tested on Windows 11 and Linux CentOS (UCL CS lab 105 Computers). Although it should work on other OS, it is not guaranteed to work perfectly.
Important
We developed the code under Python 3.11, and the requirements.txt has been generated in the same environment. Therefore installing the requirements may not work for Python versions below 3.11.
- Download this repository
- Create a Python Environment venv through
python3 -m venv $YOUR-VENV-NAME-HERE$
- Activate your virtual environment, and run the following command
python -m pip install -r scripts/requirements.txt
if your computer is a Windows Computer, andpython -m pip install -r scripts/requirements-linux.txt
if it is Linux Ubuntu instead. - Install SCEPTR via the following command:
python -m pip install sceptr
Note
You should install your own version of PyTorch depending on your CUDA version before installing the requirements.txt
. You may find instructions of installing PyTorch here.
Note
SCEPTR has been published officially here.
Important
To download the data, you would need to have access to the Chain Lab RDS and be connected with UCL's network.
To pull the data from the Chain Lab RDS, you may run the following command. Please modify rds_mountpoint
in loaders/config.json
to your mountpoint in your computer. You should not amend other configurations in the file.
python loaders/load_cdr.py -config_path loaders/config.json
To compress the data (i.e. removing all data other than V call, J call and CDR3 sequences), you may run
python utils/file-compressor.py
The fetched data will contain files that are from multiple sampling timestamps. To obtain the data for healthy patients and remove duplicated patient files, please run:
python utils/remove-control-dups.py
python rds_file_locations/remove-cancer-dups.py
The IDs for patients in the evaluation set is here with these file names.
To download the two variants of TCR-BERT, you may run the following command:
python loaders/load_ptm.py -o model
Please refer to this link for installation instructions for SCEPTR.
There are 3 training scripts, where trainer-sceptr.py
trains a classifier that uses SCEPTR to encode TCRs, trainer-symbolic.py
trains a classifier that takes in TCRs encoded by physico-chemical properties and trainer-tcrbert.py
trains a classifier that uses TCR-BERT to encode TCRs.
All of these 3 scripts would need to take in a configuration file, which can be generated by
python trainer.py --make --end
after replacing trainer.py
with the appropriate training script. If you would want to run training using the default settings, you can run the following command instead.
python trainer.py --make
All scripts will generate a log file for its training process. You may change the log file's name with the following command.
python trainer.py --log-file custom-filename.log
To modify the training configurations, you may modify the config.json generated from the command above. The configurations available for each of the 3 training scripts are different. You may find the description for each field in each training script as below:
trainer-sceptr.py
: Descriptions Heretrainer-tcrbert.py
: Descriptions Heretrainer-symbolic.py
: Descriptions Here
To specify which configuration file to run, you may use the following command:
python trainer.py -c custom-configs.json
Tip
When you run multiple training instances and would like to check the progress of each training instance, you can run the following command to check.
python for-ssh/checkdone.py
It also tells you the time that the training has been stale for. It is recommended that you check the training instance if it has been stale for over 2 hours.
We have placed the results for our training with the configs in results
.
To test a model's performance on the evaluation set, you may use the following command after amending the model's directory and the best performing epoch.
python src/calculate_evals.py
Throughout training, checkpoints will be made alongside with this current epoch's training statistics such as loss, accuracies and sufficient data to compute the AUC. This repository provides Jupyter Files to analyse the whole training loop's statistics. The Jupyter Files are as follows:
training-stats-analysis.ipynb
: Generates the loss, accuracy and AUC graphs for one training instance.training-stats-combined.ipynb
: Generates the loss, accuracy and AUC graphs for a series of training instances.sceptr-similarity.ipynb
: Computes the cosine similarity and euclidean distance between the scoring layer and classifying layer's weights after an l2-norm for SCEPTR's downstream model.sceptr-interpretability.ipynb
: Computed Appendix A.3 of the manuscript. It finds the occurence of V/J calls that the model assigned a non-zero weighting in the evaluation dataset as well as CDR3 sequences that is shared between multiple patients and is assigned a non-zero weighting.eval-stats-combined.ipynb
: Generates the confusion matrix and AUC curves for the models that are trained under a 3-way split.
We report known errors here. Please contact me here to report any problems.
-
Path Length: If your path is too long in Windows, you are prone to the following error:
DLL load failed while importing $SOMETHING$: The filename or extension is too long.
A mitigation strategy is to use the global Python, or to put your files in a shorter directory.