Skip to content

RcwYuen/TCR-Cancer-Prediction

Repository files navigation

TCR LLMs for Cancer Prediction

This project aims to investigate difference between the expressivity of physico-chemical properties (i.e. Atchley factors) and language models in cancer classifications using TCR CDR3 sequences. With the use of a language model, we obtained high AUCs in classifying whether a patient has cancer.

For more details regarding this research, please view my dissertation here.

Warning

This code has been tested on Windows 11 and Linux CentOS (UCL CS lab 105 Computers). Although it should work on other OS, it is not guaranteed to work perfectly.


Installation

Important

We developed the code under Python 3.11, and the requirements.txt has been generated in the same environment. Therefore installing the requirements may not work for Python versions below 3.11.

  1. Download this repository
  2. Create a Python Environment venv through python3 -m venv $YOUR-VENV-NAME-HERE$
  3. Activate your virtual environment, and run the following command python -m pip install -r scripts/requirements.txt if your computer is a Windows Computer, and python -m pip install -r scripts/requirements-linux.txt if it is Linux Ubuntu instead.
  4. Install SCEPTR via the following command: python -m pip install sceptr

Note

You should install your own version of PyTorch depending on your CUDA version before installing the requirements.txt. You may find instructions of installing PyTorch here.

Note

SCEPTR has been published officially here.


Downloading Required Files

Important

To download the data, you would need to have access to the Chain Lab RDS and be connected with UCL's network.

Data Fetching

To pull the data from the Chain Lab RDS, you may run the following command. Please modify rds_mountpoint in loaders/config.json to your mountpoint in your computer. You should not amend other configurations in the file.

python loaders/load_cdr.py -config_path loaders/config.json

To compress the data (i.e. removing all data other than V call, J call and CDR3 sequences), you may run

python utils/file-compressor.py

The fetched data will contain files that are from multiple sampling timestamps. To obtain the data for healthy patients and remove duplicated patient files, please run:

python utils/remove-control-dups.py
python rds_file_locations/remove-cancer-dups.py

Evaluation Set

The IDs for patients in the evaluation set is here with these file names.

Downloading TCR-BERT

To download the two variants of TCR-BERT, you may run the following command:

python loaders/load_ptm.py -o model

Downloading SCEPTR

Please refer to this link for installation instructions for SCEPTR.


Training Classifiers

There are 3 training scripts, where trainer-sceptr.py trains a classifier that uses SCEPTR to encode TCRs, trainer-symbolic.py trains a classifier that takes in TCRs encoded by physico-chemical properties and trainer-tcrbert.py trains a classifier that uses TCR-BERT to encode TCRs.

All of these 3 scripts would need to take in a configuration file, which can be generated by

python trainer.py --make --end

after replacing trainer.py with the appropriate training script. If you would want to run training using the default settings, you can run the following command instead.

python trainer.py --make

All scripts will generate a log file for its training process. You may change the log file's name with the following command.

python trainer.py --log-file custom-filename.log

Training Configurations

To modify the training configurations, you may modify the config.json generated from the command above. The configurations available for each of the 3 training scripts are different. You may find the description for each field in each training script as below:

To specify which configuration file to run, you may use the following command:

python trainer.py -c custom-configs.json

Tip

When you run multiple training instances and would like to check the progress of each training instance, you can run the following command to check.

python for-ssh/checkdone.py

It also tells you the time that the training has been stale for. It is recommended that you check the training instance if it has been stale for over 2 hours.


Analysing Training Instances

We have placed the results for our training with the configs in results.

Usage of the Evaluation Set

To test a model's performance on the evaluation set, you may use the following command after amending the model's directory and the best performing epoch.

python src/calculate_evals.py

Jupyter Notebooks

Throughout training, checkpoints will be made alongside with this current epoch's training statistics such as loss, accuracies and sufficient data to compute the AUC. This repository provides Jupyter Files to analyse the whole training loop's statistics. The Jupyter Files are as follows:

  • training-stats-analysis.ipynb: Generates the loss, accuracy and AUC graphs for one training instance.
  • training-stats-combined.ipynb: Generates the loss, accuracy and AUC graphs for a series of training instances.
  • sceptr-similarity.ipynb: Computes the cosine similarity and euclidean distance between the scoring layer and classifying layer's weights after an l2-norm for SCEPTR's downstream model.
  • sceptr-interpretability.ipynb: Computed Appendix A.3 of the manuscript. It finds the occurence of V/J calls that the model assigned a non-zero weighting in the evaluation dataset as well as CDR3 sequences that is shared between multiple patients and is assigned a non-zero weighting.
  • eval-stats-combined.ipynb: Generates the confusion matrix and AUC curves for the models that are trained under a 3-way split.

Known Errors

We report known errors here. Please contact me here to report any problems.

  • Path Length: If your path is too long in Windows, you are prone to the following error:

    DLL load failed while importing $SOMETHING$: The filename or extension is too long.
    

    A mitigation strategy is to use the global Python, or to put your files in a shorter directory.

About

This is the repository for the code used in my dissertation, for tidied code, please view https://github.com/RcwYuen/TCR-Embeddings

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published