Drug sensitivity modeling

This repo contains scripts which I wrote to build predictive models on cancer cell line drug sensitivities. This was part of a homework I had. The tasks with the required data were provided a priori.

Aim

The tasks are described in the input/comp-bio-homework-description.pdf file.

Given the data perform a dimensionality reduction with a preferred method and viusalize it.
Using gene expression data identify biomarkers for Lapatinib sensitivity.
Build a predictive model for Lapatinib sensitivity and evaluate it's performance.

Note

The created predictive model is by no means exhaustive and was not optimized.

Summary

Task 1

I decided to use principal component analysis (PCA).
I retrieved only the first 10 principal components as from PC5 the explained variability does not change significantly (see: output/pca.png)
The samples are colored according to the cancer type, some degree of separation of the samples is noticeable on the PC1.

Task 2

As the gene expression provided was TPM transformed and only cancer cell line samples were provided, there was no possibility of differential gene expression analysis. Instead, I used a feature selection approach.
I used two regression approaches: random forest and lasso regression. The output of these approaches was overlapped and only commonly selected features were kept.
A total of 60 genes were selected as influential for Lapatinib sensitivity of cancer cells (see: output/selected_genes.tsv and output/selected_gene_ids.tsv).
A systematic approach was also performed to get an idea of what functions are affected by the drug treatment. The analysis resulted in only a single enriched function, i.e. regulation of blood pressure (see: output/singificant_terms.xlsx). This could make sense as the RTK is involved in metabolism, growth and differentiation in cancers and blood supply of tumors could be essential.

Task 3

To evaluate the model I used a simple train/test split approach. I retained 20% of the samples as testing data.
Firstly I used a random forest regression approach. I also implemented a short optimization step for the model fitting to try to improve the evaluation metrics. The model performance was lackluster with a R2 = 0.36 and MSE = 1.64 (see: output/rf_pred-vs-actual.png).
Secondly, I used the lasso regression in order to get better predictions. I used similar approaches as above, without the optimization step. The overall model predictions showed improvement (see: output/lasso_pred-vs-actual.png) with an R2 = 0.56 and a MSE = 1.13.
Lastly I tried to use a classifier. As a classifier I chose the random forest approach. Similar steps were used as above to optimize the model fit and evaluate its performance. The classified performed somewhat poorly (see: output/roc_curve.png) with a ROC AUC = 0.77. Accuracy of model is 0.68 and precision is 0.68 with a recall of 0.59.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
input		input
output		output
scripts		scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Drug sensitivity modeling

Aim

Summary

About

Releases

Packages

Languages

fka21/drug_sensitivity_modeling

Folders and files

Latest commit

History

Repository files navigation

Drug sensitivity modeling

Aim

Summary

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages