A Novel Approach to Identifying and Predicting Cancer Vulnerabilities

winter qtr 2020 author: Andrew E. Davidson, [email protected] Mentor: Alana Weinstein Speical thanks to Professor Stuart for his guidance and direction.

My final lab rotation was in the Stuart lab at the Univ. of California Santa Cruz

To get a better understanding of the this project see presentations/lab3RotationLightningTalk.pptx and presentations/status-2020-03-13.pptx

This project is implemented as collection of python3 packages and jupyter notebooks. You can view the juypter notebooks with out having to start your own juypter server by viewing the notebook source file on github.

Installation

set up conda env with required packages

$ conda create --name labRotation3TensorFlow --file requirements.txt

Starting notebooks

cd ~workSpace/UCSC/labRotation3DEMETER2
conda activate labRotation3TensorFlow
export PYTHONPATH="${PYTHONPATH}:`pwd`/src"
jupyter notebook

Running Unit test

cd ~workSpace/UCSC/labRotation3DEMETER2
conda activate labRotation3labRotation3TensorFlow
export PYTHONPATH="${PYTHONPATH}:`pwd`/src"
cd src/test
python -m unittest discover .

Creating the prerequisite tidy data sets.

In ./data you will find the raw gene dependency data from project achilles and a coorespond RNA Seq data file. The many of the notebooks assume that you have created a clean tidy data set and trained a low rank matrix factorization model locally. In the future the raw data file should be removed and replace with code to download the files as needed.

Start by uncompressing the zip files in data/

To create the expected LMRF model. Run the trainRandomHoldOut.ipynb juypter notebook. The notebook will train a model with 19 learned features. This was found to be best model. The results will be sorted at data/n_19_geneFilterPercent_0.25_holdOutPercent_0.4. The results include the trained model and a tidy version of the raw data.

To create the auto encoder data set, run the createAutoEncoderDataSet.ipynb juypter notebook. This notebook will create a clean tidy data set. The result will be found in data/autoEncoder

Table of Contents: Notebook overviews

requirements.txt
- list of required python packages

Low Rank Matrix Factorization Model notebooks

explore.ipynb
- turns out 14% of D2_Achilles_gene_dep_scores_5by5.tsv are missing
- we should remove DEMETER2.dataFactory._clean(filter=0.01) and impute any remaining missing values
createUnitTestData: used to create the 5x5 TSV file src/test/data/D2_Achilles_gene_dep_scores_5by5.tsv
trainRandomHoldOut:
- used to create trained data sets of various sizes
- has some wall clock stats for various hold out sizes and n learned features
- the results can be found data/ and have names like holdOut_0.1_numFeature_100
  - notebook saves data to data/ we have to manually move data ot a sub directory to make sure we do not accidently over write it.
  - file examples.
    - Y, RTest, and RTrain have same shape
    - RTrain and RTest are knockout,filter,logical arrays
    - use RTest to select values in Y that are part of the hold out set
```
$ ls data/holdOut_0.1_numFeature_100/
D2_Achilles_gene_dep_scores_RTest_numFeatures_100.csv 
D2_Achilles_gene_dep_scores_Theta_numFeatures_100.csv
D2_Achilles_gene_dep_scores_RTrain_numFeatures_100.csv 
D2_Achilles_gene_dep_scores_X_numFeatures_100.csv
```
- 19 learned features has best performance.
evaluateRandomHoldOut.ipynb
- used for low rank matrix hyper parameter tunning
findSimilarGened.ipynb
findSimilarCellLines.ipynb

RNA Seq to Gene Dep mapping model

createAutoEncoderDataSet.ipynb
- creates data set for rnaSeq2GeneDependency.ipynb
- data saved to data/autoencoder/
geneDependencyAutoEncoder
- use this notebook to debug deep models
rnaSeq2GeneDependency.ipynb
- a deep model that maps RNA seq data to gene dependency data

TODO

see notes on improving accuracy on p. 82 , 3/11, BME notebook # 1
finish geneDependencyAutoEncoder.ipynb and rnaSeq2GeneDependency.ipynb
see presentation/status-*.pptx
summary
- which RNA seq data should we run
  - CCLEv1_hugo_log2tpm_58581genes_2019-04-15.tsv is probably better than RNAseq_lRPKM_data.csv
  - treehouseID_to_CCLEID.tsv is probably better than CCLEv1_hugo_log2tpm_58581genes_2019-04-15.tsv
    - uses different gene is. see sample_info.csv
need to add documentation to data factory about no need to shuffle Y. we are learn independent regression models for each row and col

low priority TODO

datafactory _split() is slow it is good enough. see comments in code for faster impl
- double check no over lap between spliits. test
  - train,hold = split, val,
  - test = split(holdOut)
go through code, remove set random seed. let user of class set that do not default

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
doc		doc
imgs		imgs
presentations		presentations
src		src
.gitignore		.gitignore
.project		.project
.pydevproject		.pydevproject
LICENSE		LICENSE
README.md		README.md
createAutoEncoderDataSet.ipynb		createAutoEncoderDataSet.ipynb
createUnitTestData.ipynb		createUnitTestData.ipynb
evaluateRandomHoldOut.ipynb		evaluateRandomHoldOut.ipynb
explore.ipynb		explore.ipynb
exploreErrorTerms.ipynb		exploreErrorTerms.ipynb
findSimilarCellLines.ipynb		findSimilarCellLines.ipynb
findSimilarGenes.ipynb		findSimilarGenes.ipynb
geneDependencyAutoEncoder.ipynb		geneDependencyAutoEncoder.ipynb
requirements.txt		requirements.txt
rnaSeq2GeneDependency.ipynb		rnaSeq2GeneDependency.ipynb
testForDataLeaksInSplitFunction.ipynb		testForDataLeaksInSplitFunction.ipynb
trainRandomHoldOut.ipynb		trainRandomHoldOut.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A Novel Approach to Identifying and Predicting Cancer Vulnerabilities

Installation

Starting notebooks

Running Unit test

Creating the prerequisite tidy data sets.

Table of Contents: Notebook overviews

Low Rank Matrix Factorization Model notebooks

RNA Seq to Gene Dep mapping model

TODO

low priority TODO

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

aedavids/lab3RotationProject

Folders and files

Latest commit

History

Repository files navigation

A Novel Approach to Identifying and Predicting Cancer Vulnerabilities

Installation

Starting notebooks

Running Unit test

Creating the prerequisite tidy data sets.

Table of Contents: Notebook overviews

Low Rank Matrix Factorization Model notebooks

RNA Seq to Gene Dep mapping model

TODO

low priority TODO

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages