-
Notifications
You must be signed in to change notification settings - Fork 0
/
ISC_project.Rmd
54 lines (38 loc) · 4.7 KB
/
ISC_project.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
---
title: "Jessica Fessler, ISC Project"
output: html_document
---
## ISC project:
### BACKGROUND
For my rotation this winter I applyed PrediXcan, a computational model that predicts tissue-specific gene expression using germline genetics, to data from The Cancer Genome Atlas (TCGA) to help understand the conservation or loss of genetically controlled gene expression in tumors.
The PrediXcan model was developed by University of Chicago faculty, Dr. Hae Kyung Im <https://github.com/hakyimlab/PrediXcan>. PrediXcan estimates genetically regulated gene expression using whole-genome tissue-dependent prediction models trained using a reference transcriptome data set. Developed using data from The Genotype-Tissue Expression (GTEx) Project, a database with over 20,000 tissue samples, PrediXcan uses germline genetics in the form of SNPs to guide prediction of gene expression in different tissues. My project has entailed applying this predictive model to the data in TCGA. The intent was to use the patient data and the predicted gene expression to evaluate how gene expression that is dictated by germline genetics is either conserved, or lost in the tumor. In order to address this question, we applied PrediXcan to imputed SNP data from over 6,000 TCGA participants representing 18 different cancer subtypes (SNP imputation data provided by Zhenyu Zhang in Dr. Grossman's lab). Analysis has thus far been conducted in breast cancer participants.
Initial analysis included comparing the predicted gene expression in the tumor and matched normal to the measured expression by RNAseq. Analysis methods for comparison include using Pearson correlation, Spearman correlation and linear modeling. Principal component analysis between TCGA RNA-Seq data for the tumor and normal tissue gene expression revealed large scale differences in expression. In order to account for this variability, as well as differences in samples size, we identified the number of prinicipal components (PCs) that provide the greatest number of statistically significant associations between measured and predicted gene expression. We then incorporated these principal components into linear models comparing predicted and actual gene expression. The purpose of this analysis is to remove large-scale tumor specific patterns of expression and thus make comparisons between tumor and normal more fair and meaningful. Now that we have identified the optimal cutoff, we can evaluate how estimated beta values (for association between predicted and measured expression) compare generally between tumor and matched normal samples.
### INSTRUCTIONS FOR TESTING FILES:
All files found in R_scripts
Working directory is "PrediXcan-TCGA_Results"
**File: pca_all_RNAseq.R**
To test: source file
- Generates plots of principal component analysis comparing tumor and normal RNAseq expression data
**File: plot_RNAseq_expression.R**
To test: source file
- Generates boxplot comparing all RNAseq expression levels between 4 participants
**File: gene_expression_analysis.R**
To test: source file, `plot_expression("normal", "POMZP3")`
- Generates dot plot with linear model comparing the predicted and observed gene expression for a given tissue type and gene
**File: lm_PC-analysis_bySteps.R**
To test: source file, `lm_analysis_byStep(sample_type = "normal", step_pc = 2, stop_pc = 10)`
(Uses plot_lm-PC-analysis_results.R)
- Fits linear model between predicted and observed gene expression for normal or tumor tissue samples. Includes PCs up to "stop_pc", and goes by "step_pc".
- In example, tests with 0 PC, 2 PCs, 4 PCs, 6 PCs, 8 PCs, and 10 PCs and then plots the number of significant associations (FDR < 0.05).
**File: lm_analysis_toFile.R**
To test: source file, `lm_analysis_toFile("normal", 0)`, save as "normalFile_0PC.txt"
`lm_analysis_toFile("tumor", 0)`, save as "tumorFile_0PC.txt"
- Saves a text file with the model parameters for linear model for each gene with number of defined PCs
**File: plot_onePC_result.R**
To test: `plot_lm_results("normalFile_0PC.txt", "tumorFile_0PC.txt", "Rsquared")` (Or whatever file name was used for lm_analysis_toFile)
- Creates a box plot comparing the linear model parameters for tumor and normal tissue
`plot_lm_results("normalFile_0PC.txt", "tumorFile_0PC.txt", "estimate")`
- Additional options for adjusting y-axes
**File: plot_correlation_results.R**
To test: `plot_correlations("correlation_data/BRCA_TCGA-matched-normal-RNAseq_PrediXcan-Breast-CORRECTED_correlations_20160226.txt", "correlation_data/BRCA_TCGA-tumor-RNAseq_PrediXcan-Breast-CORRECTED_correlations_20160226.txt", "Pearson")`
- Generates a boxplot comparing the correlation coefficient from predicted vs. observed gene expression in tumor and normal samples