-
Notifications
You must be signed in to change notification settings - Fork 8
/
note_misc.Rnw
116 lines (100 loc) · 7.03 KB
/
note_misc.Rnw
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
\subsection{Generation of signatures}
%\subsection{Consensus signature corresponds more closely to perturbations than pathway methods}
%TODO: consensus not perfect, GEX sig confounded, will not be 100\% with simple model (only when overfitted)
While the previous section focussed on pathway activation in relative terms
(whether, given a pathway perturbation, the score assigned to any one pathway
is higher or lower compared to other pathways), I am now interested in how well
pathway methods are able to assign activity scores across different
experiments. I use as input data the fold changes (difference in log space, or
negative difference if inhibition) of the curated perturbation experiments, and
see how well the pathway scores derived from those are ordered for each pathway
across experiments. This is meant as a comparison how well the pathway scores
obtained by different methods correspond to perturbations, and not to assign
significance of the perturbation-response signature model. Because then I would
need an independent set of experiments. This is not required as I assessed the
significance of this model analytically, with results shown in table
tab:Significance-of-signature. The comparison here serves to show that my
model corresponds better to perturbations.
The standard non-parametric (i.e., only based on order rather than value) of
quantifying this are Receiver-Operator (ROC) curves, the results of which are
shown in figure *** (AUCs in table ***. For all pathways
except NFkB and JAK-STAT (where multiple methods are tied), the consensus gene
signature I derived is also better able to detect pathway activations across
experiments. VEGF is not well recovered by any method, potentially due to the
relatedness with EGFR/MAPK and the low number of experiments. Interestingly,
SPIA in 4 out of 11 cases systematically assigns pathway scores in the wrong
direction of the experimental perturbations (this may be due to increased
expression of negative feedback regulators that SPIA picks up due to the
underlying directed KEGG graph).
\subsection{Stability of Scores in Basal Expression}
What I have not shown yet is how stable the whole process is, i.e. much the
selection of input experiments influences the scores in basal expression. In
order to investigate this, I bootstrapped the experiment selection 1,000 times
and built models as described in section ***sub:Building-the-Model, using the
GDSC basal expression and yielding a 3-dimensional scores matrix with a score
for each cell line, pathway, and bootstrap. I regressed out either the effect
of the individual cell lines or the effect of the bootstraps, and quantified
the relative remaining variance in the scores:
$$ratio=\frac{var(cell\,lines\,|\,bootstraps)}{var(bootstraps\,|\,cell\,lines)}$$
The overview of those relative variances is shown in figure
fig:bootstraps-basal. Here, a value smaller than 1 indicates that the
experiment selection has a bigger effect than the cell line I apply the model
to (i.e., we can not observe a biological effect because it is hidden by the
noise of input experiments) and values bigger than 1 indicate the times the
biological context is more influential for providing the score than the input
selection (i.e, the biological context outweighs the exact selection of
perturbation input). For most pathways (TNFa, NFkB, MAPK, JAK-STAT, Hypoxia,
and EGFR) the variance of the input experiments as only a negligible influence
(smaller than 10\%) on the final scores. For p53, TGFb, and PI3K selection of
the input experiments still accounts for under 20\% of the observed variance of
basal pathway scores. For Trail and VEGF, it is about twice as large as the
variance of the input experiments. This can either point to the pathway not
being particularly well defined (if the experiments are also not well recovered
by the signature, like for VEGF in figure fig:ROC), or that the overall
number of input experiments is too small to guarantee a stable signature (like
for Trail, where the signature performs well but the overall number of
experiments are low, cf. figure fig:ROC).
\subsection{Correlation of Pathway Scores in Basal Expression}
As I am most interested in how basal signalling activity in cancer cell lines
and tumour samples correlates with outcomes such as drug response and survival,
one crucial point of deriving pathway response signatures is that they also
need to reflect pathway activation in basal gene expression. At this point, I
know that my perturbation-response signature genes correspond to the pathway
activation between a basal and perturbed condition, but not how well the
pattern of pathway activation from those perturbation experiments corresponds
to gene expression footprints of a constitutively active signalling pathway
leaves in basal gene expression.
Providing a definitive proof that the same signatures obtained correspond to
constitutive activation in basal expression is only possible using a range of
functional readouts (e.g. mutations activating pathways, drug response on
oncogene addiction) so I will defer this to chapter 4, where I investigate the
functional relevance of perturbation-response signatures compared to other
pathway methods.
What I can show, however, is how well known pathway cross-talk and structure
(as is known in literature and I have shown in figures
fig:Modelled-pathway-crosstalk and fig:sp2-exp-heatmap) translates to
pathway correlation in basal expression, and how well they agree between
tumours and cell lines. To this end, I computed pathway scores for all primary
tumours in the TCGA and cell lines in the GDSC (details section
sub:Correlation-in-basal), and checked how well they correlate.
I find that not only are the correlations in basal gene expression (figure
fig:Correlation-speed) similar to the ones in perturbed experiments (figures
fig:sp2-exp-heatmap and fig:circle-perturb), but they also correlate very
well with known cross-talk and do not change more than could be biologically
expected between primary tumours and cell lines. I find a high correlation
between EGFR and MAPK and a lesser one to PI3K, as well as TNFa and NFkB.
Another high correlation is between NFkB and JAK-STAT that I also observed in
perturbation experiments (figure fig:circle-perturb, but below cutoff of
$10^{-5}$ of figure fig:sp2-exp-heatmap).
At this point I found strong indications that the gene expression changes in
perturbation experiments can be described across many conditions using a linear
consensus signature whose score reflects pathway activation and this signature
is translatable to basal pathway activity in non-perturbed gene expression.
Using basal gene expression, we calculated the
Pearson correlation across all samples and tissues, and plotted the correlation
matrix between each pathway combination using the R package *corplot* for
GDSC and TCGA data separately.
Below are the correlation for TCGA primary tumour data (left) and GDSC cell lines (right) for
perturbation-response genes in basal expression. Sign indicated by colour of
the points (positive correlation blue, negative correlation red), strength by
colour shade and size of the point.