-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathlv_models.Rmd
298 lines (217 loc) · 11.4 KB
/
lv_models.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
\providecommand{\E}{\operatorname{E}}
\providecommand{\V}{\operatorname{Var}}
\providecommand{\Cov}{\operatorname{Cov}}
\providecommand{\se}{\operatorname{se}}
\providecommand{\logit}{\operatorname{logit}}
\providecommand{\iid}{\; \stackrel{\text{iid}}{\sim}\;}
\providecommand{\asim}{\; \stackrel{.}{\sim}\;}
\providecommand{\xs}{x_1, x_2, \ldots, x_n}
\providecommand{\Xs}{X_1, X_2, \ldots, X_n}
\providecommand{\bB}{\boldsymbol{B}}
\providecommand{\bb}{\boldsymbol{\beta}}
\providecommand{\bx}{\boldsymbol{x}}
\providecommand{\bX}{\boldsymbol{X}}
\providecommand{\by}{\boldsymbol{y}}
\providecommand{\bY}{\boldsymbol{Y}}
\providecommand{\bz}{\boldsymbol{z}}
\providecommand{\bZ}{\boldsymbol{Z}}
\providecommand{\be}{\boldsymbol{e}}
\providecommand{\bE}{\boldsymbol{E}}
\providecommand{\bs}{\boldsymbol{s}}
\providecommand{\bS}{\boldsymbol{S}}
\providecommand{\bP}{\boldsymbol{P}}
\providecommand{\bI}{\boldsymbol{I}}
\providecommand{\bD}{\boldsymbol{D}}
\providecommand{\bd}{\boldsymbol{d}}
\providecommand{\bW}{\boldsymbol{W}}
\providecommand{\bw}{\boldsymbol{w}}
\providecommand{\bM}{\boldsymbol{M}}
\providecommand{\bPhi}{\boldsymbol{\Phi}}
\providecommand{\bphi}{\boldsymbol{\phi}}
\providecommand{\bN}{\boldsymbol{N}}
\providecommand{\bR}{\boldsymbol{R}}
\providecommand{\bu}{\boldsymbol{u}}
\providecommand{\bU}{\boldsymbol{U}}
\providecommand{\bv}{\boldsymbol{v}}
\providecommand{\bV}{\boldsymbol{V}}
\providecommand{\bO}{\boldsymbol{0}}
\providecommand{\bOmega}{\boldsymbol{\Omega}}
\providecommand{\bLambda}{\boldsymbol{\Lambda}}
\providecommand{\bSig}{\boldsymbol{\Sigma}}
\providecommand{\bSigma}{\boldsymbol{\Sigma}}
\providecommand{\bt}{\boldsymbol{\theta}}
\providecommand{\bT}{\boldsymbol{\Theta}}
\providecommand{\bpi}{\boldsymbol{\pi}}
\providecommand{\argmax}{\text{argmax}}
\providecommand{\KL}{\text{KL}}
\providecommand{\fdr}{{\rm FDR}}
\providecommand{\pfdr}{{\rm pFDR}}
\providecommand{\mfdr}{{\rm mFDR}}
\providecommand{\bh}{\hat}
\providecommand{\dd}{\lambda}
\providecommand{\q}{\operatorname{q}}
```{r, message=FALSE, echo=FALSE, cache=FALSE}
source("./customization/knitr_options.R")
```
```{r, message=FALSE, echo=FALSE, cache=FALSE}
library(broom)
library(MASS)
library(qvalue)
library(jackstraw)
pca <- function(x, space=c("rows", "columns"),
center=TRUE, scale=FALSE) {
space <- match.arg(space)
if(space=="columns") {x <- t(x)}
x <- t(scale(t(x), center=center, scale=scale))
x <- x/sqrt(nrow(x)-1)
s <- svd(x)
loading <- s$u
colnames(loading) <- paste0("Loading", 1:ncol(loading))
rownames(loading) <- rownames(x)
pc <- diag(s$d) %*% t(s$v)
rownames(pc) <- paste0("PC", 1:nrow(pc))
colnames(pc) <- colnames(x)
pve <- s$d^2 / sum(s$d^2)
if(space=="columns") {pc <- t(pc); loading <- t(loading)}
return(list(pc=pc, loading=loading, pve=pve))
}
```
# (PART) Latent Variable Models {-}
# HD Latent Variable Models
## Definition
Latent variables (or hidden variables) are random variables that are present in the underlying probabilistic model of the data, but they are unobserved.
In high-dimensional data, there may be latent variables present that affect many variables simultaneously.
These are latent variables that induce **systematic variation**. A topic of much interest is how to estimate these and incorporate them into further HD inference procedures.
## Model
Suppose we have observed data $\bY_{m \times n}$ of $m$ variables with $n$ observations each. Suppose there are $r$ latent variables contained in the $r$ rows of $\bZ_{r \times n}$ where
$$
\E\left[\bY_{m \times n} \left. \right| \bZ_{r \times n} \right] = \bPhi_{m \times r} \bZ_{r \times n}.
$$
Let's also assume that $m \gg n > r$. The latent variables $\bZ$ induce systematic variation in variable $\by_i$ parameterized by $\bphi_i$ for $i = 1, 2, \ldots, m$.
## Estimation
There exist methods for estimating the row space of $\bZ$ with probability 1 as $m \rightarrow \infty$ for a fixed $n$ in two scenarios.
[Leek (2011)](http://onlinelibrary.wiley.com/doi/10.1111/j.1541-0420.2010.01455.x/abstract) shows how to do this when $\by_i | \bZ \sim \text{MVN}(\bphi_i \bZ, \sigma^2_i \bI)$, and the $\by_i | \bZ$ are jointly independent.
[Chen and Storey (2015)](https://arxiv.org/abs/1510.03497) show how to do this when the $\by_i | \bZ$ are distributed according to a single parameter exponential family distribution with mean $\bphi_i \bZ$, and the $\by_i | \bZ$ are jointly independent.
# Jackstraw
Suppose we have a reasonable method for estimating $\bZ$ in the model
$$
\E\left[\bY \left. \right| \bZ \right] = \bPhi \bZ.
$$
The **jackstraw** method allows us to perform hypothesis tests of the form
$$
H_0: \bphi_i = \bO \mbox{ vs } H_1: \bphi_i \not= \bO.
$$
We can also perform this hypothesis test on any subset of the columns of $\bPhi$.
This is a challening problem because we have to "double dip" in the data $\bY$, first to estimate $\bZ$, and second to perform significance tests on $\bPhi$.
## Procedure
The first step is to form estimate $\hat{\bZ}$ and then test statistic $t_i$ that performs the hypothesis test for each $\bphi_i$ from $\by_i$ and $\hat{\bZ}$ ($i=1, \ldots, m$). Assume that the larger $t_i$ is, the more evidence there is against the null hypothesis in favor of the alternative.
Next we randomly select $s$ rows of $\bY$ and permute them to create data set $\bY^{0}$. Let this set of $s$ variables be indexed by $\mathcal{S}$. This breaks the relationship between $\by_i$ and $\bZ$, thereby inducing a true $H_0$, for each $i \in \mathcal{S}$.
We estimate $\hat{\bZ}^{0}$ from $\bY^{0}$ and again obtain test statistics $t_i^{0}$. Specifically, the test statistics $t_i^{0}$ for $i \in \mathcal{S}$ are saved as draws from the null distribution.
We repeat permutation procedure $B$ times, and then utilize all saved $sB$ permutation null statistics to calculate empirical p-values:
$$
p_i = \frac{1}{sB} \sum_{b=1}^B \sum_{k \in \mathcal{S}_b} 1\left(t_i \geq t_k^{0b} \right).
$$
## Example: Yeast Cell Cycle
Recall the yeast cell cycle data from earlier. We will test which genes have expression significantly associated with PC1 and PC2 since these both capture cell cycle regulation.
```{r, message=FALSE}
load("./data/spellman.RData")
time
dim(gene_expression)
dat <- t(scale(t(gene_expression), center=TRUE, scale=FALSE))
```
Test for associations between PC1 and each gene, conditioning on PC1 and PC2 being relevant sources of systematic variation.
```{r, message=FALSE}
jsobj <- jackstraw_pca(dat, r1=1, r=2, B=500, s=50, verbose=FALSE)
jsobj$p.value %>% qvalue() %>% hist()
```
This is the most significant gene plotted with PC1.
```{r, echo=FALSE}
y <- gene_expression[which.min(jsobj$p.value),]
p <- pca(gene_expression, space="rows")
plot(rep(time,2), c(p$pc[1,], y), pch=" ",xlab="time",ylab="expression")
points(time, p$pc[1,], pch=19, col="blue")
points(time, y, pch=19, col="red")
lines(smooth.spline(x=time,y=p$pc[1,],df=5), lwd=2, col="blue")
lines(smooth.spline(x=time,y=y,df=5), lwd=2, col="red")
legend(x="topright", legend=c("PC1", "top gene"), pch=c(19,19), col=c("blue", "red"))
```
Test for associations between PC2 and each gene, conditioning on PC1 and PC2 being relevant sources of systematic variation.
```{r, message=FALSE}
jsobj <- jackstraw_pca(dat, r1=2, r=2, B=500, s=50, verbose=FALSE)
jsobj$p.value %>% qvalue() %>% hist()
```
This is the most significant gene plotted with PC2.
```{r, echo=FALSE}
y <- gene_expression[which.min(jsobj$p.value),]
p <- pca(gene_expression, space="rows")
plot(rep(time,2), c(p$pc[2,], y), pch=" ",xlab="time",ylab="expression")
points(time, p$pc[2,], pch=19, col="blue")
points(time, y, pch=19, col="red")
lines(smooth.spline(x=time,y=p$pc[2,],df=5), lwd=2, col="blue")
lines(smooth.spline(x=time,y=y,df=5), lwd=2, col="red")
legend(x="topright", legend=c("PC2", "top gene"), pch=c(19,19), col=c("blue", "red"))
```
# Surrogate Variable Analysis
The **surrogate variable analysis** (SVA) model combines the many responses model with the latent variable model introduced above:
$$
\bY_{m \times n} = \bB_{m \times d} \bX_{d \times n} + \bPhi_{m \times r} \bZ_{r \times n} + \bE_{m \times n}
$$
where $m \gg n > d + r$.
Here, only $\bY$ and $\bX$ are observed, so we must combine many regressors model fitting techniques with latent variable estimation.
The variables $\bZ$ are called **surrogate variables** for what would be a complete model of all systematic variation.
## Procedure
The main challenge is that the row spaces of $\bX$ and $\bZ$ may overlap. Even when $\bX$ is the result of a randomized experiment, there will be a high probability that the row spaces of $\bX$ and $\bZ$ have some overlap.
Therefore, one cannot simply estimate $\bZ$ by applying a latent variable esitmation method on the residuals $\bY - \hat{\bB} \bX$ or on the observed response data $\bY$. In the former case, we will only estimate $\bZ$ in the space orthogonal to $\hat{\bB} \bX$. In the latter case, the estimate of $\bZ$ may modify the signal we can estimate in $\bB \bX$.
A [recent method](http://dx.doi.org/10.1080/01621459.2011.645777), takes an EM approach to esitmating $\bZ$ in the model
$$
\bY_{m \times n} = \bB_{m \times d} \bX_{d \times n} + \bPhi_{m \times r} \bZ_{r \times n} + \bE_{m \times n}.
$$
It is shown to be necessary to penalize the likelihood in the estimation of $\bB$ --- i.e., form shrinkage estimates of $\bB$ --- in order to properly balance the row spaces of $\bX$ and $\bZ$.
The regularized EM algorithm, called **cross-dimensonal inference** (CDI) iterates between
1. Estimate $\bZ$ from $\bY - \hat{\bB}^{\text{Reg}} \bX$
2. Estimate $\bB$ from $\bY - \hat{\bPhi} \hat{\bZ}$
where $\hat{\bB}^{\text{Reg}}$ is a regularized or shrunken estimate of $\bB$.
It can be shown that when the regularization can be represented by a prior distribution on $\bB$ then this algorithm achieves the MAP.
## Example: Kidney Expr by Age
In [Storey et al. (2005)](http://www.pnas.org/content/102/36/12837.full), we considered a study where kidney samples were obtained on individuals across a range of ages. The goal was to identify genes with expression associated with age.
```{r kid_load, message=FALSE, cache=FALSE}
library(edge)
library(splines)
load("./data/kidney.RData")
age <- kidcov$age
sex <- kidcov$sex
dim(kidexpr)
cov <- data.frame(sex = sex, age = age)
null_model <- ~sex
full_model <- ~sex + ns(age, df = 3)
```
```{r kid_lrt, dependson="kid_load", cache=FALSE}
de_obj <- build_models(data = kidexpr, cov = cov,
null.model = null_model,
full.model = full_model)
de_lrt <- lrt(de_obj, nullDistn = "bootstrap", bs.its = 100, verbose=FALSE)
qobj1 <- qvalueObj(de_lrt)
hist(qobj1)
```
Now that we have completed a standard generalized LRT, let's estimate $\bZ$ (the surrogate variables) using the `sva` package as accessed via the `edge` package.
```{r kid_sva, dependson="kid_lrt", cache=FALSE}
dim(nullMatrix(de_obj))
de_sva <- apply_sva(de_obj, n.sv=4, method="irw", B=10)
dim(nullMatrix(de_sva))
de_svalrt <- lrt(de_sva, nullDistn = "bootstrap", bs.its = 100, verbose=FALSE)
```
```{r kid_svaq, dependson="kid_sva", cache=FALSE}
qobj2 <- qvalueObj(de_svalrt)
hist(qobj2)
```
```{r qvalues1, dependson="kid_svaq"}
summary(qobj1)
```
```{r qvalues2, dependson="kid_svaq"}
summary(qobj2)
```
P-values from two analyses are fairly different.
```{r pval_comp, dependson="kid_svaq"}
data.frame(lrt=-log10(qobj1$pval), sva=-log10(qobj2$pval)) %>%
ggplot() + geom_point(aes(x=lrt, y=sva), alpha=0.3) + geom_abline()
```