-
Notifications
You must be signed in to change notification settings - Fork 0
/
15_mixed_models.tex
272 lines (236 loc) · 12.4 KB
/
15_mixed_models.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
\chapter{Mixed models}
It is often the case that parameters of interest in linear
models are naturally thought of as being random rather
than fixed. The rational for this can come about for many reasons.
The first occurs when the natural asymptotics have the number
of parameters tending to infinity with the sample size. As
an example, consider the \texttt{Rail} dataset in
\texttt{nlme}. The measurements are echo times for sound
traveling along railroad lines (a measure of health of the
line). Multiple (3) measurements are collected for each rail.
Consider a model of the form
$$
Y_{ij} = \mu + u_i + \epsilon_{ij},
$$
where $i$ is rail and $j$ is measurement within rail.
Treating the $u_i$ as fixed effects results in a
circumstance where the number of parameters goes
to infinity with the rails. This can lead to
inconsistent parameter estimates \citep{neyman1948consistent}
(for a simple example,
\href{http://www.stat.berkeley.edu/~census/neyscpar.pdf}{see}).
A solution to this problem is to put a distribution on
the $u_i$, say $u_i \sim_{iid} N(0, \sigma^2_u)$. This
is highly related to ridge regression (from the penalization
chapter). However, unlike penalization, this problem allows
for thinking about the random effect distribution as a population
distribution (the population of rails in our example).
Perhaps the easiest way to think about random effects is to
consider a fixed effect treatment of the $u_i$ terms. Since
we included an intercept, we would need to add one linear
constraint on the $u_i$ for identifiability. Consider the
constraint, $\sum_{i=1}^n u_i = 0$. Then, $\mu$ would be
interpreted as and overal mean and the $u_i$ terms would
be interpreted as the rail-specific deviation around
that mean. The random effect model simply specifies that
the $U_i$ are iid $N(0, \sigma^2_u)$ and mutually
independent from $\epsilon_{ij}$. The mean of the distribution
on the $U_i$ has to be 0 (or fixed at a number), since it would
not be identified from $\mu$ otherwise.
A perhaps preferable
way to specify the model is hierarchically,
$Y_{ij} ~|~ U_i \sim N(\mu, \sigma^2)$ and $U_i ~|~ \sim N(0, \sigma^2_U)$.
Consider the impications of this model. First, note that
\begin{align}
\Cov(Y_{ij}, Y_{i'j'}) & = \Cov(U_i + \epsilon_{ij}, U_{i'} + \epsilon_{i'j'})\\
& = \Cov(U_i, U_{i'}) + \Cov(\epsilon_{ij}, \epsilon_{i'j'}) \\
& = \left\{
\begin{array}{cc}
\sigma^2 + \sigma^2_U & \mbox{if}~ i=i'1 ~\mbox{and}~ j = j' \\
\sigma^2 & \mbox{if}~ i=i' ~\mbox{and}~ j\neq j' \\
0 & \mbox{Otherwise}
\end{array}
\right.
\end{align}
And thus the correlation between observations in the same cluster is
$\sigma^2_u / (\sigma^2_u + \sigma^2)$. This is the ratio between
the between subject variability, $\sigma^2_u$, and the total variability,
$\sigma^2_u + \sigma^2$.
Notice that the marginal model for $\bY_i = (Y_{i1}, \ldots, Y_{in_i})^t$ is
normally distributed with mean $\mu \times \bJ_{n_i}$ and variance
$
\sigma^2 \bI_{n_i} + \bJ_{n_i} \bJ_{n_i}^t \sigma^2_{u}.
$
It is by maximizing this (marginal) likelihood that we obtain the ML estimates
for $\mu, \sigma^2, \sigma^2_U$.
We can predict the $U_i$ by considering the estimate $E[U_i ~|~ \bY]$. To derive
this, note that the density for $U_i ~|~ \bY$ is equal to the density
of $U_i ~|~ \bY_i$, since $U_i$ is independent of every $Y_{i'j}$ for $i \neq i'$.
Then further note that the density for $U_i ~|~ \bY_i$ is propotional to the
joint density of $Y_i, U_i$, which is equal to the density of $Y_i ~|~ U_i$
times the density for $U_i$. Omitting anything that is not proportional
in $U_i$, and taking twice the natural logarithm of the the densities, we obtain:
$$
||\bY_i - \mu \bJ_{n_i} - U_i ||^2 / \sigma^2 + U_i / \sigma^2_U.
$$
Expanding the square, and discarding terms that are constant in $U_i$,
we obtain that $U_i$ is normally distributed with mean
$$
\frac{\sigma^2_u}{\sigma^2_u + \frac{\sigma^2}{n}} (\bar Y_i - \mu).
$$
Thus, if $\hat \mu = \bar Y$, our estimate of $U_i$ is the estimate that
we would typically use shrunken toward zero. The idea of shrinking estimates
when simultaneously estimating several quantities is generally a good one.
This has similarities with James/Stein estimation \citep[see this review][]{efron1977stein}.
Shrinkage estimation works by trading bias for lower variance. In our example,
the shrinkage factor is $\sigma^2_u / (\sigma^2_u + \sigma^2 /n)$. Thus, the
better estimated the mean for that group is ($\sigma^2 / n$ is small), or the more
variable the group is ($\sigma^2_u$ is large), the less shrinkage we have. On the
other hand, the fewer observations that we have, the larger the residual variation
or the smaller the inter-subject variation, the more shrinkage we have. In this
way the estimation is optimally calibrated to weigh the contribution of the individual
versus the contribution of the group to the estimate regarding this specific individual.
\section{General case}
In general, we might write $\bY ~|~ \bU \sim N(\bX \bbeta + \bZ \bU, \sigma^2 \bI)$ and
$\bU \sim N(\bzero, \bSigma_U)$. This is marginally equivalent to specifying
$$
\bY = \bX \bbeta + \bZ \bU + \beps.
$$
Here, the marginal likelihood for $\bY$ is normal with mean $\bX \bbeta$ and
variance $\bZ \bSigma_u \bZ^t + \sigma^2 \bI$. Maximum likelihood estimates
maximize the marginal likelihood via direct numerical maximization or
the EM algorithm \citep{dempster1977maximum}. Notice, for fixed variance
components, the estimate of $\bbeta$ is a weighted least squares estimate.
It should be noted the distinction between a mixed effect model and simply
specifying a marginal variance structure. The same marginal likelihood
could be obtained via the model:
$$
\bY = \bX \bbeta + \beps
$$
where $\beps \sim N(\bzero, \bZ \bSigma_u \bZ^t + \sigma^2 \bI)$. However,
some differences tend to arise. Often, the natural specification of a marginal
variance structure doesn't impose positivity constraints that random effects
do. For example, in the previous section, we saw that the covariance between
measurements in the same cluster was $\sigma^2_u / (\sigma^2_u + \sigma^2)$,
which is guaranteed to be positive. However, if fitting a general marginal
covariance structure, one would typically simply parameterize the covariance
structure as either positive or negative.
Another difference lies in the hierarchical model itself. We can actually
estimate the random effects if we specify them, unlike marginal models.
This is a key (perhaps ``the'' key) defining attribute of mixed models.
Again, our Best Linear Unbiased Predictor (BLUPs) is given by
$$
E[bU ~|~ \bY]
$$
As a homework exercise, derive the general form of the BLUPs
\section{REML}
Let $\bH_X$ be the hat matrix for $\bX$. Then note that
$$
\be=(\bI - \bH_X) \bY = (\bI - \bH_X) \bZ \bU + (\bI - \bH_X) \beps
$$
Then, we can calculate the marginal distribution for $\beps$ as
singular normal with mean $(\bI - \bH_X) \bZ \bSigma_U \bZ^t (\bI - \bH_X) + \sigma^2 (\bI - \bH_X)$.
Taking any full rank sub-vector of the $\beps$ and maximizing the marginal likelihood
for $\bSigma_U$ and $\sigma^2$ is called restricted maximum likelihood (REML).
REML estimates tend to be less biased than the ML estimates. For example, if
$y_i \sim_{iid} N(\mu, \sigma^2)$, maximizing the likelihood for
any $n-1$ of the $e_i = y_i - \bar y$ yields the unbiased variance estimate (divided by $n-1$)
rather than the biased variance estimate obtained via maximum likelihood.
REML estimates are often the default for linear mixed effect model programs.
An alternative way to derive the REML estimates is via Bayesian thinking. Consider a
model where $\bY ~|~\bbeta \sim N(\bX \bbeta, \bZ \bSigma_u \bZ^t +\sigma^2 \bI)$
and $\bbeta \sim N(0, \theta \bI)$. Calculating the mode for $\bSigma_u$ and $\sigma^2$
after having integrated out $\bbeta$ as $\theta \rightarrow \infty$ results in the REML estimates.
While this is not terribly useful for general linear mixed effect modeling, it helps us think
about REML as it relates to Bayesian analysis and it allows us to extend REML in settings
where residuals are less well defined, like generalized linear mixed models.
\section{Prediction}
Consider generally trying to predict $U$ from observed data $Y$.
Let $f_{uy}$, $f_u$, $f_y$, $f_{u|y}$ and $f_{y|u}$ be the
joint, marginal and conditional densities respectively. Let
$\theta(Y)$ be our estimator of $U$. Consider
evaluating the prediction error via the expected squared loss
$$
E[(U -\theta(Y))^2]
$$
We now show that this is minimized at $\theta(Y) = E[U ~|~ Y]$.
Note that
\begin{align*}
& E[(U -\theta(Y))^2] \\
& = E[(U - E[U ~|~ Y] + E[U ~|~ Y] - \theta(Y))^2] \\
& = E[(U - E[U ~|~ Y])^2] - 2 E[(U - E[U~|~Y])]E[(E[U~|~Y] - \theta(Y))] + E[(E[U~|~Y] - \theta(Y))^2] \\
& = E[(U - E[U ~|~ Y])^2] + E[(E[U~|~Y] - \theta(Y))^2] \\
& \geq E[(U - E[U ~|~ Y])^2].
\end{align*}
This argument should seem familiar. (In fact, Hilbert space results generalize these
kinds of arguments into one theorem.) Therefore, $E[U~|~Y]$ is the best
predictor. Note, that it is always the best predictor, regardless of the settting.
Furthermore, in the context of linear models, this predictor is both linear (in $\bY$)
and unbiased. We mean unbiased in the sense of:
$$
E[ U - E[U ~|~ Y]] = 0.
$$
Therefore, even in the more restricted class of linear estimators, in the
case of mixed models, $E[U ~|~ Y]$ remains best.
A complication arises in that we do not know the variance components. As
that is the case, we must plug in the estimates (either REML or ML).
The BLUPs lose their optimality properties then and are thus often called
EBLUPs (for empirical BLUPs).
Prediction of this sort relates to so-called empirical Bayesian prediction
and shrinkage estimation. In your more advanced classes on decision theory,
you'll learn about loss functions and uniform desirability of shrinkage
estimators over the straightforward estimators. (In our case the straightforward
estimator is the one that treats the random
effects as if fixed.) This line of thinking yields yet another use for random effect models, where
we might apply them merely for the benefits of shrinkage,
but don't actually think of our random effects as if random. Consider
settings like genomics. The genes being studied are exactly the quantities
of interest, not a random sample from a population of genes. However, it
remains useful to treat effects associated with genes as if random to
obtain the benefits of shrinkage.
\section{P-splines}
\subsection{Regression splines}
The application to splines has been a very successful, relatively new, use of
mixed models. To discuss the methodology, we need to introduce splines briefly.
We will only overview this area and focus on regression splines, while acknowledging
that other spline bases may be preferable.
Let $(a)_+ = a$ if $a > 0$ and $0$ otherwise. Let $\xi$ be a known
knot location. Now consider the model:
$$
E[Y_i] = \beta_0 + \beta_1 x_i + \gamma_1 (x_i - \xi)_+.
$$
For $x_i$ values less than or equal to $\xi$, we have
$$
E[Y_i] = \beta_0 + \beta_1 x_i
$$
and for $x_i$ values above $\xi$ we have
$$
E[Y_i] = (\beta_0 + \gamma_1 xi) + (\beta_1 + \gamma_1) x_i.
$$
Thus, the response function $f(x) = \beta_0 = \beta_1 x + \gamma_1 (x - \xi)_+$
is continuous at $\xi$ and is a line before and after. This allows us to create
"hockey stick" models, with a line below $\xi$ and a line with a different slope
afterwards. Furthermore, we could expand the model as
$$
E[Y_i] = \beta_0 + \beta_1 x_i + \sum_{k=1}^K \gamma_k (x_i - \xi_k)_+.
$$
where $\xi_k$ are knot points. Not the model is a spiky, but flexible,
function that is linear between the knots and meets at the knots.
To make the fit less spiky, we want continuous differentiability at the knot
points. First note that the function $(x)_+^p$ has $p-1$ continuous derivatives at 0.
To see this, take the limit to zero
of the derivatives from the right and the left. Thus, the function
$$
f(x) = \beta_0 + \beta_1 x + \beta_2 x^2 + \sum_{k=1}^K \gamma_k (x_i - \xi_k)_+^2
$$
will consist of parabolas between the knot points and
will have one continuous derivative at the knot points. This will fit
a smooth function that can accommodate a wide variety of data shapes.
\subsection{Coding example}
\section{Further reading}
A great book on mixed models in R is \cite{pinheiro2006mixed}.
In addition, the books by Searle, McCulloch,
Verbeke and Molenberghs are wonderful treatments of the topic
\citep{mcculloch2001linear,verbeke2009linear}. Finally, the
newer package \texttt{lme4} has a series of
\href{https://cran.r-project.org/web/packages/lme4/index.html}{vignettes}.