-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path13_penalties.tex
528 lines (470 loc) · 23.5 KB
/
13_penalties.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
\chapter{Parameter estimability and penalties}
In this section we consider parameter estimability and penalties.
\section{Estimability}
This section draws heavily from the wonderful book by \cite{searle2012linear}.
We define a linear combination of the slope parameters, $\bq^t \bbeta$, as being estimable
if it is equal to a linear combination of the expected value of $\bY$. In other words,
$\bq^t \bbeta$ is estimable if it is equal to $\bt^t E[\bY]$ for some value of $\bt$.
I find estimability most useful when $\bX$ is over-specified (not full rank). For example,
consider an ANOVA model
$$
Y_{ij} = \mu + \beta_i + \epsilon_{ij}.
$$
Verify for yourself that the $\bX$ matrix from this model is not full rank.
Because $\bt^t E[\bY] = \bt^t \bX \bbeta$ for all possible $\bbeta$, $\bq = \bt^t \bX$ and
we obtain that estimable contrasts are necessarily linear combinations of the rows
of the design matrix.
The most useful result in estimability is the invariance properties of estimable contrasts. Consider an not full rank design matrix.
Then any solution to the normal equations:
$$
\bX^t \bX \bbeta = \bX^t \bY
$$
will minimize the least squares criteria (or equivalently maximize the likelihood
under spherical Gaussian assumptions). (If you don't see this, verify it yourself using
the tools from the first few chapters.) Since $\bX$ is not full rank, this will
have infinite solutions. Let $\hat \bbeta$ and $\tilde \bbeta$ be any two such
solutions. For estimable quantities, $q^t \hat \bbeta = q^t \tilde \bbeta$.
That is, the particular solution to the normal doesn't matter for estimable quantities. This should be
clear given the definition of estimability. Recall that least squares projects
onto the plane defined by linear combinations of the columns of $\bX$. The projection,
$\hat \bY$, is unique, while the particular linear combination is not in this case.
To discuss further. Suppose $\bq^t \hat \bbeta \neq \bq^t \tilde \bbeta$
for two solutions to the normal equations, $\hat \bbeta$ and $\tilde \bbeta$
and estimable $\bq^t \bbeta$. Then $\bt^t \bX \hat \bbeta \neq \bt^t \bX \tilde \bbeta$. Let
$\hat \bY$ be the projection of $\bY$ on the space of linear combinations of the columns of $\bX$.
However, since both are projections, $\hat \bY = \bX \hat \bbeta = \bX \tilde \bbeta$.
Multiplying by $\bt^t$ then yields a contradiction.
\subsection{Why it's useful}
Consider the one way ANOVA setting again.
$$
Y_{ij} = \mu + \beta_i + \epsilon_{ij}.
$$
For $i = 1, 2$, $j = 1, \ldots, J$.
One can obtain parameter identifiability by setting $\beta_2 = 0$, $\beta_1 = 0$, $\mu = 0$ or $\beta_1 + \beta_2 = 0$ (or one of infinitely many other linear contrasts).
These constraints don't change the column space of the $\bX$ matrix. (Thus the projection stays the same.)
Recall that $\hat y_{ij} = \bar y_i$. Estimable functions are linear combinations of $E[Y_{ij}] = \mu + \beta_i$. So, note that
$$
E[Y_{21}] - E[Y_{11}] = \beta_2 - \beta_1
$$
is estimable and it will always be estimated by $\bar y_2 - \bar y_1$. Thus, regardless of which linear constraints one points on the
model to achieve identifiability, the difference in the means will have the same estimate.
This also gives us a way to go between estimates with different constraints without refitting the models. Since for two sets
of constraints we have:
$$
\bar y_i = \hat \mu + \hat \beta_i = \tilde \mu + \tilde \beta_i,
$$
yielding a simple system of equations to convert between estimates with different constraints.
\section{Linear constraints}
Consider performing least squares under the full row rank linear constraints
$$
\bK^t \bbeta = \bz.
$$
One could obtain these estimates using Lagrange multipliers
$$
||\by - \bX \bbeta||^2 + 2 \blambda^t (\bk^t \bbeta - \bZ)
= \by^t \by - 2\bbeta^2 \bX^t \by + \bbeta^t \bX^t \bX \bbeta + 2 \blambda^t (\bK^t \bbeta - \bz).
$$
Taking a derivative with respect to lambda yields
\begin{equation}
\label{eq:cls1}
2 (\bK^t \bbeta - \bz) = 0
\end{equation}
Taking a derivative with respect to $\bbeta$ we have:
$$
-2 \bX^t \by + 2 \bX^t \bX \bbeta + 2 \bK \blambda = 0
$$
which has a solution in $\bbeta$ as
\begin{equation}
\label{eq:cls2}
\bbeta = \xtxinv (\bX^t \by - \bK \blambda) = \hat \bbeta - \xtxinv \bK \blambda,
\end{equation}
where $\hat \bbeta$ is the OLS (unconstrained) estimate.
Multiplying by $\bK^t$ and using \eqref{eq:cls1} we have that
$$
\bz = \bK^t \hat \bbeta - \bK^t \xtxinv \bK \blambda
$$
yielding a solution for $\blambda$ as
$$
\blambda = \{\bK^t \xtxinv \bK\}^{-1} (\bK^t \hat \bbeta - \bz).
$$
Plugging this back into \eqref{eq:cls2} yields the solution:
$$
\bbeta = \hat \bbeta - \xtxinv \bK \{\bK^t \xtxinv \bK\}^{-1} (\bK^t \hat \bbeta - \bz).
$$
Thus, one can fit constrained least squares estimates without actually refitting the model.
Notice, in particular, that if one where to multiply this estimate by $\bK^t$, the result would be
$\bz$.
\subsection{Likelihood ratio tests}
One can use this result to derive likelihood ratio tests of $H_0: \bK \bbeta = \bz$ versus
the general alternative. From the previous section, under the null hypothesis, the
estimate under the null hypothesis,
$$\hat \bbeta_{H_0} = \hat \bbeta - \xtxinv \bK \{\bK^t \xtxinv \bK\}^{-1} (\bK^t \hat \bbeta - \bz).$$
Of course, under the alternative, the estimate is $\hat \bbeta = \xtxinv \bX^t \bY$. In both
cases, the maximum likelihood variance estimate is $\frac{1}{n}||\bY - \bX \bbeta||^2$ with
$\bbeta$ as the estimate under either the null or alternative hypothesis. Let $\hat \sigma^2_{H_0}$
and $\hat \sigma^2$ be the two estimates.
The likelihood ratio statistic is
$$
\frac{{\cal L}(\hat \bbeta_{H_0}, \hat \sigma^2_{H_0})}{{\cal L}(\hat \bbeta,\hat \sigma^2)}
= \left(\frac{\hat \sigma^2_{H_0}}{\hat \sigma^2} \right)^{-n/2}.
$$
This is monotonically equivalent to $n \hat \sigma^2 / n \hat \sigma^2_{H_0}$. However, we reject
if the null is less supported than the alternative, i.e. this statistic is small,
so we could equivalently reject if $n \hat \sigma^2_{H_0} / n \hat \sigma^2$ is large.
Further note
that
\begin{eqnarray*}
n\hat \sigma^2_{H_0}
& = &||\bY - \bX \hat \bbeta_{H_0}||^2 \\
& = & ||\bY - \bX \hat \bbeta + \bX \hat \bbeta - \bX \hat \bbeta_{H_0}||^2 \\
& = & ||\bY - \bX \hat\bbeta||^2 + ||\bX \hat \bbeta - \bX \hat \bbeta_{H_0}||^2 \\
& = & n \hat \sigma^2 + ||\bX \hat \bbeta - \bX \hat \bbeta_{H_0}||^2
\end{eqnarray*}
Notationally, let $$SS_{reg} = ||\bX \hat \bbeta - \bX \hat \bbeta_{H_0}||^2
= ||\hat \bY - \hat \bY_{H_0}||^2$$ and
$SS_{res} = n \hat \sigma^2$.
The note that the inverse of our likelihood ratio is monotonically equivalent to
$\frac{SS_{reg}}{SS_{res}}$
However, $SS_{reg} / \sigma^2$ and $SS_{res} / \sigma^2$ are both independent
Chi-squared random variables with degrees of freedom $Rank(\bK)$ and $n - p$
under the null. (Prove this for homework.) Thus, our likelihood ratio statistic can
exactly be converted into the $F$ statistic of section \ref{sec:ftest}. We leave
the demonstration that the two are identical as a homework exercise.
This line of thinking can be extended. Consider the sequence of hypotheses:
\begin{align*}
& H_1 :\bK_1 \bbeta = \bz \\
& H_2 :\bK_2 \bK_1 \bbeta = \bK_2 \bz \\
& H_3 :\bK_3 \bK_2 \bK_1 \bbeta = \bK_3 \bK_2 \bz \\
& \vdots
\end{align*}
Each $\bK_i$ is assumed full row rank and of fewer rows than $\bK_{i-1}$.
These hypotheses are nested with $H_1$ being the most restrictive, $H_2$ being the
second most, and so on. (Note, if $H_1$ holds then $H_2$ holds but not vice versa.)
Consider
testing $H_{1}$ (null) versus $H_{2}$ (alternative). Note that under our
general specification, discussing this problem will apply to testing $H_i$ versus $H_j$.
Under the arguments
above, our likelihood ratio statistic will work out to be inversely equivalent to
the statistic: $n\hat \sigma^2_{H_1} / n\hat \sigma^2_{H_2}$.
Further note that
\begin{eqnarray*}
n\hat \sigma^2_{H_1}
& = &||\bY - \hat \bY_{H_1}|| \\
& = & ||\bY - \hat \bY_{H_2}||^2 + ||\hat \bY_{H_2} - \hat \bY_{H_1}||^2
+ 2 (\bY - \hat \bY_{H_2})^t (\hat \bY_{H_2} - \hat \bY_{H_1}) \\
& = & ||\bY - \hat \bY_{H_2}||^2 + ||\hat \bY_{H_2} - \hat \bY_{H_1}||^2 \\
& = & SS_{RES}(H_2) + SS_{REG}(H_1 ~|~ H_2)
\end{eqnarray*}
Here the cross product term in the second line is zero by (tedious yet straightforward) algebra
and the facts that:
$\bK_2 \bK_1 \hat \bbeta_{H_1} = \bK_2 \bK_1 \hat \bbeta_{H_2} = \bK_2 \bz$ and $\be^t \bX = \bzero$.
Thus, our likelihood ratio statistic is monotically equivalent to
$$SS_{REG}(H_1 ~|~ H_2) / SS_{RES}(H2).$$
Furthermore,
Using the developed methods in the class the numerator is Chi-Squared with
$Rank(\bK_1)$ degrees of freedom, while the denominator has $n - \{Rank(\bK_1) - Rank(\bK_2)\}$
degrees of freedom, and they are independent. Thus we can construct an F test for nested
linear hypotheses.
This process can be iterated, decomposing $SS_{RES}(H_2)$, so that:
$$
n\hat \sigma^2_{H_1}
= SS_{REG}(H_1 ~|~ H_2) + SS_{REG}(H_2 ~|~ H_3) + SS_{RES}(H_3)
$$
And it could be iterated again so that:
$$
n\hat \sigma^2_{H_1}
= SS_{REG}(H_1 ~|~ H_2) + SS_{REG}(H_2 ~|~ H_3) + \ldots SS_{RES}(H_p)
$$
where $SS_{RES}(H_p)$ is the residual sums of squares under the most elaborate model considered.
The sums of squares add so that, for example,
$$
SS_{REG}(H_1 ~|~ H_3) = SS_{REG}(H_1 ~|~ H_2) + SS_{REG}(H_2 ~|~ H_3)
$$
and
$$
SS_{RES}(H_3) = SS_{REG}(H_3 ~|~ H_4) + \ldots + SS_{RES}(H_4).
$$
Thus, one could test any subset of the nested hypotheses by appropriately adding the sums of
squares.
\subsection{Example use}
The most popular use of the general linear hypothesis is to consider nested hypotheses.
That is, consider a linear model where $\bbeta^t = [\beta_0 ~ \beta_1 ~ \ldots ~\beta_p]$
so that the $\beta_i$ are ordered in decreasing scientific importance.
\begin{align*}
& H_1 : \beta_1 = \beta_2 = \beta_3 = \ldots = \beta_p = 0 \\
& H_2 : \beta_2 = \beta_3 = \ldots = \beta_p = 0 \\
& H_3 : \beta_3 = \ldots = \beta_p = 0 \\
& \vdots \\
& H_p : \beta_p = 0
\end{align*}
Then testing $H_1$ versus $H_2$ tests whether $\beta_1$ is zero under the assumption
that all of the remaining coefficients (excepting the intercept) are zero.
Testing $H_2$ versus $H_5$ tests whether $\beta_2 = \beta_3 = \beta_4 = 0$
under the assumption that $\beta_5$ through $\beta_p$ are 0.
\subsection{Coding examples}
Let's go through an example of fitting multiple models. We'll look
at the \texttt{swiss} dataset. The following code fits three models
for the dataset. First, we model the outcome, regional fertility,
as a function of various aspects of the region. Imagine if we
are particularly interested in agriculture as a variable. We
fit three models: one a linear regression with just agriculture,
then one including educational level variables (examination and education) and
then one including all of the previous variables plus information on religion (percent Catholic)
and infant mortality rates.
\begin{verbatim}
data(swiss)
fit1 = lm(Fertility ~ Agriculture, data = swiss)
fit2 = update(fit1, Fertility ~ Agriculture + Examination + Education)
fit3 = update(fit1, Fertility ~ Agriculture + Examination + Education + Catholic + Infant.Mortality)
anova(fit1, fit2, fit3)
\end{verbatim}
The \texttt{anova} comand gets the relevant sums of squares for each of the models, resulting in the
output
\begin{verbatim}
Analysis of Variance Table
Model 1: Fertility ~ Agriculture
Model 2: Fertility ~ Agriculture + Examination + Education
Model 3: Fertility ~ Agriculture + Examination + Education + Catholic +
Infant.Mortality
Res.Df RSS Df Sum of Sq F Pr(>F)
1 45 6283.1
2 43 3180.9 2 3102.2 30.211 8.638e-09 ***
3 41 2105.0 2 1075.9 10.477 0.0002111 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
\end{verbatim}
Here, it would appear that inclusion of the other variables is necessary.
However, let's see if we can create these sums of squares manually using our
approach.
\begin{verbatim}
> xtilde = as.matrix(swiss);
> y = xtilde[,1]
> x1 = cbind(1, xtilde[,2])
> x2 = cbind(1, xtilde[,2:4])
> x3 = cbind(1, xtilde[,-1])
> makeH = function(x) x %*% solve(t(x) %*% x) %*% t(x)
> n = length(y); I = diag(n)
> h1 = makeH(x1)
> h2 = makeH(x2)
> h3 = makeH(x3)
> ssres1 = t(y) %*% (I - h1) %*% y
> ssres2 = t(y) %*% (I - h2) %*% y
> ssres3 = t(y) %*% (I - h3) %*% y
> ssreg2g1 = t(y) %*% (h2 - h1) %*% y
>ssreg3g2 = t(y) %*% (h3 - h2) %*% y
> out = rbind( c(n - ncol(x1), ssres1, NA, NA),
c(n - ncol(x2), ssres2, ncol(x2) - ncol(x1), ssreg2g1),
c(n - ncol(x3), ssres3, ncol(x3) - ncol(x2), ssreg3g2)
)
> out
[,1] [,2] [,3] [,4]
[1,] 45 6283.116 NA NA
[2,] 43 3180.925 2 3102.191
[3,] 41 2105.043 2 1075.882
\end{verbatim}
It is interesting to note that the F test comapring Model 1 to Model 2 from the \texttt{anova} command
is obtained by dividing \texttt{3102.191 / 2} (a chi-squared divided by its 2 degrees of freedom)
by \texttt{2105.043 / 41} (an independent chi-squared divided by its 3 degrees of freedom). The
denominator of the F statistic is then the residual sum of squares from Model 3, not from Model 2.
This is why the following give two different answers for the F statistic:
\begin{verbatim}
> anova(fit1, fit2)
Analysis of Variance Table
Model 1: Fertility ~ Agriculture
Model 2: Fertility ~ Agriculture + Examination + Education
Res.Df RSS Df Sum of Sq F Pr(>F)
1 45 6283.1
2 43 3180.9 2 3102.2 20.968 4.407e-07 ***
---
> anova(fit1, fit2, fit3)
Analysis of Variance Table
Model 1: Fertility ~ Agriculture
Model 2: Fertility ~ Agriculture + Examination + Education
Model 3: Fertility ~ Agriculture + Examination + Education + Catholic +
Infant.Mortality
Res.Df RSS Df Sum of Sq F Pr(>F)
1 45 6283.1
2 43 3180.9 2 3102.2 30.211 8.638e-09 ***
3 41 2105.0 2 1075.9 10.477 0.0002111 ***
\end{verbatim}
In the first case, the denominator of the F statistic is
\texttt{3180.9 / 43}, the residual mean squared error for Model 2,
as opposed to the latter case where it is dividing by the residual
mean squared error for Model 3. Of course, under the null hypothesis,
either approach yields an independent chi squared statistic in the denominator.
However, using the Model 3 residual mean squared error reduces the
denominator degrees of freedom, though also necessarily reduces the
residual sum of squared errors (since extra terms in the regression
model always do that).
\section{Ridge regression}
Consider quadratic constraints to least squares.
$$
||\bY - \bX \bbeta||^2 + \bbeta^t \bGamma \bbeta.
$$
In this case we consider instances where $\bX$ is not necessarily full rank. The
addition of the penalty is called ``Tikhonov regularization'' for the mathematician of
that name. The specific instance of this regularization in regression is called ridge
regression. The matrix $\bGamma$ is typically assumed known or set to $\gamma \bI$.
Another way to envision ridge regression is to think in the terms of a posterior mode
on a regression model. Specifically, $\bSigma^{-1} = \bGamma / \sigma^2$ and consider the model
where $\by ~|~ \bbeta \sim N(\bX \bbeta, \sigma^2 \bI)$ and $\bbeta \sim N(\bzero, \bSigma)$.
Then one obtains the posterior for $\bbeta$ and $\sigma$ by multiplying the two densities. The
posterior mode would be obtained by minimizing minus twice the log of this product
$$
||\bY - \bX \bbeta||^2 / \sigma^2 + \bbeta^t \bGamma \bbeta / \sigma^2.
$$
which is equivalent to above in the terms of maximization for $\bbeta$.
We'll leave it as an exercise to obtain that the estimate actually obtained is
$$
\hat \bbeta_{ridge} = (\xtx + \bGamma)^{-1} \bX^t \bY.
$$
To see how this regularization helps with invertibility of $\xtx$, consider the case where
$\bGamma = \gamma \bI$. If $\gamma$ is very large then $\xtx + \gamma \bI$ is simply
small numbers added around an identity matrix, which is clearly invertible.
Consider the case where $\bX$ is column centered and is of full column rank.
Let $\bU \bD \bV^t$ be the SVD of $\bX$ where $\bU^t \bU = \bV ^t \bV = \bI$.
Note then $\xtx = \bV \bD^2 \bV^t$ and $\xtxinv = \bV \bD^{-2} \bV^t$ so that
the ordinary least squares estimate satisfies
$$
\hat \bY = \bX \xtxinv \bX^t \bY = \bU \bD \bV^t \bV \bD^{-2} \bV^t \bV \bD \bU \bY = \bU \bU^t \bY.
$$
Consider now the fitted values under ridge regression with $\bGamma = \gamma \bI$:
\begin{align*}
\hat \bY_{ridge}
& = \bX (\xtx + \gamma \bI)^{-1} \bX^t \bY \\
& = \bU \bD \bV^t (\bV \bD^2 \bV^t + \gamma \bI)^{-1} \bV \bD \bU^t \bY \\
& = \bU \bD \bV^t (\bV \bD^2 \bV^t + \gamma \bV \bV^t)^{-1} \bV \bD \bU^t \bY \\
& = \bU \bD \bV^t \{\bV (\bD^2 + \gamma \bI) \bV^t\}^{-1} \bV \bD \bU^t \bY \\
& = \bU \bD \bV^t \bV (\bD^2 + \gamma \bI)^{-1} \bV^t \bV \bD \bU^t \bY \\
& = \bU \bD (\bD^2 + \gamma \bI)^{-1} \bD \bU^t \bY\\
& = \bU \bW \bU^t \bY
\end{align*}
where the third line follows since $\bX$ is full column rank so that $\bV$ is $p\times p$ of full rank
and $\bV^{-1} = \bV^t$ so that $\bV^t \bV = \bV \bV^t = \bI$. Here $\bW$ is a diagonal matrix with elements
$$
\frac{D_i^2}{D_i^2 + \gamma}
$$
where $D_i^2$ are the eigenvalues.
In the not full rank case, the same identity can be found, though it takes a bit more work. Now assume
that $\bX$ is of full row rank (i.e. that $n < p$ and there are no redundant subjects). Now note that
$\bV$ does not have an inverse, while $\bU$ does (and $\bU^{-1} = \bU^t$. Further note via
the Woodbury theorem (where $\theta = 1/\lambda)d$:
\begin{align*}
(\xtx + \gamma \bI)^{-1} & =
\theta \bI - \theta^2 \bX^t (\bI + \theta \bX \bX^t)^{-1} \bX \\
& = \theta \bI - \theta^2 \bV \bD \bU^t (\bU \bU^t + \theta \bU \bD^2 \bU^t)^{-1} \bU \bD \bV^t \\
& = \theta \bI - \theta^2 \bV \bD \bU^t \{\bU(\bI + \theta \bD^2) \bU^t)\}^{-1} \bU \bD \bV^t \\
& = \theta \bI - \theta^2 \bV \bD \bU^t \{\bU(\bI + \theta \bD^2)^{-1} \bU^t)\} \bU \bD \bV^t \\
& = \theta \bI - \theta^2 \bV \bD (\bI + \theta \bD^2)^{-1} \bD \bV^t \\
& = \theta \bI - \theta^2 \bV \tilde \bD \bV^t
\end{align*}
where $\tilde \bD$ is diagonal with entries $D_i^2 / (1 + \theta D_i^2)$ where $D_i$ are the
diagonal entries of $\bD$. Then:
\begin{align*}
\hat \bY_{Ridge}
& = \bX (\xtx + \gamma \bI)^{-1} \bX^t \bY \\
& = \bU \bD \bV^t (\theta \bI - \theta^2 \bV \tilde \bD \bV^t) \bV \bD \bU^t \bY \\
& = \bU \bD (\theta \bI - \theta^2 \tilde \bD)\bD \bU^t \bY \\
& = \bU \bW \bU^t \bY
\end{align*}
Thus we've covered the full row and column rank cases. (Omitting the instance where $\bX$ is neither
full row nor column rank.)
%Note that the inverse necessary for ridge regression
%can be obtained easily in the event that there are few rows but lots of columns. By the Woodbury %theorem (mentioned
%several times in this text), we have that
%$$
%(\xtx + \bGamma)^{-1} = \bGamma^{-1} - \bGamma^{-1} \bX^t (\bI + \bX \bGamma^{-1} \bX^t)^{-1} \bX %\bGamma^{-1}
%$$
%This identity is especially convenient when $\bGamma = \gamma \bI$. Note that the inverse necessary has the dimension of
%$\bX \bX^t$, which is the easier dimension to work with when $p$ is larger than $n$.
\subsection{Coding example}
In the example below, we use the \texttt{swiss} data set to illustrate
fitting ridge regression. In this example, penalization isn't really
necessary, so the code is more used to simply show the fitting.
Notice that \texttt{lm.ridge} and our code give slightly different
answers. This is due to different scaling options for the design matrix.
\begin{verbatim}
data(swiss)
y = swiss[,1]
x = swiss[,-1]
y = y - mean(y)
x = apply(x, 2, function(z) (z - mean(z)) / sd(z))
n = length(y); p = ncol(x)
##get ridge regression estimates for varying lambda
lambdaSeq = seq(0, 100, by = .1)
betaSeq = sapply(lambdaSeq, function(l) solve(t(x) %*% x + l * diag(rep(1, p)), t(x) %*% y))
plot(range(lambdaSeq), range(betaSeq), type = "n", xlab = "- lambda", ylab = "Beta")
for (i in 1 : p) lines(lambdaSeq, betaSeq[i,])
##Use R's function for Ridge regression
library(MASS)
fit = lm.ridge(y ~ x, lambda = lambdaSeq)
plot(fit)
\end{verbatim}
\section{Lasso regression}
The Lasso has been somewhat of a revolution of sorts in statistics and biostatistics
of late. The central idea of the lasso is to create a penalty that forces
coefficients to be zero. For centered $\bY$ and centered and scaled $\bX$,
consider minimizing
$$
|| \bY - \bX \bbeta||^2
$$
subject to $\sum_{i=1}^p | \beta_i| < t$. The Lagrangian form of this minimization
can be written as minimizing
$$
|| \bY - \bX \bbeta||^2 + \lambda \sum_{i=1}^n | \beta_i|.
$$
Here $\blambda$ is a penalty parameter. As the Lasso constrains
$\sum_{i=1}^p | \beta_i| < t$, which has sharp corners on the axes,
it has a tendency to set parameters exactly to zero. Thus, it is
thought of as doing model selection along with penalization.
Moreover, the Lasso handles the $p > n$ problem. Finally, it's
a convex optimization problem, so that numerically solving
for the Lasso is stable. We can more generally specify the parameter
as
$$
|| \bY - \bX \bbeta||^2 + \lambda \sum_{i=1}^n | \beta_i |^q.
$$
for $q > 0$. We obtain a case of ridge regression when $q=2$ and
the Lasso when $q=1$. Since $(\sum_{i=1}^n | \beta_i |^q)^{1/q}$ is a norm,
usually called the $l_q$ norm, the various forms of
regression are often calld $l_q$ regression. For example,
ridge regression could be called $l_2$ regression, the Lasso
${\cal L}_1$ regression and so on. We could write the penalized regression estimate as
$$
|| \bY - \bX \bbeta||^2 + \lambda ||\bbeta||_q^q
$$
where $||\cdot||_q$ is the $l_q$ norm.
You can visualize the parameters easily using Wolfram's alpha:
\href{http://www.wolframalpha.com/input/?i=abs\%28x1\%29+\%2B+abs\%28x2\%29+\%3D+1}{$|x_1| + |x_2|$},
\href{http://www.wolframalpha.com/input/?i=abs\%28x1\%29\%5E2+\%2B+abs\%28x2\%29\%5E2+\%3D+1}{$|x_1|^2 + |x_2|^2$},
\href{http://www.wolframalpha.com/input/?i=abs\%28x1\%29\%5E0.5+\%2B+abs\%28x2\%29\%5E0.5+\%3D+1}{$|x_1|^0.5 + |x_2|^0.5$},
and \href{http://www.wolframalpha.com/input/?i=plot+abs\%28x1\%29\%5E4+\%2B+abs\%28x2\%29\%5E4+\%3D+1}{$|x_1|^4 + |x_2|^4$}.
Notice that as $q$ tends to zero, it tends to all of the mass on the axes where as $q$ tends to infinity, it tends
to a square. The limit as $q$ tends to 0 is called the $l_0$ norm, which just penalizes the number
of non-zero coefficients.
Just like with ridge regression, the Lasso has a Bayesian representation. Let
the prior on $\beta_i$ be iid from a Laplacian distribution with mean 0, which has
density $\frac{\theta}{2}\exp(- \theta |\beta_i|)$, and is denoted Lapplace$(0, \theta)$.
Then, the Lasso estimate is
the posterior mean assuming $\by \sim N(\bX \bbeta, \sigma^2 I)$ and
$\beta_i \sim_{iid} \mbox{Laplace}(0, \lambda / 2\sigma^2)$. Then minus twice
the log of the posterior for $\bbeta$, conditioning on $\sigma$, is proportional to
$$
||\bY - \bX \bbeta||^2 + \lambda ||\bbeta||_{1}.
$$
The connection with Bayesian statistics is somewhat loose for lasso regression.
While the Lasso is the posterior mode under a specific prior, whether or not that
prior makes sense from a Bayesian perspective is not clear. Furthermore, the full
posterior for a parameter in the model
is averaged over several sparse models, so is actually not sparse.
Also, the posterior mode is conditioned on $\sigma$ under these assumptions, Bayesian analysis usually
take into account the full posterior.
\subsection{Coding example}
Let's give an example of coding the Lasso. Here, because the optimization
problem isn't closed form, we'll rely on the \texttt{lars} package from
Tibshirani and Efron. Also assume the code from the ridge regression exmaple.
\begin{verbatim}
library(lars)
fit2 = lars(x, y, type = c("lasso"))
plot(fit2)
\end{verbatim}