-
Notifications
You must be signed in to change notification settings - Fork 0
/
04_least_squares.tex
363 lines (326 loc) · 13.4 KB
/
04_least_squares.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
\chapter{Least squares}
In this chapter we develop least squares.
\section{Basics}
\href{https://www.youtube.com/watch?v=tcmqwFDm-2c&list=PLpl-gQkQivXhdgUCdaUQcdb31CRe8Mm2y&index=21}{Watch this video before beginning.}
Let $\bX$ be a design matrix, notationally its
elements and column vectors are:
$$
\bX =
\left[
\begin{array}{ccc}
x_{11} & \ldots & x_{1p} \\
\vdots & \ldots & \vdots \\
x_{n1} & \ldots & x_{np}
\end{array}
\right]
= [\bx_1 \ldots \bx_p].
$$
We are assuming that $n \geq p$ and $\bX$ is of full (column) rank.
Consider ordinary least squares
\begin{equation}
\label{eq:ols}
|| \by - \bX \bbeta ||^2 = (\by - \bX \bbeta)^t (\by - \bX \bbeta)
= \by^t \by - 2 \by^t \bX \bbeta + \bbeta^t \xtx \bbeta.
\end{equation}
If we were to minimize \eqref{eq:ols} with respect to $\bbeta$,
consider using our matrix derivative results from Chapter \ref{chap:background}.
$$
\frac{d}{d\bbeta}~\eqref{eq:ols}
= -2 \bX^t \by + 2 \xtx \bbeta.
$$
Solving for $0$ leads to the so called normal equations:
$$\xtx \bbeta = \bX^t \by.$$
Recall that $\bX^t \bX$ retains the same rank as $\bX$. Therefore,
it is a full rank $p\times p$ matrix and hence is invertible. We
can then solve the normal equations as:
\begin{equation}
\label{eq:betahat}
\hat \beta = \betahat.
\end{equation}
The Hessian of \eqref{eq:ols} is simply $2\xtx$, which is positive
definite. (This is clear since for any non-zero vector, $\ba$, we have that
$\bX^t \ba$ is non-zero since $\bX$ is full rank and then
$\ba^t \bX^t \bX \ba = ||\bX \ba||^2 > 0$.) Thus, the root of
our derivative is indeed a minimum.
\subsection{Coding example}
\href{https://www.youtube.com/watch?v=walb3qHidJQ&list=PLpl-gQkQivXhdgUCdaUQcdb31CRe8Mm2y&index=22}{Watch this video before beginning.}
\begin{verbatim}
> y = swiss$Fertility
> x = as.matrix(swiss[,-1])
> solve(t(x) %*% x, t(x) %*% y)
[,1]
1 66.9151817
Agriculture -0.1721140
Examination -0.2580082
Education -0.8709401
Catholic 0.1041153
Infant.Mortality 1.0770481
> summary(lm(y ~ x - 1))$coef
Estimate Std. Error t value Pr(>|t|)
x1 66.9151817 10.70603759 6.250229 1.906051e-07
xAgriculture -0.1721140 0.07030392 -2.448142 1.872715e-02
xExamination -0.2580082 0.25387820 -1.016268 3.154617e-01
xEducation -0.8709401 0.18302860 -4.758492 2.430605e-05
xCatholic 0.1041153 0.03525785 2.952969 5.190079e-03
xInfant.Mortality 1.0770481 0.38171965 2.821568 7.335715e-03
\end{verbatim}
\section{A second derivation}
\href{https://www.youtube.com/watch?v=4aCJQiacet8&index=23&list=PLpl-gQkQivXhdgUCdaUQcdb31CRe8Mm2y}{Watch this video before beginning.}
If you know the answer first, it's possible to derive the
minimum to the least squares equation without taking any derivatives.
\begin{eqnarray*}
|| \by - \bX \bbeta ||^2 &=& || \by - \bX \hat \bbeta + \bX\hat \beta - \bX \bbeta || \\
& = & ||\by - \bx \hat \bbeta||^2
+ 2 (\by - \bx \hat \bbeta)^t (\bX \hat \bbeta - \bX \bbeta)
+ ||\bX \hat \bbeta - \bX \bbeta||^2 \\
& \geq & ||\by - \bx \hat \bbeta||^2
+ 2 (\by - \bX \hat \bbeta)^t (\bX \hat \bbeta - \bX \bbeta) \\
& = & ||\by - \bx \hat \bbeta||^2
+ 2 (\by - \yhat)^t \bX (\hat \bbeta - \bbeta) \\
& = & ||\by - \bX \hat \bbeta||^2
+ 2 \by^t(\bI - \hatmat )^t \bX (\hat \bbeta - \bbeta) \\
& = & ||\by - \bX \hat \bbeta||^2
+ 2 \by^t(\bI - \hatmat ) \bX (\hat \bbeta - \bbeta) \\
& = & ||\by - \bx \hat \bbeta||^2
+ 2 \by^t(\bX - \hatmat \bX )(\hat \bbeta - \bbeta) \\
& = & ||\by - \bx \hat \bbeta||^2
+ 2 \by^t(\bX - \bX )(\hat \bbeta - \bbeta) \\
& = & ||\by - \bx \hat \bbeta||^2
\end{eqnarray*}
Thus, any value of $\bbeta$ that we plug into the least squares equation
is going to give us a larger norm than if we plug in $\hat \beta$ so that
it is the unique minimum.
Notice that going from line 5 to 6, we used the fact that $\bI - \hatmat$ is symmetric. (It is also idempotent.) Also, we used a fact that is very
useful in general, $\bI - \hatmat$ is orthogonal to any linear combination
of the columns of $\bX$. This is try since if $\bX \ba$ is such a combination,
then
$$
\{\bI - \hatmat)\}bX \ba = \{\bX - \hatmat \bX\}\ba = (\bX - \bX) \ba = 0.
$$
This fact is extremely handy in working with linear models.
\section{Connection with linear regression}
\href{https://www.youtube.com/watch?v=FcWYZRAHFfM&index=24&list=PLpl-gQkQivXhdgUCdaUQcdb31CRe8Mm2y}{Watch this video before beginning.}
\label{sec:lslin}
Recall that the slope from linear regression worked out to be
$$
\ip{\bx - \bar x \bone_n}{\bx - \bar x \bone_n}^{-1}\ip{\bx - \bar x\bone_n}{\by - \bar y \bone_n}
= \hat \sigma^{-2}_{X} \hat \sigma^2_{XY}
$$
where $\hat \sigma^2_{XY}$ is the empirical covariance between X and Y.
(We rewrote this formula using the more convenient correlation.) In this form it is the
covariance between x and y divided by the variance of the x's. Let's consider
extending this in our matrix results.
Let $\bX = [\bone_n \bX_1]$, thus $\bX$ contains an intercept and then $p-1$ other
regressors. Similarly let $\bbeta = (\beta_0, \ldots, \beta_{p-1})^t = (\beta_0 \beta_1^t)^t$. Consider now least squares
$$
||\by - \bX \bbeta||
=||\by - \bone_n \beta_0 - \bX_1 \bbeta_1||
$$
If we were to hold $\beta_1$ fixed we are faced with a mean only regression
problem and the solution to $\beta_0(\bbeta_1)$ is
$$
\frac{1}{n}(\by - \bX_1 \bbeta_1)^t \bone_n
= \bar y - \bar \bx^t \bbeta_1
$$
where $\bar \bx$ is the columnwise means of $\bX$. Plugging this back into
our least squares equation for $\beta_0$ we get
$$
||\by - \bone_n \bar y - (\bX_1 - \bone_n \bar \bx^t )\bbeta_1||^2
= || \tilde \by - \tilde \bX\bbeta_1||^2
$$
where $\tilde \by$ and $\tilde \bX$ are the centered versions of
$\by$ and $\bX$. This is again just the least squares equation with
the centered variables and thus we get that
$$
\hat \beta_1 = (\tilde \bX^t \tilde \bX)^{-1} \tilde \bX \tilde \bY
=
\hat \beta_1 = \left(\frac{1}{n-1}\tilde \bX^t \tilde \bX\right)^{-1}
\frac{1}{n-1} \tilde \bX \tilde \bY.
$$
The matrix $\frac{1}{n-1}\tilde \bX^t \tilde \bX$ is the empirical
variance covariance matrix of the columns of $\bX$ while
$\frac{1}{n-1} \tilde \bX \tilde \bY$ is the vector of correlations
of $\by$ with the columns of $\bX$. Therefore, if we include an intercept,
our slope estimate is
$$
\hat \bSigma_{XX}^{-1} \hat \brho_{XY}
$$
the inverse of the variance matrix associated with $\bX$ time the
correlation matrix between $\bX$ and $\bY$. This draws an exact
parallel with the result from linear regression.
\section{Projections}
\href{https://www.youtube.com/watch?v=UM7OX4qihGs&index=25&list=PLpl-gQkQivXhdgUCdaUQcdb31CRe8Mm2y}{Watch this video before beginning.}
The vector of fitted values is
$$
\hat \by = \yhat
$$
and the vector of residuals is given by
$$
\be = \by - \hat \by = (\bI - \hatmat) \by.
$$
Thus, multiplication by the matrix $\hatmat$ takes any vector in
$\mathbb{R}^n$ and produces the fitted values. Multiplication
by $(\bI - \hatmat)$ produces the residuals. Notice that
since the $\hat \by$ vector is a linear combination of the $\bX$,
it is orthogonal to the residuals:
$$
\hat \by^t \be = \by ^ t \hatmat (\bI - \hatmat)\by = 0.
$$
It is useful to think of least squares in the terms of projections.
Consider the column space of the design matrix,
$\Gamma = \{ \bX \bbeta ~|~ \bbeta \in \mathbb{R}^p\}$. This $p$
dimensional space lives in $\mathbb{R}^n$, so think of a plane in
$\mathbb{R}^3$. Consider the vector $\by$ which lives in $\mathbb{R}^n$. Multiplication by the matrix $\hatmat$ projects $\by$ into $\Gamma$. That
is,
$$
\by \rightarrow \hatmat\by
$$
is the linear projection map between $\mathbb{R}^n$ and $\Gamma$.
The point $\hat \by$ is the point in $\Gamma$ that is closest to $\by$ and
$\hat \bbeta$ is the specific linear combination of the columns
of $\bX$ that yields $\hat \by$.
$\be$ is the vector connecting $\by$ and $\hat \by$, and it is
orthogonal to all elements in $\Gamma$.
Thinking this helps us interpret statistical aspects of least squares.
First, if $\bW$ is any $p\times p $ invertible matrix, then the
fitted values, $\hat \by$ will be the same for the design matrix
$\bX \bW$. This is because the spaces
$$
\{\bX \bbeta ~|~ \bbeta \in \mathbb{R}^p\}
$$
and
$$
\{\bX \bW \bgamma ~|~ \bgamma \in \mathbb{R}^p\}
$$
are the same, since if $\ba = \bX \bbeta$ then $\ba = \bX \bgamma$ via the relationship $\gamma = \bW \bbeta$ and thus any element of the first
space is in the second. The same argument implies in the other direction,
thus the two spaces are the same.
Therefore, any linear reorganization of the columns of $\bX$ results in the
same column space and thus the same fitted values. Furthermore, any
addition of redundant columns to $\bX$ adds nothing to the column space,
and thus it's clear what the fit should be in the event that $\bX$ is
not full rank. Any full rank subset of the columns of $\bX$ defines
the same column and thus the same fitted values.
\section{Full row rank case}
In the case where $\bX$ is $n\times n$ of full rank,
then the columns of $\bX$ form a basis for $\mathbb{R^n}$.
In this case, $\hat \by = \by$, since $\by$ lives
in the space spanned by the columns of $\bX$. All
the linear model accomplishes is a lossless
linear reorganization of $\by$. This is perhaps surprisingly
useful, especially when the columns of $\bX$ are
orthonormal ($\bX^t\bX = \bI$). In this case, the
function that takes the outcome vector and converts
it to the coefficients is called a "transform". The
most well known versions of transforms are Fourier and
wavelet.
\section{A third derivation}
\href{https://www.youtube.com/watch?v=7_-ztDF4cwk&index=26&list=PLpl-gQkQivXhdgUCdaUQcdb31CRe8Mm2y}{Watch this video before beginning.}
In this section we generate a third derivation
of least squares.
For vectors $\ba$ (outcome) and $\bb$ (predictor), define the coefficient function as:
$$
c(\ba, \bb)
= \frac{\ip{\ba}{\bb}}{|| \bb ||^2}.
$$
and the residual function as
$$
e(\ba, \bb) = \ba - c(\ba, \bb) \bb
$$
We argue that the least squares estimate of outcome
$\by$ for predictor matrix $\bX = [\bx_1 \ldots \bx_p]$
is obtained by taking successive residuals, in the following sense. Consider the least squares
equation holding $\beta_2, \ldots, \beta_p$ fixed:
\begin{equation}
\label{eq:itresid}
|| \by - \bx_1 \bbeta_1 - \ldots - \bx_p \bbeta_p||^2.
\end{equation}
This is greater than or equal to
if we replace $\beta_1$ by it's
estimate with the remainder fixed. That estimate
being:
$$
c(\by - \bx_2 \bbeta_2 - \ldots, \bx_p \bbeta_p, \bx_1).
$$
Plugging that back into the least squares equation we
get
\begin{equation}
\label{eq:itresid2}
\eqref{eq:itresid}\geq ||e(\by, \bx_1) -
e(\bx_2, \bx_1) \beta_2, \ldots, e(\bx_p, \bx_1) \beta_p||^2.
\end{equation}
Thus we have a new least squares equation with
all residuals having "removed" $\bx_1$ from all other
regressors and the outcome. Then we can repeat this
process again holding $\beta_3, \ldots, \beta_p$
fixed and obtain
$$
\eqref{eq:itresid2}\geq ||e\{e(\by, \bx_1), e(\bx_2, \bx_1)\} -
e\{e(\bx_3, \bx_1), e(\bx_2, \bx_1)\} \beta_3, \ldots, e\{e(\bx_p, \bx_1), e(\bx_{2}, \bx_1)\} \beta_p||^2.
$$
This could then be iterated to the $p^{th}$ regressor.
Moreover, because we know the same inequalities will be
obtained no matter what order we get to the $p^{th}$
regressor we can conclude that the order of taking
residuals doesn't matter. Furthermore, picking the
$p^{th}$ coefficient was arbitrary as well, so the
same conclusion applies to all regressors: the
least squares estimate for all coefficients can be
obtained by iteratively taking residuals with all of the
other regressors (in any order).
This is interesting for many reasons. First, it
is interesting to note that one need only
regression through the origin to develop full
multivariable regression. Secondly it helps
us interpret our regression coefficients and
how they are "adjusted" for the other variables.
There was nothing in particular about using vectors.
If $\bX = [\bX_1 \bX_2]$, two submatrices of size
$p_1$ and $p_2$, and $\bbeta = (\bbeta_1^t \bbeta_2^t)^t$ consider minimizing
$$
|| \by - \bX_1 \bbeta_1 - \bX_2 \bbeta_2||^2.
$$
If $\bbeta_2$ were held fixed, this would be maximized
at
$$
\bbeta_1(\bbeta_2) =
(\bX_1^t \bX_1)^{-1} \bX_1^2 (\by - \bX_2 \bbeta_2).
$$
Plugging that back in we obtain a smaller quantity
$$
||\{\bI - (\bX_1^t \bX_1)^{-1} \bX_1^2\} \by
- \{\bI - (\bX_1^t \bX_1)^{-1} \bX_1^2 \}\bX_2 \bbeta_2
||^2
$$
This is equivalent to the residual of $\by$ having
regressed out $\bX_1$ and the residual matrix of
$\bX_2$ having regressed $\bX_1$ out of every column.
Thus out $\beta_2$ estimate will be the regression
matrix of these residuals. Again, this explains
why $\bbeta_2$'s estimate has been adjusted for
$\bX_1$, both the outcome and the $\bX_2$ predictors
have been orthogonalized to the space spanned
by the columns of $\bX_1$!
\subsection{Coding example}
\href{https://www.youtube.com/watch?v=qRfydAddY7M&list=PLpl-gQkQivXhdgUCdaUQcdb31CRe8Mm2y&index=27}{Watch this video before beginning.}
\begin{verbatim}
> y = swiss$Fertility
> x = as.matrix(swiss[,-1])
> x1 = x[,1 : 3]
> x2 = x[,4 : 6]
> solve(t(x) %*% x, t(x) %*% y)
[,1]
1 66.9151817
Agriculture -0.1721140
Examination -0.2580082
Education -0.8709401
Catholic 0.1041153
Infant.Mortality 1.0770481
> ey = y - x1 %*% solve(t(x1) %*% x1, t(x1) %*% y)
> ex2 = x2 - x1 %*% solve(t(x1) %*% x1) %*% t(x1) %*% x2
> solve(t(ex2) %*% ex2, t(ex2) %*% ey)
[,1]
Education -0.8709401
Catholic 0.1041153
Infant.Mortality 1.0770481
\end{verbatim}