04_least_squares.tex

\chapter{Least squares}

In this chapter we develop least squares. 

\section{Basics}
\href{https://www.youtube.com/watch?v=tcmqwFDm-2c&list=PLpl-gQkQivXhdgUCdaUQcdb31CRe8Mm2y&index=21}{Watch this video before beginning.}

Let $\bX$ be a design matrix, notationally its
elements and column vectors are:
$$
\bX =
\left[
\begin{array}{ccc}
x_{11} & \ldots & x_{1p}  \\
\vdots & \ldots & \vdots \\
x_{n1} & \ldots & x_{np}
\end{array}
\right]
= [\bx_1 \ldots \bx_p].
$$
We are assuming that $n \geq p$ and $\bX$ is of full (column) rank. 
Consider ordinary least squares
\begin{equation}
\label{eq:ols}
|| \by - \bX \bbeta ||^2 = (\by - \bX \bbeta)^t (\by - \bX \bbeta)
= \by^t \by - 2 \by^t \bX \bbeta + \bbeta^t \xtx \bbeta.
\end{equation}
If we were to minimize \eqref{eq:ols} with respect to $\bbeta$,
consider using our matrix derivative results from Chapter \ref{chap:background}.
$$
\frac{d}{d\bbeta}~\eqref{eq:ols}
= -2 \bX^t \by + 2 \xtx \bbeta.
$$
Solving for $0$ leads to the so called normal equations:
$$\xtx \bbeta = \bX^t \by.$$
Recall that $\bX^t \bX$ retains the same rank as $\bX$. Therefore,
it is a full rank $p\times p$ matrix and hence is invertible. We
can then solve the normal equations as:
\begin{equation}
\label{eq:betahat}
\hat \beta = \betahat.
\end{equation}
The Hessian of \eqref{eq:ols} is simply $2\xtx$, which is positive
definite. (This is clear since for any non-zero vector, $\ba$, we have that
$\bX^t \ba$ is non-zero since $\bX$ is full rank and then
$\ba^t \bX^t \bX \ba = ||\bX \ba||^2 > 0$.) Thus, the root of
our derivative is indeed a minimum. 

\subsection{Coding example}
\href{https://www.youtube.com/watch?v=walb3qHidJQ&list=PLpl-gQkQivXhdgUCdaUQcdb31CRe8Mm2y&index=22}{Watch this video before beginning.}

\begin{verbatim}
> y = swiss$Fertility
> x = as.matrix(swiss[,-1])
> solve(t(x) %*% x, t(x) %*% y)
                       [,1]
1                66.9151817
Agriculture      -0.1721140
Examination      -0.2580082
Education        -0.8709401
Catholic          0.1041153
Infant.Mortality  1.0770481
> summary(lm(y ~ x - 1))$coef
                    Estimate  Std. Error   t value     Pr(>|t|)
x1                66.9151817 10.70603759  6.250229 1.906051e-07
xAgriculture      -0.1721140  0.07030392 -2.448142 1.872715e-02
xExamination      -0.2580082  0.25387820 -1.016268 3.154617e-01
xEducation        -0.8709401  0.18302860 -4.758492 2.430605e-05
xCatholic          0.1041153  0.03525785  2.952969 5.190079e-03
xInfant.Mortality  1.0770481  0.38171965  2.821568 7.335715e-03
\end{verbatim}

\section{A second derivation}
\href{https://www.youtube.com/watch?v=4aCJQiacet8&index=23&list=PLpl-gQkQivXhdgUCdaUQcdb31CRe8Mm2y}{Watch this video before beginning.}


If you know the answer first, it's possible to derive the 
minimum to the least squares equation without taking any derivatives.
\begin{eqnarray*}
|| \by - \bX \bbeta ||^2 &=& || \by - \bX \hat \bbeta + \bX\hat \beta - \bX \bbeta || \\
& = & ||\by - \bx \hat \bbeta||^2
+ 2 (\by - \bx \hat \bbeta)^t (\bX \hat \bbeta - \bX \bbeta)
+ ||\bX \hat \bbeta - \bX \bbeta||^2 \\
& \geq & ||\by - \bx \hat \bbeta||^2
+ 2 (\by - \bX \hat \bbeta)^t (\bX \hat \bbeta - \bX \bbeta) \\
& = & ||\by - \bx \hat \bbeta||^2
+ 2 (\by - \yhat)^t \bX (\hat \bbeta - \bbeta) \\
& = & ||\by - \bX \hat \bbeta||^2
+ 2 \by^t(\bI - \hatmat )^t \bX (\hat \bbeta - \bbeta) \\
& = & ||\by - \bX \hat \bbeta||^2
+ 2 \by^t(\bI - \hatmat ) \bX (\hat \bbeta - \bbeta) \\
& = & ||\by - \bx \hat \bbeta||^2
+ 2 \by^t(\bX - \hatmat \bX )(\hat \bbeta - \bbeta) \\
& = & ||\by - \bx \hat \bbeta||^2
+ 2 \by^t(\bX - \bX )(\hat \bbeta - \bbeta) \\
& = & ||\by - \bx \hat \bbeta||^2
\end{eqnarray*}
Thus, any value of $\bbeta$ that we plug into the least squares equation
is going to give us a larger norm than if we plug in $\hat \beta$ so that
it is the unique minimum.
Notice that going from line 5 to 6, we used the fact that $\bI - \hatmat$ is symmetric. (It is also idempotent.) Also, we used a fact that is very
useful in general, $\bI - \hatmat$ is orthogonal to any linear combination
of the columns of $\bX$. This is try since if $\bX \ba$ is such a combination,
then
$$
\{\bI - \hatmat)\}bX \ba = \{\bX - \hatmat \bX\}\ba = (\bX - \bX) \ba = 0.
$$
This fact is extremely handy in working with linear models. 

\section{Connection with linear regression}
\href{https://www.youtube.com/watch?v=FcWYZRAHFfM&index=24&list=PLpl-gQkQivXhdgUCdaUQcdb31CRe8Mm2y}{Watch this video before beginning.}

\label{sec:lslin}
Recall that the slope from linear regression worked out to be
$$
\ip{\bx - \bar x \bone_n}{\bx - \bar x \bone_n}^{-1}\ip{\bx - \bar x\bone_n}{\by - \bar y \bone_n} 
= \hat \sigma^{-2}_{X} \hat \sigma^2_{XY}
$$
where $\hat \sigma^2_{XY}$ is the empirical covariance between X and Y. 
(We rewrote this formula using the more convenient correlation.) In this form it is the
covariance between x and y divided by the variance of the x's. Let's consider
extending this in our matrix results.

Let $\bX = [\bone_n \bX_1]$, thus $\bX$ contains an intercept and then $p-1$ other
regressors. Similarly let $\bbeta = (\beta_0, \ldots, \beta_{p-1})^t = (\beta_0 \beta_1^t)^t$. Consider now least squares
$$
||\by - \bX \bbeta||
=||\by - \bone_n \beta_0 - \bX_1 \bbeta_1||
$$
If we were to hold $\beta_1$ fixed we are faced with a mean only regression
problem and the solution to $\beta_0(\bbeta_1)$ is 
$$
\frac{1}{n}(\by - \bX_1 \bbeta_1)^t \bone_n
= \bar y - \bar \bx^t \bbeta_1
$$
where $\bar \bx$ is the columnwise means of $\bX$. Plugging this back into
our least squares equation for $\beta_0$ we get
$$
||\by - \bone_n \bar y - (\bX_1 - \bone_n \bar \bx^t )\bbeta_1||^2
= || \tilde \by - \tilde \bX\bbeta_1||^2
$$
where $\tilde \by$ and $\tilde \bX$ are the centered versions of 
$\by$ and $\bX$. This is again just the least squares equation with
the centered variables and thus we get that
$$
\hat \beta_1 = (\tilde \bX^t \tilde \bX)^{-1} \tilde \bX \tilde \bY
=
\hat \beta_1 = \left(\frac{1}{n-1}\tilde \bX^t \tilde \bX\right)^{-1}  
\frac{1}{n-1} \tilde \bX \tilde \bY.
$$
The matrix $\frac{1}{n-1}\tilde \bX^t \tilde \bX$ is the empirical 
variance covariance matrix of the columns of $\bX$ while 
$\frac{1}{n-1} \tilde \bX \tilde \bY$ is the vector of correlations
of $\by$ with the columns of $\bX$. Therefore, if we include an intercept,
our slope estimate is 
$$
\hat \bSigma_{XX}^{-1} \hat \brho_{XY}
$$
the inverse of the variance matrix associated with $\bX$ time the
correlation matrix between $\bX$ and $\bY$. This draws an exact
parallel with the result from linear regression.

\section{Projections}
\href{https://www.youtube.com/watch?v=UM7OX4qihGs&index=25&list=PLpl-gQkQivXhdgUCdaUQcdb31CRe8Mm2y}{Watch this video before beginning.}

The vector of fitted values is 
$$
\hat \by = \yhat 
$$
and the vector of residuals is given by
$$
\be = \by - \hat \by = (\bI - \hatmat) \by.
$$
Thus, multiplication by the matrix $\hatmat$ takes any vector in
$\mathbb{R}^n$ and produces the fitted values. Multiplication
by $(\bI - \hatmat)$ produces the residuals. Notice that
since the $\hat \by$ vector is a linear combination of the $\bX$,
it is orthogonal to the residuals:
$$
\hat \by^t \be = \by ^ t \hatmat (\bI - \hatmat)\by = 0.
$$

It is useful to think of least squares in the terms of projections.
Consider the column space of the design matrix, 
$\Gamma = \{ \bX \bbeta ~|~ \bbeta \in \mathbb{R}^p\}$. This $p$ 
dimensional space lives in $\mathbb{R}^n$, so think of a plane in 
$\mathbb{R}^3$. Consider the vector $\by$ which lives in $\mathbb{R}^n$. Multiplication by the matrix $\hatmat$ projects $\by$ into $\Gamma$. That
is, 
$$
\by \rightarrow \hatmat\by
$$
is the linear projection map between $\mathbb{R}^n$ and $\Gamma$.
The point $\hat \by$ is the point in $\Gamma$ that is closest to $\by$ and
$\hat \bbeta$ is the specific linear combination of the columns
of $\bX$ that yields $\hat \by$. 
$\be$ is the vector connecting $\by$ and $\hat \by$, and it is
orthogonal to all elements in $\Gamma$.  

Thinking this helps us interpret statistical aspects of least squares.
First, if $\bW$ is any $p\times p $ invertible matrix, then the
fitted values, $\hat \by$ will be the same for the design matrix
$\bX \bW$. This is because the spaces
$$
\{\bX \bbeta ~|~ \bbeta \in \mathbb{R}^p\}
$$
and
$$
\{\bX \bW \bgamma ~|~ \bgamma \in \mathbb{R}^p\}
$$
are the same, since if $\ba = \bX \bbeta$ then $\ba = \bX \bgamma$ via the relationship $\gamma = \bW \bbeta$  and thus any element of the first
space is in the second. The same argument implies in the other direction,
thus the two spaces are the same.

Therefore, any linear reorganization of the columns of $\bX$ results in the
same column space and thus the same fitted values. Furthermore, any
addition of redundant columns to $\bX$ adds nothing to the column space,
and thus it's clear what the fit should be in the event that $\bX$ is
not full rank. Any full rank subset of the columns of $\bX$ defines
the same column and thus the same fitted values. 

\section{Full row rank case}

In the case where $\bX$ is $n\times n$ of full rank,
then the columns of $\bX$ form a basis for $\mathbb{R^n}$.
In this case, $\hat \by = \by$, since $\by$ lives
in the space spanned by the columns of $\bX$. All
the linear model accomplishes is a lossless 
linear reorganization of $\by$. This is perhaps surprisingly
useful, especially when the columns of $\bX$ are
orthonormal ($\bX^t\bX = \bI$). In this case, the
function that takes the outcome vector and converts
it to the coefficients is called a "transform". The
most well known versions of transforms are Fourier and
wavelet. 

\section{A third derivation}

\href{https://www.youtube.com/watch?v=7_-ztDF4cwk&index=26&list=PLpl-gQkQivXhdgUCdaUQcdb31CRe8Mm2y}{Watch this video before beginning.}

In this section we generate a third derivation
of least squares. 
For vectors $\ba$ (outcome) and $\bb$ (predictor), define the coefficient function as:
$$
c(\ba, \bb)
= \frac{\ip{\ba}{\bb}}{|| \bb ||^2}.
$$
and the residual function as
$$
e(\ba, \bb) = \ba - c(\ba, \bb) \bb
$$
We argue that the least squares estimate of outcome
$\by$ for predictor matrix $\bX = [\bx_1 \ldots \bx_p]$
is obtained by taking successive residuals, in the following sense. Consider the least squares
equation holding $\beta_2, \ldots, \beta_p$ fixed:
\begin{equation}
\label{eq:itresid}
|| \by - \bx_1 \bbeta_1 - \ldots - \bx_p \bbeta_p||^2.
\end{equation}
This is greater than or equal to
if we replace $\beta_1$ by it's
estimate with the remainder fixed. That estimate
being:
$$
c(\by - \bx_2 \bbeta_2 - \ldots, \bx_p \bbeta_p, \bx_1).
$$
Plugging that back into the least squares equation we
get
\begin{equation}
\label{eq:itresid2}
\eqref{eq:itresid}\geq ||e(\by, \bx_1) - 
e(\bx_2, \bx_1) \beta_2, \ldots, e(\bx_p, \bx_1) \beta_p||^2.
\end{equation}
Thus we have a new least squares equation with 
all residuals having "removed" $\bx_1$ from all other
regressors and the outcome. Then we can repeat this
process again holding $\beta_3, \ldots, \beta_p$
fixed and obtain
$$
\eqref{eq:itresid2}\geq ||e\{e(\by, \bx_1), e(\bx_2, \bx_1)\} - 
e\{e(\bx_3, \bx_1), e(\bx_2, \bx_1)\} \beta_3, \ldots, e\{e(\bx_p, \bx_1), e(\bx_{2}, \bx_1)\} \beta_p||^2.
$$
This could then be iterated to the $p^{th}$ regressor.
Moreover, because we know the same inequalities will be
obtained no matter what order we get to the $p^{th}$
regressor we can conclude that the order of taking
residuals doesn't matter. Furthermore, picking the
$p^{th}$ coefficient was arbitrary as well, so the
same conclusion applies to all regressors: the
least squares estimate for all coefficients can be
obtained by iteratively taking residuals with all of the
other regressors (in any order). 

This is interesting for many reasons. First, it
is interesting to note that one need only 
regression through the origin to develop full
multivariable regression. Secondly it helps
us interpret our regression coefficients and
how they are "adjusted" for the other variables.

There was nothing in particular about using vectors.
If $\bX = [\bX_1 \bX_2]$, two submatrices of size
$p_1$ and $p_2$, and $\bbeta = (\bbeta_1^t \bbeta_2^t)^t$ consider minimizing  
$$
|| \by - \bX_1 \bbeta_1 - \bX_2 \bbeta_2||^2.
$$
If $\bbeta_2$ were held fixed, this would be maximized
at 
$$
\bbeta_1(\bbeta_2) = 
(\bX_1^t \bX_1)^{-1} \bX_1^2 (\by - \bX_2 \bbeta_2).
$$
Plugging that back in we obtain a smaller quantity
$$
||\{\bI - (\bX_1^t \bX_1)^{-1} \bX_1^2\} \by
- \{\bI - (\bX_1^t \bX_1)^{-1} \bX_1^2 \}\bX_2 \bbeta_2
||^2
$$
This is equivalent to the residual of $\by$ having
regressed out $\bX_1$ and the residual matrix of
$\bX_2$ having regressed $\bX_1$ out of every column.
Thus out $\beta_2$ estimate will be the regression
matrix of these residuals. Again, this explains
why $\bbeta_2$'s estimate has been adjusted for
$\bX_1$, both the outcome and the $\bX_2$ predictors
have been orthogonalized to the space spanned
by the columns of $\bX_1$!


\subsection{Coding example}

\href{https://www.youtube.com/watch?v=qRfydAddY7M&list=PLpl-gQkQivXhdgUCdaUQcdb31CRe8Mm2y&index=27}{Watch this video before beginning.}

\begin{verbatim}
> y = swiss$Fertility
> x = as.matrix(swiss[,-1])
> x1 = x[,1 : 3]
> x2 = x[,4 : 6]
> solve(t(x) %*% x, t(x) %*% y)
                       [,1]
1                66.9151817
Agriculture      -0.1721140
Examination      -0.2580082
Education        -0.8709401
Catholic          0.1041153
Infant.Mortality  1.0770481
> ey = y - x1 %*% solve(t(x1) %*% x1, t(x1) %*% y)
> ex2 = x2 - x1 %*% solve(t(x1) %*% x1) %*% t(x1) %*% x2
> solve(t(ex2) %*% ex2, t(ex2) %*% ey)
                       [,1]
Education        -0.8709401
Catholic          0.1041153
Infant.Mortality  1.0770481
\end{verbatim}