-
Notifications
You must be signed in to change notification settings - Fork 0
/
11_residuals_diagnostics.tex
229 lines (217 loc) · 9.56 KB
/
11_residuals_diagnostics.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
\chapter{Residuals revisited}
For a good treatment of residuals and the other topics in this chapter,
see the book by Myers \citep{myers1990classical}.
Now with some distributional results under our belt, we can discuss distributional
properties of residuals. Note that, as a non-full rank linear transformation of
normals, the residuals are singular normal. When $\by \sim N(\bX \bbeta, \sigma^2 \bI)$, the mean of the residuals is $\bzero$, variance of the residuals is:
given by
$$
\Var(\be) = \Var\{(\bI - \bH_\bX)\by\} = \sigma^2 (\bI - \bH_\bX).
$$
As a consequence, we see that the diagonal elements of $\bI - \bH_\bX \geq 0$
and thus the diagonal elements of $\bH_\bX$ must be less than one. (A fact that
we'll use later).
A problem with the residuals is that they have the units of $\bY$ and thus are
not comparable across experiments. Taking
$$\mbox{Diag}\{S^2(I - \bH_x)\}^{-1/2}\be,$$
i.e., standardizing the residuals by their estimated standard deviation, does
get rid of the units. However, the resulting quantities are not comparable to
T-statistics since the numerator elements (the residuals) are not independent of $S^2$.
The residuals standardized in this way are called ``studentized'' residuals.
Studentized residuals are a standard part of most statistical software.
\subsection{Coding example}
\begin{verbatim}
> data(mtcars)
> y = mtcars$mpg
> x = cbind(1, mtcars$hp, mtcars$wt)
> n = nrow(x); p = ncol(x)
> hatmat = x %*% solve(t(x) %*% x) %*% t(x)
> residmat = diag(rep(1, n)) - hatmat
> e = residmat %*% y
> s = sqrt(sum(e^2) / (n - p))
> rstd = e / s / sqrt(diag(residmat))
> # compare with rstandard, r's function
> # for calculating standarized residuals
> cbind(rstd, rstandard(lm(y ~ x - 1)))
[,1] [,2]
1 -1.01458647 -1.01458647
2 -0.62332752 -0.62332752
3 -0.98475880 -0.98475880
4 0.05332850 0.05332850
5 0.14644776 0.14644776
6 -0.94769800 -0.94769800
...
\end{verbatim}
\section{Press residuals}
\label{sec:press}
Consider the model $\by \sim N(\bW \bgamma, \sigma^2 \bI)$
where $\gamma = [\bbeta^t ~ \Delta_i]$,
$\bW = [\bX ~ \bdelta_i]$ where $\bdelta_i$ is a vector of all zeros
except a 1 for row $i$. This model has a shift in position $i$, for
example if there is an outlier at that position.
The least squares criterion can be written as
\begin{equation}
\label{rstud}
\sum_{k\neq i} \left(y_k - \sum_{j = 1}^p x_{kj} \beta_j \right)^2
+ \left(y_i - \sum_{j=1}^p x_{ij} \beta_j - \Delta_i\right)^2.
\end{equation}
Consider holding $\bbeta$ fixed, then we get that the estimate of
$\Delta_i$ must satisfy
$$
\Delta_i = y_i - \sum_{j=1}^p x_{ij} \beta_j
$$
and thus the right hand term of \eqref{rstud} is 0. Then we obtain $\bbeta$ by minimizing
$$
\sum_{k\neq i} \left(y_k - \sum_{j = 1}^p x_{kj} \beta_j \right)^2.
$$
Therefore $\hat \bbeta$ is exactly the least squares estimate having
deleted the $i^{th}$ data point; notationally, $\hat\bbeta^{(-i)}$. Thus, $\hat \delta_i$
is a form of residual obtained when deleting
the $i^{th}$ point from the fitting then comparing it to the fitted value,
$$
\hat \Delta_i = y_i - \sum_{j=1}^p x_{ij} \hat \beta^{(-i)}_{k}.
$$
Notice that the fitted value at the $i^{th}$ data point is then
$\sum_{j=1}^p x_{ij} \hat \beta^{(-i)}_{k} + \hat \Delta_i = \ y_i$ and thus
the residual is zero. The term $\hat \Delta_i$ is called the PRESS residual, the
difference between the observed value and the fitted value with that point deleted.
Since the residual at the $i^{th}$ data point is zero, the estimated variance from
thsi model is exactly equal to the variance estimate having removed the
$i^{th}$ data point. The $t$ test for $\delta_i$ is then a form of
standardized residual, that exactly follows a t distribution under the null
hypothesis that $\delta_i = 0$.
\subsection{Computing PRESS residuals}
It is interesting to note that PRESS residuals don't actually require recalculating the
model with the $i^{th}$ datapoint deleted. Let $\bX^t = [\bz_1 ~ \ldots ~ \bz_n]$
so that $\bz_i$ is the $i^{th}$ row of the matrix $\bz$ (hence column $i$ of $\bz^t$).
We use $\bz$ for the rows, since we've already reserved $\bx$ for the columns of $\bX$.
Notice, then that
$$
\bX^t \bX = \sum_{i=1}^n \bz_i \bz_i^t.
$$
Thus, $\bX^{(-i), t} \bX^{(-i)}$, the x transpose x matrix with the $i^{th}$ data
point deleted is simply
$$
\bX^{(-i), t} \bX^{(-i)} = \bX^t \bX - \bz_i \bz_i^t.
$$
We can appeal to the Sherman, Morrison, Woodbury theorem for the inverse
(\href{https://en.wikipedia.org/wiki/Sherman%E2%80%93Morrison_formula}{Wikipedia})
$$
(\bX^{(-i), t} \bX^{(-i)})^{-1}
= \xtxinv + \frac{\xtxinv \bz_i \bz_i^t \xtxinv}{1 - \bz_i^t \xtxinv \bz_i}
$$
Define $h_{ii}$ as diagonal element $i$ of $\hatmat$ which is equal to
$\bz_i^t \xtxinv \bz_i$. (To see this, pre and post multiply this matrix by a
vector of zeros with a one in the position $i$, an operation which grabs the $i^{th}$ diagonal entry.)
Furthermore, note that $\bX^t \by = \sum_{i=1}^n \bz_i y_i$ so that
$$
\bX^{(-i),t} \by^{(-i)} = \bX^t \by - \bz_i y_i.
$$
Then we have that the predicted value for the $i^{th}$ data point where it was not used in the fitting is:
\begin{eqnarray*}
\hat y^{(-i)}_i & = & \bz_i^t (\bX^{(-i), t} \bX^{(-i)})^{-1} \bX^{(-i),t} \by^{(-i)}\\
& = & \bz_i^t \left(\xtxinv + \frac{\xtxinv \bz_i \bz_i^t \xtxinv}{1 - h_{ii}} \right)(\bX^t \by - \bz_i y_i) \\
& = & \hat y_i + \frac{h_{ii}}{1 - h_{ii}} \hat y_i - h_{ii} y_i - \frac{h_{ii}^2 y_i}{1 - h_{ii}} \\
& = & \frac{\hat y_i}{1 - h_{ii}} + y_i - \frac{y_i}{1 - h_{ii}}
\end{eqnarray*}
So that we wind up with the equality:
$$
y_i - \hat y^{(-i)}_i = \frac{y_i - \hat y_i}{1 - h_{ii}} = \frac{e_i}{1 - h_{ii}}
$$
In other words, the PRESS residuals are exactly the ordinary residuals divided by $1 - h_{ii}$.
\section{Externally studentized residuals}
It's often useful to have standardized residuals where a data point in question didn't
influence the residual variance. The normalized
PRESS residuals are, as seen in \ref{sec:press}. However, the PRESS residuals are
leave one out residuals, and thus the $i^{th}$ point was deleted for the fitted value. An alternative
strategy is to normalize the ordinary residuals by dividing by a standard deviation estimate
calculated with the $i^{th}$ data point deleted. That is,
$$
\frac{e_i}{s^{(-i)}\sqrt{1 - h_{ii}}}.
$$
In this statistic, observation $i$ hasn't had the opportunity to impact the variance estimate.
Given that the PRESS residuals are $\frac{e_i}{1 - h_{ii}}$, their variance is
$\sigma^2 / \sqrt{1 - h_{ii}}$. Then we have that the press residuals normalized (divided
by their standard deviations) are
$$
\frac{e_i}{\sigma \sqrt{1 - h_{ii}}}
$$
If we use the natural variance estimate for the press residuals, the estimated variance calculated with
the $i^{th}$ data point deleted, then the estimated normalized PRESS residuals are the same as the
externally standardized residuals. As we know that these also arise out of the T-test for the
mean shift outlier model from Section \ref{sec:press}.
\section{Coding example}
First let's use the \texttt{swiss} dataset to show how to calculate
the ordinary residuals and show that they are the same as those
output by \texttt{resid}.
\begin{verbatim}
> y = swiss$Fertility
> x = cbind(1, as.matrix(swiss[,-1]))
> n = nrow(x); p = ncol(x)
> hatmat = x %*% solve(t(x) %*% x) %*% t(x)
> ## ordinary residuals
> e = (diag(rep(1, n)) - hatmat) %*% y
> fit = lm(y ~ x)
> ## show that they're equal by taking the max absolute difference
> max(abs(e - resid(fit)))
[1] 4.058975e-12
\end{verbatim}
Next, we calculate the standardized residuals and show how to get
them automatically with \texttt{rstandard}
\begin{verbatim}
> ## standardized residuals
> s = sqrt(sum(e ^ 2) / (n - p))
> rstd = e / s / sqrt(1 - diag(hatmat))
> ## show that they're equal by taking the max absolute difference
> max(abs(rstd - rstandard(fit)))
[1] 6.638023e-13
\end{verbatim}
Next, let's calculate the PRESS residuals both by
leaving out the ith observation (in this case observation 6)
and by the shortcut formula
\begin{verbatim}
> i = 6
> yi = y[i]
> yihat = predict(fit)[i]
> hii = diag(hatmat)[i]
> ## fitted model without the ith data point
> y.minus.i = y[-i]
> x.minus.i = x[-i,]
> beta.minus.i = solve(t(x.minus.i) %*% (x.minus.i)) %*% t(x.minus.i) %*% y.minus.i
> yhat.i.minus.i = sum(x[i,] * beta.minus.i)
> pressi = yi - yhat.i.minus.i
> c(pressi, e[i] / (1 - hii))
Porrentruy
-17.96269 -17.96269
\end{verbatim}
Now show that the \texttt{rstudent} (externally studentized) residuals and normalized
PRESS residuals are the same
\begin{verbatim}
> ## variance estimate with i deleted
> e.minus.i = y.minus.i - x.minus.i %*% beta.minus.i
> s.minus.i = sqrt(sum(e.minus.i ^ 2) / (n - p - 1))
> ## show that the studentized residual is the PRESS residual standardized
> ei / s.minus.i / sqrt(1 - hii)
Porrentruy
-2.367218
> rstudent(fit)[i]
6
-2.367218
\end{verbatim}
Finally, show that the mean shift outlier model residuals give the PRESS and
the rstudent residuals.
\begin{verbatim}
> delta = rep(0, n); delta[i] = 1
> w = cbind(x, delta)
> round(summary(lm(y ~ w - 1))$coef, 3)
Estimate Std. Error t value Pr(>|t|)
w 65.456 10.170 6.436 0.000
wAgriculture -0.210 0.069 -3.067 0.004
wExamination -0.323 0.242 -1.332 0.190
wEducation -0.895 0.174 -5.149 0.000
wCatholic 0.113 0.034 3.351 0.002
wInfant.Mortality 1.316 0.376 3.502 0.001
wdelta -17.963 7.588 -2.367 0.023
\end{verbatim}
So notice that the the estimate for \texttt{wdelta} is the PRESS residual
while the \texttt{t value} is the externally studentized residual.