@@ -53,8 +53,8 @@ In particular, the ridge model is the same as the OLS model:
53
53
54
54
\mathbf {y} = \mathbf {bX} + \mathbf {\epsilon }
55
55
56
- where :math: `\epsilon \sim \mathcal {N}(0 , \sigma ^2 I )`, except now the error
57
- for the model is calculated as
56
+ where :math: `\epsilon \sim \mathcal {N}(\mathbf { 0 } , \sigma ^2 \mathbf {I} )`,
57
+ except now the error for the model is calculated as
58
58
59
59
.. math ::
60
60
@@ -66,9 +66,9 @@ the adjusted normal equation:
66
66
.. math ::
67
67
68
68
\hat {\mathbf {b}}_{Ridge} =
69
- (\mathbf {X}^\top \mathbf {X} + \alpha I )^{-1 } \mathbf {X}^\top \mathbf {y}
69
+ (\mathbf {X}^\top \mathbf {X} + \alpha \mathbf {I} )^{-1 } \mathbf {X}^\top \mathbf {y}
70
70
71
- where :math: `(\mathbf {X}^\top \mathbf {X} + \alpha I )^{-1 }
71
+ where :math: `(\mathbf {X}^\top \mathbf {X} + \alpha \mathbf {I} )^{-1 }
72
72
\mathbf {X}^\top ` is the pseudoinverse / Moore-Penrose inverse adjusted for
73
73
the `L2 ` penalty on the model coefficients.
74
74
@@ -81,7 +81,7 @@ the `L2` penalty on the model coefficients.
81
81
<h2 >Bayesian Linear Regression</h2 >
82
82
83
83
In its general form, Bayesian linear regression extends the simple linear
84
- regression model by introducing priors on model parameters b and/or the
84
+ regression model by introducing priors on model parameters * b * and/or the
85
85
error variance :math: `\sigma ^2 `.
86
86
87
87
The introduction of a prior allows us to quantify the uncertainty in our
@@ -98,7 +98,7 @@ data :math:`X^*` with the posterior predictive distribution:
98
98
99
99
.. math ::
100
100
101
- p(y^* \mid X^*, X, Y) = \int _{b} p(y^* \mid X^*, b) p(b \mid X, y) db
101
+ p(y^* \mid X^*, X, Y) = \int _{b} p(y^* \mid X^*, b) p(b \mid X, y) \ \text {d}b
102
102
103
103
Depending on the choice of prior it may be impossible to compute an
104
104
analytic form for the posterior / posterior predictive distribution. In
@@ -116,11 +116,11 @@ prior on `b` is Gaussian. A common parameterization is:
116
116
117
117
.. math ::
118
118
119
- b | \sigma , b_V \sim \mathcal {N}(b_{mean} , \sigma ^2 b_V )
119
+ b | \sigma , V \sim \mathcal {N}(\mu , \sigma ^2 V )
120
120
121
- where :math: `b_{mean} `, :math: `\sigma ` and :math: `b_V ` are hyperparameters. Ridge
122
- regression is a special case of this model where :math: `b_{mean}` = 0,
123
- :math: `\sigma ` = 1 and :math: `b_V = I` (ie. , the prior on ` b ` is a zero-mean,
121
+ where :math: `\mu `, :math: `\sigma ` and :math: `V ` are hyperparameters. Ridge
122
+ regression is a special case of this model where :math: `\mu = 0 ` ,
123
+ :math: `\sigma = 1 ` and :math: `V = I` (i.e. , the prior on * b * is a zero-mean,
124
124
unit covariance Gaussian).
125
125
126
126
Due to the conjugacy of the above prior with the Gaussian likelihood, there
@@ -129,22 +129,22 @@ parameters:
129
129
130
130
.. math ::
131
131
132
- A &= (b_V ^{-1 } + X^\top X)^{-1 } \\
133
- \mu _b &= A b_V ^{-1 } b_{mean} + A X^\top y \\
134
- \text {cov}_b &= \sigma ^2 A \\
132
+ A &= (V ^{-1 } + X^\top X)^{-1 } \\
133
+ \mu _b &= A V ^{-1 } \mu + A X^\top y \\
134
+ \Sigma _b &= \sigma ^2 A \\
135
135
136
136
The model posterior is then
137
137
138
138
.. math ::
139
139
140
- b \mid X, y \sim \mathcal {N}(\mu _b, \text {cov}_b )
140
+ b \mid X, y \sim \mathcal {N}(\mu _b, \Sigma _b )
141
141
142
142
We can also compute a closed-form solution for the posterior predictive distribution as
143
143
well:
144
144
145
145
.. math ::
146
146
147
- y^* \mid X^*, X, Y \sim \mathcal {N}(X^* \mu _b, \ \ X^* \text {cov}_b X^{* \top } + I)
147
+ y^* \mid X^*, X, Y \sim \mathcal {N}(X^* \mu _b, \ \ X^* \Sigma X^{* \top } + I)
148
148
149
149
where :math: `X^*` is the matrix of new data we wish to predict, and :math: `y^*`
150
150
are the predicted targets for those data.
@@ -160,7 +160,7 @@ are the predicted targets for those data.
160
160
161
161
--------------------------------
162
162
163
- If *both * b and the error variance :math: `\sigma ^2 ` are unknown, the
163
+ If *both * * b * and the error variance :math: `\sigma ^2 ` are unknown, the
164
164
conjugate prior for the Gaussian likelihood is the Normal-Gamma
165
165
distribution (univariate likelihood) or the Normal-Inverse-Wishart
166
166
distribution (multivariate likelihood).
@@ -169,22 +169,22 @@ distribution (multivariate likelihood).
169
169
170
170
.. math ::
171
171
172
- b, \sigma ^2 &\sim \text {NG}(b_{mean}, b_{V} , \alpha , \beta ) \\
172
+ b, \sigma ^2 &\sim \text {NG}(\mu , V , \alpha , \beta ) \\
173
173
\sigma ^2 &\sim \text {InverseGamma}(\alpha , \beta ) \\
174
- b \mid \sigma ^2 &\sim \mathcal {N}(b_{mean} , \sigma ^2 b_{V} )
174
+ b \mid \sigma ^2 &\sim \mathcal {N}(\mu , \sigma ^2 V )
175
175
176
- where :math: `\alpha , \beta , b_{V} `, and :math: `b_{mean} ` are
177
- parameters of the prior.
176
+ where :math: `\alpha , \beta , V `, and :math: `\mu ` are parameters of the
177
+ prior.
178
178
179
179
**Multivariate **
180
180
181
181
.. math ::
182
182
183
- b, \Sigma &\sim \mathcal {NIW}(b_{mean} , \lambda , \Psi , \rho ) \\
183
+ b, \Sigma &\sim \mathcal {NIW}(\mu , \lambda , \Psi , \rho ) \\
184
184
\Sigma &\sim \mathcal {W}^{-1 }(\Psi , \rho ) \\
185
- b \mid \Sigma &\sim \mathcal {N}(b_{mean} , \frac {1 }{\lambda } \Sigma )
185
+ b \mid \Sigma &\sim \mathcal {N}(\mu , \frac {1 }{\lambda } \Sigma )
186
186
187
- where :math: `b_{mean} , \lambda , \Psi `, and :math: `\rho ` are
187
+ where :math: `\mu , \lambda , \Psi `, and :math: `\rho ` are
188
188
parameters of the prior.
189
189
190
190
@@ -194,30 +194,30 @@ parameters:
194
194
195
195
.. math ::
196
196
197
- B &= y - X b_{mean} \\
197
+ B &= y - X \mu \\
198
198
\text {shape} &= N + \alpha \\
199
- \text {scale} &= \frac {1 }{\text {shape}} (\alpha \beta + B^\top (X b_V X^\top + I)^{-1 } B) \\
199
+ \text {scale} &= \frac {1 }{\text {shape}} (\alpha \beta + B^\top (X V X^\top + I)^{-1 } B) \\
200
200
201
201
where
202
202
203
203
.. math ::
204
204
205
205
\sigma ^2 \mid X, y &\sim \text {InverseGamma}(\text {shape}, \text {scale}) \\
206
- A &= (b_V ^{-1 } + X^\top X)^{-1 } \\
207
- \mu _b &= A b_V ^{-1 } b_{mean} + A X^\top y \\
208
- \text {cov}_b &= \sigma ^2 A
206
+ A &= (V ^{-1 } + X^\top X)^{-1 } \\
207
+ \mu _b &= A V ^{-1 } \mu + A X^\top y \\
208
+ \Sigma _b &= \sigma ^2 A
209
209
210
210
The model posterior is then
211
211
212
212
.. math ::
213
213
214
- b | X, y, \sigma ^2 \sim \mathcal {N}(\mu _b, \text {cov}_b )
214
+ b | X, y, \sigma ^2 \sim \mathcal {N}(\mu _b, \Sigma _b )
215
215
216
216
We can also compute a closed-form solution for the posterior predictive distribution:
217
217
218
218
.. math ::
219
219
220
- y^* \mid X^*, X, Y \sim \mathcal {N}(X^* \mu _b, \ X^* \text {cov}_b X^{* \top } + I)
220
+ y^* \mid X^*, X, Y \sim \mathcal {N}(X^* \mu _b, \ X^* \Sigma _b X^{* \top } + I)
221
221
222
222
**Models **
223
223
0 commit comments