Skip to content

Commit

Permalink
update linear model documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
ddbourgin committed Dec 29, 2021
1 parent bc50efa commit c7fcad6
Show file tree
Hide file tree
Showing 6 changed files with 91 additions and 68 deletions.
60 changes: 30 additions & 30 deletions docs/numpy_ml.linear_models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -53,8 +53,8 @@ In particular, the ridge model is the same as the OLS model:
\mathbf{y} = \mathbf{bX} + \mathbf{\epsilon}
where :math:`\epsilon \sim \mathcal{N}(0, \sigma^2 I)`, except now the error
for the model is calculated as
where :math:`\epsilon \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I})`,
except now the error for the model is calculated as

.. math::
Expand All @@ -66,9 +66,9 @@ the adjusted normal equation:
.. math::
\hat{\mathbf{b}}_{Ridge} =
(\mathbf{X}^\top \mathbf{X} + \alpha I)^{-1} \mathbf{X}^\top \mathbf{y}
(\mathbf{X}^\top \mathbf{X} + \alpha \mathbf{I})^{-1} \mathbf{X}^\top \mathbf{y}
where :math:`(\mathbf{X}^\top \mathbf{X} + \alpha I)^{-1}
where :math:`(\mathbf{X}^\top \mathbf{X} + \alpha \mathbf{I})^{-1}
\mathbf{X}^\top` is the pseudoinverse / Moore-Penrose inverse adjusted for
the `L2` penalty on the model coefficients.

Expand All @@ -81,7 +81,7 @@ the `L2` penalty on the model coefficients.
<h2>Bayesian Linear Regression</h2>

In its general form, Bayesian linear regression extends the simple linear
regression model by introducing priors on model parameters b and/or the
regression model by introducing priors on model parameters *b* and/or the
error variance :math:`\sigma^2`.

The introduction of a prior allows us to quantify the uncertainty in our
Expand All @@ -98,7 +98,7 @@ data :math:`X^*` with the posterior predictive distribution:

.. math::
p(y^* \mid X^*, X, Y) = \int_{b} p(y^* \mid X^*, b) p(b \mid X, y) db
p(y^* \mid X^*, X, Y) = \int_{b} p(y^* \mid X^*, b) p(b \mid X, y) \ \text{d}b
Depending on the choice of prior it may be impossible to compute an
analytic form for the posterior / posterior predictive distribution. In
Expand All @@ -116,11 +116,11 @@ prior on `b` is Gaussian. A common parameterization is:

.. math::
b | \sigma, b_V \sim \mathcal{N}(b_{mean}, \sigma^2 b_V)
b | \sigma, V \sim \mathcal{N}(\mu, \sigma^2 V)
where :math:`b_{mean}`, :math:`\sigma` and :math:`b_V` are hyperparameters. Ridge
regression is a special case of this model where :math:`b_{mean}` = 0,
:math:`\sigma` = 1 and :math:`b_V = I` (ie., the prior on `b` is a zero-mean,
where :math:`\mu`, :math:`\sigma` and :math:`V` are hyperparameters. Ridge
regression is a special case of this model where :math:`\mu = 0`,
:math:`\sigma = 1` and :math:`V = I` (i.e., the prior on *b* is a zero-mean,
unit covariance Gaussian).

Due to the conjugacy of the above prior with the Gaussian likelihood, there
Expand All @@ -129,22 +129,22 @@ parameters:

.. math::
A &= (b_V^{-1} + X^\top X)^{-1} \\
\mu_b &= A b_V^{-1} b_{mean} + A X^\top y \\
\text{cov}_b &= \sigma^2 A \\
A &= (V^{-1} + X^\top X)^{-1} \\
\mu_b &= A V^{-1} \mu + A X^\top y \\
\Sigma_b &= \sigma^2 A \\
The model posterior is then

.. math::
b \mid X, y \sim \mathcal{N}(\mu_b, \text{cov}_b)
b \mid X, y \sim \mathcal{N}(\mu_b, \Sigma_b)
We can also compute a closed-form solution for the posterior predictive distribution as
well:

.. math::
y^* \mid X^*, X, Y \sim \mathcal{N}(X^* \mu_b, \ \ X^* \text{cov}_b X^{* \top} + I)
y^* \mid X^*, X, Y \sim \mathcal{N}(X^* \mu_b, \ \ X^* \Sigma X^{* \top} + I)
where :math:`X^*` is the matrix of new data we wish to predict, and :math:`y^*`
are the predicted targets for those data.
Expand All @@ -160,7 +160,7 @@ are the predicted targets for those data.

--------------------------------

If *both* b and the error variance :math:`\sigma^2` are unknown, the
If *both* *b* and the error variance :math:`\sigma^2` are unknown, the
conjugate prior for the Gaussian likelihood is the Normal-Gamma
distribution (univariate likelihood) or the Normal-Inverse-Wishart
distribution (multivariate likelihood).
Expand All @@ -169,22 +169,22 @@ distribution (multivariate likelihood).

.. math::
b, \sigma^2 &\sim \text{NG}(b_{mean}, b_{V}, \alpha, \beta) \\
b, \sigma^2 &\sim \text{NG}(\mu, V, \alpha, \beta) \\
\sigma^2 &\sim \text{InverseGamma}(\alpha, \beta) \\
b \mid \sigma^2 &\sim \mathcal{N}(b_{mean}, \sigma^2 b_{V})
b \mid \sigma^2 &\sim \mathcal{N}(\mu, \sigma^2 V)
where :math:`\alpha, \beta, b_{V}`, and :math:`b_{mean}` are
parameters of the prior.
where :math:`\alpha, \beta, V`, and :math:`\mu` are parameters of the
prior.

**Multivariate**

.. math::
b, \Sigma &\sim \mathcal{NIW}(b_{mean}, \lambda, \Psi, \rho) \\
b, \Sigma &\sim \mathcal{NIW}(\mu, \lambda, \Psi, \rho) \\
\Sigma &\sim \mathcal{W}^{-1}(\Psi, \rho) \\
b \mid \Sigma &\sim \mathcal{N}(b_{mean}, \frac{1}{\lambda} \Sigma)
b \mid \Sigma &\sim \mathcal{N}(\mu, \frac{1}{\lambda} \Sigma)
where :math:`b_{mean}, \lambda, \Psi`, and :math:`\rho` are
where :math:`\mu, \lambda, \Psi`, and :math:`\rho` are
parameters of the prior.


Expand All @@ -194,30 +194,30 @@ parameters:

.. math::
B &= y - X b_{mean} \\
B &= y - X \mu \\
\text{shape} &= N + \alpha \\
\text{scale} &= \frac{1}{\text{shape}} (\alpha \beta + B^\top (X b_V X^\top + I)^{-1} B) \\
\text{scale} &= \frac{1}{\text{shape}} (\alpha \beta + B^\top (X V X^\top + I)^{-1} B) \\
where

.. math::
\sigma^2 \mid X, y &\sim \text{InverseGamma}(\text{shape}, \text{scale}) \\
A &= (b_V^{-1} + X^\top X)^{-1} \\
\mu_b &= A b_V^{-1} b_{mean} + A X^\top y \\
\text{cov}_b &= \sigma^2 A
A &= (V^{-1} + X^\top X)^{-1} \\
\mu_b &= A V^{-1} \mu + A X^\top y \\
\Sigma_b &= \sigma^2 A
The model posterior is then

.. math::
b | X, y, \sigma^2 \sim \mathcal{N}(\mu_b, \text{cov}_b)
b | X, y, \sigma^2 \sim \mathcal{N}(\mu_b, \Sigma_b)
We can also compute a closed-form solution for the posterior predictive distribution:

.. math::
y^* \mid X^*, X, Y \sim \mathcal{N}(X^* \mu_b, \ X^* \text{cov}_b X^{* \top} + I)
y^* \mid X^*, X, Y \sim \mathcal{N}(X^* \mu_b, \ X^* \Sigma_b X^{* \top} + I)
**Models**

Expand Down
11 changes: 8 additions & 3 deletions numpy_ml/linear_models/bayesian_regression.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ def __init__(self, alpha=1, beta=2, mu=0, V=None, fit_intercept=True):
posterior_predictive : dict or None
Frozen random variable for the posterior predictive distribution,
:math:`P(y \mid X)`. This value is only set following a call to
:meth:`numpy_ml.linear_models.BayesianLinearRegressionUnknownVariance.predict`.
:meth:`predict <numpy_ml.linear_models.BayesianLinearRegressionUnknownVariance.predict>`.
""" # noqa: E501
# this is a placeholder until we know the dimensions of X
V = 1.0 if V is None else V
Expand Down Expand Up @@ -90,7 +90,11 @@ def fit(self, X, y):
y : :py:class:`ndarray <numpy.ndarray>` of shape `(N, K)`
The targets for each of the `N` examples in `X`, where each target
has dimension `K`.
"""
Returns
-------
self : :class:`BayesianLinearRegressionUnknownVariance<numpy_ml.linear_models.BayesianLinearRegressionUnknownVariance>` instance
""" # noqa: E501
# convert X to a design matrix if we're fitting an intercept
if self.fit_intercept:
X = np.c_[np.ones(X.shape[0]), X]
Expand Down Expand Up @@ -130,6 +134,7 @@ def fit(self, X, y):
"sigma**2": stats.distributions.invgamma(a=shape, scale=scale),
"b | sigma**2": stats.multivariate_normal(mean=mu, cov=cov),
}
return self

def predict(self, X):
"""
Expand Down Expand Up @@ -206,7 +211,7 @@ def __init__(self, mu=0, sigma=1, V=None, fit_intercept=True):
posterior_predictive : dict or None
Frozen random variable for the posterior predictive distribution,
:math:`P(y \mid X)`. This value is only set following a call to
:meth:`numpy_ml.linear_models.BayesianLinearRegressionKnownVariance.predict`.
:meth:`predict <numpy_ml.linear_models.BayesianLinearRegressionKnownVariance.predict>`.
""" # noqa: E501
# this is a placeholder until we know the dimensions of X
V = 1.0 if V is None else V
Expand Down
24 changes: 14 additions & 10 deletions numpy_ml/linear_models/glm.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,10 +53,10 @@ def __init__(self, link, fit_intercept=True, tol=1e-5, max_iter=100):
Notes
-----
The generalized linear model (GLM) [a]_ [b]_ assumes that each target/dependent
The generalized linear model (GLM) [7]_ [8]_ assumes that each target/dependent
variable :math:`y_i` in target vector :math:`\mathbf{y} = (y_1, \ldots,
y_n)`, has been drawn independently from a pre-specified distribution
in the exponential family [e]_ with unknown mean :math:`\mu_i`. The GLM
in the exponential family [11]_ with unknown mean :math:`\mu_i`. The GLM
models a (one-to-one, continuous, differentiable) function, *g*, of
this mean value as a linear combination of the model parameters
:math:`\mathbf{b}` and observed covariates, :math:`\mathbf{x}_i`:
Expand All @@ -79,22 +79,22 @@ def __init__(self, link, fit_intercept=True, tol=1e-5, max_iter=100):
"Binomial", "Logit", ":math:`g(x) = \log(x) - \log(n - x)`"
"Poisson", "Log", ":math:`g(x) = \log(x)`"
An iteratively re-weighted least squares (IRLS) algorithm [c]_ can be
An iteratively re-weighted least squares (IRLS) algorithm [9]_ can be
employed to find the maximum likelihood estimate for the model
parameters :math:`\beta` in any instance of the generalized linear
model. IRLS is equivalent to Fisher scoring [d]_, which itself is
model. IRLS is equivalent to Fisher scoring [10]_, which itself is
a slight modification of classic Newton-Raphson for finding the zeros
of the first derivative of the model log-likelihood.
References
----------
.. [a] Nelder, J., & Wedderburn, R. (1972). Generalized linear
.. [7] Nelder, J., & Wedderburn, R. (1972). Generalized linear
models. *Journal of the Royal Statistical Society, Series A
(General), 135(3)*: 370–384.
.. [b] https://en.wikipedia.org/wiki/Generalized_linear_model
.. [c] https://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares
.. [d] https://en.wikipedia.org/wiki/Scoring_algorithm
.. [e] https://en.wikipedia.org/wiki/Exponential_family
.. [8] https://en.wikipedia.org/wiki/Generalized_linear_model
.. [9] https://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares
.. [10] https://en.wikipedia.org/wiki/Scoring_algorithm
.. [11] https://en.wikipedia.org/wiki/Exponential_family
Parameters
----------
Expand Down Expand Up @@ -136,7 +136,11 @@ def fit(self, X, y):
A dataset consisting of `N` examples, each of dimension `M`.
y : :py:class:`ndarray <numpy.ndarray>` of shape `(N,)`
The targets for each of the `N` examples in `X`.
"""
Returns
-------
self : :class:`GeneralizedLinearModel <numpy_ml.linear_models.GeneralizedLinearModel>` instance
""" # noqa: E501
y = np.squeeze(y)
assert y.ndim == 1

Expand Down
21 changes: 14 additions & 7 deletions numpy_ml/linear_models/linear_regression.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ def __init__(self, fit_intercept=True):
y_i = \beta^\top \mathbf{x}_i + \epsilon_i
In this equation :math:`epsilon_i \sim \mathcal{N}(0, \sigma^2_i)` is
In this equation :math:`\epsilon_i \sim \mathcal{N}(0, \sigma^2_i)` is
the error term associated with example :math:`i`, and
:math:`\sigma^2_i` is the variance of the corresponding example.
Expand Down Expand Up @@ -111,11 +111,15 @@ def update(self, X, y, weights=None):
with larger weights exert greater influence on model fit. When
`y` is a vector (i.e., `K = 1`), weights should be set to the
reciporical of the variance for each measurement (i.e., :math:`w_i
= 1/sigma^2_i`). When `K > 1`, it is assumed that all columns of
= 1/\sigma^2_i`). When `K > 1`, it is assumed that all columns of
`y` share the same weight :math:`w_i`. If None, examples are
weighted equally, resulting in the standard linear least squares
update. Default is None.
"""
Returns
-------
self : :class:`LinearRegression <numpy_ml.linear_models.LinearRegression>` instance
""" # noqa: E501
if not self._is_fit:
raise RuntimeError("You must call the `fit` method before calling `update`")

Expand Down Expand Up @@ -166,7 +170,7 @@ def _update2D(self, X, y, W):
beta += S_inv @ X.T @ (y - X @ beta)

def fit(self, X, y, weights=None):
"""
r"""
Fit regression coefficients via maximum likelihood.
Parameters
Expand All @@ -181,11 +185,15 @@ def fit(self, X, y, weights=None):
with larger weights exert greater influence on model fit. When
`y` is a vector (i.e., `K = 1`), weights should be set to the
reciporical of the variance for each measurement (i.e., :math:`w_i
= 1/sigma^2_i`). When `K > 1`, it is assumed that all columns of
= 1/\sigma^2_i`). When `K > 1`, it is assumed that all columns of
`y` share the same weight :math:`w_i`. If None, examples are
weighted equally, resulting in the standard linear least squares
update. Default is None.
"""
Returns
-------
self : :class:`LinearRegression <numpy_ml.linear_models.LinearRegression>` instance
""" # noqa: E501
N = X.shape[0]

weights = np.ones(N) if weights is None else np.atleast_1d(weights)
Expand Down Expand Up @@ -226,4 +234,3 @@ def predict(self, X):
if self.fit_intercept:
X = np.c_[np.ones(X.shape[0]), X]
return X @ self.beta
# return np.dot(X, self.beta)
28 changes: 15 additions & 13 deletions numpy_ml/linear_models/naive_bayes.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,16 +82,17 @@ def fit(self, X, y):
Notes
-----
The model parameters are stored in the :py:attr:`parameters` attribute.
The model parameters are stored in the :py:attr:`parameters
<numpy_ml.linear_models.GaussianNBClassifier.parameters>` attribute.
The following keys are present:
mean: :py:class:`ndarray <numpy.ndarray>` of shape `(K, M)`
Feature means for each of the `K` label classes
sigma: :py:class:`ndarray <numpy.ndarray>` of shape `(K, M)`
Feature variances for each of the `K` label classes
prior : :py:class:`ndarray <numpy.ndarray>` of shape `(K,)`
Prior probability of each of the `K` label classes, estimated
empirically from the training data
"mean": :py:class:`ndarray <numpy.ndarray>` of shape `(K, M)`
Feature means for each of the `K` label classes
"sigma": :py:class:`ndarray <numpy.ndarray>` of shape `(K, M)`
Feature variances for each of the `K` label classes
"prior": :py:class:`ndarray <numpy.ndarray>` of shape `(K,)`
Prior probability of each of the `K` label classes, estimated
empirically from the training data
Parameters
----------
Expand All @@ -102,8 +103,8 @@ def fit(self, X, y):
Returns
-------
self: object
"""
self : :class:`GaussianNBClassifier <numpy_ml.linear_models.GaussianNBClassifier>` instance
""" # noqa: E501
P = self.parameters
H = self.hyperparameters

Expand Down Expand Up @@ -165,7 +166,7 @@ def _log_posterior(self, X):
def _log_class_posterior(self, X, class_idx):
r"""
Compute the (unnormalized) log posterior for the label at index
`class_idx` in :py:attr:`labels`.
`class_idx` in :py:attr:`labels <numpy_ml.linear_models.GaussianNBClassifier.labels>`.
Notes
-----
Expand Down Expand Up @@ -199,8 +200,9 @@ def _log_class_posterior(self, X, class_idx):
-------
log_class_posterior : :py:class:`ndarray <numpy.ndarray>` of shape `(N,)`
Unnormalized log probability of the label at index `class_idx`
in :py:attr:`labels` for each example in `X`
"""
in :py:attr:`labels <numpy_ml.linear_models.GaussianNBClassifier.labels>`
for each example in `X`
""" # noqa: E501
P = self.parameters
mu = P["mean"][class_idx]
prior = P["prior"][class_idx]
Expand Down
Loading

0 comments on commit c7fcad6

Please sign in to comment.