update linear model documentation

ddbourgin · ddbourgin · commit c7fcad674ae3 · 2021-12-29T01:10:52.000-05:00
diff --git a/docs/numpy_ml.linear_models.rst b/docs/numpy_ml.linear_models.rst
@@ -53,8 +53,8 @@ In particular, the ridge model is the same as the OLS model:
 
     \mathbf{y} = \mathbf{bX} + \mathbf{\epsilon}
 
-where :math:`\epsilon \sim \mathcal{N}(0, \sigma^2 I)`, except now the error
-for the model is calculated as
+where :math:`\epsilon \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I})`,
+except now the error for the model is calculated as
 
 .. math::
 
@@ -66,9 +66,9 @@ the adjusted normal equation:
 .. math::
 
     \hat{\mathbf{b}}_{Ridge} =
-        (\mathbf{X}^\top \mathbf{X} + \alpha I)^{-1} \mathbf{X}^\top \mathbf{y}
+        (\mathbf{X}^\top \mathbf{X} + \alpha \mathbf{I})^{-1} \mathbf{X}^\top \mathbf{y}
 
-where :math:`(\mathbf{X}^\top \mathbf{X} + \alpha I)^{-1}
+where :math:`(\mathbf{X}^\top \mathbf{X} + \alpha \mathbf{I})^{-1}
 \mathbf{X}^\top` is the pseudoinverse / Moore-Penrose inverse adjusted for
 the `L2` penalty on the model coefficients.
 
@@ -81,7 +81,7 @@ the `L2` penalty on the model coefficients.
    <h2>Bayesian Linear Regression</h2>
 
 In its general form, Bayesian linear regression extends the simple linear
-regression model by introducing priors on model parameters b and/or the
+regression model by introducing priors on model parameters *b* and/or the
 error variance :math:`\sigma^2`.
 
 The introduction of a prior allows us to quantify the uncertainty in our
@@ -98,7 +98,7 @@ data :math:`X^*` with the posterior predictive distribution:
 
 .. math::
 
-    p(y^* \mid X^*, X, Y) = \int_{b} p(y^* \mid X^*, b) p(b \mid X, y) db
+    p(y^* \mid X^*, X, Y) = \int_{b} p(y^* \mid X^*, b) p(b \mid X, y) \ \text{d}b
 
 Depending on the choice of prior it may be impossible to compute an
 analytic form for the posterior / posterior predictive distribution. In
@@ -116,11 +116,11 @@ prior on `b` is Gaussian. A common parameterization is:
 
 .. math::
 
-    b | \sigma, b_V  \sim  \mathcal{N}(b_{mean}, \sigma^2 b_V)
+    b | \sigma, V  \sim  \mathcal{N}(\mu, \sigma^2 V)
 
-where :math:`b_{mean}`, :math:`\sigma` and :math:`b_V` are hyperparameters. Ridge
-regression is a special case of this model where :math:`b_{mean}` = 0,
-:math:`\sigma` = 1 and :math:`b_V = I` (ie., the prior on `b` is a zero-mean,
+where :math:`\mu`, :math:`\sigma` and :math:`V` are hyperparameters. Ridge
+regression is a special case of this model where :math:`\mu = 0`,
+:math:`\sigma = 1` and :math:`V = I` (i.e., the prior on *b* is a zero-mean,
 unit covariance Gaussian).
 
 Due to the conjugacy of the above prior with the Gaussian likelihood, there
@@ -129,22 +129,22 @@ parameters:
 
 .. math::
 
-    A  &=  (b_V^{-1} + X^\top X)^{-1} \\
-    \mu_b  &=  A b_V^{-1} b_{mean} + A X^\top y \\
-    \text{cov}_b  &=  \sigma^2 A \\
+    A  &=  (V^{-1} + X^\top X)^{-1} \\
+    \mu_b  &=  A V^{-1} \mu + A X^\top y \\
+    \Sigma_b  &=  \sigma^2 A \\
 
 The model posterior is then
 
 .. math::
 
-    b \mid X, y  \sim  \mathcal{N}(\mu_b, \text{cov}_b)
+    b \mid X, y  \sim  \mathcal{N}(\mu_b, \Sigma_b)
 
 We can also compute a closed-form solution for the posterior predictive distribution as
 well:
 
 .. math::
 
-    y^* \mid X^*, X, Y \sim \mathcal{N}(X^* \mu_b, \ \ X^* \text{cov}_b X^{* \top} + I)
+    y^* \mid X^*, X, Y \sim \mathcal{N}(X^* \mu_b, \ \ X^* \Sigma X^{* \top} + I)
 
 where :math:`X^*` is the matrix of new data we wish to predict, and :math:`y^*`
 are the predicted targets for those data.
@@ -160,7 +160,7 @@ are the predicted targets for those data.
 
 --------------------------------
 
-If *both* b and the error variance :math:`\sigma^2` are unknown, the
+If *both* *b* and the error variance :math:`\sigma^2` are unknown, the
 conjugate prior for the Gaussian likelihood is the Normal-Gamma
 distribution (univariate likelihood) or the Normal-Inverse-Wishart
 distribution (multivariate likelihood).
@@ -169,22 +169,22 @@ distribution (multivariate likelihood).
 
     .. math::
 
-        b, \sigma^2  &\sim  \text{NG}(b_{mean}, b_{V}, \alpha, \beta) \\
+        b, \sigma^2  &\sim  \text{NG}(\mu, V, \alpha, \beta) \\
         \sigma^2  &\sim  \text{InverseGamma}(\alpha, \beta) \\
-        b \mid \sigma^2  &\sim  \mathcal{N}(b_{mean}, \sigma^2 b_{V})
+        b \mid \sigma^2  &\sim  \mathcal{N}(\mu, \sigma^2 V)
 
-    where :math:`\alpha, \beta, b_{V}`, and :math:`b_{mean}` are
-    parameters of the prior.
+    where :math:`\alpha, \beta, V`, and :math:`\mu` are parameters of the
+    prior.
 
     **Multivariate**
 
     .. math::
 
-        b, \Sigma  &\sim  \mathcal{NIW}(b_{mean}, \lambda, \Psi, \rho) \\
+        b, \Sigma  &\sim  \mathcal{NIW}(\mu, \lambda, \Psi, \rho) \\
         \Sigma  &\sim  \mathcal{W}^{-1}(\Psi, \rho) \\
-        b \mid \Sigma  &\sim  \mathcal{N}(b_{mean}, \frac{1}{\lambda} \Sigma)
+        b \mid \Sigma  &\sim  \mathcal{N}(\mu, \frac{1}{\lambda} \Sigma)
 
-    where :math:`b_{mean}, \lambda, \Psi`, and :math:`\rho` are
+    where :math:`\mu, \lambda, \Psi`, and :math:`\rho` are
     parameters of the prior.
 
 
@@ -194,30 +194,30 @@ parameters:
 
 .. math::
 
-    B  &=  y - X b_{mean} \\
+    B  &=  y - X \mu \\
     \text{shape}  &=  N + \alpha \\
-    \text{scale}  &=  \frac{1}{\text{shape}} (\alpha \beta + B^\top (X b_V X^\top + I)^{-1} B) \\
+    \text{scale}  &=  \frac{1}{\text{shape}} (\alpha \beta + B^\top (X V X^\top + I)^{-1} B) \\
 
 where
 
 .. math::
 
     \sigma^2 \mid X, y  &\sim  \text{InverseGamma}(\text{shape}, \text{scale}) \\
-    A  &=  (b_V^{-1} + X^\top X)^{-1} \\
-    \mu_b  &=  A b_V^{-1} b_{mean} + A X^\top y \\
-    \text{cov}_b  &=  \sigma^2 A
+    A  &=  (V^{-1} + X^\top X)^{-1} \\
+    \mu_b  &=  A V^{-1} \mu + A X^\top y \\
+    \Sigma_b  &=  \sigma^2 A
 
 The model posterior is then
 
 .. math::
 
-    b | X, y, \sigma^2 \sim \mathcal{N}(\mu_b, \text{cov}_b)
+    b | X, y, \sigma^2 \sim \mathcal{N}(\mu_b, \Sigma_b)
 
 We can also compute a closed-form solution for the posterior predictive distribution:
 
 .. math::
 
-    y^* \mid X^*, X, Y \sim \mathcal{N}(X^* \mu_b, \ X^* \text{cov}_b X^{* \top} + I)
+    y^* \mid X^*, X, Y \sim \mathcal{N}(X^* \mu_b, \ X^* \Sigma_b X^{* \top} + I)
 
 **Models**
 
diff --git a/numpy_ml/linear_models/bayesian_regression.py b/numpy_ml/linear_models/bayesian_regression.py
@@ -54,7 +54,7 @@ def __init__(self, alpha=1, beta=2, mu=0, V=None, fit_intercept=True):
         posterior_predictive : dict or None
             Frozen random variable for the posterior predictive distribution,
             :math:`P(y \mid X)`. This value is only set following a call to
-            :meth:`numpy_ml.linear_models.BayesianLinearRegressionUnknownVariance.predict`.
+            :meth:`predict <numpy_ml.linear_models.BayesianLinearRegressionUnknownVariance.predict>`.
         """  # noqa: E501
         # this is a placeholder until we know the dimensions of X
         V = 1.0 if V is None else V
@@ -90,7 +90,11 @@ def fit(self, X, y):
         y : :py:class:`ndarray <numpy.ndarray>` of shape `(N, K)`
             The targets for each of the `N` examples in `X`, where each target
             has dimension `K`.
-        """
+
+        Returns
+        -------
+        self : :class:`BayesianLinearRegressionUnknownVariance<numpy_ml.linear_models.BayesianLinearRegressionUnknownVariance>` instance
+        """  # noqa: E501
         # convert X to a design matrix if we're fitting an intercept
         if self.fit_intercept:
             X = np.c_[np.ones(X.shape[0]), X]
@@ -130,6 +134,7 @@ def fit(self, X, y):
             "sigma**2": stats.distributions.invgamma(a=shape, scale=scale),
             "b | sigma**2": stats.multivariate_normal(mean=mu, cov=cov),
         }
+        return self
 
     def predict(self, X):
         """
@@ -206,7 +211,7 @@ def __init__(self, mu=0, sigma=1, V=None, fit_intercept=True):
         posterior_predictive : dict or None
             Frozen random variable for the posterior predictive distribution,
             :math:`P(y \mid X)`. This value is only set following a call to
-            :meth:`numpy_ml.linear_models.BayesianLinearRegressionKnownVariance.predict`.
+            :meth:`predict <numpy_ml.linear_models.BayesianLinearRegressionKnownVariance.predict>`.
         """  # noqa: E501
         # this is a placeholder until we know the dimensions of X
         V = 1.0 if V is None else V
diff --git a/numpy_ml/linear_models/glm.py b/numpy_ml/linear_models/glm.py
@@ -53,10 +53,10 @@ def __init__(self, link, fit_intercept=True, tol=1e-5, max_iter=100):
 
         Notes
         -----
-        The generalized linear model (GLM) [a]_ [b]_ assumes that each target/dependent
+        The generalized linear model (GLM) [7]_ [8]_ assumes that each target/dependent
         variable :math:`y_i` in target vector :math:`\mathbf{y} = (y_1, \ldots,
         y_n)`, has been drawn independently from a pre-specified distribution
-        in the exponential family [e]_ with unknown mean :math:`\mu_i`. The GLM
+        in the exponential family [11]_ with unknown mean :math:`\mu_i`. The GLM
         models a (one-to-one, continuous, differentiable) function, *g*, of
         this mean value as a linear combination of the model parameters
         :math:`\mathbf{b}` and observed covariates, :math:`\mathbf{x}_i`:
@@ -79,22 +79,22 @@ def __init__(self, link, fit_intercept=True, tol=1e-5, max_iter=100):
            "Binomial", "Logit", ":math:`g(x) = \log(x) - \log(n - x)`"
            "Poisson", "Log", ":math:`g(x) = \log(x)`"
 
-        An iteratively re-weighted least squares (IRLS) algorithm [c]_ can be
+        An iteratively re-weighted least squares (IRLS) algorithm [9]_ can be
         employed to find the maximum likelihood estimate for the model
         parameters :math:`\beta` in any instance of the generalized linear
-        model. IRLS is equivalent to Fisher scoring [d]_, which itself is
+        model. IRLS is equivalent to Fisher scoring [10]_, which itself is
         a slight modification of classic Newton-Raphson for finding the zeros
         of the first derivative of the model log-likelihood.
 
         References
         ----------
-        .. [a] Nelder, J., & Wedderburn, R. (1972). Generalized linear
+        .. [7] Nelder, J., & Wedderburn, R. (1972). Generalized linear
                models. *Journal of the Royal Statistical Society, Series A
                (General), 135(3)*: 370–384.
-        .. [b] https://en.wikipedia.org/wiki/Generalized_linear_model
-        .. [c] https://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares
-        .. [d] https://en.wikipedia.org/wiki/Scoring_algorithm
-        .. [e] https://en.wikipedia.org/wiki/Exponential_family
+        .. [8] https://en.wikipedia.org/wiki/Generalized_linear_model
+        .. [9] https://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares
+        .. [10] https://en.wikipedia.org/wiki/Scoring_algorithm
+        .. [11] https://en.wikipedia.org/wiki/Exponential_family
 
         Parameters
         ----------
@@ -136,7 +136,11 @@ def fit(self, X, y):
             A dataset consisting of `N` examples, each of dimension `M`.
         y : :py:class:`ndarray <numpy.ndarray>` of shape `(N,)`
             The targets for each of the `N` examples in `X`.
-        """
+
+        Returns
+        -------
+        self : :class:`GeneralizedLinearModel <numpy_ml.linear_models.GeneralizedLinearModel>` instance
+        """  # noqa: E501
         y = np.squeeze(y)
         assert y.ndim == 1
 
diff --git a/numpy_ml/linear_models/linear_regression.py b/numpy_ml/linear_models/linear_regression.py
@@ -18,7 +18,7 @@ def __init__(self, fit_intercept=True):
 
             y_i = \beta^\top \mathbf{x}_i + \epsilon_i
 
-        In this equation :math:`epsilon_i \sim \mathcal{N}(0, \sigma^2_i)` is
+        In this equation :math:`\epsilon_i \sim \mathcal{N}(0, \sigma^2_i)` is
         the error term associated with example :math:`i`, and
         :math:`\sigma^2_i` is the variance of the corresponding example.
 
@@ -111,11 +111,15 @@ def update(self, X, y, weights=None):
             with larger weights exert greater influence on model fit.  When
             `y` is a vector (i.e., `K = 1`), weights should be set to the
             reciporical of the variance for each measurement (i.e., :math:`w_i
-            = 1/sigma^2_i`). When `K > 1`, it is assumed that all columns of
+            = 1/\sigma^2_i`). When `K > 1`, it is assumed that all columns of
             `y` share the same weight :math:`w_i`. If None, examples are
             weighted equally, resulting in the standard linear least squares
             update.  Default is None.
-        """
+
+        Returns
+        -------
+        self : :class:`LinearRegression <numpy_ml.linear_models.LinearRegression>` instance
+        """  # noqa: E501
         if not self._is_fit:
             raise RuntimeError("You must call the `fit` method before calling `update`")
 
@@ -166,7 +170,7 @@ def _update2D(self, X, y, W):
         beta += S_inv @ X.T @ (y - X @ beta)
 
     def fit(self, X, y, weights=None):
-        """
+        r"""
         Fit regression coefficients via maximum likelihood.
 
         Parameters
@@ -181,11 +185,15 @@ def fit(self, X, y, weights=None):
             with larger weights exert greater influence on model fit.  When
             `y` is a vector (i.e., `K = 1`), weights should be set to the
             reciporical of the variance for each measurement (i.e., :math:`w_i
-            = 1/sigma^2_i`). When `K > 1`, it is assumed that all columns of
+            = 1/\sigma^2_i`). When `K > 1`, it is assumed that all columns of
             `y` share the same weight :math:`w_i`. If None, examples are
             weighted equally, resulting in the standard linear least squares
             update.  Default is None.
-        """
+
+        Returns
+        -------
+        self : :class:`LinearRegression <numpy_ml.linear_models.LinearRegression>` instance
+        """  # noqa: E501
         N = X.shape[0]
 
         weights = np.ones(N) if weights is None else np.atleast_1d(weights)
@@ -226,4 +234,3 @@ def predict(self, X):
         if self.fit_intercept:
             X = np.c_[np.ones(X.shape[0]), X]
         return X @ self.beta
-        #  return np.dot(X, self.beta)
diff --git a/numpy_ml/linear_models/naive_bayes.py b/numpy_ml/linear_models/naive_bayes.py
@@ -82,16 +82,17 @@ def fit(self, X, y):
 
         Notes
         -----
-        The model parameters are stored in the :py:attr:`parameters` attribute.
+        The model parameters are stored in the :py:attr:`parameters
+        <numpy_ml.linear_models.GaussianNBClassifier.parameters>` attribute.
         The following keys are present:
 
-        mean: :py:class:`ndarray <numpy.ndarray>` of shape `(K, M)`
-            Feature means for each of the `K` label classes
-        sigma: :py:class:`ndarray <numpy.ndarray>` of shape `(K, M)`
-            Feature variances for each of the `K` label classes
-        prior :  :py:class:`ndarray <numpy.ndarray>` of shape `(K,)`
-            Prior probability of each of the `K` label classes, estimated
-            empirically from the training data
+            "mean": :py:class:`ndarray <numpy.ndarray>` of shape `(K, M)`
+                Feature means for each of the `K` label classes
+            "sigma": :py:class:`ndarray <numpy.ndarray>` of shape `(K, M)`
+                Feature variances for each of the `K` label classes
+            "prior": :py:class:`ndarray <numpy.ndarray>` of shape `(K,)`
+                Prior probability of each of the `K` label classes, estimated
+                empirically from the training data
 
         Parameters
         ----------
@@ -102,8 +103,8 @@ def fit(self, X, y):
 
         Returns
         -------
-        self: object
-        """
+        self : :class:`GaussianNBClassifier <numpy_ml.linear_models.GaussianNBClassifier>` instance
+        """  # noqa: E501
         P = self.parameters
         H = self.hyperparameters
 
@@ -165,7 +166,7 @@ def _log_posterior(self, X):
     def _log_class_posterior(self, X, class_idx):
         r"""
         Compute the (unnormalized) log posterior for the label at index
-        `class_idx` in :py:attr:`labels`.
+        `class_idx` in :py:attr:`labels <numpy_ml.linear_models.GaussianNBClassifier.labels>`.
 
         Notes
         -----
@@ -199,8 +200,9 @@ def _log_class_posterior(self, X, class_idx):
         -------
         log_class_posterior : :py:class:`ndarray <numpy.ndarray>` of shape `(N,)`
             Unnormalized log probability of the label at index `class_idx`
-            in :py:attr:`labels` for each example in `X`
-        """
+            in :py:attr:`labels <numpy_ml.linear_models.GaussianNBClassifier.labels>`
+            for each example in `X`
+        """  # noqa: E501
         P = self.parameters
         mu = P["mean"][class_idx]
         prior = P["prior"][class_idx]
diff --git a/numpy_ml/linear_models/ridge.py b/numpy_ml/linear_models/ridge.py