Skip to content

Commit

Permalink
embed comments from review
Browse files Browse the repository at this point in the history
  • Loading branch information
agoscinski committed Aug 8, 2023
1 parent 28980c4 commit 2691cae
Show file tree
Hide file tree
Showing 7 changed files with 133 additions and 152 deletions.
53 changes: 6 additions & 47 deletions docs/src/getting-started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,25 +7,8 @@ For a detailed explaination, please look at the :ref:`selection-api`
Features and Samples Selection
------------------------------

.. include:: selection.rst
:start-after: marker-selection-introduction-begin
:end-before: marker-selection-introduction-end


These selectors are available:

* :ref:`CUR-api`: a decomposition: an iterative feature selection method based upon the
singular value decoposition.
* :ref:`PCov-CUR-api` decomposition extends upon CUR by using augmented right or left
singular vectors inspired by Principal Covariates Regression.
* :ref:`FPS-api`: a common selection technique intended to exploit the diversity of
the input space. The selection of the first point is made at random or by a
separate metric
* :ref:`PCov-FPS-api` extends upon FPS much like PCov-CUR does to CUR.
* :ref:`Voronoi-FPS-api`: conduct FPS selection, taking advantage of Voronoi
tessellations to accelerate selection.
* :ref:`DCH-api`: selects samples by constructing a directional convex hull and
determining which samples lie on the bounding surface.
.. automodule:: skmatter._selection
:noindex:

Examples
^^^^^^^^
Expand All @@ -37,19 +20,8 @@ Examples
Reconstruction Measures
-----------------------

.. include:: gfrm.rst
:start-after: marker-reconstruction-introduction-begin
:end-before: marker-reconstruction-introduction-end


These reconstruction measures are available:

* :ref:`GRE-api` (GRE) computes the amount of linearly-decodable information
recovered through a global linear reconstruction.
* :ref:`GRD-api` (GRD) computes the amount of distortion contained in a global linear
reconstruction.
* :ref:`LRE-api` (LRE) computes the amount of decodable information recovered through
a local linear reconstruction for the k-nearest neighborhood of each sample.
.. automodule:: skmatter.metrics
:noindex:

Examples
^^^^^^^^
Expand All @@ -60,21 +32,8 @@ Examples
Principal Covariates Regression
-------------------------------

.. include:: pcovr.rst
:start-after: marker-pcovr-introduction-begin
:end-before: marker-pcovr-introduction-end

It includes

* :ref:`PCovR-api` the standard Principal Covariates Regression. Utilises a
combination between a PCA-like and an LR-like loss, and therefore attempts to find
a low-dimensional projection of the feature vectors that simultaneously minimises
information loss and error in predicting the target properties using only the
latent space vectors :math:`\mathbf{T}`.
* :ref:`KPCovR-api` the Kernel Principal Covariates Regression
a kernel-based variation on the
original PCovR method, proposed in [Helfrecht2020]_.

.. automodule:: skmatter.decomposition
:noindex:

Examples
^^^^^^^^
Expand Down
23 changes: 10 additions & 13 deletions docs/src/gfrm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,19 +5,16 @@ Reconstruction Measures

.. marker-reconstruction-introduction-begin
A set of easily-interpretable error measures of the relative information capacity of
feature space `F` with respect to feature space `F'`. The methods returns a value
between 0 and 1, where 0 means that `F` and `F'` are completey distinct in terms of
linearly-decodable information, and where 1 means that `F'` is contained in `F`. All
methods are implemented as the root mean-square error for the regression of the
feature matrix `X_F'` (or sometimes called `Y` in the doc) from `X_F` (or sometimes
called `X` in the doc) for transformations with different constraints (linear,
orthogonal, locally-linear). By default a custom 2-fold cross-validation
:py:class:`skosmo.linear_model.RidgeRegression2FoldCV` is used to ensure the
generalization of the transformation and efficiency of the computation, since we deal
with a multi-target regression problem. Methods were applied to compare different
forms of featurizations through different hyperparameters and induced metrics and
kernels [Goscinski2021]_ .
.. automodule:: skmatter.metrics

These reconstruction measures are available:

* :ref:`GRE-api` (GRE) computes the amount of linearly-decodable information
recovered through a global linear reconstruction.
* :ref:`GRD-api` (GRD) computes the amount of distortion contained in a global linear
reconstruction.
* :ref:`LRE-api` (LRE) computes the amount of decodable information recovered through
a local linear reconstruction for the k-nearest neighborhood of each sample.

.. marker-reconstruction-introduction-end
Expand Down
23 changes: 0 additions & 23 deletions docs/src/pcovr.rst
Original file line number Diff line number Diff line change
@@ -1,29 +1,6 @@
Principal Covariates Regression (PCovR)
=======================================


.. marker-pcovr-introduction-begin
Often, one wants to construct new ML features from their
current representation in order to compress data or visualise
trends in the dataset. In the archetypal method for this
dimensionality reduction, principal components analysis (PCA),
features are transformed into the latent space which best
preserves the variance of the original data. Principal Covariates
Regression (PCovR), as introduced by [deJong1992]_,
is a modification to PCA that incorporates target information,
such that the resulting embedding could be tuned using a
mixing parameter α to improve performance in regression
tasks (:math:`\alpha = 0` corresponding to linear regression
and :math:`\alpha = 1` corresponding to PCA).
[Helfrecht2020]_ introduced the non-linear
version, Kernel Principal Covariates Regression (KPCovR),
where the mixing parameter α now interpolates between kernel ridge
regression (:math:`\alpha = 0`) and kernel principal components
analysis (KPCA, :math:`\alpha = 1`)

.. marker-pcovr-introduction-end
.. _PCovR-api:

PCovR
Expand Down
65 changes: 1 addition & 64 deletions docs/src/selection.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,70 +3,7 @@
Feature and Sample Selection
============================

.. marker-selection-introduction-begin
Data sub-selection modules primarily corresponding to methods derived from
CUR matrix decomposition and Farthest Point Sampling. In their classical form,
CUR and FPS determine a data subset that maximizes the variance (CUR) or
distribution (FPS) of the features or samples.
These methods can be modified to combine supervised target information denoted by the
methods `PCov-CUR` and `PCov-FPS`.
For further reading, refer to [Imbalzano2018]_ and [Cersonsky2021]_.

These selectors can be used for both feature and sample selection, with similar
instantiations. All sub-selection methods scores each feature or sample
(without an estimator)
and chooses that with the maximum score. As an simple example

.. doctest::

>>> # feature selection
>>> import numpy as np
>>> from skmatter.feature_selection import CUR, FPS, PCovCUR, PCovFPS
>>> selector = CUR(
... # the number of selections to make
... # if None, set to half the samples or features
... # if float, fraction of the total dataset to select
... # if int, absolute number of selections to make
... n_to_select=2,
... # option to use `tqdm <https://tqdm.github.io/>`_ progress bar
... progress_bar=True,
... # float, cutoff score to stop selecting
... score_threshold=1e-12,
... # boolean, whether to select randomly after non-redundant selections
... # are exhausted
... full=False,
... )
>>> X = np.array(
... [
... [0.12, 0.21, 0.02], # 3 samples, 3 features
... [-0.09, 0.32, -0.10],
... [-0.03, -0.53, 0.08],
... ]
... )
>>> y = np.array([0.0, 0.0, 1.0]) # classes of each sample
>>> selector.fit(X)
CUR(n_to_select=2, progress_bar=True, score_threshold=1e-12)
>>> Xr = selector.transform(X)
>>> print(Xr.shape)
(3, 2)
>>> selector = PCovCUR(n_to_select=2)
>>> selector.fit(X, y)
PCovCUR(n_to_select=2)
>>> Xr = selector.transform(X)
>>> print(Xr.shape)
(3, 2)
>>>
>>> # Now sample selection
>>> from skmatter.sample_selection import CUR, FPS, PCovCUR, PCovFPS
>>> selector = CUR(n_to_select=2)
>>> selector.fit(X)
CUR(n_to_select=2)
>>> Xr = X[selector.selected_idx_]
>>> print(Xr.shape)
(2, 3)

.. marker-selection-introduction-end
.. automodule:: skmatter._selection

.. _CUR-api:

Expand Down
76 changes: 74 additions & 2 deletions src/skmatter/_selection.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,77 @@
"""
Sequential selection
r"""
This module contains data sub-selection modules primarily corresponding to
methods derived from CUR matrix decomposition and Farthest Point Sampling. In
their classical form, CUR and FPS determine a data subset that maximizes the
variance (CUR) or distribution (FPS) of the features or samples. These methods
can be modified to combine supervised target information denoted by the methods
`PCov-CUR` and `PCov-FPS`. For further reading, refer to [Imbalzano2018]_ and
[Cersonsky2021]_. These selectors can be used for both feature and sample
selection, with similar instantiations. All sub-selection methods scores each
feature or sample (without an estimator) and chooses that with the maximum
score. A simple example of usage:
.. doctest::
>>> # feature selection
>>> import numpy as np
>>> from skmatter.feature_selection import CUR, FPS, PCovCUR, PCovFPS
>>> selector = CUR(
... # the number of selections to make
... # if None, set to half the samples or features
... # if float, fraction of the total dataset to select
... # if int, absolute number of selections to make
... n_to_select=2,
... # option to use `tqdm <https://tqdm.github.io/>`_ progress bar
... progress_bar=True,
... # float, cutoff score to stop selecting
... score_threshold=1e-12,
... # boolean, whether to select randomly after non-redundant selections
... # are exhausted
... full=False,
... )
>>> X = np.array(
... [
... [0.12, 0.21, 0.02], # 3 samples, 3 features
... [-0.09, 0.32, -0.10],
... [-0.03, -0.53, 0.08],
... ]
... )
>>> y = np.array([0.0, 0.0, 1.0]) # classes of each sample
>>> selector.fit(X)
CUR(n_to_select=2, progress_bar=True, score_threshold=1e-12)
>>> Xr = selector.transform(X)
>>> print(Xr.shape)
(3, 2)
>>> selector = PCovCUR(n_to_select=2)
>>> selector.fit(X, y)
PCovCUR(n_to_select=2)
>>> Xr = selector.transform(X)
>>> print(Xr.shape)
(3, 2)
>>>
>>> # Now sample selection
>>> from skmatter.sample_selection import CUR, FPS, PCovCUR, PCovFPS
>>> selector = CUR(n_to_select=2)
>>> selector.fit(X)
CUR(n_to_select=2)
>>> Xr = X[selector.selected_idx_]
>>> print(Xr.shape)
(2, 3)
These selectors are available:
* :ref:`CUR-api`: a decomposition: an iterative feature selection method based upon the
singular value decoposition.
* :ref:`PCov-CUR-api` decomposition extends upon CUR by using augmented right or left
singular vectors inspired by Principal Covariates Regression.
* :ref:`FPS-api`: a common selection technique intended to exploit the diversity of
the input space. The selection of the first point is made at random or by a
separate metric
* :ref:`PCov-FPS-api` extends upon FPS much like PCov-CUR does to CUR.
* :ref:`Voronoi-FPS-api`: conduct FPS selection, taking advantage of Voronoi
tessellations to accelerate selection.
* :ref:`DCH-api`: selects samples by constructing a directional convex hull and
determining which samples lie on the bounding surface.
"""

import numbers
Expand Down
28 changes: 25 additions & 3 deletions src/skmatter/decomposition/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,28 @@
"""
The :mod:`skmatter.decomposition` module includes the two distance
measures, as defined by Principal Covariates Regression (PCovR)
r"""
Often, one wants to construct new ML features from their current representation
in order to compress data or visualise trends in the dataset. In the archetypal
method for this dimensionality reduction, principal components analysis (PCA),
features are transformed into the latent space which best preserves the
variance of the original data. This module provides the Principal Covariates
Regression (PCovR), as introduced by [deJong1992]_, is a modification to PCA
that incorporates target information, such that the resulting embedding could
be tuned using a mixing parameter α to improve performance in regression tasks
(:math:`\alpha = 0` corresponding to linear regression and :math:`\alpha = 1`
corresponding to PCA). [Helfrecht2020]_ introduced the non-linear version,
Kernel Principal Covariates Regression (KPCovR), where the mixing parameter α
now interpolates between kernel ridge regression (:math:`\alpha = 0`) and
kernel principal components analysis (KPCA, :math:`\alpha = 1`).
The module includes:
* :ref:`PCovR-api` the standard Principal Covariates Regression. Utilises a
combination between a PCA-like and an LR-like loss, and therefore attempts to find
a low-dimensional projection of the feature vectors that simultaneously minimises
information loss and error in predicting the target properties using only the
latent space vectors :math:`\mathbf{T}`.
* :ref:`KPCovR-api` the Kernel Principal Covariates Regression
a kernel-based variation on the
original PCovR method, proposed in [Helfrecht2020]_.
"""

from ._kernel_pcovr import KernelPCovR
Expand Down
17 changes: 17 additions & 0 deletions src/skmatter/metrics/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,20 @@
r"""
This module contains a set of easily-interpretable error measures of the
relative information capacity of feature space `F` with respect to feature
space `F'`. The methods returns a value between 0 and 1, where 0 means that
`F` and `F'` are completey distinct in terms of linearly-decodable
information, and where 1 means that `F'` is contained in `F`. All methods
are implemented as the root mean-square error for the regression of the
feature matrix `X_F'` (or sometimes called `Y` in the doc) from `X_F` (or
sometimes called `X` in the doc) for transformations with different
constraints (linear, orthogonal, locally-linear). By default a custom 2-fold
cross-validation :py:class:`skosmo.linear_model.RidgeRegression2FoldCV` is
used to ensure the generalization of the transformation and efficiency of
the computation, since we deal with a multi-target regression problem.
Methods were applied to compare different forms of featurizations through
different hyperparameters and induced metrics and kernels [Goscinski2021]_ .
"""

from ._reconstruction_measures import (
check_global_reconstruction_measures_input,
check_local_reconstruction_measures_input,
Expand Down

0 comments on commit 2691cae

Please sign in to comment.