Merge pull request #11 from KulikDM/sklearn-compat

Shift to V1
KulikDM · Jan 27, 2025 · 7fa8fdc · 7fa8fdc
2 parents 5d9b340 + 082bbd7
commit 7fa8fdc
Show file tree

Hide file tree

Showing 117 changed files with 4,690 additions and 1,917 deletions.
diff --git a/.github/workflows/python-package.yml b/.github/workflows/python-package.yml
@@ -41,6 +41,6 @@ jobs:
       run: |
         pytest -vs --doctest-modules --cov-fail-under=90 --cov-branch --cov=pythresh --cov-report term-missing --pyargs pythresh --continue-on-collection-errors
     - name: Codecov
-      uses: codecov/codecov-action@v4
+      uses: codecov/codecov-action@v5
       env:
         CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
diff --git a/.github/workflows/python-publish.yml b/.github/workflows/python-publish.yml
@@ -111,8 +111,8 @@ jobs:
         run: |
           printf "Preparing to install dependancies\n"
 
-          python -m pip install --upgrade pip setuptools wheel twine
-          python setup.py sdist
+          python -m pip install --upgrade pip setuptools wheel twine build
+          python -m build
 
           printf "\nPackage created locally\n"
 

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -21,14 +21,14 @@ repos:
         name: Format docstrings
 
   - repo: https://github.com/asottile/pyupgrade
-    rev: v3.19.0
+    rev: v3.19.1
     hooks:
       - id: pyupgrade
         args: [--py38-plus]
         name: Upgrade code
 
   - repo: https://github.com/hhatto/autopep8
-    rev: v2.3.1
+    rev: v2.3.2
     hooks:
       - id: autopep8
         args: [--in-place]
@@ -42,7 +42,7 @@ repos:
         name: Sort imports
 
   - repo: https://github.com/charliermarsh/ruff-pre-commit
-    rev: v0.8.3
+    rev: v0.9.3
     hooks:
       - id: ruff
         args: [--exit-non-zero-on-fix, --fix, --line-length=180]

diff --git a/CHANGES.txt b/CHANGES.txt
@@ -86,4 +86,13 @@ v<0.3.7>, <08/18/2024> -- Fixed muli-peak error in FWFM
 v<0.3.7>, <08/18/2024> -- Updated time complexity benchmarks
 v<0.3.8>, <11/21/2024> -- Added factor arg to MAD and ZSCORE, contribution by @MalikAly
 v<0.3.8>, <12/15/2024> -- Removed matplotlib as core dependency
-v<0.3.9>, <12/19/2024> -- Fixed low contamination issue in RANK
+v<1.0.0>, <12/19/2024> -- Fixed low contamination issue in RANK
+v<1.0.0>, <01/27/2024> -- Added numpy random seed to all thresholders
+v<1.0.0>, <01/27/2024> -- Added `fit` and `predict` methods to all thresholders
+v<1.0.0>, <01/27/2024> -- Aligned MTT alpha arg with standard value
+v<1.0.0>, <01/27/2024> -- Aligned all thresholders to be sklearn compatible
+v<1.0.0>, <01/27/2024> -- Added new example notebooks
+v<1.0.0>, <01/27/2024> -- Updated all thresholder tests
+v<1.0.0>, <01/27/2024> -- Updated all thresholder examples
+v<1.0.0>, <01/27/2024> -- Updated docs with shift to V1
+v<1.0.0>, <01/27/2024> -- Updated docs with datatables
diff --git a/README.rst b/README.rst
@@ -1,6 +1,6 @@
-##################################################
- Python Outlier Detection Thresholding (PyThresh)
-##################################################
+#####################################################
+ Python Outlier Detection Thresholding (PyThresh) V1
+#####################################################
 
 **Deployment, Stats, & License**
 
@@ -74,6 +74,43 @@ PyThresh includes more than 30 thresholding algorithms. These algorithms
 range from using simple statistical analysis like the Z-score to more
 complex mathematical methods that involve graph theory and topology.
 
+***************************
+ What's New in PyThresh V1
+***************************
+
+The transition of PyThresh to V1 sees many new features!
+
+**Sklearn Compatibility**:
+
+-  The `fit` and `predict` methods have been introduced, enhancing
+   alignment with Sklearn compatibility.
+-  These methods allow a thresholder to be fitted on training data and
+   evaluated on unseen data using the `predict` method.
+-  Previously,this functionality was cumbersome to implement using the
+   `eval` method.
+-  Full backward compatibility with the `eval` method has been
+   maintained.
+-  Checks ensure that results remain consistent between `<V1` and `V1`.
+-  The `BaseEstimator` has been integrated into the `BaseThresholder`.
+-  This addition provides enhanced Sklearn compatibility to all
+   thresholders and better integration with existing Sklearn pipelines.
+
+**Reproducibility Enhancements**:
+
+-  All thresholders now include a random seed to ensure better
+   reproducibility.
+-  Previously, some components in the thresholders differed due to
+   randomness.
+
+**Improved Testing and Examples**:
+
+-  Much more robust tests have been added to ensure the functionality
+   and reliability of the code.
+-  These tests enhance confidence in the correctness of the
+   implementation and prevent regressions.
+-  All examples have been updated and new jupyter notebooks have been
+   added to introduce all the capabilities of PyThresh
+
 ************************
  Documentation & Citing
 ************************
@@ -88,23 +125,25 @@ To cite this work you can visit `PyThresh Citation
 
 ----
 
-**Outlier Detection Thresholding with 7 Lines of Code**:
+**Outlier Detection Thresholding with 8 Lines of Code**:
 
 .. code:: python
 
    # train the KNN detector
    from pyod.models.knn import KNN
-   from pythresh.thresholds.filter import FILTER
+   from pythresh.thresholds.karch import KARCH
 
    clf = KNN()
    clf.fit(X_train)
 
-   # get outlier scores
-   decision_scores = clf.decision_scores_  # raw outlier scores on the train data
+   # get outlier likelihood scores
+   decision_scores = clf.decision_scores_
 
    # get outlier labels
-   thres = FILTER()
-   labels = thres.eval(decision_scores)
+   thres = KARCH()
+   thres.fit(decision_scores)
+
+   labels = thres.labels_ # or thres.predict(decision_scores)
 
 or using multiple outlier detection score sets
 
@@ -114,18 +153,20 @@ or using multiple outlier detection score sets
    from pyod.models.knn import KNN
    from pyod.models.pca import PCA
    from pyod.models.iforest import IForest
-   from pythresh.thresholds.filter import FILTER
+   from pythresh.thresholds.karch import KARCH
 
    clfs = [KNN(), IForest(), PCA()]
 
-   # get outlier scores for each detector
+   # get outlier likelihood scores for each detector
    scores = [clf.fit(X_train).decision_scores_ for clf in clfs]
 
    scores = np.vstack(scores).T
 
    # get outlier labels
-   thres = FILTER()
-   labels = thres.eval(scores)
+   thres = KARCH()
+   thres.fit(decision_scores)
+
+   labels = thres.labels_ # or thres.predict(decision_scores)
 
 **************
  Installation
@@ -180,7 +221,14 @@ Or with **pip**:
 ****************
 
 -  **eval(score)**: evaluate a single outlier or multiple outlier
-   detection likelihood score sets.
+   detection likelihood score set (Legacy method).
+
+-  **fit(score)**: fit a thresholder for a single outlier or multiple
+   outlier detection likelihood score set.
+
+-  **predict(score)**: predict the binary labels using the fitted
+   thresholder on a single outlier or multiple outlier detection
+   likelihood score set
 
 Key Attributes of threshold:
 
@@ -189,12 +237,15 @@ Key Attributes of threshold:
    value. Note the threshold value has been derived from likelihood
    scores normalized between 0 and 1.
 
+-  **labels_**: A binary array of labels for the fitted thresholder on
+   the fitted dataset.
+
 -  **confidence_interval_**: Return the lower and upper confidence
    interval of the contamination level. Only applies to the COMB
-   thresholder
+   thresholder.
 
 -  **dscores_**: 1D array of the TruncatedSVD decomposed decision scores
-   if multiple outlier detector score sets are passed
+   if multiple outlier detector score sets are passed.
 
 -  **mixture_**: fitted mixture model class of the selected model used
    for thresholding. Only applies to MIXMOD. Attributes include:
@@ -312,11 +363,23 @@ method `thresholding confidence
 
 ----
 
-For Jupyter Notebooks, please navigate to `notebooks
-<https://github.com/KulikDM/pythresh/tree/main/notebooks>`_.
+**Tutorial Notebooks**
+
++-------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------+
+| Notebook                                                                                                          | Description                                                                                         |
++===================================================================================================================+=====================================================================================================+
+| `Introduction <https://github.com/KulikDM/pythresh/tree/main/notebooks/00_Introduction.ipynb>`_                   | Basic intro into outlier thresholding                                                               |
++-------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------+
+| `Advanced Thresholding <https://github.com/KulikDM/pythresh/tree/main/notebooks/01_Advanced.ipynb>`_              | Additional thresholding options for more advanced use                                               |
++-------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------+
+| `Threshold Confidence <https://github.com/KulikDM/pythresh/tree/main/notebooks/02_Confidence.ipynb>`_             | Calculating the confidence levels around the threshold point                                        |
++-------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------+
+| `Outlier Ranking <https://github.com/KulikDM/pythresh/tree/main/notebooks/03_Ranking.ipynb>`_                     | Assisting in selecting the best performing outlier and thresholding method combo using ranking      |
++-------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------+
 
 A quick look at all the thresholders performance can be found at
-**"/notebooks/Compare All Models.ipynb"**
+`Compare Thresholders
+<https://github.com/KulikDM/pythresh/blob/main/notebooks/Compare%20All%20Thresholders.ipynb>`_
 
 .. image:: https://raw.githubusercontent.com/KulikDM/pythresh/main/imgs/All.png
    :target: https://raw.githubusercontent.com/KulikDM/pythresh/main/imgs/All.png

diff --git a/docs/FAQ.rst b/docs/FAQ.rst
@@ -66,41 +66,8 @@ can be done with many of the outlier methods (e.g. using the
 ``decision_function`` function of a fitted PyOD model). It is important
 to note that not all outlier detection methods genuinely implement this
 functionality correctly so best to check. The threshold method can be
-independently called for both datasets with reasonable confidence that
-the new data is getting thresholded with respected to the training
-dataset simply based on the likelihood scores.
-
-However, if this is not sufficient and you would like more control over
-the thresholding you can try the above mentioned method with a few extra
-steps.
-
--  Fit an outlier detection model to a training dataset.
-
--  MinMax normalize the likelihood scores.
-
--  Evaluate the normalized likelihood scores with a thresholding method.
-
--  Get the threshold point from the normalized scores using the fitted
-   thresholder from the ``.thresh_`` attribute as done in `Examples
-   <https://pythresh.readthedocs.io/en/latest/example.html>`_
-
--  Apply the decision function of the fitted outlier detection method to
-   the new incoming data and get the likelihood scores.
-
--  Normalize the new likelihood scores with the fitted MinMax from the
-   training dataset.
-
--  Threshold these new scores using the ``thresh_`` value that you
-   obtained earlier like this: ``new_labels = cut(normalized_new_scores,
-   thresh_value)`` where the function ``cut`` can be imported from
-   ``pythresh.thresholds.thresh_utility``
-
-**Note** that if the training dataset was not meant to have outliers but
-rather serve as a reference or baseline for the test data, the first
-mentioned method is probably the better option. If the datasets,
-training and test, both are suspected of having outliers and the data
-drift between the two datasets is small, the second option should work
-well.
+fitted on the training data set and applied to any new data's computed
+outlier likelihood scores using the ``predict`` function.
 
 ----
 

diff --git a/docs/api_cc.rst b/docs/api_cc.rst
@@ -6,13 +6,25 @@ The following APIs are applicable for all detector models for ease of
 use.
 
 -  :func:`pythresh.thresholders.base.BaseDetector.eval`: evaluate a
-   single outlier or multiple outlier detection likelihood score sets
+   single outlier or multiple outlier detection likelihood score set
+   (Legacy method)
+
+-  :func:`pythresh.thresholders.base.BaseDetector.fit`: fit a
+   thresholder for a single outlier or multiple outlier detection
+   likelihood score set
+
+-  :func:`pythresh.thresholders.base.BaseDetector.predict`: predict the
+   binary labels using the fitted thresholder on a single outlier or
+   multiple outlier detection likelihood score set
 
 Key Attributes of a fitted model:
 
 -  :attr:`pythresh.thresholds.base.BaseThresholder.thresh_`: threshold
    value from scores normalize between 0 and 1
 
+-  :attr:`pythresh.thresholds.base.BaseThresholder.labels_`: A binary
+   array of labels for the fitted thresholder on the fitted dataset
+
 -  :attr:`pythresh.thresholders.base.BaseDetector.confidence_interval_`:
    Return the lower and upper confidence interval of the contamination
    level. Only applies to the COMB thresholder
@@ -34,6 +46,7 @@ See base class definition below:
 
 .. automodule:: pythresh.thresholds.base
    :members:
+   :exclude-members: _data_setup, _set_norm, _set_attributes
    :undoc-members:
    :show-inheritance:
    :inherited-members: