Skip to content

Commit

Permalink
Merge pull request #11 from KulikDM/sklearn-compat
Browse files Browse the repository at this point in the history
Shift to V1
  • Loading branch information
KulikDM authored Jan 27, 2025
2 parents 5d9b340 + 082bbd7 commit 7fa8fdc
Show file tree
Hide file tree
Showing 117 changed files with 4,690 additions and 1,917 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,6 @@ jobs:
run: |
pytest -vs --doctest-modules --cov-fail-under=90 --cov-branch --cov=pythresh --cov-report term-missing --pyargs pythresh --continue-on-collection-errors
- name: Codecov
uses: codecov/codecov-action@v4
uses: codecov/codecov-action@v5
env:
CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
4 changes: 2 additions & 2 deletions .github/workflows/python-publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -111,8 +111,8 @@ jobs:
run: |
printf "Preparing to install dependancies\n"
python -m pip install --upgrade pip setuptools wheel twine
python setup.py sdist
python -m pip install --upgrade pip setuptools wheel twine build
python -m build
printf "\nPackage created locally\n"
Expand Down
6 changes: 3 additions & 3 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,14 +21,14 @@ repos:
name: Format docstrings

- repo: https://github.com/asottile/pyupgrade
rev: v3.19.0
rev: v3.19.1
hooks:
- id: pyupgrade
args: [--py38-plus]
name: Upgrade code

- repo: https://github.com/hhatto/autopep8
rev: v2.3.1
rev: v2.3.2
hooks:
- id: autopep8
args: [--in-place]
Expand All @@ -42,7 +42,7 @@ repos:
name: Sort imports

- repo: https://github.com/charliermarsh/ruff-pre-commit
rev: v0.8.3
rev: v0.9.3
hooks:
- id: ruff
args: [--exit-non-zero-on-fix, --fix, --line-length=180]
Expand Down
11 changes: 10 additions & 1 deletion CHANGES.txt
Original file line number Diff line number Diff line change
Expand Up @@ -86,4 +86,13 @@ v<0.3.7>, <08/18/2024> -- Fixed muli-peak error in FWFM
v<0.3.7>, <08/18/2024> -- Updated time complexity benchmarks
v<0.3.8>, <11/21/2024> -- Added factor arg to MAD and ZSCORE, contribution by @MalikAly
v<0.3.8>, <12/15/2024> -- Removed matplotlib as core dependency
v<0.3.9>, <12/19/2024> -- Fixed low contamination issue in RANK
v<1.0.0>, <12/19/2024> -- Fixed low contamination issue in RANK
v<1.0.0>, <01/27/2024> -- Added numpy random seed to all thresholders
v<1.0.0>, <01/27/2024> -- Added `fit` and `predict` methods to all thresholders
v<1.0.0>, <01/27/2024> -- Aligned MTT alpha arg with standard value
v<1.0.0>, <01/27/2024> -- Aligned all thresholders to be sklearn compatible
v<1.0.0>, <01/27/2024> -- Added new example notebooks
v<1.0.0>, <01/27/2024> -- Updated all thresholder tests
v<1.0.0>, <01/27/2024> -- Updated all thresholder examples
v<1.0.0>, <01/27/2024> -- Updated docs with shift to V1
v<1.0.0>, <01/27/2024> -- Updated docs with datatables
101 changes: 82 additions & 19 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
##################################################
Python Outlier Detection Thresholding (PyThresh)
##################################################
#####################################################
Python Outlier Detection Thresholding (PyThresh) V1
#####################################################

**Deployment, Stats, & License**

Expand Down Expand Up @@ -74,6 +74,43 @@ PyThresh includes more than 30 thresholding algorithms. These algorithms
range from using simple statistical analysis like the Z-score to more
complex mathematical methods that involve graph theory and topology.

***************************
What's New in PyThresh V1
***************************

The transition of PyThresh to V1 sees many new features!

**Sklearn Compatibility**:

- The `fit` and `predict` methods have been introduced, enhancing
alignment with Sklearn compatibility.
- These methods allow a thresholder to be fitted on training data and
evaluated on unseen data using the `predict` method.
- Previously,this functionality was cumbersome to implement using the
`eval` method.
- Full backward compatibility with the `eval` method has been
maintained.
- Checks ensure that results remain consistent between `<V1` and `V1`.
- The `BaseEstimator` has been integrated into the `BaseThresholder`.
- This addition provides enhanced Sklearn compatibility to all
thresholders and better integration with existing Sklearn pipelines.

**Reproducibility Enhancements**:

- All thresholders now include a random seed to ensure better
reproducibility.
- Previously, some components in the thresholders differed due to
randomness.

**Improved Testing and Examples**:

- Much more robust tests have been added to ensure the functionality
and reliability of the code.
- These tests enhance confidence in the correctness of the
implementation and prevent regressions.
- All examples have been updated and new jupyter notebooks have been
added to introduce all the capabilities of PyThresh

************************
Documentation & Citing
************************
Expand All @@ -88,23 +125,25 @@ To cite this work you can visit `PyThresh Citation

----

**Outlier Detection Thresholding with 7 Lines of Code**:
**Outlier Detection Thresholding with 8 Lines of Code**:

.. code:: python
# train the KNN detector
from pyod.models.knn import KNN
from pythresh.thresholds.filter import FILTER
from pythresh.thresholds.karch import KARCH
clf = KNN()
clf.fit(X_train)
# get outlier scores
decision_scores = clf.decision_scores_ # raw outlier scores on the train data
# get outlier likelihood scores
decision_scores = clf.decision_scores_
# get outlier labels
thres = FILTER()
labels = thres.eval(decision_scores)
thres = KARCH()
thres.fit(decision_scores)
labels = thres.labels_ # or thres.predict(decision_scores)
or using multiple outlier detection score sets

Expand All @@ -114,18 +153,20 @@ or using multiple outlier detection score sets
from pyod.models.knn import KNN
from pyod.models.pca import PCA
from pyod.models.iforest import IForest
from pythresh.thresholds.filter import FILTER
from pythresh.thresholds.karch import KARCH
clfs = [KNN(), IForest(), PCA()]
# get outlier scores for each detector
# get outlier likelihood scores for each detector
scores = [clf.fit(X_train).decision_scores_ for clf in clfs]
scores = np.vstack(scores).T
# get outlier labels
thres = FILTER()
labels = thres.eval(scores)
thres = KARCH()
thres.fit(decision_scores)
labels = thres.labels_ # or thres.predict(decision_scores)
**************
Installation
Expand Down Expand Up @@ -180,7 +221,14 @@ Or with **pip**:
****************

- **eval(score)**: evaluate a single outlier or multiple outlier
detection likelihood score sets.
detection likelihood score set (Legacy method).

- **fit(score)**: fit a thresholder for a single outlier or multiple
outlier detection likelihood score set.

- **predict(score)**: predict the binary labels using the fitted
thresholder on a single outlier or multiple outlier detection
likelihood score set

Key Attributes of threshold:

Expand All @@ -189,12 +237,15 @@ Key Attributes of threshold:
value. Note the threshold value has been derived from likelihood
scores normalized between 0 and 1.

- **labels_**: A binary array of labels for the fitted thresholder on
the fitted dataset.

- **confidence_interval_**: Return the lower and upper confidence
interval of the contamination level. Only applies to the COMB
thresholder
thresholder.

- **dscores_**: 1D array of the TruncatedSVD decomposed decision scores
if multiple outlier detector score sets are passed
if multiple outlier detector score sets are passed.

- **mixture_**: fitted mixture model class of the selected model used
for thresholding. Only applies to MIXMOD. Attributes include:
Expand Down Expand Up @@ -312,11 +363,23 @@ method `thresholding confidence

----

For Jupyter Notebooks, please navigate to `notebooks
<https://github.com/KulikDM/pythresh/tree/main/notebooks>`_.
**Tutorial Notebooks**

+-------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------+
| Notebook | Description |
+===================================================================================================================+=====================================================================================================+
| `Introduction <https://github.com/KulikDM/pythresh/tree/main/notebooks/00_Introduction.ipynb>`_ | Basic intro into outlier thresholding |
+-------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------+
| `Advanced Thresholding <https://github.com/KulikDM/pythresh/tree/main/notebooks/01_Advanced.ipynb>`_ | Additional thresholding options for more advanced use |
+-------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------+
| `Threshold Confidence <https://github.com/KulikDM/pythresh/tree/main/notebooks/02_Confidence.ipynb>`_ | Calculating the confidence levels around the threshold point |
+-------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------+
| `Outlier Ranking <https://github.com/KulikDM/pythresh/tree/main/notebooks/03_Ranking.ipynb>`_ | Assisting in selecting the best performing outlier and thresholding method combo using ranking |
+-------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------+

A quick look at all the thresholders performance can be found at
**"/notebooks/Compare All Models.ipynb"**
`Compare Thresholders
<https://github.com/KulikDM/pythresh/blob/main/notebooks/Compare%20All%20Thresholders.ipynb>`_

.. image:: https://raw.githubusercontent.com/KulikDM/pythresh/main/imgs/All.png
:target: https://raw.githubusercontent.com/KulikDM/pythresh/main/imgs/All.png
Expand Down
37 changes: 2 additions & 35 deletions docs/FAQ.rst
Original file line number Diff line number Diff line change
Expand Up @@ -66,41 +66,8 @@ can be done with many of the outlier methods (e.g. using the
``decision_function`` function of a fitted PyOD model). It is important
to note that not all outlier detection methods genuinely implement this
functionality correctly so best to check. The threshold method can be
independently called for both datasets with reasonable confidence that
the new data is getting thresholded with respected to the training
dataset simply based on the likelihood scores.

However, if this is not sufficient and you would like more control over
the thresholding you can try the above mentioned method with a few extra
steps.

- Fit an outlier detection model to a training dataset.

- MinMax normalize the likelihood scores.

- Evaluate the normalized likelihood scores with a thresholding method.

- Get the threshold point from the normalized scores using the fitted
thresholder from the ``.thresh_`` attribute as done in `Examples
<https://pythresh.readthedocs.io/en/latest/example.html>`_

- Apply the decision function of the fitted outlier detection method to
the new incoming data and get the likelihood scores.

- Normalize the new likelihood scores with the fitted MinMax from the
training dataset.

- Threshold these new scores using the ``thresh_`` value that you
obtained earlier like this: ``new_labels = cut(normalized_new_scores,
thresh_value)`` where the function ``cut`` can be imported from
``pythresh.thresholds.thresh_utility``

**Note** that if the training dataset was not meant to have outliers but
rather serve as a reference or baseline for the test data, the first
mentioned method is probably the better option. If the datasets,
training and test, both are suspected of having outliers and the data
drift between the two datasets is small, the second option should work
well.
fitted on the training data set and applied to any new data's computed
outlier likelihood scores using the ``predict`` function.

----

Expand Down
15 changes: 14 additions & 1 deletion docs/api_cc.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,25 @@ The following APIs are applicable for all detector models for ease of
use.

- :func:`pythresh.thresholders.base.BaseDetector.eval`: evaluate a
single outlier or multiple outlier detection likelihood score sets
single outlier or multiple outlier detection likelihood score set
(Legacy method)

- :func:`pythresh.thresholders.base.BaseDetector.fit`: fit a
thresholder for a single outlier or multiple outlier detection
likelihood score set

- :func:`pythresh.thresholders.base.BaseDetector.predict`: predict the
binary labels using the fitted thresholder on a single outlier or
multiple outlier detection likelihood score set

Key Attributes of a fitted model:

- :attr:`pythresh.thresholds.base.BaseThresholder.thresh_`: threshold
value from scores normalize between 0 and 1

- :attr:`pythresh.thresholds.base.BaseThresholder.labels_`: A binary
array of labels for the fitted thresholder on the fitted dataset

- :attr:`pythresh.thresholders.base.BaseDetector.confidence_interval_`:
Return the lower and upper confidence interval of the contamination
level. Only applies to the COMB thresholder
Expand All @@ -34,6 +46,7 @@ See base class definition below:

.. automodule:: pythresh.thresholds.base
:members:
:exclude-members: _data_setup, _set_norm, _set_attributes
:undoc-members:
:show-inheritance:
:inherited-members:
Loading

0 comments on commit 7fa8fdc

Please sign in to comment.