Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add probabilistic classification to hiclass #minor #119

Merged
merged 69 commits into from
Nov 25, 2024

Conversation

LukasDrews97
Copy link
Collaborator

Add probabilistic classification via calibration to hiclass using the following methods:

  • Platt Scaling
  • Isotonic Regression
  • Beta calibration
  • (Inductive/Cross) Venn-ABERS calibration


else:
calibrators = Parallel(n_jobs=self.n_jobs)(
delayed(logging_wrapper)(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested the parallel logging in the cluster? It used to be the case that messages were repeated multiple times.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was not the case in my experiments

)
proba = calibrator.predict_proba(X)

y[:, 0] = calibrator.classes_[np.argmax(proba, axis=1)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you need to use the predictions, wouldn't it be better to use the already implemented predict method? I imagine it could simplify the code here and avoid redundancy

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the same as calling the implemented predict method, because it is using calibrated probabilities for the prediction while the predict method uses uncalibrated probabilities.

hiclass/__init__.py Outdated Show resolved Hide resolved
Comment on lines +265 to +266
y_true = make_leveled(y_true)
y_true = classifier._disambiguate(y_true)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I follow why make_leveled and _disambiguate need to be called here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Training and calibration samples are transformed using these methods in the HierarchicalClassifier class. I apply the same transformations for the test samples before calculating the metrics to get the classes for each level seperately (instead of a list of labels for each data point).

@@ -189,10 +350,82 @@ def test_predict_sparse(fitted_logistic_regression):
assert_array_equal(ground_truth, prediction)


def test_predict_proba(fitted_logistic_regression):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you can use parametrize to reduce redundancy when tests are the same in different files

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this wrapper necessary?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not necessary, although I think it improves the readability of the code and makes future customizations easier. Extending _BinaryCalibrator also ensures that all calibrators have the same methods.

Comment on lines +19 to +25
positive_label = 1
unique_labels = np.unique(y)
assert len(unique_labels) <= 2

y = np.where(y == positive_label, 1, 0)
y = y.reshape(-1) # make sure it's a 1D array

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe these lines can be replaced with the binary_only estimator tag https://scikit-learn.org/stable/developers/develop.html#estimator-tags

Suggested change
positive_label = 1
unique_labels = np.unique(y)
assert len(unique_labels) <= 2
y = np.where(y == positive_label, 1, 0)
y = y.reshape(-1) # make sure it's a 1D array

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't use estimator tags on this class because it does not extend BaseEstimator

@mirand863
Copy link
Collaborator

mirand863 commented May 3, 2024

Hi @LukasDrews97,

Just a quick request from someone from France that reached out to me via e-mail. Would it be possible to add a threshold to remove labels that have low probability?

@codecov-commenter
Copy link

codecov-commenter commented Sep 8, 2024

Codecov Report

Attention: Patch coverage is 90.49505% with 96 lines in your changes missing coverage. Please review.

Project coverage is 94.03%. Comparing base (4595264) to head (38c011a).
Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
hiclass/_calibration/VennAbersCalibrator.py 79.62% 43 Missing ⚠️
hiclass/HierarchicalClassifier.py 85.84% 16 Missing ⚠️
hiclass/metrics.py 95.45% 8 Missing ⚠️
hiclass/LocalClassifierPerLevel.py 92.00% 6 Missing ⚠️
hiclass/_calibration/Calibrator.py 91.89% 6 Missing ⚠️
hiclass/Pipeline.py 44.44% 5 Missing ⚠️
hiclass/LocalClassifierPerNode.py 95.78% 4 Missing ⚠️
hiclass/_calibration/BetaCalibrator.py 87.87% 4 Missing ⚠️
...iclass/probability_combiner/ProbabilityCombiner.py 88.88% 3 Missing ⚠️
hiclass/LocalClassifierPerParentNode.py 98.76% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #119      +/-   ##
==========================================
- Coverage   96.68%   94.03%   -2.66%     
==========================================
  Files          13       28      +15     
  Lines        1268     2297    +1029     
==========================================
+ Hits         1226     2160     +934     
- Misses         42      137      +95     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mirand863 mirand863 changed the title Add probabilistic classification to hiclass Add probabilistic classification to hiclass #minor Nov 25, 2024
@mirand863 mirand863 merged commit ee8cb75 into scikit-learn-contrib:main Nov 25, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants