-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Allow fit_resample to receive metadata routed parameters #1111
Comments
On the principle, I think it would be nice to accept metadata indeed. For your specific use case, I'm not sure that resampling is actually the best. While working on the scikit-learn project, we found that resampling is breaking the calibration of the classifier and usually what users try actually to solved can be done as a post-tuning of the threshold of the classifier. We recently added the We also worked on the following tutorial to show some internal that could be interested to you: https://probabl-ai.github.io/calibration-cost-sensitive-learning/intro.html |
So I think that we had an underlying bug in I got your example and made a minimal reproducer: So it means that it should work out of the box. |
Thank you for your recommendation. A couple of days ago I had watched your podcast together with Vincent Warmerdam on the Probabl YouTube channel. It was quite insightful and it prompted me to read the scikit-learn documentation you have linked above. It was very insightful and definitely changed my perspective to the problem. I was planning to do some benchmarking of my own once I was finished implementing some of the techniques I found in literature, and I will definitely explore calibration further. I just tested out version 0.13.0 and it works like a charm! Thank you for the quick implementation and my best wishes this holiday period <3 |
One small suggestion in relation to type checking. As of now type checkers will not recognize the class SamplerMixin(metaclass=ABCMeta):
"""Mixin class for samplers with abstract method.
Warning: This class should not be used directly. Use the derive classes
instead.
"""
_estimator_type = "sampler"
if TYPE_CHECKING:
def set_fit_resample_request(self, **kwargs): pass
... |
Let me reopen to not forget about this last issue. Thanks for reporting. |
Is your feature request related to a problem? Please describe
In cost-sensitive learning, resampling techniques are used to address the asymmetrical importance of data points. These techniques require the amount of resampling to be dependent on instance-specific parameters, such as cost weights associated with individual data points. These cost weights are usually in a cost matrix for each data point$i$ :
Since these cost weights are dependent on the data point, they cannot be predetermined during initialization
__init__
but instead must adapt dynamically based on the input data during thefit_resample
process.The current implementation imbalanced-learn Pipeline object does not natively support passing metadata through its fit_resample method. Metadata routing, which would enable instance-dependent parameters to flow seamlessly through the pipeline, is critical for implementing cost-sensitive learning workflows.
Desired workflow (DOES NOT CURRENTLY WORK)
Describe the solution you'd like
From what I understand from the metadata routing implementation of the Pipeline object only a couple of changes have to be made:
SIMPLE_METHODS
constant found here needs to include"fit_resample"
:Note that this does require imbalanced-learn to redefine the classes and functions which use the
SIMPLE_METHODS
constant internally. These are now imported from scikit-learn if scikit-learn version 1.4 or higher is installed. These include:MetadataRequest
and_MetadataRequester
.2. A method mapping from caller "fit" to callee "fit_resample" has to be added in the
get_meta_data_routing(self)
method found here and thefilter_resample
parameter ofself._iter
method needs be set toFalse
:Additional context
I am a PhD Researcher and used these methods for my paper and the author of a python package Empulse which has implemented samplers which require cost parameters to be passed to the
fit_resample
method like in the dummy example (see Empulse/Samplers). I find the whole metadata routing implementation incredibly confusing, so apologies if I made some mistakes in my reasoning.The text was updated successfully, but these errors were encountered: