-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
paper: The Effect of Class Distribution on Classifier Learning: An Empirical Study #730
Comments
Could you elaborate on what the method is doing? |
Sure. Through empirical research it has been shown that class imbalance does not necessarily mean worse performance. The proposed method consists of grid search-ing on several (and reverse) re-sampling to produce several classifiers and pick the best one. A bias correction is made (easily in the case of trees) in the form of a higher or lower threshold per leaf based on the original-to-resample relation rate (e.g. if you sampled a class twice as frequent the threshold for classification to that class should be twice as high per leaf). The annotated version shows what i believe to be the essence. |
@ShaharKatz |
As far as I read, the article itself doesn't come with the source. The intuition and example case is pretty straightforward - take an imbalanced binary label dataset that is perfectly linear separable. SVM shouldn't have a problem with this dataset and by downsampling the majority you can loose the fine-tuning of the edge. I think first i'll show that this actually works for a dataset with reproducible code and than if the results are good, incorporate it. |
It looks to me as it would be easy to do with normal scikit-learn/imbalanced-learn components in 3 lines of code (pipeline + grid-search). from sklearn.datasets import load_iris
from imblearn.datasets import make_imbalance
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from imblearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
data = load_iris()
X, y = data.data, data.target
X, y = make_imbalance(
X, y, sampling_strategy={0: 10, 1: 20, 2: 50}, random_state=42
)
model = Pipeline(
[("sampler", SMOTE()),
("scaler", StandardScaler()),
("clf", LogisticRegression())]
)
param_grid = {
"sampler__sampling_strategy": [
{0: 20, 1: 30}, {0: 30, 1: 30}, {0: 30, 1: 20},
]
}
grid = GridSearchCV(model, param_grid=param_grid).fit(X, y)
print(grid.best_params_) |
So if I don't miss anything, IMO, it would not be worth creating an estimator that would wrap all possible parameters of these models while it seems pretty easy to create a pipeline in this case. WDYT? |
Maybe the current implementation already addresses this, but the over/under sampling is the first part, the second part (that might still needs implementation) is the de-bias-ing of the estimator's results. The article shows it for trees (which is very intuitive), for logistic regression we need a different correction though.. |
Yes |
What is problem |
If ShaharKatz wants to it |
@Sandy4321 We have to be careful when adding a new algorithm in the source code. Basically code comes with the responsibility to maintain it. So we need to weigh the benefit and limitation of the current solution and decide if this is worth or not to add it. This said I did not look at the paper yet so I cannot say if this is worth or not. When speaking about debiasing, I would think that this should be linked to the scoring used during the fit of the |
OK, so I see that the debiasing is actually a ratio at the leaf level in the tree. |
One thing that I am not sure is how well this method works with deep trees where you will have very few samples in the leaf. |
Regarding the implementation @glemaitre suggested - this isn't really a simple scorer since it must know the resampling technique used in the pre-processing stage. On the other hand, this isn't really a preprocessing step since it obviously takes action during the inference. On one hand this shouldn't be model specific since most models don't do the resampling inside (which is why this repo comes in handy) but on the other hand the model implementation is relevant since it goes down to the leaf level (in trees). It's:
This is the reason i think this repo is the best place for it - cause it deals specifically with imbalanced learning and it can take this "hybrid" which doesn't necessarily plays nice with the existing interfaces. |
ShaharKatz |
@ShaharKatz |
Could this generalized across predictors (1)?
If the answer to (1) is yes then I suppose that the above sentence indicates that the solutions it could be a meta-estimator, no? |
hi |
regarding your question - @chkoar - this is model specific, we have a solution for trees and i'm currently looking at a solution for logistic regression. Don't have a general framework yet. |
Friends
there is interesting discussion what can be done better
catboost/catboost#392 (comment)
@ShaharKatz
and reference to this paper
The Effect of Class Distribution on Classifier Learning:
An Empirical Study
https://pdfs.semanticscholar.org/8939/585e7d464703fe0ec8ca9fc6acc3528ce601.pdf
The text was updated successfully, but these errors were encountered: