paper: The Effect of Class Distribution on Classifier Learning: An Empirical Study #730

Sandy4321 · 2020-06-21T14:46:21Z

Friends
there is interesting discussion what can be done better
catboost/catboost#392 (comment)
@ShaharKatz
and reference to this paper
The Effect of Class Distribution on Classifier Learning:
An Empirical Study
https://pdfs.semanticscholar.org/8939/585e7d464703fe0ec8ca9fc6acc3528ce601.pdf

glemaitre · 2020-06-22T07:41:10Z

Could you elaborate on what the method is doing?

ShaharKatz · 2020-06-22T16:21:48Z

Sure. Through empirical research it has been shown that class imbalance does not necessarily mean worse performance. The proposed method consists of grid search-ing on several (and reverse) re-sampling to produce several classifiers and pick the best one. A bias correction is made (easily in the case of trees) in the form of a higher or lower threshold per leaf based on the original-to-resample relation rate (e.g. if you sampled a class twice as frequent the threshold for classification to that class should be twice as high per leaf). The annotated version shows what i believe to be the essence.

Sandy4321 · 2020-06-22T16:32:09Z

@ShaharKatz
do you know some python code to illustrate this ?
for any classier ?

ShaharKatz · 2020-06-22T17:04:29Z

As far as I read, the article itself doesn't come with the source. The intuition and example case is pretty straightforward - take an imbalanced binary label dataset that is perfectly linear separable. SVM shouldn't have a problem with this dataset and by downsampling the majority you can loose the fine-tuning of the edge. I think first i'll show that this actually works for a dataset with reproducible code and than if the results are good, incorporate it.

glemaitre · 2020-06-29T07:13:20Z

It looks to me as it would be easy to do with normal scikit-learn/imbalanced-learn components in 3 lines of code (pipeline + grid-search).

from sklearn.datasets import load_iris
from imblearn.datasets import make_imbalance
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from imblearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE

data = load_iris()
X, y = data.data, data.target

X, y = make_imbalance(
    X, y, sampling_strategy={0: 10, 1: 20, 2: 50}, random_state=42
)

model = Pipeline(
    [("sampler", SMOTE()),
     ("scaler", StandardScaler()),
     ("clf", LogisticRegression())]
)

param_grid = {
    "sampler__sampling_strategy": [
        {0: 20, 1: 30}, {0: 30, 1: 30}, {0: 30, 1: 20},
    ]
}
grid = GridSearchCV(model, param_grid=param_grid).fit(X, y)
print(grid.best_params_)

glemaitre · 2020-06-29T07:14:32Z

So if I don't miss anything, IMO, it would not be worth creating an estimator that would wrap all possible parameters of these models while it seems pretty easy to create a pipeline in this case.

WDYT?

ShaharKatz · 2020-06-29T07:37:54Z

Maybe the current implementation already addresses this, but the over/under sampling is the first part, the second part (that might still needs implementation) is the de-bias-ing of the estimator's results. The article shows it for trees (which is very intuitive), for logistic regression we need a different correction though..

Sandy4321 · 2020-06-29T13:23:43Z

Yes
the de-bias-ing of the estimator's results
Wild be great to add this

Sandy4321 · 2020-06-29T13:24:53Z

What is problem
ShaharKatz
What's to do
Let him do it
And after you can test how good is it?

Sandy4321 · 2020-06-29T13:25:49Z

If ShaharKatz wants to it
Why to stop him?

glemaitre · 2020-07-08T07:18:21Z

@Sandy4321 We have to be careful when adding a new algorithm in the source code. Basically code comes with the responsibility to maintain it. So we need to weigh the benefit and limitation of the current solution and decide if this is worth or not to add it.

This said I did not look at the paper yet so I cannot say if this is worth or not. When speaking about debiasing, I would think that this should be linked to the scoring used during the fit of the GridSearchCV and might be implemented using make_scorer from scikit-learn.

glemaitre · 2020-07-08T07:32:28Z

OK, so I see that the debiasing is actually a ratio at the leaf level in the tree.
So this should be added in the tree code base from scikit-learn directly.
I am wondering if it could always be applied even when not resampling the dataset?

glemaitre · 2020-07-08T07:34:09Z

One thing that I am not sure is how well this method works with deep trees where you will have very few samples in the leaf.

ShaharKatz · 2020-07-08T09:15:26Z

Regarding the implementation @glemaitre suggested - this isn't really a simple scorer since it must know the resampling technique used in the pre-processing stage. On the other hand, this isn't really a preprocessing step since it obviously takes action during the inference.

On one hand this shouldn't be model specific since most models don't do the resampling inside (which is why this repo comes in handy) but on the other hand the model implementation is relevant since it goes down to the leaf level (in trees).

It's:

An adjacent model to the actual model being trained.
it is fitted using the resampling technique(s) used.
it is model-specific (doesn't work just on the proba).

This is the reason i think this repo is the best place for it - cause it deals specifically with imbalanced learning and it can take this "hybrid" which doesn't necessarily plays nice with the existing interfaces.

Sandy4321 · 2020-07-12T02:50:07Z

ShaharKatz

Sandy4321 · 2020-07-12T02:54:53Z

@ShaharKatz
If it is so complicated to incorporate your great suggestion to this package
Would you like to create stand alone Python package to benefit all of us
After you can add your code to this package, when package owner will test your code ...
Please do not give up, we need your code...

chkoar · 2020-07-24T12:01:50Z

The article shows it for trees (which is very intuitive), for logistic regression we need a different correction though..

Could this generalized across predictors (1)?

On the other hand, this isn't really a preprocessing step since it obviously takes action during the inference.

If the answer to (1) is yes then I suppose that the above sentence indicates that the solutions it could be a meta-estimator, no?

Sandy4321 · 2020-07-24T13:48:51Z

hi
why against this definite improvement?
@ShaharKatz
may you do it as separate package?
we need your code

ShaharKatz · 2020-07-25T10:05:28Z

regarding your question - @chkoar - this is model specific, we have a solution for trees and i'm currently looking at a solution for logistic regression. Don't have a general framework yet.
@Sandy4321 - I want to see that it provides value and can be generalised. If the generalisation allows for this to be a meta-estimator than there's no problem committing the code here, if not than yes, this would require a different project

Sandy4321 mentioned this issue Jun 21, 2020

Automatic class_weights/scale_pos_weight catboost/catboost#392

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

paper: The Effect of Class Distribution on Classifier Learning: An Empirical Study #730

paper: The Effect of Class Distribution on Classifier Learning: An Empirical Study #730

Sandy4321 commented Jun 21, 2020

glemaitre commented Jun 22, 2020

ShaharKatz commented Jun 22, 2020

Sandy4321 commented Jun 22, 2020

ShaharKatz commented Jun 22, 2020

glemaitre commented Jun 29, 2020

glemaitre commented Jun 29, 2020

ShaharKatz commented Jun 29, 2020

Sandy4321 commented Jun 29, 2020

Sandy4321 commented Jun 29, 2020

Sandy4321 commented Jun 29, 2020

glemaitre commented Jul 8, 2020

glemaitre commented Jul 8, 2020

glemaitre commented Jul 8, 2020

ShaharKatz commented Jul 8, 2020

Sandy4321 commented Jul 12, 2020

Sandy4321 commented Jul 12, 2020

chkoar commented Jul 24, 2020

Sandy4321 commented Jul 24, 2020

ShaharKatz commented Jul 25, 2020

paper: The Effect of Class Distribution on Classifier Learning: An Empirical Study #730

paper: The Effect of Class Distribution on Classifier Learning: An Empirical Study #730

Comments

Sandy4321 commented Jun 21, 2020

glemaitre commented Jun 22, 2020

ShaharKatz commented Jun 22, 2020

Sandy4321 commented Jun 22, 2020

ShaharKatz commented Jun 22, 2020

glemaitre commented Jun 29, 2020

glemaitre commented Jun 29, 2020

ShaharKatz commented Jun 29, 2020

Sandy4321 commented Jun 29, 2020

Sandy4321 commented Jun 29, 2020

Sandy4321 commented Jun 29, 2020

glemaitre commented Jul 8, 2020

glemaitre commented Jul 8, 2020

glemaitre commented Jul 8, 2020

ShaharKatz commented Jul 8, 2020

Sandy4321 commented Jul 12, 2020

Sandy4321 commented Jul 12, 2020

chkoar commented Jul 24, 2020

Sandy4321 commented Jul 24, 2020

ShaharKatz commented Jul 25, 2020