You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Undersampling with NearMiss version 3 does not work well with sampling_strategy=dictionary.
A potential explanation could be that the first step of the algorithm already performs an intense undersampling, leaving a number of observations to be undersampled in the second step that is already lower than the number specified in the dictionary. As a consequence, the algortithm only seems to work if the number of desired samples is very low in comparison to the existing samples. The code examples below show how, for a class of 357 samples, NearMiss3 does not work if the desired number of samples is 300 but it does work if the desired number of samples is 50.
I don't think this is a desirable feature in the algorithm, especially considering that the 3 versions of NearMiss are presented in the documentation as methods that allow to specify the number of samples to have in each class. Anyway, I think that at least it could be good to explain this in the documentation for saving time to people who find this problem (I have lost several hours trying to figure out what was happening).
Steps/Code to Reproduce
Example 1. Undersampling to 300 observations (this doesn't work):
fromsklearn.datasetsimportload_breast_cancerimportpandasaspdfromimblearn.under_samplingimportNearMissdata=load_breast_cancer()
X=pd.DataFrame(data=data.data, columns=data.feature_names)
# class 1 has clearly more than 300 observationsnp.unique(data.target, return_counts=True)
X_smt, y_smt=NearMiss(version=3, sampling_strategy={1: 300}).fit_resample(X, data.target)
Example 2. Undersampling to 50 observations (this works well):
fromsklearn.datasetsimportload_breast_cancerimportpandasaspdfromimblearn.under_samplingimportNearMissdata=load_breast_cancer()
X=pd.DataFrame(data=data.data, columns=data.feature_names)
X_smt, y_smt=NearMiss(version=3, sampling_strategy={1: 50}).fit_resample(X, data.target)
np.unique(y_smt, return_counts=True) # it worked
Expected Results
In the first example, the resulting dataset (X_smt, y_smt) should have 300 samples for class 1. In the second example, class 1 should have 50 samples.
Actual Results
The code in Example 1 raises:
"UserWarning: The number of the samples to be selected is larger than the number of samples available. The balancing ratio cannot be ensure and all samples will be returned."
Describe the bug
Undersampling with NearMiss version 3 does not work well with sampling_strategy=dictionary.
A potential explanation could be that the first step of the algorithm already performs an intense undersampling, leaving a number of observations to be undersampled in the second step that is already lower than the number specified in the dictionary. As a consequence, the algortithm only seems to work if the number of desired samples is very low in comparison to the existing samples. The code examples below show how, for a class of 357 samples, NearMiss3 does not work if the desired number of samples is 300 but it does work if the desired number of samples is 50.
I don't think this is a desirable feature in the algorithm, especially considering that the 3 versions of NearMiss are presented in the documentation as methods that allow to specify the number of samples to have in each class. Anyway, I think that at least it could be good to explain this in the documentation for saving time to people who find this problem (I have lost several hours trying to figure out what was happening).
Steps/Code to Reproduce
Example 1. Undersampling to 300 observations (this doesn't work):
Example 2. Undersampling to 50 observations (this works well):
Expected Results
In the first example, the resulting dataset (X_smt, y_smt) should have 300 samples for class 1. In the second example, class 1 should have 50 samples.
Actual Results
The code in Example 1 raises:
"UserWarning: The number of the samples to be selected is larger than the number of samples available. The balancing ratio cannot be ensure and all samples will be returned."
The code in Example 2 works well.
Versions
Linux-5.10.15-200.fc33.x86_64-x86_64-with-glibc2.2.5
Python 3.8.6 (default, Nov 10 2011, 15:00:00)
[GCC 10.2.0]
NumPy 1.19.5
SciPy 1.6.1
Scikit-Learn 0.24.1
Imbalanced-Learn 0.8.0
The text was updated successfully, but these errors were encountered: