Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: append/concatenate pipelines #793

Closed
sitnarf opened this issue Feb 12, 2021 · 16 comments
Closed

ENH: append/concatenate pipelines #793

sitnarf opened this issue Feb 12, 2021 · 16 comments

Comments

@sitnarf
Copy link

sitnarf commented Feb 12, 2021

Describe the bug

Nested pipelines using make_pipeline raise the exception.

Steps/Code to Reproduce

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier

pipeline = make_pipeline(
    make_pipeline(make_pipeline(SMOTE())),
    make_pipeline(RandomForestClassifier()),
)

Expected Results

No exception is raised.

Actual Results

Exception raised: TypeError: All intermediate steps of the chain should be estimators that implement fit and transform or fit_resample. 'Pipeline(steps=[('pipeline', Pipeline(steps=[('smote', SMOTE())]))])' implements both)

Versions

System:
python: 3.8.6 (default, Sep 25 2020, 09:36:53) [GCC 10.2.0]
executable: /home/sitnarf/.virtualenvs/backend-g-pApWmb/bin/python
machine: Linux-5.8.0-43-generic-x86_64-with-glibc2.32

Python dependencies:
pip: 20.1.1
setuptools: 44.0.0
sklearn: 0.23.2
numpy: 1.19.5
scipy: 1.6.0
Cython: None
pandas: 1.2.1
matplotlib: 3.3.4
joblib: 1.0.0
threadpoolctl: 2.1.0

@chkoar
Copy link
Member

chkoar commented Feb 12, 2021

I would not considered it as a bug. Probably is a feature request.

@sitnarf
Copy link
Author

sitnarf commented Feb 12, 2021

Hi @chkoar, in original sklearn, at least this works (and not with imblearn):

from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(
    make_pipeline(SimpleImputer()),
    make_pipeline(RandomForestClassifier()),
)

@glemaitre
Copy link
Member

duplicate to #787

The reason is that there is an ambiguity whether to call fit_resample or transform here, so this is not a bug.
Scikit-learn does not ambiguity because resampling does not exist.

Having a single sampler within a pipeline is useless. I assume that you have the same needs than in #787 where you would like to combine to resamplers in a pipeline and provide this pipeline in another pipeline.

If your usecase is different then we can decide to rename this issue and see if we can come up with a solution. Otherwise we will close it in favor of #787

@glemaitre
Copy link
Member

As mentioned by @chkoar, the current workaround could be to use a FunctionSampler: #787 (comment)

@sitnarf
Copy link
Author

sitnarf commented Feb 12, 2021

@glemaitre My use case is actually different. I want to flexibly combine pipeline blocks, e.g. append classifiers to a common preprocessing pipeline. I came with this workaround (minimal example):

def get_preprocessing_pipeline():
    return [
        SimpleImputer(),
        RandomOverSampler(),
    ]

pipeline = make_pipeline(
    *get_preprocessing_pipeline(),
    RandomForestClassifier(),
)

However, it is not ideal, as you always have to think where to add the final make_pipeline in a complex codebase with many levels.

@chkoar
Copy link
Member

chkoar commented Feb 12, 2021

@sitnarf The above example works with the current pipeline object

from imblearn.pipeline import make_pipeline
from imblearn.over_sampling import RandomOverSampler
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer

X, y = load_iris(return_X_y=1)
pipeline = make_pipeline(
    SimpleImputer(), RandomOverSampler(), RandomForestClassifier(),
)
pipeline.fit(X, y)

@chkoar
Copy link
Member

chkoar commented Feb 12, 2021

Hi @chkoar, in original sklearn, at least this works (and not with imblearn)

I understand. I said that it is not a bug since it is a handled exception with a proper message.

@glemaitre
Copy link
Member

So at the end your use case is to be able to append/concatenate pipeline and not really to nest them. You are using the nesting because scikit-learn does not provide such way to create pipeline either. I think that it would be best to request this feature upstream. Indeed, by inherit this feature then.

FYI: scikit-learn/scikit-learn#16301

@glemaitre glemaitre changed the title Nested pipelines raise TypeError [BUG] ENH: append/concatenate pipelines Feb 12, 2021
@sitnarf
Copy link
Author

sitnarf commented Feb 12, 2021

@glemaitre One question, why is it not possible to run both transform and resample using the "nested" pipeline?

@chkoar
Copy link
Member

chkoar commented Feb 12, 2021

@glemaitre One question, why is it not possible to run both transform and resample using the "nested" pipeline?

It is possible. Actually, we chose to go in that direction in order to avoid fishy situations when having nested pipelines or feature unions and a sampler exists in one branch but it does not in another.

Since the validation of pipelines is executed in init we could detect by checking recursively all nested branches. We have discussed it in another issue if I recall correctly.

@chkoar
Copy link
Member

chkoar commented Feb 12, 2021

Actually I said the same here

@sitnarf
Copy link
Author

sitnarf commented Feb 12, 2021

@glemaitre My use case is actually different. I want to flexibly combine pipeline blocks, e.g. append classifiers to a common preprocessing pipeline. I came with this workaround (minimal example):

def get_preprocessing_pipeline():
    return [
        SimpleImputer(),
        RandomOverSampler(),
    ]

pipeline = make_pipeline(
    *get_preprocessing_pipeline(),
    RandomForestClassifier(),
)

However, it is not ideal, as you always have to think where to add the final make_pipeline in a complex codebase with many levels.

Another issue with the mentioned workaround is a clumsy way for combining make_pipeline and Pipeline syntax:

def get_preprocessing_pipeline():
    return [
        SimpleImputer(),
        RandomOverSampler(),
    ]

pipeline = Pipeline([
    // ??????
   *get_preprocessing_pipeline(),
   ('classifier', RandomForestClassifier()),
])

@glemaitre
Copy link
Member

Since the validation of pipelines is executed in init we could detect by checking recursively all nested branches. We have discussed it in another issue if I recall correctly.

Validation in __init__ would go against the scikit-learn API itself. We delay it in fit for compatibility with grid-search.
Pipeline is the only estimator that does that and it is an annoying one :) Going in that direction does not seem wise.

@chkoar
Copy link
Member

chkoar commented Feb 12, 2021

Validation in init would go against the scikit-learn API itself.

In genereal yes. But for pipelines no.

Pipeline is the only estimator that does.

Correct. That's why I said that. In any case we could detect the problem in fit. The intention is the same.

@glemaitre
Copy link
Member

glemaitre commented Feb 12, 2021

Another issue with the mentioned workaround is a clumsy way for combining make_pipeline and Pipeline syntax:

To be honest, I still have issues to understand what you try to achieve. This would still be possible to combine pipeline using steps, isn't it?

In [8]: clf1 = make_pipeline(SVC())
In [10]: clf2 = make_pipeline(RandomForestClassifier())
In [13]: preprocessor = make_pipeline(StandardScaler(), SimpleImputer())
In [14]: Pipeline(steps=preprocessor.steps + clf2.steps)
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('simpleimputer', SimpleImputer()),
                ('randomforestclassifier', RandomForestClassifier())])

@sitnarf
Copy link
Author

sitnarf commented Feb 12, 2021

@glemaitre Yeah, that could be a nice workaround! I am closing now, since pipeline nesting is currently not well supported even in upstream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants