Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline - Parameterize transforms of MLDataset or xarray.Dataset #10

Closed
PeterDSteinberg opened this issue Sep 8, 2017 · 5 comments
Closed
Assignees

Comments

@PeterDSteinberg
Copy link
Contributor

This issue is about creating Pipeline to control parameterization of transformations of MLDataset or Dataset objects. When this issue is done, we should be able to do the following:

from xarray_filters.pipeline import Pipeline
from xarary_filters.steps import Generic, Serialize
def step_1(dset, **kw):
    return kw['a'] * dset.mean(dim=('x', 'y')) ** kw['b']

def step_2(dset, **kw):
    return kw['a'] + dset * kw['b']
    
steps = (('s1', Generic(step_1)),
         ('s2', Generic(step_2)),
         ('s3', Serialize('two_step_pipeline_out.nc')))
pipe = Pipeline(steps=steps)
pipe.set_params(s1__a=2,
                s1__b=3,
                s2__a=0,
                s2__b=0,
                s3__fname='file_with_zeros.nc')
pipe.fit_transform(X)

See this example notebook and also look at the documentation on custom estimators in sklearn.

@gbrener
Copy link
Contributor

gbrener commented Sep 11, 2017

Sample implementation:

https://github.com/ContinuumIO/xarray_filters/blob/param-mlds/notebooks/Parameterize-MLDataset.ipynb

Notes:

"Pipeline compatibility" doc

Illustrates how to write custom Estimators/Transformers to pass into a sklearn Pipeline: http://scikit-learn.org/stable/developers/contributing.html#pipeline-compatibility

Reusing sklearn.pipeline.Pipeline

In order to reuse as much of the sklearn.pipeline.Pipeline class as possible, the sample implementation (above) sets the final estimator to None; this has the effect of the "identity" function. Here is the relevant sklearn code that treats None as identity: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/pipeline.py#L165

  • @PeterDSteinberg commented:

    ...in xarray_filters, we will always be returning a MLDataset/Dataset (running the last step as well which will be a func taking a MLDataset and returning one.

Pipelining example from sklearn

Real-world example of Pipeline being used with GridSearchCV: http://scikit-learn.org/stable/auto_examples/plot_digits_pipe.html#sphx-glr-auto-examples-plot-digits-pipe-py

  • We could interchange the xarray_filters Pipeline here instead of the sklearn Pipeline, and pass it into dask-searchcv (instead of GridSearchCV), for example

sklearn.pipeline.FeatureUnion

Similar to Pipeline, but parallelizes the computations over each feature, allowing for optional weighting: http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html
Example of where this is useful: unifying heterogeneous data sources (http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html#sphx-glr-auto-examples-hetero-feature-union-py)

@PeterDSteinberg
Copy link
Contributor Author

A couple points:

  • Note also @gbrener there is some dask-related work with FeatureUnion here in dask_searchcv - we may want to take advantage of those ideas.
  • I'm working on a separate experiment with subclassing the parent (base) class of GridSearchCV and hope to have something to show later this week.
  • Your point under reusing Pipeline from sklearn and its None final identity transformer sounds good.
  • We should have tests in place that we support the clone operation described in the pipeline compatibility doc here

@gbrener
Copy link
Contributor

gbrener commented Sep 14, 2017

Thanks for the notes @PeterDSteinberg . I haven't looked much into FeatureUnion yet but I intend to soon. I added a unit test to confirm that clone works. Also, the build is now passing for Python 3.5 and Python 3.6, including the deploy section. Python 2.7 is still failing due to #12 - I believe @gpfreitas may be working on this.

I'm now adding builtin transformers from elm.sample_util (now located at earthio.filters) - hopefully this shouldn't be too involved now that the hard part is done. After that I'll see how we might incorporate dask into the subclasses which inherit from Step; at the moment my plan is to use dask.delayed.

@PeterDSteinberg
Copy link
Contributor Author

PR looks good - adding Step based classes at this point sounds good - thanks @gbrener !

@PeterDSteinberg
Copy link
Contributor Author

@gbrener I moved the FeatureUnion to a separate issue #22. I'm closing this one. If there's anything missing else on this issue not handled yet, please make other separate issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants