Pipeline - Parameterize transforms of MLDataset or xarray.Dataset #10

PeterDSteinberg · 2017-09-08T22:41:14Z

This issue is about creating Pipeline to control parameterization of transformations of MLDataset or Dataset objects. When this issue is done, we should be able to do the following:

from xarray_filters.pipeline import Pipeline
from xarary_filters.steps import Generic, Serialize
def step_1(dset, **kw):
    return kw['a'] * dset.mean(dim=('x', 'y')) ** kw['b']

def step_2(dset, **kw):
    return kw['a'] + dset * kw['b']
    
steps = (('s1', Generic(step_1)),
         ('s2', Generic(step_2)),
         ('s3', Serialize('two_step_pipeline_out.nc')))
pipe = Pipeline(steps=steps)
pipe.set_params(s1__a=2,
                s1__b=3,
                s2__a=0,
                s2__b=0,
                s3__fname='file_with_zeros.nc')
pipe.fit_transform(X)

See this example notebook and also look at the documentation on custom estimators in sklearn.

The text was updated successfully, but these errors were encountered:

gbrener · 2017-09-11T18:49:41Z

Sample implementation:

https://github.com/ContinuumIO/xarray_filters/blob/param-mlds/notebooks/Parameterize-MLDataset.ipynb

Notes:

"Pipeline compatibility" doc

Illustrates how to write custom Estimators/Transformers to pass into a sklearn Pipeline: http://scikit-learn.org/stable/developers/contributing.html#pipeline-compatibility

Reusing sklearn.pipeline.Pipeline

In order to reuse as much of the sklearn.pipeline.Pipeline class as possible, the sample implementation (above) sets the final estimator to None; this has the effect of the "identity" function. Here is the relevant sklearn code that treats None as identity: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/pipeline.py#L165

@PeterDSteinberg commented:

...in xarray_filters, we will always be returning a MLDataset/Dataset (running the last step as well which will be a func taking a MLDataset and returning one.

Pipelining example from sklearn

Real-world example of Pipeline being used with GridSearchCV: http://scikit-learn.org/stable/auto_examples/plot_digits_pipe.html#sphx-glr-auto-examples-plot-digits-pipe-py

We could interchange the xarray_filters Pipeline here instead of the sklearn Pipeline, and pass it into dask-searchcv (instead of GridSearchCV), for example

sklearn.pipeline.FeatureUnion

Similar to Pipeline, but parallelizes the computations over each feature, allowing for optional weighting: http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html
Example of where this is useful: unifying heterogeneous data sources (http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html#sphx-glr-auto-examples-hetero-feature-union-py)

PeterDSteinberg · 2017-09-12T15:24:10Z

A couple points:

Note also @gbrener there is some dask-related work with FeatureUnion here in dask_searchcv - we may want to take advantage of those ideas.
I'm working on a separate experiment with subclassing the parent (base) class of GridSearchCV and hope to have something to show later this week.
Your point under reusing Pipeline from sklearn and its None final identity transformer sounds good.
We should have tests in place that we support the clone operation described in the pipeline compatibility doc here

gbrener · 2017-09-14T18:59:14Z

Thanks for the notes @PeterDSteinberg . I haven't looked much into FeatureUnion yet but I intend to soon. I added a unit test to confirm that clone works. Also, the build is now passing for Python 3.5 and Python 3.6, including the deploy section. Python 2.7 is still failing due to #12 - I believe @gpfreitas may be working on this.

I'm now adding builtin transformers from elm.sample_util (now located at earthio.filters) - hopefully this shouldn't be too involved now that the hard part is done. After that I'll see how we might incorporate dask into the subclasses which inherit from Step; at the moment my plan is to use dask.delayed.

PeterDSteinberg · 2017-09-14T20:06:09Z

PR looks good - adding Step based classes at this point sounds good - thanks @gbrener !

PeterDSteinberg · 2017-10-11T16:45:38Z

@gbrener I moved the FeatureUnion to a separate issue #22. I'm closing this one. If there's anything missing else on this issue not handled yet, please make other separate issues.

PeterDSteinberg assigned gbrener Sep 8, 2017

PeterDSteinberg mentioned this issue Sep 9, 2017

Fix the deploy section of .travis.yml #11

Closed

gbrener mentioned this issue Sep 14, 2017

Enable parameterization of dataset transforms #14

Merged

PeterDSteinberg mentioned this issue Sep 18, 2017

Use scikit-learn BaseEstimator as a base class for all pipeline steps ContinuumIO/elm#194

Closed

PeterDSteinberg mentioned this issue Oct 11, 2017

Parallelism of FeatureUnion for xarray_filters #22

Open

PeterDSteinberg closed this as completed Oct 11, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipeline - Parameterize transforms of MLDataset or xarray.Dataset #10

Pipeline - Parameterize transforms of MLDataset or xarray.Dataset #10

PeterDSteinberg commented Sep 8, 2017

gbrener commented Sep 11, 2017 •

edited

Loading

PeterDSteinberg commented Sep 12, 2017

gbrener commented Sep 14, 2017 •

edited

Loading

PeterDSteinberg commented Sep 14, 2017

PeterDSteinberg commented Oct 11, 2017

Pipeline - Parameterize transforms of MLDataset or xarray.Dataset #10

Pipeline - Parameterize transforms of MLDataset or xarray.Dataset #10

Comments

PeterDSteinberg commented Sep 8, 2017

gbrener commented Sep 11, 2017 • edited Loading

Sample implementation:

Notes:

"Pipeline compatibility" doc

Reusing sklearn.pipeline.Pipeline

Pipelining example from sklearn

sklearn.pipeline.FeatureUnion

PeterDSteinberg commented Sep 12, 2017

gbrener commented Sep 14, 2017 • edited Loading

PeterDSteinberg commented Sep 14, 2017

PeterDSteinberg commented Oct 11, 2017

gbrener commented Sep 11, 2017 •

edited

Loading

gbrener commented Sep 14, 2017 •

edited

Loading