cross validation for xarray_filters.MLDataset - Elm PR 221 #61

PeterDSteinberg · 2017-10-26T15:43:32Z

[Work in progress] - This is a PR corresponding to ContinuumIO/elm#221 - cross validation for xarray-based data structures.

TODO:

Finish Elm PR 221
See how these changes relate to dask-ml, if at all
Remove print statements from here

TomAugspurger

@PeterDSteinberg sorry for not getting around to looking at this sooner. Been busy :/

Could you give a high-level overview of how elm.Pipeline differs from an sklearn.Pipeline, or link to docs laying out the difference? Is this up to date?

More specifically, could you lay out why the changes here are necessary?

Just trying to think through this from a maintenance perspective. Currently the scope is limited to "drop-in replacement for scikit-learn Pipelines", and I want to understand how that would change with this. Additionally, I want to think through some future-proofing scenarios, and if this sets us up for conflicts with future scikit-learn enhancements.

Does elm.Pipeline only affect X? Scikit-learn has an issue for allowing transforms on y: scikit-learn/scikit-learn#4143

How does elm.Pipeline's sample_weight relate to the (unimplemented) property routing in scikit-learn/scikit-learn#9566? (that PR is covering a lot, specifically the {'fit_params': sample_weights} bit.

cc @jcrist

TomAugspurger · 2017-11-07T15:29:30Z

dask_searchcv/methods.py

+        post_splits = getattr(self, '_post_splits', None)
+        if post_splits:
+            result = post_splits(np.array(X)[inds])
+            self.cache[n, True, is_train] = result


Does post_split imply that you always have a cache? If so this can be rewriting as

if post_splits: result = ... else: result = safe_indexing(...) if self.cache is not None: self.cache[n, is_x, is_train] = result

And what happens if a users passes GridSearchCV(..., cache_cv=False)?

Good question - not sure how usage of a post_splits callable relates to cache_cv=False. My Dataset/MLDatatset work so far considered only cache_cv=True.

The idea of post_splits is shown more in the CVCacheSampler. It is a way of calling a sampler function on the argument X given to fit. See the usage of sampler argument to EaSearchCV in this elm PR 221. That example uses a sampler that expects X to be a list of dates, and the sampler function determines how to build an Dataset/MLDataset for the list of dates. In other cases, X could be a list of file names and the cv argument to EaSearchCV controls how to split up those file names into test / train groups. I like the idea of a sampler function from cross validation / hyperparameterization of Dataset/MLDataset workflows because there is no assumption about the shapes of the DataArrays (unlike cross validation in typical sklearn workflows where the cv object is used to subset train/test row groupings of a large matrix).

TODO (for me):

Add a test for cache_cv=False with a sampler function - fix or document requirements there.

the sampler function determines how to build an Dataset/MLDataset for the list of dates.

Is the returned X (any y?) constrained to be the same shape as the input?

More thought needs to be done on the sampler idea in general - currently it avoids the call to check_consistent_length. The sampler may return X or an X, y tuple. X may be a Dataset/MLDataset or array-like and y is currently expected to be array-like.

I think I should just make sure I call check_consistent_length on samples of X or (X, y) returned by sampler and return FIT_FAILURE if there are inconsistent lengths. I'll experiment with that today/tomorrow.

TomAugspurger · 2017-11-07T15:32:48Z

dask_searchcv/methods.py

@@ -226,6 +262,7 @@ def fit(est, X, y, error_score='raise', fields=None, params=None,

 def fit_transform(est, X, y, error_score='raise', fields=None, params=None,
                  fit_params=None):
+    new_y = None


Where is this used?

Leftover from old effort (new_y) - just ran PEP8 check and found that and a few other things to fix.

mrocklin · 2017-11-07T17:57:28Z

@PeterDSteinberg it might be wise to answer some of the more general questions above, like

More specifically, could you lay out why the changes here are necessary?

This might be useful in order to determine if these changes are even in scope for the dask-searchcv project.

PeterDSteinberg · 2017-11-07T18:46:23Z

Hi @TomAugspurger @mrocklin . A bit of background.

The docs pipeline.rst link is out of date, see this test_xarray_cross_validation.py in elm PR 221 for current usage.

Why the changes are necessary?

This will allow cross validation using dask.distributed where cross validation is on Dataset/MLDataset structures. Example workflows would be using a sampler function to load 100 netcdf files with xarray.open_mfdataset as a Dataset with 3D climate arrays, using transformers that do stats / preprocessing on the 3D arrays before making a feature matrix passed to transformers/estimators that take a typical 2D tabular feature matrix, e.g. PCA or KMeans. In that example, the reduction from 3D arrays to 2D may involve time series averaging / binning and it would be nice to hyperparameterize the averaging/binning parameters.

Regarding:

Scikit-learn has an issue for allowing transforms on y

elm.pipeline aims to support transformers that return X, y tuples or X. In the example above with 3D arrays' feature engineering, there could be several steps passing a MLDataset of 3D arrays, then a step calling MLDataset.to_features() or other methods to create an X, y tuple. Passing y through a pipeline is required if using a sampler function. The changes in this PR related to _split_Xy are for passing X,y tuples or detecting whether they have been returned by a transformer (as opposed to just Dataset/Array-like X).

On:

How does elm.Pipeline's sample_weight relate to the (unimplemented) property routing in scikit-learn/scikit-learn#9566? (that PR is covering a lot, specifically the {'fit_params': sample_weights} bit.

elm.pipeline in Phase 1 (the time of that .rst file) supported passing sample_weight but it was brittle and was not consistent with cross validation ideas of sklearn or this PR. I will read up on that sklearn issue 9566 ( and scikit-learn/scikit-learn#4143).

I'm not in a major rush to get this merged and can work through any issues you think of.

mrocklin · 2017-11-07T18:59:43Z

I'm not in a major rush to get this merged and can work through any issues you think of.

It's not so much a timing issue as an "is this in scope" issue.

Are there ways to accomplish your goals without directly injecting elm code into this project? For example are there things that this project should generalize or protocols that we might adhere to that might be a little less specific to your project but would still allow you to make progress?

PeterDSteinberg · 2017-11-07T19:15:10Z

Agreed @mrocklin - I see your point - I don't want to cause scope creep / tech debt here (the time request being to look into that further). Currently the elm import is optional. I'll look at the issues above in further detail, but my impression now is that much of the logic here for splitting X,y tuples is also helpful for a pipeline where transformers can return y (unrelated to Dataset-like arguments).

PeterDSteinberg · 2017-11-10T16:22:58Z

Closing this on in favor of less wordy solution in #61

jbednar · 2018-02-01T20:56:40Z

Probably meant "in favor of #65", as this PR is #61.

Peter Steinberg added 4 commits October 26, 2017 08:38

cross validation for xarray_filters.MLDataset - Elm PR 221

47834f4

diagnostic printing

6c61b04

remove print statements

ce34117

fix usage of isinstance dask.base.Base -> is_dask_collection

013b3ad

PeterDSteinberg mentioned this pull request Nov 1, 2017

Fix dask.base.Base usage to dask.base.is_dask_collection #62

Closed

Peter Steinberg added 2 commits November 2, 2017 17:40

sampler related cross validation changes

6cb7c8d

resolve merge conflicts with master

af16ad4

PeterDSteinberg mentioned this pull request Nov 4, 2017

Cross validation of Pipeline/estimators using MLDataset / xarray.Dataset ContinuumIO/elm#221

Closed

Peter Steinberg added 2 commits November 6, 2017 11:58

deduplicate X,y splitting logic

557e40d

fix test failures related to transformers of None and feature union

ea512ae

TomAugspurger reviewed Nov 7, 2017

View reviewed changes

fix pep8 issues found in CI checks

632ba83

Peter Steinberg added 6 commits November 7, 2017 12:43

fix pep8 issues

9ad1d74

refactor to simplify changes in dask-searchcv

ec1e287

refactor to simplify changes in dask-searchcv

6906e83

refactor to simplify changes in dask-searchcv

5940ff1

reduce diff in dask-searchcv -> move cv stuff to elm

2e1edc9

pep8 fixes

e73381d

PeterDSteinberg mentioned this pull request Nov 8, 2017

[WIP] Improve wrapped sklearn class repr - simplify cross val for Dataset / MLDataset ContinuumIO/elm#228

Open

Peter Steinberg added 2 commits November 8, 2017 08:32

remove _get_est_type

c58293f

pep8 fixes

6367c21

PeterDSteinberg mentioned this pull request Nov 10, 2017

Allow CVCache subclasses #65

Closed

PeterDSteinberg closed this Nov 10, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cross validation for xarray_filters.MLDataset - Elm PR 221 #61

cross validation for xarray_filters.MLDataset - Elm PR 221 #61

PeterDSteinberg commented Oct 26, 2017 •

edited

Loading

TomAugspurger left a comment

TomAugspurger Nov 7, 2017

TomAugspurger Nov 7, 2017

PeterDSteinberg Nov 7, 2017

TomAugspurger Nov 7, 2017

PeterDSteinberg Nov 7, 2017

PeterDSteinberg Nov 7, 2017

TomAugspurger Nov 7, 2017

PeterDSteinberg Nov 7, 2017

mrocklin commented Nov 7, 2017

PeterDSteinberg commented Nov 7, 2017

mrocklin commented Nov 7, 2017

PeterDSteinberg commented Nov 7, 2017

PeterDSteinberg commented Nov 10, 2017

jbednar commented Feb 1, 2018

cross validation for xarray_filters.MLDataset - Elm PR 221 #61

cross validation for xarray_filters.MLDataset - Elm PR 221 #61

Conversation

PeterDSteinberg commented Oct 26, 2017 • edited Loading

TomAugspurger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrocklin commented Nov 7, 2017

PeterDSteinberg commented Nov 7, 2017

mrocklin commented Nov 7, 2017

PeterDSteinberg commented Nov 7, 2017

PeterDSteinberg commented Nov 10, 2017

jbednar commented Feb 1, 2018

PeterDSteinberg commented Oct 26, 2017 •

edited

Loading