Skip to content
This repository has been archived by the owner on Oct 14, 2018. It is now read-only.

cross validation for xarray_filters.MLDataset - Elm PR 221 #61

Closed

Conversation

PeterDSteinberg
Copy link
Collaborator

@PeterDSteinberg PeterDSteinberg commented Oct 26, 2017

[Work in progress] - This is a PR corresponding to ContinuumIO/elm#221 - cross validation for xarray-based data structures.

TODO:

  • Finish Elm PR 221
  • See how these changes relate to dask-ml, if at all
  • Remove print statements from here

Copy link
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PeterDSteinberg sorry for not getting around to looking at this sooner. Been busy :/

Could you give a high-level overview of how elm.Pipeline differs from an sklearn.Pipeline, or link to docs laying out the difference? Is this up to date?

More specifically, could you lay out why the changes here are necessary?

Just trying to think through this from a maintenance perspective. Currently the scope is limited to "drop-in replacement for scikit-learn Pipelines", and I want to understand how that would change with this. Additionally, I want to think through some future-proofing scenarios, and if this sets us up for conflicts with future scikit-learn enhancements.

Does elm.Pipeline only affect X? Scikit-learn has an issue for allowing transforms on y: scikit-learn/scikit-learn#4143

How does elm.Pipeline's sample_weight relate to the (unimplemented) property routing in scikit-learn/scikit-learn#9566? (that PR is covering a lot, specifically the {'fit_params': sample_weights} bit.

cc @jcrist

post_splits = getattr(self, '_post_splits', None)
if post_splits:
result = post_splits(np.array(X)[inds])
self.cache[n, True, is_train] = result
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does post_split imply that you always have a cache? If so this can be rewriting as

if post_splits:
    result = ...
else:
   result = safe_indexing(...)

if self.cache is not None:
    self.cache[n, is_x, is_train] = result

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And what happens if a users passes GridSearchCV(..., cache_cv=False)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question - not sure how usage of a post_splits callable relates to cache_cv=False. My Dataset/MLDatatset work so far considered only cache_cv=True.

The idea of post_splits is shown more in the CVCacheSampler. It is a way of calling a sampler function on the argument X given to fit. See the usage of sampler argument to EaSearchCV in this elm PR 221. That example uses a sampler that expects X to be a list of dates, and the sampler function determines how to build an Dataset/MLDataset for the list of dates. In other cases, X could be a list of file names and the cv argument to EaSearchCV controls how to split up those file names into test / train groups. I like the idea of a sampler function from cross validation / hyperparameterization of Dataset/MLDataset workflows because there is no assumption about the shapes of the DataArrays (unlike cross validation in typical sklearn workflows where the cv object is used to subset train/test row groupings of a large matrix).

TODO (for me):

  • Add a test for cache_cv=False with a sampler function - fix or document requirements there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the sampler function determines how to build an Dataset/MLDataset for the list of dates.

Is the returned X (any y?) constrained to be the same shape as the input?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More thought needs to be done on the sampler idea in general - currently it avoids the call to check_consistent_length. The sampler may return X or an X, y tuple. X may be a Dataset/MLDataset or array-like and y is currently expected to be array-like.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I should just make sure I call check_consistent_length on samples of X or (X, y) returned by sampler and return FIT_FAILURE if there are inconsistent lengths. I'll experiment with that today/tomorrow.

@@ -226,6 +262,7 @@ def fit(est, X, y, error_score='raise', fields=None, params=None,

def fit_transform(est, X, y, error_score='raise', fields=None, params=None,
fit_params=None):
new_y = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this used?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover from old effort (new_y) - just ran PEP8 check and found that and a few other things to fix.

@mrocklin
Copy link
Member

mrocklin commented Nov 7, 2017

@PeterDSteinberg it might be wise to answer some of the more general questions above, like

More specifically, could you lay out why the changes here are necessary?

This might be useful in order to determine if these changes are even in scope for the dask-searchcv project.

@PeterDSteinberg
Copy link
Collaborator Author

Hi @TomAugspurger @mrocklin . A bit of background.

The docs pipeline.rst link is out of date, see this test_xarray_cross_validation.py in elm PR 221 for current usage.

Why the changes are necessary?

This will allow cross validation using dask.distributed where cross validation is on Dataset/MLDataset structures. Example workflows would be using a sampler function to load 100 netcdf files with xarray.open_mfdataset as a Dataset with 3D climate arrays, using transformers that do stats / preprocessing on the 3D arrays before making a feature matrix passed to transformers/estimators that take a typical 2D tabular feature matrix, e.g. PCA or KMeans. In that example, the reduction from 3D arrays to 2D may involve time series averaging / binning and it would be nice to hyperparameterize the averaging/binning parameters.

Regarding:

Scikit-learn has an issue for allowing transforms on y

elm.pipeline aims to support transformers that return X, y tuples or X. In the example above with 3D arrays' feature engineering, there could be several steps passing a MLDataset of 3D arrays, then a step calling MLDataset.to_features() or other methods to create an X, y tuple. Passing y through a pipeline is required if using a sampler function. The changes in this PR related to _split_Xy are for passing X,y tuples or detecting whether they have been returned by a transformer (as opposed to just Dataset/Array-like X).

On:

How does elm.Pipeline's sample_weight relate to the (unimplemented) property routing in scikit-learn/scikit-learn#9566? (that PR is covering a lot, specifically the {'fit_params': sample_weights} bit.

elm.pipeline in Phase 1 (the time of that .rst file) supported passing sample_weight but it was brittle and was not consistent with cross validation ideas of sklearn or this PR. I will read up on that sklearn issue 9566 ( and scikit-learn/scikit-learn#4143).

I'm not in a major rush to get this merged and can work through any issues you think of.

@mrocklin
Copy link
Member

mrocklin commented Nov 7, 2017

I'm not in a major rush to get this merged and can work through any issues you think of.

It's not so much a timing issue as an "is this in scope" issue.

Are there ways to accomplish your goals without directly injecting elm code into this project? For example are there things that this project should generalize or protocols that we might adhere to that might be a little less specific to your project but would still allow you to make progress?

@PeterDSteinberg
Copy link
Collaborator Author

Agreed @mrocklin - I see your point - I don't want to cause scope creep / tech debt here (the time request being to look into that further). Currently the elm import is optional. I'll look at the issues above in further detail, but my impression now is that much of the logic here for splitting X,y tuples is also helpful for a pipeline where transformers can return y (unrelated to Dataset-like arguments).

@PeterDSteinberg
Copy link
Collaborator Author

Closing this on in favor of less wordy solution in #61

@jbednar
Copy link

jbednar commented Feb 1, 2018

Probably meant "in favor of #65", as this PR is #61.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants