-
-
Notifications
You must be signed in to change notification settings - Fork 43
cross validation for xarray_filters.MLDataset - Elm PR 221 #61
cross validation for xarray_filters.MLDataset - Elm PR 221 #61
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PeterDSteinberg sorry for not getting around to looking at this sooner. Been busy :/
Could you give a high-level overview of how elm.Pipeline differs from an sklearn.Pipeline, or link to docs laying out the difference? Is this up to date?
More specifically, could you lay out why the changes here are necessary?
Just trying to think through this from a maintenance perspective. Currently the scope is limited to "drop-in replacement for scikit-learn Pipelines", and I want to understand how that would change with this. Additionally, I want to think through some future-proofing scenarios, and if this sets us up for conflicts with future scikit-learn enhancements.
Does elm.Pipeline only affect X
? Scikit-learn has an issue for allowing transforms on y
: scikit-learn/scikit-learn#4143
How does elm.Pipeline
's sample_weight
relate to the (unimplemented) property routing in scikit-learn/scikit-learn#9566? (that PR is covering a lot, specifically the {'fit_params': sample_weights}
bit.
cc @jcrist
dask_searchcv/methods.py
Outdated
post_splits = getattr(self, '_post_splits', None) | ||
if post_splits: | ||
result = post_splits(np.array(X)[inds]) | ||
self.cache[n, True, is_train] = result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does post_split
imply that you always have a cache? If so this can be rewriting as
if post_splits:
result = ...
else:
result = safe_indexing(...)
if self.cache is not None:
self.cache[n, is_x, is_train] = result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And what happens if a users passes GridSearchCV(..., cache_cv=False)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question - not sure how usage of a post_splits
callable relates to cache_cv=False
. My Dataset/MLDatatset work so far considered only cache_cv=True
.
The idea of post_splits
is shown more in the CVCacheSampler. It is a way of calling a sampler
function on the argument X
given to fit. See the usage of sampler
argument to EaSearchCV
in this elm PR 221. That example uses a sampler that expects X
to be a list of dates, and the sampler
function determines how to build an Dataset/MLDataset for the list of dates. In other cases, X
could be a list of file names and the cv
argument to EaSearchCV
controls how to split up those file names into test / train groups. I like the idea of a sampler
function from cross validation / hyperparameterization of Dataset/MLDataset workflows because there is no assumption about the shapes of the DataArrays (unlike cross validation in typical sklearn
workflows where the cv
object is used to subset train/test row groupings of a large matrix).
TODO (for me):
- Add a test for
cache_cv=False
with asampler
function - fix or document requirements there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the sampler function determines how to build an Dataset/MLDataset for the list of dates.
Is the returned X
(any y
?) constrained to be the same shape as the input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More thought needs to be done on the sampler
idea in general - currently it avoids the call to check_consistent_length
. The sampler
may return X
or an X, y
tuple. X may be a Dataset/MLDataset or array-like and y is currently expected to be array-like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I should just make sure I call check_consistent_length
on samples of X
or (X, y
) returned by sampler
and return FIT_FAILURE
if there are inconsistent lengths. I'll experiment with that today/tomorrow.
dask_searchcv/methods.py
Outdated
@@ -226,6 +262,7 @@ def fit(est, X, y, error_score='raise', fields=None, params=None, | |||
|
|||
def fit_transform(est, X, y, error_score='raise', fields=None, params=None, | |||
fit_params=None): | |||
new_y = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftover from old effort (new_y
) - just ran PEP8 check and found that and a few other things to fix.
@PeterDSteinberg it might be wise to answer some of the more general questions above, like
This might be useful in order to determine if these changes are even in scope for the dask-searchcv project. |
Hi @TomAugspurger @mrocklin . A bit of background. The docs pipeline.rst link is out of date, see this test_xarray_cross_validation.py in elm PR 221 for current usage.
This will allow cross validation using dask.distributed where cross validation is on Dataset/MLDataset structures. Example workflows would be using a Regarding:
On:
I'm not in a major rush to get this merged and can work through any issues you think of. |
It's not so much a timing issue as an "is this in scope" issue. Are there ways to accomplish your goals without directly injecting elm code into this project? For example are there things that this project should generalize or protocols that we might adhere to that might be a little less specific to your project but would still allow you to make progress? |
Agreed @mrocklin - I see your point - I don't want to cause scope creep / tech debt here (the time request being to look into that further). Currently the |
Closing this on in favor of less wordy solution in #61 |
[Work in progress] - This is a PR corresponding to ContinuumIO/elm#221 - cross validation for xarray-based data structures.
TODO: