Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tolerance is not working #11

Closed
willirath opened this issue Jun 30, 2020 · 7 comments
Closed

Tolerance is not working #11

willirath opened this issue Jun 30, 2020 · 7 comments
Labels
bug Something isn't working

Comments

@willirath
Copy link
Contributor

This https://github.com/ESM-VFC/xoak/blob/53a877582077fe076e35d13df6925167ef839a4c/tests/test_catesian_coords_random_data.py#L65 fails because currently, https://github.com/ESM-VFC/xoak/blob/ffa933406461256eced4a4bc776e098147b5045f/src/xoak/core.py#L95 modifies the size of the index array.

@willirath willirath added the bug Something isn't working label Jun 30, 2020
@willirath
Copy link
Contributor Author

I suggest we change the API here and drop the tolerance keyword, but allow for requesting the distances along with the selected data so that .xoak.sel(..., return_distances=True) will contain an additional coord called sel_distances with the distances calculated based on the metric.

One could argue that xarray.sel() has a tolerance keyword and that hence .xoak.sel() should have one, too. But the tolerance keyword of xarray.sel() is difficult to use in a meaningful way, because it just raises a KeyError whenever any of the indexer points violates the tolerance. So in real applications, I guess what people will do is calculate the distances and do masking / dropping outside of xarray.sel() anyway.

@willirath
Copy link
Contributor Author

The reason why I think allowing for .xoak.sel(..., return_distances=True) rather than just dropping the tolerance keyword is that in our case, the metric may be difficult to access and apply after the fact. (Users would need to import it from sklearn and figure out how to apply it themselves.)

@benbovy
Copy link
Member

benbovy commented Jun 30, 2020

Ah good catch.

I agree that returning distances as a coordinate (or data variable?) is useful. I'm not a big fan of methods returning new variables with hard-coded names, so I would rather lean towards .xoak.sel(..., neighbor_distances=None), and replace None by any string if one wants to get the distances.

That said, I think that supporting a tolerance might still be useful and possible. We could track the invalid indices and then set data as missing values (and maybe drop) after calling isel, e.g., using .where(reshaped_dist <= tolerance, drop=True).

@willirath
Copy link
Contributor Author

I agree that returning distances as a coordinate (or data variable?) is useful. I'm not a big fan of methods returning new variables with hard-coded names, so I would rather lean towards .xoak.sel(..., neighbor_distances=None), and replace None by any string if one wants to get the distances.

I like this.

set data as missing values (and maybe drop) after calling isel

This is also a good solution.

, e.g., using .where(drop=True).

IMO, dropping should be up to the user. Otherwise, it's not easy to tell which points of the indexer were / weren't used.

@benbovy
Copy link
Member

benbovy commented Jun 30, 2020

IMO, dropping should be up to the user

Agreed.

A. Another keyword argument .xoak.sel(drop=False)?
B. let users call .dropna() on the returned Dataset / DataArray?

Option A might be confusing, as there is already a drop argument in .sel() that has a different meaning.

It might be worth to try xarray advanced indexing, i.e., ds.sel(x=da, method='nearest')) to see what's the behavior...

@benbovy
Copy link
Member

benbovy commented Jun 30, 2020

It might be worth to try xarray advanced indexing, i.e., ds.sel(x=da, method='nearest')) to see what's the behavior...

Ah I could have read your comment above more carefully:

One could argue that xarray.sel() has a tolerance keyword and that hence .xoak.sel() should have one, too. But the tolerance keyword of xarray.sel() is difficult to use in a meaningful way, because it just raises a KeyError whenever any of the indexer points violates the tolerance.

Agreed. I think that this behavior will eventually need to be addressed in xarray, at least for point-wise indexing. It occurs to me too that returning elements with missing value seems more useful than raising a KeyError.

@benbovy
Copy link
Member

benbovy commented Dec 15, 2020

Tolerance has been removed in #18. I also opened #26 for the suggestion of returning the distances to the nearest neighbors.

@benbovy benbovy closed this as completed Dec 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants