Skip to content

Support extension array indexes #9671

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 242 commits into
base: main
Choose a base branch
from

Conversation

ilan-gold
Copy link
Contributor

Identical to kmuehlbauer#1 - probably not very helpful in terms of changes since https://github.com/kmuehlbauer/xarray/tree/any-time-resolution-2 contains most of it....

Sorry, something went wrong.

kmuehlbauer and others added 30 commits October 18, 2024 07:31

Verified

This commit was signed with the committer’s verified signature.
kmuehlbauer Kai Mühlbauer
…ore/variable.py to use any-precision datetime/timedelta with autmatic inferring of resolution

Verified

This commit was signed with the committer’s verified signature.
kmuehlbauer Kai Mühlbauer

Verified

This commit was signed with the committer’s verified signature.
kmuehlbauer Kai Mühlbauer

Verified

This commit was signed with the committer’s verified signature.
kmuehlbauer Kai Mühlbauer

Verified

This commit was signed with the committer’s verified signature.
kmuehlbauer Kai Mühlbauer

Verified

This commit was signed with the committer’s verified signature.
kmuehlbauer Kai Mühlbauer

Verified

This commit was signed with the committer’s verified signature.
kmuehlbauer Kai Mühlbauer
…ocessing, raise now early

Verified

This commit was signed with the committer’s verified signature.
kmuehlbauer Kai Mühlbauer
…_ref_date

Verified

This commit was signed with the committer’s verified signature.
kmuehlbauer Kai Mühlbauer
…o fix mypy

Verified

This commit was signed with the committer’s verified signature.
kmuehlbauer Kai Mühlbauer

Verified

This commit was signed with the committer’s verified signature.
kmuehlbauer Kai Mühlbauer

Verified

This commit was signed with the committer’s verified signature.
kmuehlbauer Kai Mühlbauer

Verified

This commit was signed with the committer’s verified signature.
kmuehlbauer Kai Mühlbauer

Verified

This commit was signed with the committer’s verified signature.
kmuehlbauer Kai Mühlbauer

Verified

This commit was signed with the committer’s verified signature.
kmuehlbauer Kai Mühlbauer

Verified

This commit was signed with the committer’s verified signature.
kmuehlbauer Kai Mühlbauer
…t resolution, fix code and tests to allow this

Verified

This commit was signed with the committer’s verified signature.
kmuehlbauer Kai Mühlbauer

Verified

This commit was signed with the committer’s verified signature.
kmuehlbauer Kai Mühlbauer

Verified

This commit was signed with the committer’s verified signature.
kmuehlbauer Kai Mühlbauer

Verified

This commit was signed with the committer’s verified signature.
kmuehlbauer Kai Mühlbauer

Verified

This commit was signed with the committer’s verified signature.
kmuehlbauer Kai Mühlbauer
for more information, see https://pre-commit.ci

Verified

This commit was signed with the committer’s verified signature.
kmuehlbauer Kai Mühlbauer
…-resolution

Verified

This commit was signed with the committer’s verified signature.
kmuehlbauer Kai Mühlbauer
… more carefully, for now using pd.Series to covert `OMm` type datetimes/timedeltas (will result in ns precision)

Verified

This commit was signed with the committer’s verified signature.
kmuehlbauer Kai Mühlbauer
…rray` series creating an extension array when `.array` is accessed
@ilan-gold
Copy link
Contributor Author

@dcherian Could you give a bit of background into the changes you pushed? I'm not really following.

Sorry, this is a total mess. Apparently IndexVariable and Variable now behave differently, and I'm not sure why.

Did I do something wrong in the PR without knowing it i.e., bypassing the tests? It would be great to understand!

@dcherian
Copy link
Contributor

dcherian commented Apr 1, 2025

No you didn't do anything wrong per-se.

  1. I wanted the pandas-specific logic to live inside indexing.py as much as possible (and definitely not in namedarray/core.py, so moving that exposed some other warts. The solution right now is to expose PandasExtensionArray wrapper class.
  2. The groupby_bins tests needed to be updated because previously intervalarray got cast to a numpy object array of tuples.

@dcherian dcherian mentioned this pull request Apr 4, 2025
13 tasks
dcherian added 3 commits April 4, 2025 16:16
* main:
  Fix sparse dask repr test (pydata#10200)
  Apply ruff preview rule RUF046 (pydata#10199)
  DOC: Remove mention of netcdf pypi package (pydata#10197)
@@ -102,7 +102,7 @@ def replace_duck_with_extension_array(args) -> list:
return type(self)[type(res)](res)
return res

def __array_ufunc__(ufunc, method, *inputs, **kwargs):
def __array_ufunc__(self, ufunc, method, *inputs, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the test failure in test_units has to do with this implementation for the equals ufunc.

@ilan-gold can you take a look please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The signature is correct https://numpy.org/devdocs/user/basics.subclassing.html

I have seen that error before but I can't remember what it means now, maybe a segfault...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw that, but we have self included for __array_ufunc__ inside Xarray 🤷🏾‍♂️

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok! hopefully my fix helps, but it was hard to make heads or tails of what changed because the diff was super dirty (there was a merge from main or two). please open a PR into mine in the future, would make things easier :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the missing self in the declared signature is a documentation bug: just a couple of lines later in the example the self is included (which makes sense because __array_ufunc__ is supposed to be a method)

Comment on lines 337 to 340
try:
nbytes_str = f" {render_human_readable_nbytes(variable.nbytes)}"
except TypeError:
nbytes_str = " ?"
Copy link
Contributor

@dcherian dcherian Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit ugly, but we'd need to define nbytes on pandas wrappers, and it breaks many reprs in our tests & doctests :/

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@@ -1902,6 +1926,10 @@ def copy(self, deep: bool = True) -> Self:
array = self.array.copy(deep=True) if deep else self.array
return type(self)(array, self._dtype)

@property
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change breaks a lot of tests. I would do it in a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have something for this :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -192,8 +192,6 @@ def _maybe_wrap_data(data):
if isinstance(data, pd.Index):
return PandasIndexingAdapter(data)
if isinstance(data, pd.api.extensions.ExtensionArray):
if isinstance(data.dtype, pd.Int64Dtype | pd.Float64Dtype | pd.StringDtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This casting was intentional. We'd like these dtypes to be converted to numpy arrays.

Copy link
Contributor Author

@ilan-gold ilan-gold Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I had lost track of who did this or not, I thought I checked the blame but must not have. Could you explain a bit why so I can add a comment?

Copy link
Contributor

@dcherian dcherian Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its mostly backwards compatibility, and we aren't sure how to handle the possibility of both numpy and pandas handling very similar dtypes. So we are opting to not change behaviour

@ilan-gold
Copy link
Contributor Author

ilan-gold commented Apr 22, 2025

I think the failing test is coming from zarr==3.0.7 but otherwise everything seems to pass

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Copy link
Collaborator

@keewis keewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unless we want to special-case it in .data or anywhere else that gives access to the underlying array I don't think there's a way around exposing the extension array wrapper (I think the special-casing would make the code a bit more complicated, though). I believe that's fine, though, since that would just be a special kind of array that is compatible with the array API.

@@ -102,7 +102,7 @@ def replace_duck_with_extension_array(args) -> list:
return type(self)[type(res)](res)
return res

def __array_ufunc__(ufunc, method, *inputs, **kwargs):
def __array_ufunc__(self, ufunc, method, *inputs, **kwargs):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the missing self in the declared signature is a documentation bug: just a couple of lines later in the example the self is included (which makes sense because __array_ufunc__ is supposed to be a method)

@ilan-gold
Copy link
Contributor Author

that is compatible with the array API.

It duck types but is not array-api compatible. That being said I would not be opposed to removing the visible wrapper from .data. I can at least try locally and see what comes of it

@ilan-gold
Copy link
Contributor Author

@keewis Excellent suggestion: ilan-gold#1. It seems to be crazy straightforward

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants