-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
(fix): no fill_value
on reindex
#10304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
xarray/tests/test_dataarray.py
Outdated
@@ -3075,6 +3075,24 @@ def test_propagate_attrs(self, func) -> None: | |||
with set_options(keep_attrs=True): | |||
assert func(da).attrs == da.attrs | |||
|
|||
def test_fillna_extension_array_int(self) -> None: | |||
srs: pd.Series = pd.Series( | |||
index=np.array([1, 2, 3]), data=pd.array([pd.NA, 1, 1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we aren't ready to support these yet. Can we instead autoconvert anything with pd.NA
to float
and convert to the numpy dtype otherwise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dcherian I think we should keep the two issues separate. It sounds like @richard-berg will have a PR that cleans this behavior up (i.e., convert float/int/string types from NumpyExtensionArray
, maintain the rest or something close). For now I would change the tests so that his life is easier, but at the end of the day, I think this PR is needed regardless since there is nothing specifically referencing numeric/string types. You can reproduce the issue in #10301 with any kind of extension array type:
import pandas as pd, numpy as np, xarray as xr
cat = pd.Series(pd.Categorical(["a", "b", "c", "b"]))
cat.to_xarray().reindex(index=[1, -1])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[3], line 1
----> 1 cat.to_xarray().reindex(index=[1, -1])
File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/xarray/core/dataarray.py:2218, in DataArray.reindex(self, indexers, method, tolerance, copy, fill_value, **indexers_kwargs)
2145 """Conform this object onto the indexes of another object, filling in
2146 missing values with ``fill_value``. The default fill value is NaN.
2147
(...) 2215 align
2216 """
2217 indexers = utils.either_dict_or_kwargs(indexers, indexers_kwargs, "reindex")
-> 2218 return alignment.reindex(
2219 self,
2220 indexers=indexers,
2221 method=method,
2222 tolerance=tolerance,
2223 copy=copy,
2224 fill_value=fill_value,
2225 )
File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/xarray/structure/alignment.py:984, in reindex(obj, indexers, method, tolerance, copy, fill_value, sparse, exclude_vars)
965 # TODO: (benbovy - explicit indexes): uncomment?
966 # --> from reindex docstrings: "any mismatched dimension is simply ignored"
967 # bad_keys = [k for k in indexers if k not in obj._indexes and k not in obj.dims]
(...) 971 # "or unindexed dimension in the object to reindex"
972 # )
974 aligner = Aligner(
975 (obj,),
976 indexes=indexers,
(...) 982 exclude_vars=exclude_vars,
983 )
--> 984 aligner.align()
985 return aligner.results[0]
File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/xarray/structure/alignment.py:567, in Aligner.align(self)
565 self.results = self.objects
566 else:
--> 567 self.reindex_all()
File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/xarray/structure/alignment.py:543, in Aligner.reindex_all(self)
542 def reindex_all(self) -> None:
--> 543 self.results = tuple(
544 self._reindex_one(obj, matching_indexes)
545 for obj, matching_indexes in zip(
546 self.objects, self.objects_matching_indexes, strict=True
547 )
548 )
File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/xarray/structure/alignment.py:544, in <genexpr>(.0)
542 def reindex_all(self) -> None:
543 self.results = tuple(
--> 544 self._reindex_one(obj, matching_indexes)
545 for obj, matching_indexes in zip(
546 self.objects, self.objects_matching_indexes, strict=True
547 )
548 )
File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/xarray/structure/alignment.py:532, in Aligner._reindex_one(self, obj, matching_indexes)
529 new_indexes, new_variables = self._get_indexes_and_vars(obj, matching_indexes)
530 dim_pos_indexers = self._get_dim_pos_indexers(matching_indexes)
--> 532 return obj._reindex_callback(
533 self,
534 dim_pos_indexers,
535 new_variables,
536 new_indexes,
537 self.fill_value,
538 self.exclude_dims,
539 self.exclude_vars,
540 )
File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/xarray/core/dataarray.py:1934, in DataArray._reindex_callback(self, aligner, dim_pos_indexers, variables, indexes, fill_value, exclude_dims, exclude_vars)
1931 fill_value[_THIS_ARRAY] = value
1933 ds = self._to_temp_dataset()
-> 1934 reindexed = ds._reindex_callback(
1935 aligner,
1936 dim_pos_indexers,
1937 variables,
1938 indexes,
1939 fill_value,
1940 exclude_dims,
1941 exclude_vars,
1942 )
1944 da = self._from_temp_dataset(reindexed)
1945 da.encoding = self.encoding
File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/xarray/core/dataset.py:3277, in Dataset._reindex_callback(self, aligner, dim_pos_indexers, variables, indexes, fill_value, exclude_dims, exclude_vars)
3271 else:
3272 to_reindex = {
3273 k: v
3274 for k, v in self.variables.items()
3275 if k not in variables and k not in exclude_vars
3276 }
-> 3277 reindexed_vars = alignment.reindex_variables(
3278 to_reindex,
3279 dim_pos_indexers,
3280 copy=aligner.copy,
3281 fill_value=fill_value,
3282 sparse=aligner.sparse,
3283 )
3284 new_variables.update(reindexed_vars)
3285 new_coord_names = self._coord_names | set(new_indexes)
File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/xarray/structure/alignment.py:83, in reindex_variables(variables, dim_pos_indexers, copy, fill_value, sparse)
80 needs_masking = any(d in masked_dims for d in var.dims)
82 if needs_masking:
---> 83 new_var = var._getitem_with_mask(indxr, fill_value=fill_value_)
84 elif all(is_full_slice(k) for k in indxr):
85 # no reindexing necessary
86 # here we need to manually deal with copying data, since
87 # we neither created a new ndarray nor used fancy indexing
88 new_var = var.copy(deep=copy)
File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/xarray/core/variable.py:798, in Variable._getitem_with_mask(self, key, fill_value)
789 # TODO(shoyer): expose this method in public API somewhere (isel?) and
790 # use it for reindex.
791 # TODO(shoyer): add a sanity check that all other integers are
(...) 794 # that is actually indexed rather than mapping it to the last value
795 # along each axis.
797 if fill_value is dtypes.NA:
--> 798 fill_value = dtypes.get_fill_value(self.dtype)
800 dims, indexer, new_order = self._broadcast_indexes(key)
802 if self.size:
File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/xarray/core/dtypes.py:112, in get_fill_value(dtype)
101 def get_fill_value(dtype):
102 """Return an appropriate fill value for this dtype.
103
104 Parameters
(...) 110 fill_value : Missing value corresponding to this dtype.
111 """
--> 112 _, fill_value = maybe_promote(dtype)
113 return fill_value
File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/xarray/core/dtypes.py:66, in maybe_promote(dtype)
64 dtype_: np.typing.DTypeLike
65 fill_value: Any
---> 66 if HAS_STRING_DTYPE and np.issubdtype(dtype, np.dtypes.StringDType()):
67 # for now, we always promote string dtypes to object for consistency with existing behavior
68 # TODO: refactor this once we have a better way to handle numpy vlen-string dtypes
69 dtype_ = object
70 fill_value = np.nan
File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/numpy/_core/numerictypes.py:530, in issubdtype(arg1, arg2)
473 r"""
474 Returns True if first argument is a typecode lower/equal in type hierarchy.
475
(...) 527
528 """
529 if not issubclass_(arg1, generic):
--> 530 arg1 = dtype(arg1).type
531 if not issubclass_(arg2, generic):
532 arg2 = dtype(arg2).type
TypeError: Cannot interpret 'CategoricalDtype(categories=['a', 'b', 'c'], ordered=False, categories_dtype=object)' as a data type
If @richard-berg does not get permission to handle this sort of thing, I can make a PR and hopefully he can review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a small fix here to use arrow types as the "other" extension array type to ensure some level of better coverage, which indeed revealed a small bug. I will update the reindexing tests as well to use them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two fixes:
- https://github.com/pydata/xarray/pull/10304/files#diff-c803294f5216cbbdffa30f0b0c9f16a7e39855d4dd309c88d654bc317a78adc0R54-R64 + https://github.com/pydata/xarray/pull/10304/files#diff-5803161603b7d9c554e24313b463cc6f826d644128cb9e58755de4e2a0ac7467R179-R202 We didn't have a
reshape
function registered so it was falling back tonumpy
which was problematic forrepr
once it was added here becauserepr
didn't know about extension arrays. That has been resolved. - The arrow dtype indexing implementation lacked handling of tuples: https://github.com/pydata/xarray/pull/10304/files#diff-c803294f5216cbbdffa30f0b0c9f16a7e39855d4dd309c88d654bc317a78adc0R124-R127
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This issue is a lot deeper than I realized: pandas-dev/pandas#61433, actually for the indexing issue with arrow
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case, it looks like this doesn't have an impact here, as it turns out. It still seems like we're getting away with murder a bit because it is possible to yield 0-dimensional arrow arrays from some data types and not others. When it is possible, that gels with what xarray expects, but when it's not possible, we have problems. We probably need to look into a solution for this. Maybe making __getitem__
polymorphic, but fixing this issue + adding the test cases opened up a few different cans of worms here, so I will be able to track that issue and then can work on a fix here separately.
The broken test is similarly related to unexpected behavior around arrow
dtypes again...will look into that as well
I also added more extension array types (see issue) although these "always" passed in so far as the only change to the
xarray
codebase in this PR handles pandas extension array dtypes now explicitly with fill values, producingpd.NA
, or taking in a custom fill value (like 0, hence the changes to checking for a scalar, which had come up in a previous PR).whats-new.rst
api.rst