Skip to content

(fix): no fill_value on reindex #10304

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

ilan-gold
Copy link
Contributor

@ilan-gold ilan-gold commented May 9, 2025

I also added more extension array types (see issue) although these "always" passed in so far as the only change to the xarray codebase in this PR handles pandas extension array dtypes now explicitly with fill values, producing pd.NA, or taking in a custom fill value (like 0, hence the changes to checking for a scalar, which had come up in a previous PR).

@github-actions github-actions bot added topic-documentation topic-arrays related to flexible array support topic-hypothesis Strategies or tests using the hypothesis library labels May 9, 2025
@ilan-gold ilan-gold marked this pull request as draft May 9, 2025 12:03
@@ -3075,6 +3075,24 @@ def test_propagate_attrs(self, func) -> None:
with set_options(keep_attrs=True):
assert func(da).attrs == da.attrs

def test_fillna_extension_array_int(self) -> None:
srs: pd.Series = pd.Series(
index=np.array([1, 2, 3]), data=pd.array([pd.NA, 1, 1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we aren't ready to support these yet. Can we instead autoconvert anything with pd.NA to float and convert to the numpy dtype otherwise?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dcherian I think we should keep the two issues separate. It sounds like @richard-berg will have a PR that cleans this behavior up (i.e., convert float/int/string types from NumpyExtensionArray, maintain the rest or something close). For now I would change the tests so that his life is easier, but at the end of the day, I think this PR is needed regardless since there is nothing specifically referencing numeric/string types. You can reproduce the issue in #10301 with any kind of extension array type:

import pandas as pd, numpy as np, xarray as xr
cat = pd.Series(pd.Categorical(["a", "b", "c", "b"]))
cat.to_xarray().reindex(index=[1, -1])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[3], line 1
----> 1 cat.to_xarray().reindex(index=[1, -1])

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/xarray/core/dataarray.py:2218, in DataArray.reindex(self, indexers, method, tolerance, copy, fill_value, **indexers_kwargs)
   2145 """Conform this object onto the indexes of another object, filling in
   2146 missing values with ``fill_value``. The default fill value is NaN.
   2147 
   (...)   2215 align
   2216 """
   2217 indexers = utils.either_dict_or_kwargs(indexers, indexers_kwargs, "reindex")
-> 2218 return alignment.reindex(
   2219     self,
   2220     indexers=indexers,
   2221     method=method,
   2222     tolerance=tolerance,
   2223     copy=copy,
   2224     fill_value=fill_value,
   2225 )

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/xarray/structure/alignment.py:984, in reindex(obj, indexers, method, tolerance, copy, fill_value, sparse, exclude_vars)
    965 # TODO: (benbovy - explicit indexes): uncomment?
    966 # --> from reindex docstrings: "any mismatched dimension is simply ignored"
    967 # bad_keys = [k for k in indexers if k not in obj._indexes and k not in obj.dims]
   (...)    971 #         "or unindexed dimension in the object to reindex"
    972 #     )
    974 aligner = Aligner(
    975     (obj,),
    976     indexes=indexers,
   (...)    982     exclude_vars=exclude_vars,
    983 )
--> 984 aligner.align()
    985 return aligner.results[0]

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/xarray/structure/alignment.py:567, in Aligner.align(self)
    565     self.results = self.objects
    566 else:
--> 567     self.reindex_all()

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/xarray/structure/alignment.py:543, in Aligner.reindex_all(self)
    542 def reindex_all(self) -> None:
--> 543     self.results = tuple(
    544         self._reindex_one(obj, matching_indexes)
    545         for obj, matching_indexes in zip(
    546             self.objects, self.objects_matching_indexes, strict=True
    547         )
    548     )

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/xarray/structure/alignment.py:544, in <genexpr>(.0)
    542 def reindex_all(self) -> None:
    543     self.results = tuple(
--> 544         self._reindex_one(obj, matching_indexes)
    545         for obj, matching_indexes in zip(
    546             self.objects, self.objects_matching_indexes, strict=True
    547         )
    548     )

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/xarray/structure/alignment.py:532, in Aligner._reindex_one(self, obj, matching_indexes)
    529 new_indexes, new_variables = self._get_indexes_and_vars(obj, matching_indexes)
    530 dim_pos_indexers = self._get_dim_pos_indexers(matching_indexes)
--> 532 return obj._reindex_callback(
    533     self,
    534     dim_pos_indexers,
    535     new_variables,
    536     new_indexes,
    537     self.fill_value,
    538     self.exclude_dims,
    539     self.exclude_vars,
    540 )

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/xarray/core/dataarray.py:1934, in DataArray._reindex_callback(self, aligner, dim_pos_indexers, variables, indexes, fill_value, exclude_dims, exclude_vars)
   1931         fill_value[_THIS_ARRAY] = value
   1933 ds = self._to_temp_dataset()
-> 1934 reindexed = ds._reindex_callback(
   1935     aligner,
   1936     dim_pos_indexers,
   1937     variables,
   1938     indexes,
   1939     fill_value,
   1940     exclude_dims,
   1941     exclude_vars,
   1942 )
   1944 da = self._from_temp_dataset(reindexed)
   1945 da.encoding = self.encoding

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/xarray/core/dataset.py:3277, in Dataset._reindex_callback(self, aligner, dim_pos_indexers, variables, indexes, fill_value, exclude_dims, exclude_vars)
   3271 else:
   3272     to_reindex = {
   3273         k: v
   3274         for k, v in self.variables.items()
   3275         if k not in variables and k not in exclude_vars
   3276     }
-> 3277     reindexed_vars = alignment.reindex_variables(
   3278         to_reindex,
   3279         dim_pos_indexers,
   3280         copy=aligner.copy,
   3281         fill_value=fill_value,
   3282         sparse=aligner.sparse,
   3283     )
   3284     new_variables.update(reindexed_vars)
   3285     new_coord_names = self._coord_names | set(new_indexes)

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/xarray/structure/alignment.py:83, in reindex_variables(variables, dim_pos_indexers, copy, fill_value, sparse)
     80 needs_masking = any(d in masked_dims for d in var.dims)
     82 if needs_masking:
---> 83     new_var = var._getitem_with_mask(indxr, fill_value=fill_value_)
     84 elif all(is_full_slice(k) for k in indxr):
     85     # no reindexing necessary
     86     # here we need to manually deal with copying data, since
     87     # we neither created a new ndarray nor used fancy indexing
     88     new_var = var.copy(deep=copy)

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/xarray/core/variable.py:798, in Variable._getitem_with_mask(self, key, fill_value)
    789 # TODO(shoyer): expose this method in public API somewhere (isel?) and
    790 # use it for reindex.
    791 # TODO(shoyer): add a sanity check that all other integers are
   (...)    794 # that is actually indexed rather than mapping it to the last value
    795 # along each axis.
    797 if fill_value is dtypes.NA:
--> 798     fill_value = dtypes.get_fill_value(self.dtype)
    800 dims, indexer, new_order = self._broadcast_indexes(key)
    802 if self.size:

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/xarray/core/dtypes.py:112, in get_fill_value(dtype)
    101 def get_fill_value(dtype):
    102     """Return an appropriate fill value for this dtype.
    103 
    104     Parameters
   (...)    110     fill_value : Missing value corresponding to this dtype.
    111     """
--> 112     _, fill_value = maybe_promote(dtype)
    113     return fill_value

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/xarray/core/dtypes.py:66, in maybe_promote(dtype)
     64 dtype_: np.typing.DTypeLike
     65 fill_value: Any
---> 66 if HAS_STRING_DTYPE and np.issubdtype(dtype, np.dtypes.StringDType()):
     67     # for now, we always promote string dtypes to object for consistency with existing behavior
     68     # TODO: refactor this once we have a better way to handle numpy vlen-string dtypes
     69     dtype_ = object
     70     fill_value = np.nan

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/numpy/_core/numerictypes.py:530, in issubdtype(arg1, arg2)
    473 r"""
    474 Returns True if first argument is a typecode lower/equal in type hierarchy.
    475 
   (...)    527 
    528 """
    529 if not issubclass_(arg1, generic):
--> 530     arg1 = dtype(arg1).type
    531 if not issubclass_(arg2, generic):
    532     arg2 = dtype(arg2).type

TypeError: Cannot interpret 'CategoricalDtype(categories=['a', 'b', 'c'], ordered=False, categories_dtype=object)' as a data type

If @richard-berg does not get permission to handle this sort of thing, I can make a PR and hopefully he can review.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a small fix here to use arrow types as the "other" extension array type to ensure some level of better coverage, which indeed revealed a small bug. I will update the reindexing tests as well to use them.

Copy link
Contributor Author

@ilan-gold ilan-gold May 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two fixes:

  1. https://github.com/pydata/xarray/pull/10304/files#diff-c803294f5216cbbdffa30f0b0c9f16a7e39855d4dd309c88d654bc317a78adc0R54-R64 + https://github.com/pydata/xarray/pull/10304/files#diff-5803161603b7d9c554e24313b463cc6f826d644128cb9e58755de4e2a0ac7467R179-R202 We didn't have a reshape function registered so it was falling back to numpy which was problematic for repr once it was added here because repr didn't know about extension arrays. That has been resolved.
  2. The arrow dtype indexing implementation lacked handling of tuples: https://github.com/pydata/xarray/pull/10304/files#diff-c803294f5216cbbdffa30f0b0c9f16a7e39855d4dd309c88d654bc317a78adc0R124-R127

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue is a lot deeper than I realized: pandas-dev/pandas#61433, actually for the indexing issue with arrow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, it looks like this doesn't have an impact here, as it turns out. It still seems like we're getting away with murder a bit because it is possible to yield 0-dimensional arrow arrays from some data types and not others. When it is possible, that gels with what xarray expects, but when it's not possible, we have problems. We probably need to look into a solution for this. Maybe making __getitem__ polymorphic, but fixing this issue + adding the test cases opened up a few different cans of worms here, so I will be able to track that issue and then can work on a fix here separately.

The broken test is similarly related to unexpected behavior around arrow dtypes again...will look into that as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-arrays related to flexible array support topic-documentation topic-hypothesis Strategies or tests using the hypothesis library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression in DataArrays created from Pandas
2 participants