(fix): no `fill_value` on reindex #10304

ilan-gold · 2025-05-09T12:00:10Z

I also added more extension array types (see issue) although these "always" passed in so far as the only change to the xarray codebase in this PR handles pandas extension array dtypes now explicitly with fill values, producing pd.NA, or taking in a custom fill value (like 0, hence the changes to checking for a scalar, which had come up in a previous PR).

Closes Regression in DataArrays created from Pandas #10301
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

dcherian · 2025-05-09T13:23:57Z

xarray/tests/test_dataarray.py

@@ -3075,6 +3075,24 @@ def test_propagate_attrs(self, func) -> None:
        with set_options(keep_attrs=True):
            assert func(da).attrs == da.attrs

+    def test_fillna_extension_array_int(self) -> None:
+        srs: pd.Series = pd.Series(
+            index=np.array([1, 2, 3]), data=pd.array([pd.NA, 1, 1])


we aren't ready to support these yet. Can we instead autoconvert anything with pd.NA to float and convert to the numpy dtype otherwise?

@dcherian I think we should keep the two issues separate. It sounds like @richard-berg will have a PR that cleans this behavior up (i.e., convert float/int/string types from NumpyExtensionArray, maintain the rest or something close). For now I would change the tests so that his life is easier, but at the end of the day, I think this PR is needed regardless since there is nothing specifically referencing numeric/string types. You can reproduce the issue in #10301 with any kind of extension array type:

import pandas as pd, numpy as np, xarray as xr cat = pd.Series(pd.Categorical(["a", "b", "c", "b"])) cat.to_xarray().reindex(index=[1, -1])

--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[3], line 1 ----> 1 cat.to_xarray().reindex(index=[1, -1]) File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/xarray/core/dataarray.py:2218, in DataArray.reindex(self, indexers, method, tolerance, copy, fill_value, **indexers_kwargs) 2145 """Conform this object onto the indexes of another object, filling in 2146 missing values with ``fill_value``. The default fill value is NaN. 2147 (...) 2215 align 2216 """ 2217 indexers = utils.either_dict_or_kwargs(indexers, indexers_kwargs, "reindex") -> 2218 return alignment.reindex( 2219 self, 2220 indexers=indexers, 2221 method=method, 2222 tolerance=tolerance, 2223 copy=copy, 2224 fill_value=fill_value, 2225 ) File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/xarray/structure/alignment.py:984, in reindex(obj, indexers, method, tolerance, copy, fill_value, sparse, exclude_vars) 965 # TODO: (benbovy - explicit indexes): uncomment? 966 # --> from reindex docstrings: "any mismatched dimension is simply ignored" 967 # bad_keys = [k for k in indexers if k not in obj._indexes and k not in obj.dims] (...) 971 # "or unindexed dimension in the object to reindex" 972 # ) 974 aligner = Aligner( 975 (obj,), 976 indexes=indexers, (...) 982 exclude_vars=exclude_vars, 983 ) --> 984 aligner.align() 985 return aligner.results[0] File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/xarray/structure/alignment.py:567, in Aligner.align(self) 565 self.results = self.objects 566 else: --> 567 self.reindex_all() File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/xarray/structure/alignment.py:543, in Aligner.reindex_all(self) 542 def reindex_all(self) -> None: --> 543 self.results = tuple( 544 self._reindex_one(obj, matching_indexes) 545 for obj, matching_indexes in zip( 546 self.objects, self.objects_matching_indexes, strict=True 547 ) 548 ) File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/xarray/structure/alignment.py:544, in <genexpr>(.0) 542 def reindex_all(self) -> None: 543 self.results = tuple( --> 544 self._reindex_one(obj, matching_indexes) 545 for obj, matching_indexes in zip( 546 self.objects, self.objects_matching_indexes, strict=True 547 ) 548 ) File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/xarray/structure/alignment.py:532, in Aligner._reindex_one(self, obj, matching_indexes) 529 new_indexes, new_variables = self._get_indexes_and_vars(obj, matching_indexes) 530 dim_pos_indexers = self._get_dim_pos_indexers(matching_indexes) --> 532 return obj._reindex_callback( 533 self, 534 dim_pos_indexers, 535 new_variables, 536 new_indexes, 537 self.fill_value, 538 self.exclude_dims, 539 self.exclude_vars, 540 ) File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/xarray/core/dataarray.py:1934, in DataArray._reindex_callback(self, aligner, dim_pos_indexers, variables, indexes, fill_value, exclude_dims, exclude_vars) 1931 fill_value[_THIS_ARRAY] = value 1933 ds = self._to_temp_dataset() -> 1934 reindexed = ds._reindex_callback( 1935 aligner, 1936 dim_pos_indexers, 1937 variables, 1938 indexes, 1939 fill_value, 1940 exclude_dims, 1941 exclude_vars, 1942 ) 1944 da = self._from_temp_dataset(reindexed) 1945 da.encoding = self.encoding File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/xarray/core/dataset.py:3277, in Dataset._reindex_callback(self, aligner, dim_pos_indexers, variables, indexes, fill_value, exclude_dims, exclude_vars) 3271 else: 3272 to_reindex = { 3273 k: v 3274 for k, v in self.variables.items() 3275 if k not in variables and k not in exclude_vars 3276 } -> 3277 reindexed_vars = alignment.reindex_variables( 3278 to_reindex, 3279 dim_pos_indexers, 3280 copy=aligner.copy, 3281 fill_value=fill_value, 3282 sparse=aligner.sparse, 3283 ) 3284 new_variables.update(reindexed_vars) 3285 new_coord_names = self._coord_names | set(new_indexes) File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/xarray/structure/alignment.py:83, in reindex_variables(variables, dim_pos_indexers, copy, fill_value, sparse) 80 needs_masking = any(d in masked_dims for d in var.dims) 82 if needs_masking: ---> 83 new_var = var._getitem_with_mask(indxr, fill_value=fill_value_) 84 elif all(is_full_slice(k) for k in indxr): 85 # no reindexing necessary 86 # here we need to manually deal with copying data, since 87 # we neither created a new ndarray nor used fancy indexing 88 new_var = var.copy(deep=copy) File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/xarray/core/variable.py:798, in Variable._getitem_with_mask(self, key, fill_value) 789 # TODO(shoyer): expose this method in public API somewhere (isel?) and 790 # use it for reindex. 791 # TODO(shoyer): add a sanity check that all other integers are (...) 794 # that is actually indexed rather than mapping it to the last value 795 # along each axis. 797 if fill_value is dtypes.NA: --> 798 fill_value = dtypes.get_fill_value(self.dtype) 800 dims, indexer, new_order = self._broadcast_indexes(key) 802 if self.size: File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/xarray/core/dtypes.py:112, in get_fill_value(dtype) 101 def get_fill_value(dtype): 102 """Return an appropriate fill value for this dtype. 103 104 Parameters (...) 110 fill_value : Missing value corresponding to this dtype. 111 """ --> 112 _, fill_value = maybe_promote(dtype) 113 return fill_value File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/xarray/core/dtypes.py:66, in maybe_promote(dtype) 64 dtype_: np.typing.DTypeLike 65 fill_value: Any ---> 66 if HAS_STRING_DTYPE and np.issubdtype(dtype, np.dtypes.StringDType()): 67 # for now, we always promote string dtypes to object for consistency with existing behavior 68 # TODO: refactor this once we have a better way to handle numpy vlen-string dtypes 69 dtype_ = object 70 fill_value = np.nan File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/numpy/_core/numerictypes.py:530, in issubdtype(arg1, arg2) 473 r""" 474 Returns True if first argument is a typecode lower/equal in type hierarchy. 475 (...) 527 528 """ 529 if not issubclass_(arg1, generic): --> 530 arg1 = dtype(arg1).type 531 if not issubclass_(arg2, generic): 532 arg2 = dtype(arg2).type TypeError: Cannot interpret 'CategoricalDtype(categories=['a', 'b', 'c'], ordered=False, categories_dtype=object)' as a data type

If @richard-berg does not get permission to handle this sort of thing, I can make a PR and hopefully he can review.

I added a small fix here to use arrow types as the "other" extension array type to ensure some level of better coverage, which indeed revealed a small bug. I will update the reindexing tests as well to use them.

The two fixes:

https://github.com/pydata/xarray/pull/10304/files#diff-c803294f5216cbbdffa30f0b0c9f16a7e39855d4dd309c88d654bc317a78adc0R54-R64 + https://github.com/pydata/xarray/pull/10304/files#diff-5803161603b7d9c554e24313b463cc6f826d644128cb9e58755de4e2a0ac7467R179-R202 We didn't have a reshape function registered so it was falling back to numpy which was problematic for repr once it was added here because repr didn't know about extension arrays. That has been resolved.

The arrow dtype indexing implementation lacked handling of tuples: https://github.com/pydata/xarray/pull/10304/files#diff-c803294f5216cbbdffa30f0b0c9f16a7e39855d4dd309c88d654bc317a78adc0R124-R127

This issue is a lot deeper than I realized: pandas-dev/pandas#61433, actually for the indexing issue with arrow.

In any case, it looks like this doesn't have an impact here, as it turns out. It still seems like we're getting away with murder a bit because it is possible to yield 0-dimensional arrow arrays from some data types and not others. When it is possible, that gels with what xarray expects, but when it's not possible, we have problems. We probably need to look into a solution for this. Maybe making __getitem__ polymorphic, but fixing this issue + adding the test cases opened up a few different cans of worms here, so I will be able to track that issue and then can work on a fix here separately.

The broken test is similarly related to unexpected behavior around arrow dtypes again...will look into that as well

(fix): no fill_value on reindex

e19fe33

github-actions bot added topic-documentation topic-arrays related to flexible array support topic-hypothesis Strategies or tests using the hypothesis library labels May 9, 2025

ilan-gold marked this pull request as draft May 9, 2025 12:03

ilan-gold added 5 commits May 9, 2025 14:29

(chore): add fillna tests

05f37f6

(fix): arbitrary captialization

137c4ab

(chore): add interval array explicitly checked

3b499cf

(fix): ds values errors + dropna tests

4920782

(fix): mypy

6c97217

dcherian reviewed May 9, 2025

View reviewed changes

ilan-gold added 10 commits May 12, 2025 14:30

(fix): add in reshape implemented + remove integer tests

c561e4e

(fix): add different test cases for fillna + dropna

29131c2

(fix): dataset tests + marks for pyarrow

cf11804

(fix): remove error for 1d arrays

a7ab018

(fix): test deps for min versions

4a2170a

(fix): function name

cdedf04

(fix): ignore comment mypy

651152d

(fix): explicit skipped line

f168040

(fix): repr test

710404d

(fix): add copy + deepcopy

62aa485

This was referenced May 13, 2025

(fix): numeric arrow dtype deep copies #10315

Merged

(fix): pandas extension array repr for int64[pyarrow] #10317

Merged

ilan-gold added 2 commits May 14, 2025 10:45

Merge branch 'main' into ig/fix_no_fill_value

f7758ca

Merge branch 'main' into ig/fix_no_fill_value

4c66e3e

ilan-gold mentioned this pull request May 30, 2025

Regression in DataArrays created from Pandas #10301

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

(fix): no `fill_value` on reindex #10304

(fix): no `fill_value` on reindex #10304

Uh oh!

ilan-gold commented May 9, 2025 •

edited

Loading

Uh oh!

dcherian May 9, 2025

Uh oh!

ilan-gold May 12, 2025

Uh oh!

ilan-gold May 12, 2025

Uh oh!

ilan-gold May 12, 2025 •

edited

Loading

Uh oh!

ilan-gold May 12, 2025

Uh oh!

ilan-gold May 12, 2025

Uh oh!

Uh oh!

Uh oh!

(fix): no fill_value on reindex #10304

Are you sure you want to change the base?

(fix): no fill_value on reindex #10304

Uh oh!

Conversation

ilan-gold commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dcherian May 9, 2025

Choose a reason for hiding this comment

Uh oh!

ilan-gold May 12, 2025

Choose a reason for hiding this comment

Uh oh!

ilan-gold May 12, 2025

Choose a reason for hiding this comment

Uh oh!

ilan-gold May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilan-gold May 12, 2025

Choose a reason for hiding this comment

Uh oh!

ilan-gold May 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

(fix): no `fill_value` on reindex #10304

(fix): no `fill_value` on reindex #10304

ilan-gold commented May 9, 2025 •

edited

Loading

ilan-gold May 12, 2025 •

edited

Loading