(fix): disallow `NumpyExtensionArray` #10334

ilan-gold · 2025-05-19T10:32:07Z

This PR disallows NumpyExtensionArray at both the deepest level (PandasExtensionArray) and also at all of the top-level entrance points I could think of that should be auto-fixed. I'll need to add some tests to test that assumption but any help coming up with other potential entrace points would be helpful. For now this PR auto-converts anywhere as_compatible_data is used (which seems to be setting data, initialization, __setitem__ and copy on Variable) as well as auto-converting on dataframe conversion. So any help erroring/auto-converting elsewhere would be great.

I also whitelisted some other data types that I noticed we were touching as well, but I think those may require closer examination so will revert/investigate as needed.

Closes Regression in DataArrays created from Pandas #10301 (would happily close this if @richard-berg has something more robust/different)
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

…ata`

dcherian · 2025-05-27T17:26:28Z

properties/test_pandas_roundtrip.py

+    "extension_array",
+    [
+        pd.Categorical(["a", "b", "c"]),
+        pd.array([1, 2, 3], dtype="int64"),


Suggested change

pd.array([1, 2, 3], dtype="int64"),

pd.array([1, 2, 3], dtype="int64"),

pd.array([1, 2, 3], dtype="int64[pyarrow]"),

Can we make this more exhaustive please? with datetime, timedelta etc.

Without opening a can of worms, datetime and timedelta are the two things that are currently (on main) whitelisted (without that, they don't roundtrip).

import pandas as pd df = pd.DataFrame({"arr": pd.date_range("20130101", periods=4, tz="US/Eastern")}) df.to_xarray()["arr"].to_pandas()

loses the timezone, for example, without the whitelist on from_dataframe.

So I don't think we want to add those here, which is also why I whitelisted them elsewhere. I was hoping that before adding tests, we would settle on what exactly this PR should be doing (of course NumpyExtensionArray but should it cover anything else).

I am happy to roll back that whitelist, i.e., leave it where it was on from_dataframe and then allow these types through anyway via other means into a Dataset.

I'd simply like a test or two that exhaustively records the current behaviour (whether cast to numpy or not), so we can be sure of what is going on. Two tests (one for preserved dtype, and one for numpy casting) would work fine.

There's also two cases: ExtensionArray in a data variable, and ExtensionArray as an indexed coordinate variable.

Two tests (one for preserved dtype, and one for numpy casting) would work fine.

Not 100% certain what this meant - could you clarify?

There's also two cases: ExtensionArray in a data variable, and ExtensionArray as an indexed coordinate variable.

I think this covered here. When I check xr.Dataset.from_dataframe(df)["arr"].variable it's an IndexVariable as expected when it's an index on the original object

Also check out ilan-gold#2 for a related cleanup. As is here, the data types already round trip so that PR only cleans up the internals a bit and the tests/behavior are also a bit clearer now (better preserved date types). Other than that, no changes in terms of types or behavior, it seems

dcherian · 2025-05-30T14:47:35Z

properties/test_pandas_roundtrip.py

+        np.array([1, 2, 3], dtype="int64"),
+    ]
+    + ([pd.array([1, 2, 3], dtype="int64[pyarrow]")] if has_pyarrow else []),
+    ids=["cat", "string", "interval", "timedelta", "datetime", "numpy"]
+    + (["pyarrow"] if has_pyarrow else []),


Suggested change

np.array([1, 2, 3], dtype="int64"),

]

+ ([pd.array([1, 2, 3], dtype="int64[pyarrow]")] if has_pyarrow else []),

ids=["cat", "string", "interval", "timedelta", "datetime", "numpy"]

+ (["pyarrow"] if has_pyarrow else []),

np.array([1, 2, 3], dtype="int64"),

pytest.param(pd.array([1, 2, 3], dtype="int64[pyarrow]"), marks=pytest.mark.skipif(not has_pyarrow)),

]

will need to add the ids to each param individually though

dcherian · 2025-05-30T14:50:14Z

properties/test_pandas_roundtrip.py

+    df_arr_to_test = df.index if is_index else df["arr"]
+    assert (df_arr_to_test == roundtripped).all()
+    # `NumpyExtensionArray` types are not roundtripped, including `StringArray` which subtypes.
+    if isinstance(extension_array, pd.arrays.NumpyExtensionArray):  # type: ignore[attr-defined]


Let's cast the arrow ones for now too and relax it explicitly in a later PR please

Suggested change

if isinstance(extension_array, pd.arrays.NumpyExtensionArray): # type: ignore[attr-defined]

if isinstance(extension_array, pd.arrays.NumpyExtensionArray | pd.arrays.ArrowExtensionArray): # type: ignore[attr-defined]

dcherian · 2025-05-30T14:50:44Z

xarray/core/variable.py

+UNSUPPORTED_EXTENSION_ARRAY_TYPES = (
+    pd.arrays.DatetimeArray,
+    pd.arrays.TimedeltaArray,
+    pd.arrays.NumpyExtensionArray,  # type: ignore[attr-defined]


Suggested change

pd.arrays.NumpyExtensionArray, # type: ignore[attr-defined]

pd.arrays.NumpyExtensionArray, # type: ignore[attr-defined]

pd.arrays.ArrowExtensionArray,

ilan-gold added 4 commits May 14, 2025 16:37

(fix): disallow NumpyExtensionArray

9312d2b

(fix): clarify permitted extension array behavior in `as_compatible_d…

d09c0f5

…ata`

(refactor): centralize whitelist

88e4841

(fix): allow through other types

174274d

github-actions bot added topic-indexing topic-documentation topic-hypothesis Strategies or tests using the hypothesis library labels May 19, 2025

ilan-gold added 2 commits May 22, 2025 16:12

Merge branch 'main' into ig/disallow_numpy_extension_array

d9388f0

Merge branch 'main' into ig/disallow_numpy_extension_array

b87b380

dcherian reviewed May 27, 2025

View reviewed changes

ilan-gold added 5 commits May 30, 2025 11:48

(chore): add thorough test cases

6329964

Merge branch 'main' into ig/disallow_numpy_extension_array

a29c526

(fix): require pyarrow

c6ac491

(fix): mypy

50843ca

(fix): pyarrow check

b959345

ilan-gold marked this pull request as ready for review May 30, 2025 12:40

(fix): remove extra ignore

2d33aaa

ilan-gold requested a review from dcherian May 30, 2025 14:13

dcherian reviewed May 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

(fix): disallow `NumpyExtensionArray` #10334

(fix): disallow `NumpyExtensionArray` #10334

Uh oh!

ilan-gold commented May 19, 2025 •

edited

Loading

Uh oh!

dcherian May 27, 2025

Uh oh!

dcherian May 27, 2025

Uh oh!

ilan-gold May 28, 2025 •

edited

Loading

Uh oh!

dcherian May 28, 2025 •

edited

Loading

Uh oh!

ilan-gold May 30, 2025 •

edited

Loading

Uh oh!

ilan-gold May 30, 2025 •

edited

Loading

Uh oh!

dcherian May 30, 2025

Uh oh!

dcherian May 30, 2025

Uh oh!

dcherian May 30, 2025

Uh oh!

Uh oh!

	pd.array([1, 2, 3], dtype="int64"),
	pd.array([1, 2, 3], dtype="int64"),
	pd.array([1, 2, 3], dtype="int64[pyarrow]"),

	if isinstance(extension_array, pd.arrays.NumpyExtensionArray): # type: ignore[attr-defined]
	if isinstance(extension_array, pd.arrays.NumpyExtensionArray \| pd.arrays.ArrowExtensionArray): # type: ignore[attr-defined]

	pd.arrays.NumpyExtensionArray, # type: ignore[attr-defined]
	pd.arrays.NumpyExtensionArray, # type: ignore[attr-defined]
	pd.arrays.ArrowExtensionArray,

Uh oh!

(fix): disallow NumpyExtensionArray #10334

Are you sure you want to change the base?

(fix): disallow NumpyExtensionArray #10334

Uh oh!

Conversation

ilan-gold commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dcherian May 27, 2025

Choose a reason for hiding this comment

Uh oh!

dcherian May 27, 2025

Choose a reason for hiding this comment

Uh oh!

ilan-gold May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dcherian May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilan-gold May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilan-gold May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dcherian May 30, 2025

Choose a reason for hiding this comment

Uh oh!

dcherian May 30, 2025

Choose a reason for hiding this comment

Uh oh!

dcherian May 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

(fix): disallow `NumpyExtensionArray` #10334

(fix): disallow `NumpyExtensionArray` #10334

ilan-gold commented May 19, 2025 •

edited

Loading

ilan-gold May 28, 2025 •

edited

Loading

dcherian May 28, 2025 •

edited

Loading

ilan-gold May 30, 2025 •

edited

Loading

ilan-gold May 30, 2025 •

edited

Loading