Skip to content

(fix): disallow NumpyExtensionArray #10334

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

ilan-gold
Copy link
Contributor

@ilan-gold ilan-gold commented May 19, 2025

This PR disallows NumpyExtensionArray at both the deepest level (PandasExtensionArray) and also at all of the top-level entrance points I could think of that should be auto-fixed. I'll need to add some tests to test that assumption but any help coming up with other potential entrace points would be helpful. For now this PR auto-converts anywhere as_compatible_data is used (which seems to be setting data, initialization, __setitem__ and copy on Variable) as well as auto-converting on dataframe conversion. So any help erroring/auto-converting elsewhere would be great.

I also whitelisted some other data types that I noticed we were touching as well, but I think those may require closer examination so will revert/investigate as needed.

@github-actions github-actions bot added topic-indexing topic-documentation topic-hypothesis Strategies or tests using the hypothesis library labels May 19, 2025
"extension_array",
[
pd.Categorical(["a", "b", "c"]),
pd.array([1, 2, 3], dtype="int64"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pd.array([1, 2, 3], dtype="int64"),
pd.array([1, 2, 3], dtype="int64"),
pd.array([1, 2, 3], dtype="int64[pyarrow]"),

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this more exhaustive please? with datetime, timedelta etc.

Copy link
Contributor Author

@ilan-gold ilan-gold May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without opening a can of worms, datetime and timedelta are the two things that are currently (on main) whitelisted (without that, they don't roundtrip).

import pandas as pd

df = pd.DataFrame({"arr": pd.date_range("20130101", periods=4, tz="US/Eastern")})
df.to_xarray()["arr"].to_pandas()

loses the timezone, for example, without the whitelist on from_dataframe.

So I don't think we want to add those here, which is also why I whitelisted them elsewhere. I was hoping that before adding tests, we would settle on what exactly this PR should be doing (of course NumpyExtensionArray but should it cover anything else).

I am happy to roll back that whitelist, i.e., leave it where it was on from_dataframe and then allow these types through anyway via other means into a Dataset.

Copy link
Contributor

@dcherian dcherian May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd simply like a test or two that exhaustively records the current behaviour (whether cast to numpy or not), so we can be sure of what is going on. Two tests (one for preserved dtype, and one for numpy casting) would work fine.

There's also two cases: ExtensionArray in a data variable, and ExtensionArray as an indexed coordinate variable.

Copy link
Contributor Author

@ilan-gold ilan-gold May 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two tests (one for preserved dtype, and one for numpy casting) would work fine.

Not 100% certain what this meant - could you clarify?

There's also two cases: ExtensionArray in a data variable, and ExtensionArray as an indexed coordinate variable.

I think this covered here. When I check xr.Dataset.from_dataframe(df)["arr"].variable it's an IndexVariable as expected when it's an index on the original object

Copy link
Contributor Author

@ilan-gold ilan-gold May 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also check out ilan-gold#2 for a related cleanup. As is here, the data types already round trip so that PR only cleans up the internals a bit and the tests/behavior are also a bit clearer now (better preserved date types). Other than that, no changes in terms of types or behavior, it seems

@ilan-gold ilan-gold marked this pull request as ready for review May 30, 2025 12:40
@ilan-gold ilan-gold requested a review from dcherian May 30, 2025 14:13
Comment on lines +150 to +154
np.array([1, 2, 3], dtype="int64"),
]
+ ([pd.array([1, 2, 3], dtype="int64[pyarrow]")] if has_pyarrow else []),
ids=["cat", "string", "interval", "timedelta", "datetime", "numpy"]
+ (["pyarrow"] if has_pyarrow else []),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
np.array([1, 2, 3], dtype="int64"),
]
+ ([pd.array([1, 2, 3], dtype="int64[pyarrow]")] if has_pyarrow else []),
ids=["cat", "string", "interval", "timedelta", "datetime", "numpy"]
+ (["pyarrow"] if has_pyarrow else []),
np.array([1, 2, 3], dtype="int64"),
pytest.param(pd.array([1, 2, 3], dtype="int64[pyarrow]"), marks=pytest.mark.skipif(not has_pyarrow)),
]

will need to add the ids to each param individually though

df_arr_to_test = df.index if is_index else df["arr"]
assert (df_arr_to_test == roundtripped).all()
# `NumpyExtensionArray` types are not roundtripped, including `StringArray` which subtypes.
if isinstance(extension_array, pd.arrays.NumpyExtensionArray): # type: ignore[attr-defined]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's cast the arrow ones for now too and relax it explicitly in a later PR please

Suggested change
if isinstance(extension_array, pd.arrays.NumpyExtensionArray): # type: ignore[attr-defined]
if isinstance(extension_array, pd.arrays.NumpyExtensionArray | pd.arrays.ArrowExtensionArray): # type: ignore[attr-defined]

UNSUPPORTED_EXTENSION_ARRAY_TYPES = (
pd.arrays.DatetimeArray,
pd.arrays.TimedeltaArray,
pd.arrays.NumpyExtensionArray, # type: ignore[attr-defined]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pd.arrays.NumpyExtensionArray, # type: ignore[attr-defined]
pd.arrays.NumpyExtensionArray, # type: ignore[attr-defined]
pd.arrays.ArrowExtensionArray,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-documentation topic-hypothesis Strategies or tests using the hypothesis library topic-indexing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression in DataArrays created from Pandas
2 participants