-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
(fix): disallow NumpyExtensionArray
#10334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
(fix): disallow NumpyExtensionArray
#10334
Conversation
properties/test_pandas_roundtrip.py
Outdated
"extension_array", | ||
[ | ||
pd.Categorical(["a", "b", "c"]), | ||
pd.array([1, 2, 3], dtype="int64"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pd.array([1, 2, 3], dtype="int64"), | |
pd.array([1, 2, 3], dtype="int64"), | |
pd.array([1, 2, 3], dtype="int64[pyarrow]"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this more exhaustive please? with datetime, timedelta etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without opening a can of worms, datetime and timedelta are the two things that are currently (on main
) whitelisted (without that, they don't roundtrip).
import pandas as pd
df = pd.DataFrame({"arr": pd.date_range("20130101", periods=4, tz="US/Eastern")})
df.to_xarray()["arr"].to_pandas()
loses the timezone, for example, without the whitelist on from_dataframe
.
So I don't think we want to add those here, which is also why I whitelisted them elsewhere. I was hoping that before adding tests, we would settle on what exactly this PR should be doing (of course NumpyExtensionArray
but should it cover anything else).
I am happy to roll back that whitelist, i.e., leave it where it was on from_dataframe
and then allow these types through anyway via other means into a Dataset
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd simply like a test or two that exhaustively records the current behaviour (whether cast to numpy or not), so we can be sure of what is going on. Two tests (one for preserved dtype, and one for numpy casting) would work fine.
There's also two cases: ExtensionArray in a data variable, and ExtensionArray as an indexed coordinate variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two tests (one for preserved dtype, and one for numpy casting) would work fine.
Not 100% certain what this meant - could you clarify?
There's also two cases: ExtensionArray in a data variable, and ExtensionArray as an indexed coordinate variable.
I think this covered here. When I check xr.Dataset.from_dataframe(df)["arr"].variable
it's an IndexVariable
as expected when it's an index on the original object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also check out ilan-gold#2 for a related cleanup. As is here, the data types already round trip so that PR only cleans up the internals a bit and the tests/behavior are also a bit clearer now (better preserved date types). Other than that, no changes in terms of types or behavior, it seems
np.array([1, 2, 3], dtype="int64"), | ||
] | ||
+ ([pd.array([1, 2, 3], dtype="int64[pyarrow]")] if has_pyarrow else []), | ||
ids=["cat", "string", "interval", "timedelta", "datetime", "numpy"] | ||
+ (["pyarrow"] if has_pyarrow else []), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.array([1, 2, 3], dtype="int64"), | |
] | |
+ ([pd.array([1, 2, 3], dtype="int64[pyarrow]")] if has_pyarrow else []), | |
ids=["cat", "string", "interval", "timedelta", "datetime", "numpy"] | |
+ (["pyarrow"] if has_pyarrow else []), | |
np.array([1, 2, 3], dtype="int64"), | |
pytest.param(pd.array([1, 2, 3], dtype="int64[pyarrow]"), marks=pytest.mark.skipif(not has_pyarrow)), | |
] |
will need to add the ids to each param individually though
df_arr_to_test = df.index if is_index else df["arr"] | ||
assert (df_arr_to_test == roundtripped).all() | ||
# `NumpyExtensionArray` types are not roundtripped, including `StringArray` which subtypes. | ||
if isinstance(extension_array, pd.arrays.NumpyExtensionArray): # type: ignore[attr-defined] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's cast the arrow ones for now too and relax it explicitly in a later PR please
if isinstance(extension_array, pd.arrays.NumpyExtensionArray): # type: ignore[attr-defined] | |
if isinstance(extension_array, pd.arrays.NumpyExtensionArray | pd.arrays.ArrowExtensionArray): # type: ignore[attr-defined] |
UNSUPPORTED_EXTENSION_ARRAY_TYPES = ( | ||
pd.arrays.DatetimeArray, | ||
pd.arrays.TimedeltaArray, | ||
pd.arrays.NumpyExtensionArray, # type: ignore[attr-defined] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pd.arrays.NumpyExtensionArray, # type: ignore[attr-defined] | |
pd.arrays.NumpyExtensionArray, # type: ignore[attr-defined] | |
pd.arrays.ArrowExtensionArray, |
This PR disallows
NumpyExtensionArray
at both the deepest level (PandasExtensionArray
) and also at all of the top-level entrance points I could think of that should be auto-fixed. I'll need to add some tests to test that assumption but any help coming up with other potential entrace points would be helpful. For now this PR auto-converts anywhereas_compatible_data
is used (which seems to be settingdata
, initialization,__setitem__
andcopy
onVariable
) as well as auto-converting on dataframe conversion. So any help erroring/auto-converting elsewhere would be great.I also whitelisted some other data types that I noticed we were touching as well, but I think those may require closer examination so will revert/investigate as needed.
whats-new.rst
api.rst