-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: fix DataFrame(data=[None, 1], dtype='timedelta64[ns]') raising ValueError #60081
Conversation
The problem comes from the branch pandas/pandas/core/arrays/timedeltas.py Line 1114 in 68d9dca
data here is a 2D array([[nan], [ 1.]]) .But cast_from_unit_vectorized only accepts 1D array as input, thus producing the "Buffer has wrong number of dimensions" error (see issue), if I understand correctly the code below:pandas/pandas/_libs/tslibs/conversion.pyx Lines 133 to 147 in 68d9dca
I notice that other branches for other dtypes in |
CI should be fixed now. Can you merge main. |
Thanks. I merged main and added in whatsnew. |
pandas/core/arrays/timedeltas.py
Outdated
@@ -1111,7 +1111,7 @@ def sequence_to_td64ns( | |||
else: | |||
mask = np.isnan(data) | |||
|
|||
data = cast_from_unit_vectorized(data, unit or "ns") | |||
data = cast_from_unit_vectorized(data.ravel(), unit or "ns").reshape(data.shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me we shouldn't be doing this in array code, that should assume the data is already 1d. Can you move this to maybe_cast_to_datetime
and put it behind a check (that it's 2d with shape (N, 1)
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Moved.
pandas/core/dtypes/cast.py
Outdated
if getattr(value, "ndim", 1) == 2 and value.shape[1] == 1: | ||
res = TimedeltaArray._from_sequence(value.ravel(), dtype=dtype) | ||
return res.reshape(value.shape) | ||
res = TimedeltaArray._from_sequence(value, dtype=dtype) | ||
return res |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we raise error here for other situations, e.g. when the shape is not (N, 1) or (1, )?
Or is a less strict check appropriate, like if getattr(value, "ndim", 1) > 1:
? @rhshadrach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should ever be reaching here with ndim 3 or higher. E.g.
arr = np.zeros((3, 3, 3))
pd.DataFrame(arr)
# ValueError: Must pass 2-d input. shape=(3, 3, 3)
pandas/core/dtypes/cast.py
Outdated
@@ -1225,6 +1225,9 @@ def maybe_cast_to_datetime( | |||
_ensure_nanosecond_dtype(dtype) | |||
|
|||
if lib.is_np_dtype(dtype, "m"): | |||
if isinstance(value, np.ndarray) and value.ndim == 2 and value.shape[1] == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would e.g. [None, 1]
be converted to a 2D array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following the Traceback, the input list is converted to a 2D array in ndarray_to_mgr
, where _prep_ndarraylike
returns _ensure_2d(values)
.
pandas/pandas/core/internals/construction.py
Line 267 in 9e10119
values = _prep_ndarraylike(values, copy=copy) |
It seems for other types of input,
_ensure_2d
is also called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I see. Generally I think there would still be problems if a user passes a nested list (e.g. a "2x2" nested list) or a user passes dtype="datetime64[unit]"
I think generally maybe_cast_to_datetime
should assume the incoming value
is 1D since the _from_sequence
calls assume the values are 1D also, so the DataFrame code should apply this column column-wise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For nested list, nested_data_to_arrays
in DataFrame.__init__
processes the data column-wise, so there is no problem.
There is no error for datetime64, because DatetimeArray._from_sequence
actually happens to work with 2D array:
from pandas.core.arrays import DatetimeArray, TimedeltaArray
arr = DatetimeArray._from_sequence(np.array([[np.nan], [1]]), dtype="datetime64[ns]")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to move the ravel and reshape in _try_cast
below, sort of like the Unicode string dtype elif branch, so as to ensure the input of maybe_cast_to_datetime
is 1D.
pandas/pandas/core/construction.py
Lines 796 to 810 in 4651ddb
elif dtype.kind == "U": | |
# TODO: test cases with arr.dtype.kind in "mM" | |
if is_ndarray: | |
arr = cast(np.ndarray, arr) | |
shape = arr.shape | |
if arr.ndim > 1: | |
arr = arr.ravel() | |
else: | |
shape = (len(arr),) | |
return lib.ensure_string_array(arr, convert_na_value=False, copy=copy).reshape( | |
shape | |
) | |
elif dtype.kind in "mM": | |
return maybe_cast_to_datetime(arr, dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @jbrockmendel if you have thoughts on this approach
pandas/core/construction.py
Outdated
@@ -807,6 +807,8 @@ def _try_cast( | |||
) | |||
|
|||
elif dtype.kind in "mM": | |||
if is_ndarray and arr.ndim == 2 and arr.shape[1] == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment about why you are special-casing the arr.shape[1] == 1 case?
pandas/core/construction.py
Outdated
@@ -807,6 +807,8 @@ def _try_cast( | |||
) | |||
|
|||
elif dtype.kind in "mM": | |||
if is_ndarray and arr.ndim == 2 and arr.shape[1] == 1: | |||
return maybe_cast_to_datetime(arr.ravel(), dtype).reshape(arr.shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of arr.ravel()
(which can make a copy), can you do arr[:, 0]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a comment and applied this change.
My preference ideally would be to make cast_from_unit_vectorized correctly handle 2D inputs. Handling it in _try_cast seems like a fine second-best. |
7900612
to
c30dd74
Compare
c30dd74
to
bf680ce
Compare
@@ -807,6 +807,12 @@ def _try_cast( | |||
) | |||
|
|||
elif dtype.kind in "mM": | |||
if is_ndarray: | |||
arr = cast(np.ndarray, arr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cast is to make mypy happy.
@@ -1205,7 +1205,7 @@ def maybe_infer_to_datetimelike( | |||
|
|||
def maybe_cast_to_datetime( | |||
value: np.ndarray | list, dtype: np.dtype | |||
) -> ExtensionArray | np.ndarray: | |||
) -> DatetimeArray | TimedeltaArray | np.ndarray: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid mypy error: Item "ExtensionArray" of "ExtensionArray | ndarray[Any, Any]" has no attribute "reshape".
Merge when ready @rhshadrach |
Friendly ping @rhshadrach |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Thanks @yuanx749! |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.