Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fix DataFrame(data=[None, 1], dtype='timedelta64[ns]') raising ValueError #60081

Merged
merged 15 commits into from
Nov 7, 2024

Conversation

yuanx749
Copy link
Contributor

@yuanx749 yuanx749 commented Oct 22, 2024

@yuanx749
Copy link
Contributor Author

The problem comes from the branch elif is_float_dtype(data.dtype): in function sequence_to_td64ns.

data = cast_from_unit_vectorized(data, unit or "ns")

data here is a 2D array([[nan], [ 1.]]).
But cast_from_unit_vectorized only accepts 1D array as input, thus producing the "Buffer has wrong number of dimensions" error (see issue), if I understand correctly the code below:
cdef:
ndarray[int64_t] base, out
ndarray[float64_t] frac
tuple shape = (<object>values).shape
out = np.empty(shape, dtype="i8")
base = np.empty(shape, dtype="i8")
frac = np.empty(shape, dtype="f8")
for i in range(len(values)):
if is_nan(values[i]):
base[i] = NPY_NAT
else:
base[i] = <int64_t>values[i]
frac[i] = values[i] - base[i]

I notice that other branches for other dtypes in sequence_to_td64ns can work with 2D array data, so come up with this quick fix.

@yuanx749 yuanx749 marked this pull request as ready for review October 22, 2024 04:18
@rhshadrach
Copy link
Member

CI should be fixed now. Can you merge main.

@rhshadrach rhshadrach added Timedelta Timedelta data type Constructors Series/DataFrame/Index/pd.array Constructors labels Oct 26, 2024
@rhshadrach rhshadrach added this to the 3.0 milestone Oct 26, 2024
@yuanx749
Copy link
Contributor Author

CI should be fixed now. Can you merge main.

Thanks. I merged main and added in whatsnew.

@@ -1111,7 +1111,7 @@ def sequence_to_td64ns(
else:
mask = np.isnan(data)

data = cast_from_unit_vectorized(data, unit or "ns")
data = cast_from_unit_vectorized(data.ravel(), unit or "ns").reshape(data.shape)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me we shouldn't be doing this in array code, that should assume the data is already 1d. Can you move this to maybe_cast_to_datetime and put it behind a check (that it's 2d with shape (N, 1)).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Moved.

doc/source/whatsnew/v3.0.0.rst Outdated Show resolved Hide resolved
Comment on lines 1228 to 1229
if getattr(value, "ndim", 1) == 2 and value.shape[1] == 1:
res = TimedeltaArray._from_sequence(value.ravel(), dtype=dtype)
return res.reshape(value.shape)
res = TimedeltaArray._from_sequence(value, dtype=dtype)
return res
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we raise error here for other situations, e.g. when the shape is not (N, 1) or (1, )?

Or is a less strict check appropriate, like if getattr(value, "ndim", 1) > 1:? @rhshadrach

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should ever be reaching here with ndim 3 or higher. E.g.

arr = np.zeros((3, 3, 3))
pd.DataFrame(arr)
# ValueError: Must pass 2-d input. shape=(3, 3, 3)

@@ -1225,6 +1225,9 @@ def maybe_cast_to_datetime(
_ensure_nanosecond_dtype(dtype)

if lib.is_np_dtype(dtype, "m"):
if isinstance(value, np.ndarray) and value.ndim == 2 and value.shape[1] == 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would e.g. [None, 1] be converted to a 2D array?

Copy link
Contributor Author

@yuanx749 yuanx749 Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the Traceback, the input list is converted to a 2D array in ndarray_to_mgr, where _prep_ndarraylike returns _ensure_2d(values).

values = _prep_ndarraylike(values, copy=copy)

It seems for other types of input, _ensure_2d is also called.

Copy link
Member

@mroeschke mroeschke Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I see. Generally I think there would still be problems if a user passes a nested list (e.g. a "2x2" nested list) or a user passes dtype="datetime64[unit]"

I think generally maybe_cast_to_datetime should assume the incoming value is 1D since the _from_sequence calls assume the values are 1D also, so the DataFrame code should apply this column column-wise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For nested list, nested_data_to_arrays in DataFrame.__init__ processes the data column-wise, so there is no problem.

There is no error for datetime64, because DatetimeArray._from_sequence actually happens to work with 2D array:

from pandas.core.arrays import DatetimeArray, TimedeltaArray
arr = DatetimeArray._from_sequence(np.array([[np.nan], [1]]), dtype="datetime64[ns]")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to move the ravel and reshape in _try_cast below, sort of like the Unicode string dtype elif branch, so as to ensure the input of maybe_cast_to_datetime is 1D.

elif dtype.kind == "U":
# TODO: test cases with arr.dtype.kind in "mM"
if is_ndarray:
arr = cast(np.ndarray, arr)
shape = arr.shape
if arr.ndim > 1:
arr = arr.ravel()
else:
shape = (len(arr),)
return lib.ensure_string_array(arr, convert_na_value=False, copy=copy).reshape(
shape
)
elif dtype.kind in "mM":
return maybe_cast_to_datetime(arr, dtype)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @jbrockmendel if you have thoughts on this approach

@@ -807,6 +807,8 @@ def _try_cast(
)

elif dtype.kind in "mM":
if is_ndarray and arr.ndim == 2 and arr.shape[1] == 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment about why you are special-casing the arr.shape[1] == 1 case?

@@ -807,6 +807,8 @@ def _try_cast(
)

elif dtype.kind in "mM":
if is_ndarray and arr.ndim == 2 and arr.shape[1] == 1:
return maybe_cast_to_datetime(arr.ravel(), dtype).reshape(arr.shape)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of arr.ravel() (which can make a copy), can you do arr[:, 0]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a comment and applied this change.

@jbrockmendel
Copy link
Member

But cast_from_unit_vectorized only accepts 1D array as input

My preference ideally would be to make cast_from_unit_vectorized correctly handle 2D inputs. Handling it in _try_cast seems like a fine second-best.

@yuanx749 yuanx749 force-pushed the sequence_to_td64ns branch 2 times, most recently from 7900612 to c30dd74 Compare November 2, 2024 13:12
@@ -807,6 +807,12 @@ def _try_cast(
)

elif dtype.kind in "mM":
if is_ndarray:
arr = cast(np.ndarray, arr)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cast is to make mypy happy.

@@ -1205,7 +1205,7 @@ def maybe_infer_to_datetimelike(

def maybe_cast_to_datetime(
value: np.ndarray | list, dtype: np.dtype
) -> ExtensionArray | np.ndarray:
) -> DatetimeArray | TimedeltaArray | np.ndarray:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid mypy error: Item "ExtensionArray" of "ExtensionArray | ndarray[Any, Any]" has no attribute "reshape".

@mroeschke
Copy link
Member

Merge when ready @rhshadrach

@yuanx749
Copy link
Contributor Author

yuanx749 commented Nov 7, 2024

Friendly ping @rhshadrach

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@rhshadrach rhshadrach merged commit 04432f5 into pandas-dev:main Nov 7, 2024
50 of 51 checks passed
@rhshadrach
Copy link
Member

Thanks @yuanx749!

@yuanx749 yuanx749 deleted the sequence_to_td64ns branch November 8, 2024 02:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Timedelta Timedelta data type
Projects
None yet
4 participants