BUG: fix DataFrame(data=[None, 1], dtype='timedelta64[ns]') raising ValueError #60081

yuanx749 · 2024-10-22T03:56:45Z

closes BUG: DataFrame(data=[None, 1], dtype='timedelta64[ns]') raises ValueError: Buffer has wrong number of dimensions #60064(Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

…alueError

yuanx749 · 2024-10-22T04:17:19Z

The problem comes from the branch elif is_float_dtype(data.dtype): in function sequence_to_td64ns.

Line 1114 in 68d9dca

data = cast_from_unit_vectorized(data, unit or "ns")

data here is a 2D array([[nan], [ 1.]]).
But cast_from_unit_vectorized only accepts 1D array as input, thus producing the "Buffer has wrong number of dimensions" error (see issue), if I understand correctly the code below:

pandas/pandas/_libs/tslibs/conversion.pyx

Lines 133 to 147 in 68d9dca

    
           cdef: 
        
               ndarray[int64_t] base, out 
        
               ndarray[float64_t] frac 
        
               tuple shape = (<object>values).shape 
        
           out = np.empty(shape, dtype="i8") 
        
           base = np.empty(shape, dtype="i8") 
        
           frac = np.empty(shape, dtype="f8") 
        
           for i in range(len(values)): 
        
               if is_nan(values[i]): 
        
                   base[i] = NPY_NAT 
        
               else: 
        
                   base[i] = <int64_t>values[i] 
        
                   frac[i] = values[i] - base[i]

I notice that other branches for other dtypes in sequence_to_td64ns can work with 2D array data, so come up with this quick fix.

rhshadrach · 2024-10-26T18:44:00Z

CI should be fixed now. Can you merge main.

yuanx749 · 2024-10-27T04:33:31Z

CI should be fixed now. Can you merge main.

Thanks. I merged main and added in whatsnew.

rhshadrach · 2024-10-27T13:11:05Z

pandas/core/arrays/timedeltas.py

@@ -1111,7 +1111,7 @@ def sequence_to_td64ns(
        else:
            mask = np.isnan(data)

-        data = cast_from_unit_vectorized(data, unit or "ns")
+        data = cast_from_unit_vectorized(data.ravel(), unit or "ns").reshape(data.shape)


It seems to me we shouldn't be doing this in array code, that should assume the data is already 1d. Can you move this to maybe_cast_to_datetime and put it behind a check (that it's 2d with shape (N, 1)).

I see. Moved.

doc/source/whatsnew/v3.0.0.rst

yuanx749 · 2024-10-27T15:24:19Z

pandas/core/dtypes/cast.py

+        if getattr(value, "ndim", 1) == 2 and value.shape[1] == 1:
+            res = TimedeltaArray._from_sequence(value.ravel(), dtype=dtype)
+            return res.reshape(value.shape)
        res = TimedeltaArray._from_sequence(value, dtype=dtype)
        return res


Should we raise error here for other situations, e.g. when the shape is not (N, 1) or (1, )?

Or is a less strict check appropriate, like if getattr(value, "ndim", 1) > 1:? @rhshadrach

I don't think we should ever be reaching here with ndim 3 or higher. E.g.

arr = np.zeros((3, 3, 3)) pd.DataFrame(arr) # ValueError: Must pass 2-d input. shape=(3, 3, 3)

mroeschke · 2024-10-29T21:00:04Z

pandas/core/dtypes/cast.py

@@ -1225,6 +1225,9 @@ def maybe_cast_to_datetime(
    _ensure_nanosecond_dtype(dtype)

    if lib.is_np_dtype(dtype, "m"):
+        if isinstance(value, np.ndarray) and value.ndim == 2 and value.shape[1] == 1:


Why would e.g. [None, 1] be converted to a 2D array?

Following the Traceback, the input list is converted to a 2D array in ndarray_to_mgr, where _prep_ndarraylike returns _ensure_2d(values).

pandas/pandas/core/internals/construction.py

Line 267 in 9e10119

values = _prep_ndarraylike(values, copy=copy)

It seems for other types of input, _ensure_2d is also called.

OK I see. Generally I think there would still be problems if a user passes a nested list (e.g. a "2x2" nested list) or a user passes dtype="datetime64[unit]"

I think generally maybe_cast_to_datetime should assume the incoming value is 1D since the _from_sequence calls assume the values are 1D also, so the DataFrame code should apply this column column-wise.

For nested list, nested_data_to_arrays in DataFrame.__init__ processes the data column-wise, so there is no problem.

There is no error for datetime64, because DatetimeArray._from_sequence actually happens to work with 2D array:

from pandas.core.arrays import DatetimeArray, TimedeltaArray arr = DatetimeArray._from_sequence(np.array([[np.nan], [1]]), dtype="datetime64[ns]")

I think it is better to move the ravel and reshape in _try_cast below, sort of like the Unicode string dtype elif branch, so as to ensure the input of maybe_cast_to_datetime is 1D.

pandas/pandas/core/construction.py

Lines 796 to 810 in 4651ddb

elif dtype.kind == "U":

# TODO: test cases with arr.dtype.kind in "mM"

if is_ndarray:

arr = cast(np.ndarray, arr)

shape = arr.shape

if arr.ndim > 1:

arr = arr.ravel()

else:

shape = (len(arr),)

return lib.ensure_string_array(arr, convert_na_value=False, copy=copy).reshape(

shape

)

elif dtype.kind in "mM":

return maybe_cast_to_datetime(arr, dtype)

cc @jbrockmendel if you have thoughts on this approach

jbrockmendel · 2024-11-01T20:00:06Z

pandas/core/construction.py

@@ -807,6 +807,8 @@ def _try_cast(
        )

    elif dtype.kind in "mM":
+        if is_ndarray and arr.ndim == 2 and arr.shape[1] == 1:


can you add a comment about why you are special-casing the arr.shape[1] == 1 case?

jbrockmendel · 2024-11-01T20:00:42Z

pandas/core/construction.py

@@ -807,6 +807,8 @@ def _try_cast(
        )

    elif dtype.kind in "mM":
+        if is_ndarray and arr.ndim == 2 and arr.shape[1] == 1:
+            return maybe_cast_to_datetime(arr.ravel(), dtype).reshape(arr.shape)


instead of arr.ravel() (which can make a copy), can you do arr[:, 0]?

I have added a comment and applied this change.

jbrockmendel · 2024-11-01T20:03:10Z

But cast_from_unit_vectorized only accepts 1D array as input

My preference ideally would be to make cast_from_unit_vectorized correctly handle 2D inputs. Handling it in _try_cast seems like a fine second-best.

yuanx749 · 2024-11-02T14:31:17Z

pandas/core/construction.py

@@ -807,6 +807,12 @@ def _try_cast(
        )

    elif dtype.kind in "mM":
+        if is_ndarray:
+            arr = cast(np.ndarray, arr)


This cast is to make mypy happy.

yuanx749 · 2024-11-02T14:31:48Z

pandas/core/dtypes/cast.py

@@ -1205,7 +1205,7 @@ def maybe_infer_to_datetimelike(

 def maybe_cast_to_datetime(
    value: np.ndarray | list, dtype: np.dtype
-) -> ExtensionArray | np.ndarray:
+) -> DatetimeArray | TimedeltaArray | np.ndarray:


To avoid mypy error: Item "ExtensionArray" of "ExtensionArray | ndarray[Any, Any]" has no attribute "reshape".

mroeschke · 2024-11-02T17:04:41Z

Merge when ready @rhshadrach

yuanx749 · 2024-11-07T13:33:37Z

Friendly ping @rhshadrach

rhshadrach

lgtm

rhshadrach · 2024-11-07T21:30:40Z

Thanks @yuanx749!

BUG: fix DataFrame(data=[None, 1], dtype='timedelta64[ns]') raising V…

06f317e

…alueError

yuanx749 marked this pull request as ready for review October 22, 2024 04:18

ci

b5bec39

rhshadrach added Timedelta Timedelta data type Constructors Series/DataFrame/Index/pd.array Constructors labels Oct 26, 2024

rhshadrach added this to the 3.0 milestone Oct 26, 2024

yuanx749 added 2 commits October 27, 2024 10:54

Merge remote-tracking branch 'upstream/main' into sequence_to_td64ns

4fe0ffa

Add whatsnew

98dd50f

rhshadrach requested changes Oct 27, 2024

View reviewed changes

yuanx749 added 2 commits October 27, 2024 22:58

Move to maybe_downcast_to_dtype

9c34762

Merge remote-tracking branch 'upstream/main' into sequence_to_td64ns

65a48e1

yuanx749 commented Oct 27, 2024

View reviewed changes

Fix mypy failure

0627217

mroeschke reviewed Oct 29, 2024

View reviewed changes

yuanx749 added 2 commits October 31, 2024 21:26

Move upper level

d93d5b2

Merge remote-tracking branch 'upstream/main' into sequence_to_td64ns

84c01da

jbrockmendel reviewed Nov 1, 2024

View reviewed changes

yuanx749 added 2 commits November 2, 2024 13:03

Merge remote-tracking branch 'upstream/main' into sequence_to_td64ns

bc106df

Update

11a9a7e

yuanx749 force-pushed the sequence_to_td64ns branch 2 times, most recently from 7900612 to c30dd74 Compare November 2, 2024 13:12

mypy

bf680ce

yuanx749 force-pushed the sequence_to_td64ns branch from c30dd74 to bf680ce Compare November 2, 2024 14:08

yuanx749 commented Nov 2, 2024

View reviewed changes

mroeschke approved these changes Nov 2, 2024

View reviewed changes

yuanx749 added 3 commits November 3, 2024 10:37

Merge remote-tracking branch 'upstream/main' into sequence_to_td64ns

fc2405f

Merge remote-tracking branch 'upstream/main' into sequence_to_td64ns

5c36932

Merge remote-tracking branch 'upstream/main' into sequence_to_td64ns

e19bf19

rhshadrach approved these changes Nov 7, 2024

View reviewed changes

rhshadrach merged commit 04432f5 into pandas-dev:main Nov 7, 2024
50 of 51 checks passed

yuanx749 deleted the sequence_to_td64ns branch November 8, 2024 02:04

	elif dtype.kind == "U":
	# TODO: test cases with arr.dtype.kind in "mM"
	if is_ndarray:
	arr = cast(np.ndarray, arr)
	shape = arr.shape
	if arr.ndim > 1:
	arr = arr.ravel()
	else:
	shape = (len(arr),)
	return lib.ensure_string_array(arr, convert_na_value=False, copy=copy).reshape(
	shape
	)

	elif dtype.kind in "mM":
	return maybe_cast_to_datetime(arr, dtype)

Uh oh!

BUG: fix DataFrame(data=[None, 1], dtype='timedelta64[ns]') raising ValueError #60081

BUG: fix DataFrame(data=[None, 1], dtype='timedelta64[ns]') raising ValueError #60081

Uh oh!

Conversation

yuanx749 commented Oct 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuanx749 commented Oct 22, 2024

Uh oh!

rhshadrach commented Oct 26, 2024

Uh oh!

yuanx749 commented Oct 27, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuanx749 Oct 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mroeschke Oct 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbrockmendel commented Nov 1, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mroeschke commented Nov 2, 2024

Uh oh!

yuanx749 commented Nov 7, 2024

Uh oh!

rhshadrach left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rhshadrach commented Nov 7, 2024

Uh oh!

Uh oh!

yuanx749 commented Oct 22, 2024 •

edited

Loading

yuanx749 Oct 30, 2024 •

edited

Loading

mroeschke Oct 30, 2024 •

edited

Loading