Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix to_pandas writable bug for datetime and timedelta types #17913

Merged
merged 2 commits into from
Feb 11, 2025

Conversation

galipremsagar
Copy link
Contributor

Description

Fixes: #17743

This PR fixes writable flag for numpy arrays. This bug is only specific to these two types.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@galipremsagar galipremsagar added bug Something isn't working non-breaking Non-breaking change labels Feb 4, 2025
@galipremsagar galipremsagar self-assigned this Feb 4, 2025
@galipremsagar galipremsagar requested a review from a team as a code owner February 4, 2025 17:44
@github-actions github-actions bot added the Python Affects Python cuDF API. label Feb 4, 2025
@mroeschke
Copy link
Contributor

Curious how #17743 as a bug affects subsequent operations (i.e. is this a bug we really need to worry about?)

pandas.Index is not supposed to be publicly mutable, so the underlying numpy array not being writable doesn't seem that big of a deal, especially since the fix requires making a CPU copy of the data

@galipremsagar
Copy link
Contributor Author

galipremsagar commented Feb 4, 2025

Curious how #17743 as a bug affects subsequent operations (i.e. is this a bug we really need to worry about?)

pandas.Index is not supposed to be publicly mutable, so the underlying numpy array not being writable doesn't seem that big of a deal, especially since the fix requires making a CPU copy of the data

It's not serious. But AFAIK it is because ._data yields a DatetimeArray and that is being mutated in pandas pytests leading to failures in the test suite. If not this fix, we will need to try to address this for cudf.pandas case at least. But since users can run into this issue with cudf classic aswell I went ahead and opened a short-term fix until the arrow bug is fixed.

@mroeschke
Copy link
Contributor

mroeschke commented Feb 5, 2025

But since users can run into this issue with cudf classic

Yeah so I'm curious what subsequent issue a user would run into if a pandas.Index returned from to_pandas() if the underlying numpy array was not writable. (i.e. if no public APIs break when using this Index I'm just tempted to stay this is a won't fix to avoid the CPU copy)

@galipremsagar
Copy link
Contributor Author

But since users can run into this issue with cudf classic

Yeah so I'm curious what subsequent issue a user would run into if a pandas.Index returned from to_pandas() if the underlying numpy array was not writable. (i.e. if no public APIs break when using this Index I'm just tempted to stay this is a won't fix to avoid the CPU copy)

This is the user-facing case I'm talking about:

In [1]: import pandas as pd
i 
In [2]: i = pd.Index([1, 2, 3], dtype="datetime64[ns]")

In [4]: i.to_numpy()
Out[4]: 
array(['1970-01-01T00:00:00.000000001', '1970-01-01T00:00:00.000000002',
       '1970-01-01T00:00:00.000000003'], dtype='datetime64[ns]')

In [5]: x = i.to_numpy()

In [6]: x[0] = 1

In [7]: x
Out[7]: 
array(['1970-01-01T00:00:00.000000001', '1970-01-01T00:00:00.000000002',
       '1970-01-01T00:00:00.000000003'], dtype='datetime64[ns]')

In [8]: x[0] = 2

In [9]: x
Out[9]: 
array(['1970-01-01T00:00:00.000000002', '1970-01-01T00:00:00.000000002',
       '1970-01-01T00:00:00.000000003'], dtype='datetime64[ns]')

In [10]: import cudf

In [11]: i = cudf.Index([1, 2, 3], dtype="datetime64[ns]")

In [13]: x = i.to_pandas().to_numpy()

In [14]: x
Out[14]: 
array(['1970-01-01T00:00:00.000000001', '1970-01-01T00:00:00.000000002',
       '1970-01-01T00:00:00.000000003'], dtype='datetime64[ns]')

In [15]: x[0] = 2
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[15], line 1
----> 1 x[0] = 2

ValueError: assignment destination is read-only

nullable: bool = False,
arrow_type: bool = False,
) -> pd.Index:
if arrow_type and nullable:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do something like

if not (arrow_type and nullable):
    return pd.Index(self.data_array_view(mode="read").copy_to_host())
return super().to_pandas(nullable=nullable, arrow_type=arrow_type)

instead? I think that should avoid the extra CPU copy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would but this doesn't support nulls:

[left]:  TimedeltaIndex([NaT, NaT, NaT, NaT, NaT, NaT, NaT], dtype='timedelta64[ns]', freq=None)
[right]: TimedeltaIndex(['0 days 00:00:00.000000012', '0 days 00:00:00.000000011',
                '0 days 00:00:00.000000002', '0 days 00:00:00.000002234',
                '0 days 00:00:00.000002343', '0 days 00:00:00.000023432',
                '0 days 00:00:00.000023234'],
               dtype='timedelta64[ns]', freq=None)



def test_writable_numpy_array():
gi = cudf.Index([1, 2, 3], dtype="datetime64[ns]")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just do

gi = cudf.Index(...).to_numpy()
pi = pd.Index(...).to_numpy()

assert gi.flags == pi.flags

instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the ownable flag is not the same for both so we will not be able to this assertion.

Copy link
Contributor

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK. Darn yeah I forgot about to_numpy. Left some other feedback on the implementation

@galipremsagar galipremsagar added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Feb 11, 2025
@galipremsagar
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 18533b2 into rapidsai:branch-25.04 Feb 11, 2025
108 of 109 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge bug Something isn't working non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[BUG] DatetimeArray is not writable in cudf.pandas
2 participants