Fix categorical-accessor support and testing in dask-cudf #15591

rjzamora · 2024-04-24T14:26:16Z

Description

Related to #15027

Adds a minor tokenization fix, and adjusts testing for categorical-accessor support.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…nown_categories

…ical-support

rjzamora · 2024-04-24T14:29:45Z

python/cudf/cudf/core/series.py

+    @_cudf_nvtx_annotate
+    @_warn_no_dask_cudf
+    def __dask_tokenize__(self):
+        from dask.base import normalize_token
+
+        return [
+            type(self),
+            str(self.dtype),
+            normalize_token(self.to_pandas()),
+        ]


I'm honestly not sure why the Frame.__dask_tokenize__ definition (which looks very similar) isn't being used?

Whatever the reason may be, test_categorical_compare_ordered fails without this fix, because different Series objects end up being tokenized to the same value, and the corresponding expressions are cached between tests. (general message: Unique/proper tokenization is very important when query-planning is active)

Looks like before this was using IndexedFrame.__dask_tokenize__ where the data was being hashed as self.hash_values().values_host as opposed to self.to_pandas() (might be difference here for categorical)

Might be a better fix if IndexedFrame.__dask_tokenize__ uses to_pandas() instead of self.hash_values? Additionally if we want to use this fix as-is I think we would need to also incorporate the self.index?

Looks like before this was using IndexedFrame.dask_tokenize where the data was being hashed as self.hash_values().values_host as opposed to self.to_pandas() (might be difference here for categorical)

Aha! Thanks for pointing that out @mroeschke !

This is actually a problem we ran into before and fixed in Frame.__dask_tokenize__. It turns out that normalize_token(self._dtypes) doesn't work very well. The more reliable thing to do is actually use str(self._dtypes). With that said, dtypes with many categories may not be completely/well represented by str(self._dtypes). Therefore, I just added an extra line to explicitly normalize the actual categories for each categorical dtype.

Might be a better fix if IndexedFrame.dask_tokenize uses to_pandas() instead of self.hash_values?

I think you are right that this is probably the safest and most robust thing to do. However, I am still hesitant to remove the hash_values code path. Right now, we avoid moving more than two columns (the hashed values, and the index) to host memory when a cudf object is tokenized. The overhead difference may not be dramatic, but it would be nice to avoid moving the whole thing to pandas.

…ical-support

python/cudf/cudf/core/frame.py

charlesbluca · 2024-04-29T14:01:21Z

python/dask_cudf/dask_cudf/tests/test_accessor.py

@@ -111,7 +111,8 @@ def test_categorical_accessor_initialization2(data):
        dsr.cat


-@xfail_dask_expr("TODO: Unexplained dask-expr failure")
+# TODO: Remove this once we are pinned to dask>=2024.5.0
+@xfail_dask_expr("Requires: https://github.com/dask/dask/pull/11059")


Wonder if the lt_version param of this marker should also account for the dask-core version, since the dask-expr doesn't have a super established release cycle yet?

That way, in addition to leaving this TODO we could also do something like lt_version=2024.5.0 to make sure that things fail loudly here once that dask-core version becomes available.

Yeah, I was thinking the same thing. I was actually going to submit a dedicated PR to revise the xfail_dask_expr utility, but might as well do it here :)

Okay - Thanks again for the suggestion. The xfail_dask_expr/skip_dask_expr utilities have been updated.

…ical-support

rjzamora · 2024-05-01T19:31:02Z

/merge

rjzamora added 4 commits April 19, 2024 12:31

fix categorical support for dask-expr - needs upstream fix to clear_k…

cb046bb

…nown_categories

Merge remote-tracking branch 'upstream/branch-24.06' into fix-categor…

7a45bde

…ical-support

Merge remote-tracking branch 'upstream/branch-24.06' into fix-categor…

057fdcb

…ical-support

adjust tests

fde1651

rjzamora added bug Something isn't working 2 - In Progress Currently a work in progress dask Dask issue non-breaking Non-breaking change labels Apr 24, 2024

rjzamora self-assigned this Apr 24, 2024

github-actions bot added the Python Affects Python cuDF API. label Apr 24, 2024

rjzamora commented Apr 24, 2024

View reviewed changes

rjzamora mentioned this pull request Apr 24, 2024

[FEA] Support "dataframe.query-planning" config in dask.dataframe #15027

Open

28 tasks

rjzamora marked this pull request as ready for review April 24, 2024 15:18

rjzamora requested review from a team as code owners April 24, 2024 15:18

rjzamora requested review from mroeschke and brandon-b-miller April 24, 2024 15:18

rjzamora added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Apr 24, 2024

rjzamora added 3 commits April 25, 2024 07:16

Merge remote-tracking branch 'upstream/branch-24.06' into fix-categor…

c0d972f

…ical-support

roll back Series.__dask_tokenize__ change in favor of simpler fix

c2bc812

normalize categories just in case the list is too long for repr

0f01712

rjzamora commented Apr 25, 2024

View reviewed changes

python/cudf/cudf/core/frame.py Outdated Show resolved Hide resolved

rjzamora added 3 commits April 25, 2024 10:19

Update python/cudf/cudf/core/frame.py

b4e7c66

Merge branch 'branch-24.06' into fix-categorical-support

d901e20

Merge branch 'branch-24.06' into fix-categorical-support

f38d62e

charlesbluca reviewed Apr 29, 2024

View reviewed changes

rjzamora added 3 commits April 29, 2024 15:46

Merge branch 'branch-24.06' into fix-categorical-support

e229267

Merge branch 'branch-24.06' into fix-categorical-support

ea0616b

Merge remote-tracking branch 'upstream/branch-24.06' into fix-categor…

6f0ee4c

…ical-support

rjzamora added 2 commits April 30, 2024 08:01

use dask version instead of dask-expr version for lt_version

5432d20

update test

164cc2d

charlesbluca approved these changes Apr 30, 2024

View reviewed changes

rjzamora added 3 commits April 30, 2024 12:31

Merge branch 'branch-24.06' into fix-categorical-support

5913c8d

Merge branch 'branch-24.06' into fix-categorical-support

e41ad69

Merge branch 'branch-24.06' into fix-categorical-support

88e8383

mroeschke approved these changes May 1, 2024

View reviewed changes

rjzamora added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels May 1, 2024

bdice approved these changes May 1, 2024

View reviewed changes

rapids-bot bot merged commit 67d427d into rapidsai:branch-24.06 May 1, 2024
69 checks passed

rjzamora deleted the fix-categorical-support branch May 1, 2024 21:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix categorical-accessor support and testing in dask-cudf #15591

Fix categorical-accessor support and testing in dask-cudf #15591

rjzamora commented Apr 24, 2024

rjzamora Apr 24, 2024

mroeschke Apr 24, 2024

rjzamora Apr 25, 2024

charlesbluca Apr 29, 2024

rjzamora Apr 29, 2024

rjzamora Apr 30, 2024

rjzamora commented May 1, 2024

Fix categorical-accessor support and testing in dask-cudf #15591

Fix categorical-accessor support and testing in dask-cudf #15591

Conversation

rjzamora commented Apr 24, 2024

Description

Checklist

rjzamora Apr 24, 2024

Choose a reason for hiding this comment

mroeschke Apr 24, 2024

Choose a reason for hiding this comment

rjzamora Apr 25, 2024

Choose a reason for hiding this comment

charlesbluca Apr 29, 2024

Choose a reason for hiding this comment

rjzamora Apr 29, 2024

Choose a reason for hiding this comment

rjzamora Apr 30, 2024

Choose a reason for hiding this comment

rjzamora commented May 1, 2024