-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Support "dataframe.query-planning" config in dask.dataframe
#15027
Labels
Comments
This was referenced Feb 12, 2024
rapids-bot bot
pushed a commit
to rapidsai/dask-cuda
that referenced
this issue
Feb 13, 2024
Dask CUDA must use the deprecated `dask.dataframe` API until #1311 and rapidsai/cudf#15027 are both closed. This means that we must explicitly filter the following deprecation warning to avoid nighlty CI failures: ``` DeprecationWarning: The current Dask DataFrame implementation is deprecated. In a future release, Dask DataFrame will use new implementation that contains several improvements including a logical query planning. The user-facing DataFrame API will remain unchanged. The new implementation is already available and can be enabled by installing the dask-expr library: $ pip install dask-expr and turning the query planning option on: >>> import dask >>> dask.config.set({'dataframe.query-planning': True}) >>> import dask.dataframe as dd API documentation for the new implementation is available at https://docs.dask.org/en/stable/dask-expr-api.html Any feedback can be reported on the Dask issue tracker https://github.com/dask/dask/issues import dask.dataframe as dd ``` This PR adds the (temporarily) necessary warning filter. Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Peter Andreas Entschev (https://github.com/pentschev) URL: #1312
rapids-bot bot
pushed a commit
that referenced
this issue
Mar 11, 2024
Mostly addresses #15027 dask/dask-expr#728 exposed the necessary mechanisms for us to define a custom dask-expr backend for `cudf`. The new dispatching mechanisms are effectively the same as those in `dask.dataframe`. The only difference is that we are now registering/implementing "expression-based" collections. This PR does the following: - Defines a basic `DataFrameBackendEntrypoint` class for collection creation, and registers new collections using `get_collection_type`. - Refactors the `dask_cudf` import structure to properly support the `"dataframe.query-planning"` configuration. - Modifies CI to test dask-expr support for some of the `dask_cudf` tests. This coverage can be expanded in follow-up work. ~**Experimental Change**: This PR patches `dask_expr._expr.Expr.__new__` to enable type-based dispatching. This effectively allows us to surgically replace problematic `Expr` subclasses that do not work for cudf-backed data. For example, this PR replaces the upstream `TakeLast` expression to avoid using `squeeze` (since this method is not supported by cudf). This particular fix can be moved upstream relatively easily. However, having this kind of "patching" mechanism may be valuable for more complicated pandas/cudf discrepancies.~ ## Usage example ```python from dask import config config.set({"dataframe.query-planning": True}) import dask_cudf df = dask_cudf.DataFrame.from_dict( {"x": range(100), "y": [1, 2, 3, 4] * 25, "z": ["1", "2"] * 50}, npartitions=10, ) df["y2"] = df["x"] + df["y"] agg = df.groupby("y").agg({"y2": "mean"})["y2"] agg.simplify().pprint() ``` Dask cuDF should now be using dask-expr for "query planning": ``` Projection: columns='y2' GroupbyAggregation: arg={'y2': 'mean'} observed=True split_out=1'y' Assign: y2= Projection: columns=['y'] FromPandas: frame='<dataframe>' npartitions=10 columns=['x', 'y'] Add: Projection: columns='x' FromPandas: frame='<dataframe>' npartitions=10 columns=['x', 'y'] Projection: columns='y' FromPandas: frame='<dataframe>' npartitions=10 columns=['x', 'y'] ``` ## TODO - [x] Add basic tests - [x] Confirm that general design makes sense **Follow Up Work**: - Expand dask-expr test coverage - Fix local and upstream bugs - Add documentation once "critical mass" is reached Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) - Lawrence Mitchell (https://github.com/wence-) - Vyas Ramasubramani (https://github.com/vyasr) - Bradley Dice (https://github.com/bdice) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Ray Douglass (https://github.com/raydouglass) URL: #14805
This was referenced Mar 26, 2024
rapids-bot bot
pushed a commit
that referenced
this issue
Apr 1, 2024
Addresses parts of #15027 (json and s3 testing). Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #15408
3 tasks
rapids-bot bot
pushed a commit
that referenced
this issue
Apr 9, 2024
Related to orc and text support in #15027 Follow-up work can to enable predicate pushdown and column projection with ORC, but the goal of this PR is basic functionality (and parity with the legacy API). Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #15439
younseojava
pushed a commit
to ROCm/dask-cuda-rocm
that referenced
this issue
Apr 16, 2024
Dask CUDA must use the deprecated `dask.dataframe` API until rapidsai#1311 and rapidsai/cudf#15027 are both closed. This means that we must explicitly filter the following deprecation warning to avoid nighlty CI failures: ``` DeprecationWarning: The current Dask DataFrame implementation is deprecated. In a future release, Dask DataFrame will use new implementation that contains several improvements including a logical query planning. The user-facing DataFrame API will remain unchanged. The new implementation is already available and can be enabled by installing the dask-expr library: $ pip install dask-expr and turning the query planning option on: >>> import dask >>> dask.config.set({'dataframe.query-planning': True}) >>> import dask.dataframe as dd API documentation for the new implementation is available at https://docs.dask.org/en/stable/dask-expr-api.html Any feedback can be reported on the Dask issue tracker https://github.com/dask/dask/issues import dask.dataframe as dd ``` This PR adds the (temporarily) necessary warning filter. Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Peter Andreas Entschev (https://github.com/pentschev) URL: rapidsai#1312
This was referenced Apr 24, 2024
rapids-bot bot
pushed a commit
that referenced
this issue
May 1, 2024
Related to #15027 Adds a minor tokenization fix, and adjusts testing for categorical-accessor support. Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Charles Blackmon-Luca (https://github.com/charlesbluca) - Matthew Roeschke (https://github.com/mroeschke) - Bradley Dice (https://github.com/bdice) URL: #15591
3 tasks
rapids-bot bot
pushed a commit
that referenced
this issue
May 9, 2024
Related to #15027 Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Charles Blackmon-Luca (https://github.com/charlesbluca) URL: #15639
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
PSA
To unblock CI failures related to the dask-expr migration, down-stream RAPIDS libraries can set the following environment variable in CI (before
dask.dataframe
/dask_cudf
is ever imported):If you do this, please be sure to comment on the change, and link it to this meta issue. (So I can make the necessary changes/fixes, and turn query-planning back on)
Background
The
2024.2.0
release of Dask has deprecated the "legacy"dask.dataframe
API. Given that dask-cudf (and much of RAPIDS) is tightly integrated withdask.dataframe
, it is critical thatdask_cudf
be updated to use the newdask_expr
backend smoothly.Most of the heavy lifting is already being done in #14805. However, there will also be some follow-up work to expand coverage/examples/documentation/benchmarks. We will also need to update dask-cuda/explicit-comms.
Action Items
Basics (to be covered by #14805):
DataFrameBackendEntrypoint
entrypoint for "cudf"dask_cudf
imports withdask.dataframe
for"dataframe.query-planning"
supportExpected Follow-up:
read_json
support (Enabledask_cudf
json and s3 tests with query-planning on #15408)read_orc
support (Support orc and text IO with dask-expr using legacy conversion #15439)read_parquet
should always return DataFrame (not currently the case in dask-expr ifcolumns=<str>
)check_file_size
functionality fromdask_cudf.read_parquet
dask_cudf
json and s3 tests with query-planning on #15408)read_text
support (Support orc and text IO with dask-expr using legacy conversion #15439)to_dask_dataframe
API in favor ofto_backend
(Deprecateto/from_dask_dataframe
APIs in dask-cudf #15592)set_index(..., divisions="quantile")
(Deprecatedivisions='quantile'
support inset_index
#15804)describe
support (seems to be working now? Just need to removexfail
markers)groupby
"collect" support (Add "collect" aggregation support to dask-cudf #15593)as_index
support togroupby
get_dummy
support (Generalizeget_dummies
dask/dask-expr#1053)leftanti
merge support (Likely an error message in 24.06 and support in 24.08+)to_datetime
support (Add cudf support toto_datetime
and_maybe_from_pandas
dask/dask-expr#1035)melt
support (Add support forDataFrame.melt
dask/dask-expr#1049 & Addmelt
support when query-planning is enabled dask/dask#11088)cuDF / Dask cuDF doc build:
var
logic in dask-cudf #15347)cuML support:
cuxfilter support:
dask_cudf.core
cuxfilter#593)cugraph support:
Dask CUDA:
dask.dataframe
dask-cuda#1311)Dask SQL:
NeMo Curator:
Merlin:
The text was updated successfully, but these errors were encountered: