-
Notifications
You must be signed in to change notification settings - Fork 918
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] Parallelize multiple cuDF query calls #7053
Comments
Dask could be helpful here. Dask's implementation here is a blocked version of the pandas query (is this true for cuDF ?). So you could write something like the following: In [1]: import dask
In [2]: ddf = dask.datasets.timeseries()
In [3]: q1 = ddf.query('x > 0.5')
In [4]: q2 = ddf.query('x < 0')
In [5]: res = dask.compute([q1, q2])
In [6]: res
Out[6]:
([ id name x y
timestamp
2000-01-01 00:00:15 986 Sarah 0.577679 -0.792372
2000-01-01 00:00:18 992 Xavier 0.757104 -0.926112
2000-01-01 00:00:19 1005 Ursula 0.723600 0.459464
2000-01-01 00:00:20 961 Ingrid 0.804942 0.001995
2000-01-01 00:00:22 952 Alice 0.993793 0.806622
... ... ... ... ...
2000-01-30 23:59:45 1032 Quinn 0.919070 -0.472543
2000-01-30 23:59:48 1009 Norbert 0.631444 0.779816
2000-01-30 23:59:54 998 Tim 0.599235 -0.834612
2000-01-30 23:59:55 967 Frank 0.769159 0.853424
2000-01-30 23:59:58 966 Ursula 0.866536 0.793179
[647943 rows x 4 columns],
id name x y
timestamp
2000-01-01 00:00:01 1004 Michael -0.365752 0.700723
2000-01-01 00:00:02 963 Yvonne -0.616555 0.462348
2000-01-01 00:00:04 1022 Norbert -0.636567 -0.090011
2000-01-01 00:00:05 976 Tim -0.305983 -0.539153
2000-01-01 00:00:06 994 George -0.593397 -0.349443
... ... ... ... ...
2000-01-30 23:59:51 993 Kevin -0.988354 0.510483
2000-01-30 23:59:52 1015 Quinn -0.686493 0.710052
2000-01-30 23:59:56 1049 Hannah -0.569486 0.934169
2000-01-30 23:59:57 1013 Jerry -0.281122 -0.786347
2000-01-30 23:59:59 960 Ray -0.591232 0.570027
[1296654 rows x 4 columns]],) With N+1 workers we should see the benefits parallelism -- looking at the task graph should give us a better idea: For a GPUs the graph above is somewhat limiting. Currently, with dask-cuda, we limit the number of threads/workers to 1 per GPU. This will hopefully change in the future as we are working towards stream support: rapidsai/dask-cuda#96 Still, I would expect to see some gains in that the data will be chunked and dask can leverage this chunking for execution on multiple GPUs |
This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d. |
This issue has been labeled |
I'm going to close this question as answered, feel free to reopen if necessary. |
Hi, is there a CUDA way to parallelize multiple query calls on a cuDF dataframe or multiple query() calls on different cuDF dataframes?
Here is an example code snippet, following query doesn't take much GPU cycles and I'd like to be able to launch multiple such queries on copies of the same dataframe or different cuDF dataframes
In Pandas I can run one query (or any Pandas function for that matter) per CPU core and I'd like to know what the equivalent model is in CUDA world.
Thanks
The text was updated successfully, but these errors were encountered: