[QST] Parallelize multiple cuDF query calls #7053

yuriy312 · 2020-12-29T15:38:37Z

Hi, is there a CUDA way to parallelize multiple query calls on a cuDF dataframe or multiple query() calls on different cuDF dataframes?

Here is an example code snippet, following query doesn't take much GPU cycles and I'd like to be able to launch multiple such queries on copies of the same dataframe or different cuDF dataframes


NUM_ELEMENTS = 100000

df = cudf.DataFrame()
df['value1'] = cp.random.sample(NUM_ELEMENTS)
df['value2'] = cp.random.sample(NUM_ELEMENTS)
df['value3'] = cp.random.sample(NUM_ELEMENTS)


c1 = np.random.random()
c2 = np.random.random()
c3 = np.random.random()
res = df.query('((value1 < @c1) & (value2 > @c2) & (value3 < @c3))')

In Pandas I can run one query (or any Pandas function for that matter) per CPU core and I'd like to know what the equivalent model is in CUDA world.

Thanks

The text was updated successfully, but these errors were encountered:

harrism · 2021-01-05T00:52:39Z

@kkraus14 @quasiben is there a dask approach that can be applied here?

quasiben · 2021-01-05T01:35:52Z

Dask could be helpful here. Dask's implementation here is a blocked version of the pandas query (is this true for cuDF ?). So you could write something like the following:

In [1]: import dask

In [2]: ddf = dask.datasets.timeseries()

In [3]: q1 = ddf.query('x > 0.5')

In [4]: q2 = ddf.query('x < 0')

In [5]: res = dask.compute([q1, q2])

In [6]: res
Out[6]:
([                       id     name         x         y
  timestamp
  2000-01-01 00:00:15   986    Sarah  0.577679 -0.792372
  2000-01-01 00:00:18   992   Xavier  0.757104 -0.926112
  2000-01-01 00:00:19  1005   Ursula  0.723600  0.459464
  2000-01-01 00:00:20   961   Ingrid  0.804942  0.001995
  2000-01-01 00:00:22   952    Alice  0.993793  0.806622
  ...                   ...      ...       ...       ...
  2000-01-30 23:59:45  1032    Quinn  0.919070 -0.472543
  2000-01-30 23:59:48  1009  Norbert  0.631444  0.779816
  2000-01-30 23:59:54   998      Tim  0.599235 -0.834612
  2000-01-30 23:59:55   967    Frank  0.769159  0.853424
  2000-01-30 23:59:58   966   Ursula  0.866536  0.793179

  [647943 rows x 4 columns],
                         id     name         x         y
  timestamp
  2000-01-01 00:00:01  1004  Michael -0.365752  0.700723
  2000-01-01 00:00:02   963   Yvonne -0.616555  0.462348
  2000-01-01 00:00:04  1022  Norbert -0.636567 -0.090011
  2000-01-01 00:00:05   976      Tim -0.305983 -0.539153
  2000-01-01 00:00:06   994   George -0.593397 -0.349443
  ...                   ...      ...       ...       ...
  2000-01-30 23:59:51   993    Kevin -0.988354  0.510483
  2000-01-30 23:59:52  1015    Quinn -0.686493  0.710052
  2000-01-30 23:59:56  1049   Hannah -0.569486  0.934169
  2000-01-30 23:59:57  1013    Jerry -0.281122 -0.786347
  2000-01-30 23:59:59   960      Ray -0.591232  0.570027

  [1296654 rows x 4 columns]],)

With N+1 workers we should see the benefits parallelism -- looking at the task graph should give us a better idea:

For a GPUs the graph above is somewhat limiting. Currently, with dask-cuda, we limit the number of threads/workers to 1 per GPU. This will hopefully change in the future as we are working towards stream support: rapidsai/dask-cuda#96

Still, I would expect to see some gains in that the data will be chunked and dask can leverage this chunking for execution on multiple GPUs

github-actions · 2021-02-16T20:19:55Z

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

github-actions · 2021-05-17T21:04:38Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

vyasr · 2022-10-17T22:26:02Z

I'm going to close this question as answered, feel free to reopen if necessary.

yuriy312 added Needs Triage Need team to review and classify question Further information is requested labels Dec 29, 2020

yuriy312 changed the title ~~Parallelize multiple cuDF query calls~~ [QST] Parallelize multiple cuDF query calls Dec 29, 2020

kkraus14 added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Jan 8, 2021

github-actions bot added the stale label Feb 16, 2021

github-actions bot added the inactive-90d label May 17, 2021

vyasr closed this as completed Oct 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Parallelize multiple cuDF query calls #7053

[QST] Parallelize multiple cuDF query calls #7053

yuriy312 commented Dec 29, 2020 •

edited

Loading

harrism commented Jan 5, 2021

quasiben commented Jan 5, 2021

github-actions bot commented Feb 16, 2021

github-actions bot commented May 17, 2021

vyasr commented Oct 17, 2022

[QST] Parallelize multiple cuDF query calls #7053

[QST] Parallelize multiple cuDF query calls #7053

Comments

yuriy312 commented Dec 29, 2020 • edited Loading

harrism commented Jan 5, 2021

quasiben commented Jan 5, 2021

github-actions bot commented Feb 16, 2021

github-actions bot commented May 17, 2021

vyasr commented Oct 17, 2022

yuriy312 commented Dec 29, 2020 •

edited

Loading