[FEA] CUDA stream pool #352

rongou · 2020-04-21T23:02:13Z

Is your feature request related to a problem? Please describe.

Spark runs one executor process per GPU. Each executor concurrently runs multiple tasks, each in its own CPU thread. A task is a natural unit of work. We'd like to have each task use its own asynchronous CUDA stream so that there is no unnecessary synchronization between tasks, and data copying and compute can overlap between tasks.

However, tasks can be very fine grained and short lived, so creating and destroying a stream in each task may be too expensive. Because of the dynamic nature of task execution, it's not feasible to create all the needed streams upfront and destroy them after the work is done.

It's desirable for RMM to provide a stream pool that can be shared between different tasks/threads.

Describe the solution you'd like

A stream pool that can be initialized to a fixed size. By default this size can probably be set to the maximum number of compute and copy engine concurrent connections (work queues), i.e. 32.

The user can request a stream from the pool, which is assigned round-robin.

There is no need to return a stream to the pool.

When the pool is destroyed, all streams are destroyed.

Describe alternatives you've considered

Another option is to enable --default-stream per-thread with nvcc or #define CUDA_API_PER_THREAD_DEFAULT_STREAM with gcc. But it is said this is not compatible with RMM pool (rapidsai/dask-cuda#96 (comment)).

Additional context

PyTorch has something similar.
Issue: pytorch/pytorch#9646
PR: pytorch/pytorch#9938

The text was updated successfully, but these errors were encountered:

jrhemstad · 2020-04-21T23:15:24Z

When we've discussed this in the past, it seemed like just using per thread default stream would be sufficient for Spark. Is that no longer the case?

rongou · 2020-04-22T00:02:30Z

I think we can live with per thread default stream, but my understanding is RMM currently doesn't support that?

jrhemstad · 2020-04-22T03:45:55Z

I think we can live with per thread default stream, but my understanding is RMM currently doesn't support that?

It does not, but that's a relatively short-term problem. I believe we'll have PTDS support much sooner than we'll have a stream pool + ability to use external streams in libcudf.

harrism · 2020-04-22T09:43:26Z

PTDS is tricky and warty. I don't like the way it creates multiple indistinguishable default streams. I would rather not support it if we can avoid it.

Adding a stream pool to RMM is relatively simple, and can solve some of the ugly corner cases in RMM (e.g. if we require that streams used with RMM device_memory_resources have to be owned by the stream pool, we can avoid use-after-free bugs with streams in RMM pool allocators, which is difficult with external streams).

I think we should avoid conflating the tasks of adding a stream pool to RMM and adding external streams support to libcudf -- I think they are orthogonal.

jrhemstad · 2020-04-22T12:59:09Z

I think we should avoid conflating the tasks of adding a stream pool to RMM and adding external streams support to libcudf -- I think they are orthogonal.

What's a user going to do with a stream pool if libcudf doesn't support streams?

rongou · 2020-04-22T23:19:10Z

Actually I think I got PTDS working, at least on the couple of Spark jobs I tested. Let me send out some PRs and we can discuss there as well.

jakirkham · 2020-04-22T23:31:37Z

PTDS is tricky and warty. I don't like the way it creates multiple indistinguishable default streams. I would rather not support it if we can avoid it.

I think on the Python side we were counting on PTDS being supported. Didn't realize this would be problematic. What should our plan for streams support then be?

cc @kkraus14

harrism · 2020-04-23T01:56:37Z

I was going to implement PTDS support in the new memory resources until I heard that it would not move the needle on current benchmarks...

jakirkham · 2020-04-23T03:46:15Z

Yep that's fair. Had interpreted that to mean it was down prioritized, but still planned. The question I'm wondering about here is whether it is still planned and if not what people should do instead?

harrism · 2020-04-23T03:48:28Z

It's planned but has no priority currently, since Dask can't benefit.

what people should do instead?

Which people? My suggestion is to keep calm and carry on. :)

jakirkham · 2020-04-23T03:59:30Z

It's planned but has no priority currently, since Dask can't benefit.

That's all I needed to know :) Will carry on ;)

kkraus14 · 2020-04-23T04:08:23Z

It's planned but has no priority currently, since Dask can't benefit.

To be clear, Dask can easily benefit just by starting multiple threads and or processes per GPU. This just sometimes requires reworking problems to be consumed in smaller chunks than before and introduces new memory management challenges.

harrism · 2020-04-23T04:23:07Z

But the bottleneck is the dask scheduler. Since this is not an easy fix, I am hesitant to spend time onsomething that will not produce a visible performance benefit.

kkraus14 · 2020-04-23T04:26:34Z

But the bottleneck is the dask scheduler. Since this is not an easy fix, I am hesitant to spend time onsomething that will not produce a visible performance benefit.

Yes I agree with this statement. Just wanted to be clear that if we had problems that had much heavier compute that isn't task graph heavy, then we could potentially use this. ETL problems typically don't follow this pattern though 😄.

rongou · 2020-06-11T23:08:25Z

Since we've decided to use per-thread default stream, closing this for now. A stream pool might still be useful, but at least for Spark we don't need it now.

rongou added ? - Needs Triage Need team to review and classify feature request New feature or request labels Apr 21, 2020

This was referenced Apr 22, 2020

[REVIEW] add option for per-thread default stream #354

Merged

[REVIEW] Add CMake option for per-thread default stream rapidsai/cudf#4995

Merged

rongou closed this as completed Jun 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] CUDA stream pool #352

[FEA] CUDA stream pool #352

rongou commented Apr 21, 2020 •

edited

Loading

jrhemstad commented Apr 21, 2020

rongou commented Apr 22, 2020

jrhemstad commented Apr 22, 2020

harrism commented Apr 22, 2020 •

edited

Loading

jrhemstad commented Apr 22, 2020

rongou commented Apr 22, 2020

jakirkham commented Apr 22, 2020

harrism commented Apr 23, 2020

jakirkham commented Apr 23, 2020

harrism commented Apr 23, 2020

jakirkham commented Apr 23, 2020

kkraus14 commented Apr 23, 2020

harrism commented Apr 23, 2020 •

edited

Loading

kkraus14 commented Apr 23, 2020

rongou commented Jun 11, 2020

[FEA] CUDA stream pool #352

[FEA] CUDA stream pool #352

Comments

rongou commented Apr 21, 2020 • edited Loading

jrhemstad commented Apr 21, 2020

rongou commented Apr 22, 2020

jrhemstad commented Apr 22, 2020

harrism commented Apr 22, 2020 • edited Loading

jrhemstad commented Apr 22, 2020

rongou commented Apr 22, 2020

jakirkham commented Apr 22, 2020

harrism commented Apr 23, 2020

jakirkham commented Apr 23, 2020

harrism commented Apr 23, 2020

jakirkham commented Apr 23, 2020

kkraus14 commented Apr 23, 2020

harrism commented Apr 23, 2020 • edited Loading

kkraus14 commented Apr 23, 2020

rongou commented Jun 11, 2020

rongou commented Apr 21, 2020 •

edited

Loading

harrism commented Apr 22, 2020 •

edited

Loading

harrism commented Apr 23, 2020 •

edited

Loading