Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] CUDA stream pool #352

Closed
rongou opened this issue Apr 21, 2020 · 15 comments
Closed

[FEA] CUDA stream pool #352

rongou opened this issue Apr 21, 2020 · 15 comments
Labels
? - Needs Triage Need team to review and classify feature request New feature or request

Comments

@rongou
Copy link
Contributor

rongou commented Apr 21, 2020

Is your feature request related to a problem? Please describe.

Spark runs one executor process per GPU. Each executor concurrently runs multiple tasks, each in its own CPU thread. A task is a natural unit of work. We'd like to have each task use its own asynchronous CUDA stream so that there is no unnecessary synchronization between tasks, and data copying and compute can overlap between tasks.

However, tasks can be very fine grained and short lived, so creating and destroying a stream in each task may be too expensive. Because of the dynamic nature of task execution, it's not feasible to create all the needed streams upfront and destroy them after the work is done.

It's desirable for RMM to provide a stream pool that can be shared between different tasks/threads.

Describe the solution you'd like

A stream pool that can be initialized to a fixed size. By default this size can probably be set to the maximum number of compute and copy engine concurrent connections (work queues), i.e. 32.

The user can request a stream from the pool, which is assigned round-robin.

There is no need to return a stream to the pool.

When the pool is destroyed, all streams are destroyed.

Describe alternatives you've considered

Another option is to enable --default-stream per-thread with nvcc or #define CUDA_API_PER_THREAD_DEFAULT_STREAM with gcc. But it is said this is not compatible with RMM pool (rapidsai/dask-cuda#96 (comment)).

Additional context

PyTorch has something similar.
Issue: pytorch/pytorch#9646
PR: pytorch/pytorch#9938

@rongou rongou added ? - Needs Triage Need team to review and classify feature request New feature or request labels Apr 21, 2020
@jrhemstad
Copy link
Contributor

When we've discussed this in the past, it seemed like just using per thread default stream would be sufficient for Spark. Is that no longer the case?

@rongou
Copy link
Contributor Author

rongou commented Apr 22, 2020

I think we can live with per thread default stream, but my understanding is RMM currently doesn't support that?

@jrhemstad
Copy link
Contributor

I think we can live with per thread default stream, but my understanding is RMM currently doesn't support that?

It does not, but that's a relatively short-term problem. I believe we'll have PTDS support much sooner than we'll have a stream pool + ability to use external streams in libcudf.

@harrism
Copy link
Member

harrism commented Apr 22, 2020

PTDS is tricky and warty. I don't like the way it creates multiple indistinguishable default streams. I would rather not support it if we can avoid it.

Adding a stream pool to RMM is relatively simple, and can solve some of the ugly corner cases in RMM (e.g. if we require that streams used with RMM device_memory_resources have to be owned by the stream pool, we can avoid use-after-free bugs with streams in RMM pool allocators, which is difficult with external streams).

I think we should avoid conflating the tasks of adding a stream pool to RMM and adding external streams support to libcudf -- I think they are orthogonal.

@jrhemstad
Copy link
Contributor

I think we should avoid conflating the tasks of adding a stream pool to RMM and adding external streams support to libcudf -- I think they are orthogonal.

What's a user going to do with a stream pool if libcudf doesn't support streams?

@rongou
Copy link
Contributor Author

rongou commented Apr 22, 2020

Actually I think I got PTDS working, at least on the couple of Spark jobs I tested. Let me send out some PRs and we can discuss there as well.

@jakirkham
Copy link
Member

PTDS is tricky and warty. I don't like the way it creates multiple indistinguishable default streams. I would rather not support it if we can avoid it.

I think on the Python side we were counting on PTDS being supported. Didn't realize this would be problematic. What should our plan for streams support then be?

cc @kkraus14

@harrism
Copy link
Member

harrism commented Apr 23, 2020

I was going to implement PTDS support in the new memory resources until I heard that it would not move the needle on current benchmarks...

@jakirkham
Copy link
Member

Yep that's fair. Had interpreted that to mean it was down prioritized, but still planned. The question I'm wondering about here is whether it is still planned and if not what people should do instead?

@harrism
Copy link
Member

harrism commented Apr 23, 2020

It's planned but has no priority currently, since Dask can't benefit.

what people should do instead?

Which people? My suggestion is to keep calm and carry on. :)

@jakirkham
Copy link
Member

It's planned but has no priority currently, since Dask can't benefit.

That's all I needed to know :) Will carry on ;)

@kkraus14
Copy link
Contributor

It's planned but has no priority currently, since Dask can't benefit.

To be clear, Dask can easily benefit just by starting multiple threads and or processes per GPU. This just sometimes requires reworking problems to be consumed in smaller chunks than before and introduces new memory management challenges.

@harrism
Copy link
Member

harrism commented Apr 23, 2020

But the bottleneck is the dask scheduler. Since this is not an easy fix, I am hesitant to spend time onsomething that will not produce a visible performance benefit.

@kkraus14
Copy link
Contributor

But the bottleneck is the dask scheduler. Since this is not an easy fix, I am hesitant to spend time onsomething that will not produce a visible performance benefit.

Yes I agree with this statement. Just wanted to be clear that if we had problems that had much heavier compute that isn't task graph heavy, then we could potentially use this. ETL problems typically don't follow this pattern though 😄.

@rongou
Copy link
Contributor Author

rongou commented Jun 11, 2020

Since we've decided to use per-thread default stream, closing this for now. A stream pool might still be useful, but at least for Spark we don't need it now.

@rongou rongou closed this as completed Jun 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify feature request New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants