-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] CUDA stream pool #352
Comments
When we've discussed this in the past, it seemed like just using per thread default stream would be sufficient for Spark. Is that no longer the case? |
I think we can live with per thread default stream, but my understanding is RMM currently doesn't support that? |
It does not, but that's a relatively short-term problem. I believe we'll have PTDS support much sooner than we'll have a stream pool + ability to use external streams in libcudf. |
PTDS is tricky and warty. I don't like the way it creates multiple indistinguishable default streams. I would rather not support it if we can avoid it. Adding a stream pool to RMM is relatively simple, and can solve some of the ugly corner cases in RMM (e.g. if we require that streams used with RMM device_memory_resources have to be owned by the stream pool, we can avoid use-after-free bugs with streams in RMM pool allocators, which is difficult with external streams). I think we should avoid conflating the tasks of adding a stream pool to RMM and adding external streams support to libcudf -- I think they are orthogonal. |
What's a user going to do with a stream pool if libcudf doesn't support streams? |
Actually I think I got PTDS working, at least on the couple of Spark jobs I tested. Let me send out some PRs and we can discuss there as well. |
I think on the Python side we were counting on PTDS being supported. Didn't realize this would be problematic. What should our plan for streams support then be? cc @kkraus14 |
I was going to implement PTDS support in the new memory resources until I heard that it would not move the needle on current benchmarks... |
Yep that's fair. Had interpreted that to mean it was down prioritized, but still planned. The question I'm wondering about here is whether it is still planned and if not what people should do instead? |
It's planned but has no priority currently, since Dask can't benefit.
Which people? My suggestion is to keep calm and carry on. :) |
That's all I needed to know :) Will carry on ;) |
To be clear, Dask can easily benefit just by starting multiple threads and or processes per GPU. This just sometimes requires reworking problems to be consumed in smaller chunks than before and introduces new memory management challenges. |
But the bottleneck is the dask scheduler. Since this is not an easy fix, I am hesitant to spend time onsomething that will not produce a visible performance benefit. |
Yes I agree with this statement. Just wanted to be clear that if we had problems that had much heavier compute that isn't task graph heavy, then we could potentially use this. ETL problems typically don't follow this pattern though 😄. |
Since we've decided to use per-thread default stream, closing this for now. A stream pool might still be useful, but at least for Spark we don't need it now. |
Is your feature request related to a problem? Please describe.
Spark runs one executor process per GPU. Each executor concurrently runs multiple tasks, each in its own CPU thread. A task is a natural unit of work. We'd like to have each task use its own asynchronous CUDA stream so that there is no unnecessary synchronization between tasks, and data copying and compute can overlap between tasks.
However, tasks can be very fine grained and short lived, so creating and destroying a stream in each task may be too expensive. Because of the dynamic nature of task execution, it's not feasible to create all the needed streams upfront and destroy them after the work is done.
It's desirable for RMM to provide a stream pool that can be shared between different tasks/threads.
Describe the solution you'd like
A stream pool that can be initialized to a fixed size. By default this size can probably be set to the maximum number of compute and copy engine concurrent connections (work queues), i.e. 32.
The user can request a stream from the pool, which is assigned round-robin.
There is no need to return a stream to the pool.
When the pool is destroyed, all streams are destroyed.
Describe alternatives you've considered
Another option is to enable
--default-stream per-thread
withnvcc
or#define CUDA_API_PER_THREAD_DEFAULT_STREAM
withgcc
. But it is said this is not compatible with RMM pool (rapidsai/dask-cuda#96 (comment)).Additional context
PyTorch has something similar.
Issue: pytorch/pytorch#9646
PR: pytorch/pytorch#9938
The text was updated successfully, but these errors were encountered: