-
Notifications
You must be signed in to change notification settings - Fork 892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve pack/unpack performance if CUDA/ROCm enabled #10364
Conversation
Can one of the admins verify this patch? |
The reason for this PR is that #9906 (comment) piqued my interest, I was wondering where the runtime penalty is, and pack/unpack routines are the main place, though there are a few other places ( I tested with osu_alltoall on 64 cores with shared memory where this brings a small improvement (about half way there for small messages between with and without CUDA without actually involving a GPU), there may be better ways to test this! I am aware of #10069 which is mostly orthogonal to this patch. However, it uses the term "accelerator", whereas in this patch I used "gpu" to be consistent with the MEMCPY_GPU notation to apply to both ROCm and CUDA. Of course I'm happy to change "gpu" to "accelerator" if that's preferred. |
Instead of replacing individual memcpy()s with function pointer calls, use the existing checksum-related logic to compile two versions of every pack/unpack function, one without GPU and one with GPU support (using -DOPAL_DATATYPE_PACK_UNPACK_GPU). Signed-off-by: Bart Oldeman <[email protected]>
2e22c3c
to
7d87712
Compare
ok to test |
I am curious how much improvement do you see with this approach ? |
@bosilca, sorry for the late reply, here is what osu_alltoall does, with: Open MPI compiled without CUDA:
unpatched Open MPI compiled with CUDA:
patched Open MPI compiled with CUDA support:
if we look at 1,2,4,8:
there is some noise in there but the optimized version consistently falls in between the non-cuda and cuda enabled baseline. it's mainly about the overhead of calling |
Not sure, whether I am reading the data correctly, but it looks like for large message sizes this approach introduces some overhead 262144 17247.35 16340.17 16789.86 |
There's too much noise there in the larger messages.
large messages
The numbers are not consisten here: the cuda path for large messages is sometimes slower than the non-cuda path (without the patch!) which theoretically shouldn't happen; for bandwidth the changes fall within the expected noise, but for latency there is a clear consistent difference. |
in the end #10069 made this pretty much obsolete, so I'll close this. |
Instead of replacing individual memcpy()s with function pointer
calls, use the existing checksum-related logic to compile two
versions of every pack/unpack function, one without GPU and one with
GPU support (using -DOPAL_DATATYPE_PACK_UNPACK_GPU).
Signed-off-by: Bart Oldeman [email protected]