-
Notifications
You must be signed in to change notification settings - Fork 868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
coll/accelerator add support for more functions #13006
base: main
Are you sure you want to change the base?
coll/accelerator add support for more functions #13006
Conversation
ae0525b
to
94a56ab
Compare
add support for MPI_Reduce_scatter Signed-off-by: Edgar Gabriel <[email protected]>
8658265
to
5e0bcd7
Compare
/* Using sbufsize here on purpose to ensure symmetric decision for handling of GPU vs | ||
CPU buffers. The two buffer sizes are expected to be the same for pre-defined datatypes, | ||
but could vary due to layout issues/gaps for derived datatypes */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a better approach here is to use multiply the size of one element with the number of elements? The span could be much larger if there are big gaps that we wouldn't transfer, skewing us towards using the device memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but that is explicitly meant to handle derived datatypes that contains gaps. Note, that this is the approach also used by all other (already existing functions) in the coll/accelerator component.
/* Using sbufsize here on purpose to ensure symmetric decision for handling of GPU vs | ||
CPU buffers. The two buffer sizes are expected to be the same for pre-defined datatypes, | ||
but could vary due to layout issues/gaps for derived datatypes */ | ||
if ((rc > 0) && ((sbufsize * comm_size) <= (size_t)mca_coll_accelerator_alltoall_thresh)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar problem here: at higher process counts we would tend to use the device memory for many small send/recv. I'd expect that the overhead for small transfers becomes more pronounced for small messages at higher process counts. Why do we need to factor in the process count at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been honestly going back and forth on whether to use the parameter as a per process parameter or a parameter that limits the amount of temporary memory allocated on the CPU side. I am happy to use this as a per-process parameter (i.e. without communicator size playing a role). My original (maybe incorrect) line of thinking had to do with the v-versions of these operations (that this commit doesn't tackle yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed the parameter to be a per-process threshold (vs. the overall buffer size)
mca_coll_accelerator_allgather_thresh = 65536; | ||
(void) mca_base_component_var_register(&mca_coll_accelerator_component.super.collm_version, | ||
"allgather_thresh", | ||
"max. overall msg length for which to copy accelerator buffer to CPU for allgather operation", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that overall msg length
communicates well that the threshold is compared to send_size*comm_size
. See my comment above for my concerns about using the process count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the word 'overall' with the adjustment of the meaning of the parameter . Thank you for the review!
add support for bcast, allgather and alltoall for device buffers using a temporary buffer on the CPU. The maximum msg length for each operation for which to use this approach can be controlled through an mca parameter. Signed-off-by: Edgar Gabriel <[email protected]>
owner file update to indicate that the component is active and ownership is shared between NVidia and AMD. Signed-off-by: Edgar Gabriel <[email protected]>
5e0bcd7
to
167b8b2
Compare
Add support for more functions to the coll/accelerator component. Specifically:
To motivate the work on non-reduction operations, the main benefit is for short messages, since communication from GPU buffers has typically a higher latency than communication from CPU buffers.
Here is some data to support the PR, gathered on an MI250X node with 16 GPUs:
MPI_Bcast:
MPI_Allgather:
MPI_Alltoall:
This data demonstrates that there are benefits of using the copy-to-host approach especially for short messages. Results on different settings might vary, but the user has the option to control the desired behavior using the MCA parameter (and later using #12991 by explicitly specifying which collective operation this component should be registered for.