Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coll/accelerator add support for more functions #13006

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

edgargabriel
Copy link
Member

Add support for more functions to the coll/accelerator component. Specifically:

  • MPI_Reduce_scatter was missing as the only reduction operation
  • introduce support for Bcast, Allgather, and Alltoall operations. The max. msg size for which to use the copy-through-CPU mechanism can be controlled by MCA parameters for each function separatly.

To motivate the work on non-reduction operations, the main benefit is for short messages, since communication from GPU buffers has typically a higher latency than communication from CPU buffers.

Here is some data to support the PR, gathered on an MI250X node with 16 GPUs:

MPI_Bcast:

# OSU MPI-ROCM Broadcast Latency Test v7.5
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
# procs                   16         16         32       32
# copy to CPU?            no        yes         no      yes  
1                      12.24       2.84      14.86     3.44
2                      17.19       4.18	     21.37     4.64
4                      17.22       4.13	     21.13     4.65
8                      17.19       4.09	     21.08     4.63
16                     17.23       4.11	     21.11     4.63
32                     17.47       4.18	     28.43     4.71
64                     20.02       4.15	     28.03     4.74
128                    27.54       8.37	     45.03     8.05
256                    28.30      12.67	     44.73    13.38
512                    27.65      32.51	     44.65    28.39
1024                   27.86      57.93	     44.92    64.72
2048                   28.06      27.66	    207.97    34.89
4096                   54.40      32.14	    429.97    36.70
8192                   54.64      45.65	    430.32    57.30
16384                  55.76      56.71	    433.16    67.10
32768                  56.96      84.60	    459.17    93.21
65536                  59.84     141.01	    440.84   162.06
131072                 65.34     256.07	    466.58   279.53
262144                 76.69     474.92	    553.10   524.82
524288                 97.60     803.38	    674.06   955.30
1048576               271.92    2030.49	   1069.67  1844.79

MPI_Allgather:

# OSU MPI-ROCM Allgather Latency Test v7.5
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
# procs                   16        16        32         32
# Copy to CPU?            no       yes        no        yes
1                      44.70      5.86     54.62       6.85
2                      45.41      8.68	   54.67       9.67
4                      45.88      8.76	   54.09       9.71
8                      44.98      8.84	   66.06      10.12
16                     59.17      9.18	   83.69      10.70
32                     72.29      9.69	   99.86      11.57
64                     83.31     10.35	  120.70      12.65
128                    95.16     16.67	  148.10      19.50
256                    95.92     28.52	  148.03      33.09
512                    95.41     72.15	  147.25      86.03
1024                  111.54    144.25	  393.00     146.43
2048                  112.29     84.30	  761.04     105.84
4096                  159.91    111.35	  794.44     132.48
8192                  163.18    164.20	  801.66     188.61
16384                 169.69    277.55	  806.56     326.65
32768                 178.70    544.21	  810.47     622.84
65536                 195.38    891.83	 1661.78    1925.92
131072                226.69   1527.13	 1643.53    9896.87
262144                298.12   6354.89	 1667.39   19147.58
524288                450.23  12872.46	 1910.64   40723.25
1048576               764.79  26134.81	 3752.21   81590.10

MPI_Alltoall:

# OSU MPI-ROCM All-to-All Personalized Exchange Latency Test v7.5
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
# procs                   16        16        32        32
# copy to CPU?            no       yes        no       yes
1                     102.40     62.82    109.26     39.49
2                     119.86     65.76	  131.27     41.89
4                      70.86     27.01	  135.16     41.77
8                      75.00     26.99	  147.80     41.83
16                     88.56     27.45	  175.73     42.33
32                    112.92     27.56	  243.95     42.88
64                    184.78     27.92	  409.48     44.33
128                   363.13     34.33	  611.12     53.13
256                   484.21     46.77	  948.13     70.85
512                   205.33    135.85	  489.44    235.73
1024                  184.16    196.80	  482.84    295.78
2048                  182.26    154.94	  487.60    296.06
4096                  187.28    163.12	  538.29    311.60
8192                  189.87    184.98	  539.06    346.44
16384                 190.54    232.64	  544.14    431.32
32768                 191.31    336.01	  549.42    533.02
65536                 197.93    492.32	  557.00   1050.14
131072                199.14    889.92	  583.84  11395.43
262144                244.68   7316.06	  843.04  21582.68
524288                389.49  13314.65	 1481.85  40945.10
1048576               697.91  24498.35	 2753.87  80627.56

This data demonstrates that there are benefits of using the copy-to-host approach especially for short messages. Results on different settings might vary, but the user has the option to control the desired behavior using the MCA parameter (and later using #12991 by explicitly specifying which collective operation this component should be registered for.

@edgargabriel edgargabriel force-pushed the topic/coll-accelerator-new-funcs branch from ae0525b to 94a56ab Compare December 30, 2024 17:06
add support for MPI_Reduce_scatter

Signed-off-by: Edgar Gabriel <[email protected]>
@edgargabriel edgargabriel force-pushed the topic/coll-accelerator-new-funcs branch from 8658265 to 5e0bcd7 Compare December 30, 2024 17:22
Comment on lines +66 to +68
/* Using sbufsize here on purpose to ensure symmetric decision for handling of GPU vs
CPU buffers. The two buffer sizes are expected to be the same for pre-defined datatypes,
but could vary due to layout issues/gaps for derived datatypes */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a better approach here is to use multiply the size of one element with the number of elements? The span could be much larger if there are big gaps that we wouldn't transfer, skewing us towards using the device memory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but that is explicitly meant to handle derived datatypes that contains gaps. Note, that this is the approach also used by all other (already existing functions) in the coll/accelerator component.

/* Using sbufsize here on purpose to ensure symmetric decision for handling of GPU vs
CPU buffers. The two buffer sizes are expected to be the same for pre-defined datatypes,
but could vary due to layout issues/gaps for derived datatypes */
if ((rc > 0) && ((sbufsize * comm_size) <= (size_t)mca_coll_accelerator_alltoall_thresh)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar problem here: at higher process counts we would tend to use the device memory for many small send/recv. I'd expect that the overhead for small transfers becomes more pronounced for small messages at higher process counts. Why do we need to factor in the process count at all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been honestly going back and forth on whether to use the parameter as a per process parameter or a parameter that limits the amount of temporary memory allocated on the CPU side. I am happy to use this as a per-process parameter (i.e. without communicator size playing a role). My original (maybe incorrect) line of thinking had to do with the v-versions of these operations (that this commit doesn't tackle yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed the parameter to be a per-process threshold (vs. the overall buffer size)

mca_coll_accelerator_allgather_thresh = 65536;
(void) mca_base_component_var_register(&mca_coll_accelerator_component.super.collm_version,
"allgather_thresh",
"max. overall msg length for which to copy accelerator buffer to CPU for allgather operation",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that overall msg length communicates well that the threshold is compared to send_size*comm_size. See my comment above for my concerns about using the process count.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the word 'overall' with the adjustment of the meaning of the parameter . Thank you for the review!

edgargabriel and others added 2 commits January 2, 2025 15:02
add support for bcast, allgather and alltoall for device buffers using
a temporary buffer on the CPU. The maximum msg length for each operation
for which to use this approach can be controlled through an mca
parameter.

Signed-off-by: Edgar Gabriel <[email protected]>
owner file update to indicate that the component is active and ownership
is shared between NVidia and AMD.

Signed-off-by: Edgar Gabriel <[email protected]>
@edgargabriel edgargabriel force-pushed the topic/coll-accelerator-new-funcs branch from 5e0bcd7 to 167b8b2 Compare January 2, 2025 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants