You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am experiencing a hang when executing MPI_Reduce_scatter_block using UCC with NVIDIA SHARP. The application does not proceed past the first MPI_Reduce_scatter_block call. Below are the details of the issue, system configuration, and logs.
Reproduction Steps
Code:
MPI_Reduce_scatter_block execution code implemented c++
Expected Behavior:
The ReduceScatter operation should complete successfully across all ranks.
Actual Behavior:
The application hangs after printing [snail03:0:207860 - allreduce.c:587][2025-02-20 07:00:53] DEBUG STREAM Reduce: len:2048 . No further output is observed.
Logs
Here are the relevant logs at the time of the hang:
@nariaki3551 currently reduce-scatter supported only when SAT trees are enabled.
Can you please run with following flags?
-x UCC_TL_SHARP_REG_THRESH=0 -x SHARP_COLL_SAT_THRESHOLD=4 -x SHARP_COLL_ENABLE_SAT=1
I am experiencing a hang when executing MPI_Reduce_scatter_block using UCC with NVIDIA SHARP. The application does not proceed past the first MPI_Reduce_scatter_block call. Below are the details of the issue, system configuration, and logs.
Reproduction Steps
Code:
MPI_Reduce_scatter_block execution code implemented c++
Build & Execution Commands:
mpirun -n 1 --host snail03:1 \ -x UCC_TL_SHARP_DEVICES=mlx5_2 \ -x LD_LIBRARY_PATH -x PATH \ --mca coll_ucc_enable 1 --mca coll_ucc_priority 100 \ -x UCC_MIN_TEAM_SIZE=2 \ -x UCC_CL_BASIC_TLS=sharp,ucp \ -x UCC_TL_SHARP_TUNE=reduce_scatter:inf \ -x SHARP_COLL_LOG_LEVEL=5 \ ./test_reducescatter \ : -n 1 --host snail01:1 \ -x UCC_TL_SHARP_DEVICES=mlx5_1 \ -x LD_LIBRARY_PATH -x PATH \ --mca coll_ucc_enable 1 --mca coll_ucc_priority 100 \ -x UCC_MIN_TEAM_SIZE=2 \ -x UCC_CL_BASIC_TLS=sharp,ucp \ -x UCC_TL_SHARP_TUNE=reduce_scatter:inf \ ./test_reducescatter
Expected Behavior:
The ReduceScatter operation should complete successfully across all ranks.
Actual Behavior:
The application hangs after printing
[snail03:0:207860 - allreduce.c:587][2025-02-20 07:00:53] DEBUG STREAM Reduce: len:2048
. No further output is observed.Logs
Here are the relevant logs at the time of the hang:
System Information
revision bc996dd
)hpcx-v2.21-gcc-inbox-ubuntu20.04-cuda12-x86_64/sharp
Thank you for your support and insights in advance!
The text was updated successfully, but these errors were encountered: