Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] UCC ReduceScatter with NVIDIA SHARP Hangs #1076

Open
nariaki3551 opened this issue Feb 20, 2025 · 3 comments
Open

[Bug] UCC ReduceScatter with NVIDIA SHARP Hangs #1076

nariaki3551 opened this issue Feb 20, 2025 · 3 comments

Comments

@nariaki3551
Copy link

nariaki3551 commented Feb 20, 2025

I am experiencing a hang when executing MPI_Reduce_scatter_block using UCC with NVIDIA SHARP. The application does not proceed past the first MPI_Reduce_scatter_block call. Below are the details of the issue, system configuration, and logs.

Reproduction Steps

  1. Code:

    MPI_Reduce_scatter_block execution code implemented c++
    #include <mpi.h>
    #include <stdio.h>
    #include <stdlib.h>
    
    int do_reduce_scatter(int rank, int size) {
        int length = 1024;
        int* send_buff = (int*)malloc(length * sizeof(int));
        for (int i = 0; i < length; ++i) { send_buff[i] = rank; }
        int* recv_buff = (int*)malloc(length / size * sizeof(int));
        int recv_count = length / size;
    
        printf("Rank %d: Starting ReduceScatter\n", rank);
        for (int i = 0; i < 3; ++i) {
            MPI_Reduce_scatter_block(send_buff, recv_buff, recv_count, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
        }
        printf("Rank %d: Finished ReduceScatter\n", rank);
    
        free(send_buff);
        free(recv_buff);
    
        return 0;
    }
    
    int main(int argc, char** argv) {
        MPI_Init(&argc, &argv);
        int rank, size;
        MPI_Comm_rank(MPI_COMM_WORLD, &rank);
        MPI_Comm_size(MPI_COMM_WORLD, &size);
        MPI_Barrier(MPI_COMM_WORLD);
    
        do_reduce_scatter(rank, size);
    
        MPI_Finalize();
        return 0;
    }
  2. Build & Execution Commands:

    mpicc -o test_reducescatter test_reducescatter.c -lmpi
    mpirun -n 1 --host snail03:1 \
       -x UCC_TL_SHARP_DEVICES=mlx5_2 \
       -x LD_LIBRARY_PATH -x PATH \
       --mca coll_ucc_enable 1 --mca coll_ucc_priority 100 \
       -x UCC_MIN_TEAM_SIZE=2 \
       -x UCC_CL_BASIC_TLS=sharp,ucp \
       -x UCC_TL_SHARP_TUNE=reduce_scatter:inf \
       -x SHARP_COLL_LOG_LEVEL=5 \
       ./test_reducescatter \
    : -n 1 --host snail01:1 \
       -x UCC_TL_SHARP_DEVICES=mlx5_1 \
       -x LD_LIBRARY_PATH -x PATH \
       --mca coll_ucc_enable 1 --mca coll_ucc_priority 100 \
       -x UCC_MIN_TEAM_SIZE=2 \
       -x UCC_CL_BASIC_TLS=sharp,ucp \
       -x UCC_TL_SHARP_TUNE=reduce_scatter:inf \
       ./test_reducescatter
  3. Expected Behavior:
    The ReduceScatter operation should complete successfully across all ranks.

  4. Actual Behavior:
    The application hangs after printing [snail03:0:207860 - allreduce.c:587][2025-02-20 07:00:53] DEBUG STREAM Reduce: len:2048 . No further output is observed.


Logs

Here are the relevant logs at the time of the hang:

[1740034853.221691] [snail03:207860:0]   tl_sharp_team.c:268  TL_SHARP DEBUG init coll task 0x55e2098279c0
[snail03:0:207860 - utils/mpool.c:231][2025-02-20 07:00:53] DEBUG mpool sharp_coll_handles: allocated chunk 0x55e209827be0 of 49304 bytes with 128 elements
[snail03:0:207860 - barrier.c:75][2025-02-20 07:00:53] DEBUG SHArP Barrier request posted buf_desc:0x0x7f9458d2aa20 group_id:0x0 seqnum:1 
[snail03:0:207860 - dev.c:1070][2025-02-20 07:00:53] DEBUG SEND completion buf_desc:0x7f9458d2aa20
[snail03:0:207860 - dev.c:1094][2025-02-20 07:00:53] DEBUG RECV:RC completion buf_desc:0x7f9458d2a3a0 byte_len:12
[snail03:0:207860 - coll.c:376][2025-02-20 07:00:53] DEBUG SHARP Barrier completed. status:0 seqnum:1
[1740034853.221778] [snail03:207860:0]   tl_sharp_team.c:220  TL_SHARP DEBUG finalizing coll task 0x55e2098279c0
rank: 0, start reducescatter
[1740034853.221810] [snail03:207860:0]   tl_sharp_team.c:268  TL_SHARP DEBUG init coll task 0x55e2098279c0
[snail03:0:207860 - context.c:1358][2025-02-20 07:00:53] DEBUG External memory register, addr:0x55e209833c80 len:4096 device:mlx5_2
[snail03:0:207860 - context.c:1358][2025-02-20 07:00:53] DEBUG External memory register, addr:0x55e209834c90 len:2048 device:mlx5_2
[snail03:0:207860 - reduce_scatter.c:162][2025-02-20 07:00:53] DEBUG [root:0]Reduce-scatter offset:0 send_size:4096 recv_size:2048, root_start:0 root_end:1 is_reduce:1 num_reqs:2 req_handle=0x55e209833a80
[snail03:0:207860 - reduce_scatter.c:199][2025-02-20 07:00:53] TRACE [root:0] Reduce-scatter: first_root_recv_size:2048 send_offset:0 pos:0 frag:0 root:0
[snail03:0:207860 - allreduce.c:587][2025-02-20 07:00:53] DEBUG STREAM Reduce: len:2048 

System Information

  • OS: Ubuntu 20.04.6 LTS
  • UCC Version: 1.4.0 (revision bc996dd)
  • OpenMPI Version: 5.1.0a1
  • UCX Version: 1.19.0
  • SHARP: hpcx-v2.21-gcc-inbox-ubuntu20.04-cuda12-x86_64/sharp
  • Configuration Flags:
    --with-ucx=$UCX_DIR/build
    --with-tls=all
    --with-sharp=$SHARP_DIR
    --enable-debug
    

Thank you for your support and insights in advance!

@nariaki3551
Copy link
Author

From the debugger’s call stack, the execution appears to be halted at sharp_coll_get_sharp_group.

Image

@nariaki3551 nariaki3551 changed the title UCC ReduceScatter with NVIDIA SHARP Hangs [Bug] UCC ReduceScatter with NVIDIA SHARP Hangs Feb 20, 2025
@bureddy
Copy link
Collaborator

bureddy commented Feb 21, 2025

@nariaki3551 currently reduce-scatter supported only when SAT trees are enabled.
Can you please run with following flags?
-x UCC_TL_SHARP_REG_THRESH=0 -x SHARP_COLL_SAT_THRESHOLD=4 -x SHARP_COLL_ENABLE_SAT=1

@nariaki3551
Copy link
Author

@bureddy

currently reduce-scatter supported only when SAT trees are enabled.

Now I see why.
With SHARP_COLL_ENABLE_SAT=1, Reduce-Scatter now runs successfully.
Thank you for your support!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants