Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Application hangs when using sm btl with OpenMPI v5.0.6 #12979

Open
mshanthagit opened this issue Dec 12, 2024 · 0 comments
Open

Application hangs when using sm btl with OpenMPI v5.0.6 #12979

mshanthagit opened this issue Dec 12, 2024 · 0 comments

Comments

@mshanthagit
Copy link
Contributor

Thank you for taking the time to submit an issue!

Background information

One of the MPI-based applications (tachyon) is consistently hanging when the number of ranks > 64 on a single node (2 socket Genoa system with 192 cores). We see the hangs when pml is ob1 and btl is sm.

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

5.0.6

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

OpenMPI was installed from a git clone of v5.0.x

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

git submodule status:

42c93d64485a006a19c5c8622cbd8341e3e95392 3rd-party/openpmix (v5.0.5rc1)
35270a9230139a64d01e133f262b6ed0c40e1fbb 3rd-party/prrte (v3.0.8rc1)
dfff67569fb72dbf8d73a1dcf74d091dad93f71b config/oac (dfff675)

Please describe the system on which you are running

  • Operating system/version: Rocky Linux 9.2
  • Computer hardware: AMD Genoa 2-socket system (AMD EPYC 9654 96-Core Processor)
  • Network type: IB

The application is run on a single node.


Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

Here's the OpenMPI configuration:

./configure --prefix=<> --with-ucx=<>1.17.0 --enable-mpi1-compatibility --without-hcoll --with-knem=<> --enable-mca-no-build=btl-uct --with-xpmem=<>/2.6.5 --with-hwloc=<>/2.4.1 --with-slurm CC=gcc CXX=g++ FC=gfortran --enable-debug

Here's how we can reproduce the problem:

$ wget http://jedi.ks.uiuc.edu/~johns/raytracer/files/0.98.7/tachyon-0.98.7.tar.gz
$ tar xfa tachyon-0.98.7.tar.gz
$ cd tachyon/unix
$ make linux-mpi-64
$ cd ../scenes
$ sed -i -e 's/Resolution 669 834/Resolution 13380 16680/g' dnadof.dat
$ mpirun -np 192 ../compile/linux-mpi-64/tachyon dnadof.dat

Most ranks are in finalize, but some of the ranks are just stuck in the wait call: following is a trace from one of the ranks (ompi was built in debug mode).

#0  0x0000155554b86b3d in mca_btl_sm_check_fboxes () at ../../../../opal/mca/btl/sm/btl_sm_fbox.h:233
#1  0x0000155554b88fa6 in mca_btl_sm_component_progress () at btl_sm_component.c:553
#2  0x0000155554ad91ca in opal_progress () at runtime/opal_progress.c:224
#3  0x0000155554f25fc0 in sync_wait_st (sync=0x7fffffff6270) at ../opal/mca/threads/wait_sync.h:104
#4  0x0000155554f26d16 in ompi_request_default_wait_all (count=261, requests=0x969dc0, statuses=0x98a710)
    at request/req_wait.c:274
#5  0x0000155554fc4bc0 in PMPI_Waitall (count=261, requests=0x969dc0, statuses=0x98a710) at waitall.c:78
#6  0x0000000000414914 in ?? ()
#7  0x000000000041613e in ?? ()
#8  0x00000000004026a1 in ?? ()
#9  0x0000155554c3feb0 in __libc_start_call_main () from /lib64/libc.so.6
#10 0x0000155554c3ff60 in __libc_start_main_impl () from /lib64/libc.so.6
#11 0x0000000000402da5 in ?? ()

Tested the following runtime parameters:

mpirun -np 192 --mca pml ^ucx ../compile/linux-mpi-64/tachyon dnadof.dat  ------------------------- **(hangs)** 
mpirun -np 192 --mca pml ^ucx --mca btl ^sm ../compile/linux-mpi-64/tachyon dnadof.dat --------- **(runs to completion)**
mpirun -np 192 ../compile/linux-mpi-64/tachyon dnadof.dat -------------------------------------------- **(runs to completion)**

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants