You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One of the MPI-based applications (tachyon) is consistently hanging when the number of ranks > 64 on a single node (2 socket Genoa system with 192 cores). We see the hangs when pml is ob1 and btl is sm.
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
5.0.6
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
OpenMPI was installed from a git clone of v5.0.x
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.
Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.
$ wget http://jedi.ks.uiuc.edu/~johns/raytracer/files/0.98.7/tachyon-0.98.7.tar.gz
$ tar xfa tachyon-0.98.7.tar.gz
$ cd tachyon/unix
$ make linux-mpi-64
$ cd ../scenes
$ sed -i -e 's/Resolution 669 834/Resolution 13380 16680/g' dnadof.dat
$ mpirun -np 192 ../compile/linux-mpi-64/tachyon dnadof.dat
Most ranks are in finalize, but some of the ranks are just stuck in the wait call: following is a trace from one of the ranks (ompi was built in debug mode).
#0 0x0000155554b86b3d in mca_btl_sm_check_fboxes () at ../../../../opal/mca/btl/sm/btl_sm_fbox.h:233
#1 0x0000155554b88fa6 in mca_btl_sm_component_progress () at btl_sm_component.c:553
#2 0x0000155554ad91ca in opal_progress () at runtime/opal_progress.c:224
#3 0x0000155554f25fc0 in sync_wait_st (sync=0x7fffffff6270) at ../opal/mca/threads/wait_sync.h:104
#4 0x0000155554f26d16 in ompi_request_default_wait_all (count=261, requests=0x969dc0, statuses=0x98a710)
at request/req_wait.c:274
#5 0x0000155554fc4bc0 in PMPI_Waitall (count=261, requests=0x969dc0, statuses=0x98a710) at waitall.c:78
#6 0x0000000000414914 in ?? ()
#7 0x000000000041613e in ?? ()
#8 0x00000000004026a1 in ?? ()
#9 0x0000155554c3feb0 in __libc_start_call_main () from /lib64/libc.so.6
#10 0x0000155554c3ff60 in __libc_start_main_impl () from /lib64/libc.so.6
#11 0x0000000000402da5 in ?? ()
Thank you for taking the time to submit an issue!
Background information
One of the MPI-based applications (tachyon) is consistently hanging when the number of ranks > 64 on a single node (2 socket Genoa system with 192 cores). We see the hangs when pml is ob1 and btl is sm.
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
5.0.6
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
OpenMPI was installed from a git clone of v5.0.x
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.git submodule status:
42c93d64485a006a19c5c8622cbd8341e3e95392 3rd-party/openpmix (v5.0.5rc1)
35270a9230139a64d01e133f262b6ed0c40e1fbb 3rd-party/prrte (v3.0.8rc1)
dfff67569fb72dbf8d73a1dcf74d091dad93f71b config/oac (dfff675)
Please describe the system on which you are running
The application is run on a single node.
Details of the problem
Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.
Here's the OpenMPI configuration:
Here's how we can reproduce the problem:
Most ranks are in finalize, but some of the ranks are just stuck in the wait call: following is a trace from one of the ranks (ompi was built in debug mode).
Tested the following runtime parameters:
The text was updated successfully, but these errors were encountered: