You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
v4.1.6
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
From an opearting system distribution package: Ubuntu APT
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.
N/A
Please describe the system on which you are running
Operating system/version: Ubuntu 24.04.1 with uname -a output: Linux hpc01 6.8.0-51-generic #52-Ubuntu SMP PREEMPT_DYNAMIC Thu Dec 5 13:09:44 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Computer hardware: Intel(R) Xeon(R) Silver 4210R CPU (10 cores, 20 threads), 2-way each node, 8 nodes
Network type: Wired, with ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01), 1-way each node
Details of the problem
I am using Open MPI and slurm-wlm 23.11.4 with PMIx to perform distributed parallel computing. All of the softwares and dependencies are installed from the official Ubuntu APT source and have been upgraded to the newest version. My cluster has 8 nodes, and each node has 2 CPUs (10 cores, 20 threads), so theoretically it should be able to launch at most 20 * 2 * 8 = 320 processes simultaneously. When it is running non-MPI programs like "hostname," there is nothing wrong. However, when running MPI programs with NUM_PROCS larger than 308, the program will hang without any output. The cluster monitor also shows a low CPU & Memory utilization rate. If NUM_PROCS is less than or equal to 308, the behavior of the program is correct.
Tested programs: HPL, HPCG, and various simple testing programs like array addition and summation.
The command is as follows:
$ srun -n NUM_PROCS --mpi=pmix program
The text was updated successfully, but these errors were encountered:
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
v4.1.6
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
From an opearting system distribution package: Ubuntu APT
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.N/A
Please describe the system on which you are running
uname -a
output:Linux hpc01 6.8.0-51-generic #52-Ubuntu SMP PREEMPT_DYNAMIC Thu Dec 5 13:09:44 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Details of the problem
I am using Open MPI and slurm-wlm 23.11.4 with PMIx to perform distributed parallel computing. All of the softwares and dependencies are installed from the official Ubuntu APT source and have been upgraded to the newest version. My cluster has 8 nodes, and each node has 2 CPUs (10 cores, 20 threads), so theoretically it should be able to launch at most 20 * 2 * 8 = 320 processes simultaneously. When it is running non-MPI programs like "hostname," there is nothing wrong. However, when running MPI programs with NUM_PROCS larger than 308, the program will hang without any output. The cluster monitor also shows a low CPU & Memory utilization rate. If NUM_PROCS is less than or equal to 308, the behavior of the program is correct.
Tested programs: HPL, HPCG, and various simple testing programs like array addition and summation.
The command is as follows:
The text was updated successfully, but these errors were encountered: