Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Program Hangs When Number of Process Exceeds 308 #12993

Open
fzhwenzhou opened this issue Dec 23, 2024 · 0 comments
Open

Program Hangs When Number of Process Exceeds 308 #12993

fzhwenzhou opened this issue Dec 23, 2024 · 0 comments

Comments

@fzhwenzhou
Copy link

fzhwenzhou commented Dec 23, 2024

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

v4.1.6

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

From an opearting system distribution package: Ubuntu APT

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

N/A

Please describe the system on which you are running

  • Operating system/version: Ubuntu 24.04.1 with uname -a output: Linux hpc01 6.8.0-51-generic #52-Ubuntu SMP PREEMPT_DYNAMIC Thu Dec 5 13:09:44 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
  • Computer hardware: Intel(R) Xeon(R) Silver 4210R CPU (10 cores, 20 threads), 2-way each node, 8 nodes
  • Network type: Wired, with ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01), 1-way each node

Details of the problem

I am using Open MPI and slurm-wlm 23.11.4 with PMIx to perform distributed parallel computing. All of the softwares and dependencies are installed from the official Ubuntu APT source and have been upgraded to the newest version. My cluster has 8 nodes, and each node has 2 CPUs (10 cores, 20 threads), so theoretically it should be able to launch at most 20 * 2 * 8 = 320 processes simultaneously. When it is running non-MPI programs like "hostname," there is nothing wrong. However, when running MPI programs with NUM_PROCS larger than 308, the program will hang without any output. The cluster monitor also shows a low CPU & Memory utilization rate. If NUM_PROCS is less than or equal to 308, the behavior of the program is correct.

Tested programs: HPL, HPCG, and various simple testing programs like array addition and summation.
The command is as follows:

$ srun -n NUM_PROCS --mpi=pmix program
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant