Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug report using gcc and impi on NOAA hera system #123

Open
thomas-robinson opened this issue Sep 3, 2024 · 2 comments
Open

Bug report using gcc and impi on NOAA hera system #123

thomas-robinson opened this issue Sep 3, 2024 · 2 comments

Comments

@thomas-robinson
Copy link

While trying to run an e4s-cl init I received an error that said it was an e4s-cl bug, and to report the contents of a debug file on github. Below is the pasted contents of the file:

$ cat /home/Thomas.Robinson/.local/e4s_cl/logs/debug_log
 [Debug root:483] 
########################################################################################################################################################
E4S CONTAINER LAUNCHER LOGGING INITIALIZED

Timestamp         : 2024-09-03 13:17:23.794833
Hostname          : hfe05
Platform          : Linux-4.18.0-477.27.1.el8_8.88ciq_lts.0.1.x86_64-x86_64-with-glibc2.28
Version           : 1.0.5.dev1+g35e5e6a
Python Version    : 3.12.4
Working Directory : /scratch2/GFDL/e4s/Thomas.Robinson/containers
Terminal Size     : 152x32
Frozen            : False
Log ID            : 0ad97a938a1a609713e75e9db3edb9d99134fa24084f2412fc43ef7cb0037359
########################################################################################################################################################

[Debug e4s_cl.cli.commands.__main__:77] e4s-cl args: Namespace(command='init', options=['--profile', 'gfdl2024.01', '--launcher', 'srun', '--backend', 'singularity', '--image', '/scratch2/GFDL/e4s/Thomas.Robinson/containers/gfdlsoftware_2024.01-gcc13.sif'], dry_run=None)
[Debug e4s_cl.cli.commands.init:77] e4s-cl init args: Namespace(profile_name='gfdl2024.01', launcher='/apps/slurm/default/bin/srun', backend='singularity', image='/scratch2/GFDL/e4s/Thomas.Robinson/containers/gfdlsoftware_2024.01-gcc13.sif', cmd=[])
[Debug e4s_cl.cf.storage.local_file:50] '/home/Thomas.Robinson/.local/e4s_cl/user.json' opened read-write
[Debug e4s_cl.cf.storage.local_file:170] Initialized user database '/home/Thomas.Robinson/.local/e4s_cl/user.json'
[+] Tracing MPI execution using:
[+] '/apps/slurm/default/bin/srun /scratch2/GFDL/e4s/bin/conda/bin/e4s-cl-mpi-tester'
[Debug e4s_cl.cli.commands.profile.detect:77] e4s-cl profile detect args: Namespace(profile_name=None, cmd=['/apps/slurm/default/bin/srun', '/scratch2/GFDL/e4s/bin/conda/bin/e4s-cl-mpi-tester'])
[Debug e4s_cl.util:211] Running with parent status: ['/apps/slurm/default/bin/srun', '/scratch2/GFDL/e4s/bin/conda/bin/python', '/scratch2/GFDL/e4s/bin/bin/e4s-cl', 'profile', 'detect', '/scratch2/GFDL/e4s/bin/conda/bin/e4s-cl-mpi-tester']
Failed to determine necessary libraries: program exited with code 1
[+] Attach <PtraceProcess #2397590> to debugger
[+] Set <PtraceProcess #2397590> options to 1
[+] Created profile gfdl2024.01
[Debug root:483] 
########################################################################################################################################################
E4S CONTAINER LAUNCHER LOGGING INITIALIZED

Timestamp         : 2024-09-03 13:18:38.549017
Hostname          : h11c53
Platform          : Linux-4.18.0-477.27.1.el8_8.88ciq_lts.0.1.x86_64-x86_64-with-glibc2.28
Version           : 1.0.5.dev1+g35e5e6a
Python Version    : 3.12.4
Working Directory : /scratch2/GFDL/e4s/Thomas.Robinson/containers
Terminal Size     : 152x32
Frozen            : False
Log ID            : befd6bd2404fe811dc2d9f4e42d7d451d4c2ba762934fc55b94067328a6687f0
########################################################################################################################################################

[Debug e4s_cl.cli.commands.__main__:77] e4s-cl args: Namespace(command='init', options=['--profile', 'gfdl2024.01', '--launcher', 'srun', '--backend', 'singularity', '--image', '/scratch2/GFDL/e4s/Thomas.Robinson/containers/gfdlsoftware_2024.01-gcc13.sif'], dry_run=None)
[Debug e4s_cl.cli.commands.init:77] e4s-cl init args: Namespace(profile_name='gfdl2024.01', launcher='/apps/slurm/default/bin/srun', backend='singularity', image='/scratch2/GFDL/e4s/Thomas.Robinson/containers/gfdlsoftware_2024.01-gcc13.sif', cmd=[])
[Debug e4s_cl.cf.storage.local_file:50] '/home/Thomas.Robinson/.local/e4s_cl/user.json' opened read-write
[Debug e4s_cl.cf.storage.local_file:170] Initialized user database '/home/Thomas.Robinson/.local/e4s_cl/user.json'
[+] Tracing MPI execution using:
[+] '/apps/slurm/default/bin/srun /scratch2/GFDL/e4s/bin/conda/bin/e4s-cl-mpi-tester'
[Debug e4s_cl.cli.commands.profile.detect:77] e4s-cl profile detect args: Namespace(profile_name=None, cmd=['/apps/slurm/default/bin/srun', '/scratch2/GFDL/e4s/bin/conda/bin/e4s-cl-mpi-tester'])
[Debug e4s_cl.util:211] Running with parent status: ['/apps/slurm/default/bin/srun', '/scratch2/GFDL/e4s/bin/conda/bin/python', '/scratch2/GFDL/e4s/bin/bin/e4s-cl', 'profile', 'detect', '/scratch2/GFDL/e4s/bin/conda/bin/e4s-cl-mpi-tester']
[+] Attach <PtraceProcess #1470539> to debugger
[+] Set <PtraceProcess #1470539> options to 1
[+] Created profile gfdl2024.01
[Debug root:483] 
########################################################################################################################################################
E4S CONTAINER LAUNCHER LOGGING INITIALIZED

Timestamp         : 2024-09-03 13:20:20.583304
Hostname          : h11c53
Platform          : Linux-4.18.0-477.27.1.el8_8.88ciq_lts.0.1.x86_64-x86_64-with-glibc2.28
Version           : 1.0.5.dev1+g35e5e6a
Python Version    : 3.12.4
Working Directory : /scratch2/GFDL/e4s/Thomas.Robinson/containers
Terminal Size     : 152x32
Frozen            : False
Log ID            : 0762531179c2b4d0051837a22b8316642f05373133dfabc70dca9d4f1093cea8
########################################################################################################################################################

[Debug e4s_cl.cli.commands.__main__:77] e4s-cl args: Namespace(command='init', options=['--profile', 'gfdl2024.01', '--launcher', 'srun', '--backend', 'singularity', '--image', '/scratch2/GFDL/e4s/Thomas.Robinson/containers/gfdlsoftware_2024.01-gcc13.sif'], dry_run=None)
[Debug e4s_cl.cli.commands.init:77] e4s-cl init args: Namespace(profile_name='gfdl2024.01', launcher='/apps/slurm/default/bin/srun', backend='singularity', image='/scratch2/GFDL/e4s/Thomas.Robinson/containers/gfdlsoftware_2024.01-gcc13.sif', cmd=[])
[Debug e4s_cl.cf.storage.local_file:50] '/home/Thomas.Robinson/.local/e4s_cl/user.json' opened read-write
[Debug e4s_cl.cf.storage.local_file:170] Initialized user database '/home/Thomas.Robinson/.local/e4s_cl/user.json'
[+] Tracing MPI execution using:
[+] '/apps/slurm/default/bin/srun /scratch2/GFDL/e4s/bin/conda/bin/e4s-cl-mpi-tester'
[Debug e4s_cl.cli.commands.profile.detect:77] e4s-cl profile detect args: Namespace(profile_name=None, cmd=['/apps/slurm/default/bin/srun', '/scratch2/GFDL/e4s/bin/conda/bin/e4s-cl-mpi-tester'])
[Debug e4s_cl.util:211] Running with parent status: ['/apps/slurm/default/bin/srun', '/scratch2/GFDL/e4s/bin/conda/bin/python', '/scratch2/GFDL/e4s/bin/bin/e4s-cl', 'profile', 'detect', '/scratch2/GFDL/e4s/bin/conda/bin/e4s-cl-mpi-tester']
Failed to determine necessary libraries: program exited with code 156
[+] Attach <PtraceProcess #1470679> to debugger
[+] Set <PtraceProcess #1470679> options to 1
[+] Created profile gfdl2024.01

Here are the modules I have loaded:

$ module list
Currently Loaded Modules:
  1) gnu/9.2.0   2) impi/2020

My container is using gcc 13 and mpich installed with spack.

@FrederickDeny
Copy link
Collaborator

Hi Thomas, the fact that the created profile's name isn't specifying a mpi vendor ("[+] Created profile gfdl2024.01") indicates that e4s-cl failed to find either libmpi.so.12, libmpi_cray.so.12 or libmpi.so.40. e4s-cl will try to locate any of these three and will name the newly created profile correspondingly.

Could you check if the correct libmpi.so is in your LD_LIBRARY_PATH?

@spoutn1k
Copy link
Collaborator

spoutn1k commented Sep 5, 2024

The idea behind the init command is to understand what the MPI environment is and save the detected configuration to avoid computing it everytime.

This is done using a python script to access an MPI library from the environment, load and use well-known symbols to run basic operations to ensure it is working properly and loads all the library it needs to function (As they can sometimes lazy-load libraries).

You can see this in action here:

[Debug e4s_cl.util:211] Running with parent status: ['/apps/slurm/default/bin/srun', '/scratch2/GFDL/e4s/bin/conda/bin/python', '/scratch2/GFDL/e4s/bin/bin/e4s-cl', 'profile', 'detect', '/scratch2/GFDL/e4s/bin/conda/bin/e4s-cl-mpi-tester']
Failed to determine necessary libraries: program exited with code 156

You can see how this is done here. Intel MPI is treated as MPICH as they share ABI and sonames.

As Frederick suggested, something is preventing the proper analysis of your MPI environment. Please share the contents of the created profile and, if possible, compile a sample MPI program with this environment and share the output of ldd on it. What often happens is either a RPATH or an arbitrary soname is going against the MPI standard practices, and e4s-cl cannot adjust for that.

If you can, try running that tester script in your desired MPI environment and see if it gives you any information about what is failing /scratch2/GFDL/e4s/bin/conda/bin/e4s-cl-mpi-tester.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@spoutn1k @thomas-robinson @FrederickDeny and others