Skip to content

Using the stack in a container on a parallel FS #38

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ocaisa opened this issue Sep 16, 2020 · 4 comments
Closed

Using the stack in a container on a parallel FS #38

ocaisa opened this issue Sep 16, 2020 · 4 comments

Comments

@ocaisa
Copy link
Member

ocaisa commented Sep 16, 2020

As part of #37 I was testing using that setup to run GROMACS. Things work fine if I don't use too many MPI tasks per node, but once I go above 4 I'm getting errors:

[ocais1@juwels03 test]$  OMP_NUM_THREADS=6 srun --time=00:05:00 --nodes=1 --ntasks-per-node=6 --cpus-per-task=6 singularity exec --fusemount "$EESSI_CONFIG" --fusemount "$EESSI_PILOT" /p/project/cecam/singularity/cecam/ocais1/client-pilot_centos7-2020.08.sif /cvmfs/pilot.eessi-hpc.org/2020.08/software/x86_64/intel/haswell/software/GROMACS/2020.1-foss-2020a-Python-3.8.2/bin/gmx_mpi mdrun -s ion_channel.tpr -maxh 0.50 -resethway -noconfout -nsteps 1000 -g logfile
srun: job 2622253 queued and waiting for resources
srun: job 2622253 has been allocated resources
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
CernVM-FS: pre-mounted on file descriptor 3
Failed to initialize loader socket
Failed to initialize loader socket
Failed to initialize loader socket
FATAL:   stat /cvmfs/pilot.eessi-hpc.org/2020.08/software/x86_64/intel/haswell/software/GROMACS/2020.1-foss-2020a-Python-3.8.2/bin/gmx_mpi: transport endpoint is not connected
FATAL:   stat /cvmfs/pilot.eessi-hpc.org/2020.08/software/x86_64/intel/haswell/software/GROMACS/2020.1-foss-2020a-Python-3.8.2/bin/gmx_mpi: transport endpoint is not connected
FATAL:   stat /cvmfs/pilot.eessi-hpc.org/2020.08/software/x86_64/intel/haswell/software/GROMACS/2020.1-foss-2020a-Python-3.8.2/bin/gmx_mpi: transport endpoint is not connected
FATAL:   stat /cvmfs/pilot.eessi-hpc.org/2020.08/software/x86_64/intel/haswell/software/GROMACS/2020.1-foss-2020a-Python-3.8.2/bin/gmx_mpi: transport endpoint is not connected
FATAL:   stat /cvmfs/pilot.eessi-hpc.org/2020.08/software/x86_64/intel/haswell/software/GROMACS/2020.1-foss-2020a-Python-3.8.2/bin/gmx_mpi: transport endpoint is not connected
FATAL:   stat /cvmfs/pilot.eessi-hpc.org/2020.08/software/x86_64/intel/haswell/software/GROMACS/2020.1-foss-2020a-Python-3.8.2/bin/gmx_mpi: transport endpoint is not connected
CernVM-FS: loading Fuse module... CernVM-FS: loading Fuse module... CernVM-FS: loading Fuse module... CernVM-FS: loading Fuse module... CernVM-FS: loading Fuse module... srun: error: jwc04n178: tasks 0-5: Exited with exit code 255

I suspect the alien cache right now is not enough and we also need a local cache on the node for this use case

@ocaisa
Copy link
Member Author

ocaisa commented Sep 22, 2020

This is now fixed with the updated script in #37 , the important point is we should not be bind mounting /var/lib/cvmfs and /var/run/cvmfs, these should be created using the --scratch option to Singularity

@ocaisa ocaisa closed this as completed Sep 22, 2020
@boegel
Copy link
Contributor

boegel commented Sep 24, 2020

@ocaisa Do you know what the key difference is between using --scratch and bind mounting /varl*/cvmfs?

Should we also change this in the pilot instructions at https://eessi.github.io/docs/pilot/?

@ocaisa
Copy link
Member Author

ocaisa commented Sep 24, 2020

Well, know is a strong word :P

I think what is important here is that each MPI process will get a unique space for /var/*/cvmfs if we use the --scratch option. This is required since each MPI process is doing a separate mount, bind mounting leads to them stepping on each others toes. I would have thought that this was only required for /var/run/cvmfs (coupled with using unique cache workspaces in default.local) but this didn't work out for me.

To change the docs, you'd need to be really sure, but my suspicion is that what is described in the docs is only good for serial use.

@boegel
Copy link
Contributor

boegel commented Sep 24, 2020

I actually ran into the "Failed to initialize loader socket" issue just now as well when trying to just mount the pilot repo, without MPI being involved at all, while using --scratch worked fine.

So, I'll change it in the docs, seems more robust to me.

@trz42 also ran into trouble when trying to use a different filesystem than /tmp for the bind mounts, using --scratch would've probably prevented that from happening too...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants