cli: slurm: pulling Singularity image doesn’t work if SLURM node has no internet #919

wtraylor · 2020-09-08T15:52:14Z

On the computer cluster I am using (Goethe HLR), only the login node has internet access, not the computing nodes. Therefore, building or pulling a Singularity image on the computing node does not work.

I run the workflow like this:

popper run --engine singularity --config config.yml --file wf.yml

The details of the workflow are not relevant, but what’s important is that Popper (version 2020.09.01) now tries to execute singularity pull through SLURM, using srun. That happens in src/popper/runner_slurm.py. Apparently, this behavior was introduced in pull request #912.

I don’t have a good suggestion. It seems like some people want their Singularity images built on the computing node, and others (like me) on the login node.

The text was updated successfully, but these errors were encountered:

JayjeetAtGithub · 2020-09-09T13:16:11Z

Thanks for opening this issue @wtraylor. This helps us to understand different Slurm environments better. We can change to building images on the login node by default and give an option to use the compute node (if singularity is not present on the login node). What do you think @wtraylor and @ivotron?

ivotron · 2020-09-09T16:29:54Z

we could also make use of the --skip-pull flag to indicate the slurm runner to not pull images when it executes.

wtraylor · 2020-09-19T13:44:54Z

My hunch is that it is more common to prepare everything for an experiment on the login node, while the actual execution happens on the computing nodes. For example, I would compile my application on the login node. So from my perspective it would be more intuitive to also prepare the container image on the login node.

But I am not a very seasoned HPC user.

ivotron · 2020-09-21T15:22:18Z

My hunch is that it is more common to prepare everything for an experiment on the login node, while the actual execution happens on the computing nodes. For example, I would compile my application on the login node. So from my perspective it would be more intuitive to also prepare the container image on the login node.

But I am not a very seasoned HPC user.

another pattern I've heard from people working in HPC scenarios is that they don't have network connectivity at all to the outside world from the slurm cluster, not even on the login node, so they need to scp images to the login node and then run from there. In these scenarios, doing --skip-pull and --skip-clone would help. Maybe we also need a --skip-build?

So for this issue, do we agree that we can do this:

We can change to building images on the login node by default and give an option to use the compute node (if singularity is not present on the login node). What do you think @wtraylor and @ivotron?

given what @wtraylor mentioned above, this would address this issue, right?

wtraylor · 2020-09-21T16:13:15Z

Maybe we also need a --skip-build?
Since building an image typically involves installing software within that image, I think skipping the build would be required, too.

ivotron · 2020-09-24T23:25:05Z

sorry, I didn't explain well. What I had in mind was running the entire workflow first on the frontend node once, so it builds containers, but in single-node mode (i.e. just doing popper run with no -r flag). Then subsequently running popper run --skip-pull -r slurm. But I agree that it would be better to build containers locally rather than on each node, and control the behavior via config file.

Also, since the folder one uses on the login node is shared with all the nodes, building multiple times on each node is redundant.

JayjeetAtGithub · 2020-09-26T08:32:12Z

Yeah, building redundantly on multiple nodes is wrong. We can make changes to have 2 modes controlled through the config : i) build on login node (like in local without srun) ii) build on a single compute node (controlled through srun). What do you think @ivotron ?

ivotron · 2020-09-27T23:31:27Z

yeah, that sounds good. I'd go further and say to not implement ii) until users request it

JayjeetAtGithub changed the title ~~Pulling Singularity image doesn’t work if SLURM node has no internet~~ cli: slurm: pulling Singularity image doesn’t work if SLURM node has no internet Nov 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cli: slurm: pulling Singularity image doesn’t work if SLURM node has no internet #919

cli: slurm: pulling Singularity image doesn’t work if SLURM node has no internet #919

wtraylor commented Sep 8, 2020

JayjeetAtGithub commented Sep 9, 2020

ivotron commented Sep 9, 2020

wtraylor commented Sep 19, 2020

ivotron commented Sep 21, 2020

wtraylor commented Sep 21, 2020

ivotron commented Sep 24, 2020 •

edited

Loading

JayjeetAtGithub commented Sep 26, 2020

ivotron commented Sep 27, 2020

cli: slurm: pulling Singularity image doesn’t work if SLURM node has no internet #919

cli: slurm: pulling Singularity image doesn’t work if SLURM node has no internet #919

Comments

wtraylor commented Sep 8, 2020

JayjeetAtGithub commented Sep 9, 2020

ivotron commented Sep 9, 2020

wtraylor commented Sep 19, 2020

ivotron commented Sep 21, 2020

wtraylor commented Sep 21, 2020

ivotron commented Sep 24, 2020 • edited Loading

JayjeetAtGithub commented Sep 26, 2020

ivotron commented Sep 27, 2020

ivotron commented Sep 24, 2020 •

edited

Loading