Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli: slurm: pulling Singularity image doesn’t work if SLURM node has no internet #919

Open
wtraylor opened this issue Sep 8, 2020 · 8 comments

Comments

@wtraylor
Copy link
Contributor

wtraylor commented Sep 8, 2020

On the computer cluster I am using (Goethe HLR), only the login node has internet access, not the computing nodes. Therefore, building or pulling a Singularity image on the computing node does not work.

I run the workflow like this:

popper run --engine singularity --config config.yml --file wf.yml

The details of the workflow are not relevant, but what’s important is that Popper (version 2020.09.01) now tries to execute singularity pull through SLURM, using srun. That happens in src/popper/runner_slurm.py. Apparently, this behavior was introduced in pull request #912.

I don’t have a good suggestion. It seems like some people want their Singularity images built on the computing node, and others (like me) on the login node.

@JayjeetAtGithub
Copy link
Collaborator

Thanks for opening this issue @wtraylor. This helps us to understand different Slurm environments better. We can change to building images on the login node by default and give an option to use the compute node (if singularity is not present on the login node). What do you think @wtraylor and @ivotron?

@ivotron
Copy link
Collaborator

ivotron commented Sep 9, 2020

we could also make use of the --skip-pull flag to indicate the slurm runner to not pull images when it executes.

@wtraylor
Copy link
Contributor Author

My hunch is that it is more common to prepare everything for an experiment on the login node, while the actual execution happens on the computing nodes. For example, I would compile my application on the login node. So from my perspective it would be more intuitive to also prepare the container image on the login node.

But I am not a very seasoned HPC user.

@ivotron
Copy link
Collaborator

ivotron commented Sep 21, 2020

My hunch is that it is more common to prepare everything for an experiment on the login node, while the actual execution happens on the computing nodes. For example, I would compile my application on the login node. So from my perspective it would be more intuitive to also prepare the container image on the login node.

But I am not a very seasoned HPC user.

another pattern I've heard from people working in HPC scenarios is that they don't have network connectivity at all to the outside world from the slurm cluster, not even on the login node, so they need to scp images to the login node and then run from there. In these scenarios, doing --skip-pull and --skip-clone would help. Maybe we also need a --skip-build?

So for this issue, do we agree that we can do this:

We can change to building images on the login node by default and give an option to use the compute node (if singularity is not present on the login node). What do you think @wtraylor and @ivotron?

given what @wtraylor mentioned above, this would address this issue, right?

@wtraylor
Copy link
Contributor Author

Maybe we also need a --skip-build?
Since building an image typically involves installing software within that image, I think skipping the build would be required, too.

@ivotron
Copy link
Collaborator

ivotron commented Sep 24, 2020

sorry, I didn't explain well. What I had in mind was running the entire workflow first on the frontend node once, so it builds containers, but in single-node mode (i.e. just doing popper run with no -r flag). Then subsequently running popper run --skip-pull -r slurm. But I agree that it would be better to build containers locally rather than on each node, and control the behavior via config file.

Also, since the folder one uses on the login node is shared with all the nodes, building multiple times on each node is redundant.

@JayjeetAtGithub
Copy link
Collaborator

Yeah, building redundantly on multiple nodes is wrong. We can make changes to have 2 modes controlled through the config : i) build on login node (like in local without srun) ii) build on a single compute node (controlled through srun). What do you think @ivotron ?

@ivotron
Copy link
Collaborator

ivotron commented Sep 27, 2020

yeah, that sounds good. I'd go further and say to not implement ii) until users request it

@JayjeetAtGithub JayjeetAtGithub changed the title Pulling Singularity image doesn’t work if SLURM node has no internet cli: slurm: pulling Singularity image doesn’t work if SLURM node has no internet Nov 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants