Docker-composed Slurm cluster for local development

This directory is a configuration of a minimal but functioning Slurm cluster, running on Docker Compose, intended for development and testing of Slurm-integrated code, e.g. job submission/running/polling via PySlurm.

It also includes helper type config for running Prefect flows on the Slurm cluster.

Running it

From the parent directory, docker compose --profile slurm up will build and start the Slurm cluster. This includes the docker-compose.yaml file in this slurm dir.

What you get

It starts:

slurm_db: a Maria DB host for the slurm database. Port 3306 is mapped to your host machine, so you can inspect the Slurm Job DB at mariadb://slurm:slurm@localhost:3306/slurm (see e.g. donco_job_table).
slurm_node: a container running the slurm database daemon, controller, and a worker node (see below).
The alternative setup (profile slurm_full, not started by default) separates these into separate containers for a more realistic setup:
- slurm_db_daemon: a container running the Slurm database daemon. This is the interface between Slurm and the above DB.
- slurm_controller: a container running the Slurm controller / central manager daemon. This is the node workers poll for jobs.
- slurm_worker: a container running the Slurm compute node daemon. This container executes jobs in the queue.
- Note the difference in congigs/ and entrypoints/ for these single node vs. full setups.

Other than the Maria DB container, the others share a common Ubuntu-based image from Dockerfile. The *-entrypoint.sh scripts start the respective daemons in the respective containers.

submitter-entrypoint.sh is not used by any of these containers, but can be used to make a "submission node" (AKA "login node"). That means a node that has a Slurm installation and still runs munged so that the submission node can submit jobs, but it doesn't do work. This is used by the parent docker-compose setup, to submit Slurm jobs from Prefect flows.

slurm*.conf are configs made using the configurator for this docker compose setup only.

slurm_prolog.sh and slurm_epilog.sh are scripts that execute before and after a job, respectively. We are using them to pretend that /nfs/public is only available to jobs running in the datamover partition. This is to help with writing jobs for EBI Codon Slurm infrastructure, which looks a bit like this.

Data and filesystems

There are dummy HPS and NFS filesystems at /hps and /nfs/production. On a datamover partitioned job, there is also /nfs/public.

Using it

Interactively on the nodes

The Slurm cluster is called donco (not codon :-] ). You can dispatch Slurm commands on one of the nodes, e.g. the slurm_node (single node setup) or slurm_worker or _controller in the full setup.

user@host:/# task slurm
root@slurm_node:/# sinfo
# PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
# debug*       up   infinite      1   idle slurm_node
# datamover    up   infinite      1   idle slurm_node

root@slurm_node:/# sbatch --wait -t 00:00:30 --mem 10M --wrap="ls" -o listing.txt
# Submitted batch job 54
root@slurm_node:/# sacct -j 54
# JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
# ------------ ---------- ---------- ---------- ---------- ---------- --------
# 54                 wrap      debug       root          1  COMPLETED      0:0
# 54.batch          batch                  root          1  COMPLETED      0:0
root@slurm_node:/# srun -t 1:00:00 --mem=100M --partition=datamover --pty bash -l
root@slurm_node:/# cd /nfs/public
root@slurm_node:/nfs/public#
# Note that this extra /nfs/public mount is available because we are on the datamover parition

Via Taskfile convenience wrapper

From the parent dir, where there is a Taskfile: task slurm will open a shell on the slurm controller node, where you can run things like sinfo, squeue, srun, sbatch etc. (as bove)

task sbatch -- --wait -t 00:00:30 --mem 10M --wrap="ls" -o listing.txt will dispatch a job directly.

Use e.g. docker logs slurm_node -f to see the job execute.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Docker-composed Slurm cluster for local development

Running it

What you get

Data and filesystems

Using it

Interactively on the nodes

Via Taskfile convenience wrapper

Files

README.md

Latest commit

History

README.md

File metadata and controls

Docker-composed Slurm cluster for local development

Running it

What you get

Data and filesystems

Using it

Interactively on the nodes

Via Taskfile convenience wrapper