-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Slurm jobs #20
base: main
Are you sure you want to change the base?
Conversation
…c version and StringConstraints
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great overall. I added a drafty version of adding ssh-tunneling based on a list of ports and an ssh key. This isn't very tested, but if you have time to take a further look it could be a good starting point.
For the key, I would think that this would become an environment variable of the worker. If it is not set, it probably makes sense to use ~/.ssh/id_rsa
as a default.
flows/slurm/run_slurm.sh
Outdated
echo $BATCH_SCRIPT | ||
|
||
# Submit the Slurm batch script and capture the job ID | ||
JOB_ID=$(sbatch $BATCH_SCRIPT | awk '{print $4}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to forward ports between nodes, we need to pass information about the host to the script, and optionally specify a key to be used for ssh authentication. This key needs to have no passphrase.
JOB_ID=$(sbatch $BATCH_SCRIPT | awk '{print $4}') | |
JOB_ID=$(sbatch --export=ALL,JOB_SUBMISSION_HOST=$(hostname),JOB_SUBMISSION_SSH_KEY="$submission_ssh_key" $BATCH_SCRIPT | awk '{print $4}') |
flows/slurm/run_slurm.sh
Outdated
# Write the Slurm batch script | ||
echo "#!/bin/bash" > $BATCH_SCRIPT | ||
|
||
if [ "$partitions" = "" ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think
if [ "$partitions" = "" ] | |
if [ -z "$partitions" ] |
is often use instead of the check for an empty string as it is suitable to check for both uninitialized and empty strings. This isn't relevant here because we require each variable to be passed, but thought it could be an interesting side-note.
echo "source /etc/profile.d/modules.sh" >> $BATCH_SCRIPT | ||
echo "module load maxwell mamba" >> $BATCH_SCRIPT | ||
echo ". mamba-init" >> $BATCH_SCRIPT | ||
echo "mamba activate $conda_env" >> $BATCH_SCRIPT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
echo "mamba activate $conda_env" >> $BATCH_SCRIPT | |
echo "mamba activate $conda_env" >> $BATCH_SCRIPT | |
if [ ! -z "$forward_ports" ]; then | |
echo -n "ssh " >> $BATCH_SCRIPT | |
# If an ssh key was specified, pass it as an additional argument | |
if [ ! -z "$submission_ssh_key" ]; then | |
echo -n "-i $submission_ssh_key " >> $BATCH_SCRIPT | |
fi | |
# Loop over forward_ports and append each forwarding rule | |
for port in $forward_ports; do | |
echo -n "-L $port:localhost:8000 " >> $BATCH_SCRIPT | |
done | |
# -N tells SSH to not execute a remote command | |
echo -n "-N " >> $BATCH_SCRIPT | |
# Finally, specify host to tunnel to | |
# & at the end of the line makes the command run in the background | |
echo -n "$USER@$JOB_SUBMISSION_HOST &" >> $BATCH_SCRIPT | |
fi |
flows/slurm/run_slurm.sh
Outdated
python_file=$7 | ||
yaml_file=$8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introduce new variables forward_ports
and submission_ssh_key
python_file=$7 | |
yaml_file=$8 | |
forward_ports=$7 | |
submission_ssh_key=$8 | |
python_file=$9 | |
yaml_file=$10 |
flows/slurm/run_slurm.sh
Outdated
if [ $# -ne 8 ]; then | ||
echo "Usage: $0 <job_name> <num_nodes> <partitions> <reservations> <max_time> <conda_env> <python_file> <yaml_file>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introduce new variables forward_ports
and submission_ssh_key
if [ $# -ne 8 ]; then | |
echo "Usage: $0 <job_name> <num_nodes> <partitions> <reservations> <max_time> <conda_env> <python_file> <yaml_file>" | |
if [ $# -ne 10 ]; then | |
echo "Usage: $0 <job_name> <num_nodes> <partitions> <reservations> <max_time> <conda_env> <forward_ports> <submission_ssh_key> <python_file> <yaml_file>" |
exit 1 | ||
else | ||
# If the job is neither completed nor failed, print its status | ||
echo "Job $JOB_ID is $JOB_STATUS" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something that currently needs to more work is to retrieve the slurm job logs while it is running. We could do this by reading the slurm-{job_id}.out
file, but I wonder if there are alternative ways to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Wiebke I think you already added this. If you did, can you add it to this PR please?
from flows.slurm.schema import SlurmParams | ||
|
||
|
||
class Logger: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be refactored into something common across the flows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a proposed change in there about making the logging we're doing for both conda and prefect its own module so that we can be more DRY.
But I want to ask here...the script run_slurm.sh
contains some very specific code for running on a very specific cluster. I think this script should be renamed run_slurm_maxwell.sh
for now, with an eye towards having a way towards making a configuration mechanism that lets us choose between multiple slurm implementations.
We could make the name of the slurm script an input parameter to the flow, or use prefect blocks to configure it.
Merging this, as requested changes got addressed, but fell of this PR due to a Github branch-disconnect. Will create a PR for adding the logger changes and the additional SLURM logging changes mentioned above. |
This PR adds the option to run slurm flows at Maxwell. In addition to standard slurm parameters, this PR adds the option to set up port forwarding when the ML jobs require access to MLEx services located in a different node (e.g. tiled)