Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Slurm jobs #20

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open

Adding Slurm jobs #20

wants to merge 13 commits into from

Conversation

taxe10
Copy link
Member

@taxe10 taxe10 commented Apr 2, 2024

This PR adds the option to run slurm flows at Maxwell. In addition to standard slurm parameters, this PR adds the option to set up port forwarding when the ML jobs require access to MLEx services located in a different node (e.g. tiled)

@taxe10 taxe10 requested a review from Wiebke April 4, 2024 04:28
Copy link
Member

@Wiebke Wiebke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great overall. I added a drafty version of adding ssh-tunneling based on a list of ports and an ssh key. This isn't very tested, but if you have time to take a further look it could be a good starting point.
For the key, I would think that this would become an environment variable of the worker. If it is not set, it probably makes sense to use ~/.ssh/id_rsa as a default.

echo $BATCH_SCRIPT

# Submit the Slurm batch script and capture the job ID
JOB_ID=$(sbatch $BATCH_SCRIPT | awk '{print $4}')
Copy link
Member

@Wiebke Wiebke Apr 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to forward ports between nodes, we need to pass information about the host to the script, and optionally specify a key to be used for ssh authentication. This key needs to have no passphrase.

Suggested change
JOB_ID=$(sbatch $BATCH_SCRIPT | awk '{print $4}')
JOB_ID=$(sbatch --export=ALL,JOB_SUBMISSION_HOST=$(hostname),JOB_SUBMISSION_SSH_KEY="$submission_ssh_key" $BATCH_SCRIPT | awk '{print $4}')

# Write the Slurm batch script
echo "#!/bin/bash" > $BATCH_SCRIPT

if [ "$partitions" = "" ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think

Suggested change
if [ "$partitions" = "" ]
if [ -z "$partitions" ]

is often use instead of the check for an empty string as it is suitable to check for both uninitialized and empty strings. This isn't relevant here because we require each variable to be passed, but thought it could be an interesting side-note.

echo "source /etc/profile.d/modules.sh" >> $BATCH_SCRIPT
echo "module load maxwell mamba" >> $BATCH_SCRIPT
echo ". mamba-init" >> $BATCH_SCRIPT
echo "mamba activate $conda_env" >> $BATCH_SCRIPT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
echo "mamba activate $conda_env" >> $BATCH_SCRIPT
echo "mamba activate $conda_env" >> $BATCH_SCRIPT
if [ ! -z "$forward_ports" ]; then
echo -n "ssh " >> $BATCH_SCRIPT
# If an ssh key was specified, pass it as an additional argument
if [ ! -z "$submission_ssh_key" ]; then
echo -n "-i $submission_ssh_key " >> $BATCH_SCRIPT
fi
# Loop over forward_ports and append each forwarding rule
for port in $forward_ports; do
echo -n "-L $port:localhost:8000 " >> $BATCH_SCRIPT
done
# -N tells SSH to not execute a remote command
echo -n "-N " >> $BATCH_SCRIPT
# Finally, specify host to tunnel to
# & at the end of the line makes the command run in the background
echo -n "$USER@$JOB_SUBMISSION_HOST &" >> $BATCH_SCRIPT
fi

Comment on lines 16 to 17
python_file=$7
yaml_file=$8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introduce new variables forward_ports and submission_ssh_key

Suggested change
python_file=$7
yaml_file=$8
forward_ports=$7
submission_ssh_key=$8
python_file=$9
yaml_file=$10

Comment on lines 4 to 5
if [ $# -ne 8 ]; then
echo "Usage: $0 <job_name> <num_nodes> <partitions> <reservations> <max_time> <conda_env> <python_file> <yaml_file>"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introduce new variables forward_ports and submission_ssh_key

Suggested change
if [ $# -ne 8 ]; then
echo "Usage: $0 <job_name> <num_nodes> <partitions> <reservations> <max_time> <conda_env> <python_file> <yaml_file>"
if [ $# -ne 10 ]; then
echo "Usage: $0 <job_name> <num_nodes> <partitions> <reservations> <max_time> <conda_env> <forward_ports> <submission_ssh_key> <python_file> <yaml_file>"

flows/slurm/schema.py Show resolved Hide resolved
exit 1
else
# If the job is neither completed nor failed, print its status
echo "Job $JOB_ID is $JOB_STATUS"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something that currently needs to more work is to retrieve the slurm job logs while it is running. We could do this by reading the slurm-{job_id}.out file, but I wonder if there are alternative ways to do this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Wiebke I think you already added this. If you did, can you add it to this PR please?

from flows.slurm.schema import SlurmParams


class Logger:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be refactored into something common across the flows?

Copy link
Member

@dylanmcreynolds dylanmcreynolds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a proposed change in there about making the logging we're doing for both conda and prefect its own module so that we can be more DRY.

But I want to ask here...the script run_slurm.sh contains some very specific code for running on a very specific cluster. I think this script should be renamed run_slurm_maxwell.sh for now, with an eye towards having a way towards making a configuration mechanism that lets us choose between multiple slurm implementations.

We could make the name of the slurm script an input parameter to the flow, or use prefect blocks to configure it.

@Wiebke
Copy link
Member

Wiebke commented Jul 25, 2024

Merging this, as requested changes got addressed, but fell of this PR due to a Github branch-disconnect. Will create a PR for adding the logger changes and the additional SLURM logging changes mentioned above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants