Support `submitit` Hydra launcher, to launch runs on SLURM clusters #69

nathanpainchaud · 2022-07-25T15:48:33Z

The use of this launcher, once tested over a few weeks to launch real jobs, is meant to completely replace the vital.utils.jobs package, which will eventually be deleted.

nathanpainchaud · 2022-07-25T15:52:18Z

@lemairecarl @ThierryJudge I'm trying to test this new launcher to launch a few jobs on Beluga, but I'm having trouble creating an environment with up-to-date dependencies on the cluster. That's why I decide not to delay this PR any longer and to gather your feedback ASAP. I looked at the output config generated by Hydra, and as far as that goes everything should be okay 🤞

ThierryJudge

I think this might require some extra documentation (maybe another readme) but looks good. I added a few questions and might add some more when I test it out.

ThierryJudge · 2022-07-25T17:15:44Z

vital/config/hydra/launcher/alliance.yaml

+    timeout_min: ${oc.select:run_time_min,60}
+    setup:
+      - "module load httpproxy"  # load module allowing to connect to whitelisted domains
+      - "source $ALLIANCE_VENV_PATH/bin/activate" # activate the pre-installed virtual environment


I am not very familiar with venv but does this activate the venv from a path that is not on the compute node ?

From the assignation in .env.example, I suppose that the venv will typically be in project storage. That is what is meant by "pre-installed". Doing it this way works pretty well and is the most flexible. However, there is debate about what is the best practice. I pretty much always use venvs from shared storage instead of installing them every time on the compute node.

@lemairecarl explained pretty much the whole thing. On the compute node, you still have access to everything all the cluster's distributed filesystem, e.g. your home folder, etc. It's just that some folders of the filesystem are on the compute node itself, and some are on other machines altogether. So, depending on how much you expect to read/write to some files, it can be a good idea to move them to the folders physically on the compute node.

Regarding the python environment, unless I'm very much mistaken, the python code is only read from disk once when you launch the interpreter (at which point all the imported code is loaded into RAM), and afterwards everything is read from the RAM. So I don't think there are major performance drawbacks to keeping the python environment on a distributed filesystem, since it will only be accessed once per job, and this way you avoid potential issues with reproducing exactly the same environment each time you run a job.

The dataset is another matter entirely, since it's big, probably won't fit on the RAM, and you'll be reading from it constantly throughout the job's runtime. That's why in that case, it's worth to go through the hassle of copying it on the compute node, to avoid having to access it remotely.

vital/config/hydra/launcher/alliance.yaml

vital/runner.py

lemairecarl · 2022-07-25T19:34:12Z

vital/config/hydra/launcher/alliance.yaml

+
+hydra:
+  launcher:
+    timeout_min: ${oc.select:run_time_min,60}


Pourrais-tu m'expliquer cette ligne stp?

Absolument! Le champ hydra.launcher.timeout_min est à la base fourni par submitit pour spécifier la durée (en minutes) à demander pour la job. Quand à l'interprétation de la valeur du champ elle-même, oc.select essaie d'interpoler une autre valeur dans la config Hydra (dans ce cas-ci run_time_min), mais en utilisant la 2e valeur par défaut (dans ce cas-ci 60) si la 1ère valeur ne correspond pas à une clé dans la config Hydra.

Il n'y a pas d'exemple de configuration dans vital qui profite de cette option, mais c'est ce que j'ai trouvé comme façon de faire pour permettre facilement de spécialiser le runtime de chaque "experiment" Hydra. Dans mon projet, je spécifiais des valeurs adaptées au runtime de mes modèles dans mes config d'experiment (e.g. 60 pour un ENet, 500 pour un VAE, etc.), mais si l'utilisateur ne le spécifie pas, la job va avoir un runtime par défaut de 60 minutes. C'est la valeur par défaut de submitit, donc je l'ai simplement copiée ici pour garder le comportement par défaut.

Ci-dessous, un extrait de ma config pour mon experiment vae.yaml, pour mieux illustrer ce que je veux dire:

# @package _global_ defaults: [...] trainer: max_epochs: 200 precision: 16 data: batch_size: 128 [...] task: optim: lr: 5e-4 weight_decay: 1e-3 run_time_min: 500

lemairecarl

Quelques questions, mais ça me semble bon!

nathanpainchaud · 2022-07-26T12:04:59Z

In the end, when testing the output configs one last time before moving on to real-world tests on Beluga, I couldn't reproduce the good output config I initially got, and it's due to a bug I discovered for some edge-case behavior in Hydra.

I openened an issue with Hydra (#2316), which we should wait to be resolved before finalizing + merging the current PR.

The use of this launcher is meant in time to completely replace the `vital.utils.jobs` package

nathanpainchaud · 2023-02-21T16:22:17Z

@lemairecarl @ThierryJudge I finally received an answer recently to the issue I had opened with Hydra regarding the composition order of the config, and how built-in nodes are a special case.

The short answer is that Hydra's current behavior is a feature, not a bug, so the previous way I designed to handle switching between local/Slurm runs wouldn't work. Therefore, I changed the design a bit to introduce a new custom launcher config group (not to be confused with the built-in hydra/launcher config group).

The purpose of the new config group is to "override" in a way the built-in launcher, so that we use it rather than hydra/launcher to switch between local/Slurm runs. This is less elegant than before, but it was necessary because, from my discussions with the Hydra dev, we essentially can't override application options from the override of a built-in config group. So I had no choice but to a create new config group (that I could set to be resolved after application-level options), to enable the necessary overrides.

I know it's been a long time (and I'm in no hurry to get this merged), but if you ever have the time I would appreciate if you could give your opinions on the new design. It would be nice to finally have support for launching vital jobs on Slurm clusters again 🙂

…tom `launcher` This allows us to revert the custom checks on the config that made sure required launcher config fields exist (because now we can rely on Hydra's built-in parsing)

nathanpainchaud added the enhancement New feature or request label Jul 25, 2022

nathanpainchaud requested review from lemairecarl and ThierryJudge July 25, 2022 15:48

nathanpainchaud self-assigned this Jul 25, 2022

nathanpainchaud force-pushed the feature/submitit-launcher branch from bcf2090 to baae5a9 Compare July 25, 2022 15:49

ThierryJudge reviewed Jul 25, 2022

View reviewed changes

vitalab deleted a comment from ThierryJudge Jul 25, 2022

lemairecarl reviewed Jul 25, 2022

View reviewed changes

vital/runner.py Outdated Show resolved Hide resolved

lemairecarl reviewed Jul 25, 2022

View reviewed changes

nathanpainchaud force-pushed the feature/submitit-launcher branch from baae5a9 to c214ef4 Compare July 25, 2022 21:04

nathanpainchaud force-pushed the feature/submitit-launcher branch 2 times, most recently from 7488e59 to e8b1554 Compare August 1, 2022 17:16

nathanpainchaud force-pushed the feature/submitit-launcher branch 2 times, most recently from 6378a22 to fa8184c Compare September 5, 2022 23:25

nathanpainchaud force-pushed the feature/submitit-launcher branch 2 times, most recently from 6084769 to 9fc633a Compare September 14, 2022 23:30

nathanpainchaud mentioned this pull request Sep 14, 2022

Support submitit Hydra launcher, to launch runs on SLURM clusters vitalab/castor#3

Open

nathanpainchaud marked this pull request as draft September 14, 2022 23:57

nathanpainchaud force-pushed the feature/submitit-launcher branch from 9fc633a to 43a9215 Compare September 30, 2022 12:29

Support submitit Hydra launcher, to launch runs on SLURM clusters

5a06309

The use of this launcher is meant in time to completely replace the `vital.utils.jobs` package

nathanpainchaud force-pushed the feature/submitit-launcher branch from 43a9215 to 5a06309 Compare February 21, 2023 16:08

nathanpainchaud marked this pull request as ready for review February 21, 2023 16:11

nathanpainchaud added 2 commits March 23, 2023 23:28

Move unsupported parameters from hydra/launcher config group to cus…

7d27772

…tom `launcher` This allows us to revert the custom checks on the config that made sure required launcher config fields exist (because now we can rely on Hydra's built-in parsing)

Update default Beluga config to favor generic options where possible

f3479da

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support `submitit` Hydra launcher, to launch runs on SLURM clusters #69

Support `submitit` Hydra launcher, to launch runs on SLURM clusters #69

nathanpainchaud commented Jul 25, 2022

nathanpainchaud commented Jul 25, 2022

ThierryJudge left a comment

ThierryJudge Jul 25, 2022

lemairecarl Jul 25, 2022 •

edited

Loading

nathanpainchaud Jul 25, 2022

lemairecarl Jul 25, 2022

nathanpainchaud Jul 25, 2022

lemairecarl left a comment

nathanpainchaud commented Jul 26, 2022 •

edited

Loading

nathanpainchaud commented Feb 21, 2023

Support submitit Hydra launcher, to launch runs on SLURM clusters #69

Are you sure you want to change the base?

Support submitit Hydra launcher, to launch runs on SLURM clusters #69

Conversation

nathanpainchaud commented Jul 25, 2022

nathanpainchaud commented Jul 25, 2022

ThierryJudge left a comment

Choose a reason for hiding this comment

ThierryJudge Jul 25, 2022

Choose a reason for hiding this comment

lemairecarl Jul 25, 2022 • edited Loading

Choose a reason for hiding this comment

nathanpainchaud Jul 25, 2022

Choose a reason for hiding this comment

lemairecarl Jul 25, 2022

Choose a reason for hiding this comment

nathanpainchaud Jul 25, 2022

Choose a reason for hiding this comment

lemairecarl left a comment

Choose a reason for hiding this comment

nathanpainchaud commented Jul 26, 2022 • edited Loading

nathanpainchaud commented Feb 21, 2023

Support `submitit` Hydra launcher, to launch runs on SLURM clusters #69

Support `submitit` Hydra launcher, to launch runs on SLURM clusters #69

lemairecarl Jul 25, 2022 •

edited

Loading

nathanpainchaud commented Jul 26, 2022 •

edited

Loading