Skip to content

Latest commit

 

History

History
82 lines (60 loc) · 2.81 KB

README.md

File metadata and controls

82 lines (60 loc) · 2.81 KB

HLRN-Slurm: A tool for submitting jobs with singularity on the HLRN Cluster 🔨

Features ⭐

  • Automatic generation of SBATCH scripts based on a given .yaml configuration file and/or command-line options ✅
  • Submission of multiple jobs with one command ✅
  • Automatic start of a singularity container with bindings based on the configuartion file ✅
  • Execute a given runscript in the container... ✅
  • ... or create an idle container to connect to interactively ✅
  • Generate a hyperparameter sweep (grid-search) based on a single experiment-config and execute each generated config as a separate job ✅

Usage 🚀

First, you need to install the package via pip:

    $ pip install hlrn_slurm

Then starting a job is very simple.

Running an experiment can be as simple as:

    $ hlrn_slurm --cfg ~/prod_config.yaml --run_args "--experiment <my_experiment>"

In this case, I have a set of slurm configurations that I want applied for every experiment and the only thing that changes for each run is the experiment specification that is passed to the runscript through --run_args.

Starting a container for development (i.e. a container that allows an interactive from outside) is even simpler:

    $ hlrn_slurm --cfg ~/dev_config.yaml

Once the container is started, using it for development (e.g. through VS-Code) is easily possible, but requires some prior setup. A detailed guide on how to set this up is documented here (note that starting the container is already taken care of by hlrn-slurm).

Config 🔧

Example configuration files can be found here. The available options and default values can be inspected via $ hlrn_slurm --help.

Sweep 📈

The sweep option expects that you provide a sweep file. This is a .yaml file that contains the sweep options on top of your normal experiment configuration. This could look something like this:

dataset:
    dataset_cls: ImageNet
    batch_size: 128
model:
    _sweep_model_cls: [resnet18, resnet50]
trainer:
    max_epochs: 200 
    optimizer:
        name: Lion
        _sweep_lr: [0.0001, 0.001, 0.0005]
        weight_decay: 0.05

Important here is the keyword _sweep_ in front of keys that should be included in the sweep. The tool will generate configs and start jobs for every combination of hyperparameters that is provided through this way. In the example, we would get six configs, one of them would be:

dataset:
    dataset_cls: ImageNet
    batch_size: 128
model:
    model_cls: resnet18
trainer:
    max_epochs: 200 
    optimizer:
        name: Lion
        lr: 0.001
        weight_decay: 0.05