Skip to content

Latest commit

 

History

History
170 lines (112 loc) · 8.87 KB

README.md

File metadata and controls

170 lines (112 loc) · 8.87 KB

NNI with SLURM and W&B

This is a patch for NNI that builds on version v2.10. Now a new training service is available: slurm!

Who might be interested? [1]

  • You use NNI to run your machine learning experiments?
  • Your ML experiments run on compute nodes without internet access (for example, using a batch system)?
  • Your compute nodes and your head/login node (with internet) have access to a shared file system?

Then this package might be useful. For Weights & Biases users, you might be interested in this alternative.

SLURM

Currently, this patch only supports SLURM, but it's quite simple to extend to other workload managers (e.g. PBS).

Usage

This package is built upon NNI. If you are new to NNI, please refer to the official documents.

If you are a professional NNI user. Simply change your training service to slurm (other types are not affected, change it back if needed):

trainingService:
  platform: slurm
  resource:
    gres: gpu:NVIDIAGeForceRTX2080Ti:1  # request 1 2080Ti for each trial
    time: 1000                          # wall time for each trial
    partition: critical                 # request partition critical for resource allocation
  useSbatch: false
  useWandb: true

or if you use a python script:

experiment = Experiment('slurm')
experiment.config.training_service.resource = {
    'gres': 'gpu:NVIDIAGeForceRTX2080Ti:1',  # request 1 2080Ti for each trial
    'time': 1000,                            # wall time for each trial
    'partition': 'critical'                  # request partition critical for resource allocation
}
experiment.config.training_service.useSbatch = False
experiment.config.training_service.useWandb = True

Then run nnictl create --config config.yaml or execute the python script on the login node. It will start the NNI server on the login node and submit slurm jobs.

SLURM Training Service

There are only 4 parameters in slurm training service:

  • platform: str. Must be slurm.
  • resource: Dict[str, str]. Arguments to submit a single job to SLURM system. Do not add hyphens (- or --) at the front. Depending on the ways to submit the job (srun, sbatch), the options might be a little bit different. See SLURM docs (srun, sbatch) for more details. Feel free to use numbers -- it will automatically convert to string when reading the config.
  • useSbatch: Optional[bool]. Use sbatch to submit jobs instead of srun. The good side is: When the login node crashes accidentally, your job will not be affected. The bad side is: It has a buffer so that the output is delayed (metrics will not be affected). Default: False.
  • useWandb: Optional[bool]. Summit the trail logs to W&B. If you have logged in W&B in your system before, it will use your account. Otherwise, it will automatically create an anonymous account for you. Default: True.

Example

You can find a complete project example here. It is modified from the NNI official tutorial HPO Quickstart with PyTorch.

W&B Support

W&B provides more detailed experiment analysis (e.g. params importance, machine status, .etc).

image

If you enabled useWandb (by default), then you are expected to see a new tab on the navigate bar:

image

This is the web link to the W&B project of this experiment. After a trial succeeds, you could also see a link to this trial:

image

Caution: W&B link will only be valid if at least one of the trials succeeds. Only succeeded trials will be recorded by W&B.

By default, W&B link will be available for 7 days. If you want to keep the data for future analysis, you may claim the experiment to your account. If you have logged in W&B account on the login node, then the experiment will automatically save to your account.

How to Install

Just download this patch wheel and do pip install:

wget https://github.com/whyNLP/nni-slurm/releases/download/v2.11.1/nni-2.11.1-py3-none-manylinux1_x86_64.whl
pip install nni-2.11.1-py3-none-manylinux1_x86_64.whl

How to Uninstall

Simply do pip uninstall nni will completely remove this patch from your system.

Trouble Shooting

Error: /lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found

It might because a low version of gcc. You might want to install gcc through spack, which only requires a few lines of commands to install packages. See microsoft#4914 for more details.

Failed to establish a new connection: [Errno 111] Connection refused

This problem might have been fixed in the upcoming NNI 3.0. This patch uses a temporary fix: give it more retries. See microsoft#3496 for more details.

ValueError: ExperimentConfig: type of experiment_name (None) is not typing.Optional[str]

It has been fixed in this patch. See microsoft#5468 for more details.

Error: tuner_command_channel: Tuner did not connect in 10 seconds. Please check tuner (dispatche) log.

In most cases, this error means that the login node is too slow (heavy workload on CPU and memory). This patch has extended the connection time to 120 seconds.

PermissionError: [Errno 13] Permission denied: '/tmp/debug-cli.log'

This is a bug from wandb. Change the tmp dir to home could be a quick fix:

mkdir ~/tmp
export TMPDIR=~/tmp

See wandb#2845 for more details.

Questions

Can I run experiments using NNI without this patch?

It depends. There are some solutions:

  1. Run in local mode with srun command. Potential problem: Login node cannot use tail-stream. Listen on file change will fail. The behaviour is that no metric could be updated.
  2. Run in remote mode with srun command, but connect to localhost. Potential problem: tmp folder does not sync to compute node. Also, you might not be able to visit login node on the compute node.

See microsoft#1939, microsoft#3717 for more details.

An advantage of this patch is that it supports safe resume. Thanks to the slurm system, your trials won't be affected by the failure of NNI manager. If your NNI stops due to error, your existing trials will continue to run. You may resume the experiment later using nnictl resume <experiment_id> (docs) or python script (docs). This patch will read the trial logs and update the status, instead of ignoring the running trials. Below is an example of NNI timeline with trial concurrency of 2.

Time line ---+--------------+-----------+---------+-----------+----------------+------------------------------

User      Start NNI -- Go to sleep                                      Wake up, Resume ----------------------

NNI          +--------------------------+------ Error                       Resume -----------------+---------

Trial 1      +---------------------- Finish
Trial 2      +----------------------------------------------Finish      Register Result
Trial 3                                 +----------------------------- Register Progress -------- Finish
Trial 4                                                                        +------------------------------
Trial 5                                                                                             +---------

Notice that if you stops the experiment manually (e.g. with command nnictl stop ...), NNI will cancel all the running trials in order to release the resources.

Will you create a pull request to NNI?

I have no plan to create a pull request. This patch is not fully tested. The code style is not fully consistent with NNI requirements. I develop this patch for personal use only.

Reference