Optuna HPO & Lightning Multi-GPU Training using DDP on SLURM - ValueError: World Size does not Match #19924

eTuDpy · 2024-05-30T11:21:21Z

eTuDpy
May 30, 2024

I'm trying to utilize all the computational resources on a SLURM cluster to speed up my hyperparameter optimization using Optuna and Pytorch Lightning.

My code works fine solely with the Pytorch Lightning trainer and without the Optuna HPO loop, however, when using all together the world_size does seem to fail to be set to the correct value.

I know, that Optuna preferce the "ddp_spawn", however, as far as I can tell, there is not much performance gain of the additional GPUs.

But when it comes to start the Pytorch Lightning Trainer, I get the following ValueError:

"ValueError: You set `devices=4` and `num_nodes=1` in Lightning, but the product (4 * 1) does not match the world size (1)."

As the code runs smoothly without Optuna AND runs smoothly with Optuna but just 1 GPU / no DDP, I assume it must be something else missing or be wrong.

My Slurm File looks like this:

#!/bin/bash -l
#SBATCH --partition=alpha
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-gpu=40Gb
#SBATCH --time=72:00:00
#SBATCH --output ./output/%A.out
#SBATCH --error ./error/%A.out

# Activate My Environment
module load Anaconda3/2022.05
#module load Python/3.10.4
#module load CUDA/12.0.0
source activate $HOME/user-kernel/dl_p310

# run script
srun torchrun start_pl_hpo_ddp.py

My Trainer looks like this:

# Class Definition for better Optuna Pruning
class OptunaPruning(PyTorchLightningPruningCallback, pl.Callback):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

# Code Snippet
pl_pruner = OptunaPruning(trial, monitor="val_loss")
dist.init_process_group(backend="nccl")
torch.set_float32_matmul_precision("medium")
trainer = pl.Trainer(accelerator="gpu",
                     strategy='ddp', 
                     #num_nodes=1
                     devices=4, 
                     max_epochs=20,
                     logger=True,
                     check_val_every_n_epoch=1,
                     callbacks=[pl_pruner], 
                     log_every_n_steps=100,
                     num_sanity_val_steps=0,
                     enable_progress_bar=False,
                     enable_model_summary = False,
                     enable_checkpointing=False,
                    )

Small Notes:

I commented out num_nodes as this post mentioned it is unnecessary.
setting #SBATCH --ntasks-per-node=1 to 4 does not do the trick

inonoo123 · 2024-07-31T14:07:40Z

inonoo123
Jul 31, 2024

Ran into the same issue

1 reply

btebbutt Aug 13, 2024

same

Les1ie · 2024-09-25T04:25:35Z

Les1ie
Sep 25, 2024

Same problem.

0 replies

bursica · 2024-10-30T20:47:03Z

bursica
Oct 30, 2024

Same issue:

ddp without optuna OK,
optuna without ddp OK ,
optuna and ddp/several GPU's KO

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optuna HPO & Lightning Multi-GPU Training using DDP on SLURM - ValueError: World Size does not Match #19924

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Optuna HPO & Lightning Multi-GPU Training using DDP on SLURM - ValueError: World Size does not Match #19924

eTuDpy May 30, 2024

Replies: 3 comments · 1 reply

inonoo123 Jul 31, 2024

btebbutt Aug 13, 2024

Les1ie Sep 25, 2024

bursica Oct 30, 2024

eTuDpy
May 30, 2024

Replies: 3 comments 1 reply

inonoo123
Jul 31, 2024

Les1ie
Sep 25, 2024

bursica
Oct 30, 2024