Optuna HPO & Lightning Multi-GPU Training using DDP on SLURM - ValueError: World Size does not Match #19924
Unanswered
eTuDpy
asked this question in
DDP / multi-GPU / multi-node
Replies: 3 comments 1 reply
-
Ran into the same issue |
Beta Was this translation helpful? Give feedback.
1 reply
-
Same problem. |
Beta Was this translation helpful? Give feedback.
0 replies
-
Same issue:
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm trying to utilize all the computational resources on a SLURM cluster to speed up my hyperparameter optimization using Optuna and Pytorch Lightning.
My code works fine solely with the Pytorch Lightning trainer and without the Optuna HPO loop, however, when using all together the world_size does seem to fail to be set to the correct value.
I know, that Optuna preferce the "ddp_spawn", however, as far as I can tell, there is not much performance gain of the additional GPUs.
But when it comes to start the Pytorch Lightning Trainer, I get the following ValueError:
As the code runs smoothly without Optuna AND runs smoothly with Optuna but just 1 GPU / no DDP, I assume it must be something else missing or be wrong.
My Slurm File looks like this:
My Trainer looks like this:
Small Notes:
#SBATCH --ntasks-per-node=1
to4
does not do the trickBeta Was this translation helpful? Give feedback.
All reactions