Not sure how to make DP and DPP on single-node, 2-GPU setup work #268
Replies: 3 comments
-
I think the documentation and examples cover this case. Did those not work for you? The speedups with dp vary according to what you're doing. In DDP you (*mostly) double the speed every time you double the number of GPUs. |
Beta Was this translation helpful? Give feedback.
-
Will reopen if you are still having issues |
Beta Was this translation helpful? Give feedback.
-
Yes, this is most likely what you want. Else you run the same batch on both accelerators. |
Beta Was this translation helpful? Give feedback.
-
When I use DDP, it hangs the process, and no metric log files are created for me to view with tensorboard. I just see two tf event files.
On the other hand, when I use DP, the code runs and I can view the loss values going down in Tensorboard, but I don't see any accelerated training. Running with 1 GPU and running with DP on 2 GPUs gives me the same training time: 12 min. I've tried with different batch sizes, and there is virtually no difference in training time.
Do I have to create a DistributedSampler or do something else to see accelerated training using DP?
Code
My code is just as follows:
model = ConvNet()
# most basic trainer, uses good defaults
exp = Experiment(save_dir=os.getcwd())
trainer = Trainer(experiment=exp, gpus=[0, 1], max_nb_epochs=20, distributed_backend='dp')
trainer.fit(model)
What's your environment?
Beta Was this translation helpful? Give feedback.
All reactions