Issue on data parallel training #60

suvigy · 2020-08-27T15:20:46Z

Hi,

I'm trying to use multiple GPU-s to train with 160000 images.
There are 8 GPU-s, and I want to use GPU-s 1,2,3,4,5, since GPU 0, 6, 7 are busy. So I set export CUDA_VISIBLE_DEVICES=1,2,3,4,5

I get the follow error message:

File "/home/CW01/uia64053/anaconda3/envs/sc_sfmlearner/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/home/CW01/uia64053/anaconda3/envs/sc_sfmlearner/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/raid/home/CW01/uia64053/external-algos/unsupervised-depth/SC-SfMLearner-Release/models/DispResNet.py", line 116, in forward
outputs = self.decoder(features)
File "/home/CW01/uia64053/anaconda3/envs/sc_sfmlearner/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/raid/home/CW01/uia64053/external-algos/unsupervised-depth/SC-SfMLearner-Release/models/DispResNet.py", line 91, in forward
x = self.convs("upconv", i, 0)
File "/home/CW01/uia64053/anaconda3/envs/sc_sfmlearner/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/raid/home/CW01/uia64053/external-algos/unsupervised-depth/SC-SfMLearner-Release/models/DispResNet.py", line 23, in forward
out = self.conv(x)
File "/home/CW01/uia64053/anaconda3/envs/sc_sfmlearner/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/raid/home/CW01/uia64053/external-algos/unsupervised-depth/SC-SfMLearner-Release/models/DispResNet.py", line 41, in forward
out = self.conv(out)
File "/home/CW01/uia64053/anaconda3/envs/sc_sfmlearner/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 345, in forward
.....
File "/home/CW01/uia64053/anaconda3/envs/sc_sfmlearner/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward
self.padding, self.dilation, self.groups)
RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)

So it seems the data tensor and the model are not on the same GPU?

Thanks

suvigy · 2020-08-27T15:57:19Z

Oh I can see in the pinned issues "Training with my own data", that it is solved, I will take a look at it.

suvigy · 2020-08-28T13:13:21Z

The solution in the pinned issue works, however there will be unbalanced load on the GPU-s. (Maybe the loss should be in the model for data parallelism)

JiawangBian · 2020-08-29T05:14:14Z

I have tried to re-implement the loss. However, the training is still slow. I suggest using a single GPU for training, which takes about 2 days.

suvigy closed this as completed Aug 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue on data parallel training #60

Issue on data parallel training #60

suvigy commented Aug 27, 2020

suvigy commented Aug 27, 2020

suvigy commented Aug 28, 2020

JiawangBian commented Aug 29, 2020

Issue on data parallel training #60

Issue on data parallel training #60

Comments

suvigy commented Aug 27, 2020

suvigy commented Aug 27, 2020

suvigy commented Aug 28, 2020

JiawangBian commented Aug 29, 2020