Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue on data parallel training #60

Closed
suvigy opened this issue Aug 27, 2020 · 3 comments
Closed

Issue on data parallel training #60

suvigy opened this issue Aug 27, 2020 · 3 comments

Comments

@suvigy
Copy link

suvigy commented Aug 27, 2020

Hi,

I'm trying to use multiple GPU-s to train with 160000 images.
There are 8 GPU-s, and I want to use GPU-s 1,2,3,4,5, since GPU 0, 6, 7 are busy. So I set export CUDA_VISIBLE_DEVICES=1,2,3,4,5

I get the follow error message:

File "/home/CW01/uia64053/anaconda3/envs/sc_sfmlearner/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/home/CW01/uia64053/anaconda3/envs/sc_sfmlearner/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/raid/home/CW01/uia64053/external-algos/unsupervised-depth/SC-SfMLearner-Release/models/DispResNet.py", line 116, in forward
outputs = self.decoder(features)
File "/home/CW01/uia64053/anaconda3/envs/sc_sfmlearner/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/raid/home/CW01/uia64053/external-algos/unsupervised-depth/SC-SfMLearner-Release/models/DispResNet.py", line 91, in forward
x = self.convs("upconv", i, 0)
File "/home/CW01/uia64053/anaconda3/envs/sc_sfmlearner/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/raid/home/CW01/uia64053/external-algos/unsupervised-depth/SC-SfMLearner-Release/models/DispResNet.py", line 23, in forward
out = self.conv(x)
File "/home/CW01/uia64053/anaconda3/envs/sc_sfmlearner/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/raid/home/CW01/uia64053/external-algos/unsupervised-depth/SC-SfMLearner-Release/models/DispResNet.py", line 41, in forward
out = self.conv(out)
File "/home/CW01/uia64053/anaconda3/envs/sc_sfmlearner/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 345, in forward
.....
File "/home/CW01/uia64053/anaconda3/envs/sc_sfmlearner/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward
self.padding, self.dilation, self.groups)
RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)

So it seems the data tensor and the model are not on the same GPU?

Thanks

@suvigy
Copy link
Author

suvigy commented Aug 27, 2020

Oh I can see in the pinned issues "Training with my own data", that it is solved, I will take a look at it.

@suvigy
Copy link
Author

suvigy commented Aug 28, 2020

The solution in the pinned issue works, however there will be unbalanced load on the GPU-s. (Maybe the loss should be in the model for data parallelism)

@suvigy suvigy closed this as completed Aug 28, 2020
@JiawangBian
Copy link
Owner

I have tried to re-implement the loss. However, the training is still slow. I suggest using a single GPU for training, which takes about 2 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants