Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trouble on Parallel Training #55

Closed
SenZHANG-GitHub opened this issue Jul 20, 2020 · 3 comments
Closed

Trouble on Parallel Training #55

SenZHANG-GitHub opened this issue Jul 20, 2020 · 3 comments

Comments

@SenZHANG-GitHub
Copy link

Thanks for your work!

Recently I found that if the machine contains more than 1 GPU, then a runtime error will occur: "RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)"

This error disappears if I use CUDA_VISIBLE_DEVICES=0 to specify a certain GPU for training. The torch version I use is 1.5.1 with cuda 10.0.

May I ask whether the data parallel functionality has been successfully tested with multiple GPUs? Thanks.

Below is the full error log for your reference:

Traceback (most recent call last):
  File "train.py", line 458, in <module>
    main()
  File "train.py", line 206, in main
    train_loss = train(args, train_loader, disp_net, pose_net, optimizer, args.epoch_size, logger, training_writer)
  File "train.py", line 269, in train
    tgt_depth, ref_depths = compute_depth(disp_net, tgt_img, ref_imgs)
  File "train.py", line 437, in compute_depth
    tgt_depth = [1/disp for disp in disp_net(tgt_img)]
  File "/hdd/senzhang/venv/odo-bench/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/hdd/senzhang/venv/odo-bench/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/hdd/senzhang/venv/odo-bench/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/hdd/senzhang/venv/odo-bench/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/hdd/senzhang/venv/odo-bench/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/hdd/senzhang/venv/odo-bench/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/hdd/senzhang/venv/odo-bench/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/hdd/senzhang/src/odo-bench/sc-sfmlearner/models/DispResNet.py", line 117, in forward
    outputs = self.decoder(features)
  File "/hdd/senzhang/venv/odo-bench/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/hdd/senzhang/src/odo-bench/sc-sfmlearner/models/DispResNet.py", line 92, in forward
    x = self.convs[("upconv", i, 0)](x)
  File "/hdd/senzhang/venv/odo-bench/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/hdd/senzhang/src/odo-bench/sc-sfmlearner/models/DispResNet.py", line 24, in forward
    out = self.conv(x)
  File "/hdd/senzhang/venv/odo-bench/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/hdd/senzhang/src/odo-bench/sc-sfmlearner/models/DispResNet.py", line 42, in forward
    out = self.conv(out)
  File "/hdd/senzhang/venv/odo-bench/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/hdd/senzhang/venv/odo-bench/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 353,100% (200 of 200) |###########################################| Elapsed Time: 0:00:08 Time:  0:00:08
    return self._conv_forward(input, self.weight)
  File "/hdd/senzhang/venv/odo-bench/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 350,100% (1000 of 1000) |#########################################| Elapsed Time: 0:00:08 Time:  0:00:08
    self.padding, self.dilation, self.groups)
RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)
@JiawangBian
Copy link
Owner

See the below issue

#48

@SenZHANG-GitHub
Copy link
Author

See the below issue

#48

Thanks : ) Exactly the answer I'm looking for!

@SenZHANG-GitHub
Copy link
Author

SenZHANG-GitHub commented Jul 21, 2020

See the below issue

#48

An update:

As point out in #48, for batch size 4 it is unnecessary to use parallel training. Actually in my test it slows down the training process due to the data transmission overhead.

For epoch size 1000 and V100 GPU:

  • Two-gpu: 19 min for training and 5 min for evaluation per epoch
  • Single-gpu: 15 min for training and 4 min for evaluation per epoch

Just for reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants