Trouble on Parallel Training #55

SenZHANG-GitHub · 2020-07-20T08:23:12Z

Thanks for your work!

Recently I found that if the machine contains more than 1 GPU, then a runtime error will occur: "RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)"

This error disappears if I use CUDA_VISIBLE_DEVICES=0 to specify a certain GPU for training. The torch version I use is 1.5.1 with cuda 10.0.

May I ask whether the data parallel functionality has been successfully tested with multiple GPUs? Thanks.

Below is the full error log for your reference:

Traceback (most recent call last):
  File "train.py", line 458, in <module>
    main()
  File "train.py", line 206, in main
    train_loss = train(args, train_loader, disp_net, pose_net, optimizer, args.epoch_size, logger, training_writer)
  File "train.py", line 269, in train
    tgt_depth, ref_depths = compute_depth(disp_net, tgt_img, ref_imgs)
  File "train.py", line 437, in compute_depth
    tgt_depth = [1/disp for disp in disp_net(tgt_img)]
  File "/hdd/senzhang/venv/odo-bench/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/hdd/senzhang/venv/odo-bench/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/hdd/senzhang/venv/odo-bench/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/hdd/senzhang/venv/odo-bench/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/hdd/senzhang/venv/odo-bench/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/hdd/senzhang/venv/odo-bench/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/hdd/senzhang/venv/odo-bench/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/hdd/senzhang/src/odo-bench/sc-sfmlearner/models/DispResNet.py", line 117, in forward
    outputs = self.decoder(features)
  File "/hdd/senzhang/venv/odo-bench/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/hdd/senzhang/src/odo-bench/sc-sfmlearner/models/DispResNet.py", line 92, in forward
    x = self.convs[("upconv", i, 0)](x)
  File "/hdd/senzhang/venv/odo-bench/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/hdd/senzhang/src/odo-bench/sc-sfmlearner/models/DispResNet.py", line 24, in forward
    out = self.conv(x)
  File "/hdd/senzhang/venv/odo-bench/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/hdd/senzhang/src/odo-bench/sc-sfmlearner/models/DispResNet.py", line 42, in forward
    out = self.conv(out)
  File "/hdd/senzhang/venv/odo-bench/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/hdd/senzhang/venv/odo-bench/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 353,100% (200 of 200) |###########################################| Elapsed Time: 0:00:08 Time:  0:00:08
    return self._conv_forward(input, self.weight)
  File "/hdd/senzhang/venv/odo-bench/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 350,100% (1000 of 1000) |#########################################| Elapsed Time: 0:00:08 Time:  0:00:08
    self.padding, self.dilation, self.groups)
RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)

The text was updated successfully, but these errors were encountered:

JiawangBian · 2020-07-20T08:54:21Z

See the below issue

#48

SenZHANG-GitHub · 2020-07-20T23:35:12Z

See the below issue

#48

Thanks : ) Exactly the answer I'm looking for!

SenZHANG-GitHub · 2020-07-21T04:19:30Z

See the below issue

#48

An update:

As point out in #48, for batch size 4 it is unnecessary to use parallel training. Actually in my test it slows down the training process due to the data transmission overhead.

For epoch size 1000 and V100 GPU:

Two-gpu: 19 min for training and 5 min for evaluation per epoch
Single-gpu: 15 min for training and 4 min for evaluation per epoch

Just for reference.

SenZHANG-GitHub closed this as completed Jul 20, 2020

SenZHANG-GitHub reopened this Jul 21, 2020

SenZHANG-GitHub closed this as completed Jul 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trouble on Parallel Training #55

Trouble on Parallel Training #55

SenZHANG-GitHub commented Jul 20, 2020

JiawangBian commented Jul 20, 2020

SenZHANG-GitHub commented Jul 20, 2020

SenZHANG-GitHub commented Jul 21, 2020 •

edited

Loading

Trouble on Parallel Training #55

Trouble on Parallel Training #55

Comments

SenZHANG-GitHub commented Jul 20, 2020

JiawangBian commented Jul 20, 2020

SenZHANG-GitHub commented Jul 20, 2020

SenZHANG-GitHub commented Jul 21, 2020 • edited Loading

SenZHANG-GitHub commented Jul 21, 2020 •

edited

Loading