Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA Error when pytorch distribution training... #63

Open
liming-ai opened this issue Dec 28, 2020 · 6 comments
Open

CUDA Error when pytorch distribution training... #63

liming-ai opened this issue Dec 28, 2020 · 6 comments

Comments

@liming-ai
Copy link

Hi, thanks for your contribution, when I using distribution training, there is always RuntimeError: RuntimeError: CUDA error: invalid device function, here is my test code:

import torch
from spatial_correlation_sampler import SpatialCorrelationSampler

device = "cuda"
batch_size = 1
channel = 1
H = 10
W = 10
dtype = torch.float32

input1 = torch.randint(1, 4, (batch_size, channel, H, W), dtype=dtype, device=device, requires_grad=True)
input2 = torch.randint_like(input1, 1, 4).requires_grad_(True)

correlation_sampler = SpatialCorrelationSampler(
    kernel_size=3,
    patch_size=1,
    stride=2,
    padding=0,
    dilation=2,
    dilation_patch=1)

model = torch.nn.DataParallel(correlation_sampler, device_ids=[0,1,2]).cuda()

out = model(input1, input2)

print(out.shape)

My enviroment is

Ubuntu 18.04.5 LTS
PyTorch -- 1.6.0
torchvision -- 0.7.0
gcc -- 7.5.0
CUDA -- 10.2

The whole error info is:

Traceback (most recent call last):
  File "test.py", line 24, in <module>
    out = model(input1, input2)
  File "/home/liming/anaconda3/envs/motionsqueeze/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/liming/anaconda3/envs/motionsqueeze/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 156, in forward
    return self.gather(outputs, self.output_device)
  File "/home/liming/anaconda3/envs/motionsqueeze/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/home/liming/anaconda3/envs/motionsqueeze/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
    res = gather_map(outputs)
  File "/home/liming/anaconda3/envs/motionsqueeze/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
    return Gather.apply(target_device, dim, *outputs)
  File "/home/liming/anaconda3/envs/motionsqueeze/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 68, in forward
    return comm.gather(inputs, ctx.dim, ctx.target_device)
  File "/home/liming/anaconda3/envs/motionsqueeze/lib/python3.8/site-packages/torch/cuda/comm.py", line 166, in gather
    return torch._C._gather(tensors, dim, destination)
RuntimeError: CUDA error: invalid device function
[1]    20866 segmentation fault (core dumped)  python test.py

For un-distribution training, there is no error, but still some strange info:

torch.Size([1, 1, 1, 3, 3])
[1]    22742 segmentation fault (core dumped)  python test.py
@ClementPinard
Copy link
Owner

Hi, what hardware are you using ?
It looks like you have different GPU and that the module is only built for the first gpu, which is not the same compute capibilities as one of your other 2 GPUs.

See an interesting PR about it here

@liming-ai
Copy link
Author

Hi, what hardware are you using ?
It looks like you have different GPU and that the module is only built for the first gpu, which is not the same compute capibilities as one of your other 2 GPUs.

See an interesting PR about it here

Hi, I have 3 NVIDIA 1080 Ti, I am sure they have the same compute capibilities... There is my GPU info:

image

@ClementPinard
Copy link
Owner

Ok so this is not this problem.

I just tested your code with my computer, that has 1 1080 Ti and I didn't get the "segmentation fault" at the end of your script.

How did you install the correlation module ? From pip ? From source ?

It might not be the root cause, but I can only advice you to upgrade to 1.7 for now and try to install from this repo with setup.py

@liming-ai
Copy link
Author

Ok so this is not this problem.

I just tested your code with my computer, that has 1 1080 Ti and I didn't get the "segmentation fault" at the end of your script.

How did you install the correlation module ? From pip ? From source ?

It might not be the root cause, but I can only advice you to upgrade to 1.7 for now and try to install from this repo with setup.py

Thanks a lot ! I installed the correlation module from pip, I will upgrade pytorch to 1.7 tomorrow and reply to you!

@liming-ai
Copy link
Author

liming-ai commented Dec 29, 2020

hi @ClementPinard , I try to install PyTorch 1.7.1, and then use pip to install the tool, there is no warnning or error during installation, but I cannot import this repo:

ImportError: /home/liming/anaconda3/envs/ms/lib/python3.8/site-packages/spatial_correlation_sampler_backend.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN3c104impl23ExcludeDispatchKeyGuardC1ENS_11DispatchKeyE

When I install the module via python setup.py install, the same error happened, for dist training:

Traceback (most recent call last):
  File "test.py", line 24, in <module>
    out = model(input1, input2)
  File "/home/liming/anaconda3/envs/ms/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/liming/anaconda3/envs/ms/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in forward
    return self.gather(outputs, self.output_device)
  File "/home/liming/anaconda3/envs/ms/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 174, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/home/liming/anaconda3/envs/ms/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
    res = gather_map(outputs)
  File "/home/liming/anaconda3/envs/ms/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
    return Gather.apply(target_device, dim, *outputs)
  File "/home/liming/anaconda3/envs/ms/lib/python3.7/site-packages/torch/nn/parallel/_functions.py", line 71, in forward
    return comm.gather(inputs, ctx.dim, ctx.target_device)
  File "/home/liming/anaconda3/envs/ms/lib/python3.7/site-packages/torch/nn/parallel/comm.py", line 230, in gather
    return torch._C._gather(tensors, dim, destination)
RuntimeError: CUDA error: invalid device function
[1]    23818 segmentation fault (core dumped)  python test.py

and for single-GPU training:

torch.Size([1, 1, 1, 3, 3])
[1]    22410 segmentation fault (core dumped)  python test.py

@liming-ai
Copy link
Author

When I use pytorch 1.1 and install via python setup.py install, there is another error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/liming/anaconda3/envs/1/lib/python3.7/site-packages/spatial_correlation_sampler-0.3.0-py3.7-linux-x86_64.egg/spatial_correlation_sampler/__init__.py", line 1, in <module>
    from .spatial_correlation_sampler import SpatialCorrelationSampler, spatial_correlation_sample
  File "/home/liming/anaconda3/envs/1/lib/python3.7/site-packages/spatial_correlation_sampler-0.3.0-py3.7-linux-x86_64.egg/spatial_correlation_sampler/spatial_correlation_sampler.py", line 6, in <module>
    import spatial_correlation_sampler_backend as correlation
ImportError: libtorch.so: cannot open shared object file: No such file or directory

It is really odd, I do not konw how to deal with, could you provide some suggestions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants