-
Notifications
You must be signed in to change notification settings - Fork 495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No available image for pytorch on NVIDIA driver 418 #95
Comments
I use the " Step-by-Step Tutorial" -- Single Machine Training |
Which GPU are you using? Can you paste the detail docker command you used for starting the container? Once you install a new PyTorch version, you must uninstall byteps and install again. So it's expected that your 2nd case does not work. Can you verify that, in your 1st and 3rd case, PyTorch does not work even without BytePS? I saw people have the same issue with RTX 2080 GPU. https://discuss.pytorch.org/t/a-error-when-using-gpu/32761/14 We also had a similar issue before, but later got resolved. For your 4th and 5th case, did you try starting with just 1 GPU? Does it still deadlock? |
Very thank you for your detailed explanation. |
My nvidia driver version is 418 and I tried to use byteps with docker but failed.
The behavior is below:
1, cuda10 + pytorch1.0.1 (image : bytepsimage/worker_pytorch_rdma:latest)
error : RuntimeError: cuda runtime error (11) : invalid argument at /pytorch/aten/src/THC/THCGeneral.cpp:405
2, cuda10 + pytorch1.2 (image : bytepsimage/worker_pytorch_rdma:latest; update pytorch in container)
error : ImportError: /usr/local/lib/python2.7/dist-packages/byteps-0.1.0-py2.7-linux-x86_64.egg/byteps/torch/c_lib.so: undefined symbol: _ZN2at19UndefinedTensorImpl10_singletonE
3, cuda9 + pytorch 1.0.1 (image : bytepsimage/worker_pytorch:latest)
error : RuntimeError: cuda runtime error (11) : invalid argument at /pytorch/aten/src/THC/THCGeneral.cpp:405
4, cuda10 + pytorch1.0.1 (form Dockerfile : Dockerfile.worker.pytorch.cu100, only change "apt-get install libcudnn7"(without version))
error : could not finish bps.init() ,deadlock
5, cuda10 + pytorch1.2 (from Dockerfile Dockerfile.worker.pytorch.cu100, change libcudnn7 and python version )
error : could not finish bps.init() ,deadlock
So, it seems like pytorch version could not work while tensorflow version was working well.
The text was updated successfully, but these errors were encountered: