-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with CUDA error 59: Device-side assert triggered #1
Comments
Hi André, Happy to hear that you are interested in this work! The issue seems to be an underfow that occurs in the covolutions of at least one of the regression heads when float16 is used. In the mentioned demo Notebook you can change this line
to
Please let me know if this helps. Best regards, |
Thank you very much Eric! It works fine now! It will be also great to find a way to still use AMP, but so I can start training the model with promising results. Best regards, |
I agree, AMP is nice to have! Best regards, |
when i set "AMP=False" , I try to run the code described on the demo "demo-binary.ipynb" and"demo-multiclass.ipynb" in GPU,I get the error: when i run the code in cpu,it is no problem. ` ~/anaconda3/envs/py38torch19/lib/python3.8/site-packages/torch/autograd/init.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs) RuntimeError: transform: failed to synchronize: cudaErrorAssert: device-side assert triggered ` can you help me ? Thank you very much in advance! |
Hi, thanks for posting! Could you please run
It seems to be a problem of the loss calculation of the score head. |
@andyco98 Could you try to update to the latest version via
and add the following line after the model definition:
This practically disables AMP for the readout layers, but still allows you to use AMP everywhere else. |
Hello Eric, I also get the same error while trying to run the code in the notebooks on GPU, but it runs without any issue on the CPU. I have set amp=False in the config. Here is the output from running Collecting environment information... OS: Ubuntu 20.04.4 LTS (x86_64) Python version: 3.8.10 (default, Mar 15 2022, 12:22:08) [GCC 9.4.0] (64-bit runtime) Nvidia driver version: 510.47.03 Versions of relevant libraries: And here is the complete stacktrace
Many thanks for your help in advance! |
Hello @tommy2k0, To me it looks like the version combination of PyTorch, CUDA and cudnn might be causing this problem. Could you try to reinstall PyTorch with the desired CUDA version using the install command from here? |
Seems that was indeed the problem. Many thanks! |
Hello Eric!
First, thank you very much for this very interesting work!
I was trying to reproduce the code described on the demo "Cell Detection with Contour Proposal Networks.ipynb" and everything works fine until I start training the model. I get after 2-3 Epochs the error:
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:97: operator(): block: [37,0,0], thread: [0,0,0] Assertion
index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed./pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:97: operator(): block: [37,0,0], thread: [1,0,0] Assertion
index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
I have read in internet that this common Pytorch error could be caused by an indexing problem with the labels, but I was unable to solve it. Do you know how it can be solved?
Thank you very much in advance!
André
The text was updated successfully, but these errors were encountered: