CUDA error: uncorrectable ECC error encountered #21
-
Hello, I have a dataset of 50 data points in xyz and npz format. When I used nequip-train command to train the model, I got "RuntimeError: CUDA error: uncorrectable ECC error encountered" error. I very much appreciate it if anyone could take a look at my input files and help me with this problem. Files are attached. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Hi @tienmng, thanks for your questions, a few points below:
According to stackoverflow and this github discussion the error you're seeing is related to a hardware issue. On the github discussion they said reinstalling helps fix it. Can you share the full stacktrace as well as your install (pytorch version etc.)? The reason I could see potentially for why you're not seeing it for the non-pbc version is that it may be out-of-memory related potentially? In such a case you will need much less memory in non-pbc since you're not computing edges across the pbc. But please share the full stacktrace first and try reinstalling this. Based on googleing, this sounds hardware related. One other thing to try to find out if it's memory-related: you can run the same setup but with a 5 Angstrom r_max? 8A will require quite a bit of memory and you have a large structure + a semi-large network. |
Beta Was this translation helpful? Give feedback.
Hi @tienmng, thanks for your questions, a few points below:
According to stackoverflow and this github discussion the error you're seeing is related to a hardware issue. On the github discussion they said reinstalling helps fix it. Can you share the full stacktrace as well as your install (pytorch version etc.)?
The reason I could see potentially for why you're not seeing it for the non-pbc version is that it may be out-of-memory related potentially? In such a case you will need much less memory in non-pbc since you're not computing edges across the…