-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA errors #50
Comments
This may have something to do with the change to cuda version 8. Switching back to cudacap 5 runs jobs without CUDA errors, and also the occurence of NaNs during acoustic sims may be fixed (#48). By chance, the following cuda 8 GPU gave me errors even during water simulations. Perhaps it is broken? Don't know why |
Now regularly getting this when trying to run with qsub (
SLURM ( |
On
On
On
The same on
|
@jkosciessa based on the error messages in the comment above, e.g.
It could be the driver issue or you are loading a too-new cuda libary?? It is a complex compatibility match between Kernel Driver, Cuda library, user application to make a GPU application work. We haven't upgrade the Nvidia Linux Kernel driver since the beginning of the node was installed, and newly installed GPU nodes might have newer Kernel driver. This is why I was asking the compute node on which you run into this issue. The You should be able to tell the compute node by running |
@hurngchunlee The It would indeed explain the observed patterns if some jobs get launched on compute nodes with varying drivers etc. some of which MATLAB takes offense with. Here are some nodes on which jobs successfully loaded a GPU, and some that crashed with Success Fail |
@hurngchunlee Looking at multiple output logs, the above hosts very reliably either detect the GPU or crash. In most crashes, |
@hurngchunlee Here is a script that an admin could use to check for systematic differences between those nodes. Can someone directly access the nodes without scheduling them first? Or directly address the desired node in a job? |
weird ... the hardware and configuration on dccn-c078 is identical to dccn-c077 ... so I don't see the reason why it runs fine on dccn-c077 but failed on dccn-c078. Do you know which cuda library you are using? On those nodes (and in general Torque GPU nodes), the kernel driver supports only up to cuda 11.2. If you happen to use a version newer, you will get a cuda error. |
@hurngchunlee Interesting, according to But this is also the case for the successful deployments. |
here after is the result:
In my environment, I don't load cuda by default. Therefore the "nvcc" not available. |
@hurngchunlee Here an updated file that loads the |
The |
I don't know what DriverVersion it refers to. If it is about the Kernel driver version, it is something like
|
The |
@hurngchunlee Ok, some clarity: it seems that MATLAB R2024a dropped support for the currently installed Still doesn't solve the mystery of |
MATLAB R2022a should then be the most robust version for use with GPUs with drivers and toolkit (i.e., supplementary tools) at
It does make sense to read Release Notes 😅. |
Any new setup will be applied to Slurm. The Torque cluster is managed as it is until we migrate full cluster to Slurm. There is no plan to perform any upgrade on the current Torque cluster because the OS running on it is already EOL. On Slurm, the OS is new and the driver supports up to 12.2. I would encourage using Slurm. You mentioned that there is a limit of 2 GPUs per user, how many concurrent GPUs do you need? We could temporarily increase it to 4 for you; but I think it should be a limit there to avoid one user blocking all GPUs (despite that in Slurm we have only 11 GPUs at the moment). |
could be ... but I cannot guarantee it. I never run any GPU programs through Matlab. |
@hurngchunlee That's good to hear... so MATLAB 2024a support on SLURM then. Explains why our jobs all run fine there. I wouldn't create special rules for me right now. There will be multiple people running GPU-dependent simulations soon. Hence, a shared scheduling bottleneck. That's why we need to create some recommendations on what combinations work. I am in full favor of migrating GPUs toward SLURM. We can for now try to specify |
If you need us to increase it, please send a ticket to [email protected] with me in c.c. |
@hurngchunlee On R2022b (CUDA driver spec 11.2), GPU jobs run fine except for node |
@jkosciessa thanks for pinning it down to this particular node. I also checked it with GPU sample program from Nvidia. The same cuda executable running fine on |
@jkosciessa after restarting the server |
@hurngchunlee That's good to hear. Is there a way to specifically schedule this server? If not we can look for it in new jobs. @sirmrmarty could you look out for this server in new jobs? |
You could add |
Hey @hurngchunlee and @jkosciessa , I ran a few things on the cluster and specified the node=dccn-c078.dccn.nl. The transducer positing scripts didn't run successfully (error below). The simulation seem to run for now (I will give an update it that changes). Here an output file of a transducer positioning but I assume that based on the script not the CUDA: ----------------------------------------
Begin PBS Prologue Wed Oct 9 13:16:38 CEST 2024 1728472598
Job ID: 54309434.dccn-l029.dccn.nl
Username: marwim
Group: neuromod
Asked resources: nodes=1:gpus=1,mem=8gb,walltime=01:00:00,neednodes=1:gpus=1
Queue: short
Nodes: dccn-c078.dccn.nl
----------------------------------------
Limiting memory+swap to 9126805504 bytes ...
End PBS Prologue Wed Oct 9 13:16:38 CEST 2024 1728472598
----------------------------------------
Starting matlab/R2022b
Executing /opt/matlab/R2022b/bin/matlab -singleCompThread -nodisplay -batch tp39b22298_7ae2_4afd_b003_e44c04711a39
Adding /project/2424103.01/thalstim_simulations/thalstim_sim/tools/PRESTUS/functions/../functions
Adding /project/2424103.01/thalstim_simulations/thalstim_sim/tools/PRESTUS/functions/../toolboxesand subfolders
Adding /project/2424103.01/thalstim_simulations/thalstim_sim/tools/PRESTUS/toolboxes/k-wave/k-Wave
Current target: left_PUL
status =
0
result =
0x0 empty char array
simnibs_coords =
-16.2782 -25.9278 7.4805
target =
91 97 155
ans =
Columns 1 through 7
181.9321 -5.9618 236.1018 31.4229 172.0989 8.2863 226.2686
Column 8
45.6710
----------------------------------------
Begin PBS Epilogue Wed Oct 9 13:17:08 CEST 2024 1728472628
Job ID: 54309434.dccn-l029.dccn.nl
Job Exit Code: 1
Username: marwim
Group: neuromod
Job Name: tusim_tp_sub-002
Session: 11348
Asked resources: nodes=1:gpus=1,mem=8gb,walltime=01:00:00,neednodes=1:gpus=1
Used resources: cput=00:00:19,walltime=00:00:26,mem=2975182848b
Queue: short
Nodes: dccn-c078.dccn.nl
End PBS Epilogue Wed Oct 9 13:17:08 CEST 2024 1728472628
---------------------------------------- Here the error log Caught "std::exception" Exception message is:
merge_sort: failed to synchronize |
The full simulation appears to have completed without error, perhaps the problem above arises due to rerunning the script over existing outputs? I will close this issue for now. With the updated documentation and the node debugging I am optimistic that the jobs should now consistently run with qsub again, (as well as with SLURM). |
I encounter occasional CUDA errors during acoustic simulations. I find this error hard to debug, because a comparable simulation in CPU mode appears to run without problems.
The text was updated successfully, but these errors were encountered: