How to debug Can't find an address, check slurm.conf
errors when launching jobs?
#1591
Replies: 3 comments
-
Thank you for the detailed description! I have been unable to reproduce the issue so far (I'm running into unrelated errors, e.g. quota limits). I will continue to look into this on Monday. |
Beta Was this translation helpful? Give feedback.
-
Not sure if your cluster is still around, but here is what could be done: scontrol show node alpha5-ultra-ghpc-28
sudo grep alpha5-ultra-ghpc /slurm/scripts/log/slurmsync.log We could try to reset the node by manually killing the instance , waiting until its gone, then executing |
Beta Was this translation helpful? Give feedback.
-
Closing the discussion, please reopen if new info will became available. |
Beta Was this translation helpful? Give feedback.
-
Our .yaml file is here
Trying a job with
--nodes=10
works if job is small enough to launch on static nodes, but fail when dynamics nodes are used.Any tips how to troubleshoot this?
After waiting 10 mins for nodes to allocate, the failure looks like this
Things I checked:
Beta Was this translation helpful? Give feedback.
All reactions