Is it possible to modify the slurm.conf files on the compute nodes of a "hpc-slurm"-based cluster? #2559
Replies: 9 comments 1 reply
-
Hi @noahharrison64 ,
I don't know your use case, but would using
Manually modifying If you want to have a custom
"re-terraforming" is preferable way of making modifications (though it doesn't always work as expected). |
Beta Was this translation helpful? Give feedback.
-
Hi @mr0re1, Thanks for your reply.
I don't see why this wouldn't work for my use case! How would you go about achieving this? My current toolkit blueprint is based off the hpc-slurm blueprint.
So would I create a custom slurm.conf file based on the existing one but with the RebootProgram set, save it in my cloud shell work space and supply the path to this file as the variable slurm_conf_tpl? Cheers, |
Beta Was this translation helpful? Give feedback.
-
Giving it a second thought , "exclusive" partition may be not a best solution for your problem, there would still be a chance of consecutive execution of jobs (back-to-back) on the same node.
Yes, please either use absolute path of |
Beta Was this translation helpful? Give feedback.
-
Hi @mr0re1, Thanks for the info. I'm hoping I've managed to implement a fix that avoids the need to reboot. I'm running some tests over the weekend, if they fail I'll have a look into the advice you've given in more detail so I can properly modify the compute node conf files. Thanks, |
Beta Was this translation helpful? Give feedback.
-
Hi @mr0re1 Do you suggest supplying the slurm_conf_tpl filepath in the blueprint / slurm-controller / settings section? Will this automatically be transmitted to the compute node slurm confs even if we just update the slurm-controller module?
Also, is it possible to save the current state of my compute cluster login node (i.e. the file system) and then load this when I re-terraform the new compute partition group (which I assume will be necessary since the controller node will need updating) |
Beta Was this translation helpful? Give feedback.
-
Also is the ghpc_stage function available now? |
Beta Was this translation helpful? Give feedback.
-
Not yet, it will be available in the next release, though you can use I didn't realize you use SlurmGCP V5, I would recommend you to switch to V6. The reconfigure in V6 works much better.
The
Yes
Yes Please let us know if it doesn't WAI, once again I would advice to switch to V6. |
Beta Was this translation helpful? Give feedback.
-
UPD. |
Beta Was this translation helpful? Give feedback.
-
Thanks @mr0re1
I more meant is it possible to save the current state of my compute cluster before re-terraforming and making a new cluster. I'd like to be able to directly transfer the files that already exist onto my new login node when it is created. If i understand correctly your suggestion would only allow me to save the state of any new clusters created with "disk_auto_delete: false", rather than the one I have runner at the moment. |
Beta Was this translation helpful? Give feedback.
-
Hi,
I'm sure this is quite a naive question since I don't have a deep understanding of slurm or the hpc-toolkit.
I want to force my compute nodes to reboot after each job is completed. I assumed this would be somewhat trivial, and attempted to achieve it by running sbatch submit with the '--reboot' flag. I also modified the slurm.conf files on the controller node to include a RebootProgram=/sbin/reboot line. However, when testing this, the jobs appeared to get stuck in continuous "CF/CONFIGURING' state. Checking the slurmtrld.log I saw the following line:
This made me realise I probably need to update the slurm.conf file on the compute nodes. Since these nodes are dynamic and booted from an image when a requested, it's unclear to me how I should go about this process. This leads me to 3 questions:
Many thanks,
Noah
Beta Was this translation helpful? Give feedback.
All reactions