You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.
How many nodes are available now, for how long?
(and maybe, which partitions are they in)
i.e: at any time there should be some nodes that are idle, because the next job scheduled there needs more nodes to become free before it can start, so we should be able to see what size job can backfill right now
Some mechanisms:
scontrol show -o nodes | grep State=IDLE .. gets list of currently-idle nodes
NodeName, Partitions and maybe ActiveFeatures
For scheduled start times and nodelists: SLURM_TIME_FORMAT='%s' squeue -t PD -O jobid,partition,state,starttime,schednodes:50 | awk '$5!="(null)" { print }'
(need to make schednodes format longer, ensure whole list gets in)
Output might look like: (note that node counts ought to be cumulative)
So need to bin nodes by number-of-seconds-until free
so first get timestamp, list of idle nodes, and list of allocated, pending jobs
for each node, find earliest reference to it in schedule, calc time till then and bin against num seconds
then for each bin (starting with longest), partition by partition. For each partition in each bin, count how many nodes in this and all corresponding longer bins
How to test it? (in terms of writing unit tests)?
sinfo -h -o '%F %b %N' to get count of idle nodes in each mode
("other" means down/drained/etc)
The text was updated successfully, but these errors were encountered:
How many nodes are available now, for how long?
(and maybe, which partitions are they in)
i.e: at any time there should be some nodes that are idle, because the next job scheduled there needs more nodes to become free before it can start, so we should be able to see what size job can backfill right now
Some mechanisms:
scontrol show -o nodes | grep State=IDLE
.. gets list of currently-idle nodesSLURM_TIME_FORMAT='%s' squeue -t PD -O jobid,partition,state,starttime,schednodes:50 | awk '$5!="(null)" { print }'
(need to make schednodes format longer, ensure whole list gets in)
Output might look like: (note that node counts ought to be cumulative)
So need to bin nodes by number-of-seconds-until free
How to test it? (in terms of writing unit tests)?
sinfo -h -o '%F %b %N' to get count of idle nodes in each mode
("other" means down/drained/etc)
The text was updated successfully, but these errors were encountered: