Add a "qgaps" utility #6

sleak-lbl · 2018-04-05T15:21:05Z

How many nodes are available now, for how long?
(and maybe, which partitions are they in)

i.e: at any time there should be some nodes that are idle, because the next job scheduled there needs more nodes to become free before it can start, so we should be able to see what size job can backfill right now

Some mechanisms:

scontrol show -o nodes | grep State=IDLE .. gets list of currently-idle nodes
- NodeName, Partitions and maybe ActiveFeatures
For scheduled start times and nodelists:
SLURM_TIME_FORMAT='%s' squeue -t PD -O jobid,partition,state,starttime,schednodes:50 | awk '$5!="(null)" { print }'
(need to make schednodes format longer, ensure whole list gets in)

Output might look like: (note that node counts ought to be cumulative)

      timespan   #nodes  partitions
     (none)            1        regular,debug, ..
     2:05:14          8        knl,...
      1:40:01        20

So need to bin nodes by number-of-seconds-until free

so first get timestamp, list of idle nodes, and list of allocated, pending jobs
for each node, find earliest reference to it in schedule, calc time till then and bin against num seconds
then for each bin (starting with longest), partition by partition. For each partition in each bin, count how many nodes in this and all corresponding longer bins

How to test it? (in terms of writing unit tests)?
sinfo -h -o '%F %b %N' to get count of idle nodes in each mode
("other" means down/drained/etc)

The text was updated successfully, but these errors were encountered:

sleak-lbl added the enhancement label Apr 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a "qgaps" utility #6

Add a "qgaps" utility #6

sleak-lbl commented Apr 5, 2018

Add a "qgaps" utility #6

Add a "qgaps" utility #6

Comments

sleak-lbl commented Apr 5, 2018