Skip to content
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.

Add a "qgaps" utility #6

Open
sleak-lbl opened this issue Apr 5, 2018 · 0 comments
Open

Add a "qgaps" utility #6

sleak-lbl opened this issue Apr 5, 2018 · 0 comments

Comments

@sleak-lbl
Copy link
Member

How many nodes are available now, for how long?
(and maybe, which partitions are they in)

i.e: at any time there should be some nodes that are idle, because the next job scheduled there needs more nodes to become free before it can start, so we should be able to see what size job can backfill right now

Some mechanisms:

  • scontrol show -o nodes | grep State=IDLE .. gets list of currently-idle nodes
    • NodeName, Partitions and maybe ActiveFeatures
  • For scheduled start times and nodelists:
    SLURM_TIME_FORMAT='%s' squeue -t PD -O jobid,partition,state,starttime,schednodes:50 | awk '$5!="(null)" { print }'
    (need to make schednodes format longer, ensure whole list gets in)

Output might look like: (note that node counts ought to be cumulative)

      timespan   #nodes  partitions
     (none)            1        regular,debug, ..
     2:05:14          8        knl,...
      1:40:01        20         

So need to bin nodes by number-of-seconds-until free

  • so first get timestamp, list of idle nodes, and list of allocated, pending jobs
  • for each node, find earliest reference to it in schedule, calc time till then and bin against num seconds
  • then for each bin (starting with longest), partition by partition. For each partition in each bin, count how many nodes in this and all corresponding longer bins

How to test it? (in terms of writing unit tests)?
sinfo -h -o '%F %b %N' to get count of idle nodes in each mode
("other" means down/drained/etc)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant