Skip to content

DM-42530: usdf batch updated for latest allocateNodes auto #657

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Feb 22, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 19 additions & 17 deletions usdf/batch.rst
Original file line number Diff line number Diff line change
Expand Up @@ -81,11 +81,11 @@ The ``allocateNodes.py`` utility has the following options::
-c CPUS, --cpus CPUS cores / cpus per glidein
-a ACCOUNT, --account ACCOUNT
Slurm account for glidein job
-s QOS, --qos QOS Slurm qos or glidein job
-s QOS, --qos QOS Slurm qos for glidein job
-m MAXIMUMWALLCLOCK, --maximum-wall-clock MAXIMUMWALLCLOCK
maximum wall clock time; e.g., 3600, 10:00:00, 6-00:00:00, etc
-q QUEUE, --queue QUEUE
queue / partition name
Slurm queue / partition name
-O OUTPUTLOG, --output-log OUTPUTLOG
Output log filename; this option for PBS, unused with Slurm
-E ERRORLOG, --error-log ERRORLOG
Expand All @@ -107,16 +107,19 @@ The ``allocateNodes.py`` utility requires a small measure of configuration in th

A typical ``allocateNodes.py`` command line for obtaining resources for a BPS workflow could be::

$ allocateNodes.py -v --dynamic -n 20 -c 32 -m 4-00:00:00 -q roma,milano -g 900 s3df
$ allocateNodes.py -v -n 20 -c 32 -m 4-00:00:00 -q roma -g 600 s3df

``s3df`` is specified as the target platform.
The ``-q roma,milano`` option specifies that the glide-in jobs may run in either the roma or milano partition.
The ``-n 20 -c 32`` options request 20 individual glide-in slots of size 32 cores each (each is a Slurm job that obtains a partial node).
The maximum possible time is set to 4 days via ``-m 4-00:00:00``.
The glide-in Slurm jobs may not run for the full 4 days however, as the option ``-g 900`` specifies a
condor glide-in shutdown time of 900 seconds or 15 minutes. This means that the htcondor daemons will shut themselves
down after 15 minutes of inactivity (for example, after the workflow is complete), and the glide-in Slurm jobs
will exit at that time to avoid wasting idle resources. The ``--dynamic`` option requests that the htcondor slots be dynamic, partionable slots; this is the recommended setting as it supports possible multi-core jobs in the workflow.
The ``-q roma`` option specifies that the glide-in jobs should run in the ``roma`` partition (a good alternative value is ``milano``).
The ``-n 20 -c 32`` options request 20 individual glide-in slots of size 32 cores each (640 total cores, each glidein is a Slurm job that obtains a partial node).
The ``-c`` option is no longer a required command line option, as it will default to a value of 16 cores.
In allocateNodes there is now an encoded upper bound of 8000 cores to prevent a runaway scenario, and best collaborative usage
is generally in the 1000-2000 total core range given current qos limits.
The maximum possible time is set to 4 days via ``-m 4-00:00:00``.
The glide-in Slurm jobs may not run for the full 4 days however, as the option ``-g 600`` specifies a
condor glide-in shutdown time of 600 seconds or 10 minutes. This means that the htcondor daemons will shut themselves
down after 10 minutes of inactivity (for example, after the workflow is complete), and the glide-in Slurm jobs
will exit at that time to avoid wasting idle resources.

There is support for setting USDF S3DF Slurm account, repo and qos values. By default the account ``rubin``
with the ``developers`` repo (``--account rubin:developers``) will be used.
Expand Down Expand Up @@ -181,12 +184,11 @@ After submitting the ``allocateNodes.py`` command line above, the user may see S
Total 19 0 0 19 0 0 0 0

The htcondor slots will have a label with the username, so that one user's glide-ins may be distinguished from another's. In this case the glide-in slots are partial node 32-core chunks, and so more than one slot can appear on a given node. The decision as to whether to request full nodes or partial nodes would depend on the general load on the cluster, i.e., if the cluster is populated with other numerous single core jobs that partially fill nodes, it will be necessary to request partial nodes to acquire available resources.
Larger ``-c`` values (and hence smaller ``-n`` values for the same total number of cores) will entail less process overhead, but there may be inefficient unused cores within a slot/"node", and slots may be harder to schedule.
We recommend selecting ``-c`` such that ``-n`` is in the range of 1 to 32; ``-c 32`` is often reasonable for jobs using dozens to hundreds of cores.
Larger ``-c`` values (and hence smaller ``-n`` values for the same total number of cores) will entail less process overhead, but there may be inefficient unused cores within a slot/"node", and slots may be harder to schedule. The ``-c`` option has a default value of 16.

The ``allocateNodes.py`` utility is set up to be run in a maintenance or cron type manner, where reissuing the exact same command line request for 20 glide-ins will not directly issue 20 additional glide-ins. Rather ``allocateNodes.py`` will strive to maintain 20 glide-ins for the workflow, checking to see if that number of glide-ins are in the queue, and resubmit any missing glide-ins that may have exited due to lulls in activity within the workflow.

With htcondor slots present and visible with ``condor_status``, one may proceed with running ``ctrl_bps`` ``ctrl_bps_htcondor`` workflows in the same manner as was done on the project's previous generation computing cluster at NCSA.
With htcondor slots present and visible with ``condor_status``, one may proceed with running ``ctrl_bps`` ``ctrl_bps_htcondor`` workflows.

Usage of the ``ctrl_bps_htcondor`` plugin and module has been extensively documented at

Expand All @@ -206,22 +208,22 @@ For running at S3DF, the following ``site`` specification can be used in the BPS
allocateNodes auto
------------------

The ``ctrl_execute`` package now provides an ``allocateNodes --auto`` mode in which the user does not have to specify the number of glideins to run. This mode is not the default, and must be explicitly invoked. In this mode the user's idle jobs in the htcondor queue will be detected and an appropriate number of glideins submitted. At this stage of development the allocateNodes auto is used in conjuction with a bash script that runs alongside a BPS workflow, workflows, or generic HTCondor jobs. The script will invoke allocateNodes auto at regular intervals to submit the number of glideins needed by the workflow(s) at the particular time. A sample ``service.sh`` script is::
The ``ctrl_execute`` package now provides an ``allocateNodes --auto`` mode in which the user 1) does not have to specify the number of glideins to run and 2) does not have to specify the size of the glideins. This mode is not the default, and must be explicitly invoked. In this mode the user's idle jobs in the htcondor queue will be detected and an appropriate number of glideins submitted. The current version of ``allocateNodes --auto`` works with BPS workflows exclusively and the ``-c`` option is ignored. ``allocateNodes --auto`` searches for "large" jobs (taken to be larger than 16 cores or equivalent memory) and for each of the large jobs a customized glidein is created and submitted; for smaller jobs 16 core glideins will be submitted in the quantity needed. At this stage of development the allocateNodes auto is used in conjuction with a bash script that runs alongside a BPS workflow or workflows. The script will invoke allocateNodes auto at regular intervals to submit the number of glideins needed by the workflow(s) at the particular time. A sample ``service.sh`` script is::

#!/bin/bash
export LSST_TAG=w_2023_46
export LSST_TAG=w_2024_08
lsstsw_root=/sdf/group/rubin/sw
source ${lsstsw_root}/loadLSST.bash
setup -v lsst_distrib -t ${LSST_TAG}

# Loop for a long time, executing "allocateNodes auto" every 10 minutes.
for i in {1..500}
do
allocateNodes.py --auto --account rubin:developers -n 100 -c 16 -m 4-00:00:00 -q milano -g 240 s3df
allocateNodes.py --auto --account rubin:developers -n 50 -m 4-00:00:00 -q milano -g 240 s3df
sleep 600
done

On the allocateNodes auto command line the option ``-n 100`` no longer specifies the desired number of glideins, but rather specifies an upper bound. There are two time scales in the script above, the first is the glidein shutdown with inactivity time ``-g 240``. This can be fairly short (here 240 seconds / four minutes) to avoid idle cpus, since new glideins will be resubmitted for the user if needed in later cycles. The second time scale is the sleep time ``sleep 600``. This provides the frequency with which to run allocateNodes, and a typical time scale is 600 seconds / ten minutes. With each invocation queries are made to the htcondor schedd and the Slurm scheduler, so it is best not run with unnecessary frequency. Each invocation of allocateNodes queries the htcondor schedd on the current development machine (e.g., ``sdfrome002``).
On the allocateNodes auto command line the option ``-n 50`` no longer specifies the desired number of glideins, but rather specifies an upper bound. allocateNodes itself has an upper bound on resource usage of 8000 cores, but the user may constrain resource utilization further with this setting. There are two time scales in the script above, the first is the glidein shutdown with inactivity time ``-g 240``. This can be fairly short (here 240 seconds / four minutes) to avoid idle cpus, since new glideins will be resubmitted for the user if needed in later cycles. The second time scale is the sleep time ``sleep 600``. This provides the frequency with which to run allocateNodes, and a typical time scale is 600 seconds / ten minutes. With each invocation queries are made to the htcondor schedd and the Slurm scheduler, so it is best not run with unnecessary frequency. Each invocation of allocateNodes queries the htcondor schedd on the current development machine (e.g., ``sdfrome002``).

After the workflow is complete all of the glideins will expire and the ``service.sh`` process can be removed with Ctrl-C, killing the process, etc. If a user has executed a ``bps submit`` and acquired resources via the ``service.sh`` / ``allocateNodes`` and everything is running, but then wishes to terminate everything, how best to proceed? A good path is to issue a ``bps cancel``, which would take the precise form ``bps cancel --id <condor ID or path to run submit dir (including timestamp)>``. After the cancel all htcondor jobs will be terminated soon, and the glideins will become idle and expire shortly after the glidein shutdown time with inactivity. The last item that might remain is to stop the ``service.sh`` script, as described above. For the future we are investigating if BPS itself can manage the allocateNodes auto invocations that a workflow requires, eliminating the need for the user to manage the ``service.sh`` script.

Expand Down