diff --git a/usdf/batch.rst b/usdf/batch.rst index 7ef5b60d..efd2fc0d 100644 --- a/usdf/batch.rst +++ b/usdf/batch.rst @@ -111,8 +111,10 @@ A typical ``allocateNodes.py`` command line for obtaining resources for a BPS wo ``s3df`` is specified as the target platform. The ``-q roma`` option specifies that the glide-in jobs should run in the ``roma`` partition (a good alternative value is ``milano``). -The ``-n 20 -c 32`` options request 20 individual glide-in slots of size 32 cores each (each is a Slurm job that obtains a partial node). +The ``-n 20 -c 32`` options request 20 individual glide-in slots of size 32 cores each (640 total cores, each glidein is a Slurm job that obtains a partial node). The ``-c`` option is no longer a required command line option, as it will default to a value of 16 cores. +In allocateNodes there is now an encoded upper bound of 8000 cores to prevent a runaway scenario, and best collaborative usage +is generally in the 1000-2000 total core range given current qos limits. The maximum possible time is set to 4 days via ``-m 4-00:00:00``. The glide-in Slurm jobs may not run for the full 4 days however, as the option ``-g 600`` specifies a condor glide-in shutdown time of 600 seconds or 10 minutes. This means that the htcondor daemons will shut themselves @@ -209,7 +211,7 @@ allocateNodes auto The ``ctrl_execute`` package now provides an ``allocateNodes --auto`` mode in which the user 1) does not have to specify the number of glideins to run and 2) does not have to specify the size of the glideins. This mode is not the default, and must be explicitly invoked. In this mode the user's idle jobs in the htcondor queue will be detected and an appropriate number of glideins submitted. The current version of ``allocateNodes --auto`` works with BPS workflows exclusively and the ``-c`` option is ignored. ``allocateNodes --auto`` searches for "large" jobs (taken to be larger than 16 cores or equivalent memory) and for each of the large jobs a customized glidein is created and submitted; for smaller jobs 16 core glideins will be submitted in the quantity needed. At this stage of development the allocateNodes auto is used in conjuction with a bash script that runs alongside a BPS workflow or workflows. The script will invoke allocateNodes auto at regular intervals to submit the number of glideins needed by the workflow(s) at the particular time. A sample ``service.sh`` script is:: #!/bin/bash - export LSST_TAG=d_2024_02_02 + export LSST_TAG=w_2024_08 lsstsw_root=/sdf/group/rubin/sw source ${lsstsw_root}/loadLSST.bash setup -v lsst_distrib -t ${LSST_TAG} @@ -217,11 +219,11 @@ The ``ctrl_execute`` package now provides an ``allocateNodes --auto`` mode in wh # Loop for a long time, executing "allocateNodes auto" every 10 minutes. for i in {1..500} do - allocateNodes.py --auto --account rubin:developers -n 100 -m 4-00:00:00 -q milano -g 240 s3df + allocateNodes.py --auto --account rubin:developers -n 50 -m 4-00:00:00 -q milano -g 240 s3df sleep 600 done -On the allocateNodes auto command line the option ``-n 100`` no longer specifies the desired number of glideins, but rather specifies an upper bound. There are two time scales in the script above, the first is the glidein shutdown with inactivity time ``-g 240``. This can be fairly short (here 240 seconds / four minutes) to avoid idle cpus, since new glideins will be resubmitted for the user if needed in later cycles. The second time scale is the sleep time ``sleep 600``. This provides the frequency with which to run allocateNodes, and a typical time scale is 600 seconds / ten minutes. With each invocation queries are made to the htcondor schedd and the Slurm scheduler, so it is best not run with unnecessary frequency. Each invocation of allocateNodes queries the htcondor schedd on the current development machine (e.g., ``sdfrome002``). +On the allocateNodes auto command line the option ``-n 50`` no longer specifies the desired number of glideins, but rather specifies an upper bound. allocateNodes itself has an upper bound on resource usage of 8000 cores, but the user may constrain resource utilization further with this setting. There are two time scales in the script above, the first is the glidein shutdown with inactivity time ``-g 240``. This can be fairly short (here 240 seconds / four minutes) to avoid idle cpus, since new glideins will be resubmitted for the user if needed in later cycles. The second time scale is the sleep time ``sleep 600``. This provides the frequency with which to run allocateNodes, and a typical time scale is 600 seconds / ten minutes. With each invocation queries are made to the htcondor schedd and the Slurm scheduler, so it is best not run with unnecessary frequency. Each invocation of allocateNodes queries the htcondor schedd on the current development machine (e.g., ``sdfrome002``). After the workflow is complete all of the glideins will expire and the ``service.sh`` process can be removed with Ctrl-C, killing the process, etc. If a user has executed a ``bps submit`` and acquired resources via the ``service.sh`` / ``allocateNodes`` and everything is running, but then wishes to terminate everything, how best to proceed? A good path is to issue a ``bps cancel``, which would take the precise form ``bps cancel --id ``. After the cancel all htcondor jobs will be terminated soon, and the glideins will become idle and expire shortly after the glidein shutdown time with inactivity. The last item that might remain is to stop the ``service.sh`` script, as described above. For the future we are investigating if BPS itself can manage the allocateNodes auto invocations that a workflow requires, eliminating the need for the user to manage the ``service.sh`` script.