From 578ecad5e659fdcacd5c2a06d2828fdd0bc0ff0e Mon Sep 17 00:00:00 2001 From: Ryan Scherbarth Date: Wed, 5 Jun 2024 07:09:19 -0600 Subject: [PATCH 1/2] revised intro to slurm quickbyte --- Intro_to_slurm.md | 91 +++++++++++++++++++++++++---------------------- 1 file changed, 49 insertions(+), 42 deletions(-) diff --git a/Intro_to_slurm.md b/Intro_to_slurm.md index 321a9a6..dd152bb 100644 --- a/Intro_to_slurm.md +++ b/Intro_to_slurm.md @@ -6,58 +6,69 @@ Slurm is a resource manager and job scheduler designed for scheduling and alloca `sinfo` provides information regarding resources that are available from server. Example : - user@taos:~$ sinfo + [rdscher@hopper ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST - normal* up 2-00:00:00 2 mix taos[01,09] - normal* up 2-00:00:00 1 alloc taos02 - normal* up 2-00:00:00 6 idle taos[03-08] + general* up 2-00:00:00 2 alloc hopper[002-003] + general* up 2-00:00:00 8 idle hopper[001,004-010] + debug up 4:00:00 2 idle hopper[011-012] + condo up 2-00:00:00 4 idle* hopper[064-067] + condo up 2-00:00:00 2 down* hopper[033,051] + condo up 2-00:00:00 23 mix hopper[017-025,028,032,036,038,045,055-063] + condo up 2-00:00:00 14 alloc hopper[013-016,029-030,035,042,047,049-050,052-054] + condo up 2-00:00:00 12 idle hopper[026-027,031,034,037,039-041,043-044,046,048] + geodef up 7-00:00:00 1 idle* hopper065 + geodef up 7-00:00:00 1 alloc hopper047 + geodef up 7-00:00:00 2 idle hopper[046,048] + biocomp up 7-00:00:00 1 alloc hopper052 + From the output above we can see that one node (taos02) is allocated under a normal partition. Similarly, we can see that two nodes (taos01 and taos09) are in a mixed state meaning multiple users have resources allocated on the same node. The final line in the output shows that all other nodes (taos03-08) are currently idle. -`sinfo –N –l` provides more detailed information about individuals nodes including CPU count, memory, temporary disk space and so on. +From the output above, we can see that 2 nodes on the general partition are allocated, and 8 are idle. The corresponding node id's are also listed. In this case, hopper002 and hopper003 specifically are the ones which are allocated on the general partition. + +`sinfo –N –l` provides more detailed information about each individual node, including CPU count, memory, temporary disk space and so on. -``` - user@taos:~$ sinfo -N -l - Tue Feb 19 19:08:16 2019 + [rdscher@hopper ~]$ sinfo -N -l + Wed Jun 05 05:58:34 2024 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON - taos01 1 normal* mixed 80 2:20:2 386868 690861 10 (null) none - taos02 1 normal* allocated 40 2:10:2 64181 309479 1 (null) none - taos03 1 normal* idle 40 2:10:2 64181 309479 1 (null) none - taos04 1 normal* idle 40 2:10:2 64181 358607 1 (null) none - taos05 1 normal* idle 40 2:10:2 64181 309479 1 (null) none - taos06 1 normal* idle 40 2:10:2 64181 309479 1 (null) none - taos07 1 normal* idle 40 2:10:2 64181 309479 1 (null) none - taos08 1 normal* idle 40 2:10:2 64181 309479 1 (null) none - taos09 1 normal* mixed 40 2:10:2 103198 6864803 20 (null) none -``` - -More information regarding `sinfo` can be found by typing `man sinfo` at the command prompt while logged in to Taos. + hopper001 1 general* idle 32 2:16:1 95027 0 1 (null) none + hopper002 1 general* allocated 32 2:16:1 95027 0 1 (null) none + hopper003 1 general* allocated 32 2:16:1 95027 0 1 (null) none + hopper004 1 general* idle 32 2:16:1 95027 0 1 (null) none + hopper005 1 general* idle 32 2:16:1 95027 0 1 (null) none + hopper006 1 general* idle 32 2:16:1 95027 0 1 (null) none + hopper007 1 general* idle 32 2:16:1 95027 0 1 (null) none + hopper008 1 general* idle 32 2:16:1 95027 0 1 (null) none + + +More information regarding `sinfo` can be found by typing `man sinfo` at the command prompt while logged in to Hopper. `squeue` provides information regarding currently running jobs and the resources allocated to those jobs. -``` - user@taos:~$ squeue + [rdscher@hopper ~]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) - 22632 normal BinPacke username PD 0:00 1 (Resources) - 22548 normal minia username R 1-07:30:18 1 taos09 - 22562 normal unicycle username R 22:34:59 1 taos02 - 22567 normal sspace username R 22:00:50 1 taos01 - 22576 normal megahit username R 7:22:53 1 taos09 -``` + 2554110 condo 200-7.00 lzhang R 4:52:39 1 hopper050 + 2462342 debug ethylben eattah PD 0:00 8 (PartitionNodeLimit) + 2548555 general Ag-freq yinrr R 1-20:31:48 2 hopper[002-003] + The output from `squeue`shows you the JobID, the type of partition, the name of the job, which user owns the job, the total elapsed time of the job, how many nodes are allocated to that job, and which nodes those are. +There are additional flags you can add to the squeue command to help parse the squeue output for the information you need. +- `squeue --me` will only show you your jobs in the queue. +- `squeue -p general` will only show you jobs in the general partition +A list of all flags can be found with the `man squeue` command. -To cancel a job, you can use `scancel ` where `` refers to the JobID assigned to your job by Slurm. +To cancel a job, you can use `scancel ` where `` refers to the JobID assigned to your job by Slurm. Similar to the squeue command, you can also use the `scancel --me` to cancel all jobs you have scheduled. ## Slurm Job Submission To submit a job in slurm you do so by submitting a shell script that outlines the resources you are requesting from the scheduler, the software needed for your job, and the commands you wish to run. The beginning of your submission scrip usually contains the #Hashbang specifying which interpreter should be used for the rest of the script, in this case we are using a `bash` shell as indicated by the code `#!/bin/bash`. The next portion of your submission script tells Slurm what resources you are requesting and is always preceeded by `#SBATCH` followed by flags for various parameters detailed below. -Example of a Slurm submission script : `slurm_submission.sh` +Example of a Slurm submission script : `slurm_submission.slurm` + -``` #!/bin/bash # #SBATCH --job-name=demo @@ -66,27 +77,23 @@ Example of a Slurm submission script : `slurm_submission.sh` #SBATCH --ntasks=4 #SBATCH --time=00:10:00 #SBATCH --mem-per-cpu=100 - #SBATCH --partition=partition_name + #SBATCH --partition=general srun hostname srun sleep -``` -The above script is requesting from the scheduler an allocation of 4 nodes for 10 minutes with 100MB of ram per CPU. Note that we are requesting resources for four tasks, `--ntasks=4`, but not four nodes specifically. The default behavior of the scheduler is to provide one node per task, but this can be changed with the `--cpus-per-task` flag. Once the scheduler allocates the requested resources the job starts to run and the commands not preceeded by `#SBATCH` are interpreted and executed. The script first executes the `srun hostname` followed by `srun sleep` command. +The above script will request 4 cpu cores with 100MB of memory per cpu core. It will also choose the general partition. When we request in this way, the scheduler will give us the first 4 cpu cores it can find, whether that ends up being 1 cpu core per node across 4 nodes, or all 4 cores on the same node. You can add additional constraints, such as `ntasks-per-node`. Keep in mind that the more constraints you add, the longer your queue times may be. It might be the case that 4 cores spread across nodes becomes free instantly, while asking for all of those cores to be on the same node could take significantly longer. The arguments `–-job-name` and `–-output` correspond to name of the job you are submitting and the name of the output file where the any output not defined by the program being executed is saved. For example, anything printed to `stdout` will be saved in your `--output` file. -Of note here is the `--partition=partition_name` (or `-p partition_name`) command. This command specifies which partition, or queue, to submit your job to. If you are a member of a specific partition you likely are aware of the name of your partition, however you can see which partition you have access to with the `sinfo` command. If you leave this blank you will be submitted to the default or community partition. +Of note here is the `--partition=general` (or `-p general`) command. This command specifies which partition, or queue, to submit your job to. If you are a member of a specific partition you likely are aware of the name of your partition, however you can see which partition you have access to with the `sinfo` command. If you leave this blank you will be submitted to the default or community partition. To submit the job you execute the `sbatch` command followed by the name of your submission script, for example: -`sbatch submission.sh` +`sbatch submission.slurm` Once you execute the above command the job is queued until the requested resources are available for to be allocated to your job. -Once your job is submitted you can use `sstat` command to see information about memory usage, CPU usage, and other metrics related to the jobs you own. - - Below is an example of a Slurm submission script that runs a small python program that takes an integer as an argument, creates a random number matrix with the dimensions defined by the integer you provided, then inverts that matrix and writes it to a CSV file. Below is our small python program named `demo.py` that we will be invoking. @@ -113,7 +120,7 @@ Below is our small python program named `demo.py` that we will be invoking. numpy.savetxt("%d.csv" % args.matrix, out, delimiter=",") -Below is the Slurm submission script to submit our python program named `submission_python.sh`. This job can be submitted by typing `sbatch submission_python.sh` at the command prompt. Note the `module load` command that loads the software environment that contains the `numpy` package necessary to run our program. +Below is the Slurm submission script to submit our python program named `submission_python.slurm`. This job can be submitted by typing `sbatch submission_python.slurm` at the command prompt. Note the `module load` command that loads the software environment that contains the `numpy` package necessary to run our program. #!/bin/bash # @@ -123,9 +130,9 @@ Below is the Slurm submission script to submit our python program named `submiss #SBATCH --ntasks=4 #SBATCH --time=10:00 #SBATCH --mem-per-cpu=100 - #SBATCH --partition=ceti + #SBATCH --partition=debug - module load anaconda3 + module load miniconda3 python demo.py 34 This brief tutorial should provide the basics necessary for submitting jobs to the Slurm Workload Manager on CARC machines. From 749d4d7fbd8b5a43c7566e83f47604c0f57c0d5e Mon Sep 17 00:00:00 2001 From: Ryan Scherbarth Date: Wed, 5 Jun 2024 07:10:05 -0600 Subject: [PATCH 2/2] revision tag --- Intro_to_slurm.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/Intro_to_slurm.md b/Intro_to_slurm.md index dd152bb..e51b722 100644 --- a/Intro_to_slurm.md +++ b/Intro_to_slurm.md @@ -137,5 +137,4 @@ Below is the Slurm submission script to submit our python program named `submiss This brief tutorial should provide the basics necessary for submitting jobs to the Slurm Workload Manager on CARC machines. - - +*This quickbyte was validated on 6/5/2024* \ No newline at end of file