From 1d3bba3108a01c87b3cbcede3d84da80bba85f57 Mon Sep 17 00:00:00 2001 From: Mitja Sainio Date: Tue, 1 Oct 2024 09:35:37 +0300 Subject: [PATCH] small misc edits --- materials/batch_job.md | 112 ++++++++++++++++++++++--------- materials/job_monitoring.md | 27 +++++--- materials/supercomputer_setup.md | 26 +++---- materials/supercomputing.md | 4 +- 4 files changed, 114 insertions(+), 55 deletions(-) diff --git a/materials/batch_job.md b/materials/batch_job.md index 18e6ce5f..33d262ed 100644 --- a/materials/batch_job.md +++ b/materials/batch_job.md @@ -1,9 +1,18 @@ # Batch jobs -On our own computer, we are used to start a program (job) and the program starts instantly. In an supercomputing environment, the computer is **shared among hundreds of other users**. All heavy computing must be done on compute nodes, see [Usage policy](https://docs.csc.fi/computing/overview/#usage-policy). For using compute nodes, the user first asks for the computing resources, then waits to have access to the requested resources and first then the job starts. +On our own computer, we are used to a started program (job) starting +instantly. In a supercomputing environment, the computer is **shared among +hundreds of users**. All heavy computing must be done on compute nodes +(see [Usage policy](https://docs.csc.fi/computing/overview/#usage-policy)). To +use compute nodes, the user first asks for the computing resources and then +waits for the job to start when the requested resources become available. ## SLURM - job management system -A job management system keeps track of the available and requested computing resources. It aims to share the resources in an efficient and fair way among all users. It optimizes resource usage by filling the compute nodes so that there will be as little idling resources as possible. CSC uses a job management system called SLURM. +A job management system keeps track of the available and requested computing +resources. It aims to share the resources in an efficient and fair way among +all users. It optimizes resource usage by filling the compute nodes so that +there will be as little idling resources as possible. CSC uses a job +management system called SLURM. ```{figure} images/slurm-sketch.svg :alt: How batch jobs are distributed on compute nodes in terms of number of CPU cores, time and memory @@ -12,17 +21,20 @@ A job management system keeps track of the available and requested computing res SLURM job allocations ``` - -It is important to request only the resources you need and ensure that the resources are used efficiently. Resources allocated to a job are not available for others to use. If a job is _not_ using the cores or memory it reserved, resources are wasted. +It is important to request only the resources you need and ensure that the +resources are used efficiently. Resources allocated to a job are not available +for others to use. If a job is _not_ using the cores or memory it reserved, +resources are wasted. ## Batch job script A **batch job script** is used to request resources for a job. It consists of two parts: -* The resource request: computing time, number of cores, amount of memory and other resources like GPUs, local disk, etc. +* The resource request: computing time, number of cores, amount of memory and + other resources like GPUs, local disk, etc. * Instructions for computing: what tool or script to run. -Example minimal batch script: +Minimal example of batch script: ```bash title="simple.sh" #!/bin/bash @@ -38,58 +50,98 @@ srun python myscript.py # The script to run * Submit the job for computation: `sbatch simple.sh` * Cancel a job after job submission during queueing or runtime: `scancel jobid`. -When we submit a batch job script, the job is not started directly, but is sent into a **queue**. Depending on the requested resources and load, the job may need to wait to get started. +When we submit a batch job script, the job is not started directly, but is +sent into a **queue**. Depending on the requested resources and load, the job +may need to wait to get started. :::{admonition} How many resources to request? :class: seealso -* If you have run the code on some other machine (your laptop?), as a first guess you can reserve the same amount of CPUs and memory as that machine has. -* You can also check more closely what resources are used with `top` on Mac and Linux or `task manager` on Windows when running on the other machine. -* If your program does the same or similar thing more than once, you can estimate that the total run time by multiplying the one time run time with number of runs. -* The first resource reservation on supercomputer is often a guess, do not worry too much, just adjust it later. -* Before reserving multiple CPUs, check if your code can make use them. -* Before reserving multiple nodes, check if your code can make use them. Most GIS tools can not. -* When you double the number of cores, the job should run at least 1.5x faster. -* Some tools run both on CPU and GPU, if unsure which to use, a good rule of thumb is to compare the billing unit (BU) usage and select the one using less. A GPU uses 60 times more billing units than a single CPU core. -* You should always monitor jobs to find out what were the actual resources you requested. +* If you have run the code on some other machine (your laptop?), as a first + guess, you can reserve the same amount of CPUs and memory as on that + machine. +* You can also monitor resource usage more closely with `top` on Mac and Linux + or `task manager` on Windows when running on the other machine. +* If your program does the same thing (or similar things) more than once, you + can estimate the total run time by multiplying the duration of one run with + the total number of runs. +* An initial resource reservation on a supercomputer is often a guess, do not + worry too much, just adjust it later. +* Before reserving multiple CPUs, check if your code can make use of them. +* Before reserving multiple nodes, check if your code can make use of them. + Most GIS tools can not. +* When you double the number of cores, the job should run at least 1.5x + faster. +* Some tools run on both CPU and GPU. If unsure which to use, a good rule of + thumb is to compare the billing unit (BU) usage and select the one consuming + fewer units. A GPU uses 60 times more billing units than a single CPU core. +* You should always monitor jobs to find out what were the actual resources + you requested. Partly adapted from [Aalto Scientific Computing](https://scicomp.aalto.fi/triton/usage/program-size/) ::: ## Partitions -A **partition** is a set of compute nodes, grouped logically. Resource limitations for a job are defined by the partition (or queue) the job is submitted to. The limitations affect the **maximum run time, available memory and the number of CPU/GPU cores**. Jobs should be submitted to the smallest partition that matches the required resources. +A **partition** is a logically grouped set of compute nodes. Resource +limitations for a job are defined by the partition (or queue) the job is +submitted to. The limitations affect the **maximum run time, available memory +and the number of CPU/GPU cores**. Jobs should be submitted to the smallest +partition that matches the required resources. - [CSC Docs: Available batch job partitions](https://docs.csc.fi/computing/running/batch-job-partitions/) - [LUMI Docs: Slurm particions](https://docs.lumi-supercomputer.eu/runjobs/scheduled-jobs/partitions/) - ## Job types -* **Interactive jobs** for working with some tool interactively, for example graphical tools, writing code, testing. For interactive jobs allocate the resource via the the [interactive partition](https://docs.csc.fi/computing/running/interactive-usage/). This way your work is performed in a compute node, not on the login node. Interactive partition is often used for applications in the web interface. The resources are limited in interactive partition, but it should have no or very short queue. -* **Serial jobs** work on only one task at a time following a sequence of instructions, while only using one core. -* **Parallel jobs** distribute the work over several cores or nodes in order to achieve a shorter wall time (and/or a larger allocatable memory). -* **GPU jobs** for tools that can benefit from running on GPUs. In spatial analysis context, GPUs are most often used for deep learning. +* **Interactive jobs** are used for e.g. working interactively with + tools that have a graphical UI, writing code (using graphical development + environments) and testing whether a program runs as intended. For + interactive jobs, allocate the resources from the + [interactive partition](https://docs.csc.fi/computing/running/interactive-usage/). + This way your work is performed on a compute node, not on the login node. + The interactive partition is often used for applications in the web interface. + The resources are limited on this partition, but it should have very + short queuing times. +* **Serial jobs** work on only one task at a time following a sequence of + instructions and only using one core. +* **Parallel jobs** distribute the work over several cores or nodes in order + to achieve a shorter wall time (and/or more allocatable memory). +* **GPU jobs** for tools that can benefit from running on GPUs. In a spatial + analysis context, GPUs are most often used for deep learning. :::{admonition} Which partition to choose? :class: tip Check [CSC Docs: Available batch job partitions](https://docs.csc.fi/computing/running/batch-job-partitions/) and find suitable partitions for these tasks: -1. Through trial and error Anna has determined that her image processing process takes about 60 min and 16 GB of memory. -2. Laura has profiled her code, and determined that it can run efficiently on 20 cores with 12 GB of memory each. The complete process should be done within 4 days. +1. Through trial and error, Anna has determined that her image processing task + takes about 60 min and 16 GB of memory. +2. Laura has profiled her code, and determined that it can run efficiently on + 20 cores with 12 GB of memory each. The complete process should be done + within 4 days. 3. Ben wants to visualize a 2 GB file in QGIS. -4. Neha has written and run some Python code on her own machine. She now wants to move to Puhti and, before running her full pipeline, test that her code executes correctly with a minimal dataset. -5. Josh wants to run 4 memory heavy tasks (100GB) in parallel. Each job takes about 30 minutes to execute. +4. Neha has written and run some Python code on her own machine. She now wants + to move to Puhti and, before running her full pipeline, test that her code + executes correctly with a minimal dataset. +5. Josh wants to run 4 memory heavy tasks (100GB) in parallel. Each job takes + about 30 minutes to execute. :::{admonition} Solution :class: dropdown 1. She does not need interactive access to her process, so `small` suits best. -2. She needs to choose `longrun` or adapt her code to get under 3 days runtime (which she might want to do in order to avoid exessively long queueing times). -3. For the webinterface,`interactive` suits best and should be the first choice. -4. This is a very good idea and should always be done first. Neha can get the best and fast experience using `test` partition. This means to keep the runtime under 15 min and the memory needs below 190 GiB at a maximum of 80 tasks. -5. 400GB memory in total is more than most partitions can take. If this is the least memory possible for the jobs, it has to be run on `hugemem`. +2. She needs to choose `longrun` or adapt her code to get under 3 days runtime + (which she might want to do in order to avoid excessively long queueing + times). +3. For the web interface, `interactive` suits best and should be the first + choice. +4. This is a very good idea and should always be done first. Neha can get the + testing done quickly (= with limited queuing overhead) using the `test` + partition. This means to keep the runtime under 15 min and the memory needs + below 190 GiB at a maximum of 80 tasks. +5. 400GB memory in total is more than most partitions can provide. If this is the + least memory possible for the jobs, it has to be run on `hugemem`. ::: ::: diff --git a/materials/job_monitoring.md b/materials/job_monitoring.md index 79a42671..d3a6dd64 100644 --- a/materials/job_monitoring.md +++ b/materials/job_monitoring.md @@ -4,7 +4,10 @@ Check the status of your job: `squeue --me` ## Job output -By default, the standard output (e.g. things that you print as part of your script) and standard error (e.g. error messages from Slurm, your tool or package) are written to the file `slurm-.out` in the same folder as the batch job script. +By default, the standard output (e.g. things that you print as part of your +script) and standard error (e.g. error messages from Slurm, your tool or +package) are written to the file `slurm-.out` in the same directory as the +batch job script. :::{admonition} What to do if a job fails? :class: seealso @@ -69,7 +72,10 @@ JobID Partition State ReqMem MaxRSS AveRSS Elapsed 22361601.0 COMPLETED 145493K 139994035 00:06:17 ``` -**Note!** Querying data from the Slurm accounting database with `sacct` can be a very heavy operation. **Don't** query long time intervals or run `sacct` in a loop/using `watch` as this will degrade the performance of the system for all users. +**Note!** Querying data from the Slurm accounting database with `sacct` can be +a very heavy operation. **Don't** query long time intervals or run `sacct` in +a loop/using `watch` as this will degrade the performance of the system for +all users. Important aspects to monitor are: - Memory efficiency @@ -85,21 +91,22 @@ Important aspects to monitor are: - Perform a scaling test. - GPU efficiency - If low GPU usage: - - better to use CPUs? + - Better to use CPUs? - Is disk I/O the bottleneck? - Disk workload - If a lot of I/0, use [local disks on compute nodes](https://docs.csc.fi/computing/running/creating-job-scripts-puhti/#local-storage) :::{admonition} Monitoring interactive jobs :class: tip -If you want to monitor real-time resource usage of interactive job: - - Open a new terminal on the same compute node as where the tool/script is running: +If you want to monitor the real-time resource usage of an interactive job: + - Open a new shell on the same compute node as where the tool/script is running: - Jupyter and RStudio have Terminal windows. - - If it is some other tool, open another terminal to the copmpute node: - - Find out the compute node name from the prompt of the interactive job, something like: `r18c02` - - Open a new terminal to login node login node, - - Connect to compute node, for example: `ssh r18c02` - - Use Linux `top -u $USER` command, it gives rough estimate of memory and CPU usage. + - If it is some other tool, manually open another shell on the compute node: + - Find out the compute node name from the prompt of the interactive + job or the output of `squeue --me` (it's something like `r18c02`) + - Open a new terminal window or tab on your device and log into the supercomputer + - Connect to the compute node from the login node by running `ssh ` (for example `ssh r18c02`) + - Use Linux `top -u $USER` command, it gives a rough estimate of the memory and CPU usage of the job. ::: diff --git a/materials/supercomputer_setup.md b/materials/supercomputer_setup.md index 74145281..25edd532 100644 --- a/materials/supercomputer_setup.md +++ b/materials/supercomputer_setup.md @@ -9,29 +9,29 @@ Typical physical parts of a supercomputer: * High-speed networks between these ## Login-nodes -- Login nodes are used for moving data and scripts, script editing and for starting jobs. +- Login nodes are used for moving data and scripts, editing scripts, and starting jobs. - When you login to CSC's supercomputers, you enter one of the login nodes of the computer -- There is only a few login-nodes and they are shared by all users, so they are [not intended for heavy computing.](https://docs.csc.fi/computing/overview/#usage-policy) +- There are only a few login-nodes and they are shared by all users, so they are [not intended for heavy computing.](https://docs.csc.fi/computing/overview/#usage-policy) ![](./images/HPC_nodes.png) ## Compute-nodes -- The heavy computing should be done on compute-nodes. -- Each node has **memory**, which is used for storing information about a current task. -- Compute nodes have 2 main type of processors: +- Heavy computing should be done on compute-nodes. +- Each node has **memory**, which is used for storing information about the current task. +- Compute nodes can be classified based on the types of processors they have: * **CPU-nodes** have only CPUs (central processing unit). * **GPU-nodes** have both GPUs (graphical processing unit) and CPUs. GPUs are widely used for deep learning. - * Each CPU-node includes **cores**, which are the basic computing. There are 40 cores in Puhti and 128 in Mahti and LUMI CPU-nodes. - * It depends on the used software, if it benefits from GPU or not. Most GIS-tools can not use GPUs. + * Each CPU has multiple **cores**, which are the basic computing resource. There are 40 cores in Puhti and 128 in Mahti and LUMI CPU-nodes. + * Whether your task benefits from a GPU depends on the software used. Most GIS-tools can not use GPUs. * GPUs are more expensive, so in general the software should run at least 3x faster on GPU, that it would be reasonable to use GPU-nodes. -- While using compute nodes the compute resources have to be defined in advance, and specified if CPU or/and GPU is needed, how many cores or nodes and how much memory. +- When using compute nodes, the compute resources have to be defined in advance. You must specify e.g. the amount of GPUs, memory and nodes or CPU cores. - Specifics of [Puhti](https://docs.csc.fi/computing/systems-puhti/#nodes), [Mahti](https://docs.csc.fi/computing/systems-mahti/) and [LUMI](https://docs.lumi-supercomputer.eu/hardware/lumic/) compute nodes. ## Storage ![](./images/HPC_disks.png) -- **Disk** refers to all storage that can be accessed like a file system. This is generally storage that can hold data permanently, i.e. data is still there even if the computer has been restarted. +- **Disk** refers to all storage that can be accessed as a file system. This is generally storage that can hold data permanently, i.e. data is still there even if the computer has been restarted. - CSC supercomputers use Lustre as the **parallel distributed file system** ### Puhti disk areas @@ -45,13 +45,13 @@ Typical physical parts of a supercomputer: * `scratch` space can be extended, but it would use billing units then. #### Temporary fast disks -- Some nodes might have also **local disk space** for temporary use. +- Some nodes might also have **local disk space** for temporary use. - [CSC Docs: Login node local tmp](https://docs.csc.fi/computing/disk/#login-nodes) `$TMPDIR` for compiling, cleaned frequently. - [CSC Docs: NVMe](https://docs.csc.fi/computing/running/creating-job-scripts-puhti/#local-storage) - `$LOCAL_SCRATCH` in batch jobs, - - NVMe is accessible only during your job allocation, inc. interactive job + - NVMe is accessible only during your job allocation (including any interactive jobs) - You must copy data in and out during your batch job - - If your job reads or writes a lot of small files, using this can give 10x performance boost + - If your job reads or writes lots of small files, using this can give 10x performance boost :::{admonition} Avoid unneccesary reading and writing :class: seealso @@ -83,7 +83,7 @@ Which of the following tasks would suit to run on the login node? >Note that script names do not always reflect their contents: before launching #3, please check what is inside create_directories.sh and make sure it does what the name suggests. -Running resource-intensive applications on the login node are forbidden. Unless you are sure it will not affect other users, do not run jobs like #1 (python) or #4 (a software). You will anyway want more resources for these, than the login node can provide. +Running resource-intensive applications on the login node are forbidden. Unless you are sure it will not affect other users, do not run jobs like #1 (python) or #4 (a software). You will in any case want more resources for these than the login node can provide. ::: ::: diff --git a/materials/supercomputing.md b/materials/supercomputing.md index 8b95120b..2d30a43c 100644 --- a/materials/supercomputing.md +++ b/materials/supercomputing.md @@ -33,8 +33,8 @@ * Deep learning libraries run much faster on GPU. #### Parallel computing -* Only few GIS tools have built-in support for parallization -* With scrips and diving the data any tool can be run in parallel +* Only few GIS tools have built-in support for parallelization +* By using scripts and dividing the data, any tool can be run in parallel #### Embarrassingly parallel analyses * Many similar, but independent tasks.