Skip to content

Commit

Permalink
small misc edits (#73)
Browse files Browse the repository at this point in the history
  • Loading branch information
msainio authored Oct 1, 2024
1 parent 361a793 commit fb2ac06
Show file tree
Hide file tree
Showing 4 changed files with 114 additions and 55 deletions.
112 changes: 82 additions & 30 deletions materials/batch_job.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,18 @@
# Batch jobs

On our own computer, we are used to start a program (job) and the program starts instantly. In an supercomputing environment, the computer is **shared among hundreds of other users**. All heavy computing must be done on compute nodes, see [Usage policy](https://docs.csc.fi/computing/overview/#usage-policy). For using compute nodes, the user first asks for the computing resources, then waits to have access to the requested resources and first then the job starts.
On our own computer, we are used to a started program (job) starting
instantly. In a supercomputing environment, the computer is **shared among
hundreds of users**. All heavy computing must be done on compute nodes
(see [Usage policy](https://docs.csc.fi/computing/overview/#usage-policy)). To
use compute nodes, the user first asks for the computing resources and then
waits for the job to start when the requested resources become available.

## SLURM - job management system
A job management system keeps track of the available and requested computing resources. It aims to share the resources in an efficient and fair way among all users. It optimizes resource usage by filling the compute nodes so that there will be as little idling resources as possible. CSC uses a job management system called SLURM.
A job management system keeps track of the available and requested computing
resources. It aims to share the resources in an efficient and fair way among
all users. It optimizes resource usage by filling the compute nodes so that
there will be as little idling resources as possible. CSC uses a job
management system called SLURM.

```{figure} images/slurm-sketch.svg
:alt: How batch jobs are distributed on compute nodes in terms of number of CPU cores, time and memory
Expand All @@ -12,17 +21,20 @@ A job management system keeps track of the available and requested computing res
SLURM job allocations
```


It is important to request only the resources you need and ensure that the resources are used efficiently. Resources allocated to a job are not available for others to use. If a job is _not_ using the cores or memory it reserved, resources are wasted.
It is important to request only the resources you need and ensure that the
resources are used efficiently. Resources allocated to a job are not available
for others to use. If a job is _not_ using the cores or memory it reserved,
resources are wasted.

## Batch job script

A **batch job script** is used to request resources for a job. It consists of two parts:

* The resource request: computing time, number of cores, amount of memory and other resources like GPUs, local disk, etc.
* The resource request: computing time, number of cores, amount of memory and
other resources like GPUs, local disk, etc.
* Instructions for computing: what tool or script to run.

Example minimal batch script:
Minimal example of batch script:

```bash title="simple.sh"
#!/bin/bash
Expand All @@ -38,58 +50,98 @@ srun python myscript.py # The script to run
* Submit the job for computation: `sbatch simple.sh`
* Cancel a job after job submission during queueing or runtime: `scancel jobid`.

When we submit a batch job script, the job is not started directly, but is sent into a **queue**. Depending on the requested resources and load, the job may need to wait to get started.
When we submit a batch job script, the job is not started directly, but is
sent into a **queue**. Depending on the requested resources and load, the job
may need to wait to get started.

:::{admonition} How many resources to request?
:class: seealso

* If you have run the code on some other machine (your laptop?), as a first guess you can reserve the same amount of CPUs and memory as that machine has.
* You can also check more closely what resources are used with `top` on Mac and Linux or `task manager` on Windows when running on the other machine.
* If your program does the same or similar thing more than once, you can estimate that the total run time by multiplying the one time run time with number of runs.
* The first resource reservation on supercomputer is often a guess, do not worry too much, just adjust it later.
* Before reserving multiple CPUs, check if your code can make use them.
* Before reserving multiple nodes, check if your code can make use them. Most GIS tools can not.
* When you double the number of cores, the job should run at least 1.5x faster.
* Some tools run both on CPU and GPU, if unsure which to use, a good rule of thumb is to compare the billing unit (BU) usage and select the one using less. A GPU uses 60 times more billing units than a single CPU core.
* You should always monitor jobs to find out what were the actual resources you requested.
* If you have run the code on some other machine (your laptop?), as a first
guess, you can reserve the same amount of CPUs and memory as on that
machine.
* You can also monitor resource usage more closely with `top` on Mac and Linux
or `task manager` on Windows when running on the other machine.
* If your program does the same thing (or similar things) more than once, you
can estimate the total run time by multiplying the duration of one run with
the total number of runs.
* An initial resource reservation on a supercomputer is often a guess, do not
worry too much, just adjust it later.
* Before reserving multiple CPUs, check if your code can make use of them.
* Before reserving multiple nodes, check if your code can make use of them.
Most GIS tools can not.
* When you double the number of cores, the job should run at least 1.5x
faster.
* Some tools run on both CPU and GPU. If unsure which to use, a good rule of
thumb is to compare the billing unit (BU) usage and select the one consuming
fewer units. A GPU uses 60 times more billing units than a single CPU core.
* You should always monitor jobs to find out what were the actual resources
you requested.

Partly adapted from [Aalto Scientific Computing](https://scicomp.aalto.fi/triton/usage/program-size/)
:::

## Partitions

A **partition** is a set of compute nodes, grouped logically. Resource limitations for a job are defined by the partition (or queue) the job is submitted to. The limitations affect the **maximum run time, available memory and the number of CPU/GPU cores**. Jobs should be submitted to the smallest partition that matches the required resources.
A **partition** is a logically grouped set of compute nodes. Resource
limitations for a job are defined by the partition (or queue) the job is
submitted to. The limitations affect the **maximum run time, available memory
and the number of CPU/GPU cores**. Jobs should be submitted to the smallest
partition that matches the required resources.

- [CSC Docs: Available batch job partitions](https://docs.csc.fi/computing/running/batch-job-partitions/)
- [LUMI Docs: Slurm particions](https://docs.lumi-supercomputer.eu/runjobs/scheduled-jobs/partitions/)


## Job types

* **Interactive jobs** for working with some tool interactively, for example graphical tools, writing code, testing. For interactive jobs allocate the resource via the the [interactive partition](https://docs.csc.fi/computing/running/interactive-usage/). This way your work is performed in a compute node, not on the login node. Interactive partition is often used for applications in the web interface. The resources are limited in interactive partition, but it should have no or very short queue.
* **Serial jobs** work on only one task at a time following a sequence of instructions, while only using one core.
* **Parallel jobs** distribute the work over several cores or nodes in order to achieve a shorter wall time (and/or a larger allocatable memory).
* **GPU jobs** for tools that can benefit from running on GPUs. In spatial analysis context, GPUs are most often used for deep learning.
* **Interactive jobs** are used for e.g. working interactively with
tools that have a graphical UI, writing code (using graphical development
environments) and testing whether a program runs as intended. For
interactive jobs, allocate the resources from the
[interactive partition](https://docs.csc.fi/computing/running/interactive-usage/).
This way your work is performed on a compute node, not on the login node.
The interactive partition is often used for applications in the web interface.
The resources are limited on this partition, but it should have very
short queuing times.
* **Serial jobs** work on only one task at a time following a sequence of
instructions and only using one core.
* **Parallel jobs** distribute the work over several cores or nodes in order
to achieve a shorter wall time (and/or more allocatable memory).
* **GPU jobs** for tools that can benefit from running on GPUs. In a spatial
analysis context, GPUs are most often used for deep learning.

:::{admonition} Which partition to choose?
:class: tip

Check [CSC Docs: Available batch job partitions](https://docs.csc.fi/computing/running/batch-job-partitions/) and find suitable partitions for these tasks:

1. Through trial and error Anna has determined that her image processing process takes about 60 min and 16 GB of memory.
2. Laura has profiled her code, and determined that it can run efficiently on 20 cores with 12 GB of memory each. The complete process should be done within 4 days.
1. Through trial and error, Anna has determined that her image processing task
takes about 60 min and 16 GB of memory.
2. Laura has profiled her code, and determined that it can run efficiently on
20 cores with 12 GB of memory each. The complete process should be done
within 4 days.
3. Ben wants to visualize a 2 GB file in QGIS.
4. Neha has written and run some Python code on her own machine. She now wants to move to Puhti and, before running her full pipeline, test that her code executes correctly with a minimal dataset.
5. Josh wants to run 4 memory heavy tasks (100GB) in parallel. Each job takes about 30 minutes to execute.
4. Neha has written and run some Python code on her own machine. She now wants
to move to Puhti and, before running her full pipeline, test that her code
executes correctly with a minimal dataset.
5. Josh wants to run 4 memory heavy tasks (100GB) in parallel. Each job takes
about 30 minutes to execute.

:::{admonition} Solution
:class: dropdown

1. She does not need interactive access to her process, so `small` suits best.
2. She needs to choose `longrun` or adapt her code to get under 3 days runtime (which she might want to do in order to avoid exessively long queueing times).
3. For the webinterface,`interactive` suits best and should be the first choice.
4. This is a very good idea and should always be done first. Neha can get the best and fast experience using `test` partition. This means to keep the runtime under 15 min and the memory needs below 190 GiB at a maximum of 80 tasks.
5. 400GB memory in total is more than most partitions can take. If this is the least memory possible for the jobs, it has to be run on `hugemem`.
2. She needs to choose `longrun` or adapt her code to get under 3 days runtime
(which she might want to do in order to avoid excessively long queueing
times).
3. For the web interface, `interactive` suits best and should be the first
choice.
4. This is a very good idea and should always be done first. Neha can get the
testing done quickly (= with limited queuing overhead) using the `test`
partition. This means to keep the runtime under 15 min and the memory needs
below 190 GiB at a maximum of 80 tasks.
5. 400GB memory in total is more than most partitions can provide. If this is the
least memory possible for the jobs, it has to be run on `hugemem`.
:::
:::

Expand Down
27 changes: 17 additions & 10 deletions materials/job_monitoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,10 @@
Check the status of your job: `squeue --me`

## Job output
By default, the standard output (e.g. things that you print as part of your script) and standard error (e.g. error messages from Slurm, your tool or package) are written to the file `slurm-<jobid>.out` in the same folder as the batch job script.
By default, the standard output (e.g. things that you print as part of your
script) and standard error (e.g. error messages from Slurm, your tool or
package) are written to the file `slurm-<jobid>.out` in the same directory as the
batch job script.

:::{admonition} What to do if a job fails?
:class: seealso
Expand Down Expand Up @@ -69,7 +72,10 @@ JobID Partition State ReqMem MaxRSS AveRSS Elapsed
22361601.0 COMPLETED 145493K 139994035 00:06:17
```

**Note!** Querying data from the Slurm accounting database with `sacct` can be a very heavy operation. **Don't** query long time intervals or run `sacct` in a loop/using `watch` as this will degrade the performance of the system for all users.
**Note!** Querying data from the Slurm accounting database with `sacct` can be
a very heavy operation. **Don't** query long time intervals or run `sacct` in
a loop/using `watch` as this will degrade the performance of the system for
all users.

Important aspects to monitor are:
- Memory efficiency
Expand All @@ -85,21 +91,22 @@ Important aspects to monitor are:
- Perform a scaling test.
- GPU efficiency
- If low GPU usage:
- better to use CPUs?
- Better to use CPUs?
- Is disk I/O the bottleneck?
- Disk workload
- If a lot of I/0, use [local disks on compute nodes](https://docs.csc.fi/computing/running/creating-job-scripts-puhti/#local-storage)

:::{admonition} Monitoring interactive jobs
:class: tip
If you want to monitor real-time resource usage of interactive job:
- Open a new terminal on the same compute node as where the tool/script is running:
If you want to monitor the real-time resource usage of an interactive job:
- Open a new shell on the same compute node as where the tool/script is running:
- Jupyter and RStudio have Terminal windows.
- If it is some other tool, open another terminal to the copmpute node:
- Find out the compute node name from the prompt of the interactive job, something like: `r18c02`
- Open a new terminal to login node login node,
- Connect to compute node, for example: `ssh r18c02`
- Use Linux `top -u $USER` command, it gives rough estimate of memory and CPU usage.
- If it is some other tool, manually open another shell on the compute node:
- Find out the compute node name from the prompt of the interactive
job or the output of `squeue --me` (it's something like `r18c02`)
- Open a new terminal window or tab on your device and log into the supercomputer
- Connect to the compute node from the login node by running `ssh <comp-node-id>` (for example `ssh r18c02`)
- Use Linux `top -u $USER` command, it gives a rough estimate of the memory and CPU usage of the job.
:::


Expand Down
26 changes: 13 additions & 13 deletions materials/supercomputer_setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,29 +9,29 @@ Typical physical parts of a supercomputer:
* High-speed networks between these

## Login-nodes
- Login nodes are used for moving data and scripts, script editing and for starting jobs.
- Login nodes are used for moving data and scripts, editing scripts, and starting jobs.
- When you login to CSC's supercomputers, you enter one of the login nodes of the computer
- There is only a few login-nodes and they are shared by all users, so they are [not intended for heavy computing.](https://docs.csc.fi/computing/overview/#usage-policy)
- There are only a few login-nodes and they are shared by all users, so they are [not intended for heavy computing.](https://docs.csc.fi/computing/overview/#usage-policy)

![](./images/HPC_nodes.png)

## Compute-nodes
- The heavy computing should be done on compute-nodes.
- Each node has **memory**, which is used for storing information about a current task.
- Compute nodes have 2 main type of processors:
- Heavy computing should be done on compute-nodes.
- Each node has **memory**, which is used for storing information about the current task.
- Compute nodes can be classified based on the types of processors they have:
* **CPU-nodes** have only CPUs (central processing unit).
* **GPU-nodes** have both GPUs (graphical processing unit) and CPUs. GPUs are widely used for deep learning.
* Each CPU-node includes **cores**, which are the basic computing. There are 40 cores in Puhti and 128 in Mahti and LUMI CPU-nodes.
* It depends on the used software, if it benefits from GPU or not. Most GIS-tools can not use GPUs.
* Each CPU has multiple **cores**, which are the basic computing resource. There are 40 cores in Puhti and 128 in Mahti and LUMI CPU-nodes.
* Whether your task benefits from a GPU depends on the software used. Most GIS-tools can not use GPUs.
* GPUs are more expensive, so in general the software should run at least 3x faster on GPU, that it would be reasonable to use GPU-nodes.
- While using compute nodes the compute resources have to be defined in advance, and specified if CPU or/and GPU is needed, how many cores or nodes and how much memory.
- When using compute nodes, the compute resources have to be defined in advance. You must specify e.g. the amount of GPUs, memory and nodes or CPU cores.
- Specifics of [Puhti](https://docs.csc.fi/computing/systems-puhti/#nodes), [Mahti](https://docs.csc.fi/computing/systems-mahti/) and [LUMI](https://docs.lumi-supercomputer.eu/hardware/lumic/) compute nodes.


## Storage
![](./images/HPC_disks.png)

- **Disk** refers to all storage that can be accessed like a file system. This is generally storage that can hold data permanently, i.e. data is still there even if the computer has been restarted.
- **Disk** refers to all storage that can be accessed as a file system. This is generally storage that can hold data permanently, i.e. data is still there even if the computer has been restarted.
- CSC supercomputers use Lustre as the **parallel distributed file system**

### Puhti disk areas
Expand All @@ -45,13 +45,13 @@ Typical physical parts of a supercomputer:
* `scratch` space can be extended, but it would use billing units then.

#### Temporary fast disks
- Some nodes might have also **local disk space** for temporary use.
- Some nodes might also have **local disk space** for temporary use.
- [CSC Docs: Login node local tmp](https://docs.csc.fi/computing/disk/#login-nodes) `$TMPDIR` for compiling, cleaned frequently.

- [CSC Docs: NVMe](https://docs.csc.fi/computing/running/creating-job-scripts-puhti/#local-storage) - `$LOCAL_SCRATCH` in batch jobs,
- NVMe is accessible only during your job allocation, inc. interactive job
- NVMe is accessible only during your job allocation (including any interactive jobs)
- You must copy data in and out during your batch job
- If your job reads or writes a lot of small files, using this can give 10x performance boost
- If your job reads or writes lots of small files, using this can give 10x performance boost

:::{admonition} Avoid unneccesary reading and writing
:class: seealso
Expand Down Expand Up @@ -83,7 +83,7 @@ Which of the following tasks would suit to run on the login node?

>Note that script names do not always reflect their contents: before launching #3, please check what is inside create_directories.sh and make sure it does what the name suggests.
Running resource-intensive applications on the login node are forbidden. Unless you are sure it will not affect other users, do not run jobs like #1 (python) or #4 (a software). You will anyway want more resources for these, than the login node can provide.
Running resource-intensive applications on the login node are forbidden. Unless you are sure it will not affect other users, do not run jobs like #1 (python) or #4 (a software). You will in any case want more resources for these than the login node can provide.

:::
:::
4 changes: 2 additions & 2 deletions materials/supercomputing.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,8 @@
* Deep learning libraries run much faster on GPU.

#### Parallel computing
* Only few GIS tools have built-in support for parallization
* With scrips and diving the data any tool can be run in parallel
* Only few GIS tools have built-in support for parallelization
* By using scripts and dividing the data, any tool can be run in parallel

#### Embarrassingly parallel analyses
* Many similar, but independent tasks.
Expand Down

0 comments on commit fb2ac06

Please sign in to comment.