Skip to content

Commit

Permalink
Terminology page reorganization and other suggestions from KE (#50)
Browse files Browse the repository at this point in the history
* warning on job limits

* fix terminology page

* Update materials/job_types.md

* Update materials/terminology.md

* Update materials/terminology.md

* Update materials/job_types.md

Co-authored-by: Samantha Wittke <[email protected]>

* Add elements of supercomputing

---------

Co-authored-by: Kylli Ek <[email protected]>
Co-authored-by: Rasmus Kronberg <[email protected]>
  • Loading branch information
3 people authored Oct 11, 2023
1 parent 436baa9 commit 4c2c940
Show file tree
Hide file tree
Showing 5 changed files with 33 additions and 36 deletions.
12 changes: 10 additions & 2 deletions materials/job_types.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,8 @@ You already got to know the [interactive web interface for Puhti](https://docs.c

Disadvantages of interactive jobs:
* Blocks your shell until it finishes
* Connection interruption means that job is gone
* Connection interruption means that job is gone.
* Note: With persistent `compute node shell` from the web interface or using Linux tool [screen](https://www.geeksforgeeks.org/screen-command-in-linux-with-examples/) it is possible to keep a job running while closing the terminal.

Apart from interactive jobs, a job can be classified as **serial, parallel or GPU**, depending on the main requested resource. A serial job is the simplest type of job whereas parallel and GPU jobs may require some advanced methods to fully utilise their capacity. So instead of starting 10 shells to run 10 things at once, get to know serial jobs:

Expand Down Expand Up @@ -85,7 +86,14 @@ In this course we will focus on **embarrassingly/naturally/delightfully parallel

## Array jobs

[Array jobs](https://docs.csc.fi/computing/running/array-jobs/) are another way of taking advantage of Puhti's parallel processing capabilities for embarrassingly parallel tasks. Array jobs are useful when same code is executed many times for different datasets or with different parameters without the need to change the Python code. In GIS context a typical use case would be to run some model on study area split into multiple files where output from one file doesn't have an impact on the result of an other area.
[Array jobs](https://docs.csc.fi/computing/running/array-jobs/) are another way of taking advantage of Puhti's parallel processing capabilities for embarrassingly parallel tasks. Array jobs are useful when same code is executed many times for different datasets or with different parameters without the need to change your code. In GIS context, a typical use case would be to run some model on study area split into multiple files where output from one file doesn't have an impact on the result of another area.

:::{admonition} Maximum job limits
:class: warning

Submitting an array job of 100 members counts the same as 100 individual jobs from the batch queue system's perspective. In Puhti, one can submit/run a maximum of 400/200 jobs at the same time (except for `interactive`, `test` and `gputest`, where the limits are one or two. The maximum number of jobs that each user can submit per month should be kept below one thousand.

:::

## GPU jobs

Expand Down
4 changes: 3 additions & 1 deletion materials/partitions.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Partitions

Partitions are logical sets of nodes. Resource limitations for a job are defined by the partition (or queue) the job is submitted to. The limitations affect the maximum run time, the amount of memory, and the number of available CPU cores (which are called CPUs in Slurm). In addition, partitions may also define default resources that are automatically allocated for jobs if nothing has been specified.
A **partition** is a set of compute nodes, grouped logically. Resource limitations for a job are defined by the partition (or queue) the job is submitted to. The limitations affect the maximum run time, the amount of memory, and the number of available CPU cores (which are called CPUs in Slurm). In addition, partitions may also define default resources that are automatically allocated for jobs if nothing has been specified.

Jobs should be submitted to the partition that best matches the required resources. That way, as few resources as possible are blocked and another user with a higher demand in memory can run a job earlier. Of course, other considerations may also influence the choice of a partition.

Expand All @@ -12,6 +12,8 @@ Jobs should be submitted to the partition that best matches the required resourc
:::{admonition} Which partition to choose?
:class: tip

Check [CSC Docs: Available batch job partitions](https://docs.csc.fi/computing/running/batch-job-partitions/) and find suitable partitions for these tasks:

1. Through trial and error Anna has determined that her image processing process takes about 60 min, 16 GB of memory on a single CPU.
2. Kalika has profiled her code, and determined that it can run efficiently on 20 cores with 12 GB of memory each. The complete process should be done within 4 days.
3. Ben wants to visualize a 80 GB file in QGIS.
Expand Down
3 changes: 3 additions & 0 deletions materials/supercomputer_setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,9 @@

![](./images/puhti_overview.png)


An supercomputer has a lot of **nodes** (you can roughly think that one node is a single computer) which have the same components as your laptop or desktop computer: CPUs (sometimes also called processors or cores), memory (or RAM), and disk space. However, the supercomputer has some additional/specialized components:

- Login nodes are used to set up jobs (and to launch them)
- Jobs are run on the compute nodes
- A batch job system (scheduler) is used to run and manage the jobs
Expand Down
49 changes: 16 additions & 33 deletions materials/terminology.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,49 +6,32 @@
While for some there might be differences, the terms "computing cluster", "High Performance Computer (HPC)" and "supercomputer" are also often used interchangeably.
:::

**Cluster**
A cluster is all resources wired together for the purpose of high performance computing, which includes computational devices, networking devices (switches) and storage devices combined.
The term **HPC system** is a stand-alone resource for computationally intensive workloads.
They are typically comprised of a multitude of integrated processing and
storage elements, designed to handle high volumes of data and/or large numbers of floating-point
operations ([FLOPS](https://en.wikipedia.org/wiki/FLOPS)) with the highest possible performance.

**Node**
You can roughly think that one **node** is a single computer.
To support these constraints, an HPC resource must exist in a specific, fixed location: networking
cables can only stretch so far, and electrical and optical signals can travel only so fast.

**Core**
A node contains one or more central or graphical processing units (**CPUs** or **GPUs**) with many **cores** plus shared memory.
**CPUs** (central processing unit) are a computer’s processors for actually running programs and calculations.

**Job**
When you want to execute a program on the supercomputer, it has to be boxed into an abstraction layer called "job".
**GPU** (graphical processing unit) does certain linear algebra operations extremely efficiently, such as those encountered when processing computer graphics. Widely used also for speeding up model training in deep learning.

**Partition**
A partition is a set of compute nodes, grouped logically. We separate our computational resources based on the features of their hardware and the nature of the job.
For instance, there is an interactive computation partition called `interactive` and a CUDA enabled GPU based partition `gpu`.
Both CPUs and GPUs contain many **cores** plus shared memory.

**Task**
It maybe confusing, but tasks in Slurm means processor resource. By default, 1 task uses 1 core. However, this behavior can be altered.
Information about a current task is stored in the computer’s **memory**.

Adapted from [ODU Research Computing Wiki](https://wiki.hpc.odu.edu/)
**Disk** refers to all storage that can be accessed like a file system. This is generally storage that can hold data permanently, i.e. data is still there even if the computer has been restarted. While this storage can be local (a hard drive installed inside of it), it is more common for nodes to connect to a shared, remote fileserver or cluster of servers.

A **cluster** is all resources wired together for the purpose of high performance computing, which includes computational devices, networking devices (switches) and storage devices combined.

All of the nodes in an HPC (High Performance Computing) system have the same components as your own laptop or desktop: CPUs (sometimes also called processors or cores), memory (or RAM), and disk space. CPUs are a computer’s tool for actually running programs and calculations. A Graphical Processing Units (GPU) does certain linear algebra operations extremely efficiently, such as those encountered when processing computer graphics. Information about a current task is stored in the computer’s memory. Disk refers to all storage that can be accessed like a file system. This is generally storage that can hold data permanently, i.e. data is still there even if the computer has been restarted. While this storage can be local (a hard drive installed inside of it), it is more common for nodes to connect to a shared, remote fileserver or cluster of servers.
When you want to execute a program on the supercomputer, it has to be boxed into an abstraction layer called **job**.

Adapted from [ODU Research Computing Wiki](https://wiki.hpc.odu.edu/) and [NRIS](https://training.pages.sigma2.no).

:::{admonition} What is an HPC system?
:class: seealso

The term *HPC system* is a stand-alone resource for computationally intensive workloads.
They are typically comprised of a multitude of integrated processing and
storage elements, designed to handle high volumes of data and/or large numbers of floating-point
operations ([FLOPS](https://en.wikipedia.org/wiki/FLOPS)) with the highest possible performance.
For example, all the machines on the [Top-500](https://www.top500.org) list are HPC systems. To
support these constraints, an HPC resource must exist in a specific, fixed location: networking
cables can only stretch so far, and electrical and optical signals can travel only so fast.

The word `cluster` is often used for small to moderate scale HPC resources. Clusters are often maintained in computing centers that support several such systems, all sharing common networking and storage to support common compute intensive tasks.

From [NRIS](https://training.pages.sigma2.no).

:::

:::{admonition} Difference between an HPC computing cluster and the cloud
:::{admonition} Difference between an HPC system and the cloud
:class: seealso, dropdown


Expand All @@ -61,7 +44,7 @@ From [NRIS](https://training.pages.sigma2.no).
* All computers have a synchronised clock.
* A scheduler is involved (latter lesson)

*A cloud service could have access to an HPC cluster as part of the service as well*
A cloud service could have access to an HPC system as part of the service as well.

From [NRIS](https://training.pages.sigma2.no).
:::
1 change: 1 addition & 0 deletions materials/where_to_go.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ If you used any of our resources for your research, please acknowledge CSC and G
* [CSC geoinformatics training materials](https://research.csc.fi/gis-learning-materials)
* [CSC Earth Observation tutorial](https://docs.csc.fi/support/tutorials/gis/eo_guide/)
* [LUMI 1-day training](https://lumi-supercomputer.github.io/LUMI-training-materials/1day-20230921/)
* [Elements of supercomputing](https://edukamu.fi/elements-of-supercomputing)
* ['Research data management' self-study course](https://ssl.eventilla.com/event/v8B6B)
* [CodeRefinery materials on FAIR research software development](https://coderefinery.org/lessons/core/)

Expand Down

0 comments on commit 4c2c940

Please sign in to comment.