Skip to content

Commit

Permalink
fix typos, rephrase some info and instructions (#74)
Browse files Browse the repository at this point in the history
  • Loading branch information
msainio authored Oct 1, 2024
1 parent fb2ac06 commit 542be82
Show file tree
Hide file tree
Showing 25 changed files with 232 additions and 157 deletions.
11 changes: 5 additions & 6 deletions materials/allas.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Allas – object storage

What it is?
What is it?

* Allas is a **storage service**, technically object storage
* **For CSC project lifetime: 1-5 years**
Expand All @@ -13,7 +13,7 @@ What it is?
* LUMI-O is very similar to Allas
* [LUMI Docs: LUMI-O](https://docs.lumi-supercomputer.eu/storage/lumio/)

What it is NOT?
What is it NOT?

- A file system (even though many tools try to fool you to think so). It is just a place to store static data objects.
- A data management environment. Tools for etc. search, metadata, version control and access management are minimal.
Expand All @@ -30,7 +30,7 @@ What it is NOT?
- For data organization and access administration
- Data is stored as **objects** within a bucket
- Practically: object = file
- In reality, there is no hierarcical directory structure within a bucket, although it sometimes looks like that.
- In reality, there is no hierarchical directory structure within a bucket, although it sometimes looks like that.
- Object name can be `/data/myfile.zip` and some tools may display it as `data` folder with `myfile.zip` file.

### Things to consider
Expand All @@ -44,9 +44,8 @@ What it is NOT?

- S3 and SWIFT.
- **For new projects S3 is recommended**
- SWIFT might be soon depricated.
- Avoid cross-using SWIFT and S3 based objects!

- SWIFT might be soon deprecated.
- Avoid cross-using SWIFT- and S3-based objects!

## Tools for Allas

Expand Down
4 changes: 3 additions & 1 deletion materials/cheatsheet.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
# CSC and Unix cheatsheet
Adapted from [CSC Quick Reference](https://docs.csc.fi/img/csc-quick-reference/csc-quick-reference.pdf)

Note that this is simplified for beginners usage, once you get more experienced, you'll notice that there is more (and better) options for everything, and that not everything written here is "the whole truth".
Note that this is simplified for beginners' usage. Once you get more
experienced, you'll notice that there are more (and better) options for
everything, and that not everything written here is "the whole truth".

## Service names

Expand Down
15 changes: 11 additions & 4 deletions materials/connecting.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,10 @@

## Connecting to the supercomputer via SSH

During the course we will access the supercomputer via the webinterface in order to not overwhelm you with setups before the course. However, this way may not always be the most convenient. You can also connect to the supercomputer via SSH.
During the course we will access the supercomputer via the web interface in
order to not overwhelm you with setups before the course. However, this way
may not always be the most convenient. You can also connect to the
supercomputer via SSH.

:::{admonition} Connecting with SSH clients
:class: seealso, dropdown
Expand All @@ -38,15 +41,19 @@ During the course we will access the supercomputer via the webinterface in order
- In Windows:
- `Command Prompt` or `Powershell` are always avaialbe and can be used for basic connections.
- Special tools like [PuTTY](https://www.putty.org/) or [MobaXterm](https://mobaxterm.mobatek.net/) provide more options, inc possibility to save settings, but need installation.
- To avoid typing your password every time again and to make your connection more secure, you can [set up SSH-keys](https://docs.csc.fi/computing/connecting/#setting-up-ssh-keys).
- To avoid typing your password every time again and to make your connection more secure, you can [set up SSH-keys](https://docs.csc.fi/computing/connecting/ssh-keys/).
- [CSC Docs: Connecting to CSC supercomputers](https://docs.csc.fi/computing/connecting/)
- [LUMI Docs: Get started](https://docs.lumi-supercomputer.eu/firststeps/).

:::

## Developing scripts remotely

Instead of developing code on your local machine and moving it as files to the supercomputer for testing, you can also consider to use a local editor and push edited files directly into the supercomputer.
This works for example with **Visual Studio Code** or **Notepad++**. Note that [Visual Studio Code](https://docs.csc.fi/computing/webinterface/vscode/) is also available through the Puhti web interface.
Instead of developing code on your local machine and moving it as files to the
supercomputer for testing, you can also consider to use a local editor and
push edited files directly to the supercomputer. This works for example
with **Visual Studio Code** or **Notepad++**. Note that [Visual Studio
Code](https://docs.csc.fi/computing/webinterface/vscode/) is also available
through the Puhti web interface.

- [CSC Docs: Developing scripts remotely](https://docs.csc.fi/support/tutorials/remote-dev/)
72 changes: 52 additions & 20 deletions materials/course_plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,30 +42,62 @@
:::{admonition} Painter analogy
:class: tip

Suppose that we want to paint the four walls in a room. This is our problem. We can divide our problem in 4 different tasks: paint each of the walls. In principle, our 4 tasks are independent from each other in the sense that we don’t need to finish one to start another. However, this does not mean that the tasks can be executed simultaneously or in parallel. It all depends on the amount of resources that we have for the tasks.
Concurrent vs. parallel execution

If there is only one painter, they could work for a while in one wall, then start painting another one, then work for a little bit in the third one, and so on. The tasks are being executed concurrently but not in parallel. Only one task is being performed at a time. If we have 2 or more painters for the job, then the tasks can be performed in parallel.

In our analogy, the painters represent CPU cores in your computer. The number of CPU cores available determines the maximum number of tasks that can be performed in parallel. The number of concurrent tasks that can be started at the same time, however, is unlimited.
Synchronous vs. asynchronous execution

Now imagine that all workers have to obtain their paint form a central dispenser located at the middle of the room. If each worker is using a different colour, then they can work asynchronously. However, if they use the same colour, and two of them run out of paint at the same time, then they have to synchronise to use the dispenser — one should wait while the other is being serviced.

In our analogy, the paint dispenser represents access to the memory in your computer. Depending on how a program is written, access to data in memory can be synchronous or asynchronous.
Distributed vs. shared memory

Finally, imagine that we have 4 paint dispensers, one for each worker. In this scenario, each worker can complete its task totally on their own. They don’t even have to be in the same room, they could be painting walls of different rooms in the house, on different houses in the city, and different cities in the country. In many cases, however, we need a communication system in place. Suppose that worker A, needs a colour that is only available in the dispenser of worker B — worker A should request the paint to worker B, and worker B should respond by sending the required colour.

Think of the memory distributed on each node/computer of a cluster as the different dispensers for your workers. A fine-grained parallel program needs lots of communication/synchronisation between tasks, in contrast with a course-grained one that barely communicates at all. An embarrassingly/massively parallel problem is one where all tasks can be executed completely independent from each other (no communications required).
Suppose that we want to paint the four walls in a room. This is our problem.
We can divide our problem in 4 different tasks: paint each of the walls. In
principle, our 4 tasks are independent from each other in the sense that we
don’t need to finish one to start another. However, this does not mean that
the tasks can be executed simultaneously or in parallel. It all depends on the
amount of resources that we have for the tasks. Concurrent vs. parallel
execution

If there is only one painter, they could work for a while on one wall, then
start painting another one, then work for a little bit on the third one, and
so on. The tasks are being executed concurrently but not in parallel. Only one
task is being performed at a time. If we have 2 or more painters for the job,
then the tasks can be performed in parallel.

In our analogy, the painters represent CPU cores in your computer. The number
of CPU cores available determines the maximum number of tasks that can be
performed in parallel. The number of concurrent tasks that can be started at
the same time, however, is unlimited. Synchronous vs. asynchronous execution

Now imagine that all workers have to obtain their paint from a central
dispenser located in the middle of the room. If each worker is using a
different colour, then they can work asynchronously. However, if they use the
same colour, and two of them run out of paint at the same time, then they have
to synchronise to use the dispenser — one should wait while the other is being
serviced.

In our analogy, the paint dispenser represents access to the memory in your
computer. Depending on how a program is written, access to data in memory can
be synchronous or asynchronous. Distributed vs. shared memory

Finally, imagine that we have 4 paint dispensers, one for each worker. In this
scenario, each worker can complete their task totally on their own. They don’t
even have to be in the same room, they could be painting walls of different
rooms in the house, in different houses in the city, and different cities in
the country. In many cases, however, we need a communication system in place.
Suppose that worker A needs a colour that is only available in the dispenser
of worker B — worker A should request the paint from worker B, and worker B
should respond by sending the required colour.

Think of the memory distributed on each node/computer of a cluster as the
different dispensers for your workers. A fine-grained parallel program needs
lots of communication/synchronisation between tasks, in contrast with a
coarse-grained one that barely communicates at all. An
embarrassingly/massively parallel problem is one where all tasks can be
executed completely independent from each other (no communication required).
Processes vs. threads

Our example painters have two arms, and could potentially paint with both arms at the same time. Technically, the work being done by each arm is the work of a single painter.
Our example painters have two arms, and could potentially paint with both arms
at the same time. Technically, the work being done by each arm is the work of
a single painter.

In this example, each painter would be a process (an individual instance of a program). The painters’ arms represent a “thread” of a program. Threads are separate points of execution within a single program, and can be executed either synchronously or asynchronously.
In this example, each painter would be a process (an individual instance of a
program). The painters’ arms represent a “thread” of a program. Threads are
separate points of execution within a single program, and can be executed
either synchronously or asynchronously.

From [HPC Carpentry](http://www.hpc-carpentry.org/hpc-python/06-parallel).
:::



2 changes: 1 addition & 1 deletion materials/csc_services.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,5 +25,5 @@ See also [CSC service catalog](https://research.csc.fi/en/service-catalog)
:::{admonition} Sensitive data
:class: important

Sensitive data should be saved and processed only in services for sensitive data: [SD services](https://research.csc.fi/sensitive-data-services-for-research), [ePouta](https://research.csc.fi/-/epouta). Encrypted files can be stored also to [Allas](https://research.csc.fi/-/allas). Supercomputers and cPouta should not be used for sensitive data.
Sensitive data should be saved and processed only in services for sensitive data: [SD services](https://research.csc.fi/sensitive-data-services-for-research), [ePouta](https://research.csc.fi/-/epouta). Encrypted files can be stored also in [Allas](https://research.csc.fi/-/allas). Supercomputers and cPouta should not be used for sensitive data.
:::
6 changes: 3 additions & 3 deletions materials/data_tips.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
# Best practice tips for data

- Take **backups** of important files. Data on Puhti disks is not backed up.
- Allas is best CSC option for back-up.
- Github or similar for code.
- Allas is the best option for backups at CSC.
- GitHub or similar for code.
- Supercomputer disks do not work well with **too many small files**
- Plan your analysis in a way that too many files are not needed.
- Keep the small files in one zip-file, unzip it only on local fast disks during the analysis.
- Don't create a lot of files in one folder
- [CSC Docs: Best practice performance tips for using Lustre](https://docs.csc.fi/computing/lustre/#best-practices)
- Keep data that is needed longer also in Allas.
- **Databases**:
- Only file databases (SQLite, GeoPackage) can be kept in supercomputer disks.
- Only file databases (SQLite, GeoPackage) can be kept on supercomputer disks.
- For PostgreSQL and PostGIS use [CSC Pukki Database-as-service](https://docs.csc.fi/cloud/dbaas/).
- For any other database set up virtual machine in cPouta.

Expand Down
8 changes: 4 additions & 4 deletions materials/exercise_allas.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
Learn how to:
* Configure connection to Allas and get S3 credentials
* Sync local folder to Allas (manual back-up)
* See what data is Allas
* See what data is in Allas
* Use s3cmd.
* [CSC Docs: s3cmd](https://docs.csc.fi/data/Allas/using_allas/s3_client/)
:::
Expand All @@ -28,8 +28,8 @@ Learn how to:

:::{admonition} Change the default project and username

* `project_200xxxx` is example project name, replace with your own CSC project name.
* `cscusername` is example username, replace with your username.
* `project_200xxxx` is an example project name, replace with your own CSC project name.
* `cscusername` is an example username, replace with your username.
:::

* Open [Puhti web interface](https://puhti.csc.fi) and log in
Expand All @@ -40,7 +40,7 @@ Learn how to:
```bash
module load allas
allas-conf --mode s3cmd
# It asks to select the project, select the project by number.
# It asks to select the project, select the project with the corresponding number.
# The configuration takes a moment, please wait.
```

Expand Down
18 changes: 9 additions & 9 deletions materials/exercise_basics.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ On the **login node**: Start an interactive job with `srun`, e.g.:
srun --time=00:10:00 --pty --account=project_200xxxx --partition=interactive bash ##replace xxxx with your project number; you can also add --reservation=geocomputing_thu here for the course (not available at other times), change partition to small then
```

**or** on Puhti you can also use `sinteractive` wrapper to start an interactive session from the **login node**, which simplifies the call and asks you for the resources step by step:
**or** on Puhti you can also use the `sinteractive` wrapper to start an interactive session from the **login node**, which simplifies the call and asks you for the resources step by step:

```bash
sinteractive -i
Expand Down Expand Up @@ -91,7 +91,7 @@ Try out some other command line tool, or maybe even start a `python` or `R` sess

-> This way you can work interactively for an extended period, using e.g. lots of memory without creating load on the login nodes.

Note that above we only asked for 10 minutes of time. Once that is up, you will be automatically logged out from the compute node.
Note that above we only asked for 10 minutes of time. Once that is up, you will be automatically logged out of the compute node.

Running `exit` on the login node will log you out from Puhti.

Expand Down Expand Up @@ -141,7 +141,7 @@ nano my_serial.bash
#SBATCH --account=project_200xxxx # Choose the billing project. Has to be defined!
#SBATCH --time=00:02:00 # Maximum duration of the job. Upper limit depends on the partition.
#SBATCH --partition=test # Job queues: test, interactive, small, large, longrun, hugemem, hugemem_longrun
#SBATCH --ntasks=1 # Number of tasks. Upper limit depends on partition. For a serial job this should be set 1!
#SBATCH --ntasks=1 # Number of tasks. Upper limit depends on partition. For a serial job this should be set to 1!

echo -n "We are running on"
hostname # Run hostname-command, that will print the name of the Puhti compute node that has been allocated for this particular job
Expand All @@ -162,22 +162,22 @@ sbatch my_serial.bash
squeue --me
```

5. Once the job is done, check how much of the resources have been used with `seff jobid` (replace jobid with the number that was displayed after you ran sbatch command).
5. Once the job is completed, check how much of the resources have been used with `seff jobid` (replace jobid with the number that was displayed after you ran the `sbatch` command).

:::{admonition} Additional exercises
:class: tip

1. Where can you find the hostname print?
1. Where can you find the output of the `hostname` command?
2. How could you add a name to the job for easier identification?
3. What happens if you run the same script from above, but we request only one minute, and sleep for 2 minutes?
4. Can you run the gdalinfo command from the interactive job above via a non interactive job? What do you need to change in the sbatch job script?
4. Can you run the `gdalinfo` command from the interactive job above via a non-interactive job? What do you need to change in the sbatch job script?


:::{admonition} Solution
:class: dropdown

1. `slurm-jobid.out` in the directory from where you submitted the batch job. You can also change that location by specifying it in your batch job script with `#SBATCH --output=/your/path/slurm-%j.out`.
2. Add `#SBATCH --job-name=myname` to the resource request in the top of your sbatch script, to rename the job to myname.
2. Add `#SBATCH --job-name=myname` to the resource request at the top of your sbatch script to rename the job to "myname".
3. After the job finished, check the log file with `cat slurm-<jobid>.out`. You should see an an error in the end `slurmstepd: error: *** JOB xxx ON xxx CANCELLED AT xDATE-TIMEx DUE TO TIME LIMIT ***`. This means that our job was killed for exceeding the amount of resources requested. Although this appears harsh, this is actually a feature. Strict adherence to resource requests allows the scheduler to find the best possible place for your jobs. It also ensures the fair share of use of the computing resources.
4. Since gdalinfo is quite a fast command to run, you will only need to change the script part of your sbatch script, the resources request can stay the same. First we will need to make `gdal` available within the job with `module load geoconda`, then we can run the `gdalinfo` command. After the job is done, you can find the information again in the `slurm-jobid.out` file.

Expand All @@ -186,7 +186,7 @@ squeue --me
#SBATCH --account=<project> # Choose the billing project. Has to be defined!
#SBATCH --time=00:02:00 # Maximum duration of the job. Upper limit depends on the partition.
#SBATCH --partition=test # Job queues: test, interactive, small, large, longrun, hugemem, hugemem_longrun
#SBATCH --ntasks=1 # Number of tasks. Upper limit depends on partition. For a serial job this should be set 1!
#SBATCH --ntasks=1 # Number of tasks. Upper limit depends on partition. For a serial job this should be set to 1!

module load geoconda

Expand All @@ -201,6 +201,6 @@ gdalinfo /appl/data/geo/luke/forest_wind_damage_sensitivity/2017/windmap2017_int

* A batch job script combines resource estimates and computation steps
* Resource request lines start with `#SBATCH`
* You can find the jobs output, errors and prints in `slurm-jobid.out`
* You can find the job's output and errors in `slurm-jobid.out`

:::
2 changes: 1 addition & 1 deletion materials/exercise_gdal.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
:::{admonition} Goals
:class: note

Learn how to use commandline tools:
Learn how to use command-line tools:
* Interactively
* With several files in serial mode
* With several files in parallel
Expand Down
Loading

0 comments on commit 542be82

Please sign in to comment.