fix typos, rephrase some info and instructions (#74)

csc-training · Oct 1, 2024 · 542be82 · 542be82
1 parent fb2ac06
commit 542be82
Show file tree

Hide file tree

Showing 25 changed files with 232 additions and 157 deletions.
diff --git a/materials/allas.md b/materials/allas.md
@@ -1,6 +1,6 @@
 # Allas – object storage
 
-What it is?
+What is it?
 
 * Allas is a **storage service**, technically object storage
 * **For CSC project lifetime: 1-5 years**
@@ -13,7 +13,7 @@ What it is?
 * LUMI-O is very similar to Allas
 * [LUMI Docs: LUMI-O](https://docs.lumi-supercomputer.eu/storage/lumio/)
 
-What it is NOT?
+What is it NOT?
 
 - A file system (even though many tools try to fool you to think so). It is just a place to store static data objects.
 - A data management environment. Tools for etc. search, metadata, version control and access management are minimal.
@@ -30,7 +30,7 @@ What it is NOT?
   	- For data organization and access administration
 - Data is stored as **objects** within a bucket
 	- Practically: object = file
-	- In reality, there is no hierarcical directory structure within a bucket, although it sometimes looks like that.
+	- In reality, there is no hierarchical directory structure within a bucket, although it sometimes looks like that.
 		- Object name can be `/data/myfile.zip` and some tools may display it as `data` folder with `myfile.zip` file.
 
 ### Things to consider 
@@ -44,9 +44,8 @@ What it is NOT?
 
 - S3 and SWIFT. 
 	- **For new projects S3 is recommended**
-  	- SWIFT might be soon depricated.
-	- Avoid cross-using SWIFT and S3 based objects!
-
+  	- SWIFT might be soon deprecated.
+	- Avoid cross-using SWIFT- and S3-based objects!
 
 ## Tools for Allas
 

diff --git a/materials/cheatsheet.md b/materials/cheatsheet.md
@@ -1,7 +1,9 @@
 # CSC and Unix cheatsheet
 Adapted from [CSC Quick Reference](https://docs.csc.fi/img/csc-quick-reference/csc-quick-reference.pdf)
 
-Note that this is simplified for beginners usage, once you get more experienced, you'll notice that there is more (and better) options for everything, and that not everything written here is "the whole truth".
+Note that this is simplified for beginners' usage. Once you get more
+experienced, you'll notice that there are more (and better) options for
+everything, and that not everything written here is "the whole truth".
 
 ## Service names
 

diff --git a/materials/connecting.md b/materials/connecting.md
@@ -26,7 +26,10 @@
 
 ## Connecting to the supercomputer via SSH
 
-During the course we will access the supercomputer via the webinterface in order to not overwhelm you with setups before the course. However, this way may not always be the most convenient. You can also connect to the supercomputer via SSH.
+During the course we will access the supercomputer via the web interface in
+order to not overwhelm you with setups before the course. However, this way
+may not always be the most convenient. You can also connect to the
+supercomputer via SSH.
 
 :::{admonition} Connecting with SSH clients
 :class: seealso, dropdown
@@ -38,15 +41,19 @@ During the course we will access the supercomputer via the webinterface in order
    - In Windows:
       - `Command Prompt` or `Powershell` are always avaialbe and can be used for basic connections.
       - Special tools like [PuTTY](https://www.putty.org/) or [MobaXterm](https://mobaxterm.mobatek.net/) provide more options, inc possibility to save settings, but need installation.
-- To avoid typing your password every time again and to make your connection more secure, you can [set up SSH-keys](https://docs.csc.fi/computing/connecting/#setting-up-ssh-keys).
+- To avoid typing your password every time again and to make your connection more secure, you can [set up SSH-keys](https://docs.csc.fi/computing/connecting/ssh-keys/).
 - [CSC Docs: Connecting to CSC supercomputers](https://docs.csc.fi/computing/connecting/)
 - [LUMI Docs: Get started](https://docs.lumi-supercomputer.eu/firststeps/).
 
 :::
 
 ## Developing scripts remotely
 
-Instead of developing code on your local machine and moving it as files to the supercomputer for testing, you can also consider to use a local editor and push edited files directly into the supercomputer. 
-This works for example with **Visual Studio Code** or **Notepad++**. Note that [Visual Studio Code](https://docs.csc.fi/computing/webinterface/vscode/) is also available through the Puhti web interface.
+Instead of developing code on your local machine and moving it as files to the
+supercomputer for testing, you can also consider to use a local editor and
+push edited files directly to the supercomputer.  This works for example
+with **Visual Studio Code** or **Notepad++**. Note that [Visual Studio
+Code](https://docs.csc.fi/computing/webinterface/vscode/) is also available
+through the Puhti web interface.
 
 - [CSC Docs: Developing scripts remotely](https://docs.csc.fi/support/tutorials/remote-dev/)
diff --git a/materials/course_plan.md b/materials/course_plan.md
@@ -42,30 +42,62 @@
 :::{admonition} Painter analogy
 :class: tip
 
-Suppose that we want to paint the four walls in a room. This is our problem. We can divide our problem in 4 different tasks: paint each of the walls. In principle, our 4 tasks are independent from each other in the sense that we don’t need to finish one to start another. However, this does not mean that the tasks can be executed simultaneously or in parallel. It all depends on the amount of resources that we have for the tasks.
-Concurrent vs. parallel execution
-
-If there is only one painter, they could work for a while in one wall, then start painting another one, then work for a little bit in the third one, and so on. The tasks are being executed concurrently but not in parallel. Only one task is being performed at a time. If we have 2 or more painters for the job, then the tasks can be performed in parallel.
-
-In our analogy, the painters represent CPU cores in your computer. The number of CPU cores available determines the maximum number of tasks that can be performed in parallel. The number of concurrent tasks that can be started at the same time, however, is unlimited.
-Synchronous vs. asynchronous execution
-
-Now imagine that all workers have to obtain their paint form a central dispenser located at the middle of the room. If each worker is using a different colour, then they can work asynchronously. However, if they use the same colour, and two of them run out of paint at the same time, then they have to synchronise to use the dispenser — one should wait while the other is being serviced.
-
-In our analogy, the paint dispenser represents access to the memory in your computer. Depending on how a program is written, access to data in memory can be synchronous or asynchronous.
-Distributed vs. shared memory
-
-Finally, imagine that we have 4 paint dispensers, one for each worker. In this scenario, each worker can complete its task totally on their own. They don’t even have to be in the same room, they could be painting walls of different rooms in the house, on different houses in the city, and different cities in the country. In many cases, however, we need a communication system in place. Suppose that worker A, needs a colour that is only available in the dispenser of worker B — worker A should request the paint to worker B, and worker B should respond by sending the required colour.
-
-Think of the memory distributed on each node/computer of a cluster as the different dispensers for your workers. A fine-grained parallel program needs lots of communication/synchronisation between tasks, in contrast with a course-grained one that barely communicates at all. An embarrassingly/massively parallel problem is one where all tasks can be executed completely independent from each other (no communications required).
+Suppose that we want to paint the four walls in a room. This is our problem.
+We can divide our problem in 4 different tasks: paint each of the walls. In
+principle, our 4 tasks are independent from each other in the sense that we
+don’t need to finish one to start another. However, this does not mean that
+the tasks can be executed simultaneously or in parallel. It all depends on the
+amount of resources that we have for the tasks.  Concurrent vs. parallel
+execution
+
+If there is only one painter, they could work for a while on one wall, then
+start painting another one, then work for a little bit on the third one, and
+so on. The tasks are being executed concurrently but not in parallel. Only one
+task is being performed at a time. If we have 2 or more painters for the job,
+then the tasks can be performed in parallel.
+
+In our analogy, the painters represent CPU cores in your computer. The number
+of CPU cores available determines the maximum number of tasks that can be
+performed in parallel. The number of concurrent tasks that can be started at
+the same time, however, is unlimited.  Synchronous vs. asynchronous execution
+
+Now imagine that all workers have to obtain their paint from a central
+dispenser located in the middle of the room. If each worker is using a
+different colour, then they can work asynchronously. However, if they use the
+same colour, and two of them run out of paint at the same time, then they have
+to synchronise to use the dispenser — one should wait while the other is being
+serviced.
+
+In our analogy, the paint dispenser represents access to the memory in your
+computer. Depending on how a program is written, access to data in memory can
+be synchronous or asynchronous.  Distributed vs. shared memory
+
+Finally, imagine that we have 4 paint dispensers, one for each worker. In this
+scenario, each worker can complete their task totally on their own. They don’t
+even have to be in the same room, they could be painting walls of different
+rooms in the house, in different houses in the city, and different cities in
+the country. In many cases, however, we need a communication system in place.
+Suppose that worker A needs a colour that is only available in the dispenser
+of worker B — worker A should request the paint from worker B, and worker B
+should respond by sending the required colour.
+
+Think of the memory distributed on each node/computer of a cluster as the
+different dispensers for your workers. A fine-grained parallel program needs
+lots of communication/synchronisation between tasks, in contrast with a
+coarse-grained one that barely communicates at all. An
+embarrassingly/massively parallel problem is one where all tasks can be
+executed completely independent from each other (no communication required).
 Processes vs. threads
 
-Our example painters have two arms, and could potentially paint with both arms at the same time. Technically, the work being done by each arm is the work of a single painter.
+Our example painters have two arms, and could potentially paint with both arms
+at the same time. Technically, the work being done by each arm is the work of
+a single painter.
 
-In this example, each painter would be a process (an individual instance of a program). The painters’ arms represent a “thread” of a program. Threads are separate points of execution within a single program, and can be executed either synchronously or asynchronously.
+In this example, each painter would be a process (an individual instance of a
+program). The painters’ arms represent a “thread” of a program. Threads are
+separate points of execution within a single program, and can be executed
+either synchronously or asynchronously.
 
 From [HPC Carpentry](http://www.hpc-carpentry.org/hpc-python/06-parallel).
 :::
 
-
-
diff --git a/materials/csc_services.md b/materials/csc_services.md
@@ -25,5 +25,5 @@ See also [CSC service catalog](https://research.csc.fi/en/service-catalog)
 :::{admonition} Sensitive data
 :class: important
 
-Sensitive data should be saved and processed only in services for sensitive data: [SD services](https://research.csc.fi/sensitive-data-services-for-research), [ePouta](https://research.csc.fi/-/epouta). Encrypted files can be stored also to [Allas](https://research.csc.fi/-/allas). Supercomputers and cPouta should not be used for sensitive data. 
+Sensitive data should be saved and processed only in services for sensitive data: [SD services](https://research.csc.fi/sensitive-data-services-for-research), [ePouta](https://research.csc.fi/-/epouta). Encrypted files can be stored also in [Allas](https://research.csc.fi/-/allas). Supercomputers and cPouta should not be used for sensitive data. 
 :::
diff --git a/materials/data_tips.md b/materials/data_tips.md
@@ -1,16 +1,16 @@
 # Best practice tips for data
 
 - Take **backups** of important files. Data on Puhti disks is not backed up.
-	- Allas is best CSC option for back-up.
-	- Github or similar for code.
+	- Allas is the best option for backups at CSC.
+	- GitHub or similar for code.
 - Supercomputer disks do not work well with **too many small files**
 	- Plan your analysis in a way that too many files are not needed.
     - Keep the small files in one zip-file, unzip it only on local fast disks during the analysis. 
 	- Don't create a lot of files in one folder
 - [CSC Docs: Best practice performance tips for using Lustre](https://docs.csc.fi/computing/lustre/#best-practices)
 - Keep data that is needed longer also in Allas.
 - **Databases**:
-	- Only file databases (SQLite, GeoPackage) can be kept in supercomputer disks.
+	- Only file databases (SQLite, GeoPackage) can be kept on supercomputer disks.
 	- For PostgreSQL and PostGIS use [CSC Pukki Database-as-service](https://docs.csc.fi/cloud/dbaas/).
 	- For any other database set up virtual machine in cPouta.
 

diff --git a/materials/exercise_allas.md b/materials/exercise_allas.md
@@ -13,7 +13,7 @@
 Learn how to:
 * Configure connection to Allas and get S3 credentials
 * Sync local folder to Allas (manual back-up)
-* See what data is Allas
+* See what data is in Allas
 * Use s3cmd. 
   * [CSC Docs: s3cmd](https://docs.csc.fi/data/Allas/using_allas/s3_client/) 
 :::
@@ -28,8 +28,8 @@ Learn how to:
 
 :::{admonition} Change the default project and username
 
-* `project_200xxxx` is example project name, replace with your own CSC project name.
-* `cscusername` is example username, replace with your username.
+* `project_200xxxx` is an example project name, replace with your own CSC project name.
+* `cscusername` is an example username, replace with your username.
 :::
 
 * Open [Puhti web interface](https://puhti.csc.fi) and log in
@@ -40,7 +40,7 @@ Learn how to:
 ```bash
 module load allas
 allas-conf --mode s3cmd
-# It asks to select the project, select the project by number. 
+# It asks to select the project, select the project with the corresponding number. 
 # The configuration takes a moment, please wait.
 ```
 

diff --git a/materials/exercise_basics.md b/materials/exercise_basics.md
@@ -47,7 +47,7 @@ On the **login node**: Start an interactive job with `srun`, e.g.:
 srun  --time=00:10:00 --pty --account=project_200xxxx --partition=interactive  bash      ##replace xxxx with your project number; you can also add --reservation=geocomputing_thu here for the course (not available at other times), change partition to small then
 ```
 
-**or** on Puhti you can also use `sinteractive` wrapper to start an interactive session from the **login node**, which simplifies the call and asks you for the resources step by step: 
+**or** on Puhti you can also use the `sinteractive` wrapper to start an interactive session from the **login node**, which simplifies the call and asks you for the resources step by step: 
 
 ```bash
 sinteractive -i
@@ -91,7 +91,7 @@ Try out some other command line tool, or maybe even start a `python` or `R` sess
 
 -> This way you can work interactively for an extended period, using e.g. lots of memory without creating load on the login nodes. 
 
-Note that above we only asked for 10 minutes of time. Once that is up, you will be automatically logged out from the compute node.
+Note that above we only asked for 10 minutes of time. Once that is up, you will be automatically logged out of the compute node.
 
 Running `exit` on the login node will log you out from Puhti.
 
@@ -141,7 +141,7 @@ nano my_serial.bash
 #SBATCH --account=project_200xxxx   # Choose the billing project. Has to be defined!
 #SBATCH --time=00:02:00             # Maximum duration of the job. Upper limit depends on the partition. 
 #SBATCH --partition=test            # Job queues: test, interactive, small, large, longrun, hugemem, hugemem_longrun
-#SBATCH --ntasks=1                  # Number of tasks. Upper limit depends on partition. For a serial job this should be set 1!
+#SBATCH --ntasks=1                  # Number of tasks. Upper limit depends on partition. For a serial job this should be set to 1!
 
 echo -n "We are running on"
 hostname                    # Run hostname-command, that will print the name of the Puhti compute node that has been allocated for this particular job
@@ -162,22 +162,22 @@ sbatch my_serial.bash
 squeue --me
 ```
 
-5. Once the job is done, check how much of the resources have been used with `seff jobid` (replace jobid with the number that was displayed after you ran sbatch command).
+5. Once the job is completed, check how much of the resources have been used with `seff jobid` (replace jobid with the number that was displayed after you ran the `sbatch` command).
 
 :::{admonition} Additional exercises
 :class: tip
 
-1. Where can you find the hostname print?
+1. Where can you find the output of the `hostname` command?
 2. How could you add a name to the job for easier identification?
 3. What happens if you run the same script from above, but we request only one minute, and sleep for 2 minutes?
-4. Can you run the gdalinfo command from the interactive job above via a non interactive job? What do you need to change in the sbatch job script?
+4. Can you run the `gdalinfo` command from the interactive job above via a non-interactive job? What do you need to change in the sbatch job script?
 
 
 :::{admonition} Solution
 :class: dropdown
 
 1. `slurm-jobid.out` in the directory from where you submitted the batch job. You can also change that location by specifying it in your batch job script with `#SBATCH --output=/your/path/slurm-%j.out`.
-2. Add `#SBATCH --job-name=myname` to the resource request in the top of your sbatch script, to rename the job to myname.
+2. Add `#SBATCH --job-name=myname` to the resource request at the top of your sbatch script to rename the job to "myname".
 3. After the job finished, check the log file with `cat slurm-<jobid>.out`. You should see an an error in the end `slurmstepd: error: *** JOB xxx ON xxx CANCELLED AT xDATE-TIMEx DUE TO TIME LIMIT ***`. This means that our job was killed for exceeding the amount of resources requested. Although this appears harsh, this is actually a feature. Strict adherence to resource requests allows the scheduler to find the best possible place for your jobs. It also ensures the fair share of use of the computing resources.
 4. Since gdalinfo is quite a fast command to run, you will only need to change the script part of your sbatch script, the resources request can stay the same. First we will need to make `gdal` available within the job with `module load geoconda`, then we can run the `gdalinfo` command. After the job is done, you can find the information again in the `slurm-jobid.out` file. 
 
@@ -186,7 +186,7 @@ squeue --me
 #SBATCH --account=<project>      # Choose the billing project. Has to be defined!
 #SBATCH --time=00:02:00          # Maximum duration of the job. Upper limit depends on the partition. 
 #SBATCH --partition=test         # Job queues: test, interactive, small, large, longrun, hugemem, hugemem_longrun
-#SBATCH --ntasks=1               # Number of tasks. Upper limit depends on partition. For a serial job this should be set 1!
+#SBATCH --ntasks=1               # Number of tasks. Upper limit depends on partition. For a serial job this should be set to 1!
 
 module load geoconda
 
@@ -201,6 +201,6 @@ gdalinfo /appl/data/geo/luke/forest_wind_damage_sensitivity/2017/windmap2017_int
 
 * A batch job script combines resource estimates and computation steps
     * Resource request lines start with `#SBATCH`
-* You can find the jobs output, errors and prints in `slurm-jobid.out`
+* You can find the job's output and errors in `slurm-jobid.out`
 
 :::
diff --git a/materials/exercise_gdal.md b/materials/exercise_gdal.md
@@ -10,7 +10,7 @@
 :::{admonition} Goals
 :class: note
 
-Learn how to use commandline tools:
+Learn how to use command-line tools:
 * Interactively
 * With several files in serial mode
 * With several files in parallel