From 542be826a0af3da37f23245c63cf052ffc57a33e Mon Sep 17 00:00:00 2001 From: Mitja Sainio <91546703+msainio@users.noreply.github.com> Date: Tue, 1 Oct 2024 12:24:29 +0300 Subject: [PATCH] fix typos, rephrase some info and instructions (#74) --- materials/allas.md | 11 +++-- materials/cheatsheet.md | 4 +- materials/connecting.md | 15 +++++-- materials/course_plan.md | 72 +++++++++++++++++++++--------- materials/csc_services.md | 2 +- materials/data_tips.md | 6 +-- materials/exercise_allas.md | 8 ++-- materials/exercise_basics.md | 18 ++++---- materials/exercise_gdal.md | 2 +- materials/exercise_python.md | 8 ++-- materials/exercise_r.md | 10 ++--- materials/exercise_stac.md | 10 ++--- materials/exercise_webinterface.md | 12 ++--- materials/installations.md | 33 ++++++++------ materials/moving_data.md | 12 ++--- materials/orga.md | 8 ++-- materials/parallel.md | 30 +++++++++---- materials/parallel_python.md | 42 ++++++++++------- materials/parallel_r.md | 19 ++++---- materials/prerequisites.md | 8 ++-- materials/software.md | 33 +++++++------- materials/spatial_data_at_csc.md | 4 +- materials/support.md | 4 +- materials/terminal.md | 16 +++---- materials/where_to_go.md | 2 +- 25 files changed, 232 insertions(+), 157 deletions(-) diff --git a/materials/allas.md b/materials/allas.md index ec6969fb..3b1ed0fa 100644 --- a/materials/allas.md +++ b/materials/allas.md @@ -1,6 +1,6 @@ # Allas – object storage -What it is? +What is it? * Allas is a **storage service**, technically object storage * **For CSC project lifetime: 1-5 years** @@ -13,7 +13,7 @@ What it is? * LUMI-O is very similar to Allas * [LUMI Docs: LUMI-O](https://docs.lumi-supercomputer.eu/storage/lumio/) -What it is NOT? +What is it NOT? - A file system (even though many tools try to fool you to think so). It is just a place to store static data objects. - A data management environment. Tools for etc. search, metadata, version control and access management are minimal. @@ -30,7 +30,7 @@ What it is NOT? - For data organization and access administration - Data is stored as **objects** within a bucket - Practically: object = file - - In reality, there is no hierarcical directory structure within a bucket, although it sometimes looks like that. + - In reality, there is no hierarchical directory structure within a bucket, although it sometimes looks like that. - Object name can be `/data/myfile.zip` and some tools may display it as `data` folder with `myfile.zip` file. ### Things to consider @@ -44,9 +44,8 @@ What it is NOT? - S3 and SWIFT. - **For new projects S3 is recommended** - - SWIFT might be soon depricated. - - Avoid cross-using SWIFT and S3 based objects! - + - SWIFT might be soon deprecated. + - Avoid cross-using SWIFT- and S3-based objects! ## Tools for Allas diff --git a/materials/cheatsheet.md b/materials/cheatsheet.md index 8729a138..c1a50539 100644 --- a/materials/cheatsheet.md +++ b/materials/cheatsheet.md @@ -1,7 +1,9 @@ # CSC and Unix cheatsheet Adapted from [CSC Quick Reference](https://docs.csc.fi/img/csc-quick-reference/csc-quick-reference.pdf) -Note that this is simplified for beginners usage, once you get more experienced, you'll notice that there is more (and better) options for everything, and that not everything written here is "the whole truth". +Note that this is simplified for beginners' usage. Once you get more +experienced, you'll notice that there are more (and better) options for +everything, and that not everything written here is "the whole truth". ## Service names diff --git a/materials/connecting.md b/materials/connecting.md index fec1831f..a3ddd494 100644 --- a/materials/connecting.md +++ b/materials/connecting.md @@ -26,7 +26,10 @@ ## Connecting to the supercomputer via SSH -During the course we will access the supercomputer via the webinterface in order to not overwhelm you with setups before the course. However, this way may not always be the most convenient. You can also connect to the supercomputer via SSH. +During the course we will access the supercomputer via the web interface in +order to not overwhelm you with setups before the course. However, this way +may not always be the most convenient. You can also connect to the +supercomputer via SSH. :::{admonition} Connecting with SSH clients :class: seealso, dropdown @@ -38,7 +41,7 @@ During the course we will access the supercomputer via the webinterface in order - In Windows: - `Command Prompt` or `Powershell` are always avaialbe and can be used for basic connections. - Special tools like [PuTTY](https://www.putty.org/) or [MobaXterm](https://mobaxterm.mobatek.net/) provide more options, inc possibility to save settings, but need installation. -- To avoid typing your password every time again and to make your connection more secure, you can [set up SSH-keys](https://docs.csc.fi/computing/connecting/#setting-up-ssh-keys). +- To avoid typing your password every time again and to make your connection more secure, you can [set up SSH-keys](https://docs.csc.fi/computing/connecting/ssh-keys/). - [CSC Docs: Connecting to CSC supercomputers](https://docs.csc.fi/computing/connecting/) - [LUMI Docs: Get started](https://docs.lumi-supercomputer.eu/firststeps/). @@ -46,7 +49,11 @@ During the course we will access the supercomputer via the webinterface in order ## Developing scripts remotely -Instead of developing code on your local machine and moving it as files to the supercomputer for testing, you can also consider to use a local editor and push edited files directly into the supercomputer. -This works for example with **Visual Studio Code** or **Notepad++**. Note that [Visual Studio Code](https://docs.csc.fi/computing/webinterface/vscode/) is also available through the Puhti web interface. +Instead of developing code on your local machine and moving it as files to the +supercomputer for testing, you can also consider to use a local editor and +push edited files directly to the supercomputer. This works for example +with **Visual Studio Code** or **Notepad++**. Note that [Visual Studio +Code](https://docs.csc.fi/computing/webinterface/vscode/) is also available +through the Puhti web interface. - [CSC Docs: Developing scripts remotely](https://docs.csc.fi/support/tutorials/remote-dev/) diff --git a/materials/course_plan.md b/materials/course_plan.md index b92baf85..087c6693 100644 --- a/materials/course_plan.md +++ b/materials/course_plan.md @@ -42,30 +42,62 @@ :::{admonition} Painter analogy :class: tip -Suppose that we want to paint the four walls in a room. This is our problem. We can divide our problem in 4 different tasks: paint each of the walls. In principle, our 4 tasks are independent from each other in the sense that we don’t need to finish one to start another. However, this does not mean that the tasks can be executed simultaneously or in parallel. It all depends on the amount of resources that we have for the tasks. -Concurrent vs. parallel execution - -If there is only one painter, they could work for a while in one wall, then start painting another one, then work for a little bit in the third one, and so on. The tasks are being executed concurrently but not in parallel. Only one task is being performed at a time. If we have 2 or more painters for the job, then the tasks can be performed in parallel. - -In our analogy, the painters represent CPU cores in your computer. The number of CPU cores available determines the maximum number of tasks that can be performed in parallel. The number of concurrent tasks that can be started at the same time, however, is unlimited. -Synchronous vs. asynchronous execution - -Now imagine that all workers have to obtain their paint form a central dispenser located at the middle of the room. If each worker is using a different colour, then they can work asynchronously. However, if they use the same colour, and two of them run out of paint at the same time, then they have to synchronise to use the dispenser — one should wait while the other is being serviced. - -In our analogy, the paint dispenser represents access to the memory in your computer. Depending on how a program is written, access to data in memory can be synchronous or asynchronous. -Distributed vs. shared memory - -Finally, imagine that we have 4 paint dispensers, one for each worker. In this scenario, each worker can complete its task totally on their own. They don’t even have to be in the same room, they could be painting walls of different rooms in the house, on different houses in the city, and different cities in the country. In many cases, however, we need a communication system in place. Suppose that worker A, needs a colour that is only available in the dispenser of worker B — worker A should request the paint to worker B, and worker B should respond by sending the required colour. - -Think of the memory distributed on each node/computer of a cluster as the different dispensers for your workers. A fine-grained parallel program needs lots of communication/synchronisation between tasks, in contrast with a course-grained one that barely communicates at all. An embarrassingly/massively parallel problem is one where all tasks can be executed completely independent from each other (no communications required). +Suppose that we want to paint the four walls in a room. This is our problem. +We can divide our problem in 4 different tasks: paint each of the walls. In +principle, our 4 tasks are independent from each other in the sense that we +don’t need to finish one to start another. However, this does not mean that +the tasks can be executed simultaneously or in parallel. It all depends on the +amount of resources that we have for the tasks. Concurrent vs. parallel +execution + +If there is only one painter, they could work for a while on one wall, then +start painting another one, then work for a little bit on the third one, and +so on. The tasks are being executed concurrently but not in parallel. Only one +task is being performed at a time. If we have 2 or more painters for the job, +then the tasks can be performed in parallel. + +In our analogy, the painters represent CPU cores in your computer. The number +of CPU cores available determines the maximum number of tasks that can be +performed in parallel. The number of concurrent tasks that can be started at +the same time, however, is unlimited. Synchronous vs. asynchronous execution + +Now imagine that all workers have to obtain their paint from a central +dispenser located in the middle of the room. If each worker is using a +different colour, then they can work asynchronously. However, if they use the +same colour, and two of them run out of paint at the same time, then they have +to synchronise to use the dispenser — one should wait while the other is being +serviced. + +In our analogy, the paint dispenser represents access to the memory in your +computer. Depending on how a program is written, access to data in memory can +be synchronous or asynchronous. Distributed vs. shared memory + +Finally, imagine that we have 4 paint dispensers, one for each worker. In this +scenario, each worker can complete their task totally on their own. They don’t +even have to be in the same room, they could be painting walls of different +rooms in the house, in different houses in the city, and different cities in +the country. In many cases, however, we need a communication system in place. +Suppose that worker A needs a colour that is only available in the dispenser +of worker B — worker A should request the paint from worker B, and worker B +should respond by sending the required colour. + +Think of the memory distributed on each node/computer of a cluster as the +different dispensers for your workers. A fine-grained parallel program needs +lots of communication/synchronisation between tasks, in contrast with a +coarse-grained one that barely communicates at all. An +embarrassingly/massively parallel problem is one where all tasks can be +executed completely independent from each other (no communication required). Processes vs. threads -Our example painters have two arms, and could potentially paint with both arms at the same time. Technically, the work being done by each arm is the work of a single painter. +Our example painters have two arms, and could potentially paint with both arms +at the same time. Technically, the work being done by each arm is the work of +a single painter. -In this example, each painter would be a process (an individual instance of a program). The painters’ arms represent a “thread” of a program. Threads are separate points of execution within a single program, and can be executed either synchronously or asynchronously. +In this example, each painter would be a process (an individual instance of a +program). The painters’ arms represent a “thread” of a program. Threads are +separate points of execution within a single program, and can be executed +either synchronously or asynchronously. From [HPC Carpentry](http://www.hpc-carpentry.org/hpc-python/06-parallel). ::: - - diff --git a/materials/csc_services.md b/materials/csc_services.md index fe0ee0fc..cdfe0c6d 100644 --- a/materials/csc_services.md +++ b/materials/csc_services.md @@ -25,5 +25,5 @@ See also [CSC service catalog](https://research.csc.fi/en/service-catalog) :::{admonition} Sensitive data :class: important -Sensitive data should be saved and processed only in services for sensitive data: [SD services](https://research.csc.fi/sensitive-data-services-for-research), [ePouta](https://research.csc.fi/-/epouta). Encrypted files can be stored also to [Allas](https://research.csc.fi/-/allas). Supercomputers and cPouta should not be used for sensitive data. +Sensitive data should be saved and processed only in services for sensitive data: [SD services](https://research.csc.fi/sensitive-data-services-for-research), [ePouta](https://research.csc.fi/-/epouta). Encrypted files can be stored also in [Allas](https://research.csc.fi/-/allas). Supercomputers and cPouta should not be used for sensitive data. ::: diff --git a/materials/data_tips.md b/materials/data_tips.md index 7d7526d3..ed53dd6b 100644 --- a/materials/data_tips.md +++ b/materials/data_tips.md @@ -1,8 +1,8 @@ # Best practice tips for data - Take **backups** of important files. Data on Puhti disks is not backed up. - - Allas is best CSC option for back-up. - - Github or similar for code. + - Allas is the best option for backups at CSC. + - GitHub or similar for code. - Supercomputer disks do not work well with **too many small files** - Plan your analysis in a way that too many files are not needed. - Keep the small files in one zip-file, unzip it only on local fast disks during the analysis. @@ -10,7 +10,7 @@ - [CSC Docs: Best practice performance tips for using Lustre](https://docs.csc.fi/computing/lustre/#best-practices) - Keep data that is needed longer also in Allas. - **Databases**: - - Only file databases (SQLite, GeoPackage) can be kept in supercomputer disks. + - Only file databases (SQLite, GeoPackage) can be kept on supercomputer disks. - For PostgreSQL and PostGIS use [CSC Pukki Database-as-service](https://docs.csc.fi/cloud/dbaas/). - For any other database set up virtual machine in cPouta. diff --git a/materials/exercise_allas.md b/materials/exercise_allas.md index aa5dd02d..eee02692 100644 --- a/materials/exercise_allas.md +++ b/materials/exercise_allas.md @@ -13,7 +13,7 @@ Learn how to: * Configure connection to Allas and get S3 credentials * Sync local folder to Allas (manual back-up) -* See what data is Allas +* See what data is in Allas * Use s3cmd. * [CSC Docs: s3cmd](https://docs.csc.fi/data/Allas/using_allas/s3_client/) ::: @@ -28,8 +28,8 @@ Learn how to: :::{admonition} Change the default project and username -* `project_200xxxx` is example project name, replace with your own CSC project name. -* `cscusername` is example username, replace with your username. +* `project_200xxxx` is an example project name, replace with your own CSC project name. +* `cscusername` is an example username, replace with your username. ::: * Open [Puhti web interface](https://puhti.csc.fi) and log in @@ -40,7 +40,7 @@ Learn how to: ```bash module load allas allas-conf --mode s3cmd -# It asks to select the project, select the project by number. +# It asks to select the project, select the project with the corresponding number. # The configuration takes a moment, please wait. ``` diff --git a/materials/exercise_basics.md b/materials/exercise_basics.md index 7a80cb1f..330171b1 100644 --- a/materials/exercise_basics.md +++ b/materials/exercise_basics.md @@ -47,7 +47,7 @@ On the **login node**: Start an interactive job with `srun`, e.g.: srun --time=00:10:00 --pty --account=project_200xxxx --partition=interactive bash ##replace xxxx with your project number; you can also add --reservation=geocomputing_thu here for the course (not available at other times), change partition to small then ``` -**or** on Puhti you can also use `sinteractive` wrapper to start an interactive session from the **login node**, which simplifies the call and asks you for the resources step by step: +**or** on Puhti you can also use the `sinteractive` wrapper to start an interactive session from the **login node**, which simplifies the call and asks you for the resources step by step: ```bash sinteractive -i @@ -91,7 +91,7 @@ Try out some other command line tool, or maybe even start a `python` or `R` sess -> This way you can work interactively for an extended period, using e.g. lots of memory without creating load on the login nodes. -Note that above we only asked for 10 minutes of time. Once that is up, you will be automatically logged out from the compute node. +Note that above we only asked for 10 minutes of time. Once that is up, you will be automatically logged out of the compute node. Running `exit` on the login node will log you out from Puhti. @@ -141,7 +141,7 @@ nano my_serial.bash #SBATCH --account=project_200xxxx # Choose the billing project. Has to be defined! #SBATCH --time=00:02:00 # Maximum duration of the job. Upper limit depends on the partition. #SBATCH --partition=test # Job queues: test, interactive, small, large, longrun, hugemem, hugemem_longrun -#SBATCH --ntasks=1 # Number of tasks. Upper limit depends on partition. For a serial job this should be set 1! +#SBATCH --ntasks=1 # Number of tasks. Upper limit depends on partition. For a serial job this should be set to 1! echo -n "We are running on" hostname # Run hostname-command, that will print the name of the Puhti compute node that has been allocated for this particular job @@ -162,22 +162,22 @@ sbatch my_serial.bash squeue --me ``` -5. Once the job is done, check how much of the resources have been used with `seff jobid` (replace jobid with the number that was displayed after you ran sbatch command). +5. Once the job is completed, check how much of the resources have been used with `seff jobid` (replace jobid with the number that was displayed after you ran the `sbatch` command). :::{admonition} Additional exercises :class: tip -1. Where can you find the hostname print? +1. Where can you find the output of the `hostname` command? 2. How could you add a name to the job for easier identification? 3. What happens if you run the same script from above, but we request only one minute, and sleep for 2 minutes? -4. Can you run the gdalinfo command from the interactive job above via a non interactive job? What do you need to change in the sbatch job script? +4. Can you run the `gdalinfo` command from the interactive job above via a non-interactive job? What do you need to change in the sbatch job script? :::{admonition} Solution :class: dropdown 1. `slurm-jobid.out` in the directory from where you submitted the batch job. You can also change that location by specifying it in your batch job script with `#SBATCH --output=/your/path/slurm-%j.out`. -2. Add `#SBATCH --job-name=myname` to the resource request in the top of your sbatch script, to rename the job to myname. +2. Add `#SBATCH --job-name=myname` to the resource request at the top of your sbatch script to rename the job to "myname". 3. After the job finished, check the log file with `cat slurm-.out`. You should see an an error in the end `slurmstepd: error: *** JOB xxx ON xxx CANCELLED AT xDATE-TIMEx DUE TO TIME LIMIT ***`. This means that our job was killed for exceeding the amount of resources requested. Although this appears harsh, this is actually a feature. Strict adherence to resource requests allows the scheduler to find the best possible place for your jobs. It also ensures the fair share of use of the computing resources. 4. Since gdalinfo is quite a fast command to run, you will only need to change the script part of your sbatch script, the resources request can stay the same. First we will need to make `gdal` available within the job with `module load geoconda`, then we can run the `gdalinfo` command. After the job is done, you can find the information again in the `slurm-jobid.out` file. @@ -186,7 +186,7 @@ squeue --me #SBATCH --account= # Choose the billing project. Has to be defined! #SBATCH --time=00:02:00 # Maximum duration of the job. Upper limit depends on the partition. #SBATCH --partition=test # Job queues: test, interactive, small, large, longrun, hugemem, hugemem_longrun -#SBATCH --ntasks=1 # Number of tasks. Upper limit depends on partition. For a serial job this should be set 1! +#SBATCH --ntasks=1 # Number of tasks. Upper limit depends on partition. For a serial job this should be set to 1! module load geoconda @@ -201,6 +201,6 @@ gdalinfo /appl/data/geo/luke/forest_wind_damage_sensitivity/2017/windmap2017_int * A batch job script combines resource estimates and computation steps * Resource request lines start with `#SBATCH` -* You can find the jobs output, errors and prints in `slurm-jobid.out` +* You can find the job's output and errors in `slurm-jobid.out` ::: diff --git a/materials/exercise_gdal.md b/materials/exercise_gdal.md index bb573a4c..18b3e361 100644 --- a/materials/exercise_gdal.md +++ b/materials/exercise_gdal.md @@ -10,7 +10,7 @@ :::{admonition} Goals :class: note -Learn how to use commandline tools: +Learn how to use command-line tools: * Interactively * With several files in serial mode * With several files in parallel diff --git a/materials/exercise_python.md b/materials/exercise_python.md index 3d4fd2ae..cd66c16d 100644 --- a/materials/exercise_python.md +++ b/materials/exercise_python.md @@ -10,7 +10,7 @@ :::{admonition} Goals :class: note -* Get to know geoconda Python environment on Puhti +* Get to know the geoconda Python environment on Puhti * Try out different ways of parallelizing Python code * Understand when to go for internal vs external parallelization @@ -19,7 +19,7 @@ :::{admonition} Prerequisites :class: important -* Access to Puhti webinterface +* Access to the Puhti web interface * Some experience with Python and GIS Python tools ::: @@ -29,12 +29,12 @@ Check out at least the sections about [serial jobs](https://github.com/csc-training/geocomputing/blob/master/python/puhti/README.md#serial-job) and [parallelizing within Python](https://github.com/csc-training/geocomputing/blob/master/python/puhti/README.md#internal-parallelization). -Additional, you can also check out some of the other Python examples in [CSC geocomputing repository](https://github.com/csc-training/geocomputing/blob/master/python). +Additionally, you can check out some of the other Python examples in [CSC geocomputing repository](https://github.com/csc-training/geocomputing/blob/master/python). :::{admonition} Key points :class: important * GNU parallel for embarassingly parallel tasks, without changing the Python code -* `dask.delayed` or `multiprocessing` can be used to parallelize for loop +* `dask.delayed` or `multiprocessing` can be used to parallelize a for-loop ::: diff --git a/materials/exercise_r.md b/materials/exercise_r.md index 8682035d..8d3b38bc 100644 --- a/materials/exercise_r.md +++ b/materials/exercise_r.md @@ -1,7 +1,7 @@ # Exercise: R ## R in supercomputers -* `r-env` is the only R module in Puhti with ~1300 packages for all fields of science. +* `r-env` is the only R module in Puhti. It has ~1300 packages for all fields of science. * Mahti does not have R. * LUMI has only [EasyBuild recepy for R](https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/r/R/) * [CSC Docs: r-env](https://docs.csc.fi/apps/r-env/) @@ -17,8 +17,8 @@ :::{admonition} Goals :class: note -* Get to know `r-env` R environment on Puhti -* Running R code interactively and as batch job +* Get to know the `r-env` R environment on Puhti +* Running R code interactively and as a batch job * Try out different ways of parallelizing R code @@ -40,12 +40,12 @@ * Interactive working * Simple batch job * Parallel job with `future` library -* Additional, you can also check out some of the other R examples in [CSC geocomputing repository](https://github.com/csc-training/geocomputing/blob/master/R). +* Additionally, you can check out some of the other R examples in [CSC geocomputing repository](https://github.com/csc-training/geocomputing/blob/master/R). :::{admonition} Key points :class: important * Puhti web interface enables working with RStudio interactively -* `future` can be used to parallelization +* `future` can be used for parallelization ::: diff --git a/materials/exercise_stac.md b/materials/exercise_stac.md index 26a14dd0..8842452c 100644 --- a/materials/exercise_stac.md +++ b/materials/exercise_stac.md @@ -27,8 +27,8 @@ with [access to Puhti](https://docs.csc.fi/accounts/how-to-add-service-access-fo :::{admonition} Change the default project and username -* `project_200xxxx` is example project name, replace with your own CSC project name. -* `cscusername` is example username, replace with your username. +* `project_200xxxx` is an example project name, replace with your own CSC project name. +* `cscusername` is an example username, replace with your username. ::: * [CSC Docs: Jupyter](https://docs.csc.fi/computing/webinterface/jupyter/) @@ -50,7 +50,7 @@ with [access to Puhti](https://docs.csc.fi/accounts/how-to-add-service-access-fo * Open ## Preparations -* Open new Termianl window +* Open new Terminal window * Make a folder for the exercise materials and make it your working directory * Change the project name and username. ``` @@ -65,14 +65,14 @@ git clone https://github.com/csc-training/geocomputing.git ## STAC Notebook -1. In file exporer open: `students/cscusername/geocomputing/python/STAC +1. In the file explorer, open `students/cscusername/geocomputing/python/STAC 2. Open `STAC_CSC_example.ipynb` notebook 3. Follow the notebook, use `Shift+Enter` for running cells. ## End the session * Close the web tab - * In Active sessions view: `Cancel` + * Find the session in the Active sessions view and select `Cancel` :::{admonition} Key points :class: important diff --git a/materials/exercise_webinterface.md b/materials/exercise_webinterface.md index bf08389b..506324b7 100644 --- a/materials/exercise_webinterface.md +++ b/materials/exercise_webinterface.md @@ -29,8 +29,8 @@ :::{admonition} Change the default project and username -* `project_200xxxx` is example project name, replace with your own CSC project name. -* `cscusername` is example username, replace with your username. +* `project_200xxxx` is an example project name, replace with your own CSC project name. +* `cscusername` is an example username, replace with your username. ::: ### Info @@ -55,7 +55,7 @@ :::{admonition} Moving data -Web interface is for moving up to 10Gb data, if you have more data use other tools. More info in [moving data](moving_data.md) +The web interface can be used for moving up to 10GB of data. If you have more data, use other tools. More info on [moving data](moving_data.md). ::: @@ -94,13 +94,13 @@ Web interface is for moving up to 10Gb data, if you have more data use other too * End the session: * Close QGIS. * Close the web tab - * In Active sessions view: `Delete` + * Find the session in the Active sessions view and select `Cancel` * This only ends the Desktop session, any files written during the session would be available also afterwards. :::{admonition} QGIS in practice on supercomputer -* QGIS is designed for desktop use and it mostly uses only 1 core, so running it on supercomputer is rather slower than on desktop. QGIS is in Puhti and LUMI mainly for easy viewing of input and output data. -* With `qgis_processing` or PyQGIS scripts it is possible to paralellize your data analysis. In general other Python packages are faster, but if you have these scripts already available, they can be used. +* QGIS is designed for desktop use and it mostly uses only 1 core, so running it on a supercomputer is rather slower than on desktop. QGIS is in Puhti and LUMI mainly for easy viewing of input and output data. +* With `qgis_processing` or PyQGIS scripts it is possible to parallelize your data analysis. In general other Python packages are faster, but if you have these scripts already available, they can be used. ::: diff --git a/materials/installations.md b/materials/installations.md index 678d2c3e..69c8437b 100644 --- a/materials/installations.md +++ b/materials/installations.md @@ -2,8 +2,8 @@ ## Adding some packages to existing modules -* Generally easiest option. -* [CSC Docs: Installing **Python** packages to existing modules](https://docs.csc.fi/apps/python/#installing-python-packages-to-existing-modules) +* Generally the easiest option. +* [CSC Docs: Installing **Python** packages to existing modules](https://docs.csc.fi/support/tutorials/python-usage-guide/#installing-python-packages-to-existing-modules) * geoconda, tensorflow, pytorch, python-data etc. * The added package must be available via `pip`. * [CSC Docs: **R** package installations](https://docs.csc.fi/apps/r-env/#r-package-installations) @@ -11,7 +11,7 @@ ## Tykky -* The easiest way to create an own installation is with Tykky +* The easiest way to create a custom installation is with Tykky * Tykky has 3 options, new installation based on: * `conda` .yml file * `pip` requirement file @@ -38,40 +38,44 @@ * We will install `lastools` based on [pydo's lastools Docker image](https://hub.docker.com/r/pydo/lastools). * We use the `projappl` disk, which is the best place for software installations. -* Lastools is in Puhti already available, also as newer Linux-native installation. -* During the course we will use interactive job for doing the installation because of 50 persons doing it at the same time. Usually installations are done on login node. +* Lastools is already available on Puhti, also as a newer Linux-native installation. +* During the course we will use an interactive job for doing the installation, + because we will have 50 users doing it at the same time, which stresses the + shared file system. Usually installations are done on a login node. :::{admonition} Change the default project and username -* `project_200xxxx` is example project name, replace with your own CSC project name. -* `cscusername` is example username, replace with your username. +* `project_200xxxx` is an example project name, replace with your own CSC project name. +* `cscusername` is an example username, replace with your username. ::: * Open [Puhti web interface](https://puhti.csc.fi) and log in * Open Login node shell -* During the course only, open interactive job: +* Open an interactive job using: ``` srun --reservation=geocomputing_fri --account=project_2008648 --mem=4000 --ntasks=1 --time=0:20:00 --gres=nvme:4 --pty bash -i ``` +:::{admonition} Normally we use `sinteractive` to start interactive jobs. We used the previous command in order to use our course's resource reservation. + Make Tykky tools available ``` module load tykky ``` -Create a new directory for the installation and make the folder **above** it to your working directory +Create a new directory for the installation and make the folder **above** it your working directory ``` mkdir -p /projappl/project_200xxxx/students/cscusername/lastools cd /projappl/project_200xxxx/students/cscusername ``` -Create the new instalaltion +Create the new installation ``` wrap-container -w /opt/LAStools docker://pydo/lastools:latest --prefix lastools ``` -* `-w /opt/LAStools` - where are the tools located inside the container, that should be available -* `docker//:pydo/lastools:latest` - the existing Docker iamge +* `-w /opt/LAStools` - where the tools are located inside the container, it should be accessible +* `docker//:pydo/lastools:latest` - the existing Docker image * `--prefix lastools` - location of the new installation Add the location of your new installation to your PATH. Note that Tykky prints out the correct command for you. @@ -86,6 +90,9 @@ lasinfo -i /appl/data/geo/mml/laserkeilaus/2008_latest/2018/W444/1/W4444G4.laz :::{admonition} PATH setting -PATH defines where system is looking for tools. Changing PATH like above is valid until the Puhti session is alive. PATH (or PYTHONPATH) has to be set each session again, so it is good to add it to your batch job file. +PATH defines where the system looks for tools. Changes to PATH, like those +above, are in effect while the Puhti session is alive. PATH (or PYTHONPATH) +has to be set again in each session, so it is good to add it to your batch job +file. ::: diff --git a/materials/moving_data.md b/materials/moving_data.md index e4f0f7ca..945e5794 100644 --- a/materials/moving_data.md +++ b/materials/moving_data.md @@ -40,9 +40,9 @@ scp -r /path/to/directory cscusername@puhti.csc.fi:/scratch/project_200xxxx/dire #### rsync -- Best for big data transfers: does not copy what is already there, can resume a copy process which disconnected. +- Best for big data transfers: does not copy what is already there, can resume a copy process which has disconnected. - Can warn against accidental over-writes. -- Available on Linux, Mac and Windows Subsystem Linux (WSL). +- Available on Linux, Mac and Windows Subsystem for Linux (WSL). - Windows Powershell does not have `rsync`, MobaXterm has `rsync`, but it removes write permissions of copied files - [CSC Docs: `rsync`](https://docs.csc.fi/data/moving/rsync/) @@ -58,7 +58,7 @@ rsync --info=progress2 -a /path/to/directory cscusername@puhti.csc.fi:/scratch/p :::{admonition} Firewall limitations -Some organizations, for example research institutes with IT-services from Valtori, have stricter rules and need to use proxy for connecting to CSC. Ask your IT-service or other Puhti users in your organization for extra-guidelines. +Some organizations, for example research institutes with IT-services from Valtori, have stricter rules and need to use a proxy for connecting to CSC servers. In this case, ask your IT service or other Puhti users in your organization for additional guidelines. ::: @@ -66,12 +66,12 @@ Some organizations, for example research institutes with IT-services from Valtor ## External data services -> supercomputer -- When downloading from exernal services try to download directly to CSC, not via your local computer +- When downloading data from external services, try to download directly to CSC servers, not via your local computer - Check what APIs/tools the service supports: - Standard APIs: OGC APIs, [STAC](https://csc-training.github.io/geocomputing_course/materials/stac.html) - Custom service APIs - ftp, rsync - - wget/curl if HTTP-urls avaialable + - wget/curl if there is a URL for the data ### wget @@ -153,7 +153,7 @@ chmod 777 example ls -l example ``` -How might we give ourselves and our colleagues within the same project permission to do everything with a file, but allow no one else to do anything with it. +How might we give ourselves and our colleagues within the same project permission to do everything with a file, but allow no one else to do anything with it? From [HPC Carpentry](http://www.hpc-carpentry.org/hpc-shell) diff --git a/materials/orga.md b/materials/orga.md index 4e2fa7ed..cd6ff100 100644 --- a/materials/orga.md +++ b/materials/orga.md @@ -5,7 +5,7 @@ This course is organized by CSC - IT Center for Science and funded by Geoportti * Non-profit company producing IT services for research and higher education * Owned by ministry of education and culture (70%) and higher education institutions (30%) -* Headquaters in Keilaniemi, Espoo +* Headquarters in Keilaniemi, Espoo * Side offices and supercomputers in Kajaani * ~700 people @@ -13,7 +13,7 @@ This course is organized by CSC - IT Center for Science and funded by Geoportti ## Geoportti Research Infrastructure -Geoportti Research Infrastructure (RI) is a shared service for researchers, teachers and students using geospatial data and geocomputing tools. Geoportti RI helps the researchers in Finland to use, to refine, to preserve and to share their geospatial resources. +Geoportti Research Infrastructure (RI) is a shared service for researchers, teachers and students using geospatial data and geocomputing tools. Geoportti RI helps researchers in Finland use, refine, preserve and share their geospatial resources. * [GeoPortti web portal](https://www.geoportti.fi) * GeoPortti (oGIIR) projects has been very important for developing geocomputing usage of CSC supercomputers and cloud services. @@ -33,7 +33,7 @@ Geoportti Research Infrastructure (RI) is a shared service for researchers, teac ## Location Innovation Hub -The Location Innovation Hub (LIH) is a centre of excellence in location information coordinated by the Finnish Geospatial Research Institute. Our services are produced in conjunction with a partner network. We help companies to grow their business with location information. We also serve the public sector. +The Location Innovation Hub (LIH) is a centre of excellence in location information coordinated by the Finnish Geospatial Research Institute. Our services are produced in conjunction with a partner network. We help companies grow their business with location information. We also serve the public sector. * [Location Innovation Hub](https://locationinnovationhub.eu) * Consulting to companies and other organizations how to use GIS, trainings, test environments @@ -44,6 +44,6 @@ The Location Innovation Hub (LIH) is a centre of excellence in location informat ## Citing -If you used CSC computing resources and GIS tools or data for your research, please acknowledge CSC and Geoportti in your publications, it is important for project continuation and funding reports. As an example, you can write: +If you used CSC computing resources and GIS tools or data for your research, please acknowledge CSC and Geoportti in your publications. It is important for project continuation and funding reports. As an example, you can write: > "The authors wish to thank CSC - IT Center for Science, Finland (urn:nbn:fi:research-infras-2016072531) and the Open Geospatial Information Infrastructure for Research (Geoportti, urn:nbn:fi:research-infras-2016072513) for computational resources and support". diff --git a/materials/parallel.md b/materials/parallel.md index b2f0d831..22a7ed73 100644 --- a/materials/parallel.md +++ b/materials/parallel.md @@ -4,7 +4,7 @@ For fast computation, supercomputers utilize parallelism. ![](./images/parallel.png) -## What to paralellize? +## What to parallelize? Spatial data analysis often allows splitting at least some parts of the analysis to independent parts, that could be run in parallel: * Dividing data into parts: @@ -33,15 +33,15 @@ For doing analysis in parallel there are four main options: 3) Use external tools to run the scripts in parallel 4) Write your own parallel code -From practical point of view in supercomptuers, it is also important to understand, if the the tool/script supports: +When working on supercomputers, it is also important to understand whether the tool/script supports: * **Multi-core** - it runs in parallel only inside one node of the supercomputer. * **Multi-node** - it can distribute the work to several nodes of the supercomputer. -For multi-core there is clearly more options. The number of cores in a single node has been recently been increasing, so also multi-core tools can be very useful. +For multi-core there are clearly more options. The number of cores in a single node has been recently increasing, so also multi-core tools can be very useful. ## Tools with built-in parallel support -Look from the tool's manual, if it has built-in support for using multiple CPUs/cores. For command line tools, look for `number_of_cores`, `cores`, `cpu)`, `jobs`, `threads` or similar. Unfortunatelly not many GIS-tools have such option. +Checj from the tool's manual if it has built-in support for using multiple CPUs/cores. For command line tools, look for `number_of_cores`, `cores`, `cpus`, `jobs`, `threads` or similar. Unfortunately, not many GIS-tools have such options. Some example geospatial tools with built-in parallel support: * GDAL, some commands e.g. `gdalwarp -multi -wo NUM_THREADS=val/ALL_CPUS ...` @@ -59,7 +59,7 @@ All of these tools are multi-core, but not multi-node. :::{admonition} Define number of cores explicitly :class: warning -The GIS-tools are not written for supercomputers, so they might not understand HPC specifics correctly and may think that they can use more cores than they actually can. It usually is better to define the number of cores to use explicitly, rather than "use all available cores". +GIS-tools are typically not written for supercomputers, so they might not understand HPC specifics correctly and may think that they can use more cores than is actually possible. It usually is better to define the number of cores to use explicitly, rather than "use all available cores". ::: @@ -75,11 +75,25 @@ Many programming languages have packages for parallel computing. ## External tools to run the scripts in parallel -The external tools enable running the scripts in parallel, with minimal changes to the scripts. This way of running programs is also called task farming or high throughput computing. The tools have different complexity and different features. The simpler ones are for running same script with different input paramaters, for example different input files, scenarios, time frames etc. More complicated tools support managing the whole workflow with several steps and with dependecies between steps. Workflow tools also help with making your work more reproducible by recording the computational steps and data. See [CSC Docs: High-throughput computing and workflows](https://docs.csc.fi/computing/running/throughput/) for more information. +Some external tools enable running scripts in parallel, with minimal changes +to the scripts. This way of running programs is also called task farming or +high-throughput computing. These tools have different levels of complexity and +different features. The simpler ones are for running the same script with +different input parameters, for example different input files, scenarios, time +frames etc. More complicated tools support managing the whole workflow with +several steps and with dependencies between steps. Workflow tools also help +with making your work more reproducible by recording the computational steps +and data. See +[CSC Docs: High-throughput computing and workflows](https://docs.csc.fi/computing/running/throughput/) +for more information. ### GNU Parallel -GNU parallel is a general Linux tool for executing commands or scripts in parallel in one node. It iterates over an input list, which can be a list of files or list of input parameters. The number of tasks may be higher than number of cores, it waits with execution as resources become available. GNU Parellel does not support dependecies between the tasks. +GNU parallel is a general Linux tool for executing commands or scripts in +parallel on one node. It iterates over an input list, which can be a list of +files or input parameters. If the number of tasks is higher than the number of +available cores, the process executes the tasks as the resources become +available. GNU Parallel does not support dependencies between tasks. * [GNU Parallel documentation](https://www.gnu.org/software/parallel/) * Geocomputing examples: [GDAL with GNU-parallel](https://github.com/csc-training/geocomputing/tree/master/gdal) and [PDAL with GNU-parallel](https://github.com/csc-training/geocomputing/tree/master/pdal) @@ -87,7 +101,7 @@ GNU parallel is a general Linux tool for executing commands or scripts in parall ### Snakemake -Snakemake is a scientific workflow management system, that supports running for example R, bash and Python scripts. It can handle dependecies between the tasks and can be used both multi-core and multi-node set-ups. Snakemake is one of the easiest tools for workflow management. +Snakemake is a scientific workflow management system that supports running for example R, bash and Python scripts. It can handle dependencies between the tasks and can be used with both multi-core and multi-node setups. Snakemake is one of the easiest tools for workflow management. * [CSC Docs: Running Snakemake workflow on Puhti](https://docs.csc.fi/support/tutorials/snakemake-puhti/) diff --git a/materials/parallel_python.md b/materials/parallel_python.md index 4919d5bc..561eb6bc 100644 --- a/materials/parallel_python.md +++ b/materials/parallel_python.md @@ -1,6 +1,6 @@ # Parallel Python ## Spatial libraries with parallel support -If starting with a new code, the first option could be to look for spatial libraries that have parallelization already built in: +If starting from scratch with a new program, the first option would be to look for spatial libraries that have parallelization already built in: * [Dask-geopandas](https://dask-geopandas.readthedocs.io/) for vector data analysis, still a lot more limited than `geopandas` * [xarray](http://xarray.pydata.org/) for basic raster data analysis @@ -10,19 +10,27 @@ If starting with a new code, the first option could be to look for spatial libra * [osmnx](https://osmnx.readthedocs.io/en/stable/index.html) for routing ## Python parallel libraries -The parallel spatial libraries are still developing and do not support very wide range of functionality, so often these do not fit all requirements. Or if you are changing an existing serial code to parallel. Then the next option is to write parallel coude yourself. The basic Python code runs in serial mode, so usually some changes to code are needed to benefit from parallel computing. +The parallel spatial libraries are still developing and do not support a very +wide range of functionality, so often these do not fit all requirements. The +next option is to write parallel code yourself. The basic Python code runs in +serial mode, so usually some changes to code are needed to benefit from +parallel computing. Python has many libraries to support parallelization: * Multi-core: `multiprocessing` and `joblib` * Multi-core or multi-node: **`dask`** and `mpi4py` -If unsure, start with Dask, it is one of the newest, most versatile and easy to use. But Dask has many options and alternatives, `multiprocessing` might be easier to learn as first. All of the above mentioned spatial libraries use Dask, except `osmnx`, which uses `multiprocessing`. +If unsure, start with Dask, it is one of the newest, most versatile and most +easy to use. There are, of course, many options and alternatives to Dask. +`multiprocessing` might be the easiest to learn first. All of the +above-mentioned spatial libraries use Dask, except `osmnx`, which uses +`multiprocessing`. -:::{admonition} How many cores I can use? +:::{admonition} How many cores can I use? :class: tip. -If you need to check in code, how many cores you can use, use: +To check your program code for the number of cores you can use, run: ``` len(os.sched_getaffinity(0)) ``` @@ -38,7 +46,7 @@ Do not use `multiprocessing.cpu_count()`, that counts only hardware cores, but d :::{admonition} Delayed computing :class: tip -One general feature of Dask is that it delays computing to the point when the result is actually needed, for example for plotting or saving to a file. So for example when running the code in Jupyter, cells that actually require longer time may run instantly, but later some cell may take a lot of time. +One general feature of Dask is that it delays computing to the point when the result is actually needed, for example for plotting or saving to a file. So for example when running the code in Jupyter, cells that actually require a longer runtime may run instantly, but another cell may run much later. ::: @@ -48,15 +56,15 @@ When using Dask, two main decisions have to be made for running code in parallel 2. How to make the code parallel? ### How to run the parallel code? -[Dask](https://docs.dask.org/en/stable/) supports different set-ups for parallel computing, from supercomptuers point of view, the main options are: +[Dask](https://docs.dask.org/en/stable/) supports different set-ups for parallel computing, from a supercomputing point-of-view, the main options are: * [Default schedulers](https://docs.dask.org/en/stable/scheduler-overview.html) for multi-core jobs. * [LocalCluster](https://distributed.dask.org/en/latest/index.html) for multi-core jobs. * [SLURMCluster](https://jobqueue.dask.org/en/latest/index.html) for multi-node jobs. -While developing the code, it might be good to start with default scheduler or `LocalCluster` parallelization and then if needed change it to `SLURMCluster`. The required changes to code are small, when changing the parallelization set-up. +While developing the code, it might be good to start with the default scheduler or `LocalCluster` parallelization and then, if needed, change it to `SLURMCluster`. The required changes to code are small when changing the parallelization setup. -One of the advantages of using `LocalCluster`, is that then in Jupyter the [Dask-extension](https://github.com/dask/dask-labextension) is able to show progress and resource usage. +One of the advantages of using `LocalCluster`, is that the Jupyter [Dask-extension](https://github.com/dask/dask-labextension) is able to show progress and resource usage. **Default scheduler** is started automatically, when Dask objects or functions are used. @@ -94,9 +102,9 @@ client = Client(cluster) ### How to make the code parallel? Dask provides several options, inc [Dask DataFrames](https://docs.dask.org/en/stable/dataframe.html), [Dask Arrays](https://docs.dask.org/en/stable/array.html), [Dask Bags](https://docs.dask.org/en/stable/bag.html), [Dask Delayed](https://docs.dask.org/en/stable/delayed.html) and [Dask Futures](https://docs.dask.org/en/stable/futures.html). This decision depends on the type of analyzed data and already existing code. Additionally Dask has support for scalable machine learning with [DaskML](https://ml.dask.org/). -In this course we use Delayed functions. Delayed functions are useful in parallelising existing code. This approach delays function calls and creates a graph of the computing process. From the graph, Dask can then divide the work tasks to different workers whenever parallel computing is possible. Keep in mind that the other ways of code parallelisation might suit better in different use cases. +In this course we use delayed functions. Delayed functions are useful in parallelising existing code. This approach delays function calls and creates a graph of the computing process. From the graph, Dask can then divide the work tasks to different workers whenever parallel computing is possible. Keep in mind that the other ways of code parallelisation might suit you better in different use cases. -The changes to code are exactly the same for all parallization set-ups. The most simple changes could be: +The changes to code are exactly the same for all parallelization setups. The most simple changes could be: * For-loops: * Change to Dask's delayed functions, @@ -104,11 +112,11 @@ The changes to code are exactly the same for all parallization set-ups. The most ``` # Example of changing for-loop and map() to Dask -# Just a demo slow function, that waits for 2 seconds +# Just a slow demo function that waits for 2 seconds def slow_functio(i): time.sleep(2) return(i) -# Input data vector, the slow function is run for each element. +# Input data vector. The slow function is run for each element. input = range(0, 7) ``` **Serial** @@ -151,7 +159,11 @@ print(a) * Dask exports needed variables and libraries automatically to the parallel processes * The variables must be serializable. -* Avoid moving big size variables from main to parallel process. Spatial data analysis often includes significant amounts of data. It is better to read the data inside the parallel function. Give as input the file name or compute area coordinates etc. +* Avoid moving variables that refer to large objects from main the main serial + process to a parallel process. Spatial data analysis often involves + significant amounts of data. It is better to read the data inside the + parallel function: give the file name as input, compute area coordinates, + etc. ::: @@ -159,7 +171,7 @@ print(a) ### Batch job scripts In batch job scripts it is important to set correctly: * `nodes` - How many nodes to reserve? 1 for default schedulers or `LocalCluster`, more than 1 for `SLURMCluster` -* `cpus-per-task` - How many cores to reserver? Depending on needs, 1-n. n depends on number of available CPUs per node. +* `cpus-per-task` - How many cores to reserve? Depending on the task, something between 1 and the total number of available CPUs per node. ``` #SBATCH --nodes=1 diff --git a/materials/parallel_r.md b/materials/parallel_r.md index e1815d96..b35172ce 100644 --- a/materials/parallel_r.md +++ b/materials/parallel_r.md @@ -1,7 +1,7 @@ # Parallel R ## Spatial libraries with parallel support -If starting with a new code, the first option could be to look for spatial libraries that have parallelization already built in: +If starting from scratch with new code, the first option would be to look for spatial libraries that have parallelization already built in: * `terra` has some functions in parallel for raster processing * `gdalcubes` for multi-dimensional spatial data analysis @@ -9,19 +9,22 @@ If starting with a new code, the first option could be to look for spatial libra ## R parallel libraries -The parallel spatial libraries cover only very limited functionality, so often these do not fit all requirements. Or if you are changing an existing serial code to parallel. Then the next option is to write parallel coude yourself. +The parallel spatial libraries cover only very limited functionality, so often +these do not fit all requirements. They also cannot be used to easily +parallelize existing serial code. Then the next option is to write parallel +code yourself. R has many libraries to support parallelization: * Multi-core: `parallel` * Multi-core or multi-node: **`future`**, `snow`, `foreach`, `Rmpi`, `pbdMPI`.. -If unsure, start with `future`, it is one of the newest, most versatile and easy to use. +If unsure, start with `future`. It is one of the newest, most versatile and most easy to use. :::{admonition} Supercomputer usage :class: warning -Some of the packages require specific settings in Puhti, see [CSC Docs, r-env, Parallel batch jobs](https://docs.csc.fi/apps/r-env/#parallel-batch-jobs) for details about some of these packages. These might differ from package's general instructions. +Some of the packages require specific settings in Puhti, see [CSC Docs, r-env, Parallel batch jobs](https://docs.csc.fi/apps/r-env/#parallel-batch-jobs) for details about some of these packages. These might differ from the package's general instructions. ::: @@ -41,7 +44,7 @@ multisession | multi-core, background R sessions, limited to one node multicore | multi-core, forked R processes, limited to one node, not in Windows nor in RStudio cluster | multi-node, external R sessions -While developing the code, it might be good to start with `multisession` or `multicore` parallelization and then if needed change it to `cluster`. The required changes to code are small, when changing the parallelization set-up. +While developing the code, it might be good to start with `multisession` or `multicore` parallelization and then if needed change it to `cluster`. The required changes to code are small when changing the parallelization set-up. ``` # Multi-core, use one of them @@ -68,12 +71,12 @@ The most simple changes could be: ``` # Example of chaning for-loop and purrr's map() to furrr's future_map() -# Just a demo slow function, that waits for 5 seconds +# Just a slow demo function that waits for 5 seconds slow_function<-function(i) { Sys.sleep(5) return(i) } -# Input data vector, the slow function is run for each element. +# Input data vector. The slow function is run for each element. input = 1:7 # SERIAL options @@ -113,7 +116,7 @@ d <- future_lapply(input, slow_function) * `future` exports needed variables and libraries automatically to the parallel processes * The variables must be serializable. Terra's raster objects are not serializable, see [Terra library's recommendations](https://github.com/rspatial/terra/issues/36) -* Avoid moving big size variables from main to parallel process. Spatial data analysis often includes significant amounts of data. It is better to read the data inside the parallel function. Give as input the file name or compute area coordinates etc. +* Avoid moving variables that refer to large objects from the serial main process to a parallel process. Spatial data analysis often involves significant amounts of data. It is better to read the data inside the parallel function. Give the file name as input, compute area coordinates, etc. ::: diff --git a/materials/prerequisites.md b/materials/prerequisites.md index 7da3a386..e23c2fea 100644 --- a/materials/prerequisites.md +++ b/materials/prerequisites.md @@ -2,14 +2,14 @@ ## Experience -To make this course as enjoyable as possible for you and to make sure you can get something out of it, we expect you to know about the following: +To make this course as enjoyable as possible for you, and to make sure you can get something out of it, we expect you to know about the following: * General understanding of geoinformatics, vector and raster data, coordinate systems. * General understanding of either Python, R or use of command line tools, e.g. GDAL, PDAL, ... -* Basic Unix commands (know how to use these commands in a terminal): `cd, ls, mv, cp, rm, chmod, less, tail, echo, mkdir, pwd`. If you need to refresh these commands, go through before the [Terminal](terminal.md) page. +* Basic Unix commands (know how to use these commands in a terminal): `cd, ls, mv, cp, rm, chmod, less, tail, echo, mkdir, pwd`. If you need to refresh these commands, read through the [Terminal](terminal.md) page. -This also means that this course is not an introduction to either of the topics. Please refer to [Where to go from here page](where_to_go.md) to learn more these topics. +This also means that this course is not an introduction to any of the topics above. Please refer to the [Where to go from here](where_to_go.md) for suggestions for further learning about the course topics. ## CSC account -For the exercises, [CSC user account](https://docs.csc.fi/accounts/how-to-create-new-user-account/) and [project](https://docs.csc.fi/accounts/how-to-create-new-project/) with [access to Puhti](https://docs.csc.fi/accounts/how-to-add-service-access-for-project/). For Allas exercise also Allas service must be enabled for the project. +For the exercises, a [CSC user account](https://docs.csc.fi/accounts/how-to-create-new-user-account/) and a [CSC project](https://docs.csc.fi/accounts/how-to-create-new-project/) with [access to Puhti](https://docs.csc.fi/accounts/how-to-add-service-access-for-project/) are required. For the Allas exercise, the Allas service must be enabled for the project. diff --git a/materials/software.md b/materials/software.md index 4280059b..71059b6b 100644 --- a/materials/software.md +++ b/materials/software.md @@ -1,7 +1,7 @@ # GIS tools -* Puhti has the widest GIS software portfolio of CSC supercomputers, and possibly of all supercomputers in the world -* Pre-installed software makes it easy to start working with Puhti. +* Puhti has the widest GIS software portfolio of the CSC supercomputers, and possibly of all supercomputers in the world. +* Pre-installed software makes it easy to start working on Puhti. * GIS tools are originally not planned for supercomputers -> limited ability to utilize the computing power. * GIS installations at CSC have been done partly with GeoPortti funding. @@ -9,7 +9,7 @@ * [Ames Stereo Pipeline](https://docs.csc.fi/apps/ames-stereo/) for processing stereo images * [ArcGIS Python API](https://docs.csc.fi/apps/arcgis/) -* [CloudCompare](https://docs.csc.fi/apps/cloudcompare/) for visualizing, editing and processing poing clouds +* [CloudCompare](https://docs.csc.fi/apps/cloudcompare/) for visualizing, editing and processing point clouds * [FORCE](https://docs.csc.fi/apps/force/) for mass-processing of medium-resolution satellite images * [GDAL](https://docs.csc.fi/apps/gdal/) for geospatial data formats * **[Geoconda](https://docs.csc.fi/apps/geoconda/)** - Python spatial analysis libraries @@ -21,10 +21,10 @@ * [PCL](https://docs.csc.fi/apps/pcl/) for 2D/3D image and point cloud processing * [PDAL](https://docs.csc.fi/apps/pdal/) for point cloud translations and processing * [QGIS](https://docs.csc.fi/apps/qgis/) General purpose GIS software family for viewing, editing and analysing geospatial data -* **[R for GIS](https://docs.csc.fi/apps/r-env-for-gis/)** R spataial analysis libraries +* **[R for GIS](https://docs.csc.fi/apps/r-env-for-gis/)** R spatial analysis libraries * [SAGA GIS](https://docs.csc.fi/apps/saga-gis/) General purpose GIS software family for viewing, editing and analysing geospatial data -* [Sen2Cor](https://docs.csc.fi/apps/sen2cor/) for atmospheric-, terrain and cirrus correction of the Sentinel-2 products -* [Sen2mosaic](https://docs.csc.fi/apps/sen2mosaic/) for download, preprocessing and mosaicing of Sentinel-2 products +* [Sen2Cor](https://docs.csc.fi/apps/sen2cor/) for atmospheric, terrain and cirrus correction of Sentinel-2 products +* [Sen2mosaic](https://docs.csc.fi/apps/sen2mosaic/) for downloading, preprocessing and mosaicing Sentinel-2 products * **[SNAP](https://docs.csc.fi/apps/snap/)** for remote sensing applications * [WhiteboxTools](https://docs.csc.fi/apps/whiteboxtools/) an advanced geospatial data analysis platform * [Zonation](https://docs.csc.fi/apps/zonation/) Spatial conservation prioritization framework @@ -50,7 +50,7 @@ * Additional, easy to install yourself [EasyBuild recepies](https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs) for CGAL, GDAL, GEOS, ncview, PROJ, R. -:::{admonition} GIS tools NOT available in supercomputers +:::{admonition} GIS tools NOT available on supercomputers :class: caution * **Servers** -> these can be run in cPouta @@ -75,10 +75,10 @@ * LasTools: * Some tools available for "free" * No license from CSC - * Easy to use own license + * Easy to use your own license * ArcGIS Python API: - * For many tools connection to ArcGIS Online required. -* In general only tools with floating licenses possible + * For many tools, a connection to ArcGIS Online is required. +* In general, only tools with floating licenses can be used * Tools with node-locked licenses -> cPouta ::: @@ -95,25 +95,24 @@ | SNAP | - | [snappy](https://senbox.atlassian.net/wiki/spaces/SNAP/pages/19300362/How+to+use+the+SNAP+API+from+Python), [snapista](https://snap-contrib.github.io/snapista/gettingstarted.html) | [GPT](https://senbox.atlassian.net/wiki/spaces/SNAP/pages/70503475/Bulk+Processing+with+GPT) | | WhiteboxTools | [whiteboxR*](https://github.com/opengeos/whiteboxR) | [WhiteboxTools](https://www.whiteboxgeo.com/manual/wbt_book/python_scripting/using_whitebox_tools.html)| [Yes](https://www.whiteboxgeo.com/manual/wbt_book/command_prompt.html) | -Additionally Ames Stereo Pipeline, FORCE, LasTools, OpenDroneMap, PCL and Zonation have commandline interface. - -\* are not available in Puhti currently, but should be possible to install, ask if you need. +Additionally Ames Stereo Pipeline, FORCE, LasTools, OpenDroneMap, PCL and Zonation have a command-line interface. +\* are not available in Puhti currently, but should be possible to install, ask if you need any of them. :::{admonition} Modules :class: important -* Supercomputers are shared computing environment with many mutually incompatible tools installed +* Supercomputers are a shared computing environment with many mutually incompatible tools installed * By default only basic Linux tools are available * Pre-installed tools are available via modules * One module: single program or group of similar programs * Usage: * Check documentation for available module names and versions. - * Load a module(s) -> the system can find the tools provided by the module - * Use tools from the loaded module + * Load on or more modules -> the system can find the provided tools + * Use tools from the loaded module(s) -Example. Loading module for GDAL +Example: loading the module for GDAL ``` module load gdal diff --git a/materials/spatial_data_at_csc.md b/materials/spatial_data_at_csc.md index 633e1b40..5cec9a82 100644 --- a/materials/spatial_data_at_csc.md +++ b/materials/spatial_data_at_csc.md @@ -1,8 +1,8 @@ # Spatial data in Puhti -* Large commonly used **Finnish geospatial datasets with open license** +* Large commonly used **Finnish geospatial datasets with open licenses** * Removes transfer bottleneck -* Located at: `/appl/data/geo/` +* Located in: `/appl/data/geo/` * **All Puhti users have read access** * ~13 TB of datasets available: * **Paituli data**, with virtual mosaics for raster data diff --git a/materials/support.md b/materials/support.md index 15249482..f0add68e 100644 --- a/materials/support.md +++ b/materials/support.md @@ -10,8 +10,8 @@ - A descriptive title - What you wanted to achieve and on which computer - Which commands you have given - - What error messages did you get - - [More tips to help us to more quickly solve your issue](https://docs.csc.fi/support/support-howto/) + - What error messages you got + - [More tips to help us solve your issue quickly](https://docs.csc.fi/support/support-howto/) ## Support channels diff --git a/materials/terminal.md b/materials/terminal.md index e14d0924..77c9e7d5 100644 --- a/materials/terminal.md +++ b/materials/terminal.md @@ -9,11 +9,11 @@ A typing-based interface is often called a command-line interface, or CLI, to di Shell is the standard way to interact with a supercomputer. It is worth learning the basics and getting comfortable with the "black box", to make efficient use of the resources. -The terms terminal, command line and shell are often used interchangably, even though they mean slightly different things. +The terms terminal, command line and shell are often used interchangeably, even though they mean slightly different things. ## Basic Linux commands -Do you remember on how you edited some files in the web interface? Let's do the same thing again; only from the command line: +Do you remember on how you edited some files in the web interface? Let's do the same thing again; only now from the command line: ### Navigating folders @@ -23,14 +23,14 @@ Do you remember on how you edited some files in the web interface? Let's do the pwd ``` -2. We would like to create a new directory in our projects scratch students directory with our name, let's move there: +2. In our project's scratch storage, there is a directory called `students`. We would like to create a new subdirectory and give our username as its name. Let's move to `students`: ```bash cd /scratch/project_200xxxx/students ``` -3. Make a directory with your name (you can either type it or use the variable $USER) and see if it appears: +3. Make a directory with your username (you can either type it or use the variable $USER) and see if it appears: ```bash mkdir $USER @@ -45,7 +45,7 @@ cd $USER :::{admonition} Auto complete :class: seealso -If you just type `cd` and the first letter of the folder name, then hit `tab` key, the terminal completes the name. Handy! +If you just type `cd` and the first letter of the folder name, then hit the `tab` key, the terminal completes the name. Handy! ::: ## Exploring files @@ -63,13 +63,13 @@ wget https://raw.githubusercontent.com/csc-training/csc-env-eff/master/part-1/pr ls -l # options are l for long format, ``` -3. Use the `less` command to check out what the file looks like: +3. Use the `less` command to check what the contents of the file look like: ```bash less my-first-file.txt ``` -4. To exit the `less` preview of the file, hit `q`. +4. To exit the `less` view of the file, hit `q`. 5. Make a copy of this file: @@ -88,7 +88,7 @@ ls -l :::{admonition} Moving files :class: seealso -If you don't want to have duplicate files you can use `mv` to 'move/rename' the file. Syntax is the same: `mv /path/to/source/oldname /path/to/destination/newname`. +If you don't want to have duplicate files you can use `mv` to 'move/rename' the file. The syntax is the same: `mv /path/to/source/oldname /path/to/destination/newname`. ::: diff --git a/materials/where_to_go.md b/materials/where_to_go.md index 381a828a..42f27e9f 100644 --- a/materials/where_to_go.md +++ b/materials/where_to_go.md @@ -5,7 +5,7 @@ * (Create or get access to a CSC project with your CSC user account) * Choose suitable service for your task * [Sign-up for CSC gis-hpc mailing list ](https://postit.csc.fi/sympa/subscribe/gis-hpc) -* Ask for help, if needed, we don't bite :) +* Ask for help if you need any, we don't bite :) ## Training calendars * [CSC](https://csc.fi/koulutukset/koulutuskalenteri/)