Skip to content

Commit

Permalink
Update parallel.md
Browse files Browse the repository at this point in the history
  • Loading branch information
ktiits authored Sep 6, 2024
1 parent 6683525 commit 21434c8
Showing 1 changed file with 14 additions and 32 deletions.
46 changes: 14 additions & 32 deletions materials/parallel.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,17 +15,22 @@ Spatial data analysis often allows splitting at least some parts of the analysis
:::{admonition} Think about your own work
:class: tip

Do you need to run a lot of steps one after another? Or few steps that need a lot of memory? Do steps depend on each other? Which steps could be run in parallel? Which steps cannot be run in parallel? How to split your data?
* Do you need to run a lot of steps one after another?
* Or few steps that need a lot of memory?
* Do steps depend on each other?
* Which steps could be run in parallel?
* Which steps cannot be run in parallel?
* How to split your data?

:::

## How to parallelize?

For doing analysis in parallel there are four main options:

1) Tools with built-in parallel support
1) Use spatial analysis tools with built-in parallel support
2) Write your own scripts using parallel libaries of different scripting languages
3) External tools to run the scripts in parallel
3) Use external tools to run the scripts in parallel
4) Write your own parallel code

From practical point of view in supercomptuers, it is also important to understand, if the the tool/script supports:
Expand All @@ -36,18 +41,18 @@ For multi-core there is clearly more options. The number of cores in a single no

## Tools with built-in parallel support

Look from the tool's manual, if it has built-in support for using multiple CPUs/cores. For command line tools, look for `-n(umber of)_cores`, `-c(ores/pu)`, `-j(obs)`, `-t(hreads)` or similar. Unfortunatelly not many GIS-tools have such option.
Look from the tool's manual, if it has built-in support for using multiple CPUs/cores. For command line tools, look for `number_of_cores`, `cores`, `cpu)`, `jobs`, `threads` or similar. Unfortunatelly not many GIS-tools have such option.

Some example geospatial tools with built-in parallel support:
* GDAL, some commands e.g. `gdalwarp -multi -wo NUM_THREADS=val/ALL_CPUS ...`
* FORCE
* Lastools; many tools support parallel execution by setting `-cores`
* Lastools
* OpenDronemap
* OrfeoToolBox
* PDAL-wrench
* SNAP
* Zonation
* Whiteboxtools; many tools support parallel execution without extra action
* Whiteboxtools

All of these tools are multi-core, but not multi-node.

Expand All @@ -65,12 +70,12 @@ The deep learning libraries have options for [Multi-GPU and multi-node machine l
Many programming languages have packages for parallel computing.

* **Python** and **R** have several packages for multi-core and multi-node parallelization, see [Parallel Python](parallel_python.md) and [Parallel R](parallel_r.md) for more details.
* [Julia multi-threading](https://docs.julialang.org/en/v1/manual/multi-threading/#man-multithreading)
* [MATLAB Parallel Computing Toolbox](https://se.mathworks.com/products/parallel-computing.html)
* [Julia](https://docs.csc.fi/apps/julia/#usage)
* [MATLAB](https://docs.csc.fi/apps/matlab/#parallel-computing-on-matlab)

## External tools to run the scripts in parallel

The external tools enable running the scripts in parallel. This way of running programs is also called task farming or high throughput computing. The tools have different complexity and different features. The simpler ones of these tools are for running same script with different input paramaters, for example different input files, scenarios, time frames etc. More complicated tools support managing the whole workflow with several steps and with dependecies between steps. Workflow tools also help with making your work more reproducible by recording the computational steps and data. See [CSC Docs: High-throughput computing and workflows](https://docs.csc.fi/computing/running/throughput/) for more information.
The external tools enable running the scripts in parallel, with minimal changes to the scripts. This way of running programs is also called task farming or high throughput computing. The tools have different complexity and different features. The simpler ones are for running same script with different input paramaters, for example different input files, scenarios, time frames etc. More complicated tools support managing the whole workflow with several steps and with dependecies between steps. Workflow tools also help with making your work more reproducible by recording the computational steps and data. See [CSC Docs: High-throughput computing and workflows](https://docs.csc.fi/computing/running/throughput/) for more information.

### GNU Parallel

Expand All @@ -92,26 +97,3 @@ Parallel programs are typically parallelized with the [MPI](https://en.wikipedia

* [CSC training calendar](https://www.csc.fi/en/training#training-calendar), look for advanced coding courses.
* [CSC Computing Environment, How to speed up jobs](https://a3s.fi/CSC_training/11_speed_up_jobs.html#/how-to-speed-up-jobs)


:::{admonition} Advanced topics - MPI/OpenMP
:class: dropdown, seealso

**What is MPI?**

- MPI (Message Passing Interface) is a widely used standard for writing software that runs in parallel
- MPI utilizes parallel **processes** that _do not share memory_
- To exchange information, processes pass data messages back and forth between the cores
- Communication can be a performance bottleneck
- MPI is required when running on multiple nodes

**What is OpenMP?**

- OpenMP (Open Multi-Processing) is a standard that utilizes compute cores that share memory, i.e. **threads**
- They do not need to send messages between each other
- OpenMP is easier for beginners, but problems quickly arise with so-called _race conditions_
- This appears when different compute cores process and update the same data without proper synchronization
- OpenMP is restricted to a single node

:::

0 comments on commit 21434c8

Please sign in to comment.