From 21434c82a8898dd9ba26ae8ede029f7d1f36ed0a Mon Sep 17 00:00:00 2001 From: Kylli Ek Date: Fri, 6 Sep 2024 08:52:53 +0300 Subject: [PATCH] Update parallel.md --- materials/parallel.md | 46 +++++++++++++------------------------------ 1 file changed, 14 insertions(+), 32 deletions(-) diff --git a/materials/parallel.md b/materials/parallel.md index 82144d97..b2f0d831 100644 --- a/materials/parallel.md +++ b/materials/parallel.md @@ -15,7 +15,12 @@ Spatial data analysis often allows splitting at least some parts of the analysis :::{admonition} Think about your own work :class: tip -Do you need to run a lot of steps one after another? Or few steps that need a lot of memory? Do steps depend on each other? Which steps could be run in parallel? Which steps cannot be run in parallel? How to split your data? +* Do you need to run a lot of steps one after another? +* Or few steps that need a lot of memory? +* Do steps depend on each other? +* Which steps could be run in parallel? +* Which steps cannot be run in parallel? +* How to split your data? ::: @@ -23,9 +28,9 @@ Do you need to run a lot of steps one after another? Or few steps that need a lo For doing analysis in parallel there are four main options: -1) Tools with built-in parallel support +1) Use spatial analysis tools with built-in parallel support 2) Write your own scripts using parallel libaries of different scripting languages -3) External tools to run the scripts in parallel +3) Use external tools to run the scripts in parallel 4) Write your own parallel code From practical point of view in supercomptuers, it is also important to understand, if the the tool/script supports: @@ -36,18 +41,18 @@ For multi-core there is clearly more options. The number of cores in a single no ## Tools with built-in parallel support -Look from the tool's manual, if it has built-in support for using multiple CPUs/cores. For command line tools, look for `-n(umber of)_cores`, `-c(ores/pu)`, `-j(obs)`, `-t(hreads)` or similar. Unfortunatelly not many GIS-tools have such option. +Look from the tool's manual, if it has built-in support for using multiple CPUs/cores. For command line tools, look for `number_of_cores`, `cores`, `cpu)`, `jobs`, `threads` or similar. Unfortunatelly not many GIS-tools have such option. Some example geospatial tools with built-in parallel support: * GDAL, some commands e.g. `gdalwarp -multi -wo NUM_THREADS=val/ALL_CPUS ...` * FORCE -* Lastools; many tools support parallel execution by setting `-cores` +* Lastools * OpenDronemap * OrfeoToolBox * PDAL-wrench * SNAP * Zonation -* Whiteboxtools; many tools support parallel execution without extra action +* Whiteboxtools All of these tools are multi-core, but not multi-node. @@ -65,12 +70,12 @@ The deep learning libraries have options for [Multi-GPU and multi-node machine l Many programming languages have packages for parallel computing. * **Python** and **R** have several packages for multi-core and multi-node parallelization, see [Parallel Python](parallel_python.md) and [Parallel R](parallel_r.md) for more details. -* [Julia multi-threading](https://docs.julialang.org/en/v1/manual/multi-threading/#man-multithreading) -* [MATLAB Parallel Computing Toolbox](https://se.mathworks.com/products/parallel-computing.html) +* [Julia](https://docs.csc.fi/apps/julia/#usage) +* [MATLAB](https://docs.csc.fi/apps/matlab/#parallel-computing-on-matlab) ## External tools to run the scripts in parallel -The external tools enable running the scripts in parallel. This way of running programs is also called task farming or high throughput computing. The tools have different complexity and different features. The simpler ones of these tools are for running same script with different input paramaters, for example different input files, scenarios, time frames etc. More complicated tools support managing the whole workflow with several steps and with dependecies between steps. Workflow tools also help with making your work more reproducible by recording the computational steps and data. See [CSC Docs: High-throughput computing and workflows](https://docs.csc.fi/computing/running/throughput/) for more information. +The external tools enable running the scripts in parallel, with minimal changes to the scripts. This way of running programs is also called task farming or high throughput computing. The tools have different complexity and different features. The simpler ones are for running same script with different input paramaters, for example different input files, scenarios, time frames etc. More complicated tools support managing the whole workflow with several steps and with dependecies between steps. Workflow tools also help with making your work more reproducible by recording the computational steps and data. See [CSC Docs: High-throughput computing and workflows](https://docs.csc.fi/computing/running/throughput/) for more information. ### GNU Parallel @@ -92,26 +97,3 @@ Parallel programs are typically parallelized with the [MPI](https://en.wikipedia * [CSC training calendar](https://www.csc.fi/en/training#training-calendar), look for advanced coding courses. * [CSC Computing Environment, How to speed up jobs](https://a3s.fi/CSC_training/11_speed_up_jobs.html#/how-to-speed-up-jobs) - - -:::{admonition} Advanced topics - MPI/OpenMP -:class: dropdown, seealso - -**What is MPI?** - -- MPI (Message Passing Interface) is a widely used standard for writing software that runs in parallel -- MPI utilizes parallel **processes** that _do not share memory_ - - To exchange information, processes pass data messages back and forth between the cores - - Communication can be a performance bottleneck -- MPI is required when running on multiple nodes - -**What is OpenMP?** - -- OpenMP (Open Multi-Processing) is a standard that utilizes compute cores that share memory, i.e. **threads** - - They do not need to send messages between each other -- OpenMP is easier for beginners, but problems quickly arise with so-called _race conditions_ - - This appears when different compute cores process and update the same data without proper synchronization -- OpenMP is restricted to a single node - -::: -