Merge pull request idaholab#16272 from crswong888/step07-15232

Tutorial 1: Step 7
bylore · Nov 24, 2020 · db57ca9 · db57ca9
2 parents 455bf9e + 09de5f4
commit db57ca9
Show file tree

Hide file tree

Showing 10 changed files with 164 additions and 11 deletions.
diff --git a/framework/doc/acronyms.yml b/framework/doc/acronyms.yml
@@ -20,7 +20,7 @@ JSON: JavaScript Object Notation
 LGPL: GNU Lesser General Public License
 MMS: Method of Manufactured Solutions
 MWR: Method of Mean Weighted Residuals
-MPI: Method Passing Interface
+MPI: Message Passing Interface
 MOOSE: Multiphysics Object Oriented Simulation Environment
 NE: Nuclear Energy
 NQA-1: Nuclear Quality Assurance Level 1

diff --git a/framework/doc/globals.yml b/framework/doc/globals.yml
@@ -1,4 +1,5 @@
 libMesh: http://libmesh.github.io/
+PETSc: https://www.mcs.anl.gov/petsc/
 MOOSE: http://www.mooseframework.org
 YAML: http://yaml.org/
 python: https://www.python.org/

diff --git a/large_media b/large_media
diff --git a/...g_started/examples_and_tutorials/tutorial01_app_development/step01_moose_app.md b/...g_started/examples_and_tutorials/tutorial01_app_development/step01_moose_app.md
@@ -64,8 +64,6 @@ Ran 1 tests in 0.3 seconds.
 
 Later in this tutorial, the testing system will be explored in greater detail and tests will be created for the Babbler application.
 
-*For more information about the MOOSE testing system, please visit the [application_development/test_system.md] page.*
-
 ## Enable Use of GitHub id=git
 
 [Git](https://git-scm.com) is a version control system that enables teams of software developers to manage contributions to a single code base. When using Git, a `commit` is an update to the repository that marks a checkpoint to be revisited even after further changes are made. A repository's *commit log* shows the history of commits, and helps track the progression of code. A `push` uploads the local version of the repository to the remote (online) one.

diff --git a/..._started/examples_and_tutorials/tutorial01_app_development/step02_input_file.md b/..._started/examples_and_tutorials/tutorial01_app_development/step02_input_file.md
@@ -1,4 +1,4 @@
-# Step 2: Creating an Input File
+# Step 2: Write an Input File
 
 In this step, the concept of an input file is introduced. These files provide the means for controlling [!ac](FE) simulations with MOOSE. To demonstrate this concept, a steady-state diffusion of pressure from one end of the pipe, between the pressure vessels, to the other (see the [tutorial01_app_development/problem_statement.md] page) will be considered. The goal, here, is to create an input file that solves this simple [!ac](BVP). This problem is detailed in the [#demo] section, but, first, some basic information regarding input files and their execution are provided. As for many steps of this tutorial, concepts will be introduced and a hands-on demonstration will follow.
 

diff --git a/...g_started/examples_and_tutorials/tutorial01_app_development/step04_weak_form.md b/...g_started/examples_and_tutorials/tutorial01_app_development/step04_weak_form.md
@@ -1,4 +1,4 @@
-# Step 4: Generating a Weak Form
+# Step 4: Generate a Weak Form
 
 The first question to ask when presented with a [!ac](PDE) that governs a problem's physics is: "How do I solve this equation?" The MOOSE answer to this question is to use [Galerkin's Method](#galerkin), which involves expressing the *strong form* of a governing [!ac](PDE) in its *weak form*.
 

diff --git a/...arted/examples_and_tutorials/tutorial01_app_development/step05_kernel_object.md b/...arted/examples_and_tutorials/tutorial01_app_development/step05_kernel_object.md
@@ -1,4 +1,4 @@
-# Step 5: Creating a Kernel Object
+# Step 5: Develop a Kernel Object
 
 In this step, the basic components of [#kernels] will be presented. To demonstrate their use, a new `Kernel` will be created to solve Darcy's Pressure equation, whose weak form was derived in the [previous step](tutorial01_app_development/step04_weak_form.md#demo). The concept of class *inheritance* shall also be demonstrated, as the object to solve Darcy's equation will inherit from the `ADKernel` class.
 

diff --git a/...ng_started/examples_and_tutorials/tutorial01_app_development/step07_parallel.md b/...ng_started/examples_and_tutorials/tutorial01_app_development/step07_parallel.md
@@ -1,6 +1,154 @@
 # Step 7: Execute in Parallel
 
-!alert construction
-The remainder of this tutorial is currently being developed. More content should be available soon. For now, refer back to the [examples_and_tutorials/index.md] page for other helpful training materials or check out the MOOSE [application_development/index.md] pages for more information.
+A major objective of MOOSE is performance. This step briefly introduces parallel processing and the basic commands used for running an applicationin parallel are demonstrated. A few basic tips on how to evaluate and improve performance are given.
+
+## MOOSE Multiprocessing
+
+There are two types of parallelism supported by MOOSE: [multiprocessing](https://en.wikipedia.org/wiki/Multiprocessing) and [multithreading](https://en.wikipedia.org/wiki/Thread_(computing%29). At its core, MOOSE is designed to run in parallel by using the [Message Passing Interface](https://en.wikipedia.org/wiki/Message_Passing_Interface) protocol. [!ac](MPI) is a library of programming tools for accessing hardware and controlling how multiple CPUs exchange information while working simultaneously to run a single computer program. Shared memory parallelism is also supported through various threading libraries and can be used in union with [!ac](MPI).
+
+The general approach to solving a [!ac](FE) simulation in parallel is to partition the mesh and run an individual process that assembles and solves the system of equations for each of those mesh partitions. In general, the duration of solve decreases as the number of CPUs increases.
+
+### Basic Commands id=commands
+
+The `mpiexec` command is used to execute a MOOSE-based application using [!ac](MPI). For example,
+the tutorial application can be executed as follows, where the `-n 4` is an argument supplied to
+the `mpiexec` command that indicates to use 4 processors for execution.
+
+```bash
+cd ~/projects/babbler
+mpiexec -n 4 ./babbler-opt -i test/tests/kernels/simple_diffusion/simple_diffusion.i
+```
+
+For most cases using [!ac](MPI) alone is the best coarse of action. If threading is desired it may
+be enabled using the `--n-threads` option, which is supplied directly to the application executable.
+For example, the following runs the babbler application with 4 threads.
+
+```bash
+cd ~/projects/babbler
+./babbler-opt -i test/tests/kernels/simple_diffusion/simple_diffusion.i --n-threads 4
+```
+
+As mentioned, it is possible to run using both [!ac](MPI) and threading. This is accomplished
+by combining the two methods described above.
+
+!alert tip title=Optimum numbers are hardware and problem dependent
+The number of processors and threads available for execution is hardware dependent. A modern laptop
+typically has 4 processors, with 2 threads each. In general, it is recommended to begin with
+using just [!ac](MPI). Thus, it is typical to use between 4 and 8 processors for the `mpiexec`
+command. If threading is added then using 4 processors for [!ac](MPI) and 2 for threading would be
+typical. The optimum arrangement for parallel execution will be hardware and problem dependent, it
+may be worth while exploring differing arrangements before running a full-scale problem.
+
+!alert note title=Parallel Execution in Peacock
+In the "Execute" tab of Peacock, the `mpiexec` and `--n-threads` options can be used by selecting the "Use MPI" and "Use Threads" checkboxes and specifying the command syntax. These options can be set and enabled by default in the PEACOCK preferences.
+
+*For more information about command-line options, please visit the [application_usage/command_line_usage.md] page.*
+
+### Model Setup id=model-setup
+
+The [Mesh] System in MOOSE provides several strategies for configuring a [!ac](FE) model to be solved in parallel. Most end-users won't have to alter the default settings. Even application developers need not worry about writing parallel code, since this is handled by the core systems of MOOSE, [libMesh], and [PETSc]. However, advanced users are likely to encounter situations in which the default parallelization techniques are not suitable for the problem they are solving. Such situations are beyond the scope of this tutorial and interested readers may refer to the following for more information:
+
+- [syntax/Mesh/Partitioner/index.md]
+- [syntax/Mesh/index.md#replicated-and-distributed-mesh]
+- [syntax/Mesh/splitting.md]
+- [source/partitioner/PetscExternalPartitioner.md]
+
+
+### Evaluating and Enhancing Performance
+
+MOOSE includes a tool for evaluating performance: [PerfGraphOutput.md]. This enables a report to be printed to the terminal that details the amount of time spent processing different parts of the program as well as the total execution time. By evaluating performance reports, the ideal [model setup](model-setup) and [parallel type](#commands) can be found. This feature can be enabled in an input file like as follows for from the command-line using `--timing`.
+
+```
+[Outputs]
+  perf_graph = true
+[]
+```
+
+There is an entire field of science about [!ac](HPC) and massively parallel processing. Although it is a valuable one, a formal discussion cannot be made here. One concept worth mentioning is [scalable parallelism](https://en.wikipedia.org/wiki/Scalable_parallelism), which refers to software that performs at the same level for larger problems that use more processes as it does for smaller problems that use fewer processes. In MOOSE, selecting a number of processes based on the number of [!ac](DOFs) in the system is a simple way to try and achieve scalability.
+
+!alert tip title=Try to target 20,000 [!ac](DOFs)-per-process
+MOOSE developers tend to agree that 20,000 is the ideal number of [!ac](DOFs) that a single process may be responsible for. This value is reported as "`Num Local DOFs`" in the terminal printout at the beginning of every execution.
+
+*For more information about application performance, please visit the [application_development/performance_benchmarking.md] page.*
+
+## Demonstration
+
+To demonstrate the importance of parallel execution the current Darcy pressure input file will be
+utilized but two additional command-line options should be applied. First, the performance
+information shall be included using the `--timing` option and second the mesh will be uniformly
+refined 4 times to make the problem large enough for analysis.
+
+```bash
+cd ~/projects/babbler/problems
+./babbler-opt -i pressure_diffusion.i -r 4 --timing
+```
+
+!alert warning title=Use less refinement for older hardware
+Running this problem with 4 levels of refinement may be too much for older systems. It is still
+possible to follow along with this example using less levels of refinement.
+
+The `-r 4` option will split each quadrilateral element into 4 elements, 4 times. Therefore the
+resulting mesh will be 4^4^ times larger. The original input file results in 1000 elements, thus
+the version executed with this command contains 256,000 elements. This change is evident in the
+mesh section of the terminal output. In addition, the number of [!ac](DOFs) is reported, which is
+the number import to consider when selecting the number of processors.
+
+```
+Nonlinear System:
+  AD size required:        4
+  Num DOFs:                257761
+  Num Local DOFs:          257761
+  Num Partitions:          1
+```
+
+The number to consider is the number of local [!ac](DOFs), which is the number of [!ac](DOFs) on
+the root processor and is roughly equivalent to the number on the other processors. In addition
+the performance information should also be presented at the end of the simulation.
+
+
+```bash
+Performance Graph:
+--------------------------------------------------------------------------------------------------------------------------------------------------------------
+|                  Section                 | Calls |   Self(s)  |   Avg(s)   |    %   | Children(s) |   Avg(s)   |    %   |  Total(s)  |   Avg(s)   |    %   |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------
+| BabblerTestApp (main)                    |     1 |      0.006 |      0.006 |   0.04 |      15.048 |     15.048 |  99.96 |     15.054 |     15.054 | 100.00 |
+|   FEProblem::outputStep                  |     2 |      0.001 |      0.000 |   0.00 |       0.708 |      0.354 |   4.70 |      0.708 |      0.354 |   4.71 |
+|   Steady::PicardSolve                    |     1 |      0.000 |      0.000 |   0.00 |       7.463 |      7.463 |  49.57 |      7.463 |      7.463 |  49.57 |
+|     FEProblem::solve                     |     1 |      1.111 |      1.111 |   7.38 |       6.351 |      6.351 |  42.19 |      7.462 |      7.462 |  49.57 |
+|       FEProblem::computeResidualInternal |     4 |      0.000 |      0.000 |   0.00 |       1.753 |      0.438 |  11.64 |      1.753 |      0.438 |  11.64 |
+|       FEProblem::computeJacobianInternal |     2 |      0.000 |      0.000 |   0.00 |       4.598 |      2.299 |  30.54 |      4.598 |      2.299 |  30.54 |
+|     FEProblem::outputStep                |     1 |      0.000 |      0.000 |   0.00 |       0.000 |      0.000 |   0.00 |      0.000 |      0.000 |   0.00 |
+|   Steady::final                          |     1 |      0.000 |      0.000 |   0.00 |       0.000 |      0.000 |   0.00 |      0.000 |      0.000 |   0.00 |
+|     FEProblem::outputStep                |     1 |      0.000 |      0.000 |   0.00 |       0.000 |      0.000 |   0.00 |      0.000 |      0.000 |   0.00 |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------
+```
+
+The report indicates that the total duration of the execution was approximately 15 seconds (obviously
+this will vary depending on hardware) and the solve time to be approximately 7.5 seconds.
+
+To test the parallel scaling of this [!ac](FE) model it can be executed with an increasing number
+of processors. For example, the following executes the same problem with two processors. If the
+problem is scalable then the +solve time+ should be expected to be twice as fast.
+
+```bash
+cd ~/projects/babbler/problems
+mpiexec -n 2 ./babbler-opt -i pressure_diffusion.i -r 4 --timing
+```
+
+The data presented in [scale] shows decreasing solve time as the number of processors increases.
+This problem was executed on a 2019 Mac Pro with a 2.5 GHz 28-Core Intel Xeon W. For perfect
+scaling the 8-core run should be 8 times faster than the serial execution. Of course perfect
+scaling is not possible due the necessity of performing parallel communication during the solve.
+
+!table id=scale caption=Problem solve time with increasing numbers of processors.
+| Num. Processors | Local [!ac](DOFs) | Solve Time (sec.) |
+| - | - | - |
+| 1 | 257,761 | 7.5 |
+| 2 | 128,968 | 4.0 |
+| 4 |  64,575 | 2.1 |
+| 8 |  32,382 | 1.2 |
+
+In practice, a single process is sufficient for any MOOSE [!ac](FE) problem that has less than 20,000 total [!ac](DOFs).
 
 !content pagination previous=tutorial01_app_development/step06_input_params.md
+                    next=tutorial01_app_development/step08_test_harness.md
diff --git a/...tarted/examples_and_tutorials/tutorial01_app_development/step08_test_harness.md b/...tarted/examples_and_tutorials/tutorial01_app_development/step08_test_harness.md
@@ -0,0 +1,6 @@
+# Step 8: Write a Regression Test
+
+!alert construction
+The remainder of this tutorial is currently being developed. More content should be available soon. For now, refer back to the [examples_and_tutorials/index.md] page for other helpful training materials or check out the MOOSE [application_development/index.md] pages for more information.
+
+!content pagination previous=tutorial01_app_development/step07_parallel.md
diff --git a/python/MooseDocs/extensions/core.py b/python/MooseDocs/extensions/core.py
@@ -351,9 +351,9 @@ def createToken(self, parent, info, page):
 
         # Sub/super script must have word before the rest cannot
         if (tok == '^') or (tok == '@'):
-            if not parent.children or (not parent.children[-1].name == 'Word'):
+            if not parent.children or (parent.children[-1].name not in ('Word', 'Number')):
                 return None
-        elif parent.children and (parent.children[-1].name == 'Word'):
+        elif parent.children and (parent.children[-1].name in ('Word', 'Number')):
             return None
 
         if tok == '@':