Poor performance in large mesh simulations #1

sandroos · 2015-01-23T14:08:00Z

Certain dccrg functions have really poor performance in simulations with a large mesh. In the following example a 33x33x33 mesh was used, and partitioned to 20 cores using various methods (pure MPI, or MPI+OpenMP).

One bottleneck that came up is the operator[]function. In the following either version 1 or 2 is used, not both:

phiprof::start("Copy Old Potential");
// VERSION 1
const vector<CellID>& cells = getLocalCells();
for (size_t c=0; c<cells.size(); ++c) {
   mpiGrid[cells[c]]->parameters[CellParams::PHI_TMP] = mpiGrid[cells[c]]->parameters[CellParams::PHI];
}

// VERSION 2
for (size_t c=0; c<Poisson::localCellParams.size(); ++c) {
   Poisson::localCellParams[c][CellParams::PHI_TMP] = Poisson::localCellParams[c][CellParams::PHI];
}
phiprof::stop("Copy Old Potential");

Essentially the difference is, that in 1 the operator[] is used to access the copied values, while in 2 pointers to parameters have been cached to a vector prior to entering this code snippet.

Here's the data from profiling when ran using 2 MPI processes and 10 OpenMP threads per process:

Version 1 37.49 seconds
Version 2 0.21 seconds

That's a factor of 179 difference!!

After optimizing the solver, the MPI copies remaing a huge bottleneck:

4                  Poisson Solver (Total)                     1          24.59     94.12     24.6      1     24.58     0     100
5                    Cache Cell Parameters                    1          0.5046    2.052     0.5075    0     0.5017    1     1
5                    Pointer Caching                          1          0.06571   0.2672    0.06599   0     0.06543   1     1
5                    MPI (RHOQ)                               1          0.685     2.786     0.6952    1     0.6748    0     100
5                    Evaluate potential                       10         0.7689    3.127     0.7761    0     0.7617    1     7040
5                    MPI (start copy)                         10         19.11     77.74     19.23     0     19        1     3520
5                    MPI (wait copy)                          10         3.308     13.45     3.409     1     3.207     0     3520
5                    Copy Old Potential                       10         0.02075   0.0844    0.02077   1     0.02074   0     176
5                    Potential Change                         1          0.08296   0.3374    0.08335   1     0.08256   0     176
5                    MPI                                      1          0.02614   0.1063    0.05087   1     0.001408  0     176
5                    Other                                    1          0.01089   0.04428   0.01094   1     0.01084   0     100

MPI takes 91.3% of total simulation time here. Those profiler calls are

// Exchange new potential values on process boundaries
phiprof::start("MPI (start copy)");
mpiGrid.start_remote_neighbor_copy_updates(POISSON_NEIGHBORHOOD_ID);
phiprof::stop("MPI (start copy)");

(stuff)

phiprof::start("MPI (wait copy)");
mpiGrid.wait_remote_neighbor_copy_updates(POISSON_NEIGHBORHOOD_ID);
phiprof::stop("MPI (wait copy)");

That is, they are measuring the time spent in dccrg. MPI library will not use that much time. I suspect the operator[] is used in those dccrg functions and that is killing the performance.

The text was updated successfully, but these errors were encountered:

iljah · 2015-01-24T16:04:06Z

This is due to the fact that internally dccrg uses unordered sets and
maps in many places. In my experience accessing a specific key takes
around 1 us and iteration probably isn't much faster. The fix would be
to switch to a vector where it speeds up code enough. Note that in
most/all cases in which this problem exists it is embarrassingly
parallel so a trivial fix is to use more MPI processes.

sandroos · 2015-01-25T11:16:05Z

Well, using more MPI processes doesn't reduce the ratio of mpi-stuff / useful computations that much. What I didn't actually profile yet is how much time is spent in the SpatialCell class function where the pointers and block sizes (for the MPI datatype) are created.

iljah · 2015-08-27T13:47:06Z

I started solving this in 0a857c3 which adds a cache of pointers to cells' and their neighbors data. Serial performance of example/game_of_life_optimized.cpp increased by almost a factor of 4 wrt example/game_of_life.cpp and parallel performance using 2 processes also increased by a factor of about 3. I'll add optimized test programs at some point.

galfthan · 2015-08-28T04:53:05Z

Iljah, did this fix affect MPI performance in any way? I suspect (= have not profiled in enough detail) that to get good MPI performance one should cache the MPI datatypes. Right now MPI performance is quite poor when sending small messages. Since there are many stencils, and the cells can change the mpi datatype they return dynamically this will require additions to the API. In principle the cached datatypes would only be invalidated when cell data changes size, at load balance, and when refining/unrefining. In Vlasiators case this means we could use the datatypes for fields ~ 2500 times.

One should first profile to see if this is actually a bottle-neck.

iljah · 2015-08-29T15:03:14Z

No, my changes shouldn't affect MPI things. I suspect cells would be better off caching their datatypes, especially if they use several, as handling that in dccrg might get too tricky. Currently the bottleneck is probably elsewhere anyway, in code around get_mpi_datatype()s to be exact. A quick glance shows the sending function referring to at least 3 different hashtables: https://github.com/fmihpc/dccrg/blob/master/dccrg.hpp#L9131 of which apparently two are accessed separately for every cell which adds ~1e-6 s/cell, similarly to operator[]. I think it would take a couple of days to optimize that but no idea when I'll have that time.

iljah · 2015-08-29T15:33:57Z

Closing this one to separate technically unrelated issues: #8 #9

iljah · 2015-08-29T15:50:07Z

These issues were also mentioned: #6 #10

sandroos mentioned this issue Jan 23, 2015

Slow Poisson Solver (dccrg) fmihpc/vlasiator#77

Closed

iljah added the embarrassingly parallel label Jan 24, 2015

iljah self-assigned this Aug 27, 2015

iljah closed this as completed Aug 29, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor performance in large mesh simulations #1

Poor performance in large mesh simulations #1

sandroos commented Jan 23, 2015

iljah commented Jan 24, 2015

sandroos commented Jan 25, 2015

iljah commented Aug 27, 2015

galfthan commented Aug 28, 2015

iljah commented Aug 29, 2015

iljah commented Aug 29, 2015

iljah commented Aug 29, 2015

Poor performance in large mesh simulations #1

Poor performance in large mesh simulations #1

Comments

sandroos commented Jan 23, 2015

iljah commented Jan 24, 2015

sandroos commented Jan 25, 2015

iljah commented Aug 27, 2015

galfthan commented Aug 28, 2015

iljah commented Aug 29, 2015

iljah commented Aug 29, 2015

iljah commented Aug 29, 2015