Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance in large mesh simulations #1

Closed
sandroos opened this issue Jan 23, 2015 · 7 comments
Closed

Poor performance in large mesh simulations #1

sandroos opened this issue Jan 23, 2015 · 7 comments

Comments

@sandroos
Copy link

Certain dccrg functions have really poor performance in simulations with a large mesh. In the following example a 33x33x33 mesh was used, and partitioned to 20 cores using various methods (pure MPI, or MPI+OpenMP).

One bottleneck that came up is the operator[]function. In the following either version 1 or 2 is used, not both:

phiprof::start("Copy Old Potential");
// VERSION 1
const vector<CellID>& cells = getLocalCells();
for (size_t c=0; c<cells.size(); ++c) {
   mpiGrid[cells[c]]->parameters[CellParams::PHI_TMP] = mpiGrid[cells[c]]->parameters[CellParams::PHI];
}

// VERSION 2
for (size_t c=0; c<Poisson::localCellParams.size(); ++c) {
   Poisson::localCellParams[c][CellParams::PHI_TMP] = Poisson::localCellParams[c][CellParams::PHI];
}
phiprof::stop("Copy Old Potential");

Essentially the difference is, that in 1 the operator[] is used to access the copied values, while in 2 pointers to parameters have been cached to a vector prior to entering this code snippet.

Here's the data from profiling when ran using 2 MPI processes and 10 OpenMP threads per process:

Version 1 37.49 seconds
Version 2 0.21 seconds

That's a factor of 179 difference!!

After optimizing the solver, the MPI copies remaing a huge bottleneck:

4                  Poisson Solver (Total)                     1          24.59     94.12     24.6      1     24.58     0     100
5                    Cache Cell Parameters                    1          0.5046    2.052     0.5075    0     0.5017    1     1
5                    Pointer Caching                          1          0.06571   0.2672    0.06599   0     0.06543   1     1
5                    MPI (RHOQ)                               1          0.685     2.786     0.6952    1     0.6748    0     100
5                    Evaluate potential                       10         0.7689    3.127     0.7761    0     0.7617    1     7040
5                    MPI (start copy)                         10         19.11     77.74     19.23     0     19        1     3520
5                    MPI (wait copy)                          10         3.308     13.45     3.409     1     3.207     0     3520
5                    Copy Old Potential                       10         0.02075   0.0844    0.02077   1     0.02074   0     176
5                    Potential Change                         1          0.08296   0.3374    0.08335   1     0.08256   0     176
5                    MPI                                      1          0.02614   0.1063    0.05087   1     0.001408  0     176
5                    Other                                    1          0.01089   0.04428   0.01094   1     0.01084   0     100

MPI takes 91.3% of total simulation time here. Those profiler calls are

// Exchange new potential values on process boundaries
phiprof::start("MPI (start copy)");
mpiGrid.start_remote_neighbor_copy_updates(POISSON_NEIGHBORHOOD_ID);
phiprof::stop("MPI (start copy)");

(stuff)

phiprof::start("MPI (wait copy)");
mpiGrid.wait_remote_neighbor_copy_updates(POISSON_NEIGHBORHOOD_ID);
phiprof::stop("MPI (wait copy)");

That is, they are measuring the time spent in dccrg. MPI library will not use that much time. I suspect the operator[] is used in those dccrg functions and that is killing the performance.

@iljah
Copy link
Contributor

iljah commented Jan 24, 2015

This is due to the fact that internally dccrg uses unordered sets and
maps in many places. In my experience accessing a specific key takes
around 1 us and iteration probably isn't much faster. The fix would be
to switch to a vector where it speeds up code enough. Note that in
most/all cases in which this problem exists it is embarrassingly
parallel so a trivial fix is to use more MPI processes.

@sandroos
Copy link
Author

Well, using more MPI processes doesn't reduce the ratio of mpi-stuff / useful computations that much. What I didn't actually profile yet is how much time is spent in the SpatialCell class function where the pointers and block sizes (for the MPI datatype) are created.

@iljah
Copy link
Contributor

iljah commented Aug 27, 2015

I started solving this in 0a857c3 which adds a cache of pointers to cells' and their neighbors data. Serial performance of example/game_of_life_optimized.cpp increased by almost a factor of 4 wrt example/game_of_life.cpp and parallel performance using 2 processes also increased by a factor of about 3. I'll add optimized test programs at some point.

@iljah iljah self-assigned this Aug 27, 2015
@galfthan
Copy link
Member

Iljah, did this fix affect MPI performance in any way? I suspect (= have not profiled in enough detail) that to get good MPI performance one should cache the MPI datatypes. Right now MPI performance is quite poor when sending small messages. Since there are many stencils, and the cells can change the mpi datatype they return dynamically this will require additions to the API. In principle the cached datatypes would only be invalidated when cell data changes size, at load balance, and when refining/unrefining. In Vlasiators case this means we could use the datatypes for fields ~ 2500 times.

One should first profile to see if this is actually a bottle-neck.

@iljah
Copy link
Contributor

iljah commented Aug 29, 2015

No, my changes shouldn't affect MPI things. I suspect cells would be better off caching their datatypes, especially if they use several, as handling that in dccrg might get too tricky. Currently the bottleneck is probably elsewhere anyway, in code around get_mpi_datatype()s to be exact. A quick glance shows the sending function referring to at least 3 different hashtables: https://github.com/fmihpc/dccrg/blob/master/dccrg.hpp#L9131 of which apparently two are accessed separately for every cell which adds ~1e-6 s/cell, similarly to operator[]. I think it would take a couple of days to optimize that but no idea when I'll have that time.

@iljah
Copy link
Contributor

iljah commented Aug 29, 2015

Closing this one to separate technically unrelated issues: #8 #9

@iljah iljah closed this as completed Aug 29, 2015
@iljah
Copy link
Contributor

iljah commented Aug 29, 2015

These issues were also mentioned: #6 #10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants