-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor performance in large mesh simulations #1
Comments
This is due to the fact that internally dccrg uses unordered sets and |
Well, using more MPI processes doesn't reduce the ratio of mpi-stuff / useful computations that much. What I didn't actually profile yet is how much time is spent in the SpatialCell class function where the pointers and block sizes (for the MPI datatype) are created. |
I started solving this in 0a857c3 which adds a cache of pointers to cells' and their neighbors data. Serial performance of example/game_of_life_optimized.cpp increased by almost a factor of 4 wrt example/game_of_life.cpp and parallel performance using 2 processes also increased by a factor of about 3. I'll add optimized test programs at some point. |
Iljah, did this fix affect MPI performance in any way? I suspect (= have not profiled in enough detail) that to get good MPI performance one should cache the MPI datatypes. Right now MPI performance is quite poor when sending small messages. Since there are many stencils, and the cells can change the mpi datatype they return dynamically this will require additions to the API. In principle the cached datatypes would only be invalidated when cell data changes size, at load balance, and when refining/unrefining. In Vlasiators case this means we could use the datatypes for fields ~ 2500 times. One should first profile to see if this is actually a bottle-neck. |
No, my changes shouldn't affect MPI things. I suspect cells would be better off caching their datatypes, especially if they use several, as handling that in dccrg might get too tricky. Currently the bottleneck is probably elsewhere anyway, in code around get_mpi_datatype()s to be exact. A quick glance shows the sending function referring to at least 3 different hashtables: https://github.com/fmihpc/dccrg/blob/master/dccrg.hpp#L9131 of which apparently two are accessed separately for every cell which adds ~1e-6 s/cell, similarly to operator[]. I think it would take a couple of days to optimize that but no idea when I'll have that time. |
Certain dccrg functions have really poor performance in simulations with a large mesh. In the following example a 33x33x33 mesh was used, and partitioned to 20 cores using various methods (pure MPI, or MPI+OpenMP).
One bottleneck that came up is the
operator[]
function. In the following either version 1 or 2 is used, not both:Essentially the difference is, that in 1 the
operator[]
is used to access the copied values, while in 2 pointers to parameters have been cached to a vector prior to entering this code snippet.Here's the data from profiling when ran using 2 MPI processes and 10 OpenMP threads per process:
That's a factor of 179 difference!!
After optimizing the solver, the MPI copies remaing a huge bottleneck:
MPI takes 91.3% of total simulation time here. Those profiler calls are
That is, they are measuring the time spent in dccrg. MPI library will not use that much time. I suspect the
operator[]
is used in those dccrg functions and that is killing the performance.The text was updated successfully, but these errors were encountered: