Description
Hi,
I am working on a distributed 3D agent-based simulation tool and noticed that sometimes non-blocking point-to-point operations take a very long time to complete.
I created a simplified reproducer that demonstrates the behavior (reproducer.cc
). Each rank exchanges messages with its neighbors using MPI_Isend
and MPI_Irecv
calls. In the reproducer, the number of neighbors and the message size is the same for each rank. I measure the maximum time needed to send/receive all messages across all ranks. I performed the benchmarks on 2-node systems with Ethernet and Infiniband links. Half the neighbors (13) are on the same node and half on the opposite (13).
Benchmark systems
System A:
2-nodes each with 4 Intel E7-8890 v3 processors (72 physical cores per node), 1 Gbps Ethernet link, Alma Linux 9, OpenMPI 5.0.2 installed with (spack install [email protected] schedulers=slurm +legacylaunchers
)
System B (Dutch National Supercomputer Snellius):
2-nodes each with 2 AMD Genoa 9654 processors (192 physical cores per node), Infiniband ConnectX-7 (200Gbps within a rack, 100Gbps outside the rack), RHEL 8.6, OpenMPI 4.1.5 (provided by the HPC center)
https://servicedesk.surf.nl/wiki/display/WIKI/Snellius+hardware
Nodes were used exclusively.
Compilation
mpicxx -Wall -std=c++17 -O3 -g reproducer.cc -o reproducer
Results
Sometimes, I observe very different results for the exact same message size.
The total amount of data transferred between the two nodes can be calculated with num_ranks * num_neighbors / 2 * message_size
.
e.g. system A with 711 byte messages: 144 * 13 * 711 bytes = 1330992 bytes = 1.27 MiB.
Given the small amount of data, I did not expect to see transfer times of 216ms.
see systemA_from-512_to-1024_step-+1.log
max runtime 215 ms 711 byte messages, it 0
max runtime 216 ms 711 byte messages, it 1
max runtime 216 ms 711 byte messages, it 2
max runtime 216 ms 711 byte messages, it 3
max runtime 216 ms 711 byte messages, it 4
max runtime 216 ms 711 byte messages, it 5
max runtime 7 ms 711 byte messages, it 6
max runtime 7 ms 711 byte messages, it 7
max runtime 7 ms 711 byte messages, it 8
max runtime 7 ms 711 byte messages, it 9
systemB_from-1_to-2097152_step-x2.log
Total amount of data transferred: 384 * 13 * 65536 = 327155712 bytes = 312 MiB
I did not expect such a high variance nor transfer times >50ms given a line rate of 200 Gbps.
max runtime 1424 ms 65536 byte messages, it 0
max runtime 214 ms 65536 byte messages, it 1
max runtime 84 ms 65536 byte messages, it 2
max runtime 149 ms 65536 byte messages, it 3
max runtime 132 ms 65536 byte messages, it 4
max runtime 108 ms 65536 byte messages, it 5
max runtime 91 ms 65536 byte messages, it 6
max runtime 163 ms 65536 byte messages, it 7
max runtime 167 ms 65536 byte messages, it 8
max runtime 51 ms 65536 byte messages, it 9
Sometimes, a specific message size is fast 10x in a row.
see systemA_from-512_to-1024_step-+1.log
max runtime 215 ms 775 byte messages, it 0
max runtime 219 ms 775 byte messages, it 1
max runtime 213 ms 775 byte messages, it 2
max runtime 216 ms 775 byte messages, it 3
max runtime 216 ms 775 byte messages, it 4
max runtime 216 ms 775 byte messages, it 5
max runtime 216 ms 775 byte messages, it 6
max runtime 219 ms 775 byte messages, it 7
max runtime 213 ms 775 byte messages, it 8
max runtime 216 ms 775 byte messages, it 9
max runtime 8 ms 776 byte messages, it 0
max runtime 7 ms 776 byte messages, it 1
max runtime 8 ms 776 byte messages, it 2
max runtime 7 ms 776 byte messages, it 3
max runtime 8 ms 776 byte messages, it 4
max runtime 7 ms 776 byte messages, it 5
max runtime 8 ms 776 byte messages, it 6
max runtime 7 ms 776 byte messages, it 7
max runtime 8 ms 776 byte messages, it 8
max runtime 7 ms 776 byte messages, it 9
max runtime 211 ms 777 byte messages, it 0
max runtime 216 ms 777 byte messages, it 1
max runtime 216 ms 777 byte messages, it 2
max runtime 216 ms 777 byte messages, it 3
max runtime 216 ms 777 byte messages, it 4
max runtime 216 ms 777 byte messages, it 5
max runtime 216 ms 777 byte messages, it 6
max runtime 216 ms 777 byte messages, it 7
max runtime 219 ms 777 byte messages, it 8
max runtime 216 ms 777 byte messages, it 9
I used Intel traceanalyzer to get more insights. Below the histogram and CDF of the message arrival times on System A.
Do you have an idea of what could be causing this behavior and subsequently how to improve the performance?
I tried to keep the description short, but I am happy to provide any further information that you may need.
Many thanks in advance for your help!
Best,
Lukas
reproducer.log (GH disallows attachments with cc ending -> rename reproducer.log
to reproducer.cc
)
systemB_from-1_to-2097152_step-x2.log
systemA_from-512_to-1024_step-+1.log
systemA_from-1_to-8192_step-x2.log