Skip to content

Non-blocking point-to-point performance issue (MPI_Isend, MPI_Irecv) #12414

Open
@LukasBreitwieser

Description

@LukasBreitwieser

Hi,

I am working on a distributed 3D agent-based simulation tool and noticed that sometimes non-blocking point-to-point operations take a very long time to complete.

I created a simplified reproducer that demonstrates the behavior (reproducer.cc). Each rank exchanges messages with its neighbors using MPI_Isend and MPI_Irecv calls. In the reproducer, the number of neighbors and the message size is the same for each rank. I measure the maximum time needed to send/receive all messages across all ranks. I performed the benchmarks on 2-node systems with Ethernet and Infiniband links. Half the neighbors (13) are on the same node and half on the opposite (13).

Benchmark systems

System A:
2-nodes each with 4 Intel E7-8890 v3 processors (72 physical cores per node), 1 Gbps Ethernet link, Alma Linux 9, OpenMPI 5.0.2 installed with (spack install [email protected] schedulers=slurm +legacylaunchers)

System B (Dutch National Supercomputer Snellius):
2-nodes each with 2 AMD Genoa 9654 processors (192 physical cores per node), Infiniband ConnectX-7 (200Gbps within a rack, 100Gbps outside the rack), RHEL 8.6, OpenMPI 4.1.5 (provided by the HPC center)
https://servicedesk.surf.nl/wiki/display/WIKI/Snellius+hardware

Nodes were used exclusively.

Compilation

mpicxx -Wall -std=c++17 -O3 -g reproducer.cc -o reproducer

Results

Sometimes, I observe very different results for the exact same message size.
The total amount of data transferred between the two nodes can be calculated with num_ranks * num_neighbors / 2 * message_size.
e.g. system A with 711 byte messages: 144 * 13 * 711 bytes = 1330992 bytes = 1.27 MiB.
Given the small amount of data, I did not expect to see transfer times of 216ms.

see systemA_from-512_to-1024_step-+1.log

max runtime 215 ms       711 byte messages, it 0
max runtime 216 ms       711 byte messages, it 1
max runtime 216 ms       711 byte messages, it 2
max runtime 216 ms       711 byte messages, it 3
max runtime 216 ms       711 byte messages, it 4
max runtime 216 ms       711 byte messages, it 5
max runtime 7 ms         711 byte messages, it 6
max runtime 7 ms         711 byte messages, it 7
max runtime 7 ms         711 byte messages, it 8
max runtime 7 ms         711 byte messages, it 9

systemB_from-1_to-2097152_step-x2.log
Total amount of data transferred: 384 * 13 * 65536 = 327155712 bytes = 312 MiB
I did not expect such a high variance nor transfer times >50ms given a line rate of 200 Gbps.

max runtime 1424 ms      65536 byte messages, it 0
max runtime 214 ms       65536 byte messages, it 1
max runtime 84 ms        65536 byte messages, it 2
max runtime 149 ms       65536 byte messages, it 3
max runtime 132 ms       65536 byte messages, it 4
max runtime 108 ms       65536 byte messages, it 5
max runtime 91 ms        65536 byte messages, it 6
max runtime 163 ms       65536 byte messages, it 7
max runtime 167 ms       65536 byte messages, it 8
max runtime 51 ms        65536 byte messages, it 9

Sometimes, a specific message size is fast 10x in a row.
see systemA_from-512_to-1024_step-+1.log

max runtime 215 ms       775 byte messages, it 0
max runtime 219 ms       775 byte messages, it 1
max runtime 213 ms       775 byte messages, it 2
max runtime 216 ms       775 byte messages, it 3
max runtime 216 ms       775 byte messages, it 4
max runtime 216 ms       775 byte messages, it 5
max runtime 216 ms       775 byte messages, it 6
max runtime 219 ms       775 byte messages, it 7
max runtime 213 ms       775 byte messages, it 8
max runtime 216 ms       775 byte messages, it 9

max runtime 8 ms         776 byte messages, it 0
max runtime 7 ms         776 byte messages, it 1
max runtime 8 ms         776 byte messages, it 2
max runtime 7 ms         776 byte messages, it 3
max runtime 8 ms         776 byte messages, it 4
max runtime 7 ms         776 byte messages, it 5
max runtime 8 ms         776 byte messages, it 6
max runtime 7 ms         776 byte messages, it 7
max runtime 8 ms         776 byte messages, it 8
max runtime 7 ms         776 byte messages, it 9

max runtime 211 ms       777 byte messages, it 0
max runtime 216 ms       777 byte messages, it 1
max runtime 216 ms       777 byte messages, it 2
max runtime 216 ms       777 byte messages, it 3
max runtime 216 ms       777 byte messages, it 4
max runtime 216 ms       777 byte messages, it 5
max runtime 216 ms       777 byte messages, it 6
max runtime 216 ms       777 byte messages, it 7
max runtime 219 ms       777 byte messages, it 8
max runtime 216 ms       777 byte messages, it 9

I used Intel traceanalyzer to get more insights. Below the histogram and CDF of the message arrival times on System A.

histogram
cdf

Do you have an idea of what could be causing this behavior and subsequently how to improve the performance?
I tried to keep the description short, but I am happy to provide any further information that you may need.

Many thanks in advance for your help!

Best,
Lukas

reproducer.log (GH disallows attachments with cc ending -> rename reproducer.log to reproducer.cc)
systemB_from-1_to-2097152_step-x2.log
systemA_from-512_to-1024_step-+1.log
systemA_from-1_to-8192_step-x2.log

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions