Skip to content

Benchmark and analysis for possible reasons that hybrid is slower than MPI

ZHG2017 edited this page Jul 30, 2019 · 1 revision

Benchmark for MPI+OpenMP with different number of threads over one node of Dahu

# CORES N B # THREADS TOTAL
32 500 100 4 19.9467
32 500 100 8 17.3162
32 500 100 16 18.9274
32 500 100 32 19.4621
32 500 100 64 19.4981
32 500 100 128 24.2367

Benchmark for MPI+OpenMP with different number of threads on the single desktop with 4 cores

# CORES N B # THREADS TOTAL
32 500 100 2 41.8546
32 500 100 4 37.4821
32 500 100 8 35.8083
32 500 100 16 35.8087
32 500 100 32 40.0238

Event counting for MPI only implementation on the single desktop with 4 cores

 Performance counter stats for './benchmark-dense-solve -n 500 -b 100 -d Distributed -M CRA -s 1564490910':

     1 431 603 007      cache-misses                                                
        36 953 889      dTLB-load-misses                                            
         8 702 084      iTLB-load-misses                                            

     405,456098902 seconds time elapsed

 Performance counter stats for './benchmark-dense-solve -n 500 -b 100 -d Distributed -M CRA -s 1564490910':

     1 450 247 295      cache-misses                                                
        36 825 319      dTLB-load-misses                                            
         9 923 298      iTLB-load-misses                                            

     406,176510782 seconds time elapsed

 Performance counter stats for './benchmark-dense-solve -n 500 -b 100 -d Distributed -M CRA -s 1564490910':

     1 443 507 799      cache-misses                                                
        35 004 060      dTLB-load-misses                                            
       177 700 722      iTLB-load-misses                                            

     406,753558858 seconds time elapsed

 Performance counter stats for './benchmark-dense-solve -n 500 -b 100 -d Distributed -M CRA -s 1564490910':

     1 446 514 817      cache-misses                                                
        37 051 342      dTLB-load-misses                                            
        38 164 972      iTLB-load-misses                                            

     406,876224780 seconds time elapsed

Event counting for MPI+OMP implementation on the single desktop with 4 cores

Using 2 threads for any worker process


 Performance counter stats for './benchmark-dense-solve -n 500 -b 100 -d Combined -M CRA -s 1564490910 -t 2':

     1 181 055 424      cache-misses                                                
        29 891 598      dTLB-load-misses                                            
         5 144 597      iTLB-load-misses                                            

     250,105811705 seconds time elapsed

 Performance counter stats for './benchmark-dense-solve -n 500 -b 100 -d Combined -M CRA -s 1564490910 -t 2':

     1 188 476 630      cache-misses                                                
        30 662 497      dTLB-load-misses                                            
         9 799 303      iTLB-load-misses                                            

     252,132564516 seconds time elapsed

 Performance counter stats for './benchmark-dense-solve -n 500 -b 100 -d Combined -M CRA -s 1564490910 -t 2':

     1 203 900 822      cache-misses                                                
        31 508 846      dTLB-load-misses                                            
         5 358 814      iTLB-load-misses                                            

     252,838338050 seconds time elapsed

Using 4 threads for any worker process


 Performance counter stats for './benchmark-dense-solve -n 500 -b 100 -d Combined -M CRA -s 1564490910 -t 4':

     1 097 648 012      cache-misses                                                
        28 477 946      dTLB-load-misses                                            
         3 887 310      iTLB-load-misses                                            

     248,689815914 seconds time elapsed


 Performance counter stats for './benchmark-dense-solve -n 500 -b 100 -d Combined -M CRA -s 1564490910 -t 4':

     1 104 745 687      cache-misses                                                
        30 894 988      dTLB-load-misses                                            
        13 148 579      iTLB-load-misses                                            

     249,487985907 seconds time elapsed


 Performance counter stats for './benchmark-dense-solve -n 500 -b 100 -d Combined -M CRA -s 1564490910 -t 4':

     1 094 830 287      cache-misses                                                
        27 483 445      dTLB-load-misses                                            
        16 707 693      iTLB-load-misses                                            

     250,090669714 seconds time elapsed


Using 8 threads for any worker process

 Performance counter stats for './benchmark-dense-solve -n 500 -b 100 -d Combined -M CRA -s 1564490910 -t 8':

     1 183 348 122      cache-misses                                                
        28 191 602      dTLB-load-misses                                            
         4 249 035      iTLB-load-misses                                            

     249,730256828 seconds time elapsed

 Performance counter stats for './benchmark-dense-solve -n 500 -b 100 -d Combined -M CRA -s 1564490910 -t 8':

     1 187 749 563      cache-misses                                                
        26 973 310      dTLB-load-misses                                            
         4 091 277      iTLB-load-misses                                            

     250,610157321 seconds time elapsed

 Performance counter stats for './benchmark-dense-solve -n 500 -b 100 -d Combined -M CRA -s 1564490910 -t 8':

     1 170 709 431      cache-misses                                                
        26 327 385      dTLB-load-misses                                            
         4 242 060      iTLB-load-misses                                            

     251,144711325 seconds time elapsed
Clone this wiki locally