Skip to content

[ODK] RNS Dixon parallelized benchmarks

A. Breust edited this page Jul 10, 2019 · 15 revisions

2019-07-10

Boree (BLIS)

DIM BSIZE PRIMES THRDS RNSFGEMM USER (R) REAL (R)
500 100 1 4 BothParallel 112.3800 31.3896
500 100 2 4 BothParallel 48.1640 13.2718
500 100 3 4 BothParallel 35.5600 10.2058
500 100 4 4 BothParallel 27.3280 7.9700
500 100 6 4 BothParallel 22.7920 6.9586
500 100 8 4 BothParallel 19.1320 6.1989
500 100 12 4 BothParallel 15.5600 5.4505
500 100 16 4 BothParallel 14.0360 5.0601
500 100 20 4 BothParallel 13.6080 5.2611
500 100 24 4 BothParallel 12.7600 5.0058
500 100 28 4 BothParallel 12.7240 4.9811
500 100 32 4 BothParallel 12.7200 4.9913
500 100 36 4 BothParallel 12.8000 5.3516
500 100 40 4 BothParallel 12.9360 5.3580
  • Pourquoi le ralentissement (reproductible) à p = 20 ?
DIM BSIZE PRIMES THRDS R_TOT R_INIT R_LIFT R_CRT_RAT
500 100 1 4 29.7002 .1759 29.1154 .4071
500 100 2 4 13.7751 .1924 12.9467 .6342
500 100 3 4 10.3130 .2017 9.2142 .8953
500 100 4 4 8.9845 .2166 7.9184 .8473
500 100 6 4 7.6334 .2880 6.2613 1.0818
500 100 8 4 6.1640 .3085 4.8471 1.0060
500 100 12 4 6.0381 .4121 4.4215 1.2019
500 100 16 4 5.3471 .4997 3.7399 1.1043
500 100 20 4 5.4405 .6030 3.3833 1.4507
500 100 24 4 5.0931 .6895 3.1115 1.2883
500 100 28 4 5.0154 .7767 3.0238 1.2107
500 100 32 4 4.9846 .8715 2.9518 1.1568
500 100 36 4 5.3823 .9762 2.9195 1.4819
500 100 40 4 5.6153 1.0543 3.0158 1.5403
500 100 44 4 5.4004 1.1510 2.8320 1.4120
500 100 48 4 5.4033 1.2592 2.7961 1.3431
500 100 52 4 5.4471 1.3523 2.7674 1.3216
500 100 56 4 5.5458 1.4470 2.8158 1.2766
500 100 60 4 5.4927 1.5231 2.7218 1.2412
500 100 64 4 5.5324 1.6058 2.6910 1.2298
500 100 68 4 6.0481 1.7179 2.6463 1.6777
500 100 72 4 6.0070 1.8015 2.6233 1.5760
500 100 76 4 6.0631 1.9172 2.6064 1.5332
500 100 80 4 6.1072 2.0084 2.5915 1.5005
500 100 84 4 6.3278 2.1375 2.6368 1.5469
500 100 88 4 6.2854 2.1727 2.6138 1.4916
500 100 92 4 6.3047 2.3038 2.5624 1.4316
500 100 96 4 6.2986 2.3486 2.5501 1.3928
500 100 100 4 6.3985 2.4235 2.5477 1.4201
500 100 104 4 6.4521 2.5241 2.5341 1.3869
500 100 108 4 6.5842 2.6276 2.5588 1.3909
500 100 112 4 6.6160 2.7266 2.5416 1.3407
500 100 116 4 6.6710 2.8107 2.5459 1.3067
500 100 120 4 6.7433 2.9029 2.5369 1.2957
500 100 124 4 6.8303 2.9827 2.5515 1.2883
500 100 128 4 6.9664 3.0790 2.6000 1.2790
  • DIM 100 BSIZE 100

100_100_boree

  • DIM 200 BSIZE 100

200_100_boree

  • DIM 300 BSIZE 100

300_100_boree

  • DIM 500 BSIZE 100

500_100-boree

  • DIM 100 BSIZE 1000

100_1000_boree

  • Proposition : ???

2019-07-02

These runs have been done with the following setup:

  • Parallel init for the inverses precomputations.
  • Parallel for the euclidian division and apply of inverses.
  • Parallel FGEMM for residues update with runtime configurable ParSeqHelper::Compose
  • Parallel and cache-friendly R[j] <= R[j] / pj
  • Matrix-based fconvert_rns and addin(R, Q).

NOTE: Previous runs (2019-06-25) were done with ./test-solve-full, these are now done with ./benchmark-dense-solve. Each result is the median value over three iterations.

Boree (BLIS)

DIM BSIZE PRIMES THRDS RNSFGEMM USER (D/R) REAL (D/R)
100 100 12 4 ParallelRnsOnly .1760/.3800 .1729/.1308
100 100 12 2 ParallelRnsOnly .1640/.2720 .1599/.1541
100 100 12 1 ParallelRnsOnly .1520/.2000 .1551/.1970
500 100 12 4 ParallelRnsOnly 9.8840/17.1360 9.9488/5.8152
500 100 12 2 ParallelRnsOnly 10.8560/10.0880 10.8792/6.0366
500 100 12 1 ParallelRnsOnly 10.2360/7.4880 10.2764/7.5018
  • Faire omp_set_num_threads() a un petit impact sur le Dixon séquentiel.
  • Dès n = 500, DixonRNS est plus performant que Dixon séquentiel, même sans threading.
DIM BSIZE PRIMES THRDS RNSFGEMM USER (D/R) REAL (D/R)
500 100 1 4 ParallelRnsOnly 12.0240/120.2600 12.0441/32.0655
500 100 4 4 ParallelRnsOnly 10.3920/31.0160 10.4289/9.0184
500 100 8 4 ParallelRnsOnly 11.8560/20.5560 11.8805/6.3450
500 100 16 4 ParallelRnsOnly 11.0520/14.9640 11.0828/5.2746
500 100 32 4 ParallelRnsOnly 9.9400/13.8520 9.9576/5.7068
500 100 64 4 ParallelRnsOnly 10.9080/15.8160 10.9519/8.8123
  • Le nombre de nombres premiers à utiliser dépend du problème. Il n'y a pas de valeur magique.
DIM BSIZE PRIMES THRDS RNSFGEMM USER (D/R) REAL (D/R)
500 100 12 4 ParallelRnsOnly 11.0280/17.1680 11.0519/6.0090
500 100 12 4 ParallelFgemmOnly 10.8480/18.9560 10.8860/6.3640
500 100 12 4 BothSequential 10.8560/14.1320 10.8713/6.0654
500 100 12 4 BothParallel 10.7680/16.5080 10.8045/5.8532
500 100 12 8 ParallelRnsOnly 11.1320/13.3560 11.2400/6.0732
500 100 12 8 ParallelFgemmOnly 9.4920/10.2120 9.5009/6.3665
500 100 12 8 BothSequential 12.1280/10.1600 12.1606/6.3014
500 100 12 8 BothParallel 10.2440/13.5720 10.2528/6.2492
  • Difficile de dire quoi que ce soit, il y a pas mal de variance en fonction des nombres premiers tirés.
DIM BSIZE PRIMES THRDS D_IT D_LIFT D_TOT R_TOT R_INIT R_LIFT R_CRT_RAT
500 100 12 4 4112 10.7063 10.8053 6.1891 .4569 4.4476 1.2825
500 100 16 4 4058 11.6784 11.8271 5.6998 .4931 4.0675 1.1370
500 100 32 4 4029 11.2154 11.3647 5.5958 .8869 3.4842 1.2208
500 100 64 4 4039 11.2943 11.4433 5.7470 1.6033 2.8665 1.2717
500 100 128 4 4063 11.3224 11.4718 7.0930 3.0806 2.6821 1.3212
  • Chercher R_INIT = R_LIFT n'est pas le meilleur. Mais sur une machine très parallèle, R_LIFT gagnera beaucoup à avoir beaucoup de nombres premiers.

PALADIN

DIM BSIZE PRIMES THRDS RNSFGEMM USER (D/R) REAL (D/R)
500 100 12 4 ParallelRnsOnly 10.6400/17.1400 10.6651/5.9145
500 100 12 8 ParallelRnsOnly 10.3560/13.9040 10.3891/5.8261
  • Pas de slow-down avec PALADIN. Merci Zhu!

HPAC (OpenBLAS)

DIM BSIZE PRIMES THRDS RNSFGEMM USER (D/R) REAL (D/R)
100 100 32 32 ParallelRnsOnly .3440/29.4120 .2657/1.6594
500 100 32 32 ParallelRnsOnly 12.6920/198.9280 12.6396/22.8117
1000 100 32 32 ParallelRnsOnly 205.2720/575.9480 205.4940/94.2071
1000 100 256 32 ParallelRnsOnly /365.5400 /84.3257
2000 100 256 32 ParallelRnsOnly 1259.9800/1763.0900 1262.3600/389.6000

2019-06-25

These runs have been done with the following setup:

  • Nothing within the Init is done in parallel.
    • COULD HAVE PARALLELIZATION ON THE COMPUTATION OF INVERSES

  • Parallel for the euclidian division and apply of inverses.
  • Parallel FGEMM for residues update with FFLAS::ParSeqHelper::Compose<RNSParallel, FGEMMSequential>(4, 4)
    • COULD BE IMPROVED BY NOT HARDCODING ANYTHING

  • R[j] <= R[j] / pj is not parallel
    • FIND OUT WHY IT IS NOT WORKING SIMPLY

  • fconvert_rns to get R in ZZ is done on matrix.

Naming conventions:

  • PRIMES: The number of primes used for RNS Dixon. Basically how much we can parallelized.
  • D_LIFT: Classic Dixon lifting + RatRecon
  • D_IT: Classic Dixon number of iterations
  • R_IT: RNS Dixon number of iterations
  • R_INIT: RNS Dixon precomputing inverses mod all pj
  • R_LIFT: RNS Dixon lifting accumulations
  • R_CRTP: RNS Dixon CRT all progress()
  • R_CRT: RNS Dixon CRT result()
  • R_RAT: RNS Dixon rational reconstruction

HPAC (OpenBLAS)

DIM BITSIZE PRIMES D_IT D_LIFT R_IT R_INIT R_LIFT R_CRTP R_CRT R_RAT
100 100 256 800 .3171 4 .6772 .2350 .0379 .0128 .0248
100 100 128 805 .3123 8 .3398 .2805 .0285 .0128 .0250
100 100 64 807 .3143 16 .1727 .3388 .0237 .0128 .0248
100 100 32 802 .3106 31 .0890 .5142 .0193 .0125 .0242
100 100 16 798 .3036 61 .0477 .9056 .0161 .0116 .0243
100 100 8 801 .3153 121 .0273 1.5501 .0125 .0094 .0242
100 100 4 822 .3083 242 .0172 2.9304 .0105 .0060 .0237
100 100 2 817 .3133 489 .0118 5.8470 .0086 .0001 .0236
100 100 1 813 .3108 978 .0095 11.5726 .0001 .0001 .0238
100 100 32 802 .3106 31 .0890 .5142 .0193 .0125 .0242
100 200 32 1575 .6803 60 .0915 1.0971 .0460 .0348 .0751
200 100 32 1631 1.2162 61 .3204 1.3498 .0915 .0721 .1044
300 100 32 2421 3.0914 91 .7353 2.6728 .2338 .1972 .2580
400 100 32 3250 14.3534 122 2.0951 10.9242 .6531 .6323 .5688

Boree (BLIS)

DIM BITSIZE PRIMES D_IT D_LIFT R_IT R_INIT R_LIFT R_CRTP R_CRT R_RAT
100 100 32 807 .1426 31 .1494 .1103 .0105 .0066 .0135
100 200 32 1571 .3465 60 .1532 .3258 .0249 .0182 .0408
200 100 32 1608 .6997 61 .4154 .5350 .0492 .0371 .0557
300 100 32 2452 1.8967 91 .7970 1.2881 .1215 .1020 .1354
400 100 32 3227 5.2680 122 1.3045 2.5307 .2442 .2108 .2635
1000 100 32 8109 90.4061 307 7.5955 25.1093 2.1447 1.8424 2.1459

One D_LIFT details:

DIM BITSIZE PRIMES DIV and c = A^{-1} r FGEMM R <= R - Ac R <= R / p CONVERT r <= Q + R
1000 100 32 0.0219879 0.0230379 0.014802 0.0106769
Clone this wiki locally