[ODK] RNS Dixon parallelized benchmarks

2019-07-10

Boree (BLIS)

DIM	BSIZE	PRIMES	THRDS	RNSFGEMM	USER (R)	REAL (R)
500	100	1	4	BothParallel	112.3800	31.3896
500	100	2	4	BothParallel	48.1640	13.2718
500	100	3	4	BothParallel	35.5600	10.2058
500	100	4	4	BothParallel	27.3280	7.9700
500	100	6	4	BothParallel	22.7920	6.9586
500	100	8	4	BothParallel	19.1320	6.1989
500	100	12	4	BothParallel	15.5600	5.4505
500	100	16	4	BothParallel	14.0360	5.0601
500	100	20	4	BothParallel	13.6080	5.2611
500	100	24	4	BothParallel	12.7600	5.0058
500	100	28	4	BothParallel	12.7240	4.9811
500	100	32	4	BothParallel	12.7200	4.9913
500	100	36	4	BothParallel	12.8000	5.3516
500	100	40	4	BothParallel	12.9360	5.3580

Pourquoi le ralentissement (reproductible) à p = 20 ?

DIM	BSIZE	PRIMES	THRDS	R_TOT	R_INIT	R_LIFT	R_CRT_RAT
500	100	1	4	29.7002	.1759	29.1154	.4071
500	100	2	4	13.7751	.1924	12.9467	.6342
500	100	3	4	10.3130	.2017	9.2142	.8953
500	100	4	4	8.9845	.2166	7.9184	.8473
500	100	6	4	7.6334	.2880	6.2613	1.0818
500	100	8	4	6.1640	.3085	4.8471	1.0060
500	100	12	4	6.0381	.4121	4.4215	1.2019
500	100	16	4	5.3471	.4997	3.7399	1.1043
500	100	20	4	5.4405	.6030	3.3833	1.4507
500	100	24	4	5.0931	.6895	3.1115	1.2883
500	100	28	4	5.0154	.7767	3.0238	1.2107
500	100	32	4	4.9846	.8715	2.9518	1.1568
500	100	36	4	5.3823	.9762	2.9195	1.4819
500	100	40	4	5.6153	1.0543	3.0158	1.5403
500	100	44	4	5.4004	1.1510	2.8320	1.4120
500	100	48	4	5.4033	1.2592	2.7961	1.3431
500	100	52	4	5.4471	1.3523	2.7674	1.3216
500	100	56	4	5.5458	1.4470	2.8158	1.2766
500	100	60	4	5.4927	1.5231	2.7218	1.2412
500	100	64	4	5.5324	1.6058	2.6910	1.2298
500	100	68	4	6.0481	1.7179	2.6463	1.6777
500	100	72	4	6.0070	1.8015	2.6233	1.5760
500	100	76	4	6.0631	1.9172	2.6064	1.5332
500	100	80	4	6.1072	2.0084	2.5915	1.5005
500	100	84	4	6.3278	2.1375	2.6368	1.5469
500	100	88	4	6.2854	2.1727	2.6138	1.4916
500	100	92	4	6.3047	2.3038	2.5624	1.4316
500	100	96	4	6.2986	2.3486	2.5501	1.3928
500	100	100	4	6.3985	2.4235	2.5477	1.4201
500	100	104	4	6.4521	2.5241	2.5341	1.3869
500	100	108	4	6.5842	2.6276	2.5588	1.3909
500	100	112	4	6.6160	2.7266	2.5416	1.3407
500	100	116	4	6.6710	2.8107	2.5459	1.3067
500	100	120	4	6.7433	2.9029	2.5369	1.2957
500	100	124	4	6.8303	2.9827	2.5515	1.2883
500	100	128	4	6.9664	3.0790	2.6000	1.2790

DIM 100 BSIZE 100

100_100_boree

DIM 200 BSIZE 100

200_100_boree

DIM 300 BSIZE 100

300_100_boree

DIM 500 BSIZE 100

500_100-boree

DIM 100 BSIZE 1000

100_1000_boree

Proposition : ???

2019-07-02

These runs have been done with the following setup:

Parallel init for the inverses precomputations.
Parallel for the euclidian division and apply of inverses.
Parallel FGEMM for residues update with runtime configurable ParSeqHelper::Compose
Parallel and cache-friendly R[j] <= R[j] / pj
Matrix-based fconvert_rns and addin(R, Q).

NOTE: Previous runs (2019-06-25) were done with ./test-solve-full, these are now done with ./benchmark-dense-solve. Each result is the median value over three iterations.

Boree (BLIS)

DIM	BSIZE	PRIMES	THRDS	RNSFGEMM	USER (D/R)	REAL (D/R)
100	100	12	4	ParallelRnsOnly	.1760/.3800	.1729/.1308
100	100	12	2	ParallelRnsOnly	.1640/.2720	.1599/.1541
100	100	12	1	ParallelRnsOnly	.1520/.2000	.1551/.1970

500	100	12	4	ParallelRnsOnly	9.8840/17.1360	9.9488/5.8152
500	100	12	2	ParallelRnsOnly	10.8560/10.0880	10.8792/6.0366
500	100	12	1	ParallelRnsOnly	10.2360/7.4880	10.2764/7.5018

Faire omp_set_num_threads() a un petit impact sur le Dixon séquentiel.
Dès n = 500, DixonRNS est plus performant que Dixon séquentiel, même sans threading.

DIM	BSIZE	PRIMES	THRDS	RNSFGEMM	USER (D/R)	REAL (D/R)
500	100	1	4	ParallelRnsOnly	12.0240/120.2600	12.0441/32.0655
500	100	4	4	ParallelRnsOnly	10.3920/31.0160	10.4289/9.0184
500	100	8	4	ParallelRnsOnly	11.8560/20.5560	11.8805/6.3450
500	100	16	4	ParallelRnsOnly	11.0520/14.9640	11.0828/5.2746
500	100	32	4	ParallelRnsOnly	9.9400/13.8520	9.9576/5.7068
500	100	64	4	ParallelRnsOnly	10.9080/15.8160	10.9519/8.8123

Le nombre de nombres premiers à utiliser dépend du problème. Il n'y a pas de valeur magique.

DIM	BSIZE	PRIMES	THRDS	RNSFGEMM	USER (D/R)	REAL (D/R)
500	100	12	4	ParallelRnsOnly	11.0280/17.1680	11.0519/6.0090
500	100	12	4	ParallelFgemmOnly	10.8480/18.9560	10.8860/6.3640
500	100	12	4	BothSequential	10.8560/14.1320	10.8713/6.0654
500	100	12	4	BothParallel	10.7680/16.5080	10.8045/5.8532

500	100	12	8	ParallelRnsOnly	11.1320/13.3560	11.2400/6.0732
500	100	12	8	ParallelFgemmOnly	9.4920/10.2120	9.5009/6.3665
500	100	12	8	BothSequential	12.1280/10.1600	12.1606/6.3014
500	100	12	8	BothParallel	10.2440/13.5720	10.2528/6.2492

Difficile de dire quoi que ce soit, il y a pas mal de variance en fonction des nombres premiers tirés.

DIM	BSIZE	PRIMES	THRDS	D_IT	D_LIFT	D_TOT	R_TOT	R_INIT	R_LIFT	R_CRT_RAT
500	100	12	4	4112	10.7063	10.8053	6.1891	.4569	4.4476	1.2825
500	100	16	4	4058	11.6784	11.8271	5.6998	.4931	4.0675	1.1370
500	100	32	4	4029	11.2154	11.3647	5.5958	.8869	3.4842	1.2208
500	100	64	4	4039	11.2943	11.4433	5.7470	1.6033	2.8665	1.2717
500	100	128	4	4063	11.3224	11.4718	7.0930	3.0806	2.6821	1.3212

Chercher R_INIT = R_LIFT n'est pas le meilleur. Mais sur une machine très parallèle, R_LIFT gagnera beaucoup à avoir beaucoup de nombres premiers.

PALADIN

DIM	BSIZE	PRIMES	THRDS	RNSFGEMM	USER (D/R)	REAL (D/R)
500	100	12	4	ParallelRnsOnly	10.6400/17.1400	10.6651/5.9145
500	100	12	8	ParallelRnsOnly	10.3560/13.9040	10.3891/5.8261

Pas de slow-down avec PALADIN. Merci Zhu!

HPAC (OpenBLAS)

DIM	BSIZE	PRIMES	THRDS	RNSFGEMM	USER (D/R)	REAL (D/R)
100	100	32	32	ParallelRnsOnly	.3440/29.4120	.2657/1.6594
500	100	32	32	ParallelRnsOnly	12.6920/198.9280	12.6396/22.8117
1000	100	32	32	ParallelRnsOnly	205.2720/575.9480	205.4940/94.2071
1000	100	256	32	ParallelRnsOnly	/365.5400	/84.3257

2000	100	256	32	ParallelRnsOnly	1259.9800/1763.0900	1262.3600/389.6000

2019-06-25

These runs have been done with the following setup:

Nothing within the Init is done in parallel.
- COULD HAVE PARALLELIZATION ON THE COMPUTATION OF INVERSES
Parallel for the euclidian division and apply of inverses.
Parallel FGEMM for residues update with FFLAS::ParSeqHelper::Compose<RNSParallel, FGEMMSequential>(4, 4)
- COULD BE IMPROVED BY NOT HARDCODING ANYTHING
R[j] <= R[j] / pj is not parallel
- FIND OUT WHY IT IS NOT WORKING SIMPLY
fconvert_rns to get R in ZZ is done on matrix.

Naming conventions:

PRIMES: The number of primes used for RNS Dixon. Basically how much we can parallelized.
D_LIFT: Classic Dixon lifting + RatRecon
D_IT: Classic Dixon number of iterations
R_IT: RNS Dixon number of iterations
R_INIT: RNS Dixon precomputing inverses mod all pj
R_LIFT: RNS Dixon lifting accumulations
R_CRTP: RNS Dixon CRT all progress()
R_CRT: RNS Dixon CRT result()
R_RAT: RNS Dixon rational reconstruction

HPAC (OpenBLAS)

DIM	BITSIZE	PRIMES	D_IT	D_LIFT	R_IT	R_INIT	R_LIFT	R_CRTP	R_CRT	R_RAT
100	100	256	800	.3171	4	.6772	.2350	.0379	.0128	.0248
100	100	128	805	.3123	8	.3398	.2805	.0285	.0128	.0250
100	100	64	807	.3143	16	.1727	.3388	.0237	.0128	.0248
100	100	32	802	.3106	31	.0890	.5142	.0193	.0125	.0242
100	100	16	798	.3036	61	.0477	.9056	.0161	.0116	.0243
100	100	8	801	.3153	121	.0273	1.5501	.0125	.0094	.0242
100	100	4	822	.3083	242	.0172	2.9304	.0105	.0060	.0237
100	100	2	817	.3133	489	.0118	5.8470	.0086	.0001	.0236
100	100	1	813	.3108	978	.0095	11.5726	.0001	.0001	.0238

100	100	32	802	.3106	31	.0890	.5142	.0193	.0125	.0242
100	200	32	1575	.6803	60	.0915	1.0971	.0460	.0348	.0751
200	100	32	1631	1.2162	61	.3204	1.3498	.0915	.0721	.1044
300	100	32	2421	3.0914	91	.7353	2.6728	.2338	.1972	.2580
400	100	32	3250	14.3534	122	2.0951	10.9242	.6531	.6323	.5688

Boree (BLIS)

DIM	BITSIZE	PRIMES	D_IT	D_LIFT	R_IT	R_INIT	R_LIFT	R_CRTP	R_CRT	R_RAT
100	100	32	807	.1426	31	.1494	.1103	.0105	.0066	.0135
100	200	32	1571	.3465	60	.1532	.3258	.0249	.0182	.0408
200	100	32	1608	.6997	61	.4154	.5350	.0492	.0371	.0557
300	100	32	2452	1.8967	91	.7970	1.2881	.1215	.1020	.1354
400	100	32	3227	5.2680	122	1.3045	2.5307	.2442	.2108	.2635

1000	100	32	8109	90.4061	307	7.5955	25.1093	2.1447	1.8424	2.1459

One D_LIFT details:

DIM	BITSIZE	PRIMES	DIV and c = A^{-1} r	FGEMM R <= R - Ac	R <= R / p	CONVERT r <= Q + R
1000	100	32	0.0219879	0.0230379	0.014802	0.0106769

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ODK] RNS Dixon parallelized benchmarks

2019-07-10

Boree (BLIS)

2019-07-02

Boree (BLIS)

PALADIN

HPAC (OpenBLAS)

2019-06-25

HPAC (OpenBLAS)

Boree (BLIS)

Clone this wiki locally