Unroll update_accumulator_refresh() #5125

mstembera · 2024-03-21T08:30:11Z

Unroll update_accumulator_refresh() to process two active indices simultaneously. (The compiler can't unroll effectively because the number of active indices isn't known at compile time.)

STC https://tests.stockfishchess.org/tests/view/65faa8850ec64f0526c4fca9
LLR: 2.93 (-2.94,2.94) <0.00,2.00>
Total: 130464 W: 33882 L: 33431 D: 63151
Ptnml(0-2): 539, 14591, 34501, 15082, 519

No functional change
bench: 2109005

No functional change bench: 2109005

github-actions · 2024-03-21T08:30:53Z

clang-format 17 needs to be run on this PR.
If you do not have clang-format installed, the maintainer will run it when merging.
For the exact version please see https://packages.ubuntu.com/mantic/clang-format-17.

(execution 8371882331 / attempt 1)

Disservin · 2024-03-21T08:41:41Z

I’d be glad if someone could run this through a speedup script for various compilers

Torom · 2024-03-21T10:57:54Z

x86-64-bmi2 clang 17.0.6

Result of  50 runs
==================
base (...es/stockfish) =    1076007  +/- 2065
test (....UARunrollPR) =    1077021  +/- 1894
diff                   =      +1014  +/- 2988

speedup        = +0.0009
P(speedup > 0) =  0.7467

CPU: 4 x Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Hyperthreading: on

x86-64-bmi2 gcc 13.2.1

Result of  50 runs
==================
base (...tockfish_gcc) =    1064075  +/- 2034
test (...unrollPR_gcc) =    1063413  +/- 1871
diff                   =       -662  +/- 2986

speedup        = -0.0006
P(speedup > 0) =  0.3322

CPU: 4 x Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Hyperthreading: on

armv8-dotprod clang 16.0.6

Result of  30 runs
==================
base (../stockfish   ) =     317332  +/- 582
test (....UARunrollPR) =     316970  +/- 638
diff                   =       -362  +/- 1124

speedup        = -0.0011
P(speedup > 0) =  0.2641

CPU: 4 x aarch64
Hyperthreading: off

armv8-dotprod gcc 13.2.0

Result of  30 runs
==================
base (...tockfish_gcc) =     312348  +/- 729
test (...unrollPR_gcc) =     324944  +/- 1236
diff                   =     +12596  +/- 1785

speedup        = +0.0403
P(speedup > 0) =  1.0000

CPU: 4 x aarch64
Hyperthreading: off

XInTheDark · 2024-03-21T12:16:19Z

apple-silicon clang 17.0.6

Result of  30 runs
==================
base (...rc/stockfish) =    1319470  +/- 2945
test (...rc/stockfish) =    1312130  +/- 2728
diff                   =      -7340  +/- 748

speedup        = -0.0056
P(speedup > 0) =  0.0000

CPU: 8 x arm
Hyperthreading: off

Torom · 2024-03-22T11:02:22Z

I have grouped the Fishtest result by compiler and major compiler versions (sorted by number of games):

g++       ELO:    1.19   +/-    0.97   LOS:  99.2%   [529, 14325, 33805, 14795, 514]
clang++   ELO:    1.51   +/-    6.67   LOS:  67.2%   [10, 266, 696, 287, 5]

g++ 10       ELO:   -0.23   +/-    1.32   LOS:  36.7%   [300, 8010, 18342, 8020, 272]
g++ 13       ELO:    3.47   +/-    2.35   LOS:  99.8%   [81, 2320, 5736, 2533, 82]
g++ 11       ELO:    3.47   +/-    2.71   LOS:  99.4%   [65, 1784, 4354, 1932, 73]
g++ 12       ELO:    1.76   +/-    2.84   LOS:  88.7%   [63, 1670, 3984, 1734, 69]
g++ 9        ELO:    1.50   +/-    6.06   LOS:  68.7%   [16, 348, 869, 372, 11]
g++ 7        ELO:    3.18   +/-    7.70   LOS:  79.1%   [4, 193, 520, 204, 7]
clang++ 17   ELO:   -2.90   +/-   11.08   LOS:  30.4%   [5, 109, 257, 107, 2]
clang++ 14   ELO:   -4.12   +/-   10.76   LOS:  22.6%   [5, 97, 267, 94, 1]
clang++ 15   ELO:   19.70   +/-   14.85   LOS:  99.5%   [0, 48, 132, 75, 1]
clang++ 18   ELO:    2.71   +/-   27.70   LOS:  57.6%   [0, 12, 40, 11, 1]

vondele · 2024-03-22T14:54:18Z

Mixed results for me, nothing stands out particularly.

Base:  ./stockfish-master-g++-9
Test:  ./stockfish-patch-g++-9
run       base       test     diff
  1    1412687    1421394    +8707
  2    1415243    1411025    -4218
  3    1407051    1381335   -25716
  4    1415689    1411357    -4332
  5    1371958    1394291   +22333
  6    1396564    1402444    +5880
  7    1384844    1396239   +11395
  8    1409146    1402444    -6702
  9    1400369    1387087   -13282
 10    1396239    1409919   +13680

Result of  10 runs
==================
base (./stockfish-master-g++-9 ) =    1400979  +/- 8766
test (./stockfish-patch-g++-9  ) =    1401754  +/- 7579
diff                             =       +774  +/- 8848

speedup        = +0.0006
P(speedup > 0) =  0.5680

CPU: 16 x AMD Ryzen 9 3950X 16-Core Processor
Hyperthreading: on 

Base:  ./stockfish-master-g++-10
Test:  ./stockfish-patch-g++-10
run       base       test     diff
  1    1442944    1444219    +1275
  2    1416358    1390841   -25517
  3    1435230    1427599    -7631
  4    1438674    1439365     +691
  5    1415466    1420384    +4918
  6    1431917    1420048   -11869
  7    1405073    1413131    +8058
  8    1413797    1394291   -19506
  9    1435116    1418256   -16860
 10    1412465    1410140    -2325

Result of  10 runs
==================
base (./stockfish-master-g++-10) =    1424704  +/- 8273
test (./stockfish-patch-g++-10 ) =    1417827  +/- 10582
diff                             =      -6877  +/- 6962

speedup        = -0.0048
P(speedup > 0) =  0.0266

CPU: 16 x AMD Ryzen 9 3950X 16-Core Processor
Hyperthreading: on 

Base:  ./stockfish-master-g++-11
Test:  ./stockfish-patch-g++-11
run       base       test     diff
  1    1427032    1421057    -5975
  2    1429187    1409919   -19268
  3    1411911    1417921    +6010
  4    1426014    1415689   -10325
  5    1426240    1411689   -14551
  6    1398410    1402116    +3706
  7    1407932    1405402    -2530
  8    1423644    1414909    -8735
  9    1429187    1408153   -21034
 10    1418144    1424208    +6064

Result of  10 runs
==================
base (./stockfish-master-g++-11) =    1419770  +/- 6481
test (./stockfish-patch-g++-11 ) =    1413106  +/- 4310
diff                             =      -6664  +/- 6165

speedup        = -0.0047
P(speedup > 0) =  0.0172

CPU: 16 x AMD Ryzen 9 3950X 16-Core Processor
Hyperthreading: on 

Base:  ./stockfish-master-g++-12
Test:  ./stockfish-patch-g++-12
run       base       test     diff
  1    1416916    1418144    +1228
  2    1404305    1409477    +5172
  3    1407051    1408814    +1763
  4    1404963    1406941    +1978
  5    1401133    1405183    +4050
  6    1410693    1407932    -2761
  7    1404415    1403976     -439
  8    1417362    1420048    +2686
  9    1407822    1419935   +12113
 10    1410471    1428506   +18035

Result of  10 runs
==================
base (./stockfish-master-g++-12) =    1408513  +/- 3337
test (./stockfish-patch-g++-12 ) =    1412896  +/- 5051
diff                             =      +4382  +/- 3841

speedup        = +0.0031
P(speedup > 0) =  0.9872

CPU: 16 x AMD Ryzen 9 3950X 16-Core Processor
Hyperthreading: on

mstembera · 2024-03-22T20:08:42Z

How do you suggest we proceed? Should I run another test? Maybe another STC w/ LTC bounds?

Disservin · 2024-03-23T09:58:27Z

I can reproduce a speedup with latest gcc, profile build of x86-64-bmi2
gcc version 13.2.0 (Ubuntu 13.2.0-4ubuntu3)

➜  src git:(master) ✗ ./bench_parallel.sh ./stockfish ./stockfish-pr-5125 20 10

sf_base =  1644140 +/-  20670 (95%)
sf_test =  1695293 +/-  15120 (95%)
diff    =    51153 +/-  22407 (95%)
speedup = 3.11124% +/- 1.363% (95%)

Though @Torom earlier posted some benchmarks for g++ 13.2 which look rather different.. did you do a profile build?

Disservin · 2024-03-23T10:00:59Z

On a shorter depth bench the speedup is even larger..
➜ src git:(master) ✗ ./bench_parallel.sh ./stockfish ./stockfish-pr-5125 13 50

sf_base =  1560713 +/-   7500 (95%)
sf_test =  1674767 +/-   7859 (95%)
diff    =   114053 +/-   9184 (95%)
speedup = 7.30780% +/- 0.588% (95%)

Torom · 2024-03-23T10:02:55Z

Though @Torom earlier posted some benchmarks for g++ 13.2 which look rather different.. did you do a profile build?

Yes, these were all profile builds.
On the Raspberry Pi with gcc 13.2 it is interestingly enough also a significant speedup.

Disservin · 2024-03-23T10:34:12Z

Even with clang++-18 I get a huge speedup.. (13 50)

sf_base =  1574218 +/-   6819 (95%)
sf_test =  1702423 +/-   5995 (95%)
diff    =   128205 +/-   7367 (95%)
speedup = 8.14405% +/- 0.468% (95%)

Compiled by                : clang++ 18.1.3 on Linux
Compilation architecture   : x86-64-bmi2
Compilation settings       : 64bit BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT
Compiler __VERSION__ macro : Ubuntu Clang 18.1.3 (++20240322073236+ef6d1ec07c69-1~exp1~20240322193248.98)

Disservin · 2024-03-23T10:59:27Z

Though @Torom earlier posted some benchmarks for g++ 13.2 which look rather different.. did you do a profile build?

Yes, these were all profile builds. On the Raspberry Pi with gcc 13.2 it is interestingly enough also a significant speedup.

Also in the range of ~7% ?

Torom · 2024-03-23T11:06:13Z

armv8-dotprod gcc 13.2.0

Result of  30 runs
==================
base (...tockfish_gcc) =     312348  +/- 729
test (...unrollPR_gcc) =     324944  +/- 1236
diff                   =     +12596  +/- 1785

speedup        = +0.0403
P(speedup > 0) =  1.0000

CPU: 4 x aarch64
Hyperthreading: off

~4%

Disservin · 2024-03-23T11:12:48Z

Ah didn't see that earlier, thanks

vondele · 2024-03-23T12:13:06Z

It is well possible that it is not just compiler but also specific CPU dependent. I no case there seems to be a measurable slowdown, so it seems that would all support merging.

Disservin · 2024-03-23T12:33:25Z

Agreed.

Disservin · 2024-03-23T14:16:50Z

@mstembera I will merge this as is, probably add a comment to point out that we are manually unrolling the loop which seems to have a positive effect with some compilers/cpus. The speedup is pretty significant in my tests... was this also a speedup for you locally?
I tested it with my Ryzen 9 5950X.

Disservin · 2024-03-23T14:17:30Z

@PGG106 could you test this with your cpu?

PGG106 · 2024-03-23T14:20:07Z

sure, compiled with make -j profile-build ARCH=x86-64-avx512 on a ryzen 9 7950X, gcc 11.4
20 10:

sf_base =  1947474 +/-  15148 (95%)
sf_test =  2006258 +/-  18821 (95%)
diff    =    58783 +/-   4737 (95%)
speedup = 3.01846% +/- 0.243% (95%)

13 50:

sf_base =  1804101 +/-   3419 (95%)
sf_test =  1954879 +/-   4033 (95%)
diff    =   150778 +/-   1881 (95%)
speedup = 8.35754% +/- 0.104% (95%)

mstembera · 2024-03-23T17:06:30Z

@Disservin On my i9-7980XE it is neutral. The reason I thought it would be a speedup on some machines is because it is a similar idea to #4816 which was also a speedup but also neutral on my machine. I think @vondele is correct about it being CPU dependent. Thanks to everyone who tested!

Disservin · 2024-03-23T17:21:16Z

Anyhow nice speedup for my setup (and others) ^^
Though fishtest did not make it so clear.. so I’m suggesting that in the future it might make sense to open a draft pr for possible speedups and then we can decide if it is worth it to merge it. In this case I would have probably merged it even if it failed fishtest. Though I suspect you would not have opened the pr if it failed fishtest..?

mstembera · 2024-03-23T17:46:47Z

Thanks. Yes I would not have opened if it failed fishtest but if I think something really should be a speed up in the future I will do as you suggest. I just don't want to cause unnecessary work.

Disservin · 2024-03-23T18:06:08Z

Yeah ^^ I find this one rather interesting because it benefits apparently newer cpu even with an up to date compiler. Though regarding speedups in general I'd like to keep them in sync with (somewhat) up to date software/hardware, meaning we shouldn't aim to optimize something for a particular old compiler

mstembera · 2024-03-23T23:02:21Z

@Disservin Actually I just remembered that I have another unroll speedup that finished yellow so I never opened a PR.
https://tests.stockfishchess.org/tests/view/65a077f479aa8af82b96b850 Let me know what you think or if you want me to open a draft PR for it as you suggested above.

bench 1883927

Disservin · 2024-03-24T09:37:15Z

@mstembera It seems I was fooled by git, sorry.. I didn't realize that this patch is one commit behind master, rebasing to master and comparing with master shows no speedup, so in reality I was just testing the effect of the #5130.

(I also tested the rebased UnrollPropagate2 locally but it also showed no speedup)

Disservin · 2024-03-24T11:05:26Z

@Torom did you make the same mistake with the 4% speedup test or was that one genuinely a 4% speedup?

Torom · 2024-03-24T11:08:59Z

Since my tests were 3 days ago and the VVLTC tune was merged only two days ago, my tests against the master at that time are probably correct.

Disservin · 2024-03-24T11:13:15Z

Ah indeed.

PGG106 · 2024-03-24T11:37:58Z

I can confirm my tests were incorrent in the same vein disservin's tests were.

No functional change bench: 2103324

Unroll update_accumulator_refresh to process two active indices simultaneously. The compiler might not unroll effectively because the number of active indices isn't known at compile time. STC https://tests.stockfishchess.org/tests/view/65faa8850ec64f0526c4fca9 LLR: 2.93 (-2.94,2.94) <0.00,2.00> Total: 130464 W: 33882 L: 33431 D: 63151 Ptnml(0-2): 539, 14591, 34501, 15082, 519 closes official-stockfish#5125 No functional change

No functional change bench: 2103324

…onal change bench: 1823302

Unroll update_accumulator_refresh()

b44f329

No functional change bench: 2109005

Disservin added the no-functional-change label Mar 23, 2024

Disservin added the 🚀 gainer Gains elo label Mar 23, 2024

mstembera referenced this pull request in linrock/Stockfish Mar 23, 2024

3072 nn-fe11b1ade947.nnue

25afc72

bench 1883927

Disservin added the to be merged Will be merged shortly label Mar 26, 2024

Disservin closed this in 5001d49 Mar 26, 2024

mstembera added a commit to mstembera/Stockfish that referenced this pull request Mar 27, 2024

Reduce refresh cost after PR official-stockfish#5125

b96370d

No functional change bench: 2103324

mstembera added a commit to mstembera/Stockfish that referenced this pull request Apr 2, 2024

Reduce refresh cost after PR official-stockfish#5125

855921a

No functional change bench: 2103324

mstembera added a commit to mstembera/Stockfish that referenced this pull request Apr 2, 2024

Reduce refresh cost by 5/8 after PR official-stockfish#5125 No functi…

173c073

…onal change bench: 1823302

mstembera added a commit to mstembera/Stockfish that referenced this pull request Apr 2, 2024

Reduce refresh cost by 7/8 after PR official-stockfish#5125 No functi…

f115ce1

…onal change bench: 1823302

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unroll update_accumulator_refresh() #5125

Unroll update_accumulator_refresh() #5125

mstembera commented Mar 21, 2024

github-actions bot commented Mar 21, 2024

Disservin commented Mar 21, 2024

Torom commented Mar 21, 2024

XInTheDark commented Mar 21, 2024

Torom commented Mar 22, 2024

vondele commented Mar 22, 2024

mstembera commented Mar 22, 2024

Disservin commented Mar 23, 2024

Disservin commented Mar 23, 2024

Torom commented Mar 23, 2024

Disservin commented Mar 23, 2024 •

edited

Loading

Disservin commented Mar 23, 2024

Torom commented Mar 23, 2024

Disservin commented Mar 23, 2024

vondele commented Mar 23, 2024

Disservin commented Mar 23, 2024

Disservin commented Mar 23, 2024

Disservin commented Mar 23, 2024

PGG106 commented Mar 23, 2024 •

edited

Loading

mstembera commented Mar 23, 2024 •

edited

Loading

Disservin commented Mar 23, 2024

mstembera commented Mar 23, 2024

Disservin commented Mar 23, 2024

mstembera commented Mar 23, 2024

Disservin commented Mar 24, 2024 •

edited

Loading

Disservin commented Mar 24, 2024 •

edited

Loading

Torom commented Mar 24, 2024

Disservin commented Mar 24, 2024

PGG106 commented Mar 24, 2024

Unroll update_accumulator_refresh() #5125

Unroll update_accumulator_refresh() #5125

Conversation

mstembera commented Mar 21, 2024

github-actions bot commented Mar 21, 2024

Disservin commented Mar 21, 2024

Torom commented Mar 21, 2024

XInTheDark commented Mar 21, 2024

Torom commented Mar 22, 2024

vondele commented Mar 22, 2024

mstembera commented Mar 22, 2024

Disservin commented Mar 23, 2024

Disservin commented Mar 23, 2024

Torom commented Mar 23, 2024

Disservin commented Mar 23, 2024 • edited Loading

Disservin commented Mar 23, 2024

Torom commented Mar 23, 2024

Disservin commented Mar 23, 2024

vondele commented Mar 23, 2024

Disservin commented Mar 23, 2024

Disservin commented Mar 23, 2024

Disservin commented Mar 23, 2024

PGG106 commented Mar 23, 2024 • edited Loading

mstembera commented Mar 23, 2024 • edited Loading

Disservin commented Mar 23, 2024

mstembera commented Mar 23, 2024

Disservin commented Mar 23, 2024

mstembera commented Mar 23, 2024

Disservin commented Mar 24, 2024 • edited Loading

Disservin commented Mar 24, 2024 • edited Loading

Torom commented Mar 24, 2024

Disservin commented Mar 24, 2024

PGG106 commented Mar 24, 2024

Disservin commented Mar 23, 2024 •

edited

Loading

PGG106 commented Mar 23, 2024 •

edited

Loading

mstembera commented Mar 23, 2024 •

edited

Loading

Disservin commented Mar 24, 2024 •

edited

Loading

Disservin commented Mar 24, 2024 •

edited

Loading