Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unroll update_accumulator_refresh() #5125

Closed
wants to merge 1 commit into from

Conversation

mstembera
Copy link
Contributor

Unroll update_accumulator_refresh() to process two active indices simultaneously. (The compiler can't unroll effectively because the number of active indices isn't known at compile time.)

STC https://tests.stockfishchess.org/tests/view/65faa8850ec64f0526c4fca9
LLR: 2.93 (-2.94,2.94) <0.00,2.00>
Total: 130464 W: 33882 L: 33431 D: 63151
Ptnml(0-2): 539, 14591, 34501, 15082, 519

No functional change
bench: 2109005

No functional change
bench: 2109005
Copy link

clang-format 17 needs to be run on this PR.
If you do not have clang-format installed, the maintainer will run it when merging.
For the exact version please see https://packages.ubuntu.com/mantic/clang-format-17.

(execution 8371882331 / attempt 1)

@Disservin
Copy link
Member

I’d be glad if someone could run this through a speedup script for various compilers

@Torom
Copy link
Contributor

Torom commented Mar 21, 2024

x86-64-bmi2 clang 17.0.6

Result of  50 runs
==================
base (...es/stockfish) =    1076007  +/- 2065
test (....UARunrollPR) =    1077021  +/- 1894
diff                   =      +1014  +/- 2988

speedup        = +0.0009
P(speedup > 0) =  0.7467

CPU: 4 x Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Hyperthreading: on

x86-64-bmi2 gcc 13.2.1

Result of  50 runs
==================
base (...tockfish_gcc) =    1064075  +/- 2034
test (...unrollPR_gcc) =    1063413  +/- 1871
diff                   =       -662  +/- 2986

speedup        = -0.0006
P(speedup > 0) =  0.3322

CPU: 4 x Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Hyperthreading: on

armv8-dotprod clang 16.0.6

Result of  30 runs
==================
base (../stockfish   ) =     317332  +/- 582
test (....UARunrollPR) =     316970  +/- 638
diff                   =       -362  +/- 1124

speedup        = -0.0011
P(speedup > 0) =  0.2641

CPU: 4 x aarch64
Hyperthreading: off

armv8-dotprod gcc 13.2.0

Result of  30 runs
==================
base (...tockfish_gcc) =     312348  +/- 729
test (...unrollPR_gcc) =     324944  +/- 1236
diff                   =     +12596  +/- 1785

speedup        = +0.0403
P(speedup > 0) =  1.0000

CPU: 4 x aarch64
Hyperthreading: off

@XInTheDark
Copy link
Contributor

apple-silicon clang 17.0.6

Result of  30 runs
==================
base (...rc/stockfish) =    1319470  +/- 2945
test (...rc/stockfish) =    1312130  +/- 2728
diff                   =      -7340  +/- 748

speedup        = -0.0056
P(speedup > 0) =  0.0000

CPU: 8 x arm
Hyperthreading: off

@Torom
Copy link
Contributor

Torom commented Mar 22, 2024

I have grouped the Fishtest result by compiler and major compiler versions (sorted by number of games):

g++       ELO:    1.19   +/-    0.97   LOS:  99.2%   [529, 14325, 33805, 14795, 514]
clang++   ELO:    1.51   +/-    6.67   LOS:  67.2%   [10, 266, 696, 287, 5]
g++ 10       ELO:   -0.23   +/-    1.32   LOS:  36.7%   [300, 8010, 18342, 8020, 272]
g++ 13       ELO:    3.47   +/-    2.35   LOS:  99.8%   [81, 2320, 5736, 2533, 82]
g++ 11       ELO:    3.47   +/-    2.71   LOS:  99.4%   [65, 1784, 4354, 1932, 73]
g++ 12       ELO:    1.76   +/-    2.84   LOS:  88.7%   [63, 1670, 3984, 1734, 69]
g++ 9        ELO:    1.50   +/-    6.06   LOS:  68.7%   [16, 348, 869, 372, 11]
g++ 7        ELO:    3.18   +/-    7.70   LOS:  79.1%   [4, 193, 520, 204, 7]
clang++ 17   ELO:   -2.90   +/-   11.08   LOS:  30.4%   [5, 109, 257, 107, 2]
clang++ 14   ELO:   -4.12   +/-   10.76   LOS:  22.6%   [5, 97, 267, 94, 1]
clang++ 15   ELO:   19.70   +/-   14.85   LOS:  99.5%   [0, 48, 132, 75, 1]
clang++ 18   ELO:    2.71   +/-   27.70   LOS:  57.6%   [0, 12, 40, 11, 1]

@vondele
Copy link
Member

vondele commented Mar 22, 2024

Mixed results for me, nothing stands out particularly.

Base:  ./stockfish-master-g++-9
Test:  ./stockfish-patch-g++-9
run       base       test     diff
  1    1412687    1421394    +8707
  2    1415243    1411025    -4218
  3    1407051    1381335   -25716
  4    1415689    1411357    -4332
  5    1371958    1394291   +22333
  6    1396564    1402444    +5880
  7    1384844    1396239   +11395
  8    1409146    1402444    -6702
  9    1400369    1387087   -13282
 10    1396239    1409919   +13680

Result of  10 runs
==================
base (./stockfish-master-g++-9 ) =    1400979  +/- 8766
test (./stockfish-patch-g++-9  ) =    1401754  +/- 7579
diff                             =       +774  +/- 8848

speedup        = +0.0006
P(speedup > 0) =  0.5680

CPU: 16 x AMD Ryzen 9 3950X 16-Core Processor
Hyperthreading: on 

Base:  ./stockfish-master-g++-10
Test:  ./stockfish-patch-g++-10
run       base       test     diff
  1    1442944    1444219    +1275
  2    1416358    1390841   -25517
  3    1435230    1427599    -7631
  4    1438674    1439365     +691
  5    1415466    1420384    +4918
  6    1431917    1420048   -11869
  7    1405073    1413131    +8058
  8    1413797    1394291   -19506
  9    1435116    1418256   -16860
 10    1412465    1410140    -2325

Result of  10 runs
==================
base (./stockfish-master-g++-10) =    1424704  +/- 8273
test (./stockfish-patch-g++-10 ) =    1417827  +/- 10582
diff                             =      -6877  +/- 6962

speedup        = -0.0048
P(speedup > 0) =  0.0266

CPU: 16 x AMD Ryzen 9 3950X 16-Core Processor
Hyperthreading: on 

Base:  ./stockfish-master-g++-11
Test:  ./stockfish-patch-g++-11
run       base       test     diff
  1    1427032    1421057    -5975
  2    1429187    1409919   -19268
  3    1411911    1417921    +6010
  4    1426014    1415689   -10325
  5    1426240    1411689   -14551
  6    1398410    1402116    +3706
  7    1407932    1405402    -2530
  8    1423644    1414909    -8735
  9    1429187    1408153   -21034
 10    1418144    1424208    +6064

Result of  10 runs
==================
base (./stockfish-master-g++-11) =    1419770  +/- 6481
test (./stockfish-patch-g++-11 ) =    1413106  +/- 4310
diff                             =      -6664  +/- 6165

speedup        = -0.0047
P(speedup > 0) =  0.0172

CPU: 16 x AMD Ryzen 9 3950X 16-Core Processor
Hyperthreading: on 

Base:  ./stockfish-master-g++-12
Test:  ./stockfish-patch-g++-12
run       base       test     diff
  1    1416916    1418144    +1228
  2    1404305    1409477    +5172
  3    1407051    1408814    +1763
  4    1404963    1406941    +1978
  5    1401133    1405183    +4050
  6    1410693    1407932    -2761
  7    1404415    1403976     -439
  8    1417362    1420048    +2686
  9    1407822    1419935   +12113
 10    1410471    1428506   +18035

Result of  10 runs
==================
base (./stockfish-master-g++-12) =    1408513  +/- 3337
test (./stockfish-patch-g++-12 ) =    1412896  +/- 5051
diff                             =      +4382  +/- 3841

speedup        = +0.0031
P(speedup > 0) =  0.9872

CPU: 16 x AMD Ryzen 9 3950X 16-Core Processor
Hyperthreading: on 

@mstembera
Copy link
Contributor Author

How do you suggest we proceed? Should I run another test? Maybe another STC w/ LTC bounds?

@Disservin
Copy link
Member

I can reproduce a speedup with latest gcc, profile build of x86-64-bmi2
gcc version 13.2.0 (Ubuntu 13.2.0-4ubuntu3)

➜  src git:(master) ✗ ./bench_parallel.sh ./stockfish ./stockfish-pr-5125 20 10

sf_base =  1644140 +/-  20670 (95%)
sf_test =  1695293 +/-  15120 (95%)
diff    =    51153 +/-  22407 (95%)
speedup = 3.11124% +/- 1.363% (95%)

Though @Torom earlier posted some benchmarks for g++ 13.2 which look rather different.. did you do a profile build?

@Disservin
Copy link
Member

On a shorter depth bench the speedup is even larger..
➜ src git:(master) ✗ ./bench_parallel.sh ./stockfish ./stockfish-pr-5125 13 50

sf_base =  1560713 +/-   7500 (95%)
sf_test =  1674767 +/-   7859 (95%)
diff    =   114053 +/-   9184 (95%)
speedup = 7.30780% +/- 0.588% (95%)

@Torom
Copy link
Contributor

Torom commented Mar 23, 2024

Though @Torom earlier posted some benchmarks for g++ 13.2 which look rather different.. did you do a profile build?

Yes, these were all profile builds.
On the Raspberry Pi with gcc 13.2 it is interestingly enough also a significant speedup.

@Disservin
Copy link
Member

Disservin commented Mar 23, 2024

Even with clang++-18 I get a huge speedup.. (13 50)

sf_base =  1574218 +/-   6819 (95%)
sf_test =  1702423 +/-   5995 (95%)
diff    =   128205 +/-   7367 (95%)
speedup = 8.14405% +/- 0.468% (95%)
Compiled by                : clang++ 18.1.3 on Linux
Compilation architecture   : x86-64-bmi2
Compilation settings       : 64bit BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT
Compiler __VERSION__ macro : Ubuntu Clang 18.1.3 (++20240322073236+ef6d1ec07c69-1~exp1~20240322193248.98)

@Disservin
Copy link
Member

Though @Torom earlier posted some benchmarks for g++ 13.2 which look rather different.. did you do a profile build?

Yes, these were all profile builds. On the Raspberry Pi with gcc 13.2 it is interestingly enough also a significant speedup.

Also in the range of ~7% ?

@Torom
Copy link
Contributor

Torom commented Mar 23, 2024

armv8-dotprod gcc 13.2.0

Result of  30 runs
==================
base (...tockfish_gcc) =     312348  +/- 729
test (...unrollPR_gcc) =     324944  +/- 1236
diff                   =     +12596  +/- 1785

speedup        = +0.0403
P(speedup > 0) =  1.0000

CPU: 4 x aarch64
Hyperthreading: off

~4%

@Disservin
Copy link
Member

Ah didn't see that earlier, thanks

@vondele
Copy link
Member

vondele commented Mar 23, 2024

It is well possible that it is not just compiler but also specific CPU dependent. I no case there seems to be a measurable slowdown, so it seems that would all support merging.

@Disservin
Copy link
Member

Agreed.

@Disservin Disservin added the 🚀 gainer Gains elo label Mar 23, 2024
@Disservin
Copy link
Member

@mstembera I will merge this as is, probably add a comment to point out that we are manually unrolling the loop which seems to have a positive effect with some compilers/cpus. The speedup is pretty significant in my tests... was this also a speedup for you locally?
I tested it with my Ryzen 9 5950X.

@Disservin
Copy link
Member

@PGG106 could you test this with your cpu?

@PGG106
Copy link

PGG106 commented Mar 23, 2024

sure, compiled with make -j profile-build ARCH=x86-64-avx512 on a ryzen 9 7950X, gcc 11.4
20 10:

sf_base =  1947474 +/-  15148 (95%)
sf_test =  2006258 +/-  18821 (95%)
diff    =    58783 +/-   4737 (95%)
speedup = 3.01846% +/- 0.243% (95%)

13 50:

sf_base =  1804101 +/-   3419 (95%)
sf_test =  1954879 +/-   4033 (95%)
diff    =   150778 +/-   1881 (95%)
speedup = 8.35754% +/- 0.104% (95%)

@mstembera
Copy link
Contributor Author

mstembera commented Mar 23, 2024

@Disservin On my i9-7980XE it is neutral. The reason I thought it would be a speedup on some machines is because it is a similar idea to #4816 which was also a speedup but also neutral on my machine. I think @vondele is correct about it being CPU dependent. Thanks to everyone who tested!

@Disservin
Copy link
Member

Anyhow nice speedup for my setup (and others) ^^
Though fishtest did not make it so clear.. so I’m suggesting that in the future it might make sense to open a draft pr for possible speedups and then we can decide if it is worth it to merge it. In this case I would have probably merged it even if it failed fishtest. Though I suspect you would not have opened the pr if it failed fishtest..?

@mstembera
Copy link
Contributor Author

Thanks. Yes I would not have opened if it failed fishtest but if I think something really should be a speed up in the future I will do as you suggest. I just don't want to cause unnecessary work.

@Disservin
Copy link
Member

Yeah ^^ I find this one rather interesting because it benefits apparently newer cpu even with an up to date compiler. Though regarding speedups in general I'd like to keep them in sync with (somewhat) up to date software/hardware, meaning we shouldn't aim to optimize something for a particular old compiler

@mstembera
Copy link
Contributor Author

@Disservin Actually I just remembered that I have another unroll speedup that finished yellow so I never opened a PR.
https://tests.stockfishchess.org/tests/view/65a077f479aa8af82b96b850 Let me know what you think or if you want me to open a draft PR for it as you suggested above.

mstembera referenced this pull request in linrock/Stockfish Mar 23, 2024
@Disservin
Copy link
Member

Disservin commented Mar 24, 2024

@mstembera It seems I was fooled by git, sorry.. I didn't realize that this patch is one commit behind master, rebasing to master and comparing with master shows no speedup, so in reality I was just testing the effect of the #5130.

(I also tested the rebased UnrollPropagate2 locally but it also showed no speedup)

@Disservin
Copy link
Member

Disservin commented Mar 24, 2024

@Torom did you make the same mistake with the 4% speedup test or was that one genuinely a 4% speedup?

@Torom
Copy link
Contributor

Torom commented Mar 24, 2024

Since my tests were 3 days ago and the VVLTC tune was merged only two days ago, my tests against the master at that time are probably correct.

@Disservin
Copy link
Member

Ah indeed.

@PGG106
Copy link

PGG106 commented Mar 24, 2024

I can confirm my tests were incorrent in the same vein disservin's tests were.

@Disservin Disservin added the to be merged Will be merged shortly label Mar 26, 2024
@Disservin Disservin closed this in 5001d49 Mar 26, 2024
mstembera added a commit to mstembera/Stockfish that referenced this pull request Mar 27, 2024
No functional change
bench: 2103324
linrock pushed a commit to linrock/Stockfish that referenced this pull request Mar 27, 2024
Unroll update_accumulator_refresh to process two
active indices simultaneously.

The compiler might not unroll effectively because
the number of active indices isn't known at
compile time.

STC https://tests.stockfishchess.org/tests/view/65faa8850ec64f0526c4fca9
LLR: 2.93 (-2.94,2.94) <0.00,2.00>
Total: 130464 W: 33882 L: 33431 D: 63151
Ptnml(0-2): 539, 14591, 34501, 15082, 519

closes official-stockfish#5125

No functional change
mstembera added a commit to mstembera/Stockfish that referenced this pull request Apr 2, 2024
No functional change
bench: 2103324
mstembera added a commit to mstembera/Stockfish that referenced this pull request Apr 2, 2024
mstembera added a commit to mstembera/Stockfish that referenced this pull request Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants