-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unroll update_accumulator_refresh() #5125
Conversation
No functional change bench: 2109005
clang-format 17 needs to be run on this PR. (execution 8371882331 / attempt 1) |
I’d be glad if someone could run this through a speedup script for various compilers |
|
|
I have grouped the Fishtest result by compiler and major compiler versions (sorted by number of games):
|
Mixed results for me, nothing stands out particularly.
|
How do you suggest we proceed? Should I run another test? Maybe another STC w/ LTC bounds? |
I can reproduce a speedup with latest gcc, profile build of
Though @Torom earlier posted some benchmarks for g++ 13.2 which look rather different.. did you do a profile build? |
On a shorter depth bench the speedup is even larger..
|
Yes, these were all profile builds. |
Even with clang++-18 I get a huge speedup.. (
|
Also in the range of ~7% ? |
~4% |
Ah didn't see that earlier, thanks |
It is well possible that it is not just compiler but also specific CPU dependent. I no case there seems to be a measurable slowdown, so it seems that would all support merging. |
Agreed. |
@mstembera I will merge this as is, probably add a comment to point out that we are manually unrolling the loop which seems to have a positive effect with some compilers/cpus. The speedup is pretty significant in my tests... was this also a speedup for you locally? |
@PGG106 could you test this with your cpu? |
sure, compiled with
13 50:
|
@Disservin On my i9-7980XE it is neutral. The reason I thought it would be a speedup on some machines is because it is a similar idea to #4816 which was also a speedup but also neutral on my machine. I think @vondele is correct about it being CPU dependent. Thanks to everyone who tested! |
Anyhow nice speedup for my setup (and others) ^^ |
Thanks. Yes I would not have opened if it failed fishtest but if I think something really should be a speed up in the future I will do as you suggest. I just don't want to cause unnecessary work. |
Yeah ^^ I find this one rather interesting because it benefits apparently newer cpu even with an up to date compiler. Though regarding speedups in general I'd like to keep them in sync with (somewhat) up to date software/hardware, meaning we shouldn't aim to optimize something for a particular old compiler |
@Disservin Actually I just remembered that I have another unroll speedup that finished yellow so I never opened a PR. |
@mstembera It seems I was fooled by git, sorry.. I didn't realize that this patch is one commit behind master, rebasing to master and comparing with master shows no speedup, so in reality I was just testing the effect of the #5130. (I also tested the rebased UnrollPropagate2 locally but it also showed no speedup) |
@Torom did you make the same mistake with the 4% speedup test or was that one genuinely a 4% speedup? |
Since my tests were 3 days ago and the VVLTC tune was merged only two days ago, my tests against the master at that time are probably correct. |
Ah indeed. |
I can confirm my tests were incorrent in the same vein disservin's tests were. |
No functional change bench: 2103324
Unroll update_accumulator_refresh to process two active indices simultaneously. The compiler might not unroll effectively because the number of active indices isn't known at compile time. STC https://tests.stockfishchess.org/tests/view/65faa8850ec64f0526c4fca9 LLR: 2.93 (-2.94,2.94) <0.00,2.00> Total: 130464 W: 33882 L: 33431 D: 63151 Ptnml(0-2): 539, 14591, 34501, 15082, 519 closes official-stockfish#5125 No functional change
No functional change bench: 2103324
…onal change bench: 1823302
…onal change bench: 1823302
Unroll update_accumulator_refresh() to process two active indices simultaneously. (The compiler can't unroll effectively because the number of active indices isn't known at compile time.)
STC https://tests.stockfishchess.org/tests/view/65faa8850ec64f0526c4fca9
LLR: 2.93 (-2.94,2.94) <0.00,2.00>
Total: 130464 W: 33882 L: 33431 D: 63151
Ptnml(0-2): 539, 14591, 34501, 15082, 519
No functional change
bench: 2109005