-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a standardized, consistent benchmark for hardware performance testing. #5354
Conversation
clang-format 18 needs to be run on this PR. (execution 11084913654 / attempt 1) |
Overall I'm in favor. Quibbles:
Either way I think this will be a big benefit to have |
I can see LP cores on new intel laptop chips being an issue, but it will need some testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice idea, but I don't have an opinion yet about the entire idea of a common benchmark. In the past we always said that a user should configure bench correctly.. so what do we actually get from this?
Already possible:
- threads can be set for bench
- movetime can be configured
- hash can be configured
- custom positions can be configured as well
New:
- Sends ucinewgame before every game
- ?
Advertising this has a hardware benchmark? I'm not sure...
- Regarding the naming, I'm unsure I think it'd be a bit confusing to have bench and benchmark..
- Regarding the implementation, can be ignored for now to avoid unnecessary effort, I think it makes more sense to have this complete decoupled from uci.. meaning a new wrapper for the sole purpose of running this benchmark, which has it's entrypoint in main.cpp and exists right after.
Reason: I'd like to avoid leaking settings which were set for the benchmark to the engine or at least make it only be invocable from the cli.
The biggest issue with this is the gross lack of suitable "realistic play" conditions in regular bench, namely FENs. It's highly nontrivial for the average user to come up with a set of FENs vaguely resembling real gameplay. Also, too much configuration = users not able to consistently hit the target -- no useful "standard". If we added the positions in this PR as, say, |
Stockfish is already being used as a hardware benchmark. A standardized way to do it would be nice. However, it doesn't have to be this. A note in the docs about a suggested way to benchmark would do. |
Some ideas to be considered
basically, you want to extract/replay the uci commands from a game played (debug log) ? |
well, yes, though I'd argue misleadingly
yea, on one hand it's less diversity, on the other it's more realistic. Hard choice under time constraints.
I don't consider such hardware relevant to be honest, doesn't matter much if it gets 100k nps or 1M.
and how did this work out? consider this more about standardization than functionality
should be done for
solid consideration
alright, maybe would be better indeed, will have to see how large it would have to be on typical hardware
I think the only difference is in the frontend? wouldn't change anything measurable afaik
right, possibly explore some time curve
would be very close, yea, but not sure if exactly matching them is desirable |
What you consider relevant has nothing to do with the fact that users will run this on such low hardware, and if they have bad experiences with default settings on crap hardware, that still affects their overall opinion of Stockfish. It's something we need to be careful about, no matter how much we don't actually care about such hardware. That said, a per-thread default is likely a big improvement, probably the best way to do it.
exactly, exactly.
In my experience ("manually kibitizing" tcec games), sending just FENs is perfectly fine. Stockfish still reuses TT with no issue whether or not it has move history.
good point, but probably not needed for a first version of this thing. The two major concerns, at present, are 1) standardizing a performance test, and 2) getting a game-realistic spread of positions. The improvements from these two things are much greater than the improvement of adding game-realistic thinking time. This is ultimately a speed-test, not an elo-test, and we can get reliable speed measurements without worrying about thinking time. (Note "reliable" not "accurate", just needs to be comparable rather than having a non-arbitrary meaning) |
Alright, let's revive because otherwise I'm just gonna forget about this. More diagnostics has been added so the result looks like this
No easy way to get the system configuration so that's left out. Renamed to Silenced all engine output. On a fresh windows console this was actually close to 10% performance difference. With how the buffering is done this is potentially all variance depending on the setup, which we want to minimize. Added a small warmup step, currently 3 first positions. Split setting options and running the bench. Mostly due to warmup interfering with the logic here. Did a rough regression on some fishtest LTC to figure out a good time curve. Ended up with this:
This is objectionable. I agree with some comments earlier that it's some what desirable. I don't think it matters that much what exact curve we use, as long as there's some higher weight for early-mid game positions. The movetimes end up looking like this
I have not tested if it's more stable than bench yet. I'll do it once I have some idle time on my machine. Other tests are welcome too. I think the list of positions is final, unless something bad pops up. |
I have played around with it, and most things look good to me (e.g. positions, scheme to select time etc). I still need to do some benchmarking on larger hardware, but it looks like it gives pretty stable nps. A few comments for consideration.
The current runtime is ~5min, which is a bit long. I suggest default runtime and hash are reduced 2x. (sites like openbenchmarking run at least 3 times to measure the variance, with additional rounds until the average is well converged). For the user input right now it is Finally, personally I don't like so much the |
Some benchmark results on 288 threads:
< 1% variance, which I think is very good for a threaded run. |
Alright, I'll rename it to
yea, I think that's a nice solution, slightly outside of this PR but probably best to do it now
good idea
I thought that this isn't really important information, but I guess it doesn't hurt to include it, if only as a sanity check.
The current hashfull calculation is not very useful, because it completely ignores prior searches. Would have to use a custom implementation to get a more meaningful value (ignore generation). It's not ideal because there's some positions that are no longer reachable, but better than getting 3-5% hashfull just because the current search is small. Would it be reasonable?
Would
alright, should still be fairly reasonable, the last searches would be around 250ms. is this being considered for sf17? provided I finish it in a timely manner |
I don't think we should do something special, but aren't we sending essentially the same uci commands as we would do during a game?
I don't think so, to be tried.
While it would have been nice, I'd rather take the conservative approach, and merge it as one of the first things of SF17dev. |
yes, and the hashful has the same meaning, so imo meaningless. It's fine for singular position analysis, but as soon as you do start playing games it's worthless. But sure, can include. |
yeah, but that still means it is fine to compare to https://github.com/official-stockfish/Stockfish/wiki/Useful-data#elo-cost-of-small-hash which was captured during game play. |
I'd be happy to see this merged sooner than later, would be great if you could implement the last missing pieces. |
Just to recall here that we discussed on discord a possible alternative name for this: |
Personally I don't think speedtest Is a good name. Speedtest might imply that what is being tested is the speed of stockfish itself. While with benchmark it is a standard name that means test the performance of this hardware. As benchmark is already used by bench, perf seems more standard to me. Also it is shorter. |
4e7df95
to
0236ca1
Compare
Ok, I think I addressed everything now, leaving commits for clarity on what was changed since then. I will squash for the final merge. Output looks like this now
I felt compelled to include more detailed hashfull information, since the basic hashfull doesn't paint the whole picture for gameplay. |
Is it intended to have |
I have tested the
|
good catch |
What is the default thread count? Maybe for this it would make sense to default to all threads? |
it does default to all threads 3 runs on almost idle 7800x3d
NPS stddev: 8360 (0.05%) |
Some final things from my point of view, if this is fixed, please undraft and it can be merged.
|
Thread count as in thread used or available? |
I understand that it may be a little too much. Would it perhaps make more sense to just display hashful as is commonly understood along with total TT occupation (i.e. the amount of entries touched during the search)? Otherwise I'll remove that output.
that's what I wanted to use initially but it doesn't output a normalized value
makes sense
good point. it may be best to actually use the numa output we get normally for more info |
I would just provide the output that our standard hashfull provide, or none at all. Also for the user experience, I think having nodes/second as the last line output makes sense, it somehow is the final result of the run.
usual numa ouput is good. |
Alright, current candidate
If the |
thanks, looks good to me. Please squash in one commit with a good commit message and undraft. |
`speedtest [threads] [hash_MiB] [time_s]`. `threads` default to system concurrency. `hash_MiB` defaults to `threads*128`. `time_s` defaults to 150. Intended to be used with default parameters, as a stable hardware benchmark. Example: ``` C:\dev\stockfish-master\src>stockfish.exe speedtest Stockfish dev-20240928-nogit by the Stockfish developers (see AUTHORS file) info string Using 16 threads Warmup position 3/3 Position 258/258 =========================== Version : Stockfish dev-20240928-nogit Compiled by : g++ (GNUC) 13.2.0 on MinGW64 Compilation architecture : x86-64-vnni256 Compilation settings : 64bit VNNI BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT Compiler __VERSION__ macro : 13.2.0 Large pages : yes User invocation : speedtest Filled invocation : speedtest 16 2048 150 Available processors : 0-15 Thread count : 16 Thread binding : none TT size [MiB] : 2048 Hash max, avg [per mille] : single search : 40, 21 single game : 631, 428 Total nodes searched : 2099917842 Total search time [s] : 153.937 Nodes/second : 13641410 ``` ------------------------------- Small unrelated tweaks: - Network verification output is now handled as a callback. - TT hashfull queries allow specifying maximum entry age. No functional changes.
c592215
to
5a71324
Compare
`speedtest [threads] [hash_MiB] [time_s]`. `threads` default to system concurrency. `hash_MiB` defaults to `threads*128`. `time_s` defaults to 150. Intended to be used with default parameters, as a stable hardware benchmark. Example: ``` C:\dev\stockfish-master\src>stockfish.exe speedtest Stockfish dev-20240928-nogit by the Stockfish developers (see AUTHORS file) info string Using 16 threads Warmup position 3/3 Position 258/258 =========================== Version : Stockfish dev-20240928-nogit Compiled by : g++ (GNUC) 13.2.0 on MinGW64 Compilation architecture : x86-64-vnni256 Compilation settings : 64bit VNNI BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT Compiler __VERSION__ macro : 13.2.0 Large pages : yes User invocation : speedtest Filled invocation : speedtest 16 2048 150 Available processors : 0-15 Thread count : 16 Thread binding : none TT size [MiB] : 2048 Hash max, avg [per mille] : single search : 40, 21 single game : 631, 428 Total nodes searched : 2099917842 Total search time [s] : 153.937 Nodes/second : 13641410 ``` ------------------------------- Small unrelated tweaks: - Network verification output is now handled as a callback. - TT hashfull queries allow specifying maximum entry age. closes official-stockfish#5354 No functional change.
This is an idea I need more feedback on. Recently, while testing the new NUMA awareness code, we've discovered multiple issues with the current common ways to test performance.
So far, everyone relies on either search from
startpos
orbench
for performance testing. It's all flawed in some way or another. Searching fromstartpos
is not representative of the overall average performance in games, andbench
is tuned for single-threaded execution - variance hits +-15% on high thread counts.The idea is to have a single simple command that by default (no specific arguments) tests the maximum performance attainable on the target machine in common workloads. The purpose would be presence on popular benchmarks, like ipmanchess, openbenchmarking, and potentially more in the future.
Replacing
bench
is not desirable, as it serves its purpose well, so a new command would be introduced. The current working name isbenchmark
.Operation outline:
The bench has the same operating principle as the current bench, executing
go
commands on a preselected set of positions.Position selection:
Settings selection:
movetime
per position, to minimize effects of nondeterministic multithreaded search as much as possible. Selected as1000
ms, to take a bit less than 5 minutes in total.Other considerations:
ucinewgame
before every gamego
to search end is measuredI need some feedback on this direction, whether it's desired, and if so whether the implementation is in the desired shape, before further testing and tuning.
The choice of positions may need to change, we need to find a set of 5 or at most 6 games that produce minimal variance across runs while providing good coverage of positions. It is also important that we avoid positions that lead to search explosions that take long time to resolve, or otherwise reach near-cyclic (fortress) setups, or positions that reach maximum depth and terminate early. The current set of positions is preliminary, remains untested.