-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Considerations for "Skill Level" #3635
Comments
I've played a few games against it for RandomEvalPerturb in the upper quarter, and could at least win. Probably instead of percent I would use per mill so it can be a bit finer tuned. If we see that as an alternative for Skill Level (which has clear deficiencies), would be interesting to know if other places rely on Skill Level (e.g. Lichess?). It would be nice if some people try to play a few games against it to see if that would be an interesting opponent for a suitable value of RandomEvalPerturb. |
Thanks! I'm going to test it a few games and write back here. Also, we need to establish an anchor chess engine to measure the ELO. What about measure it against one of the Maia ELO calibrated versions? |
I played a few games. I could easily win the level 100, but suffered losses to level 50 after some close games. It looked like the level 50 is somewhere around 2000 to 2200 level. Overall it looks cool. |
100 has effectively completely random evaluation, so it's about as expected. Did you play the games at some time control or fixed nodes like in my tests? Anything particularily visible about the "style"? |
I tried it with fixed 5 minutes games, not nodes. As a human vs AI, I could only do only a few games. It was clearly still "computer style", meaning there was strong consistency throughout the games. Not sure there is anything you can do. However, the ability to assess the position also became a little worse, which is welcome. Next, we should calibrate against an engine with a well defined human rating like Maia. |
Lichess playing vs computer levels 1-8 relies on search depth, UCI_Elo and move time values. |
…s her to connect the RandomEvalPerturb value and loosely connect it to a desired Elo based on his quick test results. Through testig , iwas ab;e to dertermine the rnadom function produce more reliable results using 400K nodes/move per move (as opposed to 100K nodes/move). Adjusted accordingly. This will be used in my Android Beth Harmon Chess app ( clone of Droidfish). The beauty of this method is that is uses move randomization so that a book is not needed, but still can be used, Also, turned off pure mode as that did have some undesired weakness at weaker levels. Calculated RandomEvalPerturb (REP) value based on Elo {A) {B) {C) (D) (E) (F) (G) (H) (I) (J) Factor -1 UCI_Elo (A)-(B) Factor 2 (C)/(D) Factor 3 (E)/(F) Factor 4 (B)/(H) (G)+(I) Sopel Tests official-stockfish#3635 3200 3000 200 2.8 71 10 7 225 13 20 23 stockfish_pure_20_100k 20 3011 3200 2900 300 2.8 107 10 10 225 12 22 3200 2800 400 2.8 142 10 14 225 12 26 25 stockfish_pure_25_100k 25 2896 3200 2700 500 2.8 178 10 17 225 12 29 26 stockfish_pure_30_100k 30 2748 3200 2600 600 2.8 214 10 21 225 11 32 3200 2500 700 2.8 250 10 25 225 11 36 28 stockfish_pure_35_100k 35 2592 3200 2400 800 2.8 285 10 28 225 10 38 3200 2300 900 2.8 321 10 32 225 10 42 29 stockfish_pure_40_100k 40 2430 3200 2200 1000 2.8 357 10 35 225 9 44 32 stockfish_pure_45_100k 45 2295 3200 2100 1100 2.8 392 10 39 225 9 48 3200 2000 1200 2.8 428 10 42 225 8 50 34 stockfish_pure_50_100k 50 2100 * Elo 2000 Achored to REP 50 3200 1900 1300 2.8 464 10 46 225 8 54 36 stockfish_pure_55_100k 55 1928 3200 1800 1400 2.8 500 10 50 225 8 58 3200 1700 1500 2.8 535 10 53 225 7 60 38 stockfish_pure_60_100k 60 1797 3200 1600 1600 2.8 571 10 57 225 7 64 3200 1500 1700 2.8 607 10 60 225 6 66 39 stockfish_pure_65_100k 65 1570 3200 1400 1800 2.8 642 10 64 225 6 70 40 stockfish_pure_70_100k 70 1325 3200 1300 1900 2.8 678 10 67 225 5 72 3200 1200 2000 2.8 714 10 71 225 5 76 41 stockfish_pure_75_100k 75 1184 3200 1100 2100 2.8 750 10 75 225 4 79 3200 1000 2200 2.8 785 10 78 225 4 82
based on Sopel's initial implementation discussed in official-stockfish#3635 in this new scheme, the strenght is of the engine is limited by replacing a (varying) part of the evaluation, with a random perturbation. This scheme is easier to implement than our current skill level implementation, and has the advantage that it has a wider Elo range, being both weaker than skill level 1 and stronger than skill level 19. The skill level option is removed, and instead UCI_Elo and UCI_LimitStrength are the only options available. UCI_Elo is calibrated such that 1500 Elo is equivalent in strength to the engine maia1 (https://lichess.org/@/maia1) which has a blitz rating on lichess of 1500 (based on nearly 600k human games). The full Elo range (750 - 5200) is obtained by playing games between engines roughly 100-200 elo apart with the perturbation going from 0 to 1000, and fitting the ordo results. With this fit, a conversion from UCI_Elo to the magnitude of the random perturbation is possible. All games are played at lichess blitz TC (5m+3s), and playing strenght is different at different TC. Indeed, maia1 is a fixed 1 node leela 'search', independent from TC, whereas this scheme searches normally, and improves with TC. There are a few caveats, it is unclear how the playing style of the engine is, the old skill level was not really satisfactory, it needs to be seen if this is fixed with this approach. Furthermore, while in the engine - engine matches maia1 and SF@1500Elo are equivalent in strength (at blitz TC), it is not sure if its rating against humans will be the same (engine Elo and human Elo can be very different). No functional change
based on Sopel's initial implementation discussed in official-stockfish#3635 in this new scheme, the strenght is of the engine is limited by replacing a (varying) part of the evaluation, with a random perturbation. This scheme is easier to implement than our current skill level implementation, and has the advantage that it has a wider Elo range, being both weaker than skill level 1 and stronger than skill level 19. The skill level option is removed, and instead UCI_Elo and UCI_LimitStrength are the only options available. UCI_Elo is calibrated such that 1500 Elo is equivalent in strength to the engine maia1 (https://lichess.org/@/maia1) which has a blitz rating on lichess of 1500 (based on nearly 600k human games). The full Elo range (750 - 5200) is obtained by playing games between engines roughly 100-200 elo apart with the perturbation going from 0 to 1000, and fitting the ordo results. With this fit, a conversion from UCI_Elo to the magnitude of the random perturbation is possible. All games are played at lichess blitz TC (5m+3s), and playing strenght is different at different TC. Indeed, maia1 is a fixed 1 node leela 'search', independent from TC, whereas this scheme searches normally, and improves with TC. There are a few caveats, it is unclear how the playing style of the engine is, the old skill level was not really satisfactory, it needs to be seen if this is fixed with this approach. Furthermore, while in the engine - engine matches maia1 and SF@1500Elo are equivalent in strength (at blitz TC), it is not sure if its rating against humans will be the same (engine Elo and human Elo can be very different). No functional change
based on Sopel's initial implementation discussed in official-stockfish#3635 in this new scheme, the strenght is of the engine is limited by replacing a (varying) part of the evaluation, with a random perturbation. This scheme is easier to implement than our current skill level implementation, and has the advantage that it has a wider Elo range, being both weaker than skill level 1 and stronger than skill level 19. The skill level option is removed, and instead UCI_Elo and UCI_LimitStrength are the only options available. UCI_Elo is calibrated such that 1500 Elo is equivalent in strength to the engine maia1 (https://lichess.org/@/maia1) which has a blitz rating on lichess of 1500 (based on nearly 600k human games). The full Elo range (750 - 5200) is obtained by playing games between engines roughly 100-200 elo apart with the perturbation going from 0 to 1000, and fitting the ordo results. With this fit, a conversion from UCI_Elo to the magnitude of the random perturbation is possible. All games are played at lichess blitz TC (5m+3s), and playing strenght is different at different TC. Indeed, maia1 is a fixed 1 node leela 'search', independent from TC, whereas this scheme searches normally, and improves with TC. There are a few caveats, it is unclear how the playing style of the engine is, the old skill level was not really satisfactory, it needs to be seen if this is fixed with this approach. Furthermore, while in the engine - engine matches maia1 and SF@1500Elo are equivalent in strength (at blitz TC), it is not sure if its rating against humans will be the same (engine Elo and human Elo can be very different). No functional change
My experience is as follows:
Most of the game, computer plays ok. Weird moves, but nothing terrible. I win playing the way humans should play against a computer, by stiffling the opponent like a boa constrictor. Eventually, when I break through, and the materialistic computer can't find a way out without losing material, it stops resisting, and just throws away its pieces. That is lame. A key skill in chess is learning to win in a winning position, in the most irrefutable way, leaving no counter play to the opponent (ie. so that even full strength SF will eventually be forced to bend the knee). And you don't learn that if the opponent goes hara-kiri on you. |
@lucasart have you tried playing the version described in this issue to see whether there's a difference in style? |
@Sopel97 I will try when my computer is fixed. One thing I notice is that random eval perturbation + depth (or nodes?) limit should suffice. Running a multi PV search behind the scene complicates for no reason. I can't see the purpose of this once random eval is introduced. This could be an important simplification. *nodes might be better than depth limits, because of endgame, where branching factors reduce and even humans can out-calculate depth limits easily. |
There is a more up-to-date branch which gets rid of multipv and has a calibrated scheme to set user Elo here : |
@vondele I tried it at 1500 elo: 5+3 blitz, no takebacks, FRC (to reduce opening knowledge factor and focus purely on gameplay, SF is bookless here). As expected, SF played slightly odd moves (positionally), but destroyed me tactically. I think we still need limits on depth or nodes in combination to eval pollution. Another problem is aspiration search. Because the search returns polluted scores going all over the place, there is a lot of search inconsistency and re-searches going on. When using skill level, it probably makes sense to switch off aspiration, and just search +/-INF window directly. The user doesn't see the logs, so you might say it doesn't matter. But it changes time management, and that's the part visible to the user (ie. lots of searhc instability make SF use time more aggressively).
Strong chess players are probably laughing at this game ;-) For the sake of comparison, here is the last game I played against maia5 on lichess in blitz 5+3: https://lichess.org/9pNVarMAimZ6. Basically a draw, and the engine blundered the king+pawn endgame. |
@lucasart thanks for playing. Yes, I agree tactically the engine is still pretty strong, i.e. will likely notice if a piece is hanging, and definitely will see if the is a mate in N available (with N much more than e.g. a 1500Elo player will see). I've also seen the time management aspect. Maybe limiting depth is an option, one could also think about excessive pruning/LMR. The Elo calibration is really tricky.. 1500Elo matches maia5 in a direct match quite well, however, for humans it seems clearly stronger. |
@vondele I played your version at 1500 with a fixed 1sec / move time control. At this TC it blundered pieces very often . At 1700 UCI_Elo the engine played better . |
@tillchess in this context we are referring to 1500 elo for the lichess blitz rating scale. This translates to 1000 elo blitz on chess.com. perhaps you are much stronger than that ? |
Here's my game against UCI_Elo=1900 https://lichess.org/IGGdlkJB |
25... Qg3?? is a sepuku move again. You were already winning, of course, but at least the computer was making you work for it, creating threats on your king and preventing exchanges to hinder your progress. |
@lucasart I noticed that it uses more time than usual (It got really low on time on move 20+). Also the pawn blunders in the opening is kinda strange, I don't think a 1900 would miss that. |
I've done some experimentation in Demolito, and got some fairly satisfying results with the following scheme:
Now, we must have realistic expectations about this. It will never be as good as Maia at being human-like. The point here is just to code something that is simple, and somewhat reasonable. Applying a Maia strategy is certainly very interesting, but is a separate project on its own. |
By the way playing against weak levels of Stockfish is extremely popular on lichess.org, especially with beginners. I think it's because facing an AI in non-competitive play adds much less pressure than facing a human opponent. Even level 0 is currently too strong for players that are just starting out. So we're using a patched version that extends the range (fairy-stockfish/Fairy-Stockfish@2329160 + fairy-stockfish/Fairy-Stockfish@f451358). |
Yes,
|
@lucasart This thread started with a new scheme to weaken Stockfish and remove the current pick_best method (which chooses a suboptimal move) . |
For reference, this is the setup used on LiChess https://github.com/lichess-org/fishnet/blob/master/src/api.rs#L208 |
The Lichess website uses Fairy Stockfish with a different Skill Level range from -20 to 20. |
Skills 0 to 20 in Fairy-Stockfish are equivalent to Stockfish, the negative values extend the range in a straight-forward way. We didn't put too much thought into the parameters on Lichess, basically just trying to get to a point where complete beginners have a shot at beating the lowest level, and then building a mostly arbitrary progression from the lowest to the strongest level. The weakened play at -20 certainly doesn't feel particularly natural. Patches to improve it (like this one and maybe #3777, cc @xefoci7612) would be awesome. We could deploy patched versions on Lichess, for human testing, but I am not sure if we can get actionable feedback from that. |
@niklasf in case you decide to give #3777 a try, please there is this little fix to apply. |
It would be even more practical for people to have Stockfish skill levels offered in terms of Novice (1000 Elo), Experienced (1600 Elo), Master (2200 Elo), Grandmaster (2600 Elo), World Champion (maximum Elo). Something like that. Does this idea sound interesting to you? |
My rationale is that “Skill Level 0”, for example, doesn’t express clearly what kind of strength will Stockfish play at. Maybe “Skill Level: Novice (1000 Elo)” is more human-readable? |
Skill 0 isn't 1000 Elo at all and it makes more sense to leave skill values as numbers. |
Okay, but then everyone will have to guess at what Elo does Stockfish play when in Skill Level XY mode. What Elo does, for example, Skill Level 0 represent? If not 1000, what then? This will quickly lead to confusion. Also, I have a question just to clear a confusion of mine. Not related to skill level. Anyway, what is this? |
No, because UCI_LimitStrength is false by default and it needs to be set to true for UCI_elo to work |
Ah, thank you for all of the given information. This is very helpful. |
I have, however, one additional question now. Since Skill Level 19 is 2886 Elo and Stockfish’s maximum Elo is 2850 — how can Skill Level 19 be at such a high Elo to surpass Stockfish’s maximum? |
I'm using 'stockfish-15-1.el8.x86_64' rpm from my distro and 'gnome-chess' front end with skill set to "easy". I get a good game, sometimes winning, sometimes losing. BUT, when I start winning Stockfish seems to "give up", and starts throwing away pieces. This is counter intuitive, it would be good if it could play at normal strengh (or stronger) when it's losing. This would make it more realistic. |
I know we are not only going for play "character" but more for strength, but a big difference I see between beginner play and very low level engines is that the engine moves seem almost random, while beginner play mostly has plans, just not thought out longterm. "How can I attack the opponents queen now, how do I say chess..." So I do think that the suggested way of limiting depth might simulate something in that direction. |
I think we can learn something from how other (open source) engines approach strength limitation. One striking example is Rodent IV, an engine designed very specifically to exhibit recognisable playing styles (defined by personalities) rather than aiming for maximum playing strength. Unfettered, CCRL puts it right at 3000 Elo, which ain't bad, considering. Rodent can also be configured through UCI options for an Elo rating as low as 800, which it implements by three primary mechanisms:
Rodent does not support tablebases, but instead has specific knowledge of certain simple endgames (KQK, KRK, KPK, KBNK, and KBBK, IIRC). Since this knowledge is implemented by way of specialised evaluation functions, it is affected to some extent by the second and third strength-limitation mechanisms noted above. The above mechanisms work fairly well, I think, for limiting strength to the club-player range. Club players have usually developed a definite sense of how to play chess, but are limited in how deeply they can analyse a position or recall their theory. So an engine that plays good, but not so deeply analysed, moves is a good match. Nobody paying attention would be fooled that the playing style is human-like, but it shouldn't feel outright wrong to play against. Novice players are I think a different matter, and could require a different approach entirely for satisfactory results. Novices might not even be familiar with all the basic rules of chess yet, such as en-passant capture, underpromotion of pawns, or even castling; the only opening theory they might know is one or two recommended first moves for White (and what then?). Perhaps they've heard some vague guidelines such as the basic material values, pushing pawns forward and arranging them diagonally for protection, moving pieces into the centre, and keeping the king safely in the corner. A whole lot of the nuances that are encoded into a strong engine's evaluation functions just wouldn't occur to them. So, for the weakest levels of play, I think you would get a much more realistic style of play with some of the following ideas:
Whether it even makes sense to graft such things into Stockfish, I have no idea. |
Fix native build on linux Eliminated Stockfish handicap mode and replaced with another better one, based on an idea of and Michael Byrne Thanks to Tomasz Sobczyk official-stockfish/Stockfish#3635 Michael Byrne MichaelB7/Stockfish@18480ca
I originally tested this on stockfish 15.1, I then on 16, and now on 17.2 |
There was some discussion recently about possible improvements to the implementation of "Skill Level" in Stockfish on discord. @vondele suggested to try eval randomization, in particular interpolation of the NNUE eval with
N(0, RookValueEg)
using some parameter. I've implemented it and tested briefly, but it requires more work to assess the quality of the games played and calibrate the Elo rating. The initial results suggest that it might be a good direction.The experiment I performed was to use pure NNUE evaluation at fixed nodes and vary the interpolation parameter. 46 configurations played a round-robin tournament with 50 games in each pair. The following c-chess-cli command was used:
Naming:
stockfish_pure_{RandomEvalPerturb}_{nodes}
Code: Sopel97@56a8a4f
Experiment results: https://drive.google.com/drive/folders/14SZEV6TICedYtNZ2Ym2sFCQynoTBMeQ0?usp=sharing (includes a .pgn with moves saved).
Ordo:
Plots:
The text was updated successfully, but these errors were encountered: