Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Considerations for "Skill Level" #3635

Open
Sopel97 opened this issue Jul 31, 2021 · 36 comments
Open

[RFC] Considerations for "Skill Level" #3635

Sopel97 opened this issue Jul 31, 2021 · 36 comments

Comments

@Sopel97
Copy link
Member

Sopel97 commented Jul 31, 2021

There was some discussion recently about possible improvements to the implementation of "Skill Level" in Stockfish on discord. @vondele suggested to try eval randomization, in particular interpolation of the NNUE eval with N(0, RookValueEg) using some parameter. I've implemented it and tested briefly, but it requires more work to assess the quality of the games played and calibrate the Elo rating. The initial results suggest that it might be a good direction.

The experiment I performed was to use pure NNUE evaluation at fixed nodes and vary the interpolation parameter. 46 configurations played a round-robin tournament with 50 games in each pair. The following c-chess-cli command was used:

#!/bin/bash
 
c-chess-cli \
    -concurrency 16 \
    -rounds 1 \
    -games 50 \
    -openings file=/home/sopel/nnue/c-chess-cli/noob_3moves.epd order=random -repeat -resign 3 700 -draw 8 10 \
    -pgn tournament.pgn 2 \
    -each tc=1000+1 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_0_100k option.RandomEvalPerturb=0 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_0_90k option.RandomEvalPerturb=0 nodes=90000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_0_80k option.RandomEvalPerturb=0 nodes=80000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_0_70k option.RandomEvalPerturb=0 nodes=70000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_0_60k option.RandomEvalPerturb=0 nodes=60000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_0_50k option.RandomEvalPerturb=0 nodes=50000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_0_40k option.RandomEvalPerturb=0 nodes=40000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_0_30k option.RandomEvalPerturb=0 nodes=30000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_0_20k option.RandomEvalPerturb=0 nodes=20000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_0_10k option.RandomEvalPerturb=0 nodes=10000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_0_5k option.RandomEvalPerturb=0 nodes=5000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_0_4k option.RandomEvalPerturb=0 nodes=4000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_0_3k option.RandomEvalPerturb=0 nodes=3000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_0_2k option.RandomEvalPerturb=0 nodes=2000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_0_1k option.RandomEvalPerturb=0 nodes=1000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_1_100k option.RandomEvalPerturb=1 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_2_100k option.RandomEvalPerturb=2 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_3_100k option.RandomEvalPerturb=3 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_4_100k option.RandomEvalPerturb=4 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_5_100k option.RandomEvalPerturb=5 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_6_100k option.RandomEvalPerturb=6 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_7_100k option.RandomEvalPerturb=7 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_8_100k option.RandomEvalPerturb=8 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_9_100k option.RandomEvalPerturb=9 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_10_100k option.RandomEvalPerturb=10 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_12_100k option.RandomEvalPerturb=12 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_14_100k option.RandomEvalPerturb=14 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_16_100k option.RandomEvalPerturb=16 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_18_100k option.RandomEvalPerturb=18 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_20_100k option.RandomEvalPerturb=20 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_25_100k option.RandomEvalPerturb=25 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_30_100k option.RandomEvalPerturb=30 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_35_100k option.RandomEvalPerturb=35 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_40_100k option.RandomEvalPerturb=40 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_45_100k option.RandomEvalPerturb=45 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_50_100k option.RandomEvalPerturb=50 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_55_100k option.RandomEvalPerturb=55 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_60_100k option.RandomEvalPerturb=60 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_65_100k option.RandomEvalPerturb=65 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_70_100k option.RandomEvalPerturb=70 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_75_100k option.RandomEvalPerturb=75 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_80_100k option.RandomEvalPerturb=80 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_85_100k option.RandomEvalPerturb=85 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_90_100k option.RandomEvalPerturb=90 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_95_100k option.RandomEvalPerturb=95 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_100_100k option.RandomEvalPerturb=100 nodes=100000

Naming: stockfish_pure_{RandomEvalPerturb}_{nodes}
Code: Sopel97@56a8a4f
Experiment results: https://drive.google.com/drive/folders/14SZEV6TICedYtNZ2Ym2sFCQynoTBMeQ0?usp=sharing (includes a .pgn with moves saved).
Ordo:

   # PLAYER                     :   RATING  ERROR  POINTS  PLAYED   (%)  CFS(%)
   1 stockfish_pure_1_100k      :      9.8   19.7  1826.0    2250    81      84
   2 stockfish_pure_0_100k      :      0.0   ----  1810.5    2250    80      79
   3 stockfish_pure_2_100k      :     -8.1   19.6  1797.5    2250    80      51
   4 stockfish_pure_4_100k      :     -8.4   20.0  1797.0    2250    80      68
   5 stockfish_pure_0_90k       :    -13.1   19.7  1789.5    2250    80      65
   6 stockfish_pure_3_100k      :    -16.8   19.0  1783.5    2250    79      75
   7 stockfish_pure_5_100k      :    -23.2   19.7  1773.0    2250    79      93
   8 stockfish_pure_6_100k      :    -37.8   19.1  1749.0    2250    78      72
   9 stockfish_pure_0_80k       :    -43.2   19.0  1740.0    2250    77      56
  10 stockfish_pure_8_100k      :    -44.7   19.2  1737.5    2250    77      86
  11 stockfish_pure_7_100k      :    -54.8   19.0  1720.5    2250    76      85
  12 stockfish_pure_0_70k       :    -65.2   19.4  1703.0    2250    76      67
  13 stockfish_pure_9_100k      :    -69.6   19.0  1695.5    2250    75      55
  14 stockfish_pure_10_100k     :    -70.8   20.1  1693.5    2250    75      98
  15 stockfish_pure_0_60k       :    -89.6   19.2  1661.5    2250    74      63
  16 stockfish_pure_12_100k     :    -92.8   19.9  1656.0    2250    74     100
  17 stockfish_pure_0_50k       :   -126.1   19.1  1599.0    2250    71      71
  18 stockfish_pure_14_100k     :   -131.6   19.2  1589.5    2250    71     100
  19 stockfish_pure_16_100k     :   -174.4   19.8  1517.0    2250    67      59
  20 stockfish_pure_0_40k       :   -176.5   19.4  1513.5    2250    67     100
  21 stockfish_pure_18_100k     :   -210.8   19.9  1457.0    2250    65     100
  22 stockfish_pure_0_30k       :   -246.6   19.5  1400.0    2250    62      64
  23 stockfish_pure_20_100k     :   -250.2   20.2  1394.5    2250    62     100
  24 stockfish_pure_0_20k       :   -352.1   21.2  1248.5    2250    55      87
  25 stockfish_pure_25_100k     :   -365.2   20.6  1231.5    2250    55     100
  26 stockfish_pure_30_100k     :   -513.1   24.1  1067.0    2250    47     100
  27 stockfish_pure_0_10k       :   -587.6   25.5   999.5    2250    44     100
  28 stockfish_pure_35_100k     :   -669.2   27.1   933.5    2250    41     100
  29 stockfish_pure_40_100k     :   -831.1   30.0   816.5    2250    36      63
  30 stockfish_pure_0_5k        :   -836.2   29.1   813.0    2250    36     100
  31 stockfish_pure_0_4k        :   -927.1   31.5   752.0    2250    33      99
  32 stockfish_pure_45_100k     :   -966.2   31.4   726.5    2250    32     100
  33 stockfish_pure_0_3k        :  -1031.4   33.1   685.0    2250    30     100
  34 stockfish_pure_50_100k     :  -1161.1   34.6   607.0    2250    27     100
  35 stockfish_pure_0_2k        :  -1209.6   36.0   579.5    2250    26     100
  36 stockfish_pure_55_100k     :  -1333.4   38.0   514.0    2250    23     100
  37 stockfish_pure_0_1k        :  -1426.8   41.5   469.5    2250    21      97
  38 stockfish_pure_60_100k     :  -1464.0   42.1   453.0    2250    20     100
  39 stockfish_pure_65_100k     :  -1690.6   49.2   368.5    2250    16     100
  40 stockfish_pure_70_100k     :  -1936.4   62.2   298.5    2250    13     100
  41 stockfish_pure_75_100k     :  -2076.9   68.5   263.5    2250    12     100
  42 stockfish_pure_80_100k     :  -2360.9   84.0   201.0    2250     9     100
  43 stockfish_pure_85_100k     :  -2581.0   93.3   158.5    2250     7     100
  44 stockfish_pure_90_100k     :  -2913.0  115.3   106.0    2250     5     100
  45 stockfish_pure_95_100k     :  -3417.6  175.1    44.0    2250     2     100
  46 stockfish_pure_100_100k    :  -3677.9  186.7    10.0    2250     0     ---
 
White advantage = 8.06 +/- 1.85
Draw rate (equal opponents) = 55.79 % +/- 0.45

Plots:

@vondele
Copy link
Member

vondele commented Jul 31, 2021

I've played a few games against it for RandomEvalPerturb in the upper quarter, and could at least win.

Probably instead of percent I would use per mill so it can be a bit finer tuned.

If we see that as an alternative for Skill Level (which has clear deficiencies), would be interesting to know if other places rely on Skill Level (e.g. Lichess?).

It would be nice if some people try to play a few games against it to see if that would be an interesting opponent for a suitable value of RandomEvalPerturb.

@scchess
Copy link

scchess commented Jul 31, 2021

Thanks! I'm going to test it a few games and write back here. Also, we need to establish an anchor chess engine to measure the ELO. What about measure it against one of the Maia ELO calibrated versions?

@scchess
Copy link

scchess commented Jul 31, 2021

I played a few games. I could easily win the level 100, but suffered losses to level 50 after some close games. It looked like the level 50 is somewhere around 2000 to 2200 level. Overall it looks cool.

@Sopel97
Copy link
Member Author

Sopel97 commented Jul 31, 2021

100 has effectively completely random evaluation, so it's about as expected. Did you play the games at some time control or fixed nodes like in my tests? Anything particularily visible about the "style"?

@scchess
Copy link

scchess commented Jul 31, 2021

I tried it with fixed 5 minutes games, not nodes. As a human vs AI, I could only do only a few games. It was clearly still "computer style", meaning there was strong consistency throughout the games. Not sure there is anything you can do. However, the ability to assess the position also became a little worse, which is welcome. Next, we should calibrate against an engine with a well defined human rating like Maia.

@tillchess
Copy link

tillchess commented Aug 1, 2021

If we see that as an alternative for Skill Level (which has clear deficiencies), would be interesting to know if other places rely on Skill Level (e.g. Lichess?).

Lichess playing vs computer levels 1-8 relies on search depth, UCI_Elo and move time values.
Therefore since UCI_ELo and UCI_LimitStrength is used Lichess relies on Skill Level

MichaelB7 added a commit to MichaelB7/Stockfish that referenced this issue Aug 17, 2021
…s her to connect the RandomEvalPerturb value and loosely connect it to a desired Elo based on his quick test results. Through testig , iwas ab;e to dertermine the rnadom function produce more reliable results using 400K nodes/move per move (as opposed to 100K nodes/move). Adjusted accordingly. This will be used in my Android Beth Harmon Chess app ( clone of Droidfish). The beauty of this method is that is uses move randomization so that a book is not needed, but still can be used,

Also, turned off pure mode as that did have some undesired weakness at weaker levels.

									Calculated
									RandomEvalPerturb  (REP) value
									based on Elo
{A)	{B)	{C)	(D)	(E)	(F)	(G)	(H)	(I)	(J)
Factor -1	UCI_Elo	(A)-(B)	Factor 2	(C)/(D)	Factor 3	(E)/(F)	Factor 4	(B)/(H)	(G)+(I)	Sopel Tests official-stockfish#3635
3200	3000	200	2.8	71	10	7	225	13	20	  23 stockfish_pure_20_100k     	20	3011
3200	2900	300	2.8	107	10	10	225	12	22
3200	2800	400	2.8	142	10	14	225	12	26	  25 stockfish_pure_25_100k     	25	2896
3200	2700	500	2.8	178	10	17	225	12	29	  26 stockfish_pure_30_100k     	30	2748
3200	2600	600	2.8	214	10	21	225	11	32
3200	2500	700	2.8	250	10	25	225	11	36	  28 stockfish_pure_35_100k     	35	2592
3200	2400	800	2.8	285	10	28	225	10	38
3200	2300	900	2.8	321	10	32	225	10	42	  29 stockfish_pure_40_100k     	40	2430
3200	2200	1000	2.8	357	10	35	225	9	44	  32 stockfish_pure_45_100k     	45	2295
3200	2100	1100	2.8	392	10	39	225	9	48
3200	2000	1200	2.8	428	10	42	225	8	50	  34 stockfish_pure_50_100k     	50	2100	 * Elo 2000 Achored to REP 50
3200	1900	1300	2.8	464	10	46	225	8	54	  36 stockfish_pure_55_100k     	55	1928
3200	1800	1400	2.8	500	10	50	225	8	58
3200	1700	1500	2.8	535	10	53	225	7	60	  38 stockfish_pure_60_100k     	60	1797
3200	1600	1600	2.8	571	10	57	225	7	64
3200	1500	1700	2.8	607	10	60	225	6	66	  39 stockfish_pure_65_100k     	65	1570
3200	1400	1800	2.8	642	10	64	225	6	70	  40 stockfish_pure_70_100k     	70	1325
3200	1300	1900	2.8	678	10	67	225	5	72
3200	1200	2000	2.8	714	10	71	225	5	76	  41 stockfish_pure_75_100k     	75	1184
3200	1100	2100	2.8	750	10	75	225	4	79
3200	1000	2200	2.8	785	10	78	225	4	82
vondele added a commit to vondele/Stockfish that referenced this issue Sep 11, 2021
based on Sopel's initial implementation discussed in official-stockfish#3635

in this new scheme, the strenght is of the engine is limited by replacing a (varying) part of the evaluation,
with a random perturbation. This scheme is easier to implement than our current skill level implementation,
and has the advantage that it has a wider Elo range, being both weaker than skill level 1 and stronger than skill level 19.

The skill level option is removed, and instead UCI_Elo and UCI_LimitStrength are the only options available.

UCI_Elo is calibrated such that 1500 Elo is equivalent in strength to the engine maia1 (https://lichess.org/@/maia1)
which has a blitz rating on lichess of 1500 (based on nearly 600k human games). The full Elo range (750 - 5200) is obtained by playing
games between engines roughly 100-200 elo apart with the perturbation going from 0 to 1000, and fitting the ordo results. With this fit,
a conversion from UCI_Elo to the magnitude of the random perturbation is possible.
All games are played at lichess blitz TC (5m+3s), and playing strenght is different at different TC.
Indeed, maia1 is a fixed 1 node leela 'search', independent from TC, whereas this scheme searches normally, and improves with TC.

There are a few caveats, it is unclear how the playing style of the engine is, the old skill level was not really satisfactory, it needs to be seen if this is fixed with this approach. Furthermore, while in the engine - engine matches maia1 and SF@1500Elo are equivalent in strength (at blitz TC), it is not sure if its rating against humans will be the same (engine Elo and human Elo can be very different).

No functional change
vondele added a commit to vondele/Stockfish that referenced this issue Sep 11, 2021
based on Sopel's initial implementation discussed in official-stockfish#3635

in this new scheme, the strenght is of the engine is limited by replacing a (varying) part of the evaluation,
with a random perturbation. This scheme is easier to implement than our current skill level implementation,
and has the advantage that it has a wider Elo range, being both weaker than skill level 1 and stronger than skill level 19.

The skill level option is removed, and instead UCI_Elo and UCI_LimitStrength are the only options available.

UCI_Elo is calibrated such that 1500 Elo is equivalent in strength to the engine maia1 (https://lichess.org/@/maia1)
which has a blitz rating on lichess of 1500 (based on nearly 600k human games). The full Elo range (750 - 5200) is obtained by playing
games between engines roughly 100-200 elo apart with the perturbation going from 0 to 1000, and fitting the ordo results. With this fit,
a conversion from UCI_Elo to the magnitude of the random perturbation is possible.
All games are played at lichess blitz TC (5m+3s), and playing strenght is different at different TC.
Indeed, maia1 is a fixed 1 node leela 'search', independent from TC, whereas this scheme searches normally, and improves with TC.

There are a few caveats, it is unclear how the playing style of the engine is, the old skill level was not really satisfactory, it needs to be seen if this is fixed with this approach. Furthermore, while in the engine - engine matches maia1 and SF@1500Elo are equivalent in strength (at blitz TC), it is not sure if its rating against humans will be the same (engine Elo and human Elo can be very different).

No functional change
vondele added a commit to vondele/Stockfish that referenced this issue Sep 11, 2021
based on Sopel's initial implementation discussed in official-stockfish#3635

in this new scheme, the strenght is of the engine is limited by replacing a (varying) part of the evaluation,
with a random perturbation. This scheme is easier to implement than our current skill level implementation,
and has the advantage that it has a wider Elo range, being both weaker than skill level 1 and stronger than skill level 19.

The skill level option is removed, and instead UCI_Elo and UCI_LimitStrength are the only options available.

UCI_Elo is calibrated such that 1500 Elo is equivalent in strength to the engine maia1 (https://lichess.org/@/maia1)
which has a blitz rating on lichess of 1500 (based on nearly 600k human games). The full Elo range (750 - 5200) is obtained by playing
games between engines roughly 100-200 elo apart with the perturbation going from 0 to 1000, and fitting the ordo results. With this fit,
a conversion from UCI_Elo to the magnitude of the random perturbation is possible.
All games are played at lichess blitz TC (5m+3s), and playing strenght is different at different TC.
Indeed, maia1 is a fixed 1 node leela 'search', independent from TC, whereas this scheme searches normally, and improves with TC.

There are a few caveats, it is unclear how the playing style of the engine is, the old skill level was not really satisfactory, it needs to be seen if this is fixed with this approach. Furthermore, while in the engine - engine matches maia1 and SF@1500Elo are equivalent in strength (at blitz TC), it is not sure if its rating against humans will be the same (engine Elo and human Elo can be very different).

No functional change
@lucasart
Copy link

lucasart commented Sep 24, 2021

My experience is as follows:

  • blitz rating 1793 on lichess.
  • I score about 50% against SF level 5 on lichess. That's playing in real blitz conditions 5'+3" no take backs, no restarting games when blundering (just counter attack until SF counter blunders, same as I would against a human).

Most of the game, computer plays ok. Weird moves, but nothing terrible. I win playing the way humans should play against a computer, by stiffling the opponent like a boa constrictor.

Eventually, when I break through, and the materialistic computer can't find a way out without losing material, it stops resisting, and just throws away its pieces.

That is lame. A key skill in chess is learning to win in a winning position, in the most irrefutable way, leaving no counter play to the opponent (ie. so that even full strength SF will eventually be forced to bend the knee). And you don't learn that if the opponent goes hara-kiri on you.

@Sopel97
Copy link
Member Author

Sopel97 commented Sep 24, 2021

@lucasart have you tried playing the version described in this issue to see whether there's a difference in style?

@lucasart
Copy link

lucasart commented Sep 27, 2021

@Sopel97 I will try when my computer is fixed. One thing I notice is that random eval perturbation + depth (or nodes?) limit should suffice. Running a multi PV search behind the scene complicates for no reason. I can't see the purpose of this once random eval is introduced. This could be an important simplification.

*nodes might be better than depth limits, because of endgame, where branching factors reduce and even humans can out-calculate depth limits easily.

@vondele
Copy link
Member

vondele commented Sep 27, 2021

There is a more up-to-date branch which gets rid of multipv and has a calibrated scheme to set user Elo here :
https://github.com/vondele/Stockfish/commits/rep
It is calibrated at 1500 Elo against the maia engine at lichess.

@lucasart
Copy link

lucasart commented Sep 27, 2021

@vondele I tried it at 1500 elo: 5+3 blitz, no takebacks, FRC (to reduce opening knowledge factor and focus purely on gameplay, SF is bookless here). As expected, SF played slightly odd moves (positionally), but destroyed me tactically. I think we still need limits on depth or nodes in combination to eval pollution.

Another problem is aspiration search. Because the search returns polluted scores going all over the place, there is a lot of search inconsistency and re-searches going on. When using skill level, it probably makes sense to switch off aspiration, and just search +/-INF window directly. The user doesn't see the logs, so you might say it doesn't matter. But it changes time management, and that's the part visible to the user (ie. lots of searhc instability make SF use time more aggressively).

[Date "2021.09.27"]
[White "Human"]
[Black "vondele_rep"]
[Result "0-1"]
[FEN "bbqnnrkr/pppppppp/8/8/8/8/PPPPPPPP/BBQNNRKR w KQkq - 0 1"]
[GameDuration "00:11:41"]
[GameEndTime "2021-09-27T20:51:29.068 HKT"]
[GameStartTime "2021-09-27T20:39:47.861 HKT"]
[PlyCount "30"]
[SetUp "1"]
[Termination "time forfeit"]
[TimeControl "300+3"]

1. e4 Ne6 {-0.35/21 31s} 2. Nf3 {17s} Nf4 {0.00/21 28s} 3. Re1 {20s}
f5 {+0.74/20 26s} 4. exf5 {19s} e6 {+1.09/20 32s} 5. fxe6 {8.3s}
dxe6 {+1.30/19 25s} 6. Ne3 {15s} b5 {+1.10/20 17s} 7. Qd1 {35s} h5 {0.00/19 39s}
8. d4 {73s} Nd6 {+1.31/20 37s} 9. b3 {6.4s} Rh6 {+0.77/19 35s} 10. O-O {65s}
Ne4 {+3.19/20 23s} 11. c4 {35s} c5 {+0.96/20 29s} 12. dxc5 {2.8s}
Nh3+ {+1.57/17 8.3s} 13. gxh3 {20s} Rg6+ {+4.78/14 0.45s} 14. Ng2 {8.9s}
Rxg2+ {+1.94/19 6.1s} 15. Kxg2 {3.4s} Ng5 {+2.44/15 0.43s, White loses on time}
0-1

Strong chess players are probably laughing at this game ;-)

For the sake of comparison, here is the last game I played against maia5 on lichess in blitz 5+3: https://lichess.org/9pNVarMAimZ6. Basically a draw, and the engine blundered the king+pawn endgame.

@vondele
Copy link
Member

vondele commented Sep 27, 2021

@lucasart thanks for playing. Yes, I agree tactically the engine is still pretty strong, i.e. will likely notice if a piece is hanging, and definitely will see if the is a mate in N available (with N much more than e.g. a 1500Elo player will see). I've also seen the time management aspect.

Maybe limiting depth is an option, one could also think about excessive pruning/LMR.

The Elo calibration is really tricky.. 1500Elo matches maia5 in a direct match quite well, however, for humans it seems clearly stronger.

@tillchess
Copy link

@vondele I played your version at 1500 with a fixed 1sec / move time control. At this TC it blundered pieces very often . At 1700 UCI_Elo the engine played better .

@lucasart
Copy link

@tillchess in this context we are referring to 1500 elo for the lichess blitz rating scale. This translates to 1000 elo blitz on chess.com. perhaps you are much stronger than that ?

@kayn1208
Copy link

Here's my game against UCI_Elo=1900 https://lichess.org/IGGdlkJB

@lucasart
Copy link

lucasart commented Sep 29, 2021

Here's my game against UCI_Elo=1900 https://lichess.org/IGGdlkJB

25... Qg3?? is a sepuku move again. You were already winning, of course, but at least the computer was making you work for it, creating threats on your king and preventing exchanges to hinder your progress.

@kayn1208
Copy link

@lucasart I noticed that it uses more time than usual (It got really low on time on move 20+). Also the pawn blunders in the opening is kinda strange, I don't think a 1900 would miss that.

@lucasart
Copy link

lucasart commented Oct 7, 2021

I've done some experimentation in Demolito, and got some fairly satisfying results with the following scheme:

  • Level L go from 1 to 15. Not using UCI_Elo, but a transofmration function (or mapping table) could be calibrated.
  • depth limit: L <= 10 ? L : 2 * L - 10. prevents low levels from finding inhuman tactics and deep mates. also, this means that much less eval noise is needed to achieve the same elo, resulting in a more "normal" playing style.
  • nodes limit: 2 ^ (L + 5). this is only meaningful for high levels L >= 9 or so. That's because nodes limits are only checked every 5ms in Demolito (in a separate thread). Stockfish checks nodes differently, but still, I don't think it can enforce small node limits, so depth and nodes are needed. The point of nodes is that between L=9 and L=15, we want node count to only double whereas depth are incremented by 2. So less depth limit, more node limit, which mean more adaptive depth (ie. long sequences can be calculated in king+pawn endgames with low effective branching factor, but not so long in complex middle games, this somewhat emulates human tactical abilities a bit more).
  • noise added to each eval, drawn from a logistic distribution with mean zero, and scale factor s = PawnEndgameValue * 0.8 ^ (L - 1) * phaseFactor, where phaseFactor = 1.0 when the side to move has all pieces, and phaseFactor=0.5 when stm has no pieces (excl. King and Pawns), linearly in between. The point of phaseFactor is to strengthen endgame play relative to middle game, to mittigate the fact that depth limitation makes the engine weaker in endgame relative to middlegame. The reason for using logistic is that it has fat tails. So you don't need to make the scale absurdly large to draw large tail values. That way, instead of polluting most evals, you only pollute a few, which is more human like (sporadic large mistakes, rather than pervasive medium mistakes). There are many distributions out there, this one is just the simplest one I know to formulate mathematically that does the job (fat tails).
  • another important thing is that it must be non-deterministic, so you want the seed of the PRNG to be unpredictable, and not reset to a fixed value on every game. that way you get an opponent which always plays differently, and you don't just end-up repeating the same opening moves all the time, until you can win by trial and error (restart game when you blunder, memorize, repeat). Of course, you could make the argument that non-determinism should come from the GUI+OpeningBook random selection. But I wanted it to work nicely without opening book. After all, opening book knowledge is part of chess skill in human play. So you don't want level 1 to play with state of the art opening theory, then utter garbage when out of book.

Now, we must have realistic expectations about this. It will never be as good as Maia at being human-like. The point here is just to code something that is simple, and somewhat reasonable. Applying a Maia strategy is certainly very interesting, but is a separate project on its own.

@niklasf
Copy link
Contributor

niklasf commented Oct 7, 2021

By the way playing against weak levels of Stockfish is extremely popular on lichess.org, especially with beginners. I think it's because facing an AI in non-competitive play adds much less pressure than facing a human opponent.

Even level 0 is currently too strong for players that are just starting out. So we're using a patched version that extends the range (fairy-stockfish/Fairy-Stockfish@2329160 + fairy-stockfish/Fairy-Stockfish@f451358).

@lucasart
Copy link

lucasart commented Oct 8, 2021

Yes, Skill Level=0 is too strong for beginners. It's hard to put ourselves in the shoes of beginners, once we have some experience in chess. So I've asked an actual beginner to play 2 games against Demolito Level=1, and she couldn't even beat it (although she came close). Stockfish Level=0 is >500 elo stronger than that...

Rank Name           Elo    +    - games score oppo. draws 
   1 Demolito_L4    789   30   30  1000   86%   396    5% 
   2 Stockfish_L1   618   30   30   800   68%   405    6% 
   3 Stockfish_L0   533   29   29   800   61%   405    5% 
   4 Demolito_L3    527   24   24  1000   56%   449    9% 
   5 Demolito_L2    303   27   27  1000   30%   493    7% 
   6 Demolito_L1      0   43   43  1000    5%   554    3% 

@tillchess
Copy link

@lucasart This thread started with a new scheme to weaken Stockfish and remove the current pick_best method (which chooses a suboptimal move) .

@vondele
Copy link
Member

vondele commented Apr 30, 2022

For reference, this is the setup used on LiChess https://github.com/lichess-org/fishnet/blob/master/src/api.rs#L208

@tillchess
Copy link

For reference, this is the setup used on LiChess https://github.com/lichess-org/fishnet/blob/master/src/api.rs#L208

The Lichess website uses Fairy Stockfish with a different Skill Level range from -20 to 20.

@niklasf
Copy link
Contributor

niklasf commented Apr 30, 2022

Skills 0 to 20 in Fairy-Stockfish are equivalent to Stockfish, the negative values extend the range in a straight-forward way.

We didn't put too much thought into the parameters on Lichess, basically just trying to get to a point where complete beginners have a shot at beating the lowest level, and then building a mostly arbitrary progression from the lowest to the strongest level.

The weakened play at -20 certainly doesn't feel particularly natural. Patches to improve it (like this one and maybe #3777, cc @xefoci7612) would be awesome. We could deploy patched versions on Lichess, for human testing, but I am not sure if we can get actionable feedback from that.

@xefoci7612
Copy link

@niklasf in case you decide to give #3777 a try, please there is this little fix to apply.

@PedanticHacker
Copy link
Contributor

PedanticHacker commented Jun 13, 2022

It would be even more practical for people to have Stockfish skill levels offered in terms of Novice (1000 Elo), Experienced (1600 Elo), Master (2200 Elo), Grandmaster (2600 Elo), World Champion (maximum Elo). Something like that.

Does this idea sound interesting to you?

@PedanticHacker
Copy link
Contributor

My rationale is that “Skill Level 0”, for example, doesn’t express clearly what kind of strength will Stockfish play at. Maybe “Skill Level: Novice (1000 Elo)” is more human-readable?

@tillchess
Copy link

My rationale is that “Skill Level 0”, for example, doesn’t express clearly what kind of strength will Stockfish play at. Maybe “Skill Level: Novice (1000 Elo)” is more human-readable?

Skill 0 isn't 1000 Elo at all and it makes more sense to leave skill values as numbers.

@PedanticHacker
Copy link
Contributor

Okay, but then everyone will have to guess at what Elo does Stockfish play when in Skill Level XY mode. What Elo does, for example, Skill Level 0 represent? If not 1000, what then? This will quickly lead to confusion.


Also, I have a question just to clear a confusion of mine. Not related to skill level. Anyway, what is this?
option name UCI_Elo type spin default 1350 min 1350 max 2850
Does this option mean that Stockfish plays at Elo 1350 by default if the option is not overridden to, say, 2850?

@dav1312
Copy link
Contributor

dav1312 commented Jun 18, 2022

What Elo does, for example, Skill Level 0 represent? If not 1000, what then?

0    1347
1    1490
2    1597
3    1694
4    1785
5    1871
6    1954
7    2035
8    2113
9    2189
10    2264
11    2337
12    2409
13    2480
14    2550
15    2619
16    2686
17    2754
18    2820
19    2886

Does this option mean that Stockfish plays at Elo 1350 by default

No, because UCI_LimitStrength is false by default and it needs to be set to true for UCI_elo to work

@PedanticHacker
Copy link
Contributor

Ah, thank you for all of the given information. This is very helpful.

@PedanticHacker
Copy link
Contributor

I have, however, one additional question now. Since Skill Level 19 is 2886 Elo and Stockfish’s maximum Elo is 2850 — how can Skill Level 19 be at such a high Elo to surpass Stockfish’s maximum?

@GerryHickman
Copy link

I'm using 'stockfish-15-1.el8.x86_64' rpm from my distro and 'gnome-chess' front end with skill set to "easy". I get a good game, sometimes winning, sometimes losing. BUT, when I start winning Stockfish seems to "give up", and starts throwing away pieces. This is counter intuitive, it would be good if it could play at normal strengh (or stronger) when it's losing. This would make it more realistic.

@SchulzKilian
Copy link

I know we are not only going for play "character" but more for strength, but a big difference I see between beginner play and very low level engines is that the engine moves seem almost random, while beginner play mostly has plans, just not thought out longterm. "How can I attack the opponents queen now, how do I say chess..." So I do think that the suggested way of limiting depth might simulate something in that direction.

@chromi
Copy link

chromi commented Apr 23, 2023

I think we can learn something from how other (open source) engines approach strength limitation. One striking example is Rodent IV, an engine designed very specifically to exhibit recognisable playing styles (defined by personalities) rather than aiming for maximum playing strength. Unfettered, CCRL puts it right at 3000 Elo, which ain't bad, considering.

Rodent can also be configured through UCI options for an Elo rating as low as 800, which it implements by three primary mechanisms:

  1. Limiting the depth of opening-book knowledge it may use. Basically the engine enforces dropping out of book at a game move dependent on the selected Elo. Lower-rated players tend to have less detailed opening knowledge, so this mechanism makes sense as part of strength reduction.
  2. Once out of book, a search speed limit is implemented by inserting millisecond sleeps whenever the actual search rate exceeds the appropriate value. This value is calculated as an exponential function of the selected Elo rating, and ranges from the dozens to the millions of nodes per second over the supported Elo range, and results in a reduced search depth which varies fairly appropriately with the time control. It's perhaps noteworthy that this mechanism is inherently more energy-efficient than Stockfish's "artificial centipawn loss" method.
  3. For very low ratings (below about 1500), "evaluation blurring" is also applied at the leaves of the search tree, rather than at the root as Stockfish does. This is accomplished by adding a function of the Zobrist hash to the normal evaluation, so it's consistent even on re-searches of the same node. Again, the scale of the blurring is a function of the selected Elo.

Rodent does not support tablebases, but instead has specific knowledge of certain simple endgames (KQK, KRK, KPK, KBNK, and KBBK, IIRC). Since this knowledge is implemented by way of specialised evaluation functions, it is affected to some extent by the second and third strength-limitation mechanisms noted above.

The above mechanisms work fairly well, I think, for limiting strength to the club-player range. Club players have usually developed a definite sense of how to play chess, but are limited in how deeply they can analyse a position or recall their theory. So an engine that plays good, but not so deeply analysed, moves is a good match. Nobody paying attention would be fooled that the playing style is human-like, but it shouldn't feel outright wrong to play against.

Novice players are I think a different matter, and could require a different approach entirely for satisfactory results. Novices might not even be familiar with all the basic rules of chess yet, such as en-passant capture, underpromotion of pawns, or even castling; the only opening theory they might know is one or two recommended first moves for White (and what then?). Perhaps they've heard some vague guidelines such as the basic material values, pushing pawns forward and arranging them diagonally for protection, moving pieces into the centre, and keeping the king safely in the corner. A whole lot of the nuances that are encoded into a strong engine's evaluation functions just wouldn't occur to them.

So, for the weakest levels of play, I think you would get a much more realistic style of play with some of the following ideas:

  1. Exclude underpromotion, en-passant capture, and/or castling from the move generator during the search (unless they are somehow the only legal moves), but still accept them as legal moves when supplied or booked. En-passant and underpromotion occur fairly rarely in any case, so castling is probably the "special" move to cut out at the lowest Elo threshold relative to the others.
  2. Consider only "natural" moves: immediate checks, captures, threats of undefended pieces, pawn advances, defences of hanging pieces, or simply developing pieces into the centre or towards the opposing king. Don't bother looking for moves which control weak squares or increase mobility; those are advanced concepts. This mirrors the kinds of moves that a novice would focus on, before they learn about things like piece coordination or positional play.
  3. Assume the opponent plays some proportion of null-moves during the search; use a consistent PRNG such as derivation from the Zobrist hash. This allows the engine to exhibit a coherent "plan" several moves deep, while still neglecting threats and responses that an engine would otherwise find difficult to ignore.
  4. Do not consult tablebases at all. Rely on the search and the middlegame heuristics to resolve the endgame.
  5. Opening books should be shallow, and be tuned to lead to the kinds of openings taught in introductory texts, not to grandmaster lines.

Whether it even makes sense to graft such things into Stockfish, I have no idea.

amchess added a commit to amchess/ShashChess that referenced this issue Nov 9, 2023
Fix native build on linux
Eliminated Stockfish handicap mode and replaced with
another better one, based  on an idea of and Michael Byrne
Thanks to
Tomasz Sobczyk official-stockfish/Stockfish#3635
Michael Byrne MichaelB7/Stockfish@18480ca
@GerryHickman
Copy link

GerryHickman commented Dec 7, 2024

I originally tested this on stockfish 15.1, I then on 16, and now on 17.2
I still see the issue where stockfish throws away the queen for "no reason".
I have three *.pgn files (attached) showing the exact moves that were made.
I had to rename the *.pgn files to *.pgn.txt to be able to upload them here.
lose_queen.pgn.txt
lose_queen2.pgn.txt
lose_queen3.pgn.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

15 participants