Replies: 1 comment
-
It appears, from the op presentation above, that they might be asking about the discerning power of the engine, for differences that the low level chess logic of its core rules should determine (like symmetry operations). Well, I don't know about the "random" experiment. I should read more. But from reading abstract, and beside the code word "metamorphic" which is not about geology, I welcome where this kind of thinking might go. A key question, perhaps more general than in op, that this paper might be touching directly or not, is that the only measure we have about confidence of SF single input position search for best next move, is never directly testing that, but always inferred from specifically engine game pools derived whole game outcome statistics. Accuracy, failure to what? all good questions. Rarely asked I would claim. in my unclear knowledge level. I am not an engine developer. But I like to think about the chess model that engine actually implements for our human consumption. A difficult thing to get when not a developer oneself. I would need to read further to understand the full scope of the questions. I like there is a clear section 5 about their research questions. Thanks for proposing this, stimulating. |
Beta Was this translation helpful? Give feedback.
-
Hi,
I want to share and discuss this paper, available here:
https://www.sciencedirect.com/science/article/pii/S0950584923001179?dgcid=rss_sd_all
with supplementary material available:
https://github.com/MMH1997/MT_ChessEngines
The paper claims that "The experiments demonstrate the usefulness of our approach to identify issues in the latest version of the widely recognised to be the best chess engine: Stockfish (version 15, released in April 2022). Our tool is flexible and can be easily extended with metamorphic relations that can be defined in the future by either us or other users."
The overall idea is to create a variant of a given chess position, (almost) equivalent to the original, and then verify that both positions should have the same evaluation. For instance, given a chess position, rotating all the pieces with respect to the central axis should lead to the same evaluation. Another example considered in the paper is to replace a queen by either a rook or a bishop of the same color, and then verify that the evaluation obtained from should be better than the one corresponding to the follow–up input. Examples of positions that change the evaluation (up to the point it changes what side has a clear advantage or shorter mates are found) are reported in the paper.
As far as I understand, depth=10 has been used for executing all the experiments.
The authors conclude with "Although we do not have concrete evidence, we can still speculate on the potential main cause for the detected failures. Chess engines explore moves in a specific order during the search process. However, if this order is different, for example, between a position and its symmetrical position, or if there is any randomness introduced within this order at any point, it could lead to differences in the evaluation due to variations in the lines of play considered."
Hence, a key question is: Do instances of positions that change the SF evaluation really considered failures?
PS: I'm not the author of this academic paper
Beta Was this translation helpful? Give feedback.
All reactions