Use Q=0 for self-play following AZ paper behavior while keeping FPU reduction tuning #350

Mardak · 2018-09-11T18:29:22Z

Fix #344. As documented in the issue, #344 (comment) neither lc0 or lczero ever used Q=0 diverging from the learning behavior of AZ.

Looking even further back in lczero history into leela-zero, Q/FPU was originally set to 1.1 (win rate on a scale of [0, 1]) based on gcp's experience with leela, so it seems that never in leela-related history has Q=0 been used for generating self-play for training.

https://github.com/gcp/leela-zero/blob/2f7463d2cfba1b4617b3bd73bbdf3e1f52382429/UCTNode.cpp#L291

…eduction tuning

killerducky · 2018-09-11T19:13:32Z

I think what we do now is an improvement over the paper, and there are other things I'd prefer to test before this.

Mardak · 2018-09-11T19:38:41Z

For even more context, killerducky asked DeepMind to clarify FPU:
http://computer-go.org/pipermail/computer-go/2017-December/010550.html

Aja Huang responded:
http://computer-go.org/pipermail/computer-go/2017-December/010567.html

All I can say is that first-play-urgency is not a significant technical detail, and what's why we didn't specify it in the paper.

One could infer that AGZ/AZ did not try different values for FPU/reduction, and just used Q=0 as documented in the AGZ paper to get good results for AZ chess. Additionally assuming DeepMind used Q=0 for match games, potentially this search inefficiency was overcome by just having significantly higher visits than the 800 used for self-play.

As I noted in the code comments, searching wider for losing positions and deeper for winning positions allows for two different learning targets for future networks trained on those self-play games. The current behavior of using parentQ seems to be optimized for match strength leading to just one learning target at the cost of the inability to learn some types of (tactical) moves.

Tilps · 2018-09-21T08:21:15Z

Out of curiously I ran my 800 -> 8000 visit transition analysis with FPU of 0. The most obvious difference to my previous analysis runs was that moves that get 0 visits at 800, were much more likely to get more visits at 8000 compared to previous analysis. With our current FPU moves that get 0 visits typically get 1 visit after 8000 (and also after about 4000, suggesting that one visit is possibly excessive). With FPU of 0, moves that get 0 visits at 800 get more than 3 visits on average at 8000, which is similar to how much 1 visit nodes get, suggesting that a lot of places being given 0 are not being given a fare go.

Mardak · 2018-09-21T18:13:43Z

Just to be clear for your analysis, did you do FPU 0 for all nodes or only root nodes? Non-root FPU favoring wider search ends up skewing accurate NN+search average action value towards losing.

Tilps · 2018-09-21T22:42:28Z

hmm looks like I may have indeed accidentally applied it to all nodes - will retry!

Tilps · 2018-09-22T01:34:33Z

Tested again with 0 only for root node - same outcome of 0 mapping to more than 3 visits at 8000, maybe even worse that with using 0 all the time although my new dataset isn't especially big yet, so the difference is quite possibly in the noise.

Mardak · 2018-12-07T21:26:37Z

DeepMind initialized to loss instead of draw:

http://talkchess.com/forum3/viewtopic.php?f=2&t=69175&start=70#p781765

Use Q=0 for self-play following AZ paper behavior while keeping FPU r…

e283ad6

…eduction tuning

Mardak closed this Dec 7, 2018

Mardak deleted the noise-q0 branch March 20, 2019 16:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Q=0 for self-play following AZ paper behavior while keeping FPU reduction tuning #350

Use Q=0 for self-play following AZ paper behavior while keeping FPU reduction tuning #350

Mardak commented Sep 11, 2018

killerducky commented Sep 11, 2018

Mardak commented Sep 11, 2018

Tilps commented Sep 21, 2018

Mardak commented Sep 21, 2018

Tilps commented Sep 21, 2018

Tilps commented Sep 22, 2018

Mardak commented Dec 7, 2018

Use Q=0 for self-play following AZ paper behavior while keeping FPU reduction tuning #350

Use Q=0 for self-play following AZ paper behavior while keeping FPU reduction tuning #350

Conversation

Mardak commented Sep 11, 2018

killerducky commented Sep 11, 2018

Mardak commented Sep 11, 2018

Tilps commented Sep 21, 2018

Mardak commented Sep 21, 2018

Tilps commented Sep 21, 2018

Tilps commented Sep 22, 2018

Mardak commented Dec 7, 2018