Use Q=0 for self-play following AZ paper behavior while keeping FPU reduction tuning #350
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fix #344. As documented in the issue, #344 (comment) neither lc0 or lczero ever used Q=0 diverging from the learning behavior of AZ.
Looking even further back in lczero history into leela-zero, Q/FPU was originally set to 1.1 (win rate on a scale of [0, 1]) based on gcp's experience with leela, so it seems that never in leela-related history has Q=0 been used for generating self-play for training.
https://github.com/gcp/leela-zero/blob/2f7463d2cfba1b4617b3bd73bbdf3e1f52382429/UCTNode.cpp#L291