Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate exploration from training feedback (alternate method of lowering T) #342

Closed
killerducky opened this issue Sep 6, 2018 · 30 comments

Comments

@killerducky
Copy link
Contributor

killerducky commented Sep 6, 2018

We have two competing goals:
A) Get an accurate value of positions.
B) Explore different positions so we learn about a variety of different positions.
For B we have T=1 for the entire game. But this reduces the accuracy of A.

Proposed method:

  1. Pick a random number of plies N
  2. For N plies, do one playout. Pick the move to play proportional to the Policy output.
    2a) For these moves do not record anything in the training data file
  3. After N plies, do our normal self play algorithm with T=0. Note that Dirichlet noise will still introduce some amount of a randomness.

For picking N we want a good number of games that start near the opening so we can learn values of opening moves. But we also want to get into unusual middle/endgame positions.

Using this method, the time spent in step 1 is wasted in that we generate no data to feedback to our NN. But this step is 800 times faster because we only do one playout for each move in that step.

@RedDenver
Copy link

This is very similar to what I've mentioned before. Although, I suggest doing it like Experience Replay (a common RL technique) where after a game is played, go back and explore an alternate move from that game and then send just the partial games starting after the alternate move back as training positions.

@oscardssmith
Copy link
Contributor

@RedDenver I really like the idea of xp replay. It seems to do a good job of exploring without messing up game results as a result. Do you have any idea if this would be easy to implement?

@RedDenver
Copy link

@oscardssmith Both mine and KD's ideas require some change to the training data being sent back to the server and probably to the way the server selects positions to training on. I'm not familiar with those parts of the code, so I'm not sure how much work that would entail.

@RedDenver
Copy link

RedDenver commented Sep 6, 2018

Based on the discord discussion about how to training the opening moves, I propose using high temp in the first 6 plies (119 million positions possible) in order to produce semi-random game since training is distributed, and then playing the remainder of the game without temp and submit that full game back to training just like we do now.

Then additional partial games can be produced in a few ways:

  1. From the full game just played by selecting one or more random positions after ply 6 (or whatever ply we use as the temp cutoff) from that game and select a random alternate move and then play that move and the rest of that variation again without temp.
  2. From the method KD suggests in the initial comment, but only allow N > 6 (or whatever ply we use as the temp cutoff) to prevent duplicate games.

@Mardak
Copy link
Contributor

Mardak commented Sep 6, 2018

At a high level, this is a bit similar to @dubslow stochastic temperature where effect of temperature is reduced/removed after a random amount of moves: #237 (comment)

The approach here seems like it would require more changes to self-play and training with special handling of opening positions or empty training data. Even if we say "N plies" can be 0, the proportion of training data that would include opening moves would be significantly less (but unclear if that's actually a problem as perhaps average training with history planes won't confuse start position from other positions).

One general issue of the described approach of picking positions based on priors instead of including value with search is that on the surface, it would seem that it would be harder to have the network correct incorrect position value. But I suppose that could be eventually covered by having a separate run of "N-1" getting into the same "just before" position with search generating prior training data that increases the likelihood of randomly picking the "N" position. So perhaps in other words, there's multiple networks needed for value to propagate to prior to then randomly play into positions to then update value.

On an even larger general topic of randomness for exploration is whether it generates training data in the sufficient proportion of reasonable and/or desired positions, e.g., piece sacrifice, fortress perpetual positions. But then again, a proposed design doesn't need to try to address all things at once.

@bjbraams
Copy link

bjbraams commented Sep 6, 2018

I have wondered why Alpha Zero (and Leela Zero and LC0 in its wake) was not using some form of temporal difference (TD) learning or Q-learning to obtain the training data for the value function; they use the value at the end of the simulated game. (It is different for the policy function; the data that is used for to train the policy head is available as each move is generated, without playing out the game.) If some sort of TD learning is used then one can completely divorce the generation of the training data from the generation of complete simulated games. One can obtain board positions in any way one likes and obtain the training data for just one board position at a time. It removes a lot of correlations in the data and it removes the conflict between high noise in the self-play games (to obtain a diverse set of training configurations) and low noise (to obtain an accurate value).

I discussed this from a somewhat remote perspective -- the generation of training data for fitting a molecular energy function -- in a post on the r/cbaduk reddit that may be of interest for the present discussion again.

"Generating neural network training data by self-play, sampling a state space in molecular simulations"; posted on r/cbaduk on 2018-03-22.
https://www.reddit.com/r/cbaduk/comments/86fjj9/

@RedDenver
Copy link

@Mardak This isn't the same as #237 as these proposals explicitly remove positions from training that occur before a temp/random move.

@Mardak
Copy link
Contributor

Mardak commented Sep 6, 2018

Yes, and that "remove training data" seemed to be an artifact of not doing search in these earlier moves to quickly get to positions that we do want to train, so this would have no visit probability data to train with, and the removal also then requires special handling to actually train opening moves.

AGZ with it's 30 moves T=1 and remainder T=0 shows that early position value is probably pretty good even though there's multiple temperature-induced moves that might distort the true value.

@killerducky was the intent of the proposal to "train value with zero temperature-induced plays" or just to reduce the likelihood? Either way is a reasonable proposal and would address "Get an accurate value of positions." -- the sub point 2a just made it seem that the removal of early position value training wasn't an explicit goal of the proposal.

@RedDenver
Copy link

@Mardak I think exploration with alternative moves is still very useful in the middle and end games, but I don't think temp is the right tool to do it. Temp is very simple to implement and works well in the opening though.

So my suggestions are based on removing temp after the opening, and experience replay is very similar to how humans analyze games - look at variations in games already played. Just makes sense to me.

@MaurizioDeLeo
Copy link

Isn't this the same as #330 (comment)

@bjbraams
Copy link

bjbraams commented Sep 8, 2018

Indeed, the post and discussion by @amjshl under #330 looks like a very good quantitative demonstration of the benefits of separating the sampling of the configurations from the evaluation of the training data for the value. This (from the comments section under #330) sums it up for me.

"Yes, my assumption is that the result of T=0 at 800 playouts gives the most accurate value but that is very computationally intensive. As a compromise, running just 1, 10, or 50 playouts with T=0 is cheaper to compute but gives a better prediction accuracy than T=1 and 800 playouts used during training games. The challenge is that we still need T=1 to generate greater variety of positions and explore new moves, but use T=0 only for determining the value of a position, but not generate positions for training."

@RedDenver
Copy link

@MaurizioDeLeo @bjbraams This isn't the same as #330, as that one is a comparison of using temp for entire game vs only using temp in early game. This issue is discussing how to keep move exploration without using temp.

@DanielUranga
Copy link
Member

Now that the "uncertainty" head is being considered that could be used to guide the exploration. Kind of curiosity based learning: https://towardsdatascience.com/curiosity-driven-learning-made-easy-part-i-d3e5a2263359
The idea would be to play moves that get to positions with maximun uncertainty, and use T=0 from there like proposed in this issue.

@oscardssmith
Copy link
Contributor

oh, i hadn't thought i of the that use. that's a cool idea

@jhellis3
Copy link

jhellis3 commented Oct 23, 2018

One could just only train value on result until eval drops below 1 pawn (for the winning side) or whatever that translates to in win % and train value on the search result for all moves previous to that.

@nicovdijk
Copy link

I would like to work on something like this, but slightly different, involving a small amount of temp in the second phase including a value guard. Still I think we can set it up flexible so that the exact use is still free to choose. I am not 100% sure that my idea of the current code is correct, so please correct me if necessary.

I suggest we try to do the following:

  • Implement that settings such as temp can change over the course of a game (or is this already possible?).
  • Implement a value guard in move selection.
  • Include a training flag per position (boolean or maybe better int) in the training output indicating if it should be used for training (or maybe training to Q or Z so that it can be different for different positions, is Q already saved in the training output? ).
  • Use this training flag parameter in SGD.
  • Define this flag using the original idea here or detect a potential blunder by comparing the Q value of the played move to the best Q value to change temp settings.

What do you guys think?

@oscardssmith
Copy link
Contributor

This could work, but should not be in the initial version as value guard is orthogonal.

  • There is a PR that does this
  • Delay until v2 if tests show it works better
  • No. If you don't want to train a positions, don't send it.
  • See above
  • Again, orthogonal

@nicovdijk
Copy link

nicovdijk commented Dec 31, 2018

Thanks for the reply.

Ok, I can agree. That means start from that PR, which one is it?

Just sending the positions to train is most definitely better. It's probably a misunderstanding of mine of how the code works right now.

Then in the first version, I still think it's easiest to set temp and dirichlet noise to zero after detection of a 'blunder'. It will not reduce time required for training games, but does not require a more extensive restructuring of the code for first playing a complete game based on a 1 node search.

If we could start from a version of the code that includes Q training, then we could still use the positions in phase one.

Should be doable.

@oscardssmith
Copy link
Contributor

We don't want to use the phase 1 positions. The desired behavior is
phase 1: get to an initial board state by making x moves at 1 node per move, and choosing proportional to policy.
phase 2: evaluate that position with an 800 node per move playout tuned for maximum strength.

An easy way to do step 1 might be to pick a random number between 1 and 450, and play that many plies (or restart if the game ends by then).

@nicovdijk
Copy link

I'll drop talking about using the first phase, but just want to mention that it's useful information when you use 800 nodes.

I assume you mean that the first phase is with temperature and /or dirichlet noise? Otherwise we would very identical games until phase two is started.

I'm not really happy with just choosing a move number to start the second phase. Game length statistics will vary a lot for different nets...

What about a relatively flat exponential distribution for choosing the move number at which phase two starts? I would believe we would have relatively short games using 1 node and temp /noise and it would be a waste to have to restart many times for a (too high) move number.

@oscardssmith
Copy link
Contributor

flat exponential should work fine. You can't use temp as is, as with only 1 node, temp is meaningless. Instead you want to choose proportionally to policy (with noise added).

@nicovdijk
Copy link

Ok, that's what I meant, what is meant with 1 node then?

@oscardssmith
Copy link
Contributor

current training uses 800 NN evals to decide where to move. Since we don't train on the first part with this issue, and only want to produce a variety of training positions, we only need to run the NN once on the root position for each move, and pick relative to policy.

@nicovdijk
Copy link

I understand that we only do one NN eval at root and look at policy, but I still don't understand what the difference is between what I call 1 node and what you call proportional to policy...

Anyway, as I understand it (just reread the DM papers to check) you can actually apply temp to the root policy (distribution of move probabilities), basically sharpening (or softening for temp >1) the distribution. Adding dirichlet noise really changes the distribution, so can also change the move for temp = 0. I guess it makes most sense to use both in phase one right?

@oscardssmith
Copy link
Contributor

Temp is applied at root, but the formula talks about the nodes each of the children got. At 1 node, there are no child nodes

@nicovdijk
Copy link

Ok, that makes sense. From the Go paper it appears that the policy is trained to resemble the visit distribution including the temp effect... I always thought this was only done just before move selection!?

Anyway, sampling from the pure policy should be equivalent to some positive value of temp. So that already gives some variation. So I should look at the code to see if it's possible to add additional softening to the distribution, if this would be desirable.

To really explore we also add dirichlet noise.

If you could point out the version including changes of parameters during game generation I would be grateful (I looked for "Hanse" but did not find it yet).

@ghost
Copy link

ghost commented Feb 10, 2019

This could work, but should not be in the initial version as value guard is orthogonal.

My apologies, but could you please explain what a "value guard" is (or link to some research paper)? My Google searches for "mcts value guard" and similar did not turn up anything.

@oscardssmith
Copy link
Contributor

when I say it's orthogonal, I mean it has nothing to do with search. It's mitigating the effects (both good and bad) of temp by only selecting moves that are close enough in quality.

@jhorthos
Copy link
Contributor

jhorthos commented Feb 15, 2019

Temp is applied at root, but the formula talks about the nodes each of the children got. At 1 node, there are no child nodes

The policy distribution can still be shaped in a temperature-like manner, just acting directly on the policy to sharpen or flatten it. This could be an important issue for long phase 1 playouts if you want the resulting positions to be something reasonable. Making each move proportional to policy (without sharpening) might end up producing many odd moves in a long phase 1 playout. That might be fine, but it might not. It would be easy enough to implement a tau that defaults to 1 (does nothing). If tau is anything else there would be an exponentiation and renormalization at every ply so there would be some cost, but most likely it is trivial compared to the NN.

@Mardak
Copy link
Contributor

Mardak commented Jan 1, 2020

With #964, one could set a very high --minimum-allowed-visits to separate out any temperature move that isn't the most visited move, and a high --discarded-start-chance would allow continuing playing from these discarded moves that would have the game outcome no longer affecting the earlier positions. That would pretty much separate exploration from training feedback.

@Tilps is still adjusting those selfplay options, and practically it seems like there will still be some "usual temperature" that combines exploration and training feedback, but maybe it's good enough to close off this issue.

@Mardak Mardak closed this as completed Jan 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants