Replies: 27 comments
-
That's really cool!! My only question is if variance could achieve the same result as it would be a lot easier to back up |
Beta Was this translation helpful? Give feedback.
-
Very interesting idea. @oscardssmith As we don't really want to increase the node structure even more, I would propose to first try it out with on-the-fly entropy calculations during each playout and see how that goes. @kk2796 Do you want to try out to code it yourself, in which case myself or someone else can certainly give you pointers where to add/modify the code, or do you want someone else to do it? |
Beta Was this translation helpful? Give feedback.
-
@ddobbelaere variance backup only requires 4 bytes (1 float) per node. On the fly entropy (or variance) will probably be pretty slow. |
Beta Was this translation helpful? Give feedback.
-
I think it's almost self-evident that a naive "on-demand" approach (computing entropy dynamically each time it is needed) would entail an untenable performance overhead. If some form of local caching can't come to the rescue, it may be necessary to work with less expensive statistical approximations such as variance (which is more conducive to incremental aggregation vs wholesale refresh). That said, though I don't have the mathematical fortitude to prove this, I do feel the "root problem" is one deeply grounded in information theory and that entropy is the right/best metric to use in solving it. Other ideas may work, but my intuition is that their effectiveness would be limited by how close they came to approximating entropy. @ddobbelaere I'd definitely prefer to leave the coding to others. I'm 99% data scientist and 1% developer. Downloading Leela's source (along with proper compilers and appropriate IDEs) and then personally trying to put this idea into code sounds like utter drudgery. I suspect I would need weeks, perhaps months, to produce something that was even functional. Even then, my resulting code would be deplorable, and, frankly, I would hate every minute of it. Backing up a step though - I think there's some meaningful work that can be done before any coding changes (much less discussions of tuning/optimizations) are even on the map. Is there a way to run lc0.exe so that it performs a small fixed-node search (say, 100 nodes), and spools out some representation of the search tree after 100 playouts? That is, something which would allow me to see the tree structure (nodes and edges) as well as the value-head and policy-head outputs for each node? I know there are log options which will display statistics for a position, but as far as I know, one only sees the summary stats for moves of the root node (not the entire tree). If I can start with that, I'd be able to run some simulations showing what the effect of entropy-bias would be on various positions. In theory, I can accomplish 3 things:
I think the first two of these would be most meaningful... however, the third would probably generate the most hype for coding up a solution that can be tested "for real", e.g. assessing ELO impact of the entropy-enabled code vs unmodified code in fixed-node competition. I should also note that during this initial assessment phase, I could play around with other variants of incorporating E(s,a) into U(s,a) formula and/or add constants that allow the effect to be magnified or dampened. TLDR: If there's a way to run Leela in a debug or verbose mode to print the entire search tree, including policy/value results of each node, please let me know! |
Beta Was this translation helpful? Give feedback.
-
adding 4 bytes to each node (assuming it is only needed in nodes, not edges) is about a 2% increase in total RAM usage by my estimate. Not a problem. |
Beta Was this translation helpful? Give feedback.
-
I want to help you with the coding part, but after thinking for a while I found there is still an issue with the proposed formula. For moves that have only one legal follow-up move, it holds that E(s,a) = 0. The issue is that those moves will completely usurp all the visits at some point as U(s,a) grows without bound. This problem is so severe it needs to be sorted out first. My first naive thought would be to take E'(s,a) = E(s,a) + E0, with a fixed and yet to be determined constant E0 > 0, but I am curious if there are other, maybe more elegant solutions. |
Beta Was this translation helpful? Give feedback.
-
Any chance of doing it with stdev? I think it will likely yield better results, but is also much easier to calculate and backup. |
Beta Was this translation helpful? Give feedback.
-
Sounds like a very interesting concept, though the mathematics of it is not obvious/intuitive to me at all. Should'nt the entropy term be more coupled with Q(s,a)? Moves which are good with low entropy we would want to explore more, while moves which are bad and low entropy should be explored less and less. One way to look at it is if it is a force sequence of winning moves we want to play it while if it is a forced sequence of losing moves we would want to avoid it at all cost. |
Beta Was this translation helpful? Give feedback.
-
@oscardssmith I think that in the case of only one legal follow-up move the standard deviation is also zero (at least for every possible definition I can think of). @kk2796 Please allow me to straighten the notation in line with the conventions in literature: During each playout, we want to maximize the upper confidence bound Q(s,a) + U(s,a), with Q(s,a) the mean action value of the move (expected outcome averaged over values of visited nodes). As an alternative to the original intention of the opening post, I propose to calculate the entropy (or standard deviation if you like that more) over the current visit distribution of the follow-up moves (and not over the upper confidence bound Q(s,a) + U(s,a) distribution). This is much easier to calculate on-the-fly and may be feasible (moreover, it doesn't contain recursion that is present in the original proposition, i.e., that E(s,a) depends on the entropy of the follow-up moves after that, etc.) |
Beta Was this translation helpful? Give feedback.
-
Yes, the most extreme effect of E(s,a) is when E(s,a) = 0. Two points I want to make:
Put together, 1 and 2 amount to "The most extreme effect of E(s,a)=0 is only as drastic as the effect of N(s,a)=0; and while N(s,a)=0 is guaranteed to grow to N(s,a)>0 with the first visit; E(s,a)=0 is also guaranteed to grow to E(s,a)>0, it may just take 2 or even 3 visits to do so". I've got spotty availability this morning, but do intend to follow up on other points raised. |
Beta Was this translation helpful? Give feedback.
-
@ddobbelaere
Thanks for pointing out the discrepancy; I'll aim to use the correct nomenclature going forward. |
Beta Was this translation helpful? Give feedback.
-
One confusing thing is that the A0 puct formula is different than every other one, so they really shouldn't have called it that. |
Beta Was this translation helpful? Give feedback.
-
@oscardssmith @kk2796 Thanks for the clarification, it makes more sense now. Now I also understand why nobody was eager to calculate the entropy on the fly :). I am still annoyed by one thing: Q(s,a) + U(s,a) compared to the Q(s,a') + U(s,a') of all siblings doesn't really represent the probability that a move will be played IMO (in that sense, MCTS is actually very deterministic, it will always pick the move that maximizes Q+U). Another thing to note (but this is maybe merely a technicality arising from our lc0 convention that -1 <= Q <= 1) is that (Q(s,a) + U(s,a)) / sum(Q(s,a') + U(s,a')) is not what you want it to be (as some terms can be negative), but this can maybe be solved by considering Q+U+1 in the equations instead of Q+U I guess (although I am even less sure about the meaning in terms of probability in this case, but at least it's between 0 and 1 then). I am still wondering if the entropy over the visit distribution doesn't also make sense. IMO it more corresponds to the probability that a move will be played. Moreover, briefly looking at the backup update equations lets me conclude they are doable (if we use visit distribution) and can be incremental (calculations at each step do not depend on the branching factor)! If we ever do this, we have to check numerical stability of course. I wonder what you or other people think about these considerations. |
Beta Was this translation helpful? Give feedback.
-
After thinking about the issue for quite some time, here are my new ideas:
U(s,a) = cPuct ⋅ P(s,a) ⋅ N(s,b)^0.5 ⋅ 1/(1+N(s,a)) + c' ⋅ S(s,a), with c' a scaling constant (c' = 0 reduces to the old formula). The variance S^2(s,a) can be efficiently calculated during the backup step (if we want to store both the variance and the mean in each node, for example both as float) with the numerically stable algorithm https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm, that was pointed to me by @oscardssmith in Discord chat. Note the (imo) importance of calculating the variance of all the raw net eval Qs. On top of the more easier calculation during the backup step (compared to changing Q's of the in-between non-leaf nodes), it gives us a hint that the raw NN evals fluctuate wildly in the subtree (indicating possible tactical combinations that need to be explored more agressively). |
Beta Was this translation helpful? Give feedback.
-
So, the more I considered your annoyance at treating the relative values of Q+U as "probabilities of visits", the more annoyed I became with it as well. If nothing in the A0 PUCT algorithm treats Q+U as a probability, that really undermines the justification for insisting upon using "entropy of Q+U probabilities". Or, at least, undermines my pretense that the approach represents some shining beacon of "the exact right metric to use." On a positive note, this is leaving me 100% on board with going back to the drawing board and looking at this issue and proposed ideas with fresh eyes. In fact - part of the reason I've taken so long to respond is I keep getting halfway through typing a response and finding my ideas evolving to the point that I completely disagree with what I'd just typed. :-) While I wrestle to find some stable ground in my thinking, I didn't want to draw out the silence any longer. So for now, a couple points I am firm on:
|
Beta Was this translation helpful? Give feedback.
-
One thing to note is that there's no inherent reason that we have to pick the move with most nodes. If that's the only problem with this, we could switch to Q-U or something as a node selection criteria (which would fix this problem) |
Beta Was this translation helpful? Give feedback.
-
@kk2796 (In response to 2.) I thought I had incremental update formulas for the entropy formulation, but after a second look they are flawed. I propose to start off with integrating Q-variance tracking in the codebase, such that we have S(s,a) available. In the same "research/test" PR, we could investigate several formulations to include this variance term S(s,a) in the uncertainty term U(s,a). I'll see if I find some time to do this. Of course we will need lots of testing once the code is ready. We could for example test against a suite of tactical problems at fixed nodes (Maybe there are good tactic suites available in EPD format? Probably @jhortos knows?). There will probably be a lot of tools at our disposal to help us automate (e.g. @killerducky's lc0_analyzer) and we can always ask for help/advice in Discord chat. Hopefully, it'll improve tactical capabilities and if not, it was a fun ride :). |
Beta Was this translation helpful? Give feedback.
-
This is a really interesting idea and discussion! I have an idea for modifying the entropy-based approach above, although I keep finding problems with it. Basically the idea is to replace the 1/(1+N(s,a)) term with 1/(1+N(s,a)(1-R(s,a))), where R is the "remaining relevant exploration" in the subtree associated with a, which trends to 0 as more and more of the (estimated to be) probable parts of the subtree are expanded. Will hopefully get a chance to write this up properly tonight - just wanted to mention the idea before this discussion moved too far beyond the realm of equation modification. |
Beta Was this translation helpful? Give feedback.
-
OK, here's the idea in some detail. My aim was to capture the intuition (from the entropy approach above) that it's more efficient in some sense to calculate a position that is a forced followup to a position than to calculate some other unforced position. Basically, the idea is that these 'follow-up positions' will tend to have higher probability of occurring in the game all else being equal. For example, a position in a forced line stemming from action a_1 will have a higher probability of occurring in the game than a position in one of three equally plausible forced lines stemming from action a_2, assuming that action a_1 and action a_2 are equally probable. I assume we can estimate the probability that a node will be arrived at in the game (from some base node) by using a product of fractional U+Q values (as described in steps 2 and 3 in the original comment). I term this the "preference" for a node to distinguish it from probability proper. Instead of looking at the entropy of this "preference", we instead use it to define the concept of “relevant remaining exploration”, abbreviated 'RRExp(s)', and propose using the decay in this metric to discourage exploration. We define RRExp(s) for a node s as: We now aim to encourage maximizing the expected value of RRExp at the node where we eventually decide to expand (so our expansion is actually useful), modelling the decision of what node to expand at as a random variable. That is, we want to maximize: I’m not sure how to estimate the probability probExpand(s) of expanding at a state s. As a first attempt one could try: In summary, we get the following for the expected value of obtained relevant remaining exploration upon selecting action 'a': The modification to the original equation could then involve changing 1/(1 + N(a,s)) to E[Obtained RRExp(a,s)] or maybe 1/(1 + N(a,s)*(1-E[Obtained RRExp(a,s)])). (Note that The E[Obtained RRExp(a,s)]] becomes large when we want to encourage exploration in the subtree associated with a, which is different than in the entropy case.) I'm not sure if this really revives the vision of prioritizing visiting states that are obvious followups, and I'm unsure in particular about how well we can estimate probExpand(s). I do think this avoids falling into the stalemate trap described above, where a stalemating action becomes preferred due to its low subtree entropy. In particular, once the stalemating state is visited, its unexploredPref should become zero (as there are no states following this one that are unexplored), and so further exploring this stalemating state will be discouraged. |
Beta Was this translation helpful? Give feedback.
-
@dEgolf perhaps another way of framing this is: what depth will the next visit to this node reach? Thinking of scenario with two actions, a and b, with "all things being equal from existing Q and U vantage point" , but a has just 1 or 2 reasonable followups and b has lots of followups. In this case, visits to b will sort of get "hung up" at shallow depth while visits to a should get to deeper levels faster. Maybe a close-enough idea to what entropy is trying to capture, probably much easier to compute, and definitely avoids stalemate issues. |
Beta Was this translation helpful? Give feedback.
-
@kk2796 I think that is a way of viewing it; that the aim is to try and get as much information about how the game is likely to go with each node expansion. I would note, however, that the estimated information obtained is not really a function of depth (at least not the way I framed things above). Instead, it corresponds more broadly to how "relevant" we think the expanded node will be, which is basically how likely we think that expanded node is to occur in the game. This does encourage exploring more forcing lines to greater depth as a consequence, though, as the "relevancy" of the nodes in such a line decays more slowly with depth (due to lower branching). The philosophical connection to entropy I had in mind is that we try to construct a measure of how much information will be obtained by exploring down the subtree associated with action 'a', and use this to weight exploration into this subtree. (Noting that entropy can be viewed as a measure of information). |
Beta Was this translation helpful? Give feedback.
-
Isn't it just preferring to examine positions with less number of "plausible" moves? Those positions are complex and need hard think (wide tree of variants with similar priors), but it's not the reason to avoid it. Note that in the end we want to find just a single best move, not maximize confidence of our eval for variety of variations. If move A is promising but "hard to think" and move B is less promising but almost forcing, I'm not convinced it makes sense to allocate more time into B just to become even more sure that it's less promising. Am I missing something? |
Beta Was this translation helpful? Give feedback.
-
@mooskagh The goal in the idea I expressed a few posts above was to maximize the 'relevant information gain' with each node expansion. This doesn't quite correspond to investigating moves that lead to more forcing lines. We instead aim to expand at nodes that are:
As a consequence, if two totally unexplored nodes - say node A and node B - have moves that look equally promising, but we estimate the probability of node B occurring on the board to be higher, then this approach will prefer to perform a node expansion at node B. Notice that this doesn't strictly encourage exploring forcing lines. Once a very forcing line branches sufficiently so that its node probabilities become low enough, then exploring new (less forcing) lines at shallower depths will be preferred. It should also be noted that the 'information gain' aspect discourages exploring a well explored node further. So, if exploring move A would involve expanding a node with really promising unexplored moves and move B would lead to expanding a node without really promising unexplored moves (with this node having the same estimated probability as the unexplored node following move A), then this will tend to result in further exploration of move A being preferred. In summary - we're not trying to maximize confidence of our eval for some subset of variations, but instead trying to obtain as much relevant information as possible about the whole set of variations with each node expansion. Hope this helps clarify things. Thoughts and criticism are very welcome! |
Beta Was this translation helpful? Give feedback.
-
Any update on this? I have some ideas I'd like to contribute. |
Beta Was this translation helpful? Give feedback.
-
what are they? |
Beta Was this translation helpful? Give feedback.
-
Size of subtree is |
Beta Was this translation helpful? Give feedback.
-
this is definitely an interesting idea, and should be compared to the general properties the tree search should have, see #1231. Also, did anyone ever try to implement this? |
Beta Was this translation helpful? Give feedback.
-
I originally posted this idea in Google Forum and was encouraged to echo it here. That thread lays out an intuitive case that using entropy to guide exploration can yield highly efficient search strategies that cannot be realized with "success likelihood" and "visit counts" alone. For the skeptics, please feel free to peruse the original thread (or to see something a bit more concrete, skim down in this thread to COOLEST POINT OF ENTIRE POST IMO )
I'm not going to spend any more time arguing for this change; I think the strongest case for it arises from exploring how it could/should work. On that note - I've taken a bit of time to explore the math of this idea, and I'd like to lay out a plausible implementation. I do not claim this is an optimal implementation.
First, it's necessary to define entropy in the context of PUCT search. This is the context where the search is in state "s" and must choose an action "a" using probabilities that align with the various U(s,a) values of all actions. The action chosen will accord to a playout to the node associated with that action. There is already a good amount of information that's calculated and pulled together to compute U(s,a) for action a:
To this mix, I introduce the term E(s,a). E(s,a) is the entropy associated with the subtree rooted at the node of action a. The entropy of a subtree is the entropy of the distribution of probabilities of a playout from the root of that subtree reaching any particular leaf node (either an unvisited node or terminal "mate" node).
An algorithm for computing E(s,a) for subtree T:
Entropy({p1, p2, p3, ... pN}) = (-1)⋅ sum[n in 1..N]{ pn ⋅ log( pn , baseN) }
For those not familiar with the math here - this entropy calculation will always return a value between 0 and 1. The more uniform the probabilities are, the closer the entropy will be to 1; the more diverse the probabilities, the closer it will be to 0.
To illustrate, here are a few sets of 4 probabilities ("log4" indicates log base 4)
Note - the final example is actually perfectly uniform and thus achieves the maximum possible entropy of exactly 1.
Before I move onto the next topic (what to do with E(s,a)) I should mention that I'm not entirely clear on a low-cost method to track entropy according to this definition. In fact, this may be the largest barrier to implementation. However, there are two reasons to think it may not be a problem:
So - what do I propose we do with E(s,a)? No reason to draw it out:
Old formula without E(s,a)
U(s,a) = Q(s,a) + cPuct ⋅ P(s,a) ⋅ N(s,b)^0.5 ⋅ 1/(1+N(s,a))
New formula:
U(s,a) = Q(s,a) + cPuct ⋅ P(s,a) ⋅ N(s,b)^0.5 ⋅ 1/(1+N(s,a)⋅E(s,a))
Pausing for a quick FAQ:
Q: So, "voila" you've come up with a better search formula than Deepmind?
A: As has been the case in many other regards, I suspect Deepmind team's sole concern was to find a simple solution that was "good enough" to carry their thesis. I also suspect the benefits of entropy-bias are far higher depending on sharpness of game... so while this might've made a huge difference if A0 has looked at entropy for chess, it might have been negligible against Go and Shogi.
Q: Did... did you just shove the new E(s,a) factor onto the end and hope it would work?
A: Maybe. But hear me out: this winds up being pretty slick!
Q: Are you seriously wasting github developer's time with cheesy fake dialog? Shouldn't you get on with this?
A: Ah - yeah, good point.
So, as I alluded to in the FAQ - while this may seem arbitrary at first blush, it truly is cool how well it works:
It's 100% consistent with current behavior for unexplored nodes: both the number of visits N(s,a) and entropy E(s,a) have value of zero, thus N(s,a) = 0 = N(s,a)⋅E(s,a) = 0⋅0
It's almost identical to current behavior if distributions of probabilities are nearly uniform. In that case, E(s,a) will be very close to 1, and E(s,a)⋅N(s,a) =~= N(s,a). In other words, the math will only allow the behavior to deviate from typical PUCT search if there are areas of high entropy.
COOLEST POINT OF ENTIRE POST IMO (if you read nothing else, read this)
In one circumstance where the E(s,a) significantly impacts U(s,a) and deviates from A0/Leela results, not only does the E(s,a)-based result make more sense than the current formula, but it's so obviously more-correct that it seems the entire case for E(s,a) could stand on this example. The circumstance I'm alluding to is a chess move that corresponds to action a, for which there is just one legal response, a'.
Under the current/Leela/A0 approach to exploration, when action a is unvisited, N(s,a) = 0 and the exploration term for that action has small denominator of (1+0), giving overall exploration factor of N(s,b)^0.5 ⋅ (1/(1+0)) = N(s,b)^0.5 ⋅ 1 = N(s,b)^0.5. In other words, for as long as action a remains unvisited, as N(s,b) grows, the likelihood of this unexplored node being played will climb with sqrt of N(s,b). After the action a is visited, N(s,a') will be 1, and the exploration term will be abruptly cut in half, from factor N(s,b)^0.5 to factor of (1+N(s,b))^0.5 ⋅ (1/(1+1)) =~= (N(s,b)^0.5) / 2
Now consider what happens when E(s,a) is added to the equation. The U(s,a) value for unvisited node is the same as without entropy: N(s,b)^0.5 ⋅ (1/(1+0⋅0)) = N(s,b)^0.5 ⋅ 1 = N(s,b)^0.5. However , after action a, if we compute the entropy of the subtree obtained, we would find that the playout to root node of a' was forced to follow a 100% probability path to a single node. In other words, the subtree with a single forced move has the same E(s,a') = 0 as E(s,a) did for the unvisited node.
The effect is clear: if there is an unvisited node that leads to a forced response, visiting that node once has no diminishing effect on how good a candidate it is for a follow up visit. And IMO, nothing could be closer to what our human intuition tells us: it would be absurd to evaluate a position, see that it leaves just one legal response, and not be bothered to consider what that response would be.
This is more a continuation of point #3. Consider: what if instead of there being just one response that was legal, instead there was only one move that doesn't lose instantly. For example, perhaps the initial action corresponds to a pawn attacking a queen, with only one escape square open to the queen. In this case, the response "move queen to safety" won't have a "pure" 100% policy. But it certainly may have a policy that looks like {0.92, 0.01, 0.02, 0.01, 0.03, 0.01}. For once-visited nodes, these policies become the playout probabilities for the next visit; and in this example, the entropy of those probabilities comes out to 0.22.
For those who just read through #3, it should already be evident that this is going to yield a weaker version of what we saw in #3. Instead of the exploration term doubling, ie denominator increasing from (1+0) = 1 to (1+1) = 2, this low entropy will subdue the rise of denominator; it'll increase from 1.0 to 1.22 and a follow-up visit to this queen-attack node will only be slightly less likely than the original visit.
There's actually quite a bit more I'd like to add... For example, I have dreamed up a much simpler metric than entropy that would capture a lot of the same perks - but I'll stop here for a breath and leave those for a follow-up.
Thanks to those who read this all the way through!
Beta Was this translation helpful? Give feedback.
All reactions