MCTS

Monte-Carlo Tree Search for use on gym environments.

The popular UCT algorithm. MCTS can be thought of a low-memory representation of a policy, which can be used when a simulator of the environment is available. At each state, MCTS runs simulations and returns an action, which roughly maximizes an estimated Q-value. UCT is a highly-selective best-first search, meaning that it can look quite far ahead, making it useful in games like Chess and Go. In AlphaGo, MCTS acts as a policy-improvement step: improving upon the policy encoded in a neural net, as well as the online-player.

This algorithm is useful when a simulator of the environment is available. Starting with the current state, all actions are simulated. If the next state-action pair has been encountered before, UCT picks the action with the highest upper-confidence bound on Q-value. This is implemented in the function select_child. The first time a particular state-action pair is encountered (a leaf node), its Q-value is estimated by randomly picking actions until the simulation ends. This estimated Q-value for the leaf node is incorporated into the Q-values of transitions leading to the leaf node, with appropriate discounting, by the backpropagate function. After num_simulations runs, the action picked the most is returned.

There are numerous ways to modify MCTS. In environments with a long or infinite horizon, a statistical estimate of the value function can be used to assess the value of leaf nodes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

MCTS

Monte-Carlo Tree Search for use on gym environments.

Files

README.md

Latest commit

History

README.md

File metadata and controls

MCTS

Monte-Carlo Tree Search for use on gym environments.