.. autoclass:: numpy_ml.rl_models.agents.CrossEntropyAgent :members: :undoc-members: :inherited-members:
.. autoclass:: numpy_ml.rl_models.agents.DynaAgent :members: :undoc-members: :inherited-members:
Monte Carlo methods are ways of solving RL problems based on averaging sample returns for each state-action pair. Parameters are updated only at the completion of an episode.
In on-policy learning, the agent maintains a single policy that it updates over the course of training. In order to ensure the policy converges to a (near-) optimal policy, the agent must maintain that the policy assigns non-zero probability to ALL state-action pairs during training to ensure continual exploration.
- Thus on-policy learning is a compromise--it learns action values not for the optimal policy, but for a near-optimal policy that still explores.
In off-policy learning, the agent maintains two separate policies:
- Target policy: The policy that is learned during training and that will eventually become the optimal policy.
- Behavior policy: A policy that is more exploratory and is used to generate behavior during training.
Off-policy methods are often of greater variance and are slower to converge. On the other hand, off-policy methods are more powerful and general than on-policy methods.
.. autoclass:: numpy_ml.rl_models.agents.MonteCarloAgent :members: :undoc-members: :inherited-members:
Temporal difference methods are examples of bootstrapping in that they update their estimate for the value of state s on the basis of a previous estimate.
Advantages of TD algorithms:
- They do not require a model of the environment, its reward, or its next-state probability distributions.
- They are implemented in an online, fully incremental fashion. This allows them to be used with infinite-horizons / when episodes take prohibitively long to finish.
- TD algorithms learn from each transition regardless of what subsequent actions are taken.
- In practice, TD methods have usually been found to converge faster than constant-\alpha Monte Carlo methods on stochastic tasks.
.. autoclass:: numpy_ml.rl_models.agents.TemporalDifferenceAgent :members: :undoc-members: :inherited-members: