title	date	lastmod
Monte Carlo Policy	2022-11-08	2022-11-21

Monte Carlo

Estimate the value function from sampling:

First visit MC: average returns only for first time (s,a) is visited in an episode/trial Repeated visits of (s,a) in the trial does not constitute a new learning condition

Grid World Scenario: Discount factor $\gamma = 1$

Trial	(1,1)	(2,2)
(1,1)->(1,2)->(1,3)	$G_t=0+0+1^2\times-5=-5$	NA
(1,1)->(1,2)->(2,2)->(2,3)	$G_t=0+0+5=5$	$G_t=5$
(1,1)->(2,1)->(2,2)->(2,3)	$G_t=5$	$G_t=5$
Monte Carlo Estimates Q for (1,1): $\frac{5+5-5}{3}=\frac{5}{3}$
Monte Carlo Estimates Q for (2,2): $\frac{5+5}{2}=5$

This only works when we have the entire path ending in a goal state, what if we do not have this whole path? Use Q-Learning

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monte Carlo Policy.md

Monte Carlo Policy.md

Monte Carlo

Files

Monte Carlo Policy.md

Latest commit

History

Monte Carlo Policy.md

File metadata and controls

Monte Carlo