title | date | lastmod |
---|---|---|
Monte Carlo Policy |
2022-11-08 |
2022-11-21 |
Estimate the value function from sampling:
First visit MC: average returns only for first time (s,a) is visited in an episode/trial Repeated visits of (s,a) in the trial does not constitute a new learning condition
Grid World Scenario:
Discount factor
Trial | (1,1) | (2,2) |
---|---|---|
(1,1)->(1,2)->(1,3) | NA | |
(1,1)->(1,2)->(2,2)->(2,3) | ||
(1,1)->(2,1)->(2,2)->(2,3) | ||
Monte Carlo Estimates Q for (1,1): |
||
Monte Carlo Estimates Q for (2,2): |
This only works when we have the entire path ending in a goal state, what if we do not have this whole path? Use Q-Learning