Skip to content

Latest commit

 

History

History
25 lines (21 loc) · 1.13 KB

Monte Carlo Policy.md

File metadata and controls

25 lines (21 loc) · 1.13 KB
title date lastmod
Monte Carlo Policy
2022-11-08
2022-11-21

Monte Carlo

Pasted image 20220415190047 Estimate the value function from sampling: Pasted image 20220415190342

First visit MC: average returns only for first time (s,a) is visited in an episode/trial Repeated visits of (s,a) in the trial does not constitute a new learning condition

Grid World Scenario: Discount factor $\gamma = 1$

Trial (1,1) (2,2)
(1,1)->(1,2)->(1,3) $G_t=0+0+1^2\times-5=-5$ NA
(1,1)->(1,2)->(2,2)->(2,3) $G_t=0+0+5=5$ $G_t=5$
(1,1)->(2,1)->(2,2)->(2,3) $G_t=5$ $G_t=5$
Monte Carlo Estimates Q for (1,1): $\frac{5+5-5}{3}=\frac{5}{3}$
Monte Carlo Estimates Q for (2,2): $\frac{5+5}{2}=5$

This only works when we have the entire path ending in a goal state, what if we do not have this whole path? Use Q-Learning