Skip to content

Latest commit

 

History

History
49 lines (38 loc) · 3.28 KB

Markov Decision Process.md

File metadata and controls

49 lines (38 loc) · 3.28 KB
title date lastmod
Markov Decision Process
2022-11-08
2022-11-21

Components of MDP

Pasted image 20220415205145

Formulating an MDP

![Pasted image 20220415223757](Pics/Pasted%20image%2020220415223757.png)	
![Pasted image 20220415223941](Pics/Pasted%20image%2020220415223941.png)

The Bellman Equation

$$ V_{i+s}(s) =max_a(\sum_{s'} P(s'|s,a)(r(s,a,s')+\gamma V(s')) $$

Value iteration

Pasted image 20220415160057 Pasted image 20220415160224

Grid World Scenario:

$V_i$ (1,1) (1,2) (1,3) (2,1) (2,2) (2,3)
$V_0$ 0 0 0 0 0 0
$V_1$ 0 0 0 0 $Up = 0.8\times0+0.1\times5+0.1\times0+0.9\times0$ $Down=0.8\times0+0.1\times5+0.1\times0+0.9\times0$ $Left=0.80+0.15+0.10=0.5$ $Right=0.85+0.10+0.10=4$ Max = 4 0
$V_2$ 0 $Up=0.8\times(0+0.9*4)+$
$0.1\times0+0.1\times-5=2.38$
0 $Right=0.8\times(0+0.9*4)+$
$0.1\times0+0.1\times0=2.88$
$Right=0.8\times(5+0.90)+$
$0.1\times(0+0.9
4)=4.36$
0

Obtain the optimal policy

Pasted image 20220415211535

Note

Once we have V*, we can use V* in the bellman equation for each State and Action to obtain the corresponding Q(S,A) value. $\pi*$ is the action which obtains max Q(S,A) i.e. V(S).

Policy iteration

[!NOTE] Steps

  1. Start with random policy and V(s)=0 for 1st iteration
  2. Policy evaluation (calculate V) until stable
  3. For each state s calculate V(s) using the action in the policy (this is different from calculating V(s) using max $Q(s,a)$)
  4. Policy Improvement (calculate new policy)
  5. For each state s calculate $Q(s,a)$ of all actions using the stabilized V(s) values.
  6. Update the policy with the actions which maximize $Q(s,a)$ of each state

Pasted image 20220415210325 Asynchronous refers to in series: value calculated from previous steps (same iteration) are used in the next state calculations.

Synchronous refers to in parallel: All V(s) values are calculated using the previous iteration V(s) estimates.

Without the transition function or probabilities we need Monte Carlo Policy reinforcement learning.