title	date	lastmod
Markov Decision Process	2022-11-08	2022-11-21

Components of MDP

Formulating an MDP

![Pasted image 20220415223757](Pics/Pasted%20image%2020220415223757.png)	
![Pasted image 20220415223941](Pics/Pasted%20image%2020220415223941.png)

The Bellman Equation

$$ V_{i+s}(s) =max_a(\sum_{s'} P(s'|s,a)(r(s,a,s')+\gamma V(s')) $$

Value iteration

Grid World Scenario:

$V_i$	(1,2)	(2,1)	(2,2)
$V_0$	0	0	0
$V_1$	0	0	$Up = 0.8\times0+0.1\times5+0.1\times0+0.9\times0$ $Down=0.8\times0+0.1\times5+0.1\times0+0.9\times0$ $Left=0.80+0.15+0.10=0.5$ $Right=0.85+0.10+0.10=4$ Max = 4
$V_2$	$Up=0.8\times(0+0.9*4)+$ $0.1\times0+0.1\times-5=2.38$	$Right=0.8\times(0+0.9*4)+$ $0.1\times0+0.1\times0=2.88$	$Right=0.8\times(5+0.90)+$ $0.1\times(0+0.94)=4.36$

Obtain the optimal policy

Note

Once we have V*, we can use V* in the bellman equation for each State and Action to obtain the corresponding Q(S,A) value. $\pi*$ is the action which obtains max Q(S,A) i.e. V(S).

Policy iteration

[!NOTE] Steps

Start with random policy and V(s)=0 for 1st iteration

Policy evaluation (calculate V) until stable

For each state s calculate V(s) using the action in the policy (this is different from calculating V(s) using max $Q(s,a)$)

Policy Improvement (calculate new policy)

For each state s calculate $Q(s,a)$ of all actions using the stabilized V(s) values.

Update the policy with the actions which maximize $Q(s,a)$ of each state

Asynchronous refers to in series: value calculated from previous steps (same iteration) are used in the next state calculations.

Synchronous refers to in parallel: All V(s) values are calculated using the previous iteration V(s) estimates.

Without the transition function or probabilities we need Monte Carlo Policy reinforcement learning.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Markov Decision Process.md

Markov Decision Process.md

Components of MDP

Formulating an MDP

The Bellman Equation

Value iteration

Obtain the optimal policy

Policy iteration

Files

Markov Decision Process.md

Latest commit

History

Markov Decision Process.md

File metadata and controls

Components of MDP

Formulating an MDP

The Bellman Equation

Value iteration

Obtain the optimal policy

Policy iteration