title | date | lastmod |
---|---|---|
Markov Decision Process |
2022-11-08 |
2022-11-21 |


(1,1) | (1,2) | (1,3) | (2,1) | (2,2) | (2,3) | |
---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | |
0 | 0 | 0 | 0 |
|
0 | |
0 |
|
0 |
|
$Right=0.8\times(5+0.90)+$ $0.1\times(0+0.94)=4.36$ |
0 |
Note
Once we have V*, we can use V* in the bellman equation for each State and Action to obtain the corresponding Q(S,A) value.
[!NOTE] Steps
- Start with random policy and V(s)=0 for 1st iteration
- Policy evaluation (calculate V) until stable
- For each state s calculate V(s) using the action in the policy (this is different from calculating V(s) using max $Q(s,a)$)
- Policy Improvement (calculate new policy)
- For each state s calculate
$Q(s,a)$ of all actions using the stabilized V(s) values.- Update the policy with the actions which maximize
$Q(s,a)$ of each state
Asynchronous refers to in series: value calculated from previous steps (same iteration) are used in the next state calculations.
Synchronous refers to in parallel: All V(s) values are calculated using the previous iteration V(s) estimates.
Without the transition function or probabilities we need Monte Carlo Policy reinforcement learning.