title | date | lastmod |
---|---|---|
Q-Learning |
2022-11-08 |
2022-11-21 |
Use temporal difference to update Q values at each time difference when the agent interacts with the environment.
New sample: This refers to the maximum utility that can be achieved in the state we are entering, i.e.
Grid World Scenario: Trial 1
(1,1 right) | (1,2 right) |
---|---|
0 | |
Trial 2 | |
(1,1 right) | (1,2 right) |
----------- | ----------------------------------------------- |
0 |