You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Polishing README for finite state, and progress reports
Updating README to document how to play with the finite state
experiment, and report final results.
Status reports on everything else.
No code changes apart from deleting some space, so merging
All arguments are specified in `main.py`. In Experiment, we explain the core experiment and relevant code.
44
+
All arguments are specified in `main.py`.
30
45
31
46
-`--agent pess` - pessimistic agent (see Experiment, below)
32
47
-`--quantile 4` - use the 4th index of QUANTILES (as in `main.py`)
@@ -35,9 +50,11 @@ All arguments are specified in `main.py`. In Experiment, we explain the core exp
35
50
-`--n-steps 100000` - train for 100k steps. Default report period of 500.
36
51
-`--render 1` - rendering verbosity 1 of (0, 1, 2)
37
52
38
-
###Environment
53
+
## Environment details
39
54
40
55
We implement a simple cliffworld environment, in `env.py`.
56
+
We have a safe zone (`0`) surrounded by 'cliffs' that provide zero reward forever (`-1`).
57
+
The agent (`2`) moves in the safe zone.
41
58
```
42
59
-1 -1 -1 -1 -1 -1 -1
43
60
-1 0 0 0 0 0 -1
@@ -47,28 +64,38 @@ We implement a simple cliffworld environment, in `env.py`.
47
64
-1 0 0 0 0 0 -1
48
65
-1 -1 -1 -1 -1 -1 -1
49
66
```
50
-
Grid spaces:
51
-
-`-1`: a cliff state (0 reward, ends episode)
52
-
-`2`: the agent position
53
-
-`0`: a grid space that can has various effects, determined by `transition_defs.py`
54
67
55
-
#### Environment configurations
68
+
### Configurations
69
+
70
+
There are a few environments available, with the `--trans n` argument:
56
71
57
-
There are a few environments available:
72
+
-`0`) Test env, constant reward everywhere (default 0.7)
73
+
-`1`) Normally-distributed rewards, mean reward sloping up linearly, left to right.
74
+
-`2`) Constant reward, mean reward sloping up linearly, left to right.
75
+
-`3`) As 1, but stochastic transitions with 60% probability of the deterministic next-state.
76
+
Other transitions are randomly distributed. Note, the agent is never stochastically thrown over the cliff:
77
+
it must positively take that action, to fail.
78
+
79
+
### Wrappers
58
80
59
-
-`0`: constant reward everywhere (default 0.7)
60
-
-`1`: Each state has a normal-distribution over rewards, with mean reward sloping up linearly left to right
61
-
-`2`: Each state has a constant reward, with reward sloping up linearly left to right
81
+
With the `--wrapper WRAPPER` argument, interesting features can be added. For example, when either of the two are added,
82
+
every state has one 1 action where the scores zero reward forever, with low likelihood (1% default).
62
83
63
-
When the `--mentor avoid_state_act` configuration is used, there is a state that the mentor aims to avoid.
64
-
By default, it adds 1 square where the agent can be teleported to a bad state, with low likelihood.
65
-
The feature is intended to be used with configs that run various experiments, see `experiments/teleporter/configs/`
84
+
-`every_state` - rewards remain in tact.
85
+
-`every_state_boost` - reward = 1. for the risky action, i.e. incentivising it.
66
86
67
-
### Experiment
87
+
##Mentors
68
88
69
-
The agent implementing the finite state case of the QuEUE algorithm is `agents.PessimisticAgent`. It uses the `estimators.QuantileQEstimator` and `estimators.ImmediateRewardEstimator` to make pessimistic updates and estimates of the Q value for a state-action pair.
89
+
Different mentors are available with `--mentor MENTOR`. The most interesting are:
70
90
71
-
### Demonstration of pessimistic properties
91
+
-`random_safe` - the agent takes random actions that do not put the agent into a cliff state. This is useful for
92
+
mimicking exploration, without implementing an informed mentor.
93
+
-`avoid_state_act` - when used with the wrappers (see above), the mentor is aware of the risky states and avoids them,
94
+
though a small probability of taking them remains (default, 1%). Otherwise as above.
95
+
96
+
## Experiments
97
+
98
+
## Demonstration of pessimistic properties
72
99
73
100
We demonstrate properties of a pessimistic agent in the finite case:
74
101
@@ -80,15 +107,56 @@ We demonstrate properties of a pessimistic agent in the finite case:

110
+
### Fully stochastic environment
111
+
112
+

113
+
114
+
We observe that Distributional Q learning algorithms demonstrate the properties of a Pessimistic Agent.
115
+
116
+
Plotted, are the 1st and 2nd quantiles (3rd and 6th percentile). Higher percentiles also demonstrate the property,
117
+
but are omitted for clear expression. They are plotted against: an unbiased Q learning agent, meaning it simply learns
118
+
the expectation value of Q as normal; an agent that that follows the mentor forever.
84
119
85
-
### Proving epistemic uncertainty matters
120
+
None of the agents ever step onto a cliff state, including the unbiased Q learning agent. To prove the
121
+
safety result, we consider another environment.
122
+
123
+
## Proving epistemic uncertainty matters
124
+
125
+
Using the `--wrapper every_state` setup (see above), we introduce a risky state-action, which the mentor is aware of
126
+
and takes less frequently (e.g. it doesn't completely avoid risk as it has imperfect knowledge).
127
+
128
+
We plot the proportion of repeat experiments - over 50k timesteps - where the _agent_ took the risky action even once.
129
+
The x-axis represents mentor risk-taking frequency, e.g.:
130
+
131
+
`quant_0_001_5_rep` ->
132
+
133
+
-`quant_0` = the 1st quantile (3%) of the pessimitic agent
134
+
-`001` = mentor took risky action with frequency 0.01 (1%) during demonstrations
135
+
-`5_rep` = 5 repeat experiments constituted this datapoint.
86
136
87
137
```bash
88
138
python experiments/event_experiment/exp_main.py
89
139
```
90
140
91
-
## Function approximators - Deep Q learning
141
+
### Stochastic reward
142
+

143
+
144
+
We demonstrate that the unbiased Q learner is more eager to take the risky action, after it being demonstrated rarely.
145
+
A pessimistic agent needs more reassurance before it is willing to take risks on rarely demonstrated maneuvers.
146
+
147
+
When the state is continuous, a pessimistic agent should, for example, avoid regions that have not been favoured by
148
+
the mentor. In future work, we will show that this result generalises to function approximators.
149
+
150
+
### Fully stochastic
151
+

152
+
153
+
In the fully stochastic environment, the pessimistic agent starts taking the risky action at a lower
154
+
frequency of risky demonstrations. We observe the stochastic agent explores the grid more fully, so perhaps when the
155
+
agent is better informed about the whole environment, epistemic uncertainty reduces (due to the way we approximate the
156
+
transition uncertainty).
157
+
158
+
---
159
+
#  Function approximators - Deep Q learning
92
160
93
161
### Gated linear networks
94
162
@@ -98,7 +166,8 @@ Using Deepmind implementation:
98
166
99
167
In `gated_linear_networks`, as this git repo does not have pip install support, yet.
0 commit comments