Skip to content

Commit eabce54

Browse files
author
jamie
committed
Polishing README for finite state, and progress reports
Updating README to document how to play with the finite state experiment, and report final results. Status reports on everything else. No code changes apart from deleting some space, so merging
1 parent 22e6cd4 commit eabce54

7 files changed

+143
-42
lines changed

README.md

+48-14
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,59 @@
1-
# pessimistic-agents
2-
A repository for code for empirical investigations of pessimistic agents
1+
# Pessimistic Agents
32

4-
# Setup
3+
[Pessimistic Agents](https://arxiv.org/abs/2006.08753)
4+
are ask-for-help reinforcement learning agents that offer guarantees of:
55

6-
## Supported conda env
6+
1. Eventually outperforming the mentor
7+
2. Eventually stopping querying the mentor
8+
3. Never causing unprecedented events to happen, with arbitrary probability
79

8-
With anaconda
10+
In this repository, we investigate their behaviour in the faithful setting, and explore approximations that allow them
11+
to be used in real-world RL problems.
912

10-
```bash
11-
conda env create -f torch_env_cpu.yml
12-
```
13+
Overview - see individual README.md files for more detail.
14+
15+
---
16+
17+
## Distributional Q Learning - dist_q_learning/
18+
19+
We introduce a tractable implementation of Pessimistic Agents. Approximate the Bayesian world and mentor models
20+
as a distribution over epistemic uncertainty of Q values. By using a pessimistic (low) quantily, we demonstrate the
21+
expected behaviour and safety results for a pessimistic agent.
22+
23+
| Work | Status |
24+
| ------------- | ------------- |
25+
| Finite state Q Table proof of concept | ![DONE](https://via.placeholder.com/100x40/008000/FFFFFF?text=DONE) |
26+
| Continuous deep Q learning implementation | ![WIP](https://via.placeholder.com/100x40/FF7B00/FFFFFFF?text=WIP) |
27+
28+
---
29+
## Faithful implementation - cliffworld/
1330

14-
# Experiments
31+
Implement and investigate a faithful representation of a Bayesian Pessimistic Agent.
1532

16-
## `pessimistic_prior`
33+
| Work | Status |
34+
| ------------- | ------------- |
35+
| Environment | ![DONE](https://via.placeholder.com/100x40/008000/FFFFFF?text=DONE) |
36+
| Agent | ![HOLD](https://via.placeholder.com/100x40/A83500/FFFFFFF?text=On+Hold) |
1737

18-
Apply pessimism approximation to current RL agents.
38+
On hold, some progress made in implementing the environment and mentor models.
1939

20-
## `dist_q_learning`
40+
---
2141

22-
Learn from distributions of the Q-value estimate (using a pessimistic quantile)
42+
## Pessimistic RL - pessimistic_prior/
2343

24-
See `dist_q_learning/README.md`
44+
Apply pessimism approximation to neural network based, deep Q learning RL agents.
2545

46+
| Work | Status |
47+
| ------------- | ------------- |
48+
| DQN proof of concept | ![HOLD](https://via.placeholder.com/100x40/A83500/FFFFFFF?text=On+Hold) |
49+
50+
-----
51+
# Setup
52+
53+
## Supported conda env
54+
55+
With anaconda
56+
57+
```bash
58+
conda env create -f torch_env_cpu.yml
59+
```

dist_q_learning/README.md

+95-26
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,19 @@
1-
# Distributed Q Learning - QuEUE
1+
# Distributional Q Learning
22

3-
Algorithm specification pending.
3+
We introduce an algorithm that, instead of learning the expectation value for Q,
4+
keeps a distribution over Q values. With a distribution over Q values we approximate an ordering
5+
over world models, and can select a low quantile to demonstrate behaviour of pessimistic agents.
46

5-
## Setup
7+
The distribution is taken over epistemic uncertainty, induced by e.g. lack of observations
8+
in a certain area of the environment.
9+
10+
The Distributional Q Learning algorithm is defined in `agents.PessimisticAgent`,
11+
which utilises two estimators to make pessimistic updates and estimates to the Q value
12+
for a history of state, action pairs:
13+
- `q_estimators.QuantileQEstimator`
14+
- `estimators.ImmediateRewardEstimator`
15+
16+
# Setup
617

718
```bash
819
# Setup conda envs
@@ -15,18 +26,22 @@ source set_path.sh
1526
python main.py -h
1627
```
1728

18-
## Q Table implementation
29+
# ![DONE](https://via.placeholder.com/100x40/008000/FFFFFF?text=DONE) Q Table implementation
1930

2031
Implements a finite state experiment.
2132

2233
Goal: implement QuEUE faithfully, demonstrate properties of a pessimistic agent in an intuitive case.
2334

24-
### Example
35+
## Example
2536

2637
```bash
38+
# display the help message
39+
python main.py -h
40+
41+
# Run a basic pessimistic agent in a stochastic-reward environment
2742
python main.py --agent pess --quantile 4 --mentor random_safe --trans 1 --n-steps 100000 --render 1
2843
```
29-
All arguments are specified in `main.py`. In Experiment, we explain the core experiment and relevant code.
44+
All arguments are specified in `main.py`.
3045

3146
- `--agent pess` - pessimistic agent (see Experiment, below)
3247
- `--quantile 4` - use the 4th index of QUANTILES (as in `main.py`)
@@ -35,9 +50,11 @@ All arguments are specified in `main.py`. In Experiment, we explain the core exp
3550
- `--n-steps 100000` - train for 100k steps. Default report period of 500.
3651
- `--render 1` - rendering verbosity 1 of (0, 1, 2)
3752

38-
### Environment
53+
## Environment details
3954

4055
We implement a simple cliffworld environment, in `env.py`.
56+
We have a safe zone (`0`) surrounded by 'cliffs' that provide zero reward forever (`-1`).
57+
The agent (`2`) moves in the safe zone.
4158
```
4259
-1 -1 -1 -1 -1 -1 -1
4360
-1 0 0 0 0 0 -1
@@ -47,28 +64,38 @@ We implement a simple cliffworld environment, in `env.py`.
4764
-1 0 0 0 0 0 -1
4865
-1 -1 -1 -1 -1 -1 -1
4966
```
50-
Grid spaces:
51-
- `-1`: a cliff state (0 reward, ends episode)
52-
- `2`: the agent position
53-
- `0`: a grid space that can has various effects, determined by `transition_defs.py`
5467

55-
#### Environment configurations
68+
### Configurations
69+
70+
There are a few environments available, with the `--trans n` argument:
5671

57-
There are a few environments available:
72+
- `0`) Test env, constant reward everywhere (default 0.7)
73+
- `1`) Normally-distributed rewards, mean reward sloping up linearly, left to right.
74+
- `2`) Constant reward, mean reward sloping up linearly, left to right.
75+
- `3`) As 1, but stochastic transitions with 60% probability of the deterministic next-state.
76+
Other transitions are randomly distributed. Note, the agent is never stochastically thrown over the cliff:
77+
it must positively take that action, to fail.
78+
79+
### Wrappers
5880

59-
- `0`: constant reward everywhere (default 0.7)
60-
- `1`: Each state has a normal-distribution over rewards, with mean reward sloping up linearly left to right
61-
- `2`: Each state has a constant reward, with reward sloping up linearly left to right
81+
With the `--wrapper WRAPPER` argument, interesting features can be added. For example, when either of the two are added,
82+
every state has one 1 action where the scores zero reward forever, with low likelihood (1% default).
6283

63-
When the `--mentor avoid_state_act` configuration is used, there is a state that the mentor aims to avoid.
64-
By default, it adds 1 square where the agent can be teleported to a bad state, with low likelihood.
65-
The feature is intended to be used with configs that run various experiments, see `experiments/teleporter/configs/`
84+
- `every_state` - rewards remain in tact.
85+
- `every_state_boost` - reward = 1. for the risky action, i.e. incentivising it.
6686

67-
### Experiment
87+
## Mentors
6888

69-
The agent implementing the finite state case of the QuEUE algorithm is `agents.PessimisticAgent`. It uses the `estimators.QuantileQEstimator` and `estimators.ImmediateRewardEstimator` to make pessimistic updates and estimates of the Q value for a state-action pair.
89+
Different mentors are available with `--mentor MENTOR`. The most interesting are:
7090

71-
### Demonstration of pessimistic properties
91+
- `random_safe` - the agent takes random actions that do not put the agent into a cliff state. This is useful for
92+
mimicking exploration, without implementing an informed mentor.
93+
- `avoid_state_act` - when used with the wrappers (see above), the mentor is aware of the risky states and avoids them,
94+
though a small probability of taking them remains (default, 1%). Otherwise as above.
95+
96+
## Experiments
97+
98+
## Demonstration of pessimistic properties
7299

73100
We demonstrate properties of a pessimistic agent in the finite case:
74101

@@ -80,15 +107,56 @@ We demonstrate properties of a pessimistic agent in the finite case:
80107
python experiments/core_experiment/finite_agent_0.py
81108
```
82109

83-
![Experimental results](experiments/core_experiment/saved_results/Bigger_agent_pess_trans_2_n_100_steps_200_mentor_random_safe_earlystop_0_init_zero_True.png "Experimental results for a pessimistic agent")
110+
### Fully stochastic environment
111+
112+
![Experimental results](experiments/saved_results/trans_3_horizon_inf_agent_pess_mentor_avoid_state_act_wrapper_every_state_report_every_n_100_steps_100000_init_zero_True_state_len_7_sampling_strat_random_batch_size_20_update_freq_10_learning_rate_0.5.png "Performance result for pessimistic agent")
113+
114+
We observe that Distributional Q learning algorithms demonstrate the properties of a Pessimistic Agent.
115+
116+
Plotted, are the 1st and 2nd quantiles (3rd and 6th percentile). Higher percentiles also demonstrate the property,
117+
but are omitted for clear expression. They are plotted against: an unbiased Q learning agent, meaning it simply learns
118+
the expectation value of Q as normal; an agent that that follows the mentor forever.
84119

85-
### Proving epistemic uncertainty matters
120+
None of the agents ever step onto a cliff state, including the unbiased Q learning agent. To prove the
121+
safety result, we consider another environment.
122+
123+
## Proving epistemic uncertainty matters
124+
125+
Using the `--wrapper every_state` setup (see above), we introduce a risky state-action, which the mentor is aware of
126+
and takes less frequently (e.g. it doesn't completely avoid risk as it has imperfect knowledge).
127+
128+
We plot the proportion of repeat experiments - over 50k timesteps - where the _agent_ took the risky action even once.
129+
The x-axis represents mentor risk-taking frequency, e.g.:
130+
131+
`quant_0_001_5_rep` ->
132+
133+
- `quant_0` = the 1st quantile (3%) of the pessimitic agent
134+
- `001` = mentor took risky action with frequency 0.01 (1%) during demonstrations
135+
- `5_rep` = 5 repeat experiments constituted this datapoint.
86136

87137
```bash
88138
python experiments/event_experiment/exp_main.py
89139
```
90140

91-
## Function approximators - Deep Q learning
141+
### Stochastic reward
142+
![Experimental results](experiments/saved_results/final_trans_1.png "Safety result for pessimistic agent - stochastic R")
143+
144+
We demonstrate that the unbiased Q learner is more eager to take the risky action, after it being demonstrated rarely.
145+
A pessimistic agent needs more reassurance before it is willing to take risks on rarely demonstrated maneuvers.
146+
147+
When the state is continuous, a pessimistic agent should, for example, avoid regions that have not been favoured by
148+
the mentor. In future work, we will show that this result generalises to function approximators.
149+
150+
### Fully stochastic
151+
![Experimental results](experiments/saved_results/final_trans_3.png "Safety result for pessimistic agent - fully stochastic")
152+
153+
In the fully stochastic environment, the pessimistic agent starts taking the risky action at a lower
154+
frequency of risky demonstrations. We observe the stochastic agent explores the grid more fully, so perhaps when the
155+
agent is better informed about the whole environment, epistemic uncertainty reduces (due to the way we approximate the
156+
transition uncertainty).
157+
158+
---
159+
# ![WIP](https://via.placeholder.com/100x40/FF7B00/FFFFFFF?text=WIP) Function approximators - Deep Q learning
92160

93161
### Gated linear networks
94162

@@ -98,7 +166,8 @@ Using Deepmind implementation:
98166

99167
In `gated_linear_networks`, as this git repo does not have pip install support, yet.
100168

101-
## Testing
169+
---
170+
# Tests
102171

103172
```bash
104173
python -m unittest discover tests

dist_q_learning/estimators.py

-2
Original file line numberDiff line numberDiff line change
@@ -56,8 +56,6 @@ def decay_lr(self):
5656
self.lr *= (1. - self.lr_decay)
5757

5858

59-
60-
6159
class ImmediateRewardEstimator(Estimator):
6260
"""Estimates the next reward given a current state and an action"""
6361

Loading
Loading

0 commit comments

Comments
 (0)