Skip to content

Commit

Permalink
improve Reward Corruption description
Browse files Browse the repository at this point in the history
  • Loading branch information
tom4everitt committed Jul 3, 2017
1 parent a26eb05 commit 23e2c4e
Showing 1 changed file with 14 additions and 5 deletions.
19 changes: 14 additions & 5 deletions demo.html
Original file line number Diff line number Diff line change
Expand Up @@ -303,14 +303,23 @@ <h3 id='setup_label'>Setup: </h3>
<span class="md" id="reward_corruption_exp" style="display:none">
# Reward Corruption

This demo shows the experiments for the paper <a href=https://arxiv.org/abs/1705.08417 target=_blank>Reinforcement Learning with a Corrupted Reward Channel</a>. We test tabular agents (Q-learning, SARSA, Softmax Q-learning, and Quantilising) in a standard gridworld with 4 dispensers, with one addition: a blue tile with a corrupt reward - its observed reward is high, but its true reward is low. Can the agents avoid getting stuck on the corrupt blue tile?
This demo shows the experiments for the paper <a href=https://arxiv.org/abs/1705.08417 target=_blank>Reinforcement Learning with a Corrupted Reward Channel</a>, summarised in a blog post [here](http://www.tomeveritt.se/paper/2017/05/29/reinforcement-learning-with-corrupted-reward-channel.html).

In this setup, the (observed) reward = true reward + corrupt reward. All the rewards are between 0 and 1. The observed rewards are as follows: blue tile 1, yellow dispenser tiles 0.9, empty tiles 0.1, wall 0. The true rewards are the same, except 0 for the blue tile.
The setup is a standard gridworld with 4 dispensers, but with one addition: a blue tile with a corrupt reward. The blue tile represents bad ways of getting reward, such as wireheading or abusing a misspecified reward function. The blue tile has high observed reward, but low true reward.

We recommend running the tabular agents for at least 100,000 cycles. Besides the usual agent parameters, you can set the temperature \\(\beta\\) for the softmax agent and the cutoff \\(\delta\\) for the quantilising agent.
Which agents get stuck on the corrupt blue tile? In the [paper](href=https://arxiv.org/abs/1705.08417 target=_blank), we show that Quantilisers are better at avoiding the corrupt reward than Q-learning and SARSA.

## Optional: running experiments in the console
If you'd like to reproduce the experiments in the paper or play around with the results, you can run the experiments in the console as follows:
*Try running the different agents for 1000 time steps, and observe the difference in behaviour and true reward.*

### Details
In the plots, (observed) reward = true reward + corrupt reward.

The observed rewards are as follows: blue tile 1, yellow dispenser tiles 0.9, empty tiles 0.1, wall 0. The true rewards are the same as the observed rewards, except the blue tile that has 0 true reward.

Besides the usual agent parameters, you can set the temperature \\(\beta\\) for the softmax agent and the cutoff \\(\delta\\) for the quantilising agent.

### Running experiments in the console
To reproduce the experiments in the paper, you can run the experiments in the console as follows:

<code>for(let i=0; i<20; i++) { demo.experiment([configs.reward_corruption_experiments], {download: true, agent: {type:Quantiliser, cycles:1000000}}) }</code>

Expand Down

0 comments on commit 23e2c4e

Please sign in to comment.