Classical RL algorithms can only use rewards as information how to improve. This is often very inefficient. However, the environment often provides useful feedback on what has been done wrong. This can be, for example, explicit natural language feedback by a human. By using this complex feedback the training speed can be greatly improved.
experiments/train_single_step_dreamer.py was used to run this experiment. The number of runs with different random seeds, the log directory and whether to use feedback observations can be changed in this file. The experiments/logs/make_sequence_guessing_figure.py script was used to create the training progress figures in pdf format. It produces one pdf file for every sequence length range. So one is the success rate on guessing sequences of length between 1 and 15 bits and so on.
experiments/train_open_loop_dreamer contains the script that trains the open-loop Dreamer algororithm on the Pendulum-v0 environment. The results can be visualized with the experiments/make_open_loop_dreamer_figure.py script.
This part is not yet finished. The environment code is in learning_from_feedback/envs/simple_feedback_reacher. There is one training script that trains a simplified dreamer algorithm on this environment in experiments/simple_feedback_reacher/train_single_step_dreamer.py. Another training script in experiments/simple_feedback_reacher/train_state_dreamer.py trains a more complete Dreamer agent.
A framework for temporal abstraction in reinforcement learning
(One of the first papers in Hierarchical RL)
Feudal Networks for Hierarchical Reinforcement Learning
Hierarchical Skills for Efficient Exploration
Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation
Dream to Control: Learning Behaviors by Latent Imagination
(Introduces the Dreamer algorithm)
Mastering Atari with Discrete World Models
(Improved Dreamer algorithm for discrete control)
Learning Continuous Control Policies by Stochastic Value Gradients
End-to-End Differentiable Physics for Learning and Control
Making the World Differentiable: On Using Self-Supervised Fully Recurrent Neural Networks for Dynamic Reinforcement Learning and Planning in Non-Stationary Environments
(Old paper by Schmidhuber on using learned enviroment models to calculate policy gradients)
The environment model still takes a very long time to train. Maybe it is necessary to use a specialized network architecture instead of just fully connected layers.
The instruction might be "pick up the bottle". In case of an unsuccessfull attempt the feedback could be "The bottle is the green object on the right". The idea is that it is easier for the agent to learn what a green object is than to learn what a bottle is.
The Dreamer algorithm has to predict at least T steps into the future when rewards are delayed by up to T steps. A hierarchical RL algorithm might be able to improve a policy in a fraction of T steps.