In this repository, I reproduce the results of Prefrontal Cortex as a Meta-Reinforcement Learning System1, Episodic Control as Meta-Reinforcement Learning2 and Been There, Done That: Meta-Learning with Episodic Recall3 on variants of the sequential decision making "Two Step" task originally introduced in Model-based Influences on Humans’ Choices and Striatal Prediction Errors4. You will find below a description of the task with results, along with a brief overview of Meta-RL and its connection to neuroscience, as well as details covering the structure of the code.
Episodic Two-Step Task | Episodic LSTM |
---|---|
*Note: I have a related repository on the "Harlow" visual fixation task in case you are interested :)
**I aim to write a blog post to accompany this repository, so stay tuned!
In recent years, deep reinforcement learning (deep-RL) have been in the forefront of artificial intelligence research since DeepMind's seminal work on DQN 5 that was able to solve a wide range of Atari games by just looking at the raw pixels, as a human would. However, there remained a major issue that disqualified it as a plausible model of human learning, and that is the sample efficiency problem. It basically refers "to the amount of data required for a learning system to attain any chosen target level of performance" 6. In other words, a task that would take a biological brain a matter of minutes to master would require many orders of magnitude more training data for a deep-RL artificial agent. Botvinick et al. (2019) 6 identify two main sources of slowness in deep-RL: the need for incremental parameter adjustment and starting with a weak inductive bias. I will be going into more details of each in my blog post. However, they note that subsequent research has shown that it is possible to train artificial agents in a sample efficient manner, and that is by (1) augmenting it with an episodic memory system to avoid redundant exploration and leverage prior experience more effectively, and (2) using a meta-learning approach by training the agent on a series of structurally interrelated tasks that can strengthen the inductive bias (narrowing the hypothesis set) which enables the agent to hone-in to a valid solution much faster 6.
DeepMind have been preaching about the importance of neuroscience and artificial intelligence research to work together in what they call a virtuous circle; where each field will inspire and drive forward the other 7. I must admit that they were the reason I joined an MSc program in Computational Cognitive Neuroscience after working on AI for a couple of years now. In short, they were indeed able to show that meta-RL - which was drawn from the machine learning literature - is able to explain a wide range of neuroscientific findings as well as resolve many of the prevailing quandaries in reward-based learning 1. They do so by conceptualizing the prefrontal cortex along with its subcortical components (basal ganglia and thalamic nuclei) as its own free standing meta-RL system. Concretely, they show that dopamine driven synaptic plasticity, that is model free, gives rise to a second more efficient model-based RL algorithm implemented in the prefrontal's network activation dynamics 1. In this repository, I reproduce one of their simulations (the two-step task) that showcases the emergence of model-based learning in accord with observed behavior in both humans and rodents 1.
The episodic meta-RL variant was proposed by Ritter et al. (2018) 2, 3 and is partly inspired by evidence that episodic memory retrieval in humans operate through reinstatement that recreates patterns of activity in neural circuits supporting working memory 2. This presents a new theory in which human decision making can be seen as an interplay between working and episodic memory, that is itself learned through training to maximise rewards on a distribution of tasks. The episodic memory system is implemented as a differentiable neural dictionary 8 that stores the task context as keys and the LSTM cell states as values. This will be expanded upon as well in the accompanied blog post.
This task has been widely used in the neuroscience literature to distinguish the contribution of different systems viewed to support decision making. The variant I am using here was developed to disassociate a model-free system, that caches values of actions in states, from a model-based system that learns an internal model of the environment and evaluates the values of actions through look-ahead planning 9. The purpose here is to see whether the model-free algorithm used to train the weights (which is A2C10 in this case) gives rise to behavior emulating model-based strategy.
The results below, which this code reproduces, shows the stay probability as a function of the previous transition type (common or uncommon) and whether the agent was rewarded in the last trial or not. This probability basically reflects the selection of the same action as in the previous trial.
What we are looking for is a higher stay probability if the previous trial used a common transition and was rewarded or the opposite if it was not rewarded. This is indeed what we see below, which implies model-based behavior. I strongly encourage to check the referenced papers for a more detailed description of the task and hyperparameters used in those simulations.
Published Result 1 | My Result |
---|---|
The incremental version uses the exact same setup and model as the episodic one with the same hyperparameters but with the reinstatement gate set to 0, therefore no memories are retrieved.
Published Result 2 | My Result | ||||
---|---|---|---|---|---|
Incremental Uncued | Incremental Cued | Episodic | Incremental Uncued | Incremental Cued | Episodic |
Those are the training trajectories of the episodic and incremental variants (each of which consists of 10 runs with different random seeds). We can see that the episodic version on average converges faster and accumulates more rewards early on as it makes better use of prior experience using the long-term memory.
Episodic Training Curve | Incremental Training Curve |
---|---|
Meta-RL-TwoStep-Task
├── LICENSE
├── README.md
├── episodic.py # reproduces results of [2] and [3]
├── vanilla.py # reproduces results of [1]
├── plotting.py # plots extra graphs
└── configs
├── ep_two_step.yaml # configuration file with hyperparameters for episodic.py
└── two_step.yaml # configuration file with hyperparamters for vanilla.py
└── tasks
├── ep_two_step.py # episodic two-step task
└── two_step.py # vanilla two-step task
└── models
├── a2c_lstm.py # advantage actor-critic (a2c) algorithm with working memory
├── a2c_dnd_lstm.py # a2c algorithm with working memory and long-term (episodic) memory
├── dnd.py # episodic memory as a differentiable neural dictionary
├── ep_lstm.py # episodic lstm module wrapper
└── ep_lstm_cell.py # episodic lstm cell with extra reinstatement gate
-
Wang, J., Kurth-Nelson, Z., Kumaran, D., Tirumala, D., Soyer, H., Leibo, J., Hassabis, D., & Botvinick, M. (2018). Prefrontal Cortex as a Meta-Reinforcement Learning System. Nat Neurosci, 21, 860–868.
-
Ritter, S., Wang, J., Kurth-Nelson, Z., & Botvinick, M. (2018). Episodic Control as Meta-Reinforcement Learning. bioRxiv.
-
Ritter, S., Wang, X., Kurth-Nelson, Z., Jayakumar, M., Blundell, C., Pascanu, R., & Botvinick, M. (2018). Been There, Done That: Meta-Learning with Episodic Recall. ICML, 4351–4360.
-
Daw, N. D., Gershman, S. J., Seymour, B., Dayan, P., & Dolan, R. J. (2011).Model-based Influences on Humans’ Choices and Striatal Prediction Errors. Neuron, 69(6), 1204–1215. https://doi.org/10.1016/j.neuron.2011.02.027
-
Mnih, Volodymyr & Kavukcuoglu, Koray & Silver, David & Rusu, Andrei & Veness, Joel & Bellemare, Marc & Graves, Alex & Riedmiller, Martin & Fidjeland, Andreas & Ostrovski, Georg & Petersen, Stig & Beattie, Charles & Sadik, Amir & Antonoglou, Ioannis & King, Helen & Kumaran, Dharshan & Wierstra, Daan & Legg, Shane & Hassabis, Demis. (2015). Human-level control through deep reinforcement learning. Nature. 518. 529-33. 10.1038/nature14236.
-
Botvinick, Mathew & Ritter, Sam & Wang, Jane & Kurth-Nelson, Zeb & Blundell, Charles & Hassabis, Demis. (2019). Reinforcement Learning, Fast and Slow. Trends in Cognitive Sciences. 23. 10.1016/j.tics.2019.02.006.
-
Hassabis, Demis & Kumaran, Dharshan & Summerfield, Christopher & Botvinick, Matthew. (2017). Neuroscience-Inspired Artificial Intelligence. Neuron. 95. 245-258. 10.1016/j.neuron.2017.06.011.
-
Pritzel, Alexander & Uria, Benigno & Srinivasan, Sriram & Puigdomènech, Adrià & Vinyals, Oriol & Hassabis, Demis & Wierstra, Daan & Blundell, Charles. (2017). Neural Episodic Control.
-
Jane X. Wang and Zeb Kurth-Nelson and Dhruva Tirumala and Hubert Soyer and Joel Z. Leibo and Rémi Munos and Charles Blundell and Dharshan Kumaran and Matthew Botvinick (2016). Learning to reinforcement learn. CoRR, abs/1611.05763.
-
Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D. & Kavukcuoglu, K.. (2016). Asynchronous Methods for Deep Reinforcement Learning. Proceedings of The 33rd International Conference on Machine Learning, in PMLR 48:1928-1937
I would like to give a shout out to those repositories and blog posts. They were of great help to me when implementing this project. Make sure to check them out!