A set of cognitive neuroscience inspired agents and learning algorithms.
These consist of implementations of the canonical Q-Learning, Actor-Critic, Value-Iteration, Successor Representation algorithms and more.
The algorithms included here are all tabular. Tabular algorithms work with observations that are integer representations of the state of the agent (e.g., which grid the agent is in a grid world). This corresponds to the index
observation type.
If you are interested in algorithms which use function approximation (deep-rl), see here.
Algorithm | Function(s) | Update Rule(s) | Reference | Description | Code Link |
---|---|---|---|---|---|
TD-Q | Q(s, a) | one-step temporal difference | Watkins & Dayan, 1992 | A basic q-learning algorithm | Code |
TD-SR | ψ(s, a), ω(s) | one-step temporal difference | Dayan, 1993 | A basic successor representation algorithm | Code |
TD-AC | V(s), π(a | s) | one-step temporal difference | Sutton & Barto, 2018 | A basic actor-critic algorithm | Code |
Dyna-Q | Q(s, a) | one-step temporal difference, replay-based dyna | Sutton, 1990 | A dyna q-learning algorithm | Code |
Dyna-SR | ψ(s, a), ω(s) | one-step temporal difference, replay-base dyna | Russek et al., 2017 | A dyna successor representation algorithm | Code |
Dyna-AC | V(s), π(a | s) | one-step temporal difference, replay-based dyna | Sutton, 1990 | A dyna actor-critic algorithm | Code |
MBV | Q(s, a), T(s' | s, a) | value-iteration | Sutton & Barto, 2018 | A basic value iteration algorithm | Code |
SRMB | Q(s, a), T(s' | s, a), ψ(s, a), ω(s) | value-iteration, one-step temporal difference | Momennejad et al., 2017 | A hybrid of value iteration and temporal-difference successor algorithms | Code |
QET | Q(s, a), e(s, a) | eligibility trace | Sutton & Barto, 2018 | A q-learning algorithm using online eligibility traces | Code |
DistQ | Q(s, a, c) | one-step temporal difference | Dabney et al., 2020 | A distributional q-learning algorithm which uses separate learning rates for optimistic and pessimistic units. | Code |
QEC | Q(s, a) | episodic control | Lengyel & Dayan, 2007 | An episodic control algorithm that uses return targets from monte-carlo rollouts | Code |
QMC | Q(s, a) | monte-carlo | Sutton & Barto, 2018 | A q-learning algorithm that uses return targets from monte-carlo rollouts | Code |
SARSA | Q(s, a) | one-step temporal difference | Rummery & Niranjan, 1994 | An on-policy sarsa algorithm | Code |
MoodQ | Q(s, a), M | one-step temporal difference | Eldar et al., 2016 | A q-learning algorithm that uses a mood parameter to modulate learning rates | Code |
Below is a list of the common hyperparameters shared between all algorithms and agent types. Typical value ranges provided are meant as rough guidelines for generally appropriate learning behavior. Depending on the nature of the specific algorithm or task, other values may be more desirable.
lr
- Learning rate of algorithm. Typical value range:0
-0.1
.gamma
- Discount factor for bootstrapping. Typical value range:0.5
-0.99
.poltype
- Policy type. Can be eithersoftmax
to sample actions proportional to action value estimates, oregreedy
to sample either the most valuable action or random action stochastically.beta
- The temperate parameter used with thesoftmax
poltype. Typical value range:1
-1000
.epsilon
- The probablility of randomly acting using with theegreedy
poltype. Typical value range:0.1
-0.5
.