Algorithm Toolkit

A set of cognitive neuroscience inspired agents and learning algorithms.

These consist of implementations of the canonical Q-Learning, Actor-Critic, Value-Iteration, Successor Representation algorithms and more.

The algorithms included here are all tabular. Tabular algorithms work with observations that are integer representations of the state of the agent (e.g., which grid the agent is in a grid world). This corresponds to the index observation type.

If you are interested in algorithms which use function approximation (deep-rl), see here.

Included algorithms

Algorithm	Function(s)	Update Rule(s)	Reference	Description	Code Link
TD-Q	Q(s, a)	one-step temporal difference	Watkins & Dayan, 1992	A basic q-learning algorithm	Code
TD-SR	ψ(s, a), ω(s)	one-step temporal difference	Dayan, 1993	A basic successor representation algorithm	Code
TD-AC	V(s), π(a \| s)	one-step temporal difference	Sutton & Barto, 2018	A basic actor-critic algorithm	Code
Dyna-Q	Q(s, a)	one-step temporal difference, replay-based dyna	Sutton, 1990	A dyna q-learning algorithm	Code
Dyna-SR	ψ(s, a), ω(s)	one-step temporal difference, replay-base dyna	Russek et al., 2017	A dyna successor representation algorithm	Code
Dyna-AC	V(s), π(a \| s)	one-step temporal difference, replay-based dyna	Sutton, 1990	A dyna actor-critic algorithm	Code
MBV	Q(s, a), T(s' \| s, a)	value-iteration	Sutton & Barto, 2018	A basic value iteration algorithm	Code
SRMB	Q(s, a), T(s' \| s, a), ψ(s, a), ω(s)	value-iteration, one-step temporal difference	Momennejad et al., 2017	A hybrid of value iteration and temporal-difference successor algorithms	Code
QET	Q(s, a), e(s, a)	eligibility trace	Sutton & Barto, 2018	A q-learning algorithm using online eligibility traces	Code
DistQ	Q(s, a, c)	one-step temporal difference	Dabney et al., 2020	A distributional q-learning algorithm which uses separate learning rates for optimistic and pessimistic units.	Code
QEC	Q(s, a)	episodic control	Lengyel & Dayan, 2007	An episodic control algorithm that uses return targets from monte-carlo rollouts	Code
QMC	Q(s, a)	monte-carlo	Sutton & Barto, 2018	A q-learning algorithm that uses return targets from monte-carlo rollouts	Code
SARSA	Q(s, a)	one-step temporal difference	Rummery & Niranjan, 1994	An on-policy sarsa algorithm	Code
MoodQ	Q(s, a), M	one-step temporal difference	Eldar et al., 2016	A q-learning algorithm that uses a mood parameter to modulate learning rates	Code

Algorithm hyperparameters

Below is a list of the common hyperparameters shared between all algorithms and agent types. Typical value ranges provided are meant as rough guidelines for generally appropriate learning behavior. Depending on the nature of the specific algorithm or task, other values may be more desirable.

lr - Learning rate of algorithm. Typical value range: 0 - 0.1.
gamma - Discount factor for bootstrapping. Typical value range: 0.5 - 0.99.
poltype - Policy type. Can be either softmax to sample actions proportional to action value estimates, or egreedy to sample either the most valuable action or random action stochastically.
beta - The temperate parameter used with the softmax poltype. Typical value range: 1 - 1000.
epsilon - The probablility of randomly acting using with the egreedy poltype. Typical value range: 0.1 - 0.5.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Algorithm Toolkit

Included algorithms

Algorithm hyperparameters

Files

README.md

Latest commit

History

README.md

File metadata and controls

Algorithm Toolkit

Included algorithms

Algorithm hyperparameters