Skip to content

Latest commit

 

History

History
38 lines (29 loc) · 4.72 KB

README.md

File metadata and controls

38 lines (29 loc) · 4.72 KB

Algorithm Toolkit

A set of cognitive neuroscience inspired agents and learning algorithms.

These consist of implementations of the canonical Q-Learning, Actor-Critic, Value-Iteration, Successor Representation algorithms and more.

The algorithms included here are all tabular. Tabular algorithms work with observations that are integer representations of the state of the agent (e.g., which grid the agent is in a grid world). This corresponds to the index observation type.

If you are interested in algorithms which use function approximation (deep-rl), see here.

Included algorithms

Algorithm Function(s) Update Rule(s) Reference Description Code Link
TD-Q Q(s, a) one-step temporal difference Watkins & Dayan, 1992 A basic q-learning algorithm Code
TD-SR ψ(s, a), ω(s) one-step temporal difference Dayan, 1993 A basic successor representation algorithm Code
TD-AC V(s), π(a | s) one-step temporal difference Sutton & Barto, 2018 A basic actor-critic algorithm Code
Dyna-Q Q(s, a) one-step temporal difference, replay-based dyna Sutton, 1990 A dyna q-learning algorithm Code
Dyna-SR ψ(s, a), ω(s) one-step temporal difference, replay-base dyna Russek et al., 2017 A dyna successor representation algorithm Code
Dyna-AC V(s), π(a | s) one-step temporal difference, replay-based dyna Sutton, 1990 A dyna actor-critic algorithm Code
MBV Q(s, a), T(s' | s, a) value-iteration Sutton & Barto, 2018 A basic value iteration algorithm Code
SRMB Q(s, a), T(s' | s, a), ψ(s, a), ω(s) value-iteration, one-step temporal difference Momennejad et al., 2017 A hybrid of value iteration and temporal-difference successor algorithms Code
QET Q(s, a), e(s, a) eligibility trace Sutton & Barto, 2018 A q-learning algorithm using online eligibility traces Code
DistQ Q(s, a, c) one-step temporal difference Dabney et al., 2020 A distributional q-learning algorithm which uses separate learning rates for optimistic and pessimistic units. Code
QEC Q(s, a) episodic control Lengyel & Dayan, 2007 An episodic control algorithm that uses return targets from monte-carlo rollouts Code
QMC Q(s, a) monte-carlo Sutton & Barto, 2018 A q-learning algorithm that uses return targets from monte-carlo rollouts Code
SARSA Q(s, a) one-step temporal difference Rummery & Niranjan, 1994 An on-policy sarsa algorithm Code
MoodQ Q(s, a), M one-step temporal difference Eldar et al., 2016 A q-learning algorithm that uses a mood parameter to modulate learning rates Code

Algorithm hyperparameters

Below is a list of the common hyperparameters shared between all algorithms and agent types. Typical value ranges provided are meant as rough guidelines for generally appropriate learning behavior. Depending on the nature of the specific algorithm or task, other values may be more desirable.

  • lr - Learning rate of algorithm. Typical value range: 0 - 0.1.
  • gamma - Discount factor for bootstrapping. Typical value range: 0.5 - 0.99.
  • poltype - Policy type. Can be either softmax to sample actions proportional to action value estimates, or egreedy to sample either the most valuable action or random action stochastically.
  • beta - The temperate parameter used with the softmax poltype. Typical value range: 1 - 1000.
  • epsilon - The probablility of randomly acting using with the egreedy poltype. Typical value range: 0.1 - 0.5.