Minimal Policy Search is a toolbox for Matlab providing the implementation of RL algorithms.
The repository originally focused on policy search (hence the name), especially REPS and policy gradient, but it now contains a wide variety of algrorithms (PPO, TRPO, DQN, DPG, FQI, ...).
It also has multi-objective RL algorithms, benchmark MDPs and optimization problems, and common policies classes.
Some algorithms require the Optimization Toolbox.
Some utility functions are imported from File Exchange (original authors are always acknowledged).
Launch INSTALL
to add the path of all folders.
Algs
All the algorithms and solvers are located in this folder, as well as some script to run them. By using scripts, it is possible to interrupt and resume the learning process without losing any data.
The only parameters that you might want to change are the learning rates and the number of rollouts per iteration.
Also, a history of the results is usually kept. For example, J_history
stores the expected return at each iteration.
BenchmarkOpt
Here are some test functions for optimization.
Experiments
This folder contains some scripts to set up experiments. Each script inizializes the MDP, the policies and the number of samples and episodes per learning iteration. After running a setup script, just run an algorithm script to start the learning.
SettingMC % mountain car setup
RUN_PG % run policy gradient (terminate by CTRL+C)
plot(J_history) % plot expected return
show_simulation(mdp,policy.makeDeterministic,1000,0.1) % visualize learned policy (see below)
Notice that, in the case of episodic (black box) RL, these scripts define both the low level policy (the one used by the agent) and the high level policy (the sampling distribution used to draw the low level policy parameters). In this setting, the exploration noise is given by the high level policy, while the low level policy is deterministic (e.g., the covariance of a Gaussian is zeroed and the high level policy only draws its mean).
Library
The folder contains some policies, generic basis functions, and functions for sampling and evaluation. The most important functions are
collect_samples
: stores low level tuples(s,a,r,s')
into a struct,collect_episodes
: collects high level data, i.e. pairs(return,policy)
,evaluate_policies
: evaluates low level policies on several episodes,evaluate_policies_high
: evaluates high level policies on several episodes.
Policies are modeled as objects. Their most important method is drawAction
, but depending on the class some additional properties might be mandatory.
IMPORTANT! All data is stored in COLUMNS, e.g., states are matrices
S x N
, whereS
is the size of one state andN
is the number of states. Similarly, actions are matricesA x N
and features are matricesF x N
.
MDPs
Each MDP is modeled as an object (MDP.m
) and requires some properties (dimension of state and action spaces, bounds, etc...) and methods (for state transitions and plots).
Each MDP also has a default discount factor gamma
, which usually works well with the majority of the algorithms, but feel free to change it if necessary.
The most important function is [s',r,d] = simulator(s,a)
, which defines the transition function.
The function returns d = True
if the next state s'
is terminal (episode ended).
Usually, the reward r(s,a,s')
depends on s,a
, and on s'
if the next state is terminal.
For example, the cart-pole swing-up returns a reward depending on the current position of the pole + a penalty if the cart hits the walls (terminal next state).
For MDPs sharing the same environment (e.g., mountain car with continuous or discrete actions, cart-pole with or without swing-up, ...), there are common Environment (Env
) classes.
These classes define common variables and functions (transition, plotting, ...), while each subclass defines task-specific functions (reward, action parsing, terminal conditions, ...).
Finally, there are also subclasses for some special extension to MDPs, i.e., Contextual MDPs (CMDP.m
), Multi-objective MDPs (MOMDP.m
), and Average-reward MDPs (MDP_avg.m
).
IMPORTANT! To allow parallel execution of multiple episodes, all MDPs functions (except the ones for plotting) need to support vectorized operations, i.e., they need to deal with states and actions represented as
S x N
andA x N
matrices, respectively.
MO_Library
This folder contains functions used in the multi-objective framework, e.g., hypervolume estimators and Pareto-optimality filters.
IMPORTANT! All frontiers are stored in ROWS, i.e., they are matrices
N x R
, whereN
is the number of points andR
is the number of objectives.
Utilities
Utility functions used for matrix operations, plotting and sampling are stored in this folder.
Here is a list with examples of all ways for visualizing your data or rendering an environment. Please note that not all MDPs support rendering.
Real-time data plotting
During the learning, it is possible to plot in real-time a desired variable (e.g., the expected return J
) by using updateplot
.
updateplot('Return',iter,J,1)
Confidence interval plots from multiple trials
If you are interested on evaluating an algorithm on several trials you can use the function shadedErrorBar
. For a complete example, please refer to make_stdplot.m
.
Real-time rendering
Launch mdp.showplot
to initialize the rendering of the agent-environment interaction will be shown during the learning. To stop plotting use mdp.closeplot
.
IMPORTANT! This is possible only if you are learning using one episode per iteration.
Offline rendering
- For step-based algorithms, you can directly use the built-in plotting function of the MDPs.
As
collect_samples
returns a low-level dataset of the episodes, you just have to callmdp.plotepisode
data = collect_samples(mdp,episodes,steps,policy)
mdp.plotepisode(data(1),0.001)
- For episode-based algorithms, the low-level dataset is not returned. In this case, you can call
show_simulation
, which executes only one episode and renders it. This visualization can be used also in step-based algorithms.
show_simulation(mdp,policy,100,0.001)
show_simulation(mdp,policy.update(policy_high.drawAction(1)),100,0.001)
If the MDP provides pixels rendering, you can enable it by adding an additional argument to the function call
show_simulation(mdp,policy,100,0.001,1)
Plot policies
If the state space is 2-dimensional, you can plot the value functions learned by policies and the action distribution over the states.
SettingDeep % deep sea treasure setup
RUN_PG % run policy gradient (terminate by CTRL+C)
policy.plotQ(mdp.stateLB,mdp.stateUB) % plot Q-function
policy.plotV(mdp.stateLB,mdp.stateUB) % plot V-function
policy.plotGreedy(mdp.stateLB,mdp.stateUB) % plot the action taken by zeroing the exploration
MOMDPs Pareto frontier
To plot a set of points as a Pareto frontier of a MOMDP, use MOMDP.plotfront
. You can use additional arguments like in the built-in plot
to customize the plot. Please note that the points have to be passed as rows and that the function does not filter dominated points.
MOMDP.plotfront([0.5 0.5; 1 0; 0 1], '--or', 'LineWidth', 2)