Deep Reinforcement Learning

:: Basic policy gradient

Policy gradient is a on-policy method which seek to directly optimize the policy by using sampled trajectories as weight. Those weights will then be used to indicate how good the policy performed. Based on that knowledge, the algorithm updates the parameters of his policy to make action leading to similar good trajectories more likely and similar bad trajectories less likely. In the case of Deep Reinforcement Learning, the policy parameter is a neural net. For this essay, I've studied and implemented the basic version of policy gradient also known as REINFORCE. I've also complemented my reading with the following ressources:

CS 294--112 Deep Reinforcement Learning: lecture 4, 5 and 9 by Sergey Levine from University Berkeley;
OpenAI: Spinning Up: Intro to Policy Optimization, by Josh Achiam;
and Lil' Log blog:Policy Gradient Algorithms by Lilian Weng, research intern at OpenAI

Download the essay pdf

Watch recorded agent

The REINFORCE implementation:

Note: You can check explanation on how to use the package by using the --help flag

To watch the trained algorithm

cd DRLimplementation
python -m BasicPolicyGradient [--record] [--play_for]=max trajectories (default=10)

To execute the training loop

cd DRLimplementation
python -m BasicPolicyGradient --train

To navigate trough the computation graph in TensorBoard

tensorboard --logdir=DRLimplementation/BasicPolicyGradient/graph/runs