Policy gradient is a on-policy method which seek to directly optimize the policy by using sampled trajectories as weight. Those weights will then be used to indicate how good the policy performed. Based on that knowledge, the algorithm updates the parameters of his policy to make action leading to similar good trajectories more likely and similar bad trajectories less likely. In the case of Deep Reinforcement Learning, the policy parameter is a neural net. For this essay, I've studied and implemented the basic version of policy gradient also known as REINFORCE. I've also complemented my reading with the following ressources:
- CS 294--112 Deep Reinforcement Learning: lecture 4, 5 and 9 by Sergey Levine from University Berkeley;
- OpenAI: Spinning Up: Intro to Policy Optimization, by Josh Achiam;
- and Lil' Log blog:Policy Gradient Algorithms by Lilian Weng, research intern at OpenAI
Watch recorded agent
Note: You can check explanation on how to use the package by using the --help
flag
cd DRLimplementation
python -m BasicPolicyGradient [--record] [--play_for]=max trajectories (default=10)
cd DRLimplementation
python -m BasicPolicyGradient --train
tensorboard --logdir=DRLimplementation/BasicPolicyGradient/graph/runs