Deep Reinforcement Learning

:: Soft Actor-Critic

Soft Actor-Critic (SAC) is an off-policy algorithm based on the Maximum Entropy Reinforcement Learning framework. The main idea behind Maximum Entropy RL is to frame the decision-making problem as a graphical model from top to bottom and then solve it using tools borrowed from the field of Probabilistic Graphical Model Under this framework, a learning agent seeks to maximize both the return and the entropy simultaneously. This approach benefit Deep Reinforcement Learning algorithm by giving them the capacity to consider and learn many alternate paths leading to an optimal goal and the capacity to learn how to act optimally despite adverse circumstances.

Since SAC is an off-policy algorithm, then it has the ability to train on samples coming from a different policy. What is particular though is that contrary to other off-policy algortihm, it's stable. This mean that the algorithm is much less picky in term of hyperparameter tuning.

SAC is curently the state of the art Deep Reinforcement Learning algorithm together with Twin Delayed Deep Deterministic policy gradient (TD3)

The learning curve of the Maximum Entropy RL framework is quite steep due to it's depth and to how much it re-think the RL problem. It was definitavely required in order to understand how SAC work. Tackling the applied part was arguably the most difficult project I did to date, both in term of component to implement and silent bug dificulties. Never the less I'm particularly proud of the result.

See my blog post Soft Actor-Critic part 1: intuition and theoretical aspect for more details on SAC and MaxEnt-RL

Reading material:

I've also complemented my reading with the following resources:

CS 294--112 Deep Reinforcement Learning: lectures 14-15 by Sergey Levine from University Berkeley
OpenAI: Spinning Up: Soft Actor-Critic by Josh Achiam;
and also Lil' Log blog:Policy Gradient Algorithms by Lilian Weng, research intern at OpenAI

Download the essay pdf: Deep Reinforcement Learning – Soft Actor-Critic

Watch mp4 video - Soft Actor-Critic Post training - Test run on 2X harder LunarLanderContinuous-v2 environment

The Soft Actor-Critic implementation:

Note: You can check explanation on how to use the package by using the --help flag

To watch the trained algorithm

cd DRLimplementation
python -m SoftActorCritic [--playLunar, --playHardLunar, --playPendulum] [--record] 
                            [--play_for]=max trajectories (default=10) [--harderEnvCoeficient=1.6] (default)

To execute the training loop

cd DRLimplementation
python -m SoftActorCritic < trainExperimentSpecification > [--rerun] [--renderTraining]

Choose < trainExperimentSpecification > between the following:

For BipedalWalker-v2 environment: [--trainBipedalWalker]: Train on Bipedal Walker gym env a Soft Actor-Critic agent
For Pendulum-v0 environment: [--trainPendulum]: Train on Pendulum gym env a Soft Actor-Critic agent
For LunarLanderContinuous-v2 environment: [--trainLunarLander]: Train on LunarLander a Soft Actor-Critic agent
Experimentation utility: [--trainExperimentBuffer]: Run a batch of experiment spec

To navigate trough the computation graph in TensorBoard

cd DRLimplementation
tensorboard --logdir=SoftActorCritic/graph