👉 Setup Python environment for the repo

Course general Notes (Google Docs)
Notes for env setup (GitHub Gist)
Notes for issues (GitHub Gist)

👉 Unity enviroment `Tennis` vector game (P3 Project Submission)

The model achieved an average score of 0.50 between episodes 4336 and 4435 (step 157,506), and peaked at 2.6 around step 200,000.
Check the training logs on W&B.

✅ Project description

Check the project information (multi-agent reinforcement learning (MARL))
Check the course notes

In this environment, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1. If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01. Thus, the goal of each agent is to keep the ball in play.

The observation space consists of 8 variables corresponding to the position and velocity of the ball and racket. Each agent receives its own, local observation. Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping.

The task is episodic, and in order to solve the environment, your agents must get an average score of +0.5 (over 100 consecutive episodes, after taking the maximum over both agents). Specifically,
- After each episode, we add up the rewards that each agent received (without discounting), to get a score for each agent. This yields 2 (potentially different) scores. We then take the maximum of these 2 scores.
- This yields a single score for each episode.
The environment is considered solved, when the average (over 100 episodes) of those scores is at least +0.5.

✅ Multi-Agent Deep Deterministic Policy Gradient (MADDPG) solution

MADDPG, or Multi-agent DDPG, extends DDPG into a multi-agent policy gradient algorithm where decentralized agents learn a centralized critic based on the observations and actions of all agents. It leads to learned policies that only use local information (i.e. their own observations) at execution time, does not assume a differentiable model of the environment dynamics or any particular structure on the communication method between agents, and is applicable not only to cooperative interaction but to competitive or mixed interaction involving both physical and communicative behavior. The critic is augmented with extra information about the policies of other agents, while the actor only has access to local information. After training is completed, only the local actors are used at execution phase, acting in a decentralized manner.

DDPG relies on a single Q-network to estimate action values, which can lead to overestimation and make it harder to converge, especially in multi-agent environments. In contrast, TD3, or Twin Delayed DDPG uses two Q-networks (typically taking the smaller Q-value) to minimize overestimation. It also adds clipped noise to the target actor’s outputs when calculating target Q-values, which helps smooth out the critic’s losses, thereby improving the overall stability during training. For each agent, 6 networks (1 local actor, 2 local critics, 1 target actor, and 2 target critics) and 1 replay buffer will be created. During training, an agent can access the observations and actions of other agents. During execution, however, each agent relies on its own observations and receives actions from its own local actor.
Prioritized replay buffers are used to improve the speed of convergence, since the rewards are very sparse.
Multi-environments, implemented using the multiprocessing library (check the file ..\python\baselines\baselines\common\vec_env\subproc_vec_env.py), can be used for parallel training here, which can add diversity to the experiences. Additionally, asynchronous stepping may help speed up the training and evalation processes.

✅ entry points

working directory: $ cd python
python/experiments/deeprl_maddpg_continuous.py: train
$ python -m experiments.deeprl_maddpg_continuous --is_training True (training)
$ python -m experiments.deeprl_maddpg_continuous (evaluation)
python/experiments/deeprl_maddpg_plot.py: plot train and eval scores
$ python -m experiments.deeprl_maddpg_plot
launch tensorboard in VSCode: $ tensorboard --logdir=./data/tf_log

✅ setup Python environment

notes for env setup
notes for issues

✅ Implementation

Reuse the DDPG framework from P2-Unity Reacher (multi-envs, many resuable functions and components, etc.)
- All the code is integrated with ShangtongZhang's deeprl framework which uses some OpenAI Baselines functionalities.
- One task can step multiple envs, either with a single process, or with multiple processes. multiple tasks can be stepped synchronously.
Instantiate the DeterministicActorCriticNet class to create 4-6 Networks (2 objects) per agent:
actor-critic(s), target actor-critic(s)
Soft updates for target networks, AdamW optimizer on actor-critic(s) networks
A central Prioritized Experience Replay (PER) buffer:
- Storing new memories, priority sampling, updating priorities using critic Q-values
The MADDPGAgent class to choose actions, do soft updates, save models
The Task class to handle list of agents and train/eval functions
Utility functions to reshape the observations and actions, etc.
Human readable logs and tensorboard logs
- Train and eval tasks create both readable and tensorboard logs
- The plot functionality uses tensorboard log data
Weights & Biases
- training logs and sweeping

✅ Coding

An env has 2 agents playing with each other. Refer to this notebook.
Create class MADDPGAgent(BaseAgent) in the file ..\python\deeprl\agent\MADDPG_agent.py.
Create train and eval functions in the file ..\python\experiments\deeprl_maddpg_continuous.py.
Add the brain name 'TennisBrain' and the episodic return logic in the function get_return_from_brain_info() in the file ..\python\deeprl\component\envs.py'.

In the get_env_fn() function, for Gym games, the environment class is wrapped using OriginalReturnWrapper(). Inside the wrapper class's step() and reset() method, info['episodic_return'] = self.total_rewards is defined. However, for Unity games, the environment is already instantiated at the same location, so it can't be wrapped with an wrapper class. Instead, we define info['episodic_return'] within classes UnityVecEnv and UnitySubprocVecEnv, which call the get_return_from_brain_info() function where info is actually populated.

For the Tennis game, we sum the rewards each agent receives (without discounting) to get individual scores for both agents, resulting in two potentially different scores. We then take the higher score as the episodic return. However, since the two agents are always competing, one score eventually becomes consistently higher by about 0.11. To simplify, we can use the average of the scores as the episodic return without changing any related code. This means if the target score is 0.5, an average score above 0.445 would indicate the environment is solved.
The class PrioritizedReplay implementation is in the file ..\deeprl\component\replay.py
Local and target actor-critic netowrks architecture (It can be found in each human readable log file.)
- DDPG
- TD3

✅ Training

DDPG + uniform replay

Some of the hyperparameters

config.min_memory_size = int(1e6)
config.mini_batch_size = 256
config.replay_fn = lambda: PrioritizedReplay(memory_size=config.min_memory_size, 
                                             batch_size=config.mini_batch_size)
config.discount = 0.99  ## λ lambda, Q-value discount rate
config.random_process_fn = lambda: GaussianProcess(
    size=(config.action_dim,), std=LinearSchedule(0.3))  ## noise to add
config.noise_decay_rate = 0.3  ## config.random_process.sample() * (1/(self.total_episodes+1)**config.noise_decay_rate)
## before it is warmed up, use random actions, do not sample from buffer or update neural networks
config.warm_up = int(1e4) ## can't be 0 steps, or it will create a deadloop in buffer
config.replay_interval = 1  ## replay-policy update every n steps
config.actor_update_freq = 2  ## update the actor once for every n updates to the critic
config.target_network_mix = int(5e-3)  ## τ: soft update rate = 0.5%, trg = trg*(1-τ) + src*τ

With this setting, the model successfully solved the environment, achieving an average score above 0.5 after 60,000 training steps (around 1,200 episodes). The progress slowed afterward (slower to reach higher scores), likely due to excessive noise and an insufficient decay rate. However if add too little noise at the start, the model easily gets stuck and hard to move on, if add too much moise latter, the training score climbs slowly.

TD3 + prioritized replay
- Some of the hyperparameters
```
config.min_memory_size = int(1e5)
torch.optim.AdamW(params, lr=1e-4) for all the networks
config.random_process_fn = lambda: GaussianProcess(
    size=(config.action_dim,), std=LinearSchedule(1))
config.action_noise_factor = 0.1
config.policy_noise_factor = 0.2
config.noise_clip = (-0.5, 0.5)
```
- With this setup, the model achieved an average score of 0.50 between episodes 4336 and 4435 (step 157,506), and peaked at 2.6 around step 200,000.
  Check the training logs on W&B.
- One of the main challenges in this environment is the sparse reward structure, which can easily cause training to stall, particularly during the early stages. For example, check some of the training metrics in the early stages. The training score might level off around 0.06 for approximately 400,000 steps before it begins to improve. Additionally, the training process is highly sensitive to hyperparameter settings, making it difficult to find a configuration that leads to convergence. I've come across research areas like 'sparse rewards' and 'reward reshaping,' which could potentially help improve performance.
- If one or both of the actor losses didn’t decrease (oscillating or even increasing) during the early stages of training, it means the model wasn’t learning and would typically end up with a score close to 0.

✅ reference

Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments (2020)
https://arxiv.org/pdf/1706.02275
Prioritized Experience Replay (2015)
http://arxiv.org/abs/1511.05952
Competitive Multi-Agent Reinforcement Learning (DDPG) with TorchRL Tutorial (2022)
https://pytorch.org/rl/0.4/tutorials/multiagent_competitive_ddpg.html
OpenAI Spinning Up: Twin Delayed DDPG (TD3)
https://spinningup.openai.com/en/latest/algorithms/td3.html
https://github.com/sfujim/TD3/blob/master/TD3.py
https://spinningup.openai.com/en/latest/_modules/spinup/algos/pytorch/td3/td3.html

👉 AlphaZero

Tic-Tac-Toe notebook
Tic-Tac-Toe-advanced notebook

👉 Unity enviroment `Reacher-v2` vector game (Project Submission)

✅ setup Python environment

notes for env setup
notes for issues

✅ entry points

working directory: $ cd python
python/experiments/deeprl_ddpg_continuous.py: train
$ python -m experiments.deeprl_ddpg_continuous --is_training True (training)
$ python -m experiments.deeprl_ddpg_continuous (evaluation)
python/experiments/deeprl_ddpg_plot.py: plot train and eval scores
$ python -m experiments.deeprl_ddpg_plot
launch tensorboard in VSCode: $ tensorboard --logdir=./data/tf_log

✅ Result: A DDPG model was trained in one Unity-Reacher-v2 environment with 1 agent (1 robot arm) for 155 episodes, then evaluated in 3 environments (each with 1 agent) parallelly for 150 consecutive episodes and got an average score of 33.92(0.26) (0.26 is the standard deviation of scores in different envs). also the trained model is tested to control 20 agents in 4 envs parallelly and got a score of 34.24(0.10).

evaluation with graphics

Notes:
- the 4 envs and each its own 1 (or 20) agents above were controlled by one single DDPG model at the same time.
- observation dimension [num_envs, num_agents (per env), state_size] will be converted to [num_envs*num_agents, state_size] to pass through the neural networks.
- during training, action dimension will be [mini_batch_size (replay batch), action_size];
  during evaluation, the local network will ouput actions with dimension [num_envs*num_agents, action_size], and it will be converted to [num_envs, num_agents, action_size] to step the envs.
train and eval scores
monitor train-eval scores with tensorboard
DDPG neural networks architecture
evaluation result (in 3 envs for 150 consecutive episodes)
saved files (check the folder)
- trained model
- train log (human readable):
  you can find all the configuration including training hyperparameters, network architecture, train and eval scores, here.
- tf_log (tensorflow log, will be read by the plot modules)
- eval log (human readable)

✅ major efforts in coding

all the code is integrated with ShangtongZhang's deeprl framework which uses some OpenAI Baselines functionalities.
one task can step multiple envs, either with a single process, or with multiple processes. multiple tasks can be stepped synchronously.

to enable multiprocessing of Unity environments, the following code has had to be modified.
in python/unityagents/rpc_communicator.py

class UnityToExternalServicerImplementation(UnityToExternalServicer):
    # parent_conn, child_conn = Pipe() ## removed by nov05
...
class RpcCommunicator(Communicator):
    def initialize(self, inputs: UnityInput) -> UnityOutput: # type: ignore
        try:
            self.unity_to_external = UnityToExternalServicerImplementation()
            self.unity_to_external.parent_conn, self.unity_to_external.child_conn = Pipe() ## added by nov05

Task UML diagram

Agent UML diagram
launch multiple Unity environments parallelly (not used in the project) from an executable file (using Python Subprocess and Multiprocess, without MLAgents)
- the major code file python\unityagents\environment2.py
- check the video of how to run the code ($python -m tests2.test_unity_multiprocessing)

✅ reference

https://arxiv.org/abs/1509.02971

👉 OpenAI Gym's Atari `Pong` pixel game

notebooks
- PPO without clipping: Colab, GitHub
- PPO with clipping, Colab, GitHub

👉 Unity ML-Agents `Banana Collectors` (Project Submission)

For this toy game, two Deep Q-network methods are tried out. Since the observations (states) are simple (not in pixels), convolutional layers are not in use. And the evaluation results confirm that linear layers are sufficient for solving the problem.
- Double DQN, with 3 linear layers (hidden dims: 256*64, later tried with 64*64)
- Dueling DQN, with 2 linear layers + 2 split linear layers (hidden dims: 64*64)

▪️ The Dueling DQN architecture is displayed as below.

Dueling Architecture	The green module

▪️ Since both the advantage and the value stream propagate gradients to the last convolutional layer in the backward pass, we rescale the combined gradient entering the last convolutional layer by 1/√2. This simple heuristic mildly increases stability.

        self.layer1 = nn.Linear(state_size, 64)
        self.layer2 = nn.Linear(64, 64)
        self.layer3_adv = nn.Linear(in_features=64, out_features=action_size) ## advantage
        self.layer3_val = nn.Linear(in_features=64, out_features=1) ## state value

    def forward(self, state):
        x = F.relu(self.layer1(state))
        x = F.relu(self.layer2(x))
        adv, val = self.layer3_adv(x), self.layer3_val(x)
        return (val + adv - adv.mean(1).unsqueeze(1).expand(x.size(0), action_size)) / (2**0.5)

▪️ In addition, we clip the gradients to have their norm less than or equal to 10. This clipping is not standard practice in deep RL, but common in recurrent network training (Bengio et al., 2013).

        ## clip the gradients
        nn.utils.clip_grad_norm_(self.qnetwork_local.parameters(), 10.)
        nn.utils.clip_grad_norm_(self.qnetwork_target.parameters(), 10.)

The following picture shows the train and eval scores (rewards) for both architectures. Since it is a toy project, trained models are not formally evaluated. We can roughly see that Dueling DQN slightly performs better with an average score of 17 vs. Double DQN 13 in 10 episodes.

Project artifacts:
- All the notebooks (trained in Google Colab, evaluated on local machine)
- The project folder p1_navigation (which contains checkpoints dqn_checkpoint_2000.pth and dueling_dqn_checkpoint_2000.pth)
- Video recording (which demonstrates how trained models are run on the local machine)
- Project submission repo

👉 C++ Crash Course

Some of the course content
code examples

👉 Logs

2024-11-17 p3 Unity Tennis submission
2024-04-10 p2 Unity Reacher v2 submission
2024-03-07 Python code to launch multiple Unity environments parallelly from an executable file
...
2024-02-14 Banana game project submission
2024-02-11 Unity MLAgent Banana env set up
2024-02-10 repo cloned

Deep Reinforcement Learning Nanodegree

This repository contains material related to Udacity's Deep Reinforcement Learning Nanodegree program.

Dynamic Programming: Implement Dynamic Programming algorithms such as Policy Evaluation, Policy Improvement, Policy Iteration, and Value Iteration.
Monte Carlo: Implement Monte Carlo methods for prediction and control.
Temporal-Difference: Implement Temporal-Difference methods such as Sarsa, Q-Learning, and Expected Sarsa.
Discretization: Learn how to discretize continuous state spaces, and solve the Mountain Car environment.
Tile Coding: Implement a method for discretizing continuous state spaces that enables better generalization.
Deep Q-Network: Explore how to use a Deep Q-Network (DQN) to navigate a space vehicle without crashing.
Robotics: Use a C++ API to train reinforcement learning agents from virtual robotic simulation in 3D. (External link)
Hill Climbing: Use hill climbing with adaptive noise scaling to balance a pole on a moving cart.
Cross-Entropy Method: Use the cross-entropy method to train a car to navigate a steep hill.
REINFORCE: Learn how to use Monte Carlo Policy Gradients to solve a classic control task.
Proximal Policy Optimization: Explore how to use Proximal Policy Optimization (PPO) to solve a classic reinforcement learning task. (Coming soon!)
Deep Deterministic Policy Gradients: Explore how to use Deep Deterministic Policy Gradients (DDPG) with OpenAI Gym environments.
- Pendulum: Use OpenAI Gym's Pendulum environment.
- BipedalWalker: Use OpenAI Gym's BipedalWalker environment.
Finance: Train an agent to discover optimal trading strategies.

Labs / Projects

The labs and projects can be found below. All of the projects use rich simulation environments from Unity ML-Agents. In the Deep Reinforcement Learning Nanodegree program, you will receive a review of your project. These reviews are meant to give you personalized feedback and to tell you what can be improved in your code.

The Taxi Problem: In this lab, you will train a taxi to pick up and drop off passengers.
Navigation: In the first project, you will train an agent to collect yellow bananas while avoiding blue bananas.
Continuous Control: In the second project, you will train an robotic arm to reach target locations.
Collaboration and Competition: In the third project, you will train a pair of agents to play tennis!

Resources

Cheatsheet: You are encouraged to use this PDF file to guide your study of reinforcement learning.

OpenAI Gym Benchmarks

Classic Control

Acrobot-v1 with Tile Coding and Q-Learning
Cartpole-v0 with Hill Climbing | solved in 13 episodes
Cartpole-v0 with REINFORCE | solved in 691 episodes
MountainCarContinuous-v0 with Cross-Entropy Method | solved in 47 iterations
MountainCar-v0 with Uniform-Grid Discretization and Q-Learning | solved in <50000 episodes
Pendulum-v0 with Deep Deterministic Policy Gradients (DDPG)

Box2d

BipedalWalker-v2 with Deep Deterministic Policy Gradients (DDPG)
CarRacing-v0 with Deep Q-Networks (DQN) | Coming soon!
LunarLander-v2 with Deep Q-Networks (DQN) | solved in 1504 episodes

Toy Text

FrozenLake-v0 with Dynamic Programming
Blackjack-v0 with Monte Carlo Methods
CliffWalking-v0 with Temporal-Difference Methods

Dependencies

To set up your python environment to run the code in this repository, follow the instructions below.

Create (and activate) a new environment with Python 3.6.

Linux or Mac:

conda create --name drlnd python=3.6
source activate drlnd

Windows:

conda create --name drlnd python=3.6 
activate drlnd

If running in Windows, ensure you have the "Build Tools for Visual Studio 2019" installed from this site. This article may also be very helpful. This was confirmed to work in Windows 10 Home.
Follow the instructions in this repository to perform a minimal install of OpenAI gym.
- Next, install the classic control environment group by following the instructions here.
- Then, install the box2d environment group by following the instructions here.

Clone the repository (if you haven't already!), and navigate to the python/ folder. Then, install several dependencies.

git clone https://github.com/udacity/deep-reinforcement-learning.git
cd deep-reinforcement-learning/python
pip install .

Create an IPython kernel for the drlnd environment.

python -m ipykernel install --user --name drlnd --display-name "drlnd"

Before running code in a notebook, change the kernel to match the drlnd environment by using the drop-down Kernel menu.

Want to learn more?

Come learn with us in the Deep Reinforcement Learning Nanodegree program at Udacity!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!