Report.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Deep Reinforcement Learning Navigation Project</title>
    <link rel="stylesheet" href="styles.css">
    <link rel="preconnect" href="https://fonts.googleapis.com">
    <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
    <link href="https://fonts.googleapis.com/css2?family=Roboto:wght@100;300&display=swap" rel="stylesheet">
</head>
<body>
<div class="main">
<h1>Deep Reinforcement Learning Nanodegree</h1>

<p>
    This repository contains my <b>Deep Reinforcement Learning</b> solution to one of
    the stated problems within the scope of the following Nanodegree:
</p>

<a href="https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893">
    Deep Reinforcement Learning Nanodegree Program - Become a Deep Reinforcement Learning Expert
</a>

<p>
    Deep Reinforcement Learning is a thriving field in AI with lots of practical
    applications. It consists of the use of Deep Learning techniques to be able to
    solve a given task by taking actions in an environment, in order to achieve a
    certain goal. This is not only useful to train an AI agent so as to play videogames,
    but also to set up and solve any environment related to any domain. In particular,
    as a Civil Engineer, my main area of interest is the <b>AEC field</b> (Architecture,
    Engineering & Construction).
</p>

<h2>Navigation Project</h2>

<p>
    This project consists of a Navigation problem, which is contained within the Value
    Based Methods chapter and is solved by means of Deep Q Learning algorithms.
</p>

<h3>Environment description</h3>

<p>
    The environment used for this project is based on the <i>Unity ML-Agents</i> Banana Collector
    environment from 2018. Nonetheless, the updated corresponding equivalent in 2022
    would be the
    <a href="https://github.com/Unity-Technologies/ml-agents/blob/main/docs/Learning-Environment-Examples.md#food-collector">
        Food Collector Environment
    </a>
</p>

<p>
    It consists of a square playground in which there exist
    both yellow and blue bananas. An agent must then pick the yellow ones, while discarding
    the blues.
</p>

<img src="https://video.udacity-data.com/topher/2018/June/5b1ab4b0_banana/banana.gif" alt="Banana environment"
     width="600em" class="center">
<br/>

<p>
    Its characteristics are as follows:
</p>

<ul>
    <li>The state space has 37 dimensions and contains the agent's velocity, along with
        ray-based perception of objects in front of it.</li>
    <li>The action space consists of 4 discrete actions. Namely: move forward, move
        backward, turn left and turn right.</li>
    <li>A reward of +1 or -1 is provided when a yellow or blue banana is collected,
        respectively.</li>
</ul>


<h3>Solution</h3>

<p>
    The code is structured with the <i>PyTorch</i> Deep Neural Networks on <i>model.py</i>, the
    agent logics on <i>agent.py</i> and the environment setup on <i>environment.py</i>. All of it
    is structured pythonically, in an OOP fashion, and is often self explanatory.
    Lastly, the Jupyter Notebook <i>Navigation.ipynb</i> builds up an interface to train agents
    according to several variations of the Deep Q Learning algorithm, as well as to
    visualize the corresponding results with <i>pandas DataFrames</i> and <i>Seaborn</i> plots.
</p>

<h4>Deep Q Learning</h4>

<p>
    This is the vanilla Deep Q Learning algorithm, as firstly
    <a href="https://doi.org/10.48550/arXiv.1312.5602">introduced by the DeepMind team in 2013</a>
    which was able to solve Atari games.
</p>

<img src="./img/dqn.png" alt="DQN" width="700em" class="center">

<h4>Double Deep Q Learning</h4>

<p>
    Deep Q Learning is known to be prone to overestimate its action values, specially
    at early learning stages. For this reason, this variant tackles that issue by
    selecting the best action using one Deep Neural Network, but evaluating it using a
    different one. This makes the evolution of action values more stable and hence,
    learning is more robust.
</p>

<img src="./img/ddqn.png" alt="DDQN" width="700em" class="center">

<h4>Prioritized Experience Replay</h4>

<p>
    This algorithm changes the sampling of the Memory Replay buffer from uniformly
    distributed to favour those state-actions that have triggered the highest
    Temporal Difference Error. This way, the agent can learn more frequently from those
    state-actions it still isn't sure about. One of the drawbacks is that depending on
    the implementation details, this algorithm can be much slower than uniform sampling.
    Moreover, if the hyperparameters are not correctly tweaked, it can suffer from
    instability issues.
</p>

<img src="./img/per.png" alt="PER" width="700em" class="center">

<h4>Dueling Deep Q Learning</h4>

<p>
    In this variant of the algorithm, the neural networks are divided into two streams.
    One of them accounts for the state value, whereas the other one accounts for the
    advantage of taking the possible actions. Both streams are combined before
    outputting the Q value for each action at the last layer. Due to its architecture,
    this algorithm helps to better generalize learning across actions.
</p>

<img src="./img/dueling.png" alt="DQN" width="700em" class="center">

<h4>All of the above</h4>

<p>
    Lastly, an agent is trained in the environment using Double Deep Q Learning,
    Prioritized Experience Replay and Dueling Deep Q Learning, all at once.
</p>

<img src="./img/all.png" alt="DQN" width="700em" class="center">

<h4>Comparison of the different algorithms</h4>

<p>
    The following chart shows the number of episodes which was needed to solve the
    environment, for each of the implemented algorithms:
</p>

<img src="./img/comparison.png" alt="DQN" width="700em" class="center">

<p>
    Needless to say, this is not a thorough comparison, since only one run is
    included, and only up to 1000 episodes. Even so, this has been a fascinating
    exercise which has let me further understand the dynamics of the Deep Q Learning
    algorithm, along with three of its variants.
</p>

<p>
    Both from the chart and from the previous training plots, it can be seen that
    for this particular environment, and in order to reach an average of 13 points
    per episode over 100 episodes, there has not been much of a difference. However,
    it must be noted that the slope of the rolling average score is gentle, and being
    a problem of stochastic nature, its intersection with the horizontal line at 13
    points can greatly vary across different runs. Perhaps if the objective was set
    higher than 13 points, and a longer training was needed, the differences between
    them would more easily arise.
</p>

<p>
    In any case, one of the most obvious distinguishing elements has been the needed
    amount of time for Prioritized Experience Replay. The forward pass needed for each
    step in order to add the Temporal Difference error to the Memory Replay buffer has
    undoubtedly slowed down the training. However, this could be more efficiently
    implemented, for instance by using multiprocessing, or by making a buffer of
    experiences which was then jointly passed to the GPU. The latter would be worse
    per step, but probably better per unit of time spent.
</p>

<h3>Ideas for a next step</h3>

<p>
    In order to improve future performance, the following elements could be explored:
</p>

<ul>
    <li>Longer training runs (more than 1000 episodes)</li>
    <li>Hyperparameter tweaking (learning rate, epsilon decay, etc.)</li>
    <li>Other variants of the Deep Q Learning algorithm, such as multi-step bootstrap
        targets, distributional DQN or noisy DQN.</li>
</ul>
</div>
</body>
</html>