-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathReport.html
195 lines (157 loc) · 7.66 KB
/
Report.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Deep Reinforcement Learning Navigation Project</title>
<link rel="stylesheet" href="styles.css">
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link href="https://fonts.googleapis.com/css2?family=Roboto:wght@100;300&display=swap" rel="stylesheet">
</head>
<body>
<div class="main">
<h1>Deep Reinforcement Learning Nanodegree</h1>
<p>
This repository contains my <b>Deep Reinforcement Learning</b> solution to one of
the stated problems within the scope of the following Nanodegree:
</p>
<a href="https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893">
Deep Reinforcement Learning Nanodegree Program - Become a Deep Reinforcement Learning Expert
</a>
<p>
Deep Reinforcement Learning is a thriving field in AI with lots of practical
applications. It consists of the use of Deep Learning techniques to be able to
solve a given task by taking actions in an environment, in order to achieve a
certain goal. This is not only useful to train an AI agent so as to play videogames,
but also to set up and solve any environment related to any domain. In particular,
as a Civil Engineer, my main area of interest is the <b>AEC field</b> (Architecture,
Engineering & Construction).
</p>
<h2>Navigation Project</h2>
<p>
This project consists of a Navigation problem, which is contained within the Value
Based Methods chapter and is solved by means of Deep Q Learning algorithms.
</p>
<h3>Environment description</h3>
<p>
The environment used for this project is based on the <i>Unity ML-Agents</i> Banana Collector
environment from 2018. Nonetheless, the updated corresponding equivalent in 2022
would be the
<a href="https://github.com/Unity-Technologies/ml-agents/blob/main/docs/Learning-Environment-Examples.md#food-collector">
Food Collector Environment
</a>
</p>
<p>
It consists of a square playground in which there exist
both yellow and blue bananas. An agent must then pick the yellow ones, while discarding
the blues.
</p>
<img src="https://video.udacity-data.com/topher/2018/June/5b1ab4b0_banana/banana.gif" alt="Banana environment"
width="600em" class="center">
<br/>
<p>
Its characteristics are as follows:
</p>
<ul>
<li>The state space has 37 dimensions and contains the agent's velocity, along with
ray-based perception of objects in front of it.</li>
<li>The action space consists of 4 discrete actions. Namely: move forward, move
backward, turn left and turn right.</li>
<li>A reward of +1 or -1 is provided when a yellow or blue banana is collected,
respectively.</li>
</ul>
<h3>Solution</h3>
<p>
The code is structured with the <i>PyTorch</i> Deep Neural Networks on <i>model.py</i>, the
agent logics on <i>agent.py</i> and the environment setup on <i>environment.py</i>. All of it
is structured pythonically, in an OOP fashion, and is often self explanatory.
Lastly, the Jupyter Notebook <i>Navigation.ipynb</i> builds up an interface to train agents
according to several variations of the Deep Q Learning algorithm, as well as to
visualize the corresponding results with <i>pandas DataFrames</i> and <i>Seaborn</i> plots.
</p>
<h4>Deep Q Learning</h4>
<p>
This is the vanilla Deep Q Learning algorithm, as firstly
<a href="https://doi.org/10.48550/arXiv.1312.5602">introduced by the DeepMind team in 2013</a>
which was able to solve Atari games.
</p>
<img src="./img/dqn.png" alt="DQN" width="700em" class="center">
<h4>Double Deep Q Learning</h4>
<p>
Deep Q Learning is known to be prone to overestimate its action values, specially
at early learning stages. For this reason, this variant tackles that issue by
selecting the best action using one Deep Neural Network, but evaluating it using a
different one. This makes the evolution of action values more stable and hence,
learning is more robust.
</p>
<img src="./img/ddqn.png" alt="DDQN" width="700em" class="center">
<h4>Prioritized Experience Replay</h4>
<p>
This algorithm changes the sampling of the Memory Replay buffer from uniformly
distributed to favour those state-actions that have triggered the highest
Temporal Difference Error. This way, the agent can learn more frequently from those
state-actions it still isn't sure about. One of the drawbacks is that depending on
the implementation details, this algorithm can be much slower than uniform sampling.
Moreover, if the hyperparameters are not correctly tweaked, it can suffer from
instability issues.
</p>
<img src="./img/per.png" alt="PER" width="700em" class="center">
<h4>Dueling Deep Q Learning</h4>
<p>
In this variant of the algorithm, the neural networks are divided into two streams.
One of them accounts for the state value, whereas the other one accounts for the
advantage of taking the possible actions. Both streams are combined before
outputting the Q value for each action at the last layer. Due to its architecture,
this algorithm helps to better generalize learning across actions.
</p>
<img src="./img/dueling.png" alt="DQN" width="700em" class="center">
<h4>All of the above</h4>
<p>
Lastly, an agent is trained in the environment using Double Deep Q Learning,
Prioritized Experience Replay and Dueling Deep Q Learning, all at once.
</p>
<img src="./img/all.png" alt="DQN" width="700em" class="center">
<h4>Comparison of the different algorithms</h4>
<p>
The following chart shows the number of episodes which was needed to solve the
environment, for each of the implemented algorithms:
</p>
<img src="./img/comparison.png" alt="DQN" width="700em" class="center">
<p>
Needless to say, this is not a thorough comparison, since only one run is
included, and only up to 1000 episodes. Even so, this has been a fascinating
exercise which has let me further understand the dynamics of the Deep Q Learning
algorithm, along with three of its variants.
</p>
<p>
Both from the chart and from the previous training plots, it can be seen that
for this particular environment, and in order to reach an average of 13 points
per episode over 100 episodes, there has not been much of a difference. However,
it must be noted that the slope of the rolling average score is gentle, and being
a problem of stochastic nature, its intersection with the horizontal line at 13
points can greatly vary across different runs. Perhaps if the objective was set
higher than 13 points, and a longer training was needed, the differences between
them would more easily arise.
</p>
<p>
In any case, one of the most obvious distinguishing elements has been the needed
amount of time for Prioritized Experience Replay. The forward pass needed for each
step in order to add the Temporal Difference error to the Memory Replay buffer has
undoubtedly slowed down the training. However, this could be more efficiently
implemented, for instance by using multiprocessing, or by making a buffer of
experiences which was then jointly passed to the GPU. The latter would be worse
per step, but probably better per unit of time spent.
</p>
<h3>Ideas for a next step</h3>
<p>
In order to improve future performance, the following elements could be explored:
</p>
<ul>
<li>Longer training runs (more than 1000 episodes)</li>
<li>Hyperparameter tweaking (learning rate, epsilon decay, etc.)</li>
<li>Other variants of the Deep Q Learning algorithm, such as multi-step bootstrap
targets, distributional DQN or noisy DQN.</li>
</ul>
</div>
</body>
</html>