diff --git a/README.md b/README.md
index 0cd26d3..a26a5d7 100644
--- a/README.md
+++ b/README.md
@@ -1,9 +1,14 @@
 # reinforcement-learning-skiing
 
-## Documentation
-For the detailed explanation of the solution please read the [report](./documentation.pdf).
+## Description
+
+The goal of the project was to implement and compare two algorithms for the Atari game Skiing: DQN and DDQN. The project
+was implemented in Python 3.11.5.
+You can run it by following the instructions described below, and see the documentation [here](#documentation).
 
-## Required libraries
+## How to run the project
+
+### 1. Install Required libraries
 
 ```bash
 conda create -n rl-skiing python=3.11.5
@@ -30,9 +35,7 @@ pip3 install torch
 pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu
 ```
 
-## How to run the project
-
-### 1. Set the PYTHONPATH
+### 2. Set the PYTHONPATH
 
 - Windows - Powershell:
 
@@ -52,7 +55,7 @@ set PYTHONPATH=.
 export PYTHONPATH=.
 ```
 
-### 2. Run the project
+### 3. Run the project
 
 - Run DQN or DDQN:
 
@@ -64,4 +67,256 @@ python src/main.py [--dqn | --ddqn]
 
 ```bash
 python src/test.py [--dqn | --ddqn]
-```
\ No newline at end of file
+```
+
+## Documentation
+
+### Environment
+
+For the environment we used the OpenAI Gym Skiing game. The environment is shown in depth at
+this [link](https://gymnasium.farama.org/environments/atari/skiing/#skiing), but we will explain it here too.
+
+![Environment](./docs/images/image1.gif)
+
+#### Observations
+
+We use the ALE/Skiing-v5 version. The observations consist of frames, which are returned in the following shape: Box(0,
+255, (210, 160, 3), uint8), where 3 represents the channels for “RGB”, 210 and 160 being the height and the width,
+respectively.
+
+#### Actions
+
+The skier has 3 possible actions, 0 (NOOP / DOWN), 1 (RIGHT) and 2 (LEFT).
+
+#### Rewards
+
+The rewards are alternating between -6 and -7 during the run, and in the final state, when we pass the red flags, we get
+a big negative reward depending on the number of flags that we went through and the time it took to finish the run. (Ex.
+-8500)
+
+#### Details
+
+- If the skier hits any obstacle, tree or flag, the run does end, nor does it get any negative rewards at that moment,
+  but the time lost will impact the total reward at the end of the run.
+- We use the environment variant with a frameskip of 4. This will help us a lot for the training speed. The rewards are
+  scaled correctly by the game for the frameskip, therefore the baselines we will present apply on our environment even
+  though some of them may be tested on an environment with no frame skipping.
+- While researching for improvements in our implementation, we’ve found out, after reading
+  the [Agent57 Paper](https://arxiv.org/pdf/2003.13350.pdf), that our environment is one of the hardest to train on in
+  the Atari suite. By its nature, with the big negative reward at the end of the game, it is a challenging game for the
+  agent to learn.
+
+### Baselines
+
+All the baselines found come from this [link](https://paperswithcode.com/sota/atari-games-on-atari-2600-skiing). Some of
+the better agents presented in these baselines use additional data, human player data, and are run for very long periods
+of time on really powerful computers. Taking all of these in consideration our aim is to beat the baseline for random
+actions.
+![Baselines](./docs/images/image2.png)
+
+### DQN && DDQN
+
+- Given the context of the skiing game used by us, the complexity and high-dimensional state spaces, it would have been
+  suitable to use deep reinforcement learning techniques. Therefore, we thought about implementing two Q-Learning
+  algorithms: DQN and DDQN. One advantage of these implementations is that they are suitable for games where the state
+  of the environment can change rapidly. In Q-Learning, the policy is updated differently, making it an off-policy
+  algorithm, as the Q-learning model continues to learn from experiences, it gradually refines its policy, making better
+  decisions over time based on accumulated knowledge.
+- The expected reward an agent could receive by choosing an action in a particular state is determined using the Q-value
+  function. In deep reinforcement learning, neural networks are used to approximate the Q-value function, with the
+  network taking the state as input and outputting Q-values for all possible actions.
+
+#### Performance Comparison
+
+Since both of these algorithms heavily rely on training experience, there were no major differences in performance,
+because our hardware resources were limited. Despite this, we saw a noticeable difference in stability while training.
+
+DQN: ![DQN](./docs/images/image11.png)
+DDQN: ![DDQN](./docs/images/image3.png)
+
+### Preprocessing
+
+- Due to the limited hardware resources, the first step was converting the image to grayscale. In comparison, we ran 100
+  episodes using RGB and the results looked similar, but the training time was much higher.
+- We wanted to focus on the most relevant visual features observed and the optimal portion for learning was in the
+  center of the image. We decided to crop out irrelevant portions, such as the top timestamp and the footer “Activision”
+  to improve the efficiency of the reinforcement learning model.
+- As our final preprocessing steps we resize the image to 80x80, and normalize it by dividing each pixel with 255. In
+  different papers that are tested on Atari games, we saw that the main approach is to resize the games to 84x84. We
+  chose not to, because when we visually tested the different resize, the 80x80 is the smallest possible size to prevent
+  loss of information, therefore this approach improves the performance a little bit.
+  ![Original State](./docs/images/image12.png)
+  ![Grayscale](./docs/images/image5.png)
+  ![Crop](./docs/images/image7.png)
+  ![Resize](./docs/images/image13.png)
+
+### Incremental steps
+
+We first started with the DDQN algorithm, because there was more work to do. In order to implement the DQN algorithm we
+just stopped using the target network.
+
+#### Initial algorithm
+
+- Our first iteration was a really simple one, we wanted to make sure that the DDQN algorithm works properly. The
+  network was primitive, 2 convolutional layers followed by 2 linear layers, the final layer being of size 3 (the number
+  of possible actions).
+- In this step there was no big training, we’ve just made sure that our algorithm is implemented accordingly.
+- At this time, our loss function was MSELoss, and the optimizer used was Adam.
+
+#### Optimizing the network
+
+- As it is the main actor in our implementation, the network had to be able to extract information from the
+  observations. We tried various variants. In the end, we used the architecture showcased in
+  the [Deepmind Paper](https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf). We chose this extensively
+  tested and robust architecture, because, after we got stuck in our improvements, we wanted to make sure that the
+  network is not the problem.
+- For our loss function, we’ve only tried MSELoss and HuberLoss, our final choice being the latter, as it seems that the
+  training was more stable, even though the performance of the agent was not improved.
+- For our optimizer we tried Adam and RMSProp. The results were similar, but not having time and hardware resources to
+  tune the hyperparameters for both of them, we chose to stick with the first one.
+
+#### Using stacked frames
+
+- At this moment we already had some results, around the random baselines situated at around -17000.
+- There was one main problem that we did not see in our implementation. We could not catch the velocity of our skier by
+  using only one frame at a time, and therefore we opted for stacked frames of 4.
+- This implementation remained unchanged for a few days until we realized another major flaw in our algorithm. At the
+  start of another episode, we took the last 3 frames from our replay memory and stacked them with our current frame,
+  but that was completely wrong, because they are not related. We solved this by hardcoding 3 steps at the beginning of
+  each episode to take the action 0 (DOWN). This is a good approach, because our skier needs high speed to achieve a
+  good score and also it solves our problem.
+
+#### Optimization - Biggest challenge
+
+- By far the biggest challenge was optimizing our implementation, from hyperparameter tuning, to data visualization and
+  analyzation. We tried so many variants that some of them were lost 🙂.
+- Some hyperparameters remained unchanged, as they seemed suitable for our context. For example, the epsilon value,
+  which represents the exploration rate value. We set it to 1 at the beginning of training, and decreased it by a factor
+  until it reached the minimum of 0.1. Gamma (the discount factor) was set to 0.99 and remained there, because it looked
+  obvious that in our environment the actions taken are not that important at the moment, but very important in the long
+  term.
+
+##### Replay Memory
+
+In theory, the increase of memory dimension results in a better performance. This especially applies to our game, where
+the actions are not important at the moment they are taken, but at the end. In our case, once again, the hardware
+limitations force us to reduce the memory size. Firstly, we are limited to 16GB of RAM memory, which already fills at
+500_000 memory size, and it requests additional space from the solid state drive, which is not performant, thus reducing
+the training speed. The second limitation we faced was the number of episodes trained. The lengthiest training sessions
+that we did were 5000 episodes, which took around 24 hours depending on the batch size and the network used. When we
+start the training process, we fill the memory replay with an initial data of length 50_000. With this memory size of
+500_000, we can store 500 episodes at a time in the memory (~1000 steps per episode * 500 episodes). We believe that
+this value is good, considering our limitations, because when we run 5000 episodes, there will not be any redundant data
+in the memory replay after the first half of the training process.
+
+##### Loss function
+
+As previously stated, we tried MSELoss and HuberLoss. For HuberLoss we tried to optimize the delta, but the default
+value was the best option.
+
+##### Optimizer
+
+For optimizers both Adam and RMSProp did a similar job, but we stuck to Adam, since we did not have time to do a
+potential grid search due to computational resources.
+
+##### Update frequency
+
+In literature, the best value to update the target network with the weights from the policy network is between 5000 and
+10000 learns. Since we have around 1000 steps per episode, and therefore 1000 learns per episode, we tried updating the
+network at each 5, 7 and 10 episodes. We chose to stick with the second value mentioned, 7, since it provided the most
+stability.
+
+##### Epsilon decay
+
+When thinking about exploitation vs exploration, we used a decaying epsilon value. We start at the value 1 (100% chance
+of exploration), and decay it by a factor of 0.9975 (epsilon * 0.9975) per episode, going to a minimum of 0.1. Here we
+tried different values, but the previous ones described performed the best. We tried different decaying values: 0.90,
+0.95, 0.9975, 0.99 and different minimums, 0.3, 0.1, 0.01. Using the selected values, our epsilon reaches the minimum of
+0.1 after 920 episodes, and the agent sees almost 1 million frames before doing mostly exploitation.
+
+##### Batch size
+
+We tried various batch sizes: 16, 32, 64, 256. The best values were 32, 64, and 256. We chose the value 32, since the
+training time was somewhat directly proportional to the batch size chosen.
+
+##### Decaying learning rate
+
+Seeing that the average loss was reaching a plateau pretty soon, around 50 episodes, we implemented a decaying learning
+rate, starting from 0.1 and 0.01, down to 0.00025. The decaying took place if there were more than 10 episodes that
+didn’t improve the current minimum average loss. The idea worked, the loss was decreasing constantly, but there was a
+catch. Doing a relatively short training session of 5k episodes, our agent does not have time to find the optimal
+policy, therefore we start learning bad policy networks, and enforce them with this decaying learning rate. As a result,
+we decided to abandon this idea.
+
+##### Rewards
+
+We tried three different rewards optimizations :clipping between -1 and 0, normalization between -1 and 0, and changing
+the sign to +. None of these methods proved useful.
+
+##### Final results
+
+The next plot is the result of our best training session which took 2000 episodes and around 12 hours to finish. The
+model is far from being good at the game, but we can see that the plot is indicating a really slow but steady
+improvement, in its final 500 episodes hitting higher peaks more often. This model gets a median of -14895.0 reward
+tested on 100 episodes.
+![Final Results](./docs/images/image8.png)
+
+Here are other runs that we did while trying to tune the hyperparameters:
+![Other Runs](./docs/images/image4.png)
+![Other Runs](./docs/images/image10.png)
+
+### Agent is trolling us?
+
+In our runs, we observed a really good score, almost too good to be true. After finishing the training, we tested our
+agent. It was constantly receiving the same score, a really good one of -9013.
+
+```txt
+Episode 1 finished with a reward of -9013.0!
+Episode 2 finished with a reward of -9013.0!
+Episode 3 finished with a reward of -9013.0!
+…
+The same for 100 episodes!
+```
+
+But when we tested it visually, we observed that our agent found a little trick. If it goes right down (only choosing
+the action 0), it will end the episode pretty fast, since there is a small chance of colliding with the flags or trees,
+and a pretty good chance of getting through some of the flags.
+![Agent is trolling us](./docs/images/image9.gif)
+
+### Future work / ➡️➡️ Next steps
+
+#### Reward shaping
+
+We can try to shape the rewards in other ways. Some of the ideas that come to mind are to detect when we are between
+flags using computer vision techniques, and give a positive reward for those frames. We can also enforce the agent to
+always have velocity, taxing by giving a big negative reward for staying in place. This happens left and right are
+chosen multiple times. We can also force the agent to pick action 0 (DOWN) once every x frames, by giving it a reward,
+but this must be done carefully in order to not troll us again as it did before choosing only that action.
+
+#### Grid search
+
+We can try doing a grid search on hyperparameters to find the best possible values for our game.
+
+#### Run, Run and Run
+
+Try larger memory sizes, try more episodes.
+
+### Conclusions - Random vs DDQN Agent
+
+#### Random Agent
+
+![Random Agent](./docs/images/image14.gif)
+
+#### DDQN Agent
+
+![DDQN Agent](./docs/images/image6.gif)
+
+From this video and more runs observed, it seems that our agent only learnt that it needs good speed in order to get
+good scores, it does not seem to care too much about the flags or trees. For a bigger improvement of our agent we can
+continue by creating custom rewards for hitting the trees, getting stuck in one place or hitting the walls.
+
+While it may not be obvious at a first glance, our agent did a little bit of learning. Our hardware limitations may or
+may not have been our biggest problem, forcing us to take small learning and optimization steps. Nevertheless, the game
+we chose is a difficult one, we did not achieve a really smart agent, but we believe that this was the beauty of the
+project. Its complexity led us to write, test, experiment, read, research and try all sorts of new techniques.
+
diff --git a/docs/images/image1.gif b/docs/images/image1.gif
new file mode 100644
index 0000000..5e8e1a1
Binary files /dev/null and b/docs/images/image1.gif differ
diff --git a/docs/images/image10.png b/docs/images/image10.png
new file mode 100644
index 0000000..3df8bfe
Binary files /dev/null and b/docs/images/image10.png differ
diff --git a/docs/images/image11.png b/docs/images/image11.png
new file mode 100644
index 0000000..07f7cf9
Binary files /dev/null and b/docs/images/image11.png differ
diff --git a/docs/images/image12.png b/docs/images/image12.png
new file mode 100644
index 0000000..f2b21a3
Binary files /dev/null and b/docs/images/image12.png differ
diff --git a/docs/images/image13.png b/docs/images/image13.png
new file mode 100644
index 0000000..0505102
Binary files /dev/null and b/docs/images/image13.png differ
diff --git a/docs/images/image14.gif b/docs/images/image14.gif
new file mode 100644
index 0000000..75ca40b
Binary files /dev/null and b/docs/images/image14.gif differ
diff --git a/docs/images/image2.png b/docs/images/image2.png
new file mode 100644
index 0000000..397ba59
Binary files /dev/null and b/docs/images/image2.png differ
diff --git a/docs/images/image3.png b/docs/images/image3.png
new file mode 100644
index 0000000..1b823ee
Binary files /dev/null and b/docs/images/image3.png differ
diff --git a/docs/images/image4.png b/docs/images/image4.png
new file mode 100644
index 0000000..4c2cf17
Binary files /dev/null and b/docs/images/image4.png differ
diff --git a/docs/images/image5.png b/docs/images/image5.png
new file mode 100644
index 0000000..ede3932
Binary files /dev/null and b/docs/images/image5.png differ
diff --git a/docs/images/image6.gif b/docs/images/image6.gif
new file mode 100644
index 0000000..c9c37e0
Binary files /dev/null and b/docs/images/image6.gif differ
diff --git a/docs/images/image7.png b/docs/images/image7.png
new file mode 100644
index 0000000..584d9b6
Binary files /dev/null and b/docs/images/image7.png differ
diff --git a/docs/images/image8.png b/docs/images/image8.png
new file mode 100644
index 0000000..9d406e6
Binary files /dev/null and b/docs/images/image8.png differ
diff --git a/docs/images/image9.gif b/docs/images/image9.gif
new file mode 100644
index 0000000..242f183
Binary files /dev/null and b/docs/images/image9.gif differ