Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
  • Loading branch information
github-actions[bot] committed Jul 16, 2024
1 parent bad65b9 commit 30075d5
Show file tree
Hide file tree
Showing 10 changed files with 27 additions and 39 deletions.
Original file line number Diff line number Diff line change
@@ -1,20 +1,18 @@
Week 6 into GSoC 2024: Stuck with the Variational AutoEncoder, problems with Keras
==================================================================================


.. post:: July 06 2024
:author: Iñigo Tellaetxe
:tags: google
:category: gsoc



What I did this week
~~~~~~~~~~~~~~~~~~~~

This week was all about the Variational AutoEncoder. My mentors advised me to drop the TensorFlow implementation of the regression VAE I found last week, to instead directly integrate the variational and conditional characteristics in my AE implementation, following a more modular approach. This was a good decision, as adapting third party code to one's needs is often a bit of a mess (it already started being a mess, so yeah). Also, once the variational part is done, implementing the conditional should not be that hard.

To provide a bit of intuition behind the VAE, let me illustrate first the "vanilla" AE. This is a neural network that compresses the input data into a reduced representation, and then tries to reconstruct the original data from this representation. We refer to "latent space" to the place where the compressed data representations live. So, once the AE is good at compressing data to the latent space, we could take a sample from it and generate new data.

The objective of the AE is to minimize the difference between the input data and the generated data, also called the "reconstruction loss".

.. image:: /_static/images/inigo_vanilla_autoencoder.png
Expand All @@ -23,22 +21,22 @@ The objective of the AE is to minimize the difference between the input data and


Usually the problem with AEs is that the latent space is full of "holes", i.e. it is a set of points, meaning that if a sample is taken between two already encoded samples, the generated data will not be an interpolation between the two samples, but a random sample. To solve this, the VAE seeks to "regularize" the latent space, encoding/compressing the input data into distributions that live in it, instead of single points. This way, the latent space is continuous and the interpolation between two samples is meaningful and more likely to follow the data distribution.

The VAE does this by adding a "regularization loss" to the AE, which is the Kullback-Leibler divergence between the latent space distribution and a prior distribution (usually a normal distribution). If you are curious, you can find a nice explanation about the differences between AEs and VAEs `here <https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73>`_.

.. image:: /_static/images/inigo_variational_autoencoder.png
:alt: Variational AutoEncoder diagram.
:width: 600


Thus, I started by implementing the VAE in Keras, and I must say that it was not as easy as I thought it would be. I used `this VAE example from Keras <https://keras.io/examples/generative/vae/>`_, adding a ``ReparametrizationTrickSampling`` layer between the ``Encoder`` and the ``Decoder``.

The main problem was that the Keras implementation was not behaving as expected because it was giving me problems of model initialization, so I encapsulated the ``Encoder`` and the ``Decoder`` parts into individual ``keras.Model`` class instances, to put them all together under a wrapping Model. Doing this was a bit painful, but it worked.
The problem was that I was not being able to train the model because the loss was constantly ``nan`` values.

What is coming up next week
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Next week I must correct the ``nan`` problem when training because I am not sure what is causing it. I am using exponential operations in the KL loss computation and in the ``ReparametrizationTrickSampling`` layer, what could explode if the values of the exponent are too big. I will explore this.

Next week I must correct the ``nan`` problem when training because I am not sure what is causing it. I am using exponential operations in the KL loss computation and in the ``ReparametrizationTrickSampling`` layer, which could return excessively large values if the exponent value is too big, leading to an exploding gradients issue. I will explore this.

Did I get stuck anywhere
~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down
Original file line number Diff line number Diff line change
@@ -1,31 +1,28 @@
Week 7 into GSoC 2024: Starting to see the light at the end of the VAE
======================================================================


.. post:: July 12 2024
:author: Iñigo Tellaetxe
:tags: google
:category: gsoc



What I did this week
~~~~~~~~~~~~~~~~~~~~

Finally, I figured out how to solve the ``nan`` value problem in the VAE training. As I suspected, the values that the ``ReparametrizationTrickSampling`` layer was getting were too big for the exponential operations. I use the exponential operation because I am treating the Encoder output as the log variance of the latent space distribution, and for sampling we need the standard deviation. We use the log variance instead of the standard deviation for avoiding computing logarithms.
The solution consisted of two measures: first, clipping the values of the exponential operations to the ``[-1e10, 1e10]`` range. Then, adding batch normalization layers after each convolution in the Encoder enforcing mean=0 and variance=1, avoids the values from getting too big. I did not use batch normalization in the fully connected layers that output the mean and the log variance of the latent distribution because I did not want to constrain the latent space too much with distributions with the mentioned characteristics. These layers should capture other characteristics of the data distribution.

The solution consisted of two measures: first, clipping the values of the exponential operations to the ``[-1e10, 1e10]`` range; second, adding batch normalization layers after each convolution in the Encoder enforces a 0 mean and unit variance, preventing large values at the output. I did not use batch normalization in the fully connected layers that output the mean and the log variance of the latent distribution because I did not want to constrain the latent space too much with distributions with the mentioned characteristics. These layers should capture other characteristics of the data distribution.

You can see a preliminary result of training the Variational AutoEncoder below, with the exact same training parameters as the AutoEncoder, but only for 50 epochs for proving the concept.

.. image:: /_static/images/inigo_preliminary_vae_result_fibercup.png
:alt: Preliminary reconstruction result after training the VAE for 50 epochs with the FiberCup dataset.
:width: 600


I should mention that the variance of the 0-mean distribution from which ``epsilon`` is sampled in the ``ReparametrizationTrickSampling`` layer can be adjusted as a hyperparameter. I set it to 1 for now, but I could explore this hyperparameter in the future, but first I need to learn about the theoretical implications of modifying it.

On the other hand, I started a discussion in a `PR <https://github.com/itellaetxe/tractoencoder_gsoc/pull/1>`_ that includes all these changes to make it easier for my mentors and my GSoC colleagues to review my code. This change was suggested by my mentors to make my advances more public and accessible to feedback.


What is coming up next week
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand All @@ -34,4 +31,6 @@ Next week I will train the VAE for longer (~120 epochs) and I will also explore
Did I get stuck anywhere
~~~~~~~~~~~~~~~~~~~~~~~~

I got a bit stuck thinking how to solve the exploding loss problem, and despite the described advances may seem small, they required a lot of thought and debugging. Thankfully my lab colleague `Jorge <https://github.com/JGarciaCondado>`_ helped me with his machine learning expertise and gave me the idea of batch normalization.
I got a bit stuck thinking how to solve the exploding loss problem, and despite the described advances may seem small, they required a lot of thought and debugging. Thankfully my lab colleague `Jorge <https://github.com/JGarciaCondado>`_ helped me with his machine learning expertise and gave me the idea of batch normalization.

Until next week!
4 changes: 1 addition & 3 deletions dipy.org/pull/47/blog.html
Original file line number Diff line number Diff line change
Expand Up @@ -875,9 +875,7 @@ <h2 class="ablog-post-title">

</div>
</ul>
<p class="ablog-post-excerpt"><p>Finally, I figured out how to solve the <code class="docutils literal notranslate"><span class="pre">nan</span></code> value problem in the VAE training. As I suspected, the values that the <code class="docutils literal notranslate"><span class="pre">ReparametrizationTrickSampling</span></code> layer was getting were too big for the exponential operations. I use the exponential operation because I am treating the Encoder output as the log variance of the latent space distribution, and for sampling we need the standard deviation. We use the log variance instead of the standard deviation for avoiding computing logarithms.
The solution consisted of two measures: first, clipping the values of the exponential operations to the <code class="docutils literal notranslate"><span class="pre">[-1e10,</span> <span class="pre">1e10]</span></code> range. Then, adding batch normalization layers after each convolution in the Encoder enforcing mean=0 and variance=1, avoids the values from getting too big. I did not use batch normalization in the fully connected layers that output the mean and the log variance of the latent distribution because I did not want to constrain the latent space too much with distributions with the mentioned characteristics. These layers should capture other characteristics of the data distribution.
You can see a preliminary result of training the Variational AutoEncoder below, with the exact same training parameters as the AutoEncoder, but only for 50 epochs for proving the concept.</p>
<p class="ablog-post-excerpt"><p>Finally, I figured out how to solve the <code class="docutils literal notranslate"><span class="pre">nan</span></code> value problem in the VAE training. As I suspected, the values that the <code class="docutils literal notranslate"><span class="pre">ReparametrizationTrickSampling</span></code> layer was getting were too big for the exponential operations. I use the exponential operation because I am treating the Encoder output as the log variance of the latent space distribution, and for sampling we need the standard deviation. We use the log variance instead of the standard deviation for avoiding computing logarithms.</p>
</p>

<p class="ablog-post-expand"><a href="posts/2024/2024_07_12_Inigo_week_7.html"><em>Read more ...</em></a></p>
Expand Down
4 changes: 1 addition & 3 deletions dipy.org/pull/47/blog/2024.html
Original file line number Diff line number Diff line change
Expand Up @@ -971,9 +971,7 @@ <h2 class="ablog-post-title">

</div>
</ul>
<p class="ablog-post-excerpt"><p>Finally, I figured out how to solve the <code class="docutils literal notranslate"><span class="pre">nan</span></code> value problem in the VAE training. As I suspected, the values that the <code class="docutils literal notranslate"><span class="pre">ReparametrizationTrickSampling</span></code> layer was getting were too big for the exponential operations. I use the exponential operation because I am treating the Encoder output as the log variance of the latent space distribution, and for sampling we need the standard deviation. We use the log variance instead of the standard deviation for avoiding computing logarithms.
The solution consisted of two measures: first, clipping the values of the exponential operations to the <code class="docutils literal notranslate"><span class="pre">[-1e10,</span> <span class="pre">1e10]</span></code> range. Then, adding batch normalization layers after each convolution in the Encoder enforcing mean=0 and variance=1, avoids the values from getting too big. I did not use batch normalization in the fully connected layers that output the mean and the log variance of the latent distribution because I did not want to constrain the latent space too much with distributions with the mentioned characteristics. These layers should capture other characteristics of the data distribution.
You can see a preliminary result of training the Variational AutoEncoder below, with the exact same training parameters as the AutoEncoder, but only for 50 epochs for proving the concept.</p>
<p class="ablog-post-excerpt"><p>Finally, I figured out how to solve the <code class="docutils literal notranslate"><span class="pre">nan</span></code> value problem in the VAE training. As I suspected, the values that the <code class="docutils literal notranslate"><span class="pre">ReparametrizationTrickSampling</span></code> layer was getting were too big for the exponential operations. I use the exponential operation because I am treating the Encoder output as the log variance of the latent space distribution, and for sampling we need the standard deviation. We use the log variance instead of the standard deviation for avoiding computing logarithms.</p>
</p>

<p class="ablog-post-expand"><a href="../posts/2024/2024_07_12_Inigo_week_7.html"><em>Read more ...</em></a></p>
Expand Down
4 changes: 1 addition & 3 deletions dipy.org/pull/47/blog/author/inigo-tellaetxe.html
Original file line number Diff line number Diff line change
Expand Up @@ -971,9 +971,7 @@ <h2 class="ablog-post-title">

</div>
</ul>
<p class="ablog-post-excerpt"><p>Finally, I figured out how to solve the <code class="docutils literal notranslate"><span class="pre">nan</span></code> value problem in the VAE training. As I suspected, the values that the <code class="docutils literal notranslate"><span class="pre">ReparametrizationTrickSampling</span></code> layer was getting were too big for the exponential operations. I use the exponential operation because I am treating the Encoder output as the log variance of the latent space distribution, and for sampling we need the standard deviation. We use the log variance instead of the standard deviation for avoiding computing logarithms.
The solution consisted of two measures: first, clipping the values of the exponential operations to the <code class="docutils literal notranslate"><span class="pre">[-1e10,</span> <span class="pre">1e10]</span></code> range. Then, adding batch normalization layers after each convolution in the Encoder enforcing mean=0 and variance=1, avoids the values from getting too big. I did not use batch normalization in the fully connected layers that output the mean and the log variance of the latent distribution because I did not want to constrain the latent space too much with distributions with the mentioned characteristics. These layers should capture other characteristics of the data distribution.
You can see a preliminary result of training the Variational AutoEncoder below, with the exact same training parameters as the AutoEncoder, but only for 50 epochs for proving the concept.</p>
<p class="ablog-post-excerpt"><p>Finally, I figured out how to solve the <code class="docutils literal notranslate"><span class="pre">nan</span></code> value problem in the VAE training. As I suspected, the values that the <code class="docutils literal notranslate"><span class="pre">ReparametrizationTrickSampling</span></code> layer was getting were too big for the exponential operations. I use the exponential operation because I am treating the Encoder output as the log variance of the latent space distribution, and for sampling we need the standard deviation. We use the log variance instead of the standard deviation for avoiding computing logarithms.</p>
</p>

<p class="ablog-post-expand"><a href="../../posts/2024/2024_07_12_Inigo_week_7.html"><em>Read more ...</em></a></p>
Expand Down
4 changes: 1 addition & 3 deletions dipy.org/pull/47/blog/category/gsoc.html
Original file line number Diff line number Diff line change
Expand Up @@ -971,9 +971,7 @@ <h2 class="ablog-post-title">

</div>
</ul>
<p class="ablog-post-excerpt"><p>Finally, I figured out how to solve the <code class="docutils literal notranslate"><span class="pre">nan</span></code> value problem in the VAE training. As I suspected, the values that the <code class="docutils literal notranslate"><span class="pre">ReparametrizationTrickSampling</span></code> layer was getting were too big for the exponential operations. I use the exponential operation because I am treating the Encoder output as the log variance of the latent space distribution, and for sampling we need the standard deviation. We use the log variance instead of the standard deviation for avoiding computing logarithms.
The solution consisted of two measures: first, clipping the values of the exponential operations to the <code class="docutils literal notranslate"><span class="pre">[-1e10,</span> <span class="pre">1e10]</span></code> range. Then, adding batch normalization layers after each convolution in the Encoder enforcing mean=0 and variance=1, avoids the values from getting too big. I did not use batch normalization in the fully connected layers that output the mean and the log variance of the latent distribution because I did not want to constrain the latent space too much with distributions with the mentioned characteristics. These layers should capture other characteristics of the data distribution.
You can see a preliminary result of training the Variational AutoEncoder below, with the exact same training parameters as the AutoEncoder, but only for 50 epochs for proving the concept.</p>
<p class="ablog-post-excerpt"><p>Finally, I figured out how to solve the <code class="docutils literal notranslate"><span class="pre">nan</span></code> value problem in the VAE training. As I suspected, the values that the <code class="docutils literal notranslate"><span class="pre">ReparametrizationTrickSampling</span></code> layer was getting were too big for the exponential operations. I use the exponential operation because I am treating the Encoder output as the log variance of the latent space distribution, and for sampling we need the standard deviation. We use the log variance instead of the standard deviation for avoiding computing logarithms.</p>
</p>

<p class="ablog-post-expand"><a href="../../posts/2024/2024_07_12_Inigo_week_7.html"><em>Read more ...</em></a></p>
Expand Down
4 changes: 1 addition & 3 deletions dipy.org/pull/47/blog/tag/google.html
Original file line number Diff line number Diff line change
Expand Up @@ -971,9 +971,7 @@ <h2 class="ablog-post-title">

</div>
</ul>
<p class="ablog-post-excerpt"><p>Finally, I figured out how to solve the <code class="docutils literal notranslate"><span class="pre">nan</span></code> value problem in the VAE training. As I suspected, the values that the <code class="docutils literal notranslate"><span class="pre">ReparametrizationTrickSampling</span></code> layer was getting were too big for the exponential operations. I use the exponential operation because I am treating the Encoder output as the log variance of the latent space distribution, and for sampling we need the standard deviation. We use the log variance instead of the standard deviation for avoiding computing logarithms.
The solution consisted of two measures: first, clipping the values of the exponential operations to the <code class="docutils literal notranslate"><span class="pre">[-1e10,</span> <span class="pre">1e10]</span></code> range. Then, adding batch normalization layers after each convolution in the Encoder enforcing mean=0 and variance=1, avoids the values from getting too big. I did not use batch normalization in the fully connected layers that output the mean and the log variance of the latent distribution because I did not want to constrain the latent space too much with distributions with the mentioned characteristics. These layers should capture other characteristics of the data distribution.
You can see a preliminary result of training the Variational AutoEncoder below, with the exact same training parameters as the AutoEncoder, but only for 50 epochs for proving the concept.</p>
<p class="ablog-post-excerpt"><p>Finally, I figured out how to solve the <code class="docutils literal notranslate"><span class="pre">nan</span></code> value problem in the VAE training. As I suspected, the values that the <code class="docutils literal notranslate"><span class="pre">ReparametrizationTrickSampling</span></code> layer was getting were too big for the exponential operations. I use the exponential operation because I am treating the Encoder output as the log variance of the latent space distribution, and for sampling we need the standard deviation. We use the log variance instead of the standard deviation for avoiding computing logarithms.</p>
</p>

<p class="ablog-post-expand"><a href="../../posts/2024/2024_07_12_Inigo_week_7.html"><em>Read more ...</em></a></p>
Expand Down
Loading

0 comments on commit 30075d5

Please sign in to comment.