Skip to content

Commit

Permalink
insert results / conclusion sections
Browse files Browse the repository at this point in the history
  • Loading branch information
olivier-bernard-creatis committed Jan 9, 2024
1 parent f13d6ef commit 876d5e5
Show file tree
Hide file tree
Showing 12 changed files with 88 additions and 21 deletions.
109 changes: 88 additions & 21 deletions collections/_posts/2023-12-19-latent-diffusion-models.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,8 +72,11 @@ The figure below shows the rate-distorsion trade-off of a trained model. Learnin
* A perceptual compression model based on previous work [1] is used to efficiently encode images
* It consists in an auto-encoder trained by combinaison of a perceptual loss and a patch-based adversarial objective
* The overall objective of the cited paper is more complex than only computed an efficient latent space (high-resolution image synthesis based on transformer), but as far as I understand the pre-trained encoder/decoder parts are available and directly used in latent DM formalism. This paper should be the subject of a future post!
* two different kinds of regularizations are tested to avoid high-variance latent spaces: *KL-reg* which imposes a slight KL-penality towards a standard normal on the learned latent, and *VQ-reg* which uses a vector quantization layer [2] within the decoder.

![](/collections/images/latent-DM/perceptual-image-compression.jpg)
<div style="text-align:center">
<img src="/collections/images/latent-DM/perceptual-image-compression.jpg" width=600></div>
<p style="text-align: center;font-style:italic">Figure 2. Illustration of the perceptual compression model detailed in [1] and used to compute the encoder/decoder module.</p>

&nbsp;

Expand Down Expand Up @@ -112,52 +115,116 @@ $$\mathcal{L}_{LDM} := \mathbb{E}_{z \sim E(x), y, \epsilon \sim \mathcal{N}(0,\

# Results

* Qualitative and quantitative evaluation on COCO 2017 dataset. 118K training images and 5K validation images.
* They are 7 instances per image on average and up to 63 instances in a single image of the training set.
* The maximum number of predictions, $$N$$, is consequently set to 100.
* Results are compared to Faster R-CNN, the strongest baseline for real-time object detection on natural images
* 6 different kinds of image generation: text-to-Image, Layout-to-Image, Class-Label-to-Image, Super resolution, Inpainting, Semantic-Map-to-Image
* Latent space with 2 different regularization strategies: *KL-reg* and *VQ-reg*
* Latent space with different degrees of downsampling
* LDM-KL-8 means latent diffusion model with KL-reg and a downsampling of 8 to generate the latent space
* DDIM is used during inference (with different number of iterations) as an optimal sampling procedure
* FID: Fréchet Inception Distance: captures the similarity of generated images to real ones better than the more conventional Inception Score

&nbsp;

## Perceptual compression tradeoffs

<div style="text-align:center">
<img src="/collections/images/detr/performances_coco.jpg" width=600></div>
<p style="text-align: center;font-style:italic">Table 1. Performances comparison with Faster R-CNN on COCO 2017 dataset.</p>
<img src="/collections/images/latent-DM/results-perceptual-compression.jpg" width=600></div>
<p style="text-align: center;font-style:italic">Figure 3. Analyzing the training of class-conditional LDMs with
different downsampling factors f over 2M train steps on the ImageNet dataset.</p>

* LDM-1 corresponds to DM without any latent representation.
* LDM-4, LDM-8 and LDM-16 appear to be the most efficient
* LDM-32 shows limitations due to to high downsampling effects

&nbsp;

## Hyperparameters overview

Recall that FFNs prediction heads and Hungarian loss were used at each decoder stage during training. The authors computed the performances at each decoder layer, showing that the performances of the deepest layers are not improved by bounding boxes post-proccessing techniques such as Non-Maximum Suppression.

<div style="text-align:center">
<img src="/collections/images/detr/ablation_layers.jpg" width=400></div>
<p style="text-align: center;font-style:italic">Figure 3. Performances at each decoder stage.</p>
<img src="/collections/images/latent-DM/results-hyperparameters-unconditioned-cases.jpg" width=600></div>
<p style="text-align: center;font-style:italic">Table 1. Hyperparameters for the unconditional LDMs producing the numbers shown in Tab. 3. All models trained on a single NVIDIA A100.</p>

&nbsp;

The two figures below show the attention of the encoder and the decoder for a reference point or for the final predicted object.
<div style="text-align:center">
<img src="/collections/images/latent-DM/results-hyperparameters-conditioned-cases.jpg" width=600></div>
<p style="text-align: center;font-style:italic">Table 2. Hyperparameters for the conditional LDMs trained on the ImageNet dataset. All models trained on a single NVIDIA A100.</p>

&nbsp;

## Unconditional image synthesis

<div style="text-align:center">
<img src="/collections/images/latent-DM/results-image-generation-uncondition.jpg" width=600></div>
<p style="text-align: center;font-style:italic">Table 3. Evaluation metrics for unconditional image synthesis. N-s refers to N sampling steps with the DDIM sampler. ∗: trained in KL-regularized latent space</p>

<div style="text-align:center">
<img src="/collections/images/detr/attention_encoder.jpg" width=600></div>
<p style="text-align: center;font-style:italic">Figure 4. Encoder self-attention for several reference points (in red).</p>
<img src="/collections/images/latent-DM/results-image-generation-uncondition-CelebA-HQ.jpg" width=600></div>
<p style="text-align: center;font-style:italic">Figure 4. Random samples of the best performing model LDM-4 on the CelebA-HQ dataset. Sampled with 500 DDIM steps (FID = 5.15)</p>

<div style="text-align:center">
<img src="/collections/images/detr/attention_decoder.jpg" width=600></div>
<p style="text-align: center;font-style:italic">Figure 5. Decoder attention for every predicted object.</p>
<img src="/collections/images/latent-DM/results-image-generation-uncondition-bedrooms.jpg" width=600></div>
<p style="text-align: center;font-style:italic">Figure 5. Random samples of the best performing model LDM-4 on the LSUN-Bedrooms dataset. Sampled with 200 DDIM steps (FID = 2.95)</p>

&nbsp;

With small modifications that create a FPN-like network, DETR can also work for panoptic segmentation (details not explained here, we only provide an exemple of visual results below)
## Class-conditional image synthesis

<div style="text-align:center">
<img src="/collections/images/latent-DM/results-image-generation-condition-ImageNet.jpg" width=600></div>
<p style="text-align: center;font-style:italic">Table 4. Comparison of a class-conditional ImageNet LDM with
recent state-of-the-art methods for class-conditional image generation on ImageNet. c.f.g. denotes classifier-free guidance with a scale s</p>

<div style="text-align:center">
<img src="/collections/images/detr/segmentation.jpg" width=600></div>
<p style="text-align: center;font-style:italic">Figure 6: Qualitative results for panoptic segmentation generated by DETR-R101.</p>
<img src="/collections/images/latent-DM/results-class-conditional-image-synthesis.jpg" width=600></div>
<p style="text-align: center;font-style:italic">Figure 6. Random samples from LDM-4 trained on the ImageNet dataset. Sampled with classifier-free guidance scale s = 5.0 and 200 DDIM steps</p>

&nbsp;

## Text-conditional image synthesis

* a LDM with 1.45B parameters is trained using KL-regularized conditioned on language prompts on LAION-400M
* use of the BERT-tokenizer
* $$\tau_{\theta}$$ is implemented as a transformer to infer a latent code which is mapped into the UNet via (multi-head) cross-attention

<div style="text-align:center">
<img src="/collections/images/latent-DM/results-text-conditional-image-synthesis.jpg" width=600></div>
<p style="text-align: center;font-style:italic">Table 5. Evaluation of text-conditional image synthesis on the
256×256-sized MS-COCO dataset: with 250 DDIM steps</p>

<div style="text-align:center">
<img src="/collections/images/latent-DM/results-text-conditional-image-synthesis-2.jpg" width=600></div>
<p style="text-align: center;font-style:italic">Figure 7. Illustration of the text-conditional image synthesis. Sampled with 250 DDIM steps</p>

&nbsp;

## Semantic-map-to-image synthesis

* use of images of landscapes paired with semantic maps
* Downsampled versions of the semantic maps are simply concatenated with the latent image representation of a LDM-4 model with VQ-reg.
* No cross-attention scheme is used here
* The model is trained on an input resolution of 256x256 but the authors find that the model generalizes to larger resolutions and can generate images up to the megapixel regime


<div style="text-align:center">
<img src="/collections/images/latent-DM/results-semantic-synthesis.jpg" width=400></div>
<p style="text-align: center;font-style:italic">Figure 8. When provided a semantic map as conditioning, the LDMs generalize to substantially larger resolutions than those seen during training. Although this model was trained on inputs of size 256x256 it can be used to create high-resolution samples as the ones shown here, which are of resolution 1024×384</p>



&nbsp;

# Conclusion
# Conclusions

DEtection TRansformer (DETR) is a new transformer-based model to perform object detection in an end-to-end fashion. The main idea of the paper is to force unique predictions via bipartite matching that finds the optimal assignment between the predicted bounding boxes and the ground truth, avoiding the use of surrogate components or post-processing like non-maximum suppression or anchors.
* Latent diffusion model allows to synthesize high quality images with efficient computational times.
* The key resides in the use of an efficient latent representation of images which is perceptually equivalent but with reduced computational complexity

&nbsp;

# References
\[1\] P. Esser, R. Rombach, B. Ommer, *Taming transformers for high-resolution image synthesis*, CoRR 2022, [link to paper](https://arxiv.org/pdf/2012.09841.pdf)
\[1\] P. Esser, R. Rombach, B. Ommer, *Taming transformers for high-resolution image synthesis*, CoRR 2022, [\[link to paper\]](https://arxiv.org/pdf/2012.09841.pdf)

\[2\] A. van den Oord, O. Vinyals, and K. Kavukcuoglu, *Neural discrete representation learning*, In NIPS, 2017 [\[link to paper\]](https://arxiv.org/pdf/1711.00937.pdf)



Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 876d5e5

Please sign in to comment.