"Understanding Diffusion Models: A Unified Perspective" Arxiv, 2022 Aug 25 paper code pdf note Authors: Calvin Luo
本文档用于整理 Diffusion 基础知识,便于后续复习!:construction_worker:
- Task: survey
- Problems
- 🏷️ Label:
- "Diffusion Models: A Comprehensive Survey of Methods and Applications" ACM-Computing-Surveys, 2023 paper
Analogously, the objects that we encounter in the actual world may also be generated as a function of some higher-level representations; Whereas Plato’s Allegory illustrates the idea behind latent variables as potentially unobservable representations that determine observations.
目标是学习更高维度的表示,要去最大化数据分布的概率
motivation: the goal of generative models is to model true data distribution
$p(x)$ . so that we wish to maximize the probability$p(x)$ ⭐
学习更高维度的表示 z,去表示 x,因此:
- marginalize out 一下搞成一个条件概率
- 再用 chain rule of probability 改写一下
In this case, we marginalize out latent variable Z and appeal to chain rule of probability.
- Q:What's
marginalize out
?
引出 ELBO 公式
- Q:直接去 maximize
$p(x) = \int{ p(x|z) dz }$ 这个式子需要积分,不好实现咋办?
因此对公式 (3)整理定义 ELBO 作为优化目标去优化模型,
However, using these two equations (3), we can derive a term called the Evidence Lower Bound (ELBO), which as its name suggests, is a lower bound of the evidence. The evidence is quantified in this case as the log likelihood of the observed data
这里
evidence is quantified as
$\log{p(x)}$ ⭐
For now, let us try to dive deeper into why the ELBO is an objective we would like to maximize.
- Q:为什么是优化 ELBO 这个式子?
From this derivation, we clearly observe from Equation 15 that the evidence is equal to the ELBO plus the KL Divergence between the approximate posterior qφ(z|x) and the true posterior p(z|x).
ELBO 和原始数据分布
这里用 VAE
- Q:为什么称作 lower bound 下界?
这个预测&真实分布的 KL 散度
Firstly, we now know why the ELBO is indeed a lower bound: the difference between the evidence and the ELBO is a strictly non-negative KL term, thus the value of the ELBO can never exceed the evidence.
- Q:另一种角度看下为什么优化 ELBO
目标是训练一个模型
Secondly, we explore why we seek to maximize the ELBO. Having introduced latent variables z that we would like to model, our goal is to learn this underlying latent structure that describes our observed data. In other words, we want to optimize the parameters of our variational posterior qφ(z|x) to exactly match the true posterior distribution p(z|x), which is achieved by minimizing their KL Divergence (ideally to zero).
但因为
Unfortunately, it is intractable to minimize this KL Divergence term directly, as we do not have access to the ground truth p(z|x) distribution.
因此这个 ELBO 就作为代理的 loss 去优化模型,原始数据分布 p(x) 不变,最大化 ELBO $\mathbb{E}{q{{\phi}}(z|x)}\left[\log\frac{p(x, z)}{q_{{\phi}}(z|x)}\right]$ ,等价与优化 KL 散度 ⭐ $$ \log{p(x)} = \mathbb{E}{q{{\phi}}(z|x)}\left[\log\frac{p(x, z)}{q_{{\phi}}(z|x)}\right] + D_{KL}({q_{{\phi}}(z|x)} || {p(z|x)}) $$
the more we optimize the ELBO, the closer our approximate posterior gets to the true posterior. Additionally, once trained, the ELBO can be used to estimate the likelihood of observed or generated data as well, since it is learned to approximate the model evidence log p(x).
In the default formulation of the Variational Autoencoder (VAE) [1], we directly maximize the ELBO
- Q:为什么叫 variational?
因为要同时优化一堆数据的分布
This approach is variational, because we optimize for the best qφ(z|x) amongst a family of potential posterior distributions parameterized by φ.
- Q:为什么叫 autoencoder?
It is called an autoencoder because it is reminiscent of a traditional autoencoder model, where input data is trained to predict itself after undergoing an intermediate bottlenecking representation step.
把 ELBO 公式拆解下,能看出来更多结论
$$ \log{p(x)} = \mathbb{E}{q{{\phi}}(z|x)}\left[\log\frac{p(x, z)}{q_{{\phi}}(z|x)}\right] + D_{KL}({q_{{\phi}}(z|x)} || {p(z|x)}) $$
最大化 ELBO,等价于优化 2 项:VAE Encoder (最小化 z 特征分布之间的差距) && VAE Decoder( 最大化 z 重建效果 )
![eq17.png](docs/2022_08_Arxiv_Understanding Diffusion Models-A Unified Perspective_Note/eq17.png) $$ \begin{align} \mathbb{E}{q{{\phi}}(z|x)}\left[\log\frac{p(x, z)}{q_{{\phi}}(z|x)}\right] = \underbrace{{\mathbb{E}{q{{\phi}}(z|x)}\left[\log p_{{\theta}}(x|z)\right]}}_\text{reconstruction term(Decoder)} -
\underbrace{{D_{KL}{q_{{\phi}}(z|x)}{p(z)}}}_\text{prior matching term(Encoder)} \end{align} $$
The two terms in Equation 19 each have intuitive descriptions: the first term measures the reconstruction likelihood of the decoder from our variational distribution; this ensures that the learned distribution is modeling effective latents that the original data can be regenerated from.
避免 Encoder 学习到的特征分布变成 Dirac delta function ⭐
The second term measures how similar the learned variational distribution is to a prior belief held over latent variables. Minimizing this term encourages the encoder to actually learn a distribution rather than collapse into a Dirac delta function
- Q:Dirac delta function 是啥?
一个点 = 1 其余点为 0
https://www.sciencedirect.com/topics/engineering/dirac-delta-function
- Q:如何利用 ELBO 去优化 VAE Encoder
$\phi$ , Decoder$\theta$ ?
VAE Encoder 的概率分布一般选取多元高斯分布,特征 z 分布
The encoder of the VAE is commonly chosen to model a multivariate Gaussian with diagonal covariance, and the prior is often selected to be a standard multivariate Gaussian
因此 ELBO 公式中 KL 散度可以用数学公式表示去计算;Decoder 重建项可以用 Monte Carlo estimate
方法去采样
Then, the KL divergence term of the ELBO can be computed analytically, and the reconstruction term can be approximated using a Monte Carlo estimate. Our objective can then be rewritten as:
- Q:Decoder 优化项中的 z,需要随机采样的操作不可微分咋办?
利用 parameterization trick 使用一个噪声项表示随机性。同时公式里面有确定项
Fortunately, this can be addressed via the reparameterization trick when qφ(z|x) is designed to model certain distributions, including the multivariate Gaussian
The reparameterization trick rewrites a random variable as a deterministic function of a noise variable; this allows for the optimization of the non-stochastic terms through gradient descent.
另一个角度,通过 parameterization trick 实现将任意高斯分布,转化到有更多性质的标准高斯分布
In other words, arbitrary Gaussian distributions can be interpreted as standard Gaussians (of which
$\epsilon$ is a sample) that have their mean shifted from zero to the target mean µ by addition, and their variance stretched by the target variance σ 2 . Therefore, by the reparameterization trick, sampling from an arbitrary Gaussian distribution can be performed by sampling from a standard Gaussian, scaling the result by the target standard deviation, and shifting it by the target mean.
回到 VAE 提取高维度特征,用 parameterization trick 公式实现 Monte Carlo 采样,把随即向放入噪声中,这下可以用梯度优化了
isotropic Gaussian distribution
isotropic (having uniform physical properties in all directions) 各同向性,isotropic distribution 在不同方向上的分布都很均匀,这应该是”各向同性“的直观解释。
isotropic random vector
指的就是每个方向上方差为 1,不同方向上协方差为 0 的随机向量。VAE的reparameterization那一层其实就是isotropic random vector $$ Definition:~\text{A random vector X in } R^n \text{is called isotropic if} \ Cov(X) = E(XX^T) = I_n $$
- 判定X是否为 isotropic isotropic 的随机向量与任意向量的内积都等于那个向量的 L2-norm
- 高维的 isotropic random vector 几乎正交(向量内积==0)
Sampling is a stochastic process and therefore we cannot backpropagate the gradient. To make it trainable, the reparameterization trick is introduced.
$$
z \sim q_{\phi}(z | x) = \mathcal{N}(z;\mu^{(i)},\sigma^{2(i)}I)\
z = \mu + \sigma \bigodot \epsilon , where ~\epsilon\sim N(0,I)
$$
In the multivariate Gaussian case, we make the model trainable by learning the mean and variance of the distribution,
随机采样,转化为
$z = \mu + \sigma \bigodot \epsilon$ 形式,能够去学习$\mu, \sigma$
更高维度的特征 z 认为是更高更高维度的空间表示,这里想要提取比之前 VAE 更高维度的表示 (不只是用一个维度的 z,而用 z1,z2,... 来表示 x)
A Hierarchical Variational Autoencoder (HVAE) [2, 3] is a generalization of a VAE that extends to multiple hierarchies over latent variables
HVAE 各个step 的特征可以依赖与其他所有 step 的特征,但这里先只研究特殊情况 Markovian HVAE
Whereas in the general HVAE with T hierarchical levels, each latent is allowed to condition on all previous latents, in this work we focus on a special case which we call a Markovian HVAE (MHVAE)
MHVAE 图示;Diffusion Model 可以认为是加了一些限定的 MHVAE ⭐
In a MHVAE, the generative process is a Markov chain; that is, each transition down the hierarchy is Markovian, where decoding each latent
$z_t$ only conditions on previous latent$z_{t+1}$
MHVAE 的 Encoder 和 Decoder 对应的概率公式(后面 ELBO 要用!)
Mathematically, we represent the joint distribution and the posterior of a Markovian HVAE as
改写下 ELBO 到对应的 MHVAE $$ \mathbb{E}{q{{\phi}}(z|x)}\left[\log\frac{p(x, z)}{q_{{\phi}}(z|x)}\right] \to \mathbb{E}{q{{\phi}}(z_{1:T}|x)}\left[\log\frac{p(x, z_{1:T})}{q_{{\phi}}(z_{1:T}|x)}\right] $$
- 原先单步 VAE 的 ELBO,上面是联合概率,下面是 encoder 分布 (后验概率)
- Extended ELBO
-
引入 ELBO 的原因
最大化 p(x) ,但公式里面有积分不好求解;因此引入 ELBO 作为代替的目标去优化模型
$q_{\phi}(z|x)$ $$ \log{p(x)} = \mathbb{E}{q{{\phi}}(z|x)}\left[\log\frac{p(x, z)}{q_{{\phi}}(z|x)}\right] + D_{KL}({q_{{\phi}}(z|x)} || {p(z|x)}) \geq \mathbb{E}{q{{\phi}}(z|x)}\left[\log\frac{p(x, z)}{q_{{\phi}}(z|x)}\right] $$ -
为什么 ELBO 是下界?
ELBO 和原始数据分布
$\log{p(x)}$ 差距在于一个 KL 散度$D_{KL}({q_{{\phi}}(z|x)} || {p(z|x)})$ 。 ⭐ 这个KL散度是非负的,所以和真实数据分布$\log p(x)$ 会一直差一丢丢,因此称作 lower bound 下届!优化 ELBO 等价于去优化 KL 散度(预测特征 z 的分布 <-> z 的真实分布的差距)让其越来越小
-
拆解下 ELBO 包含 2 项:VAE Encoder (最小化 z 特征分布之间的差距) && VAE Decoder( 最大化 z 重建效果 )
![eq17.png](docs/2022_08_Arxiv_Understanding Diffusion Models-A Unified Perspective_Note/eq17.png)
-
parameterization trick 使用一个噪声项表示随机性。同时公式里面有确定项
$\mu$ ,可以通过梯度反传的方式被优化;解决 ELBO 中 decoder 随机采样不可微分,导致不好训练的问题 -
Markovian HVAE 中改写的 ELBO 公式 $$ \mathbb{E}{q{{\phi}}(z|x)}\left[\log\frac{p(x, z)}{q_{{\phi}}(z|x)}\right] \to \mathbb{E}{q{{\phi}}(z_{1:T}|x)}\left[\log\frac{p(x, z_{1:T})}{q_{{\phi}}(z_{1:T}|x)}\right] $$
- "Deep Unsupervised Learning using Nonequilibrium Thermodynamics" ICML, 2015 Mar 12 paper code pdf note Authors: Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, Surya Ganguli
参考非均匀热力学,提出 Diffusion forward process 去模拟数据分布,做了 CIFAR 上的图像重建实验 🌞
-
"Denoising Diffusion Probabilistic Models" NIPS, 2020 Jun 19,
DDPM
paper code pdf note Authors: Jonathan Ho, Ajay Jain, Pieter Abbeel -
"Denoising Diffusion Implicit Models" ICLR, 2020 Oct 6,
DDIM
paper code pdf note Authors: Jiaming Song, Chenlin Meng, Stefano Ermon -
"Variational Diffusion Models" NIPS, 2021 Jul 1,
VDM
paper code
Score-based
VDM 可以看作 MHVAE 上加了一些限定
- The latent dimension is exactly equal to the data dimension
- The structure of the latent encoder at each timestep is not learned; it is pre-defined as a linear Gaussian model. In other words, it is a Gaussian distribution centered around the output of the previous timestep
- The Gaussian parameters of the latent encoders vary over time in such a way that the distribution of the latent at final timestep T is a standard Gaussian
- Q:与 HVAE 不同的是,VDM 中每一个 timestep 的模型不是一个单独的模型,而是一个线性高斯模型
DDPM 中均值和方差是预设好的超参 (噪声表),在 VDM 这两个参数是可学习的
Unlike a Markovian HVAE, the structure of the encoder at each timestep t is not learned; it is fixed as a linear Gaussian model, where the mean and standard deviation can be set beforehand as hyperparameters [5], or learned as parameters [6].
因此 encoder 加噪一步可以表示为
- Q:限定3,xT 是标准高斯噪声
整体来看,就是逐步加高斯噪声到图像上,使得 xT 变为纯高斯噪声 ⭐
Collectively, what this set of assumptions describes is a steady noisification of an image input over time; we progressively corrupt an image by adding Gaussian noise until eventually it becomes completely identical to pure Gaussian noise. Visually, this process is depicted in Figure 3.
均值 where\epsilon\sim\mathcal{N}(0,I)$ ⭐
$$
x_t = \sqrt{\overline{\alpha}} \cdot x_0 + \sqrt{1- \overline{\alpha}} \cdot \epsilon_0 \
= \sqrt{\alpha_0 \ldots\alpha_{t-(T-1)}} x_{t-T} + \sqrt{1-(\alpha_0 \ldots\alpha_{t-(T-1)})} \epsilon_{t-T} ~~where~timestamp=T
$$
反向去噪
reverse the above process and sample from
Thus we need to learn a neural network to approximate the conditioned probability distributions in the reverse diffusion process,
We would like to train
-
model predict
$\epsilon_{t}$ from input$x_t$ at timestep=t $$ \mu_{\theta} = \frac{1}{\sqrt{\alpha_t}}\cdot(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}t}}\cdot \epsilon{\theta}(x_t,t)) $$真实值根据 forward Diffusion process 的公式逆推得到
$\tilde{\mu_t} = \frac{1}{\sqrt{\alpha_t}}\cdot(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\cdot\epsilon_t)$
使用
-
小结
-
Forward diffusion process
对原图 x0 **逐步(Timestep t)施加一个少量(通过
$\beta_i$ 来控制)**的高斯噪声,使得原图逐渐失去特征,得到 noise map $$ q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1- \beta_t} x_{t-1}, \beta_tI) \ x_t = \sqrt{1 - \alpha_t} + \sqrt{1-\alpha_t} * \epsilon_{t-1},where\epsilon\sim\mathcal{N}(0,I) $$ -
Reverse diffusion process
希望根据
$q(x_{t-1} | x_t) = \mathcal{N}(x_t; \mu_{\theta}(x_t,t), \Sigma_{\theta}(x_t,t))$ 采样来逆转上述加噪声过程,得到原图。模型预测$\mu_{\theta}$ -
加速 Diffusion Process
-
DDIM samples only a subset of S diffusion steps
-
LDM(Latent diffusion model) runs the diffusion process in the latent space instead of pixel space, making training cost lower and inference speed faster.
-
-
2020_NeurIPS_DDPM_Denoising-diffusion-probabilistic-models.pdf 博客参考 Markov Chain 博客参考 DDPM B站视频 DDPM 数学基础
- "Denoising Diffusion Probabilistic Models" NIPS, 2020 Jun 19 paper code pdf note Authors: Jonathan Ho, Ajay Jain, Pieter Abbeel
- "Denoising Diffusion Implicit Models" ICLR, 2020 Oct 6,
DDIM
paper code pdf note Authors: Jiaming Song, Chenlin Meng, Stefano Ermon
-
训练 loss 证明了和 DDPM 是一样的
-
加噪去噪改成了 non-Markovian process
每一步加噪 or 去噪都会用到上一步和 x0,例如加噪
$x_t$ 要用$x_0, x_{t-1}$ ,因此每一步加噪去噪就不是 Markovian 了
去噪一步的公式
- "Generative Modeling by Estimating Gradients of the Data Distribution" NIPS, 2019 Jul 12,
NCSN
paper code pdf note Authors: Yang Song, Stefano Ermon
Langevin dynamics
-
"Score-Based Generative Modeling through Stochastic Differential Equations" ICLR, 2020 Nov 26 paper code pdf note Authors: Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, Ben Poole
-
"Improved Techniques for Training Score-Based Generative Models" NIPS, 2020 Jun 16 paper code pdf note Authors: Yang Song, Stefano Ermon
- "A tutorial on energy-based learning."
- "How to train your energy-based models"