"LDMVFI: Video Frame Interpolation with Latent Diffusion Models" Arxiv, 2023 Mar 👍 paper code paper local pdf
first approach to use Latent-diffusion-Model(Stable-diffusion) that addresses VFI Task(video frame interpolation), contains two major components: VFI-specific Autoencoder, Denoising U-net
-
VQ-FIGAN
replace LDM encoder, decoder with VQ-FIGAN
- LDM encoder 没针对 VFI 任务设计,用 MaxViT atten & deformable conv 增强 VFI 性能
- 在 decoder 中加入 GT code >> 增强 reconstruction consistency
-
denoising U-net
针对 VFI 任务,inference 时候能够使用前后帧信息,用前后帧的 code 作为 condition
Contributions
- 第一个用 LDM 做 VFI 的方法:用前后帧的 latent code 作为 condition
- 设计了一个新的 Encoder-Decoder
VQ-FIGAN
,代替 Stable Diffusion 中的 VQGAN- LDM encoder 没针对 VFI 任务设计,用
MaxViT atten
& deformable conv 增强 VFI 性能 - 在 decoder 中加入 GT(已有的前后帧 I0, I1) 的 latent code >> 增强 reconstruction consistency
- replace vanilla attention with MaxViT attention >> improve efficiency
- LDM encoder 没针对 VFI 任务设计,用
-
background
video frame interpolation (VFI) is to generate intermediate frames between two existing consecutive frames in a video sequence.
现有 PSNR-oriented VFI 方法,优化 L1/L2-based distortion loss,导致虽然 PSNR 指标高,但是主观看图的效果并不好。
现有 VFI 方法分为两大类: flow-based & kernel-based
-
光流预测
-
predict locally adaptive convolution kernels to synthesize output pixels
-
之前 DM 用于 VFI(video frame interpolation)
-
"MaxViT: Multi-Axis Vision Transformer" ECCV, 2022 Apr paper code
MaxViT is a family of hybrid (CNN + ViT) image classification models, that achieves better performances across the board for both parameter and FLOPs efficiency than both SoTA ConvNets and Transformers
- 写作 trick:简述一下 LDM 水一下篇幅
- ❓
$\phi$ 特征如何出- Encoder 结构
The proposed LDMVFI contains two main components:
- VQ-FIGAN: a VFI-specific autoencoding model projects frames into a latent space, and reconstructs the target frame
- denoising U-Net that performs reverse diffusion process in the latent space for conditional image generation
整体框架
为啥要用个新的 Encoder ?
作者发现 LDM 预训练的 encoder 压缩h会丢失高频细节,就自己设计了一个 Encoder 映射到 latent space.
LDM encoder's design purpose is to project images into efficient latent representations where high-frequency details are removed. Lossy info brought by LDM encoder would deteriorate the quality.
如何在设计 Encoder, decoder 融合 I0, I1 信息?
-
MaxCA blocks 改进
MaxViT attention: Q=latent code, KV=i0,i1 的 code. >> reversion Diffusion 预测两帧之间的差异,和 i0 关联得到一个权值矩阵,加到 i1 上
- VQGAN 用 ViT 的 attention,计算量大
- more-efficient inference HR video >> 用一个更轻量的 ViT 提速
- 用了一种更 memory efficient free 的实现方式 "Video Frame Interpolation via Adaptive Separable Convolution" paper
-
Decoder 输出
$\set{\Omega^\tau,\alpha^\tau, \beta^\tau}$ $\Omega^\tau$ 为 kernel 的 param,$\tau$ 表示 0, 1 帧$\alpha^\tau, \beta^\tau$ 代表水平和垂直方向空间上的像素偏移量
locally adaptive deformable convolutions
TODO: equ 4-6 decoder 最后在送入这个 deformable conv
输出
$I^{n0},I^{n1}$ 加权融合一下
class FIEncoder(nn.Module)
分为3段:开始一个 Conv 通道不变抽特征。mid: 连续几层 ResBlock
(可以加 MaxViT attention
) + downsample-block
进行降采样. end: Res+Attention(可以是 maxViT, vanilla)+Res
-
MaxAttentionBlock
- W,H 分为 8x8 的 patch
- 对 embedding x 做 MaxViT_atten(x) + x
-
max_cross_attn
先 MaxViT 一下,再 crossAttn
-
vanilla attention
>> Self-Attentiondef forward(self, x): h_ = x h_ = self.norm(h_) # Normalize q = self.q(h_) # Conv2d k = self.k(h_) v = self.v(h_) # compute attention b,c,h,w = q.shape q = q.reshape(b,c,h*w) q = q.permute(0,2,1) # b,hw,c k = k.reshape(b,c,h*w) # b,c,hw w_ = torch.bmm(q,k) # b,hw,hw w[b,i,j]=sum_c q[b,i,c]k[b,c,j] w_ = w_ * (int(c)**(-0.5)) w_ = torch.nn.functional.softmax(w_, dim=2) # attend to values v = v.reshape(b,c,h*w) w_ = w_.permute(0,2,1) # b,hw,hw (first hw of k, second of q) h_ = torch.bmm(v,w_) # b, c,hw (hw of q) h_[b,c,j] = sum_i v[b,c,i] w_[b,i,j] >> batch-matrix-matrix product h_ = h_.reshape(b,c,h,w) h_ = self.proj_out(h_) # Conv2d return x+h_
-
downsample 实现维度 / 2
-
HW 只pad一边 + conv2d(c, c, kernel_size=3, stride=2, padding=0)
torch.nn.functional.pad(input, pad=(padding_left,padding_right,padding_top,padding_bottom),)
每个维度pad成对些参数,维度从dim=-1 向前算 -
torch.nn.functional.avg_pool2d(x, kernel_size=2, stride=2)
$$ 2DavgPoolingoutput~rule\ H_{out} = \frac{H_{in} +2*pading[0] - kernel[0]}{stide[0]} + 1 $$
-
class VectorQuantizer2(nn.Module)
-
VQGAN >>
class VectorQuantizer2(nn.Module):
对于 paper Figure3 Decoder 里面的VQLayer
根据 latent code ,从 VQGAN codebook 中获取 embedding
-
mid part
Res+MaxVit(Self-attn)+Res
-
Upsample part
Res+MaxVit(Cross attn) + Upsample(nearest 插值 + Conv2d 不改变尺寸)
class Upsample(nn.Module): def __init__(self, in_channels, with_conv): super().__init__() self.with_conv = with_conv if self.with_conv: self.conv = torch.nn.Conv2d(in_channels, in_channels, kernel_size=3, stride=1, padding=1) def forward(self, x): x = torch.nn.functional.interpolate(x, scale_factor=2.0, mode="nearest") if self.with_conv: x = self.conv(x) return x
-
Decoder 输出
$\set{\Omega^\tau,\alpha^\tau, \beta^\tau}$ ,是deformable convolution-based interpolation kernel
模块的参数这个 Kernel 参考下面的 paper,公式用
cupy
库实现 & 加速 ⭐-
$\Omega^\tau$ 为 kernel 的 param,$\tau$ 表示 0, 1 帧 -
$\alpha^\tau, \beta^\tau$ 代表水平和垂直方向空间上的像素偏移量
def OffsetHead(c_in)
4个Conv+Relu针对 VFI 任务特殊的模块,因为当前生成的帧,看作前后帧的信息加上偏移
-
Training VQ-FIGAN
直接用 VQGAN 原始的训练方式
- 把 I0,i1 作为条件,reverse diffusion 如何实现? ❓
reverse diffusion, y 为前后帧 GT $$ pθ(x_{t−1}|x_t, y) $$
Experiment Section 最后一小节有介绍
conditioning the denoising U-Net on the latents z 0 , z1 (前后帧 GT 的 code)is concatenation
3090 GPUs, VQ-FIGAN 70 epochs, U-net 60 epochs
Appendix J 提供了各个数据集的 url
-
Training Vimeo90K + BVI-DVC quintuplets
The final training set consists of 64612 frame septuplets from Vimeo90k and 17600 frame quintuplets from BVI-DVC provided by [11]
-
evaluation on commonly used VFI benchmarks
UCF101, DAVIS, SNU-FILM
-
Full HD evaluation
BVI-HFR [43] dataset
-
10 baselines
including BMBC [50], AdaCoF [37], CDFI [15], XVFI [60], ABME [51], IFRNet [35], VFIformer [41], ST-MFNet [11], FLAVR [30], and MCVD [70].
All these models were re-trained on our training dataset for fair comparison
-
Effectiveness of VQ-FIGAN
- V2: Decoder 直接出
$I^n$ 而不是输出 kernel 的参数 - V1: V2 基础上再去掉 I0I1 过来的 pyramid 特征 & 用 ResNet 代替 MaxCABlocks
- V2: Decoder 直接出
-
Down-sampling Factor f >> dimension of the latent space
f increases from 8 to 32, there is generally an increasing trend in model performance; f=64 效果降低很多,选择 f=8
通过压缩率,调整
$I_t$ 相对于$I_0, I_1$ 的权重,压缩率 f 越大,$I_t$ 信息越少之后 decoder$I_0, I_1$ 的权重就越高!:star: -
U-net 参数量>> 调整通道数实现 c=256 450M 参数
Table 3 reflects a decreasing trend in model performance as c is decreased
-
Conditioning Mechanism:效果类似,选择复杂度更低的 concat
- z0,z1,zt concat as U-net input
- using MaxCA blocks at different layers of U-net
See Appendix H.
-
much slower inference speed 对比其他 SOTA 方法
-
model parameters in LDMVFI is also larger 450M 参数。。
knowledge distillation [21] and model compression [4] can be used
-
two-stage training strategy, large model size and slow inference speed of LDMVFI mean that large-scale training and evaluation processes
训练推理慢。。。
-
LPIPS copy the code from PerceptualSimilarity 👍 this paper's code origin version miss the weight
-
Datset
learn what & how to apply to our task
可以模仿这个把 LDM 用到 VFI 的思路
-
如果 LDM encoder 效果不好,尝试 replace LDM encoder, decoder with VQ-FIGAN && MaxViT 考虑换一下
LDM encoder 没针对 VFI 任务设计,用 MaxViT atten & deformable conv 增强 VFI 性能
-
在 decoder 中加入 GT 信息>> 增强 reconstruction consistency
-
concat 方式的 condition 效果与 Atten 中替换 KV 效果类似