Video editing with foundational stabel diffusion model

Stable diffusion model can generate images based on given input prompts, making a pretrianed diffusion model viable tool to edit images with conditions. This pipeline is adapted from the work of Tune-A-Video (github, paper) that is based on open-source Huggigface Diffusers and its pretrained checkpoints.

Results

Input Video	Output Video

"A woman is talking"	"A woman, wearing Superman clothes, is talking”	"A woman, wearing Batman's mask, is talking"	"A Wonder Woman is talking, cartoon style"

The pipeline

Set-ups

Conda

In a conda env, I installed conda install Python==3.11 and the following:

pip install -r requirements.txt

Docker Container

I can also build a docker container to use GPUs for training and inferencing (postprocessing not included). Build Docker image (inside the current folder):

docker build -t image_name -f ./docker/Dockerfile .

Launch a Docker image using the following command (if needed, --network your_docker_network_name to specify a network). You will have a running shell and access to NVIDIA GPUs. Then follow the instructions in the next sections.

docker run --gpus all -it image_name

Training:

The input video will be decomposed into frame images. The prompt and the images (in batches) will be embedded into latent vectors. The model is trained to semantically match these latent vectors going through cross-attention Unet architecture.

Download stable diffusion mdoel and the pretrined weights.
```
./download_models.sh
```
Stongly suggest lanching accelerate jobs in terminal. First, configurate Accelerate for non/distributed training.
```
accelerate config
```

Launch a training job.

accelerate launch train_tuneavideo.py --config='./configs/woman-talking.yaml'

Some notes: I have tried different image aspect ratios and resolutions, I think the best is (512, 512), which is the default image sizes of the pretrained model. GPU memory is a bottelneck for the training (with A100 40GBs) since the model itself is quite huge. Due to resources limitation, I was only able to train videos with a total frames up to 16.

Inferencing:

Once the training is done (modify inference.py if needed), do:

python inference.py

In this process,

New prompts will be embeded. The new latent vectors are initialized through DDIM inversion, providing structure guidance for sampling.
The new latent vectors will be used to reconstruct frames (the same dimension as input videos) through a VAE decoder.

Postporcessing:

It contains a few functionalites using module moviepy. See postprocess.ipynb.

An audio is extracted from the original video.
A new video is made by combining the audio and new video of the same duration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Video editing with foundational stabel diffusion model

Results

The pipeline

Set-ups

Training:

Inferencing:

Postporcessing:

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
configs		configs
data		data
docker		docker
tuneavideo		tuneavideo
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
download_models.sh		download_models.sh
inference.py		inference.py
postprocess.ipynb		postprocess.ipynb
requirements.txt		requirements.txt
train_tuneavideo.py		train_tuneavideo.py
video_process.py		video_process.py

License

zhuowenzhao/video_editing_diffusion

Folders and files

Latest commit

History

Repository files navigation

Video editing with foundational stabel diffusion model

Results

The pipeline

Set-ups

Training:

Inferencing:

Postporcessing:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages