Stable diffusion model can generate images based on given input prompts, making a pretrianed diffusion model viable tool to edit images with conditions. This pipeline is adapted from the work of Tune-A-Video (github, paper) that is based on open-source Huggigface Diffusers and its pretrained checkpoints.
Input Video | Output Video | ||
"A woman is talking" | "A woman, wearing Superman clothes, is talking” | "A woman, wearing Batman's mask, is talking" | "A Wonder Woman is talking, cartoon style" |
Conda
In a conda env, I installed conda install Python==3.11
and the following:
pip install -r requirements.txt
Docker Container
I can also build a docker container to use GPUs for training and inferencing (postprocessing not included). Build Docker image (inside the current folder):
docker build -t image_name -f ./docker/Dockerfile .
Launch a Docker image using the following command (if needed, --network your_docker_network_name
to specify a network). You will have a running shell and access to NVIDIA GPUs. Then follow the instructions in the next sections.
docker run --gpus all -it image_name
The input video will be decomposed into frame images. The prompt and the images (in batches) will be embedded into latent vectors. The model is trained to semantically match these latent vectors going through cross-attention Unet architecture.
-
Download stable diffusion mdoel and the pretrined weights.
./download_models.sh
-
Stongly suggest lanching accelerate jobs in terminal. First, configurate
Accelerate
for non/distributed training.accelerate config
-
Launch a training job.
accelerate launch train_tuneavideo.py --config='./configs/woman-talking.yaml'
Some notes: I have tried different image aspect ratios and resolutions, I think the best is (512, 512), which is the default image sizes of the pretrained model. GPU memory is a bottelneck for the training (with A100 40GBs) since the model itself is quite huge. Due to resources limitation, I was only able to train videos with a total frames up to 16.
Once the training is done (modify inference.py
if needed), do:
python inference.py
In this process,
- New prompts will be embeded. The new latent vectors are initialized through DDIM inversion, providing structure guidance for sampling.
- The new latent vectors will be used to reconstruct frames (the same dimension as input videos) through a VAE decoder.
It contains a few functionalites using module moviepy
. See postprocess.ipynb
.
- An audio is extracted from the original video.
- A new video is made by combining the audio and new video of the same duration.