This repository is the official implementation of "SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation"
- Paper: arxiv
- Demo page: Audio Samples
- Chekpoints: Hugging Face (Now only checkpoints are avaiable.)
Contact:
- Koichi SAITO: [email protected]
- [2024/12/04] We're plainig to open-source codebase/checkpoints of DiT backbone with full-band text-to-sound setting and downstream tasks, as well.
- Download and put the teacher model's checkpoints and AudioLDM-s-full checkpoints for VAE+Vocoder part to
soundctm/ckpt
- SoundCTM checkpoint on AudioCaps (ema=0.999, 30K training iterations)
For inference, both AudioLDM-s-full (for VAE's decoder+Vocoder) and SoundCTM checkpoints will be used.
Install docker to your own server and build docker container:
docker build -t soundctm .
Then run scripts in the container.
Please see ctm_train.sh
and ctm_train.py
and modify folder path dependeing on your environment.
Then run bash ctm_train.sh
Please see ctm_inference.sh
and ctm_inference.py
and modify folder path dependeing on your environment.
Then run bash ctm_inference.sh
Please see numerical_evaluation.sh
and numerical_evaluation.py
and modify folder path dependeing on your environment.
Then run bash numerical_evaluation.sh
Follow the instructions given in the AudioCaps repository for downloading the data.
Data locations are needed to be spesificied in ctm_train.sh
.
You can also see some examples at data/train.csv
.
The training code also requires a Weights & Biases account to log the training outputs and demos. Create an account and log in with:
$ wandb login
Or you can also pass an API key as an environment variable WANDB_API_KEY
.
(You can obtain the API key from https://wandb.ai/authorize after logging in to your account.)
$ WANDB_API_KEY="12345x6789y..."
@article{saito2024soundctm,
title={SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation},
author={Koichi Saito and Dongjun Kim and Takashi Shibuya and Chieh-Hsin Lai and Zhi Zhong and Yuhta Takida and Yuki Mitsufuji},
journal={arXiv preprint arXiv:2405.18503},
year={2024}
}
Part of the code is borrowed from the following repos. We would like to thank the authors of these repos for their contribution.