The official repo for Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation. Our paper has been selected as the spotlight in ICLR 2025.
[2025.02.11] Our inference code for T2A is released with instructions. Our paper has been selected as the spotlight in ICLR!
[2025.01.24] Our preview version of BEWO-1M is released in with instructions.
[2025.01.23] Our paper is accepted by ICLR 2025! See you in Singapore.
[2024.10.14] Our initial paper is now accessible at .
- Dataset: Data instruction for BEWO
- Inference: Inference code for BEWO
- ITD Evaluation: Evaluation code for BEWO (Coming soon.)
To better facilitate the advancement of multimodal guided spatial audio generation models, we have developed a dual-channel audio dataset named Both Ears Wide Open 1M (BEWO-1M) through rigorous simulations and GPT-assisted caption transformation.
Totally, we constructed 2.8k hours of training audio with more than 1M audio-text pairs and approximately 17 hours of validation data with 6.2k pairs.
The full dataset of BEWO-1M can be find in here.
Requires PyTorch 2.0 or later for Flash Attention support
Development for the repo is done in Python 3.9 or 3.8.10
This code base is adapted from stable-audio-tools. Sincere thanks to the engineers for their great work.
# for inference
cd models
conda create -n "bewo" python=3.9
conda activate bewo
pip install -r requirements.txt --no-dependencies
Model | GDrive | Detail |
---|---|---|
BEWO_nl.ckpt | link | Training with natural language only (with direction description in the prompt) |
BEWO_attri.ckpt | link | Training with induction attributes only (no direction description in the prompt) |
BEWO_mix.ckpt | link | Training with both natural language and attributes (whatever in the prompt) |
To generate audio from a text prompt using our pretrained model:
- Download the pretrained model and config files from [MODEL_LINK]
- Place the model checkpoint at
./bewo_config/BEWO_nl.ckpt
- Place the model config at
./bewo_config/model_config_sim.json
- Run the following command:
cd models
# feel free to reset seed and cfg_scale before inferencing
python simple_generation.py --prompt "A dog is barking on the left." --device cuda:0
python simple_generation.py --prompt "a car is moving from left to right." --device cuda:0
To generate audio from a text prompt using our pretrained model:
- Download the pretrained model and config files from [MODEL_LINK]
- Place the model checkpoint at
./bewo_config/BEWO_mix.ckpt
or./bewo_config/BEWO_attri.ckpt
- Place the model config at
./bewo_config/model_config_sim_mix.json
- Run the following command:
The GPT induction is used to generate the spatial attributes. We offer two models for you to choose. GPT-4o and DeepSeekv3. Since the DeepSeek model is much cheaper and open-sourced, using it can be considered as a cost-effective solution. The default setting is "coarse" for the balance of quality and control.
Using GPT induction:
cd models
# feel free to reset seed and cfg_scale before inferencing
# better with BEWO_mix.ckpt
python gpt_induction.py --prompt "A dog is barking on the left." --device cuda:0
python gpt_induction.py --prompt "a dog is barking and running from left to right." --device cuda:0
We also provide a manual setting for you to manually set the initial and final direction and moving state. The direction is from 1 (left) to 5 (right). The moving state is from 0 (no moving) to 3 (fast moving).
Using manual setting:
cd models
# feel free to reset seed and cfg_scale before inferencing
# better with BEWO_mix.ckpt
python gpt_induction.py --prompt "a dog is barking." --device cuda:0 --manual True --init_direction 1 --final_direction 1 --moving 0
python gpt_induction.py --prompt "a dog is barking." --device cuda:0 --manual True --init_direction 1 --final_direction 5 --moving 1
If you find this repo useful, please cite our papers:
@article{sun2024both,
title={Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation},
author={Sun, Peiwen and Cheng, Sitong and Li, Xiangtai and Ye, Zhen and Liu, Huadai and Zhang, Honggang and Xue, Wei and Guo, Yike},
journal={arXiv preprint arXiv:2410.10676},
year={2024}
}
Please also cite stable-audio-tools paper if you use the code in this repo. Thanks again for their great work.
@article{evans2024stable,
title={Stable audio open},
author={Evans, Zach and Parker, Julian D and Carr, CJ and Zukowski, Zack and Taylor, Josiah and Pons, Jordi},
journal={arXiv preprint arXiv:2407.14358},
year={2024}
}