Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation

Keywords:

The official repo for Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation. Our paper has been selected as the spotlight in ICLR 2025.

Community Contribution:

Outlines

💥 News 💥

[2025.02.11] Our inference code for T2A is released with instructions. Our paper has been selected as the spotlight in ICLR!

[2025.01.24] Our preview version of BEWO-1M is released in with instructions.

[2025.01.23] Our paper is accepted by ICLR 2025! See you in Singapore.

[2024.10.14] Our initial paper is now accessible at .

Overall Structure

Dataset: Data instruction for BEWO
Inference: Inference code for BEWO
ITD Evaluation: Evaluation code for BEWO (Coming soon.)

BEWO-1M Dataset

To better facilitate the advancement of multimodal guided spatial audio generation models, we have developed a dual-channel audio dataset named Both Ears Wide Open 1M (BEWO-1M) through rigorous simulations and GPT-assisted caption transformation.

Totally, we constructed 2.8k hours of training audio with more than 1M audio-text pairs and approximately 17 hours of validation data with 6.2k pairs.

The full dataset of BEWO-1M can be find in here.

Requirements

Requires PyTorch 2.0 or later for Flash Attention support

Development for the repo is done in Python 3.9 or 3.8.10

This code base is adapted from stable-audio-tools. Sincere thanks to the engineers for their great work.

# for inference

cd models
conda create -n "bewo" python=3.9
conda activate bewo
pip install -r requirements.txt --no-dependencies

Model Gallery

Model	GDrive	Detail
BEWO_nl.ckpt	link	Training with natural language only (with direction description in the prompt)
BEWO_attri.ckpt	link	Training with induction attributes only (no direction description in the prompt)
BEWO_mix.ckpt	link	Training with both natural language and attributes (whatever in the prompt)

Usage

Simple generation:

To generate audio from a text prompt using our pretrained model:

Download the pretrained model and config files from [MODEL_LINK]
Place the model checkpoint at ./bewo_config/BEWO_nl.ckpt
Place the model config at ./bewo_config/model_config_sim.json
Run the following command:

cd models
# feel free to reset seed and cfg_scale before inferencing
python simple_generation.py --prompt "A dog is barking on the left." --device cuda:0
python simple_generation.py  --prompt "a car is moving from left to right." --device cuda:0

Coarse-to-fine generation:

To generate audio from a text prompt using our pretrained model:

Download the pretrained model and config files from [MODEL_LINK]
Place the model checkpoint at ./bewo_config/BEWO_mix.ckpt or ./bewo_config/BEWO_attri.ckpt
Place the model config at ./bewo_config/model_config_sim_mix.json
Run the following command:

The GPT induction is used to generate the spatial attributes. We offer two models for you to choose. GPT-4o and DeepSeekv3. Since the DeepSeek model is much cheaper and open-sourced, using it can be considered as a cost-effective solution. The default setting is "coarse" for the balance of quality and control.

Using GPT induction:

cd models
# feel free to reset seed and cfg_scale before inferencing
# better with BEWO_mix.ckpt
python gpt_induction.py --prompt "A dog is barking on the left." --device cuda:0
python gpt_induction.py  --prompt "a dog is barking and running from left to right." --device cuda:0

We also provide a manual setting for you to manually set the initial and final direction and moving state. The direction is from 1 (left) to 5 (right). The moving state is from 0 (no moving) to 3 (fast moving).

Using manual setting:

cd models
# feel free to reset seed and cfg_scale before inferencing
# better with BEWO_mix.ckpt
python gpt_induction.py --prompt "a dog is barking." --device cuda:0 --manual True --init_direction 1 --final_direction 1 --moving 0
python gpt_induction.py --prompt "a dog is barking." --device cuda:0 --manual True --init_direction 1 --final_direction 5 --moving 1

Reference

If you find this repo useful, please cite our papers:

@article{sun2024both,
  title={Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation},
  author={Sun, Peiwen and Cheng, Sitong and Li, Xiangtai and Ye, Zhen and Liu, Huadai and Zhang, Honggang and Xue, Wei and Guo, Yike},
  journal={arXiv preprint arXiv:2410.10676},
  year={2024}
}

Please also cite stable-audio-tools paper if you use the code in this repo. Thanks again for their great work.

@article{evans2024stable,
  title={Stable audio open},
  author={Evans, Zach and Parker, Julian D and Carr, CJ and Zukowski, Zack and Taylor, Josiah and Pons, Jordi},
  journal={arXiv preprint arXiv:2407.14358},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.assets		.assets
datasets		datasets
evaluations		evaluations
models		models
.gitignore		.gitignore
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation

Outlines

💥 News 💥

Overall Structure

BEWO-1M Dataset

Requirements

Model Gallery

Usage

Simple generation:

Coarse-to-fine generation:

Reference

About

Releases

Packages

Contributors 2

Languages

PeiwenSun2000/Both-Ears-Wide-Open

Folders and files

Latest commit

History

Repository files navigation

Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation

Outlines

💥 News 💥

Overall Structure

BEWO-1M Dataset

Requirements

Model Gallery

Usage

Simple generation:

Coarse-to-fine generation:

Reference

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages