This repository contains a tutorial of building audio understanding systems with large language models (LLMs). The audio understanding tasks include automatic speech recogntion (ASR), audio caption, audio query answering, music transcription, etc. The repository is written in PyTorch. All tasks are formatted to a same format with tuples of audio, question, and answering as input. An audio understanding system consists of an audio encoder and an LLM decoder. When loading pretrained audio encoders and train LLM decoders from scratch, users can train an audio understanding system in less than 10 hours using a single RTX 4090 GPU. The model framework looks like:
# Clone the repo
git clone https://github.com/qiuqiangkong/audio_understanding
cd audio_understanding
# Install Python environment
conda create --name audio_understanding python=3.10
# Activate environment
conda activate audio_understanding
# Install Python packages dependencies
bash env.sh
Music tagging is a task to predict the tags of an audio clip, such as "classical", "country", and "blues", etc.
Users need to do download the GTZAN dataset (1.3 GB, 8 hours).
bash ./scripts/download_gtzan.sh
The downloaded dataset after compression looks like:
gtzan (1.3 GB) └── genres ├── blues (100 files) ├── classical (100 files) ├── country (100 files) ├── disco (100 files) ├── hiphop (100 files) ├── jazz (100 files) ├── metal (100 files) ├── pop (100 files) ├── reggae (100 files) └── rock (100 files)
Takes ~3 hours on 1 RTX4090 to train for 100,000 steps.
CUDA_VISIBLE_DEVICES=0 python train.py --config="./configs/music_tagging_gtzan.yaml"
Train a music tagging system by yourself or download a pretrained checkpoint:
mkdir -p ./checkpoints/train/music_tagging_gtzan
wget -O ./checkpoints/train/music_tagging_gtzan/step=100000.pth https://huggingface.co/qiuqiangkong/audio_understanding/resolve/main/music_tagging_gtzan_step%3D100000.pth?download=true
CUDA_VISIBLE_DEVICES=0 python inference.py \
--config="./configs/music_tagging_gtzan.yaml" \
--ckpt_path="./checkpoints/train/music_tagging_gtzan/step=100000.pth" \
--audio_path="./assets/audios/gtzan_blues.00002.au"
Task | Training Dataset | Loss | Test audio | Output |
---|---|---|---|---|
Music Tagging | GTZAN (size: 8 h) | ![]() |
gtzan.mp4 |
blues |
ASR is a task to predict audio of spoken language to texts.
Users need to do download the LibriSpeech dataset (60 GB, 1,000 hours).
bash ./scripts/download_librispeech.sh
The downloaded dataset after compression looks like:
librispeech (60 GB) ├── dev-clean (40 folders) │ ├── 1272 (3 folders) │ │ ├── 128104 │ │ │ ├── 1272-128104-0000.flac │ │ │ ├── ... │ │ │ ├── 1272-128104-0014.flac │ │ │ └── 1272-128104.trans.txt │ │ ... │ ... ├── dev-other (33 folders) ├── test-clean (40 folders) ├── test-other (33 folders) ├── train-clean-100 (251 folders) ├── train-clean-360 (921 folders) ├── train-other-500 (1166 folders) ├── BOOKS.TXT ├── CHAPTERS.TXT ├── LICENSE.TXT ├── README.TXT └── SPEAKERS.TXT
Takes ~8 hours on 1 RTX4090 to train for 100,000 steps.
CUDA_VISIBLE_DEVICES=0 python train.py --config="./configs/asr_librispeech.yaml"
Train an ASR system by yourself or download a pretrained checkpoint:
mkdir -p ./checkpoints/train/asr_librispeech
wget -O ./checkpoints/train/asr_librispeech/step=100000.pth https://huggingface.co/qiuqiangkong/audio_understanding/resolve/main/asr_librispeech_step%3D100000.pth?download=true
CUDA_VISIBLE_DEVICES=0 python inference.py \
--config="./configs/asr_librispeech.yaml" \
--ckpt_path="./checkpoints/train/asr_librispeech/step=100000.pth" \
--audio_path="./assets/audios/librispeech_1688-142285-0000.flac"
Audio caption is a task to predict the captions of an audio.
Users need to do download the Clotho dataset (7.3 GB, 24 hours)
bash ./scripts/download_clotho.sh
The downloaded dataset after compression looks like:
clotho (7.3 GB) ├── clotho_audio_development (2894 wavs) ├── clotho_audio_evaluation (1046 wavs) ├── clotho_captions_development.csv ├── clotho_captions_evaluation.csv ├── clotho_metadata_development.csv ├── clotho_metadata_evaluation.csv └── LICENSE
Takes ~8 hours on 1 RTX4090 to train for 100,000 steps.
CUDA_VISIBLE_DEVICES=0 python train.py --config="./configs/audio_caption_clotho.yaml"
Train an audio caption system by yourself or download a pretrained checkpoint:
mkdir -p ./checkpoints/train/audio_caption_clotho
wget -O ./checkpoints/train/audio_caption_clotho/step=100000.pth https://huggingface.co/qiuqiangkong/audio_understanding/resolve/main/audio_caption_clotho_step%3D100000.pth?download=true
CUDA_VISIBLE_DEVICES=0 python inference.py \
--config="./configs/audio_caption_clotho.yaml" \
--ckpt_path="./checkpoints/train/audio_caption_clotho/step=100000.pth" \
--audio_path="./assets/audios/clotho_birds_long.wav"
Task | Training Dataset | Loss | Test audio | Output |
---|---|---|---|---|
Audio Caption | Clotho (size: 24 h) | ![]() |
clotho.mp4 |
birds chirping and a passing of a car outdoors |
Piano transcription is a task to transcribe audio of a piano playing into a MIDI file.
Users need to do download the Maestro dataset (131 GB, 199 hours).
bash ./scripts/download_maestro.sh
The downloaded dataset after compression looks like:
maestro-v3.0.0 (131 GB) ├── 2004 (132 songs, wav + flac + midi + tsv) ├── 2006 (115 songs, wav + flac + midi + tsv) ├── 2008 (147 songs, wav + flac + midi + tsv) ├── 2009 (125 songs, wav + flac + midi + tsv) ├── 2011 (163 songs, wav + flac + midi + tsv) ├── 2013 (127 songs, wav + flac + midi + tsv) ├── 2014 (105 songs, wav + flac + midi + tsv) ├── 2015 (129 songs, wav + flac + midi + tsv) ├── 2017 (140 songs, wav + flac + midi + tsv) ├── 2018 (93 songs, wav + flac + midi + tsv) ├── LICENSE ├── maestro-v3.0.0.csv ├── maestro-v3.0.0.json └── README
Takes ~8 hours on 1 RTX4090 to train for 100,000 steps
CUDA_VISIBLE_DEVICES=0 python train.py --config="./configs/piano_transcription_maestro.yaml"
Train a piano transcription system by yourself or download a pretrained checkpoint:
mkdir -p ./checkpoints/train/piano_transcription_maestro
wget -O ./checkpoints/train/piano_transcription_maestro/step=100000.pth https://huggingface.co/qiuqiangkong/audio_understanding/resolve/main/piano_transcription_maestro_step%3D100000.pth?download=true
CUDA_VISIBLE_DEVICES=0 python inference.py \
--config="./configs/piano_transcription_maestro.yaml" \
--ckpt_path="./checkpoints/train/piano_transcription_maestro/step=100000.pth" \
--audio_path="./assets/audios/cut_liszt_5s.mp3"
Task | Training Dataset | Loss | Test audio | Output |
---|---|---|---|---|
Piano Transcription | MAESTRO (199 h) | ![]() |
piano.mp4 |
output.mp4 |
We use Huggingface accelerate library to train the systems on multiple GPUs. train_accelerate.py just adds a few lines to train.py. Here is an example to run with 4 GPUs:
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch --multi_gpu --num_processes 4 train_accelerate.py --config="./configs/asr_librispeech.yaml"
Loss comparison between training with 1 GPU and 4 GPUs. The training will speed up by 4 times.
Task | Training Dataset | Train loss | Test loss |
---|---|---|---|
ASR | LibriSpeech (size: 1,000 h) | ![]() |
![]() |
The Llama model code is from: https://github.com/qiuqiangkong/mini_llm/blob/main/models/llama.py
MIT
@article{kong2020panns, title={Panns: Large-scale pretrained audio neural networks for audio pattern recognition}, author={Kong, Qiuqiang and Cao, Yin and Iqbal, Turab and Wang, Yuxuan and Wang, Wenwu and Plumbley, Mark D}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, volume={28}, pages={2880--2894}, year={2020}, publisher={IEEE} }