Audio Understanding with Large Language Models

This repository contains a tutorial of building audio understanding systems with large language models (LLMs). The audio understanding tasks include automatic speech recogntion (ASR), audio caption, audio query answering, music transcription, etc. The repository is written in PyTorch. All tasks are formatted to a same format with tuples of audio, question, and answering as input. An audio understanding system consists of an audio encoder and an LLM decoder. When loading pretrained audio encoders and train LLM decoders from scratch, users can train an audio understanding system in less than 10 hours using a single RTX 4090 GPU. The model framework looks like:

0. Install dependencies

# Clone the repo
git clone https://github.com/qiuqiangkong/audio_understanding
cd audio_understanding

# Install Python environment
conda create --name audio_understanding python=3.10

# Activate environment
conda activate audio_understanding

# Install Python packages dependencies
bash env.sh

1. Music tagging

Music tagging is a task to predict the tags of an audio clip, such as "classical", "country", and "blues", etc.

1.1 Download dataset

Users need to do download the GTZAN dataset (1.3 GB, 8 hours).

bash ./scripts/download_gtzan.sh

The downloaded dataset after compression looks like:

gtzan (1.3 GB)
└── genres
    ├── blues (100 files)
    ├── classical (100 files)
    ├── country (100 files)
    ├── disco (100 files)
    ├── hiphop (100 files)
    ├── jazz (100 files)
    ├── metal (100 files)
    ├── pop (100 files)
    ├── reggae (100 files)
    └── rock (100 files)

1.2 Train

Takes ~3 hours on 1 RTX4090 to train for 100,000 steps.

CUDA_VISIBLE_DEVICES=0 python train.py --config="./configs/music_tagging_gtzan.yaml"

1.3 Inference

Train a music tagging system by yourself or download a pretrained checkpoint:

mkdir -p ./checkpoints/train/music_tagging_gtzan
wget -O ./checkpoints/train/music_tagging_gtzan/step=100000.pth https://huggingface.co/qiuqiangkong/audio_understanding/resolve/main/music_tagging_gtzan_step%3D100000.pth?download=true

CUDA_VISIBLE_DEVICES=0 python inference.py \
	--config="./configs/music_tagging_gtzan.yaml" \
	--ckpt_path="./checkpoints/train/music_tagging_gtzan/step=100000.pth" \
	--audio_path="./assets/audios/gtzan_blues.00002.au"

1.4 Results

Task	Training Dataset	Loss	Test audio	Output
Music Tagging	GTZAN (size: 8 h)		gtzan.mp4	blues

2. Automatic speech recognition (ASR)

ASR is a task to predict audio of spoken language to texts.

2.1 Download dataset

Users need to do download the LibriSpeech dataset (60 GB, 1,000 hours).

bash ./scripts/download_librispeech.sh

The downloaded dataset after compression looks like:

librispeech (60 GB)
├── dev-clean (40 folders)
│   ├── 1272 (3 folders)
│   │   ├── 128104
│   │   │   ├── 1272-128104-0000.flac
│   │   │   ├── ...
│   │   │   ├── 1272-128104-0014.flac
│   │   │   └── 1272-128104.trans.txt
│   │    ...
│    ...
├── dev-other (33 folders)
├── test-clean (40 folders)
├── test-other (33 folders)
├── train-clean-100 (251 folders)
├── train-clean-360 (921 folders)
├── train-other-500 (1166 folders)
├── BOOKS.TXT
├── CHAPTERS.TXT
├── LICENSE.TXT
├── README.TXT
└── SPEAKERS.TXT

2.2 Train

Takes ~8 hours on 1 RTX4090 to train for 100,000 steps.

CUDA_VISIBLE_DEVICES=0 python train.py --config="./configs/asr_librispeech.yaml"

2.3 Inference

Train an ASR system by yourself or download a pretrained checkpoint:

mkdir -p ./checkpoints/train/asr_librispeech
wget -O ./checkpoints/train/asr_librispeech/step=100000.pth https://huggingface.co/qiuqiangkong/audio_understanding/resolve/main/asr_librispeech_step%3D100000.pth?download=true

CUDA_VISIBLE_DEVICES=0 python inference.py \
	--config="./configs/asr_librispeech.yaml" \
	--ckpt_path="./checkpoints/train/asr_librispeech/step=100000.pth" \
	--audio_path="./assets/audios/librispeech_1688-142285-0000.flac"

2.4 Results

Task	Training Dataset	Loss	Test audio	Output
ASR	LibriSpeech (size: 1,000 h)		librispeech.mp4	there ' s iron they say in all our blood and a grain or two perhaps is good but his he makes me harshly feel has got a little too much of steel anon

3. Audio Caption

Audio caption is a task to predict the captions of an audio.

3.1 Download dataset

Users need to do download the Clotho dataset (7.3 GB, 24 hours)

bash ./scripts/download_clotho.sh

The downloaded dataset after compression looks like:

clotho (7.3 GB)
├── clotho_audio_development (2894 wavs)
├── clotho_audio_evaluation (1046 wavs)
├── clotho_captions_development.csv
├── clotho_captions_evaluation.csv
├── clotho_metadata_development.csv
├── clotho_metadata_evaluation.csv
└── LICENSE

3.2 Train

Takes ~8 hours on 1 RTX4090 to train for 100,000 steps.

CUDA_VISIBLE_DEVICES=0 python train.py --config="./configs/audio_caption_clotho.yaml"

3.3 Inference

Train an audio caption system by yourself or download a pretrained checkpoint:

mkdir -p ./checkpoints/train/audio_caption_clotho
wget -O ./checkpoints/train/audio_caption_clotho/step=100000.pth https://huggingface.co/qiuqiangkong/audio_understanding/resolve/main/audio_caption_clotho_step%3D100000.pth?download=true

CUDA_VISIBLE_DEVICES=0 python inference.py \
	--config="./configs/audio_caption_clotho.yaml" \
	--ckpt_path="./checkpoints/train/audio_caption_clotho/step=100000.pth" \
	--audio_path="./assets/audios/clotho_birds_long.wav"

3.4 Results

Task	Training Dataset	Loss	Test audio	Output
Audio Caption	Clotho (size: 24 h)		clotho.mp4	birds chirping and a passing of a car outdoors

4. Piano Transcription

Piano transcription is a task to transcribe audio of a piano playing into a MIDI file.

4.1 Download dataset

Users need to do download the Maestro dataset (131 GB, 199 hours).

bash ./scripts/download_maestro.sh

The downloaded dataset after compression looks like:

maestro-v3.0.0 (131 GB)
├── 2004 (132 songs, wav + flac + midi + tsv)
├── 2006 (115 songs, wav + flac + midi + tsv)
├── 2008 (147 songs, wav + flac + midi + tsv)
├── 2009 (125 songs, wav + flac + midi + tsv)
├── 2011 (163 songs, wav + flac + midi + tsv)
├── 2013 (127 songs, wav + flac + midi + tsv)
├── 2014 (105 songs, wav + flac + midi + tsv)
├── 2015 (129 songs, wav + flac + midi + tsv)
├── 2017 (140 songs, wav + flac + midi + tsv)
├── 2018 (93 songs, wav + flac + midi + tsv)
├── LICENSE
├── maestro-v3.0.0.csv
├── maestro-v3.0.0.json
└── README

4.2 Train

Takes ~8 hours on 1 RTX4090 to train for 100,000 steps

CUDA_VISIBLE_DEVICES=0 python train.py --config="./configs/piano_transcription_maestro.yaml"

4.3 Inference

Train a piano transcription system by yourself or download a pretrained checkpoint:

mkdir -p ./checkpoints/train/piano_transcription_maestro
wget -O ./checkpoints/train/piano_transcription_maestro/step=100000.pth https://huggingface.co/qiuqiangkong/audio_understanding/resolve/main/piano_transcription_maestro_step%3D100000.pth?download=true

CUDA_VISIBLE_DEVICES=0 python inference.py \
	--config="./configs/piano_transcription_maestro.yaml" \
	--ckpt_path="./checkpoints/train/piano_transcription_maestro/step=100000.pth" \
	--audio_path="./assets/audios/cut_liszt_5s.mp3"

4.4 Results

Task	Training Dataset	Loss	Test audio	Output
Piano Transcription	MAESTRO (199 h)		piano.mp4	output.mp4

5. Train on Multiple GPUs.

We use Huggingface accelerate library to train the systems on multiple GPUs. train_accelerate.py just adds a few lines to train.py. Here is an example to run with 4 GPUs:

CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch --multi_gpu --num_processes 4 train_accelerate.py --config="./configs/asr_librispeech.yaml"

Loss comparison between training with 1 GPU and 4 GPUs. The training will speed up by 4 times.

Task	Training Dataset	Train loss	Test loss
ASR	LibriSpeech (size: 1,000 h)

External links

The Llama model code is from: https://github.com/qiuqiangkong/mini_llm/blob/main/models/llama.py

License

MIT

Cite

@article{kong2020panns,
  title={Panns: Large-scale pretrained audio neural networks for audio pattern recognition},
  author={Kong, Qiuqiang and Cao, Yin and Iqbal, Turab and Wang, Yuxuan and Wang, Wenwu and Plumbley, Mark D},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  volume={28},
  pages={2880--2894},
  year={2020},
  publisher={IEEE}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Audio Understanding with Large Language Models

0. Install dependencies

1. Music tagging

1.1 Download dataset

1.2 Train

1.3 Inference

1.4 Results

2. Automatic speech recognition (ASR)

2.1 Download dataset

2.2 Train

2.3 Inference

2.4 Results

3. Audio Caption

3.1 Download dataset

3.2 Train

3.3 Inference

3.4 Results

4. Piano Transcription

4.1 Download dataset

4.2 Train

4.3 Inference

4.4 Results

5. Train on Multiple GPUs.

External links

License

Cite

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
assets		assets
audio_understanding		audio_understanding
configs		configs
scripts		scripts
LICENSE		LICENSE
README.md		README.md
env.sh		env.sh
inference.py		inference.py
train.py		train.py
train_accelerate.py		train_accelerate.py

License

qiuqiangkong/audio_understanding

Folders and files

Latest commit

History

Repository files navigation

Audio Understanding with Large Language Models

0. Install dependencies

1. Music tagging

1.1 Download dataset

1.2 Train

1.3 Inference

1.4 Results

2. Automatic speech recognition (ASR)

2.1 Download dataset

2.2 Train

2.3 Inference

2.4 Results

3. Audio Caption

3.1 Download dataset

3.2 Train

3.3 Inference

3.4 Results

4. Piano Transcription

4.1 Download dataset

4.2 Train

4.3 Inference

4.4 Results

5. Train on Multiple GPUs.

External links

License

Cite

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages