This sub-repository provides codes for our Interspeech 2024 paper, including the setup procedure for the training caption data and the pre-training steps.
@article{niizumi2024m2d-clap,
title = {{M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose Audio-Language Representation}},
author = {Daisuke Niizumi and Daiki Takeuchi and Yasunori Ohishi and Noboru Harada and Masahiro Yasuda and Shunsuke Tsubaki and Keisuke Imoto},
journal = {to appear at Interspeech},
year = {2024},
url = {https://arxiv.org/abs/2406.02032}}
Our implementation does not convert texts into sentence (semantic) embeddings on the fly. Instead, we convert them into embeddings in advance (offline) at the following steps 2 and 3.
- Prepare for the M2D pre-training on AudioSet by following the 3. Pre-training From Scratch.
- Especcially, configure data/audioset_lms according to the Example preprocessing steps (AudioSet).
- Run
Note-AutoACD-GTEbase.ipynb
to createdata/capemb_GTEbase_Audo_A_C_D.npy
for Auto-ACD captions. - Run
Note-ACalt4_GTEbase.ipynb
to createdata/capemb_GTEbase_AC_alt_4.npy
for AudioCaps Alternative 4 Captions (ACalt4).
In summary, the following data should be ready.
data/audioset_lms
-- The log-mel spectrogram audio samples (many .npy files)data/files_audioset.csv
-- The list of the samples in thedata/audioset_lms
.data/capemb_GTEbase_Audo_A_C_D.npy
-- The caption embeddings of AutoACD.data/capemb_GTEbase_AC_alt_4.npy
-- The caption embeddings of ACalt4.
The exact pre-training command line we used is as follows:
OMP_NUM_THREADS=1 torchrun --nproc_per_node=4 -m semantics.train_clap --input_size 80x608 --patch_size 16x16 --epochs 300 --batch_size 512 --accum_iter 1 --save_freq 50 --seed 3 --model m2d_clap_vit_base --file_caption data/capemb_GTEbase_Audo_A_C_D.npy,data/capemb_GTEbase_AC_alt_4.npy --loss_off .01
Quick example: examples/Example_4_CLAP.ipynb.
The evaluation steps follow the original M2D.
For the zero-shot evaluation, refer to the ../all_eval.sh, which contains all the command lines exactly used for the paper.
Refer to the repository ACalt4 for the details.
Description | Notebook |
---|---|
Zero-shot ESC-50 classification with M2D-CLAP | |
Audio feature visualization example with M2D-CLAP |
The t-SNE visualization of the audio embeddings encoded by M2D-CLAP. The conventional audio embeddings are the output of the audio encoder for transfer learning. The CLAP audio embeddings are the output of the audio projector for ZS inference.