You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
July 2024: We are working hard on the leaderboard and speech translation dataset. Stay tuned!
July 2024: Support all 26 datasets listed in AudioBench manuscript.
🔧 Installation
Installation with pip:
pip install -r requirements.txt
For model-as-judge evaluation, we serve the judgement model as a service via vllm on port 5000.
⏩ Quick Start
The example is hosting a Llama-3-70B-Instruct model and running the cascade Whisper + Llama-3 model.
# Step 1:# Server the model as judge# It will auto-download the model and may requires verification from Hugging Face.# In the demo, we use 2 H100 80G in order to host the model.# For smaller VRAM, you may need to reduce the model size.# bash host_model_judge_llama_3_70b_instruct.sh# Another option (recommended) is to use the quantized model which could be hosted on 2*40G GPUs.
bash host_model_judge_llama_3_70b_instruct_awq.sh
# Step 2:# The example is done with 3 H100 80G GPUs.# The AudioLLMs model inference is done on GPU 2 since GPU 0&1 is used to host model-as-judge services.# This is a setting for just using 50 samples for evaluation.
MODEL_NAME=whisper_large_v3_with_llama_3_8b_instruct
GPU=2
BATCH_SIZE=1
METRICS=llama3_70b_judge_binary
OVERWRITE=True
NUMBER_OF_SAMPLES=50
DATASET=cn_college_listen_mcq_test
bash eval.sh $DATASET$MODEL_NAME$GPU$BATCH_SIZE$OVERWRITE$METRICS$NUMBER_OF_SAMPLES# Step 3:# The results would be like:# "llama3_70b_judge_binary": {# "judge_score": 90.0,# "success_rate": 1.0# }#}# This indicates that the cascade model can achieve 90% accuracy on the MCQ task for English listening test.
The example is how to get started. To evaluate on the full datasets, please refer to Examples.
# After model weight download, run the evaluation script for all datasets
bash examples/eval_salmonn_7b.sh
More models are accessible in this survey.
To add a new model, please refer to Adding a New Model.
📖 Citation
If you find our work useful, please consider citing our paper!
@article{wang2024audiobench,
title={AudioBench: A Universal Benchmark for Audio Large Language Models},
author={Wang, Bin and Zou, Xunlong and Lin, Geyu and Sun, Shuo and Liu, Zhuohan and Zhang, Wenyu and Liu, Zhengyuan and Aw, AiTi and Chen, Nancy F},
journal={arXiv preprint arXiv:2406.16020},
year={2024}
}