GitHub - GeWu-Lab/Crab: [CVPR 2025] Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation

(CVPR'25) Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation

🚀🚀 Welcome to the repo of Crab! If our project helps you, please give us a star ⭐ on GitHub to support us. 🙏🙏

📰 News

[2025.06.04] Upload YouTube Presentation Video
[2025.03.20] Release pretrain, finetune weights and AV-UIE Dataset.
[2025.03.17] Project demo video
[2025.03.15] Release training, evaluation and inference codes of Crab.
[2025.02.27] Crab has been accepted to CVPR 2025.

🌟 Unified Audio-Visual Scene Understanding Capabilities

Temporal Localization

AVE Task

-en7GAdXAQk.mp4

label: Male speech, man speaking, [0,8]

Predict: A man is giving a speech from a podium in a classroom. The man speaks from the beginning of the video until the 8th second. So the audible and visible event in the video is Male speech, man speaking , and the time range is 0,8 .

AVVP Task

6aV6c8mY0lg.mp4

Label: Audio: Chainsaw, [0,10] Visual: Chainsaw, [0,1],[3,6],[9,10]

Predict: The video shows a man using a chainsaw to cut a tree. The chainsaw is visible and running from the beginning of the video (0 seconds) to the end (10 seconds). Therefore, visual and audio events are: Chainsaw,(0 10) and Chainsaw,(0 10)

Spatio-temporal reasoning

MUSIC-AVQA Task

00006835.mp4

Question: What is the left instrument of the first sounding instrument?

Answer: saxophone

Pred: In the video, four men are playing musical instruments. The man on the left is playing the saxophone, the man in the middle is playing the double bass, the man on the right is playing the guitar, and the man in the back is playing the drums. At the beginning of the video, the man in the middle starts playing the double bass first, and then the other instruments start playing. So the instrument on the left of the first sounding instrument is the saxophone. So the answer is saxophone .

Spatial Localization & Pixel-level understanding

🛠️ Requirements and Installation

Basic Dependencies:

Python == 3.9
Pytorch == 2.1.0
transformers == 4.37.2
deepspeed == 0.12.6

Install required packages:

git clone [email protected]:GeWu-Lab/Crab.git
cd Crab
pip install -r requirements.txt

Multi-modal Encoder Weights:

download visual encoder openai-clip-vit-large-patch14
download audio encoder Fine-tuned BEATs_iter3+ (AS2M)

LLM Weights:

download LLaMA-2-Chat-HF

🚀 Quick Start

Download finetune weights in directory ckpt_dir, AVS_finetune_weights and AVSS_finetune_weights in directory avs_ckpt_dir;
Prepare your test samples in data/example.json like this:

[
    {
        "task": "avqa",
        "audio_path": "assets/example/avqa/00006835.mp3",
        "video_path": "assets/example/avqa/00006835.mp4",
        "question": "What is the left instrument of the first sounding instrument?"
    },
    {
        "task": "ave",
        "audio_path": "assets/example/ave/-67UNKFmRLk.mp3",
        "video_path": "assets/example/ave/-67UNKFmRLk.mp4"
    },
    {
        "task": "avvp",
        "audio_path": "assets/example/avvp/6aV6c8mY0lg.mp3",
        "video_path": "assets/example/avvp/6aV6c8mY0lg.mp4"
    },
    {
        "task": "arig",
        "audio_path": "assets/example/arig/audio.wav",
        "image_path": "assets/example/arig/1.jpg"
    },
    {
        "task": "s4",
        "audio_path": "assets/example/s4/audio.wav",
        "image_path": "assets/example/s4/0.jpg",
        "mask_path": "assets/example/s4/0.png"
    },
    {
        "task": "ms3",
        "audio_path": "assets/example/ms3/audio.wav",
        "image_path": "assets/example/ms3/1.jpg",
        "mask_path": "assets/example/ms3/1.png"
    },
    {
        "task": "ref-avs",
        "audio_path": "assets/example/ref-avs/audio.wav",
        "image_path": "assets/example/ref-avs/7.jpg",
        "mask_path": "assets/example/ref-avs/00007.png",
        "exp": "making the loudest sound"
    },
    {
        "task":"avss",
        "audio_path":"assets/example/avss/audio.wav",
        "image_path":"assets/example/avss/0.jpg",
        "mask_path":"assets/example/avss/0.png"
    }
]

Infer.

For MUSIC-AVQA task, set avqa_task = True and ckpt_dir = <your ckpt_dir> in scripts/quick_start.sh, then run:

bash scripts/quick_start.sh

For S4, MS3 and Ref-AVS tasks, set <your task> = True and avs_ckpt_dir = <your avs_ckpt_dir>.
For AVSS task, set avss_task = True and avs_ckpt_dir = <your avss_ckpt_dir>.

🗝️ Training

Pretrain

Use our pre-trained checkpoints: Download audio pretrain checkpoint, visual pretrain checkpoint, segmentation pretrain checkpoint in prtrain_ckpt_dir;
Pretrain based on LLaMA-7b-Chat-hf model: download image and video pretrain dataset from Video-LLaVA; download audio pretrain dataset from AudioCaps; download segmentation pretrain dataset from LVIS;

For visual pretrain, run:
```
bash scripts/pretrain/pretrain_visual.sh
```
For audio pretrain, run:
```
bash scripts/pretrain/pretrain_audio.sh
```
For segmentation pretrain, run:
```
bash scripts/pretrain/pretrain_seg.sh
```

Finetune. Download AVUIE dataset annotations and raw data from AVE, AVVP, AVS, Ref-AVS, MUSIC-AVQA, VALOR, modify the data_root in dataset/unified_dataset.py;
Jointly training on all tasks:

bash scripts/finetune/finetun_hyper_lora.sh

Jointly training on AVS tasks

set finetune_ckpt_dir = <your finetune ckpt dir> in step 3, then run:

bash scripts/finetune/finetune_hyper_lora_avs.sh

🤖 Inference

bash scripts/finetune/inference_hyper_lora.sh

📑 Citation

If you find Crab useful for your research and applications, please cite using this BibTeX:

@article{du2025crab,
  title={Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation},
  author={Du, Henghui and Li, Guangyao and Zhou, Chang and Zhang, Chunjie and Zhao, Alan and Hu, Di},
  journal={arXiv preprint arXiv:2503.13068},
  year={2025}
}

🔒 License

This project is released under the Apache 2.0 license as found in the LICENSE file. Please get in touch with us if you find any potential violations.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
assets		assets
configs		configs
data		data
dataset		dataset
deepspeed		deepspeed
models		models
peft_hyper		peft_hyper
scripts		scripts
utils		utils
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

(CVPR'25) Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation

🚀🚀 Welcome to the repo of Crab! If our project helps you, please give us a star ⭐ on GitHub to support us. 🙏🙏

📰 News

🌟 Unified Audio-Visual Scene Understanding Capabilities

Temporal Localization

AVE Task

AVVP Task

Spatio-temporal reasoning

MUSIC-AVQA Task

Spatial Localization & Pixel-level understanding

🛠️ Requirements and Installation

🚀 Quick Start

🗝️ Training

🤖 Inference

📑 Citation

🔒 License

About

Uh oh!

Releases

Packages

Languages

GeWu-Lab/Crab

Folders and files

Latest commit

History

Repository files navigation

(CVPR'25) Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation

🚀🚀 Welcome to the repo of Crab! If our project helps you, please give us a star ⭐ on GitHub to support us. 🙏🙏

📰 News

🌟 Unified Audio-Visual Scene Understanding Capabilities

Temporal Localization

AVE Task

AVVP Task

Spatio-temporal reasoning

MUSIC-AVQA Task

Spatial Localization & Pixel-level understanding

🛠️ Requirements and Installation

🚀 Quick Start

🗝️ Training

🤖 Inference

📑 Citation

🔒 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages