Skip to content

Latest commit

 

History

History

target_speaker_extraction

ClearerVoice-Studio: Target Speaker Extraction Algorithms

Table of Contents

1. Introduction

This repository provides training scripts for various target speaker extraction algorithms, including audio-only, audio-visual, and neuro-steered speaker extraction.

2. Usage

Step-by-Step Guide

  1. Clone the Repository
git clone https://github.com/modelscope/ClearerVoice-Studio.git
  1. Create Conda Environment
cd ClearerVoice-Studio/train/target_speaker_extraction/
conda create -n clear_voice_tse python=3.9
conda activate clear_voice_tse
conda install pytorch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1  pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt
  1. Download Dataset

Follow the download links or preprocessing scripts provided under each task section.

  1. Modify Dataset Paths

Update the paths to your datasets in the configuration files. For example, modify the "audio_direc" and "ref_direc" in "config/config_YGD_gesture_seg_2spk.yaml"

  1. Modify Train Configuration

Adjust the settings in the "train.sh" file. For example, set "n_gpu=1" for single-GPU training, or "n_gpu=2" for two-GPU distributed training

  1. Start Training
bash train.sh
  1. Visualize Training Progress using Tensorboard
tensorboard --logdir ./checkpoints/
  1. Optionally Evaluate Checkpoints
bash evaluate_only.sh

3. Audio-only speaker extraction conditioned on a reference speech

Support datasets for training:

Support models for training:

Non-causal (Offline) WSJ0-2mix benchmark:

Dataset Speakers Model Config Checkpoint SI-SDRi (dB) SDRi (dB)
WSJ0-2mix 2-mix SpEx+ Paper - 16.9 17.2
WSJ0-2mix 2-mix SpEx+ This repo This repo 17.1 17.5

4. Audio-visual speaker extraction conditioned on face or lip recording

Support datasets for training:

Support models for training:

Non-causal (Offline) VoxCeleb2-mix benchmark:

Dataset Speakers Model Config Checkpoint SI-SDRi (dB) SDRi (dB)
VoxCeleb2 2-mix AV-ConvTasNet Paper - 10.6 10.9
VoxCeleb2 2-mix MuSE Paper - 11.7 12.0
VoxCeleb2 2-mix reentry Paper - 12.6 12.9
VoxCeleb2 2-mix AV-DPRNN This repo This repo 11.5 11.8
VoxCeleb2 2-mix AV-TFGridNet This repo This repo 13.7 14.1
VoxCeleb2 2-mix AV-Mossformer2 This repo This repo 14.6 14.9
VoxCeleb2 3-mix AV-ConvTasNet Paper - 9.8 10.2
VoxCeleb2 3-mix MuSE Paper - 11.6 12.2
VoxCeleb2 3-mix reentry Paper - 12.6 13.1
VoxCeleb2 3-mix AV-DPRNN This repo This repo 10.5 11.0
VoxCeleb2 3-mix AV-TFGridNet This repo This repo 14.2 14.6
VoxCeleb2 3-mix AV-Mossformer2 This repo This repo 15.5 16.0

Non-causal (Offline) LRS2-mix benchmark:

Dataset Speakers Model Config Checkpoint SI-SDRi (dB) SDRi (dB)
LRS2 2-mix AV-ConvTasNet This repo This repo 11.6 11.9
LRS2 2-mix AV-DPRNN This repo This repo 12.0 12.4
LRS2 2-mix AV-TFGridNet This repo This repo 15.1 15.4
LRS2 2-mix AV-Mossformer2 This repo This repo 15.5 15.8
LRS2 3-mix AV-ConvTasNet This repo This repo 10.8 11.3
LRS2 3-mix AV-DPRNN This repo This repo 10.6 11.1
LRS2 3-mix AV-TFGridNet This repo This repo 15.0 15.4
LRS2 3-mix AV-Mossformer2 This repo This repo 16.2 16.6

5. Audio-visual speaker extraction conditioned on body gestures

Support datasets for training:

Support models for training:

Non-causal (Offline) YGD-mix benchmark:

Dataset Speakers Model Config Checkpoint SI-SDRi (dB) SDRi (dB)
YGD 2-mix DPRNN-GSR Paper - 6.2 8.1
YGD 2-mix SEG Paper - 9.1 10.0
YGD 2-mix SEG This repo This repo 9.5 10.4
YGD 3-mix DPRNN-GSR Paper - 1.8 3.5
YGD 3-mix SEG Paper - 5.0 5.3
YGD 3-mix SEG This repo This repo 4.9 5.6

6. Neuro-steered speaker extraction conditioned on EEG signals

Support datasets for training:

Support models for training:

Non-causal (Offline) KUL-mix benchmark:

Dataset Speakers Model Config Checkpoint SI-SDRi (dB) SDRi (dB)
KUL 2-mix NeuroHeed Paper - 14.3 15.5
KUL 2-mix NeuroHeed This repo This repo 13.4 15.0

Causal (online) KUL-mix benchmark:

Dataset Speakers Model Config Checkpoint SI-SDRi (dB) SDRi (dB)
KUL 2-mix NeuroHeed Paper - 11.2 11.8