- 1. Introduction
- 2. Usage
- 3. Task: Audio-only Speaker Extraction Conditioned on a Reference Speech
- 4. Task: Audio-visual Speaker Extraction Conditioned on Face (Lip) Recording
- 5. Task: Audio-visual Speaker Extraction Conditioned on Body Gestures
- 6. Task: Neuro-steered Speaker Extraction Conditioned on EEG Signals
This repository provides training scripts for various target speaker extraction algorithms, including audio-only, audio-visual, and neuro-steered speaker extraction.
- Clone the Repository
git clone https://github.com/modelscope/ClearerVoice-Studio.git
- Create Conda Environment
cd ClearerVoice-Studio/train/target_speaker_extraction/
conda create -n clear_voice_tse python=3.9
conda activate clear_voice_tse
conda install pytorch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt
- Download Dataset
Follow the download links or preprocessing scripts provided under each task section.
- Modify Dataset Paths
Update the paths to your datasets in the configuration files. For example, modify the "audio_direc" and "ref_direc" in "config/config_YGD_gesture_seg_2spk.yaml"
- Modify Train Configuration
Adjust the settings in the "train.sh" file. For example, set "n_gpu=1" for single-GPU training, or "n_gpu=2" for two-GPU distributed training
- Start Training
bash train.sh
- Visualize Training Progress using Tensorboard
tensorboard --logdir ./checkpoints/
- Optionally Evaluate Checkpoints
bash evaluate_only.sh
- WSJ0-2mix [Download]
- SpEx+ (Non-causal) [Paper: SpEx+: A Complete Time Domain Speaker Extraction Network]
Dataset | Speakers | Model | Config | Checkpoint | SI-SDRi (dB) | SDRi (dB) |
---|---|---|---|---|---|---|
WSJ0-2mix | 2-mix | SpEx+ | Paper | - | 16.9 | 17.2 |
WSJ0-2mix | 2-mix | SpEx+ | This repo | This repo | 17.1 | 17.5 |
- AV-ConvTasNet (Causal/Non-causal) [Paper: Time Domain Audio Visual Speech Separation]
- AV-DPRNN (aka USEV) (Non-causal) [Paper: Universal Speaker Extraction With Visual Cue]
- AV-TFGridNet (Non-causal) [Paper: Scenario-Aware Audio-Visual TF-GridNet for Target Speech Extraction]
- AV-Mossformer2 (Non-causal) [Paper: ClearVoice]
Dataset | Speakers | Model | Config | Checkpoint | SI-SDRi (dB) | SDRi (dB) |
---|---|---|---|---|---|---|
VoxCeleb2 | 2-mix | AV-ConvTasNet | Paper | - | 10.6 | 10.9 |
VoxCeleb2 | 2-mix | MuSE | Paper | - | 11.7 | 12.0 |
VoxCeleb2 | 2-mix | reentry | Paper | - | 12.6 | 12.9 |
VoxCeleb2 | 2-mix | AV-DPRNN | This repo | This repo | 11.5 | 11.8 |
VoxCeleb2 | 2-mix | AV-TFGridNet | This repo | This repo | 13.7 | 14.1 |
VoxCeleb2 | 2-mix | AV-Mossformer2 | This repo | This repo | 14.6 | 14.9 |
VoxCeleb2 | 3-mix | AV-ConvTasNet | Paper | - | 9.8 | 10.2 |
VoxCeleb2 | 3-mix | MuSE | Paper | - | 11.6 | 12.2 |
VoxCeleb2 | 3-mix | reentry | Paper | - | 12.6 | 13.1 |
VoxCeleb2 | 3-mix | AV-DPRNN | This repo | This repo | 10.5 | 11.0 |
VoxCeleb2 | 3-mix | AV-TFGridNet | This repo | This repo | 14.2 | 14.6 |
VoxCeleb2 | 3-mix | AV-Mossformer2 | This repo | This repo | 15.5 | 16.0 |
Dataset | Speakers | Model | Config | Checkpoint | SI-SDRi (dB) | SDRi (dB) |
---|---|---|---|---|---|---|
LRS2 | 2-mix | AV-ConvTasNet | This repo | This repo | 11.6 | 11.9 |
LRS2 | 2-mix | AV-DPRNN | This repo | This repo | 12.0 | 12.4 |
LRS2 | 2-mix | AV-TFGridNet | This repo | This repo | 15.1 | 15.4 |
LRS2 | 2-mix | AV-Mossformer2 | This repo | This repo | 15.5 | 15.8 |
LRS2 | 3-mix | AV-ConvTasNet | This repo | This repo | 10.8 | 11.3 |
LRS2 | 3-mix | AV-DPRNN | This repo | This repo | 10.6 | 11.1 |
LRS2 | 3-mix | AV-TFGridNet | This repo | This repo | 15.0 | 15.4 |
LRS2 | 3-mix | AV-Mossformer2 | This repo | This repo | 16.2 | 16.6 |
- YGD [Download] [Paper: Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots]
- SEG (Non-causal) [Paper: Speaker Extraction with Co-Speech Gestures Cue]
Dataset | Speakers | Model | Config | Checkpoint | SI-SDRi (dB) | SDRi (dB) |
---|---|---|---|---|---|---|
YGD | 2-mix | DPRNN-GSR | Paper | - | 6.2 | 8.1 |
YGD | 2-mix | SEG | Paper | - | 9.1 | 10.0 |
YGD | 2-mix | SEG | This repo | This repo | 9.5 | 10.4 |
YGD | 3-mix | DPRNN-GSR | Paper | - | 1.8 | 3.5 |
YGD | 3-mix | SEG | Paper | - | 5.0 | 5.3 |
YGD | 3-mix | SEG | This repo | This repo | 4.9 | 5.6 |
- KUL [Download] [Paper: Auditory-Inspired Speech Envelope Extraction Methods for Improved EEG-Based Auditory Attention Detection in a Cocktail Party Scenario]
- NeuroHeed (Non-causal) [Paper: Neuro-Steered Speaker Extraction Using EEG Signals]
Dataset | Speakers | Model | Config | Checkpoint | SI-SDRi (dB) | SDRi (dB) |
---|---|---|---|---|---|---|
KUL | 2-mix | NeuroHeed | Paper | - | 14.3 | 15.5 |
KUL | 2-mix | NeuroHeed | This repo | This repo | 13.4 | 15.0 |
Dataset | Speakers | Model | Config | Checkpoint | SI-SDRi (dB) | SDRi (dB) |
---|---|---|---|---|---|---|
KUL | 2-mix | NeuroHeed | Paper | - | 11.2 | 11.8 |