Skip to content

Latest commit

 

History

History
 
 

speech_separation

ClearerVoice-Studio: Train Speech Separation Models

1. Introduction

This repository provides a flexible training or finetune scripts for speech separation models. Currently, it supports both 8kHz and 16kHz sampling rates:

model name sampling rate Paper Link
MossFormer2_SS_8K 8000 MossFormer2 (Paper, ICASSP 2024)
MossFormer2_SS_16K 16000 MossFormer2 (Paper, ICASSP 2024)

MossFormer2 has achieved state-of-the-art speech sesparation performance upon the paper published in ICASSP 2024. It is a hybrid model by integrating a recurrent module into our previous MossFormer framework. MossFormer2 is capable to model not only long-range and coarse-scale dependencies but also fine-scale recurrent patterns. For efficient self-attention across the extensive sequence, MossFormer2 adopts the joint local-global self-attention strategy as proposed for MossFormer. MossFormer2 introduces a dedicated recurrent module to model intricate temporal dependencies within speech signals.

github_fig1

Instead of applying the recurrent neural networks (RNNs) that use traditional recurrent connections, we present a recurrent module based on a feedforward sequential memory network (FSMN), which is considered "RNN-free" recurrent network due to the ability to capture recurrent patterns without using recurrent connections. Our recurrent module mainly comprises an enhanced dilated FSMN block by using gated convolutional units (GCU) and dense connections. In addition, a bottleneck layer and an output layer are also added for controlling information flow. The recurrent module relies on linear projections and convolutions for seamless, parallel processing of the entire sequence.

github_fig2

MossFormer2 demonstrates remarkable performance in WSJ0-2/3mix, Libri2Mix, and WHAM!/WHAMR! benchmarks. Please refer to our Paper or the individual models using the standalone script (link).

We will provide performance comparisons of our released models with the publically available models in ClearVoice page.

2. Usage

Step-by-Step Guide

If you haven't created a Conda environment for ClearerVoice-Studio yet, follow steps 1 and 2. Otherwise, skip directly to step 3.

  1. Clone the Repository
git clone https://github.com/modelscope/ClearerVoice-Studio.git
  1. Create Conda Environment
cd ClearerVoice-Studio
conda create -n ClearerVoice-Studio python=3.8
conda activate ClearerVoice-Studio
pip install -r requirements.txt
  1. Prepare Dataset

If you want to try an experimental training of the speech separation model, we suggest you to preapre the training and testing data as follows:

  • Step 1: Download the WSJ0 speech dataset from here (Link)

  • Step 2: Use the mixture generation scripts in python format or matlab format to generate mixture datasets. Use the sampling rate either 8000Hz or 16000Hz.

  • Step 3: Create scp files as formatted in data/tr_wsj0_2mix_16k.scp for train, validation, and test.

  • Step 4: Replace the tr_list and cv_list paths for scp files in config/train/MossFormer2_SS_16K.yaml

  1. Start Training
bash train.sh

You may need to set the correct network in train.sh and choose either a fresh training or a finetune process using:

network=MossFormer2_SS_16K              #Train MossFormer2_SS_16K model
train_from_last_checkpoint=1            #Set 1 to start training from the last checkpoint if exists, 
init_checkpoint_path=./                 #Path to your initial model if starting fine-tuning; otherwise, set it to 'None'