This repository provides a flexible training or finetune scripts for speech separation models. Currently, it supports both 8kHz and 16kHz sampling rates:
model name | sampling rate | Paper Link |
---|---|---|
MossFormer2_SS_8K | 8000 | MossFormer2 (Paper, ICASSP 2024) |
MossFormer2_SS_16K | 16000 | MossFormer2 (Paper, ICASSP 2024) |
MossFormer2 has achieved state-of-the-art speech sesparation performance upon the paper published in ICASSP 2024. It is a hybrid model by integrating a recurrent module into our previous MossFormer framework. MossFormer2 is capable to model not only long-range and coarse-scale dependencies but also fine-scale recurrent patterns. For efficient self-attention across the extensive sequence, MossFormer2 adopts the joint local-global self-attention strategy as proposed for MossFormer. MossFormer2 introduces a dedicated recurrent module to model intricate temporal dependencies within speech signals.
Instead of applying the recurrent neural networks (RNNs) that use traditional recurrent connections, we present a recurrent module based on a feedforward sequential memory network (FSMN), which is considered "RNN-free" recurrent network due to the ability to capture recurrent patterns without using recurrent connections. Our recurrent module mainly comprises an enhanced dilated FSMN block by using gated convolutional units (GCU) and dense connections. In addition, a bottleneck layer and an output layer are also added for controlling information flow. The recurrent module relies on linear projections and convolutions for seamless, parallel processing of the entire sequence.
MossFormer2 demonstrates remarkable performance in WSJ0-2/3mix, Libri2Mix, and WHAM!/WHAMR! benchmarks. Please refer to our Paper or the individual models using the standalone script (link).
We will provide performance comparisons of our released models with the publically available models in ClearVoice page.
If you haven't created a Conda environment for ClearerVoice-Studio yet, follow steps 1 and 2. Otherwise, skip directly to step 3.
- Clone the Repository
git clone https://github.com/modelscope/ClearerVoice-Studio.git
- Create Conda Environment
cd ClearerVoice-Studio
conda create -n ClearerVoice-Studio python=3.8
conda activate ClearerVoice-Studio
pip install -r requirements.txt
- Prepare Dataset
If you want to try an experimental training of the speech separation model, we suggest you to preapre the training and testing data as follows:
-
Step 1: Download the WSJ0 speech dataset from here (Link)
-
Step 2: Use the mixture generation scripts in python format or matlab format to generate mixture datasets. Use the sampling rate either 8000Hz or 16000Hz.
-
Step 3: Create scp files as formatted in
data/tr_wsj0_2mix_16k.scp
for train, validation, and test. -
Step 4: Replace the
tr_list
andcv_list
paths for scp files inconfig/train/MossFormer2_SS_16K.yaml
- Start Training
bash train.sh
You may need to set the correct network in train.sh
and choose either a fresh training or a finetune process using:
network=MossFormer2_SS_16K #Train MossFormer2_SS_16K model
train_from_last_checkpoint=1 #Set 1 to start training from the last checkpoint if exists,
init_checkpoint_path=./ #Path to your initial model if starting fine-tuning; otherwise, set it to 'None'