This repository contains the code release for our paper Learning Speaker Embedding with Momentum Contrast.
The code has been developed using Kaldi and Pytorch. Kaldi is only used for feature extraction and post-processings. All neural networks are trained using Pytorch.
The purpose of the project is to make researches on neural network based speaker verification easier. We also try to reproduce some results in our papers.
The code is tested with the following dependencies.
- Python: 3.6.8
- Kaldi: 5.5
- kaldi-io: 0.9.4
- numpy: 1.16.4
- Pillow: 6.2.1
- scikit-learn: 0.22.2.post1
- six: 1.13.0
- tensorboardX: 1.1
- torch: 1.2.0
In addition to Kaldi, you can install the requirements via pip with:
pip install -r requirements.txt
Set KALDI_ROOT properly in path.sh. Link $KALDI_ROOT/egs/wsj/s5/utils to utils.
To train the MoCo Model with SpecAugment, run:
sh train_moco.sh \
--voxceleb1_root [voxceleb1 dir] \
--voxceleb2_root [voxceleb2 dir] \
--rirs_noises_root [rirs_noises dir] \
--musan_root [musan dir] \
--data [train data dir] \
--exp [exp dir]
To train AAM-Softmax with the pretrained MoCo model, run:
sh train_xvector.sh \
--voxceleb1_root [voxceleb1 dir] \
--voxceleb2_root [voxceleb2 dir] \
--rirs_noises_root [rirs_noises dir] \
--musan_root [musan dir] \
--moco_model [pretrained MoCo model] \
--data [train data dir] \
--exp [exp dir]
Note: If the MoCo Model doesn't exist, run_xvector.sh will be trained like standard Xvector recipe.
After training, you can evaluate the performance on the test set, run:
sh test.sh \
--data [test data dir] \
--exp [exp] \
--dir [trained model dir] \
--mdl [model name, default final.pkl] \
--plda_score [apply plda if true else apply cosine score, default true]
- Training data: All of Voxceleb2, plus the training portion of Voxceleb1.
- Test data: The test portion of Voxceleb1.
- For standard Xvector and MoCo, the learning rate is initially set to 1e-4 and gradually reduced to 1e-5 along the training process, for AAM-Softmax, the learning rate is initially set to 1e-5 and gradually reduced to 1e-6 along the training process.
- Chunk Size: From 200 ~ 400.
- Batch Size: 1024.
- Backend Classifier: PLDA and Cosine.
- Evaluate Model: We selet epoch=900(checkpoint_e900.pkl) to evaluate the performance, although the AAM-Softmax trained with pretrained MoCo model would converge faster.
Method | EER(%) | minDCT(0.01) | minDCT(0.001) |
---|---|---|---|
Ivector | 5.467 | 0.4859 | 0.6213 |
Xvector | 3.34 | 0.3795 | 0.6138 |
Xvector-AAM | 2.55 | 0.3464 | 0.5848 |
Xvector-AAM-MoCo | 2.423 | 0.2856 | 0.3850 |
Method | EER(%) | minDCT(0.01) | minDCT(0.001) |
---|---|---|---|
Ivector | 14.65 | 0.7195 | 0.8661 |
Xvector | 7.349 | 0.5799 | 0.7418 |
Xvector-AAM | 2.306 | 0.2647 | 0.3372 |
Xvector-AAM-MoCo | 2.402 | 0.2232 | 0.3573 |
If you used this code please kindly cite the following paper:
Ke Ding, Xuanji He, Guanglu Wan. Learning Speaker Embedding with Momentum Contrast. arXiv preprint arXiv:2001.01986 (2020)
If you have any question, please feel free to contact us:
Auther | |
---|---|
Ke Ding | [email protected] |
Xuanji He | [email protected] |
Guanglu Wan | [email protected] |
The code is BSD-style licensed, as found in the LICENSE file.