Spectron: Target Speaker Extraction using Conditional Transformer with Adversarial Refinement

Abstract

Recently, attention-based transformers have become a defacto standard in many deep learning applications including natural language processing, computer vision, signal processing, etc.. In this paper, we propose a transformer-based end-to-end model to extract a target speaker’s speech from a monaural multi-speaker mixed audio signal. Unlike existing speaker extraction methods, we introduce two additional objectives to impose speaker embedding consistency and waveform encoder invertibility and jointly train both speaker encoder and speech separator to better capture the speaker conditional embedding. Furthermore, we leverage a multiscale discriminator to refine the perceptual quality of the extracted speech. Our experiments show that the use of a dual path transformer in the separator backbone along with proposed training paradigm improves the CNN baseline by 3.12 dB points. Finally, we compare our approach with recent state-of-the-arts and show that our model outperforms existing methods by 4.1 dB points on an average without creating additional data dependency

Project Page: https://tatban.github.io/spec-res/
Paper: https://arxiv.org/abs/2409.01352

Dataset

We assume the dataset is same as VoiceFilter paper. Data paths must be updated in the corresponding .csv files in data folder

Training

train spectron full model: python train_spectron_msd.py
train spectron without adversarial refinement: python train.py
train spectron with pretrained transformer (without adv. refinement): python train_with_pretrained_DPT.py

Inference

change the OUT_DIR and weights_path as per the training choice as above
run: python test.py

Results

Model	SDRi (dB)	SI-SNRi (dB)
VoiceFilter	7.8	-
AtssNet	9.3	-
X-TasNet	13.8	12.7
Spectron without MSD (ours)	13.9	12.8
Spectron (ours)	14.4	13.3

Citation

If you use this piece of code, please cite:

@misc{bandyopadhyay2024spectrontargetspeakerextraction,
      title={Spectron: Target Speaker Extraction using Conditional Transformer with Adversarial Refinement}, 
      author={Tathagata Bandyopadhyay},
      year={2024},
      eprint={2409.01352},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2409.01352}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
SPI_Train		SPI_Train
configs		configs
data		data
models		models
utils		utils
LICENSE		LICENSE
README.md		README.md
Spectron_speech_Extractor.py		Spectron_speech_Extractor.py
local_resume_train_spectron_msd.py		local_resume_train_spectron_msd.py
q_vis.py		q_vis.py
resume_train.py		resume_train.py
temp.pth		temp.pth
test.py		test.py
train.py		train.py
train_spectron_msd.py		train_spectron_msd.py
train_with_pretrained_DPT.py		train_with_pretrained_DPT.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spectron: Target Speaker Extraction using Conditional Transformer with Adversarial Refinement

Abstract

Dataset

Training

Inference

Results

Citation

About

Releases

Packages

Languages

License

tatban/Spectron

Folders and files

Latest commit

History

Repository files navigation

Spectron: Target Speaker Extraction using Conditional Transformer with Adversarial Refinement

Abstract

Dataset

Training

Inference

Results

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages