About The Project

A voice-changing dictaphone

Voice-Swapper is a dictaphone that will be used to convert the user’s voice(source) to a target voice without any loss of linguistic information. VC is useful in many applications, such as customizing audio book and avatar voices, dubbing, voice modification, voice restoration after surgery, and cloning of voices of historical persons. VC models are primarily implemented with Generative Adversarial Networks(GANs) which provide promising results by generating the user fed-in statements in the target’s voice. We aim to build these models from scratch and implement them on a NVIDIA Jetson, a commonly used, powerful device, for AI applications. This project would be an inter-sig project between Diode and CompSoc.

Use the README.md to get started.

(back to top)

Objectives

To build the generative adversarial network model from scratch. To implement these models on a NVIDIA Jetson. To perform voice swapping (conversion) in real-time.

(back to top)

Scope

If time permits, we aim to propose a novel model based on the survey/summary of model performances in VCC2020 and write a research paper based on its performance compared to the existing models.

Click here for the complete proposal.

(back to top)

Model Architecture

CycleGAN

One of the important characteristics of speech is that it has sequential and hierarchical structures, e.g., voiced or unvoiced segments and phonemes or morphemes. An effective way to represent such structures would be to use an RNN, but it is computationally demanding due to the difficulty of parallel implementations.

Instead, we configure a CycleGAN using gated CNNs that not only allow parallelization over sequential data but also achieve state-of-the-art in speech modeling. In a gated CNN, gated linear units (GLUs) are used as an activation function. A GLU is a data-driven activation function, and the gated mechanism allows the information to be selectively propagated depending on the previous layer states.

MelGAN

We propose MelGAN-VC, a voice conversion method that relies on non-parallel speech data and is able to convert audio signals of arbitrary length from a source voice to a target voice. We firstly compute spectrograms from waveform data and then perform a domain translation using a Generative Adversarial Network (GAN) architecture. An additional siamese network helps preserving speech information in the translation process, without sacrificing the ability to flexibly model the style of the target speaker.

(back to top)

Roadmap

Scrape audio files for the target speakers
Cleanup the audio files
Perform MEL spectrogram
Building the models of different architectures
- Train the model
- Test the model