Skip to content

Latest commit

 

History

History
193 lines (124 loc) · 7.96 KB

README.md

File metadata and controls

193 lines (124 loc) · 7.96 KB


Logo

Voice-Swapper

Real-time voice conversion using GANs implemented on RPi4.
Explore the dataset »

Table of Contents
  1. About The Project
  2. Objectives
  3. Scope
  4. Roadmap
  5. Contact
  6. Acknowledgments

About The Project

A voice-changing dictaphone

Voice-Swapper is a dictaphone that will be used to convert the user’s voice(source) to a target voice without any loss of linguistic information. VC is useful in many applications, such as customizing audio book and avatar voices, dubbing, voice modification, voice restoration after surgery, and cloning of voices of historical persons. VC models are primarily implemented with Generative Adversarial Networks(GANs) which provide promising results by generating the user fed-in statements in the target’s voice. We aim to build these models from scratch and implement them on a NVIDIA Jetson, a commonly used, powerful device, for AI applications. This project would be an inter-sig project between Diode and CompSoc.

Use the README.md to get started.

(back to top)

Objectives

To build the generative adversarial network model from scratch. To implement these models on a NVIDIA Jetson. To perform voice swapping (conversion) in real-time.

(back to top)

Scope

If time permits, we aim to propose a novel model based on the survey/summary of model performances in VCC2020 and write a research paper based on its performance compared to the existing models.

Click here for the complete proposal.

(back to top)

Model Architecture

CycleGAN

One of the important characteristics of speech is that it has sequential and hierarchical structures, e.g., voiced or unvoiced segments and phonemes or morphemes. An effective way to represent such structures would be to use an RNN, but it is computationally demanding due to the difficulty of parallel implementations.

Instead, we configure a CycleGAN using gated CNNs that not only allow parallelization over sequential data but also achieve state-of-the-art in speech modeling. In a gated CNN, gated linear units (GLUs) are used as an activation function. A GLU is a data-driven activation function, and the gated mechanism allows the information to be selectively propagated depending on the previous layer states.

Logo

MelGAN

We propose MelGAN-VC, a voice conversion method that relies on non-parallel speech data and is able to convert audio signals of arbitrary length from a source voice to a target voice. We firstly compute spectrograms from waveform data and then perform a domain translation using a Generative Adversarial Network (GAN) architecture. An additional siamese network helps preserving speech information in the translation process, without sacrificing the ability to flexibly model the style of the target speaker.

Logo

(back to top)

Roadmap

  • Scrape audio files for the target speakers
  • Cleanup the audio files
  • Perform MEL spectrogram
  • Building the models of different architectures
    • Train the model
    • Test the model

(back to top)

Contact

Palgun N P - [email protected]

Harish Gumnur - [email protected]

Nikhil P Reddy - [email protected]

Project Link: https://github.com/IEEE-NITK/Voice-Swapper

(back to top)