Skip to content

Commit

Permalink
Release Vevo pre-trained models (#374)
Browse files Browse the repository at this point in the history
  • Loading branch information
RMSnow authored Dec 23, 2024
1 parent 1eb43a0 commit 58d8b48
Show file tree
Hide file tree
Showing 26 changed files with 3,486 additions and 12 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
.vs
.vscode
.cache
pyrightconfig.json

# GitHub files
.github
Expand Down
42 changes: 30 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,8 @@

- **TTS**: Text to Speech (⛳ supported)
- **SVS**: Singing Voice Synthesis (👨‍💻 developing)
- **VC**: Voice Conversion (👨‍💻 developing)
- **VC**: Voice Conversion (⛳ supported)
- **AC**: Accent Conversion (⛳ supported)
- **SVC**: Singing Voice Conversion (⛳ supported)
- **TTA**: Text to Audio (⛳ supported)
- **TTM**: Text to Music (👨‍💻 developing)
Expand All @@ -31,6 +32,7 @@
In addition to the specific generation tasks, Amphion includes several **vocoders** and **evaluation metrics**. A vocoder is an important module for producing high-quality audio signals, while evaluation metrics are critical for ensuring consistent metrics in generation tasks. Moreover, Amphion is dedicated to advancing audio generation in real-world applications, such as building **large-scale datasets** for speech synthesis.

## 🚀 News
- **2024/12/22**: We release the reproduction of **Vevo**, a zero-shot voice imitation framework with controllable timbre and style. Vevo can be applied into a series of speech generation tasks, including VC, TTS, AC, and more. The released pre-trained models are trained on [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) dataset and achieve SOTA zero-shot VC performance. [![arXiv](https://img.shields.io/badge/OpenReview-Paper-COLOR.svg)](https://openreview.net/pdf?id=anQDiQZhDP) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-model-yellow)](https://huggingface.co/amphion/Vevo) [![WebPage](https://img.shields.io/badge/WebPage-Demo-red)](https://versavoice.github.io/) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](models/vc/vevo/README.md)
- **2024/10/19**: We release **MaskGCT**, a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision. MaskGCT is trained on [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) dataset and achieves SOTA zero-shot TTS performance. [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2409.00750) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-model-yellow)](https://huggingface.co/amphion/maskgct) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-demo-pink)](https://huggingface.co/spaces/amphion/maskgct) [![ModelScope](https://img.shields.io/badge/ModelScope-space-purple)](https://modelscope.cn/studios/amphion/maskgct) [![ModelScope](https://img.shields.io/badge/ModelScope-model-cyan)](https://modelscope.cn/models/amphion/MaskGCT) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](models/tts/maskgct/README.md)
- **2024/09/01**: [Amphion](https://arxiv.org/abs/2312.09911), [Emilia](https://arxiv.org/abs/2407.05361) and [DSFF-SVC](https://arxiv.org/abs/2310.11160) got accepted by IEEE SLT 2024! 🤗
- **2024/08/28**: Welcome to join Amphion's [Discord channel](https://discord.com/invite/drhW7ajqAG) to stay connected and engage with our community!
Expand All @@ -48,21 +50,33 @@ In addition to the specific generation tasks, Amphion includes several **vocoder
### TTS: Text to Speech

- Amphion achieves state-of-the-art performance compared to existing open-source repositories on text-to-speech (TTS) systems. It supports the following models or architectures:
- [FastSpeech2](https://arxiv.org/abs/2006.04558): A non-autoregressive TTS architecture that utilizes feed-forward Transformer blocks.
- [VITS](https://arxiv.org/abs/2106.06103): An end-to-end TTS architecture that utilizes conditional variational autoencoder with adversarial learning
- [VALL-E](https://arxiv.org/abs/2301.02111): A zero-shot TTS architecture that uses a neural codec language model with discrete codes.
- [NaturalSpeech2](https://arxiv.org/abs/2304.09116): An architecture for TTS that utilizes a latent diffusion model to generate natural-sounding voices.
- [Jets](Jets): An end-to-end TTS model that jointly trains FastSpeech2 and HiFi-GAN with an alignment module.
- [MaskGCT](https://arxiv.org/abs/2409.00750): a fully non-autoregressive TTS architecture that eliminates the need for explicit alignment information between text and speech supervision.
- [FastSpeech2](https://arxiv.org/abs/2006.04558): A non-autoregressive TTS architecture that utilizes feed-forward Transformer blocks. [![code](https://img.shields.io/badge/README-Code-blue)](egs/tts/FastSpeech2/README.md)
- [VITS](https://arxiv.org/abs/2106.06103): An end-to-end TTS architecture that utilizes conditional variational autoencoder with adversarial learning [![code](https://img.shields.io/badge/README-Code-blue)](egs/tts/VITS/README.md)
- [VALL-E](https://arxiv.org/abs/2301.02111): A zero-shot TTS architecture that uses a neural codec language model with discrete codes. [![code](https://img.shields.io/badge/README-Code-blue)](egs/tts/VALLE_V2/README.md)
- [NaturalSpeech2](https://arxiv.org/abs/2304.09116): An architecture for TTS that utilizes a latent diffusion model to generate natural-sounding voices. [![code](https://img.shields.io/badge/README-Code-blue)](egs/tts/NaturalSpeech2/README.md)
- [Jets](Jets): An end-to-end TTS model that jointly trains FastSpeech2 and HiFi-GAN with an alignment module. [![code](https://img.shields.io/badge/README-Code-blue)](egs/tts/Jets/README.md)
- [MaskGCT](https://arxiv.org/abs/2409.00750): a fully non-autoregressive TTS architecture that eliminates the need for explicit alignment information between text and speech supervision. [![code](https://img.shields.io/badge/README-Code-blue)](models/tts/maskgct/README.md)

### VC: Voice Conversion

Amphion supports the following voice conversion models:

- [Vevo](https://openreview.net/pdf?id=anQDiQZhDP): A zero-shot voice imitation framework with controllable timbre and style. **Vevo-Timbre** conducts the style-preserved voice conversion, and **Vevo-Voice** conducts the style-converted voice conversion. [![code](https://img.shields.io/badge/README-Code-blue)](models/vc/vevo/README.md)
- [FACodec](https://arxiv.org/abs/2403.03100): FACodec decomposes speech into subspaces representing different attributes like content, prosody, and timbre. It can achieve zero-shot voice conversion. [![code](https://img.shields.io/badge/README-Code-blue)](https://huggingface.co/amphion/naturalspeech3_facodec)
- [Noro](https://arxiv.org/abs/2411.19770): A **noise-robust** zero-shot voice conversion system. Noro introduces innovative components tailored for VC using noisy reference speeches, including a dual-branch reference encoding module and a noise-agnostic contrastive speaker loss. [![code](https://img.shields.io/badge/README-Code-blue)](egs/vc/Noro/README.md)

### AC: Accent Conversion

- Amphion supports AC with [Vevo-Style](models/vc/vevo/README.md). Particularly, it can conduct the accent conversion in a zero-shot manner. [![code](https://img.shields.io/badge/README-Code-blue)](models/vc/vevo/README.md)

### SVC: Singing Voice Conversion

- Ampion supports multiple content-based features from various pretrained models, including [WeNet](https://github.com/wenet-e2e/wenet), [Whisper](https://github.com/openai/whisper), and [ContentVec](https://github.com/auspicious3000/contentvec). Their specific roles in SVC has been investigated in our SLT 2024 paper. [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2310.11160) [![code](https://img.shields.io/badge/README-Code-red)](egs/svc/MultipleContentsSVC)
- Amphion implements several state-of-the-art model architectures, including diffusion-, transformer-, VAE- and flow-based models. The diffusion-based architecture uses [Bidirectional dilated CNN](https://openreview.net/pdf?id=a-xFK8Ymz5J) as a backend and supports several sampling algorithms such as [DDPM](https://arxiv.org/pdf/2006.11239.pdf), [DDIM](https://arxiv.org/pdf/2010.02502.pdf), and [PNDM](https://arxiv.org/pdf/2202.09778.pdf). Additionally, it supports single-step inference based on the [Consistency Model](https://openreview.net/pdf?id=FmqFfMTNnv).
- Ampion supports multiple content-based features from various pretrained models, including [WeNet](https://github.com/wenet-e2e/wenet), [Whisper](https://github.com/openai/whisper), and [ContentVec](https://github.com/auspicious3000/contentvec). Their specific roles in SVC has been investigated in our SLT 2024 paper. [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2310.11160) [![code](https://img.shields.io/badge/README-Code-blue)](egs/svc/MultipleContentsSVC)
- Amphion implements several state-of-the-art model architectures, including diffusion-, transformer-, VAE- and flow-based models. The diffusion-based architecture uses [Bidirectional dilated CNN](https://openreview.net/pdf?id=a-xFK8Ymz5J) as a backend and supports several sampling algorithms such as [DDPM](https://arxiv.org/pdf/2006.11239.pdf), [DDIM](https://arxiv.org/pdf/2010.02502.pdf), and [PNDM](https://arxiv.org/pdf/2202.09778.pdf). Additionally, it supports single-step inference based on the [Consistency Model](https://openreview.net/pdf?id=FmqFfMTNnv). [![code](https://img.shields.io/badge/README-Code-blue)](egs/svc/DiffComoSVC/README.md)

### TTA: Text to Audio

- Amphion supports the TTA with a latent diffusion model. It is designed like [AudioLDM](https://arxiv.org/abs/2301.12503), [Make-an-Audio](https://arxiv.org/abs/2301.12661), and [AUDIT](https://arxiv.org/abs/2304.00830). It is also the official implementation of the text-to-audio generation part of our NeurIPS 2023 paper. [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2304.00830) [![code](https://img.shields.io/badge/README-Code-red)](egs/tta/RECIPE.md)
- Amphion supports the TTA with a latent diffusion model. It is designed like [AudioLDM](https://arxiv.org/abs/2301.12503), [Make-an-Audio](https://arxiv.org/abs/2301.12661), and [AUDIT](https://arxiv.org/abs/2304.00830). It is also the official implementation of the text-to-audio generation part of our NeurIPS 2023 paper. [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2304.00830) [![code](https://img.shields.io/badge/README-Code-blue)](egs/tta/RECIPE.md)

### Vocoder

Expand All @@ -71,11 +85,13 @@ In addition to the specific generation tasks, Amphion includes several **vocoder
- Flow-based vocoders: [WaveGlow](https://arxiv.org/abs/1811.00002).
- Diffusion-based vocoders: [Diffwave](https://arxiv.org/abs/2009.09761).
- Auto-regressive based vocoders: [WaveNet](https://arxiv.org/abs/1609.03499), [WaveRNN](https://arxiv.org/abs/1802.08435v1).
- Amphion provides the official implementation of [Multi-Scale Constant-Q Transform Discriminator](https://arxiv.org/abs/2311.14957) (our ICASSP 2024 paper). It can be used to enhance any architecture GAN-based vocoders during training, and keep the inference stage (such as memory or speed) unchanged. [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2311.14957) [![code](https://img.shields.io/badge/README-Code-red)](egs/vocoder/gan/tfr_enhanced_hifigan)
- Amphion provides the official implementation of [Multi-Scale Constant-Q Transform Discriminator](https://arxiv.org/abs/2311.14957) (our ICASSP 2024 paper). It can be used to enhance any architecture GAN-based vocoders during training, and keep the inference stage (such as memory or speed) unchanged. [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2311.14957) [![code](https://img.shields.io/badge/README-Code-blue)](egs/vocoder/gan/tfr_enhanced_hifigan)

### Evaluation

Amphion provides a comprehensive objective evaluation of the generated audio. The evaluation metrics contain:
Amphion provides a comprehensive objective evaluation of the generated audio. [![code](https://img.shields.io/badge/README-Code-blue)](egs/metrics/README.md)

The supported evaluation metrics contain:

- **F0 Modeling**: F0 Pearson Coefficients, F0 Periodicity Root Mean Square Error, F0 Root Mean Square Error, Voiced/Unvoiced F1 Score, etc.
- **Energy Modeling**: Energy Root Mean Square Error, Energy Pearson Coefficients, etc.
Expand Down Expand Up @@ -133,6 +149,8 @@ Mount dataset by argument `-v` is necessary when using Docker. Please refer to [
We detail the instructions of different tasks in the following recipes:

- [Text to Speech (TTS)](egs/tts/README.md)
- [Voice Conversion (VC)](models/vc/vevo/README.md)
- [Accent Conversion (AC)](models/vc/vevo/README.md)
- [Singing Voice Conversion (SVC)](egs/svc/README.md)
- [Text to Audio (TTA)](egs/tta/README.md)
- [Vocoder](egs/vocoder/README.md)
Expand Down
Binary file added imgs/vc/vevo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
108 changes: 108 additions & 0 deletions models/codec/melvqgan/melspec.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# Copyright (c) 2023 Amphion.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.

import torch
import pyworld as pw
import numpy as np
import soundfile as sf
import os
from torchaudio.functional import pitch_shift
import librosa
from librosa.filters import mel as librosa_mel_fn
import torch.nn as nn
import torch.nn.functional as F
import tqdm


def dynamic_range_compression(x, C=1, clip_val=1e-5):
return np.log(np.clip(x, a_min=clip_val, a_max=None) * C)


def dynamic_range_decompression(x, C=1):
return np.exp(x) / C


def dynamic_range_compression_torch(x, C=1, clip_val=1e-5):
return torch.log(torch.clamp(x, min=clip_val) * C)


def dynamic_range_decompression_torch(x, C=1):
return torch.exp(x) / C


def spectral_normalize_torch(magnitudes):
output = dynamic_range_compression_torch(magnitudes)
return output


def spectral_de_normalize_torch(magnitudes):
output = dynamic_range_decompression_torch(magnitudes)
return output


class MelSpectrogram(nn.Module):
def __init__(
self,
n_fft,
num_mels,
sampling_rate,
hop_size,
win_size,
fmin,
fmax,
center=False,
):
super(MelSpectrogram, self).__init__()
self.n_fft = n_fft
self.hop_size = hop_size
self.win_size = win_size
self.sampling_rate = sampling_rate
self.num_mels = num_mels
self.fmin = fmin
self.fmax = fmax
self.center = center

mel_basis = {}
hann_window = {}

mel = librosa_mel_fn(
sr=sampling_rate, n_fft=n_fft, n_mels=num_mels, fmin=fmin, fmax=fmax
)
mel_basis = torch.from_numpy(mel).float()
hann_window = torch.hann_window(win_size)

self.register_buffer("mel_basis", mel_basis)
self.register_buffer("hann_window", hann_window)

def forward(self, y):
y = torch.nn.functional.pad(
y.unsqueeze(1),
(
int((self.n_fft - self.hop_size) / 2),
int((self.n_fft - self.hop_size) / 2),
),
mode="reflect",
)
y = y.squeeze(1)
spec = torch.stft(
y,
self.n_fft,
hop_length=self.hop_size,
win_length=self.win_size,
window=self.hann_window,
center=self.center,
pad_mode="reflect",
normalized=False,
onesided=True,
return_complex=True,
)
spec = torch.view_as_real(spec)

spec = torch.sqrt(spec.pow(2).sum(-1) + (1e-9))

spec = torch.matmul(self.mel_basis, spec)
spec = spectral_normalize_torch(spec)

return spec
Loading

0 comments on commit 58d8b48

Please sign in to comment.