Homebrewed VITS-3 with extra flow to improve text encoder's projected normalizing flow distribution and prior loss (WIP 🚧)

inspired by VITS-2 and GradTTS

Prerequisites

Python >= 3.10
Clone this repository
Install python requirements. Please refer requirements.txt
1. You may need to install espeak first: apt-get install espeak
Download datasets
1. Download and extract the LJ Speech dataset, then rename or create a link to the dataset folder: ln -s /path/to/LJSpeech-1.1/wavs DUMMY1
2. For mult-speaker setting, download and extract the VCTK dataset, and downsample wav files to 22050 Hz. Then rename or create a link to the dataset folder: ln -s /path/to/VCTK-Corpus/downsampled_wavs DUMMY2
Build Monotonic Alignment Search and run preprocessing if you use your own datasets.

# Cython-version Monotonoic Alignment Search
cd monotonic_align
python setup.py build_ext --inplace

# Preprocessing (g2p) for your own datasets. Preprocessed phonemes for LJ Speech and VCTK have been already provided.
# python preprocess.py --text_index 1 --filelists filelists/ljs_audio_text_train_filelist.txt filelists/ljs_audio_text_val_filelist.txt filelists/ljs_audio_text_test_filelist.txt 
# python preprocess.py --text_index 2 --filelists filelists/vctk_audio_sid_text_train_filelist.txt filelists/vctk_audio_sid_text_val_filelist.txt filelists/vctk_audio_sid_text_test_filelist.txt

How to run (dry-run)

model forward pass (dry-run)

import torch
from models import SynthesizerTrn

net_g = SynthesizerTrn(
    n_vocab=256,
    spec_channels=80, 
    segment_size=8192,
    inter_channels=192,
    hidden_channels=192,
    filter_channels=768,
    n_heads=2,
    n_layers=6,
    kernel_size=3,
    p_dropout=0.1,
    resblock="1", 
    resblock_kernel_sizes=[3, 7, 11],
    resblock_dilation_sizes=[[1, 3, 5], [1, 3, 5], [1, 3, 5]],
    upsample_rates=[8, 8, 2, 2],
    upsample_initial_channel=512,
    upsample_kernel_sizes=[16, 16, 4, 4],
    n_speakers=0,
    gin_channels=0,
    use_sdp=True, 
    use_transformer_flows=True, 
    # (choose from "pre_conv", "fft", "mono_layer_inter_residual", "mono_layer_post_residual")
    transformer_flow_type="fft", 
    use_spk_conditioned_encoder=True, 
    use_noise_scaled_mas=True, 
    use_duration_discriminator=True, 
)

x = torch.LongTensor([[1, 2, 3],[4, 5, 6]]) # token ids
x_lengths = torch.LongTensor([3, 2]) # token lengths
y = torch.randn(2, 80, 100) # mel spectrograms
y_lengths = torch.Tensor([100, 80]) # mel spectrogram lengths

net_g(
    x=x,
    x_lengths=x_lengths,
    y=y,
    y_lengths=y_lengths,
)

# calculate loss and backpropagate

Training Example

# LJ Speech
python train.py -c configs/vits3_ljs_nosdp.json -m ljs_base # no-sdp; (recommended)
python train.py -c configs/vits3_ljs_base.json -m ljs_base # with sdp;

# VCTK
python train_ms.py -c configs/vits3_vctk_base.json -m vctk_base

TODOs, features and notes

[] Train for LJ Speech and get sample audio

Name	Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs	configs	initial commit	Nov 14, 2023
filelists	filelists	initial commit	Nov 14, 2023
monotonic_align	monotonic_align	initial commit	Nov 14, 2023
text	text	initial commit	Nov 14, 2023
.gitattributes	.gitattributes	initial commit	Nov 14, 2023
.gitignore	.gitignore	initial commit	Nov 14, 2023
LICENSE	LICENSE	Initial commit	Sep 12, 2023
README.md	README.md	initial commit	Nov 14, 2023
attentions.py	attentions.py	initial commit	Nov 14, 2023
colab_requirements.txt	colab_requirements.txt	initial commit	Nov 14, 2023
commons.py	commons.py	initial commit	Nov 14, 2023
data_utils.py	data_utils.py	initial commit	Nov 14, 2023
export_onnx.py	export_onnx.py	initial commit	Nov 14, 2023
infer_onnx.py	infer_onnx.py	initial commit	Nov 14, 2023
inference.ipynb	inference.ipynb	initial commit	Nov 14, 2023
inference.py	inference.py	initial commit	Nov 14, 2023
inference_ms.py	inference_ms.py	initial commit	Nov 14, 2023
losses.py	losses.py	initial commit	Nov 14, 2023
mel_processing.py	mel_processing.py	initial commit	Nov 14, 2023
models.py	models.py	Update models.py	Nov 15, 2023
modules.py	modules.py	initial commit	Nov 14, 2023
preprocess.py	preprocess.py	initial commit	Nov 14, 2023
preprocess_audio.py	preprocess_audio.py	initial commit	Nov 14, 2023
requirements.txt	requirements.txt	initial commit	Nov 14, 2023
train.py	train.py	initial commit	Nov 14, 2023
train_ms.py	train_ms.py	initial commit	Nov 14, 2023
transforms.py	transforms.py	initial commit	Nov 14, 2023
utils.py	utils.py	initial commit	Nov 14, 2023
webui.py	webui.py	initial commit	Nov 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Homebrewed VITS-3 with extra flow to improve text encoder's projected normalizing flow distribution and prior loss (WIP 🚧)

inspired by VITS-2 and GradTTS

Prerequisites

How to run (dry-run)

Training Example

TODOs, features and notes

About

Releases

Packages

Languages

License

Marioando/vits3_pytorch

Folders and files

Latest commit

History

Repository files navigation

Homebrewed VITS-3 with extra flow to improve text encoder's projected normalizing flow distribution and prior loss (WIP 🚧)

inspired by VITS-2 and GradTTS

Prerequisites

How to run (dry-run)

Training Example

TODOs, features and notes

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages