Skip to content

Marioando/vits3_pytorch

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
Nov 14, 2023
Nov 14, 2023
Nov 14, 2023
Nov 14, 2023
Nov 14, 2023
Nov 14, 2023
Sep 12, 2023
Nov 14, 2023
Nov 14, 2023
Nov 14, 2023
Nov 14, 2023
Nov 14, 2023
Nov 14, 2023
Nov 14, 2023
Nov 14, 2023
Nov 14, 2023
Nov 14, 2023
Nov 14, 2023
Nov 14, 2023
Nov 15, 2023
Nov 14, 2023
Nov 14, 2023
Nov 14, 2023
Nov 14, 2023
Nov 14, 2023
Nov 14, 2023
Nov 14, 2023
Nov 14, 2023
Nov 14, 2023

Repository files navigation

Homebrewed VITS-3 with extra flow to improve text encoder's projected normalizing flow distribution and prior loss (WIP 🚧)

inspired by VITS-2 and GradTTS

Prerequisites

  1. Python >= 3.10
  2. Clone this repository
  3. Install python requirements. Please refer requirements.txt
    1. You may need to install espeak first: apt-get install espeak
  4. Download datasets
    1. Download and extract the LJ Speech dataset, then rename or create a link to the dataset folder: ln -s /path/to/LJSpeech-1.1/wavs DUMMY1
    2. For mult-speaker setting, download and extract the VCTK dataset, and downsample wav files to 22050 Hz. Then rename or create a link to the dataset folder: ln -s /path/to/VCTK-Corpus/downsampled_wavs DUMMY2
  5. Build Monotonic Alignment Search and run preprocessing if you use your own datasets.
# Cython-version Monotonoic Alignment Search
cd monotonic_align
python setup.py build_ext --inplace

# Preprocessing (g2p) for your own datasets. Preprocessed phonemes for LJ Speech and VCTK have been already provided.
# python preprocess.py --text_index 1 --filelists filelists/ljs_audio_text_train_filelist.txt filelists/ljs_audio_text_val_filelist.txt filelists/ljs_audio_text_test_filelist.txt 
# python preprocess.py --text_index 2 --filelists filelists/vctk_audio_sid_text_train_filelist.txt filelists/vctk_audio_sid_text_val_filelist.txt filelists/vctk_audio_sid_text_test_filelist.txt

How to run (dry-run)

  • model forward pass (dry-run)
import torch
from models import SynthesizerTrn

net_g = SynthesizerTrn(
    n_vocab=256,
    spec_channels=80, 
    segment_size=8192,
    inter_channels=192,
    hidden_channels=192,
    filter_channels=768,
    n_heads=2,
    n_layers=6,
    kernel_size=3,
    p_dropout=0.1,
    resblock="1", 
    resblock_kernel_sizes=[3, 7, 11],
    resblock_dilation_sizes=[[1, 3, 5], [1, 3, 5], [1, 3, 5]],
    upsample_rates=[8, 8, 2, 2],
    upsample_initial_channel=512,
    upsample_kernel_sizes=[16, 16, 4, 4],
    n_speakers=0,
    gin_channels=0,
    use_sdp=True, 
    use_transformer_flows=True, 
    # (choose from "pre_conv", "fft", "mono_layer_inter_residual", "mono_layer_post_residual")
    transformer_flow_type="fft", 
    use_spk_conditioned_encoder=True, 
    use_noise_scaled_mas=True, 
    use_duration_discriminator=True, 
)

x = torch.LongTensor([[1, 2, 3],[4, 5, 6]]) # token ids
x_lengths = torch.LongTensor([3, 2]) # token lengths
y = torch.randn(2, 80, 100) # mel spectrograms
y_lengths = torch.Tensor([100, 80]) # mel spectrogram lengths

net_g(
    x=x,
    x_lengths=x_lengths,
    y=y,
    y_lengths=y_lengths,
)

# calculate loss and backpropagate

Training Example

# LJ Speech
python train.py -c configs/vits3_ljs_nosdp.json -m ljs_base # no-sdp; (recommended)
python train.py -c configs/vits3_ljs_base.json -m ljs_base # with sdp;

# VCTK
python train_ms.py -c configs/vits3_vctk_base.json -m vctk_base

TODOs, features and notes

  • [] Train for LJ Speech and get sample audio

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 95.8%
  • Jupyter Notebook 3.7%
  • Cython 0.5%