Paper: LLaSA: Scaling Train-time and Inference-time Compute for LLaMA-based Speech Synthesis
Codec: xcodec2 (Please install new version xcodec2==0.1.3)
LLaMA based TTS 3b version: Llasa-3B
LLaMA based TTS 1b version: Llasa-1B
LLaMA based TTS 8b version: Llasa-8B
Single Vector Quantization
- 65536 Codebook Size using Finite Scalar Quantization achieving 99% codebook usage. ( comparable to text tokenizers, LLaMA3 128256)
- 50x1 Tokens per Second
Multilingual Speech Semantic Support
- Uses Wav2Vec2-BERT, a semantic encoder pre-trained on 4.5M hours of unlabeled audio data covering more than 143 languages.
- Codec trained on 150k hours of multilingual speech data, including Emilia (En/Zh/De/Fr/Ja/Ko) and MLS (En/Fr/De/Nl/Es/It/Pt/Pl).
High-Quality Speech Reconstruction
- Transformer + Vocos Decoder
- BigCodec encoder
- Spec discriminator with FFT sizes {78, 126, 206, 334, 542, 876, 1418, 2296} tailored for transformer decoder. Details here
- Achieving UTMOS 4.13 WER 2.47 (hubert-large-ls960-ft) sim 0.82 (wavlm_large_finetune) stoi 0.92 pesq-nb 3.05 pesq-wb 2.44 on librispeech-test-clean reconstruction (gt: WER 2.09 UTMOS 4.09)
- Only for 16kHz speech
Code is tested on python3.9
Please follow the following steps to setup your environment
- Clone this repo
- conda create --name xcodec2 python=3.9
- conda activate xcodec2
pip install -r requirements.txt
- Download the pretrained checkpoint here
To train a XCodec2, firstly you have to prepare your data
- Make a file list by:
- Train a X-Codec-2.0 with the default setting by:
python log_dir=/path/to/log_dir
Batch inference
Code extracting
Code will save in output folder with the same subfolder structure for audio file.
I would like to extend a special thanks to authors of BigCodec, since our code base is mainly borrowed from BigCodec.