Skip to content

Commit

Permalink
update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
yuekaizhang committed Nov 20, 2024
1 parent d55a534 commit e66f133
Show file tree
Hide file tree
Showing 8 changed files with 195 additions and 163 deletions.
65 changes: 58 additions & 7 deletions egs/libritts/TTS/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Introduction

LibriTTS is a multi-speaker English corpus of approximately 585 hours of read English speech at 24kHz sampling rate, prepared by Heiga Zen with the assistance of Google Speech and Google Brain team members.
The LibriTTS corpus is designed for TTS research. It is derived from the original materials (mp3 audio files from LibriVox and text files from Project Gutenberg) of the LibriSpeech corpus.
LibriTTS is a multi-speaker English corpus of approximately 585 hours of read English speech at 24kHz sampling rate, prepared by Heiga Zen with the assistance of Google Speech and Google Brain team members.
The LibriTTS corpus is designed for TTS research. It is derived from the original materials (mp3 audio files from LibriVox and text files from Project Gutenberg) of the LibriSpeech corpus.
The main differences from the LibriSpeech corpus are listed below:
1. The audio files are at 24kHz sampling rate.
2. The speech is split at sentence breaks.
Expand All @@ -11,16 +11,16 @@ The main differences from the LibriSpeech corpus are listed below:
For more information, refer to the paper "LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech", Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J. Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu, arXiv, 2019. If you use the LibriTTS corpus in your work, please cite this paper where it was introduced.

> [!CAUTION]
> The next-gen Kaldi framework provides tools and models for generating high-quality, synthetic speech (Text-to-Speech, TTS).
> The next-gen Kaldi framework provides tools and models for generating high-quality, synthetic speech (Text-to-Speech, TTS).
> While these recipes has the potential to advance various fields such as accessibility, language education, and AI-driven solutions, it also carries certain ethical and legal responsibilities.
>
>
> By using this framework, you agree to the following:
> 1. Legal and Ethical Use: You shall not use this framework, or any models derived from it, for any unlawful or unethical purposes. This includes, but is not limited to: Creating voice clones without the explicit, informed consent of the individual whose voice is being cloned. Engaging in any form of identity theft, impersonation, or fraud using cloned voices. Violating any local, national, or international laws regarding privacy, intellectual property, or personal data.
>
>
> 2. Responsibility of Use: The users of this framework are solely responsible for ensuring that their use of voice cloning technologies complies with all applicable laws and ethical guidelines. We explicitly disclaim any liability for misuse of the technology.
>
>
> 3. Attribution and Use of Open-Source Components: This project is provided under the Apache 2.0 license. Users must adhere to the terms of this license and provide appropriate attribution when required.
>
>
> 4. No Warranty: This framework is provided “as-is,” without warranty of any kind, either express or implied. We do not guarantee that the use of this software will comply with legal requirements or that it will not infringe the rights of third parties.

Expand Down Expand Up @@ -49,3 +49,54 @@ To inference, use:
--epoch 400 \
--tokens data/tokens.txt
```

# [VALL-E](https://arxiv.org/abs/2301.02111)

./valle contains the code for training VALL-E TTS model.

Checkpoints and training logs can be found [here](https://huggingface.co/yuekai/vall-e_libritts). The demo of the model trained with libritts and [libritts-r](https://www.openslr.org/141/) is available [here](https://huggingface.co/spaces/yuekai/valle-libritts-demo).

Preparation:

```
bash prepare.sh --start-stage 4
```

The training command is given below:

```
world_size=8
exp_dir=exp/valle
## Train AR model
python3 valle/train.py --max-duration 320 --filter-min-duration 0.5 --filter-max-duration 14 --train-stage 1 \
--num-buckets 6 --dtype "bfloat16" --save-every-n 1000 --valid-interval 2000 \
--share-embedding true --norm-first true --add-prenet false \
--decoder-dim 1024 --nhead 16 --num-decoder-layers 12 --prefix-mode 1 \
--base-lr 0.03 --warmup-steps 200 --average-period 0 \
--num-epochs 20 --start-epoch 1 --start-batch 0 --accumulate-grad-steps 1 \
--exp-dir ${exp_dir} --world-size ${world_size}
## Train NAR model
# cd ${exp_dir}
# ln -s ${exp_dir}/best-valid-loss.pt epoch-99.pt # --start-epoch 100=99+1
# cd -
python3 valle/train.py --max-duration 160 --filter-min-duration 0.5 --filter-max-duration 14 --train-stage 2 \
--num-buckets 6 --dtype "float32" --save-every-n 1000 --valid-interval 2000 \
--share-embedding true --norm-first true --add-prenet false \
--decoder-dim 1024 --nhead 16 --num-decoder-layers 12 --prefix-mode 1 \
--base-lr 0.03 --warmup-steps 200 --average-period 0 \
--num-epochs 40 --start-epoch 100 --start-batch 0 --accumulate-grad-steps 2 \
--exp-dir ${exp_dir} --world-size ${world_size}
```

To inference, use:
```
huggingface-cli login
huggingface-cli download --local-dir ${exp_dir} yuekai/vall-e_libritts
top_p=1.0
python3 valle/infer.py --output-dir demos_epoch_${epoch}_avg_${avg}_top_p_${top_p} \
--top-k -1 --temperature 1.0 \
--text ./libritts.txt \
--checkpoint ${exp_dir}/epoch-${epoch}-avg-${avg}.pt --top-p ${top_p}
```
69 changes: 45 additions & 24 deletions egs/wenetspeech4tts/TTS/README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,6 @@
# Introduction

LibriTTS is a multi-speaker English corpus of approximately 585 hours of read English speech at 24kHz sampling rate, prepared by Heiga Zen with the assistance of Google Speech and Google Brain team members.
The LibriTTS corpus is designed for TTS research. It is derived from the original materials (mp3 audio files from LibriVox and text files from Project Gutenberg) of the LibriSpeech corpus.
The main differences from the LibriSpeech corpus are listed below:
1. The audio files are at 24kHz sampling rate.
2. The speech is split at sentence breaks.
3. Both original and normalized texts are included.
4. Contextual information (e.g., neighbouring sentences) can be extracted.
5. Utterances with significant background noise are excluded.
For more information, refer to the paper "LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech", Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J. Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu, arXiv, 2019. If you use the LibriTTS corpus in your work, please cite this paper where it was introduced.
[**WenetSpeech4TTS**](https://huggingface.co/datasets/Wenetspeech4TTS/WenetSpeech4TTS) is a multi-domain **Mandarin** corpus derived from the open-sourced [WenetSpeech](https://arxiv.org/abs/2110.03370) dataset.

> [!CAUTION]
> The next-gen Kaldi framework provides tools and models for generating high-quality, synthetic speech (Text-to-Speech, TTS).
Expand All @@ -24,28 +16,57 @@ For more information, refer to the paper "LibriTTS: A Corpus Derived from LibriS
> 4. No Warranty: This framework is provided “as-is,” without warranty of any kind, either express or implied. We do not guarantee that the use of this software will comply with legal requirements or that it will not infringe the rights of third parties.

# VITS
# [VALL-E](https://arxiv.org/abs/2301.02111)

This recipe provides a VITS model trained on the LibriTTS dataset.
./valle contains the code for training VALL-E TTS model.

Pretrained model can be found [here](https://huggingface.co/zrjin/icefall-tts-libritts-vits-2024-10-30).
Checkpoints and training logs can be found [here](https://huggingface.co/yuekai/vall-e_wenetspeech4tts). The demo of the model trained with Wenetspeech4TTS Premium (945 hours) is available [here](https://huggingface.co/spaces/yuekai/valle_wenetspeech4tts_demo).

Preparation:

```
bash prepare.sh
```

The training command is given below:

```
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
./vits/train.py \
--world-size 4 \
--num-epochs 400 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir vits/exp \
--max-duration 500
world_size=8
exp_dir=exp/valle
## Train AR model
python3 valle/train.py --max-duration 320 --filter-min-duration 0.5 --filter-max-duration 14 --train-stage 1 \
--num-buckets 6 --dtype "bfloat16" --save-every-n 1000 --valid-interval 2000 \
--share-embedding true --norm-first true --add-prenet false \
--decoder-dim 1024 --nhead 16 --num-decoder-layers 12 --prefix-mode 1 \
--base-lr 0.03 --warmup-steps 200 --average-period 0 \
--num-epochs 20 --start-epoch 1 --start-batch 0 --accumulate-grad-steps 1 \
--exp-dir ${exp_dir} --world-size ${world_size}
## Train NAR model
# cd ${exp_dir}
# ln -s ${exp_dir}/best-valid-loss.pt epoch-99.pt # --start-epoch 100=99+1
# cd -
python3 valle/train.py --max-duration 160 --filter-min-duration 0.5 --filter-max-duration 14 --train-stage 2 \
--num-buckets 6 --dtype "float32" --save-every-n 1000 --valid-interval 2000 \
--share-embedding true --norm-first true --add-prenet false \
--decoder-dim 1024 --nhead 16 --num-decoder-layers 12 --prefix-mode 1 \
--base-lr 0.03 --warmup-steps 200 --average-period 0 \
--num-epochs 40 --start-epoch 100 --start-batch 0 --accumulate-grad-steps 2 \
--exp-dir ${exp_dir} --world-size ${world_size}
```

To inference, use:
```
./vits/infer.py \
--exp-dir vits/exp \
--epoch 400 \
--tokens data/tokens.txt
huggingface-cli login
huggingface-cli download --local-dir ${exp_dir} yuekai/vall-e_wenetspeech4tts
top_p=1.0
python3 valle/infer.py --output-dir demos_epoch_${epoch}_avg_${avg}_top_p_${top_p} \
--top-k -1 --temperature 1.0 \
--text ./aishell3.txt \
--checkpoint ${exp_dir}/epoch-${epoch}-avg-${avg}.pt \
--text-extractor pypinyin_initials_finals --top-p ${top_p}
```

# Credits
- [vall-e](https://github.com/lifeiteng/vall-e)
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,12 @@
Phonemize Text and EnCodec Audio.
Usage example:
python3 bin/tokenizer.py \
--src_dir ./data/manifests --output_dir ./data/tokenized
python3 ./local/compute_neural_codec_and_prepare_text_tokens.py --dataset-parts "${dataset_parts}" \
--text-extractor ${text_extractor} \
--audio-extractor ${audio_extractor} \
--batch-duration 2500 --prefix "wenetspeech4tts" \
--src-dir "data/manifests" --split 100 \
--output-dir "${audio_feats_dir}/wenetspeech4tts_${dataset_parts}_split_100"
"""
import argparse
Expand Down Expand Up @@ -523,7 +527,7 @@ def main():
"wenetspeech4tts",
]:
part = part.resample(24000)
assert args.prefix_lower() in [
assert args.prefix.lower() in [
"ljspeech",
"aishell",
"baker",
Expand Down Expand Up @@ -557,36 +561,26 @@ def main():
# TextTokenizer
if args.text_extractor:
for c in tqdm(part):
if (
args.prefix == "baker"
and args.text_extractor == "labeled_pinyin"
):
phonemes = c.supervisions[0].custom["tokens"]["text"]
unique_symbols.update(phonemes)
if args.prefix == "ljspeech":
text = c.supervisions[0].custom["normalized_text"]
text = text.replace(""", '"').replace(""", '"')
phonemes = tokenize_text(text_tokenizer, text=text)
elif args.prefix in [
"aishell",
"aishell2",
"wenetspeech4tts",
"libritts",
"libritts-r",
]:
phonemes = tokenize_text(
text_tokenizer, text=c.supervisions[0].text
)
if c.supervisions[0].custom is None:
c.supervisions[0].custom = {}
c.supervisions[0].normalized_text = c.supervisions[0].text
else:
if args.prefix == "ljspeech":
text = c.supervisions[0].custom["normalized_text"]
text = text.replace(""", '"').replace(""", '"')
phonemes = tokenize_text(text_tokenizer, text=text)
elif args.prefix in [
"aishell",
"aishell2",
"wenetspeech4tts",
"libritts",
"libritts-r",
]:
phonemes = tokenize_text(
text_tokenizer, text=c.supervisions[0].text
)
if c.supervisions[0].custom is None:
c.supervisions[0].custom = {}
c.supervisions[0].normalized_text = c.supervisions[
0
].text
else:
raise NotImplementedError(f"{args.prefix}")
c.supervisions[0].custom["tokens"] = {"text": phonemes}
unique_symbols.update(phonemes)
raise NotImplementedError(f"{args.prefix}")
unique_symbols.update(phonemes)
c.tokens = phonemes
assert c.supervisions[
0
Expand Down
29 changes: 14 additions & 15 deletions egs/wenetspeech4tts/TTS/prepare.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,12 @@ set -eou pipefail
# fix segmentation fault reported in https://github.com/k2-fsa/icefall/issues/674
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python

j=16
stage=2
stop_stage=2
stage=1
stop_stage=4

dl_dir=$PWD/download

dataset_parts="-p Basic" # -p Premium for Premium dataset only
dataset_parts="Premium" # Basic for all 10k hours data, Premium for about 10% of the data

text_extractor="pypinyin_initials_finals" # default is espeak for English
audio_extractor="Encodec" # or Fbank
Expand Down Expand Up @@ -62,37 +61,37 @@ if [ $stage -le 2 ] && [ $stop_stage -ge 2 ]; then
python3 ./local/compute_neural_codec_and_prepare_text_tokens.py --dataset-parts "${dataset_parts}" \
--text-extractor ${text_extractor} \
--audio-extractor ${audio_extractor} \
--batch-duration 2500 \
--prefix "wenetspeech4tts" \
--batch-duration 2500 --prefix "wenetspeech4tts" \
--src-dir "data/manifests" \
--split 100 \
--output-dir "${audio_feats_dir}/${prefix}_baisc_split_100"
--output-dir "${audio_feats_dir}/wenetspeech4tts_${dataset_parts}_split_100"
cp ${audio_feats_dir}/wenetspeech4tts_${dataset_parts}_split_100/unique_text_tokens.k2symbols ${audio_feats_dir}
fi
touch ${audio_feats_dir}/.wenetspeech4tts.tokenize.done
fi

if [ $stage -le 3 ] && [ $stop_stage -ge 3 ]; then
log "Stage 13: Combine features for basic"
if [ ! -f ${audio_feats_dir}/wenetspeech4tts_cuts_Baisc.jsonl.gz ]; then
pieces=$(find ${audio_feats_dir}/wenetspeech4tts_baisc_split_100 -name "*.jsonl.gz")
lhotse combine $pieces ${audio_feats_dir}/wenetspeech4tts_cuts_Baisc.jsonl.gz
log "Stage 3: Combine features"
if [ ! -f ${audio_feats_dir}/wenetspeech4tts_cuts_${dataset_parts}.jsonl.gz ]; then
pieces=$(find ${audio_feats_dir}/wenetspeech4tts_${dataset_parts}_split_100 -name "*.jsonl.gz")
lhotse combine $pieces ${audio_feats_dir}/wenetspeech4tts_cuts_${dataset_parts}.jsonl.gz
fi
fi

if [ $stage -le 4 ] && [ $stop_stage -ge 4 ]; then
log "Stage 3: Prepare wenetspeech4tts train/dev/test"
log "Stage 4: Prepare wenetspeech4tts train/dev/test"
if [ ! -e ${audio_feats_dir}/.wenetspeech4tts.train.done ]; then

lhotse subset --first 400 \
${audio_feats_dir}/wenetspeech4tts_cuts_Baisc.jsonl.gz \
${audio_feats_dir}/wenetspeech4tts_cuts_${dataset_parts}.jsonl.gz \
${audio_feats_dir}/cuts_dev.jsonl.gz

lhotse subset --last 400 \
${audio_feats_dir}/wenetspeech4tts_cuts_Baisc.jsonl.gz \
${audio_feats_dir}/wenetspeech4tts_cuts_${dataset_parts}.jsonl.gz \
${audio_feats_dir}/cuts_test.jsonl.gz

lhotse copy \
${audio_feats_dir}/wenetspeech4tts_cuts_Baisc.jsonl.gz \
${audio_feats_dir}/wenetspeech4tts_cuts_${dataset_parts}.jsonl.gz \
${audio_feats_dir}/cuts_train.jsonl.gz

touch ${audio_feats_dir}/.wenetspeech4tts.train.done
Expand Down
Loading

0 comments on commit e66f133

Please sign in to comment.