Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training result #158

Open
yiwei0730 opened this issue Aug 10, 2023 · 2 comments
Open

Training result #158

yiwei0730 opened this issue Aug 10, 2023 · 2 comments

Comments

@yiwei0730
Copy link

yiwei0730 commented Aug 10, 2023

I'd like to inquire about the training results. I have combined datasets AISHELL3, aidata, and a Chinese dataset, totaling 600 hours of training. Although the three audio files are not 24000Hz, I have set cut_set = cut_set.resample(24000) in the line 184 in bin/tokenizer.py, so they should have been converted to 24000Hz.
I have followed the document's instructions, using the prefix-1 training method.

python3 bin/trainer.py --world-size 2 --max-duration 80 --filter-min-duration 0.5 --filter-max-duration 14 --train-stage 1
--num-buckets 6 --dtype "bfloat16" --save-every-n 10000 --valid-interval 20000
--model-name valle --share-embedding true --norm-first true --add-prenet false
--decoder-dim 1024 --nhead 16 --num-decoder-layers 12 --prefix-mode 1
--base-lr 0.05 --warmup-steps 200 --average-period 0
--num-epochs 20 --start-epoch 1 --start-batch 0 --accumulate-grad-steps 4
--exp-dir ${exp_dir}

Train NAR model
cp ${exp_dir}/best-valid-loss.pt ${exp_dir}/epoch-2.pt # --start-epoch 3=2+1

python3 bin/trainer.py --world-size 2 --max-duration 40 --filter-min-duration 0.5 --filter-max-duration 14 --train-stage 2
--num-buckets 6 --dtype "float32" --save-every-n 10000 --valid-interval 20000
--model-name valle --share-embedding true --norm-first true --add-prenet false
--decoder-dim 1024 --nhead 16 --num-decoder-layers 12 --prefix-mode 1
--base-lr 0.05 --warmup-steps 200 --average-period 0
--num-epochs 40 --start-epoch 3 --start-batch 0 --accumulate-grad-steps 4
--exp-dir ${exp_dir}
But when using the synthesized audio files and synthesizing with unseen data, the following situations occur:

  1. Often the latter part of the prompt appears at the beginning of the synthesized speech.
  2. Synthesizing long sentences leads to repeated or skipped segments in the latter part of the synthesis.
    Is there any way to improve these situations?"
@lifeiteng
Copy link
Owner

  1. 看下 --prefix-mode 2 or 4, --prefix-mode 1 就是会有这个问题
  2. 应该算是 AR 模型的通病了

@decajcd
Copy link

decajcd commented Jun 21, 2024

请问其他数据集如何做预处理

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants