Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can you give samples for speaker embedding and inferenced samples? #6

Open
ljh0412 opened this issue Aug 24, 2022 · 2 comments
Open

Comments

@ljh0412
Copy link

ljh0412 commented Aug 24, 2022

Firstly, I really appreciate for this repo. It helped me a lot for learning about TTS.

But I think I met some problems on inference stage.

I trained the model with LibriTTS with adjusted configs from FastSpeech2 repo, just removing language options.
(If you wish, I will make a pull request about it. It would be helpful for others to train model.)

While the training loss was as you shown, I cannot get proper duration prediction while I'm doing inference.

I checked the training stage where synth_one_sample function operates by saving wavs, and I saw that predicted speech and reconstructed speech was fairly good quality (a bit error for mel prediction though).

So I guess there could be some issues on mel embedding for conditional normalization layer and speaker embedding.

Maybe there could be some conflicts on them?

In this sense, it will be helpful for me and other people to get some inference examples such as speaker embedding samples and inferenced samples.

I attach some samples, configs, commands here.
tested_data.zip

@cantabile-kwok
Copy link

cantabile-kwok commented Nov 15, 2022

I'm getting a very similar problem, troubling me for days. Have you solved it?

Here are my tensorboard logs. They seem pretty strange as the losses stop decreasing after a very short period of time (several thousand steps) and start to blow up. This phenomenon even happens before phone-level-embedding-prediction ( which could also be a trouble!)
image

@cantabile-kwok
Copy link

Luckily I found my problem originated not from the model or code itself. It was from the value of x-vector I was using. I used the x-vectors extracted from SpeechBrain library instead of speaker embedding table, and the values in these x-vectors can range from -100 to +100. This caused numerical instability in conditional layer norms, so the loss cannot be decreased. After normalizing these embeddings, my training process went correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants