Multispeaker dataset #44

HobisPL · 2023-02-27T14:09:09Z

What should a dataset for Multispeaker look like?
Should each speaker have an identifier at the end, for example:

152334H · 2023-02-27T14:54:19Z

It's all learned implicitly. There's no fundamental difference between a single-speaker and a multi-speaker dataset apart from the variance of the distribution of conditioning latents && predicted audio.

It is perhaps better to actually model each wav file as an individual speaker -- each speaker is a point on latent space, and there are general clusters corresponding to individual characters, and perhaps you could circle each cluster and label it as the broad space of a single speaker's voice, but in practice there ought to be overlaps for a sufficiently diverse multispeaker dataset

152334H · 2023-02-27T14:57:12Z

I think there is a potential idea to be applied here, actually -- you could try to apply the exact same conditioning latent for EVERY line said by a specific character. But that would require additional code and stuff

"Multispeaker" in the current case just means exposing the model to more kinds of speakers during training. Ideally the model would learn to clone all of them, conditionally on the input zero-shot latent; in practice it underfits severely with the short number of epoches available in fine-tuning. I suspect a much much longer training run might teach the model to correctly remember all speakers, but it might also just lead to terrible overfitting on the existing lines

LorenzoBrugioni · 2023-03-12T23:45:31Z

Hi and thanks for your work.
So as of now, if I fine tune on a single speaker dataset i will become a single speaker model, or at least it seems to me.
Even when I use the conditional latents from another speaker during zero shot inference, i get always the voice of the speaker I fine tuned on.

SiddantaK · 2023-11-27T06:02:23Z

Hi, is it possible to train this model for a multispeaker dataset? if there is then, can u give the information in detail? Thank you in advance.

GuenKainto · 2023-12-12T05:56:21Z

I try to train with multispeaker dataset but it have so thing wrong
i training another language with data ManVoice , when i try to clone , it hard to clone Woman voice or baby voice , but it can clone another man voice (80%) . And womanVoice like this too - can clone man Voice but good in womanVoice (some high voice it will hard or have some anoying noise).
When i training with mix data of man and woman voice, when i clone , the output voice look like radom voice. Some time man, some time woman, not a voice a want to clone.

HobisPL changed the title ~~multiple speakers dataset~~ Multispeaker speakers dataset Feb 28, 2023

HobisPL changed the title ~~Multispeaker speakers dataset~~ Multispeaker dataset Feb 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multispeaker dataset #44

Multispeaker dataset #44

HobisPL commented Feb 27, 2023 •

edited

Loading

152334H commented Feb 27, 2023 •

edited

Loading

152334H commented Feb 27, 2023

LorenzoBrugioni commented Mar 12, 2023

SiddantaK commented Nov 27, 2023

GuenKainto commented Dec 12, 2023

Multispeaker dataset #44

Multispeaker dataset #44

Comments

HobisPL commented Feb 27, 2023 • edited Loading

152334H commented Feb 27, 2023 • edited Loading

152334H commented Feb 27, 2023

LorenzoBrugioni commented Mar 12, 2023

SiddantaK commented Nov 27, 2023

GuenKainto commented Dec 12, 2023

HobisPL commented Feb 27, 2023 •

edited

Loading

152334H commented Feb 27, 2023 •

edited

Loading