Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding a new language #128

Open
tculjaga opened this issue Sep 1, 2024 · 10 comments
Open

adding a new language #128

tculjaga opened this issue Sep 1, 2024 · 10 comments

Comments

@tculjaga
Copy link

tculjaga commented Sep 1, 2024

Hi, is it possible to add support for a new language as slovenian/croatian/serbian ?

Do you have any procedure we can follow to train the model for that languages ?

@ylacombe
Copy link
Collaborator

ylacombe commented Sep 2, 2024

Hey @tculjaga, it's possible but there's no guide yet! I don't have the bandwidth for the next few weeks though

@Oyemade
Copy link

Oyemade commented Sep 3, 2024

Would you mind sharing some pointers? I don't mind taking a stab at it. I just successfully fine-tuned MMS for some Western African languages and hoping to build off of that.

@tculjaga
Copy link
Author

tculjaga commented Sep 3, 2024

Hi @ylacombe thanks for a quick response. I'd like to give it a try... all we need is a small and short guide (a bullet list is enough for start) how you do it on a specific language then we can try our best to battle through it :)

@SherryS997
Copy link

Would you mind sharing some pointers? I don't mind taking a stab at it. I just successfully fine-tuned MMS for some Western African languages and hoping to build off of that.

Hello, we're actually trying to train parler-tts for Indian languages and our pilot test gave great results. We followed exactly the same training process, and only changed to text encoder to mT5. It worked well!

@ylacombe
Copy link
Collaborator

Hey @SherryS997,
thanks for giving this feedback. I actually trained a version of Mini v1 which uses mT5 instead of T5, but it's only in English.
Do you think you'd be able to either open-source your model, or to share some snippet?
Thanks!

@SherryS997
Copy link

SherryS997 commented Sep 18, 2024

Hi @ylacombe,
Yes, we plan to open-source our model, along with the dataset, code, and captions, in the next month or two once we’re fully satisfied with the results. This work is part of the TTS research at AI4Bharat, and the model will be designed to support all 23 official languages of India, including English. We're also experimenting with various English accents and are optimistic that the model will handle those effectively as well.
In our pilot training, we used mT5, which delivered excellent results. However, we are now experimenting with other tokenizers, as mT5 supports only 12 of the 23 languages we’re targeting. These alternatives will help us achieve broader language coverage for the project.

@ylacombe
Copy link
Collaborator

Hey @SherryS997, let's speak over mail if that's okay with you: yoach [at] huggingface.co

@showgan
Copy link

showgan commented Sep 20, 2024

Hi @ylacombe and @SherryS997,
I'm also very interested in training for a low resource language which is not supported by any tokenizer. I'd really appreciate it if you could share with me some advice as well.
Thanks!

@Strive-for-excellence
Copy link

Hi @ylacombe, Yes, we plan to open-source our model, along with the dataset, code, and captions, in the next month or two once we’re fully satisfied with the results. This work is part of the TTS research at AI4Bharat, and the model will be designed to support all 23 official languages of India, including English. We're also experimenting with various English accents and are optimistic that the model will handle those effectively as well. In our pilot training, we used mT5, which delivered excellent results. However, we are now experimenting with other tokenizers, as mT5 supports only 12 of the 23 languages we’re targeting. These alternatives will help us achieve broader language coverage for the project.

Great job. Looking forward to your open-source release.

@SherryS997
Copy link

Hi @ylacombe and @SherryS997, I'm also very interested in training for a low resource language which is not supported by any tokenizer. I'd really appreciate it if you could share with me some advice as well. Thanks!

You may need to build a tokenizer or better yet extend the flanT5 tokenizer. Then all you need to do is point to this tokenzer is the start json config.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants