Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

planning: fast open source tts for ichigo #94

Open
PodsAreAllYouNeed opened this issue Oct 18, 2024 · 8 comments
Open

planning: fast open source tts for ichigo #94

PodsAreAllYouNeed opened this issue Oct 18, 2024 · 8 comments
Assignees

Comments

@PodsAreAllYouNeed
Copy link

PodsAreAllYouNeed commented Oct 18, 2024

We need to replace the current fishspeech with better TTS model.

WIP Shortlist of Possible candidates:

Test sentence:
I'm Ichigo, a local AI created by Homebrew Research. I'm here to help answer your questions and make your life easier.

Samples
https://drive.google.com/drive/folders/1FbR5H7rqirHDgxbjxO8Zwhxsj5y4t_mq?usp=sharing

Name License Code Paper Comments
Tacotron2 BSD 3-Clause License https://github.com/NVIDIA/tacotron2 https://arxiv.org/pdf/1712.05884 https://colab.research.google.com/github/pytorch/pytorch.github.io/blob/master/assets/hub/nvidia_deeplearningexamples_tacotron2.ipynb This model is faster than real time, uses mel-spectrograms, which can be very fast. But it sounds really terrible compared to recent models. Probably not usable.
Hifi-GAN MIT License https://github.com/jik876/hifi-gan https://arxiv.org/abs/2010.05646 https://colab.research.google.com/github/pytorch/pytorch.github.io/blob/master/assets/hub/nvidia_deeplearningexamples_hifigan.ipynb#scrollTo=b3b54df5 Old, but this have been used by many of papers. Sounds better than tacotron but not by much
FastSpeech2
VITS
VALLE
NaturalSpeech2
Jets
MELLE
VALLE-2 Tried this using Amphion. It is not about to pronounces "Ichigo" and "AI" properly. Probably something went wrong with the phoneme conversion. E2/F5-TTS is abit better at this.
Voicebox
E2/F5-TTS MIT License https://github.com/SWivid/F5-TTS arxiv.org/abs/2410.06885 https://huggingface.co/spaces/mrfakename/E2-F5-TTS Generation seems pretty good, but not sure if it will be fast enough. Needs transcript of the reference text, F5-TTS needs speed set to 0.8 for better generation.
@tikikun
Copy link
Collaborator

tikikun commented Oct 20, 2024

Why the sample on f5--ts work, it seems everything else is pretty bad

@PodsAreAllYouNeed PodsAreAllYouNeed self-assigned this Oct 20, 2024
@hahuyhoang411
Copy link
Contributor

Screenshot 2024-10-21 at 08 37 10

with f5 we can change the system prompt of Ichigo a bit and make it more nature

@PodsAreAllYouNeed
Copy link
Author

PodsAreAllYouNeed commented Oct 21, 2024

Tested on TTS Arena and added to Drive:

Commercial
ElevenLabs
FishSpeech v1.4
PlayHT2.0
PlayHT3.0mini
XTTSv2

Non-Commercial
GPT-SoVITS (MIT License)
MeloTTS (MIT License) (Multi-lingual, multi-accent)
OpenVoicev2 (MIT License)
Parler-TTS and Parler TTS Large(Apache-2.0)
StyleTTS2 (MIT License)

unknown license
VoiceCraftV2

@PodsAreAllYouNeed
Copy link
Author

After testing these models, it seems F5-TTS is the only open-source TTS that can get the pronunciation of both "Ichigo" and reading out of the acronym "AI" correct. The commercial ones have no problem with this of course. The next question is then whether F5-TTS inference is going to be fast enough. Will update after some testing.

@tikikun
Copy link
Collaborator

tikikun commented Oct 23, 2024

f5-tts vram usage is quite concerning, can you make a direct comparison with fishspeech

@PodsAreAllYouNeed
Copy link
Author

PodsAreAllYouNeed commented Oct 24, 2024

I used nvtop to monitor F5-tts on my machine. Based on their provided inference script, its requiring 2.3GB of GPU memory during inference. I tested this on a 214 word long generation, which the inference script converts into 8 batches for generation. The maximum memory stays constant at 2.3GB across the batches.

For the 214 word generation:
Total time taken: 17.9s.
main process: 14.9s.
model loading: 3.0s.

Audio generated: 87s
RTF: 0.21 (incl model loading), 0.17 (inference only)
This is very close to the reported 0.15 RTF. Either way, the generation is faster than real-time. Not sure if we will get the same result on a consumer-grade GPU though. The F5-TTS authors used 3090s for small scale experiments and A100s for large scale experiments. Not sure which GPU they used to calculate RTF though.

@tikikun
Copy link
Collaborator

tikikun commented Oct 24, 2024

15gb is nut

@PodsAreAllYouNeed
Copy link
Author

15gb is nut

Sorry before I edited I said 15GB, but was an error on my part. Someone else's job didn't clear the GPU and it was sitting on 15GB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

3 participants