planning: fast open source tts for ichigo #94

PodsAreAllYouNeed · 2024-10-18T11:07:52Z

We need to replace the current fishspeech with better TTS model.

WIP Shortlist of Possible candidates:

Amphion (https://github.com/open-mmlab/Amphion) <-this is a framework
Tacotron2 (https://github.com/NVIDIA/tacotron2)
Hifi-GAN (https://github.com/jik876/hifi-gan)
MELLE (https://arxiv.org/pdf/2407.08551)
VALLE-2 (https://arxiv.org/pdf/2406.05370)
Voicebox (https://arxiv.org/pdf/2306.15687)
F5-TTS (https://arxiv.org/pdf/2410.06885)
CosyVoice (https://github.com/FunAudioLLM/CosyVoice)

Test sentence:
I'm Ichigo, a local AI created by Homebrew Research. I'm here to help answer your questions and make your life easier.

Samples
https://drive.google.com/drive/folders/1FbR5H7rqirHDgxbjxO8Zwhxsj5y4t_mq?usp=sharing

Name	License	Code	Paper	Comments
Tacotron2	BSD 3-Clause License	https://github.com/NVIDIA/tacotron2	https://arxiv.org/pdf/1712.05884	https://colab.research.google.com/github/pytorch/pytorch.github.io/blob/master/assets/hub/nvidia_deeplearningexamples_tacotron2.ipynb This model is faster than real time, uses mel-spectrograms, which can be very fast. But it sounds really terrible compared to recent models. Probably not usable.
Hifi-GAN	MIT License	https://github.com/jik876/hifi-gan	https://arxiv.org/abs/2010.05646	https://colab.research.google.com/github/pytorch/pytorch.github.io/blob/master/assets/hub/nvidia_deeplearningexamples_hifigan.ipynb#scrollTo=b3b54df5 Old, but this have been used by many of papers. Sounds better than tacotron but not by much
FastSpeech2
VITS
VALLE
NaturalSpeech2
Jets
MELLE
VALLE-2				Tried this using Amphion. It is not about to pronounces "Ichigo" and "AI" properly. Probably something went wrong with the phoneme conversion. E2/F5-TTS is abit better at this.
Voicebox
E2/F5-TTS	MIT License	https://github.com/SWivid/F5-TTS	arxiv.org/abs/2410.06885	https://huggingface.co/spaces/mrfakename/E2-F5-TTS Generation seems pretty good, but not sure if it will be fast enough. Needs transcript of the reference text, F5-TTS needs speed set to 0.8 for better generation.

tikikun · 2024-10-20T00:12:34Z

Why the sample on f5--ts work, it seems everything else is pretty bad

hahuyhoang411 · 2024-10-21T06:38:08Z

with f5 we can change the system prompt of Ichigo a bit and make it more nature

PodsAreAllYouNeed · 2024-10-21T11:30:49Z

Tested on TTS Arena and added to Drive:

Commercial
ElevenLabs
FishSpeech v1.4
PlayHT2.0
PlayHT3.0mini
XTTSv2

Non-Commercial
GPT-SoVITS (MIT License)
MeloTTS (MIT License) (Multi-lingual, multi-accent)
OpenVoicev2 (MIT License)
Parler-TTS and Parler TTS Large(Apache-2.0)
StyleTTS2 (MIT License)

unknown license
VoiceCraftV2

PodsAreAllYouNeed · 2024-10-21T11:34:47Z

After testing these models, it seems F5-TTS is the only open-source TTS that can get the pronunciation of both "Ichigo" and reading out of the acronym "AI" correct. The commercial ones have no problem with this of course. The next question is then whether F5-TTS inference is going to be fast enough. Will update after some testing.

tikikun · 2024-10-23T02:36:04Z

f5-tts vram usage is quite concerning, can you make a direct comparison with fishspeech

PodsAreAllYouNeed · 2024-10-24T06:32:54Z

I used nvtop to monitor F5-tts on my machine. Based on their provided inference script, its requiring 2.3GB of GPU memory during inference. I tested this on a 214 word long generation, which the inference script converts into 8 batches for generation. The maximum memory stays constant at 2.3GB across the batches.

For the 214 word generation:
Total time taken: 17.9s.
main process: 14.9s.
model loading: 3.0s.

Audio generated: 87s
RTF: 0.21 (incl model loading), 0.17 (inference only)
This is very close to the reported 0.15 RTF. Either way, the generation is faster than real-time. Not sure if we will get the same result on a consumer-grade GPU though. The F5-TTS authors used 3090s for small scale experiments and A100s for large scale experiments. Not sure which GPU they used to calculate RTF though.

tikikun · 2024-10-24T08:47:22Z

15gb is nut

PodsAreAllYouNeed · 2024-10-24T09:21:33Z

15gb is nut

Sorry before I edited I said 15GB, but was an error on my part. Someone else's job didn't clear the GPU and it was sitting on 15GB.

PodsAreAllYouNeed self-assigned this Oct 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

planning: fast open source tts for ichigo #94

planning: fast open source tts for ichigo #94

PodsAreAllYouNeed commented Oct 18, 2024 •

edited

Loading

tikikun commented Oct 20, 2024

hahuyhoang411 commented Oct 21, 2024

PodsAreAllYouNeed commented Oct 21, 2024 •

edited

Loading

PodsAreAllYouNeed commented Oct 21, 2024

tikikun commented Oct 23, 2024

PodsAreAllYouNeed commented Oct 24, 2024 •

edited

Loading

tikikun commented Oct 24, 2024

PodsAreAllYouNeed commented Oct 24, 2024

planning: fast open source tts for ichigo #94

planning: fast open source tts for ichigo #94

Comments

PodsAreAllYouNeed commented Oct 18, 2024 • edited Loading

tikikun commented Oct 20, 2024

hahuyhoang411 commented Oct 21, 2024

PodsAreAllYouNeed commented Oct 21, 2024 • edited Loading

PodsAreAllYouNeed commented Oct 21, 2024

tikikun commented Oct 23, 2024

PodsAreAllYouNeed commented Oct 24, 2024 • edited Loading

tikikun commented Oct 24, 2024

PodsAreAllYouNeed commented Oct 24, 2024

PodsAreAllYouNeed commented Oct 18, 2024 •

edited

Loading

PodsAreAllYouNeed commented Oct 21, 2024 •

edited

Loading

PodsAreAllYouNeed commented Oct 24, 2024 •

edited

Loading