Chiron - Part 4 Local Text to Speech #69

nduartech · 2024-12-14T07:19:22Z

nduartech
Dec 14, 2024
Maintainer

slug: chiron-tts
description: Giving Chiron a voice
published: 2024-11-20

Again, envisioning a Siri-like assistant, text-to-speech was a problem I knew I had to solve sooner than later, and given the noted hardware limitations I faced (my only GPU is a 2070 SUPER, which meant extremely limited VRAM), I quickly realized that running a GPU-based model for TTS alongside a GPU-based model for chat would not be possble. With this in mind, I began searching for options.

ONNX

During my research into lightweight alternatives, I found ONNX (Open Neural Network Exchange), which provided an elegant solution to my resource constraints. ONNX enables efficient execution of machine learning models on CPU, allowing me to reserve GPU resources for the chat functionality. This framework bridges the gap between different ML platforms while maintaining performance, making it particularly suitable for my text-to-speech requirements.

sherpa-onnx

From there, I discovered sherpa-onnx, a comprehensive speech processing toolkit. Its integration of the ONNX runtime addressed my hardware limitations, and its implementation of the VITS text-to-speech model offered impressive quality without excessive resource demands. The inclusion of pre-trained models and straightforward integration path made it a compelling choice for my project.

sherpa-rs

Finding a Rust port for sherpa-onnx solidified the deal. While initially I planned to use the library for both text-to-speech as well as speech-to-text functionality, I later found an alternative that performed slightly better on my PC for STT. The TTS, however, was satisfactory for my use case. In evaluating the available pre-trained models, I landed on the GLaDOS voice model (as a nerd I loved both Portal 1 and 2).

Pulling it all together

The integration of sherpa-rs with Tauri presented several interesting challenges. The system needed to handle text-to-speech processing without impacting UI responsiveness or creating audio playback delays.

The architecture evolved into a queue-based system utilizing multiple threads - one managing incoming text-to-speech requests and another handling audio playback. This design proved essential for maintaining natural speech patterns and fluid conversation flow. The frontend implementation included adding intelligent text segmentation at punctuation marks for natural pauses, along with special handling for code blocks, mimicking ChatGPT's pattern of choosing to omit these in favor of saying "You can see the code in our conversation history".

Given the resource constraints, I also wanted to prevent accumulation of the audio file generated by the text-to-speech process over time. To ensure this, the code actively deletes these wav files on playback, as well as on application exit in case the user quits the app before audio finishes playing, effectively ensuring stable long-term operation.

While more resource-intensive solutions exist, this implementation achieves a practical balance of performance and efficiency, resulting in a reliable text-to-speech system that could still run on my hardware.

csukuangfj · 2024-12-16T03:09:13Z

csukuangfj
Dec 16, 2024

Have you had any issues with using sherpa-onnx?

I found the STT features less robust

For sherpa-rs, could you be more specific about less robust? It supports lots of models, including whisper. Which models have you tested?

1 reply

nduartech Dec 17, 2024
Maintainer Author

Hi! Thanks for the comment and for the great library! I haven't had any issues per se, loved the TTS glados voice model, but from just anecdotal experience I found the whisper models (I tried both tiny and small) performed just slightly better on my limited hardware with CTranslate2 than on ONNX. To be fair, I have a pretty mid computer and an awful mic, so ultimately I'd be happy to alter this if there's evidence to the contrary :). For now I'm happy to adjust the language to indicate that the performance difference was not that much. Honestly bit surprised to see a comment already, haha. Would love to know how you came across this, just out of curiosity? Didn't really expect anyone to see this yet XD

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Chiron - Part 4 Local Text to Speech #69

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Chiron - Part 4 Local Text to Speech #69

Uh oh!

Uh oh!

nduartech Dec 14, 2024 Maintainer

slug: chiron-tts description: Giving Chiron a voice published: 2024-11-20

ONNX

sherpa-onnx

sherpa-rs

Pulling it all together

Replies: 1 comment · 1 reply

Uh oh!

csukuangfj Dec 16, 2024

Uh oh!

Uh oh!

nduartech Dec 17, 2024 Maintainer Author

nduartech
Dec 14, 2024
Maintainer

slug: chiron-tts
description: Giving Chiron a voice
published: 2024-11-20

Replies: 1 comment 1 reply

csukuangfj
Dec 16, 2024

nduartech Dec 17, 2024
Maintainer Author