Chiron - Part 4 Local Text to Speech #69
nduartech
announced in
Blog Posts
Replies: 1 comment 1 reply
-
Have you had any issues with using sherpa-onnx?
For sherpa-rs, could you be more specific about |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
slug: chiron-tts
description: Giving Chiron a voice
published: 2024-11-20
Again, envisioning a Siri-like assistant, text-to-speech was a problem I knew I had to solve sooner than later, and given the noted hardware limitations I faced (my only GPU is a 2070 SUPER, which meant extremely limited VRAM), I quickly realized that running a GPU-based model for TTS alongside a GPU-based model for chat would not be possble. With this in mind, I began searching for options.
ONNX
During my research into lightweight alternatives, I found ONNX (Open Neural Network Exchange), which provided an elegant solution to my resource constraints. ONNX enables efficient execution of machine learning models on CPU, allowing me to reserve GPU resources for the chat functionality. This framework bridges the gap between different ML platforms while maintaining performance, making it particularly suitable for my text-to-speech requirements.
sherpa-onnx
From there, I discovered sherpa-onnx, a comprehensive speech processing toolkit. Its integration of the ONNX runtime addressed my hardware limitations, and its implementation of the VITS text-to-speech model offered impressive quality without excessive resource demands. The inclusion of pre-trained models and straightforward integration path made it a compelling choice for my project.
sherpa-rs
Finding a Rust port for sherpa-onnx solidified the deal. While initially I planned to use the library for both text-to-speech as well as speech-to-text functionality, I later found an alternative that performed slightly better on my PC for STT. The TTS, however, was satisfactory for my use case. In evaluating the available pre-trained models, I landed on the GLaDOS voice model (as a nerd I loved both Portal 1 and 2).
Pulling it all together
The integration of sherpa-rs with Tauri presented several interesting challenges. The system needed to handle text-to-speech processing without impacting UI responsiveness or creating audio playback delays.
The architecture evolved into a queue-based system utilizing multiple threads - one managing incoming text-to-speech requests and another handling audio playback. This design proved essential for maintaining natural speech patterns and fluid conversation flow. The frontend implementation included adding intelligent text segmentation at punctuation marks for natural pauses, along with special handling for code blocks, mimicking ChatGPT's pattern of choosing to omit these in favor of saying "You can see the code in our conversation history".
Given the resource constraints, I also wanted to prevent accumulation of the audio file generated by the text-to-speech process over time. To ensure this, the code actively deletes these wav files on playback, as well as on application exit in case the user quits the app before audio finishes playing, effectively ensuring stable long-term operation.
While more resource-intensive solutions exist, this implementation achieves a practical balance of performance and efficiency, resulting in a reliable text-to-speech system that could still run on my hardware.
Beta Was this translation helpful? Give feedback.
All reactions