A fast and local speech-to-text system that is personalized with your Home Assistant device and area names.
Speech-to-phrase is not a general purpose speech recognition system. Instead of answering the question "what did the user say?", it answers "which of the phrases I know did the user say?". This is accomplished by combining pre-defined sentence templates with the names of your Home Assistant entities, areas, and floors that have been exposed to Assist.
You can add your own sentences and list values with --custom-sentences-dir <DIR>
where <DIR>
contains directories of YAML files per language. For example:
python3 -m speech_to_phrase ... --custom-sentences-dir /path/to/custom_sentences
For an English model, you could have /path/to/custom_sentences/en/sentences.yaml
with:
language: "en"
lists:
todo_item:
values:
- "apples" # make sure to use quotes!
- "bananas"
This would allow you to say "add apples to my shopping list" if you have a todo entity in Home Assistant exposed with the name "shopping list".
You can also create lists with the same names as your sentence trigger wildcards to make them usable in speech-to-phrase.
A Docker container is available that can be connected to Home Assistant via the wyoming integration:
docker run -it -p 10300:10300 \
-v /path/to/download/models:/models \
-v /path/to/train:/train rhasspy/wyoming-speech-to-phrase \
--hass-websocket-uri 'ws://homeassistant.local:8123/api/websocket' \
--hass-token '<LONG_LIVED_ACCESS_TOKEN>' \
--retrain-on-start
Speech models and tools are downloaded automatically from HuggingFace
Speech-to-phrase combines pre-defined sentence templates with the names of things from your Home Assistant to produce a hassil template file. This file compactly represents all of the possible sentences that can be recognized, which may be hundreds, thousands, or even millions.
Using techniques developed in the Rhasspy project, speech-to-phrase converts the compact sentence templates into a finite state transducer (FST) which is then used to train a language model for Kaldi. The opengrm tooling is crucial for efficiency during this step, as it avoids unpacking the sentence templates into every possible combination.
Each speech-to-phrase model contains a pre-built dictionary of word pronunciations as well as a phonetisaurus model that will guess pronunciations for unknown words.
During training, a lot of "magic" happens to ensure that your entity, area, and floor names can be recognized automatically:
- Words with numbers are split apart ("PM2.5" becomes "PM 2.5")
- Initialisms are further split ("PM" or "P.M." becomes "P M")
- Digits are replaced with their spoken word forms ("123" becomes "one hundred twenty three")
- Unknown words have their pronunciations guessed
To make phrase recognition more robust, a "fuzzy" layer is added on top of Kaldi's transcription output. This layer can correct small errors, such as duplicate or missing words, and also ensures that output names are exactly what you have in Home Assistant.