Skip to content

Sabering1/speech-to-phrase

 
 

Repository files navigation

Speech-to-phrase logo Speech-to-Phrase

A fast and local speech-to-text system that is personalized with your Home Assistant device and area names.

Speech-to-phrase is not a general purpose speech recognition system. Instead of answering the question "what did the user say?", it answers "which of the phrases I know did the user say?". This is accomplished by combining pre-defined sentence templates with the names of your Home Assistant entities, areas, and floors that have been exposed to Assist.

Show add-on

Supported languages

Custom sentences

You can add your own sentences and list values with --custom-sentences-dir <DIR> where <DIR> contains directories of YAML files per language. For example:

python3 -m speech_to_phrase ... --custom-sentences-dir /path/to/custom_sentences

For an English model, you could have /path/to/custom_sentences/en/sentences.yaml with:

language: "en"
lists:
  todo_item:
    values:
      - "apples"  # make sure to use quotes!
      - "bananas"

This would allow you to say "add apples to my shopping list" if you have a todo entity in Home Assistant exposed with the name "shopping list".

You can also create lists with the same names as your sentence trigger wildcards to make them usable in speech-to-phrase.

Docker container

A Docker container is available that can be connected to Home Assistant via the wyoming integration:

docker run -it -p 10300:10300 \
  -v /path/to/download/models:/models \
  -v /path/to/train:/train rhasspy/wyoming-speech-to-phrase \
  --hass-websocket-uri 'ws://homeassistant.local:8123/api/websocket' \
  --hass-token '<LONG_LIVED_ACCESS_TOKEN>' \
  --retrain-on-start

Models and tools

Speech models and tools are downloaded automatically from HuggingFace

How it works

Speech-to-phrase combines pre-defined sentence templates with the names of things from your Home Assistant to produce a hassil template file. This file compactly represents all of the possible sentences that can be recognized, which may be hundreds, thousands, or even millions.

Using techniques developed in the Rhasspy project, speech-to-phrase converts the compact sentence templates into a finite state transducer (FST) which is then used to train a language model for Kaldi. The opengrm tooling is crucial for efficiency during this step, as it avoids unpacking the sentence templates into every possible combination.

Each speech-to-phrase model contains a pre-built dictionary of word pronunciations as well as a phonetisaurus model that will guess pronunciations for unknown words.

During training, a lot of "magic" happens to ensure that your entity, area, and floor names can be recognized automatically:

  • Words with numbers are split apart ("PM2.5" becomes "PM 2.5")
  • Initialisms are further split ("PM" or "P.M." becomes "P M")
  • Digits are replaced with their spoken word forms ("123" becomes "one hundred twenty three")
  • Unknown words have their pronunciations guessed

To make phrase recognition more robust, a "fuzzy" layer is added on top of Kaldi's transcription output. This layer can correct small errors, such as duplicate or missing words, and also ensures that output names are exactly what you have in Home Assistant.

About

Fast and personalized local speech-to-text

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 98.9%
  • Shell 1.1%