- For kaldi-active-grammar
- kaldi_model_daanzu_20211030-biglm (1.05 GB)
- kaldi_model_daanzu_20211030-mediumlm (651 MB)
- kaldi_model_daanzu_20211030-smalllm (400 MB)
- kaldi_model_daanzu_20200905_1ep-biglm (1.05 GB)
- kaldi_model_daanzu_20200905_1ep-mediumlm (651 MB)
- kaldi_model_daanzu_20200905_1ep-smalllm (400 MB)
- kaldi_model_daanzu_20200328_1ep-mediumlm (322 MB)
- For generic kaldi, or vosk
- If you have trouble downloading, try using
wget --continue
- Latency: I have yet to do formal latency testing, but for command grammars, the latency between the end of the utterance (as determined by the Voice Activity Detector) and receiving the final recognition results is in the range of 10-20ms.
- Metric: Word Error Rate (WER)
- Data sets:
- LibriSpeech Test Clean
- Mozilla Common Voice English Test
- TED-LIUM Release 3 Legacy Test
- TestSet1: my test set combining multiple sources
- Speech Comm: test set from Google's Speech Commands Dataset, consisting of short single-word commands
Note: The tests on commands are not necessarily fair, because they were performed using a full dictation grammar, rather than a reduced command-specific grammar. This is a worst case scenario for accuracy; in practice, speaking commands would perform much more accurately.
Engine | LS Test Clean | CV4 Test | Ted3 Test | TestSet1 | Speech Comm |
---|---|---|---|---|---|
KaldiAG dgesr2-f-1ep LibriSpeech LM | 4.77 | 30.91 | 12.98 | 10.16 | 11.67 |
vosk-model-en-us-aspire-0.2 [carpa] | 17.90 | 69.76 | 55.90 | ||
vosk-model-small-en-us-0.3 | 19.30 | 45.57 | |||
Zamia LibriSpeech LM | 4.56 | 34.28 | 10.34 | 30.16 | |
Amazon Transcribe ** | 8.21% | ||||
CMU PocketSphinx (0.1.15) ** | 31.82% | ||||
Google Speech-to-Text ** | 12.23% | ||||
Mozilla DeepSpeech (0.6.1) ** | 7.55% | ||||
Picovoice Cheetah (v1.2.0) ** | 10.49% | ||||
Picovoice Cheetah LibriSpeech LM (v1.2.0) ** | 8.25% | ||||
Picovoice Leopard (v1.0.0) ** | 8.34% | ||||
Picovoice Leopard LibriSpeech LM (v1.0.0) ** | 6.58% |
**: not tested by me; from Picovoice speech-to-text-benchmark
Fine tuning a generic model for an individual speaker can greatly increase accuracy, at the small cost of recording some training data from the speaker themself. This training data can be recorded specifically for training purposes, or it can be retained from normal use while using another model (or even another engine).
- Very difficult speech.
Model | David Commands (test set) | David Dictation (test set) |
---|---|---|
KaldiAG dgesr-f-1ep generic | 84.94 | 70.59 |
KaldiAG dgesr-f-1ep fine tuned on ~34hr of mixed commands + dictation | 7.11 | 14.46 |
Custom model trained only on ~34hr of mixed commands + dictation | 10.04 | 10.29 |
- Accented speech.
- Shervin Commands: ~1 hour, ~4000 utterances.
- Shervin Dictation: ~20 minutes, ~250 utterances.
Model | Shervin Commands | Shervin Dictation |
---|---|---|
KaldiAG dgesr2-f-1ep generic | 46.98 | 9.21 |
KaldiAG dgesr2-f-1ep fine tuned on Shervin Commands + Dictation | 9.76 | 2.40 |