The mozilla-deepspeech-0.6.1
model is a speech recognition neural network pre-trained by Mozilla
based on DeepSpeech architecture (CTC decoder with beam search and n-gram language model)
with changed neural network topology.
For details on the original DeepSpeech, see paper.
For details on this model, see repository.
Metric | Value |
---|---|
Type | Speech recognition |
GFlops per audio frame | 0.0472 |
GFlops per second of audio | 2.36 |
MParams | 47.2 |
Source framework | TensorFlow* |
Metric | Value | Parameters |
---|---|---|
WER @ Librispeech test-clean | 8.93% | with LM, beam_width = 32, Python CTC decoder |
WER @ Librispeech test-clean | 7.55% | with LM, beam_width = 500, C++ CTC decoder |
NB: beam_width=32 is a low value for a CTC decoder, and was used to achieve reasonable evaluation time with Python CTC decoder in Accuracy Checker. Increasing beam_width improves WER metric and slows down decoding. Speech Recognition DeepSpeech Demo has a faster C++ CTC decoder module.
Use accuracy_check [...] --model_attributes <path_to_folder_with_downloaded_model>
to specify the path to additional model attributes. path_to_folder_with_downloaded_model
is a path to the folder, where the current model is downloaded by Model Downloader tool.
-
Audio MFCC coefficients, name:
input_node
, shape:1, 16, 19, 26
, format:B, N, T, C
, where:B
- batch size, fixed to 1N
-input_lengths
, number of audio frames in this section of audioT
- context frames: along with the current frame, the network expects 9 preceding frames and 9 succeeding frames. The absent context frames are filled with zeros.C
- 26 MFCC coefficients per each frame
See
<omz_dir>/models/public/mozilla-deepspeech-0.6.1/accuracy-check.yml
for all audio preprocessing and feature extraction parameters. -
Number of audio frames, INT32 value, name:
input_lengths
, shape1
. -
LSTM in-state (c) and input (h, a.k.a hidden state) vectors. Names:
previous_state_c
andprevious_state_h
, shapes:1, 2048
, format:B, C
.
When splitting a long audio into chunks, these inputs must be fed with the corresponding outputs from the previous chunk. Chunk processing order must be from early to late audio positions.
-
Audio MFCC coefficients, name:
input_node
, shape:1, 16, 19, 26
, format:B, N, T, C
, where:B
- batch size, fixed to 1N
- number of audio frames in this section of audio, fixed to 16T
- context frames: along with the current frame, the network expects 9 preceding frames and 9 succeeding frames. The absent context frames are filled with zeros.C
- 26 MFCC coefficients in each frame
See
<omz_dir>/models/public/mozilla-deepspeech-0.6.1/accuracy-check.yml
for all audio preprocessing and feature extraction parameters. -
LSTM in-state and input vectors. Names:
previous_state_c
andprevious_state_h
, shapes:1, 2048
, format:B, C
.
When splitting a long audio into chunks, these inputs must be fed with the corresponding outputs from the previous chunk. Chunk processing order must be from early to late audio positions.
-
Per-frame probabilities (after softmax) for every symbol in the alphabet, name:
logits
, shape:16, 1, 29
, format:N, B, C
, where:N
- number of audio frames in this section of audioB
- batch size, fixed to 1C
- alphabet size, including the CTC blank symbol
The per-frame probabilities are to be decoded with a CTC decoder. The alphabet is: 0 = space, 1...26 = "a" to "z", 27 = apostrophe, 28 = CTC blank symbol.
NB:
logits
is probabilities after softmax, despite its name. -
LSTM out-state and output vectors. Names:
new_state_c
andnew_state_h
, shapes:1, 2048
, format:B, C
. See Inputs.
-
Per-frame probabilities (after softmax) for every symbol in the alphabet, name:
logits
, shape:16, 1, 29
, format:N, B, C
, where:N
- number of audio frames in this section of audio, fixed to 16B
- batch size, fixed to 1C
- alphabet size, including the CTC blank symbol
The per-frame probabilities are to be decoded with a CTC decoder. The alphabet is: 0 = space, 1...26 = "a" to "z", 27 = apostrophe, 28 = CTC blank symbol.
NB:
logits
is probabilities after softmax, despite its name. -
LSTM out-state and output vectors. Names:
cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/BlockLSTM/TensorIterator.2
fornew_state_c
cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/BlockLSTM/TensorIterator.1
fornew_state_h
Shapes:
1, 2048
, format:B, C
. See the corresponding Inputs.
You can download models and if necessary convert them into OpenVINO™ IR format using the Model Downloader and other automation tools as shown in the examples below.
An example of using the Model Downloader:
omz_downloader --name <model_name>
An example of using the Model Converter:
omz_converter --name <model_name>
The model can be used in the following demos provided by the Open Model Zoo to show its capabilities:
The original model is distributed under the
Mozilla Public License, Version 2.0.
A copy of the license is provided in <omz_dir>/models/public/licenses/MPL-2.0-Mozilla-Deepspeech.txt
.