Deep Speech 0.5.0
General
This is the 0.5.0 release of Deep Speech, an open speech-to-text engine. This release includes source code
and a trained model
deepspeech-0.5.0-models.tar.gz
trained on American English which achieves an 8.22% word error rate on the LibriSpeech clean test corpus. Models with a "*.pbmm" extension are memory mapped and much more memory efficient, as well as faster to load. Models with the ".tflite" extension are converted to use with TFLite and have post-training quantization enabled, and are more suitable for resource constrained environments.
We also include example audio files:
which can be used to test the engine; and checkpoint files
deepspeech-0.5.0-checkpoint.tar.gz
which can be used as the basis for further fine-tuning.
Notable changes from the previous release
- Python 2.7 is no longer supported for training
- Update decoder parameter names in native client
- Proper DeepSpeech error codes
- Produce Maven Bundle and upload to bintray
- Add TFLite accuracy estimation tool
- Fix invalid characters in (Windows) speech to text result
- Added importer for Common Voice v2 corpora
- Added alphabet generation utility
- Enabled TFLite post-training quantization
- Enabled official Windows support
- Added simple nodejs example
- Improved Nodejs streaming inference with VAD and FFmpeg
- Implement input pipeline with tf.data API
- Use tf.lite.TFLiteConverter to create tflite model
- Add Windows NuGet upload of Deep Speech
- Update to TensorFlow v1.13
- Add NET Framework targets to NuGet package
- Added Windows support for npm package
- Added output of word timings
- Removed distributed training support
- Add Mozilla Code of Conduct
- Exposed letter and word timing information
- Windows Python bindings
- Embed/read more metadata in exported model
- Exposed transcription probability information
- Add Windows Python packages to PyPI upload tasks
- Expose extended metadata information to bindings
- Build for ElectronJS
- Perform separate validation and test epochs per dataset when multiple files are specified
- Added LinguaLibre importer
- Add NodeJS v12 support
- Added AISHELL dataset importer
- Unmangled Exported symbols
- Moved to SWIG 4.0.0
- Enhanced CTC decoder to stream along with the RNN
- Use separate execution plans for acoustic model and feature computation
Hyperparameters for fine-tuning
The hyperparameters used to train the model are useful for fine tuning. Thus, we document them here along with the hardware used, a server with 8 TitanX Pascal GPUs (12GB of VRAM).
train_files
Fisher, LibriSpeech, and Switchboard training corpora.dev_files
LibriSpeech clean dev corpora.test_files
LibriSpeech clean test corpustrain_batch_size
24dev_batch_size
48test_batch_size
48n_hidden
2048learning_rate
0.0001dropout_rate
0.15epoch
75lm_alpha
0.75lm_beta
1.85
The weights with the best validation loss were selected at the end of the 75 epochs using --noearly_stop
. The selected model was trained for 467356 steps.
Bindings
This release also includes a Python based command line tool deepspeech
, installed through
pip install deepspeech
Alternatively, quicker inference can be performed using a supported NVIDIA GPU on Linux. (See below to find which GPU's are supported.) This is done by instead installing the GPU specific package:
pip install deepspeech-gpu
Also, it exposes bindings for the following languages
-
Python (Versions 3.4, 3.5, 3.6 and 3.7) installed via
pip install deepspeech
Alternatively, quicker inference can be performed using a supported NVIDIA GPU on Linux. (See below to find which GPU's are supported.) This is done by instead installing the GPU specific package:
pip install deepspeech-gpu
-
NodeJS (Versions 4.x, 5.x, 6.x, 7.x, 8.x, 9.x, 10.x, 11.x, and 12.x) installed via
npm install deepspeech
Alternatively, quicker inference can be performed using a supported NVIDIA GPU on Linux. (See below to find which GPU's are supported.) This is done by instead installing the GPU specific package:
npm install deepspeech-gpu
-
ElectronJS versions 3.1, 4.0, 4.1, 5.0 are also supported
-
C++ which requires the appropriate shared objects are installed from
native_client.tar.xz
(See the section in the main README which describesnative_client.tar.xz
installation.) -
.NET which is installed by following the instructions on the NuGet package page.
In addition there are third party bindings that are supported by external developers, for example
- Rust which is installed by following the instructions on the external Rust repo.
- Go which is installed by following the instructions on the external Go repo.
Supported Platforms
- OS X 10.10, 10.11, 10.12, 10.13 and 10.14
- Linux x86 64 bit with a modern CPU (Needs at least AVX/FMA)
- Linux x86 64 bit with a modern CPU + NVIDIA GPU (Compute Capability at least 3.0, see NVIDIA docs)
- Raspbian Stretch on Raspberry Pi 3
- ARM64 built against Debian/ARMbian Stretch and tested on LePotato boards
- Java Android bindings / demo app. Early preview, tested only on Pixel 2 device, TF Lite model only
Known Issues
- Feature caching speeds training but increases memory usage
- Current
v2 TRIE
handling still triggers ~600MB memory usage - Code not yet thread safe, having multiple concurrent streams tied to the same model leads to bad transcriptions.
Contact/Getting Help
- FAQ - We have a list of common questions, and their answers, in our FAQ. When just getting started, it's best to first check the FAQ to see if your question is addressed.
- Discourse Forums - If your question is not addressed in the FAQ, the Discourse Forums is the next place to look. They contain conversations on General Topics, Using Deep Speech, Alternative Platforms, and Deep Speech Development.
- IRC - If your question is not addressed by either the FAQ or Discourse Forums, you can contact us on the
#machinelearning
channel on Mozilla IRC; people there can try to answer/help - Issues - Finally, if all else fails, you can open an issue in our repo if there is a bug with the current code base.