Skip to content

Releases: mozilla/DeepSpeech

v0.8.0-alpha.2

27 May 19:16
Compare
Choose a tag to compare
v0.8.0-alpha.2 Pre-release
Pre-release
Bump VERSION to 0.8.0-alpha.2

v0.8.0-alpha.1

26 May 20:30
Compare
Choose a tag to compare
v0.8.0-alpha.1 Pre-release
Pre-release
Retry tag group due to too many reruns

v0.8.0-alpha.0

26 May 13:03
d141943
Compare
Choose a tag to compare
v0.8.0-alpha.0 Pre-release
Pre-release
Merge pull request #3015 from mozilla/alpha-0.8a0

Bump VERSION to 0.8.0-alpha.0

DeepSpeech 0.7.1

12 May 15:31
2e9c281
Compare
Choose a tag to compare

General

This is the 0.7.1 release of Deep Speech, an open speech-to-text engine. In accord with semantic versioning, this version is not backwards compatible with version 0.6.1 or earlier versions. This is a bugfix release and retains compatibility with the 0.7.0 models. All model files included here are identical to the ones in the 0.7.0 release. As with previous releases, this release includes the source code:

v0.7.1.tar.gz

and the acoustic models:

deepspeech-0.7.1-models.pbmm
deepspeech-0.7.1-models.tflite.

The model with the ".pbmm" extension is memory mapped and thus memory efficient and fast to load. The model with the ".tflite" extension is converted to use TFLite, has post-training quantization enabled, and is more suitable for resource constrained environments.

The acoustic models were trained on American English and the pbmm model achieves an 5.97% word error rate on the LibriSpeech clean test corpus.

In addition we release the scorer:

deepspeech-0.7.1-models.scorer

which takes the place of the language model and trie in older releases.

We also include example audio files:

audio-0.7.1.tar.gz

which can be used to test the engine, and checkpoint files:

deepspeech-0.7.1-checkpoint.tar.gz

which can be used as the basis for further fine-tuning.

Notable changes from the previous release

  • Moved all usage documentation to deepspeech.readthedocs.io, where they're properly versioned and by default redirected to the latest stable release. (#2949)
  • Statically link libsox on macOS native client so users don't have to install sox. (#2951)
  • Fix a bug where JavaScript binding was not returning the Stream wrapper in Model.createStream. (#2957)
  • Fix a bug where DS_EnableExternalScorer (and equivalent bindings) did not properly handle errors and left a partially initializer Scorer in place. (#2970).
  • Fix a bug in the Python client when overriding the default beam width with the --beam_width flag. (#2976)
  • Fix a bug in the JavaScript binding where strict mode was violated in Stream.finishStreamWithMetadata. (#2980 / #2981).

Training Regimen + Hyperparameters for fine-tuning

The hyperparameters used to train the model are useful for fine tuning. Thus, we document them here along with the training regimen, hardware used (a server with 8 Quadro RTX 6000 GPUs each with 24GB of VRAM), and our use of cuDNN RNN.

In contrast to previous releases, training for this release occurred in several phases each phase with a lower learning rate than the phase before it.

The initial phase used the hyperparameters:

  • train_files Fisher, LibriSpeech, Switchboard, Common Voice English, and approximately 1700 hours of transcribed WAMU (NPR) radio shows explicitly licensed to use as training corpora.
  • dev_files LibriSpeech clean dev corpus.
  • test_files LibriSpeech clean test corpus
  • train_batch_size 128
  • dev_batch_size 128
  • test_batch_size 128
  • n_hidden 2048
  • learning_rate 0.0001
  • dropout_rate 0.40
  • epochs 125

The weights with the best validation loss were selected at the end of 125 epochs using --noearly_stop.

The second phase was started using the weights with the best validation loss from the previous phase. This second phase used the same hyperparameters as the first but with the following changes:

  • learning_rate 0.00001
  • epochs 100

The weights with the best validation loss were selected at the end of 100 epochs using --noearly_stop.

Like the second, the third phase was started using the weights with the best validation loss from the previous phase. This third phase used the same hyperparameters as the second but with the following changes:

  • learning_rate 0.000005

The weights with the best validation loss were selected at the end of 100 epochs using --noearly_stop. The model selected under this process was trained for a sum total of 732522 steps over all phases.

Subsequent to this the lm_optimizer.py was used with the following parameters:

  • lm_alpha_max 5
  • lm_beta_max 5
  • n_trials 2400
  • test_files LibriSpeech clean dev corpus.

to determine the optimal lm_alpha and lm_beta with respect to the LibriSpeech clean dev corpus. This resulted in:

  • lm_alpha 0.931289039105002
  • lm_beta 1.1834137581510284

Bindings

This release also includes a Python based command line tool deepspeech, installed through

pip install deepspeech

Alternatively, quicker inference can be performed using a supported NVIDIA GPU on Linux. (See below to find which GPU's are supported.) This is done by instead installing the GPU specific package:

pip install deepspeech-gpu

On Linux, macOS and Windows, the DeepSpeech package does not use TFLite by default. A TFLite version of the package on those platforms is available as:

pip install deepspeech-tflite

Also, it exposes bindings for the following languages

  • Python (Versions 3.5, 3.6, 3.7 and 3.8) installed via

    pip install deepspeech

    Alternatively, quicker inference can be performed using a supported NVIDIA GPU on Linux. (See below to find which GPU's are supported.) This is done by instead installing the GPU specific package:

    pip install deepspeech-gpu

    On Linux (AMD64), macOS and Windows, the DeepSpeech package does not use TFLite by default. A TFLite version of the package on those platforms is available as:

    pip install deepspeech-tflite
  • NodeJS (Versions 10.x, 11.x, 12.x, and 13.x) installed via

    npm install deepspeech
    

    Alternatively, quicker inference can be performed using a supported NVIDIA GPU on Linux. (See below to find which GPU's are supported.) This is done by instead installing the GPU specific package:

    npm install deepspeech-gpu
    

    On Linux (AMD64), macOS and Windows, the DeepSpeech package does not use TFLite by default. A TFLite version of the package on those platforms is available as:

    npm install deepspeech-tflite
  • ElectronJS versions 5.0, 6.0, 6.1, 7.0, 7.1, and 8.0 are also supported

  • C which requires the appropriate shared objects are installed from native_client.tar.xz (See the section in the main README which describes native_client.tar.xz installation.)

  • .NET which is installed by following the instructions on the NuGet package page.

In addition there are third party bindings that are supported by external developers, for example

  • Rust which is installed by following the instructions on the external Rust repo.
  • Go which is installed by following the instructions on the external Go repo.
  • V which is installed by following the instructions on the external Vlang repo.

Supported Platforms

  • Windows 8.1, 10, and Server 2012 R2 64-bits (Needs at least AVX support, requires Redistribuable Visual C++ 2015 Update 3 (64-bits) for runtime).
  • OS X 10.10, 10.11, 10.12, 10.13, 10.14 and 10.15
  • Linux x86 64 bit with a modern CPU (Needs at least AVX/FMA)
  • Linux x86 64 bit with a modern CPU + NVIDIA GPU (Compute Capability at least 3.0, see NVIDIA docs)
  • Raspbian Buster on Raspberry Pi 3 + Raspberry Pi 4
  • ARM64 built against Debian/ARMbian Buster and tested on LePotato boards
  • Java Android bindings / demo app. Early preview, tested only on Pixel 2 device, TF Lite model only.

Documentation

Documentation is available on deepspeech.readthedocs.io.

Contact/Getting Help

  1. FAQ - We have a list of common questions, and their answers, in our FAQ. When just getting started, it's best to first c...
Read more

v0.7.1-alpha.2

07 May 13:09
33cba89
Compare
Choose a tag to compare
v0.7.1-alpha.2 Pre-release
Pre-release
Merge pull request #2983 from mozilla/alpha-071a2

Bump VERSION to 0.7.1-alpha.2

v0.7.1-alpha.1

04 May 12:05
f848bf4
Compare
Choose a tag to compare
v0.7.1-alpha.1 Pre-release
Pre-release
Merge pull request #2971 from mozilla/bump-0.7.1a1

Bump VERSION to 0.7.1-alpha.1

v0.7.1-alpha.0

01 May 21:17
fdee05f
Compare
Choose a tag to compare
v0.7.1-alpha.0 Pre-release
Pre-release
Merge pull request #2960 from mozilla/new-alpha-071

Bump VERSION to 0.7.1-alpha.0

DeepSpeech 0.7.0

24 Apr 16:17
3fbbca2
Compare
Choose a tag to compare

General

This is the 0.7.0 release of Deep Speech, an open speech-to-text engine. In accord with semantic versioning, this version is not backwards compatible with version 0.6.1 or earlier versions. So when updating one will have to update code and models. As with previous releases, this release includes the source code:

v0.7.0.tar.gz

and the acoustic models:

deepspeech-0.7.0-models.pbmm
deepspeech-0.7.0-models.tflite.

The model with the ".pbmm" extension is memory mapped and thus memory efficient and fast to load. The model with the ".tflite" extension is converted to use TFLite, has post-training quantization enabled, and is more suitable for resource constrained environments.

The acoustic models were trained on American English and the pbmm model achieves an 5.97% word error rate on the LibriSpeech clean test corpus.

In addition we release the scorer:

deepspeech-0.7.0-models.scorer

which takes the place of the language model and trie in older releases.

We also include example audio files:

audio-0.7.0.tar.gz

which can be used to test the engine, and checkpoint files:

deepspeech-0.7.0-checkpoint.tar.gz

which can be used as the basis for further fine-tuning.

Notable changes from the previous release

  • Added Multi-stream .NET support[1].
  • Fixed upper frequency limit when computing MFCC's[2].
  • Remove benchmark_nc as it was not used[3].
  • Added TFLite-specific NPM package[4].
  • Added TFLite NuGet package[5].
  • Added Sample DBs, a new format for training data that allows for much improved training speeds[6].
  • Re-worked the reporting of WER during model evaluation[7].
  • Fixed incorrect decoding format in .NET[8].
  • Embedded beam width in model and made the parameter optional in API[9].
  • Added support for transfer learning as described in Chapter 8 of Josh Meyer's PhD thesis[10][11][12].
  • Added support for ElectronJS v8.0[13].
  • Added optimizer to select the optimal lm_alpha + lm_beta[14][19].
  • Exposed multiple transcriptions in "WithMetadata" API[16].
  • New packaging format for external scorer (previously lm.binary and trie files)[26].
  • Exposed error codes in a human readable form[17][18].
  • Bumped dependency to TensorFlow 1.15.2[20].
  • Re-packaged training code to be installable simplifying training setup[21][22].
  • Added recursive transcription of directories to transcribe.py[23].
  • Added support for TypeScript[24].
  • Fixed bug in computation of initial timestamp[25].
  • Moved Stream-relative functions to be methods in the Stream object in Python and JavaScript bindings[27].

Training Regimen + Hyperparameters for fine-tuning

The hyperparameters used to train the model are useful for fine tuning. Thus, we document them here along with the training regimen, hardware used (a server with 8 Quadro RTX 6000 GPUs each with 24GB of VRAM), and our use of cuDNN.

In contrast to previous releases, training for this release occurred in several phases each phase with a lower learning rate than the phase before it.

The initial phase used the hyperparameters:

  • train_files Fisher, LibriSpeech, Switchboard, Common Voice English, and approximately 1700 hours of transcribed WAMU (NPR) radio shows explicitly licensed to use as training corpora.
  • dev_files LibriSpeech clean dev corpus.
  • test_files LibriSpeech clean test corpus
  • train_batch_size 128
  • dev_batch_size 128
  • test_batch_size 128
  • n_hidden 2048
  • learning_rate 0.0001
  • dropout_rate 0.40
  • epochs 125

The weights with the best validation loss were selected at the end of 125 epochs using --noearly_stop.

The second phase was started using the weights with the best validation loss from the previous phase. This second phase used the same hyperparameters as the first but with the following changes:

  • learning_rate 0.00001
  • epochs 100

The weights with the best validation loss were selected at the end of 100 epochs using --noearly_stop.

Like the second, the third phase was started using the weights with the best validation loss from the previous phase. This third phase used the same hyperparameters as the second but with the following changes:

  • learning_rate 0.000005

The weights with the best validation loss were selected at the end of 100 epochs using --noearly_stop. The model selected under this process was trained for a sum total of 732522 steps over all phases.

Subsequent to this the lm_optimizer.py was used with the following parameters:

  • lm_alpha_max 5
  • lm_beta_max 5
  • n_trials 2400
  • test_files LibriSpeech clean dev corpus.

to determine the optimal lm_alpha and lm_beta with respect to the LibriSpeech clean dev corpus. This resulted in:

  • lm_alpha 0.931289039105002
  • lm_beta 1.1834137581510284

Bindings

This release also includes a Python based command line tool deepspeech, installed through

pip install deepspeech

Alternatively, quicker inference can be performed using a supported NVIDIA GPU on Linux. (See below to find which GPU's are supported.) This is done by instead installing the GPU specific package:

pip install deepspeech-gpu

On Linux, macOS and Windows, the DeepSpeech package does not use TFLite by default. A TFLite version of the package on those platforms is available as:

pip install deepspeech-tflite

Also, it exposes bindings for the following languages

  • Python (Versions 3.5, 3.6, 3.7 and 3.8) installed via

    pip install deepspeech

    Alternatively, quicker inference can be performed using a supported NVIDIA GPU on Linux. (See below to find which GPU's are supported.) This is done by instead installing the GPU specific package:

    pip install deepspeech-gpu

    On Linux (AMD64), macOS and Windows, the DeepSpeech package does not use TFLite by default. A TFLite version of the package on those platforms is available as:

    pip install deepspeech-tflite
  • NodeJS (Versions 10.x, 11.x, 12.x, and 13.x) installed via

    npm install deepspeech
    

    Alternatively, quicker inference can be performed using a supported NVIDIA GPU on Linux. (See below to find which GPU's are supported.) This is done by instead installing the GPU specific package:

    npm install deepspeech-gpu
    

    On Linux (AMD64), macOS and Windows, the DeepSpeech package does not use TFLite by default. A TFLite version of the package on those platforms is available as:

    npm install deepspeech-tflite
  • ElectronJS versions 5.0, 6.0, 6.1, 7.0, 7.1, and 8.0 are also supported

  • C which requires the appropriate shared objects are installed from native_client.tar.xz (See the section in the main README which describes native_client.tar.xz installation.)

  • .NET which is installed by following the instructions on the NuGet package page.

In addition there are third party bindings tha...

Read more

v0.7.0-alpha.4

24 Apr 13:10
Compare
Choose a tag to compare
v0.7.0-alpha.4 Pre-release
Pre-release
Merge branch 'new-alpha' (Fixes #2938)

v0.7.0-alpha.3

25 Mar 14:27
58bc2f2
Compare
Choose a tag to compare
v0.7.0-alpha.3 Pre-release
Pre-release
Merge pull request #2853 from lissyx/bump-v0.7.0-alpha.3

Bump VERSION to 0.7.0-alpha.3