Skip to content

Files

This branch is 2 commits ahead of, 239 commits behind facebookresearch/fairseq:main.

speech_to_text

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
Nov 2, 2021
Apr 2, 2021
Feb 19, 2021
Dec 22, 2021
Sep 13, 2021
Sep 13, 2021
Sep 13, 2021
Dec 22, 2021
Apr 2, 2021

Speech-to-Text (S2T) Modeling

https://www.aclweb.org/anthology/2020.aacl-demo.6

Speech recognition (ASR) and speech-to-text translation (ST) with fairseq.

Data Preparation

S2T modeling data consists of source speech features, target text and other optional information (source text, speaker id, etc.). Fairseq S2T uses per-dataset-split TSV manifest files to store these information. Each data field is represented by a column in the TSV file.

Unlike text token embeddings, speech features (e.g. log mel-scale filter banks) are usually fixed during model training and can be pre-computed. The manifest file contains the path to either the feature file in NumPy format or the WAV/FLAC audio file. For the latter, features will be extracted on-the-fly by fairseq S2T. Optionally, feature/audio files can be packed into uncompressed ZIP files (then accessed via byte offset and length) to improve I/O performance.

Fairseq S2T also employs a YAML file for data related configurations: tokenizer type and dictionary path for the target text, feature transforms such as CMVN (cepstral mean and variance normalization) and SpecAugment, temperature-based resampling, etc.

Model Training

Fairseq S2T uses the unified fairseq-train interface for model training. It requires arguments --task speech_to_text, --arch <model architecture in fairseq.models.speech_to_text.*> and --config-yaml <config YAML filename>.

Inference & Evaluation

Fairseq S2T uses the unified fairseq-generate/fairseq-interactive interface for inference and evaluation. It requires arguments --task speech_to_text and --config-yaml <config YAML filename>. The interactive console takes audio paths (one per line) as inputs.

Examples

Updates

  • 02/04/2021: Added interactive decoding (fairseq-interactive) support. Examples: ASR (LibriSpeech) and ST (CoVoST 2).
  • 01/08/2021: Several fixes for S2T Transformer model, inference-time de-tokenization, scorer configuration and data preparation scripts. We also add pre-trained models to the examples and revise the instructions. Breaking changes: the data preparation scripts now extract filterbank features without CMVN. CMVN is instead applied on-the-fly (defined in the config YAML).

What's Next

Citation

Please cite as:

@inproceedings{wang2020fairseqs2t,
  title = {fairseq S2T: Fast Speech-to-Text Modeling with fairseq},
  author = {Changhan Wang and Yun Tang and Xutai Ma and Anne Wu and Dmytro Okhonko and Juan Pino},
  booktitle = {Proceedings of the 2020 Conference of the Asian Chapter of the Association for Computational Linguistics (AACL): System Demonstrations},
  year = {2020},
}

@inproceedings{ott2019fairseq,
  title = {fairseq: A Fast, Extensible Toolkit for Sequence Modeling},
  author = {Myle Ott and Sergey Edunov and Alexei Baevski and Angela Fan and Sam Gross and Nathan Ng and David Grangier and Michael Auli},
  booktitle = {Proceedings of NAACL-HLT 2019: Demonstrations},
  year = {2019},
}