Skip to content

Latest commit

 

History

History
22 lines (15 loc) · 2.26 KB

README.md

File metadata and controls

22 lines (15 loc) · 2.26 KB

Detection of fillers in conversational speech

About fillers

This work is a NLP master's degree project aiming at improving ASR systems, in particular in the specific task of detecting fillers in spontaneous speech.

Fillers are a specific type of disfluencies. Disfluencies can be defined as interruptions in normal flow of speech which make utterances longer without adding semantic information. Fillers, in particular, are vocalizations, like “uhm”, “eh”, “ah” etc., which fill the time that should be occupied by a word (generally in correspondence of hesitations, uncertainty or attempts to hold the floor).

Improving Automatic Speech Recognition systems

Automatic Speech Recognition (ASR) systems systems are generally trained on read speech data, which make them weak for recognizing disfluencies, since read speech is generally free of disfluencies. Training ASR on spontaneous speech data would be the best method to overcome this problem, however such data is expensive and laborious to obtain. Our approach is the implementation of an event detection sequence tagger, which is a type of Neural network (NN) that require fewer data to train on. The NN will tag sequences with the labels silence, speech, and filler. By combining the NN to a classical ASR architecture, a stronger ASR system able to recognize a natural spontaneous speech is possible.

Approach

Using the corpus CallHome, we performed forced alignment between transcriptions and audio files using Montreal Forced Aligner. This alignment allowed us to automatically assign _silence_, _speech_, or _filler_ to each frame of the audio files. Finally we fed a sequence tagging Neural Network with the sequences of Mel-Frequency Cepstral Coefficients (MFCC) extracted over the above-mentioned frames.

Notebook

To run the notebook, you will need to install before hand the following python libraries: Pydub, soundfile, textgrids, pandas, torch, torchaudio, and sklearn. Also, the following softwares are needed: Montreal Forced Aligner (for the forced alignment process), Sox for the stereo-mono conversion.