Generalized Speech Classifier with ASR backend #3862

mrinal-sourav · 2022-03-21T05:16:38Z

mrinal-sourav
Mar 21, 2022

Hi All,

So, I want to make a speech commands classifier on a certain set of words [We can think of speech commands dataset].
However, I want to keep the flexibility of adding new commands/words without having to retrain the model.
Please correct me if I am wrong when I state that MatchBox classifier would need to be retrained every time I add a new word to my "vocabulary".

I was thinking if it could be possible to use the pre-trained phoneme/character based ASR models like QuartzNet etc. but restrict their prediction to single word commands in the beam search and language model decoding for inference. This will perhaps need very little tweaking of the pipeline at the decoding level to adjust to new words. I tried the accuracy of pretrained quartznet provided on Speech commands data set as is and found the accuracy to be 69%.

Is there a convenient way to tweak the decoding to only predict words defined in a small list of words? (predict out-of-vocab if the final prediction does not match any word in the list)

titu1994 · 2022-03-21T06:38:20Z

titu1994
Mar 21, 2022
Maintainer

You are right about matchboxnet needing to be retrained again and again for all new classes - that is how all deep learning classification models work. They cannot add new classes without retraining.

ASR is a distinctly more challenging task - alignment of one sequence to another sequence (practically it is a monotonic alignment learning with alignment free losses such as CTC or RNNT). The alignment learning task is significantly more challenging than classification - and therefore requires significantly more compute + data to achieve good robust results. But on the flip side, it does not have the limitations of classical classification models - once trained to recognize a vocabulary, as long as it correctly predicts the entire keyword again you can use it for generically any keyword without retraining.

Of course, there is the issue of Our of Vocabulary words - words that the model has never heard before during ASR training will almost never be transcribed to text perfectly (true Hey Siri or Alexa or Google on any of the models which are trained on academic only datasets). Since the train data never heard such keywords, it will often fail to predict the key word perfectly.

What you observe of 70% accuracy for such keywords is a combination of - such words not being often seen in the train set + the ASR task requiring an entirely different architecture and much more representation power than simple speech classifiers (compare matchboxnet with 70,000 params vs QuartzNet with 18 M params).

For efficiency sake alone (in terms of compute resources, data set size required and model size of training) - retraining your classifier each time by freezing the encoder and finetuning just the decoder is extremely cheap for matchboxnet compared to QuartzNet or more powerful models.

2 replies

mrinal-sourav Mar 21, 2022
Author

"But on the flip side, it does not have the limitations of classical classification models - once trained to recognize a vocabulary, as long as it correctly predicts the entire keyword again you can use it for generically any keyword without retraining"

OK, so I think here we are on the same page with that.
The thing is, I want to limit/restrict the vocabulary ASR model is trained on during inference for Speech commands words so that I get a higher accuracy (from 70 to say 90+). Could this be done with kenlm or neural language model trained only for those words?

Model size is not a consideration for now as the generalizing ability (of ASR) is more important for my use case.
Retraining for individual words (using MatchBox) entails that I may not get sufficient audio samples for those words.

titu1994 Mar 22, 2022
Maintainer

You are probably looking for word boosting - and yes it can be done, but not in Nemo. Nvidia Riva supports this feature.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generalized Speech Classifier with ASR backend #3862

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Generalized Speech Classifier with ASR backend #3862

mrinal-sourav Mar 21, 2022

Replies: 1 comment · 2 replies

titu1994 Mar 21, 2022 Maintainer

mrinal-sourav Mar 21, 2022 Author

titu1994 Mar 22, 2022 Maintainer

mrinal-sourav
Mar 21, 2022

Replies: 1 comment 2 replies

titu1994
Mar 21, 2022
Maintainer

mrinal-sourav Mar 21, 2022
Author

titu1994 Mar 22, 2022
Maintainer