Generalized Speech Classifier with ASR backend #3862
Replies: 1 comment 2 replies
-
You are right about matchboxnet needing to be retrained again and again for all new classes - that is how all deep learning classification models work. They cannot add new classes without retraining. ASR is a distinctly more challenging task - alignment of one sequence to another sequence (practically it is a monotonic alignment learning with alignment free losses such as CTC or RNNT). The alignment learning task is significantly more challenging than classification - and therefore requires significantly more compute + data to achieve good robust results. But on the flip side, it does not have the limitations of classical classification models - once trained to recognize a vocabulary, as long as it correctly predicts the entire keyword again you can use it for generically any keyword without retraining. Of course, there is the issue of Our of Vocabulary words - words that the model has never heard before during ASR training will almost never be transcribed to text perfectly (true Hey Siri or Alexa or Google on any of the models which are trained on academic only datasets). Since the train data never heard such keywords, it will often fail to predict the key word perfectly. What you observe of 70% accuracy for such keywords is a combination of - such words not being often seen in the train set + the ASR task requiring an entirely different architecture and much more representation power than simple speech classifiers (compare matchboxnet with 70,000 params vs QuartzNet with 18 M params). For efficiency sake alone (in terms of compute resources, data set size required and model size of training) - retraining your classifier each time by freezing the encoder and finetuning just the decoder is extremely cheap for matchboxnet compared to QuartzNet or more powerful models. |
Beta Was this translation helpful? Give feedback.
-
Hi All,
So, I want to make a speech commands classifier on a certain set of words [We can think of speech commands dataset].
However, I want to keep the flexibility of adding new commands/words without having to retrain the model.
Please correct me if I am wrong when I state that MatchBox classifier would need to be retrained every time I add a new word to my "vocabulary".
I was thinking if it could be possible to use the pre-trained phoneme/character based ASR models like QuartzNet etc. but restrict their prediction to single word commands in the beam search and language model decoding for inference. This will perhaps need very little tweaking of the pipeline at the decoding level to adjust to new words. I tried the accuracy of pretrained quartznet provided on Speech commands data set as is and found the accuracy to be 69%.
Is there a convenient way to tweak the decoding to only predict words defined in a small list of words? (predict out-of-vocab if the final prediction does not match any word in the list)
Beta Was this translation helpful? Give feedback.
All reactions