Replies: 2 comments
-
A follow up on this proposal - Assembly AI has announced that they have adversarially benchmarked their new speech-to-text capability Universal-1 to measure hallucination.
|
Beta Was this translation helpful? Give feedback.
-
This is good metric. Personally, I have seen that Assembly AI produces more accurate transcription compared to Whisper and Deepgram models. |
Beta Was this translation helpful? Give feedback.
-
Based on the tone of early comments about the large-v3 model, the model has an improved error rate in the standard benchmarks (Common Voice and Fleurs) but for many practical purposes it may be less useful than large-v2 because of worse hallucinations and missing punctuation. Based on early anecdotal usage at any rate.
Various workarounds have been necessary going back to previous models as well, so this problem isn't entirely new
#928
#1783
#1606 (comment)
Setting aside the question of whether large-v3 is a practical improvement or not, would it make sense to create an adversarial benchmark for Whisper to evaluate models against? The idea would be to use a dataset of non-targeted adversarial sounds ie. non-speech sounds including silence. For a perfect score, these various sounds should transcribe to
<nospeech>
using the standard hyperparameters across different languages. Possibly some benchmark entries would consist of prefix speech in a selected language to condition the model, followed by non-speech, the combination of which may induce hallucination.In parallel of course, it would be nice to sanitize the training data (as suggested here for example):
#1783 (reply in thread)
A few models down the road is it possible to minimize the need for a whisper + voice activity detection + hyperparameter tuning + post processing approach?
AI Audio Datasets List
https://github.com/Yuan-ManX/ai-audio-datasets-list#se
Beta Was this translation helpful? Give feedback.
All reactions