Adversarial benchmarking to measure hallucination? #1818

glangford · 2023-11-17T15:08:55Z

glangford
Nov 17, 2023

Based on the tone of early comments about the large-v3 model, the model has an improved error rate in the standard benchmarks (Common Voice and Fleurs) but for many practical purposes it may be less useful than large-v2 because of worse hallucinations and missing punctuation. Based on early anecdotal usage at any rate.

Various workarounds have been necessary going back to previous models as well, so this problem isn't entirely new
#928
#1783
#1606 (comment)

Setting aside the question of whether large-v3 is a practical improvement or not, would it make sense to create an adversarial benchmark for Whisper to evaluate models against? The idea would be to use a dataset of non-targeted adversarial sounds ie. non-speech sounds including silence. For a perfect score, these various sounds should transcribe to <nospeech> using the standard hyperparameters across different languages. Possibly some benchmark entries would consist of prefix speech in a selected language to condition the model, followed by non-speech, the combination of which may induce hallucination.

In parallel of course, it would be nice to sanitize the training data (as suggested here for example):
#1783 (reply in thread)

A few models down the road is it possible to minimize the need for a whisper + voice activity detection + hyperparameter tuning + post processing approach?

AI Audio Datasets List
https://github.com/Yuan-ManX/ai-audio-datasets-list#se

glangford · 2024-04-05T19:41:20Z

glangford
Apr 5, 2024
Author

A follow up on this proposal - Assembly AI has announced that they have adversarially benchmarked their new speech-to-text capability Universal-1 to measure hallucination.

To systematically assess robustness against non-speech audio, we evaluated Universal-1 and Whisper Large-v3 on audio samples randomly selected from the AudioSet and DNC datasets (200 samples each), excluding sound categories that might contain human speech. As shown in the table below, Whisper Large-v3 almost always generates random text for such ambient sound samples, including long strings of unintelligible or repeated symbols. In contrast, Universal-1 only outputs a non-blank response around 10% of the time and generally avoids generating long text. These results demonstrate USM's ability to distinguish between speech and non-speech sounds, minimizing erroneous outputs for background noise.
...

https://www.assemblyai.com/research/universal-1

0 replies

rahulbansal16 · 2024-08-10T17:17:05Z

rahulbansal16
Aug 10, 2024

This is good metric.

Personally, I have seen that Assembly AI produces more accurate transcription compared to Whisper and Deepgram models.
Is there a way to improve Whisper performance for a set of domains compared to Assembly AI?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adversarial benchmarking to measure hallucination? #1818

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Adversarial benchmarking to measure hallucination? #1818

glangford Nov 17, 2023

Replies: 2 comments

glangford Apr 5, 2024 Author

rahulbansal16 Aug 10, 2024

glangford
Nov 17, 2023

glangford
Apr 5, 2024
Author

rahulbansal16
Aug 10, 2024