Pre-processings to reduce hallucinations from noisy audio #2378

jhj0517 · 2024-10-07T10:52:51Z

jhj0517
Oct 7, 2024

Hi everyone. Thanks for all your great efforts for the really cool open source project.
Here's my experience in reducing hallucinations from noisy audio.

This is the sample that can reproduce hallucinations :

https://www.youtube.com/watch?v=Eek0cOjLrV0

There's a long noise between 0:03~1:05 in the sample.
The main thing I focused on was to avoid this noise part as much as possible, so that Whisper would cause less hallucinations.

This is a human-made transcription with no errors:

First Flame of Love Eviction Ceremony.

Love.

It's why we're all here.

But tonight,

one of you will have to let go of love's warm bosom 

and cleave to rejection's cold shoulders.

Welcome to the first 

Flame-

Coming up next on Joe Schmoe 2,

the most shocking eviction yet.

Welcome to the-

Below one is the transcription using whisper large-v2. The reason I used large-v2 is that I got much worse results with large-v3 :

first Flame of Love Eviction Ceremony.

♪♪

♪♪

♪♪

♪♪

♪♪

♪♪

♪♪

♪♪

Coming up next on Joe Schmoe II,

the most shocking eviction yet.

Welcome to the...

For the hyperparameters beam_size is 5, all other defaults are used. Each line break represents a different segment detected by large-v2.

Whisper has transcribed the long noise part (0:03 ~ 1:05) into ♪♪s.
The problem is that large-v2 also transcribed some speeches as just ♪♪s.

What I used to reduce such hallucinations is Silero-VAD to detect voices and MDX-Net model from UVR to remove the noise itself from the audio.

This is the result with VAD-only:

Welcome to the first Flame of Love Eviction Ceremony.

But tonight, one of you will have to let go of love's warm bosom

and cleave to rejection's cold shoulder.

Welcome to the first Flame...

Coming up next on Joe Schmoe II,

the most shocking eviction yet.

VAD successfully skipped the long noise part (0:03 ~ 1:05), but also some speeches ("Love. It's why we're all here" part). This is because Silero VAD also caused hallucinations by the long noise part of the audio. I made more attempts with different VAD parameters, but couldn't get a better result. In my experience, tweaking hyperparameters often leads to unexpected hallucinations, so I prefer pre-processing if possible.

In my opinion, just remove the noise with the MDX-Net model (Or any UVR models that can separate the noise from the audio. I haven't tested them all. ) is the best way to reduce hallucinations in such cases.

Here's the result with MDX-Net + VAD :

Welcome to the first Flame of Love Eviction Ceremony.

Love.

It's why we're all here.

But tonight,

one of you will have to let go of love's warm bosom

and cleave to rejection's cold shoulders.

Welcome to the first

Flame-

Coming up next on Joe Schmoe 2,

the most shocking eviction yet.

Welcome to the-

It skipped the long noise part 0:03 ~ 1:05, didn't miss the few lines of speech either.
The hallucination this version made is to add "Welcome to the" in the very first line, it's most accurate one so far.

Since UVR models needs GPU to run ( about ~8GB VRAM in my test ) for faster speed and it's not as lightweight as Silero VAD —Silero VAD is super fast with CPU as it took 1ms to be processed on a single CPU thread for a one 30 sec audio chunk—, it might feel like a hassle to add a pre-processing pipeline. But it gives me the best result so far.

If you want to try these opt-in pre-processings with whisper, you can try it in the Whisper-WebUI.

eirikraha · 2024-10-28T11:42:58Z

eirikraha
Oct 28, 2024

Did you use the GUI for MDX-Net or were you able to just refer to that specific model in your code?

2 replies

jhj0517 Oct 28, 2024
Author

I used ultimatevocalremover_api for that.

You can see how I used it in the Whisper-WebUI here : https://github.com/jhj0517/Whisper-WebUI/blob/master/modules/uvr/music_separator.py

eirikraha Oct 28, 2024

Thank you!

Ko4ka · 2024-11-01T16:31:17Z

Ko4ka
Nov 1, 2024

I have noticed that compression_ratio_threshold doesn't work as it should (it should filter out the segments with unreasonably high values)

result_speaker_0 = model.transcribe(
            channel_0_path,
            language="ru",
            initial_prompt='Звонок в компанию, это колл центр застройщика, разговор ведет сотрудник Ольга',
            temperature= (0.0, 0.1),
            logprob_threshold=-0.6,
            no_speech_threshold= 0.0,
            compression_ratio_threshold=2.1,  # LOL Doesn't work
            condition_on_previous_text=True,
            word_timestamps=True,
            hallucination_silence_threshold=1
        )

This .transcribe should have filtered out everything above 2.1, but when I do

print("Transcription for Speaker 0:")
        for segment in result_speaker_0["segments"]:
            print(f"{segment['start']}s - {segment['end']}s: {segment['text']} {segment['compression_ratio']}")

I get some crazy values

379.44s - 379.94s:  Двадцать минут. 14.170212765957446
379.94s - 379.96s:  Двадцать минут. 14.170212765957446
379.96s - 379.96s:  14.170212765957446
379.96s - 379.96s:  14.170212765957446
379.96s - 380.04s:  Двадцать минут. 14.170212765957446
398.18s - 399.02s:  Двадцать минут. 11.577777777777778
399.42s - 399.42s:  11.577777777777778
403.88s - 404.92s:  Двадцать минут. 11.577777777777778
404.92s - 406.26s:  Двадцать минут. 11.577777777777778
406.26s - 406.42s:  Двадцать минут. 11.577777777777778
406.42s - 406.9s:  Двадцать минут. 11.577777777777778
406.9s - 407.64s:  Двадцать минут. 11.577777777777778

I recomend that you filter your segments manually after transcribing, anything above ~2-2.2 is likely a halucination.
I hope this helps :)

0 replies

montvid · 2024-11-11T10:54:16Z

montvid
Nov 11, 2024

For me OpenWhisper is simply unusable even trying speech to text on a clean 50 min monologue I get missing text. All the small models including turbo (but not large v2/v3) make English word recognition errors so unusable. Thanks for https://github.com/jhj0517/Whisper-WebUI - ONLY using Silero VAD I get perfect speech to text on my 50 min monologue without cutouts.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre-processings to reduce hallucinations from noisy audio #2378

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Pre-processings to reduce hallucinations from noisy audio #2378

jhj0517 Oct 7, 2024

Replies: 3 comments · 2 replies

eirikraha Oct 28, 2024

jhj0517 Oct 28, 2024 Author

eirikraha Oct 28, 2024

Ko4ka Nov 1, 2024

montvid Nov 11, 2024

jhj0517
Oct 7, 2024

Replies: 3 comments 2 replies

eirikraha
Oct 28, 2024

jhj0517 Oct 28, 2024
Author

Ko4ka
Nov 1, 2024

montvid
Nov 11, 2024