Whisper large producing differing outputs when .wav file is chunked #440

i-am-neo · 2022-10-30T23:53:10Z

i-am-neo
Oct 30, 2022

@jongwook Thank you for open-sourcing Whisper. I'm getting odd results from Whisper when I transcribe a .wav file as is versus when it's chunked.

While I expected some "wording differences" at the end of a chunked .wav file, surprisingly, and worrisome, is when Whisper drops "blocks of text" inside of a .wav chunk.

Have you a workaround for this? Notebook here.

Example:
###Comparing the excerpt from record.wav and the transcription of record_1.wav

####The first block is equivalent

We are currently aiming at kicking off the organization on Monday.There is roughly a plan to make it unconference style, single track plus hallway track.
And volunteers are still welcome.
If you want to volunteer, just join the meeting on Monday.
The thing is in the notes document that I'm going to post in the chat in a bit and or just join the 
summit staff channel, but only join the summit staff channel.
If you want to volunteer, you just want to join the contributor summit.
There's also a contributor summit channel where you're going to post registration information and other kind of things.
Okay.
Thank you.

####This second block appeared in record.wav but went missing in the record_1.wav transcription

< And with announcements, it's time to move into release updates.
< And to give us release updates, we should have one of our release leads.
< Yeah, Sissy here.
< Yep.

####This third block got a wonky transcription in record_1.wav. Compare it with the excerpt from record.wav

< Hey, Sissy, go ahead.
< Hi.
< Yeah.
< Hi, everyone.
< This is Sissy from the release team.
< I'm going to be the release lead shadow for this cycle and I'm going to give updates for the release team.
< And currently we are at week 10 and the release is currently scheduled for April, which is in wait 15th.
< And I will just give some major milestones for everyone to be aware of.
< First, we are going to start to take exceptions starting March 21st.
< And we are going to have like one more meeting added starting next week, also 21st of March.
< And we will have daily breakout meetings starting March 28th.
< And also we will have our first release retrospective meeting scheduled also next week.
< And currently we have 16.
< 6 enhancement tracked and we are like roughly two weeks to code freeze, which is Wednesday 30th of March.
< So for all the tracking enhancement, please be aware you have to get all the PR merged before the code freeze to be able to consider that it's completed.
< And for all the enhancement which needs docs, please be aware you have to open the placeholder PRs before 31st of March.
< And the doc deadline for PR ready to review is 5th of April.
< And our docs team will make sure that all the docs emerge one week before the monthly release.
< And we are starting to collect the major SIEMs candidate.
< And if you have any suggestions, please don't hesitate to reach out, also any potential release block context.
< And we are reached in the SIG release channel.
< So please reach out there if you have any questions or concerns.
< For the CI signal update, we currently still have one filling test.
< We will be syncing with the CI signal guys to make sure it gets addressed before the release.
< And for the patch release, backward relating for the alpha 4 release card and all the previous release card, which is mostly, yeah, that's basically updates from us.

wonky transcription from record_1.wav

> So, go ahead.
> I am young.
> Hi everyone.
> This is to see from the red list.
> I'm going to be starting in late 15th, and I will just give some major milestones for everyone to be aware of.
> First we are going to start to take exceptions starting March 1, and we are going to have like one more meeting at a study next week.
> Also, 21st of March, and we will be have daily meetings starting on the 8th of April.
> Sorry, March.
> And we will have our first release retrospective meeting scheduled also next week.
> Currently we have 16 six enhancement tracked, and we are like, roughly two weeks to code phrase, which is Wednesday, 20th or 30th of March.
> So, for all the tracking has meant, let's be aware you have to get all the PR merged before the code phrase to be able to consider that is completed.
> And for all the, which needs stocks, please be aware you have to open up this folder PRS before before 31st of March, and the top deadline for care ready to review is this April, and our docs team will make sure that all the docs in virtual one week before the release.
> Next, and we are reached up to release channel so please reach out to Barry, you have any questions or concerns.
> And for the patch release backward to the meeting for the offer for release card, and all the previous with this card which mostly.
> Yeah, that's basically up this from us.

MrEdwards007 · 2022-11-02T02:05:56Z

MrEdwards007
Nov 2, 2022

What I read and saw first-hand is that Whisper is non-deterministic. You may not get the exact same transcription from execution to execution on the same input file. I also chunked my audio and sometimes Whisper would begin to confabulate, putting in text that was clearly not in the audio. I experienced this in tiny, base, small and medium to varying degrees. This was more pronounced, happening occasionally in the "small.en" and more often in "medium.en". The input files were the same but Whisper would sometimes give slightly different output at some points and in other times would go completely rogue for a moment.

The other thing that is going to be different about chunked files is that when you cross boundaries, normalization of text will change because it has lost context, so punctuation at the end of the file and capitalization at the beginning of the new file will almost certainly be incorrect. Also for generally the same reason, the last word at the end of a chunk and the first word at the beginning of the next may fragmented, making Whisper "guess," sometimes incorrectly. I made a note of this in my quest to accelerate transcription without changing Whisper.

https://github.com/MrEdwards007/WhisperTaskAcceleration

For testing purposes, I put a marker to indicate where the transcription changed files and that's when I realized the reason for some of the issues experienced.

So, my belief is this is mixture of two issues. The first is that Whisper is non-deterministic and the second are the issues associated with transcribing across file boundaries.

4 replies

i-am-neo Nov 17, 2022
Author

@MrEdwards007 One can produce the same transcription across multiple runs by setting beam_size=None and temperature=0.

MrEdwards007 Nov 17, 2022

I am familiar with changing the temperature in NLP to zero, to ensure a predictable (non-deterministic) outcome. I recall that a beam search is used to explore different paths to conclusion but again, I do not understand its use in this context.

Pausing for my lack of understanding of beam and temperature for this context.
If I generalize this to NLP then we have a setting for the beam and temperature, which would ensure repeatability

Where does the prompt come into play and how can it be utilized?
Wouldn't I lose accuracy because the beam and temperature (guessing) in this context would allow for searching for the most probabilistic outcome based on context. Changing them to zero makes them repeatable but likely less accurate.

I am really reaching here because I don't understand a prompt in this context but could I essentially provide a prompt and Whisper would provide an GPT3 type of response such as summarization, sentiment analysis, question answering, text generation? These are is the contexts where I know that prompts and prompts engineering exist.

FYI, when I ran Whisper repeatedly against the same target files, I thought something was something really wrong in the audio, when I realized it just made up things that were completely not in the audio, often around the same time. What I would (now) be inclined to do is to see if there is something about that segment of audio that is "special\unique." I would transplant it somewhere else OR better still, I would make of copy of the same segment and duplicate it elsewhere to see if Whisper reacts in the same way. This way I know its something about that fragment of audio versus where it is located in the file or across file boundaries.

What I am baffled about is that Whisper was trained on 30 second clips and utilizes 30 second windows. Since this is the case, then no matter how long the target audio is (e.g. 4 hours), a 30 second window is still 30 seconds of transcription\translation. The only difference should be how it treats the normalization of the text.

This is a valuable conversation.

Thank you.

MrEdwards007 Nov 17, 2022

I did some searching and found this nugget which could be particularly valuable.
#117 (comment)

OK, I see the use of the prompt, which provides context for transcription.
This could be a bit complicated but given the files are fragmented, taking the ending portion of file from the end of one file and feeding into the beginning of the next in the queue, keeps the context for text normalization. It complicated for my use case, since all my fragments are being processed in parallel. Its something I will experiment with when I get some time.

i-am-neo Dec 3, 2022
Author

It complicated for my use case, since all my fragments are being processed in parallel.

Adding to that, during "failure mode" when Whisper drops first words of an audio snippet and/or drops words from a preceding audio snippet, you may end up in a state where you can't be sure what should have been transcribed.
That said, depending on your application, you could engineer remedial solutions for transcription outputs you've detected to be less than desired.

Where does the prompt come into play and how can it be utilized?

I think you've found some information. :) If you look at the Whisper implementation, I think you'll find that all tokens produced thus far for a segment are used as input/context to predict the next token, unless you reset the context for that segment.

Wouldn't I lose accuracy because the beam and temperature (guessing) in this context would allow for searching for the most probabilistic outcome based on context. Changing them to zero makes them repeatable but likely less accurate.

Setting temperature = 0 (and beam search to None) for this implementation makes the model default to Greedy decoding (vs beam search). Greedy decoding, however, makes the Whisper, which is a seq2seq model, susceptible to output repetitive loops at times. According to their paper, the authors introduce beam search in the implementation to specifically address certain problems they encountered in long-form transcription, which presumably your application is (based on your desire to parallelize audio chunks). Quoting from the paper,

We observed that it
is crucial to have beam search and temperature scheduling
based on the repetitiveness and the log probability of the
model predictions in order to reliably transcribe long audio.

Hope helpful.

jianfch · 2022-11-02T17:01:27Z

jianfch
Nov 2, 2022

This is expected behavior. Long-from transcription works by using tokens of previous segments to prompt the new segment (i.e. context). This logic is all handled within transcribe. Chunking the audio and calling transcribe on each chunk isolates the context of the chunk. It comes down to just multiplying different numbers, different results are expected to come out.

5 replies

dgoryeo Nov 3, 2022

Thanks @jianfch . This was very helpful clarification. Does this mean that if we use a VAD before a Long-form transcribe, we will similarly loose context?

jianfch Nov 3, 2022

Depends on how VAD is used. If the audio feed into transcribe as one piece, then it will likely be fine.

i-am-neo Nov 17, 2022
Author

@jianfch Thanks. The issue here wasn't so much the tokens produced by previous segments (the missing segment was preceded by some segments with tokens). Rather I tracked it down to output differences between the large and medium models.
The missing segment is no longer "missing" when the medium model is used (vs large).

Side note for @dgoryeo - Whisper does its own VAD check in this implementation, upon which a segment is skipped over.

jianfch Nov 17, 2022

That's likely due the medium models performing better without prompt than the large model for this case. But remains the case the model will generally performs better with a prompt (of previous tokens) than without it. This is the method transcribe uses to produce consistent long-form transcription/translation despite it only doing it in chunks. Transcriptions largely remain consistent even without context but occasionally you will get "Brian" and then "Bryan" in another chunk. The effects are much more noticeable with translations.

i-am-neo Nov 17, 2022
Author

I hear you, and experimented by feeding the large model a long text as initial prompt. Basically all the text preceding the missing segment in question. The model still didn't produce the missing segment as the medium model did.

FurkanGozukara · 2022-11-09T00:05:20Z

FurkanGozukara
Nov 9, 2022

i also noticed mp3 performs better instead of native audio of the mkv recordings

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whisper large producing differing outputs when .wav file is chunked #440

{{title}}

Replies: 5 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Whisper large producing differing outputs when .wav file is chunked #440

wonky transcription from record_1.wav

Replies: 5 comments · 9 replies

i-am-neo Nov 17, 2022 Author

i-am-neo Dec 3, 2022 Author

i-am-neo Nov 17, 2022 Author

i-am-neo Nov 17, 2022 Author

Replies: 5 comments 9 replies

i-am-neo Nov 17, 2022
Author

i-am-neo Dec 3, 2022
Author

i-am-neo Nov 17, 2022
Author

i-am-neo Nov 17, 2022
Author