Replies: 5 comments 9 replies
-
What I read and saw first-hand is that Whisper is non-deterministic. You may not get the exact same transcription from execution to execution on the same input file. I also chunked my audio and sometimes Whisper would begin to confabulate, putting in text that was clearly not in the audio. I experienced this in tiny, base, small and medium to varying degrees. This was more pronounced, happening occasionally in the "small.en" and more often in "medium.en". The input files were the same but Whisper would sometimes give slightly different output at some points and in other times would go completely rogue for a moment. The other thing that is going to be different about chunked files is that when you cross boundaries, normalization of text will change because it has lost context, so punctuation at the end of the file and capitalization at the beginning of the new file will almost certainly be incorrect. Also for generally the same reason, the last word at the end of a chunk and the first word at the beginning of the next may fragmented, making Whisper "guess," sometimes incorrectly. I made a note of this in my quest to accelerate transcription without changing Whisper. https://github.com/MrEdwards007/WhisperTaskAcceleration For testing purposes, I put a marker to indicate where the transcription changed files and that's when I realized the reason for some of the issues experienced. So, my belief is this is mixture of two issues. The first is that Whisper is non-deterministic and the second are the issues associated with transcribing across file boundaries. |
Beta Was this translation helpful? Give feedback.
-
This is expected behavior. Long-from transcription works by using tokens of previous segments to prompt the new segment (i.e. context). This logic is all handled within |
Beta Was this translation helpful? Give feedback.
-
i also noticed mp3 performs better instead of native audio of the mkv recordings |
Beta Was this translation helpful? Give feedback.
-
@jongwook Thank you for open-sourcing Whisper. I'm getting odd results from Whisper when I transcribe a .wav file as is versus when it's chunked.
While I expected some "wording differences" at the end of a chunked .wav file, surprisingly, and worrisome, is when Whisper drops "blocks of text" inside of a .wav chunk.
Have you a workaround for this? Notebook here.
Example:
###Comparing the excerpt from record.wav and the transcription of record_1.wav
####The first block is equivalent
####This second block appeared in record.wav but went missing in the record_1.wav transcription
####This third block got a wonky transcription in record_1.wav. Compare it with the excerpt from record.wav
wonky transcription from record_1.wav
Beta Was this translation helpful? Give feedback.
All reactions