"initial_prompt" appears to progressively override audio for longer streams #278

AdolfVonKleist · 2024-09-26T12:15:16Z

I've been using WhisperLive with great success recently in multiple languages. Seriously amazing. I recently noticed the support for initial_prompt which was added in January, and tried applying it to my use case.

I have noticed that while the initial_prompt value works amazingly well during the first 10-20s of a conversation, when we get beyond this point it suddenly starts to completely override the input audio.

For example I'll specify a 'corrected' spelling for a company name: SupaSqrrl DIE-namics instead of Super Squirrel Dynamics. In the first 20s any utterances of this phrase will be perfectly transcribed according to the initial_prompt value I've added: SupaSqrrl DIE-namics. However as the conversation progresses this boosted phrase will start to override all other input speech and the recognizer will just end up outputting the initial_prompt over and over again.

I thought maybe the prompt was being provided repeatedly somewhere in the code, but after a cursory review of the source I didn't see anything like that.

I'm wondering if anyone else has experienced something similar?

edit: I also can confirm I don't see this behavior in longer files when I transcribe in batch mode with whisperx or faster-whisper.

The text was updated successfully, but these errors were encountered:

AdolfVonKleist · 2024-09-27T09:42:11Z

So I dug into this a bit more and was able to confirm that basically two things are happening when I use the websocket connection and fasterwhisper version (I assume it's the same for TensorRT but cannot verify):

This loop is called repeatedly for each new set of samples sent to .transcribe(, however this call segments = self.generate_segments(features, tokenizer, options, encoder_output) via transcribe_audio, and never results in any internal iteration, no matter how long I stream audio to the transcriber. This conditional is called once for every input_sample as well:

        if options.initial_prompt is not None:
            if isinstance(options.initial_prompt, str):
                initial_prompt = " " + options.initial_prompt.strip()
                initial_prompt_tokens = tokenizer.encode(initial_prompt)
                all_tokens.extend(initial_prompt_tokens)
            else:
                all_tokens.extend(options.initial_prompt)

the result is that even when 'turned on' the context is never extended with earlier content, it is called once for each new clip with the initial_prompt value. I looked at instead tracking the 'last_segment' returned to transcribe_audio in the client, as well as sending the global current timestamp_offset in order to see how I might change/impact the results.

If I send the initial_prompt only during the first 10-20s of the stream it works well. Otherwise it starts to override the content of the audio. I also tried sharing the 'last_segment' by extending transcribe:

        result, info = self.transcriber.transcribe(
            input_sample,
            timestamp_offset=self.timestamp_offset,  # added to track global state in transcribe
            last_segment=self.last_segment,  # added to track 'latest' text segment in transcribe
            initial_prompt=self.initial_prompt,
            language=self.language,
            task=self.task,
            vad_filter=self.use_vad,
            vad_parameters=self.vad_parameters if self.use_vad else None)
        self.last_segment=result

this worked a little bit better, but unfortunately seemed to result in a lot of new 'gaps' in the STT results; presumably because the last_segment I'm providing here is not necessarily aligned with the previous clip? In any case, for the moment it appears to be a bust. It's a shame because in the non-streaming/live version this feature is amazingly robust. Here it seems there is at least no 'quick fix' as I had hoped.

It may be just a need to more carefully time-align the 'most recent' partial output with the current clip - like the infrastructure in transcribe implies - but this is currently never actually activated as far as I can tell from my tests the last day or so.

Maybe there's something else I'm missing here as well.

tripled-yang · 2024-10-13T14:03:50Z

Good day, Sir Could you have more observations on this issue （I do not see this issue in the real-time transcribe from microphone）

By the way, just a question, where is the code below:
if options.initial_prompt is not None:
if isinstance(options.initial_prompt, str):
initial_prompt = " " + options.initial_prompt.strip()
initial_prompt_tokens = tokenizer.encode(initial_prompt)
all_tokens.extend(initial_prompt_tokens)
else:
all_tokens.extend(options.initial_prompt)

AdolfVonKleist · 2024-10-15T13:31:38Z

@zeliang3 it is here:

WhisperLive/whisper_live/transcriber.py

Line 464 in be71657

if options.initial_prompt is not None:

I haven't had a chance to look at it closely again. I see it constantly in the websocket. I'm using it in streaming mode over a websocket in a ReactJS web application. Can you provide a minimum usage example for your microphone based approach? I have not tried this myself. Maybe I'll have better luck comparing it against a working alternative.

I'll be happy to invest another day or so in this and provide a pull request if I can suss it out; but I either need a bit more free time, or some kind of hint.

tripled-yang · 2024-10-17T06:11:19Z

just simply call client(), and it will choose the current microphone bro @AdolfVonKleist

from whisper_live.client import TranscriptionClient

client = TranscriptionClient(
  "192.168.1.100",
  9090,
  translate=False,
  model="large",
  save_output_recording=True,                         
  output_recording_filename="./output_recording.wav"
)

client()

makaveli10 · 2024-12-18T06:53:21Z

Thanks for opening the issue, instead of last_segment=self.last_segment can you try getting the last_segment from the results of faster_whisper transcribe call i.e., the segments here:

WhisperLive/whisper_live/server.py

Line 1018 in be71657

def update_segments(self, segments, duration):

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"initial_prompt" appears to progressively override audio for longer streams #278

"initial_prompt" appears to progressively override audio for longer streams #278

AdolfVonKleist commented Sep 26, 2024 •

edited

Loading

AdolfVonKleist commented Sep 27, 2024 •

edited

Loading

tripled-yang commented Oct 13, 2024

AdolfVonKleist commented Oct 15, 2024 •

edited

Loading

tripled-yang commented Oct 17, 2024 •

edited

Loading

makaveli10 commented Dec 18, 2024 •

edited

Loading

"initial_prompt" appears to progressively override audio for longer streams #278

"initial_prompt" appears to progressively override audio for longer streams #278

Comments

AdolfVonKleist commented Sep 26, 2024 • edited Loading

AdolfVonKleist commented Sep 27, 2024 • edited Loading

tripled-yang commented Oct 13, 2024

AdolfVonKleist commented Oct 15, 2024 • edited Loading

tripled-yang commented Oct 17, 2024 • edited Loading

makaveli10 commented Dec 18, 2024 • edited Loading

AdolfVonKleist commented Sep 26, 2024 •

edited

Loading

AdolfVonKleist commented Sep 27, 2024 •

edited

Loading

AdolfVonKleist commented Oct 15, 2024 •

edited

Loading

tripled-yang commented Oct 17, 2024 •

edited

Loading

makaveli10 commented Dec 18, 2024 •

edited

Loading