Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"initial_prompt" appears to progressively override audio for longer streams #278

Open
AdolfVonKleist opened this issue Sep 26, 2024 · 5 comments

Comments

@AdolfVonKleist
Copy link

AdolfVonKleist commented Sep 26, 2024

I've been using WhisperLive with great success recently in multiple languages. Seriously amazing. I recently noticed the support for initial_prompt which was added in January, and tried applying it to my use case.

I have noticed that while the initial_prompt value works amazingly well during the first 10-20s of a conversation, when we get beyond this point it suddenly starts to completely override the input audio.

For example I'll specify a 'corrected' spelling for a company name: SupaSqrrl DIE-namics instead of Super Squirrel Dynamics. In the first 20s any utterances of this phrase will be perfectly transcribed according to the initial_prompt value I've added: SupaSqrrl DIE-namics. However as the conversation progresses this boosted phrase will start to override all other input speech and the recognizer will just end up outputting the initial_prompt over and over again.

I thought maybe the prompt was being provided repeatedly somewhere in the code, but after a cursory review of the source I didn't see anything like that.

I'm wondering if anyone else has experienced something similar?

edit: I also can confirm I don't see this behavior in longer files when I transcribe in batch mode with whisperx or faster-whisper.

@AdolfVonKleist
Copy link
Author

AdolfVonKleist commented Sep 27, 2024

So I dug into this a bit more and was able to confirm that basically two things are happening when I use the websocket connection and fasterwhisper version (I assume it's the same for TensorRT but cannot verify):

  1. This loop is called repeatedly for each new set of samples sent to .transcribe(, however this call segments = self.generate_segments(features, tokenizer, options, encoder_output) via transcribe_audio, and never results in any internal iteration, no matter how long I stream audio to the transcriber. This conditional is called once for every input_sample as well:
        if options.initial_prompt is not None:
            if isinstance(options.initial_prompt, str):
                initial_prompt = " " + options.initial_prompt.strip()
                initial_prompt_tokens = tokenizer.encode(initial_prompt)
                all_tokens.extend(initial_prompt_tokens)
            else:
                all_tokens.extend(options.initial_prompt)

the result is that even when 'turned on' the context is never extended with earlier content, it is called once for each new clip with the initial_prompt value. I looked at instead tracking the 'last_segment' returned to transcribe_audio in the client, as well as sending the global current timestamp_offset in order to see how I might change/impact the results.

If I send the initial_prompt only during the first 10-20s of the stream it works well. Otherwise it starts to override the content of the audio. I also tried sharing the 'last_segment' by extending transcribe:

        result, info = self.transcriber.transcribe(
            input_sample,
            timestamp_offset=self.timestamp_offset,  # added to track global state in transcribe
            last_segment=self.last_segment,  # added to track 'latest' text segment in transcribe
            initial_prompt=self.initial_prompt,
            language=self.language,
            task=self.task,
            vad_filter=self.use_vad,
            vad_parameters=self.vad_parameters if self.use_vad else None)
        self.last_segment=result

this worked a little bit better, but unfortunately seemed to result in a lot of new 'gaps' in the STT results; presumably because the last_segment I'm providing here is not necessarily aligned with the previous clip? In any case, for the moment it appears to be a bust. It's a shame because in the non-streaming/live version this feature is amazingly robust. Here it seems there is at least no 'quick fix' as I had hoped.

It may be just a need to more carefully time-align the 'most recent' partial output with the current clip - like the infrastructure in transcribe implies - but this is currently never actually activated as far as I can tell from my tests the last day or so.

Maybe there's something else I'm missing here as well.

@tripled-yang
Copy link

Good day, Sir Could you have more observations on this issue (I do not see this issue in the real-time transcribe from microphone)

By the way, just a question, where is the code below:
if options.initial_prompt is not None:
if isinstance(options.initial_prompt, str):
initial_prompt = " " + options.initial_prompt.strip()
initial_prompt_tokens = tokenizer.encode(initial_prompt)
all_tokens.extend(initial_prompt_tokens)
else:
all_tokens.extend(options.initial_prompt)

@AdolfVonKleist
Copy link
Author

AdolfVonKleist commented Oct 15, 2024

@zeliang3 it is here:

if options.initial_prompt is not None:

I haven't had a chance to look at it closely again. I see it constantly in the websocket. I'm using it in streaming mode over a websocket in a ReactJS web application. Can you provide a minimum usage example for your microphone based approach? I have not tried this myself. Maybe I'll have better luck comparing it against a working alternative.

I'll be happy to invest another day or so in this and provide a pull request if I can suss it out; but I either need a bit more free time, or some kind of hint.

@tripled-yang
Copy link

tripled-yang commented Oct 17, 2024

just simply call client(), and it will choose the current microphone bro @AdolfVonKleist

from whisper_live.client import TranscriptionClient

client = TranscriptionClient(
  "192.168.1.100",
  9090,
  translate=False,
  model="large",
  save_output_recording=True,                         
  output_recording_filename="./output_recording.wav"
)

client()

@makaveli10
Copy link
Collaborator

makaveli10 commented Dec 18, 2024

Thanks for opening the issue, instead of last_segment=self.last_segment can you try getting the last_segment from the results of faster_whisper transcribe call i.e., the segments here:

def update_segments(self, segments, duration):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants