Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ultravox audio streaming #278

Open
FelixNeutatzMainWebSolutions opened this issue Feb 4, 2025 · 1 comment
Open

Ultravox audio streaming #278

FelixNeutatzMainWebSolutions opened this issue Feb 4, 2025 · 1 comment

Comments

@FelixNeutatzMainWebSolutions

Hi everyone,

I am currently experimenting with audio streaming into the model. The idea is to improve latency by ingesting parts of the audio before the utterance of the user is finished. As we already run computation while the user is speaking, we can let the model answer faster.

This is what I came up with:

inference: Optional[ultravox_infer.UltravoxInference] = ultravox_infer.UltravoxInference(
            "fixie-ai/ultravox-v0_4",
            device=None,
            data_type=None,
            conversation_mode=True,
        )

user_audio_prompt = datasets.VoiceSample.from_prompt_and_file("<|audio|>", "part0.mp3")
inference.infer(user_audio_prompt, max_tokens=1)
del inference.past_messages[-1]

user_audio_prompt = datasets.VoiceSample.from_prompt_and_file("<|audio|>", "part1.mp3")
inference.infer(user_audio_prompt, max_tokens=1)
del inference.past_messages[-1]

user_audio_prompt = datasets.VoiceSample.from_prompt_and_file("<|audio|>", "part2.mp3")
output = inference.infer(user_audio_prompt)
print(output)

Is there any more efficient approach to this?

Thank you for your help.

Best regards,
Felix

@zqhuang211
Copy link
Contributor

That’s one way to do it, but it’s not quite right since the audio segments are encoded separately. Additionally, they are treated as separate speaker turns, and you end up wasting compute on the inference of intermediate segments. The latency incurred from these additional inference steps would be significantly higher than that of speech encoding.

We experimented with block-wise unidirectional encoding for the speech encoder. You can find a config here:

exp_name: "ultravox-streaming-experiments-1s"

We haven’t done much work on this feature yet, so it could break training/inference or hurt model performance. But it’s a more viable solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants