-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
[Model] Extend Ultravox to accept audio longer than 30s #13631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Farzad Abdolhosseini <[email protected]>
Signed-off-by: Farzad Abdolhosseini <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
FYI @NickLucche for the usage of whisper |
Thanks for the contrib! |
re @NickLucche: Here's the processor link: https://huggingface.co/fixie-ai/ultravox-v0_3-llama-3_2-1b/blob/main/ultravox_processing.py#L209 The logic: for each audio, split to 30 second chunks (but do not pad the last item to 30s, which is the same as before). There are other ways we could've done this, but it matches what we do on the Ultravox side for both some fine-tuning that we do and evals. If we end up updating those I'll update VLLM as well. Also, note that since we don't pad the last chunk, and since in most cases we have smaller than 30s audio, the number of frames do not match across samples. |
Signed-off-by: Farzad Abdolhosseini <[email protected]>
9f00316
to
0c5363e
Compare
Ok I see then that's a naive chunking where you don't account for splitting mid-word nor you have any overlap and/or prompt from previous chunk. This case seems much easier to handle vllm-side, given changes are already in hf. Let's just make sure the batched whisper forward is accounted for by the initial profiler run to avoid oom. |
Signed-off-by: Farzad Abdolhosseini <[email protected]>
Signed-off-by: Farzad Abdolhosseini <[email protected]>
Signed-off-by: Farzad Abdolhosseini <[email protected]>
Signed-off-by: Farzad Abdolhosseini <[email protected]>
Signed-off-by: Farzad Abdolhosseini <[email protected]>
Signed-off-by: Farzad Abdolhosseini <[email protected]>
Signed-off-by: Farzad Abdolhosseini <[email protected]>
Signed-off-by: Farzad Abdolhosseini <[email protected]>
Signed-off-by: Farzad Abdolhosseini <[email protected]>
Signed-off-by: Farzad Abdolhosseini <[email protected]>
Signed-off-by: Farzad Abdolhosseini <[email protected]>
Signed-off-by: Farzad Abdolhosseini <[email protected]>
Signed-off-by: Farzad Abdolhosseini <[email protected]>
Signed-off-by: Farzad Abdolhosseini <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello! @farzadab I left a few comments regarding the CI failure. Thanks for your contribution!
Signed-off-by: Farzad Abdolhosseini <[email protected]>
Signed-off-by: Farzad Abdolhosseini <[email protected]>
Signed-off-by: Farzad Abdolhosseini <[email protected]>
Signed-off-by: Farzad Abdolhosseini <[email protected]>
Signed-off-by: Farzad Abdolhosseini <[email protected]>
Signed-off-by: Farzad Abdolhosseini <[email protected]>
Signed-off-by: Farzad Abdolhosseini <[email protected]>
Signed-off-by: Farzad Abdolhosseini <[email protected]>
The tests are finally passing, yay! |
Nice, let's merge this then. |
Hey guys, looks like this PR broke the Ultravox LoRA tests. Both V0 and V1 vllm/vllm/v1/worker/gpu_model_runner.py Line 1380 in 53be4a8
cc @jeejeelee @robertgshaw2-redhat |
…#13631) Signed-off-by: Farzad Abdolhosseini <[email protected]> Signed-off-by: Richard Liu <[email protected]>
…#13631) Signed-off-by: Farzad Abdolhosseini <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>
…#13631) Signed-off-by: Farzad Abdolhosseini <[email protected]>
…#13631) Signed-off-by: Farzad Abdolhosseini <[email protected]> Signed-off-by: Mu Huai <[email protected]>
Currently the Ultravox model input is capped to 30 seconds and extra audio is truncated (AFAIK). Also each sample is fed to Whisper individually (without being batched).
This PR allows using longer audio by chunking them first, using Whisper encoder in batch mode, and then concatenates them.
TODO: