-
-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] Extend Ultravox to accept audio longer than 30s #13631
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Farzad Abdolhosseini <[email protected]>
Signed-off-by: Farzad Abdolhosseini <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
FYI @NickLucche for the usage of whisper |
Thanks for the contrib! |
re @NickLucche: Here's the processor link: https://huggingface.co/fixie-ai/ultravox-v0_3-llama-3_2-1b/blob/main/ultravox_processing.py#L209 The logic: for each audio, split to 30 second chunks (but do not pad the last item to 30s, which is the same as before). There are other ways we could've done this, but it matches what we do on the Ultravox side for both some fine-tuning that we do and evals. If we end up updating those I'll update VLLM as well. Also, note that since we don't pad the last chunk, and since in most cases we have smaller than 30s audio, the number of frames do not match across samples. |
Signed-off-by: Farzad Abdolhosseini <[email protected]>
9f00316
to
0c5363e
Compare
Ok I see then that's a naive chunking where you don't account for splitting mid-word nor you have any overlap and/or prompt from previous chunk. This case seems much easier to handle vllm-side, given changes are already in hf. Let's just make sure the batched whisper forward is accounted for by the initial profiler run to avoid oom. |
Signed-off-by: Farzad Abdolhosseini <[email protected]>
Signed-off-by: Farzad Abdolhosseini <[email protected]>
Signed-off-by: Farzad Abdolhosseini <[email protected]>
Signed-off-by: Farzad Abdolhosseini <[email protected]>
Signed-off-by: Farzad Abdolhosseini <[email protected]>
Signed-off-by: Farzad Abdolhosseini <[email protected]>
Signed-off-by: Farzad Abdolhosseini <[email protected]>
Signed-off-by: Farzad Abdolhosseini <[email protected]>
Signed-off-by: Farzad Abdolhosseini <[email protected]>
Signed-off-by: Farzad Abdolhosseini <[email protected]>
Signed-off-by: Farzad Abdolhosseini <[email protected]>
Signed-off-by: Farzad Abdolhosseini <[email protected]>
Signed-off-by: Farzad Abdolhosseini <[email protected]>
Signed-off-by: Farzad Abdolhosseini <[email protected]>
Signed-off-by: Farzad Abdolhosseini <[email protected]>
Thanks for the comments. This PR has been ready to review. For reference, I can confirm that the evals have improved. before (8B model):
after (8B model):
Rows 0, 2, and 3 are there as a sanity check. Difference of less than 1 point is usually not significant (specially on model-as-judge evals). A similar trend is seen on 70B which reaches 90.30 on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work here. Surprised to see how big of a leap can a simple chunking strategy achieve!
Thanks!
Just to clarify, the difference in metrics is not because of a "better" chunking strategy. It's just that, before this we used to throw away any audio past 30 seconds. Any chunking strategy is probably better than no strategy 😅 |
Can you update |
Signed-off-by: Farzad Abdolhosseini <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update! Let's see if the tests pass.
Signed-off-by: Farzad Abdolhosseini <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello! @farzadab I left a few comments regarding the CI failure. Thanks for your contribution!
Signed-off-by: Farzad Abdolhosseini <[email protected]>
Signed-off-by: Farzad Abdolhosseini <[email protected]>
Signed-off-by: Farzad Abdolhosseini <[email protected]>
curr = curr.get(key, {}) | ||
curr.pop(keys[-1], None) | ||
|
||
return _convert_tensors_to_list(result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without doing _convert_tensors_to_list
, the equals comparison kept failing even on the same inputs:
E Differing items:
E {'mm_kwargs': {'audio_num_chunks': tensor([1, 1, 1]), 'audio_lens': tensor([4, 5, 6]), 'audio_token_len': tensor([1, 1, 1], dtype=torch.int32)}} !=
E {'mm_kwargs': {'audio_num_chunks': tensor([1, 1, 1]), 'audio_lens': tensor([4, 5, 6]), 'audio_token_len': tensor([1, 1, 1], dtype=torch.int32)}}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MultiModalKwargs needs to be handled separately because it has a custom equality check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh gawd, that's why. I kept banging my head on the wall for so long on this issue 😢
The issue here is that _items_by_modality
still keeps a version of audio_features
.
Signed-off-by: Farzad Abdolhosseini <[email protected]>
Signed-off-by: Farzad Abdolhosseini <[email protected]>
Currently the Ultravox model input is capped to 30 seconds and extra audio is truncated (AFAIK). Also each sample is fed to Whisper individually (without being batched).
This PR allows using longer audio by chunking them first, using Whisper encoder in batch mode, and then concatenates them.
TODO: