-
-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] New model support for Phi-4-multimodal-instruct #14119
[Model] New model support for Phi-4-multimodal-instruct #14119
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Please update the tests as described here: https://docs.vllm.ai/en/latest/contributing/model/tests.html |
examples/offline_inference_phi3o.py
Outdated
@@ -0,0 +1,531 @@ | |||
# Implements a simple offline inference script for the Phi 3.5 Speech model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will delete this test file once the review is done.
@@ -38,3 +38,4 @@ depyf==0.18.0 # required for profiling and debugging with compilation config | |||
cloudpickle # allows pickling lambda functions in model_executor/models/registry.py | |||
watchfiles # required for http server to monitor the updates of TLS files | |||
python-json-logger # Used by logging as per examples/other/logging_configuration.md | |||
scipy # Required for phi-4-multimodal-instruct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should try to remove the scipy dep if possible. It is okay if we do it in a follow up PR. Seems like it is only being used for a single function scipy.signal.resample_poly
in phi4mm.py
# Resample to 16000 or 8000 if needed
if fs > 16000:
wav = scipy.signal.resample_poly(wav, 1, fs // 16000)
fs = 16000
elif 8000 < fs < 16000:
wav = scipy.signal.resample_poly(wav, 1, fs // 8000)
fs = 8000
elif fs < 8000:
raise RuntimeError(f"Unsupported sample rate {fs}")
if fs == 8000:
if self._eightk_method == "resample":
# Input audio is 8 kHz. Convert to 16 kHz before feature
# extraction
wav = scipy.signal.resample_poly(wav, 2, 1)
fs = 16000
# Do nothing here for fillzero method
elif fs != 16000:
# Input audio is not a supported sample rate.
raise RuntimeError(
f"Input data using an unsupported sample rate: {fs}"
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will do that in the following up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we're here, can you update the supported models page to also include text-only version of Phi-4?
Hi @DarkLight1337, I have addressed all the new comments, could you please approve this PR if it looks good to you? I would like to ship this as soon as possible since people are waiting on this to try out Phi-4-multimodal-instruct on vLLM. |
78f037b
to
41d0103
Compare
Signed-off-by: Congcong Chen <[email protected]>
Signed-off-by: Congcong Chen <[email protected]>
Signed-off-by: Congcong Chen <[email protected]>
Signed-off-by: Congcong Chen <[email protected]>
Signed-off-by: Congcong Chen <[email protected]>
Signed-off-by: Congcong Chen <[email protected]>
Signed-off-by: Congcong Chen <[email protected]>
Signed-off-by: Congcong Chen <[email protected]>
Signed-off-by: Congcong Chen <[email protected]>
Signed-off-by: Congcong Chen <[email protected]>
Signed-off-by: Congcong Chen <[email protected]>
Signed-off-by: Congcong Chen <[email protected]>
Signed-off-by: Congcong Chen <[email protected]>
Signed-off-by: Congcong Chen <[email protected]>
Signed-off-by: Congcong Chen <[email protected]>
Signed-off-by: Congcong Chen <[email protected]>
Signed-off-by: Congcong Chen <[email protected]>
Signed-off-by: Congcong Chen <[email protected]>
Signed-off-by: Congcong Chen <[email protected]>
Signed-off-by: Congcong Chen <[email protected]>
…request Signed-off-by: Congcong Chen <[email protected]>
Signed-off-by: Congcong Chen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@congcongchen123 I left two final comments to help us ship this PR in. Thanks for the effort!
BTW @ywang96 , should I merge latest main into my branch one more time to make it ready to ship? |
Yep that would be great! |
Signed-off-by: Congcong Chen <[email protected]>
Signed-off-by: Congcong Chen <[email protected]>
41d0103
to
08c845a
Compare
Hi @ywang96 the buildkite/ci/pr/basic-models-test test failed because of the following check in vllm/model_executor/models/vision_siglip_navit.py:
Looks like the testing container does not have flash_attn package installed, and flash_attn is required by vllm/model_executor/models/vision_siglip_navit.py. What is the suggested way to fix that? Thanks To answer myself, as a work around, maybe I could print out the logging messages instead of throwing runtime error. We will come back on this in the following PRs. |
…t does not have flash_attn installed Signed-off-by: Congcong Chen <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
was added to the wrong place, and I just corrected it. |
The kernel test failure looks unrelated to this PR, is there a way to bypass that and merge this PR? |
It is a known failure on main, I have force-merged this PR. |
Got it! Thanks @DarkLight1337 ! |
…#14119) Signed-off-by: Johnny <[email protected]>
New model for https://huggingface.co/microsoft/Phi-4-multimodal-instruct/tree/main
co-author: Jacob Platin and Vadim Mazalov for speech encoder and Yen-Chun Chen for vision encoder.
FIX #13936
Add instructions on how to load the model with lora checkpoints:
Run OpenAI compatible server:
Note that
--max-seq-len-to-capture=131072
is super important otherwise this will be set to be 8k by default, and vLLM will surprisingly disble cuda_graph for sequence with length larger than 8k. With cuda_graph disabled unexpectly, we have seen signifcant time between token degradation (twice as slow).Got result: