[Model] New model support for Phi-4-multimodal-instruct #14119

congcongchen123 · 2025-03-03T07:11:57Z

New model for https://huggingface.co/microsoft/Phi-4-multimodal-instruct/tree/main

co-author: Jacob Platin and Vadim Mazalov for speech encoder and Yen-Chun Chen for vision encoder.

Add instructions on how to load the model with lora checkpoints:

Run OpenAI compatible server:

Since the vision-lora and speech-lora co-exist with the base model, we have to manually specify the path of the lora weights:

model_path = snapshot_download(repo_id=args.model_path)
vision_lora_path = model_path + "/vision-lora"
speech_lora_path = model_path + "/speech-lora"

Run the server:

python -m vllm.entrypoints.openai.api_server --model 'microsoft/Phi-4-multimodal-instruct' --dtype auto --trust-remote-code --max-model-len 131072 --max-seq-len-to-capture=131072 --enable-lora --max-lora-rank 320 --lora-extra-vocab-size 0 --limit-mm-per-prompt audio=3,image=3 --max-loras 2 --lora-modules speech=<speech_lora_path> vision=<vision_lora_path>

Note that --max-seq-len-to-capture=131072 is super important otherwise this will be set to be 8k by default, and vLLM will surprisingly disble cuda_graph for sequence with length larger than 8k. With cuda_graph disabled unexpectly, we have seen signifcant time between token degradation (twice as slow).

Send a curl request:

curl -X POST "http://localhost:8000/v1/chat/completions"       -H "Content-Type: application/json"       --data '{
          "model": "vision",
          "messages": [
            { "role": "user", "content": [ {"type": "text","text": "Represent the given image."},{ "type": "image_url", "image_url": { "url": "https://alinasayre.com/wp-content/uploads/2013/10/d67cd-dsc01646.jpg" } } ]}
         ]
      }'

Got result:

{"id":"chatcmpl-fcc36661176a4aa99f20268969eea2be","object":"chat.completion","created":1741166187,"model":"vision","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"Image description: A faded yellow traffic sign with one-lane and bridge warnings.","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":200020}],"usage":{"prompt_tokens":3402,"total_tokens":3419,"completion_tokens":17,"prompt_tokens_details":null},"prompt_logprobs":null}

github-actions · 2025-03-03T07:12:07Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

DarkLight1337 · 2025-03-03T07:14:18Z

Please update the tests as described here: https://docs.vllm.ai/en/latest/contributing/model/tests.html

congcongchen123 · 2025-03-03T07:21:56Z

examples/offline_inference_phi3o.py

@@ -0,0 +1,531 @@
+# Implements a simple offline inference script for the Phi 3.5 Speech model.


Will delete this test file once the review is done.

examples/offline_inference_phi3o.py

vllm/model_executor/models/vision_siglip_navit.py

mgoin · 2025-03-03T16:51:41Z

requirements-common.txt

@@ -38,3 +38,4 @@ depyf==0.18.0 # required for profiling and debugging with compilation config
 cloudpickle # allows pickling lambda functions in model_executor/models/registry.py
 watchfiles # required for http server to monitor the updates of TLS files
 python-json-logger # Used by logging as per examples/other/logging_configuration.md
+scipy # Required for phi-4-multimodal-instruct


We should try to remove the scipy dep if possible. It is okay if we do it in a follow up PR. Seems like it is only being used for a single function scipy.signal.resample_poly in phi4mm.py

# Resample to 16000 or 8000 if needed if fs > 16000: wav = scipy.signal.resample_poly(wav, 1, fs // 16000) fs = 16000 elif 8000 < fs < 16000: wav = scipy.signal.resample_poly(wav, 1, fs // 8000) fs = 8000 elif fs < 8000: raise RuntimeError(f"Unsupported sample rate {fs}") if fs == 8000: if self._eightk_method == "resample": # Input audio is 8 kHz. Convert to 16 kHz before feature # extraction wav = scipy.signal.resample_poly(wav, 2, 1) fs = 16000 # Do nothing here for fillzero method elif fs != 16000: # Input audio is not a supported sample rate. raise RuntimeError( f"Input data using an unsupported sample rate: {fs}" )

Sure, will do that in the following up PR.

DarkLight1337

While we're here, can you update the supported models page to also include text-only version of Phi-4?

docs/source/models/supported_models.md

congcongchen123 · 2025-03-04T20:08:12Z

Hi @DarkLight1337, I have addressed all the new comments, could you please approve this PR if it looks good to you? I would like to ship this as soon as possible since people are waiting on this to try out Phi-4-multimodal-instruct on vLLM.

Signed-off-by: Congcong Chen <[email protected]>

…request Signed-off-by: Congcong Chen <[email protected]>

Signed-off-by: Congcong Chen <[email protected]>

ywang96

@congcongchen123 I left two final comments to help us ship this PR in. Thanks for the effort!

examples/offline_inference_phi3o.py

docs/source/models/supported_models.md

congcongchen123 · 2025-03-04T21:18:37Z

BTW @ywang96 , should I merge latest main into my branch one more time to make it ready to ship?

ywang96 · 2025-03-04T21:19:13Z

BTW @ywang96 , should I merge latest main into my branch one more time to make it ready to ship?

Yep that would be great!

Signed-off-by: Congcong Chen <[email protected]>

congcongchen123 · 2025-03-04T23:21:10Z

Hi @ywang96 the buildkite/ci/pr/basic-models-test test failed because of the following check in vllm/model_executor/models/vision_siglip_navit.py:

self.attn_backend: _Backend = get_vit_attn_backend(support_fa=True)
if self.attn_backend != _Backend.FLASH_ATTN:
    raise RuntimeError(
        "Phi-4-multimodal-instruct model does not support"\
            f" {self.attn_backend} backend now."
    )

Looks like the testing container does not have flash_attn package installed, and flash_attn is required by vllm/model_executor/models/vision_siglip_navit.py. What is the suggested way to fix that? Thanks

To answer myself, as a work around, maybe I could print out the logging messages instead of throwing runtime error. We will come back on this in the following PRs.

…t does not have flash_attn installed Signed-off-by: Congcong Chen <[email protected]>

Signed-off-by: Roger Wang <[email protected]>

ywang96 · 2025-03-04T23:59:23Z

Hi @ywang96 the buildkite/ci/pr/basic-models-test test failed because of the following check in vllm/model_executor/models/vision_siglip_navit.py:
self.attn_backend: _Backend = get_vit_attn_backend(support_fa=True)
if self.attn_backend != _Backend.FLASH_ATTN:
    raise RuntimeError(
        "Phi-4-multimodal-instruct model does not support"\
            f" {self.attn_backend} backend now."
    )
Looks like the testing container does not have flash_attn package installed, and flash_attn is required by vllm/model_executor/models/vision_siglip_navit.py. What is the suggested way to fix that? Thanks

To answer myself, as a work around, maybe I could print out the logging messages instead of throwing runtime error. We will come back on this in the following PRs.

        # Detect attention implementation.
        self.attn_backend: _Backend = get_vit_attn_backend(support_fa=True)
        if self.attn_backend != _Backend.FLASH_ATTN:
            # Only print out errors for now to make ci/pr/basic-models-test
            # happy since the testing environment does not have flash_attn
            # installed.
            logger.error("Phi-4-multimodal-instruct model does not support "\
                         "%s backend now.", self.attn_backend)

was added to the wrong place, and I just corrected it.

congcongchen123 · 2025-03-05T04:34:54Z

The kernel test failure looks unrelated to this PR, is there a way to bypass that and merge this PR?
If not, what steps should I take?

DarkLight1337 · 2025-03-05T04:57:14Z

It is a known failure on main, I have force-merged this PR.

congcongchen123 · 2025-03-05T05:01:45Z

Got it! Thanks @DarkLight1337 !

…#14119) Signed-off-by: Johnny <[email protected]>

mergify bot added documentation Improvements or additions to documentation ci/build frontend labels Mar 3, 2025

congcongchen123 commented Mar 3, 2025

View reviewed changes

ywang96 reviewed Mar 3, 2025

View reviewed changes

examples/offline_inference_phi3o.py Outdated Show resolved Hide resolved

congcongchen123 commented Mar 3, 2025

View reviewed changes

vllm/model_executor/models/vision_siglip_navit.py Outdated Show resolved Hide resolved

ywang96 reviewed Mar 3, 2025

View reviewed changes

vllm/model_executor/models/vision_siglip_navit.py Show resolved Hide resolved

DarkLight1337 mentioned this pull request Mar 3, 2025

[New Model]: Phi-4 Multimodal Instruct #13936

Closed

1 task

mgoin reviewed Mar 3, 2025

View reviewed changes

congcongchen123 requested a review from DarkLight1337 as a code owner March 4, 2025 01:29

congcongchen123 requested a review from ywang96 March 4, 2025 01:30

DarkLight1337 reviewed Mar 4, 2025

View reviewed changes

docs/source/models/supported_models.md Outdated Show resolved Hide resolved

congcongchen123 requested a review from DarkLight1337 March 4, 2025 20:07

congcongchen123 force-pushed the congcongchen/phi-4-multimodal-instruct branch from 78f037b to 41d0103 Compare March 4, 2025 21:13

congcongchen123 and others added 13 commits March 4, 2025 13:14

Add support for Phi-4-multimodal-instruct

1da44c5

Signed-off-by: Congcong Chen <[email protected]>

Fix error related to interface changes from latest main

d0fa70f

Signed-off-by: Congcong Chen <[email protected]>

rename and clean up code

85ef077

Signed-off-by: Congcong Chen <[email protected]>

Minor clean-up

d8f40d8

Signed-off-by: Congcong Chen <[email protected]>

Do not support Tensor Parallel and Pipeline Parallel

a569e4e

Signed-off-by: Congcong Chen <[email protected]>

clean up code

f7b8579

Signed-off-by: Congcong Chen <[email protected]>

code cleaning / renaming for the vision part

375e246

Signed-off-by: Congcong Chen <[email protected]>

rename phi4o with phi4mm

f8a3373

Signed-off-by: Congcong Chen <[email protected]>

refactor change phi4o to phi4mm

7b73371

Signed-off-by: Congcong Chen <[email protected]>

refactor change phi4o to phi4mm continued

30a70d5

Signed-off-by: Congcong Chen <[email protected]>

final update to change phio to phi4mm

99a636d

Signed-off-by: Congcong Chen <[email protected]>

Fix errors after rebasing to the top of the main

5dcf783

Signed-off-by: Congcong Chen <[email protected]>

Refactor phimm_utils

707dfe1

Signed-off-by: Congcong Chen <[email protected]>

congcongchen123 added 9 commits March 4, 2025 13:14

Add more max LoRA rank support

1200132

Signed-off-by: Congcong Chen <[email protected]>

format code

4b11f10

Signed-off-by: Congcong Chen <[email protected]>

restore requirements-test.txt

86295b3

Signed-off-by: Congcong Chen <[email protected]>

remove hard dependency on flash-attn

bbdcfb7

Signed-off-by: Congcong Chen <[email protected]>

Register test and add model info to supported_model.md

28de545

Signed-off-by: Congcong Chen <[email protected]>

restore requirements-test.txt

5327e89

Signed-off-by: Congcong Chen <[email protected]>

restore requirements-test.txt

454d4a7

Signed-off-by: Congcong Chen <[email protected]>

Add text-only version of Phi-4-mini to the supported_models page per …

1bb5750

…request Signed-off-by: Congcong Chen <[email protected]>

Minor update to supported_models.md

a2cc774

Signed-off-by: Congcong Chen <[email protected]>

ywang96 approved these changes Mar 4, 2025

View reviewed changes

examples/offline_inference_phi3o.py Outdated Show resolved Hide resolved

docs/source/models/supported_models.md Outdated Show resolved Hide resolved

congcongchen123 added 2 commits March 4, 2025 13:25

Update supported_models.md

0460912

Signed-off-by: Congcong Chen <[email protected]>

delete the testing script

08c845a

Signed-off-by: Congcong Chen <[email protected]>

congcongchen123 force-pushed the congcongchen/phi-4-multimodal-instruct branch from 41d0103 to 08c845a Compare March 4, 2025 21:29

Merge branch 'main' into congcongchen/phi-4-multimodal-instruct

1f468f4

ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 4, 2025

congcongchen123 and others added 2 commits March 4, 2025 15:51

Print errors instead of throwing the RuntimeError since CI environmen…

72ddbb4

…t does not have flash_attn installed Signed-off-by: Congcong Chen <[email protected]>

update attn_backend detection

77f4edc

Signed-off-by: Roger Wang <[email protected]>

DarkLight1337 approved these changes Mar 5, 2025

View reviewed changes

vllm-bot merged commit 0a995d5 into vllm-project:main Mar 5, 2025
58 of 61 checks passed

congcongchen123 mentioned this pull request Mar 5, 2025

[Kernel] Add trition.autotune to address the high latency overhead of punica kernels #14272

Open

congcongchen123 deleted the congcongchen/phi-4-multimodal-instruct branch March 6, 2025 00:28

jeejeelee mentioned this pull request Mar 6, 2025

[Misc] Add Phi4-MM example #14343

Merged

johnnynunez pushed a commit to johnnynunez/vllm that referenced this pull request Mar 6, 2025

[Model] New model support for Phi-4-multimodal-instruct (vllm-project…

7a725b8

…#14119) Signed-off-by: Johnny <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] New model support for Phi-4-multimodal-instruct #14119

[Model] New model support for Phi-4-multimodal-instruct #14119

congcongchen123 commented Mar 3, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Mar 3, 2025

DarkLight1337 commented Mar 3, 2025

congcongchen123 Mar 3, 2025

mgoin Mar 3, 2025

congcongchen123 Mar 4, 2025

DarkLight1337 left a comment

congcongchen123 commented Mar 4, 2025 •

edited

Loading

ywang96 left a comment

congcongchen123 commented Mar 4, 2025

ywang96 commented Mar 4, 2025

congcongchen123 commented Mar 4, 2025 •

edited

Loading

ywang96 commented Mar 4, 2025

congcongchen123 commented Mar 5, 2025 •

edited

Loading

DarkLight1337 commented Mar 5, 2025

congcongchen123 commented Mar 5, 2025

		@@ -0,0 +1,531 @@
		# Implements a simple offline inference script for the Phi 3.5 Speech model.

[Model] New model support for Phi-4-multimodal-instruct #14119

[Model] New model support for Phi-4-multimodal-instruct #14119

Conversation

congcongchen123 commented Mar 3, 2025 • edited by github-actions bot Loading

Run OpenAI compatible server:

github-actions bot commented Mar 3, 2025

DarkLight1337 commented Mar 3, 2025

congcongchen123 Mar 3, 2025

Choose a reason for hiding this comment

mgoin Mar 3, 2025

Choose a reason for hiding this comment

congcongchen123 Mar 4, 2025

Choose a reason for hiding this comment

DarkLight1337 left a comment

Choose a reason for hiding this comment

congcongchen123 commented Mar 4, 2025 • edited Loading

ywang96 left a comment

Choose a reason for hiding this comment

congcongchen123 commented Mar 4, 2025

ywang96 commented Mar 4, 2025

congcongchen123 commented Mar 4, 2025 • edited Loading

ywang96 commented Mar 4, 2025

congcongchen123 commented Mar 5, 2025 • edited Loading

DarkLight1337 commented Mar 5, 2025

congcongchen123 commented Mar 5, 2025

congcongchen123 commented Mar 3, 2025 •

edited by github-actions bot

Loading

congcongchen123 commented Mar 4, 2025 •

edited

Loading

congcongchen123 commented Mar 4, 2025 •

edited

Loading

congcongchen123 commented Mar 5, 2025 •

edited

Loading