Comparison of multiple images with the Phi-3 Vision model #563

johnnyainsworth · 2024-06-04T13:41:03Z

johnnyainsworth
Jun 4, 2024

Does the Phi-3 Vision ONNX model support the comparison of multiple images? In the Phi-3Cookbook they show an example with two images. Based on that example, I would expect the code to look something like

`import onnxruntime_genai as og
import os

def processImage(image_path):
image = None
if (len(image_path) == 0):
print("No image provided")
else:
print("Loading image...")
if not os.path.exists(image_path):
raise FileNotFoundError(f"Image file not found: {image_path}")
image = og.Images.open(image_path)
return image

modelName = r'phi-3-vision-128k\cpu-int4-rtn-block-32-acc-level-4'

image1_path = 'Pic1.jpg'
image2_path = 'Pic2.jpg'
output_tokens = 3072

text = 'Please explain the similarities and differences between these two images'

model = og.Model(modelName)
tokenizer = og.Tokenizer(model)
processor = model.create_multimodal_processor()
tokenizer_stream = tokenizer.create_stream()

prompt = "<|user|>\n"
image1 = processImage(image1_path)
prompt += "<|image_1|>\n"
image2 = processImage(image2_path)
prompt += "<|image_2|>\n"
prompt += f"{text}<|end|>\n<|assistant|>\n"
inputs = processor(prompt, images=[image1,image2])
params = og.GeneratorParams(model)
params.set_inputs(inputs)
params.set_search_options(max_length=output_tokens)

generator = og.Generator(model, params)
output_str = ""
while not generator.is_done():
generator.compute_logits()
generator.generate_next_token()
new_token = generator.get_next_tokens()[0]
output_str += tokenizer_stream.decode(new_token)

print(output_str)
del generator
`

This code throws an error "RuntimeError: Unable to cast Python instance to C++ type (#define PYBIND11_DETAILED_ERROR_MESSAGES or compile in debug mode for details)" I can't find the documentation on multimodal processor to see if there is a different way to get the inputs from the MultiModalProcessor class. Any help or suggestions would be appreciated. Thanks!

baijumeswani · 2024-06-04T17:03:32Z

baijumeswani
Jun 4, 2024
Collaborator

onnxruntime-genai currently does not have support for loading multiple images and executing the phi3vision model with multiple images.
The error you're seeing is because the processor cannot accept a list of images right now.

Depending on our internal prioritization, we can add support for multiple images per prompt scenario in the near future. I'll convert this issue into a discussion now.

3 replies

johnnyainsworth Jun 4, 2024
Author

Got it. Thank you!

baijumeswani Jun 11, 2024
Collaborator

The phi3-vision pytorch model can support multiple images, but the model team claims explicit support of only 1 image per prompt as can be seen through this PR: https://huggingface.co/microsoft/Phi-3-vision-128k-instruct/discussions/36.

onnxruntime-genai will not prioritize support for multiple images per prompt at this time to align with the hf model design and examples and due to resource constraints. Sincere apologies for that.

johnnyainsworth Jun 12, 2024
Author

No problem. Thanks for letting me know!

Barshan-Mandal · 2024-06-08T05:42:37Z

Barshan-Mandal
Jun 8, 2024

you can try to input images separately and using previous messages you can get the output for multiple images

0 replies

TamirHCL · 2024-08-23T15:09:54Z

TamirHCL
Aug 23, 2024

@baijumeswani since Phi 3.5 Vision is almost made to handle multiple images as opposed to Phi 3 Vision, will this be prioritized now?

0 replies

elephantpanda · 2024-08-29T01:09:54Z

elephantpanda
Aug 29, 2024

What about just combining the two images into one image. Would that work?

1 reply

TamirHCL Sep 1, 2024

No, unfortunately that won't work for me.

kunal-vaishnavi · 2024-09-03T17:06:24Z

kunal-vaishnavi
Sep 3, 2024
Collaborator

Support for multiple images will be added in this PR for Phi-3.5 vision. More work is needed to support Phi-3.5 vision, however, and that work is in progress.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparison of multiple images with the Phi-3 Vision model #563

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Comparison of multiple images with the Phi-3 Vision model #563

johnnyainsworth Jun 4, 2024

Replies: 5 comments · 4 replies

baijumeswani Jun 4, 2024 Collaborator

johnnyainsworth Jun 4, 2024 Author

baijumeswani Jun 11, 2024 Collaborator

johnnyainsworth Jun 12, 2024 Author

Barshan-Mandal Jun 8, 2024

TamirHCL Aug 23, 2024

elephantpanda Aug 29, 2024

TamirHCL Sep 1, 2024

kunal-vaishnavi Sep 3, 2024 Collaborator

johnnyainsworth
Jun 4, 2024

Replies: 5 comments 4 replies

baijumeswani
Jun 4, 2024
Collaborator

johnnyainsworth Jun 4, 2024
Author

baijumeswani Jun 11, 2024
Collaborator

johnnyainsworth Jun 12, 2024
Author

Barshan-Mandal
Jun 8, 2024

TamirHCL
Aug 23, 2024

elephantpanda
Aug 29, 2024

kunal-vaishnavi
Sep 3, 2024
Collaborator