onnxruntime_genai.onnxruntime_genai.OrtException when running Phi-3-Vision ONNX model #849

JehanJaye · 2024-08-28T15:52:23Z

Describe the bug
python3 phi3v.py -m cuda-int4-rtn-block-32 gives the following issue:

Loading model... Traceback (most recent call last): File "phi3v.py", line 66, in <module> run(args) File "phi3v.py", line 16, in run model = og.Model(args.model_path) onnxruntime_genai.onnxruntime_genai.OrtException: Load model from cuda-int4-rtn-block-32/phi-3-v-128k-instruct-text.onnx failed:This is an invalid model. In Node, ("/model/layers.0/attn/GroupQueryAttention", GroupQueryAttention, "com.microsoft", -1) : ("/model/layers.0/attn/qkv_proj/MatMul/output_0": tensor(float16),"","","past_key_values.0.key": tensor(float16),"past_key_values.0.value": tensor(float16),"/model/attn_mask_reformat/attn_mask_subgraph/Sub/Cast/output_0": tensor(int32),"/model/attn_mask_reformat/attn_mask_subgraph/Gather/Cast/output_0": tensor(int32),"cos_cache": tensor(float16),"sin_cache": tensor(float16),) -> ("/model/layers.0/attn/GroupQueryAttention/output_0": tensor(float16),"present.0.key": tensor(float16),"present.0.value": tensor(float16),) , Error Node (/model/layers.0/attn/GroupQueryAttention) has input size 9 not in range [min=7, max=7].

To Reproduce

Quantized Phi-3-vision model in ONNX format on the Jetson ORIN

Compile ONNXRuntime for Jetpack5.1.1 with CUDA 11.4
wget http://jetson.webredirect.org:8000/jp5/cu114/onnxruntime-gpu-1.16.3.tar.gz
mkdir ort
tar -xvf onnxruntime-gpu-1.16.3.tar.gz -C ort
mv ort/include/onnxruntime/onnxruntime_c_api.h ort/include/
rm -rf ort/include/onnxruntime/
Compiling onnxruntime-genai repository : Switch to 940bc
python3 build.py --use_cuda --cuda_home /usr/local/cuda-11.4 --skip_tests --skip_csharp --parallel
Install the generated wheel pip3 install *.whl
pip3 install huggingface-hub[cli]
Download the Phi-3-vision ONNX model huggingface-cli download microsoft/Phi-3-vision-128k-instruct-onnx-cuda --include cuda-int4-rtn-block-32/* --local-dir .
Example script wget https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi3v.py
Inference python3 phi3v.py -m cuda-int4-rtn-block-32

JETSON-ORIN

NVIDIA Jetson AGX Orin 64GB
Ubuntu 20.04 Focal Fossa
CUDA: 11.4.315
cuDNN: 8.6.0.166
TensorRT: 8.5.2.2
VPI: 2.2.7
Vulkan: 1.3.204
Jetpack 5.1.1

Additional context
onnxruntime-genai built from source without encountering any CUDA-related problems. However, when loading the model I get this error related to the model. I would appreciate any assistance in diagnosing and correcting this problem.

The text was updated successfully, but these errors were encountered:

kunal-vaishnavi · 2024-08-28T16:27:01Z

Can you upgrade your version of ONNX Runtime? The GroupQueryAttention op was updated to support more inputs and ONNX Runtime v1.16.3 does not have that change.

JehanJaye · 2024-08-30T09:12:33Z

Thanks! Since I am using JetPack 5.1.1 with CUDA 11.4, I couldn't find a pre-compiled newer version than that of onnxruntime-gpu tarball that supports gpu-linux-aarch64.

Is there any other alternative except compiling and building supported onnxruntime-gpu tarball from source? I use that extracted tar as ort_home to build the onnxruntime-genai.

However besides this approach, I was able to find out a compiled gpu aarch64 newer version of onnxruntime as a .whl. when building onnxruntime-genai from source (build.py), instead of giving ort_home source directory to onnxruntime-gpu, is there any other option here?

kunal-vaishnavi · 2024-08-30T22:20:23Z

ONNX Runtime GenAI requires the shared libraries and the C API header file from ONNX Runtime. To get the shared libraries, you can install the .whl and copy the shared libraries from onnxruntime/capi/ that match the libonnxruntime*.so* pattern. To get the header file, you can download the header file from include/onnxruntime/core/session/onnxruntime_c_api.h using an official ONNX Runtime release branch. Official release branches are named as rel-{ORT_VERSION}.

For example:

1) Download and install the ONNX Runtime `.whl` file

For example, wheels for Jetson appear to be published here.

$ wget https://nvidia.box.com/shared/static/qnm7xtdemybuyog3yzz4qio3ly8fvi6r.whl -O onnxruntime_gpu-1.18.0-cp39-cp39-linux_aarch64.whl
$ pip install onnxruntime_gpu-1.18.0-cp39-cp39-linux_aarch64.whl

2) Clone ONNX Runtime GenAI and prepare folders

$ git clone https://github.com/microsoft/onnxruntime-genai
$ cd onnxruntime-genai
$ mkdir -p ort/include/
$ mkdir -p ort/lib/

3) Find where the `.whl` is installed

This example is using onnxruntime-gpu as the package name to search. Please change this to the package name you installed.

$ pip show onnxruntime-gpu
Name: onnxruntime-gpu
Version: 1.18.0
Summary: ONNX Runtime is a runtime accelerator for Machine Learning models
Home-page: https://onnxruntime.ai
Author: Microsoft Corporation
Author-email: [email protected]
License: MIT License
Location: /path/to/.local/lib/python3.9/site-packages
Requires: coloredlogs, flatbuffers, numpy, packaging, protobuf, sympy
Required-by:

4) Copy shared libraries to `ort/lib/`

This is using /path/to/.local/lib/python3.9/site-packages as the example location. Please change this to the location you see in the previous step.

$ cp /path/to/.local/lib/python3.9/site-packages/onnxruntime/capi/libonnxruntime*.so* ort/lib/

5) Download C API header file to `ort/include/`

This is using rel-1.18.0 as the example since the pip package example version is 1.18.0. Please replace 1.18.0 with the version you want to use.

$ cd ort/include/
$ wget https://github.com/microsoft/onnxruntime/blob/rel-1.18.0/include/onnxruntime/core/session/onnxruntime_c_api.h

6) Build ONNX Runtime GenAI from source

Please modify the python build.py command as needed for your build. For more details, please visit here.

$ cd ../../
$ python build.py

JehanJaye · 2024-09-03T11:56:18Z

Thanks for the very detailed response. Will try this out and update here.

@kunal-vaishnavi Any estimated release date for release Phi-3.5-vision ONNX?

kunal-vaishnavi · 2024-09-03T17:16:55Z

@kunal-vaishnavi Any estimated release date for release Phi-3.5-vision ONNX?

The work is in progress and we are working to complete it soon, but there's no estimated release date because the Phi-3.5 vision ONNX models will need to undergo Microsoft's Responsible AI evaluations before they can be published officially. If the evaluations take a while, I can publish a tutorial once all of the work is merged into ONNX Runtime GenAI so that you can generate your own ONNX models locally and run them.

github-actions bot added ep:CUDA model:transformer platform:jetson ep:TensorRT labels Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

onnxruntime_genai.onnxruntime_genai.OrtException when running Phi-3-Vision ONNX model #849

onnxruntime_genai.onnxruntime_genai.OrtException when running Phi-3-Vision ONNX model #849

JehanJaye commented Aug 28, 2024 •

edited

Loading

kunal-vaishnavi commented Aug 28, 2024

JehanJaye commented Aug 30, 2024

kunal-vaishnavi commented Aug 30, 2024

JehanJaye commented Sep 3, 2024

kunal-vaishnavi commented Sep 3, 2024

onnxruntime_genai.onnxruntime_genai.OrtException when running Phi-3-Vision ONNX model #849

onnxruntime_genai.onnxruntime_genai.OrtException when running Phi-3-Vision ONNX model #849

Comments

JehanJaye commented Aug 28, 2024 • edited Loading

kunal-vaishnavi commented Aug 28, 2024

JehanJaye commented Aug 30, 2024

kunal-vaishnavi commented Aug 30, 2024

1) Download and install the ONNX Runtime .whl file

2) Clone ONNX Runtime GenAI and prepare folders

3) Find where the .whl is installed

4) Copy shared libraries to ort/lib/

5) Download C API header file to ort/include/

6) Build ONNX Runtime GenAI from source

JehanJaye commented Sep 3, 2024

kunal-vaishnavi commented Sep 3, 2024

JehanJaye commented Aug 28, 2024 •

edited

Loading

1) Download and install the ONNX Runtime `.whl` file

3) Find where the `.whl` is installed

4) Copy shared libraries to `ort/lib/`

5) Download C API header file to `ort/include/`