Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

onnxruntime_genai.onnxruntime_genai.OrtException when running Phi-3-Vision ONNX model #849

Open
JehanJaye opened this issue Aug 28, 2024 · 5 comments

Comments

@JehanJaye
Copy link

JehanJaye commented Aug 28, 2024

Describe the bug
python3 phi3v.py -m cuda-int4-rtn-block-32 gives the following issue:

Loading model... Traceback (most recent call last): File "phi3v.py", line 66, in <module> run(args) File "phi3v.py", line 16, in run model = og.Model(args.model_path) onnxruntime_genai.onnxruntime_genai.OrtException: Load model from cuda-int4-rtn-block-32/phi-3-v-128k-instruct-text.onnx failed:This is an invalid model. In Node, ("/model/layers.0/attn/GroupQueryAttention", GroupQueryAttention, "com.microsoft", -1) : ("/model/layers.0/attn/qkv_proj/MatMul/output_0": tensor(float16),"","","past_key_values.0.key": tensor(float16),"past_key_values.0.value": tensor(float16),"/model/attn_mask_reformat/attn_mask_subgraph/Sub/Cast/output_0": tensor(int32),"/model/attn_mask_reformat/attn_mask_subgraph/Gather/Cast/output_0": tensor(int32),"cos_cache": tensor(float16),"sin_cache": tensor(float16),) -> ("/model/layers.0/attn/GroupQueryAttention/output_0": tensor(float16),"present.0.key": tensor(float16),"present.0.value": tensor(float16),) , Error Node (/model/layers.0/attn/GroupQueryAttention) has input size 9 not in range [min=7, max=7].

To Reproduce

Quantized Phi-3-vision model in ONNX format on the Jetson ORIN

  1. Compile ONNXRuntime for Jetpack5.1.1 with CUDA 11.4
    wget http://jetson.webredirect.org:8000/jp5/cu114/onnxruntime-gpu-1.16.3.tar.gz
    mkdir ort
    tar -xvf onnxruntime-gpu-1.16.3.tar.gz -C ort
    mv ort/include/onnxruntime/onnxruntime_c_api.h ort/include/
    rm -rf ort/include/onnxruntime/
  2. Compiling onnxruntime-genai repository : Switch to 940bc
  3. python3 build.py --use_cuda --cuda_home /usr/local/cuda-11.4 --skip_tests --skip_csharp --parallel
  4. Install the generated wheel pip3 install *.whl
  5. pip3 install huggingface-hub[cli]
  6. Download the Phi-3-vision ONNX model huggingface-cli download microsoft/Phi-3-vision-128k-instruct-onnx-cuda --include cuda-int4-rtn-block-32/* --local-dir .
  7. Example script wget https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi3v.py
  8. Inference python3 phi3v.py -m cuda-int4-rtn-block-32

JETSON-ORIN

  • NVIDIA Jetson AGX Orin 64GB
  • Ubuntu 20.04 Focal Fossa
  • CUDA: 11.4.315
  • cuDNN: 8.6.0.166
  • TensorRT: 8.5.2.2
  • VPI: 2.2.7
  • Vulkan: 1.3.204
  • Jetpack 5.1.1

Additional context
onnxruntime-genai built from source without encountering any CUDA-related problems. However, when loading the model I get this error related to the model. I would appreciate any assistance in diagnosing and correcting this problem.

@kunal-vaishnavi
Copy link
Contributor

Can you upgrade your version of ONNX Runtime? The GroupQueryAttention op was updated to support more inputs and ONNX Runtime v1.16.3 does not have that change.

@JehanJaye
Copy link
Author

Thanks! Since I am using JetPack 5.1.1 with CUDA 11.4, I couldn't find a pre-compiled newer version than that of onnxruntime-gpu tarball that supports gpu-linux-aarch64.

Is there any other alternative except compiling and building supported onnxruntime-gpu tarball from source? I use that extracted tar as ort_home to build the onnxruntime-genai.

However besides this approach, I was able to find out a compiled gpu aarch64 newer version of onnxruntime as a .whl. when building onnxruntime-genai from source (build.py), instead of giving ort_home source directory to onnxruntime-gpu, is there any other option here?

@kunal-vaishnavi
Copy link
Contributor

ONNX Runtime GenAI requires the shared libraries and the C API header file from ONNX Runtime. To get the shared libraries, you can install the .whl and copy the shared libraries from onnxruntime/capi/ that match the libonnxruntime*.so* pattern. To get the header file, you can download the header file from include/onnxruntime/core/session/onnxruntime_c_api.h using an official ONNX Runtime release branch. Official release branches are named as rel-{ORT_VERSION}.

For example:

1) Download and install the ONNX Runtime .whl file

For example, wheels for Jetson appear to be published here.

$ wget https://nvidia.box.com/shared/static/qnm7xtdemybuyog3yzz4qio3ly8fvi6r.whl -O onnxruntime_gpu-1.18.0-cp39-cp39-linux_aarch64.whl
$ pip install onnxruntime_gpu-1.18.0-cp39-cp39-linux_aarch64.whl

2) Clone ONNX Runtime GenAI and prepare folders

$ git clone https://github.com/microsoft/onnxruntime-genai
$ cd onnxruntime-genai
$ mkdir -p ort/include/
$ mkdir -p ort/lib/

3) Find where the .whl is installed

This example is using onnxruntime-gpu as the package name to search. Please change this to the package name you installed.

$ pip show onnxruntime-gpu
Name: onnxruntime-gpu
Version: 1.18.0
Summary: ONNX Runtime is a runtime accelerator for Machine Learning models
Home-page: https://onnxruntime.ai
Author: Microsoft Corporation
Author-email: [email protected]
License: MIT License
Location: /path/to/.local/lib/python3.9/site-packages
Requires: coloredlogs, flatbuffers, numpy, packaging, protobuf, sympy
Required-by:

4) Copy shared libraries to ort/lib/

This is using /path/to/.local/lib/python3.9/site-packages as the example location. Please change this to the location you see in the previous step.

$ cp /path/to/.local/lib/python3.9/site-packages/onnxruntime/capi/libonnxruntime*.so* ort/lib/

5) Download C API header file to ort/include/

This is using rel-1.18.0 as the example since the pip package example version is 1.18.0. Please replace 1.18.0 with the version you want to use.

$ cd ort/include/
$ wget https://github.com/microsoft/onnxruntime/blob/rel-1.18.0/include/onnxruntime/core/session/onnxruntime_c_api.h

6) Build ONNX Runtime GenAI from source

Please modify the python build.py command as needed for your build. For more details, please visit here.

$ cd ../../
$ python build.py

@JehanJaye
Copy link
Author

Thanks for the very detailed response. Will try this out and update here.

@kunal-vaishnavi Any estimated release date for release Phi-3.5-vision ONNX?

@kunal-vaishnavi
Copy link
Contributor

@kunal-vaishnavi Any estimated release date for release Phi-3.5-vision ONNX?

The work is in progress and we are working to complete it soon, but there's no estimated release date because the Phi-3.5 vision ONNX models will need to undergo Microsoft's Responsible AI evaluations before they can be published officially. If the evaluations take a while, I can publish a tutorial once all of the work is merged into ONNX Runtime GenAI so that you can generate your own ONNX models locally and run them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants