Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does llama2 support int8 quantization? #16

Open
shaonianyr opened this issue Aug 3, 2023 · 3 comments
Open

Does llama2 support int8 quantization? #16

shaonianyr opened this issue Aug 3, 2023 · 3 comments

Comments

@shaonianyr
Copy link

Use this script to build int8 but failed: https://github.com/microsoft/onnxruntime-inference-examples/tree/main/quantization/language_model/llama

@shaonianyr shaonianyr changed the title Does llama2 support int8 quantiztaion? Does llama2 support int8 quantization? Aug 3, 2023
@JoshuaElsdon
Copy link
Contributor

Hello there,
We are still exploring what is the most robust quantization option for this model. Out of personal interest I would be interested in knowing the specific error that you ran into. Could you copy/paste your error here?

@shaonianyr
Copy link
Author

Thanks to reply.
The error occurs when calling int8-onnx like: "onnxruntime.capi.onnxruntime_pybind11_state.InvalidProtobuf: [ONNXRuntimeError] : 7 : INVALID_PROTOBUF : Load model from /data/int8-onnx/decoder_model_quantized.onnx failed:Protobuf parsing failed."

@loretoparisi
Copy link
Collaborator

In theory in8, int4 should work properly in Llama2 at least you can find Q4, Q8 and even Q2 quantization on HF Model Hub, but not in the ONNX format though (GGUF / GGML to have the Q2).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants