Does llama2 support int8 quantization? #16

shaonianyr · 2023-08-03T08:05:33Z

Use this script to build int8 but failed: https://github.com/microsoft/onnxruntime-inference-examples/tree/main/quantization/language_model/llama

JoshuaElsdon · 2023-08-08T17:38:00Z

Hello there,
We are still exploring what is the most robust quantization option for this model. Out of personal interest I would be interested in knowing the specific error that you ran into. Could you copy/paste your error here?

shaonianyr · 2023-08-18T03:32:42Z

Thanks to reply.
The error occurs when calling int8-onnx like: "onnxruntime.capi.onnxruntime_pybind11_state.InvalidProtobuf: [ONNXRuntimeError] : 7 : INVALID_PROTOBUF : Load model from /data/int8-onnx/decoder_model_quantized.onnx failed:Protobuf parsing failed."

loretoparisi · 2023-09-02T20:52:44Z

In theory in8, int4 should work properly in Llama2 at least you can find Q4, Q8 and even Q2 quantization on HF Model Hub, but not in the ONNX format though (GGUF / GGML to have the Q2).

shaonianyr changed the title ~~Does llama2 support int8 quantiztaion?~~ Does llama2 support int8 quantization? Aug 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does llama2 support int8 quantization? #16

Does llama2 support int8 quantization? #16

shaonianyr commented Aug 3, 2023

JoshuaElsdon commented Aug 8, 2023

shaonianyr commented Aug 18, 2023

loretoparisi commented Sep 2, 2023

Does llama2 support int8 quantization? #16

Does llama2 support int8 quantization? #16

Comments

shaonianyr commented Aug 3, 2023

JoshuaElsdon commented Aug 8, 2023

shaonianyr commented Aug 18, 2023

loretoparisi commented Sep 2, 2023