You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello there,
We are still exploring what is the most robust quantization option for this model. Out of personal interest I would be interested in knowing the specific error that you ran into. Could you copy/paste your error here?
Thanks to reply.
The error occurs when calling int8-onnx like: "onnxruntime.capi.onnxruntime_pybind11_state.InvalidProtobuf: [ONNXRuntimeError] : 7 : INVALID_PROTOBUF : Load model from /data/int8-onnx/decoder_model_quantized.onnx failed:Protobuf parsing failed."
In theory in8, int4 should work properly in Llama2 at least you can find Q4, Q8 and even Q2 quantization on HF Model Hub, but not in the ONNX format though (GGUF / GGML to have the Q2).
Use this script to build int8 but failed: https://github.com/microsoft/onnxruntime-inference-examples/tree/main/quantization/language_model/llama
The text was updated successfully, but these errors were encountered: