-
Notifications
You must be signed in to change notification settings - Fork 329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a document for quantization on NNPA #3045
Conversation
Signed-off-by: Tung D. Le <[email protected]>
Signed-off-by: Tung D. Le <[email protected]>
@jenkins-droid test this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just have a few questions.
docs/Quantization-NNPA.md
Outdated
- supports per-tensor dynamic quantization, and | ||
- quantizes data tensors from float32 to 8-bit signed integer because NNPA supports 8-bit signed integers. If a data tensor in the input model is already in 8-bit singed integer, the compiler will not quantize it again. | ||
|
||
The compiler provides two compile flags for quantizing a model at compile time: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since these flags are targetting only dynamic quantization, shall we specify dynamically quantizing
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we could use --nnpa-dquant
for dynamic quantization, and use it systematically for both options.
docs/Quantization-NNPA.md
Outdated
|
||
# Performance notes | ||
|
||
It is often the case that symmetric quantization leads to better inference performance but poorer accuracy than asymetric quantization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: asymmetric
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, will let the final word to @Sunny-Anand
docs/Quantization-NNPA.md
Outdated
- supports per-tensor dynamic quantization, and | ||
- quantizes data tensors from float32 to 8-bit signed integer because NNPA supports 8-bit signed integers. If a data tensor in the input model is already in 8-bit singed integer, the compiler will not quantize it again. | ||
|
||
The compiler provides two compile flags for quantizing a model at compile time: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we could use --nnpa-dquant
for dynamic quantization, and use it systematically for both options.
Signed-off-by: Tung D. Le <[email protected]>
@Sunny-Anand @AlexandreEichenberger could you take another look at my new changes based on your comments? Would like to make sure everything is clear before merging. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the changes.
docs/Quantization-NNPA.md
Outdated
|
||
# Overview | ||
|
||
NNPA in IBM Telum II supports 8-bit signed-integer quantized matrix multiplications. This document shows how to compile an ONNX model for quantization on NNPA. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be no consensus on what quantization means. It always means going from a higher precision to a lower precision, but I don't think it necessarily implies integer representation. See here for example
https://huggingface.co/docs/optimum/en/concept_guides/quantization
So maybe we could be a bit clearer here.
=====
NNPA in IBM Telum II supports 8-bit signed-integer quantized matrix multiplications. This document shows how to compile an ONNX model for 8-bit quantization on NNPA. When not following these steps, models will still be accelerated when targeting Telum systems using a mixture of 16-bit floating-point numbers for computations mapped to the Telum's Integrated AI accelerator and 32-bit floating-point numbers for computations mapped to the Telum CPUs.
=====
I think that once this is out of the way, we may continue having the text below without changes. Or one could use "8-bit integer quantization" once at the beginning of sections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I updated it with your content.
Signed-off-by: Tung D. Le <[email protected]>
Jenkins Linux ppc64le Build #15164 [push] Add a document for quant... started at 23:04 |
Jenkins Linux s390x Build #16138 [push] Add a document for quant... started at 22:47 |
Jenkins Linux amd64 Build #16136 [push] Add a document for quant... started at 21:47 |
Jenkins Linux amd64 Build #16136 [push] Add a document for quant... passed after 1 hr 25 min |
Jenkins Linux s390x Build #16138 [push] Add a document for quant... passed after 1 hr 27 min |
Jenkins Linux ppc64le Build #15164 [push] Add a document for quant... passed after 2 hr 26 min |
This PR adds a document file
docs/Quantization-NNPA.md
to explain how to use quantization on NNPA.