Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a document for quantization on NNPA #3045

Merged
merged 5 commits into from
Jan 20, 2025

Conversation

tungld
Copy link
Collaborator

@tungld tungld commented Jan 16, 2025

This PR adds a document file docs/Quantization-NNPA.md to explain how to use quantization on NNPA.

Signed-off-by: Tung D. Le <[email protected]>
@tungld
Copy link
Collaborator Author

tungld commented Jan 16, 2025

@jenkins-droid test this please

Copy link
Collaborator

@Sunny-Anand Sunny-Anand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just have a few questions.

docs/Quantization-NNPA.md Show resolved Hide resolved
docs/Quantization-NNPA.md Show resolved Hide resolved
- supports per-tensor dynamic quantization, and
- quantizes data tensors from float32 to 8-bit signed integer because NNPA supports 8-bit signed integers. If a data tensor in the input model is already in 8-bit singed integer, the compiler will not quantize it again.

The compiler provides two compile flags for quantizing a model at compile time:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these flags are targetting only dynamic quantization, shall we specify dynamically quantizing here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could use --nnpa-dquant for dynamic quantization, and use it systematically for both options.


# Performance notes

It is often the case that symmetric quantization leads to better inference performance but poorer accuracy than asymetric quantization.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: asymmetric

Copy link
Collaborator

@AlexandreEichenberger AlexandreEichenberger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, will let the final word to @Sunny-Anand

docs/Quantization-NNPA.md Show resolved Hide resolved
docs/Quantization-NNPA.md Show resolved Hide resolved
- supports per-tensor dynamic quantization, and
- quantizes data tensors from float32 to 8-bit signed integer because NNPA supports 8-bit signed integers. If a data tensor in the input model is already in 8-bit singed integer, the compiler will not quantize it again.

The compiler provides two compile flags for quantizing a model at compile time:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could use --nnpa-dquant for dynamic quantization, and use it systematically for both options.

Signed-off-by: Tung D. Le <[email protected]>
@tungld
Copy link
Collaborator Author

tungld commented Jan 17, 2025

@Sunny-Anand @AlexandreEichenberger could you take another look at my new changes based on your comments? Would like to make sure everything is clear before merging. Thanks!

Copy link
Collaborator

@Sunny-Anand Sunny-Anand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the changes.


# Overview

NNPA in IBM Telum II supports 8-bit signed-integer quantized matrix multiplications. This document shows how to compile an ONNX model for quantization on NNPA.
Copy link
Collaborator

@AlexandreEichenberger AlexandreEichenberger Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be no consensus on what quantization means. It always means going from a higher precision to a lower precision, but I don't think it necessarily implies integer representation. See here for example

https://huggingface.co/docs/optimum/en/concept_guides/quantization

So maybe we could be a bit clearer here.

=====

NNPA in IBM Telum II supports 8-bit signed-integer quantized matrix multiplications. This document shows how to compile an ONNX model for 8-bit quantization on NNPA. When not following these steps, models will still be accelerated when targeting Telum systems using a mixture of 16-bit floating-point numbers for computations mapped to the Telum's Integrated AI accelerator and 32-bit floating-point numbers for computations mapped to the Telum CPUs.

=====

I think that once this is out of the way, we may continue having the text below without changes. Or one could use "8-bit integer quantization" once at the beginning of sections.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I updated it with your content.

@tungld tungld merged commit bd41f89 into onnx:main Jan 20, 2025
7 checks passed
@jenkins-droid
Copy link
Collaborator

Jenkins Linux ppc64le Build #15164 [push] Add a document for quant... started at 23:04

@jenkins-droid
Copy link
Collaborator

Jenkins Linux s390x Build #16138 [push] Add a document for quant... started at 22:47

@jenkins-droid
Copy link
Collaborator

Jenkins Linux amd64 Build #16136 [push] Add a document for quant... started at 21:47

@jenkins-droid
Copy link
Collaborator

Jenkins Linux amd64 Build #16136 [push] Add a document for quant... passed after 1 hr 25 min

@jenkins-droid
Copy link
Collaborator

Jenkins Linux s390x Build #16138 [push] Add a document for quant... passed after 1 hr 27 min

@jenkins-droid
Copy link
Collaborator

Jenkins Linux ppc64le Build #15164 [push] Add a document for quant... passed after 2 hr 26 min

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants