- Introduction
- Get Started
2.1 Static Quantization with IPEX Backend
2.1.1 Usage Sample with IPEX
2.1.2 Specify Quantization Rules
2.1.3 Model Examples
2.2 Static Quantization with PT2E Backend
2.2.1 Usage Sample with PT2E 2.2.2 Model Examples with PT2E
Post-Training Quantization (PTQ) is a technique used to convert a pre-trained floating-point model to a quantized model. This approach does not require model retraining. Instead, it uses calibration data to determine the optimal quantization parameters. Static quantization involves calibrating both weights and activations during the quantization process. Currently, we support two paths to perform static PTQ: Intel Extension for PyTorch (IPEX) and PyTorch 2 Export Quantization (PT2E).
Intel Extension for PyTorch (IPEX) provides optimizations specifically for Intel hardware, improving the performance of PyTorch models through efficient execution on CPUs. IPEX supports PTQ, allowing users to quantize models to lower precision to reduce model size and inference time while maintaining accuracy.
The design philosophy of the quantization interface of Intel(R) Neural Compressor is easy-of-use. It requests user to provide model
, calibration function
, and example inputs
. Those parameters would be used to quantize and tune the model.
model
is the framework model location or the framework model object.
calibration function
is used to determine the appropriate quantization parameters, such as scale
and zero-point
, for the model's weights and activations. This process is crucial for minimizing the loss of accuracy that can occur when converting from floating-point to lower-precision format.
IPEX leverages just-in-time (JIT) compilation techniques for optimizing the model. example inputs
is used to trace the computational graph of the model, enabling various optimizations and transformations that are specific to IPEX. This tracing process captures the operations performed by the model, allowing IPEX to apply quantization optimizations effectively. example inputs
should be representative of the actual data the model will process to ensure accurate calibration.
import intel_extension_for_pytorch as ipex
from neural_compressor.torch.quantization import StaticQuantConfig, convert, prepare
quant_config = StaticQuantConfig(act_sym=True, act_algo="minmax")
prepared_model = prepare(model, quant_config=quant_config, example_inputs=example_inputs)
run_fn(prepared_model)
q_model = convert(prepared_model)
Important
To use static quantization with the IPEX backend, please explicitly import IPEX at the beginning of your program.
Intel(R) Neural Compressor support specify quantization rules by operator name or operator type. Users can use set_local
to fallback either op_name
or op_type
in StaticQuantConfig
to achieve the above purpose.
- Example of
op_name_dict
Here we don't quantize the layer namedfc1
.
# fallback by op_name
quant_config.set_local("fc1", StaticQuantConfig(w_dtype="fp32", act_dtype="fp32"))
prepared_model = prepare(fp32_model, quant_config=quant_config, example_inputs=example_inputs)
run_fn(prepared_model)
q_model = convert(prepared_model)
- Example of
op_type_dict
Here we don't quantizeLinear
layers.
# fallback by op_type
quant_config.set_local("Linear", StaticQuantConfig(w_dtype="fp32", act_dtype="fp32"))
prepared_model = prepare(model, quant_config=quant_config, example_inputs=example_inputs)
run_fn(prepared_model)
q_model = convert(prepared_model)
Users could refer to examples on how to quantize a new model.
Compared to the IPEX backend, which uses JIT compilation to capture the eager model, the PT2E path uses torch.dynamo
to capture the eager model into an FX graph model, and then inserts the observers and Q/QD pairs on it. Finally it uses the torch.compile
to perform the pattern matching and replace the Q/DQ pairs with optimized quantized operators.
There are four steps to perform W8A8 static quantization with PT2E backend: export
, prepare
, convert
and compile
.
import torch
from neural_compressor.torch.export import export
from neural_compressor.torch.quantization import StaticQuantConfig, prepare, convert
# Prepare the float model and example inputs for export model
model = UserFloatModel()
example_inputs = ...
# Export eager model into FX graph model
exported_model = export(model=model, example_inputs=example_inputs)
# Quantize the model
quant_config = StaticQuantConfig()
prepared_model = prepare(exported_model, quant_config=quant_config)
# Calibrate
run_fn(prepared_model)
q_model = convert(prepared_model)
# Compile the quantized model and replace the Q/DQ pattern with Q-operator
from torch._inductor import config
config.freezing = True
opt_model = torch.compile(q_model)
Note: The
set_local
ofStaticQuantConfig
will be supported after the torch 2.4 release.
Users could refer to cv examples and llm examples on how to quantize a new model.