Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance UT #23

Merged
merged 13 commits into from
Jul 3, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 10 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,20 +30,20 @@ pip install -r requirements.txt
pip install .
```

> **Note**:
> **Note**:
> Further installation methods can be found under [Installation Guide](./docs/installation_guide.md).

## Getting Started

Setting up the environment:
Setting up the environment:
```bash
pip install onnx-neural-compressor "onnxruntime>=1.17.0" onnx
```
After successfully installing these packages, try your first quantization program.
> Notes: please install from source before the formal pypi release.
> Notes: please install from source before the formal pypi release.

### Weight-Only Quantization (LLMs)
Following example code demonstrates Weight-Only Quantization on LLMs, device will be selected for efficiency automatically when multiple devices are available.
Following example code demonstrates Weight-Only Quantization on LLMs, device will be selected for efficiency automatically when multiple devices are available.

Run the example:
```python
Expand All @@ -59,17 +59,16 @@ quant = matmul_nbits_quantizer.MatMulNBitsQuantizer(
)
quant.process()
best_model = quant.model
```
```

### Static Quantization

```python
from onnx_neural_compressor import config
from onnx_neural_compressor.quantization import quantize
from onnx_neural_compressor.quantization import calibrate
from onnx_neural_compressor.quantization import quantize, config
from onnx_neural_compressor import data_reader


class DataReader(calibrate.CalibrationDataReader):
class DataReader(data_reader.CalibrationDataReader):
def __init__(self):
self.encoded_list = []
# append data into self.encoded_list
Expand Down Expand Up @@ -127,6 +126,6 @@ quantize(model, output_model_path, qconfig)
* [Contribution Guidelines](./docs/source/CONTRIBUTING.md)
* [Security Policy](SECURITY.md)

## Communication
## Communication
- [GitHub Issues](https://github.com/onnx/neural-compressor/issues): mainly for bug reports, new feature requests, question asking, etc.
- [Email](mailto:[email protected]): welcome to raise any interesting research ideas on model compression techniques by email for collaborations.
- [Email](mailto:[email protected]): welcome to raise any interesting research ideas on model compression techniques by email for collaborations.
48 changes: 20 additions & 28 deletions docs/quantization.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,10 @@ Quantization
1. [Quantization Introduction](#quantization-introduction)
2. [Quantization Fundamentals](#quantization-fundamentals)
3. [Accuracy Aware Tuning](#with-or-without-accuracy-aware-tuning)
4. [Get Started](#get-started)
4.1 [Post Training Quantization](#post-training-quantization)
4.2 [Specify Quantization Rules](#specify-quantization-rules)
4.3 [Specify Quantization Backend and Device](#specify-quantization-backend-and-device)
4. [Get Started](#get-started)
4.1 [Post Training Quantization](#post-training-quantization)
4.2 [Specify Quantization Rules](#specify-quantization-rules)
4.3 [Specify Quantization Backend and Device](#specify-quantization-backend-and-device)
5. [Examples](#examples)

## Quantization Introduction
Expand All @@ -22,7 +22,7 @@ The math equation is like: $$X_{int8} = round(Scale \times X_{fp32} + ZeroPoint)

**Affine Quantization**

This is so-called `asymmetric quantization`, in which we map the min/max range in the float tensor to the integer range. Here int8 range is [-128, 127], uint8 range is [0, 255].
This is so-called `asymmetric quantization`, in which we map the min/max range in the float tensor to the integer range. Here int8 range is [-128, 127], uint8 range is [0, 255].

here:

Expand All @@ -34,13 +34,13 @@ If UINT8 is specified, $Scale = (|X_{f_{max}} - X_{f_{min}}|) / 255$ and $ZeroPo

**Scale Quantization**

This is so-called `Symmetric quantization`, in which we use the maximum absolute value in the float tensor as float range and map to the corresponding integer range.
This is so-called `Symmetric quantization`, in which we use the maximum absolute value in the float tensor as float range and map to the corresponding integer range.

The math equation is like:

here:

If INT8 is specified, $Scale = max(abs(X_{f_{max}}), abs(X_{f_{min}})) / 127$ and $ZeroPoint = 0$.
If INT8 is specified, $Scale = max(abs(X_{f_{max}}), abs(X_{f_{min}})) / 127$ and $ZeroPoint = 0$.

or

Expand All @@ -61,10 +61,10 @@ Sometimes the reduce_range feature, that's using 7 bit width (1 sign bit + 6 dat
+ Symmetric Quantization
+ int8: scale = 2 * max(abs(rmin), abs(rmax)) / (max(int8) - min(int8) - 1)
+ Asymmetric Quantization
+ uint8: scale = (rmax - rmin) / (max(uint8) - min(uint8)); zero_point = min(uint8) - round(rmin / scale)
+ uint8: scale = (rmax - rmin) / (max(uint8) - min(uint8)); zero_point = min(uint8) - round(rmin / scale)

#### Reference
+ MLAS: [MLAS Quantization](https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/quantization/onnx_quantizer.py)
+ MLAS: [MLAS Quantization](https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/quantization/onnx_quantizer.py)

### Quantization Approaches

Expand All @@ -88,7 +88,7 @@ This approach is major quantization approach people should try because it could

## With or Without Accuracy Aware Tuning

Accuracy aware tuning is one of unique features provided by Neural Compressor, compared with other 3rd party model compression tools. This feature can be used to solve accuracy loss pain points brought by applying low precision quantization and other lossy optimization methods.
Accuracy aware tuning is one of unique features provided by Neural Compressor, compared with other 3rd party model compression tools. This feature can be used to solve accuracy loss pain points brought by applying low precision quantization and other lossy optimization methods.

This tuning algorithm creates a tuning space based on user-defined configurations, generates quantized graph, and evaluates the accuracy of this quantized graph. The optimal model will be yielded if the pre-defined accuracy goal is met.

Expand All @@ -105,7 +105,7 @@ User could refer to below chart to understand the whole tuning flow.

## Get Started

The design philosophy of the quantization interface of ONNX Neural Compressor is easy-of-use. It requests user to provide `model`, `calibration dataloader`, and `evaluation function`. Those parameters would be used to quantize and tune the model.
The design philosophy of the quantization interface of ONNX Neural Compressor is easy-of-use. It requests user to provide `model`, `calibration dataloader`, and `evaluation function`. Those parameters would be used to quantize and tune the model.

`model` is the framework model location or the framework model object.

Expand All @@ -123,12 +123,11 @@ User could execute:
This means user could leverage ONNX Neural Compressor to directly generate a fully quantized model without accuracy aware tuning. It's user responsibility to ensure the accuracy of the quantized model meets expectation. ONNX Neural Compressor supports `Post Training Static Quantization` and `Post Training Dynamic Quantization`.

``` python
from onnx_neural_compressor import config
from onnx_neural_compressor.quantization import quantize
from onnx_neural_compressor.quantization import calibrate
from onnx_neural_compressor.quantization import quantize, config
from onnx_neural_compressor import data_reader


class DataReader(calibrate.CalibrationDataReader):
class DataReader(data_reader.CalibrationDataReader):
def get_next(self): ...

def rewind(self): ...
Expand All @@ -144,17 +143,10 @@ quantize(model, q_model_path, qconfig)
This means user could leverage the advance feature of ONNX Neural Compressor to tune out a best quantized model which has best accuracy and good performance. User should provide `eval_fn`.

``` python
from onnx_neural_compressor.quantization import calibrate
from onnx_neural_compressor.quantization import tuning
CalibrationDataReader,
GPTQConfig,
RTNConfig,
autotune,
get_woq_tuning_config,
)

from onnx_neural_compressor import data_reader
from onnx_neural_compressor.quantization import tuning, config

class DataReader(calibrate.CalibrationDataReader):
class DataReader(data_reader.CalibrationDataReader):
def get_next(self): ...

def rewind(self): ...
Expand Down Expand Up @@ -200,7 +192,7 @@ Neural-Compressor will quantized models with user-specified backend or detecting
<tr>
<th>Backend</th>
<th>Backend Library</th>
<th>Support Device(cpu as default)</th>
<th>Support Device(cpu as default)</th>
</tr>
</thead>
<tbody>
Expand Down Expand Up @@ -235,9 +227,9 @@ Neural-Compressor will quantized models with user-specified backend or detecting
<br>

> ***Note***
>
>
> DmlExecutionProvider support works as experimental, please expect exceptions.
>
>
> Known limitation: the batch size of onnx models has to be fixed to 1 for DmlExecutionProvider, no multi-batch and dynamic batch support yet.


Expand Down
3 changes: 1 addition & 2 deletions docs/quantization_weight_only.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,8 +124,7 @@ To find the best algorithm, users can leverage the `autotune` feature to explore
### **User code example**

```python
from onnx_neural_compressor import config
from onnx_neural_compressor.quantization import tuning
from onnx_neural_compressor.quantization import tuning, config

tune_config = tuning.TuningConfig(config_set=config.get_woq_tuning_config())
best_model = tuning.autotune(
Expand Down
16 changes: 7 additions & 9 deletions docs/smooth_quant.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ array([[0.68475647, 0.4742902 , 0.74404275],
7.384850698449426e-07
```

The difference between $W$ and $W_{dq}$ shows that quantization affects precision and appropriate values of scale and zero point will reduce the loss of precision.
The difference between $W$ and $W_{dq}$ shows that quantization affects precision and appropriate values of scale and zero point will reduce the loss of precision.

#### Per-channel example

Expand Down Expand Up @@ -233,7 +233,7 @@ The image on the left presents a normal MatMul forward with 1x2 input $x$ and 2

### SmoothQuant

In the previous subsection, we have explained why per-channel quantization could not be applied for activation, even though it could lead to lower quantization loss. However, the quantization error loss of activation plays an important role in the accuracy loss of model quantization[^2][^3][^4].
In the previous subsection, we have explained why per-channel quantization could not be applied for activation, even though it could lead to lower quantization loss. However, the quantization error loss of activation plays an important role in the accuracy loss of model quantization[^2][^3][^4].



Expand Down Expand Up @@ -274,7 +274,7 @@ j is the index of the input channels.
For most of the models such as OPT and BLOOM, $\alpha = 0.5$ is a well-balanced value to split the difficulty of weight and activation quantization. A larger $\alpha$ value could be used on models with more significant activation outliers to migrate more quantization difficulty to weights.


### Our enhancement:
### Our enhancement:

#### Algorithm: Auto-tuning of $\alpha$.

Expand All @@ -297,7 +297,7 @@ Multiple criteria (e.g min, max and mean) are supported to determine the $\alpha

In our experiments, an $\alpha$ range of [0.0, 1.0] with a step_size of 0.1 is found to be well-balanced one for the majority of models.

#### Engineering
#### Engineering

*fully automated*: users only need to pass a model and dataloader.

Expand All @@ -322,7 +322,7 @@ There are two ways to apply smooth quantization: 1) using a fixed `alpha` for th
To set a fixed alpha for the entire model, users can follow this example:

```python
from onnx_neural_compressor import config
from onnx_neural_compressor.quantization import config

qconfig = config.StaticQuantConfig(
data_reader, extra_options={"SmoothQuant": True, "SmoothQuantAlpha": 0.5, "SmoothQuantFolding": True}
Expand All @@ -344,8 +344,7 @@ The tuning process looks for the optimal `alpha` value from a list of `alpha` va
Here is an example:

```python
from onnx_neural_compressor import config
from onnx_neural_compressor.quantization import tuning
from onnx_neural_compressor.quantization import tuning, config

qconfig = tuning.TuningConfig(config_set=[config.SmoothQuantConfig(alpha=np.arange(0.1, 0.5, 0.05).tolist())])
best_model = tuning.autotune(
Expand All @@ -360,8 +359,7 @@ In this case, the tuning process searches the optimal `alpha` of each operator b
Here is an example:

```python
from onnx_neural_compressor import config
from onnx_neural_compressor.quantization import quantize
from onnx_neural_compressor.quantization import quantize, config

qconfig = config.StaticQuantConfig(
data_reader,
Expand Down
6 changes: 3 additions & 3 deletions onnx_neural_compressor/algorithms/utility.py
Original file line number Diff line number Diff line change
Expand Up @@ -174,12 +174,12 @@ def quantize_data_per_channel(data, axis, qType, sym, reduce_range=False):
return rmin.reshape(-1, 1), rmax.reshape(-1, 1), zero_point.reshape(-1, 1), scale.reshape(-1, 1), quantized_data


def dequantize_data_with_scale_zero(tensor_value, scale_value, zo_value): # pragma: no cover
def dequantize_data_with_scale_zero(tensor_value, scale_value, zo_value):
"""Dequantize tensor with scale and zero point."""
return (tensor_value.astype(scale_value.dtype) - zo_value.astype(scale_value.dtype)) * scale_value


def dequantize_data(tensor_value, scale_value, zo_value, axis=0): # pragma: no cover
def dequantize_data(tensor_value, scale_value, zo_value, axis=0):
"""Dequantize tensor."""
if not isinstance(scale_value, np.ndarray):
return dequantize_data_with_scale_zero(tensor_value, scale_value, zo_value)
Expand Down Expand Up @@ -386,7 +386,7 @@ def make_matmul_weight_only_node(
# require onnxruntime > 1.16.3
kwargs["accuracy_level"] = accuracy_level

else:
else: # pragma: no cover
offset = 5 if zero_point is not None else 4
op_type = "MatMulFpQ4"

Expand Down
17 changes: 10 additions & 7 deletions onnx_neural_compressor/algorithms/weight_only/awq.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ def _apply_awq_scale(model, weight_config, absorb_pairs, output_dicts):

input_name_to_nodes = model.input_name_to_nodes()
for parent, nodes in absorb_pairs.items():
if any([node.input[0] not in output_dicts for node in nodes]):
if any([node.input[0] not in output_dicts for node in nodes]): # pragma: no cover
logger.warning(
"Miss input tensors of nodes {} during AWQ, skip it!".format(
", ".join([node.name for node in nodes if node.input[0] not in output_dicts])
Expand Down Expand Up @@ -102,7 +102,7 @@ def _apply_awq_scale(model, weight_config, absorb_pairs, output_dicts):
version.Version(ort.__version__) >= constants.ONNXRT116_VERSION
and num_bits == 4
and group_size == 32
): # pragma: no cover
):
# MatMulFpQ4 support 4 bits and 32 group_size with ort 1.16.0 and 1.16.1 versions
# MatMulNBits supports 4 bits and 2^n group_size with ort > 1.16.1
q_weight = quant_utils.qdq_data(
Expand Down Expand Up @@ -154,7 +154,9 @@ def _apply_awq_scale(model, weight_config, absorb_pairs, output_dicts):

if parent.op_type in ["LayerNormalization", "BatchNormalization", "InstanceNormalization"] and len(
input_name_to_nodes[nodes[0].input[0]]
) == len(nodes):
) == len(
nodes
): # pragma: no cover
for idx in [1, 2]:
tensor = onnx.numpy_helper.to_array(model.get_initializer(parent.input[idx]), base_dir)
dtype = tensor.dtype
Expand Down Expand Up @@ -187,7 +189,7 @@ def _apply_awq_scale(model, weight_config, absorb_pairs, output_dicts):
updated_nodes.append(parent.name)
output_dicts[parent.output[0]] = output_dicts[parent.output[0]] / np.reshape(best_scale, (1, -1))

else: # pragma: no cover
else:
# insert mul
scale_tensor = onnx.helper.make_tensor(
name=parent.output[0] + "_weight_only_scale",
Expand Down Expand Up @@ -256,7 +258,7 @@ def _apply_awq_clip(model, weight_config, absorb_pairs, output_dicts):
version.Version(ort.__version__) >= constants.ONNXRT116_VERSION
and num_bits == 4
and group_size == 32
): # pragma: no cover
):
# MatMulFpQ4 support 4 bits and 32 group_size with ort 1.16.0 and 1.16.1 versions
# MatMulNBits supports 4 bits and 2^n group_size with ort > 1.16.1
weight = quant_utils.qdq_data(
Expand Down Expand Up @@ -342,7 +344,8 @@ def awq_quantize(
output_names.append(node.input[0])
output_names = list(set(output_names))
model.add_tensors_to_outputs(output_names)
if model.is_large_model:

if model.is_large_model: # pragma: no cover
onnx.save_model(
model.model,
model.model_path + "_augment.onnx",
Expand Down Expand Up @@ -374,7 +377,7 @@ def awq_quantize(
):
dump_pairs[parent.name].append(model.get_node(node.name))

if len(dump_pairs[parent.name]) == 0:
if len(dump_pairs[parent.name]) == 0: # pragma: no cover
continue

output_dicts = {}
Expand Down
6 changes: 3 additions & 3 deletions onnx_neural_compressor/algorithms/weight_only/gptq.py
Original file line number Diff line number Diff line change
Expand Up @@ -272,13 +272,13 @@ def gptq_quantize(
weight = onnx.numpy_helper.to_array(
model.get_initializer(model.get_node(node.name).input[1]), base_dir
).copy()
if len(weight.shape) != 2:
if len(weight.shape) != 2: # pragma: no cover
continue

weights.append(weight)
node_list.append(model.get_node(node.name))

if len(weights) == 0:
if len(weights) == 0: # pragma: no cover
continue

Hs = [np.zeros((i.shape[0], i.shape[0])) for i in weights]
Expand Down Expand Up @@ -327,7 +327,7 @@ def gptq_quantize(
if ("CUDAExecutionProvider" in providers and satisfy_MatMulNBits_condition) or (
"CUDAExecutionProvider" not in providers
and (satisfy_MatMulFpQ4_condition or satisfy_MatMulNBits_condition)
): # pragma: no cover
):
# MatMulFpQ4 support 4 bits and 32 group_size with ort 1.16.0 and 1.16.1 versions, supported by CPU EP
# MatMulNBits supports 4 bits and 2^n group_size with ort > 1.16.1, supported by CPU EP AND CUDA EP
org_shape = weight.shape
Expand Down
Loading
Loading