Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance UT #23

Merged
merged 13 commits into from
Jul 3, 2024
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 9 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,20 +30,20 @@ pip install -r requirements.txt
pip install .
```

> **Note**:
> **Note**:
> Further installation methods can be found under [Installation Guide](./docs/installation_guide.md).

## Getting Started

Setting up the environment:
Setting up the environment:
```bash
pip install onnx-neural-compressor "onnxruntime>=1.17.0" onnx
```
After successfully installing these packages, try your first quantization program.
> Notes: please install from source before the formal pypi release.
> Notes: please install from source before the formal pypi release.

### Weight-Only Quantization (LLMs)
Following example code demonstrates Weight-Only Quantization on LLMs, device will be selected for efficiency automatically when multiple devices are available.
Following example code demonstrates Weight-Only Quantization on LLMs, device will be selected for efficiency automatically when multiple devices are available.

Run the example:
```python
Expand All @@ -59,17 +59,17 @@ quant = matmul_nbits_quantizer.MatMulNBitsQuantizer(
)
quant.process()
best_model = quant.model
```
```

### Static Quantization

```python
from onnx_neural_compressor import config
from onnx_neural_compressor.quantization import quantize
from onnx_neural_compressor.quantization import calibrate
from onnx_neural_compressor import data_reader


class DataReader(calibrate.CalibrationDataReader):
class DataReader(data_reader.CalibrationDataReader):
def __init__(self):
self.encoded_list = []
# append data into self.encoded_list
Expand Down Expand Up @@ -127,6 +127,6 @@ quantize(model, output_model_path, qconfig)
* [Contribution Guidelines](./docs/source/CONTRIBUTING.md)
* [Security Policy](SECURITY.md)

## Communication
## Communication
- [GitHub Issues](https://github.com/onnx/neural-compressor/issues): mainly for bug reports, new feature requests, question asking, etc.
- [Email](mailto:[email protected]): welcome to raise any interesting research ideas on model compression techniques by email for collaborations.
- [Email](mailto:[email protected]): welcome to raise any interesting research ideas on model compression techniques by email for collaborations.
44 changes: 19 additions & 25 deletions docs/quantization.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,10 @@ Quantization
1. [Quantization Introduction](#quantization-introduction)
2. [Quantization Fundamentals](#quantization-fundamentals)
3. [Accuracy Aware Tuning](#with-or-without-accuracy-aware-tuning)
4. [Get Started](#get-started)
4.1 [Post Training Quantization](#post-training-quantization)
4.2 [Specify Quantization Rules](#specify-quantization-rules)
4.3 [Specify Quantization Backend and Device](#specify-quantization-backend-and-device)
4. [Get Started](#get-started)
4.1 [Post Training Quantization](#post-training-quantization)
4.2 [Specify Quantization Rules](#specify-quantization-rules)
4.3 [Specify Quantization Backend and Device](#specify-quantization-backend-and-device)
5. [Examples](#examples)

## Quantization Introduction
Expand All @@ -22,7 +22,7 @@ The math equation is like: $$X_{int8} = round(Scale \times X_{fp32} + ZeroPoint)

**Affine Quantization**

This is so-called `asymmetric quantization`, in which we map the min/max range in the float tensor to the integer range. Here int8 range is [-128, 127], uint8 range is [0, 255].
This is so-called `asymmetric quantization`, in which we map the min/max range in the float tensor to the integer range. Here int8 range is [-128, 127], uint8 range is [0, 255].

here:

Expand All @@ -34,13 +34,13 @@ If UINT8 is specified, $Scale = (|X_{f_{max}} - X_{f_{min}}|) / 255$ and $ZeroPo

**Scale Quantization**

This is so-called `Symmetric quantization`, in which we use the maximum absolute value in the float tensor as float range and map to the corresponding integer range.
This is so-called `Symmetric quantization`, in which we use the maximum absolute value in the float tensor as float range and map to the corresponding integer range.

The math equation is like:

here:

If INT8 is specified, $Scale = max(abs(X_{f_{max}}), abs(X_{f_{min}})) / 127$ and $ZeroPoint = 0$.
If INT8 is specified, $Scale = max(abs(X_{f_{max}}), abs(X_{f_{min}})) / 127$ and $ZeroPoint = 0$.

or

Expand All @@ -61,10 +61,10 @@ Sometimes the reduce_range feature, that's using 7 bit width (1 sign bit + 6 dat
+ Symmetric Quantization
+ int8: scale = 2 * max(abs(rmin), abs(rmax)) / (max(int8) - min(int8) - 1)
+ Asymmetric Quantization
+ uint8: scale = (rmax - rmin) / (max(uint8) - min(uint8)); zero_point = min(uint8) - round(rmin / scale)
+ uint8: scale = (rmax - rmin) / (max(uint8) - min(uint8)); zero_point = min(uint8) - round(rmin / scale)

#### Reference
+ MLAS: [MLAS Quantization](https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/quantization/onnx_quantizer.py)
+ MLAS: [MLAS Quantization](https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/quantization/onnx_quantizer.py)

### Quantization Approaches

Expand All @@ -88,7 +88,7 @@ This approach is major quantization approach people should try because it could

## With or Without Accuracy Aware Tuning

Accuracy aware tuning is one of unique features provided by Neural Compressor, compared with other 3rd party model compression tools. This feature can be used to solve accuracy loss pain points brought by applying low precision quantization and other lossy optimization methods.
Accuracy aware tuning is one of unique features provided by Neural Compressor, compared with other 3rd party model compression tools. This feature can be used to solve accuracy loss pain points brought by applying low precision quantization and other lossy optimization methods.

This tuning algorithm creates a tuning space based on user-defined configurations, generates quantized graph, and evaluates the accuracy of this quantized graph. The optimal model will be yielded if the pre-defined accuracy goal is met.

Expand All @@ -105,7 +105,7 @@ User could refer to below chart to understand the whole tuning flow.

## Get Started

The design philosophy of the quantization interface of ONNX Neural Compressor is easy-of-use. It requests user to provide `model`, `calibration dataloader`, and `evaluation function`. Those parameters would be used to quantize and tune the model.
The design philosophy of the quantization interface of ONNX Neural Compressor is easy-of-use. It requests user to provide `model`, `calibration dataloader`, and `evaluation function`. Those parameters would be used to quantize and tune the model.

`model` is the framework model location or the framework model object.

Expand All @@ -125,10 +125,10 @@ This means user could leverage ONNX Neural Compressor to directly generate a ful
``` python
from onnx_neural_compressor import config
from onnx_neural_compressor.quantization import quantize
from onnx_neural_compressor.quantization import calibrate
from onnx_neural_compressor import data_reader


class DataReader(calibrate.CalibrationDataReader):
class DataReader(data_reader.CalibrationDataReader):
def get_next(self): ...

def rewind(self): ...
Expand All @@ -144,17 +144,11 @@ quantize(model, q_model_path, qconfig)
This means user could leverage the advance feature of ONNX Neural Compressor to tune out a best quantized model which has best accuracy and good performance. User should provide `eval_fn`.

``` python
from onnx_neural_compressor.quantization import calibrate
from onnx_neural_compressor import data_reader
from onnx_neural_compressor.quantization import tuning
CalibrationDataReader,
GPTQConfig,
RTNConfig,
autotune,
get_woq_tuning_config,
)

from onnx_neural_compressor import config

class DataReader(calibrate.CalibrationDataReader):
class DataReader(data_reader.CalibrationDataReader):
def get_next(self): ...

def rewind(self): ...
Expand Down Expand Up @@ -200,7 +194,7 @@ Neural-Compressor will quantized models with user-specified backend or detecting
<tr>
<th>Backend</th>
<th>Backend Library</th>
<th>Support Device(cpu as default)</th>
<th>Support Device(cpu as default)</th>
</tr>
</thead>
<tbody>
Expand Down Expand Up @@ -235,9 +229,9 @@ Neural-Compressor will quantized models with user-specified backend or detecting
<br>

> ***Note***
>
>
> DmlExecutionProvider support works as experimental, please expect exceptions.
>
>
> Known limitation: the batch size of onnx models has to be fixed to 1 for DmlExecutionProvider, no multi-batch and dynamic batch support yet.


Expand Down
17 changes: 10 additions & 7 deletions onnx_neural_compressor/algorithms/weight_only/awq.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ def _apply_awq_scale(model, weight_config, absorb_pairs, output_dicts, num_bits,
base_dir = os.path.dirname(model.model_path) if model.model_path is not None else ""

for parent, nodes in absorb_pairs.items():
if any([node.input[0] not in output_dicts for node in nodes]):
if any([node.input[0] not in output_dicts for node in nodes]): # pragma: no cover
logger.warning(
"Miss input tensors of nodes {} during AWQ, skip it!".format(
", ".join([node.name for node in nodes if node.input[0] not in output_dicts])
Expand Down Expand Up @@ -101,7 +101,7 @@ def _apply_awq_scale(model, weight_config, absorb_pairs, output_dicts, num_bits,
version.Version(ort.__version__) >= constants.ONNXRT116_VERSION
and num_bits == 4
and group_size == 32
): # pragma: no cover
):
# MatMulFpQ4 support 4 bits and 32 group_size with ort 1.16.0 and 1.16.1 versions
# MatMulNBits supports 4 bits and 2^n group_size with ort > 1.16.1
q_weight = woq_utility.qdq_tensor(weight, num_bits, group_size, scheme, "uint") / np.expand_dims(
Expand Down Expand Up @@ -153,7 +153,9 @@ def _apply_awq_scale(model, weight_config, absorb_pairs, output_dicts, num_bits,

if parent.op_type in ["LayerNormalization", "BatchNormalization", "InstanceNormalization"] and len(
model.input_name_to_nodes()[nodes[0].input[0]]
) == len(nodes):
) == len(
nodes
): # pragma: no cover
for idx in [1, 2]:
tensor = onnx.numpy_helper.to_array(model.get_initializer(parent.input[idx]), base_dir)
dtype = tensor.dtype
Expand Down Expand Up @@ -186,7 +188,7 @@ def _apply_awq_scale(model, weight_config, absorb_pairs, output_dicts, num_bits,
updated_nodes.append(parent.name)
output_dicts[parent.output[0]] = output_dicts[parent.output[0]] / np.reshape(best_scale, (1, -1))

else: # pragma: no cover
else:
# insert mul
scale_tensor = onnx.helper.make_tensor(
name=parent.output[0] + "_weight_only_scale",
Expand Down Expand Up @@ -256,7 +258,7 @@ def _apply_awq_clip(model, weight_config, absorb_pairs, output_dicts, num_bits,
version.Version(ort.__version__) >= constants.ONNXRT116_VERSION
and num_bits == 4
and group_size == 32
): # pragma: no cover
):
# MatMulFpQ4 support 4 bits and 32 group_size with ort 1.16.0 and 1.16.1 versions
# MatMulNBits supports 4 bits and 2^n group_size with ort > 1.16.1
weight = woq_utility.qdq_tensor(
Expand Down Expand Up @@ -346,7 +348,8 @@ def awq_quantize(
output_names.append(node.input[0])
output_names = list(set(output_names))
model.add_tensors_to_outputs(output_names)
if model.is_large_model:

if model.is_large_model: # pragma: no cover
onnx.save_model(
model.model,
model.model_path + "_augment.onnx",
Expand Down Expand Up @@ -376,7 +379,7 @@ def awq_quantize(
):
dump_pairs[parent.name].append(model.get_node(node.name))

if len(dump_pairs[parent.name]) == 0:
if len(dump_pairs[parent.name]) == 0: # pragma: no cover
continue

output_dicts = {}
Expand Down
6 changes: 3 additions & 3 deletions onnx_neural_compressor/algorithms/weight_only/gptq.py
Original file line number Diff line number Diff line change
Expand Up @@ -279,13 +279,13 @@ def gptq_quantize(
weight = onnx.numpy_helper.to_array(
model.get_initializer(model.get_node(node.name).input[1]), base_dir
).copy()
if len(weight.shape) != 2:
if len(weight.shape) != 2: # pragma: no cover
continue

weights.append(weight)
node_list.append(model.get_node(node.name))

if len(weights) == 0:
if len(weights) == 0: # pragma: no cover
continue

Hs = [np.zeros((i.shape[0], i.shape[0])) for i in weights]
Expand Down Expand Up @@ -335,7 +335,7 @@ def gptq_quantize(
if ("CUDAExecutionProvider" in providers and satisfy_MatMulNBits_condition) or (
"CUDAExecutionProvider" not in providers
and (satisfy_MatMulFpQ4_condition or satisfy_MatMulNBits_condition)
): # pragma: no cover
):
# MatMulFpQ4 support 4 bits and 32 group_size with ort 1.16.0 and 1.16.1 versions, supported by CPU EP
# MatMulNBits supports 4 bits and 2^n group_size with ort > 1.16.1, supported by CPU EP AND CUDA EP
org_shape = weight.shape
Expand Down
14 changes: 7 additions & 7 deletions onnx_neural_compressor/algorithms/weight_only/utility.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,11 +29,11 @@

from onnx_neural_compressor import constants, utility

if sys.version_info < (3, 11) and util.find_spec("onnxruntime_extensions"): # pragma: no cover
if sys.version_info < (3, 11) and util.find_spec("onnxruntime_extensions"):
import onnxruntime_extensions


def _get_blob_size(group_size, has_zp): # pragma: no cover
def _get_blob_size(group_size, has_zp):
"""Get blob_size.

Args:
Expand All @@ -42,9 +42,9 @@ def _get_blob_size(group_size, has_zp): # pragma: no cover
"""
if version.Version(ort.__version__) > constants.ONNXRT1161_VERSION:
blob_size = group_size // 2
elif has_zp:
elif has_zp: # pragma: no cover
blob_size = group_size // 2 + 4 + 1
else:
else: # pragma: no cover
blob_size = group_size // 2 + 4
return blob_size

Expand Down Expand Up @@ -109,7 +109,7 @@ def make_matmul_weight_only_node(

# build zero_point tensor
if zero_point is not None:
if num_bits > 4:
if num_bits > 4: # pragma: no cover
packed_zp = np.reshape(zero_point, (1, -1)).astype("uint8")
else:
packed_zp = np.full((zero_point.shape[0] + 1) // 2, 136, dtype="uint8")
Expand Down Expand Up @@ -137,7 +137,7 @@ def make_matmul_weight_only_node(
# require onnxruntime > 1.16.3
kwargs["accuracy_level"] = accuracy_level

else:
else: # pragma: no cover
offset = 5 if zero_point is not None else 4
op_type = "MatMulFpQ4"

Expand Down Expand Up @@ -201,7 +201,7 @@ def prepare_inputs(model, data_reader, providers):
"""

so = ort.SessionOptions()
if sys.version_info < (3, 11) and util.find_spec("onnxruntime_extensions"): # pragma: no cover
if sys.version_info < (3, 11) and util.find_spec("onnxruntime_extensions"):
so.register_custom_ops_library(onnxruntime_extensions.get_library_path())
if model.is_large_model:
onnx.save_model(
Expand Down
10 changes: 5 additions & 5 deletions onnx_neural_compressor/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -310,7 +310,7 @@ def to_diff_dict(cls, instance) -> Dict[str, Any]:
def from_json_file(cls, filename):
with open(filename, "r", encoding="utf-8") as file:
config_dict = json.load(file)
return cls.from_dict(**config_dict)
return cls.from_dict(config_dict)

def to_json_file(self, filename):
config_dict = self.to_dict()
Expand Down Expand Up @@ -543,7 +543,7 @@ def register_supported_configs(cls):
raise NotImplementedError

@classmethod
def get_config_set_for_tuning(cls) -> None:
def get_config_set_for_tuning(cls) -> None: # pragma: no cover
# TODO (Yi) handle the composable config in `tuning_config`
return None

Expand Down Expand Up @@ -706,7 +706,7 @@ def get_model_info(model: Union[onnx.ModelProto, pathlib.Path, str]) -> list:
return filter_result

@classmethod
def get_config_set_for_tuning(cls) -> Union[None, "RTNConfig", List["RTNConfig"]]: # pragma: no cover
def get_config_set_for_tuning(cls) -> Union[None, "RTNConfig", List["RTNConfig"]]:
return RTNConfig(weight_bits=[4, 8], weight_sym=[True, False])


Expand Down Expand Up @@ -871,7 +871,7 @@ def get_model_info(model: Union[onnx.ModelProto, pathlib.Path, str]) -> list:
return filter_result

@classmethod
def get_config_set_for_tuning(cls) -> Union[None, "GPTQConfig", List["GPTQConfig"]]: # pragma: no cover
def get_config_set_for_tuning(cls) -> Union[None, "GPTQConfig", List["GPTQConfig"]]:
return GPTQConfig(
weight_bits=[4, 8],
weight_sym=[True, False],
Expand Down Expand Up @@ -1022,7 +1022,7 @@ def get_model_info(model: Union[onnx.ModelProto, pathlib.Path, str]) -> list:
return filter_result

@classmethod
def get_config_set_for_tuning(cls) -> Union[None, "AWQConfig", List["AWQConfig"]]: # pragma: no cover
def get_config_set_for_tuning(cls) -> Union[None, "AWQConfig", List["AWQConfig"]]:
return AWQConfig(
weight_bits=[4, 8],
weight_sym=[True, False],
Expand Down
Loading
Loading