onnx · chensuyue · Jul 3, 2024 · Jun 12, 2024 · Jun 14, 2024 · Jun 18, 2024
diff --git a/README.md b/README.md
@@ -30,20 +30,20 @@ pip install -r requirements.txt
 pip install .
 ```
 
-> **Note**: 
+> **Note**:
 > Further installation methods can be found under [Installation Guide](./docs/installation_guide.md).
 
 ## Getting Started
 
-Setting up the environment:  
+Setting up the environment:
 ```bash
 pip install onnx-neural-compressor "onnxruntime>=1.17.0" onnx
 ```
 After successfully installing these packages, try your first quantization program.
-> Notes: please install from source before the formal pypi release. 
+> Notes: please install from source before the formal pypi release.
 
 ### Weight-Only Quantization (LLMs)
-Following example code demonstrates Weight-Only Quantization on LLMs, device will be selected for efficiency automatically when multiple devices are available. 
+Following example code demonstrates Weight-Only Quantization on LLMs, device will be selected for efficiency automatically when multiple devices are available.
 
 Run the example:
 ```python
@@ -59,17 +59,16 @@ quant = matmul_nbits_quantizer.MatMulNBitsQuantizer(
 )
 quant.process()
 best_model = quant.model
-```   
+```
 
 ### Static Quantization
 
 ```python
-from onnx_neural_compressor import config
-from onnx_neural_compressor.quantization import quantize
-from onnx_neural_compressor.quantization import calibrate
+from onnx_neural_compressor.quantization import quantize, config
+from onnx_neural_compressor import data_reader
 
 
-class DataReader(calibrate.CalibrationDataReader):
+class DataReader(data_reader.CalibrationDataReader):
     def __init__(self):
         self.encoded_list = []
         # append data into self.encoded_list
@@ -127,6 +126,6 @@ quantize(model, output_model_path, qconfig)
 * [Contribution Guidelines](./docs/source/CONTRIBUTING.md)
 * [Security Policy](SECURITY.md)
 
-## Communication 
+## Communication
 - [GitHub Issues](https://github.com/onnx/neural-compressor/issues): mainly for bug reports, new feature requests, question asking, etc.
-- [Email](mailto:[email protected]): welcome to raise any interesting research ideas on model compression techniques by email for collaborations.  
+- [Email](mailto:[email protected]): welcome to raise any interesting research ideas on model compression techniques by email for collaborations.
diff --git a/docs/quantization.md b/docs/quantization.md
@@ -4,10 +4,10 @@ Quantization
 1. [Quantization Introduction](#quantization-introduction)
 2. [Quantization Fundamentals](#quantization-fundamentals)
 3. [Accuracy Aware Tuning](#with-or-without-accuracy-aware-tuning)
-4. [Get Started](#get-started)  
-   4.1 [Post Training Quantization](#post-training-quantization)    
-   4.2 [Specify Quantization Rules](#specify-quantization-rules)    
-   4.3 [Specify Quantization Backend and Device](#specify-quantization-backend-and-device)  
+4. [Get Started](#get-started)
+   4.1 [Post Training Quantization](#post-training-quantization)
+   4.2 [Specify Quantization Rules](#specify-quantization-rules)
+   4.3 [Specify Quantization Backend and Device](#specify-quantization-backend-and-device)
 5. [Examples](#examples)
 
 ## Quantization Introduction
@@ -22,7 +22,7 @@ The math equation is like: $$X_{int8} = round(Scale \times X_{fp32} + ZeroPoint)
 
 **Affine Quantization**
 
-This is so-called `asymmetric quantization`, in which we map the min/max range in the float tensor to the integer range. Here int8 range is [-128, 127], uint8 range is [0, 255]. 
+This is so-called `asymmetric quantization`, in which we map the min/max range in the float tensor to the integer range. Here int8 range is [-128, 127], uint8 range is [0, 255].
 
 here:
 
@@ -34,13 +34,13 @@ If UINT8 is specified, $Scale = (|X_{f_{max}} - X_{f_{min}}|) / 255$ and $ZeroPo
 
 **Scale Quantization**
 
-This is so-called `Symmetric quantization`, in which we use the maximum absolute value in the float tensor as float range and map to the corresponding integer range. 
+This is so-called `Symmetric quantization`, in which we use the maximum absolute value in the float tensor as float range and map to the corresponding integer range.
 
 The math equation is like:
 
 here:
 
-If INT8 is specified, $Scale = max(abs(X_{f_{max}}), abs(X_{f_{min}})) / 127$ and $ZeroPoint = 0$. 
+If INT8 is specified, $Scale = max(abs(X_{f_{max}}), abs(X_{f_{min}})) / 127$ and $ZeroPoint = 0$.
 
 or
 
@@ -61,10 +61,10 @@ Sometimes the reduce_range feature, that's using 7 bit width (1 sign bit + 6 dat
 + Symmetric Quantization
     + int8: scale = 2 * max(abs(rmin), abs(rmax)) / (max(int8) - min(int8) - 1)
 + Asymmetric Quantization
-    + uint8: scale = (rmax - rmin) / (max(uint8) - min(uint8)); zero_point = min(uint8)  - round(rmin / scale) 
+    + uint8: scale = (rmax - rmin) / (max(uint8) - min(uint8)); zero_point = min(uint8)  - round(rmin / scale)
 
 #### Reference
-+ MLAS:  [MLAS Quantization](https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/quantization/onnx_quantizer.py) 
++ MLAS:  [MLAS Quantization](https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/quantization/onnx_quantizer.py)
 
 ### Quantization Approaches
 
@@ -88,7 +88,7 @@ This approach is major quantization approach people should try because it could
 
 ## With or Without Accuracy Aware Tuning
 
-Accuracy aware tuning is one of unique features provided by Neural Compressor, compared with other 3rd party model compression tools. This feature can be used to solve accuracy loss pain points brought by applying low precision quantization and other lossy optimization methods. 
+Accuracy aware tuning is one of unique features provided by Neural Compressor, compared with other 3rd party model compression tools. This feature can be used to solve accuracy loss pain points brought by applying low precision quantization and other lossy optimization methods.
 
 This tuning algorithm creates a tuning space based on user-defined configurations, generates quantized graph, and evaluates the accuracy of this quantized graph. The optimal model will be yielded if the pre-defined accuracy goal is met.
 
@@ -105,7 +105,7 @@ User could refer to below chart to understand the whole tuning flow.
 
 ## Get Started
 
-The design philosophy of the quantization interface of ONNX Neural Compressor is easy-of-use. It requests user to provide `model`, `calibration dataloader`, and `evaluation function`. Those parameters would be used to quantize and tune the model. 
+The design philosophy of the quantization interface of ONNX Neural Compressor is easy-of-use. It requests user to provide `model`, `calibration dataloader`, and `evaluation function`. Those parameters would be used to quantize and tune the model.
 
 `model` is the framework model location or the framework model object.
 
@@ -123,12 +123,11 @@ User could execute:
 This means user could leverage ONNX Neural Compressor to directly generate a fully quantized model without accuracy aware tuning. It's user responsibility to ensure the accuracy of the quantized model meets expectation. ONNX Neural Compressor supports `Post Training Static Quantization` and `Post Training Dynamic Quantization`.
 
 ``` python
-from onnx_neural_compressor import config
-from onnx_neural_compressor.quantization import quantize
-from onnx_neural_compressor.quantization import calibrate
+from onnx_neural_compressor.quantization import quantize, config
+from onnx_neural_compressor import data_reader
 
 
-class DataReader(calibrate.CalibrationDataReader):
+class DataReader(data_reader.CalibrationDataReader):
     def get_next(self): ...
 
     def rewind(self): ...
@@ -144,17 +143,10 @@ quantize(model, q_model_path, qconfig)
 This means user could leverage the advance feature of ONNX Neural Compressor to tune out a best quantized model which has best accuracy and good performance. User should provide `eval_fn`.
 
 ``` python
-from onnx_neural_compressor.quantization import calibrate
-from onnx_neural_compressor.quantization import tuning
-    CalibrationDataReader,
-    GPTQConfig,
-    RTNConfig,
-    autotune,
-    get_woq_tuning_config,
-)
-
+from onnx_neural_compressor import data_reader
+from onnx_neural_compressor.quantization import tuning, config
 
-class DataReader(calibrate.CalibrationDataReader):
+class DataReader(data_reader.CalibrationDataReader):
     def get_next(self): ...
 
     def rewind(self): ...
@@ -200,7 +192,7 @@ Neural-Compressor will quantized models with user-specified backend or detecting
         <tr>
             <th>Backend</th>
             <th>Backend Library</th>
-            <th>Support Device(cpu as default)</th> 
+            <th>Support Device(cpu as default)</th>
         </tr>
     </thead>
     <tbody>
@@ -235,9 +227,9 @@ Neural-Compressor will quantized models with user-specified backend or detecting
 <br>
 
 > ***Note***
-> 
+>
 > DmlExecutionProvider support works as experimental, please expect exceptions.
-> 
+>
 > Known limitation: the batch size of onnx models has to be fixed to 1 for DmlExecutionProvider, no multi-batch and dynamic batch support yet.
 
 

diff --git a/docs/quantization_weight_only.md b/docs/quantization_weight_only.md
@@ -124,8 +124,7 @@ To find the best algorithm, users can leverage the `autotune` feature to explore
 ### **User code example**
 
 ```python
-from onnx_neural_compressor import config
-from onnx_neural_compressor.quantization import tuning
+from onnx_neural_compressor.quantization import tuning, config
 
 tune_config = tuning.TuningConfig(config_set=config.get_woq_tuning_config())
 best_model = tuning.autotune(

diff --git a/docs/smooth_quant.md b/docs/smooth_quant.md
@@ -103,7 +103,7 @@ array([[0.68475647, 0.4742902 , 0.74404275],
 7.384850698449426e-07
 ```
 
-The difference between $W$ and $W_{dq}$ shows that quantization affects precision and appropriate values of scale and zero point will reduce the loss of precision. 
+The difference between $W$ and $W_{dq}$ shows that quantization affects precision and appropriate values of scale and zero point will reduce the loss of precision.
 
 #### Per-channel example
 
@@ -233,7 +233,7 @@ The image on the left presents a normal MatMul forward  with 1x2 input $x$ and 2
 
 ### SmoothQuant
 
-In the previous subsection, we have explained why per-channel quantization could not be applied for activation, even though it could lead to lower quantization loss. However, the quantization error loss of activation plays an important role in the accuracy loss of model quantization[^2][^3][^4]. 
+In the previous subsection, we have explained why per-channel quantization could not be applied for activation, even though it could lead to lower quantization loss. However, the quantization error loss of activation plays an important role in the accuracy loss of model quantization[^2][^3][^4].
 
 
 
@@ -274,7 +274,7 @@ j is the index of the input channels.
 For most of the models such as OPT and BLOOM, $\alpha = 0.5$ is a well-balanced value to split the difficulty of weight and activation quantization. A larger $\alpha$ value could be used on models with more significant activation outliers to migrate more quantization difficulty to weights.
 
 
-### Our enhancement: 
+### Our enhancement:
 
 #### Algorithm: Auto-tuning of $\alpha$.
 
@@ -297,7 +297,7 @@ Multiple criteria (e.g min, max and mean) are supported to determine the $\alpha
 
 In our experiments, an $\alpha$ range of [0.0, 1.0] with a step_size of 0.1 is found to be well-balanced one for the majority of models.
 
-#### Engineering 
+#### Engineering
 
 *fully automated*: users only need to pass a model and dataloader.
 
@@ -322,7 +322,7 @@ There are two ways to apply smooth quantization: 1) using a fixed `alpha` for th
 To set a fixed alpha for the entire model, users can follow this example:
 
 ```python
-from onnx_neural_compressor import config
+from onnx_neural_compressor.quantization import config
 
 qconfig = config.StaticQuantConfig(
     data_reader, extra_options={"SmoothQuant": True, "SmoothQuantAlpha": 0.5, "SmoothQuantFolding": True}
@@ -344,8 +344,7 @@ The tuning process looks for the optimal `alpha` value from a list of `alpha` va
 Here is an example:
 
 ```python
-from onnx_neural_compressor import config
-from onnx_neural_compressor.quantization import tuning
+from onnx_neural_compressor.quantization import tuning, config
 
 qconfig = tuning.TuningConfig(config_set=[config.SmoothQuantConfig(alpha=np.arange(0.1, 0.5, 0.05).tolist())])
 best_model = tuning.autotune(
@@ -360,8 +359,7 @@ In this case, the tuning process searches the optimal `alpha` of each operator b
 Here is an example:
 
 ```python
-from onnx_neural_compressor import config
-from onnx_neural_compressor.quantization import quantize
+from onnx_neural_compressor.quantization import quantize, config
 
 qconfig = config.StaticQuantConfig(
     data_reader,

diff --git a/onnx_neural_compressor/algorithms/utility.py b/onnx_neural_compressor/algorithms/utility.py
@@ -174,12 +174,12 @@ def quantize_data_per_channel(data, axis, qType, sym, reduce_range=False):
     return rmin.reshape(-1, 1), rmax.reshape(-1, 1), zero_point.reshape(-1, 1), scale.reshape(-1, 1), quantized_data
 
 
-def dequantize_data_with_scale_zero(tensor_value, scale_value, zo_value):  # pragma: no cover
+def dequantize_data_with_scale_zero(tensor_value, scale_value, zo_value):
     """Dequantize tensor with scale and zero point."""
     return (tensor_value.astype(scale_value.dtype) - zo_value.astype(scale_value.dtype)) * scale_value
 
 
-def dequantize_data(tensor_value, scale_value, zo_value, axis=0):  # pragma: no cover
+def dequantize_data(tensor_value, scale_value, zo_value, axis=0):
     """Dequantize tensor."""
     if not isinstance(scale_value, np.ndarray):
         return dequantize_data_with_scale_zero(tensor_value, scale_value, zo_value)
@@ -386,7 +386,7 @@ def make_matmul_weight_only_node(
             # require onnxruntime > 1.16.3
             kwargs["accuracy_level"] = accuracy_level
 
-    else:
+    else:  # pragma: no cover
         offset = 5 if zero_point is not None else 4
         op_type = "MatMulFpQ4"
 

diff --git a/onnx_neural_compressor/algorithms/weight_only/awq.py b/onnx_neural_compressor/algorithms/weight_only/awq.py
@@ -50,7 +50,7 @@ def _apply_awq_scale(model, weight_config, absorb_pairs, output_dicts):
 
     input_name_to_nodes = model.input_name_to_nodes()
     for parent, nodes in absorb_pairs.items():
-        if any([node.input[0] not in output_dicts for node in nodes]):
+        if any([node.input[0] not in output_dicts for node in nodes]):  # pragma: no cover
             logger.warning(
                 "Miss input tensors of nodes {} during AWQ, skip it!".format(
                     ", ".join([node.name for node in nodes if node.input[0] not in output_dicts])
@@ -102,7 +102,7 @@ def _apply_awq_scale(model, weight_config, absorb_pairs, output_dicts):
                     version.Version(ort.__version__) >= constants.ONNXRT116_VERSION
                     and num_bits == 4
                     and group_size == 32
-                ):  # pragma: no cover
+                ):
                     # MatMulFpQ4 support 4 bits and 32 group_size with ort 1.16.0 and 1.16.1 versions
                     # MatMulNBits supports 4 bits and 2^n group_size with ort > 1.16.1
                     q_weight = quant_utils.qdq_data(
@@ -154,7 +154,9 @@ def _apply_awq_scale(model, weight_config, absorb_pairs, output_dicts):
 
         if parent.op_type in ["LayerNormalization", "BatchNormalization", "InstanceNormalization"] and len(
             input_name_to_nodes[nodes[0].input[0]]
-        ) == len(nodes):
+        ) == len(
+            nodes
+        ):  # pragma: no cover
             for idx in [1, 2]:
                 tensor = onnx.numpy_helper.to_array(model.get_initializer(parent.input[idx]), base_dir)
                 dtype = tensor.dtype
@@ -187,7 +189,7 @@ def _apply_awq_scale(model, weight_config, absorb_pairs, output_dicts):
             updated_nodes.append(parent.name)
             output_dicts[parent.output[0]] = output_dicts[parent.output[0]] / np.reshape(best_scale, (1, -1))
 
-        else:  # pragma: no cover
+        else:
             # insert mul
             scale_tensor = onnx.helper.make_tensor(
                 name=parent.output[0] + "_weight_only_scale",
@@ -256,7 +258,7 @@ def _apply_awq_clip(model, weight_config, absorb_pairs, output_dicts):
                     version.Version(ort.__version__) >= constants.ONNXRT116_VERSION
                     and num_bits == 4
                     and group_size == 32
-                ):  # pragma: no cover
+                ):
                     # MatMulFpQ4 support 4 bits and 32 group_size with ort 1.16.0 and 1.16.1 versions
                     # MatMulNBits supports 4 bits and 2^n group_size with ort > 1.16.1
                     weight = quant_utils.qdq_data(
@@ -342,7 +344,8 @@ def awq_quantize(
                 output_names.append(node.input[0])
         output_names = list(set(output_names))
         model.add_tensors_to_outputs(output_names)
-        if model.is_large_model:
+
+        if model.is_large_model:  # pragma: no cover
             onnx.save_model(
                 model.model,
                 model.model_path + "_augment.onnx",
@@ -374,7 +377,7 @@ def awq_quantize(
                 ):
                     dump_pairs[parent.name].append(model.get_node(node.name))
 
-            if len(dump_pairs[parent.name]) == 0:
+            if len(dump_pairs[parent.name]) == 0:  # pragma: no cover
                 continue
 
             output_dicts = {}

diff --git a/onnx_neural_compressor/algorithms/weight_only/gptq.py b/onnx_neural_compressor/algorithms/weight_only/gptq.py
@@ -272,13 +272,13 @@ def gptq_quantize(
                 weight = onnx.numpy_helper.to_array(
                     model.get_initializer(model.get_node(node.name).input[1]), base_dir
                 ).copy()
-                if len(weight.shape) != 2:
+                if len(weight.shape) != 2:  # pragma: no cover
                     continue
 
                 weights.append(weight)
                 node_list.append(model.get_node(node.name))
 
-        if len(weights) == 0:
+        if len(weights) == 0:  # pragma: no cover
             continue
 
         Hs = [np.zeros((i.shape[0], i.shape[0])) for i in weights]
@@ -327,7 +327,7 @@ def gptq_quantize(
             if ("CUDAExecutionProvider" in providers and satisfy_MatMulNBits_condition) or (
                 "CUDAExecutionProvider" not in providers
                 and (satisfy_MatMulFpQ4_condition or satisfy_MatMulNBits_condition)
-            ):  # pragma: no cover
+            ):
                 # MatMulFpQ4 support 4 bits and 32 group_size with ort 1.16.0 and 1.16.1 versions, supported by CPU EP
                 # MatMulNBits supports 4 bits and 2^n group_size with ort > 1.16.1, supported by CPU EP AND CUDA EP
                 org_shape = weight.shape