onnx · chensuyue · Jul 3, 2024 · Jun 12, 2024 · Jun 14, 2024 · Jun 18, 2024
diff --git a/README.md b/README.md
@@ -30,20 +30,20 @@ pip install -r requirements.txt
 pip install .
 ```
 
-> **Note**: 
+> **Note**:
 > Further installation methods can be found under [Installation Guide](./docs/installation_guide.md).
 
 ## Getting Started
 
-Setting up the environment:  
+Setting up the environment:
 ```bash
 pip install onnx-neural-compressor "onnxruntime>=1.17.0" onnx
 ```
 After successfully installing these packages, try your first quantization program.
-> Notes: please install from source before the formal pypi release. 
+> Notes: please install from source before the formal pypi release.
 
 ### Weight-Only Quantization (LLMs)
-Following example code demonstrates Weight-Only Quantization on LLMs, device will be selected for efficiency automatically when multiple devices are available. 
+Following example code demonstrates Weight-Only Quantization on LLMs, device will be selected for efficiency automatically when multiple devices are available.
 
 Run the example:
 ```python
@@ -59,17 +59,17 @@ quant = matmul_nbits_quantizer.MatMulNBitsQuantizer(
 )
 quant.process()
 best_model = quant.model
-```   
+```
 
 ### Static Quantization
 
 ```python
 from onnx_neural_compressor import config
 from onnx_neural_compressor.quantization import quantize
-from onnx_neural_compressor.quantization import calibrate
+from onnx_neural_compressor import data_reader
 
 
-class DataReader(calibrate.CalibrationDataReader):
+class DataReader(data_reader.CalibrationDataReader):
     def __init__(self):
         self.encoded_list = []
         # append data into self.encoded_list
@@ -127,6 +127,6 @@ quantize(model, output_model_path, qconfig)
 * [Contribution Guidelines](./docs/source/CONTRIBUTING.md)
 * [Security Policy](SECURITY.md)
 
-## Communication 
+## Communication
 - [GitHub Issues](https://github.com/onnx/neural-compressor/issues): mainly for bug reports, new feature requests, question asking, etc.
-- [Email](mailto:[email protected]): welcome to raise any interesting research ideas on model compression techniques by email for collaborations.  
+- [Email](mailto:[email protected]): welcome to raise any interesting research ideas on model compression techniques by email for collaborations.
diff --git a/docs/quantization.md b/docs/quantization.md
@@ -4,10 +4,10 @@ Quantization
 1. [Quantization Introduction](#quantization-introduction)
 2. [Quantization Fundamentals](#quantization-fundamentals)
 3. [Accuracy Aware Tuning](#with-or-without-accuracy-aware-tuning)
-4. [Get Started](#get-started)  
-   4.1 [Post Training Quantization](#post-training-quantization)    
-   4.2 [Specify Quantization Rules](#specify-quantization-rules)    
-   4.3 [Specify Quantization Backend and Device](#specify-quantization-backend-and-device)  
+4. [Get Started](#get-started)
+   4.1 [Post Training Quantization](#post-training-quantization)
+   4.2 [Specify Quantization Rules](#specify-quantization-rules)
+   4.3 [Specify Quantization Backend and Device](#specify-quantization-backend-and-device)
 5. [Examples](#examples)
 
 ## Quantization Introduction
@@ -22,7 +22,7 @@ The math equation is like: $$X_{int8} = round(Scale \times X_{fp32} + ZeroPoint)
 
 **Affine Quantization**
 
-This is so-called `asymmetric quantization`, in which we map the min/max range in the float tensor to the integer range. Here int8 range is [-128, 127], uint8 range is [0, 255]. 
+This is so-called `asymmetric quantization`, in which we map the min/max range in the float tensor to the integer range. Here int8 range is [-128, 127], uint8 range is [0, 255].
 
 here:
 
@@ -34,13 +34,13 @@ If UINT8 is specified, $Scale = (|X_{f_{max}} - X_{f_{min}}|) / 255$ and $ZeroPo
 
 **Scale Quantization**
 
-This is so-called `Symmetric quantization`, in which we use the maximum absolute value in the float tensor as float range and map to the corresponding integer range. 
+This is so-called `Symmetric quantization`, in which we use the maximum absolute value in the float tensor as float range and map to the corresponding integer range.
 
 The math equation is like:
 
 here:
 
-If INT8 is specified, $Scale = max(abs(X_{f_{max}}), abs(X_{f_{min}})) / 127$ and $ZeroPoint = 0$. 
+If INT8 is specified, $Scale = max(abs(X_{f_{max}}), abs(X_{f_{min}})) / 127$ and $ZeroPoint = 0$.
 
 or
 
@@ -61,10 +61,10 @@ Sometimes the reduce_range feature, that's using 7 bit width (1 sign bit + 6 dat
 + Symmetric Quantization
     + int8: scale = 2 * max(abs(rmin), abs(rmax)) / (max(int8) - min(int8) - 1)
 + Asymmetric Quantization
-    + uint8: scale = (rmax - rmin) / (max(uint8) - min(uint8)); zero_point = min(uint8)  - round(rmin / scale) 
+    + uint8: scale = (rmax - rmin) / (max(uint8) - min(uint8)); zero_point = min(uint8)  - round(rmin / scale)
 
 #### Reference
-+ MLAS:  [MLAS Quantization](https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/quantization/onnx_quantizer.py) 
++ MLAS:  [MLAS Quantization](https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/quantization/onnx_quantizer.py)
 
 ### Quantization Approaches
 
@@ -88,7 +88,7 @@ This approach is major quantization approach people should try because it could
 
 ## With or Without Accuracy Aware Tuning
 
-Accuracy aware tuning is one of unique features provided by Neural Compressor, compared with other 3rd party model compression tools. This feature can be used to solve accuracy loss pain points brought by applying low precision quantization and other lossy optimization methods. 
+Accuracy aware tuning is one of unique features provided by Neural Compressor, compared with other 3rd party model compression tools. This feature can be used to solve accuracy loss pain points brought by applying low precision quantization and other lossy optimization methods.
 
 This tuning algorithm creates a tuning space based on user-defined configurations, generates quantized graph, and evaluates the accuracy of this quantized graph. The optimal model will be yielded if the pre-defined accuracy goal is met.
 
@@ -105,7 +105,7 @@ User could refer to below chart to understand the whole tuning flow.
 
 ## Get Started
 
-The design philosophy of the quantization interface of ONNX Neural Compressor is easy-of-use. It requests user to provide `model`, `calibration dataloader`, and `evaluation function`. Those parameters would be used to quantize and tune the model. 
+The design philosophy of the quantization interface of ONNX Neural Compressor is easy-of-use. It requests user to provide `model`, `calibration dataloader`, and `evaluation function`. Those parameters would be used to quantize and tune the model.
 
 `model` is the framework model location or the framework model object.
 
@@ -125,10 +125,10 @@ This means user could leverage ONNX Neural Compressor to directly generate a ful
 ``` python
 from onnx_neural_compressor import config
 from onnx_neural_compressor.quantization import quantize
-from onnx_neural_compressor.quantization import calibrate
+from onnx_neural_compressor import data_reader
 
 
-class DataReader(calibrate.CalibrationDataReader):
+class DataReader(data_reader.CalibrationDataReader):
     def get_next(self): ...
 
     def rewind(self): ...
@@ -144,17 +144,11 @@ quantize(model, q_model_path, qconfig)
 This means user could leverage the advance feature of ONNX Neural Compressor to tune out a best quantized model which has best accuracy and good performance. User should provide `eval_fn`.
 
 ``` python
-from onnx_neural_compressor.quantization import calibrate
+from onnx_neural_compressor import data_reader
 from onnx_neural_compressor.quantization import tuning
-    CalibrationDataReader,
-    GPTQConfig,
-    RTNConfig,
-    autotune,
-    get_woq_tuning_config,
-)
-
+from onnx_neural_compressor import config
 
-class DataReader(calibrate.CalibrationDataReader):
+class DataReader(data_reader.CalibrationDataReader):
     def get_next(self): ...
 
     def rewind(self): ...
@@ -200,7 +194,7 @@ Neural-Compressor will quantized models with user-specified backend or detecting
         <tr>
             <th>Backend</th>
             <th>Backend Library</th>
-            <th>Support Device(cpu as default)</th> 
+            <th>Support Device(cpu as default)</th>
         </tr>
     </thead>
     <tbody>
@@ -235,9 +229,9 @@ Neural-Compressor will quantized models with user-specified backend or detecting
 <br>
 
 > ***Note***
-> 
+>
 > DmlExecutionProvider support works as experimental, please expect exceptions.
-> 
+>
 > Known limitation: the batch size of onnx models has to be fixed to 1 for DmlExecutionProvider, no multi-batch and dynamic batch support yet.
 
 

diff --git a/onnx_neural_compressor/algorithms/weight_only/awq.py b/onnx_neural_compressor/algorithms/weight_only/awq.py
@@ -49,7 +49,7 @@ def _apply_awq_scale(model, weight_config, absorb_pairs, output_dicts, num_bits,
     base_dir = os.path.dirname(model.model_path) if model.model_path is not None else ""
 
     for parent, nodes in absorb_pairs.items():
-        if any([node.input[0] not in output_dicts for node in nodes]):
+        if any([node.input[0] not in output_dicts for node in nodes]):  # pragma: no cover
             logger.warning(
                 "Miss input tensors of nodes {} during AWQ, skip it!".format(
                     ", ".join([node.name for node in nodes if node.input[0] not in output_dicts])
@@ -101,7 +101,7 @@ def _apply_awq_scale(model, weight_config, absorb_pairs, output_dicts, num_bits,
                     version.Version(ort.__version__) >= constants.ONNXRT116_VERSION
                     and num_bits == 4
                     and group_size == 32
-                ):  # pragma: no cover
+                ):
                     # MatMulFpQ4 support 4 bits and 32 group_size with ort 1.16.0 and 1.16.1 versions
                     # MatMulNBits supports 4 bits and 2^n group_size with ort > 1.16.1
                     q_weight = woq_utility.qdq_tensor(weight, num_bits, group_size, scheme, "uint") / np.expand_dims(
@@ -153,7 +153,9 @@ def _apply_awq_scale(model, weight_config, absorb_pairs, output_dicts, num_bits,
 
         if parent.op_type in ["LayerNormalization", "BatchNormalization", "InstanceNormalization"] and len(
             model.input_name_to_nodes()[nodes[0].input[0]]
-        ) == len(nodes):
+        ) == len(
+            nodes
+        ):  # pragma: no cover
             for idx in [1, 2]:
                 tensor = onnx.numpy_helper.to_array(model.get_initializer(parent.input[idx]), base_dir)
                 dtype = tensor.dtype
@@ -186,7 +188,7 @@ def _apply_awq_scale(model, weight_config, absorb_pairs, output_dicts, num_bits,
             updated_nodes.append(parent.name)
             output_dicts[parent.output[0]] = output_dicts[parent.output[0]] / np.reshape(best_scale, (1, -1))
 
-        else:  # pragma: no cover
+        else:
             # insert mul
             scale_tensor = onnx.helper.make_tensor(
                 name=parent.output[0] + "_weight_only_scale",
@@ -256,7 +258,7 @@ def _apply_awq_clip(model, weight_config, absorb_pairs, output_dicts, num_bits,
                     version.Version(ort.__version__) >= constants.ONNXRT116_VERSION
                     and num_bits == 4
                     and group_size == 32
-                ):  # pragma: no cover
+                ):
                     # MatMulFpQ4 support 4 bits and 32 group_size with ort 1.16.0 and 1.16.1 versions
                     # MatMulNBits supports 4 bits and 2^n group_size with ort > 1.16.1
                     weight = woq_utility.qdq_tensor(
@@ -346,7 +348,8 @@ def awq_quantize(
                 output_names.append(node.input[0])
         output_names = list(set(output_names))
         model.add_tensors_to_outputs(output_names)
-        if model.is_large_model:
+
+        if model.is_large_model:  # pragma: no cover
             onnx.save_model(
                 model.model,
                 model.model_path + "_augment.onnx",
@@ -376,7 +379,7 @@ def awq_quantize(
                 ):
                     dump_pairs[parent.name].append(model.get_node(node.name))
 
-            if len(dump_pairs[parent.name]) == 0:
+            if len(dump_pairs[parent.name]) == 0:  # pragma: no cover
                 continue
 
             output_dicts = {}

diff --git a/onnx_neural_compressor/algorithms/weight_only/gptq.py b/onnx_neural_compressor/algorithms/weight_only/gptq.py
@@ -279,13 +279,13 @@ def gptq_quantize(
                 weight = onnx.numpy_helper.to_array(
                     model.get_initializer(model.get_node(node.name).input[1]), base_dir
                 ).copy()
-                if len(weight.shape) != 2:
+                if len(weight.shape) != 2:  # pragma: no cover
                     continue
 
                 weights.append(weight)
                 node_list.append(model.get_node(node.name))
 
-        if len(weights) == 0:
+        if len(weights) == 0:  # pragma: no cover
             continue
 
         Hs = [np.zeros((i.shape[0], i.shape[0])) for i in weights]
@@ -335,7 +335,7 @@ def gptq_quantize(
             if ("CUDAExecutionProvider" in providers and satisfy_MatMulNBits_condition) or (
                 "CUDAExecutionProvider" not in providers
                 and (satisfy_MatMulFpQ4_condition or satisfy_MatMulNBits_condition)
-            ):  # pragma: no cover
+            ):
                 # MatMulFpQ4 support 4 bits and 32 group_size with ort 1.16.0 and 1.16.1 versions, supported by CPU EP
                 # MatMulNBits supports 4 bits and 2^n group_size with ort > 1.16.1, supported by CPU EP AND CUDA EP
                 org_shape = weight.shape

diff --git a/onnx_neural_compressor/algorithms/weight_only/utility.py b/onnx_neural_compressor/algorithms/weight_only/utility.py
@@ -29,11 +29,11 @@
 
 from onnx_neural_compressor import constants, utility
 
-if sys.version_info < (3, 11) and util.find_spec("onnxruntime_extensions"):  # pragma: no cover
+if sys.version_info < (3, 11) and util.find_spec("onnxruntime_extensions"):
     import onnxruntime_extensions
 
 
-def _get_blob_size(group_size, has_zp):  # pragma: no cover
+def _get_blob_size(group_size, has_zp):
     """Get blob_size.
 
     Args:
@@ -42,9 +42,9 @@ def _get_blob_size(group_size, has_zp):  # pragma: no cover
     """
     if version.Version(ort.__version__) > constants.ONNXRT1161_VERSION:
         blob_size = group_size // 2
-    elif has_zp:
+    elif has_zp:  # pragma: no cover
         blob_size = group_size // 2 + 4 + 1
-    else:
+    else:  # pragma: no cover
         blob_size = group_size // 2 + 4
     return blob_size
 
@@ -109,7 +109,7 @@ def make_matmul_weight_only_node(
 
         # build zero_point tensor
         if zero_point is not None:
-            if num_bits > 4:
+            if num_bits > 4:  # pragma: no cover
                 packed_zp = np.reshape(zero_point, (1, -1)).astype("uint8")
             else:
                 packed_zp = np.full((zero_point.shape[0] + 1) // 2, 136, dtype="uint8")
@@ -137,7 +137,7 @@ def make_matmul_weight_only_node(
             # require onnxruntime > 1.16.3
             kwargs["accuracy_level"] = accuracy_level
 
-    else:
+    else:  # pragma: no cover
         offset = 5 if zero_point is not None else 4
         op_type = "MatMulFpQ4"
 
@@ -201,7 +201,7 @@ def prepare_inputs(model, data_reader, providers):
     """
 
     so = ort.SessionOptions()
-    if sys.version_info < (3, 11) and util.find_spec("onnxruntime_extensions"):  # pragma: no cover
+    if sys.version_info < (3, 11) and util.find_spec("onnxruntime_extensions"):
         so.register_custom_ops_library(onnxruntime_extensions.get_library_path())
     if model.is_large_model:
         onnx.save_model(

diff --git a/onnx_neural_compressor/config.py b/onnx_neural_compressor/config.py
@@ -310,7 +310,7 @@ def to_diff_dict(cls, instance) -> Dict[str, Any]:
     def from_json_file(cls, filename):
         with open(filename, "r", encoding="utf-8") as file:
             config_dict = json.load(file)
-        return cls.from_dict(**config_dict)
+        return cls.from_dict(config_dict)
 
     def to_json_file(self, filename):
         config_dict = self.to_dict()
@@ -543,7 +543,7 @@ def register_supported_configs(cls):
         raise NotImplementedError
 
     @classmethod
-    def get_config_set_for_tuning(cls) -> None:
+    def get_config_set_for_tuning(cls) -> None:  # pragma: no cover
         # TODO (Yi) handle the composable config in `tuning_config`
         return None
 
@@ -706,7 +706,7 @@ def get_model_info(model: Union[onnx.ModelProto, pathlib.Path, str]) -> list:
         return filter_result
 
     @classmethod
-    def get_config_set_for_tuning(cls) -> Union[None, "RTNConfig", List["RTNConfig"]]:  # pragma: no cover
+    def get_config_set_for_tuning(cls) -> Union[None, "RTNConfig", List["RTNConfig"]]:
         return RTNConfig(weight_bits=[4, 8], weight_sym=[True, False])
 
 
@@ -871,7 +871,7 @@ def get_model_info(model: Union[onnx.ModelProto, pathlib.Path, str]) -> list:
         return filter_result
 
     @classmethod
-    def get_config_set_for_tuning(cls) -> Union[None, "GPTQConfig", List["GPTQConfig"]]:  # pragma: no cover
+    def get_config_set_for_tuning(cls) -> Union[None, "GPTQConfig", List["GPTQConfig"]]:
         return GPTQConfig(
             weight_bits=[4, 8],
             weight_sym=[True, False],
@@ -1022,7 +1022,7 @@ def get_model_info(model: Union[onnx.ModelProto, pathlib.Path, str]) -> list:
         return filter_result
 
     @classmethod
-    def get_config_set_for_tuning(cls) -> Union[None, "AWQConfig", List["AWQConfig"]]:  # pragma: no cover
+    def get_config_set_for_tuning(cls) -> Union[None, "AWQConfig", List["AWQConfig"]]:
         return AWQConfig(
             weight_bits=[4, 8],
             weight_sym=[True, False],