Skip to content

v2.15.0

Latest
Compare
Choose a tag to compare
@nikita-malininn nikita-malininn released this 06 Feb 10:08
· 2389 commits to develop since this release

Post-training Quantization:

Features:

  • (TensorFlow) The nncf.quantize() method is now the recommended API for Quantization-Aware Training. Please refer to an example for more details about how to use a new approach.
  • (TensorFlow) Compression layers placement in the model now can be serialized and restored with new API functions: nncf.tensorflow.get_config() and nncf.tensorflow.load_from_config(). Please see the documentation for the saving/loading of a quantized model for more details.
  • (OpenVINO) Added example with LLM quantization to FP8 precision.
  • (TorchFX, Experimental) Preview support for the new quantize_pt2e API has been introduced, enabling quantization of torch.fx.GraphModule models with the OpenVINOQuantizer and the X86InductorQuantizer quantizers. quantize_pt2e API utilizes MinMax algorithm statistic collectors, as well as SmoothQuant, BiasCorrection and FastBiasCorrection Post-Training Quantization algorithms.
  • Added unification of scales for ScaledDotProductAttention operation.

Fixes:

  • (ONNX) Fixed sporadic accuracy issues with the BiasCorrection algorithm.
  • (ONNX) Fixed GroupConvolution operation weight quantization, which also improves performance for a number of models.
  • Fixed AccuracyAwareQuantization algorithm to solve #3118 issue.
  • Fixed issue with NNCF usage with potentially corrupted backend frameworks.

Improvements:

  • (TorchFX, Experimental) Added YoloV11 support.
  • (OpenvINO) The performance of the FastBiasCorrection algorithm was improved.
  • Significantly faster data-free weight compression for OpenVINO models: INT4 compression is now up to 10x faster, while INT8 compression is up to 3x faster. The larger the model the higher the time reduction.
  • AWQ weight compression is now up to 2x faster, improving overall runtime efficiency.
  • Peak memory usage during INT4 data-free weight compression in the OpenVINO backend is reduced by up to 50% for certain models.

Tutorials:

Deprecations/Removals:

  • (TensorFlow) The nncf.tensorflow.create_compressed_model() method is now marked as deprecated. Please use the nncf.quantize() method for the quantization initialization.

Requirements:

  • Updated the minimal version for numpy (>=1.24.0).
  • Removed tqdm dependency.

Acknowledgements

Thanks for contributions from the OpenVINO developer community:
@rk119
@devesh-2002