From 42913064e9ece55c790cbdecb82c8d772d6c42c9 Mon Sep 17 00:00:00 2001 From: Alexander Date: Thu, 26 Dec 2024 16:32:17 +0400 Subject: [PATCH 1/2] LLM optimization documentation fixes and updates. --- .../weight-compression.rst | 25 ++++++------------- .../4-bit-weight-quantization.rst | 4 +++ .../openvino-workflow/model-optimization.rst | 16 ++++-------- 3 files changed, 16 insertions(+), 29 deletions(-) diff --git a/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst b/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst index 046dde9661c3bb..d6fdea3f36b8b1 100644 --- a/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst +++ b/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression.rst @@ -5,8 +5,9 @@ LLM Weight Compression :maxdepth: 1 :hidden: - weight-compression/microscaling-quantization weight-compression/4-bit-weight-quantization + weight-compression/microscaling-quantization + Weight compression enhances the efficiency of models by reducing their memory footprint, @@ -16,14 +17,13 @@ Unlike full model quantization, where both weights and activations are quantized only targets weights, keeping activations as floating-point numbers. This means preserving most of the model's accuracy while improving its speed and reducing its size. The reduction in size is especially noticeable with larger models. -For instance the 7 billion parameter Llama 2 model can be reduced -from about 25GB to 4GB using 4-bit weight compression. +For instance the 8 billion parameter Llama 3 model can be reduced +from about 16.1 GB to 4.8 GB using 4-bit weight quantization on top of bfloat16 model. .. note:: - With smaller language models (i.e. less than 1B parameters), weight + With smaller language models (i.e. less than 1B parameters), low-bit weight compression may result in more accuracy reduction than with larger models. - Therefore, weight compression is recommended for use with LLMs only. LLMs and other GenAI models that require extensive memory to store the weights during inference can benefit @@ -36,7 +36,7 @@ from weight compression as it: * improves inference speed by reducing the latency of memory access when computing the operations with weights, for example, Linear layers. The weights are smaller and thus faster to load from memory; -* unlike quantization, does not require sample data to calibrate the range of +* unlike full static quantization, does not require sample data to calibrate the range of activation values. Currently, `NNCF `__ @@ -64,7 +64,7 @@ by running the following command: pip install optimum[openvino] **8-bit weight quantization** offers a good balance between reducing the size and lowering the -accuracy of a model. It usually results in significant improvements for transformer-based models +accuracy of a model. It usually results in significant improvements for Transformer-based models and guarantees good model performance for a vast majority of supported CPU and GPU platforms. By default, weights are compressed asymmetrically to "INT8_ASYM" mode. @@ -223,17 +223,6 @@ depending on the model. For more details, refer to the article on how to :doc:`infer LLMs using Optimum Intel <../../../learn-openvino/llm_inference_guide/llm-inference-hf>`. -The code snippet below shows how to do 4-bit quantization of the model weights represented -in OpenVINO IR using NNCF: - -.. tab-set:: - - .. tab-item:: OpenVINO - :sync: openvino - - .. doxygensnippet:: docs/optimization_guide/nncf/code/weight_compression_openvino.py - :language: python - :fragment: [compression_4bit] Refer to the article about :doc:`4-bit weight quantization <./weight-compression/4-bit-weight-quantization>` diff --git a/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression/4-bit-weight-quantization.rst b/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression/4-bit-weight-quantization.rst index ae9bc7d7b8b4a3..4aca6254ac0291 100644 --- a/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression/4-bit-weight-quantization.rst +++ b/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression/4-bit-weight-quantization.rst @@ -134,6 +134,10 @@ trade-offs after optimization: the original floating-point precision of the model weights (``INT8_ASYM`` is default value). | +.. tip:: + + NNCF allows stacking the supported optimization methods. For example, AWQ, Scale Estimation + and GPTQ methods can be enabled all together to achieve better accuracy. 4-bit Weight Quantization with GPTQ ################################### diff --git a/docs/articles_en/openvino-workflow/model-optimization.rst b/docs/articles_en/openvino-workflow/model-optimization.rst index f5a5f97341e960..e44cf556329bd1 100644 --- a/docs/articles_en/openvino-workflow/model-optimization.rst +++ b/docs/articles_en/openvino-workflow/model-optimization.rst @@ -21,24 +21,24 @@ In OpenVINO, the default optimization tool is NNCF (Neural Network Compression F It is a `set of compression algorithms `__, organized as a Python package, that make your models smaller and faster. Note that NNCF is **not part of the OpenVINO package**, so it needs to be installed separately. It supports -models in **PyTorch**, **TensorFlow** , **ONNX**, and **OpenVINO IR** formats, offering +models in **OpenVINO IR**, **PyTorch**, **ONNX**, and **TensorFlow** formats, offering the following main optimizations: .. image:: ../assets/images/WHAT_TO_USE.svg | :doc:`Weight Compression `: -| an easy-to-use method for Large Language Model footprint reduction and inference +| An easy-to-use method for Large Language Model footprint reduction and inference acceleration. | :doc:`Post-training Quantization `: -| designed to optimize deep learning models by applying 8-bit integer quantization. Being +| Designed to optimize deep learning models by applying 8-bit integer quantization. Being the easiest way to optimize a model it does not require its retraining or fine-tuning but may result in a drop in accuracy. If the accuracy-performance tradeoff is not acceptable, Training-time Optimization may be a better option. | :doc:`Training-time Optimization `: -| involves a suite of advanced methods such as Structured or Unstructured Pruning, as well +| Involves a suite of advanced methods such as Structured or Unstructured Pruning, as well as Quantization-aware Training. This kind of optimization requires the use of the model's original framework, for NNCF, it is either PyTorch or TensorFlow. @@ -54,13 +54,7 @@ Recommended workflows 3. If the accuracy drop is unacceptable, use quantization-aware training instead. It will give you the same level of performance boost, with a smaller impact on accuracy. -* **Weight compression** works **only with LLMs**. Do not try to use it with other models. -* For **visual-multimodal** use cases, the encoder / decoder split approach may be recommended. - - - - - +* **Weight compression** works with **LLMs**, **VLMs** and other Transformer-based models. From 0b530451a6196bab2473cd2ece3eb0c373a557d5 Mon Sep 17 00:00:00 2001 From: Alexander Date: Fri, 27 Dec 2024 12:13:07 +0400 Subject: [PATCH 2/2] Fixed issue --- .../weight-compression/4-bit-weight-quantization.rst | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression/4-bit-weight-quantization.rst b/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression/4-bit-weight-quantization.rst index 4aca6254ac0291..3994e5550c4e2f 100644 --- a/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression/4-bit-weight-quantization.rst +++ b/docs/articles_en/openvino-workflow/model-optimization-guide/weight-compression/4-bit-weight-quantization.rst @@ -133,7 +133,8 @@ trade-offs after optimization: There are three modes: INT8_ASYM, INT8_SYM, and NONE, which retains the original floating-point precision of the model weights (``INT8_ASYM`` is default value). -| + + .. tip:: NNCF allows stacking the supported optimization methods. For example, AWQ, Scale Estimation