Skip to content

Latest commit

 

History

History
40 lines (26 loc) · 1.86 KB

client_quant.md

File metadata and controls

40 lines (26 loc) · 1.86 KB

Quantization on Client

  1. Introduction
  2. Get Started

Introduction

For RTN, and GPTQ algorithms, we provide default algorithm configurations for different processor types (client and sever). Generally, lightweight configurations are tailored specifically for client devices to enhance performance and efficiency.

Get Started

Here, we take the RTN algorithm as example to demonstrate the usage on a client machine.

from neural_compressor.torch.quantization import get_default_rtn_config, convert, prepare
from neural_compressor.torch import load_empty_model

model_state_dict_path = "/path/to/model/state/dict"
float_model = load_empty_model(model_state_dict_path)
quant_config = get_default_rtn_config()
prepared_model = prepare(float_model, quant_config)
quantized_model = convert(prepared_model)

Tip

By default, the appropriate configuration is determined based on hardware information, but users can explicitly specify processor_type as either client or server when calling get_default_rtn_config.

For Windows machines, run the following command to utilize all available cores automatically:

python main.py

Tip

For Linux systems, users need to configure the environment variables appropriately to achieve optimal performance. For example, set the OMP_NUM_THREADS explicitly. For processors with hybrid architecture (including both P-cores and E-cores), it is recommended to bind tasks to all P-cores using taskset.

RTN quantization is a quick process, finishing in tens of seconds and using several GB of RAM when working with 7B models, e.g.,meta-llama/Llama-2-7b-chat-hf. However, for the higher accuracy, GPTQ algorithm is recommended, but be prepared for a longer quantization time.