For RTN
, and GPTQ
algorithms, we provide default algorithm configurations for different processor types (client
and sever
). Generally, lightweight configurations are tailored specifically for client devices to enhance performance and efficiency.
Here, we take the RTN
algorithm as example to demonstrate the usage on a client machine.
from neural_compressor.torch.quantization import get_default_rtn_config, convert, prepare
from neural_compressor.torch import load_empty_model
model_state_dict_path = "/path/to/model/state/dict"
float_model = load_empty_model(model_state_dict_path)
quant_config = get_default_rtn_config()
prepared_model = prepare(float_model, quant_config)
quantized_model = convert(prepared_model)
Tip
By default, the appropriate configuration is determined based on hardware information, but users can explicitly specify processor_type
as either client
or server
when calling get_default_rtn_config
.
For Windows machines, run the following command to utilize all available cores automatically:
python main.py
Tip
For Linux systems, users need to configure the environment variables appropriately to achieve optimal performance. For example, set the OMP_NUM_THREADS
explicitly. For processors with hybrid architecture (including both P-cores and E-cores), it is recommended to bind tasks to all P-cores using taskset
.
RTN quantization is a quick process, finishing in tens of seconds and using several GB of RAM when working with 7B models, e.g.,meta-llama/Llama-2-7b-chat-hf. However, for the higher accuracy, GPTQ algorithm is recommended, but be prepared for a longer quantization time.