Official PyTorch implement of paper EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
- [2024/10] 🔥 We release a new weight-activation quantization algorithm, PrefixQuant, which is the first work to let the performance of static activation quantization surpasses dynamic ones.
- [2024/08] The new inference backend T-MAC from Microsoft has supported EffcientQAT models.
- [2024/08] We support for the quantization of Mistral-Large-Instruct. W2g64 Mistral-Large-Instruct with our EfficientQAT can compress the 123B models to 35 GB with only 4 points accuracy degeneration.
- [2024/07] New featurs! We support to transfer EfficientQAT quantized models into
GPTQ v2
format andBitBLAS
format, which can be directly loaded through GPTQModel. - [2024/07] We release EfficientQAT, which pushes the limitation of uniform (INT) quantization in an efficient manner.
- Clone this repository and navigate to EfficientQAT folder
git clone https://github.com/OpenGVLab/EfficientQAT.git
cd EfficientQAT
- Install package
conda create -n efficientqat python==3.11
conda activate efficientqat
pip install -r requirements.txt
We provide a number of prequantized EfficientQAT models as follows:
- WikiText2 PPL is measured in 2048 context length.
- Avg. Accuracy indicate the average accuracy in 5 zero-shot reasoning tasks (WinoGrande,PIQA,HellaSwag,Arc-Easy, Arc-Challenge) with lm-eval v0.4.2.
- 1GB =
$10^9$ Bit - Hub Link: EQAT indicates the original checkpoints. We also transfer the checkpoints into GPTQ and BitBLAS formats, which can be loaded directly through GPTQModel. (PS: GPTQModel is a official bug-fixed repo of AutoGPTQ, which would be merged into AutoGPTQ in future.)
Model | Quantization | WikiText2 PPL | Avg. Accuracy | Model Size (GB) | Hub link |
---|---|---|---|---|---|
Llama-2-7B | fp16 | 5.47 | 64.86 | 13.2 | - |
Llama-2-7B | w4g128 | 5.53 | 64.27 | 3.7 | EQAT|GPTQ|BitBLAS |
Llama-2-7B | w3g128 | 5.81 | 64.02 | 3.1 | EQAT |
Llama-2-7B | w2g64 | 6.86 | 60.14 | 2.3 | EQAT|GPTQ|BitBLAS |
Llama-2-7B | w2g128 | 7.17 | 59.50 | 2.2 | EQAT|GPTQ|BitBLAS |
Llama-2-13B | fp16 | 4.88 | 67.81 | 25.4 | - |
Llama-2-13B | w4g128 | 4.93 | 67.52 | 6.8 | EQAT|GPTQ|BitBLAS |
Llama-2-13B | w3g128 | 5.12 | 67.28 | 5.6 | EQAT |
Llama-2-13B | w2g64 | 5.96 | 64.88 | 4.0 | EQAT|GPTQ|BitBLAS |
Llama-2-13B | w2g128 | 6.08 | 63.88 | 3.8 | EQAT|GPTQ|BitBLAS |
Llama-2-70B | fp16 | 3.32 | 72.41 | 131.6 | - |
Llama-2-70B | w4g128 | 3.39 | 72.62 | 35.8 | EQAT|GPTQ|BitBLAS |
Llama-2-70B | w3g128 | 3.61 | 71.76 | 29.1 | EQAT |
Llama-2-70B | w2g64 | 4.52 | 69.48 | 20.1 | EQAT|GPTQ|BitBLAS |
Llama-2-70B | w2g128 | 4.61 | 68.93 | 18.9 | EQAT|GPTQ|BitBLAS |
Llama-3-8B | fp16 | 6.14 | 68.58 | 13.0 | - |
Llama-3-8B | w4g128 | 6.47 | 68.43 | 5.4 | EQAT|GPTQ|BitBLAS |
Llama-3-8B | w3g128 | 7.09 | 67.35 | 4.7 | EQAT |
Llama-3-8B | w2g64 | 9.41 | 60.76 | 3.9 | EQAT|GPTQ|BitBLAS |
Llama-3-8B | w2g128 | 9.80 | 59.36 | 3.8 | EQAT|GPTQ|BitBLAS |
Llama-3-70B | fp16 | 2.85 | 75.33 | 137.8 | - |
Llama-3-70B | w4g128 | 3.17 | 74.57 | 38.9 | EQAT|GPTQ|BitBLAS |
Llama-3-70B | w3g128 | 4.19 | 72.42 | 32.2 | EQAT |
Llama-3-70B | w2g64 | 6.08 | 67.89 | 23.2 | EQAT|GPTQ |
Llama-3-70B | w2g128 | 6.38 | 67.57 | 22.0 | EQAT|GPTQ|BitBLAS |
Llama-3-8B-Instruct | fp16 | 8.29 | 68.43 | 13.0 | - |
Llama-3-8B-Instruct | w4g128 | 7.93 | 68.39 | 5.4 | EQAT|GPTQ|BitBLAS |
Llama-3-8B-Instruct | w3g128 | 8.55 | 67.24 | 4.7 | EQAT |
Llama-3-8B-Instruct | w2g64 | 11.19 | 60.66 | 3.9 | EQAT|GPTQ|BitBLAS |
Llama-3-8B-Instruct | w2g128 | 11.73 | 60.16 | 3.8 | EQAT|GPTQ|BitBLAS |
Llama-3-70B-Instruct | fp16 | 5.33 | 73.78 | 137.8 | - |
Llama-3-70B-Instruct | w4g128 | 5.35 | 73.47 | 38.9 | EQAT|GPTQ|BitBLAS |
Llama-3-70B-Instruct | w3g128 | 5.65 | 72.87 | 32.2 | EQAT |
Llama-3-70B-Instruct | w2g64 | 7.86 | 67.64 | 23.2 | EQAT|GPTQ|BitBLAS |
Llama-3-70B-Instruct | w2g128 | 8.14 | 67.54 | 22.0 | EQAT|GPTQ|BitBLAS |
Mistral-Large-Instruct-2407 | fp16 | 2.74 | 77.76 | 228.5 | - |
Mistral-Large-Instruct-2407 | w2g64 | 5.58 | 73.54 | 35.5 | GPTQ |
EfficientQAT involves two consecutive training phases: Block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP). The detailed training script can be found in ./examples
. We give the training script examples on Llama-2-7B with w2g64 quantization in the following.
- Block-AP
You should modify --model
to the folder of full-precision model in the script before you running the following command.
bash examples/block_ap/Llama-2-7b/w2g64.sh
Specifically, the --weight_lr
is 2e-5
for 2-bit and 1e-5
for 3-/4-bits in our experiments.
Some other important arguments:
--train_size
: number of training data samples, 4096 as default--val_size
: number of validation data samples, 64 as default--off_load_to_disk
: save training dataset to disk, saving CPU memory but may reduce training speed
- E2E-QP
Then, you can load the quantized model of Block-AP for further E2E-QP. Specifically, E2E-QP can adapt to different scenarios by changing the training datasets. You should modify --quant_model_path
to the folder of quantized model in the script before you running the following command.
1) Train on RedPajama
bash examples/e2e_qp/Llama-2-7b/w2g64-redpajama.sh
2) Train on Alpaca
bash examples/e2e_qp/Llama-2-7b/w2g128-redpajama.sh
Specifically, the --learning_rate
is 2e-5
for 2-bit and 1e-5
for 3-/4-bits in our experiments. You can decrease the --per_device_train_batch_size
to reduce the memory footprint during training, and making sure that --gradient_accumulation_steps
increases by the same multiple to maintain the same batch size.
- Download the pre-quantized EfficientQAT models from Huggingface
pip install huggingface_hub
huggingface-cli download ChenMnZ/Llama-2-7b-EfficientQAT-w2g64 --local-dir ./output/pre_quantized_models/Llama-2-7b-EfficientQAT-w2g64
- Evaluate the pre-quantized EfficientQAT model
CUDA_VISIBLE_DEVICES=0 python main_block_ap.py \
--resume_quant ./output/pre_quantized_models/Llama-2-7b-EfficientQAT-w2g64 \
--net Llama-2 \
--wbits 2 \
--group_size 64 \
--output_dir ./output/inference_results/ \
--eval_ppl \
--eval_tasks piqa,arc_easy,arc_challenge,hellaswag,winogrande
Firstly, you should install gptqmodel
package to support GPTQ and BitBLAS quantization format:
git clone https://github.com/ModelCloud/GPTQModel.git && cd GPTQModel
bash install.sh
- In our experiences, we test with
gptqmodel v0.9.8
.
Then, we offer three types of transferring as follows:
- Transfer EfficientQAT checkpoints to GPTQ format
bash examples/model_transfer/efficientqat_to_gptq/llama-2-7b.sh
- Note: Currently AutoGPTQ has overflow bugs for asymmetric quantization. Therefore, we choose the official bug-fixed version GPTQModel to transfer our asymmetric quantized models. Therefore, the GPTQ models provide by this repo can be only successfully loaded through GPTQModel otherwise AutoGPTQ.
- Transfer EfficientQAT checkpoints to BitBLAS format
bash examples/model_transfer/efficientqat_to_bitblas/llama-2-7b.sh
- Speedup has some problem, refer this issue for details.
- Transfer fp32 datas in EfficientQAT checkpoints to half-precision counterparts. Some of parameters are saved as fp32 for training, you can transfer them into half-precision to further reducing model size after training.
bash examples/model_transfer/fp32_to_16/llama-2-7b.sh
Below is an example to inference with GPTQ or BitBLAS quantized formats.
from transformers import AutoTokenizer
from gptqmodel import GPTQModel
quant_dir = "ChenMnZ/Llama-2-7b-EfficientQAT-w2g128-GPTQ"
# quant_dir = "ChenMnZ/Llama-2-7b-EfficientQAT-w2g128-BitBLAS"
# or local path
tokenizer = AutoTokenizer.from_pretrained(quant_dir, use_fast=True)
# load quantized model to the first GPU
model = GPTQModel.from_quantized(quant_dir)
# inference with model.generate
print(tokenizer.decode(model.generate(**tokenizer("Model quantization is", return_tensors="pt").to(model.device))[0]))
If you found this work useful, please consider citing:
@article{efficientqat,
title={EfficientQAT: Efficient Quantization-Aware Training for Large Language Models},
author={Chen, Mengzhao and Shao, Wenqi and Xu, Peng and Wang, Jiahao and Gao, Peng and Zhang, Kaipeng and Qiao, Yu and Luo, Ping},
journal={arXiv preprint arXiv:2407.11062},
year={2024}
}