This repository contains the code for FrameQuant, a Post-Training-Quantization algorithm for Transformer models. FrameQuant is able to quantize Transformers to ultra low bitwidths(almost 2 bits) while providing flexibility by allowing fractional bitwidths. Please find our paper at arXiv:2403.06082 for more details on our algorithm and experiments.
All our dependencies are provided in the requirements.txt
. (We also have a docker image with all of them installed).
pip install -r requirements.txt
We use the Fast Hadamard Transform implementation from Dao-AILab for our fast random projections. Follow their instructions to install the package or run
bash install_fast_hadamard_tx.sh
We also have a docker image on the dockerhub with all the packages installed. From within the FrameQuant directory, simply run
docker run --ipc=host --gpus all -it -v "$PWD:/workspace" harshauwm163/fq:0.97
to start using FrameQuant.
This respository contains the code for our core algorithm. We provide implementation for quantizing Llama2 models (more models to come!). Our code is developed on top of GPTQ and GPTQ-for-LLaMa. Here is the command to run FrameQuant on LLaMa2-7B.
# For Llama model located at /data/hf-llama-2-7b; run
python llama.py /data/hf-llama-2-7b c4 --eval --new-eval --tff_transform --tff_redundancy 1.0 --wbits 2 --save
The command below is used for running inference using the quantized model.
# /data/hf-llama-2-7b should contain the config files and the tokenizer for the original model
# ./results/Llama_FQ should contain packed_model.ckpt, generated by the quantization script above
python inference.py /data/hf-llama-2-7b ./results/Llama_FQ
Here is a comparision of FrameQuant to other PTQ methods.
Performance of Post Training Quantized (PTQ) LLMs from the Llama2 class on language modeling.
ppl on WikiText2 | ppl on C4 | ||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
Performance of Post Training Quantized (PTQ) LLMs from the OPT class on language modeling.
ppl on WikiText2 | ppl on C4 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
Downstream performance of FrameQuant
The LLMs quantized with FrameQuant also perform better on the downstream tasks. We use LM-eval-harness to evaluate Llama2-7B model quantized using various methods on the downstream tasks and show that FrameQuant achieves best performance on all the tasks.
Method | bits | ARC (challenge) | ARC (easy) | BoolQ | HellaSwag | PIQA | WinoGrande |
---|---|---|---|---|---|---|---|
Full-Precision | |||||||
GPTQ | |||||||
QuIP | |||||||
FrameQuant |
|||||||
FrameQuant |
Please find the quantized models available on Huggingface
model | bitwidth | link |
---|---|---|
Llama2-7B | 2 bits | https://huggingface.co/uw-madison/Llama2-7B-FrameQuant-2bit |
Please do cite our work, if you find it interesting!
@InProceedings{adepuFQIcml24,
author = {Harshavardhan Adepu and Zhanpeng Zeng and Li Zhang and Vikas Singh},
title = {FrameQuant: Flexible Low-Bit Quantization for Transformers},
OPTcrossref = {},
OPTkey = {},
booktitle = {Proceedings of International Conference on Machine Learning (ICML)},
OPTpages = {},
year = {2024},
venue = {ICML},
OPTeditor = {},
OPTvolume = {},
OPTnumber = {},
OPTseries = {},
OPTaddress = {},
month = {July},
OPTorganization = {},
OPTpublisher = {},
OPTnote = {},
OPTannote = {}
}