LLM Quantization and Benchmarking

Selected Model

bigcode/starcoderbase-3b
bigcode/starcoderbase-1b

Quantized to GPTQ format

The base model is quantized to GPTQ format using AutoGPTQ for GPU inference. The model is quantized to 4-bit precision (medium size, balanced quality).

HuggingFace model card for cosmo3769/starcoderbase-3b-GPTQ. Here is the Quantization script.
HuggingFace model card for cosmo3769/starcoderbase-1b-GPTQ. Here is the Quantization script.

Quantized to GGUF format

The base model is quantized to GGUF format using llama.cpp for CPU inference. The model is quantized to 4-bit (q4_k_m) precision (medium size, balanced quality).

HuggingFace model card for cosmo3769/starcoderbase-3b-GGUF. Here is the Quantization script.
HuggingFace model card for cosmo3769/starcoderbase-1b-GGUF. Here is the Quantization script.

Benchmark for starcoderbase-3b (Quantized and Non-Quantized)

The benchmark is done using lm-evaluation-harness.

Here is the Benchmarking script.

Tasks	Version	Filter	n-shot	Metric	non-quantized Value	non-quantized Stderr	quantized Value (GPTQ)	quantized Stderr (GPTQ)
codexglue_code2text	N/A	none	None	smoothed_bleu_4	1.3519	± 0.3067	0.9254	± 0.2109
- code2text_go	1	none	None	smoothed_bleu_4	1.5781	± 0.3734	1.4702	± 0.4813
- code2text_java	1	none	None	smoothed_bleu_4	1.2778	± 0.1991	0.6907	± 0.6907
- code2text_javascript	1	none	None	smoothed_bleu_4	1.1443	± 0.1181	0.9469	± 0.0339
- code2text_php	1	none	None	smoothed_bleu_4	0.5171	± 0.5171	0.5171	± 0.5171
- code2text_python	1	none	None	smoothed_bleu_4	2.8338	± 1.5323	1.1676	± 0.2156
- code2text_ruby	3	none	None	smoothed_bleu_4	0.7601	± 0.7601	0.7601	± 0.7601

Tasks	Version	Filter	n-shot	Metric	non-quantized Value	non-quantized Stderr	quantized Value (GPTQ)	quantized Stderr (GPTQ)
bigbench_code_line_description_generate_until	1	none	None	exact_match	0	± 0	0	± 0
bigbench_code_line_description_multiple_choice	0	none	None	acc	0.25	± 0.0564	0.3	± 0.0597

Benchmark for starcoderbase-1b (Quantized and Non-Quantized)

The benchmark is done using lm-evaluation-harness.

Here is the Benchmarking script.

Tasks	Version	Filter	n-shot	Metric	non-quantized Value	non-quantized Stderr	quantized Value (GPTQ)	quantized Stderr (GPTQ)
codexglue_code2text	N/A	none	None	smoothed_bleu_4	0.8767	± 0.0592	0.7959	± 0.2180
- code2text_go	1	none	None	smoothed_bleu_4	1.0054	± 0.0983	0.9280	± 0.0291
- code2text_java	1	none	None	smoothed_bleu_4	1.2158	± 0.1657	1.2112	± 0.1703
- code2text_javascript	1	none	None	smoothed_bleu_4	0.8560	± 0.0429	0.8848	± 0.0391
- code2text_php	1	none	None	smoothed_bleu_4	0.9879	± 0.0887	0.6055	± 0.6055
- code2text_python	1	none	None	smoothed_bleu_4	1.1950	± 0.2819	1.1460	± 1.1460
- code2text_ruby	3	none	None	smoothed_bleu_4	0.0000	± 0.0000	0.0000	± 0.0000

Tasks	Version	Filter	n-shot	Metric	non-quantized Value	non-quantized Stderr	quantized Value (GPTQ)	quantized Stderr (GPTQ)
bigbench_code_line_description_generate_until	1	none	None	exact_match	0	± 0	0	± 0
bigbench_code_line_description_multiple_choice	0	none	None	acc	0.15	± 0.0465	0.1333	± 0.0443

Challenges and Adapted Solutions

While benchmarking using lm-evaluation-harness, I encountered an issue which I have raised here in detail. I fixed this issue with an MR.
While researching, I also used bigcode-evaluation-harness for benchmarking. But in terms of usage, I got an issue which I have discussed here on length.
Since I have been using colab with 1 T4 GPU, and kaggle kernel with 2 T4 GPU, I had a major issue with compute resources.

Some notable attempts

While researching and implementing, I did few things but are not included in the final implementation.

Quantized codellama/CodeLlama-7b-hf to GGUF format. Here is the HuggingFace model card for cosmo3769/CodeLlama-7b-hf-GGUF. Here is the Quantization script.
Quantized mlabonne/EvolCodeLlama-7b to GGUF format. Here is the HuggingFace model card for mlabonne/EvolCodeLlama-7b-GGUF. Here is the Quantization script. NOTE - The quantized model was already pushed by mlabonne so I have not pushed it to Hub.
Quantized stabilityai/stablelm-zephyr-3b to GGUF format. Here is the HuggingFace model card for cosmo3769/stablelm-zephyr-3b-GGUF. Here is the Quantization script.
Quantized bigcode/starcoderbase-3b to GGUF format (q4_k_s). Here is the HuggingFace model card for cosmo3769/starcoderbase-3b-GGUF. Here is the Quantization script.
I have also quantized gemma models but have not included here since it was already done heavily by other user.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QUANTIZE_and_BENCHMARK.md

QUANTIZE_and_BENCHMARK.md

LLM Quantization and Benchmarking

Selected Model

Quantized to GPTQ format

Quantized to GGUF format

Benchmark for starcoderbase-3b (Quantized and Non-Quantized)

Benchmark for starcoderbase-1b (Quantized and Non-Quantized)

Challenges and Adapted Solutions

Some notable attempts

Files

QUANTIZE_and_BENCHMARK.md

Latest commit

History

QUANTIZE_and_BENCHMARK.md

File metadata and controls

LLM Quantization and Benchmarking

Selected Model

Quantized to GPTQ format

Quantized to GGUF format

Benchmark for starcoderbase-3b (Quantized and Non-Quantized)

Benchmark for starcoderbase-1b (Quantized and Non-Quantized)

Challenges and Adapted Solutions

Some notable attempts