Skip to content

Latest commit

 

History

History
78 lines (55 loc) · 9.32 KB

QUANTIZE_and_BENCHMARK.md

File metadata and controls

78 lines (55 loc) · 9.32 KB

LLM Quantization and Benchmarking

Selected Model

Quantized to GPTQ format

The base model is quantized to GPTQ format using AutoGPTQ for GPU inference. The model is quantized to 4-bit precision (medium size, balanced quality).

Quantized to GGUF format

The base model is quantized to GGUF format using llama.cpp for CPU inference. The model is quantized to 4-bit (q4_k_m) precision (medium size, balanced quality).

Benchmark for starcoderbase-3b (Quantized and Non-Quantized)

The benchmark is done using lm-evaluation-harness.

Here is the Benchmarking script.

Tasks Version Filter n-shot Metric non-quantized Value non-quantized Stderr quantized Value (GPTQ) quantized Stderr (GPTQ)
codexglue_code2text N/A none None smoothed_bleu_4 1.3519 ± 0.3067 0.9254 ± 0.2109
- code2text_go 1 none None smoothed_bleu_4 1.5781 ± 0.3734 1.4702 ± 0.4813
- code2text_java 1 none None smoothed_bleu_4 1.2778 ± 0.1991 0.6907 ± 0.6907
- code2text_javascript 1 none None smoothed_bleu_4 1.1443 ± 0.1181 0.9469 ± 0.0339
- code2text_php 1 none None smoothed_bleu_4 0.5171 ± 0.5171 0.5171 ± 0.5171
- code2text_python 1 none None smoothed_bleu_4 2.8338 ± 1.5323 1.1676 ± 0.2156
- code2text_ruby 3 none None smoothed_bleu_4 0.7601 ± 0.7601 0.7601 ± 0.7601
Tasks Version Filter n-shot Metric non-quantized Value non-quantized Stderr quantized Value (GPTQ) quantized Stderr (GPTQ)
bigbench_code_line_description_generate_until 1 none None exact_match 0 ± 0 0 ± 0
bigbench_code_line_description_multiple_choice 0 none None acc 0.25 ± 0.0564 0.3 ± 0.0597

Benchmark for starcoderbase-1b (Quantized and Non-Quantized)

The benchmark is done using lm-evaluation-harness.

Here is the Benchmarking script.

Tasks Version Filter n-shot Metric non-quantized Value non-quantized Stderr quantized Value (GPTQ) quantized Stderr (GPTQ)
codexglue_code2text N/A none None smoothed_bleu_4 0.8767 ± 0.0592 0.7959 ± 0.2180
- code2text_go 1 none None smoothed_bleu_4 1.0054 ± 0.0983 0.9280 ± 0.0291
- code2text_java 1 none None smoothed_bleu_4 1.2158 ± 0.1657 1.2112 ± 0.1703
- code2text_javascript 1 none None smoothed_bleu_4 0.8560 ± 0.0429 0.8848 ± 0.0391
- code2text_php 1 none None smoothed_bleu_4 0.9879 ± 0.0887 0.6055 ± 0.6055
- code2text_python 1 none None smoothed_bleu_4 1.1950 ± 0.2819 1.1460 ± 1.1460
- code2text_ruby 3 none None smoothed_bleu_4 0.0000 ± 0.0000 0.0000 ± 0.0000
Tasks Version Filter n-shot Metric non-quantized Value non-quantized Stderr quantized Value (GPTQ) quantized Stderr (GPTQ)
bigbench_code_line_description_generate_until 1 none None exact_match 0 ± 0 0 ± 0
bigbench_code_line_description_multiple_choice 0 none None acc 0.15 ± 0.0465 0.1333 ± 0.0443

Challenges and Adapted Solutions

  • While benchmarking using lm-evaluation-harness, I encountered an issue which I have raised here in detail. I fixed this issue with an MR.
  • While researching, I also used bigcode-evaluation-harness for benchmarking. But in terms of usage, I got an issue which I have discussed here on length.
  • Since I have been using colab with 1 T4 GPU, and kaggle kernel with 2 T4 GPU, I had a major issue with compute resources.

Some notable attempts

While researching and implementing, I did few things but are not included in the final implementation.