Skip to content

Latest commit

 

History

History
27 lines (23 loc) · 1.98 KB

llm_recipes.md

File metadata and controls

27 lines (23 loc) · 1.98 KB

LLMs Quantization Recipes

Intel® Neural Compressor supported advanced large language models (LLMs) quantization technologies including SmoothQuant (SQ) and Weight-Only Quant (WOQ), and verified a list of LLMs on 4th Gen Intel® Xeon® Scalable Processor (codenamed Sapphire Rapids) with PyTorch, Intel® Extension for PyTorch and Intel® Extension for Transformers.
This document aims to publish the specific recipes we achieved for the popular LLMs and help users to quickly get an optimized LLM with limited 1% accuracy loss.

Notes:

IPEX key models

Models SQ INT8 WOQ INT8 WOQ INT4
EleutherAI/gpt-j-6b
facebook/opt-1.3b
facebook/opt-30b
meta-llama/Llama-2-7b-hf
meta-llama/Llama-2-13b-hf
meta-llama/Llama-2-70b-hf
tiiuae/falcon-40b

Detail recipes can be found HERE.

Notes:

  • This model list comes from IPEX.
  • WOQ INT4 recipes will be published soon.