MODEL FAMILY | MODEL NAME (Huggingface hub) | FP32 | BF16 | Static quantization INT8 | Weight only quantization INT8 | Weight only quantization INT4 |
---|---|---|---|---|---|---|
LLAMA | meta-llama/Llama-2-7b-hf | ✅ | ✅ | ✅ | ✅ | ✅ |
LLAMA | meta-llama/Llama-2-13b-hf | ✅ | ✅ | ✅ | ✅ | ✅ |
LLAMA | meta-llama/Llama-2-70b-hf | ✅ | ✅ | ✅ | ✅ | ✅ |
LLAMA | meta-llama/Meta-Llama-3-8B | ✅ | ✅ | ✅ | ✅ | ✅ |
LLAMA | meta-llama/Meta-Llama-3-70B | ✅ | ✅ | ✅ | ✅ | ✅ |
LLAMA | meta-llama/Meta-Llama-3.1-8B-Instruct | ✅ | ✅ | ✅ | ✅ | ✅ |
LLAMA | meta-llama/Llama-3.2-3B-Instruct | ✅ | ✅ | ✅ | ✅ | ✅ |
LLAMA | meta-llama/Llama-3.2-11B-Vision-Instruct | ✅ | ✅ | ✅ | ✅ | |
GPT-J | EleutherAI/gpt-j-6b | ✅ | ✅ | ✅ | ✅ | ✅ |
GPT-NEOX | EleutherAI/gpt-neox-20b | ✅ | ✅ | ✅ | ✅ | ✅ |
DOLLY | databricks/dolly-v2-12b | ✅ | ✅ | ✅ | ✅ | ✅ |
FALCON | tiiuae/falcon-7b | ✅ | ✅ | ✅ | ✅ | ✅ |
FALCON | tiiuae/falcon-11b | ✅ | ✅ | ✅ | ✅ | ✅ |
FALCON | tiiuae/falcon-40b | ✅ | ✅ | ✅ | ✅ | ✅ |
OPT | facebook/opt-30b | ✅ | ✅ | ✅ | ✅ | ✅ |
OPT | facebook/opt-1.3b | ✅ | ✅ | ✅ | ✅ | ✅ |
Bloom | bigscience/bloom-1b7 | ✅ | ✅ | ✅ | ✅ | ✅ |
CodeGen | Salesforce/codegen-2B-multi | ✅ | ✅ | ✅ | ✅ | ✅ |
Baichuan | baichuan-inc/Baichuan2-7B-Chat | ✅ | ✅ | ✅ | ✅ | ✅ |
Baichuan | baichuan-inc/Baichuan2-13B-Chat | ✅ | ✅ | ✅ | ✅ | ✅ |
Baichuan | baichuan-inc/Baichuan-13B-Chat | ✅ | ✅ | ✅ | ✅ | ✅ |
ChatGLM | THUDM/chatglm3-6b | ✅ | ✅ | ✅ | ✅ | ✅ |
ChatGLM | THUDM/chatglm2-6b | ✅ | ✅ | ✅ | ✅ | ✅ |
GPTBigCode | bigcode/starcoder | ✅ | ✅ | ✅ | ✅ | ✅ |
T5 | google/flan-t5-xl | ✅ | ✅ | ✅ | ✅ | ✅ |
MPT | mosaicml/mpt-7b | ✅ | ✅ | ✅ | ✅ | ✅ |
Mistral | mistralai/Mistral-7B-v0.1 | ✅ | ✅ | ✅ | ✅ | ✅ |
Mixtral | mistralai/Mixtral-8x7B-v0.1 | ✅ | ✅ | ✅ | ✅ | |
Stablelm | stabilityai/stablelm-2-1_6b | ✅ | ✅ | ✅ | ✅ | ✅ |
Qwen | Qwen/Qwen-7B-Chat | ✅ | ✅ | ✅ | ✅ | ✅ |
Qwen | Qwen/Qwen2-7B | ✅ | ✅ | ✅ | ✅ | ✅ |
LLaVA | liuhaotian/llava-v1.5-7b | ✅ | ✅ | ✅ | ✅ | |
GIT | microsoft/git-base | ✅ | ✅ | ✅ | ✅ | |
Yuan | IEITYuan/Yuan2-102B-hf | ✅ | ✅ | ✅ | ||
Phi | microsoft/phi-2 | ✅ | ✅ | ✅ | ✅ | ✅ |
Phi | microsoft/Phi-3-mini-4k-instruct | ✅ | ✅ | ✅ | ✅ | ✅ |
Phi | microsoft/Phi-3-mini-128k-instruct | ✅ | ✅ | ✅ | ✅ | ✅ |
Phi | microsoft/Phi-3-medium-4k-instruct | ✅ | ✅ | ✅ | ✅ | ✅ |
Phi | microsoft/Phi-3-medium-128k-instruct | ✅ | ✅ | ✅ | ✅ | ✅ |
Phi | microsoft/Phi-4-mini-instruct | ✅ | ✅ | ✅ | ✅ | |
Phi | microsoft/Phi-4-multimodal-instruct | ✅ | ✅ | ✅ | ✅ | |
Whisper | openai/whisper-large-v2 | ✅ | ✅ | ✅ | ✅ | ✅ |
Maira | microsoft/maira-2 | ✅ | ✅ | ✅ | ✅ | |
Jamba | ai21labs/Jamba-v0.1 | ✅ | ✅ | ✅ | ✅ | |
DeepSeek | deepseek-ai/DeepSeek-V2.5-1210 | ✅ | ✅ | ✅ | ✅ |
MODEL FAMILY | MODEL NAME (Huggingface hub) | BF16 | Weight only quantization INT8 |
---|---|---|---|
LLAMA | meta-llama/Llama-2-7b-hf | ✅ | ✅ |
LLAMA | meta-llama/Llama-2-13b-hf | ✅ | ✅ |
LLAMA | meta-llama/Llama-2-70b-hf | ✅ | ✅ |
LLAMA | meta-llama/Meta-Llama-3-8B | ✅ | ✅ |
LLAMA | meta-llama/Meta-Llama-3-70B | ✅ | ✅ |
LLAMA | meta-llama/Meta-Llama-3.1-8B-Instruct | ✅ | ✅ |
LLAMA | meta-llama/Llama-3.2-3B-Instruct | ✅ | ✅ |
LLAMA | meta-llama/Llama-3.2-11B-Vision-Instruct | ✅ | ✅ |
GPT-J | EleutherAI/gpt-j-6b | ✅ | ✅ |
GPT-NEOX | EleutherAI/gpt-neox-20b | ✅ | ✅ |
DOLLY | databricks/dolly-v2-12b | ✅ | ✅ |
FALCON | tiiuae/falcon-11b | ✅ | ✅ |
FALCON | tiiuae/falcon-40b | ✅ | ✅ |
OPT | facebook/opt-30b | ✅ | ✅ |
OPT | facebook/opt-1.3b | ✅ | ✅ |
Bloom | bigscience/bloom-1b7 | ✅ | ✅ |
CodeGen | Salesforce/codegen-2B-multi | ✅ | ✅ |
Baichuan | baichuan-inc/Baichuan2-7B-Chat | ✅ | ✅ |
Baichuan | baichuan-inc/Baichuan2-13B-Chat | ✅ | ✅ |
Baichuan | baichuan-inc/Baichuan-13B-Chat | ✅ | ✅ |
GPTBigCode | bigcode/starcoder | ✅ | ✅ |
T5 | google/flan-t5-xl | ✅ | ✅ |
Mistral | mistralai/Mistral-7B-v0.1 | ✅ | ✅ |
Mistral | mistralai/Mixtral-8x7B-v0.1 | ✅ | ✅ |
MPT | mosaicml/mpt-7b | ✅ | ✅ |
Stablelm | stabilityai/stablelm-2-1_6b | ✅ | ✅ |
Qwen | Qwen/Qwen-7B-Chat | ✅ | ✅ |
Qwen | Qwen/Qwen2-7B | ✅ | ✅ |
GIT | microsoft/git-base | ✅ | ✅ |
Phi | microsoft/phi-2 | ✅ | ✅ |
Phi | microsoft/Phi-3-mini-4k-instruct | ✅ | ✅ |
Phi | microsoft/Phi-3-mini-128k-instruct | ✅ | ✅ |
Phi | microsoft/Phi-3-medium-4k-instruct | ✅ | ✅ |
Phi | microsoft/Phi-3-medium-128k-instruct | ✅ | ✅ |
Whisper | openai/whisper-large-v2 | ✅ | ✅ |
DeepSeek | deepseek-ai/DeepSeek-V2.5-1210 | ✅ | ✅ |
Note: The above verified models (including other models in the same model family, like "codellama/CodeLlama-7b-hf" from LLAMA family) are well supported with all optimizations like indirect access KV cache, fused ROPE, and customized linear kernels. We are working in progress to better support the models in the tables with various data types. In addition, more models will be optimized in the future.
ipex.llm provides a single script to facilitate running generation tasks as below:
python run.py --help # for more detailed usages
Key args of run.py | Notes |
---|---|
generation | default: beam search (beam size = 4), "--greedy" for greedy search |
input tokens or prompt | provide fixed sizes for input prompt size, use "--input-tokens" for <INPUT_LENGTH> in [32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 32768, 130944]; if "--input-tokens" is not used, use "--prompt" to choose other strings as inputs |
input images | default: None, use "--image-url" to choose the image link address for vision-text tasks |
vision text tasks | default: False, use "--vision-text-model" to choose if your model (like llama3.2 11B model) is running for vision-text generation tasks, default False meaning text generation tasks only |
output tokens | default: 32, use "--max-new-tokens" to choose any other size |
batch size | default: 1, use "--batch-size" to choose any other size |
token latency | enable "--token-latency" to print out the first or next token latency |
generation iterations | use "--num-iter" and "--num-warmup" to control the repeated iterations of generation, default: 100-iter/10-warmup |
streaming mode output | greedy search only (work with "--greedy"), use "--streaming" to enable the streaming generation output |
KV Cache dtype | default: auto, use "--kv-cache-dtype=fp8_e5m2" to enable e5m2 KV Cache. More information refer to vLLM FP8 E5M2 KV Cache |
input mode | default: 0, use "--input-mode" to choose input mode for multimodal models. 0: language; 1: vision; 2: speech; 3: vision and speech |
input audios | default: None, use "--audio" to choose the audio link address for speech tasks |
Note: You may need to log in your HuggingFace account to access the model files. Please refer to HuggingFace login.
Alternatively, you can run the Jupyter Notebook to see ipex.llm with BF16 and various other quick start examples.
Additional setup instructions for running the notebook can be found here.
Note: The following "OMP_NUM_THREADS" and "numactl" settings are based on the assumption that the target server has 56 physical cores per numa socket, and we benchmark with 1 socket. Please adjust the settings per your hardware.
OMP_NUM_THREADS=56 numactl -m 0 -C 0-55 python run.py --benchmark -m meta-llama/Meta-Llama-3.1-8B-Instruct --dtype float32
OMP_NUM_THREADS=56 numactl -m 0 -C 0-55 python run.py --benchmark -m meta-llama/Meta-Llama-3.1-8B-Instruct --dtype float32 --ipex
OMP_NUM_THREADS=56 numactl -m 0 -C 0-55 python run.py --benchmark -m meta-llama/Meta-Llama-3.1-8B-Instruct --dtype bfloat16 --ipex
wget https://intel-extension-for-pytorch.s3.amazonaws.com/miscellaneous/llm/cpu/2/llama3-1-8b_qconfig.json
OMP_NUM_THREADS=56 numactl -m 0 -C 0-55 python run.py --benchmark -m meta-llama/Meta-Llama-3.1-8B-Instruct --ipex-smooth-quant --qconfig-summary-file llama3-1-8b_qconfig.json --output-dir "saved_results"
OMP_NUM_THREADS=56 numactl -m 0 -C 0-55 python run.py --benchmark -m meta-llama/Meta-Llama-3.1-8B-Instruct --ipex-weight-only-quantization --weight-dtype INT8 --quant-with-amp --output-dir "saved_results"
huggingface-cli download hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4 --local-dir ./Llama-3.1-8B-GPTQ
OMP_NUM_THREADS=56 numactl -m 0 -C 0-55 python run.py --benchmark -m ./Llama-3.1-8B-GPTQ --ipex-weight-only-quantization --weight-dtype INT4 --lowp-mode BF16 --quant-with-amp --output-dir "saved_results"
deepspeed --bind_cores_to_rank run.py --benchmark -m meta-llama/Meta-Llama-3.1-8B-Instruct --dtype bfloat16 --ipex --autotp --shard-model
deepspeed --bind_cores_to_rank run.py --benchmark -m meta-llama/Meta-Llama-3.1-8B-Instruct --ipex-weight-only-quantization --weight-dtype INT8 --quant-with-amp --autotp --shard-model --output-dir "saved_results"
huggingface-cli download hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4 --local-dir ./Llama-3.1-8B-GPTQ
deepspeed --bind_cores_to_rank run.py --benchmark -m ./Llama-3.1-8B-GPTQ --ipex-weight-only-quantization --weight-dtype INT4 --lowp-mode BF16 --quant-with-amp --autotp --output-dir "saved_results"
For the quantized models used in accuracy tests below, we can reuse the model files that are named "best_model.pt" in the "--output-dir" path (generated during inference performance tests above).
Check Advanced Usage for details.
# The following "OMP_NUM_THREADS" and "numactl" settings are based on the assumption that
# the target server has 56 physical cores per numa socket, and we benchmark with 1 socket.
# Please adjust the settings per your hardware.
# run_accuracy.py script is inside single_instance directory.
cd single_instance
# Running FP32 model
OMP_NUM_THREADS=56 numactl -m 0 -C 0-55 python run_accuracy.py -m meta-llama/Meta-Llama-3.1-8B-Instruct --dtype float32 --ipex --tasks lambada_openai
# Running BF16 model
OMP_NUM_THREADS=56 numactl -m 0 -C 0-55 python run_accuracy.py -m meta-llama/Meta-Llama-3.1-8B-Instruct --dtype bfloat16 --ipex --tasks lambada_openai
# Quantization. Assuming the quantized model is generated at "../saved_results/best_model.pt".
OMP_NUM_THREADS=56 numactl -m 0 -C 0-55 python run_accuracy.py -m meta-llama/Meta-Llama-3.1-8B-Instruct --quantized-model-path "../saved_results/best_model.pt" --dtype int8 --ipex --quant-with-amp --tasks lambada_openai
# Assuming the pre-sharded Llama model is generated at "saved_results/llama_local_shard/" folder.
# run_accuracy_with_deepspeed.py script is under "distributed" directory.
cd distributed
# Distributed inference in FP32
deepspeed --num_accelerators 2 --master_addr `hostname -I | sed -e 's/\s.*$//'` --bind_cores_to_rank run_accuracy_with_deepspeed.py --model "../saved_results/llama_local_shard/" --dtype float32 --ipex --tasks lambada_openai
# Distributed inference in BF16
deepspeed --num_accelerators 2 --master_addr `hostname -I | sed -e 's/\s.*$//'` --bind_cores_to_rank run_accuracy_with_deepspeed.py --model "../saved_results/llama_local_shard/" --dtype bfloat16 --ipex --tasks lambada_openai
# Distributed inference with Weight-Only Quantization
deepspeed --num_accelerators 2 --master_addr `hostname -I | sed -e 's/\s.*$//'` --bind_cores_to_rank run_accuracy_with_deepspeed.py --model "../saved_results/llama_local_shard/" --ipex-weight-only-quantization --weight-dtype INT8 --quant-with-amp --tasks lambada_openai
A bash script is provided to simplify environment configuration and the command launch.
Steps:
- Enter the
llm
directory - Create a
hostfile.txt
following instructions of deepspeed - Find out the network interface name used for node communication via
ifconfig
oribv_devices
ex : eth0 - Open
tools/run_scaling.sh
script to update required information in line 3 to line 11 according to your environment and needs - run the command below to run distributed inference among nodes
bash tools/run_scaling.sh
The docker image built in the environment setup tutorial functions ssh connection for distributed executions across multiple machines via Ethernet. However, it is supposed to be running with 1 single container on each machine. Inside each docker container, multiple inference instances can be launched by the deepspeed
command.
Use the command below on all machines to launch the docker containers. This command uses the host network interfaces inside the docker container. Thus, you need to put the host ip addresses into the hostfile.txt
. Do NOT launch multiple docker containers on one single machine from the same docker image. These docker containers listen on the same machine on the same port, will result in unpredicable ssh connections.
docker run --rm -it --privileged -v /dev/shm:/dev/shm --net host ipex-llm:main bash
Note: For models on HuggingFace require access privileges, you need to run the huggingface-cli login
command in each docker container to config a HuggingFace access token.
- Command:
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <physical cores list> python run.py --benchmark -m <MODEL_ID> --dtype float32 --ipex
- An example of Llama-3.1-8B model:
OMP_NUM_THREADS=56 numactl -m 0 -C 0-55 python run.py --benchmark -m meta-llama/Meta-Llama-3.1-8B-Instruct --dtype float32 --ipex
- Command:
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <physical cores list> python run.py --benchmark -m <MODEL_ID> --dtype bfloat16 --ipex
- An example of Llama-3.1-8B model:
OMP_NUM_THREADS=56 numactl -m 0 -C 0-55 python run.py --benchmark -m meta-llama/Meta-Llama-3.1-8B-Instruct --dtype bfloat16 --ipex
We use the SmoothQuant algorithm to get good accuracy of static quantization, which is a popular method for LLM models. Besides, by default, we enable quantization mixed fp32 inference (non-quantized OPs run with fp32 dtype). To get better performance, you may add "--quant-with-amp" to enable quantization with Automatic Mixed Precision inference (non-quantized OPs run with bf16 dtype). Please note that static quantization with AMP is still experimental and it may lead to accuracy drop and other issues.
- Command:
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <physical cores list> python run.py --benchmark -m <MODEL_ID> --ipex-smooth-quant --qconfig-summary-file <path to the qconfig of the model_id> --output-dir "saved_results"
- An example of Llama-3.1-8B model:
wget https://intel-extension-for-pytorch.s3.amazonaws.com/miscellaneous/llm/cpu/2/llama3-1-8b_qconfig.json
OMP_NUM_THREADS=56 numactl -m 0 -C 0-55 python run.py --benchmark -m meta-llama/Meta-Llama-3.1-8B-Instruct --ipex-smooth-quant --qconfig-summary-file llama3-1-8b_qconfig.json --output-dir "saved_results"
We provide the following qconfig summary files with good quality (calibration on "NeelNanda/pile-10k" dataset and evaluate accuracy on "lambada_openai" dataset):
Model ID | Download links |
---|---|
meta-llama/Llama-2-13b-hf | link |
meta-llama/Llama-2-70b-hf | link |
meta-llama/Meta-Llama-3.1-8B-Instruct | link |
EleutherAI/gpt-j-6b | link |
tiiuae/falcon-7b | link |
tiiuae/falcon-11b | link |
tiiuae/falcon-40b | link |
facebook/opt-30b | link |
facebook/opt-1.3b | link |
baichuan-inc/Baichuan2-7B-Chat | link |
baichuan-inc/Baichuan-13B-Chat | link |
THUDM/chatglm2-6b | link |
bigscience/bloom-1b7 | link |
Salesforce/codegen-2B-multi | link |
mosaicml/mpt-7b | link |
microsoft/phi-2 | link |
openai/whisper-large-v2 | link |
If you would like to generate qconfig summary files (due to changes on model variants or calibration dataset), please follow the tuning examples provided by Intel® Neural Compressor.
Weights are quantized by round-to-nearest (RTN).
- Command for WoQ INT8:
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <physical cores list> python run.py --benchmark -m <MODEL_ID> --ipex-weight-only-quantization --weight-dtype INT8 --quant-with-amp --output-dir "saved_results"
- An example for Llama-3.1-8B model:
OMP_NUM_THREADS=56 numactl -m 0 -C 0-55 python run.py --benchmark -m meta-llama/Meta-Llama-3.1-8B-Instruct --ipex-weight-only-quantization --weight-dtype INT8 --quant-with-amp --output-dir "saved_results"
Notes:
-
Please note that
<MODEL_ID>
should be the ID of a non-quantized model instead of any quantized version on HuggingFace. -
Automatic Mixed Precision (AMP) is recommended to get peak performance and fair accuracy. It is turned on by
--quant-with-amp
or off by removing the option. -
By default, computation is done in bfloat16 no matter AMP is turned on or not. Computation dtype can be specified by
--lowp-mode
. Available options areFP32
,FP16
,BF16
, andINT8
. -
By default, weights are quantized per channel. Use
--group-size
for group-wise quantization. -
The command above works fine for most models we listed. However, to get better accuracy for the following models, some changes to the command are needed.
Model ID | Changes to command |
---|---|
bigcode/starcoder | Add "--group-size 128 " |
baichuan-inc/Baichuan-13B-Chat | Remove "--quant-with-amp " |
baichuan-inc/Baichuan2-13B-Chat | Add "--group-size 64 " |
bigscience/bloom-1b7 | Remove "--quant-with-amp "; add "--group-size 128 " |
EleutherAI/gpt-neox-20b | Remove "--quant-with-amp "; add "--group-size 256 " |
facebook/opt-30b | Remove "--quant-with-amp " |
databricks/dolly-v2-12b | Remove "--quant-with-amp "; add "--lowp-mode FP32 " |
stabilityai/stablelm-2-1_6b | Add "--group-size 128 " |
meta-llama/Meta-Llama-3-70B | Add "--group-size 128 " |
For Weight-only Quantization (WoQ) INT4, weights are quantized into int4 by different quantization algorithms. Among them, we support RTN, GPTQ, AWQ and intel/auto-round.
To run with RTN, the command is similar as WoQ INT8 and you need to provide the ID of a non-quantized model:
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <physical cores list> python run.py --benchmark -m <MODEL_ID> --ipex-weight-only-quantization --weight-dtype INT4 --quant-with-amp --output-dir "saved_results"
To run with GPTQ, AWQ, and intel/auto-round, you need to download or generate quantized weights beforehand.
If the INT4 quantized weight checkpoint files of the desired model can be found in HuggingFace Models, you can download them and benchmark with the following commands:
huggingface-cli download <INT4_MODEL_ID> --local-dir <INT4_CKPT_SAVE_PATH>
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <physical cores list> python run.py --benchmark -m <INT4_CKPT_SAVE_PATH> --ipex-weight-only-quantization --quant-with-amp --lowp-mode [INT8|BF16]
Here is an example to run Llama-3.1-8B with GPTQ:
huggingface-cli download hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4 --local-dir ./Llama-3.1-8B-GPTQ
OMP_NUM_THREADS=56 numactl -m 0 -C 0-55 python run.py --benchmark -m ./Llama-3.1-8B-GPTQ --ipex-weight-only-quantization --quant-with-amp --lowp-mode BF16
Note:
-
You cannot use the ID of a quantized model on HuggingFace directly for benchmarking. Please download them and provide the local path.
-
By default, computation is done in INT8 for WoQ INT4 if
--lowp-mode
is not specified. -
For GPTQ with
desc_act=True
, INT8 computation is not available. You have to set--lowp-mode BF16
explicitly.
If the quantized INT4 checkpoint of the desired model is not available in HuggingFace Models, you can quantize the model using Intel® Neural Compressor (INC). INC supports WoQ INT4 quantization with GPTQ, AWQ and intel/auto-round algorithms.
Please refer to INC's tutorial to generate the INT4 weight checkpoint files in a separate python environment. When the quantization process finishes, use the same command to run the model:
# Switch back to IPEX environment first.
conda activate llm
# "./llama_3_1_8B_INT4_GPTQ" is the example path of the output INT4 checkpoint.
OMP_NUM_THREADS=56 numactl -m 0 -C 0-55 python run.py --benchmark -m ./llama_3_1_8B_INT4_GPTQ --ipex-weight-only-quantization --quant-with-amp --lowp-mode BF16
If your INT4 checkpoints are not from HuggingFace or INC, please make sure the directory has the same structure as those on HuggingFace.
(1) numactl is used to specify memory and cores of your hardware to get better performance. <node N> specifies the numa node id (e.g., 0 to use the memory from the first numa node). <physical cores list> specifies phsysical cores which you are using from the <node N> numa node (e.g., 0-56 from the first numa node). You can use lscpu command in Linux to check the numa node information.
(2) The <MODEL_ID> (e.g., "meta-llama/Llama-2-13b-hf") specifies the model you will run. we provide some verified <MODEL ID> in the Optimized Model List. You can also try other models from HuggingFace Models.
(3) For all quantization benchmarks, both quantization and inference stages will be triggered by default. For quantization stage, it will auto-generate the quantized model named "best_model.pt" in the "--output-dir" path, and for inference stage, it will launch the inference with the quantized model "best_model.pt". For inference-only benchmarks (avoid the repeating quantization stage), you can also reuse these quantized models for by adding "--quantized-model-path <output_dir + "best_model.pt">" .
In the DeepSpeed cases below, we recommend "--shard-model" to shard model weight sizes more even for better memory usage when running with DeepSpeed.
If using "--shard-model", it will save a copy of the shard model weights file in the path of "--output-dir" (default path is "./saved_results" if not provided). If you have used "--shard-model" and generated such a shard model path (or your model weights files are already well sharded), in further repeated benchmarks, please remove "--shard-model", and replace "-m <MODEL_ID>" with "-m <shard model path>" to skip the repeated shard steps.
Besides, the standalone shard model function/scripts are also provided in the Advanced Usage section, in case you would like to generate the shard model weights files in advance before running distributed inference.
- Command:
deepspeed --bind_cores_to_rank run.py --benchmark -m <MODEL_ID> --dtype float32 --ipex --autotp --shard-model
- An example of Llama-3.1-8B model:
deepspeed --bind_cores_to_rank run.py --benchmark -m meta-llama/Meta-Llama-3.1-8B-Instruct --dtype float32 --ipex --autotp --shard-model
- Command:
deepspeed --bind_cores_to_rank run.py --benchmark -m <MODEL_ID> --dtype bfloat16 --ipex --autotp --shard-model
- An example of Llama-3.1-8B model:
deepspeed --bind_cores_to_rank run.py --benchmark -m meta-llama/Meta-Llama-3.1-8B-Instruct --dtype bfloat16 --ipex --autotp --shard-model
More details about WoQ INT8 can be found in the section above.
For weight-only quantization with deepspeed, we quantize the model then run the benchmark. The quantized model won't be saved.
- Command:
deepspeed --bind_cores_to_rank run.py --benchmark -m <MODEL_ID> --ipex --ipex-weight-only-quantization --weight-dtype INT8 --quant-with-amp --autotp --shard-model --output-dir "saved_results"
Similar to single instance usage, we need to update some arguments of the running command specifically for some models to achieve better accuracy.
Model ID | Changes to command |
---|---|
EleutherAI/gpt-j-6b | Remove "--quant-with-amp "; add "--dtype float32 " |
EleutherAI/gpt-neox-20b | Remove "--quant-with-amp "; add "--lowp-mode FP32 --dtype float32 --group-size 256 " |
bigcode/starcoder | Add "--group-size 128 " |
baichuan-inc/Baichuan-13B-Chat | Remove "--quant-with-amp "; add "--dtype float32 " |
baichuan-inc/Baichuan2-13B-Chat | Add "--group-size 64 " |
bigscience/bloom-1b7 | Remove "--quant-with-amp "; add "--group-size 128 " |
facebook/opt-30b | Remove "--quant-with-amp "; add "--dtype float32 " |
databricks/dolly-v2-12b | Remove "--quant-with-amp "; add "--lowp-mode FP32 --dtype float32 " |
stabilityai/stablelm-2-1_6b | Add "--group-size 128 " |
meta-llama/Meta-Llama-3-70B | Add "--group-size 128 " |
- An example of Llama-3.1-8B model:
deepspeed --bind_cores_to_rank run.py --benchmark -m meta-llama/Meta-Llama-3.1-8B-Instruct --ipex --ipex-weight-only-quantization --weight-dtype INT8 --quant-with-amp --autotp --shard-model --output-dir "saved_results"
We can either download a quantized weight checkpoint from Huggingface Models, or quantize the model using INC with GPTQ/AWQ/AutoRound algorithms, or quantize the model with RTN algorithm within IPEX. Please refer the instructions for details.
- Command:
deepspeed --bind_cores_to_rank run.py --benchmark -m <INT4_CKPT_PATH> --ipex --ipex-weight-only-quantization --weight-dtype INT4 --lowp-mode BF16 --quant-with-amp --autotp --output-dir "saved_results"
- Example with GPTQ INT4 Llama-3.1-8B model:
huggingface-cli download hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4 --local-dir ./Llama-3.1-8B-GPTQ
deepspeed --bind_cores_to_rank run.py --benchmark -m ./Llama-3.1-8B-GPTQ --ipex --ipex-weight-only-quantization --weight-dtype INT4 --lowp-mode BF16 --quant-with-amp --autotp --output-dir "saved_results"
There are some model-specific requirements to be aware of, as follows:
-
For MPT models from the remote hub, we need to modify the config.json to use the modeling_mpt.py in transformers. Therefore, in the following scripts, we need to pass an extra configuration file like "--config-file=model_config/mosaicml_mpt-7b_config.json".
-
For Falcon models from remote hub, we need to modify the config.json to use the modeling_falcon.py in transformers. Therefore, in the following scripts, we need to pass an extra configuration file like "--config-file=model_config/tiiuae_falcon-40b_config.json". This is optional for FP32/BF16 but needed for quantizations.
-
For Llava models from remote hub, additional setup is required, i.e.,
bash ./tools/prepare_llava.sh
.
Intel® Xeon® CPU Max Series are equipped with high bandwidth memory (HBM), which further accelerates LLM inference. For the common case that HBM and DDR are both installed in a Xeon® CPU Max Series server, the memory mode can be configured to Flat Mode or Cache Mode. Details about memory modes can be found at Section 3.1 in the Xeon® CPU Max Series Configuration Guide.
In cache mode, only DDR address space is visible to software and HBM functions as a transparent memory-side cache for DDR. Therefore the usage is the same with the common usage.
In flat mode, HBM and DDR are exposed to software as separate address spaces.
Therefore we need to check the HBM_NODE_INDEX
of interest with commands like lscpu
, then the LLM inference invoking command would be like:
- Command:
OMP_NUM_THREADS=<HBM node cores num> numactl -m <HBM_NODE_INDEX> -C <HBM cores list> python run.py --benchmark -m <MODEL_ID> --dtype bfloat16 --ipex
- An example of Llama-3.1-8B model with HBM numa node index being 2:
OMP_NUM_THREADS=56 numactl -m 2 -C 0-55 python run.py --benchmark -m meta-llama/Meta-Llama-3.1-8B-Instruct --dtype bfloat16 --ipex
Note: For some very large models we may get an "OOM Error" due to HBM capacity limitations. In this case we can change -m
argument for numactl
to -p
in the above command to enable the model inference with the larger DDR memory.
- Command:
OMP_NUM_THREADS=<HBM node cores num> numactl -p <HBM_NODE_INDEX> -C <HBM cores list> python run.py --benchmark -m <MODEL_ID> --dtype bfloat16 --ipex
- An example of Llama-3.1-8B model with HBM numa node index being 2:
OMP_NUM_THREADS=56 numactl -p 2 -C 0-55 python run.py --benchmark -m meta-llama/Meta-Llama-3.1-8B-Instruct --dtype bfloat16 --ipex
As HBM has memory capacity limitations, we need to shard the model in advance with DDR memory. Please follow the example.
Then we can invoke distributed inference with deepspeed
command:
- Command:
deepspeed --bind_cores_to_rank run.py --benchmark -m <SHARDED_MODEL_PATH> --dtype bfloat16 --ipex --autotp
As the model has been sharded, we specify SHARDED_MODEL_PATH
for -m
argument instead of original model name or path, and --shard-model
argument is not needed.
- An example of Llama-3.1-8B model:
python utils/create_shard_model.py -m meta-llama/Meta-Llama-3.1-8B-Instruct --save-path ./local_llama3_1_8b
deepspeed --bind_cores_to_rank run.py --benchmark -m ./local_llama3_1_8b --dtype bfloat16 --ipex --autotp
To save memory usage, we could shard the model weights under the local path before we launch distributed tests with DeepSpeed.
cd ./utils
# general command:
python create_shard_model.py -m <MODEL ID> --save-path <SHARD MODEL PATH>
# After sharding the model, using -m <SHARD MODEL PATH> in later tests
# An example of Llama-3.1-8B:
python create_shard_model.py -m meta-llama/Meta-Llama-3.1-8B-Instruct --save-path ./local_llama3_1_8b
ipex.llm
is focusing on LLM performance optimizations,
yet we also provide example scripts for the validation of the model from accuracy perspective.
We leverage lm-evaluation-harness for the accuracy test,
and recommend to test accuracy of most models with "lambada_openai" task.
For some models, like Salesforce/codegen-2B-multi
and mosaicml/mpt-7b
, we recommend to test their accuracy with "hellaswag" task.
For more candidate tasks for accuracy validation, please check lm-evaluation-harness task table.
cd ./single_instance
- Command:
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <physical cores list> python run_accuracy.py -m <MODEL_ID> --dtype float32 --ipex --tasks {TASK_NAME}
- An example of Llama-3.1-8B model:
OMP_NUM_THREADS=56 numactl -m 0 -C 0-55 python run_accuracy.py -m meta-llama/Meta-Llama-3.1-8B-Instruct --dtype float32 --ipex --tasks lambada_openai
- Command:
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <physical cores list> python run_accuracy.py -m <MODEL_ID> --dtype bfloat16 --ipex --tasks {TASK_NAME}
- An example of Llama-3.1-8B model:
OMP_NUM_THREADS=56 numactl -m 0 -C 0-55 python run_accuracy.py -m meta-llama/Meta-Llama-3.1-8B-Instruct --dtype bfloat16 --ipex --tasks lambada_openai
For the quantized models to be used in accuracy tests, we can reuse the model files that are named "best_model.pt" in the "--output-dir" path (generated during inference performance tests).
- Command:
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_accuracy.py --model <MODEL ID> --quantized-model-path "../saved_results/best_model.pt" --dtype <int8 or int4> --tasks <TASK_NAME>
# Please add "--quant-with-amp" if your model is quantized with this flag
- An example of Llama-3.1-8B model:
OMP_NUM_THREADS=56 numactl -m 0 -C 0-55 python run_accuracy.py -m meta-llama/Meta-Llama-3.1-8B-Instruct --quantized-model-path "../saved_results/best_model.pt" --dtype int8 --ipex --quant-with-amp --tasks lambada_openai
We provided a run_accuracy_with_deepspeed.py
script for testing accuracy
for the models benchmarked in distributed way via deepspeed
.
Prior to the accuracy testing, we need to have the sharded model. The sharded model should have been generated
following the instruction for performance benchmarking with deepspeed where --shard-model
flag is set. The generated model shards will be placed in the folder specified by --output-dir
argument.
Alternatively, the model sharding process can also be accomplished in a standalone way.
Then we can test the accuracy with the following commands, in which -m
or --model
is specified with
the path of the folder of the sharded model instead of original model ID.
# Run distributed accuracy with 2 ranks of one node
cd ./distributed
- Command:
deepspeed --num_accelerators 2 --master_addr `hostname -I | sed -e 's/\s.*$//'` --bind_cores_to_rank run_accuracy_with_deepspeed.py --model <SHARD MODEL PATH> --dtype float32 --ipex --tasks <TASK_NAME>
- An example of a pre-sharded Llama model:
deepspeed --num_accelerators 2 --master_addr `hostname -I | sed -e 's/\s.*$//'` --bind_cores_to_rank run_accuracy_with_deepspeed.py --model ../saved_results/llama_local_shard --dtype float32 --ipex --tasks lambada_openai
- Command:
deepspeed --num_accelerators 2 --master_addr `hostname -I | sed -e 's/\s.*$//'` --bind_cores_to_rank run_accuracy_with_deepspeed.py --model <SHARD MODEL PATH> --dtype bfloat16 --ipex --tasks <TASK_NAME>
- An example of a pre-sharded Llama model:
deepspeed --num_accelerators 2 --master_addr `hostname -I | sed -e 's/\s.*$//'` --bind_cores_to_rank run_accuracy_with_deepspeed.py --model ../saved_results/llama_local_shard --dtype bfloat16 --ipex --tasks lambada_openai
- Command:
deepspeed --num_accelerators 2 --master_addr `hostname -I | sed -e 's/\s.*$//'` --bind_cores_to_rank run_accuracy_with_deepspeed.py --model <SHARD MODEL PATH> --ipex-weight-only-quantization --weight-dtype INT8 --quant-with-amp --ipex --tasks <TASK_NAME>
Similar to script usage for performance benchmarking, we need to update some arguments of the running command specifically for some models to achieve better accuracy.
Model ID | Changes to command |
---|---|
EleutherAI/gpt-j-6b | Remove "--quant-with-amp "; add "--dtype float32 " |
EleutherAI/gpt-neox-20b | Remove "--quant-with-amp "; add "--lowp-mode FP32 --dtype float32 --group-size 256 " |
bigcode/starcoder | Add "--group-size 128 " |
baichuan-inc/Baichuan-13B-Chat | Remove "--quant-with-amp "; add "--dtype float32 " |
baichuan-inc/Baichuan2-13B-Chat | Add "--group-size 64 " |
bigscience/bloom-1b7 | Remove "--quant-with-amp "; add "--group-size 128 " |
facebook/opt-30b | Remove "--quant-with-amp "; add "--dtype float32 " |
databricks/dolly-v2-12b | Remove "--quant-with-amp "; add "--lowp-mode FP32 --dtype float32 " |
- An example of a pre-sharded INT8 Llama model:
deepspeed --num_accelerators 2 --master_addr `hostname -I | sed -e 's/\s.*$//'` --bind_cores_to_rank run_accuracy_with_deepspeed.py --model ../saved_results/llama_local_shard --ipex-weight-only-quantization --weight-dtype INT8 --quant-with-amp --ipex --tasks <TASK_NAME>
Please check the instructions for WoQ INT4 performance benchmarking
for the details on how to download or generate the INT4 quantized checkpoint files.
INT4 checkpoints cannot be pre-sharded, so in the command --model
should be set as the path of the downloaded or generated checkpoint.
- Command:
deepspeed --num_accelerators 2 --master_addr `hostname -I | sed -e 's/\s.*$//'` --bind_cores_to_rank run_accuracy_with_deepspeed.py --model <INT4_CKPT_PATH> --ipex-weight-only-quantization --weight-dtype INT4 --lowp-mode BF16 --quant-with-amp --ipex --tasks <TASK_NAME>
- An example to run Llama-3.1-8B:
huggingface-cli download hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4 --local-dir ./Llama-3.1-8B-GPTQ
deepspeed --num_accelerators 2 --master_addr `hostname -I | sed -e 's/\s.*$//'` --bind_cores_to_rank run_accuracy_with_deepspeed.py --model ./Llama-3.1-8B-GPTQ --ipex-weight-only-quantization --weight-dtype INT4 --lowp-mode BF16 --quant-with-amp --ipex --tasks lambada_openai
The performance results on AWS instances can be found here.
The LLM inference methods introduced in this page can be well applied for AWS. We can just follow the above instructions and enjoy the boosted performance of LLM with Intel® Extension for PyTorch* optimizations on the AWS instances.