Skip to content

Latest commit

 

History

History
executable file
·
165 lines (125 loc) · 6.69 KB

README.md

File metadata and controls

executable file
·
165 lines (125 loc) · 6.69 KB

Benchmarking Script for Large Language Models

This script provides a unified approach to estimate performance for Large Language Models (LLMs). It leverages pipelines provided by Optimum-Intel and allows performance estimation for PyTorch and OpenVINO models using nearly identical code and pre-collected models.

1. Prepare Python Virtual Environment for LLM Benchmarking

python3 -m venv ov-llm-bench-env
source ov-llm-bench-env/bin/activate
pip install --upgrade pip

git clone  https://github.com/openvinotoolkit/openvino.genai.git
cd openvino.genai/llm_bench/python/
pip install -r requirements.txt  

Note: For existing Python environments, run the following command to ensure that all dependencies are installed with the latest versions:
pip install -U --upgrade-strategy eager -r requirements.txt

(Optional) Hugging Face Login :

Login to Hugging Face if you want to use non-public models:

huggingface-cli login

2. Convert Model to OpenVINO IR Format

The optimum-cli tool simplifies converting Hugging Face models to OpenVINO IR format.

Usage:

optimum-cli export openvino --model <MODEL_ID> --weight-format <PRECISION> <OUTPUT_DIR>

optimum-cli export openvino -h # For detailed information
  • --model <MODEL_ID> : model_id for downloading from huggngface_hub or path with directory where pytorch model located.
  • --weight-format <PRECISION> : precision for model conversion. Available options: fp32, fp16, int8, int4, mxfp4
  • <OUTPUT_DIR>: output directory for saving generated OpenVINO model.

NOTE:

  • Models larger than 1 billion parameters are exported to the OpenVINO format with 8-bit weights by default. You can disable it with --weight-format fp32.

Example:

optimum-cli export openvino --model meta-llama/Llama-2-7b-chat-hf --weight-format fp16 models/llama-2-7b-chat

Resulting file structure:

    models
    └── llama-2-7b-chat
        ├── config.json
        ├── generation_config.json
        ├── openvino_detokenizer.bin
        ├── openvino_detokenizer.xml
        ├── openvino_model.bin
        ├── openvino_model.xml
        ├── openvino_tokenizer.bin
        ├── openvino_tokenizer.xml
        ├── special_tokens_map.json
        ├── tokenizer_config.json
        ├── tokenizer.json
        └── tokenizer.model

3. Benchmark LLM Model

To benchmark the performance of the LLM, use the following command:

python benchmark.py -m <model> -d <device> -r <report_csv> -f <framework> -p <prompt text> -n <num_iters>
# e.g.
python benchmark.py -m models/llama-2-7b-chat/ -n 2
python benchmark.py -m models/llama-2-7b-chat/ -p "What is openvino?" -n 2
python benchmark.py -m models/llama-2-7b-chat/ -pf prompts/llama-2-7b-chat_l.jsonl -n 2

Parameters:

  • -m: Path to the model.
  • -d: Inference device (default: CPU).
  • -r: Path to the CSV report.
  • -f: Framework (default: ov).
  • -p: Interactive prompt text.
  • -pf: Path to a JSONL file containing prompts.
  • -n: Number of iterations (default: 0, the first iteration is excluded).
  • -ic: Limit the output token size (default: 512) for text generation and code generation models.

Additional options:

python ./benchmark.py -h # for more information

Benchmarking the Original PyTorch Model:

To benchmark the original PyTorch model, first download the model locally and then run benchmark by specifying PyTorch as the framework with parameter -f pt

# Download PyTorch Model
huggingface-cli download meta-llama/Llama-2-7b-chat-hf --local-dir models/llama-2-7b-chat/pytorch
# Benchmark with PyTorch Framework
python benchmark.py -m models/llama-2-7b-chat/pytorch -n 2 -f pt

Note: If needed, You can install a specific OpenVINO version using pip:

# e.g. 
pip install openvino==2024.4.0
# Optional, install the openvino nightly package if needed.
# OpenVINO nightly is pre-release software and has not undergone full release validation or qualification. 
pip uninstall openvino
pip install --upgrade --pre openvino openvino-tokenizers --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly

4. Benchmark LLM with torch.compile()

The --torch_compile_backend option enables you to use torch.compile() to accelerate PyTorch models by compiling them into optimized kernels using a specified backend.

Before benchmarking, you need to download the original PyTorch model. Use the following command to download the model locally:

huggingface-cli download meta-llama/Llama-2-7b-chat-hf --local-dir models/llama-2-7b-chat/pytorch

To run the benchmarking script with torch.compile(), use the --torch_compile_backend option to specify the backend. You can choose between pytorch or openvino (default). Example:

python ./benchmark.py -m models/llama-2-7b-chat/pytorch -d CPU --torch_compile_backend openvino

Note: To use torch.compile() with CUDA GPUs, you need to install the nightly version of PyTorch:

pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118

5. Running on 2-Socket Platforms

The benchmarking script sets openvino.properties.streams.num(1) by default. For multi-socket platforms, use numactl on Linux or the --load_config option to modify behavior.

OpenVINO Version Behaviors
Before 2024.0.0 streams.num(1)
execute on 2 sockets.
2024.0.0 streams.num(1)
execute on the same socket as the APP is running on.

For example, --load_config config.json as following will result in streams.num(1) and execute on 2 sockets.

{
  "INFERENCE_NUM_THREADS": <NUMBER>
} 

<NUMBER> is the number of total physical cores in 2 sockets.

6. Additional Resources

  • Error Troubleshooting: Check the NOTES.md for solutions to known issues.
  • Image Generation Configuration: Refer to IMAGE_GEN.md for setting parameters for image generation models.