We aim to run these benchmarks and share them with the OPEA community for three primary reasons:
- To offer insights on inference throughput in real-world scenarios, helping you choose the best service or deployment for your needs.
- To establish a baseline for validating optimization solutions across different implementations, providing clear guidance on which methods are most effective for your use case.
- To inspire the community to build upon our benchmarks, allowing us to better quantify new solutions in conjunction with current leading LLMs, serving frameworks etc.
- ChatQnA
- DocSum
Before running the benchmarks, ensure you have:
-
Kubernetes Environment
- Kubernetes installation: Use kubespray or other official Kubernetes installation guides
- (Optional) Kubernetes set up guide on Intel Gaudi product
-
Configuration YAML
The configuration file (e.g.,./ChatQnA/benchmark_chatqna.yaml
) consists of two main sections: deployment and benchmarking. Required fields with# mandatory
comment must be filled with valid values, such asHUGGINGFACEHUB_API_TOKEN
. For all other fields, you can either customize them according to our needs or leave them empty ("") to use the default values from the helm charts.Default Models:
- LLM:
meta-llama/Meta-Llama-3-8B-Instruct
(Required: must be specified as it's shared between deployment and benchmarking phases) - Embedding:
BAAI/bge-base-en-v1.5
- Reranking:
BAAI/bge-reranker-base
You can customize which models to use by setting the
model_id
field in the corresponding service section. Note that the LLM model must be specified in the configuration as it is used by both deployment and benchmarking processes.Important Notes:
- For Gaudi deployments:
- LLM service runs on Gaudi devices
- If enabled, the reranking service (teirerank) also runs on Gaudi devices
- Llama Model Access:
- Downloading Llama models requires both:
- HuggingFace API token
- Special authorization from Meta
- Please visit meta-llama/Meta-Llama-3-8B-Instruct to request access
- Deployment will fail if model download is unsuccessful due to missing authorization
- Downloading Llama models requires both:
Node and Replica Configuration:
node: [1, 2, 4, 8] # Number of nodes to deploy replicaCount: [1, 2, 4, 8] # Must align with node configuration
The
replicaCount
values must align with thenode
configuration by index:- When deploying on 1 node → uses replicaCount[0] = 1
- When deploying on 2 nodes → uses replicaCount[1] = 2
- When deploying on 4 nodes → uses replicaCount[2] = 4
- When deploying on 8 nodes → uses replicaCount[3] = 8
Note: Model parameters that accept lists (e.g.,
max_batch_size
,max_num_seqs
) are deployment parameters that affect model service behavior but not the number of service instances. When these parameters are lists, each value will trigger a service upgrade followed by a new round of testing, while maintaining the same number of service instances. - LLM:
-
Install required Python packages Run the following command to install all necessary dependencies:
pip install -r requirements.txt
notes: the benchmark need
opea-eval>=1.3
, if v1.3 is not released, please build theopea-eval
from source.
Before running benchmarks, you need to:
-
Prepare Test Data
-
Testing for general benchmark target:
Download the retrieval file using the command below for data ingestion in RAG:
wget https://github.com/opea-project/GenAIEval/tree/main/evals/benchmark/data/upload_file.txt
-
Testing for pubmed benchmark target:
For the
chatqna_qlist_pubmed
test case, preparepubmed_${max_lines}.txt
by following this README
After the data is prepared, please update the
absolute path
of this file in the benchmark.yaml file. For example, in theChatQnA/benchmark_chatqna.yaml
file,/home/sdp/upload_file.txt
should be replaced by your file path. -
-
Prepare Model Files (Recommended)
pip install -U "huggingface_hub[cli]" sudo mkdir -p /mnt/models sudo chmod 777 /mnt/models huggingface-cli download --cache-dir /mnt/models meta-llama/Meta-Llama-3-8B-Instruct
The benchmarking process consists of two main components: deployment and benchmarking. We provide deploy_and_benchmark.py
as a unified entry point that combines both steps.
The script deploy_and_benchmark.py
serves as the main entry point. You can use any example's configuration YAML file. Here are examples using ChatQnA configuration:
-
For a specific number of nodes:
# Default OOB (Out of Box) mode python deploy_and_benchmark.py ./ChatQnA/benchmark_chatqna.yaml --target-node 1 # Or specify test mode explicitly python deploy_and_benchmark.py ./ChatQnA/benchmark_chatqna.yaml --target-node 1 --test-mode [oob|tune]
-
For all node configurations:
# Default OOB (Out of Box) mode python deploy_and_benchmark.py ./ChatQnA/benchmark_chatqna.yaml # Or specify test mode explicitly python deploy_and_benchmark.py ./ChatQnA/benchmark_chatqna.yaml --test-mode [oob|tune]
This will process all node configurations defined in your YAML file.
The script provides two test modes controlled by the --test-mode
parameter:
-
OOB (Out of Box) Mode - Default
--test-mode oob # or omit the parameter
- Uses enabled configurations only:
- Resources: Only uses resources when
resources.enabled
is True - Model parameters:
- Uses batch parameters when
batch_params.enabled
is True - Uses token parameters when
token_params.enabled
is True
- Uses batch parameters when
- Resources: Only uses resources when
- Suitable for basic functionality testing with selected optimizations
- Uses enabled configurations only:
-
Tune Mode
--test-mode tune
- Applies all configurations regardless of enabled status:
- Resource-related parameters:
resources.cores_per_instance
: CPU cores allocationresources.memory_capacity
: Memory allocationresources.cards_per_instance
: GPU/Accelerator cards allocation
- Model parameters:
- Batch parameters:
max_batch_size
: Maximum batch size (TGI engine)max_num_seqs
: Maximum number of sequences (vLLM engine)
- Token parameters:
max_input_length
: Maximum input sequence lengthmax_total_tokens
: Maximum total tokens per requestmax_batch_total_tokens
: Maximum total tokens in a batchmax_batch_prefill_tokens
: Maximum tokens in prefill phase
- Batch parameters:
- Resource-related parameters:
- Applies all configurations regardless of enabled status:
Choose "oob" mode when you want to selectively enable optimizations, or "tune" mode when you want to apply all available optimizations regardless of their enabled status.
Helm Chart Directory Issues
-
During execution, the script downloads and extracts the Helm chart to a directory named after your example
-
The directory name is derived from your input YAML file path
- For example: if your input is
./ChatQnA/benchmark_chatqna.yaml
, the extracted directory will bechatqna/
- For example: if your input is
-
In some error cases, this directory might not be properly cleaned up
-
If you encounter deployment issues, check if there's a leftover Helm chart directory:
# Example: for ./ChatQnA/benchmark_chatqna.yaml ls -la chatqna/ # Clean up if needed rm -rf chatqna/
-
After cleaning up the directory, try running the deployment again
Note: Always ensure there are no leftover Helm chart directories from previous failed runs before starting a new deployment.