|
| 1 | +# Large Language Model Throughput Testing Framework |
| 2 | + |
| 3 | +This framework is designed to test the **throughput performance** of Large Language Models (LLMs) across different deployment methods: **offline inference** (using PaddlePaddle and PyTorch) and **online inference** (via API). It specifically focuses on evaluating the model's ability to handle **batch queries**, measuring throughput in tokens per second under various configurations and batch sizes. |
| 4 | + |
| 5 | +## Features |
| 6 | + |
| 7 | + * **Diverse Deployment Method Support:** Tests LLMs deployed via online API, and offline inference with PaddlePaddle and PyTorch. |
| 8 | + * **Batch Query Throughput Calculation:** Accurately measures throughput (tokens/s) for concurrent queries, providing insights into the model's performance under load. |
| 9 | + * **Detailed Time Logging:** Records the total time for each batch processing operation. |
| 10 | + |
| 11 | +## How It Works |
| 12 | + |
| 13 | +This framework operates by sending batched requests to specified endpoints (API or local inference scripts) and collecting performance data on how the model generates responses for multiple queries simultaneously. |
| 14 | + |
| 15 | +1. **Input Queries:** You provide a set of questions as test input, typically from a `.parquet` or text file. |
| 16 | +2. **Batch Processing:** The framework groups these questions into batches of a specified `rollout_input_batch_size`. |
| 17 | +3. **Generate Responses:** For each query within a batch, the framework requests the model to generate `rollout_n` responses. |
| 18 | +4. **Time Measurement:** The total time from sending a batch of questions to receiving all corresponding responses is recorded. |
| 19 | +5. **Token Statistics:** The total number of tokens generated across all responses in a batch is summed up. |
| 20 | +6. **Throughput Calculation:** Throughput (tokens/s) is calculated by dividing the total tokens generated by the total time taken for the batch to complete. |
| 21 | + |
| 22 | +## Usage |
| 23 | + |
| 24 | +This section details how to run throughput tests for each deployment method using the provided shell scripts. |
| 25 | + |
| 26 | +### Data Preparation |
| 27 | + |
| 28 | +To run the tests, you'll first need to download and extract the `rl_data.tar.gz` archive, which contains the GSM8K dataset in a suitable format for testing. |
| 29 | + |
| 30 | +``` |
| 31 | +cd llm/benchmark/rl |
| 32 | +wget https://paddle-qa.bj.bcebos.com/paddlenlp/rl_data.tar.gz |
| 33 | +tar -zxvf rl_data.tar.gz |
| 34 | +``` |
| 35 | + |
| 36 | +Extract the contents of the archive. This will create a data folder containing the GSM8K dataset. |
| 37 | + |
| 38 | +### Online API Inference |
| 39 | + |
| 40 | +This script tests the throughput of a remote LLM API. |
| 41 | + |
| 42 | +**Configuration (`api_serve.sh`):** |
| 43 | + |
| 44 | +```bash |
| 45 | +output_dir="api_serve_results" |
| 46 | + |
| 47 | +python api_serve.py \ |
| 48 | + --openai_urls "your_url1" "your_url2"\ |
| 49 | + --api_keys "key1" "key2" \ |
| 50 | + --model "Qwen2.5-7B-Instruct-1M" \ |
| 51 | + --tokenizer "Qwen/Qwen2.5-7B-Instruct-1M" \ |
| 52 | + --input_file ./data/gsm8k/instruct/train.parquet \ |
| 53 | + --output_dir ${output_dir} \ |
| 54 | + --use_fastdeploy true \ |
| 55 | + --rollout_input_batch_size 8 \ |
| 56 | + --rollout_n 8 \ |
| 57 | + --top_p 1.0 \ |
| 58 | + --temperature 0.7 \ |
| 59 | + --max_dec_len 8192 \ |
| 60 | + --limit_rows 512 |
| 61 | +``` |
| 62 | + |
| 63 | + * **`--openai_urls`**: URLs of the API endpoints to test. |
| 64 | + * **`--api_keys`**: API keys for authentication (if required). |
| 65 | + * **`--model`**: Name of the model being tested. |
| 66 | + * **`--tokenizer`**: Path or name of the tokenizer. |
| 67 | + * **`--input_file`**: Path to the input dataset file. |
| 68 | + * **`--output_dir`**: Directory to save output results. |
| 69 | + * **`--use_fastdeploy`**: Use FastDeploy if true, otherwise use vLLM (default: true). |
| 70 | + * **`--rollout_input_batch_size`**: The batch size for API requests. |
| 71 | + * **`--rollout_n`**: Number of responses to generate for each input query. |
| 72 | + * **`--max_dec_len`**: Maximum decoding length for responses. |
| 73 | + * **`--limit_rows`**: Limit the number of input rows processed. |
| 74 | + |
| 75 | +**Run command:** |
| 76 | + |
| 77 | +```bash |
| 78 | +bash scripts/api_serve.sh |
| 79 | +``` |
| 80 | + |
| 81 | +### Offline PaddlePaddle Inference |
| 82 | + |
| 83 | +This script tests the throughput of an LLM using offline PaddlePaddle inference, potentially with distributed processing. |
| 84 | + |
| 85 | +**Configuration (`paddle_infer.sh`):** |
| 86 | + |
| 87 | +```bash |
| 88 | +unset PADDLE_TRAINERS_NUM |
| 89 | +unset PADDLE_ELASTIC_JOB_ID |
| 90 | +unset PADDLE_TRAINER_ENDPOINTS |
| 91 | +unset DISTRIBUTED_TRAINER_ENDPOINTS |
| 92 | +unset FLAGS_START_PORT |
| 93 | +unset PADDLE_ELASTIC_TIMEOUT |
| 94 | + |
| 95 | +export PYTHONPATH="your_paddlenlp_path/PaddleNLP":$PYTHONPATH |
| 96 | +export PYTHONPATH="your_paddlenlp_path/PaddleNLP/llm":$PYTHONPATH |
| 97 | + |
| 98 | +export FLAGS_set_to_1d=False |
| 99 | +export NVIDIA_TF32_OVERRIDE=0 |
| 100 | +export FLAGS_dataloader_use_file_descriptor=False |
| 101 | +export HF_DATASETS_DOWNLOAD_TIMEOUT=1 |
| 102 | +export FLAGS_gemm_use_half_precision_compute_type=False |
| 103 | +export FLAGS_force_cublaslt_no_reduced_precision_reduction=True |
| 104 | + |
| 105 | +export FLAGS_custom_allreduce=0 |
| 106 | +export FLAGS_mla_use_tensorcore=0 |
| 107 | +export FLAGS_cascade_attention_max_partition_size=2048 |
| 108 | + |
| 109 | +export CUDA_VISIBLE_DEVICES=4,5,6,7 |
| 110 | +output_dir="pdpd_bf16_offline" |
| 111 | + |
| 112 | +python -u -m paddle.distributed.launch --log_dir ${output_dir}/logs --gpus ${CUDA_VISIBLE_DEVICES} paddle_infer.py \ |
| 113 | + --actor_model_name_or_path your_model_name \ |
| 114 | + --max_src_len 2048 \ |
| 115 | + --min_dec_len 32 \ |
| 116 | + --max_dec_len 30720 \ |
| 117 | + --top_p 1.0 \ |
| 118 | + --temperature 1.0 \ |
| 119 | + --rollout_input_batch_size 4 \ |
| 120 | + --rollout_n 8 \ |
| 121 | + --rollout_max_num_seqs 24 \ |
| 122 | + --rollout_quant_type "" \ |
| 123 | + --tensor_parallel_degree 4 \ |
| 124 | + --limit_rows 640 \ |
| 125 | + --input_file file.parquet \ |
| 126 | + --output_dir ${output_dir} > ./paddleinfer.log 2>&1 |
| 127 | +``` |
| 128 | + |
| 129 | + * **`CUDA_VISIBLE_DEVICES`**: Specifies the GPUs to be used. |
| 130 | + * **`paddle.distributed.launch`**: Initiates a distributed PaddlePaddle training/inference job. |
| 131 | + * **`--actor_model_name_or_path`**: Path to the pre-trained model. |
| 132 | + * **`--max_src_len`**: Maximum source sequence length. |
| 133 | + * **`--rollout_input_batch_size`**: The batch size for inference. |
| 134 | + * **`--rollout_n`**: Number of responses to generate for each input query. |
| 135 | + * **`--tensor_parallel_degree`**: Degree of tensor parallelism for distributed inference. |
| 136 | + * **`--input_file`**: Path to the input dataset file. |
| 137 | + * **`--output_dir`**: Directory to save output results and logs. |
| 138 | + |
| 139 | +### Offline PyTorch Inference |
| 140 | + |
| 141 | +This script tests the throughput of an LLM using offline PyTorch inference. |
| 142 | + |
| 143 | +**Configuration (`torch_infer.sh`):** |
| 144 | + |
| 145 | +```bash |
| 146 | +export CUDA_VISIBLE_DEVICES=4,5,6,7 |
| 147 | + |
| 148 | +output_dir="vllm_bf16_offline_flashattn" |
| 149 | + |
| 150 | +python torch_infer.py \ |
| 151 | + --actor_model_name_or_path Qwen/Qwen2.5-7B-Instruct-1M \ |
| 152 | + --max_src_len 2048 \ |
| 153 | + --min_dec_len 32 \ |
| 154 | + --max_dec_len 30720 \ |
| 155 | + --top_p 1.0 \ |
| 156 | + --temperature 1.0 \ |
| 157 | + --rollout_input_batch_size 4 \ |
| 158 | + --rollout_n 8 \ |
| 159 | + --tensor_parallel_degree 4 \ |
| 160 | + --limit_rows 640 \ |
| 161 | + --input_file ./data/gsm8k/instruct/train.parquet \ |
| 162 | + --output_dir ${output_dir} \ |
| 163 | + --gpu_memory_utilization 0.8 > ./torchinferflashattn.log 2>&1 |
| 164 | +``` |
| 165 | + * **`--gpu_memory_utilization`**: Fraction of GPU memory to be reserved for the model. |
| 166 | + |
| 167 | +----- |
| 168 | + |
| 169 | +## Output Results |
| 170 | + |
| 171 | +The `output_dir` contains the following files: |
| 172 | + |
| 173 | +**1. Statistics Files** |
| 174 | +• `dispersed_stats.csv` |
| 175 | + |
| 176 | + Per-batch request length and throughput statistics. Fields: |
| 177 | + `batch_index, rollout_lengths, min_length, max_length, avg_length, completion_time, throughput_tokens_per_sec` |
| 178 | + |
| 179 | +• `global_stats.csv` |
| 180 | + |
| 181 | + Aggregated global metrics. Fields: |
| 182 | + `batch_index, min_response_tokens, max_response_tokens, avg_response_tokens, total_response_tokens, completion_time, throughput_tokens_per_sec` |
| 183 | + |
| 184 | +**2. Detailed Records** |
| 185 | +• `rollout_details.jsonl` |
| 186 | + |
| 187 | + Raw per-request outputs (JSON Lines format), including input/output text. |
| 188 | + |
0 commit comments