Skip to content

Commit eee766b

Browse files
authored
[RL] reinforce learning benchmark framework (#10619)
* 增加rl目录下的init.py文件 * 增加copy rightb并格式化 * add reinforcement learning benchmark framework * update copyright * update * add reinforce learning benchmark framework * add elapsed time metrics in api serve framework * fix some bug * add reinforce learning framework with scripts file * remove vllm quant type * reformat rl benchmark code and scripts * add readme file * add fastdeploy engine support for api serve
1 parent 134a7b9 commit eee766b

File tree

8 files changed

+1433
-0
lines changed

8 files changed

+1433
-0
lines changed

llm/benchmark/rl/README.md

Lines changed: 188 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,188 @@
1+
# Large Language Model Throughput Testing Framework
2+
3+
This framework is designed to test the **throughput performance** of Large Language Models (LLMs) across different deployment methods: **offline inference** (using PaddlePaddle and PyTorch) and **online inference** (via API). It specifically focuses on evaluating the model's ability to handle **batch queries**, measuring throughput in tokens per second under various configurations and batch sizes.
4+
5+
## Features
6+
7+
* **Diverse Deployment Method Support:** Tests LLMs deployed via online API, and offline inference with PaddlePaddle and PyTorch.
8+
* **Batch Query Throughput Calculation:** Accurately measures throughput (tokens/s) for concurrent queries, providing insights into the model's performance under load.
9+
* **Detailed Time Logging:** Records the total time for each batch processing operation.
10+
11+
## How It Works
12+
13+
This framework operates by sending batched requests to specified endpoints (API or local inference scripts) and collecting performance data on how the model generates responses for multiple queries simultaneously.
14+
15+
1. **Input Queries:** You provide a set of questions as test input, typically from a `.parquet` or text file.
16+
2. **Batch Processing:** The framework groups these questions into batches of a specified `rollout_input_batch_size`.
17+
3. **Generate Responses:** For each query within a batch, the framework requests the model to generate `rollout_n` responses.
18+
4. **Time Measurement:** The total time from sending a batch of questions to receiving all corresponding responses is recorded.
19+
5. **Token Statistics:** The total number of tokens generated across all responses in a batch is summed up.
20+
6. **Throughput Calculation:** Throughput (tokens/s) is calculated by dividing the total tokens generated by the total time taken for the batch to complete.
21+
22+
## Usage
23+
24+
This section details how to run throughput tests for each deployment method using the provided shell scripts.
25+
26+
### Data Preparation
27+
28+
To run the tests, you'll first need to download and extract the `rl_data.tar.gz` archive, which contains the GSM8K dataset in a suitable format for testing.
29+
30+
```
31+
cd llm/benchmark/rl
32+
wget https://paddle-qa.bj.bcebos.com/paddlenlp/rl_data.tar.gz
33+
tar -zxvf rl_data.tar.gz
34+
```
35+
36+
Extract the contents of the archive. This will create a data folder containing the GSM8K dataset.
37+
38+
### Online API Inference
39+
40+
This script tests the throughput of a remote LLM API.
41+
42+
**Configuration (`api_serve.sh`):**
43+
44+
```bash
45+
output_dir="api_serve_results"
46+
47+
python api_serve.py \
48+
--openai_urls "your_url1" "your_url2"\
49+
--api_keys "key1" "key2" \
50+
--model "Qwen2.5-7B-Instruct-1M" \
51+
--tokenizer "Qwen/Qwen2.5-7B-Instruct-1M" \
52+
--input_file ./data/gsm8k/instruct/train.parquet \
53+
--output_dir ${output_dir} \
54+
--use_fastdeploy true \
55+
--rollout_input_batch_size 8 \
56+
--rollout_n 8 \
57+
--top_p 1.0 \
58+
--temperature 0.7 \
59+
--max_dec_len 8192 \
60+
--limit_rows 512
61+
```
62+
63+
* **`--openai_urls`**: URLs of the API endpoints to test.
64+
* **`--api_keys`**: API keys for authentication (if required).
65+
* **`--model`**: Name of the model being tested.
66+
* **`--tokenizer`**: Path or name of the tokenizer.
67+
* **`--input_file`**: Path to the input dataset file.
68+
* **`--output_dir`**: Directory to save output results.
69+
* **`--use_fastdeploy`**: Use FastDeploy if true, otherwise use vLLM (default: true).
70+
* **`--rollout_input_batch_size`**: The batch size for API requests.
71+
* **`--rollout_n`**: Number of responses to generate for each input query.
72+
* **`--max_dec_len`**: Maximum decoding length for responses.
73+
* **`--limit_rows`**: Limit the number of input rows processed.
74+
75+
**Run command:**
76+
77+
```bash
78+
bash scripts/api_serve.sh
79+
```
80+
81+
### Offline PaddlePaddle Inference
82+
83+
This script tests the throughput of an LLM using offline PaddlePaddle inference, potentially with distributed processing.
84+
85+
**Configuration (`paddle_infer.sh`):**
86+
87+
```bash
88+
unset PADDLE_TRAINERS_NUM
89+
unset PADDLE_ELASTIC_JOB_ID
90+
unset PADDLE_TRAINER_ENDPOINTS
91+
unset DISTRIBUTED_TRAINER_ENDPOINTS
92+
unset FLAGS_START_PORT
93+
unset PADDLE_ELASTIC_TIMEOUT
94+
95+
export PYTHONPATH="your_paddlenlp_path/PaddleNLP":$PYTHONPATH
96+
export PYTHONPATH="your_paddlenlp_path/PaddleNLP/llm":$PYTHONPATH
97+
98+
export FLAGS_set_to_1d=False
99+
export NVIDIA_TF32_OVERRIDE=0
100+
export FLAGS_dataloader_use_file_descriptor=False
101+
export HF_DATASETS_DOWNLOAD_TIMEOUT=1
102+
export FLAGS_gemm_use_half_precision_compute_type=False
103+
export FLAGS_force_cublaslt_no_reduced_precision_reduction=True
104+
105+
export FLAGS_custom_allreduce=0
106+
export FLAGS_mla_use_tensorcore=0
107+
export FLAGS_cascade_attention_max_partition_size=2048
108+
109+
export CUDA_VISIBLE_DEVICES=4,5,6,7
110+
output_dir="pdpd_bf16_offline"
111+
112+
python -u -m paddle.distributed.launch --log_dir ${output_dir}/logs --gpus ${CUDA_VISIBLE_DEVICES} paddle_infer.py \
113+
--actor_model_name_or_path your_model_name \
114+
--max_src_len 2048 \
115+
--min_dec_len 32 \
116+
--max_dec_len 30720 \
117+
--top_p 1.0 \
118+
--temperature 1.0 \
119+
--rollout_input_batch_size 4 \
120+
--rollout_n 8 \
121+
--rollout_max_num_seqs 24 \
122+
--rollout_quant_type "" \
123+
--tensor_parallel_degree 4 \
124+
--limit_rows 640 \
125+
--input_file file.parquet \
126+
--output_dir ${output_dir} > ./paddleinfer.log 2>&1
127+
```
128+
129+
* **`CUDA_VISIBLE_DEVICES`**: Specifies the GPUs to be used.
130+
* **`paddle.distributed.launch`**: Initiates a distributed PaddlePaddle training/inference job.
131+
* **`--actor_model_name_or_path`**: Path to the pre-trained model.
132+
* **`--max_src_len`**: Maximum source sequence length.
133+
* **`--rollout_input_batch_size`**: The batch size for inference.
134+
* **`--rollout_n`**: Number of responses to generate for each input query.
135+
* **`--tensor_parallel_degree`**: Degree of tensor parallelism for distributed inference.
136+
* **`--input_file`**: Path to the input dataset file.
137+
* **`--output_dir`**: Directory to save output results and logs.
138+
139+
### Offline PyTorch Inference
140+
141+
This script tests the throughput of an LLM using offline PyTorch inference.
142+
143+
**Configuration (`torch_infer.sh`):**
144+
145+
```bash
146+
export CUDA_VISIBLE_DEVICES=4,5,6,7
147+
148+
output_dir="vllm_bf16_offline_flashattn"
149+
150+
python torch_infer.py \
151+
--actor_model_name_or_path Qwen/Qwen2.5-7B-Instruct-1M \
152+
--max_src_len 2048 \
153+
--min_dec_len 32 \
154+
--max_dec_len 30720 \
155+
--top_p 1.0 \
156+
--temperature 1.0 \
157+
--rollout_input_batch_size 4 \
158+
--rollout_n 8 \
159+
--tensor_parallel_degree 4 \
160+
--limit_rows 640 \
161+
--input_file ./data/gsm8k/instruct/train.parquet \
162+
--output_dir ${output_dir} \
163+
--gpu_memory_utilization 0.8 > ./torchinferflashattn.log 2>&1
164+
```
165+
* **`--gpu_memory_utilization`**: Fraction of GPU memory to be reserved for the model.
166+
167+
-----
168+
169+
## Output Results
170+
171+
The `output_dir` contains the following files:
172+
173+
**1. Statistics Files**
174+
`dispersed_stats.csv`
175+
176+
Per-batch request length and throughput statistics. Fields:
177+
`batch_index, rollout_lengths, min_length, max_length, avg_length, completion_time, throughput_tokens_per_sec`
178+
179+
`global_stats.csv`
180+
181+
Aggregated global metrics. Fields:
182+
`batch_index, min_response_tokens, max_response_tokens, avg_response_tokens, total_response_tokens, completion_time, throughput_tokens_per_sec`
183+
184+
**2. Detailed Records**
185+
`rollout_details.jsonl`
186+
187+
Raw per-request outputs (JSON Lines format), including input/output text.
188+

0 commit comments

Comments
 (0)