tensorrtllm and vllm backend results are different using genai-perf #779

upskyy · 2024-09-05T04:17:14Z

Thank you for releasing a great project.

I measured genai-perf by running the rtzr/ko-gemma-2-9b-it (gemma-2-9b-it fine-tuning model) model with the tritonserver vllm backend and tritonserver tensorrt_llm backend.
However, the two Output sequence length metrics are different, so I think the Output token throughput (per sec) is different.

Since output-tokens-mean was set to 100 in the argument, vllm came out as 100, and tensorrtllm seems to come out as 100 added to the input sequence length.

I ran genai-perf in nvcr.io/nvidia/tritonserver:24.07-py3-sdk docker.

Please let me know if there is anything that needs to be corrected or something I did wrong.
I'll attach the script and the results.

tensorrtllm

genai-perf -m ensemble   --service-kind triton   --backend tensorrtllm   --num-prompts 100   --random-seed 123   --synthetic-input-tokens-mean 200   --synthetic-input-tokens-stddev 0  --output-tokens-mean 100   --output-tokens-stddev 0   --output-tokens-mean-deterministic   --tokenizer rtzr/ko-gemma-2-9b-it   --concurrency 1   --measurement-interval 4000   --profile-export-file my_profile_export.json   --url localhost:8001


concurrency: 1

                                        LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃              Statistic ┃      avg ┃      min ┃      max ┃      p99 ┃      p90 ┃      p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│   Request latency (ms) │ 1,606.41 │ 1,593.64 │ 1,617.31 │ 1,617.19 │ 1,616.06 │ 1,610.55 │
│ Output sequence length │   299.50 │   298.00 │   300.00 │   300.00 │   300.00 │   300.00 │
│  Input sequence length │   199.75 │   199.00 │   200.00 │   200.00 │   200.00 │   200.00 │
└────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘
Output token throughput (per sec): 186.43
Request throughput (per sec): 0.62
2024-09-04 09:48 [INFO] genai_perf.export_data.json_exporter:56 - Generating artifacts/ensemble-triton-tensorrtllm-concurrency1/profile_export_genai_perf.json
2024-09-04 09:48 [INFO] genai_perf.export_data.csv_exporter:69 - Generating artifacts/ensemble-triton-tensorrtllm-concurrency1/profile_export_genai_perf.csv




concurrency: 4
                                        LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃              Statistic ┃      avg ┃      min ┃      max ┃      p99 ┃      p90 ┃      p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│   Request latency (ms) │ 1,781.95 │ 1,740.25 │ 2,142.17 │ 2,103.83 │ 1,777.44 │ 1,765.17 │
│ Output sequence length │   299.77 │   298.00 │   300.00 │   300.00 │   300.00 │   300.00 │
│  Input sequence length │   199.84 │   199.00 │   200.00 │   200.00 │   200.00 │   200.00 │
└────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘
Output token throughput (per sec): 649.28
Request throughput (per sec): 2.17
2024-09-04 09:51 [INFO] genai_perf.export_data.json_exporter:56 - Generating artifacts/ensemble-triton-tensorrtllm-concurrency4/profile_export_genai_perf.json
2024-09-04 09:51 [INFO] genai_perf.export_data.csv_exporter:69 - Generating artifacts/ensemble-triton-tensorrtllm-concurrency4/profile_export_genai_perf.csv



concurrency: 8
                                        LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃              Statistic ┃      avg ┃      min ┃      max ┃      p99 ┃      p90 ┃      p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│   Request latency (ms) │ 2,091.10 │ 1,970.12 │ 2,943.90 │ 2,881.30 │ 2,313.61 │ 2,029.94 │
│ Output sequence length │   299.64 │   297.00 │   300.00 │   300.00 │   300.00 │   300.00 │
│  Input sequence length │   199.90 │   199.00 │   200.00 │   200.00 │   200.00 │   200.00 │
└────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘
Output token throughput (per sec): 1054.81
Request throughput (per sec): 3.52
2024-09-04 09:53 [INFO] genai_perf.export_data.json_exporter:56 - Generating artifacts/ensemble-triton-tensorrtllm-concurrency8/profile_export_genai_perf.json
2024-09-04 09:53 [INFO] genai_perf.export_data.csv_exporter:69 - Generating artifacts/ensemble-triton-tensorrtllm-concurrency8/profile_export_genai_perf.csv

vllm

genai-perf   -m rtzr_gemma2   --service-kind triton   --backend vllm   --num-prompts 100   --random-seed 123   --synthetic-input-tokens-mean 200   --synthetic-input-tokens-stddev 0   --output-tokens-mean 100   --output-tokens-stddev 0   --output-tokens-mean-deterministic   --tokenizer rtzr/ko-gemma-2-9b-it   --concurrency 1   --measurement-interval 4000   --profile-export-file my_profile_export.json   --url localhost:18001


concurrency: 1

                                        LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃              Statistic ┃      avg ┃      min ┃      max ┃      p99 ┃      p90 ┃      p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│   Request latency (ms) │ 3,792.74 │ 3,781.30 │ 3,812.85 │ 3,812.27 │ 3,807.09 │ 3,798.46 │
│ Output sequence length │   100.00 │   100.00 │   100.00 │   100.00 │   100.00 │   100.00 │
│  Input sequence length │   200.00 │   200.00 │   200.00 │   200.00 │   200.00 │   200.00 │
└────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘
Output token throughput (per sec): 26.37
Request throughput (per sec): 0.26
2024-09-05 04:01 [INFO] genai_perf.export_data.json_exporter:56 - Generating artifacts/rtzr_gemma2-triton-vllm-concurrency1/profile_export_genai_perf.json
2024-09-05 04:01 [INFO] genai_perf.export_data.csv_exporter:69 - Generating artifacts/rtzr_gemma2-triton-vllm-concurrency1/profile_export_genai_perf.csv



concurrency: 4

                                        LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃              Statistic ┃      avg ┃      min ┃      max ┃      p99 ┃      p90 ┃      p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│   Request latency (ms) │ 3,996.60 │ 3,990.91 │ 4,007.69 │ 4,007.69 │ 4,007.66 │ 4,007.18 │
│ Output sequence length │    99.67 │    96.00 │   100.00 │   100.00 │   100.00 │   100.00 │
│  Input sequence length │   199.75 │   199.00 │   200.00 │   200.00 │   200.00 │   200.00 │
└────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘
Output token throughput (per sec): 99.75
Request throughput (per sec): 1.00
2024-09-05 04:02 [INFO] genai_perf.export_data.json_exporter:56 - Generating artifacts/rtzr_gemma2-triton-vllm-concurrency4/profile_export_genai_perf.json
2024-09-05 04:02 [INFO] genai_perf.export_data.csv_exporter:69 - Generating artifacts/rtzr_gemma2-triton-vllm-concurrency4/profile_export_genai_perf.csv



concurrency: 8
                                        LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃              Statistic ┃      avg ┃      min ┃      max ┃      p99 ┃      p90 ┃      p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│   Request latency (ms) │ 4,125.69 │ 4,090.61 │ 4,192.69 │ 4,192.68 │ 4,192.45 │ 4,191.99 │
│ Output sequence length │    99.92 │    98.00 │   100.00 │   100.00 │   100.00 │   100.00 │
│  Input sequence length │   199.88 │   199.00 │   200.00 │   200.00 │   200.00 │   200.00 │
└────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘
Output token throughput (per sec): 193.71
Request throughput (per sec): 1.94
2024-09-05 04:04 [INFO] genai_perf.export_data.json_exporter:56 - Generating artifacts/rtzr_gemma2-triton-vllm-concurrency8/profile_export_genai_perf.json
2024-09-05 04:04 [INFO] genai_perf.export_data.csv_exporter:69 - Generating artifacts/rtzr_gemma2-triton-vllm-concurrency8/profile_export_genai_perf.csv

The text was updated successfully, but these errors were encountered:

dyastremsky · 2024-11-05T16:05:04Z

Apologies for the delayed response. You need to set exclude_output_in_input to true in the model config to not echo the input tokens in the output for TensorRT-LLM.

There was a limitation in TensorRT-LLM that prevented GenAI-Perf from setting this value automatically. That limitation might have been lifted recently. We have it in our queue to investigate whether GenAI-Perf can now take care of this for you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tensorrtllm and vllm backend results are different using genai-perf #779

tensorrtllm and vllm backend results are different using genai-perf #779

upskyy commented Sep 5, 2024 •

edited

Loading

dyastremsky commented Nov 5, 2024

tensorrtllm and vllm backend results are different using genai-perf #779

tensorrtllm and vllm backend results are different using genai-perf #779

Comments

upskyy commented Sep 5, 2024 • edited Loading

dyastremsky commented Nov 5, 2024

upskyy commented Sep 5, 2024 •

edited

Loading