Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tensorrtllm and vllm backend results are different using genai-perf #779

Open
upskyy opened this issue Sep 5, 2024 · 1 comment
Open

Comments

@upskyy
Copy link

upskyy commented Sep 5, 2024

Thank you for releasing a great project.

I measured genai-perf by running the rtzr/ko-gemma-2-9b-it (gemma-2-9b-it fine-tuning model) model with the tritonserver vllm backend and tritonserver tensorrt_llm backend.
However, the two Output sequence length metrics are different, so I think the Output token throughput (per sec) is different.

Since output-tokens-mean was set to 100 in the argument, vllm came out as 100, and tensorrtllm seems to come out as 100 added to the input sequence length.

I ran genai-perf in nvcr.io/nvidia/tritonserver:24.07-py3-sdk docker.

Please let me know if there is anything that needs to be corrected or something I did wrong.
I'll attach the script and the results.

  • tensorrtllm
genai-perf -m ensemble   --service-kind triton   --backend tensorrtllm   --num-prompts 100   --random-seed 123   --synthetic-input-tokens-mean 200   --synthetic-input-tokens-stddev 0  --output-tokens-mean 100   --output-tokens-stddev 0   --output-tokens-mean-deterministic   --tokenizer rtzr/ko-gemma-2-9b-it   --concurrency 1   --measurement-interval 4000   --profile-export-file my_profile_export.json   --url localhost:8001


concurrency: 1

                                        LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃              Statistic ┃      avg ┃      min ┃      max ┃      p99 ┃      p90 ┃      p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│   Request latency (ms) │ 1,606.41 │ 1,593.64 │ 1,617.31 │ 1,617.19 │ 1,616.06 │ 1,610.55 │
│ Output sequence length │   299.50 │   298.00 │   300.00 │   300.00 │   300.00 │   300.00 │
│  Input sequence length │   199.75 │   199.00 │   200.00 │   200.00 │   200.00 │   200.00 │
└────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘
Output token throughput (per sec): 186.43
Request throughput (per sec): 0.62
2024-09-04 09:48 [INFO] genai_perf.export_data.json_exporter:56 - Generating artifacts/ensemble-triton-tensorrtllm-concurrency1/profile_export_genai_perf.json
2024-09-04 09:48 [INFO] genai_perf.export_data.csv_exporter:69 - Generating artifacts/ensemble-triton-tensorrtllm-concurrency1/profile_export_genai_perf.csv




concurrency: 4
                                        LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃              Statistic ┃      avg ┃      min ┃      max ┃      p99 ┃      p90 ┃      p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│   Request latency (ms) │ 1,781.95 │ 1,740.25 │ 2,142.17 │ 2,103.83 │ 1,777.44 │ 1,765.17 │
│ Output sequence length │   299.77 │   298.00 │   300.00 │   300.00 │   300.00 │   300.00 │
│  Input sequence length │   199.84 │   199.00 │   200.00 │   200.00 │   200.00 │   200.00 │
└────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘
Output token throughput (per sec): 649.28
Request throughput (per sec): 2.17
2024-09-04 09:51 [INFO] genai_perf.export_data.json_exporter:56 - Generating artifacts/ensemble-triton-tensorrtllm-concurrency4/profile_export_genai_perf.json
2024-09-04 09:51 [INFO] genai_perf.export_data.csv_exporter:69 - Generating artifacts/ensemble-triton-tensorrtllm-concurrency4/profile_export_genai_perf.csv



concurrency: 8
                                        LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃              Statistic ┃      avg ┃      min ┃      max ┃      p99 ┃      p90 ┃      p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│   Request latency (ms) │ 2,091.10 │ 1,970.12 │ 2,943.90 │ 2,881.30 │ 2,313.61 │ 2,029.94 │
│ Output sequence length │   299.64 │   297.00 │   300.00 │   300.00 │   300.00 │   300.00 │
│  Input sequence length │   199.90 │   199.00 │   200.00 │   200.00 │   200.00 │   200.00 │
└────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘
Output token throughput (per sec): 1054.81
Request throughput (per sec): 3.52
2024-09-04 09:53 [INFO] genai_perf.export_data.json_exporter:56 - Generating artifacts/ensemble-triton-tensorrtllm-concurrency8/profile_export_genai_perf.json
2024-09-04 09:53 [INFO] genai_perf.export_data.csv_exporter:69 - Generating artifacts/ensemble-triton-tensorrtllm-concurrency8/profile_export_genai_perf.csv
  • vllm
genai-perf   -m rtzr_gemma2   --service-kind triton   --backend vllm   --num-prompts 100   --random-seed 123   --synthetic-input-tokens-mean 200   --synthetic-input-tokens-stddev 0   --output-tokens-mean 100   --output-tokens-stddev 0   --output-tokens-mean-deterministic   --tokenizer rtzr/ko-gemma-2-9b-it   --concurrency 1   --measurement-interval 4000   --profile-export-file my_profile_export.json   --url localhost:18001


concurrency: 1

                                        LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃              Statistic ┃      avg ┃      min ┃      max ┃      p99 ┃      p90 ┃      p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│   Request latency (ms) │ 3,792.74 │ 3,781.30 │ 3,812.85 │ 3,812.27 │ 3,807.09 │ 3,798.46 │
│ Output sequence length │   100.00 │   100.00 │   100.00 │   100.00 │   100.00 │   100.00 │
│  Input sequence length │   200.00 │   200.00 │   200.00 │   200.00 │   200.00 │   200.00 │
└────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘
Output token throughput (per sec): 26.37
Request throughput (per sec): 0.26
2024-09-05 04:01 [INFO] genai_perf.export_data.json_exporter:56 - Generating artifacts/rtzr_gemma2-triton-vllm-concurrency1/profile_export_genai_perf.json
2024-09-05 04:01 [INFO] genai_perf.export_data.csv_exporter:69 - Generating artifacts/rtzr_gemma2-triton-vllm-concurrency1/profile_export_genai_perf.csv



concurrency: 4

                                        LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃              Statistic ┃      avg ┃      min ┃      max ┃      p99 ┃      p90 ┃      p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│   Request latency (ms) │ 3,996.60 │ 3,990.91 │ 4,007.69 │ 4,007.69 │ 4,007.66 │ 4,007.18 │
│ Output sequence length │    99.67 │    96.00 │   100.00 │   100.00 │   100.00 │   100.00 │
│  Input sequence length │   199.75 │   199.00 │   200.00 │   200.00 │   200.00 │   200.00 │
└────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘
Output token throughput (per sec): 99.75
Request throughput (per sec): 1.00
2024-09-05 04:02 [INFO] genai_perf.export_data.json_exporter:56 - Generating artifacts/rtzr_gemma2-triton-vllm-concurrency4/profile_export_genai_perf.json
2024-09-05 04:02 [INFO] genai_perf.export_data.csv_exporter:69 - Generating artifacts/rtzr_gemma2-triton-vllm-concurrency4/profile_export_genai_perf.csv



concurrency: 8
                                        LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃              Statistic ┃      avg ┃      min ┃      max ┃      p99 ┃      p90 ┃      p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│   Request latency (ms) │ 4,125.69 │ 4,090.61 │ 4,192.69 │ 4,192.68 │ 4,192.45 │ 4,191.99 │
│ Output sequence length │    99.92 │    98.00 │   100.00 │   100.00 │   100.00 │   100.00 │
│  Input sequence length │   199.88 │   199.00 │   200.00 │   200.00 │   200.00 │   200.00 │
└────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘
Output token throughput (per sec): 193.71
Request throughput (per sec): 1.94
2024-09-05 04:04 [INFO] genai_perf.export_data.json_exporter:56 - Generating artifacts/rtzr_gemma2-triton-vllm-concurrency8/profile_export_genai_perf.json
2024-09-05 04:04 [INFO] genai_perf.export_data.csv_exporter:69 - Generating artifacts/rtzr_gemma2-triton-vllm-concurrency8/profile_export_genai_perf.csv
@dyastremsky
Copy link
Contributor

Apologies for the delayed response. You need to set exclude_output_in_input to true in the model config to not echo the input tokens in the output for TensorRT-LLM.

There was a limitation in TensorRT-LLM that prevented GenAI-Perf from setting this value automatically. That limitation might have been lifted recently. We have it in our queue to investigate whether GenAI-Perf can now take care of this for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants