Skip to content

Commit

Permalink
- [Blog] AMD MI300X inference benchmark #1806 (Fixed Conclusion)
Browse files Browse the repository at this point in the history
  • Loading branch information
peterschmidt85 committed Oct 10, 2024
1 parent 10fd534 commit d17a84a
Showing 1 changed file with 16 additions and 12 deletions.
28 changes: 16 additions & 12 deletions docs/blog/posts/amd-mi300x-inference-benchmark.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,14 +154,14 @@ or maintain Time to First Token (TTFT).

At 1 RPS, vLLM performs slightly better than TGI. However, between 2 and 4 RPS, TGI outperforms vLLM in both throughput and TTFT.

Notably, TGI begins to drop requests once it reaches 5 RPS.
> Notably, TGI begins to drop requests once it reaches 5 RPS.
We repeated the test using a higher number of requests, ranging from 300 to 900.

<img src="https://raw.githubusercontent.com/dstackai/benchmarks/refs/heads/main/amd/inference/charts_rps/token_per_second_tpi_vllm.png" width="725" style="padding: 0 40px 0 50px"/>

At 900 requests with a rate of 3 requests per second (RPS), TGI dropped a majority of the requests. However, its
performance improved notably when the number of requests was below 900.
> At 900 requests with a rate of 3 requests per second (RPS), TGI dropped a majority of the requests. However, its
> performance improved notably when the number of requests was below 900.
<img src="https://raw.githubusercontent.com/dstackai/benchmarks/refs/heads/main/amd/inference/charts_rps/mean_ttft_tgi_vllm.png" width="725" style="padding: 0 40px 0 50px"/>

Expand All @@ -176,22 +176,26 @@ This difference may be related to how vLLM [pre-allocates GPU cache :material-ar

## Conclusion

- TGI is highly efficient at handling medium to high workloads. In our tests on 8x AMD MI300X GPU, medium workloads
are defined as RPS between 2 and 4. In these cases, it delivers faster time to first token (TTFT) and higher
throughput.
- Conversely, vLLM works well with lower RPS but struggles to scale, making it less ideal for more demanding workloads.
- TGI's edge comes from
its [continuous batching algorithm :material-arrow-top-right-thin:{ .external }](https://huggingface.co/blog/martinigoyanes/llm-inference-at-scale-with-tgi){:target="_blank"}, which dynamically modifies batch sizes to optimize GPU usage.
1. For small sequence lengths, starting with a batch size of 64, TGI significantly outperforms vLLM in terms of throughput and TTFT.
2. For larger sequence lengths, TGI outperforms vLLM even more in both throughput and TTFT, with the difference increasing as the batch size grows.
3. At higher request rates, TGI continues to outperform vLLM, likely due to its superior ability to batch requests efficiently.

!!! info "Limitation"
* In certain circumstances (e.g., at higher request rates), for unknown reasons, TGI dropped requests, making it
impossible to accurately track throughput and TTFT.
* With vLLM, we used the default backend configuration. With better tuning, we might have achieved improved performance.

To gain a more complete understanding of the performance potential, a wider variety of backend configurations should be tested.
In general, the 8x AMD MI300X is a good fit for larger models and allows us to make the most of its vRAM, especially for
larger batches.

If you’d like to support us in doing more benchmarks, please let us know.

## What's next?

While we wait for AMD to announce new GPUs and for data centers to offer them, we’re considering tests with NVIDIA GPUs
like the H100 and H200, and possibly Google TPU.
like the H100 and H200, as well as possibly Google TPU.

If you’d like to support us in doing more benchmarks, please let us know.
> Also, the next step is to measure how the FP8 version of the model would perform on this hardware.
### Source code

Expand Down

0 comments on commit d17a84a

Please sign in to comment.