Unexpected throughput test result of llama-7b on A100 40G #275
Replies: 5 comments 8 replies
-
I tried your setting on my environment with 1 A100-40G, and got the following result:
Is your tokenizer correct? Can you try |
Beta Was this translation helpful? Give feedback.
-
Maybe tokenizer is the reason? Is tokenizer time included when caculating throughput? |
Beta Was this translation helpful? Give feedback.
-
On the starcoder case, should not be due to the tokenizer. Maybe the current implementation of paged attention just not good enough as it does not implement MQA |
Beta Was this translation helpful? Give feedback.
-
I test vicuna-7b(whose performace should similar to llama-7b) on 3090 24G, and got 2.66 request/s 1269.87 tokens/s, faster than your A100 40G. |
Beta Was this translation helpful? Give feedback.
-
Tested throughput of llama-7b with single A800 80G, the result is 0.98 requests/s, 476.76 tokens/s. far lower than |
Beta Was this translation helpful? Give feedback.
-
Tested throughput of llama-7b with single A100 40G, the result is 1.49 requests/s, 714.33 tokens/s.
I wonder why it is even lower than the 154.2 requests/min result of llama-13b in README.md,
and im not quite sure the meaning of "each request asks for 1 output completion", is it the "--n" option in demo code?
Here is my command and outputs, thanks!
Beta Was this translation helpful? Give feedback.
All reactions