Unexpected throughput test result of llama-7b on A100 40G #275

yingjie1011 · 2023-06-27T06:48:52Z

yingjie1011
Jun 27, 2023

Tested throughput of llama-7b with single A100 40G, the result is 1.49 requests/s, 714.33 tokens/s.
I wonder why it is even lower than the 154.2 requests/min result of llama-13b in README.md,
and im not quite sure the meaning of "each request asks for 1 output completion", is it the "--n" option in demo code?
Here is my command and outputs, thanks!

root@bffd002d0432:/workdir/vllm/benchmarks# CUDA_VISIBLE_DEVICES=7 python3 benchmark_throughput.py --dataset "/workdir/ShareGPT_V3_unfiltered_cleaned_split.json"
Namespace(backend='vllm', dataset='/workdir/ShareGPT_V3_unfiltered_cleaned_split.json', hf_max_batch_size=None, model='/workdir/dist/models/llama-7b', n=1, num_prompts=1000, seed=0, tensor_parallel_size=1, use_beam_search=False)
Token indices sequence length is longer than the specified maximum sequence length for this model (3152 > 2048). Running this sequence through the model will result in indexing errors
INFO 06-27 01:36:49 llm_engine.py:60] Initializing an LLM engine with config: model='/workdir/dist/models/llama-7b', dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)
INFO 06-27 01:37:48 llm_engine.py:131] # GPU blocks: 2875, # CPU blocks: 512
Processed prompts: 100%|██████████████████████████████████████| 1000/1000 [11:09<00:00, 1.49it/s]
Throughput: 1.49 requests/s, 714.33 tokens/s

zhuohan123 · 2023-06-27T15:04:49Z

zhuohan123
Jun 27, 2023
Maintainer

I tried your setting on my environment with 1 A100-40G, and got the following result:

$ python benchmark_throughput.py --dataset ShareGPT_V3_unfiltered_cleaned_split.json --model huggyllama/llama-
7b
Namespace(backend='vllm', dataset='ShareGPT_V3_unfiltered_cleaned_split.json', model='huggyllama/llama-7b', tensor_parallel_size=1, n=1, use_beam_search=False, num_prompts=1000, seed=0, hf_max_batch_size=None)
Token indices sequence length is longer than the specified maximum sequence length for this model (3152 > 2048). Running this sequence through the model will result in indexing errors
INFO 06-27 14:59:04 llm_engine.py:59] Initializing an LLM engine with config: model='huggyllama/llama-7b', dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)
INFO 06-27 14:59:04 tokenizer_utils.py:30] Using the LLaMA fast tokenizer in 'hf-internal-testing/llama-tokenizer' to avoid potential protobuf errors.
INFO 06-27 14:59:47 llm_engine.py:128] # GPU blocks: 2855, # CPU blocks: 512
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [03:32<00:00,  4.70it/s]
Throughput: 4.70 requests/s, 2245.99 tokens/s

Is your tokenizer correct? Can you try huggyllama/llama-7b model and see how it works?

7 replies

WoosukKwon Jun 28, 2023
Maintainer

@yingjie1011 Thanks for reporting that. Could you try this NVIDIA PyTorch docker with CUDA 11.8? It is close to our main development and experiment environment.

# Pull the Docker image with CUDA 11.8.
docker run --gpus all -it --rm --shm-size=8g nvcr.io/nvidia/pytorch:22.12-py3

andreapiso Jun 28, 2023

We are also not seeing barely any improvement over Huggingface :(

Unfortunately our environment does not allow us to just run any docker container.

We are on CUDA 11.6, with nvidia driver 470.

gesanqiu Jun 29, 2023

I wander whether the benchmark_throughput.py warmup the model before run engine, may this be a problem?

zhuohan123 Jun 29, 2023
Maintainer

My environment for reference:

Driver: 510.47.03
CUDA: 11.7
Python: 3.10.9
transformers: 4.28.1

I'm using the latest commit. Is the performance difference caused by the driver version?

andreapiso Jun 29, 2023

I hope not, because version 470 is still the latest LTS version that Nvidia released so far! Not possible for many enterprises to move out of LTS releases :(

zhaoyang-star · 2023-07-02T03:19:14Z

zhaoyang-star
Jul 2, 2023

Maybe tokenizer is the reason? Is tokenizer time included when caculating throughput?

0 replies

andreapiso · 2023-07-02T03:25:30Z

andreapiso
Jul 2, 2023

On the starcoder case, should not be due to the tokenizer. Maybe the current implementation of paged attention just not good enough as it does not implement MQA

0 replies

gesanqiu · 2023-07-03T10:00:31Z

gesanqiu
Jul 3, 2023

I test vicuna-7b(whose performace should similar to llama-7b) on 3090 24G, and got 2.66 request/s 1269.87 tokens/s, faster than your A100 40G.
And I do some test on driver version 530.30.02 and 450.102.04 in nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04 docker image, logs shown that 530.30.02 had a higher throughput and K,V cache usage than 450.102.04. So I perfer it's a environment problem.

1 reply

renfeier Jul 3, 2023

I tested throughput of open_llama_13b with single A100 40G, the result is 1.54 requests/s, 745.18 tokens/s. far lower than the 154.2 requests/min result of llama-13b in README.md
My environment for reference:
Driver Version: 450.156.00 CUDA Version: 11.3 Python 3.8.16. transformers 4.30.2
@zhuohan123

comingboy0701 · 2023-07-11T11:38:14Z

comingboy0701
Jul 11, 2023

Tested throughput of llama-7b with single A800 80G, the result is 0.98 requests/s, 476.76 tokens/s. far lower than
My environment for reference:
Driver Version: 515.86.01 CUDA Version: 11.7 Python 3.9.1. transformers 4.30.0.dev0

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected throughput test result of llama-7b on A100 40G #275

{{title}}

Replies: 5 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Unexpected throughput test result of llama-7b on A100 40G #275

Replies: 5 comments · 8 replies

zhuohan123 Jun 27, 2023 Maintainer

WoosukKwon Jun 28, 2023 Maintainer

zhuohan123 Jun 29, 2023 Maintainer

Replies: 5 comments 8 replies

zhuohan123
Jun 27, 2023
Maintainer

WoosukKwon Jun 28, 2023
Maintainer

zhuohan123 Jun 29, 2023
Maintainer