Skip to content

Latest commit

 

History

History
73 lines (49 loc) · 2.38 KB

README.md

File metadata and controls

73 lines (49 loc) · 2.38 KB

Investigate Overlapping Text Effect on vLLM Performance

If multiple concurrent queries have large overlapping sections vs. not having such sections, does this effect vLLM performance? Check this with some experiments.

First test is to detect random number in list, both with same list of numbers for each query vs. different list for different queries. Can be generated automatically and easy to test.

Test on Lambda Labs GPU Instance with 4-Bit Llama 3.1 8B

This test was done on a Lambda Labs gpu_1x_a100_sxm4 server in the Virginia, USA region.

The 4-bit GPTQ quantized Llama 3.1 8B LLM was used for this test.

Environment Setup

Following steps done in conda environment with Python 3.12. Without conda environment the vLLM server didn't work properly (errors with undefined symbols occurred).

Setup the environment:

sudo apt install parallel
pip install vllm openai fire
pip install --upgrade jinja2

git clone https://github.com/StanHatko/benchmark_llm_overlap_queries
cd benchmark_llm_overlap_queries

Start vLLM server:

vllm serve hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4 --host 127.0.0.1 --port 8000 --enable-prefix-caching

Initial test of LLM:

./generate_detect_num_list.py /tmp/llm_test_basic 10 0
./send_local_llm_query.py /tmp/llm_test_basic_000.json

ls -1 /tmp/llm_test_basic_*.json
ls -1 /tmp/llm_test_basic_*.json | parallel -j 10 ./send_local_llm_query.py
time ( ls -1 /tmp/llm_test_basic_*.json | parallel -j 10 ./send_local_llm_query.py )

Performance Test with Different

Check time per run with 100 runs, with 100 entries generated per run, each list being different, 50 threads:

./test_llm_detect_num_list_diff.sh >~/test_diff_log.txt 2>~/test_diff_time.txt
cat ~/test_diff_time.txt | grep real | perl -pe 's/.*0m//' | perl -pe 's/s$//'

The time results are in the file time-taken-4bit-diff.txt. The time in seconds has mean $6.45708$ and standard deviation $0.5997773$.

Performance Test with Same

Check time per run with 100 runs, with 100 entries generated per run, within each run the list is same, 50 threads:

./test_llm_detect_num_list_same.sh >~/test_same_log.txt 2>~/test_same_time.txt
cat ~/test_same_time.txt | grep real | perl -pe 's/.*0m//' | perl -pe 's/s$//'

The time results are in the file time-taken-4bit-same.txt. The time in seconds has mean $4.58922$ and standard deviation $0.591287$.