fix: skip probing for incompatible scenario #879

zhuangqh · 2025-02-12T14:01:07Z

When determining the max available context length, vllm will
generate dummy inputs with max_num_batched_tokens tokens
and profiling peak VRAM usage.

skip probing for multi-gpu serving

If tensor-parallelism is enabled, there are more than 1 gpu workers.
However, we have no way to adjust the context length for profiling
for these workers in vllm.

skip probing when chunked-prefill is enabled

chunked prefill is a feature
to improve thoughput for long context serving (max_model_len > 32K).
It makes the peak memory usage nonlinearly related to max_model_len.

Notes for Reviewers:

test case:

single gpu with enough vram to run max-model-len
single gpu, find max-model-len after several probing
single gpu, unable to run with even small model len
multi gpu
chunked prefill enabled

When determining the max available context length, vllm will generate dummy inputs with `max_num_batched_tokens` tokens and profiling peak VRAM usage. 1. skip probing for multi-gpu serving If tensor-parallelism is enabled, there are more than 1 gpu workers. However, we have no way to adjust the context length for profiling for these workers in vllm. 2. skip probing when chunked-prefill is enabled [chunked prefill](https://arxiv.org/pdf/2308.16369) is a feature to improve thoughput for long context serving (max_model_len > 32K). It makes the peak memory usage nonlinearly related to max_model_len. Signed-off-by: jerryzhuang <[email protected]>

presets/workspace/inference/vllm/inference_api.py

zhuangqh requested review from Fei-Guo, helayoty and ishaansehgal99 as code owners February 12, 2025 14:01

zhuangqh commented Feb 12, 2025

View reviewed changes

presets/workspace/inference/vllm/inference_api.py Show resolved Hide resolved

Fei-Guo approved these changes Feb 12, 2025

View reviewed changes

ishaansehgal99 reviewed Feb 12, 2025

View reviewed changes

presets/workspace/inference/vllm/inference_api.py Show resolved Hide resolved

ishaansehgal99 reviewed Feb 12, 2025

View reviewed changes

presets/workspace/inference/vllm/inference_api.py Show resolved Hide resolved

ishaansehgal99 approved these changes Feb 12, 2025

View reviewed changes

zhuangqh merged commit 5895b9b into kaito-project:main Feb 13, 2025
3 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: skip probing for incompatible scenario #879

fix: skip probing for incompatible scenario #879

zhuangqh commented Feb 12, 2025

fix: skip probing for incompatible scenario #879

fix: skip probing for incompatible scenario #879

Conversation

zhuangqh commented Feb 12, 2025