Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: skip probing for incompatible scenario #879

Merged
merged 1 commit into from
Feb 13, 2025

Conversation

zhuangqh
Copy link
Collaborator

When determining the max available context length, vllm will
generate dummy inputs with max_num_batched_tokens tokens
and profiling peak VRAM usage.

  1. skip probing for multi-gpu serving

If tensor-parallelism is enabled, there are more than 1 gpu workers.
However, we have no way to adjust the context length for profiling
for these workers in vllm.

  1. skip probing when chunked-prefill is enabled

chunked prefill is a feature
to improve thoughput for long context serving (max_model_len > 32K).
It makes the peak memory usage nonlinearly related to max_model_len.

Notes for Reviewers:

test case:

  • single gpu with enough vram to run max-model-len
  • single gpu, find max-model-len after several probing
  • single gpu, unable to run with even small model len
  • multi gpu
  • chunked prefill enabled

When determining the max available context length, vllm will
generate dummy inputs with `max_num_batched_tokens` tokens
and profiling peak VRAM usage.

1. skip probing for multi-gpu serving

If tensor-parallelism is enabled, there are more than 1 gpu
workers. However, we have no way to adjust the context length
for profiling for these workers in vllm.

2. skip probing when chunked-prefill is enabled

[chunked prefill](https://arxiv.org/pdf/2308.16369) is a feature
to improve thoughput for long context serving (max_model_len > 32K).
It makes the peak memory usage nonlinearly related to max_model_len.

Signed-off-by: jerryzhuang <[email protected]>
@zhuangqh zhuangqh merged commit 5895b9b into kaito-project:main Feb 13, 2025
3 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants