Add VLLM_SCHED_PREFILL_KVC_FREEPCT #89

sanyalington · 2024-07-18T09:45:41Z

Add VLLM_SCHED_PREFILL_KVC_FREEPCT feature to schedule prefill only when percentage of kv cache is free.
To enable this feature export VLLM_SCHED_PREFILL_KVC_FREEPCT= (float num>0.0).
This helps in large Batch Size offline inference scenarios where prefills can be batched and scheduled when a certain percentage of KV cache is free. If KV cache is below VLLM_SCHED_PREFILL_KVC_FREEPCT, no prefills will be scheduled and decode gets priority. As sequences finish decode, kv cache gets freed up.

…hen percentage of kv cache is free

gshtras · 2024-07-18T15:33:25Z

vllm/core/scheduler.py

@@ -24,6 +24,9 @@
 ARTIFICIAL_PREEMPTION_MAX_CNT = 500


+VLLM_SCHED_PREFILL_KVC_FREEPCT = float(
+    os.getenv("VLLM_SCHED_PREFILL_KVC_FREEPCT", 0.0))  # noqa


Please move to envs.py

Add VLLM_SCHED_PREFILL_KVC_FREEPCT feature to schedule prefill only w…

402f99d

…hen percentage of kv cache is free

shajrawi requested a review from gshtras July 18, 2024 15:16

gshtras reviewed Jul 18, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add VLLM_SCHED_PREFILL_KVC_FREEPCT #89

Add VLLM_SCHED_PREFILL_KVC_FREEPCT #89

sanyalington commented Jul 18, 2024

gshtras Jul 18, 2024

Add VLLM_SCHED_PREFILL_KVC_FREEPCT #89

Are you sure you want to change the base?

Add VLLM_SCHED_PREFILL_KVC_FREEPCT #89

Conversation

sanyalington commented Jul 18, 2024

gshtras Jul 18, 2024

Choose a reason for hiding this comment