Extremely high CPU memory (RAM) usage when running vLLM on k8s #10571
Unanswered
BrianPark314
asked this question in
Q&A
Replies: 1 comment
-
If this fits more in issues, please let me know. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am not sure if this is a bug or intended behavior, so I'll post it here.
I am hosting a Qwen 2.5 72B GPTQ int4 model, which should consume about ~40 GBs of VRAM on two tesla V100s. The model is loaded fine, but afterwards I discovered that the pod is consuming 34 GBs of RAM.
After restricting the memory limit to 8GBs for the pod, the server died following ZeroMQ error.
Then I increased the RAM limit to 16GB, everything seems to work fine but the pod is occupying almost 100% of RAM, which cannot be good for stability.
The seemingly excess RAM consumption start after the model weights are loaded to VRAM, and CUDA graphs compiling. Within 15 seconds the RAM usages jumps from ~4 GB to ~16 GB.
Is this intended behavior?
Beta Was this translation helpful? Give feedback.
All reactions