You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
I use kaito to deploy deepseek-r1-distill-llama-8b in azure arc, I used Standard_NC32_A2 sku, it has 2 GPUs each GPU has 16G memory.
The latest helm chart has featureGates: vLLM: "True" and after I install and try to run deepseek it failed with out of memory and in log it says my total GPU memory is 14Gi. after I switch the featureGates: vLLM to false, it deploy successfully. Steps To Reproduce
use a node with 2 GPU each has 16G GPU memory
helm install latest chart and deploy deepseek-r1-distill-llama-8b model with preferredNode name with that node, it should be failed.
delete these and helm uninstall and then install the chart with featureGates:vLLM: "false", apply same CR, it should work
Logs
/usr/local/lib/python3.12/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'
from vllm.version import version as VLLM_VERSION
INFO 02-05 07:35:23 inference_api.py:113] Loading LoRA adapters from /mnt/adapter
WARNING 02-05 07:35:23 config.py:1674] Casting torch.bfloat16 to torch.float16.
WARNING 02-05 07:35:29 arg_utils.py:953] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 02-05 07:35:29 config.py:1005] Chunked prefill is enabled with max_num_batched_tokens=512.
INFO 02-05 07:35:29 inference_api.py:222] Try run profiler to find max available seq len
INFO 02-05 07:35:29 model_runner.py:1060] Starting to load model /workspace/vllm/weights...
[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/vllm/inference_api.py", line 244, in
[rank0]: try_set_max_available_seq_len(args)
[rank0]: File "/workspace/vllm/inference_api.py", line 223, in try_set_max_available_seq_len
[rank0]: available_seq_len = find_max_available_seq_len(engine_config, max_probe_steps)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/workspace/vllm/inference_api.py", line 130, in find_max_available_seq_len
[rank0]: executor = executor_class(
[rank0]: ^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 47, in init
[rank0]: self._init_executor()
[rank0]: File "/usr/local/lib/python3.12/site-packages/vllm/executor/gpu_executor.py", line 40, in _init_executor
[rank0]: self.driver_worker.load_model()
[rank0]: File "/usr/local/lib/python3.12/site-packages/vllm/worker/worker.py", line 183, in load_model
[rank0]: self.model_runner.load_model()
[rank0]: File "/usr/local/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1062, in load_model
[rank0]: self.model = get_model(model_config=self.model_config,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/model_loader/init.py", line 19, in get_model
[rank0]: return loader.load_model(model_config=model_config,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py", line 398, in load_model
[rank0]: model = _initialize_model(model_config, self.load_config,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py", line 175, in _initialize_model
[rank0]: return build_model(
[rank0]: ^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py", line 160, in build_model
[rank0]: return model_class(config=hf_config,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 522, in init
[rank0]: self.lm_head = ParallelLMHead(
[rank0]: ^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 443, in init
[rank0]: super().init(num_embeddings, embedding_dim, params_dtype,
[rank0]: File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 260, in init
[rank0]: self.linear_method.create_weights(self,
[rank0]: File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 28, in create_weights
[rank0]: weight = Parameter(torch.empty(sum(output_partition_sizes),
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/site-packages/torch/utils/_device.py", line 79, in torch_function
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1002.00 MiB. GPU 0 has a total capacity of 14.53 GiB of which 446.00 MiB is free. Process 102225 has 14.09 GiB memory in use. Of the allocated memory 14.01 GiB is allocated by PyTorch, and 17.49 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) Environment
Azure arc hybrid aks.
Kubernetes version (use kubectl version):
Client Version: v1.30.5
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.4
qfai
changed the title
Vllm seems unable to detect and use multiple GPUs
Vllm seems unable to detect and use multiple GPUs(The issue can be bypass by switch off vLLM featureGates)
Feb 5, 2025
we found it will stuck when we trying to find the fitting model window size for A100x2 automatic ally. Then we disable the tensor parallelism for deepseek by default.
you can turn this feature on manually by providing a config.
apiVersion: kaito.sh/v1alpha1
kind: Workspace
metadata:
name: workspace-deepseek-r1-distill-qwen-14b
resource:
instanceType: "Standard_NC48ads_A100_v4"
labelSelector:
matchLabels:
apps: deepseek-r1-distill-qwen-14b
inference:
preset:
name: "deepseek-r1-distill-qwen-14b"
config: "ds-inference-params" # here
---
apiVersion: v1
kind: ConfigMap
metadata:
name: ds-inference-params
data:
inference_config.yaml: |
# Maximum number of steps to find the max available seq len fitting in the GPU memory.
max_probe_steps: 6
vllm:
cpu-offload-gb: 0
gpu-memory-utilization: 0.95
swap-space: 4
max-model-len: 131072 # --> decrease this value according to the GPU memory
tensor-parallel-size: 2 # --> set to gpu count
# max-seq-len-to-capture: 8192
# num-scheduler-steps: 1
# enable-chunked-prefill: false
# see https://docs.vllm.ai/en/latest/serving/engine_args.html for more options.
Change this according to your sku.
max-model-len: 131072 # --> decrease this value according to the GPU memory
tensor-parallel-size: 2 # --> set to gpu count
Describe the bug
I use kaito to deploy deepseek-r1-distill-llama-8b in azure arc, I used Standard_NC32_A2 sku, it has 2 GPUs each GPU has 16G memory.
The latest helm chart has featureGates: vLLM: "True" and after I install and try to run deepseek it failed with out of memory and in log it says my total GPU memory is 14Gi. after I switch the featureGates: vLLM to false, it deploy successfully.
Steps To Reproduce
Logs
/usr/local/lib/python3.12/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'
from vllm.version import version as VLLM_VERSION
INFO 02-05 07:35:23 inference_api.py:113] Loading LoRA adapters from /mnt/adapter
WARNING 02-05 07:35:23 config.py:1674] Casting torch.bfloat16 to torch.float16.
WARNING 02-05 07:35:29 arg_utils.py:953] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 02-05 07:35:29 config.py:1005] Chunked prefill is enabled with max_num_batched_tokens=512.
INFO 02-05 07:35:29 inference_api.py:222] Try run profiler to find max available seq len
INFO 02-05 07:35:29 model_runner.py:1060] Starting to load model /workspace/vllm/weights...
[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/vllm/inference_api.py", line 244, in
[rank0]: try_set_max_available_seq_len(args)
[rank0]: File "/workspace/vllm/inference_api.py", line 223, in try_set_max_available_seq_len
[rank0]: available_seq_len = find_max_available_seq_len(engine_config, max_probe_steps)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/workspace/vllm/inference_api.py", line 130, in find_max_available_seq_len
[rank0]: executor = executor_class(
[rank0]: ^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 47, in init
[rank0]: self._init_executor()
[rank0]: File "/usr/local/lib/python3.12/site-packages/vllm/executor/gpu_executor.py", line 40, in _init_executor
[rank0]: self.driver_worker.load_model()
[rank0]: File "/usr/local/lib/python3.12/site-packages/vllm/worker/worker.py", line 183, in load_model
[rank0]: self.model_runner.load_model()
[rank0]: File "/usr/local/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1062, in load_model
[rank0]: self.model = get_model(model_config=self.model_config,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/model_loader/init.py", line 19, in get_model
[rank0]: return loader.load_model(model_config=model_config,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py", line 398, in load_model
[rank0]: model = _initialize_model(model_config, self.load_config,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py", line 175, in _initialize_model
[rank0]: return build_model(
[rank0]: ^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py", line 160, in build_model
[rank0]: return model_class(config=hf_config,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 522, in init
[rank0]: self.lm_head = ParallelLMHead(
[rank0]: ^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 443, in init
[rank0]: super().init(num_embeddings, embedding_dim, params_dtype,
[rank0]: File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 260, in init
[rank0]: self.linear_method.create_weights(self,
[rank0]: File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 28, in create_weights
[rank0]: weight = Parameter(torch.empty(sum(output_partition_sizes),
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/site-packages/torch/utils/_device.py", line 79, in torch_function
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1002.00 MiB. GPU 0 has a total capacity of 14.53 GiB of which 446.00 MiB is free. Process 102225 has 14.09 GiB memory in use. Of the allocated memory 14.01 GiB is allocated by PyTorch, and 17.49 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Environment
Azure arc hybrid aks.
kubectl version
):Client Version: v1.30.5
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.4
cat /etc/os-release
):Additional context
node GPU info
allocatable:
cpu: 31580m
ephemeral-storage: "95026644016"
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 121761172Ki
nvidia.com/gpu: "2"
pods: "110"
The text was updated successfully, but these errors were encountered: