Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vllm seems unable to detect and use multiple GPUs(The issue can be bypass by switch off vLLM featureGates) #866

Open
qfai opened this issue Feb 5, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@qfai
Copy link
Contributor

qfai commented Feb 5, 2025

Describe the bug
I use kaito to deploy deepseek-r1-distill-llama-8b in azure arc, I used Standard_NC32_A2 sku, it has 2 GPUs each GPU has 16G memory.
The latest helm chart has featureGates: vLLM: "True" and after I install and try to run deepseek it failed with out of memory and in log it says my total GPU memory is 14Gi. after I switch the featureGates: vLLM to false, it deploy successfully.
Steps To Reproduce

  1. use a node with 2 GPU each has 16G GPU memory
  2. helm install latest chart and deploy deepseek-r1-distill-llama-8b model with preferredNode name with that node, it should be failed.
  3. delete these and helm uninstall and then install the chart with featureGates:vLLM: "false", apply same CR, it should work

Logs
/usr/local/lib/python3.12/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'
from vllm.version import version as VLLM_VERSION
INFO 02-05 07:35:23 inference_api.py:113] Loading LoRA adapters from /mnt/adapter
WARNING 02-05 07:35:23 config.py:1674] Casting torch.bfloat16 to torch.float16.
WARNING 02-05 07:35:29 arg_utils.py:953] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 02-05 07:35:29 config.py:1005] Chunked prefill is enabled with max_num_batched_tokens=512.
INFO 02-05 07:35:29 inference_api.py:222] Try run profiler to find max available seq len
INFO 02-05 07:35:29 model_runner.py:1060] Starting to load model /workspace/vllm/weights...
[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/vllm/inference_api.py", line 244, in
[rank0]: try_set_max_available_seq_len(args)
[rank0]: File "/workspace/vllm/inference_api.py", line 223, in try_set_max_available_seq_len
[rank0]: available_seq_len = find_max_available_seq_len(engine_config, max_probe_steps)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/workspace/vllm/inference_api.py", line 130, in find_max_available_seq_len
[rank0]: executor = executor_class(
[rank0]: ^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 47, in init
[rank0]: self._init_executor()
[rank0]: File "/usr/local/lib/python3.12/site-packages/vllm/executor/gpu_executor.py", line 40, in _init_executor
[rank0]: self.driver_worker.load_model()
[rank0]: File "/usr/local/lib/python3.12/site-packages/vllm/worker/worker.py", line 183, in load_model
[rank0]: self.model_runner.load_model()
[rank0]: File "/usr/local/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1062, in load_model
[rank0]: self.model = get_model(model_config=self.model_config,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/model_loader/init.py", line 19, in get_model
[rank0]: return loader.load_model(model_config=model_config,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py", line 398, in load_model
[rank0]: model = _initialize_model(model_config, self.load_config,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py", line 175, in _initialize_model
[rank0]: return build_model(
[rank0]: ^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py", line 160, in build_model
[rank0]: return model_class(config=hf_config,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 522, in init
[rank0]: self.lm_head = ParallelLMHead(
[rank0]: ^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 443, in init
[rank0]: super().init(num_embeddings, embedding_dim, params_dtype,
[rank0]: File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 260, in init
[rank0]: self.linear_method.create_weights(self,
[rank0]: File "/usr/local/lib/python3.12/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 28, in create_weights
[rank0]: weight = Parameter(torch.empty(sum(output_partition_sizes),
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/site-packages/torch/utils/_device.py", line 79, in torch_function
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1002.00 MiB. GPU 0 has a total capacity of 14.53 GiB of which 446.00 MiB is free. Process 102225 has 14.09 GiB memory in use. Of the allocated memory 14.01 GiB is allocated by PyTorch, and 17.49 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Environment
Azure arc hybrid aks.

  • Kubernetes version (use kubectl version):
    Client Version: v1.30.5
    Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
    Server Version: v1.29.4
  • OS (e.g: cat /etc/os-release):
  • Install tools:
  • Others:

Additional context

node GPU info
allocatable:
cpu: 31580m
ephemeral-storage: "95026644016"
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 121761172Ki
nvidia.com/gpu: "2"
pods: "110"

@qfai qfai added the bug Something isn't working label Feb 5, 2025
@qfai qfai changed the title Vllm seems unable to detect and use multiple GPUs Vllm seems unable to detect and use multiple GPUs(The issue can be bypass by switch off vLLM featureGates) Feb 5, 2025
@zhuangqh
Copy link
Collaborator

zhuangqh commented Feb 6, 2025

we found it will stuck when we trying to find the fitting model window size for A100x2 automatic ally. Then we disable the tensor parallelism for deepseek by default.
you can turn this feature on manually by providing a config.

apiVersion: kaito.sh/v1alpha1
kind: Workspace
metadata:
  name: workspace-deepseek-r1-distill-qwen-14b
resource:
  instanceType: "Standard_NC48ads_A100_v4"
  labelSelector:
    matchLabels:
      apps: deepseek-r1-distill-qwen-14b
inference:
  preset:
    name: "deepseek-r1-distill-qwen-14b"
  config: "ds-inference-params"  # here
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: ds-inference-params
data:
  inference_config.yaml: |
    # Maximum number of steps to find the max available seq len fitting in the GPU memory.
    max_probe_steps: 6

    vllm:
      cpu-offload-gb: 0
      gpu-memory-utilization: 0.95
      swap-space: 4
      max-model-len: 131072   # --> decrease this value according to the GPU memory
      tensor-parallel-size: 2     # --> set to gpu count

      # max-seq-len-to-capture: 8192
      # num-scheduler-steps: 1
      # enable-chunked-prefill: false
      # see https://docs.vllm.ai/en/latest/serving/engine_args.html for more options.

Change this according to your sku.

      max-model-len: 131072   # --> decrease this value according to the GPU memory
      tensor-parallel-size: 2     # --> set to gpu count

We will fix this issue soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants