vLLM quantization BrokenPipeError #252

j-irion · 2024-08-29T12:36:14Z

Hello,
I am trying to benchmark a model quantized to nf4 with bnb. How can I run it with the vLLM backend without getting a BrokenPipeError? Also, how can I utilize both GPUs of my machine?
Thank your for your help!

defaults:
  - benchmark
  - scenario: inference
  - launcher: process
  - backend: vllm
  - _base_
  - _self_

name: vllm_llama

launcher:
  device_isolation: true
  device_isolation_action: warn

backend:
  model: hugging-quants/Meta-Llama-3.1-70B-BNB-NF4-BF16
  device: cuda
  device_ids: 0,1
  engine_args:
    enforce_eager: true
  #torch_dtype: float16
  #quantization_scheme: bnb
  #quantization_config:
    #load_in_4bit: true
    #bnb_4bit_compute_dtype: float16

scenario:
  memory: true
  latency: true
  energy: true

  warmup_runs: 10
  iterations: 10
  duration: 10

  input_shapes:
    batch_size: 2
    sequence_length: ${seq_len}
  generate_kwargs:
    max_new_tokens: 32
    min_new_tokens: 32

hydra:
  sweeper:
    params:
      scenario.input_shapes.sequence_length: 512, 768, 1024, 1536, 2048, 3072, 4096, 6144
  run:
    dir: outputs/${hydra.job.name}/${hydra.job.num}_${scenario.input_shapes.sequence_length}

IlyasMoutawwakil · 2024-09-03T21:13:20Z

I don't think this is because of the nf4 or quantization in general, vllm multi gpu used to work in optimum-benchmark when it used ray for distribution, but last time I tried it didn't work, so would appreciate a PR if you can get it working.

j-irion · 2024-09-04T15:17:10Z

Thanks for your answer! I understand about the use of dual GPUs.

I think the BrokenPipeError does happen because of the quantization. I can execute the benchmark in the same configuration with the full meta-llama/Meta-Llama-3.1-8B model on one GPU. If I try using the bnb nf4 quantized version hugging-quants/Meta-Llama-3.1-8B-BNB-NF4 it throws the error.

IlyasMoutawwakil · 2024-09-04T16:00:08Z

The broken pipe error is not a problem in itself, it happens when the process you're running the benchmark in exists abruptly.
The inline launcher is good for debugging these issues, can you try running your benchmark with launcher: inline. And also do you confirm that this model works with vLLM outside of optimum-benchmark ?

IlyasMoutawwakil · 2024-09-19T08:48:58Z

Any news ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vLLM quantization BrokenPipeError #252

vLLM quantization BrokenPipeError #252

j-irion commented Aug 29, 2024

IlyasMoutawwakil commented Sep 3, 2024

j-irion commented Sep 4, 2024

IlyasMoutawwakil commented Sep 4, 2024 •

edited

Loading

IlyasMoutawwakil commented Sep 19, 2024

vLLM quantization BrokenPipeError #252

vLLM quantization BrokenPipeError #252

Comments

j-irion commented Aug 29, 2024

IlyasMoutawwakil commented Sep 3, 2024

j-irion commented Sep 4, 2024

IlyasMoutawwakil commented Sep 4, 2024 • edited Loading

IlyasMoutawwakil commented Sep 19, 2024

IlyasMoutawwakil commented Sep 4, 2024 •

edited

Loading