Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vLLM quantization BrokenPipeError #252

Open
j-irion opened this issue Aug 29, 2024 · 4 comments
Open

vLLM quantization BrokenPipeError #252

j-irion opened this issue Aug 29, 2024 · 4 comments

Comments

@j-irion
Copy link

j-irion commented Aug 29, 2024

Hello,
I am trying to benchmark a model quantized to nf4 with bnb. How can I run it with the vLLM backend without getting a BrokenPipeError? Also, how can I utilize both GPUs of my machine?
Thank your for your help!

defaults:
  - benchmark
  - scenario: inference
  - launcher: process
  - backend: vllm
  - _base_
  - _self_

name: vllm_llama

launcher:
  device_isolation: true
  device_isolation_action: warn

backend:
  model: hugging-quants/Meta-Llama-3.1-70B-BNB-NF4-BF16
  device: cuda
  device_ids: 0,1
  engine_args:
    enforce_eager: true
  #torch_dtype: float16
  #quantization_scheme: bnb
  #quantization_config:
    #load_in_4bit: true
    #bnb_4bit_compute_dtype: float16

scenario:
  memory: true
  latency: true
  energy: true

  warmup_runs: 10
  iterations: 10
  duration: 10

  input_shapes:
    batch_size: 2
    sequence_length: ${seq_len}
  generate_kwargs:
    max_new_tokens: 32
    min_new_tokens: 32

hydra:
  sweeper:
    params:
      scenario.input_shapes.sequence_length: 512, 768, 1024, 1536, 2048, 3072, 4096, 6144
  run:
    dir: outputs/${hydra.job.name}/${hydra.job.num}_${scenario.input_shapes.sequence_length}
@IlyasMoutawwakil
Copy link
Member

I don't think this is because of the nf4 or quantization in general, vllm multi gpu used to work in optimum-benchmark when it used ray for distribution, but last time I tried it didn't work, so would appreciate a PR if you can get it working.

@j-irion
Copy link
Author

j-irion commented Sep 4, 2024

Thanks for your answer! I understand about the use of dual GPUs.

I think the BrokenPipeError does happen because of the quantization. I can execute the benchmark in the same configuration with the full meta-llama/Meta-Llama-3.1-8B model on one GPU. If I try using the bnb nf4 quantized version hugging-quants/Meta-Llama-3.1-8B-BNB-NF4 it throws the error.

@IlyasMoutawwakil
Copy link
Member

IlyasMoutawwakil commented Sep 4, 2024

The broken pipe error is not a problem in itself, it happens when the process you're running the benchmark in exists abruptly.
The inline launcher is good for debugging these issues, can you try running your benchmark with launcher: inline. And also do you confirm that this model works with vLLM outside of optimum-benchmark ?

@IlyasMoutawwakil
Copy link
Member

Any news ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@IlyasMoutawwakil @j-irion and others