Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running FastChat on GPTQ (and quantized) models #3530

Open
NamburiSrinath opened this issue Sep 19, 2024 · 0 comments
Open

Running FastChat on GPTQ (and quantized) models #3530

NamburiSrinath opened this issue Sep 19, 2024 · 0 comments

Comments

@NamburiSrinath
Copy link

NamburiSrinath commented Sep 19, 2024

Hi team,

I've a question related to generating model responses using GPTQ.

I've compressed Llama-2-7B using basic AutoGPTQ using transformers.

from transformers import AutoModelForCausalLM, AutoTokenizer
from optimum.gptq import GPTQQuantizer, load_quantized_model
import torch
model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)

quantizer = GPTQQuantizer(bits=4, dataset="c4", model_seqlen = 2048)
quantized_model = quantizer.quantize_model(model, tokenizer)

save_folder = "directory/"
quantizer.save(model, save_folder)
tokenizer.save_pretrained(save_folder)

Once the model is saved, I am trying to generate the answers using the following command

python gen_model_answer.py --model-path directory/ --model-id llama-2-7b-gptq-4

But this throws the following error -

File "/home/ubuntu/anaconda3/envs/compress_align/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/Compress_Align/FastChat/fastchat/llm_judge/gen_model_answer.py", line 103, in get_model_answers
    model, tokenizer = load_model(
  File "/home/ubuntu/Compress_Align/FastChat/fastchat/model/model_adapter.py", line 379, in load_model
    model, tokenizer = adapter.load_model(model_path, kwargs)
  File "/home/ubuntu/Compress_Align/FastChat/fastchat/model/model_adapter.py", line 124, in load_model
    model = AutoModelForCausalLM.from_pretrained(
  File "/home/ubuntu/anaconda3/envs/compress_align/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
    return model_class.from_pretrained(
  File "/home/ubuntu/anaconda3/envs/compress_align/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3823, in from_pretrained
    hf_quantizer.postprocess_model(model)
  File "/home/ubuntu/anaconda3/envs/compress_align/lib/python3.9/site-packages/transformers/quantizers/base.py", line 195, in postprocess_model
    return self._process_model_after_weight_loading(model, **kwargs)
  File "/home/ubuntu/anaconda3/envs/compress_align/lib/python3.9/site-packages/transformers/quantizers/quantizer_gptq.py", line 80, in _process_model_after_weight_loading
    model = self.optimum_quantizer.post_init_model(model)
  File "/home/ubuntu/anaconda3/envs/compress_align/lib/python3.9/site-packages/optimum/gptq/quantizer.py", line 595, in post_init_model
    raise ValueError(
ValueError: Found modules on cpu/disk. Using Exllama or Exllamav2 backend requires all the modules to be on GPU.You can deactivate exllama backend by setting `disable_exllama=True` in the quantization config object

I was able to pass device map as cuda in the load_model() function of BaseModelAdapter class but the inference is tooo slow (I am assuming the model is loaded in GPU but the computations are being done in CPU which is not the one I expected to be)

model = AutoModelForCausalLM.from_pretrained(
                model_path,
                low_cpu_mem_usage=True,
                trust_remote_code=True,
                # device_map='cuda', # TODO: Change made to support quantized model instead of disabling exllama!
                **from_pretrained_kwargs,
            )

Is there a way to generate answers and evaluate the quantized models the same way as this step - https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/README.md. Or am I missing something fundamental?

Tagging relevant issues

  1. ValueError: Found modules on cpu/disk. Using Exllama backend requires all the modules to be on GPU.You can deactivate exllama backend by setting disable_exllama=True in the quantization config objec #2459
  2. Using Exllama backend requires all the modules to be on GPU - how? AutoGPTQ/AutoGPTQ#406

The main fix suggested is to disable exllama but that increases the inference speed a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant