Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: Found modules on cpu/disk. Using Exllama backend requires all the modules to be on GPU.You can deactivate exllama backend by setting disable_exllama=True in the quantization config objec #2459

Open
dinchu opened this issue Sep 21, 2023 · 9 comments

Comments

@dinchu
Copy link

dinchu commented Sep 21, 2023

when trying to load quantized models i always get

ValueError: Found modules on cpu/disk. Using Exllama backend requires all the modules to be on GPU.You can deactivate exllama backend by setting disable_exllama=True in the quantization config objec

@aliozts
Copy link

aliozts commented Sep 21, 2023

Hi, may I ask how do you load the model, in my case with single GPU I also had that problem and I had to use disable_exllama=True while loading the model (change the config.json in your model file and add disable_exllama: true to quantization_config there if you're loading it directly from a file). When I worked with 2 GPUs, I did not have this problem. Sorry if it does not answer your question but I hope it helps. I do not know why this happens though sadly.

@ilovesouthpark
Copy link

It can work, Thanks. @aliozts

@tigerinus
Copy link

Hi, may I ask how do you load the model, in my case with single GPU I also had that problem and I had to use disable_exllama=True while loading the model (change the config.json in your model file and add disable_exllama: true to quantization_config there if you're loading it directly from a file). When I worked with 2 GPUs, I did not have this problem. Sorry if it does not answer your question but I hope it helps. I do not know why this happens though sadly.

Disabling Exllama makes the entire inferencing much slower.

Check out AutoGPTQ/AutoGPTQ#406 for how to enable Exllama.

@Saravan004
Copy link

Why I cant run it in GPU? Although I am having NVIDIA GeForce Mx450.
Could anyone please help?

@apoorvpandey0
Copy link

I have NVIDIA GTX 1650 still getting same error

@chenyujiang11
Copy link

在config.json的quantization_config下加入"disable_exllama": true,即可解决问题。
这个错误只有单卡的时候才会出现,多卡时未出现过,使用的显卡为Tesla T4。

@UmiVilbig
Copy link

I was running into a similar problem running GPTQ in a docker container. I was getting disable_exllama error. In short the issue showed up when I ran the container without --gpus all command. Below is my system configs

GPU: 1660Ti
transformers==4.36.2
optimum==1.16.1
auto-gptq==0.6.0+cu118
CUDA=12.3

SOLUTION: for me I fixed the disable_exllama error by running the container with --gpus all

@NamburiSrinath
Copy link

I am also facing the same issue. Disabling exllama increases the inference speed a lot, so am not sure if that's the ideal way.

Here are more details - #3530

@kj-1024
Copy link

kj-1024 commented Oct 11, 2024

我遇到了同样问题,我使用dbgpt项目启动模型Qwen2.5-32B-Instruct-GPTQ-Int4报错如题目所述。定位到config.json文件,将use_exllama该项从true修改为false可解决。 希望对此有帮助。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants