Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intel demo AutoModelForCausalLM model.generate Wrong response by docker run the same chatglm3-int4 model bin file #1302

Open
2 tasks done
ahlwjnj opened this issue Jul 29, 2024 · 0 comments

Comments

@ahlwjnj
Copy link

ahlwjnj commented Jul 29, 2024

System Info / 系統信息

Ubuntu22.04, docker

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

from transformers import TextIteratorStreamer, AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM, RtnConfig

model_name = "./models/chatglm3-6b"

model = AutoModelForCausalLM.from_pretrained(model_name,
quantization_config=RtnConfig(bits=4, compute_dtype="int8",
weight_dtype="int4_fullrange",
use_neural_speed=True
),
trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

history = [{"role": "system", "content": "你的名字是'XX Chat'."}]
prompt = {"role": "user", "content": "Hi, please introduce yourself in Chinese."}
messages = history + [prompt]

##Stat to Chat
model_inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_tensors="pt"
)
output = model.generate(input_ids = model_inputs,
max_new_tokens = max_length
)
print("output=", output)
response = tokenizer.decode(output[0], skip_special_tokens=False)
print("origin response=\n", response)

The correct response can be got by running the above code on the development PC with Ubuntu, but wrong response got on the other PC by running docker image. And the wrong response is like the following style whatever Ubuntu or Windows environment.
2024-07-23 22:26:38 output=
2024-07-23 22:26:38 [[64790, 64792, 906, 31007, 13361, 31007, 30994, 13, 30910, 31822, 32873, 54532, 30953, 11214, 22011, 6263, 64795, 30910, 13, 8686, 30932, 2929, 9571, 3040, 291, 4463, 30930, 64796, 4033, 37204, 37204, 37204, 37204, 37204, 37204, 37204,
...
37204, 37204, 37204, 37204, 37204, 37204, 37204, 37204, 37204, 37204, 37204, 37204, 37204, 37204, 37204, 37204, 37204, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
2024-07-23 22:26:38 origin response=
2024-07-23 22:26:38 [gMASK] sop <|system|>
2024-07-23 22:26:38 你的名字是'XX Chat'. <|user|>
2024-07-23 22:26:38 Hi, please introduce yourself in Chinese. <|assistant|> Gold负面负面负面负面负面负面负面负面负面负面负面负面 ...

Q1: Quantization_config problem? I have change weight_dtype=“int4_fullrange”, "int4_clip", "int4", and same wrong response.
Q2: Is there any problem happens when copy the docker image with quantization model bin file from one PC to another ?
Q3: How to debug this problem when copy the docker image with quantization model bin file ?

Expected behavior / 期待表现

AutoModelForCausalLM model.generate right response by docker run the same chatglm3-int4 model bin file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant