-
Notifications
You must be signed in to change notification settings - Fork 509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Common Issue Summary 常见问题汇总 #232
Comments
I will summarize common issues here.1. Multi-GPU Inference - RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!Many people have encountered this bug, and we haven't yet found a good method to handle all cases. However, there is a workaround that requires manually assigning devices to the model. For example, deploying this 26B model on two V100 GPUs: The model is a total of 26B, so the ideal situation is 13B per card. Therefore, after excluding the 6B for ViT, card 0 needs to hold 7B, which means 1/3 of the 20B LLM is on card 0, and 2/3 is on card 1. In code, it would look like this: device_map = {
'vision_model': 0,
'mlp1': 0,
'language_model.model.tok_embeddings': 0, # near the first layer of LLM
'language_model.model.norm': 1, # near the last layer of LLM
'language_model.output.weight': 1 # near the last layer of LLM
}
for i in range(16):
device_map[f'language_model.model.layers.{i}'] = 0
for i in range(16, 48):
device_map[f'language_model.model.layers.{i}'] = 1
print(device_map)
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True,
device_map=device_map
).eval() 2. Multi-Image Inference - When the number of images exceeds two, the model seems to treat all the input as one image. From the code, the model seems to input all the blocks to the model together, without distinguishing between different images. Even with lmdeploy, the problem is the same.Issues: #223, The current V1.5 model was not trained with such (interleaved) data. Modifying the inference interface can support it, but the results are unstable. The June version will include multi-image interleaved training, which should improve performance. The code will also support this feature at that time. 3. Prompt FormatIssues: #227 TODO 4. Quantification - AWQ / INT4 Quantification, Low GPU utilization during int8 model inferenceIssues: #209, #210, #193, #167 Thanks to the lmdeploy team for providing AWQ quantization support. The 4-bit model is available at OpenGVLab/InternVL-Chat-V1-5-AWQ. You can try this one. |
Hi everyone,
This is a Common Issue Summary where I will compile the frequently encountered issues. If you notice any omissions, please feel free to help add to the list. Thank you!
这里是常见问题汇总,我会在这里汇总一些常见的问题。如果有遗漏的地方,请大家帮忙补充,谢谢!
The text was updated successfully, but these errors were encountered: