Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Common Issue Summary 常见问题汇总 #232

Closed
czczup opened this issue May 30, 2024 · 1 comment
Closed

Common Issue Summary 常见问题汇总 #232

czczup opened this issue May 30, 2024 · 1 comment

Comments

@czczup
Copy link
Member

czczup commented May 30, 2024

Hi everyone,

This is a Common Issue Summary where I will compile the frequently encountered issues. If you notice any omissions, please feel free to help add to the list. Thank you!

这里是常见问题汇总,我会在这里汇总一些常见的问题。如果有遗漏的地方,请大家帮忙补充,谢谢!

@czczup czczup pinned this issue May 30, 2024
@czczup
Copy link
Member Author

czczup commented May 30, 2024

I will summarize common issues here.

1. Multi-GPU Inference - RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

Issues: #229, #118

Many people have encountered this bug, and we haven't yet found a good method to handle all cases. However, there is a workaround that requires manually assigning devices to the model.

For example, deploying this 26B model on two V100 GPUs:

The model is a total of 26B, so the ideal situation is 13B per card. Therefore, after excluding the 6B for ViT, card 0 needs to hold 7B, which means 1/3 of the 20B LLM is on card 0, and 2/3 is on card 1.

In code, it would look like this:

device_map = {
    'vision_model': 0,
    'mlp1': 0,
    'language_model.model.tok_embeddings': 0,  # near the first layer of LLM
    'language_model.model.norm': 1,  # near the last layer of LLM
    'language_model.output.weight': 1  # near the last layer of LLM
}
for i in range(16):
    device_map[f'language_model.model.layers.{i}'] = 0
for i in range(16, 48):
    device_map[f'language_model.model.layers.{i}'] = 1
print(device_map)
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    device_map=device_map
).eval()

2. Multi-Image Inference - When the number of images exceeds two, the model seems to treat all the input as one image. From the code, the model seems to input all the blocks to the model together, without distinguishing between different images. Even with lmdeploy, the problem is the same.

Issues: #223,

The current V1.5 model was not trained with such (interleaved) data. Modifying the inference interface can support it, but the results are unstable.

The June version will include multi-image interleaved training, which should improve performance. The code will also support this feature at that time.

3. Prompt Format

Issues: #227

TODO

4. Quantification - AWQ / INT4 Quantification, Low GPU utilization during int8 model inference

Issues: #209, #210, #193, #167

Thanks to the lmdeploy team for providing AWQ quantization support.

The 4-bit model is available at OpenGVLab/InternVL-Chat-V1-5-AWQ. You can try this one.

@czczup czczup changed the title Common Issue Summary Common Issue Summary 常见问题汇总 May 30, 2024
@czczup czczup closed this as completed Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant