-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VLM] Merged multi-modal processor for InternVL-based models #12553
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: DarkLight1337 <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
|
Signed-off-by: DarkLight1337 <[email protected]>
@@ -723,7 +723,7 @@ See [this page](#generative-models) for more information on how to use generativ | |||
* ✅︎ | |||
- * `NVLM_D_Model` | |||
* NVLM-D 1.0 | |||
* T + I<sup>E+</sup> | |||
* T + I<sup>+</sup> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Embedding inputs is not supported because we need image size information to calculate the number of patches for the multi-modal processor.
@staticmethod | ||
def flat_from_sizes(modality: str, size_per_item: torch.Tensor): | ||
slice_idxs = [0, *accumulate(size_per_item)] | ||
slices = [ | ||
slice(slice_idxs[i], slice_idxs[i + 1]) | ||
for i in range(len(size_per_item)) | ||
] | ||
|
||
return MultiModalFieldConfig.flat(modality, slices) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added factory method to avoid repeating the logic of getting the slice indices
Signed-off-by: DarkLight1337 <[email protected]>
24bf8c7
to
5132b51
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM! Thanks for refactoring this model!
min_num = hf_config.min_dynamic_patch | ||
if dynamic_image_size is None: | ||
dynamic_image_size = hf_config.dynamic_image_size | ||
class InternVLProcessor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can also use this to replace the patch InternVLProcessor
in test:
vllm/tests/models/decoder_only/vision_language/vlm_utils/model_utils.py
Lines 390 to 428 in f17f1d4
class InternVLProcessor: | |
"""A simple processor for InternVL2 which misses a processor.""" | |
def __init__(self, hf_runner: HfRunner): | |
self.num_image_token = hf_runner.model.num_image_token | |
self.tokenizer = hf_runner.tokenizer | |
self.dtype = hf_runner.model.dtype | |
self.config = AutoConfig.from_pretrained(hf_runner.model_name, | |
trust_remote_code=True) | |
self.vision_config = self.config.vision_config | |
self.use_thumbnail = self.config.use_thumbnail | |
self.min_num = self.config.min_dynamic_patch | |
self.max_num = self.config.max_dynamic_patch | |
self.image_size = self.vision_config.image_size | |
def __call__(self, text: str, images: Union[Image, List[Image]], | |
**kwargs): | |
from vllm.model_executor.models.internvl import ( | |
IMG_CONTEXT, IMG_END, IMG_START, image_to_pixel_values) | |
images = [images] if isinstance(images, Image) else images | |
pixel_values = [ | |
image_to_pixel_values(image, self.image_size, self.min_num, | |
self.max_num, | |
self.use_thumbnail).to(self.dtype) | |
for image in images | |
] | |
num_patches_list = [ | |
pixel_value.shape[0] for pixel_value in pixel_values | |
] | |
pixel_values = torch.cat(pixel_values, dim=0) | |
for num_patches in num_patches_list: | |
context_tokens = IMG_CONTEXT * self.num_image_token \ | |
* num_patches | |
image_tokens = IMG_START + context_tokens + IMG_END | |
text = text.replace('<image>', image_tokens, 1) | |
prompt = self.tokenizer(text, return_tensors="pt") | |
prompt.update({"pixel_values": pixel_values}) | |
return prompt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new implementation of HF processor flattens the pixel values, so it is incompatible with this alternative implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, let me try updating the tests to not use HF processor - perhaps I can patch the generate method to be based on chat... It's difficult to keep the implementations separate as I need to copy the processing code from the README anyway.
Signed-off-by: DarkLight1337 <[email protected]>
Have you tested NVLM-D already? Turns out that I also have to update H2O-VL because it depends on InternVL. |
Signed-off-by: DarkLight1337 <[email protected]>
ddf55c2
to
3814c90
Compare
Seems that NVLM-D doesn't work yet (tested with aleiko/NVLM-D-72B-w4a16 cost about 52GB VRAM totally):
|
806a9ec
to
9e7aff2
Compare
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
9e7aff2
to
9b177b5
Compare
Putting it to draft until I get the tests to pass locally. |
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
The NVLM-D model should work on latest commit(bd67b1a) now: V0 outputs:
V1 outputs:
|
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Update InternVL (similarly, H2OVL and NVLM-D) to use the merged multi-modal processor.
Note:
BaseProcessingInfo.get_mm_max_tokens_per_item
now takes amm_counts
argument to accommodate H2O-VL which has different processing logic depending on how many images are passed.InternVL V0 output
InternVL V1 output
@ywang96 can you help check NVLM-D? My current setup doesn't have enough memory for it.
After this PR, only Molmo and Pixtral still use the legacy input mapper in V1.