-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to process batch: Currently, only support batch_size=1
#19
Comments
We have updated the model code in Huggingface to support batch generation: https://huggingface.co/AIDC-AI/Ovis1.6-Gemma2-9B/commit/68fded4ab4ae6626f38fbf1cf5b18a01ccf8a41d. Below is an example of how to run batch generation: batch_inputs = [
('example_image1.jpeg', 'Describe the content of this image.'),
('example_image2.jpeg', 'What is the equation in the image?')
]
batch_input_ids = []
batch_attention_mask = []
batch_pixel_values = []
for image_path, text in batch_inputs:
image = Image.open(image_path)
query = f'<image>\n{text}'
prompt, input_ids, pixel_values = model.preprocess_inputs(query, [image])
attention_mask = torch.ne(input_ids, text_tokenizer.pad_token_id)
input_ids = input_ids.unsqueeze(0).to(device=model.device)
attention_mask = attention_mask.unsqueeze(0).to(device=model.device)
pixel_values = [pixel_values.to(dtype=visual_tokenizer.dtype, device=visual_tokenizer.device)]
batch_input_ids.append(input_ids.squeeze())
batch_attention_mask.append(attention_mask.squeeze())
batch_pixel_values.append(pixel_values)
pad_batch_input_ids = torch.nn.utils.rnn.pad_sequence([i.flip(dims=[0]) for i in batch_input_ids],batch_first=True, padding_value=0.0).flip(dims=[1])
pad_batch_input_ids = pad_batch_input_ids[:,-model.config.multimodal_max_length:]
pad_batch_attention_mask = torch.nn.utils.rnn.pad_sequence([i.flip(dims=[0]) for i in batch_attention_mask],batch_first=True, padding_value=False).flip(dims=[1])
pad_batch_attention_mask = pad_batch_attention_mask[:,-model.config.multimodal_max_length:]
pad_batch_pixel_values = [item for sublist in batch_pixel_values for item in sublist]
# generate output
with torch.inference_mode():
gen_kwargs = dict(
max_new_tokens=1024,
do_sample=False,
top_p=None,
top_k=None,
temperature=None,
repetition_penalty=None,
eos_token_id=model.generation_config.eos_token_id,
pad_token_id=text_tokenizer.pad_token_id,
use_cache=True
)
output_ids = model.generate(pad_batch_input_ids, pixel_values=pad_batch_pixel_values, attention_mask=pad_batch_attention_mask, **gen_kwargs)
for i in range(len(batch_input_ids)):
output = text_tokenizer.decode(output_ids[i], skip_special_tokens=True)
print(f'Output_{i}:\n{output}') |
Other than using batch inference which accepts image-prompt pairs, is there a way to input multiple images and a single prompt? |
After modifying the code according to the example, I'm still encountering AssertionError: Currently, only support batch_size=1. Do I need to update the model, or is there something I missed in the modifications? My transformer version is correct. My code is as follows: import torch model = AutoModelForCausalLM.from_pretrained("/opt/cv/xxx/demo/models/Ovis1.6-Gemma2-9B/", batch_inputs = [ batch_input_ids = [] for image_path, text in batch_inputs: pad_batch_input_ids = torch.nn.utils.rnn.pad_sequence([i.flip(dims=[0]) for i in batch_input_ids],batch_first=True, padding_value=0.0).flip(dims=[1]) generate outputwith torch.inference_mode(): for i in range(len(batch_input_ids)): |
After updating modeling_ovis.py, I was able to run it successfully. |
@YichengShen While Ovis1.6 is primarily trained on single-image samples, it also supports multi-image inputs. Here is an example demonstrating how to handle two-image inputs: import torch
from PIL import Image
from transformers import AutoModelForCausalLM
# load model
model = AutoModelForCausalLM.from_pretrained("AIDC-AI/Ovis1.6-Gemma2-9B",
torch_dtype=torch.bfloat16,
multimodal_max_length=8192,
trust_remote_code=True).cuda()
text_tokenizer = model.get_text_tokenizer()
visual_tokenizer = model.get_visual_tokenizer()
# enter image path and prompt
images = []
for i in range(2):
image_path = input(f"Enter image_{i+1} path: ")
images.append(Image.open(image_path))
text = input("Enter prompt: ")
query = f'Image 1: <image>\nImage 2: <image>\n{text}'
# format conversation
prompt, input_ids, pixel_values = model.preprocess_inputs(query, images)
attention_mask = torch.ne(input_ids, text_tokenizer.pad_token_id)
input_ids = input_ids.unsqueeze(0).to(device=model.device)
attention_mask = attention_mask.unsqueeze(0).to(device=model.device)
pixel_values = [pixel_values.to(dtype=visual_tokenizer.dtype, device=visual_tokenizer.device)]
# generate output
with torch.inference_mode():
gen_kwargs = dict(
max_new_tokens=1024,
do_sample=False,
top_p=None,
top_k=None,
temperature=None,
repetition_penalty=None,
eos_token_id=model.generation_config.eos_token_id,
pad_token_id=text_tokenizer.pad_token_id,
use_cache=True
)
output_ids = model.generate(input_ids, pixel_values=pixel_values, attention_mask=attention_mask, **gen_kwargs)[0]
output = text_tokenizer.decode(output_ids, skip_special_tokens=True)
print(f'Output:\n{output}') |
When will support for batch size > 1 be available, or where should I make modifications to enable this feature?
The text was updated successfully, but these errors were encountered: