Multi-Image or Multi-Video Inference Example #97

chancharikmitra · 2024-07-20T10:16:19Z

Hello, and thanks for such a great contribution to the field of interleaved LMMs! This is really great work. I was wondering if there was an example of the format for multiple image or multiple video inference (similar to what is shown in the in-context learning examples)? Does it involve appending multiple <image> tokens at the specified locations? And then are the images and videos inserted sequentially?

From my understanding of the run_vila.py script, the way to have an ICL input for images (and the corresponding structure for videos, of course) would be as follows:

python -W ignore llava/eval/run_vila.py \
    --model-path Efficient-Large-Model/Llama-3-VILA1.5-8b \
    --conv-mode llama_3 \
    --query "<image>\n ICL text 1 <image>\n ICL text 2 <image>\n" \
    --image-file "img1.png,img2.png,img3.png"

However, I am not sure if the positions of the <image> tokens are considered by the model during generation because looking at the llava_llama.py, the method for preparing the multimodal inputs is inherited from LLaVA, which I believe just concatenates the image features and does not embed them specifically in the locations of the <image> tokens.

I may have missed something as I am still new to the codebase and exploring the model more deeply. Would appreciate any clarification on the point about multi-image and multi-video inputs. Thanks!

Edit: After having looked more deeply, it seems to me at least that the way I have formatted the prompt (with '\n' included) aligns with your code. However, I see in your paper that the image tokens are enumerated:

Edit 2:

As a side note, I do get this warning a lot.

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results. Setting pad_token_id to eos_token_id:128001 for open-end generation.

I think the pad token is fine as it is automatically set to the eos_token. But what about the mask? I see no mention of that when I try to evaluate on datasets like SEEDBench. I do seem to get uncharacteristically low acc. on these benchmarks, and I am trying to find out why.

I also noticed that the run_vila.py script does not have 'llama_3' as a conv_mode option. Is it possible that VILA-1.5 used a different conv_mode?

The text was updated successfully, but these errors were encountered:

DtYXs · 2024-07-26T11:46:05Z

Hello, I think in the VILA code, images are embed specifically in the locations of the <image> tokens.

VILA/llava/model/llava_arch.py

Lines 371 to 391 in 0085724

    
           for i in range(num_images + 1): 
        
               cur_new_input_embeds.append(cur_input_embeds_no_im[i]) 
        
               cur_new_labels.append(cur_labels_noim[i]) 
        
               if i < num_images: 
        
                   cur_image_features = image_features[cur_image_idx] 
        
                   cur_image_idx += 1 
        
                   cur_new_input_embeds.append(cur_image_features) 
        
                   cur_new_labels.append( 
        
                       torch.full( 
        
                           (cur_image_features.shape[0],), 
        
                           IGNORE_INDEX, 
        
                           device=cur_labels.device, 
        
                           dtype=cur_labels.dtype, 
        
                       ) 
        
                   ) 
        
           cur_new_input_embeds = torch.cat(cur_new_input_embeds) 
        
           cur_new_labels = torch.cat(cur_new_labels) 
        
           new_input_embeds.append(cur_new_input_embeds) 
        
           new_labels.append(cur_new_labels)

chancharikmitra · 2024-07-27T08:39:17Z

Thank you @DtYXs for the clarification about the <image> token and its placement! Given that, do you have any insights on why zero-shot performance on VILA-1.5-8b might be lower than what is being reported? Few-shot improvements are fantastic as advertised. Perhaps it is related to my concerns regarding masking and the conv_mode formatting. However, looking deeper at the eval scripts, I see that the conv_mode was passed directly - so 'llama_3' indeed would have been used.

chancharikmitra changed the title ~~Unquantized Model Availability~~ Multi-Image or Multi-Image Inference Example Jul 20, 2024

chancharikmitra changed the title ~~Multi-Image or Multi-Image Inference Example~~ Multi-Image or Multi-Video Inference Example Jul 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-Image or Multi-Video Inference Example #97

Multi-Image or Multi-Video Inference Example #97

chancharikmitra commented Jul 20, 2024 •

edited

Loading

DtYXs commented Jul 26, 2024 •

edited

Loading

chancharikmitra commented Jul 27, 2024

Multi-Image or Multi-Video Inference Example #97

Multi-Image or Multi-Video Inference Example #97

Comments

chancharikmitra commented Jul 20, 2024 • edited Loading

DtYXs commented Jul 26, 2024 • edited Loading

chancharikmitra commented Jul 27, 2024

chancharikmitra commented Jul 20, 2024 •

edited

Loading

DtYXs commented Jul 26, 2024 •

edited

Loading