how to do the inference with the finetune weights / model #83

thisurawz1 · 2024-08-29T16:28:42Z

I have already fine-tuned the videollama2 for a custom dataset using qlora. after fine-tuning got the above files. now, how can I make the inference with those weights/ models? how can I use this finetune weights/ model with the inference script you provided?

Looking forward to a solution as soon as possible. thank you.

`
import sys
sys.path.append('./')
from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init

def inference():
disable_torch_init()

# Video Inference
modal = 'video'
modal_path = 'assets/cat_and_chicken.mp4' 
instruct = 'What animals are in the video, what are they doing, and how does the video feel?'
# Reply:
# The video features a kitten and a baby chick playing together. The kitten is seen laying on the floor while the baby chick hops around. The two animals interact playfully with each other, and the video has a cute and heartwarming feel to it.

# Image Inference
modal = 'image'
modal_path = 'assets/sora.png'
instruct = 'What is the woman wearing, what is she doing, and how does the image feel?'
# Reply:
# The woman in the image is wearing a black coat and sunglasses, and she is walking down a rain-soaked city street. The image feels vibrant and lively, with the bright city lights reflecting off the wet pavement, creating a visually appealing atmosphere. The woman's presence adds a sense of style and confidence to the scene, as she navigates the bustling urban environment.

model_path = 'DAMO-NLP-SG/VideoLLaMA2-7B'
# Base model inference (only need to replace model_path)
# model_path = 'DAMO-NLP-SG/VideoLLaMA2-7B-Base'
model, processor, tokenizer = model_init(model_path)
output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)

print(output)

if name == "main":
inference()
`

The text was updated successfully, but these errors were encountered:

clownrat6 · 2024-09-06T07:28:55Z

Yes, you can. The newest version commit supports directly loading lora model.

thisurawz1 · 2024-09-06T08:14:38Z

Can you share the script for it please. Do we just have to change the current model path to lora path. I did it but didn't work at all.

thisurawz1 · 2024-09-10T08:54:13Z

can you share the exact script that we can do the inference with the LoRA weights. please.

thisurawz1 · 2024-09-11T03:44:06Z

Yes, you can. The newest version commit supports directly loading the Lora model.

Can you share the script on how to load the Lora model directly? I already finished the fine-tuning. And got those files. But I don't know how to do the inference with these.

LiangMeng89 · 2024-10-14T08:02:55Z

Yes, you can. The newest version commit supports directly loading the Lora model.

Can you share the script on how to load the Lora model directly? I already finished the fine-tuning. And got those files. But I don't know how to do the inference with these.

Hello! I have the same problem. Have you solved it?

ffcarina · 2024-10-14T09:19:25Z

Can you share the script on how to load the Lora model directly? I already finished the fine-tuning. And got those files. But I don't know how to do the inference with these.

@thisurawz1 Through the following code, I successfully loaded the LoRA fine-tuned model for inference. Hope this helps you.

from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init

disable_torch_init()

modal = 'video'
modal_path = 'VideoLLaMA2/videollama2/serve/examples/sample_demo_1.mp4'
instruct = 'What is the baby wearing and what is he doing?'
model_path = 'VideoLLaMA2/work_dirs/videollama2/finetune_downstream_sft_settings_qlora_MESC' # your model dir

model, processor, tokenizer = model_init(model_path)
output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)

thisurawz1 · 2024-10-15T02:27:07Z

Yes, you can. The newest version commit supports directly loading the Lora model.

Can you share the script on how to load the Lora model directly? I already finished the fine-tuning. And got those files. But I don't know how to do the inference with these.

Hello! I have the same problem. Have you solved it?

thisurawz1 · 2024-10-15T02:30:31Z

Yes, you can. The newest version commit supports directly loading the Lora model.

Can you share the script on how to load the Lora model directly? I already finished the fine-tuning. And got those files. But I don't know how to do the inference with these.

Hello! I have the same problem. Have you solved it?

yes. you have to update the videollama2 repository to the latest commit. then use the following script. just have to change the model path in the original inference script. thats all.

import sys
sys.path.append('./')
from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init


def inference():
    disable_torch_init()

    # Video Inference
    modal = 'video'
    modal_path = 'assets/cat_and_chicken.mp4' 
    instruct = 'What animals are in the video, what are they doing, and how does the video feel?'
    # Reply:
    # The video features a kitten and a baby chick playing together. The kitten is seen laying on the floor while the baby chick hops around. The two animals interact playfully with each other, and the video has a cute and heartwarming feel to it.

    # Image Inference
    modal = 'image'
    modal_path = 'assets/sora.png'
    instruct = 'What is the woman wearing, what is she doing, and how does the image feel?'
    # Reply:
    # The woman in the image is wearing a black coat and sunglasses, and she is walking down a rain-soaked city street. The image feels vibrant and lively, with the bright city lights reflecting off the wet pavement, creating a visually appealing atmosphere. The woman's presence adds a sense of style and confidence to the scene, as she navigates the bustling urban environment.

    model_path = 'DAMO-NLP-SG/VideoLLaMA2-7B'
    # Base model inference (only need to replace model_path)
    # model_path = 'work_dirs/videollama2/finetune_downstream_sft_settings_qlora' #your fine-tuned weights directory 
    model, processor, tokenizer = model_init(model_path)
    output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)

    print(output)

if __name__ == "__main__":
    inference()

thisurawz1 · 2024-10-15T02:33:01Z

Can you share the script on how to load the Lora model directly? I already finished the fine-tuning. And got those files. But I don't know how to do the inference with these.

@thisurawz1 Through the following code, I successfully loaded the LoRA fine-tuned model for inference. Hope this helps you.
from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init

disable_torch_init()

modal = 'video'
modal_path = 'VideoLLaMA2/videollama2/serve/examples/sample_demo_1.mp4'
instruct = 'What is the baby wearing and what is he doing?'
model_path = 'VideoLLaMA2/work_dirs/videollama2/finetune_downstream_sft_settings_qlora_MESC' # your model dir

model, processor, tokenizer = model_init(model_path)
output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)

Thank you so much

LiangMeng89 · 2024-10-28T17:35:39Z

Yes, you can. The newest version commit supports directly loading the Lora model.

Can you share the script on how to load the Lora model directly? I already finished the fine-tuning. And got those files. But I don't know how to do the inference with these.

Hello! I have the same problem. Have you solved it?

yes. you have to update the videollama2 repository to the latest commit. then use the following script. just have to change the model path in the original inference script. thats all.

import sys
sys.path.append('./')
from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init


def inference():
    disable_torch_init()

    # Video Inference
    modal = 'video'
    modal_path = 'assets/cat_and_chicken.mp4' 
    instruct = 'What animals are in the video, what are they doing, and how does the video feel?'
    # Reply:
    # The video features a kitten and a baby chick playing together. The kitten is seen laying on the floor while the baby chick hops around. The two animals interact playfully with each other, and the video has a cute and heartwarming feel to it.

    # Image Inference
    modal = 'image'
    modal_path = 'assets/sora.png'
    instruct = 'What is the woman wearing, what is she doing, and how does the image feel?'
    # Reply:
    # The woman in the image is wearing a black coat and sunglasses, and she is walking down a rain-soaked city street. The image feels vibrant and lively, with the bright city lights reflecting off the wet pavement, creating a visually appealing atmosphere. The woman's presence adds a sense of style and confidence to the scene, as she navigates the bustling urban environment.

    model_path = 'DAMO-NLP-SG/VideoLLaMA2-7B'
    # Base model inference (only need to replace model_path)
    # model_path = 'work_dirs/videollama2/finetune_downstream_sft_settings_qlora' #your fine-tuned weights directory 
    model, processor, tokenizer = model_init(model_path)
    output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)

    print(output)

if __name__ == "__main__":
    inference()

Thank you, I will try this.

LiangMeng89 · 2024-10-30T18:23:49Z

Yes, you can. The newest version commit supports directly loading lora model.

Dear author,I used your lora checkpoint folder structure and loading example code(#36) to my fintue_qlora inference code on my own experiment video data, but it still has some errors.The old inference code of readme file is work.I just put your code in the code. Please help me!

1: My fintue_qlora inference code:

import torch
import transformers

import sys
sys.path.append('./')

from videollama2.conversation import conv_templates
from videollama2.constants import DEFAULT_MMODAL_TOKEN, MMODAL_TOKEN_INDEX
from videollama2.mm_utils import get_model_name_from_path, tokenizer_MMODAL_token, process_video, process_image
from videollama2.model.builder import load_pretrained_model

def inference():
# Video Inference
paths = ['./datasets/test_data/videos/video_202.mp4']
questions = ['hidden****']
# Reply:
modal_list = ['video']

# Image Inference
#paths = ['assets/sora.png']
#questions = ['What is the woman wearing, what is she doing, and how does the image feel?']
# Reply:
# The woman in the image is wearing a black coat and sunglasses, and she is walking down a rain-soaked city street. The image feels vibrant and lively, with the bright city lights reflecting off the wet pavement, creating a visually appealing atmosphere. The woman's presence adds a sense of style and confidence to the scene, as she navigates the bustling urban environment.
#modal_list = ['image']

# 1. Initialize the model.
model_path = './checkpoints/VideoLLaMA2-7B-qlora'   #./checkpoints/VideoLLaMA2-7B
# Base model inference (only need to replace model_path)
# model_path = 'DAMO-NLP-SG/VideoLLaMA2-7B-Base'
model_name = get_model_name_from_path(model_path)
tokenizer, model, processor, context_len = load_pretrained_model(model_path, './checkpoints/Mistral-7B-Instruct-v0.2', model_name)  # None
model = model.to('cuda:0')
conv_mode = 'llama2'

# 2. Visual preprocess (load & transform image or video).
if modal_list[0] == 'video':
    tensor = process_video(paths[0], processor, model.config.image_aspect_ratio).to(dtype=torch.float16, device='cuda', non_blocking=True)
    default_mm_token = DEFAULT_MMODAL_TOKEN["VIDEO"]
    modal_token_index = MMODAL_TOKEN_INDEX["VIDEO"]
else:
    tensor = process_image(paths[0], processor, model.config.image_aspect_ratio)[0].to(dtype=torch.float16, device='cuda', non_blocking=True)
    default_mm_token = DEFAULT_MMODAL_TOKEN["IMAGE"]
    modal_token_index = MMODAL_TOKEN_INDEX["IMAGE"]
tensor = [tensor]

# 3. text preprocess (tag process & generate prompt).
question = default_mm_token + "\n" + questions[0]
conv = conv_templates[conv_mode].copy()
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
input_ids = tokenizer_MMODAL_token(prompt, tokenizer, modal_token_index, return_tensors='pt').unsqueeze(0).to('cuda:0')

with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        images_or_videos=tensor,
        modal_list=modal_list,
        do_sample=True,
        temperature=0.2,
        max_new_tokens=1024,
        use_cache=True,
    )

outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
print(outputs[0])

if name == "main":
inference()

2: Terminal errors:
(videollama2) lm@SR6430G23:~/videollama2/VideoLLaMA2$ /home/lm/anaconda3/envs/videollama2/bin/python inference.py
200
Loading VideoLLaMA from base model...
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:13<00:00, 4.36s/it]
Some weights of Videollama2MistralForCausalLM were not initialized from the model checkpoint at ./checkpoints/Mistral-7B-Instruct-v0.2 and are newly initialized: ['model.mm_projector.readout.0.bias', 'model.mm_projector.readout.0.weight', 'model.mm_projector.readout.2.bias', 'model.mm_projector.readout.2.weight', 'model.mm_projector.s1.b1.conv1.bn.bias', 'model.mm_projector.s1.b1.conv1.bn.weight', 'model.mm_projector.s1.b1.conv1.conv.weight', 'model.mm_projector.s1.b1.conv2.bn.bias', 'model.mm_projector.s1.b1.conv2.bn.weight', 'model.mm_projector.s1.b1.conv2.conv.weight', 'model.mm_projector.s1.b1.conv3.bn.bias', 'model.mm_projector.s1.b1.conv3.bn.weight', 'model.mm_projector.s1.b1.conv3.conv.weight', 'model.mm_projector.s1.b1.downsample.bn.bias', 'model.mm_projector.s1.b1.downsample.bn.weight', 'model.mm_projector.s1.b1.downsample.conv.weight', 'model.mm_projector.s1.b1.se.fc1.bias', 'model.mm_projector.s1.b1.se.fc1.weight', 'model.mm_projector.s1.b1.se.fc2.bias', 'model.mm_projector.s1.b1.se.fc2.weight', 'model.mm_projector.s1.b2.conv1.bn.bias', 'model.mm_projector.s1.b2.conv1.bn.weight', 'model.mm_projector.s1.b2.conv1.conv.weight', 'model.mm_projector.s1.b2.conv2.bn.bias', 'model.mm_projector.s1.b2.conv2.bn.weight', 'model.mm_projector.s1.b2.conv2.conv.weight', 'model.mm_projector.s1.b2.conv3.bn.bias', 'model.mm_projector.s1.b2.conv3.bn.weight', 'model.mm_projector.s1.b2.conv3.conv.weight', 'model.mm_projector.s1.b2.se.fc1.bias', 'model.mm_projector.s1.b2.se.fc1.weight', 'model.mm_projector.s1.b2.se.fc2.bias', 'model.mm_projector.s1.b2.se.fc2.weight', 'model.mm_projector.s1.b3.conv1.bn.bias', 'model.mm_projector.s1.b3.conv1.bn.weight', 'model.mm_projector.s1.b3.conv1.conv.weight', 'model.mm_projector.s1.b3.conv2.bn.bias', 'model.mm_projector.s1.b3.conv2.bn.weight', 'model.mm_projector.s1.b3.conv2.conv.weight', 'model.mm_projector.s1.b3.conv3.bn.bias', 'model.mm_projector.s1.b3.conv3.bn.weight', 'model.mm_projector.s1.b3.conv3.conv.weight', 'model.mm_projector.s1.b3.se.fc1.bias', 'model.mm_projector.s1.b3.se.fc1.weight', 'model.mm_projector.s1.b3.se.fc2.bias', 'model.mm_projector.s1.b3.se.fc2.weight', 'model.mm_projector.s1.b4.conv1.bn.bias', 'model.mm_projector.s1.b4.conv1.bn.weight', 'model.mm_projector.s1.b4.conv1.conv.weight', 'model.mm_projector.s1.b4.conv2.bn.bias', 'model.mm_projector.s1.b4.conv2.bn.weight', 'model.mm_projector.s1.b4.conv2.conv.weight', 'model.mm_projector.s1.b4.conv3.bn.bias', 'model.mm_projector.s1.b4.conv3.bn.weight', 'model.mm_projector.s1.b4.conv3.conv.weight', 'model.mm_projector.s1.b4.se.fc1.bias', 'model.mm_projector.s1.b4.se.fc1.weight', 'model.mm_projector.s1.b4.se.fc2.bias', 'model.mm_projector.s1.b4.se.fc2.weight', 'model.mm_projector.s2.b1.conv1.bn.bias', 'model.mm_projector.s2.b1.conv1.bn.weight', 'model.mm_projector.s2.b1.conv1.conv.weight', 'model.mm_projector.s2.b1.conv2.bn.bias', 'model.mm_projector.s2.b1.conv2.bn.weight', 'model.mm_projector.s2.b1.conv2.conv.weight', 'model.mm_projector.s2.b1.conv3.bn.bias', 'model.mm_projector.s2.b1.conv3.bn.weight', 'model.mm_projector.s2.b1.conv3.conv.weight', 'model.mm_projector.s2.b1.se.fc1.bias', 'model.mm_projector.s2.b1.se.fc1.weight', 'model.mm_projector.s2.b1.se.fc2.bias', 'model.mm_projector.s2.b1.se.fc2.weight', 'model.mm_projector.s2.b2.conv1.bn.bias', 'model.mm_projector.s2.b2.conv1.bn.weight', 'model.mm_projector.s2.b2.conv1.conv.weight', 'model.mm_projector.s2.b2.conv2.bn.bias', 'model.mm_projector.s2.b2.conv2.bn.weight', 'model.mm_projector.s2.b2.conv2.conv.weight', 'model.mm_projector.s2.b2.conv3.bn.bias', 'model.mm_projector.s2.b2.conv3.bn.weight', 'model.mm_projector.s2.b2.conv3.conv.weight', 'model.mm_projector.s2.b2.se.fc1.bias', 'model.mm_projector.s2.b2.se.fc1.weight', 'model.mm_projector.s2.b2.se.fc2.bias', 'model.mm_projector.s2.b2.se.fc2.weight', 'model.mm_projector.s2.b3.conv1.bn.bias', 'model.mm_projector.s2.b3.conv1.bn.weight', 'model.mm_projector.s2.b3.conv1.conv.weight', 'model.mm_projector.s2.b3.conv2.bn.bias', 'model.mm_projector.s2.b3.conv2.bn.weight', 'model.mm_projector.s2.b3.conv2.conv.weight', 'model.mm_projector.s2.b3.conv3.bn.bias', 'model.mm_projector.s2.b3.conv3.bn.weight', 'model.mm_projector.s2.b3.conv3.conv.weight', 'model.mm_projector.s2.b3.se.fc1.bias', 'model.mm_projector.s2.b3.se.fc1.weight', 'model.mm_projector.s2.b3.se.fc2.bias', 'model.mm_projector.s2.b3.se.fc2.weight', 'model.mm_projector.s2.b4.conv1.bn.bias', 'model.mm_projector.s2.b4.conv1.bn.weight', 'model.mm_projector.s2.b4.conv1.conv.weight', 'model.mm_projector.s2.b4.conv2.bn.bias', 'model.mm_projector.s2.b4.conv2.bn.weight', 'model.mm_projector.s2.b4.conv2.conv.weight', 'model.mm_projector.s2.b4.conv3.bn.bias', 'model.mm_projector.s2.b4.conv3.bn.weight', 'model.mm_projector.s2.b4.conv3.conv.weight', 'model.mm_projector.s2.b4.se.fc1.bias', 'model.mm_projector.s2.b4.se.fc1.weight', 'model.mm_projector.s2.b4.se.fc2.bias', 'model.mm_projector.s2.b4.se.fc2.weight', 'model.mm_projector.sampler.0.bias', 'model.mm_projector.sampler.0.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Loading additional VideoLLaMA weights...
Loading LoRA weights...
Merging LoRA weights...
Model is loaded...
Loading VideoLLaMA 2 from base model...
You are using a model of type mistral to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.
Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]
Traceback (most recent call last):
File "inference.py", line 166, in
inference()
File "inference.py", line 127, in inference
tokenizer, model, processor, context_len = load_pretrained_model(model_path, './checkpoints/Mistral-7B-Instruct-v0.2', model_name) # None
File "/home/lm/videollama2/VideoLLaMA2/videollama2/model/builder.py", line 140, in load_pretrained_model
model = Videollama2MistralForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, **kwargs)
File "/home/lm/anaconda3/envs/videollama2/lib/python3.8/site-packages/transformers/modeling_utils.py", line 3754, in from_pretrained
) = cls._load_pretrained_model(
File "/home/lm/anaconda3/envs/videollama2/lib/python3.8/site-packages/transformers/modeling_utils.py", line 4214, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/home/lm/anaconda3/envs/videollama2/lib/python3.8/site-packages/transformers/modeling_utils.py", line 889, in _load_state_dict_into_meta_model
hf_quantizer.create_quantized_param(model, param, param_name, param_device, state_dict, unexpected_keys)
File "/home/lm/anaconda3/envs/videollama2/lib/python3.8/site-packages/transformers/quantizers/quantizer_bnb_4bit.py", line 190, in create_quantized_param
raise ValueError(
ValueError: Supplied state dict for model.layers.0.mlp.down_proj.weight does not contain bitsandbytes__* and possibly other quantized_stats components.

thisurawz1 closed this as completed Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to do the inference with the finetune weights / model #83

how to do the inference with the finetune weights / model #83

thisurawz1 commented Aug 29, 2024 •

edited

Loading

clownrat6 commented Sep 6, 2024

thisurawz1 commented Sep 6, 2024

thisurawz1 commented Sep 10, 2024

thisurawz1 commented Sep 11, 2024 •

edited

Loading

LiangMeng89 commented Oct 14, 2024

ffcarina commented Oct 14, 2024

thisurawz1 commented Oct 15, 2024

thisurawz1 commented Oct 15, 2024

thisurawz1 commented Oct 15, 2024

LiangMeng89 commented Oct 28, 2024

LiangMeng89 commented Oct 30, 2024

how to do the inference with the finetune weights / model #83

how to do the inference with the finetune weights / model #83

Comments

thisurawz1 commented Aug 29, 2024 • edited Loading

clownrat6 commented Sep 6, 2024

thisurawz1 commented Sep 6, 2024

thisurawz1 commented Sep 10, 2024

thisurawz1 commented Sep 11, 2024 • edited Loading

LiangMeng89 commented Oct 14, 2024

ffcarina commented Oct 14, 2024

thisurawz1 commented Oct 15, 2024

thisurawz1 commented Oct 15, 2024

thisurawz1 commented Oct 15, 2024

LiangMeng89 commented Oct 28, 2024

LiangMeng89 commented Oct 30, 2024

thisurawz1 commented Aug 29, 2024 •

edited

Loading

thisurawz1 commented Sep 11, 2024 •

edited

Loading