Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to do the inference with the finetune weights / model #83

Closed
thisurawz1 opened this issue Aug 29, 2024 · 11 comments
Closed

how to do the inference with the finetune weights / model #83

thisurawz1 opened this issue Aug 29, 2024 · 11 comments

Comments

@thisurawz1
Copy link

thisurawz1 commented Aug 29, 2024

345262460-cb6c8569-4307-4275-b536-21aa253d9eee
I have already fine-tuned the videollama2 for a custom dataset using qlora. after fine-tuning got the above files. now, how can I make the inference with those weights/ models? how can I use this finetune weights/ model with the inference script you provided?

Looking forward to a solution as soon as possible. thank you.

`
import sys
sys.path.append('./')
from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init

def inference():
disable_torch_init()

# Video Inference
modal = 'video'
modal_path = 'assets/cat_and_chicken.mp4' 
instruct = 'What animals are in the video, what are they doing, and how does the video feel?'
# Reply:
# The video features a kitten and a baby chick playing together. The kitten is seen laying on the floor while the baby chick hops around. The two animals interact playfully with each other, and the video has a cute and heartwarming feel to it.

# Image Inference
modal = 'image'
modal_path = 'assets/sora.png'
instruct = 'What is the woman wearing, what is she doing, and how does the image feel?'
# Reply:
# The woman in the image is wearing a black coat and sunglasses, and she is walking down a rain-soaked city street. The image feels vibrant and lively, with the bright city lights reflecting off the wet pavement, creating a visually appealing atmosphere. The woman's presence adds a sense of style and confidence to the scene, as she navigates the bustling urban environment.

model_path = 'DAMO-NLP-SG/VideoLLaMA2-7B'
# Base model inference (only need to replace model_path)
# model_path = 'DAMO-NLP-SG/VideoLLaMA2-7B-Base'
model, processor, tokenizer = model_init(model_path)
output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)

print(output)

if name == "main":
inference()
`

@clownrat6
Copy link
Member

Yes, you can. The newest version commit supports directly loading lora model.

@thisurawz1
Copy link
Author

Can you share the script for it please. Do we just have to change the current model path to lora path. I did it but didn't work at all.

@thisurawz1
Copy link
Author

can you share the exact script that we can do the inference with the LoRA weights. please.

@thisurawz1
Copy link
Author

thisurawz1 commented Sep 11, 2024

Yes, you can. The newest version commit supports directly loading the Lora model.

Can you share the script on how to load the Lora model directly? I already finished the fine-tuning. And got those files. But I don't know how to do the inference with these.
image

image

@LiangMeng89
Copy link

Yes, you can. The newest version commit supports directly loading the Lora model.

Can you share the script on how to load the Lora model directly? I already finished the fine-tuning. And got those files. But I don't know how to do the inference with these. image

image

Hello! I have the same problem. Have you solved it?

@ffcarina
Copy link

Can you share the script on how to load the Lora model directly? I already finished the fine-tuning. And got those files. But I don't know how to do the inference with these.

@thisurawz1 Through the following code, I successfully loaded the LoRA fine-tuned model for inference. Hope this helps you.

from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init

disable_torch_init()

modal = 'video'
modal_path = 'VideoLLaMA2/videollama2/serve/examples/sample_demo_1.mp4'
instruct = 'What is the baby wearing and what is he doing?'
model_path = 'VideoLLaMA2/work_dirs/videollama2/finetune_downstream_sft_settings_qlora_MESC' # your model dir

model, processor, tokenizer = model_init(model_path)
output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)

@thisurawz1
Copy link
Author

Yes, you can. The newest version commit supports directly loading the Lora model.

Can you share the script on how to load the Lora model directly? I already finished the fine-tuning. And got those files. But I don't know how to do the inference with these. image
image

Hello! I have the same problem. Have you solved it?

@thisurawz1
Copy link
Author

Yes, you can. The newest version commit supports directly loading the Lora model.

Can you share the script on how to load the Lora model directly? I already finished the fine-tuning. And got those files. But I don't know how to do the inference with these. image
image

Hello! I have the same problem. Have you solved it?

yes. you have to update the videollama2 repository to the latest commit. then use the following script. just have to change the model path in the original inference script. thats all.

import sys
sys.path.append('./')
from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init


def inference():
    disable_torch_init()

    # Video Inference
    modal = 'video'
    modal_path = 'assets/cat_and_chicken.mp4' 
    instruct = 'What animals are in the video, what are they doing, and how does the video feel?'
    # Reply:
    # The video features a kitten and a baby chick playing together. The kitten is seen laying on the floor while the baby chick hops around. The two animals interact playfully with each other, and the video has a cute and heartwarming feel to it.

    # Image Inference
    modal = 'image'
    modal_path = 'assets/sora.png'
    instruct = 'What is the woman wearing, what is she doing, and how does the image feel?'
    # Reply:
    # The woman in the image is wearing a black coat and sunglasses, and she is walking down a rain-soaked city street. The image feels vibrant and lively, with the bright city lights reflecting off the wet pavement, creating a visually appealing atmosphere. The woman's presence adds a sense of style and confidence to the scene, as she navigates the bustling urban environment.

    model_path = 'DAMO-NLP-SG/VideoLLaMA2-7B'
    # Base model inference (only need to replace model_path)
    # model_path = 'work_dirs/videollama2/finetune_downstream_sft_settings_qlora' #your fine-tuned weights directory 
    model, processor, tokenizer = model_init(model_path)
    output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)

    print(output)

if __name__ == "__main__":
    inference()

@thisurawz1
Copy link
Author

Can you share the script on how to load the Lora model directly? I already finished the fine-tuning. And got those files. But I don't know how to do the inference with these.

@thisurawz1 Through the following code, I successfully loaded the LoRA fine-tuned model for inference. Hope this helps you.

from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init

disable_torch_init()

modal = 'video'
modal_path = 'VideoLLaMA2/videollama2/serve/examples/sample_demo_1.mp4'
instruct = 'What is the baby wearing and what is he doing?'
model_path = 'VideoLLaMA2/work_dirs/videollama2/finetune_downstream_sft_settings_qlora_MESC' # your model dir

model, processor, tokenizer = model_init(model_path)
output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)

Thank you so much

@LiangMeng89
Copy link

Yes, you can. The newest version commit supports directly loading the Lora model.

Can you share the script on how to load the Lora model directly? I already finished the fine-tuning. And got those files. But I don't know how to do the inference with these. image
image

Hello! I have the same problem. Have you solved it?

yes. you have to update the videollama2 repository to the latest commit. then use the following script. just have to change the model path in the original inference script. thats all.

import sys
sys.path.append('./')
from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init


def inference():
    disable_torch_init()

    # Video Inference
    modal = 'video'
    modal_path = 'assets/cat_and_chicken.mp4' 
    instruct = 'What animals are in the video, what are they doing, and how does the video feel?'
    # Reply:
    # The video features a kitten and a baby chick playing together. The kitten is seen laying on the floor while the baby chick hops around. The two animals interact playfully with each other, and the video has a cute and heartwarming feel to it.

    # Image Inference
    modal = 'image'
    modal_path = 'assets/sora.png'
    instruct = 'What is the woman wearing, what is she doing, and how does the image feel?'
    # Reply:
    # The woman in the image is wearing a black coat and sunglasses, and she is walking down a rain-soaked city street. The image feels vibrant and lively, with the bright city lights reflecting off the wet pavement, creating a visually appealing atmosphere. The woman's presence adds a sense of style and confidence to the scene, as she navigates the bustling urban environment.

    model_path = 'DAMO-NLP-SG/VideoLLaMA2-7B'
    # Base model inference (only need to replace model_path)
    # model_path = 'work_dirs/videollama2/finetune_downstream_sft_settings_qlora' #your fine-tuned weights directory 
    model, processor, tokenizer = model_init(model_path)
    output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)

    print(output)

if __name__ == "__main__":
    inference()

Thank you, I will try this.

@LiangMeng89
Copy link

Yes, you can. The newest version commit supports directly loading lora model.

Dear author,I used your lora checkpoint folder structure and loading example code(#36) to my fintue_qlora inference code on my own experiment video data, but it still has some errors.The old inference code of readme file is work.I just put your code in the code. Please help me!

1: My fintue_qlora inference code:

import torch
import transformers

import sys
sys.path.append('./')

from videollama2.conversation import conv_templates
from videollama2.constants import DEFAULT_MMODAL_TOKEN, MMODAL_TOKEN_INDEX
from videollama2.mm_utils import get_model_name_from_path, tokenizer_MMODAL_token, process_video, process_image
from videollama2.model.builder import load_pretrained_model

def inference():
# Video Inference
paths = ['./datasets/test_data/videos/video_202.mp4']
questions = ['hidden****']
# Reply:
modal_list = ['video']

# Image Inference
#paths = ['assets/sora.png']
#questions = ['What is the woman wearing, what is she doing, and how does the image feel?']
# Reply:
# The woman in the image is wearing a black coat and sunglasses, and she is walking down a rain-soaked city street. The image feels vibrant and lively, with the bright city lights reflecting off the wet pavement, creating a visually appealing atmosphere. The woman's presence adds a sense of style and confidence to the scene, as she navigates the bustling urban environment.
#modal_list = ['image']

# 1. Initialize the model.
model_path = './checkpoints/VideoLLaMA2-7B-qlora'   #./checkpoints/VideoLLaMA2-7B
# Base model inference (only need to replace model_path)
# model_path = 'DAMO-NLP-SG/VideoLLaMA2-7B-Base'
model_name = get_model_name_from_path(model_path)
tokenizer, model, processor, context_len = load_pretrained_model(model_path, './checkpoints/Mistral-7B-Instruct-v0.2', model_name)  # None
model = model.to('cuda:0')
conv_mode = 'llama2'

# 2. Visual preprocess (load & transform image or video).
if modal_list[0] == 'video':
    tensor = process_video(paths[0], processor, model.config.image_aspect_ratio).to(dtype=torch.float16, device='cuda', non_blocking=True)
    default_mm_token = DEFAULT_MMODAL_TOKEN["VIDEO"]
    modal_token_index = MMODAL_TOKEN_INDEX["VIDEO"]
else:
    tensor = process_image(paths[0], processor, model.config.image_aspect_ratio)[0].to(dtype=torch.float16, device='cuda', non_blocking=True)
    default_mm_token = DEFAULT_MMODAL_TOKEN["IMAGE"]
    modal_token_index = MMODAL_TOKEN_INDEX["IMAGE"]
tensor = [tensor]

# 3. text preprocess (tag process & generate prompt).
question = default_mm_token + "\n" + questions[0]
conv = conv_templates[conv_mode].copy()
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
input_ids = tokenizer_MMODAL_token(prompt, tokenizer, modal_token_index, return_tensors='pt').unsqueeze(0).to('cuda:0')

with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        images_or_videos=tensor,
        modal_list=modal_list,
        do_sample=True,
        temperature=0.2,
        max_new_tokens=1024,
        use_cache=True,
    )

outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
print(outputs[0])

if name == "main":
inference()

2: Terminal errors:
(videollama2) lm@SR6430G23:~/videollama2/VideoLLaMA2$ /home/lm/anaconda3/envs/videollama2/bin/python inference.py
200
Loading VideoLLaMA from base model...
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:13<00:00, 4.36s/it]
Some weights of Videollama2MistralForCausalLM were not initialized from the model checkpoint at ./checkpoints/Mistral-7B-Instruct-v0.2 and are newly initialized: ['model.mm_projector.readout.0.bias', 'model.mm_projector.readout.0.weight', 'model.mm_projector.readout.2.bias', 'model.mm_projector.readout.2.weight', 'model.mm_projector.s1.b1.conv1.bn.bias', 'model.mm_projector.s1.b1.conv1.bn.weight', 'model.mm_projector.s1.b1.conv1.conv.weight', 'model.mm_projector.s1.b1.conv2.bn.bias', 'model.mm_projector.s1.b1.conv2.bn.weight', 'model.mm_projector.s1.b1.conv2.conv.weight', 'model.mm_projector.s1.b1.conv3.bn.bias', 'model.mm_projector.s1.b1.conv3.bn.weight', 'model.mm_projector.s1.b1.conv3.conv.weight', 'model.mm_projector.s1.b1.downsample.bn.bias', 'model.mm_projector.s1.b1.downsample.bn.weight', 'model.mm_projector.s1.b1.downsample.conv.weight', 'model.mm_projector.s1.b1.se.fc1.bias', 'model.mm_projector.s1.b1.se.fc1.weight', 'model.mm_projector.s1.b1.se.fc2.bias', 'model.mm_projector.s1.b1.se.fc2.weight', 'model.mm_projector.s1.b2.conv1.bn.bias', 'model.mm_projector.s1.b2.conv1.bn.weight', 'model.mm_projector.s1.b2.conv1.conv.weight', 'model.mm_projector.s1.b2.conv2.bn.bias', 'model.mm_projector.s1.b2.conv2.bn.weight', 'model.mm_projector.s1.b2.conv2.conv.weight', 'model.mm_projector.s1.b2.conv3.bn.bias', 'model.mm_projector.s1.b2.conv3.bn.weight', 'model.mm_projector.s1.b2.conv3.conv.weight', 'model.mm_projector.s1.b2.se.fc1.bias', 'model.mm_projector.s1.b2.se.fc1.weight', 'model.mm_projector.s1.b2.se.fc2.bias', 'model.mm_projector.s1.b2.se.fc2.weight', 'model.mm_projector.s1.b3.conv1.bn.bias', 'model.mm_projector.s1.b3.conv1.bn.weight', 'model.mm_projector.s1.b3.conv1.conv.weight', 'model.mm_projector.s1.b3.conv2.bn.bias', 'model.mm_projector.s1.b3.conv2.bn.weight', 'model.mm_projector.s1.b3.conv2.conv.weight', 'model.mm_projector.s1.b3.conv3.bn.bias', 'model.mm_projector.s1.b3.conv3.bn.weight', 'model.mm_projector.s1.b3.conv3.conv.weight', 'model.mm_projector.s1.b3.se.fc1.bias', 'model.mm_projector.s1.b3.se.fc1.weight', 'model.mm_projector.s1.b3.se.fc2.bias', 'model.mm_projector.s1.b3.se.fc2.weight', 'model.mm_projector.s1.b4.conv1.bn.bias', 'model.mm_projector.s1.b4.conv1.bn.weight', 'model.mm_projector.s1.b4.conv1.conv.weight', 'model.mm_projector.s1.b4.conv2.bn.bias', 'model.mm_projector.s1.b4.conv2.bn.weight', 'model.mm_projector.s1.b4.conv2.conv.weight', 'model.mm_projector.s1.b4.conv3.bn.bias', 'model.mm_projector.s1.b4.conv3.bn.weight', 'model.mm_projector.s1.b4.conv3.conv.weight', 'model.mm_projector.s1.b4.se.fc1.bias', 'model.mm_projector.s1.b4.se.fc1.weight', 'model.mm_projector.s1.b4.se.fc2.bias', 'model.mm_projector.s1.b4.se.fc2.weight', 'model.mm_projector.s2.b1.conv1.bn.bias', 'model.mm_projector.s2.b1.conv1.bn.weight', 'model.mm_projector.s2.b1.conv1.conv.weight', 'model.mm_projector.s2.b1.conv2.bn.bias', 'model.mm_projector.s2.b1.conv2.bn.weight', 'model.mm_projector.s2.b1.conv2.conv.weight', 'model.mm_projector.s2.b1.conv3.bn.bias', 'model.mm_projector.s2.b1.conv3.bn.weight', 'model.mm_projector.s2.b1.conv3.conv.weight', 'model.mm_projector.s2.b1.se.fc1.bias', 'model.mm_projector.s2.b1.se.fc1.weight', 'model.mm_projector.s2.b1.se.fc2.bias', 'model.mm_projector.s2.b1.se.fc2.weight', 'model.mm_projector.s2.b2.conv1.bn.bias', 'model.mm_projector.s2.b2.conv1.bn.weight', 'model.mm_projector.s2.b2.conv1.conv.weight', 'model.mm_projector.s2.b2.conv2.bn.bias', 'model.mm_projector.s2.b2.conv2.bn.weight', 'model.mm_projector.s2.b2.conv2.conv.weight', 'model.mm_projector.s2.b2.conv3.bn.bias', 'model.mm_projector.s2.b2.conv3.bn.weight', 'model.mm_projector.s2.b2.conv3.conv.weight', 'model.mm_projector.s2.b2.se.fc1.bias', 'model.mm_projector.s2.b2.se.fc1.weight', 'model.mm_projector.s2.b2.se.fc2.bias', 'model.mm_projector.s2.b2.se.fc2.weight', 'model.mm_projector.s2.b3.conv1.bn.bias', 'model.mm_projector.s2.b3.conv1.bn.weight', 'model.mm_projector.s2.b3.conv1.conv.weight', 'model.mm_projector.s2.b3.conv2.bn.bias', 'model.mm_projector.s2.b3.conv2.bn.weight', 'model.mm_projector.s2.b3.conv2.conv.weight', 'model.mm_projector.s2.b3.conv3.bn.bias', 'model.mm_projector.s2.b3.conv3.bn.weight', 'model.mm_projector.s2.b3.conv3.conv.weight', 'model.mm_projector.s2.b3.se.fc1.bias', 'model.mm_projector.s2.b3.se.fc1.weight', 'model.mm_projector.s2.b3.se.fc2.bias', 'model.mm_projector.s2.b3.se.fc2.weight', 'model.mm_projector.s2.b4.conv1.bn.bias', 'model.mm_projector.s2.b4.conv1.bn.weight', 'model.mm_projector.s2.b4.conv1.conv.weight', 'model.mm_projector.s2.b4.conv2.bn.bias', 'model.mm_projector.s2.b4.conv2.bn.weight', 'model.mm_projector.s2.b4.conv2.conv.weight', 'model.mm_projector.s2.b4.conv3.bn.bias', 'model.mm_projector.s2.b4.conv3.bn.weight', 'model.mm_projector.s2.b4.conv3.conv.weight', 'model.mm_projector.s2.b4.se.fc1.bias', 'model.mm_projector.s2.b4.se.fc1.weight', 'model.mm_projector.s2.b4.se.fc2.bias', 'model.mm_projector.s2.b4.se.fc2.weight', 'model.mm_projector.sampler.0.bias', 'model.mm_projector.sampler.0.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Loading additional VideoLLaMA weights...
Loading LoRA weights...
Merging LoRA weights...
Model is loaded...
Loading VideoLLaMA 2 from base model...
You are using a model of type mistral to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.
Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]
Traceback (most recent call last):
File "inference.py", line 166, in
inference()
File "inference.py", line 127, in inference
tokenizer, model, processor, context_len = load_pretrained_model(model_path, './checkpoints/Mistral-7B-Instruct-v0.2', model_name) # None
File "/home/lm/videollama2/VideoLLaMA2/videollama2/model/builder.py", line 140, in load_pretrained_model
model = Videollama2MistralForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, **kwargs)
File "/home/lm/anaconda3/envs/videollama2/lib/python3.8/site-packages/transformers/modeling_utils.py", line 3754, in from_pretrained
) = cls._load_pretrained_model(
File "/home/lm/anaconda3/envs/videollama2/lib/python3.8/site-packages/transformers/modeling_utils.py", line 4214, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/home/lm/anaconda3/envs/videollama2/lib/python3.8/site-packages/transformers/modeling_utils.py", line 889, in _load_state_dict_into_meta_model
hf_quantizer.create_quantized_param(model, param, param_name, param_device, state_dict, unexpected_keys)
File "/home/lm/anaconda3/envs/videollama2/lib/python3.8/site-packages/transformers/quantizers/quantizer_bnb_4bit.py", line 190, in create_quantized_param
raise ValueError(
ValueError: Supplied state dict for model.layers.0.mlp.down_proj.weight does not contain bitsandbytes__* and possibly other quantized_stats components.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants