Install packages.
conda create -n chatunivi python=3.10 -y
conda activate chatunivi
pip install --upgrade pip
pip install -e .
pip install ninja
pip install flash-attn --no-build-isolation
Download the pre-trained model.
Name | Weight |
---|---|
Chat-UniVi-7B | Download |
Chat-UniVi-13B | Download |
Chat-UniVi-ScienceQA-7B | Download |
Prepare the base model.
Base model | Weight |
---|---|
Vicuna-7B | Download |
Vicuna-13B | Download |
deepspeed \
--include localhost:0,1,2,3,4,5,6,7 \
--master_port=29602 \
ChatUniVi/train/train_mem.py \
--deepspeed scripts/zero3.json \
--model_name_or_path ${LLM model path} \
--version v1 \
--model_use PRETUNE \
--dataset_use Pretrain \
--vision_tower openai/clip-vit-large-patch14 \
--tune_mm_mlp_adapter True \
--mm_vision_select_layer -2 \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--bf16 True \
--output_dir ${stage1 save path} \
--num_train_epochs 1 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 24000 \
--save_total_limit 1 \
--learning_rate 2e-3 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--dataloader_num_workers 4 \
--lazy_preprocess True \
--report_to wandb
deepspeed \
--include localhost:0,1,2,3,4,5,6,7 \
--master_port=29601 \
ChatUniVi/train/train_mem.py \
--deepspeed scripts/zero2.json \
--model_name_or_path ${LLM model path} \
--version v1 \
--model_use FINETUNE \
--dataset_use FINETUNE \
--vision_tower openai/clip-vit-large-patch14 \
--pretrain_mm_mlp_adapter ${stage1 save path}/mm_projector.bin \
--mm_vision_select_layer -2 \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--bf16 True \
--output_dir ${stage2 save path} \
--num_train_epochs 2 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 50000 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--dataloader_num_workers 4 \
--lazy_preprocess True \
--report_to wandb
Our quantitative evaluation protocol follows that of LLaVA. Following LLaVA, we employ 90 questions based on 30 COCO validation images, covering various aspects, including conversation, detail description, and complex reasoning. For more details, please refer to LLaVA.
CUDA_VISIBLE_DEVICES=0,1,2,3 \
python ChatUniVi/eval/model_vqa.py \
--model-path ${model path} \
--question-file ChatUniVi/eval/questions/coco2014_val_qa_eval/qa90_questions.jsonl \
--image-folder ${image folder} \
--answers-file results/answer-file-vqa.jsonl
OPENAI_API_KEY=[openai api key] \
python ChatUniVi/eval/evaluate/evaluate_gpt_review_visual.py \
--question ChatUniVi/eval/questions/coco2014_val_qa_eval/qa90_questions.jsonl \
--context ChatUniVi/eval/table/caps_boxes_coco2014_val_80.jsonl \
--answer-list ChatUniVi/eval/questions/coco2014_val_qa_eval/qa90_gpt4_answer.jsonl results/answer-file-vqa.jsonl \
--rule ChatUniVi/eval/table/rule.json \
--output results/review-file-vqa.json
python ChatUniVi/eval/evaluate/summarize_gpt_review.py \
-d results/review-file-vqa.json
The quantitative evaluation protocol for video understanding follows the methodology introduced by Video-ChatGPT. Specifically, Video-ChatGPT curates a test set based on the ActivityNet-200 dataset, which includes videos with rich, dense descriptive captions and associated question-answer pairs from human annotations. For more details, please refer to Video-ChatGPT.
It is worth noting that the results span a range from 0 to 5. To standardize the metrics, we normalized all scores to a scale of 0 to 100 in the paper.
python ChatUniVi/eval/model_video_general.py \
--model-path ${model path} \
--question-file ChatUniVi/eval/questions/video_qa/generic_qa.json \
--video-folder ${video folder} \
--answers-file results/answer-video-generic.jsonl
CUDA_VISIBLE_DEVICES=0,1,2,3 \
python ChatUniVi/eval/model_video_general.py \
--model-path ${model path} \
--question-file ChatUniVi/eval/questions/video_qa/temporal_qa.json \
--video-folder ${video folder} \
--answers-file results/answer-video-temporal.jsonl
CUDA_VISIBLE_DEVICES=0,1,2,3 \
python ChatUniVi/eval/model_video_consistency.py \
--model-path ${model path} \
--question-file ChatUniVi/eval/questions/video_qa/consistency_qa.json \
--video-folder ${video folder} \
--answers-file results/answer-video-consistency.jsonl
# Correctness of Information
python ChatUniVi/eval/evaluate/evaluate_benchmark_1_correctness.py \
--pred_path results/answer-video-generic.jsonl \
--output_dir results/correctness \
--output_json results/review-video-correctness.jsonl \
--api_key [openai api key] \
--num_tasks 1
# Detail Orientation
python ChatUniVi/eval/evaluate/evaluate_benchmark_2_detailed_orientation.py \
--pred_path results/answer-video-generic.jsonl \
--output_dir results/detailed_orientation \
--output_json results/review-video-detailed_orientation.jsonl \
--api_key [openai api key] \
--num_tasks 1
# Contextual Understanding
python ChatUniVi/eval/evaluate/evaluate_benchmark_3_context.py \
--pred_path results/answer-video-generic.jsonl \
--output_dir results/context \
--output_json results/review-video-context.jsonl \
--api_key [openai api key] \
--num_tasks 1
# Temporal Understanding
python ChatUniVi/eval/evaluate/evaluate_benchmark_4_temporal.py \
--pred_path results/answer-video-temporal.jsonl \
--output_dir results/temporal \
--output_json results/review-video-temporal.jsonl \
--api_key [openai api key] \
--num_tasks 1
# Consistency
python ChatUniVi/eval/evaluate/evaluate_benchmark_5_consistency.py \
--pred_path results/answer-video-consistency.jsonl \
--output_dir results/consistency \
--output_json results/review-video-consistency.jsonl \
--api_key [openai api key] \
--num_tasks 1
ScienceQA is a comprehensive multimodal science question-answering dataset comprising 21k multiple-choice questions. It covers a wide range of domains, spanning 3 subjects, 26 topics, 127 categories, and 379 skills. Each example in ScienceQA contains a visual context, a textual context, a question, multiple options, and the correct answer. For the input of Chat-UniVi, we concatenate the question, textual context, and options sequentially into a single sentence.
deepspeed \
--include localhost:0,1,2,3,4,5,6,7 \
--master_port=29603 ChatUniVi/train/train.py \
--deepspeed scripts/zero.json \
--model_name_or_path ${LLM model path} \
--version v1 \
--model_use FINETUNE \
--dataset_use SQA \
--vision_tower openai/clip-vit-large-patch14 \
--pretrain_mm_mlp_adapter ${stage1 save path}/mm_projector.bin \
--mm_vision_select_layer -2 \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--bf16 True \
--output_dir ${save path} \
--num_train_epochs 9 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 5000 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--dataloader_num_workers 4 \
--lazy_preprocess True \
--report_to wandb
CUDA_VISIBLE_DEVICES=0,1,2,3 \
python ChatUniVi/eval/model_vqa_scienceqa.py \
--model-path ${model path} \
--question-file ChatUniVi/eval/questions/scienceqa/test_QCM-LEA.json \
--image-folder ${image folder} \
--answers-file results/answer-scienceqa.jsonl
python ChatUniVi/eval/evaluate/evaluate_science_qa.py \
--base-dir ChatUniVi/eval/questions/scienceqa \
--result-file results/answer-scienceqa.jsonl \
--output-file results/output-scienceqa.json \
--output-result results/output-result-scienceqa.json
Our evaluation protocol follows that of Video-ChatGPT, utilizing GPT-assisted evaluation to assess the capabilities of models. For more details, please refer to Video-ChatGPT.
# MSRVTT QA
CUDA_VISIBLE_DEVICES=0,1,2,3 \
python ChatUniVi/eval/model_video_qa.py \
--model-path ${model path} \
--question-file ChatUniVi/eval/questions/video_qa/msrvtt_qa.json \
--video-folder ${video folder} \
--answers-list ChatUniVi/eval/questions/video_qa/msrvtt_a_list.json \
--answers-file results/answer-msrvtt-qa.jsonl
# MSVD QA
CUDA_VISIBLE_DEVICES=0,1,2,3 \
python ChatUniVi/eval/model_video_qa.py \
--model-path ${model path} \
--question-file ChatUniVi/eval/questions/video_qa/msvd_qa.json \
--video-folder ${video folder} \
--answers-list ChatUniVi/eval/questions/video_qa/msvd_a_list.json \
--answers-file results/answer-msvd-qa.jsonl
# TGIF QA
CUDA_VISIBLE_DEVICES=0,1,2,3 \
python ChatUniVi/eval/model_video_qa.py \
--model-path ${model path} \
--question-file ChatUniVi/eval/questions/video_qa/tgif_qa.json \
--video-folder ${video folder} \
--answers-list ChatUniVi/eval/questions/video_qa/tgif_a_list.json \
--answers-file results/answer-tgif-qa.jsonl
# ActivityNet QA
CUDA_VISIBLE_DEVICES=0,1,2,3 \
python ChatUniVi/eval/model_video_qa.py \
--model-path ${model path} \
--question-file ChatUniVi/eval/questions/video_qa/activitynet_qa.json \
--video-folder ${video folder} \
--answers-list ChatUniVi/eval/questions/video_qa/activitynet_a_list.json \
--answers-file results/answer-activitynet-qa.jsonl
# MSRVTT QA
python ChatUniVi/eval/evaluate/evaluate_video_qa.py \
--pred_path results/answer-msrvtt-qa.jsonl \
--output_dir results/msrvtt-qa \
--output_json results/review-msrvtt-qa.jsonl \
--api_key [openai api key] \
--num_tasks 1
# MSVD QA
python ChatUniVi/eval/evaluate/evaluate_video_qa.py \
--pred_path results/answer-msvd-qa.jsonl \
--output_dir results/msvd-qa \
--output_json results/review-msvd-qa.jsonl \
--api_key [openai api key] \
--num_tasks 1
# TGIF QA
python ChatUniVi/eval/evaluate/evaluate_video_qa.py \
--pred_path results/answer-tgif-qa.jsonl \
--output_dir results/tgif-qa \
--output_json results/review-tgif-qa.jsonl \
--api_key [openai api key] \
--num_tasks 1
# ActivityNet QA
python ChatUniVi/eval/evaluate/evaluate_video_qa.py \
--pred_path results/answer-activitynet-qa.jsonl \
--output_dir results/activitynet-qa \
--output_json results/review-activitynet-qa.jsonl \
--api_key [openai api key] \
--num_tasks 1
To quantitatively evaluate the hallucination problem of the model, we adopt the polling-based object probing evaluation (POPE) process.
# Random
CUDA_VISIBLE_DEVICES=0,1,2,3 \
python ChatUniVi/eval/model_coco_vqa.py \
--model-path ${model path} \
--question-file ChatUniVi/eval/questions/coco_pope/coco_pope_random.jsonl \
--image-folder ${image folder} \
--answers-file results/pope-random.jsonl
# Popular
CUDA_VISIBLE_DEVICES=0,1,2,3 \
python ChatUniVi/eval/model_coco_vqa.py \
--model-path ${model path} \
--question-file ChatUniVi/eval/questions/coco_pope/coco_pope_popular.jsonl \
--image-folder ${image folder} \
--answers-file results/pope-popular.jsonl
# Adversarial
CUDA_VISIBLE_DEVICES=0,1,2,3 \
python ChatUniVi/eval/model_coco_vqa.py \
--model-path ${model path} \
--question-file ChatUniVi/eval/questions/coco_pope/coco_pope_adversarial.jsonl \
--image-folder ${image folder} \
--answers-file results/pope-adversarial.jsonl