- Following LLaVA's instructions. You MUST first download eval.zip.
- It contains custom annotations, scripts, and the prediction files with LLaVA v1.5. Extract to
eval
. This also provides a general structure for all datasets.
After downloading all of them, organize the data as follows in eval
.
eval
├── gqa
│ ├── answers
│ ├── data
│ └── llava_gqa_testdev_balanced.jsonl
├── llava-bench-in-the-wild
│ ├── answers
│ ├── answers_gpt4.jsonl
│ ├── bard_0718.jsonl
│ ├── bing_chat_0629.jsonl
│ ├── context.jsonl
│ ├── images
│ ├── questions.jsonl
│ ├── README.md
│ └── reviews
├── mmbench
│ ├── answers
│ ├── answers_upload
│ ├── mmbench_dev_20230712.tsv
│ └── mmbench_dev_en_20231003.tsv
├── MME
│ ├── answers
│ ├── convert_answer_to_mme.py
│ └── llava_mme.jsonl
├── mm-vet
│ ├── answers
│ ├── bard_set.json
│ ├── convert_answers.py
│ ├── images
│ ├── llava-mm-vet.jsonl
│ ├── mm-vet.json
│ └── results
├── pope
│ ├── answers
│ ├── coco
│ ├── llava_pope_test.jsonl
│ └── val2014
├── scienceqa
│ ├── answers
│ ├── images
│ ├── llava_test_CQM-A.json
│ ├── pid_splits.json
│ └── problems.json
├── seed_bench
│ ├── answers
│ ├── answers_upload
│ ├── extract_video_frames.py
│ └── llava-seed-bench.jsonl
├── textvqa
│ ├── answers
│ ├── llava_textvqa_val_v051_ocr.jsonl
│ ├── TextVQA_0.5.1_val.json
│ └── train_images
├── vizwiz
│ ├── answers
│ ├── answers_upload
│ ├── llava_test.jsonl
│ ├── test
│ ├── test.json
│ ├── train.json
│ └── val.json
└── vqav2
├── answers
├── answers_upload
├── llava_vqav2_mscoco_test2015.jsonl
├── llava_vqav2_mscoco_test-dev2015.jsonl
└── test2015
Our image validation code comes from LLaVA, thanks for their contribution!
You can refer to the official repository for validation, but we also provide off-the-shelf scripts.
- Download
test2015
and put it undereval/vqav2
. - Multi-GPU inference.
LLaVA-based model
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1/eval/llava/vqav2.sh
MoE-based model
bash scripts/v1/eval/moe_llava/vqav2.sh
- Submit the results to the evaluation server:
eval/vqav2/answers_upload
.
- Download the data following the official instructions here and put under
eval/gqa/data
. - Multi-GPU inference
LLaVA-based model
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1/eval/llava/gqa.sh
MoE-based model
bash scripts/v1/eval/moe_llava/gqa.sh
LLaVA-based model
CUDA_VISIBLE_DEVICES=0 bash scripts/v1/eval/moe_llava/vizwiz.sh
MoE-based model
bash scripts/v1/eval/moe_llava/vizwiz.sh
- Submit the results to the evaluation server:
eval/vizwiz/answers_upload
.
- Under
eval/scienceqa
, downloadimages
,pid_splits.json
,problems.json
from thedata/scienceqa
folder of the ScienceQA repo. - Single-GPU inference and evaluate.
LLaVA-based model
CUDA_VISIBLE_DEVICES=0 bash scripts/v1/eval/moe_llava/sqa.sh
MoE-based model
bash scripts/v1/eval/moe_llava/sqa.sh
- Download
TextVQA_0.5.1_val.json
and images and extract toeval/textvqa
. - Single-GPU inference and evaluate.
LLaVA-based model
CUDA_VISIBLE_DEVICES=0 bash scripts/v1/eval/moe_llava/textvqa.sh
MoE-based model
bash scripts/v1/eval/moe_llava/textvqa.sh
- Download
coco
from POPE and put undereval/pope
. - Single-GPU inference and evaluate.
LLaVA-based model
CUDA_VISIBLE_DEVICES=0 bash scripts/v1/eval/moe_llava/pope.sh
MoE-based model
bash scripts/v1/eval/moe_llava/pope.sh
- Download the data following the official instructions here.
- Downloaded images to
MME_Benchmark_release_version
. - Put the official
eval_tool
andMME_Benchmark_release_version
undereval/MME
. - Single-GPU inference and evaluate.
LLaVA-based model
CUDA_VISIBLE_DEVICES=0 bash scripts/v1/eval/llava/mme.sh
MoE-based model
bash scripts/v1/eval/moe_llava/mme.sh
- Download
mmbench_dev_20230712.tsv
and put undereval/mmbench
. - Single-GPU inference.
LLaVA-based model
CUDA_VISIBLE_DEVICES=0 bash scripts/v1/eval/llava/mmbench.sh
MoE-based model
bash scripts/v1/eval/moe_llava/mmbench.sh
- Submit the results to the evaluation server:
eval/mmbench/answers_upload/mmbench_dev_20230712
.
- Download
mmbench_dev_cn_20231003.tsv
and put undereval/mmbench
. - Single-GPU inference.
LLaVA-based model
CUDA_VISIBLE_DEVICES=0 bash scripts/v1/eval/llava/mmbench_cn.sh
MoE-based model
bash scripts/v1/eval/moe_llava/mmbench_cn.sh
- Submit the results to the evaluation server:
eval/mmbench/answers_upload/mmbench_dev_cn_20231003
.
- Following the official instructions to download the images and the videos. Put images under
eval/seed_bench/SEED-Bench-image
. - Extract the video frame in the middle from the downloaded videos, and put them under
eval/seed_bench/SEED-Bench-video-image
. - Multiple-GPU inference and evaluate.
LLaVA-based model
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1/eval/llava/seed.sh
MoE-based model
bash scripts/v1/eval/moe_llava/seed.sh
- Optionally, submit the results to the leaderboard:
eval/seed_bench/answers_upload
using the official jupyter notebook.
- Extract contents of
llava-bench-in-the-wild
toeval/llava-bench-in-the-wild
. - Single-GPU inference and evaluate.
LLaVA-based model
CUDA_VISIBLE_DEVICES=0 bash scripts/v1/eval/moe_llava/llavabench.sh
MoE-based model
bash scripts/v1/eval/moe_llava/llavabench.sh
- Extract
mm-vet.zip
toeval/mmvet
. - Single-GPU inference.
LLaVA-based model
CUDA_VISIBLE_DEVICES=0 bash scripts/v1/eval/moe_llava/mmvet.sh
MoE-based model
bash scripts/v1/eval/moe_llava/mmvet.sh