diff --git a/README.md b/README.md index 640452eb..da46b700 100644 --- a/README.md +++ b/README.md @@ -2,9 +2,16 @@ # image InternVL Family: Closing the Gap to Commercial Multimodal Models with Open-Source Suites —— A Pioneering Open-Source Alternative to GPT-4o -[\[🔥 Mini-InternVL\]](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat/shell/mini_internvl) [\[🆕 Blog\]](https://internvl.github.io/blog/) [\[🤔 FAQs\]](https://internvl.readthedocs.io/en/latest/tutorials/faqs.html) [\[🚀 InternVL2 Blog\]](https://internvl.github.io/blog/2024-07-02-InternVL-2.0/) [\[🗨️ Chat Demo\]](https://internvl.opengvlab.com/) [\[🤗 HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[📖 Document\]](https://internvl.readthedocs.io/en/latest/) [\[🌐 API\]](https://internvl.readthedocs.io/en/latest/get_started/internvl_chat_api.html) [\[🚀 Quick Start\]](#quick-start-with-huggingface) +
+ image +
+
+ +[\[🆕 Blog\]](https://internvl.github.io/blog/) [\[🤔 FAQs\]](https://internvl.readthedocs.io/en/latest/tutorials/faqs.html) [\[🚀 InternVL2 Blog\]](https://internvl.github.io/blog/2024-07-02-InternVL-2.0/) [\[🗨️ Chat Demo\]](https://internvl.opengvlab.com/) [\[🤗 HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[📖 Document\]](https://internvl.readthedocs.io/en/latest/) [\[🌐 API\]](https://internvl.readthedocs.io/en/latest/get_started/internvl_chat_api.html) [\[🚀 Quick Start\]](#quick-start-with-huggingface) -[\[🔥 Mini-InternVL Report\]](https://arxiv.org/abs/2410.16261) [\[📜 InternVL 1.5 Report\]](https://arxiv.org/abs/2404.16821) [\[📜 InternVL 1.0 Paper\]](https://arxiv.org/abs/2312.14238) [\[📖 2.0 中文解读\]](https://zhuanlan.zhihu.com/p/706547971) [\[📖 1.5 中文解读\]](https://zhuanlan.zhihu.com/p/699439759) [\[📖 1.0 中文解读\]](https://zhuanlan.zhihu.com/p/702946079) +[\[🔥 Mini-InternVL Report\]](https://arxiv.org/abs/2410.16261) [\[📜 InternVL 1.5 Report\]](https://arxiv.org/abs/2404.16821) [\[📜 InternVL 1.0 Paper\]](https://arxiv.org/abs/2312.14238) + +[\[📖 2.0 中文解读\]](https://zhuanlan.zhihu.com/p/706547971) [\[📖 1.5 中文解读\]](https://zhuanlan.zhihu.com/p/699439759) [\[📖 1.0 中文解读\]](https://zhuanlan.zhihu.com/p/702946079) [Switch to the Chinese version (切换至中文版)](/README_zh.md) @@ -16,9 +23,10 @@ ## News 🚀🚀🚀 -- `2024/10/21`: We release the Mini-InternVL series, which includes three chat models: __Mini-InternVL-1B__, __Mini-InternVL-2B__ and __Mini-InternVL-4B__. These models achieve impressive performance with minimal size: the 4B model achieves 90% of the performance with just 5% of the model size. For more details, please check our [Project page](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat/shell/mini_internvl) and [Document](https://internvl.readthedocs.io/en/latest/internvl2.0/domain_adaptation.html). + +- `2024/10/21`: We release the Mini-InternVL series. These models achieve impressive performance with minimal size: the 4B model achieves 90% of the performance with just 5% of the model size. For more details, please check our [project page](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat/shell/mini_internvl) and [document](https://internvl.readthedocs.io/en/latest/internvl2.0/domain_adaptation.html). - `2024/08/01`: The [Chartmimic](https://chartmimic.github.io/) team evaluated the InternVL2 series models on their benchmark. The InternVL2-26B and 76B models achieved the top two performances among open-source models, with the InternVL2 76B model surpassing GeminiProVision and exhibiting comparable results to Claude-3-opus. -- `2024/08/01`: InternVL2-Pro achieved the SOTA performance among open-source models on the [CharXiv](https://charxiv.github.io/#leaderboard) dataset, surpassing some well-known closed-source models such as GPT-4V, Gemini 1.5 Flash, and Claude 3 Sonnet. +- `2024/08/01`: InternVL2-Pro achieved the SOTA performance among open-source models on the [CharXiv](https://charxiv.github.io/#leaderboard) dataset, surpassing many closed-source models such as GPT-4V, Gemini 1.5 Flash, and Claude 3 Sonnet. - `2024/07/24`: The [MLVU](https://github.com/JUNJIE99/MLVU) team evaluated InternVL-1.5 on their benchmark. The average performance on the multiple-choice task was 50.4%, while the performance on the generative tasks was 4.02. The performance on the multiple-choice task ranked #1 among all open-source MLLMs. - `2024/07/18`: 🔥🔥 InternVL2-40B achieved SOTA performance among open-source models on the [Video-MME](https://github.com/BradyFU/Video-MME) dataset, scoring 61.2 when inputting 16 frames and 64.4 when inputting 32 frames. It significantly outperforms other open-source models and is the closest open-source model to GPT-4o mini. - `2024/07/18`: 🔥 InternVL2-Pro achieved the SOTA performance on the [DocVQA](https://rrc.cvc.uab.es/?ch=17&com=evaluation&task=1) and [InfoVQA](https://rrc.cvc.uab.es/?ch=17&com=evaluation&task=3) benchmarks. @@ -29,7 +37,6 @@ - `2024/05/13`: InternVL 1.0 can now be used as the [text encoder](https://huggingface.co/OpenGVLab/InternVL-14B-224px) for diffusion models to support multilingual generation natively in over 110 languages worldwide. See [MuLan](https://github.com/mulanai/MuLan) for more details. - `2024/04/18`: InternVL-Chat-V1-5 has been released at [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5), approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. - `2024/02/27`: InternVL is accepted by CVPR 2024 (Oral)! 🎉 -- `2024/02/24`: InternVL-Chat models have been included in the [VLMEvalKit](https://github.com/open-compass/VLMEvalKit). - `2024/02/21`: [InternVL-Chat-V1-2-Plus](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) achieved SOTA performance on MathVista (59.9), MMBench (83.8), and MMVP (58.7). See our [blog](https://internvl.github.io/blog/2024-02-21-InternVL-1.2/) for more details. - `2024/02/12`: InternVL-Chat-V1-2 has been released. It achieves 51.6 on MMMU val and 82.3 on MMBench test. For more details, please refer to our [blog](https://internvl.github.io/blog/2024-02-21-InternVL-1.2/) and [SFT data](./internvl_chat#prepare-training-datasets). The model is now available on [HuggingFace](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2), and both training / evaluation data and scripts are open-sourced. - `2024/01/24`: InternVL-Chat-V1-1 is released, it supports Chinese and has stronger OCR capability, see [here](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1). diff --git a/README_zh.md b/README_zh.md index ea2eb817..1ecb3257 100644 --- a/README_zh.md +++ b/README_zh.md @@ -2,9 +2,16 @@ # image InternVL家族:通过开源组件缩小与商业多模态模型的差距 —— GPT-4o的开源替代方案 +
+ image +
+
+ [\[🆕 博客\]](https://internvl.github.io/blog/) [\[🤔 常见问题\]](https://internvl.readthedocs.io/en/latest/tutorials/faqs.html) [\[🚀 InternVL2 博客\]](https://internvl.github.io/blog/2024-07-02-InternVL-2.0/) [\[🗨️ 对话Demo\]](https://internvl.opengvlab.com/) [\[🤗 HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[📖 文档\]](https://internvl.readthedocs.io/en/latest/) [\[🌐 API\]](https://internvl.readthedocs.io/en/latest/get_started/internvl_chat_api.html) [\[🚀 快速开始\]](#使用-huggingface-快速开始) -[\[📜 InternVL 1.0 论文\]](https://arxiv.org/abs/2312.14238) [\[📜 InternVL 1.5 报告\]](https://arxiv.org/abs/2404.16821) [\[📖 1.0 中文解读\]](https://zhuanlan.zhihu.com/p/702946079) [\[📖 1.5 中文解读\]](https://zhuanlan.zhihu.com/p/699439759) [\[📖 2.0 中文解读\]](https://zhuanlan.zhihu.com/p/706547971) +[\[🔥 Mini-InternVL 报告\]](https://arxiv.org/abs/2410.16261) [\[📜 InternVL 1.5 报告\]](https://arxiv.org/abs/2404.16821) [\[📜 InternVL 1.0 论文\]](https://arxiv.org/abs/2312.14238) + +[\[📖 2.0 中文解读\]](https://zhuanlan.zhihu.com/p/706547971) [\[📖 1.5 中文解读\]](https://zhuanlan.zhihu.com/p/699439759) [\[📖 1.0 中文解读\]](https://zhuanlan.zhihu.com/p/702946079) [Switch to the English version (切换至英文版)](/README.md) @@ -17,6 +24,7 @@ ## 最新消息 🚀🚀🚀 +- `2024/10/21`: 我们发布了 Mini-InternVL 系列。这些模型在保持极小模型体积的同时实现了出色的性能:4B 模型仅用 5% 的模型大小便达到了 90% 的性能。有关更多详细信息,请查看我们的 [项目页面](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat/shell/mini_internvl) 和 [文档](https://internvl.readthedocs.io/en/latest/internvl2.0/domain_adaptation.html)。 - `2024/08/01`: [Chartmimic](https://chartmimic.github.io/) 团队在他们的基准测试中评估了 InternVL2 系列模型。InternVL2-26B 和 76B 模型在开源模型中取得了前两名的成绩,其中 InternVL2-Llama3-76B 模型超过了 GeminiProVision,并表现出与 Claude-3-opus 相当的结果。 - `2024/08/01`: InternVL2-Pro 在 [CharXiv](https://charxiv.github.io/#leaderboard) 数据集中实现了开源模型中的 SOTA 性能,也比部分知名闭源模型如 GPT-4V、Gemini 1.5 Flash、Claude 3 Sonnet 取得了更好成绩 - `2024/07/24`: [MLVU](https://github.com/JUNJIE99/MLVU)团队在它们的基准测试中评估了InternVL-1.5。在多项选择任务上的平均表现为50.4%,而在生成任务上的表现为4.02。多项选择任务的表现在所有开源多模态大语言模型中排名第一。 @@ -26,11 +34,9 @@ - `2024/06/19`: 我们提出了 Needle In A Multimodal Haystack ([MM-NIAH](https://github.com/OpenGVLab/MM-NIAH)),这是第一个针对模型关于长多模态文档理解能力的评测基准。 - `2024/05/30`: 我们发布了 [ShareGPT-4o](https://sharegpt4o.github.io/),这是一个大规模、高质量的多模态数据集。我们计划开源一批使用 GPT-4o 精心标注的数据,包括 200K 条图像详细描述、10K 条视频详细描述,以及 10K 条音频详细描述。 - `2024/05/29`: 我们开源了 Mini-InternVL 系列,包括以下两个对话模型:[Mini-InternVL-Chat-2B-V1-5](https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-2B-V1-5) 和 [Mini-InternVL-Chat-4B-V1-5](https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-4B-V1-5)。这些模型在极小的尺寸下实现了令人印象深刻的性能:2B 模型以 8% 的模型尺寸实现了 80% 的性能,4B 模型以 16% 的模型尺寸实现了 90% 的性能。更多细节请查看我们的[博客](https://internvl.github.io/blog/2024-05-25-Mini-InternVL-1.5/)。 -- `2024/05/28`: 感谢 [lmdeploy](https://github.com/InternLM/lmdeploy) 团队提供的 AWQ 量化支持。InternVL 1.5 的 4-bit 模型发布在 [OpenGVLab/InternVL-Chat-V1-5-AWQ](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5-AWQ)。 - `2024/05/13`: InternVL 1.0 现在可以作为扩散模型的 [文本编码器](https://huggingface.co/OpenGVLab/InternVL-14B-224px),支持全球超过 110 种语言的多语言生成。详情请看 [MuLan](https://github.com/mulanai/MuLan)。 - `2024/04/18`: InternVL-Chat-V1-5 已经在 [HuggingFace](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5) 发布,在 MMMU、DocVQA、ChartQA、MathVista 等各种基准测试中,性能接近 GPT-4V 和 Gemini Pro。 - `2024/02/27`: InternVL 已被 CVPR 2024 (Oral) 接收!🎉 -- `2024/02/24`: InternVL-Chat 系列模型已经接入 [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) 评测框架。 - `2024/02/21`: [InternVL-Chat-V1-2-Plus](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) 在 MathVista(59.9)、MMBench(83.8)和 MMVP(58.7)上实现了 SOTA 性能。详情请看我们的[博客](https://internvl.github.io/blog/2024-02-21-InternVL-1.2/)。 - `2024/02/12`: InternVL-Chat-V1-2 已经发布,它在 MMMU 验证集上达到了 51.6,在 MMBench 测试集上达到了 82.3。 更多信息请参考我们的[博客](https://internvl.github.io/blog/2024-02-21-InternVL-1.2/)以及 [SFT 数据](./internvl_chat#prepare-training-datasets)。该模型已经在 [HuggingFace](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2) 发布,训练、测评的数据和脚本均已开源。 - `2024/01/24`: InternVL-Chat-V1-1 已经发布,它支持中文对话,并具备强大的 OCR 能力,详情请看[这里](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1)。 @@ -929,6 +935,12 @@ print(f'User: {question}\nAssistant: {response}') journal={arXiv preprint arXiv:2404.16821}, year={2024} } +@article{gao2024mini, + title={Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5\% Parameters and 90\% Performance}, + author={Gao, Zhangwei and Chen, Zhe and Cui, Erfei and Ren, Yiming and Wang, Weiyun and Zhu, Jinguo and Tian, Hao and Ye, Shenglong and He, Junjun and Zhu, Xizhou and others}, + journal={arXiv preprint arXiv:2410.16261}, + year={2024} +} ``` ## 致谢 diff --git a/internvl_chat/eval/domain_specific/drivelm/evaluate.py b/internvl_chat/eval/domain_specific/drivelm/evaluate.py index fe9a5a33..2c6b0024 100644 --- a/internvl_chat/eval/domain_specific/drivelm/evaluate.py +++ b/internvl_chat/eval/domain_specific/drivelm/evaluate.py @@ -1,302 +1,302 @@ -import argparse -import itertools -import json -import os -import random -import time -from functools import partial - -import torch -from datasets import concatenate_datasets, load_dataset -from internvl.model.internvl_chat import InternVLChatModel -from internvl.train.dataset import build_transform, dynamic_preprocess -from torch.utils.data import Dataset -from tqdm import tqdm -from transformers import AutoTokenizer -from PIL import Image -import re - -ds_collections = { - 'DriveLM_val': { - 'root': 'InternVL-Domain-Adaptation-Data/val/drivelm_val.jsonl', - 'max_new_tokens': 200, - 'min_new_tokens': 1, - 'split': 'validation', - "image_root":"InternVL-Domain-Adaptation-Data/images/drivelm/stitch", - } -} - - - -def post_process(pred): - pred = pred.strip() - pattern = r"" - mapping={"CAM_FRONT_LEFT":[0,0],"CAM_FRONT":[1,0],"CAM_FRONT_RIGHT":[2,0],"CAM_BACK_LEFT":[0,1],"CAM_BACK":[1,1],"CAM_BACK_RIGHT":[2,1]} - patch_size = 448 - width = patch_size * 2 - height = patch_size - whole_img_width=width*3 - whole_img_height=height*2 - matches = re.findall(pattern, pred) - for object_id in matches: - - object_id_c = object_id.replace("<","").replace(">","") - try: - ctag = object_id_c.split(",")[0] - cxcy = json.loads(",".join(object_id_c.split(",")[2:])) - cam = object_id_c.split(",")[1] - if cam in mapping: - mx,my=mapping[cam] - # old_wide,old_height = images_size[cam] - old_wide,old_height = 1600, 900 - cx ,cy = cxcy - cx = (cx / 1000) * whole_img_width - cy = (cy/1000) * whole_img_height - cx -= mx*width - cy -= my*height - cx = cx/width * old_wide - cy = cy/height * old_height - # cx =max(0,min(old_wide,cx)) - # cy =max(0,min(old_height,cy)) - cx =round(max(0,min(old_wide,cx)),1) - cy =round(max(0,min(old_height,cy)),1) - new_object_id = f"<{ctag},{cam},{cx},{cy}>" - - pred = pred.replace(object_id,new_object_id) - except Exception as e: - print(e) - return pred - -def collate_fn(batches, tokenizer): - pixel_values = torch.cat([_['pixel_values'] for _ in batches], dim=0) - questions = [_['question'] for _ in batches] - questions_old = [_['question_old'] for _ in batches] - answers = [_['answer'] for _ in batches] - data_ids = [_['data_id'] for _ in batches] - # images_sizes = [_['images_size'] for _ in batches] - return pixel_values, questions_old,questions, answers, data_ids - -class DriveLMDataset(torch.utils.data.Dataset): - - def __init__(self, root, split, prompt, image_path, input_size=224, dynamic_image_size=False, - use_thumbnail=False, max_num=6,): - # run for each subject - - with open(root,"r") as f: - self.data = [json.loads(line) for line in f.readlines()] - # data_val = json.load(f) - # merge all dataset - # self.data = concatenate_datasets(sub_dataset_list) - self.prompt = prompt - self.input_size = input_size - self.dynamic_image_size = dynamic_image_size - self.use_thumbnail = use_thumbnail - self.max_num = max_num - self.transform = build_transform(is_train=False, input_size=input_size) - self.image_path =image_path - - # with open(image_meta,"r") as f: - # self.image_meta = json.load(f) - - def __len__(self): - return len(self.data) - - def __getitem__(self, idx): - - data = self.data[idx] - data_id = data['id'] - question = data["conversations"][0]["value"].strip() - question_old = data["question_old"] - image_file = os.path.join(self.image_path,data['image']) - image = Image.open(image_file).convert("RGB") - # question_type = data['question_type'] - - # choices = eval(data['options']) - answer = data["conversations"][1]["value"].strip() - - if self.dynamic_image_size: - # images = [] - - pil_image = dynamic_preprocess(image, image_size=self.input_size, - use_thumbnail=self.use_thumbnail, - max_num=self.max_num) - images = pil_image - else: - images = [image] - pixel_values = [self.transform(image) for image in images] - pixel_values = torch.stack(pixel_values) - - # image_id = os.path.basename(image_file).split(".")[0] - # images_size = self.image_meta[image_id]["images_size"] - - - return { - "question_old":question_old, - 'question': question, - 'pixel_values': pixel_values, - # 'images_size':images_size, - 'answer': answer, - 'data_id': data_id - } - - -class InferenceSampler(torch.utils.data.sampler.Sampler): - - def __init__(self, size): - self._size = int(size) - assert size > 0 - self._rank = torch.distributed.get_rank() - self._world_size = torch.distributed.get_world_size() - self._local_indices = self._get_local_indices(size, self._world_size, self._rank) - - @staticmethod - def _get_local_indices(total_size, world_size, rank): - shard_size = total_size // world_size - left = total_size % world_size - shard_sizes = [shard_size + int(r < left) for r in range(world_size)] - - begin = sum(shard_sizes[:rank]) - end = min(sum(shard_sizes[:rank + 1]), total_size) - return range(begin, end) - - def __iter__(self): - yield from self._local_indices - - def __len__(self): - return len(self._local_indices) - -def evaluate_chat_model(): - - random.seed(args.seed) - prompt = None - for ds_name in args.datasets: - dataset = DriveLMDataset( - root=ds_collections[ds_name]['root'], - split=ds_collections[ds_name]['split'], - prompt=prompt, - image_path = ds_collections[ds_name]["image_root"], - # image_meta = ds_collections[ds_name]["image_meta"], - input_size=image_size, - dynamic_image_size=args.dynamic, - use_thumbnail=use_thumbnail, - max_num=args.max_num - ) - dataloader = torch.utils.data.DataLoader( - dataset=dataset, - sampler=InferenceSampler(len(dataset)), - batch_size=args.batch_size, - num_workers=args.num_workers, - pin_memory=True, - drop_last=False, - collate_fn=partial(collate_fn, tokenizer=tokenizer), - ) - - outputs = [] - for _, (pixel_values, questions_old, questions, answers, data_ids) in tqdm(enumerate(dataloader)): - pixel_values = pixel_values.to(torch.bfloat16).cuda() - generation_config = dict( - num_beams=args.num_beams, - max_new_tokens=ds_collections[ds_name]['max_new_tokens'], - min_new_tokens=ds_collections[ds_name]['min_new_tokens'], - do_sample=True if args.temperature > 0 else False, - temperature=args.temperature, - ) - pred = model.chat( - tokenizer=tokenizer, - pixel_values=pixel_values, - question=questions[0], - generation_config=generation_config - ) - - # preds = [pred] - # if len(options[0]) == 0: - # preds = [pred] - # else: - preds = [post_process(pred)] - - for question, pred, answer, data_id,question_old in zip(questions, preds, answers, data_ids,questions_old): - outputs.append({ - 'question': question_old, - 'answer': pred, - 'gt_answers': answer, - 'id': data_id - }) - - torch.distributed.barrier() - - world_size = torch.distributed.get_world_size() - merged_outputs = [None for _ in range(world_size)] - torch.distributed.all_gather_object(merged_outputs, json.dumps(outputs)) - - merged_outputs = [json.loads(_) for _ in merged_outputs] - merged_outputs = [_ for _ in itertools.chain.from_iterable(merged_outputs)] - - if torch.distributed.get_rank() == 0: - - print(f'Evaluating {ds_name} ...') - time_prefix = time.strftime('%y%m%d%H%M%S', time.localtime()) - results_file = f'{ds_name}_{time_prefix}.json' - output_path = os.path.join(args.out_dir, results_file) - - with open(output_path, 'w') as f: - json.dump(merged_outputs, f, indent=4) - print('Results saved to {}'.format(output_path)) - - - -if __name__ == '__main__': - parser = argparse.ArgumentParser() - parser.add_argument('--checkpoint', type=str, default='') - parser.add_argument('--datasets', type=str, default='MMMU_dev') - parser.add_argument('--batch-size', type=int, default=1) - parser.add_argument('--num-workers', type=int, default=1) - parser.add_argument('--num-beams', type=int, default=5) - parser.add_argument('--temperature', type=float, default=0.0) - parser.add_argument('--out-dir', type=str, default='results') - parser.add_argument('--seed', type=int, default=0) - parser.add_argument('--dynamic', action='store_true') - parser.add_argument('--max-num', type=int, default=12) - parser.add_argument('--load-in-8bit', action='store_true') - parser.add_argument('--auto', action='store_true') - args = parser.parse_args() - - if not os.path.exists(args.out_dir): - os.makedirs(args.out_dir) - - args.datasets = args.datasets.split(',') - print('datasets:', args.datasets) - assert args.batch_size == 1, 'Only batch size 1 is supported' - - torch.distributed.init_process_group( - backend='nccl', - world_size=int(os.getenv('WORLD_SIZE', '1')), - rank=int(os.getenv('RANK', '0')), - ) - - torch.cuda.set_device(int(os.getenv('LOCAL_RANK', 0))) - - if args.auto: - os.environ['CUDA_LAUNCH_BLOCKING'] = '1' - kwargs = {'device_map': 'auto'} if args.auto else {} - tokenizer = AutoTokenizer.from_pretrained(args.checkpoint, trust_remote_code=True, use_fast=False) - model = InternVLChatModel.from_pretrained( - args.checkpoint, low_cpu_mem_usage=True, torch_dtype=torch.bfloat16, - load_in_8bit=args.load_in_8bit, **kwargs).eval() - if not args.load_in_8bit and not args.auto: - model = model.cuda() - image_size = model.config.force_image_size or model.config.vision_config.image_size - use_thumbnail = model.config.use_thumbnail - - total_params = sum(p.numel() for p in model.parameters()) / 1e9 - if total_params > 20 or args.dynamic: - args.num_beams = 1 - print(f'[test] total_params: {total_params}B, use num_beams: {args.num_beams}') - else: - print(f'[test] total_params: {total_params}B') - print(f'[test] image_size: {image_size}') - print(f'[test] template: {model.config.template}') - print(f'[test] dynamic_image_size: {args.dynamic}') - print(f'[test] use_thumbnail: {use_thumbnail}') - print(f'[test] max_num: {args.max_num}') - - evaluate_chat_model() +import argparse +import itertools +import json +import os +import random +import re +import time +from functools import partial + +import torch +from datasets import concatenate_datasets, load_dataset +from internvl.model.internvl_chat import InternVLChatModel +from internvl.train.dataset import build_transform, dynamic_preprocess +from PIL import Image +from torch.utils.data import Dataset +from tqdm import tqdm +from transformers import AutoTokenizer + +ds_collections = { + 'DriveLM_val': { + 'root': 'InternVL-Domain-Adaptation-Data/val/drivelm_val.jsonl', + 'max_new_tokens': 200, + 'min_new_tokens': 1, + 'split': 'validation', + 'image_root': 'InternVL-Domain-Adaptation-Data/images/drivelm/stitch', + } +} + + +def post_process(pred): + pred = pred.strip() + pattern = r'' + mapping = {'CAM_FRONT_LEFT': [0, 0], 'CAM_FRONT': [1, 0], 'CAM_FRONT_RIGHT': [2, 0], 'CAM_BACK_LEFT': [0, 1], + 'CAM_BACK': [1, 1], 'CAM_BACK_RIGHT': [2, 1]} + patch_size = 448 + width = patch_size * 2 + height = patch_size + whole_img_width = width * 3 + whole_img_height = height * 2 + matches = re.findall(pattern, pred) + for object_id in matches: + + object_id_c = object_id.replace('<', '').replace('>', '') + try: + ctag = object_id_c.split(',')[0] + cxcy = json.loads(','.join(object_id_c.split(',')[2:])) + cam = object_id_c.split(',')[1] + if cam in mapping: + mx, my = mapping[cam] + # old_wide,old_height = images_size[cam] + old_wide, old_height = 1600, 900 + cx, cy = cxcy + cx = (cx / 1000) * whole_img_width + cy = (cy / 1000) * whole_img_height + cx -= mx * width + cy -= my * height + cx = cx / width * old_wide + cy = cy / height * old_height + # cx =max(0,min(old_wide,cx)) + # cy =max(0,min(old_height,cy)) + cx = round(max(0, min(old_wide, cx)), 1) + cy = round(max(0, min(old_height, cy)), 1) + new_object_id = f'<{ctag},{cam},{cx},{cy}>' + + pred = pred.replace(object_id, new_object_id) + except Exception as e: + print(e) + return pred + + +def collate_fn(batches, tokenizer): + pixel_values = torch.cat([_['pixel_values'] for _ in batches], dim=0) + questions = [_['question'] for _ in batches] + questions_old = [_['question_old'] for _ in batches] + answers = [_['answer'] for _ in batches] + data_ids = [_['data_id'] for _ in batches] + # images_sizes = [_['images_size'] for _ in batches] + return pixel_values, questions_old, questions, answers, data_ids + + +class DriveLMDataset(torch.utils.data.Dataset): + + def __init__(self, root, split, prompt, image_path, input_size=224, dynamic_image_size=False, + use_thumbnail=False, max_num=6, ): + # run for each subject + + with open(root, 'r') as f: + self.data = [json.loads(line) for line in f.readlines()] + # data_val = json.load(f) + # merge all dataset + # self.data = concatenate_datasets(sub_dataset_list) + self.prompt = prompt + self.input_size = input_size + self.dynamic_image_size = dynamic_image_size + self.use_thumbnail = use_thumbnail + self.max_num = max_num + self.transform = build_transform(is_train=False, input_size=input_size) + self.image_path = image_path + + # with open(image_meta,"r") as f: + # self.image_meta = json.load(f) + + def __len__(self): + return len(self.data) + + def __getitem__(self, idx): + + data = self.data[idx] + data_id = data['id'] + question = data['conversations'][0]['value'].strip() + question_old = data['question_old'] + image_file = os.path.join(self.image_path, data['image']) + image = Image.open(image_file).convert('RGB') + # question_type = data['question_type'] + + # choices = eval(data['options']) + answer = data['conversations'][1]['value'].strip() + + if self.dynamic_image_size: + # images = [] + + pil_image = dynamic_preprocess(image, image_size=self.input_size, + use_thumbnail=self.use_thumbnail, + max_num=self.max_num) + images = pil_image + else: + images = [image] + pixel_values = [self.transform(image) for image in images] + pixel_values = torch.stack(pixel_values) + + # image_id = os.path.basename(image_file).split(".")[0] + # images_size = self.image_meta[image_id]["images_size"] + + return { + 'question_old': question_old, + 'question': question, + 'pixel_values': pixel_values, + # 'images_size':images_size, + 'answer': answer, + 'data_id': data_id + } + + +class InferenceSampler(torch.utils.data.sampler.Sampler): + + def __init__(self, size): + self._size = int(size) + assert size > 0 + self._rank = torch.distributed.get_rank() + self._world_size = torch.distributed.get_world_size() + self._local_indices = self._get_local_indices(size, self._world_size, self._rank) + + @staticmethod + def _get_local_indices(total_size, world_size, rank): + shard_size = total_size // world_size + left = total_size % world_size + shard_sizes = [shard_size + int(r < left) for r in range(world_size)] + + begin = sum(shard_sizes[:rank]) + end = min(sum(shard_sizes[:rank + 1]), total_size) + return range(begin, end) + + def __iter__(self): + yield from self._local_indices + + def __len__(self): + return len(self._local_indices) + + +def evaluate_chat_model(): + random.seed(args.seed) + prompt = None + for ds_name in args.datasets: + dataset = DriveLMDataset( + root=ds_collections[ds_name]['root'], + split=ds_collections[ds_name]['split'], + prompt=prompt, + image_path=ds_collections[ds_name]['image_root'], + # image_meta = ds_collections[ds_name]["image_meta"], + input_size=image_size, + dynamic_image_size=args.dynamic, + use_thumbnail=use_thumbnail, + max_num=args.max_num + ) + dataloader = torch.utils.data.DataLoader( + dataset=dataset, + sampler=InferenceSampler(len(dataset)), + batch_size=args.batch_size, + num_workers=args.num_workers, + pin_memory=True, + drop_last=False, + collate_fn=partial(collate_fn, tokenizer=tokenizer), + ) + + outputs = [] + for _, (pixel_values, questions_old, questions, answers, data_ids) in tqdm(enumerate(dataloader)): + pixel_values = pixel_values.to(torch.bfloat16).cuda() + generation_config = dict( + num_beams=args.num_beams, + max_new_tokens=ds_collections[ds_name]['max_new_tokens'], + min_new_tokens=ds_collections[ds_name]['min_new_tokens'], + do_sample=True if args.temperature > 0 else False, + temperature=args.temperature, + ) + pred = model.chat( + tokenizer=tokenizer, + pixel_values=pixel_values, + question=questions[0], + generation_config=generation_config + ) + + # preds = [pred] + # if len(options[0]) == 0: + # preds = [pred] + # else: + preds = [post_process(pred)] + + for question, pred, answer, data_id, question_old in zip(questions, preds, answers, data_ids, + questions_old): + outputs.append({ + 'question': question_old, + 'answer': pred, + 'gt_answers': answer, + 'id': data_id + }) + + torch.distributed.barrier() + + world_size = torch.distributed.get_world_size() + merged_outputs = [None for _ in range(world_size)] + torch.distributed.all_gather_object(merged_outputs, json.dumps(outputs)) + + merged_outputs = [json.loads(_) for _ in merged_outputs] + merged_outputs = [_ for _ in itertools.chain.from_iterable(merged_outputs)] + + if torch.distributed.get_rank() == 0: + print(f'Evaluating {ds_name} ...') + time_prefix = time.strftime('%y%m%d%H%M%S', time.localtime()) + results_file = f'{ds_name}_{time_prefix}.json' + output_path = os.path.join(args.out_dir, results_file) + + with open(output_path, 'w') as f: + json.dump(merged_outputs, f, indent=4) + print('Results saved to {}'.format(output_path)) + + +if __name__ == '__main__': + parser = argparse.ArgumentParser() + parser.add_argument('--checkpoint', type=str, default='') + parser.add_argument('--datasets', type=str, default='MMMU_dev') + parser.add_argument('--batch-size', type=int, default=1) + parser.add_argument('--num-workers', type=int, default=1) + parser.add_argument('--num-beams', type=int, default=5) + parser.add_argument('--temperature', type=float, default=0.0) + parser.add_argument('--out-dir', type=str, default='results') + parser.add_argument('--seed', type=int, default=0) + parser.add_argument('--dynamic', action='store_true') + parser.add_argument('--max-num', type=int, default=12) + parser.add_argument('--load-in-8bit', action='store_true') + parser.add_argument('--auto', action='store_true') + args = parser.parse_args() + + if not os.path.exists(args.out_dir): + os.makedirs(args.out_dir) + + args.datasets = args.datasets.split(',') + print('datasets:', args.datasets) + assert args.batch_size == 1, 'Only batch size 1 is supported' + + torch.distributed.init_process_group( + backend='nccl', + world_size=int(os.getenv('WORLD_SIZE', '1')), + rank=int(os.getenv('RANK', '0')), + ) + + torch.cuda.set_device(int(os.getenv('LOCAL_RANK', 0))) + + if args.auto: + os.environ['CUDA_LAUNCH_BLOCKING'] = '1' + kwargs = {'device_map': 'auto'} if args.auto else {} + tokenizer = AutoTokenizer.from_pretrained(args.checkpoint, trust_remote_code=True, use_fast=False) + model = InternVLChatModel.from_pretrained( + args.checkpoint, low_cpu_mem_usage=True, torch_dtype=torch.bfloat16, + load_in_8bit=args.load_in_8bit, **kwargs).eval() + if not args.load_in_8bit and not args.auto: + model = model.cuda() + image_size = model.config.force_image_size or model.config.vision_config.image_size + use_thumbnail = model.config.use_thumbnail + + total_params = sum(p.numel() for p in model.parameters()) / 1e9 + if total_params > 20 or args.dynamic: + args.num_beams = 1 + print(f'[test] total_params: {total_params}B, use num_beams: {args.num_beams}') + else: + print(f'[test] total_params: {total_params}B') + print(f'[test] image_size: {image_size}') + print(f'[test] template: {model.config.template}') + print(f'[test] dynamic_image_size: {args.dynamic}') + print(f'[test] use_thumbnail: {use_thumbnail}') + print(f'[test] max_num: {args.max_num}') + + evaluate_chat_model() diff --git a/internvl_chat/eval/domain_specific/mme_rw/evaluate.py b/internvl_chat/eval/domain_specific/mme_rw/evaluate.py index 19bcc30c..f5be2d44 100644 --- a/internvl_chat/eval/domain_specific/mme_rw/evaluate.py +++ b/internvl_chat/eval/domain_specific/mme_rw/evaluate.py @@ -1,333 +1,335 @@ -import argparse -import base64 -import itertools -import json -import os -import random -import time -from functools import partial -from io import BytesIO -import re -import pandas as pd -import torch -from internvl.model.internvl_chat import InternVLChatModel -from internvl.train.dataset import build_transform, dynamic_preprocess -from PIL import Image -from torch.utils.data import Dataset -from tqdm import tqdm -from transformers import AutoTokenizer -from typing import Literal - -ds_collections = { - 'MME_RealWorld':{ - 'root': 'InternVL-Domain-Adaptation-DataMME-RealWorld/val/MME_RealWorld.json', - 'max_new_tokens': 100, - 'min_new_tokens': 1, - 'img_root':'InternVL-Domain-Adaptation-DataMME-RealWorld/images/MME-RealWorld/data', - 'type': 'dev', - 'language': 'en' - } -} - - -def collate_fn(batches, tokenizer): - pixel_values = torch.cat([_['pixel_values'] for _ in batches], dim=0) - questions = [_['question'] for _ in batches] - answers = [_['answer'] for _ in batches] - indexes = [_['index'] for _ in batches] - choices = [_['choice'] for _ in batches] - categorys = [_['category'] for _ in batches] - tasks = [_['task'] for _ in batches] - return pixel_values, questions, answers, indexes, choices,categorys,tasks - - -class MMERealworldDataset(torch.utils.data.Dataset): - - def __init__(self, root, prompt, language, subtask:Literal["Monitoring","OCR with Complex Context","Diagram and Table",'Autonomous_Driving','Remote Sensing'], - img_root,input_size=224, dynamic_image_size=False, - use_thumbnail=False, max_num=6): - with open(root,"r") as f: - self.data_meta = json.load(f) - self.subtask = subtask - self.data_meta = [item for item in self.data_meta if item["Subtask"]==self.subtask] - self.img_root = img_root - self.prompt = prompt - self.language = language - self.input_size = input_size - self.dynamic_image_size = dynamic_image_size - self.use_thumbnail = use_thumbnail - self.max_num = max_num - self.transform = build_transform(is_train=False, input_size=input_size) - - def __len__(self): - return len(self.data_meta) - - def __getitem__(self, idx): - index = self.data_meta[idx]["Question_id"] - assert self.data_meta[idx]["Question Type"] == "Multiple Choice" - image = os.path.join( self.img_root,self.data_meta[idx]['Image']) - question = self.data_meta[idx]['Text'] - choices = self.data_meta[idx]["Answer choices"] - answer = self.data_meta[idx]["Ground truth"] - category =self.data_meta[idx]["Category"] - task =self.data_meta[idx]["Task"] - # catetory = self.df.iloc[idx]['category'] - # l2_catetory = self.df.iloc[idx]['l2-category'] - - image = Image.open(image).convert('RGB') - if self.dynamic_image_size: - images = dynamic_preprocess(image, image_size=self.input_size, - use_thumbnail=self.use_thumbnail, - max_num=self.max_num) - else: - images = [image] - pixel_values = [self.transform(image) for image in images] - pixel_values = torch.stack(pixel_values) - - if self.language == 'cn': - question = question + 'The choices are listed below:\n' + '\n'.join(choices) + '\n' + self.prompt['cn'] - else: - question = question + '选项如下所示:\n'+'\n'.join(choices) + '\n' + self.prompt['en'] - - return { - 'question': question, - 'pixel_values': pixel_values, - 'answer': answer, - 'index': index, - 'choice': choices, - 'category':category, - 'task':task - } - - - -class InferenceSampler(torch.utils.data.sampler.Sampler): - - def __init__(self, size): - self._size = int(size) - assert size > 0 - self._rank = torch.distributed.get_rank() - self._world_size = torch.distributed.get_world_size() - self._local_indices = self._get_local_indices(size, self._world_size, self._rank) - - @staticmethod - def _get_local_indices(total_size, world_size, rank): - shard_size = total_size // world_size - left = total_size % world_size - shard_sizes = [shard_size + int(r < left) for r in range(world_size)] - - begin = sum(shard_sizes[:rank]) - end = min(sum(shard_sizes[:rank + 1]), total_size) - return range(begin, end) - - def __iter__(self): - yield from self._local_indices - - def __len__(self): - return len(self._local_indices) - - -def post_process(s, choices): - s = s.strip() - answer_prefixes = [ - "The best answer is", - "The correct answer is", - "The answer is", - "The answer", - "The best option is" - "The correct option is", - "Best answer:", - "Best option:", - ] - for answer_prefix in answer_prefixes: - s = s.replace(answer_prefix, "") - - if len(s.split()) > 10 and not re.search("[ABCDE]", s): - return "" - matches = re.search(r'[ABCDE]', s) - if matches is None: - for choice in choices: - if s.lower() in choice.lower(): - return choice[1] - return "" - return matches[0] - -def evaluate(outputs): - results= {"Reasoning":{}, - "Perception":{}} - for data_item in outputs: - cnt = data_item["answer"] == data_item['gt_answers'] - category = data_item['category'] - task = data_item['task'] - if category not in results[task]: - results[task][category] = {'true': cnt, 'false': 1-cnt} - else: - results[task][category]['true'] += cnt - results[task][category]['false'] += 1 - cnt - - cnt_subtask, sum_subtask = 0, 0 - for task, tasks_values in results.items(): - cnt_task, sum_task = 0, 0 - for category, category_dict in tasks_values.items(): - cnt_task += category_dict['true'] - sum_task += category_dict['false'] + category_dict['true'] - acc = category_dict['true'] / (category_dict['false'] + category_dict['true']) - print(f'-'*4 + f'\t' + 'Acc ' + '{:.4f}'.format(acc) + f'\t{category.capitalize()}') - - cnt_subtask +=cnt_task - sum_subtask += sum_task - if sum_task == 0: - acc_task = 0 - else: - acc_task = cnt_task / sum_task - print(f'*'*32 + f'Acc' + '{:.4f}'.format(acc_task) + f'\t{task}') - - if sum_subtask == 0: - acc_subtasks = 0 - else: - acc_subtasks = cnt_subtask / sum_subtask - print(f'+'*16 + f'\t Acc ' + '{:.4f}'.format(acc_subtasks)) - return acc_subtasks - -def evaluate_chat_model(): - random.seed(args.seed) - - for ds_name in args.datasets: - dataset = MMERealworldDataset( - root=ds_collections[ds_name]['root'], - prompt=prompt, - language=ds_collections[ds_name]['language'], - subtask=args.subtask, - img_root=ds_collections[ds_name]['img_root'], - input_size=image_size, - dynamic_image_size=args.dynamic, - use_thumbnail=use_thumbnail, - max_num=args.max_num - ) - dataloader = torch.utils.data.DataLoader( - dataset=dataset, - sampler=InferenceSampler(len(dataset)), - batch_size=args.batch_size, - num_workers=args.num_workers, - pin_memory=True, - drop_last=False, - collate_fn=partial(collate_fn, tokenizer=tokenizer), - ) - - outputs = [] - for pixel_values, questions, answers, indexes, options,categorys,tasks in tqdm(dataloader): - pixel_values = pixel_values.to(torch.bfloat16).cuda() - generation_config = dict( - num_beams=args.num_beams, - max_new_tokens=ds_collections[ds_name]['max_new_tokens'], - min_new_tokens=ds_collections[ds_name]['min_new_tokens'], - do_sample=True if args.temperature > 0 else False, - temperature=args.temperature, - ) - out = model.chat( - tokenizer=tokenizer, - pixel_values=pixel_values, - question=questions[0], - generation_config=generation_config - ) - outs = [out] - preds = [post_process(out, options[0])] - - for question, pred, answer, index, out,category,task in zip(questions, preds, answers, indexes,outs,categorys,tasks): - outputs.append({ - 'question': question, - 'output':out, - 'answer': pred, - 'gt_answers': answer, - 'index': index, - 'category':category, - 'task':task - }) - - - torch.distributed.barrier() - - world_size = torch.distributed.get_world_size() - merged_outputs = [None for _ in range(world_size)] - torch.distributed.all_gather_object(merged_outputs, json.dumps(outputs)) - - merged_outputs = [json.loads(_) for _ in merged_outputs] - merged_outputs = [_ for _ in itertools.chain.from_iterable(merged_outputs)] - - if torch.distributed.get_rank() == 0: - - print(f'Evaluating {ds_name} ...') - time_prefix = time.strftime('%y%m%d%H%M%S', time.localtime()) - results_file = f'{ds_name}_{args.subtask}_{time_prefix}.json' - output_path = os.path.join(args.out_dir, results_file) - - with open(output_path,"w") as f: - json.dump(merged_outputs,f,indent=4) - evaluate(merged_outputs) - - print('Results saved to {}'.format(output_path)) - - -if __name__ == '__main__': - parser = argparse.ArgumentParser() - parser.add_argument('--checkpoint', type=str, default='') - parser.add_argument('--datasets', type=str, default='mmbench_dev_20230712') - parser.add_argument('--subtask', type=str, default='Autonomous_Driving') - parser.add_argument('--batch-size', type=int, default=1) - parser.add_argument('--num-workers', type=int, default=1) - parser.add_argument('--num-beams', type=int, default=5) - parser.add_argument('--temperature', type=float, default=0.0) - parser.add_argument('--out-dir', type=str, default='results') - parser.add_argument('--seed', type=int, default=0) - parser.add_argument('--dynamic', action='store_true') - parser.add_argument('--max-num', type=int, default=6) - parser.add_argument('--load-in-8bit', action='store_true') - parser.add_argument('--load-in-4bit', action='store_true') - parser.add_argument('--auto', action='store_true') - args = parser.parse_args() - - if not os.path.exists(args.out_dir): - os.makedirs(args.out_dir) - - args.datasets = args.datasets.split(',') - print('datasets:', args.datasets) - assert args.batch_size == 1, 'Only batch size 1 is supported' - - torch.distributed.init_process_group( - backend='nccl', - world_size=int(os.getenv('WORLD_SIZE', '1')), - rank=int(os.getenv('RANK', '0')), - ) - - torch.cuda.set_device(int(os.getenv('LOCAL_RANK', 0))) - - if args.auto: - os.environ['CUDA_LAUNCH_BLOCKING'] = '1' - kwargs = {'device_map': 'auto'} if args.auto else {} - tokenizer = AutoTokenizer.from_pretrained(args.checkpoint, trust_remote_code=True, use_fast=False) - model = InternVLChatModel.from_pretrained( - args.checkpoint, low_cpu_mem_usage=True, torch_dtype=torch.bfloat16, - load_in_8bit=args.load_in_8bit, load_in_4bit=args.load_in_4bit,**kwargs).eval() - if not args.load_in_8bit and not args.load_in_4bit and not args.auto: - model = model.cuda() - image_size = model.config.force_image_size or model.config.vision_config.image_size - use_thumbnail = model.config.use_thumbnail - - total_params = sum(p.numel() for p in model.parameters()) / 1e9 - if total_params > 20 or args.dynamic: - args.num_beams = 1 - print(f'[test] total_params: {total_params}B, use num_beams: {args.num_beams}') - else: - print(f'[test] total_params: {total_params}B') - print(f'[test] image_size: {image_size}') - print(f'[test] template: {model.config.template}') - print(f'[test] dynamic_image_size: {args.dynamic}') - print(f'[test] use_thumbnail: {use_thumbnail}') - print(f'[test] max_num: {args.max_num}') - - prompt = { - 'en': 'Select the best answer to the above multiple-choice question based on the image. \ - Respond with only the letter (A, B, C, D, or E) of the correct option. \nThe best answer is:', - 'cn': '根据图像选择上述多项选择题的最佳答案。只需回答正确选项的字母(A, B, C, D 或 E)。\n 最佳答案为:', - } - evaluate_chat_model() +import argparse +import base64 +import itertools +import json +import os +import random +import re +import time +from functools import partial +from io import BytesIO +from typing import Literal + +import pandas as pd +import torch +from internvl.model.internvl_chat import InternVLChatModel +from internvl.train.dataset import build_transform, dynamic_preprocess +from PIL import Image +from torch.utils.data import Dataset +from tqdm import tqdm +from transformers import AutoTokenizer + +ds_collections = { + 'MME_RealWorld': { + 'root': 'InternVL-Domain-Adaptation-DataMME-RealWorld/val/MME_RealWorld.json', + 'max_new_tokens': 100, + 'min_new_tokens': 1, + 'img_root': 'InternVL-Domain-Adaptation-DataMME-RealWorld/images/MME-RealWorld/data', + 'type': 'dev', + 'language': 'en' + } +} + + +def collate_fn(batches, tokenizer): + pixel_values = torch.cat([_['pixel_values'] for _ in batches], dim=0) + questions = [_['question'] for _ in batches] + answers = [_['answer'] for _ in batches] + indexes = [_['index'] for _ in batches] + choices = [_['choice'] for _ in batches] + categorys = [_['category'] for _ in batches] + tasks = [_['task'] for _ in batches] + return pixel_values, questions, answers, indexes, choices, categorys, tasks + + +class MMERealworldDataset(torch.utils.data.Dataset): + + def __init__(self, root, prompt, language, subtask: Literal[ + 'Monitoring', 'OCR with Complex Context', 'Diagram and Table', 'Autonomous_Driving', 'Remote Sensing'], + img_root, input_size=224, dynamic_image_size=False, + use_thumbnail=False, max_num=6): + with open(root, 'r') as f: + self.data_meta = json.load(f) + self.subtask = subtask + self.data_meta = [item for item in self.data_meta if item['Subtask'] == self.subtask] + self.img_root = img_root + self.prompt = prompt + self.language = language + self.input_size = input_size + self.dynamic_image_size = dynamic_image_size + self.use_thumbnail = use_thumbnail + self.max_num = max_num + self.transform = build_transform(is_train=False, input_size=input_size) + + def __len__(self): + return len(self.data_meta) + + def __getitem__(self, idx): + index = self.data_meta[idx]['Question_id'] + assert self.data_meta[idx]['Question Type'] == 'Multiple Choice' + image = os.path.join(self.img_root, self.data_meta[idx]['Image']) + question = self.data_meta[idx]['Text'] + choices = self.data_meta[idx]['Answer choices'] + answer = self.data_meta[idx]['Ground truth'] + category = self.data_meta[idx]['Category'] + task = self.data_meta[idx]['Task'] + # catetory = self.df.iloc[idx]['category'] + # l2_catetory = self.df.iloc[idx]['l2-category'] + + image = Image.open(image).convert('RGB') + if self.dynamic_image_size: + images = dynamic_preprocess(image, image_size=self.input_size, + use_thumbnail=self.use_thumbnail, + max_num=self.max_num) + else: + images = [image] + pixel_values = [self.transform(image) for image in images] + pixel_values = torch.stack(pixel_values) + + if self.language == 'cn': + question = question + 'The choices are listed below:\n' + '\n'.join(choices) + '\n' + self.prompt['cn'] + else: + question = question + '选项如下所示:\n' + '\n'.join(choices) + '\n' + self.prompt['en'] + + return { + 'question': question, + 'pixel_values': pixel_values, + 'answer': answer, + 'index': index, + 'choice': choices, + 'category': category, + 'task': task + } + + +class InferenceSampler(torch.utils.data.sampler.Sampler): + + def __init__(self, size): + self._size = int(size) + assert size > 0 + self._rank = torch.distributed.get_rank() + self._world_size = torch.distributed.get_world_size() + self._local_indices = self._get_local_indices(size, self._world_size, self._rank) + + @staticmethod + def _get_local_indices(total_size, world_size, rank): + shard_size = total_size // world_size + left = total_size % world_size + shard_sizes = [shard_size + int(r < left) for r in range(world_size)] + + begin = sum(shard_sizes[:rank]) + end = min(sum(shard_sizes[:rank + 1]), total_size) + return range(begin, end) + + def __iter__(self): + yield from self._local_indices + + def __len__(self): + return len(self._local_indices) + + +def post_process(s, choices): + s = s.strip() + answer_prefixes = [ + 'The best answer is', + 'The correct answer is', + 'The answer is', + 'The answer', + 'The best option is' + 'The correct option is', + 'Best answer:', + 'Best option:', + ] + for answer_prefix in answer_prefixes: + s = s.replace(answer_prefix, '') + + if len(s.split()) > 10 and not re.search('[ABCDE]', s): + return '' + matches = re.search(r'[ABCDE]', s) + if matches is None: + for choice in choices: + if s.lower() in choice.lower(): + return choice[1] + return '' + return matches[0] + + +def evaluate(outputs): + results = {'Reasoning': {}, + 'Perception': {}} + for data_item in outputs: + cnt = data_item['answer'] == data_item['gt_answers'] + category = data_item['category'] + task = data_item['task'] + if category not in results[task]: + results[task][category] = {'true': cnt, 'false': 1 - cnt} + else: + results[task][category]['true'] += cnt + results[task][category]['false'] += 1 - cnt + + cnt_subtask, sum_subtask = 0, 0 + for task, tasks_values in results.items(): + cnt_task, sum_task = 0, 0 + for category, category_dict in tasks_values.items(): + cnt_task += category_dict['true'] + sum_task += category_dict['false'] + category_dict['true'] + acc = category_dict['true'] / (category_dict['false'] + category_dict['true']) + print(f'-' * 4 + f'\t' + 'Acc ' + '{:.4f}'.format(acc) + f'\t{category.capitalize()}') + + cnt_subtask += cnt_task + sum_subtask += sum_task + if sum_task == 0: + acc_task = 0 + else: + acc_task = cnt_task / sum_task + print(f'*' * 32 + f'Acc' + '{:.4f}'.format(acc_task) + f'\t{task}') + + if sum_subtask == 0: + acc_subtasks = 0 + else: + acc_subtasks = cnt_subtask / sum_subtask + print(f'+' * 16 + f'\t Acc ' + '{:.4f}'.format(acc_subtasks)) + return acc_subtasks + + +def evaluate_chat_model(): + random.seed(args.seed) + + for ds_name in args.datasets: + dataset = MMERealworldDataset( + root=ds_collections[ds_name]['root'], + prompt=prompt, + language=ds_collections[ds_name]['language'], + subtask=args.subtask, + img_root=ds_collections[ds_name]['img_root'], + input_size=image_size, + dynamic_image_size=args.dynamic, + use_thumbnail=use_thumbnail, + max_num=args.max_num + ) + dataloader = torch.utils.data.DataLoader( + dataset=dataset, + sampler=InferenceSampler(len(dataset)), + batch_size=args.batch_size, + num_workers=args.num_workers, + pin_memory=True, + drop_last=False, + collate_fn=partial(collate_fn, tokenizer=tokenizer), + ) + + outputs = [] + for pixel_values, questions, answers, indexes, options, categorys, tasks in tqdm(dataloader): + pixel_values = pixel_values.to(torch.bfloat16).cuda() + generation_config = dict( + num_beams=args.num_beams, + max_new_tokens=ds_collections[ds_name]['max_new_tokens'], + min_new_tokens=ds_collections[ds_name]['min_new_tokens'], + do_sample=True if args.temperature > 0 else False, + temperature=args.temperature, + ) + out = model.chat( + tokenizer=tokenizer, + pixel_values=pixel_values, + question=questions[0], + generation_config=generation_config + ) + outs = [out] + preds = [post_process(out, options[0])] + + for question, pred, answer, index, out, category, task in zip(questions, preds, answers, indexes, outs, + categorys, tasks): + outputs.append({ + 'question': question, + 'output': out, + 'answer': pred, + 'gt_answers': answer, + 'index': index, + 'category': category, + 'task': task + }) + + torch.distributed.barrier() + + world_size = torch.distributed.get_world_size() + merged_outputs = [None for _ in range(world_size)] + torch.distributed.all_gather_object(merged_outputs, json.dumps(outputs)) + + merged_outputs = [json.loads(_) for _ in merged_outputs] + merged_outputs = [_ for _ in itertools.chain.from_iterable(merged_outputs)] + + if torch.distributed.get_rank() == 0: + print(f'Evaluating {ds_name} ...') + time_prefix = time.strftime('%y%m%d%H%M%S', time.localtime()) + results_file = f'{ds_name}_{args.subtask}_{time_prefix}.json' + output_path = os.path.join(args.out_dir, results_file) + + with open(output_path, 'w') as f: + json.dump(merged_outputs, f, indent=4) + evaluate(merged_outputs) + + print('Results saved to {}'.format(output_path)) + + +if __name__ == '__main__': + parser = argparse.ArgumentParser() + parser.add_argument('--checkpoint', type=str, default='') + parser.add_argument('--datasets', type=str, default='mmbench_dev_20230712') + parser.add_argument('--subtask', type=str, default='Autonomous_Driving') + parser.add_argument('--batch-size', type=int, default=1) + parser.add_argument('--num-workers', type=int, default=1) + parser.add_argument('--num-beams', type=int, default=5) + parser.add_argument('--temperature', type=float, default=0.0) + parser.add_argument('--out-dir', type=str, default='results') + parser.add_argument('--seed', type=int, default=0) + parser.add_argument('--dynamic', action='store_true') + parser.add_argument('--max-num', type=int, default=6) + parser.add_argument('--load-in-8bit', action='store_true') + parser.add_argument('--load-in-4bit', action='store_true') + parser.add_argument('--auto', action='store_true') + args = parser.parse_args() + + if not os.path.exists(args.out_dir): + os.makedirs(args.out_dir) + + args.datasets = args.datasets.split(',') + print('datasets:', args.datasets) + assert args.batch_size == 1, 'Only batch size 1 is supported' + + torch.distributed.init_process_group( + backend='nccl', + world_size=int(os.getenv('WORLD_SIZE', '1')), + rank=int(os.getenv('RANK', '0')), + ) + + torch.cuda.set_device(int(os.getenv('LOCAL_RANK', 0))) + + if args.auto: + os.environ['CUDA_LAUNCH_BLOCKING'] = '1' + kwargs = {'device_map': 'auto'} if args.auto else {} + tokenizer = AutoTokenizer.from_pretrained(args.checkpoint, trust_remote_code=True, use_fast=False) + model = InternVLChatModel.from_pretrained( + args.checkpoint, low_cpu_mem_usage=True, torch_dtype=torch.bfloat16, + load_in_8bit=args.load_in_8bit, load_in_4bit=args.load_in_4bit, **kwargs).eval() + if not args.load_in_8bit and not args.load_in_4bit and not args.auto: + model = model.cuda() + image_size = model.config.force_image_size or model.config.vision_config.image_size + use_thumbnail = model.config.use_thumbnail + + total_params = sum(p.numel() for p in model.parameters()) / 1e9 + if total_params > 20 or args.dynamic: + args.num_beams = 1 + print(f'[test] total_params: {total_params}B, use num_beams: {args.num_beams}') + else: + print(f'[test] total_params: {total_params}B') + print(f'[test] image_size: {image_size}') + print(f'[test] template: {model.config.template}') + print(f'[test] dynamic_image_size: {args.dynamic}') + print(f'[test] use_thumbnail: {use_thumbnail}') + print(f'[test] max_num: {args.max_num}') + + prompt = { + 'en': 'Select the best answer to the above multiple-choice question based on the image. \ + Respond with only the letter (A, B, C, D, or E) of the correct option. \nThe best answer is:', + 'cn': '根据图像选择上述多项选择题的最佳答案。只需回答正确选项的字母(A, B, C, D 或 E)。\n 最佳答案为:', + } + evaluate_chat_model() diff --git a/internvl_chat/eval/domain_specific/rs_det/caculate.py b/internvl_chat/eval/domain_specific/rs_det/caculate.py index 831610cb..6e115229 100644 --- a/internvl_chat/eval/domain_specific/rs_det/caculate.py +++ b/internvl_chat/eval/domain_specific/rs_det/caculate.py @@ -1,118 +1,122 @@ -import json -import os -import re -import argparse -import torch -from torchvision.ops.boxes import box_area - -def calculate_iou(box1, box2): - x1, y1, x2, y2 = box1 - x3, y3, x4, y4 = box2 - - intersection_x1 = max(x1, x3) - intersection_y1 = max(y1, y3) - intersection_x2 = min(x2, x4) - intersection_y2 = min(y2, y4) - - intersection_area = max(0, intersection_x2 - intersection_x1 + 1) * max( - 0, intersection_y2 - intersection_y1 + 1 - ) - - box1_area = (x2 - x1 + 1) * (y2 - y1 + 1) - box2_area = (x4 - x3 + 1) * (y4 - y3 + 1) - - union_area = box1_area + box2_area - intersection_area - - iou = intersection_area / union_area - - return iou - -def box_iou(boxes1, boxes2): - area1 = box_area(boxes1) - area2 = box_area(boxes2) - - lt = torch.max(boxes1[:, None, :2], boxes2[:, :2]) # [N,M,2] - rb = torch.min(boxes1[:, None, 2:], boxes2[:, 2:]) # [N,M,2] - - wh = (rb - lt).clamp(min=0) # [N,M,2] - inter = wh[:, :, 0] * wh[:, :, 1] # [N,M] - - union = area1[:, None] + area2 - inter - - iou = inter / union - return iou, union - - -def transform_bbox(bbox,image_size): - x1,y1,x2,y2 = bbox - W,H = image_size - x1 = min(max(x1/1000 * W,0),W) - x2 = min(max(x2/1000 * W,0),W) - y1 = min(max(y1/1000 * H,0),H) - y2 = min(max(y2/1000 * H,0),H) - - return [x1,y1,x2,y2] -def evaluation_metrics(outputs): - - correct=0 - incorrect=0 - pattern = r'\[*\[.*?,.*?,.*?,.*?\]\]*' - # pattern = r'\[*\[(.*?),(.*?),(.*?),(.*?)\]\]*' - # print(outputs) - for output in outputs: - bbox = output['gt_answers'] - image_size = output["image_size"] - pred = output["answer"] - # 查找所有匹配 - matches = re.findall(pattern, pred) - if len(matches) > 1: - print("大于一个匹配") - print(matches) - if len(matches) ==0: - incorrect=incorrect+1 - else: - try: - pred_bbox = json.loads(matches[0]) - pred_bbox = transform_bbox(pred_bbox[0],image_size) - iou_score = calculate_iou(pred_bbox,bbox) - if iou_score > 0.5: - correct=correct+1 - else: - incorrect=incorrect+1 - except Exception as e: - print(e) - print(output) - incorrect=incorrect+1 - - # else: - # continue - print('correct:',correct) - print('incorrect:',incorrect) - print('Total:',correct+incorrect) - print('Acc@0.5:',(correct/(correct+incorrect))) - - return { - 'correct:':correct, - 'incorrect:':incorrect, - 'Total:':correct+incorrect, - 'Acc@0.5:':correct/(correct+incorrect) - } - -if __name__ == "__main__": - - parser = argparse.ArgumentParser() - parser.add_argument('--output_file', type=str, default='') - args = parser.parse_args() - with open(args.output_file,"r") as f: - data= json.load(f) - if "outputs" in data: - data = data["outputs"] - outputs = data - results = evaluation_metrics(outputs) - results_file = args.output_file - with open(results_file,"w") as f: - json.dump({ - "results":results, - "outputs":outputs - },f,indent=4) - \ No newline at end of file +import argparse +import json +import os +import re + +import torch +from torchvision.ops.boxes import box_area + + +def calculate_iou(box1, box2): + x1, y1, x2, y2 = box1 + x3, y3, x4, y4 = box2 + + intersection_x1 = max(x1, x3) + intersection_y1 = max(y1, y3) + intersection_x2 = min(x2, x4) + intersection_y2 = min(y2, y4) + + intersection_area = max(0, intersection_x2 - intersection_x1 + 1) * max( + 0, intersection_y2 - intersection_y1 + 1 + ) + + box1_area = (x2 - x1 + 1) * (y2 - y1 + 1) + box2_area = (x4 - x3 + 1) * (y4 - y3 + 1) + + union_area = box1_area + box2_area - intersection_area + + iou = intersection_area / union_area + + return iou + + +def box_iou(boxes1, boxes2): + area1 = box_area(boxes1) + area2 = box_area(boxes2) + + lt = torch.max(boxes1[:, None, :2], boxes2[:, :2]) # [N,M,2] + rb = torch.min(boxes1[:, None, 2:], boxes2[:, 2:]) # [N,M,2] + + wh = (rb - lt).clamp(min=0) # [N,M,2] + inter = wh[:, :, 0] * wh[:, :, 1] # [N,M] + + union = area1[:, None] + area2 - inter + + iou = inter / union + return iou, union + + +def transform_bbox(bbox, image_size): + x1, y1, x2, y2 = bbox + W, H = image_size + x1 = min(max(x1 / 1000 * W, 0), W) + x2 = min(max(x2 / 1000 * W, 0), W) + y1 = min(max(y1 / 1000 * H, 0), H) + y2 = min(max(y2 / 1000 * H, 0), H) + + return [x1, y1, x2, y2] + + +def evaluation_metrics(outputs): + correct = 0 + incorrect = 0 + pattern = r'\[*\[.*?,.*?,.*?,.*?\]\]*' + # pattern = r'\[*\[(.*?),(.*?),(.*?),(.*?)\]\]*' + # print(outputs) + for output in outputs: + bbox = output['gt_answers'] + image_size = output['image_size'] + pred = output['answer'] + # 查找所有匹配 + matches = re.findall(pattern, pred) + if len(matches) > 1: + print('大于一个匹配') + print(matches) + if len(matches) == 0: + incorrect = incorrect + 1 + else: + try: + pred_bbox = json.loads(matches[0]) + pred_bbox = transform_bbox(pred_bbox[0], image_size) + iou_score = calculate_iou(pred_bbox, bbox) + if iou_score > 0.5: + correct = correct + 1 + else: + incorrect = incorrect + 1 + except Exception as e: + print(e) + print(output) + incorrect = incorrect + 1 + + # else: + # continue + print('correct:', correct) + print('incorrect:', incorrect) + print('Total:', correct + incorrect) + print('Acc@0.5:', (correct / (correct + incorrect))) + + return { + 'correct:': correct, + 'incorrect:': incorrect, + 'Total:': correct + incorrect, + 'Acc@0.5:': correct / (correct + incorrect) + } + + +if __name__ == '__main__': + + parser = argparse.ArgumentParser() + parser.add_argument('--output_file', type=str, default='') + args = parser.parse_args() + with open(args.output_file, 'r') as f: + data = json.load(f) + if 'outputs' in data: + data = data['outputs'] + outputs = data + results = evaluation_metrics(outputs) + results_file = args.output_file + with open(results_file, 'w') as f: + json.dump({ + 'results': results, + 'outputs': outputs + }, f, indent=4) diff --git a/internvl_chat/eval/domain_specific/rs_det/evaluate.py b/internvl_chat/eval/domain_specific/rs_det/evaluate.py index b4247338..3076a02b 100644 --- a/internvl_chat/eval/domain_specific/rs_det/evaluate.py +++ b/internvl_chat/eval/domain_specific/rs_det/evaluate.py @@ -1,275 +1,273 @@ -import argparse -import base64 -import itertools -import json -import os -import random -import time -from functools import partial -from io import BytesIO - -import pandas as pd -import torch -from internvl.model.internvl_chat import InternVLChatModel -from internvl.train.dataset import build_transform, dynamic_preprocess -from PIL import Image -from torch.utils.data import Dataset -from tqdm import tqdm -from transformers import AutoTokenizer -import math -ds_collections = { - 'DIOR_RSVG': { - 'root': 'InternVL-Domain-Adaptation-Data/val/dior_rsvg_test.json', - 'max_new_tokens':200, - 'min_new_tokens': 1, - 'type': 'test', - 'image_root':"InternVL-Domain-Adaptation-Data/images/" -}, -} - - -def collate_fn(batches, tokenizer): - pixel_values = torch.cat([_['pixel_values'] for _ in batches], dim=0) - questions = [_['question'] for _ in batches] - answers = [_['answer'] for _ in batches] - image_sizes = [_['image_size'] for _ in batches] - - return pixel_values, questions, answers, image_sizes - - -class GroundingDataset(torch.utils.data.Dataset): - - def __init__(self, root, image_root,prompt="", input_size=224, dynamic_image_size=False, - use_thumbnail=False, max_num=6): - - with open(root,"r") as f: - self.ann_data = json.load(f) - self.image_root = image_root - self.input_size = input_size - self.dynamic_image_size = dynamic_image_size - self.use_thumbnail = use_thumbnail - self.max_num = max_num - self.transform = build_transform(is_train=False, input_size=input_size) - self.prompt = prompt - def __len__(self): - return len(self.ann_data) - - def __getitem__(self, idx): - data_item = self.ann_data[idx] - # index = data_item["id"] - image = data_item['image'] - question = self.prompt + data_item['prompt'] - answer = data_item['bbox'] - image_size_ = data_item["size"] - # catetory = self.df.iloc[idx]['category'] - # l2_catetory = self.df.iloc[idx]['l2-category'] - image = Image.open(os.path.join(self.image_root,image)).convert('RGB') - if self.dynamic_image_size: - images = dynamic_preprocess(image, image_size=self.input_size, - use_thumbnail=self.use_thumbnail, - max_num=self.max_num) - else: - images = [image] - pixel_values = [self.transform(image) for image in images] - pixel_values = torch.stack(pixel_values) - - - return { - 'question': question, - 'pixel_values': pixel_values, - 'answer': answer, - "image_size":image_size_ - # 'index': index, - } - -def calculate_iou(box1, box2): - x1, y1, x2, y2 = box1 - x3, y3, x4, y4 = box2 - - intersection_x1 = max(x1, x3) - intersection_y1 = max(y1, y3) - intersection_x2 = min(x2, x4) - intersection_y2 = min(y2, y4) - - intersection_area = max(0, intersection_x2 - intersection_x1 + 1) * max( - 0, intersection_y2 - intersection_y1 + 1 - ) - - box1_area = (x2 - x1 + 1) * (y2 - y1 + 1) - box2_area = (x4 - x3 + 1) * (y4 - y3 + 1) - - union_area = box1_area + box2_area - intersection_area - - iou = intersection_area / union_area - - return iou - - - - -class InferenceSampler(torch.utils.data.sampler.Sampler): - - def __init__(self, size): - self._size = int(size) - assert size > 0 - self._rank = torch.distributed.get_rank() - self._world_size = torch.distributed.get_world_size() - self._local_indices = self._get_local_indices(size, self._world_size, self._rank) - - @staticmethod - def _get_local_indices(total_size, world_size, rank): - shard_size = total_size // world_size - left = total_size % world_size - shard_sizes = [shard_size + int(r < left) for r in range(world_size)] - - begin = sum(shard_sizes[:rank]) - end = min(sum(shard_sizes[:rank + 1]), total_size) - return range(begin, end) - - def __iter__(self): - yield from self._local_indices - - def __len__(self): - return len(self._local_indices) - - - -def evaluate_chat_model(): - random.seed(args.seed) - - for ds_name in args.datasets: - dataset = GroundingDataset( - root=ds_collections[ds_name]['root'], - image_root = ds_collections[ds_name]['image_root'], - prompt=prompt_prefix, - - # language=ds_collections[ds_name]['language'], - input_size=image_size, - dynamic_image_size=args.dynamic, - use_thumbnail=use_thumbnail, - max_num=args.max_num - ) - dataloader = torch.utils.data.DataLoader( - dataset=dataset, - sampler=InferenceSampler(len(dataset)), - batch_size=args.batch_size, - num_workers=args.num_workers, - pin_memory=True, - drop_last=False, - collate_fn=partial(collate_fn, tokenizer=tokenizer), - ) - - outputs = [] - for _, (pixel_values, questions, answers, image_sizes) in tqdm(enumerate(dataloader)): - pixel_values = pixel_values.to(torch.bfloat16).cuda() - generation_config = dict( - num_beams=args.num_beams, - max_new_tokens=ds_collections[ds_name]['max_new_tokens'], - min_new_tokens=ds_collections[ds_name]['min_new_tokens'], - do_sample=True if args.temperature > 0 else False, - temperature=args.temperature, - ) - pred= model.chat( - tokenizer=tokenizer, - pixel_values=pixel_values, - question=questions[0], - generation_config=generation_config - ) - preds = [pred] - # preds = [post_process(output)] - - for question, pred, answer, image_size_ in zip(questions, preds, answers, image_sizes): - outputs.append({ - 'question': question, - 'answer': pred, - 'gt_answers': answer, - 'image_size': image_size_ - }) - - torch.distributed.barrier() - - world_size = torch.distributed.get_world_size() - merged_outputs = [None for _ in range(world_size)] - torch.distributed.all_gather_object(merged_outputs, json.dumps(outputs)) - - merged_outputs = [json.loads(_) for _ in merged_outputs] - merged_outputs = [_ for _ in itertools.chain.from_iterable(merged_outputs)] - - if torch.distributed.get_rank() == 0: - - print(f'Evaluating {ds_name} ...') - time_prefix = time.strftime('%y%m%d%H%M%S', time.localtime()) - results_file = f'{ds_name}_{time_prefix}.json' - output_path = os.path.join(args.out_dir, results_file) - # results = evaluation_metrics(merged_outputs) - with open(output_path,"w") as f: - json.dump({ - # "results":results, - "outputs":merged_outputs - },f,indent=4) - print('Results saved to {}'.format(output_path)) - cmd = f'python eval/rs_det/caculate.py --output_file {output_path}' - print(cmd) - os.system(cmd) - -if __name__ == '__main__': - parser = argparse.ArgumentParser() - parser.add_argument('--checkpoint', type=str, default='') - parser.add_argument('--datasets', type=str, default='mmbench_dev_20230712') - parser.add_argument('--batch-size', type=int, default=1) - parser.add_argument('--num-workers', type=int, default=1) - parser.add_argument('--num-beams', type=int, default=5) - parser.add_argument('--temperature', type=float, default=0.0) - parser.add_argument('--out-dir', type=str, default='results') - parser.add_argument('--seed', type=int, default=0) - parser.add_argument('--dynamic', action='store_true') - parser.add_argument('--max-num', type=int, default=6) - parser.add_argument('--load-in-8bit', action='store_true') - parser.add_argument('--auto', action='store_true') - args = parser.parse_args() - - if not os.path.exists(args.out_dir): - os.makedirs(args.out_dir) - - args.datasets = args.datasets.split(',') - print('datasets:', args.datasets) - assert args.batch_size == 1, 'Only batch size 1 is supported' - - torch.distributed.init_process_group( - backend='nccl', - world_size=int(os.getenv('WORLD_SIZE', '1')), - rank=int(os.getenv('RANK', '0')), - ) - - torch.cuda.set_device(int(os.getenv('LOCAL_RANK', 0))) - - if args.auto: - os.environ['CUDA_LAUNCH_BLOCKING'] = '1' - - kwargs = {'device_map': "auto"} if args.auto else {} - - tokenizer = AutoTokenizer.from_pretrained(args.checkpoint, trust_remote_code=True, use_fast=False) - model = InternVLChatModel.from_pretrained( - args.checkpoint, low_cpu_mem_usage=True, torch_dtype=torch.bfloat16, - load_in_8bit=args.load_in_8bit, **kwargs).eval() - if not args.load_in_8bit and not args.auto: - model = model.cuda() - image_size = model.config.force_image_size or model.config.vision_config.image_size - use_thumbnail = model.config.use_thumbnail - - total_params = sum(p.numel() for p in model.parameters()) / 1e9 - if total_params > 20 or args.dynamic: - args.num_beams = 1 - print(f'[test] total_params: {total_params}B, use num_beams: {args.num_beams}') - else: - print(f'[test] total_params: {total_params}B') - print(f'[test] image_size: {image_size}') - print(f'[test] template: {model.config.template}') - print(f'[test] dynamic_image_size: {args.dynamic}') - print(f'[test] use_thumbnail: {use_thumbnail}') - print(f'[test] max_num: {args.max_num}') - - prompt_prefix = "Detect " - # prompt_prefix = "Please provide the bounding box coordinate of the region this sentence describes: " - - evaluate_chat_model() +import argparse +import base64 +import itertools +import json +import math +import os +import random +import time +from functools import partial +from io import BytesIO + +import pandas as pd +import torch +from internvl.model.internvl_chat import InternVLChatModel +from internvl.train.dataset import build_transform, dynamic_preprocess +from PIL import Image +from torch.utils.data import Dataset +from tqdm import tqdm +from transformers import AutoTokenizer + +ds_collections = { + 'DIOR_RSVG': { + 'root': 'InternVL-Domain-Adaptation-Data/val/dior_rsvg_test.json', + 'max_new_tokens': 200, + 'min_new_tokens': 1, + 'type': 'test', + 'image_root': 'InternVL-Domain-Adaptation-Data/images/' + }, +} + + +def collate_fn(batches, tokenizer): + pixel_values = torch.cat([_['pixel_values'] for _ in batches], dim=0) + questions = [_['question'] for _ in batches] + answers = [_['answer'] for _ in batches] + image_sizes = [_['image_size'] for _ in batches] + + return pixel_values, questions, answers, image_sizes + + +class GroundingDataset(torch.utils.data.Dataset): + + def __init__(self, root, image_root, prompt='', input_size=224, dynamic_image_size=False, + use_thumbnail=False, max_num=6): + + with open(root, 'r') as f: + self.ann_data = json.load(f) + self.image_root = image_root + self.input_size = input_size + self.dynamic_image_size = dynamic_image_size + self.use_thumbnail = use_thumbnail + self.max_num = max_num + self.transform = build_transform(is_train=False, input_size=input_size) + self.prompt = prompt + + def __len__(self): + return len(self.ann_data) + + def __getitem__(self, idx): + data_item = self.ann_data[idx] + # index = data_item["id"] + image = data_item['image'] + question = self.prompt + data_item['prompt'] + answer = data_item['bbox'] + image_size_ = data_item['size'] + # catetory = self.df.iloc[idx]['category'] + # l2_catetory = self.df.iloc[idx]['l2-category'] + image = Image.open(os.path.join(self.image_root, image)).convert('RGB') + if self.dynamic_image_size: + images = dynamic_preprocess(image, image_size=self.input_size, + use_thumbnail=self.use_thumbnail, + max_num=self.max_num) + else: + images = [image] + pixel_values = [self.transform(image) for image in images] + pixel_values = torch.stack(pixel_values) + + return { + 'question': question, + 'pixel_values': pixel_values, + 'answer': answer, + 'image_size': image_size_ + # 'index': index, + } + + +def calculate_iou(box1, box2): + x1, y1, x2, y2 = box1 + x3, y3, x4, y4 = box2 + + intersection_x1 = max(x1, x3) + intersection_y1 = max(y1, y3) + intersection_x2 = min(x2, x4) + intersection_y2 = min(y2, y4) + + intersection_area = max(0, intersection_x2 - intersection_x1 + 1) * max( + 0, intersection_y2 - intersection_y1 + 1 + ) + + box1_area = (x2 - x1 + 1) * (y2 - y1 + 1) + box2_area = (x4 - x3 + 1) * (y4 - y3 + 1) + + union_area = box1_area + box2_area - intersection_area + + iou = intersection_area / union_area + + return iou + + +class InferenceSampler(torch.utils.data.sampler.Sampler): + + def __init__(self, size): + self._size = int(size) + assert size > 0 + self._rank = torch.distributed.get_rank() + self._world_size = torch.distributed.get_world_size() + self._local_indices = self._get_local_indices(size, self._world_size, self._rank) + + @staticmethod + def _get_local_indices(total_size, world_size, rank): + shard_size = total_size // world_size + left = total_size % world_size + shard_sizes = [shard_size + int(r < left) for r in range(world_size)] + + begin = sum(shard_sizes[:rank]) + end = min(sum(shard_sizes[:rank + 1]), total_size) + return range(begin, end) + + def __iter__(self): + yield from self._local_indices + + def __len__(self): + return len(self._local_indices) + + +def evaluate_chat_model(): + random.seed(args.seed) + + for ds_name in args.datasets: + dataset = GroundingDataset( + root=ds_collections[ds_name]['root'], + image_root=ds_collections[ds_name]['image_root'], + prompt=prompt_prefix, + # language=ds_collections[ds_name]['language'], + input_size=image_size, + dynamic_image_size=args.dynamic, + use_thumbnail=use_thumbnail, + max_num=args.max_num + ) + dataloader = torch.utils.data.DataLoader( + dataset=dataset, + sampler=InferenceSampler(len(dataset)), + batch_size=args.batch_size, + num_workers=args.num_workers, + pin_memory=True, + drop_last=False, + collate_fn=partial(collate_fn, tokenizer=tokenizer), + ) + + outputs = [] + for _, (pixel_values, questions, answers, image_sizes) in tqdm(enumerate(dataloader)): + pixel_values = pixel_values.to(torch.bfloat16).cuda() + generation_config = dict( + num_beams=args.num_beams, + max_new_tokens=ds_collections[ds_name]['max_new_tokens'], + min_new_tokens=ds_collections[ds_name]['min_new_tokens'], + do_sample=True if args.temperature > 0 else False, + temperature=args.temperature, + ) + pred = model.chat( + tokenizer=tokenizer, + pixel_values=pixel_values, + question=questions[0], + generation_config=generation_config + ) + preds = [pred] + # preds = [post_process(output)] + + for question, pred, answer, image_size_ in zip(questions, preds, answers, image_sizes): + outputs.append({ + 'question': question, + 'answer': pred, + 'gt_answers': answer, + 'image_size': image_size_ + }) + + torch.distributed.barrier() + + world_size = torch.distributed.get_world_size() + merged_outputs = [None for _ in range(world_size)] + torch.distributed.all_gather_object(merged_outputs, json.dumps(outputs)) + + merged_outputs = [json.loads(_) for _ in merged_outputs] + merged_outputs = [_ for _ in itertools.chain.from_iterable(merged_outputs)] + + if torch.distributed.get_rank() == 0: + print(f'Evaluating {ds_name} ...') + time_prefix = time.strftime('%y%m%d%H%M%S', time.localtime()) + results_file = f'{ds_name}_{time_prefix}.json' + output_path = os.path.join(args.out_dir, results_file) + # results = evaluation_metrics(merged_outputs) + with open(output_path, 'w') as f: + json.dump({ + # "results":results, + 'outputs': merged_outputs + }, f, indent=4) + print('Results saved to {}'.format(output_path)) + cmd = f'python eval/rs_det/caculate.py --output_file {output_path}' + print(cmd) + os.system(cmd) + + +if __name__ == '__main__': + parser = argparse.ArgumentParser() + parser.add_argument('--checkpoint', type=str, default='') + parser.add_argument('--datasets', type=str, default='mmbench_dev_20230712') + parser.add_argument('--batch-size', type=int, default=1) + parser.add_argument('--num-workers', type=int, default=1) + parser.add_argument('--num-beams', type=int, default=5) + parser.add_argument('--temperature', type=float, default=0.0) + parser.add_argument('--out-dir', type=str, default='results') + parser.add_argument('--seed', type=int, default=0) + parser.add_argument('--dynamic', action='store_true') + parser.add_argument('--max-num', type=int, default=6) + parser.add_argument('--load-in-8bit', action='store_true') + parser.add_argument('--auto', action='store_true') + args = parser.parse_args() + + if not os.path.exists(args.out_dir): + os.makedirs(args.out_dir) + + args.datasets = args.datasets.split(',') + print('datasets:', args.datasets) + assert args.batch_size == 1, 'Only batch size 1 is supported' + + torch.distributed.init_process_group( + backend='nccl', + world_size=int(os.getenv('WORLD_SIZE', '1')), + rank=int(os.getenv('RANK', '0')), + ) + + torch.cuda.set_device(int(os.getenv('LOCAL_RANK', 0))) + + if args.auto: + os.environ['CUDA_LAUNCH_BLOCKING'] = '1' + + kwargs = {'device_map': 'auto'} if args.auto else {} + + tokenizer = AutoTokenizer.from_pretrained(args.checkpoint, trust_remote_code=True, use_fast=False) + model = InternVLChatModel.from_pretrained( + args.checkpoint, low_cpu_mem_usage=True, torch_dtype=torch.bfloat16, + load_in_8bit=args.load_in_8bit, **kwargs).eval() + if not args.load_in_8bit and not args.auto: + model = model.cuda() + image_size = model.config.force_image_size or model.config.vision_config.image_size + use_thumbnail = model.config.use_thumbnail + + total_params = sum(p.numel() for p in model.parameters()) / 1e9 + if total_params > 20 or args.dynamic: + args.num_beams = 1 + print(f'[test] total_params: {total_params}B, use num_beams: {args.num_beams}') + else: + print(f'[test] total_params: {total_params}B') + print(f'[test] image_size: {image_size}') + print(f'[test] template: {model.config.template}') + print(f'[test] dynamic_image_size: {args.dynamic}') + print(f'[test] use_thumbnail: {use_thumbnail}') + print(f'[test] max_num: {args.max_num}') + + prompt_prefix = 'Detect ' + # prompt_prefix = "Please provide the bounding box coordinate of the region this sentence describes: " + + evaluate_chat_model() diff --git a/internvl_chat/eval/domain_specific/rs_vqa/evaluate.py b/internvl_chat/eval/domain_specific/rs_vqa/evaluate.py index 7112ed9f..b6db829b 100644 --- a/internvl_chat/eval/domain_specific/rs_vqa/evaluate.py +++ b/internvl_chat/eval/domain_specific/rs_vqa/evaluate.py @@ -50,15 +50,15 @@ def collate_fn(batches, tokenizer): indexes = [_['index'] for _ in batches] question_types = [_['question_type'] for _ in batches] - return pixel_values, questions, answers, indexes,question_types + return pixel_values, questions, answers, indexes, question_types class RSVQADataset(torch.utils.data.Dataset): - def __init__(self, root, prompt,image_root, input_size=224, dynamic_image_size=False, + def __init__(self, root, prompt, image_root, input_size=224, dynamic_image_size=False, use_thumbnail=False, max_num=6): - - with open(root,"r") as f: + + with open(root, 'r') as f: self.ann_data = json.load(f) self.prompt = prompt @@ -74,16 +74,16 @@ def __len__(self): def __getitem__(self, idx): data_item = self.ann_data[idx] - index = data_item["id"] + index = data_item['id'] image = data_item['image'] # print(data_item) # print( self.prompt) - question = data_item['question'] + "\n" + self.prompt + question = data_item['question'] + '\n' + self.prompt answer = data_item['gt_answer'] - question_type = data_item['type'] + question_type = data_item['type'] # catetory = self.df.iloc[idx]['category'] # l2_catetory = self.df.iloc[idx]['l2-category'] - image = Image.open(os.path.join(self.image_root,image)).convert('RGB') + image = Image.open(os.path.join(self.image_root, image)).convert('RGB') if self.dynamic_image_size: images = dynamic_preprocess(image, image_size=self.input_size, use_thumbnail=self.use_thumbnail, @@ -93,43 +93,40 @@ def __getitem__(self, idx): pixel_values = [self.transform(image) for image in images] pixel_values = torch.stack(pixel_values) - return { 'question': question, 'pixel_values': pixel_values, 'answer': answer, 'index': index, - 'question_type':question_type + 'question_type': question_type } def evaluation_metrics(outputs): - - correct=0 - incorrect=0 + correct = 0 + incorrect = 0 for output in outputs: - gt=output['gt_answers'] - answer=output['answer'].split(',')[0].lower().replace('.','') - if gt==answer: - correct=correct+1 + gt = output['gt_answers'] + answer = output['answer'].split(',')[0].lower().replace('.', '') + if gt == answer: + correct = correct + 1 else: - incorrect=incorrect+1 + incorrect = incorrect + 1 # else: # continue - print('correct:',correct) - print('incorrect:',incorrect) - print('Total:',correct+incorrect) - print('Acc:',(correct/(correct+incorrect))) + print('correct:', correct) + print('incorrect:', incorrect) + print('Total:', correct + incorrect) + print('Acc:', (correct / (correct + incorrect))) return { - 'correct:':correct, - 'incorrect:':incorrect, - 'Total:':correct+incorrect, - 'Acc:':correct/(correct+incorrect) + 'correct:': correct, + 'incorrect:': incorrect, + 'Total:': correct + incorrect, + 'Acc:': correct / (correct + incorrect) } - class InferenceSampler(torch.utils.data.sampler.Sampler): def __init__(self, size): @@ -156,7 +153,6 @@ def __len__(self): return len(self._local_indices) - def evaluate_chat_model(): random.seed(args.seed) @@ -191,7 +187,7 @@ def evaluate_chat_model(): do_sample=True if args.temperature > 0 else False, temperature=args.temperature, ) - pred= model.chat( + pred = model.chat( tokenizer=tokenizer, pixel_values=pixel_values, question=questions[0], @@ -200,13 +196,13 @@ def evaluate_chat_model(): preds = [pred] # preds = [post_process(output)] - for question, pred, answer, index,question_type in zip(questions, preds, answers, indexes, question_types): + for question, pred, answer, index, question_type in zip(questions, preds, answers, indexes, question_types): outputs.append({ 'question': question, 'response': pred, 'gt_answer': answer, 'index': int(index), - "question_type":question_type + 'question_type': question_type }) torch.distributed.barrier() @@ -219,19 +215,16 @@ def evaluate_chat_model(): merged_outputs = [_ for _ in itertools.chain.from_iterable(merged_outputs)] if torch.distributed.get_rank() == 0: - print(f'Evaluating {ds_name} ...') time_prefix = time.strftime('%y%m%d%H%M%S', time.localtime()) results_file = f'{ds_name}_{time_prefix}.json' output_path = os.path.join(args.out_dir, results_file) - with open(output_path,"w") as f: + with open(output_path, 'w') as f: json.dump(merged_outputs - ,f,indent=4) + , f, indent=4) cmd = f'python eval/rs_vqa/score.py --output_file {output_path}' print(cmd) os.system(cmd) - - if __name__ == '__main__': @@ -289,5 +282,5 @@ def evaluate_chat_model(): print(f'[test] use_thumbnail: {use_thumbnail}') print(f'[test] max_num: {args.max_num}') - prompt = "Answer the question using a single word or phrase." + prompt = 'Answer the question using a single word or phrase.' evaluate_chat_model() diff --git a/internvl_chat/eval/domain_specific/rs_vqa/score.py b/internvl_chat/eval/domain_specific/rs_vqa/score.py index 05685fcc..231418e1 100644 --- a/internvl_chat/eval/domain_specific/rs_vqa/score.py +++ b/internvl_chat/eval/domain_specific/rs_vqa/score.py @@ -20,6 +20,7 @@ def is_correct_count(response, answer): return True return False + def is_correct_area(response, answer): try: response = int(response) if response is not None else 0 @@ -28,6 +29,7 @@ def is_correct_area(response, answer): return False return is_correct_count(response, answer) + def calculate_scores(data): type_counts = {} type_correct = {} @@ -60,33 +62,31 @@ def calculate_scores(data): total_count = sum(type_counts.values()) total_score = round(total_correct / total_count, 4) if total_count > 0 else 0.0 - total_correct_useful = sum([v for k,v in type_correct.items() if k not in ["count","area"] ]) - total_count_useful = sum([v for k,v in type_counts.items() if k not in ["count","area"] ]) + total_correct_useful = sum([v for k, v in type_correct.items() if k not in ['count', 'area']]) + total_count_useful = sum([v for k, v in type_counts.items() if k not in ['count', 'area']]) total_score_useful = round(total_correct_useful / total_count_useful, 4) if total_count_useful > 0 else 0.0 - print(f"{type_scores=}") - print(f"{total_score_useful=}") - return type_scores, total_score,total_score_useful,type_counts + print(f'{type_scores=}') + print(f'{total_score_useful=}') + return type_scores, total_score, total_score_useful, type_counts + if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--output_file', type=str, default='') args = parser.parse_args() - - with open(args.output_file,"r") as f: - data= json.load(f) - if "outputs" in data: - data = data["outputs"] - type_scores, total_score,total_score_useful,type_counts = calculate_scores(data) + + with open(args.output_file, 'r') as f: + data = json.load(f) + if 'outputs' in data: + data = data['outputs'] + type_scores, total_score, total_score_useful, type_counts = calculate_scores(data) results = { - "type_scores":type_scores, - "type_counts":type_counts, - "total_score":total_score, - "total_score_useful":total_score_useful, - "outputs":data + 'type_scores': type_scores, + 'type_counts': type_counts, + 'total_score': total_score, + 'total_score_useful': total_score_useful, + 'outputs': data } - with open(args.output_file,"w") as f: - json.dump(results,f,indent=4) - - - + with open(args.output_file, 'w') as f: + json.dump(results, f, indent=4) diff --git a/internvl_chat/evaluate.sh b/internvl_chat/evaluate.sh index 4233ca89..9da3e762 100644 --- a/internvl_chat/evaluate.sh +++ b/internvl_chat/evaluate.sh @@ -640,4 +640,4 @@ if [ ${DATASET} == "rsvqa-hr-test2" ]; then --nproc_per_node=${GPUS} \ --master_port=${MASTER_PORT} \ eval/domain_specific/rs_vqa/evaluate.py --checkpoint ${CHECKPOINT} --datasets RSVQA_L "${ARGS[@]:2}" -fi \ No newline at end of file +fi diff --git a/internvl_chat/shell/mini_internvl/README.md b/internvl_chat/shell/mini_internvl/README.md index 68f5cc8a..d43b101f 100644 --- a/internvl_chat/shell/mini_internvl/README.md +++ b/internvl_chat/shell/mini_internvl/README.md @@ -1,72 +1,70 @@ -# Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5\% Parameters and 90\% Performance - -## Introduction - -We introduce Mini-InternVL, a series of MLLMs with parameters ranging from 1B to 4B, which achieves 90\% of the performance with only 5\% of the parameters. -This significant improvement in efficiency and effectiveness makes our models more accessible and applicable in various real-world scenarios. - -![internvl 1 5_wwh_33_2](https://github.com/user-attachments/assets/820ed173-4bd1-45a6-95d6-59c1be01d53f) - -- InternViT-300M - -We employ InternViT-300M as our visual encoder, a lightweight vision model that inherits the capabilities of a powerful vision encoder. We directly leverage InternViT-6B that has undergone generative training on diverse datasets to transfer knowledge to a lightweight vision model, CLIP-ViT-L-336px. - -- Adaptation for Mini-InternVL - -To further promote the adoption of our models, we develop a unified adaptation framework for Mini-InternVL, which enables our models to transfer and outperform specialized models in downstream tasks, including autonomous driving, medical images, and remote sensing. We hope to provide insights into the application of MLLMs. - -## Models and Performance - -| Model | MMMU (val)| MathVista (testmini) |AI2D |ChartQA |DocVQA | InfoVQA |OCRBench| MMB-EN | MMB-CN |Avg. Score | -|:--------:|:-----:|:-----:|:-----: |:------:|:------:|:----:|:----:|:------:|----:| :------:| -|Claude3.5-Sonnet|65.9 | 67.7| 94.7 | 90.8 | 95.2 | - | 788 | 79.7 | 80.7 | 81.7 | -|InternVL2-Llama3-76B|58.2 | 65.5 | 87.6 | 88.4 | 94.1 | 82.0 | 839 | 86.5 | 86.3 | 81.4 | -|Mini-InternVL-1B ([🤗](https://huggingface.co/OpenGVLab/InternVL2-1B))| 36.7 | 37.7 | 64.1 | 72.9 | 81.7 | 50.9 | 754 | 65.4 | 60.7 | 60.6 (74\%) | -|Mini-InternVL-2B ([🤗](https://huggingface.co/OpenGVLab/InternVL2-2B))|36.3| 46.3 | 74.1 | 76.2 | 86.9 | 58.9 | 784 | 73.2 | 70.9 | 66.8 (82\%)| -|Mini-InternVL-4B ([🤗](https://huggingface.co/OpenGVLab/InternVL2-4B))|48.3 | 58.6 | 78.9 | 81.5 | 89.2 | 67.0 | 788 | 78.6 | 73.9 | 72.8 (90\%) | - -- We evaluate models using InternVL and VLMEvalKit repositories. AI2D, ChartQA, DocVQA, InfoVQA, and MMBench are tested with InternVL, while MathVistaand OCRBench use VLMEvalKit. For MMMU, we report scores from OpenCompass leaderboard. - -- The Avg. Score is the average of the scores from all tested benchmarks, with the OCR-Bench score divided by 10. The values in parentheses represent the relative parameters and performance of Mini-InternVL compared to *InternVL2-Llama3-76B*, which is considered as 100\%. - -## Domain Adaptation - -Visual tasks (*e.g.* Image classification, region perception, multi-view images tasks, video related tasks and visual grounding) can be formulated into VQA format. - -![framework_03_2](https://github.com/user-attachments/assets/63bffb31-cf05-4f52-a679-4700650d0c37) - - -In the [document](https://internvl.readthedocs.io/en/latest/internvl2.0/domain_adaptation.html), we provide detailed information on the datasets and the fine-tuning process. - -## Citation -If you find this project useful in your research, please consider citing: - -```BibTeX -@article{chen2023internvl, - title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks}, - author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng}, - journal={arXiv preprint arXiv:2312.14238}, - year={2023} -} -@article{chen2024far, - title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites}, - author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others}, - journal={arXiv preprint arXiv:2404.16821}, - year={2024} -} -@article{gao2024mini, - title={Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5\% Parameters and 90\% Performance}, - author={Gao, Zhangwei and Chen, Zhe and Cui, Erfei and Ren, Yiming and Wang, Weiyun and Zhu, Jinguo and Tian, Hao and Ye, Shenglong and He, Junjun and Zhu, Xizhou and others}, - journal={arXiv preprint arXiv:2410.16261}, - year={2024} -} -``` - - -## Acknowledgements - -[DriveGPT4](https://tonyxuqaq.github.io/projects/DriveGPT4/), -[GeoChat](https://github.com/mbzuai-oryx/GeoChat), -[SkySenseGPT](https://github.com/Luo-Z13/SkySenseGPT), -[DriveLM](https://github.com/OpenDriveLab/DriveLM) - +# Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance + +## Introduction + +We introduce Mini-InternVL, a series of MLLMs with parameters ranging from 1B to 4B, which achieves 90% of the performance with only 5% of the parameters. +This significant improvement in efficiency and effectiveness makes our models more accessible and applicable in various real-world scenarios. + +![internvl 1 5_wwh_33_2](https://github.com/user-attachments/assets/820ed173-4bd1-45a6-95d6-59c1be01d53f) + +- InternViT-300M + +We employ InternViT-300M as our visual encoder, a lightweight vision model that inherits the capabilities of a powerful vision encoder. We directly leverage InternViT-6B that has undergone generative training on diverse datasets to transfer knowledge to a lightweight vision model, CLIP-ViT-L-336px. + +- Adaptation for Mini-InternVL + +To further promote the adoption of our models, we develop a unified adaptation framework for Mini-InternVL, which enables our models to transfer and outperform specialized models in downstream tasks, including autonomous driving, medical images, and remote sensing. We hope to provide insights into the application of MLLMs. + +## Models and Performance + +| Model | MMMU (val) | MathVista (testmini) | AI2D | ChartQA | DocVQA | InfoVQA | OCRBench | MMB-EN | MMB-CN | Avg. Score | +| :--------------------------------------------------------------------: | :--------: | :------------------: | :--: | :-----: | :----: | :-----: | :------: | :----: | -----: | :--------: | +| Claude3.5-Sonnet | 65.9 | 67.7 | 94.7 | 90.8 | 95.2 | - | 788 | 79.7 | 80.7 | 81.7 | +| InternVL2-Llama3-76B | 58.2 | 65.5 | 87.6 | 88.4 | 94.1 | 82.0 | 839 | 86.5 | 86.3 | 81.4 | +| Mini-InternVL-1B ([🤗](https://huggingface.co/OpenGVLab/InternVL2-1B)) | 36.7 | 37.7 | 64.1 | 72.9 | 81.7 | 50.9 | 754 | 65.4 | 60.7 | 60.6 (74%) | +| Mini-InternVL-2B ([🤗](https://huggingface.co/OpenGVLab/InternVL2-2B)) | 36.3 | 46.3 | 74.1 | 76.2 | 86.9 | 58.9 | 784 | 73.2 | 70.9 | 66.8 (82%) | +| Mini-InternVL-4B ([🤗](https://huggingface.co/OpenGVLab/InternVL2-4B)) | 48.3 | 58.6 | 78.9 | 81.5 | 89.2 | 67.0 | 788 | 78.6 | 73.9 | 72.8 (90%) | + +- We evaluate models using InternVL and VLMEvalKit repositories. AI2D, ChartQA, DocVQA, InfoVQA, and MMBench are tested with InternVL, while MathVistaand OCRBench use VLMEvalKit. For MMMU, we report scores from OpenCompass leaderboard. + +- The Avg. Score is the average of the scores from all tested benchmarks, with the OCR-Bench score divided by 10. The values in parentheses represent the relative parameters and performance of Mini-InternVL compared to *InternVL2-Llama3-76B*, which is considered as 100%. + +## Domain Adaptation + +Visual tasks (*e.g.* Image classification, region perception, multi-view images tasks, video related tasks and visual grounding) can be formulated into VQA format. + +![framework_03_2](https://github.com/user-attachments/assets/63bffb31-cf05-4f52-a679-4700650d0c37) + +In the [document](https://internvl.readthedocs.io/en/latest/internvl2.0/domain_adaptation.html), we provide detailed information on the datasets and the fine-tuning process. + +## Citation + +If you find this project useful in your research, please consider citing: + +```BibTeX +@article{chen2023internvl, + title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks}, + author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng}, + journal={arXiv preprint arXiv:2312.14238}, + year={2023} +} +@article{chen2024far, + title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites}, + author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others}, + journal={arXiv preprint arXiv:2404.16821}, + year={2024} +} +@article{gao2024mini, + title={Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5\% Parameters and 90\% Performance}, + author={Gao, Zhangwei and Chen, Zhe and Cui, Erfei and Ren, Yiming and Wang, Weiyun and Zhu, Jinguo and Tian, Hao and Ye, Shenglong and He, Junjun and Zhu, Xizhou and others}, + journal={arXiv preprint arXiv:2410.16261}, + year={2024} +} +``` + +## Acknowledgements + +[DriveGPT4](https://tonyxuqaq.github.io/projects/DriveGPT4/), +[GeoChat](https://github.com/mbzuai-oryx/GeoChat), +[SkySenseGPT](https://github.com/Luo-Z13/SkySenseGPT), +[DriveLM](https://github.com/OpenDriveLab/DriveLM) diff --git a/internvl_chat/tools/images_stitching.py b/internvl_chat/tools/images_stitching.py index 017b46f7..fab0b063 100644 --- a/internvl_chat/tools/images_stitching.py +++ b/internvl_chat/tools/images_stitching.py @@ -1,81 +1,79 @@ -import os -from PIL import Image, ImageDraw, ImageFont -import json -from tqdm import tqdm -import argparse - - -FOOT = ImageFont.truetype("/usr/share/fonts/dejavu/DejaVuSans-Bold.ttf", 50) - -def custom_image(img_paths,save_path,image_size=448): - captions=["CAM_FRONT_LEFT","CAM_FRONT","CAM_FRONT_RIGHT","CAM_BACK_LEFT","CAM_BACK","CAM_BACK_RIGHT"] - - width = image_size*2 - height = image_size - # count = 0 - all_images = {} - for image_id,image_files in tqdm(img_paths.items()): - all_images[image_id] = dict() - all_images[image_id]["images_path"] = image_files - all_images[image_id]["images_size"] = {k:(0,0) for k in image_files.keys()} - imgs = {} - for caption, image_file in image_files.items(): - image_path = os.path.join(args.data_root, image_file.replace("../nuscenes/samples/","/nuscenes/samples/")) - img = Image.open(image_path).convert('RGB') - old_wide,old_height = img.size - all_images[image_id]["images_size"][caption] = (old_wide,old_height) - img=img.resize((width, height)) - - draw = ImageDraw.Draw(img) - text = caption - draw.text((0,0), text, fill=(255, 0, 255), font=FOOT) - imgs[caption] = img - - - result_width = width * 3 - result_height = height * 2 - result_img = Image.new('RGB', (result_width, result_height)) - - imgs = [imgs[caption] for caption in captions] - for i in range(len(imgs)): - row = i // 3 - col = i % 3 - - left = col * width - top = row * height - right = left + width - bottom = top + height - result_img.paste(imgs[i], (left,top)) - - result_path = os.path.join(save_path,image_id + ".jpg") - result_img.save(result_path) - -def get_images(ann_file): - with open(ann_file, 'r') as f :#, \ - train_file = json.load(f) - - images = {} - for scene_id in train_file.keys(): - scene_data = train_file[scene_id]['key_frames'] - for frame_id in scene_data.keys(): - image_id = scene_id + "_" + frame_id - if image_id not in images: - images[image_id] = scene_data[frame_id]['image_paths'] - else: - print(image_id) - - return images - - -if __name__ == '__main__': - parser = argparse.ArgumentParser() - parser.add_argument('--data-root', type=str, default="InternVL-Domain-Adaptation-Data/images/drivelm") - parser.add_argument('--ann-file', type=str, default="path/to/v1_1_val_nus_q_only.json") - args = parser.parse_args() - images = get_images(args.ann_file) - save_path = os.path.join(args.data_root,"stitch") - os.makedirs(save_path,exist_ok=True) - custom_image(img_paths=images,save_path=save_path) - - - +import argparse +import json +import os + +from PIL import Image, ImageDraw, ImageFont +from tqdm import tqdm + +FOOT = ImageFont.truetype('/usr/share/fonts/dejavu/DejaVuSans-Bold.ttf', 50) + + +def custom_image(img_paths, save_path, image_size=448): + captions = ['CAM_FRONT_LEFT', 'CAM_FRONT', 'CAM_FRONT_RIGHT', 'CAM_BACK_LEFT', 'CAM_BACK', 'CAM_BACK_RIGHT'] + + width = image_size * 2 + height = image_size + # count = 0 + all_images = {} + for image_id, image_files in tqdm(img_paths.items()): + all_images[image_id] = dict() + all_images[image_id]['images_path'] = image_files + all_images[image_id]['images_size'] = {k: (0, 0) for k in image_files.keys()} + imgs = {} + for caption, image_file in image_files.items(): + image_path = os.path.join(args.data_root, image_file.replace('../nuscenes/samples/', '/nuscenes/samples/')) + img = Image.open(image_path).convert('RGB') + old_wide, old_height = img.size + all_images[image_id]['images_size'][caption] = (old_wide, old_height) + img = img.resize((width, height)) + + draw = ImageDraw.Draw(img) + text = caption + draw.text((0, 0), text, fill=(255, 0, 255), font=FOOT) + imgs[caption] = img + + result_width = width * 3 + result_height = height * 2 + result_img = Image.new('RGB', (result_width, result_height)) + + imgs = [imgs[caption] for caption in captions] + for i in range(len(imgs)): + row = i // 3 + col = i % 3 + + left = col * width + top = row * height + right = left + width + bottom = top + height + result_img.paste(imgs[i], (left, top)) + + result_path = os.path.join(save_path, image_id + '.jpg') + result_img.save(result_path) + + +def get_images(ann_file): + with open(ann_file, 'r') as f: # , \ + train_file = json.load(f) + + images = {} + for scene_id in train_file.keys(): + scene_data = train_file[scene_id]['key_frames'] + for frame_id in scene_data.keys(): + image_id = scene_id + '_' + frame_id + if image_id not in images: + images[image_id] = scene_data[frame_id]['image_paths'] + else: + print(image_id) + + return images + + +if __name__ == '__main__': + parser = argparse.ArgumentParser() + parser.add_argument('--data-root', type=str, default='InternVL-Domain-Adaptation-Data/images/drivelm') + parser.add_argument('--ann-file', type=str, default='path/to/v1_1_val_nus_q_only.json') + args = parser.parse_args() + images = get_images(args.ann_file) + save_path = os.path.join(args.data_root, 'stitch') + os.makedirs(save_path, exist_ok=True) + custom_image(img_paths=images, save_path=save_path) diff --git a/streamlit_demo/app.py b/streamlit_demo/app.py index 04d519ee..4fc9a5a7 100644 --- a/streamlit_demo/app.py +++ b/streamlit_demo/app.py @@ -49,7 +49,7 @@ def get_model_list(): assert ret.status_code == 200 ret = requests.post(controller_url + '/list_models') models = ret.json()['models'] - models = [item for item in models if 'InternVL2-Det' not in item] + models = [item for item in models if 'InternVL2-Det' not in item and 'InternVL2-Gen' not in item] return models