现在issue有点多，我们团队会逐一查阅并解决，请耐心等待。

书生2.5 - 多模态多任务通用大模型

这个代码仓库是InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions的官方实现。

[论文] [知乎专栏]

亮点

👍 高达30亿参数的最强视觉通用主干模型
🏆 图像分类标杆数据集ImageNet 90.1% Top1准确率，开源模型中准确度最高
🏆 物体检测标杆数据集COCO 65.5 mAP，唯一超过65 mAP的模型

Application in Challenge

2022 Waymo 3D Camera-Only Detection Challenge: 基于书生2.5 BEVFormer++取得赛道冠军
nuScenes 3D detection task: BEVFormer v2 在nuScenes纯视觉检测任务中取得SOTA性能(64.8 NDS)
CVPR 2023 Workshop End-to-End Autonomous Driving: InternImage作为baseline支持了比赛 3D Occupancy Prediction Challenge和OpenLane Topology Challenge

项目功能

各类下游任务
支持CVPR 2023 Workshop on End-to-End Autonomous Driving，详见
支持Segment Anything
支持提取模型中间层特征，详见
支持基于DeepSpeed的低成本训练，详见
DCNv3算子预编译.whl包，详见
InternImage-H(1B)/G(3B)
支持分类/检测/分割TensorRT推理
InternImage 系列分类代码
InternImage-T/S/B/L/XL ImageNet-1K 预训练模型
InternImage-L/XL ImageNet-22K 预训练模型
InternImage-T/S/B/L/XL 检测和实例分割模型
InternImage-T/S/B/L/XL 语义分割模型

简介

"书生2.5"是商汤科技与上海人工智能实验室联合发布的多模态多任务通用大模型。"书生2.5"包括大规模视觉基础模型"InternImage"，预训练算法"M3I-Pretraining"，通用解码器"Uni-Perceiver"系列，以及自动驾驶感知通用编码器"BEVFormer"系列。

“书生2.5”的应用

1. 图像模态任务性能

在图像分类标杆数据集ImageNet上，“书生2.5”仅基于公开数据便达到了 90.1% 的Top-1准确率。这是除谷歌与微软两个未公开模型及额外数据集外，唯一准确率超过90.0%的模型，同时也是世界上开源模型中ImageNet准确度最高，规模最大的模型；
在物体检测标杆数据集COCO上，“书生2.5” 取得了 65.5 的 mAP，是世界上唯一超过65 mAP的模型；
在另外16个重要的视觉基础数据集（覆盖分类、检测和分割任务）上取得世界最好性能。

分类任务

图像分类	场景分类		长尾分类
ImageNet	Places365	Places 205	iNaturalist 2018
90.1	61.2	71.7	92.3

检测任务

常规物体检测				长尾物体检测		自动驾驶物体检测		密集物体检测
COCO	VOC 2007	VOC 2012	OpenImage	LVIS minival	LVIS val	BDD100K	nuScenes	CrowdHuman
65.5	94.0	97.2	74.1	65.8	63.2	38.8	64.8	97.2

分割任务

语义分割			街景分割	RGBD分割
ADE20K	COCO Stuff-10K	Pascal Context	CityScapes	NYU Depth V2
62.9	59.6	70.3	86.1	69.7

2. 图文跨模态任务性能

图文检索

“书生2.5”可根据文本内容需求快速定位检索出语义最相关的图像。这一能力既可应用于视频和图像集合，也可进一步结合物体检测框，具有丰富的应用模式，帮助用户更便捷、快速地找到所需图像资源, 例如可在相册中返回文本所指定的相关图像。

以图生文

“书生2.5”的“以图生文”在图像描述、视觉问答、视觉推理和文字识别等多个方面均拥有强大的理解能力。例如在自动驾驶场景下，可以提升场景感知理解能力，辅助车辆判断交通信号灯状态、道路标志牌等信息，为车辆的决策规划提供有效的感知信息支持。

图文多模态任务

图像描述	微调图文检索		零样本图文检索
COCO Caption	COCO Caption	Flickr30k	Flickr30k
148.2	76.4	94.8	89.1

预训练模型

开源视觉预训练模型

name	pretrain	pre-training resolution	#param	download
InternImage-L	ImageNet-22K	384x384	223M	ckpt
InternImage-XL	ImageNet-22K	384x384	335M	ckpt
InternImage-H	Joint 427M	384x384	1.08B	ckpt
InternImage-G	-	384x384	3B	ckpt

ImageNet-1K图像分类

name	pretrain	resolution	acc@1	#param	FLOPs	download
InternImage-T	ImageNet-1K	224x224	83.5	30M	5G	ckpt \| cfg
InternImage-S	ImageNet-1K	224x224	84.2	50M	8G	ckpt \| cfg
InternImage-B	ImageNet-1K	224x224	84.9	97M	16G	ckpt \| cfg
InternImage-L	ImageNet-22K	384x384	87.7	223M	108G	ckpt \| cfg
InternImage-XL	ImageNet-22K	384x384	88.0	335M	163G	ckpt \| cfg
InternImage-H	Joint 427M	640x640	89.6	1.08B	1478G	ckpt \| cfg
InternImage-G	-	512x512	90.1	3B	2700G	ckpt \| cfg

COCO目标检测和实例分割

backbone	method	schd	box mAP	mask mAP	#param	FLOPs	download
InternImage-T	Mask R-CNN	1x	47.2	42.5	49M	270G	ckpt \| cfg
InternImage-T	Mask R-CNN	3x	49.1	43.7	49M	270G	ckpt \| cfg
InternImage-S	Mask R-CNN	1x	47.8	43.3	69M	340G	ckpt \| cfg
InternImage-S	Mask R-CNN	3x	49.7	44.5	69M	340G	ckpt \| cfg
InternImage-B	Mask R-CNN	1x	48.8	44.0	115M	501G	ckpt \| cfg
InternImage-B	Mask R-CNN	3x	50.3	44.8	115M	501G	ckpt \| cfg
InternImage-L	Cascade	1x	54.9	47.7	277M	1399G	ckpt \| cfg
InternImage-L	Cascade	3x	56.1	48.5	277M	1399G	ckpt \| cfg
InternImage-XL	Cascade	1x	55.3	48.1	387M	1782G	ckpt \| cfg
InternImage-XL	Cascade	3x	56.2	48.8	387M	1782G	ckpt \| cfg

backbone	method	box mAP (val/test)	#param	FLOPs	download
InternImage-H	DINO (TTA)	65.0 / 65.4	2.18B	TODO	TODO
InternImage-G	DINO (TTA)	65.3 / 65.5	3B	TODO	TODO

ADE20K语义分割

backbone	method	resolution	mIoU (ss/ms)	#param	FLOPs	download
InternImage-T	UperNet	512x512	47.9 / 48.1	59M	944G	ckpt \| cfg
InternImage-S	UperNet	512x512	50.1 / 50.9	80M	1017G	ckpt \| cfg
InternImage-B	UperNet	512x512	50.8 / 51.3	128M	1185G	ckpt \| cfg
InternImage-L	UperNet	640x640	53.9 / 54.1	256M	2526G	ckpt \| cfg
InternImage-XL	UperNet	640x640	55.0 / 55.3	368M	3142G	ckpt \| cfg
InternImage-H	UperNet	896x896	59.9 / 60.3	1.12B	3566G	ckpt \| cfg
InternImage-H	Mask2Former	896x896	62.5 / 62.9	1.31B	4635G	ckpt \| cfg

模型推理速度

export classification model from pytorch to tensorrt

export detection model from pytorch to tensorrt

export segmentation model from pytorch to tensorrt

name	resolution	#param	FLOPs	batch 1 FPS (TensorRT)
InternImage-T	224x224	30M	5G	156
InternImage-S	224x224	50M	8G	129
InternImage-B	224x224	97M	16G	116
InternImage-L	384x384	223M	108G	56
InternImage-XL	384x384	335M	163G	47

在使用mmdeploy将PyTorch模型转为TensorRT之前，请确保您已正确编译DCNv3的自定义算子，其安装方式如下：

export MMDEPLOY_DIR=/the/root/path/of/MMDeploy

# prepare our custom ops, you can find it at InternImage/tensorrt/modulated_deform_conv_v3
cp -r modulated_deform_conv_v3 ${MMDEPLOY_DIR}/csrc/mmdeploy/backend_ops/tensorrt

# build custom ops
cd ${MMDEPLOY_DIR}
mkdir -p build && cd build
cmake -DCMAKE_CXX_COMPILER=g++-7 -DMMDEPLOY_TARGET_BACKENDS=trt -DTENSORRT_DIR=${TENSORRT_DIR} -DCUDNN_DIR=${CUDNN_DIR} ..
make -j$(nproc) && make install

# install the mmdeploy after building custom ops
cd ${MMDEPLOY_DIR}
pip install -e .

关于mmdeploy编译自定义算子的更多细节，请参考这份文档。

引用

若“书生2.5”对您的研究工作有帮助，请参考如下bibtex对我们的工作进行引用。

@article{wang2022internimage,
  title={InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions},
  author={Wang, Wenhai and Dai, Jifeng and Chen, Zhe and Huang, Zhenhang and Li, Zhiqi and Zhu, Xizhou and Hu, Xiaowei and Lu, Tong and Lu, Lewei and Li, Hongsheng and others},
  journal={arXiv preprint arXiv:2211.05778},
  year={2022}
}

@inproceedings{zhu2022uni,
  title={Uni-perceiver: Pre-training unified architecture for generic perception for zero-shot and few-shot tasks},
  author={Zhu, Xizhou and Zhu, Jinguo and Li, Hao and Wu, Xiaoshi and Li, Hongsheng and Wang, Xiaohua and Dai, Jifeng},
  booktitle={CVPR},
  pages={16804--16815},
  year={2022}
}

@article{zhu2022uni,
  title={Uni-perceiver-moe: Learning sparse generalist models with conditional moes},
  author={Zhu, Jinguo and Zhu, Xizhou and Wang, Wenhai and Wang, Xiaohua and Li, Hongsheng and Wang, Xiaogang and Dai, Jifeng},
  journal={arXiv preprint arXiv:2206.04674},
  year={2022}
}

@article{li2022uni,
  title={Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks},
  author={Li, Hao and Zhu, Jinguo and Jiang, Xiaohu and Zhu, Xizhou and Li, Hongsheng and Yuan, Chun and Wang, Xiaohua and Qiao, Yu and Wang, Xiaogang and Wang, Wenhai and others},
  journal={arXiv preprint arXiv:2211.09808},
  year={2022}
}

@article{yang2022bevformer,
  title={BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision},
  author={Yang, Chenyu and Chen, Yuntao and Tian, Hao and Tao, Chenxin and Zhu, Xizhou and Zhang, Zhaoxiang and Huang, Gao and Li, Hongyang and Qiao, Yu and Lu, Lewei and others},
  journal={arXiv preprint arXiv:2211.10439},
  year={2022}
}

@article{su2022towards,
  title={Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information},
  author={Su, Weijie and Zhu, Xizhou and Tao, Chenxin and Lu, Lewei and Li, Bin and Huang, Gao and Qiao, Yu and Wang, Xiaogang and Zhou, Jie and Dai, Jifeng},
  journal={arXiv preprint arXiv:2211.09807},
  year={2022}
}

@inproceedings{li2022bevformer,
  title={Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers},
  author={Li, Zhiqi and Wang, Wenhai and Li, Hongyang and Xie, Enze and Sima, Chonghao and Lu, Tong and Qiao, Yu and Dai, Jifeng},
  booktitle={ECCV},
  pages={1--18},
  year={2022},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_CN.md

README_CN.md

书生2.5 - 多模态多任务通用大模型

亮点

相关项目

多模态基模型

自动驾驶

Application in Challenge

最新进展

项目功能

简介

“书生2.5”的应用

1. 图像模态任务性能

2. 图文跨模态任务性能

预训练模型

引用

Files

README_CN.md

Latest commit

History

README_CN.md

File metadata and controls

书生2.5 - 多模态多任务通用大模型

亮点

相关项目

多模态基模型

自动驾驶

Application in Challenge

最新进展

项目功能

简介

“书生2.5”的应用

1. 图像模态任务性能

2. 图文跨模态任务性能

预训练模型

引用