Skip to content

[CVPR 2023] InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

License

Notifications You must be signed in to change notification settings

toilaluan/InternImage

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

InternImage

PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC

This repository is an official implementation of the InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions.

Paper | Blog in Chinese

News

  • Feb 28, 2023: InternImage is accepted to CVPR 2023!
  • Nov 18, 2022: πŸš€ InternImage-XL merged into BEVFormer v2 achieves state-of-the-art performance of 63.4 NDS on nuScenes Camera Only.
  • Nov 10, 2022: πŸš€πŸš€ InternImage-H achieves a new record 65.4 mAP on COCO detection test-dev and 62.9 mIoU on ADE20K, outperforming previous models by a large margin.

Coming soon

  • InternImage-H(1B)/G(3B)
  • Other downstream tasks.
  • TensorRT inference.
  • Classification code of the InternImage series.
  • InternImage-T/S/B/L/XL ImageNet-1k pretrained model.
  • InternImage-L/XL ImageNet-22k pretrained model.
  • InternImage-T/S/B/L/XL detection and instance segmentation model.
  • InternImage-T/S/B/L/XL semantic segmentation model.

Introduction

InternImage, initially described in arxiv, can be a general backbone for computer vision. It takes deformable convolution as the core operator to obtain large effective receptive fields, and introducing adaptive spatial aggregation to reduces the strict inductive bias. Our model makes it possible to learn more stronger and robust models with large-scale parameters from massive data.

Main Results on ImageNet with Pretrained Models

ImageNet-1K and ImageNet-22K Pretrained InternImage Models

name pretrain resolution acc@1 #param FLOPs 22K model 1K model
InternImage-T ImageNet-1K 224x224 83.5 30M 5G - ckpt | cfg
InternImage-S ImageNet-1K 224x224 84.2 50M 8G - ckpt | cfg
InternImage-B ImageNet-1K 224x224 84.9 97M 16G - ckpt | cfg
InternImage-L ImageNet-22K 384x384 87.7 223M 108G ckpt ckpt | cfg
InternImage-XL ImageNet-22K 384x384 88.0 335M 163G ckpt ckpt | cfg

Main Results on Downstream Tasks

COCO Object Detection

backbone method schd box mAP mask mAP #param FLOPs Download
InternImage-T Mask R-CNN 1x 47.2 42.5 49M 270G ckpt | cfg
InternImage-T Mask R-CNN 3x 49.1 43.7 49M 270G ckpt | cfg
InternImage-S Mask R-CNN 1x 47.8 43.3 69M 340G ckpt | cfg
InternImage-S Mask R-CNN 3x 49.7 44.5 69M 340G ckpt | cfg
InternImage-B Mask R-CNN 1x 48.8 44.0 115M 501G ckpt | cfg
InternImage-B Mask R-CNN 3x 50.3 44.8 115M 501G ckpt | cfg
InternImage-L Cascade 1x 54.9 47.7 277M 1399G ckpt | cfg
InternImage-L Cascade 3x 56.1 48.5 277M 1399G ckpt | cfg
InternImage-XL Cascade 1x 55.3 48.1 387M 1782G ckpt | cfg
InternImage-XL Cascade 3x 56.2 48.8 387M 1782G ckpt | cfg

ADE20K Semantic Segmentation

backbone resolution single scale multi scale #param FLOPs Download
InternImage-T 512x512 47.9 48.1 59M 944G ckpt | cfg
InternImage-S 512x512 50.1 50.9 80M 1017G ckpt | cfg
InternImage-B 512x512 50.8 51.3 128M 1185G ckpt | cfg
InternImage-L 640x640 53.9 54.1 256M 2526G ckpt | cfg
InternImage-XL 640x640 55.0 55.3 368M 3142G ckpt | cfg

Main Results of FPS

name resolution #params FLOPs Batch 1 FPS(TensorRT)
InternImage-T 224x224 30M 5G 156
InternImage-S 224x224 50M 8G 129
InternImage-B 224x224 97M 16G 116
InternImage-L 384x384 223M 108G 56
InternImage-XL 384x384 335M 163G 47

Citation

If this work is helpful for your research, please consider citing the following BibTeX entry.

@article{wang2022internimage,
  title={InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions},
  author={Wang, Wenhai and Dai, Jifeng and Chen, Zhe and Huang, Zhenhang and Li, Zhiqi and Zhu, Xizhou and Hu, Xiaowei and Lu, Tong and Lu, Lewei and Li, Hongsheng and others},
  journal={arXiv preprint arXiv:2211.05778},
  year={2022}
}

About

[CVPR 2023] InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 65.2%
  • Cuda 29.8%
  • C++ 4.2%
  • Shell 0.8%