Vision-Language Object Detection and Visual Question Answering

This repository includes Microsoft's GLIP and Salesforce's BLIP ensembled Gradio demo for detecting objects and Visual Question Answering based on text prompts.

Updates

Integrated into Huggingface Spaces 🤗

About GLIP: Grounded Language-Image Pre-training -

GLIP demonstrate strong zero-shot and few-shot transferability to various object-level recognition tasks.

The model used in this repo is GLIP-T, it is originally pre-trained on Conceptual Captions 3M and SBU captions.

About BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation -

A new model architecture that enables a wider range of downstream tasks than existing methods, and a new dataset bootstrapping method for learning from noisy web data.

Installation and Setup

Enviornment - Due to limitations with maskrcnn_benchmark, this repo requires Pytorch=1.10 and torchvision.

Use requirements.txt to install dependencies

pip3 install -r requirements.txt

Build maskrcnn_benchmark

python setup.py build develop --user

To verify a successful build, check the terminal for message
"Finished processing dependencies for maskrcnn-benchmark==0.1"

Checkpoints

Download the pre-trained models into the checkpoints folder.

mkdir checkpoints
cd checkpoints

Model	Weight
GLIP-T	weight
BLIP	weight

files.maxMemoryForLargeFilesMB

If you have an NVIDIA GPU with 8GB VRAM, run local demo using Gradio interface

python3 app_local.py

Sample display

After loading the checkpoints, you must click on the prompted local URL to run inference.

Video I/O

Image I/O

Future Work

Frame based Visual Question Answering
Video based Visual Question Answering
Each object based Visual Question Answering

Citations

@inproceedings{li2022blip,
      title={BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation}, 
      author={Junnan Li and Dongxu Li and Caiming Xiong and Steven Hoi},
      year={2022},
      booktitle={ICML},
}
@inproceedings{li2021grounded,
      title={Grounded Language-Image Pre-training},
      author={Liunian Harold Li* and Pengchuan Zhang* and Haotian Zhang* and Jianwei Yang and Chunyuan Li and Yiwu Zhong and Lijuan Wang and Lu Yuan and Lei Zhang and Jenq-Neng Hwang and Kai-Wei Chang and Jianfeng Gao},
      year={2022},
      booktitle={CVPR},
}
@article{zhang2022glipv2,
  title={GLIPv2: Unifying Localization and Vision-Language Understanding},
  author={Zhang, Haotian* and Zhang, Pengchuan* and Hu, Xiaowei and Chen, Yen-Chun and Li, Liunian Harold and Dai, Xiyang and Wang, Lijuan and Yuan, Lu and Hwang, Jenq-Neng and Gao, Jianfeng},
  journal={arXiv preprint arXiv:2206.05836},
  year={2022}
}
@article{li2022elevater,
  title={ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models},
  author={Li*, Chunyuan and Liu*, Haotian and Li, Liunian Harold and Zhang, Pengchuan and Aneja, Jyoti and Yang, Jianwei and Jin, Ping and Lee, Yong Jae and Hu, Houdong and Liu, Zicheng and others},
  journal={arXiv preprint arXiv:2204.08790},
  year={2022}
}

Acknowledgement

The implementation of this work relies on resources from BLIP, GLIP, Huggingface Transformers, and timm. We thank the original authors for their open-sourcing.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
configs		configs
maskrcnn_benchmark		maskrcnn_benchmark
models		models
resources		resources
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app_local.py		app_local.py
itm.py		itm.py
requirements.txt		requirements.txt
setup.py		setup.py
vqa.py		vqa.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vision-Language Object Detection and Visual Question Answering

Updates

About GLIP: Grounded Language-Image Pre-training -

About BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation -

Installation and Setup

Checkpoints

If you have an NVIDIA GPU with 8GB VRAM, run local demo using Gradio interface

Sample display

Future Work

Citations

Acknowledgement

About

Releases

Packages

Languages

License

Aasthaengg/GLIP-BLIP-Vision-Langauge-Obj-Det-VQA

Folders and files

Latest commit

History

Repository files navigation

Vision-Language Object Detection and Visual Question Answering

Updates

About GLIP: Grounded Language-Image Pre-training -

About BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation -

Installation and Setup

Checkpoints

If you have an NVIDIA GPU with 8GB VRAM, run local demo using Gradio interface

Sample display

Future Work

Citations

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages