Image Understanding Model

README

Image Understanding Model

This repository contains two approaches for developing a model to understand the content of screenshots.

Approach 1: Gemini + OCR

In this approach, we utilize the Gemini 1.0 Pro Vision model along with EasyOCR for content understanding. The process involves:

Performing Optical Character Recognition (OCR) on the screenshot using EasyOCR to extract text.
Utilizing the Gemini language model to generate a natural language understanding of the screenshot content, incorporating the OCR results.

Screenshot

Approach 2: Salesforce BLIP Image Captioning

The second approach involves using an open-source image captioning model from Hugging Face, specifically Salesforce/blip-image-captioning-large. The steps include:

Preprocessing the screenshot image.
Using the BLIP model to generate image captions, both conditionally and unconditionally.

Screenshot

Performance Evaluation

Among the two approaches, Approach 1 (Gemini OCR) showed better performance. However, both approaches are included for comparison purposes.

Instructions for Usage

To use these approaches:

Ensure all necessary dependencies are installed.
Replace the image path or URL with the actual screenshot.
Run the respective function for each approach.

For detailed instructions and examples, refer to the individual scripts or functions.

Note

Approach 1 leverages a combination of proprietary and open-source technologies, providing a comprehensive understanding of the screenshot content. Approach 2 relies solely on open-source tools, offering transparency and flexibility.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
Visual_Content_Interpretation.ipynb		Visual_Content_Interpretation.ipynb
input1.png		input1.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Image Understanding Model

Approach 1: Gemini + OCR

Screenshot

Approach 2: Salesforce BLIP Image Captioning

Screenshot

Performance Evaluation

Instructions for Usage

Note

About

Releases

Packages

Languages

gautham-balraj/Visual-Content-Interpretation

Folders and files

Latest commit

History

Repository files navigation

Image Understanding Model

Approach 1: Gemini + OCR

Screenshot

Approach 2: Salesforce BLIP Image Captioning

Screenshot

Performance Evaluation

Instructions for Usage

Note

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages