README
This repository contains two approaches for developing a model to understand the content of screenshots.
In this approach, we utilize the Gemini 1.0 Pro Vision model along with EasyOCR for content understanding. The process involves:
- Performing Optical Character Recognition (OCR) on the screenshot using EasyOCR to extract text.
- Utilizing the Gemini language model to generate a natural language understanding of the screenshot content, incorporating the OCR results.
The second approach involves using an open-source image captioning model from Hugging Face, specifically Salesforce/blip-image-captioning-large
. The steps include:
- Preprocessing the screenshot image.
- Using the BLIP model to generate image captions, both conditionally and unconditionally.
Among the two approaches, Approach 1 (Gemini OCR) showed better performance. However, both approaches are included for comparison purposes.
To use these approaches:
- Ensure all necessary dependencies are installed.
- Replace the image path or URL with the actual screenshot.
- Run the respective function for each approach.
For detailed instructions and examples, refer to the individual scripts or functions.
Approach 1 leverages a combination of proprietary and open-source technologies, providing a comprehensive understanding of the screenshot content. Approach 2 relies solely on open-source tools, offering transparency and flexibility.