Machine Learning Approach to Extracting Entity Values from Images

Project Overview

This project demonstrates the use of Optical Character Recognition (OCR) and pattern recognition techniques (Regex) to extract key product attributes (like weight, dimensions, voltage, and wattage) from images. It was built to solve the challenge of extracting product details from e-commerce images, where textual descriptions may be missing or incomplete.

Introduction

In e-commerce, having accurate product details such as weight, voltage, and dimensions is critical for tasks like inventory management and product listing. This project tackles the problem by extracting these values directly from product images, using a combination of OCR to capture text and regex for pattern recognition.

Dataset

The dataset consists of product images accessed via URLs, containing attributes such as:

Weight (e.g., "22 lbs", "10 kg")
Dimensions (e.g., "10x12 inches", "30x50 cm")
Voltage (e.g., "120V")
Wattage (e.g., "60W")

Each image is processed to extract these attributes using OCR and regex patterns, followed by unit standardization.

Approach

The project follows these key steps:

Data Preprocessing: The images are processed to extract textual information using Tesseract OCR.
Pattern Recognition: Regular expressions are used to detect and extract specific attribute values such as weight and dimensions.
Unit Conversion: Once values are extracted, they are converted into standardized units for consistency.

Key Components:

Optical Character Recognition (OCR): Tesseract is used to extract text from product images.
Regex for Feature Extraction: Regular expressions are employed to identify numeric patterns associated with product attributes like weight, dimensions, and voltage.

Technologies Used

Python: Used for scripting, data manipulation, and regex.
Tesseract OCR: An open-source OCR engine used to extract text from product images.
Regular Expressions (Regex): For pattern-based extraction of product attributes.
Pandas: For data manipulation.
Matplotlib: For data visualization.

Experiments

Image Preprocessing

To improve the accuracy of text extraction using OCR, different image preprocessing techniques like changing brightness and contrast were applied.

Regex Optimization

Regex patterns were refined iteratively to accurately identify product attributes such as weight and dimensions across a wide range of formats (e.g., "10x12 in", "22 lbs"). Flexibility was added to handle variations and edge cases.

Results

The solution achieved high accuracy in extracting relevant attributes such as weight, voltage, and dimensions from product images. The combination of OCR and regex-based pattern recognition was found to be effective for structured data extraction.

Accuracy: Approximately 85% accuracy was achieved for correctly extracting key attributes across diverse product images.
Performance: The system performs efficiently even with large datasets, making it suitable for real-time applications.

Conclusion

This project successfully demonstrated how OCR and regex can be applied to extract structured data from product images. The system efficiently extracts and standardizes values like weight, voltage, and dimensions, which are critical in e-commerce. The solution is robust and can be adapted for various product categories.

How to Run

Clone the repository:

git clone https://github.com/Mukunj-21/Image-Entity-Extractor.git
cd Image-Entity-Extractor

Run the Jupyter Notebook:

jupyter notebook Image Entity Extractor.ipynb

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
dataset		dataset
src		src
Image Entity Extractor.ipynb		Image Entity Extractor.ipynb
README.md		README.md
test_out.csv		test_out.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Learning Approach to Extracting Entity Values from Images

Project Overview

Table of Contents

Introduction

Dataset

Approach

Key Components:

Technologies Used

Experiments

Image Preprocessing

Regex Optimization

Results

Conclusion

How to Run

About

Releases

Packages

Languages

Mukunj-21/Image-Entity-Extractor

Folders and files

Latest commit

History

Repository files navigation

Machine Learning Approach to Extracting Entity Values from Images

Project Overview

Table of Contents

Introduction

Dataset

Approach

Key Components:

Technologies Used

Experiments

Image Preprocessing

Regex Optimization

Results

Conclusion

How to Run

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages