Skip to content

ELT application that captures images from Reddit and stores them in an SQLite database for further ML classification.

License

Notifications You must be signed in to change notification settings

GuiFernandess7/reddit-images-extractor-and-ml-classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

61 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reddit User Image Extractor

reddit-banner-image

Introduction

The Reddit User Image Extractor effectively harnesses ELT processes to gather and store images from the subreddits. By utilizing the Reddit API, SQLAlchemy, SQLite, and Amazon S3, this application provides a foundation for future machine learning and data processing tasks.

The data collection process is automated on a daily basis using GitHub Actions, ensuring that the database remains up-to-date with the latest images. This seamless integration of automation not only saves time but also enhances the consistency and reliability of the database.

One potential application of the collected data is to develop a classification model. By training a machine learning algorithm on a labeled dataset, the model could analyze visual features to make accurate predictions. This capability could enhance various applications, such as targeted marketing or user profiling in social media platforms.

This project not only showcases the integration of various tools and technologies but also highlights the potential for further exploration in image classification and data analysis.

ELT Process

Untitled-2024-11-02-2027

  1. Reddit API (Image Data Collection): The initial collection of image URLs is performed through the Reddit API. Metadata, including image URLs, is stored in an SQLite database, which includes a column named img_url to hold the image addresses.

  2. Python Application (URL Extraction and Storage): A Python script processes the extracted URLs and stores the initial data in an SQLite database hosted on Amazon S3. This database serves as a central repository for the image data, allowing easy updates and labeling.

  3. DBeaver (Data Labeling): In DBeaver, I review and label each image. For images that no longer exist, you can leave the URL as NULL, which prevents unnecessary downloads. After labeling, a Python script updates the SQLite file on S3 with the new labels.

  4. Google Colab (Image Download): In Colab, the image download process begins by reading the updated SQLite database and filtering only valid URLs (non-NULL values). The script then downloads the images as JPG files and stores them on Google Drive, ensuring that only accessible images are considered for training and inference.

  5. Inference in Colab (Gender Classification): During the inference phase, the model uses the images stored in Google Drive and associates each image with its respective label (M or F) by matching the image id with the database records. This association allows the model to classify the images based on the pre-assigned labels in the SQLite database.

This two-step validation process — first in DBeaver and later in Colab — ensures that only relevant and accessible image URLs are processed and labeled, optimizing the dataset and model performance.

Tools

python Jupyter SQLite SQLAlchemy AWS Bucket

About

ELT application that captures images from Reddit and stores them in an SQLite database for further ML classification.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages