Reddit User Image Extractor

Introduction

The Reddit User Image Extractor effectively harnesses ELT processes to gather and store images from the subreddits. By utilizing the Reddit API, SQLAlchemy, SQLite, and Amazon S3, this application provides a foundation for future machine learning and data processing tasks.

The data collection process is automated on a daily basis using GitHub Actions, ensuring that the database remains up-to-date with the latest images. This seamless integration of automation not only saves time but also enhances the consistency and reliability of the database.

One potential application of the collected data is to develop a classification model. By training a machine learning algorithm on a labeled dataset, the model could analyze visual features to make accurate predictions. This capability could enhance various applications, such as targeted marketing or user profiling in social media platforms.

This project not only showcases the integration of various tools and technologies but also highlights the potential for further exploration in image classification and data analysis.

ELT Process

Reddit API (Image Data Collection): The initial collection of image URLs is performed through the Reddit API. Metadata, including image URLs, is stored in an SQLite database, which includes a column named img_url to hold the image addresses.
Python Application (URL Extraction and Storage): A Python script processes the extracted URLs and stores the initial data in an SQLite database hosted on Amazon S3. This database serves as a central repository for the image data, allowing easy updates and labeling.
DBeaver (Data Labeling): In DBeaver, I review and label each image. For images that no longer exist, you can leave the URL as NULL, which prevents unnecessary downloads. After labeling, a Python script updates the SQLite file on S3 with the new labels.
Google Colab (Image Download): In Colab, the image download process begins by reading the updated SQLite database and filtering only valid URLs (non-NULL values). The script then downloads the images as JPG files and stores them on Google Drive, ensuring that only accessible images are considered for training and inference.
Inference in Colab (Gender Classification): During the inference phase, the model uses the images stored in Google Drive and associates each image with its respective label (M or F) by matching the image id with the database records. This association allows the model to classify the images based on the pre-assigned labels in the SQLite database.

This two-step validation process — first in DBeaver and later in Colab — ensures that only relevant and accessible image URLs are processed and labeled, optimizing the dataset and model performance.

Tools

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.github/workflows		.github/workflows
app		app
.gitignore		.gitignore
LICENSE		LICENSE
Pipfile		Pipfile
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reddit User Image Extractor

Introduction

ELT Process

Tools

About

Releases

Packages

Languages

License

GuiFernandess7/reddit-images-extractor-and-ml-classifier

Folders and files

Latest commit

History

Repository files navigation

Reddit User Image Extractor

Introduction

ELT Process

Tools

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages