Skip to content

Jieyi1114/Forum-Message-Clustering-and-Real-Time-Data-Processing

Repository files navigation

This project involved developing a system for clustering forum messages on Reddit using web scraping, data preprocessing, and clustering algorithms. The key components of the project included extracting messages from online forums, converting them into vector representations using Doc2Vec, and applying clustering algorithms to group similar messages based on keywords. Additionally, the system automates the data collection, processing, and storage at fixed intervals (specified by users), ensuring real-time database updates. A command-line interface allows users to input keywords or messages to find the closest matching cluster, with visualizations provided to display cluster contents.

To run this project, you need to:

-Install libraries: doc2vec, gensim, sklearn, matplotlib, urllib, praw, pytesseract, nltk, mysql.connector, sqlalchemy

-Run the main.py file to collect data, clean, cluster, visualize and store in the database

*This project is completed with Winnie Cai and Min Sang Yoo.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages