GitHub - Jieyi1114/Forum-Message-Clustering-and-Real-Time-Data-Processing

This project involved developing a system for clustering forum messages on Reddit using web scraping, data preprocessing, and clustering algorithms. The key components of the project included extracting messages from online forums, converting them into vector representations using Doc2Vec, and applying clustering algorithms to group similar messages based on keywords. Additionally, the system automates the data collection, processing, and storage at fixed intervals (specified by users), ensuring real-time database updates. A command-line interface allows users to input keywords or messages to find the closest matching cluster, with visualizations provided to display cluster contents.

To run this project, you need to:

-Install libraries: doc2vec, gensim, sklearn, matplotlib, urllib, praw, pytesseract, nltk, mysql.connector, sqlalchemy

-Run the main.py file to collect data, clean, cluster, visualize and store in the database

*This project is completed with Winnie Cai and Min Sang Yoo.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
TextCleanUp.py		TextCleanUp.py
db_creation.py		db_creation.py
doc2vec.py		doc2vec.py
fetch_and_clean_data.py		fetch_and_clean_data.py
main.py		main.py
praw_config.py		praw_config.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

Jieyi1114/Forum-Message-Clustering-and-Real-Time-Data-Processing

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages