A comprehensive data pipeline that collects content from multiple sources (Reddit, News APIs), processes them using natural language processing (NLP) techniques, and performs sentiment analysis to classify each post as positive, negative, or neutral.
- Multiple data sources:
- Reddit posts from configurable subreddits
- News articles from NewsAPI.org
- Text preprocessing pipeline with NLTK
- Sentiment analysis using machine learning (Logistic Regression with TF-IDF)
- Data storage in SQLite database
- Interactive visualization dashboard
- Continuous processing and classification
/SentimentAnalysisPipeline/
├── config/ # Configuration files
├── data/ # Data storage
│ └── raw/ # Raw downloaded datasets for model training
├── models/ # ML model storage
│ └── saved_models/ # Saved trained models
├── src/ # Source code
│ ├── data/ # Data collection and processing
│ │ ├── reddit_collector.py # Reddit data collector
│ │ ├── news_collector.py # News API collector
│ │ ├── database.py # Database operations
│ │ └── preprocessor.py # Text preprocessing
│ ├── models/ # Model training and classification
│ │ ├── trainer.py # Model training
│ │ └── classifier.py # Sentiment classification
│ └── visualization/ # Visualization and dashboard
│ ├── dashboard.py # Interactive web dashboard
│ └── plots.py # Data visualization
├── app.py # Main application
└── requirements.txt # Dependencies
- Python 3.7+
- Reddit API credentials (Client ID, Client Secret)
- NewsAPI.org API key
-
Clone the repository:
git clone https://github.com/AriachAmine/SentimentAnalysisPipeline.git cd SentimentAnalysisPipeline
-
Install dependencies:
pip install -r requirements.txt
-
Download required NLTK data:
python app.py --setup
-
Open
config/config.yaml
and update it with your API credentials:# Reddit API Configuration reddit_api: client_id: "YOUR_REDDIT_CLIENT_ID" client_secret: "YOUR_REDDIT_CLIENT_SECRET" user_agent: "Sentiment Analysis App by /u/YOUR_USERNAME" # News API Configuration news_api: api_key: "YOUR_NEWS_API_KEY"
-
Configure data collection parameters:
collection: keywords: ["technology", "AI", "machinelearning"] # For news queries subreddits: ["technology", "artificial", "MachineLearning"] # Reddit sources languages: ["en"] interval_seconds: 60
The simplest way to run the full application is:
python app.py
This will:
- Start Reddit and News API collectors
- Process collected content
- Classify sentiment (if a model has been trained)
- Launch the visualization dashboard
Before classification can work, you need to train a sentiment model:
- Download a sentiment dataset like Sentiment140 from Kaggle
- Place it in the
data/raw
directory - Run the training command:
python app.py --train
To run just the visualization dashboard:
python app.py --dashboard-only
The application supports several command line options:
--train
: Train the sentiment model--dashboard-only
: Run only the dashboard--setup
: Download required NLTK data
The dashboard is accessible at http://localhost:5000 and includes:
- Sentiment distribution pie chart
- Sentiment trend line chart over time
- Recent content with sentiment classification
This project is licensed under the MIT License.
- PRAW for Reddit API interaction
- NewsAPI for news content
- NLTK for natural language processing
- scikit-learn for machine learning
- Flask for the web dashboard