A comprehensive search engine for analyzing and exploring opinions about electric vehicles from news sources worldwide.
The EV Opinion Search Engine is a specialized tool for collecting, analyzing, and searching public opinions about electric vehicles. It combines web crawling, information retrieval, sentiment analysis, and topic modeling to provide a complete solution for understanding consumer perceptions of EVs.
Key features:
- Crawls and collects EV-related articles from news sources via News API
- Indexes content for fast and efficient searching
- Analyzes sentiment (positive, negative, neutral)
- Identifies topics and entities mentioned in opinions
- Provides a user-friendly web interface with data visualizations
SC4021/
│
├── config/ # Configuration files
│ ├── app_config.py # Application settings
│ └── solr_schema.xml # Solr schema definition
│
├── crawler/ # Data collection module
│ ├── newsapi_crawler.py # News API data collector
│ └── data_cleaner.py # Text cleaning utilities
│
├── indexing/ # Search indexing module
│ ├── solr_indexer.py # Solr indexing functionality
│ └── search_utils.py # Search utility functions
│
├── classification/ # Opinion analysis module
│ ├── classifier.py # Sentiment and topic analysis
│ └── evaluation.py # Evaluation and annotation tools
│
├── web/ # Web interface
│ ├── app.py # Flask application
│ ├── templates/ # HTML templates
│ └── static/ # Static assets (CSS, JS)
│
├── scripts/ # Command-line scripts
│ ├── run_crawler.py # Run the crawler
│ ├── run_indexer.py # Run the indexer
│ ├── run_evaluation.py # Run evaluation tools
│ └── run_webapp.py # Run the web application
│
├── data/ # Data directory (not in version control)
│ ├── raw/ # Raw crawled data
│ ├── processed/ # Processed data
│ └── evaluation/ # Evaluation datasets
│
└── requirements.txt # Python dependencies
- Python 3.8 or higher
- Apache Solr 8.11 or higher
- News API key (free tier available)
-
Clone the repository:
git clone https://github.com/alaneel/SC4021.git cd SC4021
-
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Download required models:
python -m spacy download en_core_web_sm python -m nltk.downloader punkt
-
Set up environment variables:
# News API credentials (required for crawler) export NEWSAPI_API_KEY="your_api_key" # Solr configuration (optional, defaults provided) export SOLR_URL="http://localhost:8983/solr" export SOLR_COLLECTION="ev_opinions"
-
Set up Solr:
- Download and install Apache Solr from https://solr.apache.org/
- Create a collection using the schema provided in
config/solr_schema.xml
To crawl EV opinions from news sources:
python scripts/run_crawler.py --limit 100 --preprocess
Options:
--queries
: Specify search queries (default: from config)--limit
: Maximum articles per query (default: 100)--days
: Number of days to look back (default: 30)--language
: Language of articles (default: en)--preprocess
: Apply text preprocessing to clean the data--output
: Specify output file name--api-key
: Specify News API Key (or use environment variable)
To index the collected data to Solr:
python scripts/run_indexer.py --latest --classify
Options:
--input
: Specify input CSV file--latest
: Use the most recent data file--classify
: Run sentiment analysis and topic modeling before indexing--clear
: Clear existing index before indexing--optimize
: Optimize the index after indexing
To create an annotation dataset:
python scripts/run_evaluation.py annotate --latest --samples 1000
To evaluate the classifier on annotated data:
python scripts/run_evaluation.py evaluate --input data/evaluation/ev_opinions_annotation.csv
To train the classifier models:
python scripts/run_evaluation.py train --input data/evaluation/ev_opinions_annotation.csv
To run the web interface:
python scripts/run_webapp.py
Then open your browser and navigate to http://localhost:5000
.
The classification module implements a hybrid approach combining:
- Sentiment Analysis: Uses a fine-tuned DistilBERT model to classify opinions as positive, negative, or neutral
- Topic Modeling: Uses Latent Dirichlet Allocation (LDA) to discover hidden topics in the corpus
- Entity Recognition: Identifies EV-related entities such as brands, models, and components
The web interface provides:
- Full-text search with faceted navigation
- Sentiment distribution visualization
- Opinion timeline charts
- Interactive word clouds
- Filtering by sentiment, date, source, topics, and entities
When using the News API for data collection, be aware of the following:
-
Rate Limits: News API free tier allows 100 requests per day with up to 100 results per request, which is sufficient for collecting around 10,000 articles daily.
-
Data Timeframe: Free tier access limits searches to articles published in the last month only.
-
Search Limitations: Some filtering capabilities (like by source domain or language) may be limited compared to the paid tier.
-
Attribution Requirements: When displaying news content, proper attribution to the source is required according to News API terms of service.
The crawler includes automatic rate limit handling and will save intermediate results to prevent data loss if limits are reached.
To add a new data source, create a new crawler in the crawler
module following the pattern established by newsapi_crawler.py
.
To add new classification capabilities:
- Extend the
EVOpinionClassifier
class inclassification/classifier.py
- Add evaluation methods in
classification/evaluation.py
- Update the indexing script to include the new attributes
- This project uses the Transformers library by Hugging Face for sentiment analysis
- Topic modeling is implemented using Gensim
- Web interface built with Flask and Bootstrap
- News API for data collection