EV Opinion Search Engine

A comprehensive search engine for analyzing and exploring opinions about electric vehicles from news sources worldwide.

Overview

The EV Opinion Search Engine is a specialized tool for collecting, analyzing, and searching public opinions about electric vehicles. It combines web crawling, information retrieval, sentiment analysis, and topic modeling to provide a complete solution for understanding consumer perceptions of EVs.

Key features:

Crawls and collects EV-related articles from news sources via News API
Indexes content for fast and efficient searching
Analyzes sentiment (positive, negative, neutral)
Identifies topics and entities mentioned in opinions
Provides a user-friendly web interface with data visualizations

Project Structure

SC4021/
│
├── config/               # Configuration files
│   ├── app_config.py     # Application settings
│   └── solr_schema.xml   # Solr schema definition
│
├── crawler/              # Data collection module
│   ├── newsapi_crawler.py # News API data collector
│   └── data_cleaner.py   # Text cleaning utilities
│
├── indexing/             # Search indexing module
│   ├── solr_indexer.py   # Solr indexing functionality
│   └── search_utils.py   # Search utility functions
│
├── classification/       # Opinion analysis module
│   ├── classifier.py     # Sentiment and topic analysis
│   └── evaluation.py     # Evaluation and annotation tools
│
├── web/                  # Web interface
│   ├── app.py            # Flask application
│   ├── templates/        # HTML templates
│   └── static/           # Static assets (CSS, JS)
│
├── scripts/              # Command-line scripts
│   ├── run_crawler.py    # Run the crawler
│   ├── run_indexer.py    # Run the indexer
│   ├── run_evaluation.py # Run evaluation tools
│   └── run_webapp.py     # Run the web application
│
├── data/                 # Data directory (not in version control)
│   ├── raw/              # Raw crawled data
│   ├── processed/        # Processed data
│   └── evaluation/       # Evaluation datasets
│
└── requirements.txt      # Python dependencies

Installation

Prerequisites

Python 3.8 or higher
Apache Solr 8.11 or higher
News API key (free tier available)

Setup

Clone the repository:

git clone https://github.com/alaneel/SC4021.git
cd SC4021

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```

Download required models:

python -m spacy download en_core_web_sm
python -m nltk.downloader punkt

Set up environment variables:

# News API credentials (required for crawler)
export NEWSAPI_API_KEY="your_api_key"

# Solr configuration (optional, defaults provided)
export SOLR_URL="http://localhost:8983/solr"
export SOLR_COLLECTION="ev_opinions"

Set up Solr:
- Download and install Apache Solr from https://solr.apache.org/
- Create a collection using the schema provided in config/solr_schema.xml

Usage

Data Collection

To crawl EV opinions from news sources:

python scripts/run_crawler.py --limit 100 --preprocess

Options:

--queries: Specify search queries (default: from config)
--limit: Maximum articles per query (default: 100)
--days: Number of days to look back (default: 30)
--language: Language of articles (default: en)
--preprocess: Apply text preprocessing to clean the data
--output: Specify output file name
--api-key: Specify News API Key (or use environment variable)

Indexing

To index the collected data to Solr:

python scripts/run_indexer.py --latest --classify

Options:

--input: Specify input CSV file
--latest: Use the most recent data file
--classify: Run sentiment analysis and topic modeling before indexing
--clear: Clear existing index before indexing
--optimize: Optimize the index after indexing

Classification and Evaluation

To create an annotation dataset:

python scripts/run_evaluation.py annotate --latest --samples 1000

To evaluate the classifier on annotated data:

python scripts/run_evaluation.py evaluate --input data/evaluation/ev_opinions_annotation.csv

To train the classifier models:

python scripts/run_evaluation.py train --input data/evaluation/ev_opinions_annotation.csv

Web Application

To run the web interface:

python scripts/run_webapp.py

Then open your browser and navigate to http://localhost:5000.

Classification Approach

The classification module implements a hybrid approach combining:

Sentiment Analysis: Uses a fine-tuned DistilBERT model to classify opinions as positive, negative, or neutral
Topic Modeling: Uses Latent Dirichlet Allocation (LDA) to discover hidden topics in the corpus
Entity Recognition: Identifies EV-related entities such as brands, models, and components

Web Interface Features

The web interface provides:

Full-text search with faceted navigation
Sentiment distribution visualization
Opinion timeline charts
Interactive word clouds
Filtering by sentiment, date, source, topics, and entities

News API Usage and Limits

When using the News API for data collection, be aware of the following:

Rate Limits: News API free tier allows 100 requests per day with up to 100 results per request, which is sufficient for collecting around 10,000 articles daily.
Data Timeframe: Free tier access limits searches to articles published in the last month only.
Search Limitations: Some filtering capabilities (like by source domain or language) may be limited compared to the paid tier.
Attribution Requirements: When displaying news content, proper attribution to the source is required according to News API terms of service.

The crawler includes automatic rate limit handling and will save intermediate results to prevent data loss if limits are reached.

Development

Adding New Data Sources

To add a new data source, create a new crawler in the crawler module following the pattern established by newsapi_crawler.py.

Extending Classification

To add new classification capabilities:

Extend the EVOpinionClassifier class in classification/classifier.py
Add evaluation methods in classification/evaluation.py
Update the indexing script to include the new attributes

License

MIT License

Acknowledgements

This project uses the Transformers library by Hugging Face for sentiment analysis
Topic modeling is implemented using Gensim
Web interface built with Flask and Bootstrap
News API for data collection

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.idea		.idea
classification		classification
config		config
crawler		crawler
indexing		indexing
scripts		scripts
tests		tests
web		web
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EV Opinion Search Engine

Overview

Project Structure

Installation

Prerequisites

Setup

Usage

Data Collection

Indexing

Classification and Evaluation

Web Application

Classification Approach

Web Interface Features

News API Usage and Limits

Development

Adding New Data Sources

Extending Classification

License

Acknowledgements

About

Releases

Packages

Languages

License

Alaneel/SC4021

Folders and files

Latest commit

History

Repository files navigation

EV Opinion Search Engine

Overview

Project Structure

Installation

Prerequisites

Setup

Usage

Data Collection

Indexing

Classification and Evaluation

Web Application

Classification Approach

Web Interface Features

News API Usage and Limits

Development

Adding New Data Sources

Extending Classification

License

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages