NEWS Project

A web scraping project built with Scrapy to collect news articles from Pakistani news websites (Dawn and Tribune), automated using Prefect.

Features

Scrapes news articles from websites like Dawn and Tribune.
Redis-based URL deduplication with 24-hour expiry
JSON export of scraped articles
Detailed logging system
Daily statistics tracking
Automated workflow using Prefect
Scheduled article scraping

Requirements

Python 3.7+
Redis
Docker (recommended) or WSL2 for Windows users
Prefect

Installation

# Clone repository
git clone https://github.com/waleedkaimkhani/NEWS_Project.git
cd NEWS_Project

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
venv\Scripts\activate     # Windows

# Install dependencies
pip install -r requirements.txt

# Start Redis
docker run --name redis -p 6379:6379 -d redis

# Start Prefect server
prefect server start

Usage

Manual Spider Execution

scrapy crawl dawn_latest
scrapy crawl tribune_latest

Run Parallel Spiders

To run multiple spiders in parallel, use the parallel_scrape.py script:

python news_scrapper/parallel_scrape.py

Automated Pipeline

# Start Prefect agent
prefect agent start -q default

# Deploy workflow
python deployment.py

# View runs in Prefect UI
http://localhost:4200

Prefect Pipeline

The project uses Prefect for workflow automation:

Scheduled scraping every 12 hours
Parallel execution of spiders
Error handling and retries
Email notifications for failures
Monitoring through Prefect UI

Output

Articles saved as JSON in data/ directory
Logs stored in logs/ directory
Statistics saved in stats/ directory
Pipeline runs visible in Prefect UI

Project Structure


## Project Structure

Project Structure: NEWS_Project

NEWS_Project/
├── news_scrapper/               # Main module for web scraping
│   ├── spiders/                 # Spiders for scraping websites
│   ├── items.py                 # Defines data models for scraped items
│   ├── pipelines.py             # Data processing pipelines
│   ├── settings.py              # Scrapy project settings
│   ├── middlewares.py           # Middleware for custom behaviors
│   ├── parallel_scrape.py       # Script to run spiders in parallel
├── logs/                        # Directory for log files
├── data/                        # Directory where JSON files are stored
├── stats/                       # Directory for stats (e.g., number of articles scraped)
├── news_pipeline.py             # Prefect pipeline script to run scrapers and store data in PostgreSQL
├── deployment.py                # Prefect flow scheduling script
├── requirements.txt             # Python dependencies
├── scrapy.cfg                   # Scrapy configuration file

Future Enhancements

Sentiment analysis of scraped articles.
Bias and propaganda detection using NLP models.
Integration with a dashboard to visualize trends in news sentiment.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgments

Scrapy for the web scraping framework.
Local news websites for providing data.
prefect for open source workflow orchectration
postgress for open source relational DB

Feel free to contribute or raise issues to improve this project!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

NEWS Project

Features

Requirements

Installation

Usage

Manual Spider Execution

Run Parallel Spiders

Automated Pipeline

Prefect Pipeline

Output

Project Structure

Project Structure: NEWS_Project

Future Enhancements

License

Acknowledgments

Files

README.md

Latest commit

History

README.md

File metadata and controls

NEWS Project

Features

Requirements

Installation

Usage

Manual Spider Execution

Run Parallel Spiders

Automated Pipeline

Prefect Pipeline

Output

Project Structure

Project Structure: NEWS_Project

Future Enhancements

License

Acknowledgments