Skip to content

Latest commit

 

History

History
137 lines (98 loc) · 3.53 KB

README.md

File metadata and controls

137 lines (98 loc) · 3.53 KB

NEWS Project

A web scraping project built with Scrapy to collect news articles from Pakistani news websites (Dawn and Tribune), automated using Prefect.

Features

  • Scrapes news articles from websites like Dawn and Tribune.
  • Redis-based URL deduplication with 24-hour expiry
  • JSON export of scraped articles
  • Detailed logging system
  • Daily statistics tracking
  • Automated workflow using Prefect
  • Scheduled article scraping

Requirements

  • Python 3.7+
  • Redis
  • Docker (recommended) or WSL2 for Windows users
  • Prefect

Installation

# Clone repository
git clone https://github.com/waleedkaimkhani/NEWS_Project.git
cd NEWS_Project

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
venv\Scripts\activate     # Windows

# Install dependencies
pip install -r requirements.txt

# Start Redis
docker run --name redis -p 6379:6379 -d redis

# Start Prefect server
prefect server start

Usage

Manual Spider Execution

scrapy crawl dawn_latest
scrapy crawl tribune_latest

Run Parallel Spiders

To run multiple spiders in parallel, use the parallel_scrape.py script:

python news_scrapper/parallel_scrape.py

Automated Pipeline

# Start Prefect agent
prefect agent start -q default

# Deploy workflow
python deployment.py

# View runs in Prefect UI
http://localhost:4200

Prefect Pipeline

The project uses Prefect for workflow automation:

  • Scheduled scraping every 12 hours
  • Parallel execution of spiders
  • Error handling and retries
  • Email notifications for failures
  • Monitoring through Prefect UI

Output

  • Articles saved as JSON in data/ directory
  • Logs stored in logs/ directory
  • Statistics saved in stats/ directory
  • Pipeline runs visible in Prefect UI

Project Structure


## Project Structure

Project Structure: NEWS_Project

NEWS_Project/
├── news_scrapper/               # Main module for web scraping
│   ├── spiders/                 # Spiders for scraping websites
│   ├── items.py                 # Defines data models for scraped items
│   ├── pipelines.py             # Data processing pipelines
│   ├── settings.py              # Scrapy project settings
│   ├── middlewares.py           # Middleware for custom behaviors
│   ├── parallel_scrape.py       # Script to run spiders in parallel
├── logs/                        # Directory for log files
├── data/                        # Directory where JSON files are stored
├── stats/                       # Directory for stats (e.g., number of articles scraped)
├── news_pipeline.py             # Prefect pipeline script to run scrapers and store data in PostgreSQL
├── deployment.py                # Prefect flow scheduling script
├── requirements.txt             # Python dependencies
├── scrapy.cfg                   # Scrapy configuration file

Future Enhancements

  • Sentiment analysis of scraped articles.
  • Bias and propaganda detection using NLP models.
  • Integration with a dashboard to visualize trends in news sentiment.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgments

  • Scrapy for the web scraping framework.
  • Local news websites for providing data.
  • prefect for open source workflow orchectration
  • postgress for open source relational DB

Feel free to contribute or raise issues to improve this project!