Web-scraper for news articles from BBC News using Scrapy, a small personal project for scraping, please refrain from using this for commercial purposes
NOTE: Requires virtualenv,
virtualenvwrapper
- Fork this repository:
$ git clone https://github.com/Ocre42/news-scraper.git
$ cd news-scraper\Scraper
$ pip install Scrapy
- Identify yourself on with USER_AGENT
- Make sure ROBOTSTXT_OBEY is True
- You can modify the DOWNLOAD_DELAY and AUTOTHROTTLE_ENABLED, default should be 1 second per download
$ cd Scraper
$ scrapy crawl news
- Or you can save the results into files such as json
$ scrapy crawl news -o results.json
Enjoy and crawl responsibly!