qrator

This project scrapes data from various news feeds.

Steps to help you get started

Requirements

Install devel packs:
- On Fedora/RHEL: # yum install libxslt-devel python-devel
- On Ubuntu: # apt-get install libxml2-dev libxslt1-dev python-dev
Install pip and virtualenv # yum/apt-get install python-pip; pip install virtualenv
Go to project root and run # virtualenv venv; sourve venv/bin/activate
Install requirements: $ pip install -r requirements.txt
scrapy (0.23 preferred) and scrapyd (http://scrapyd.readthedocs.org/en/latest/install.html).
Elasticsearch instance running on port 9200
Ensure 'scrapyd' is running: $ sudo service scrapyd start
- Note: refer execution steps below if you get error here.

Execution steps

If you haven't deployed the project yet, then go to project root and run:

$ scrapyd-deploy

If you aren't running scrapyd as a service, and just ran the above command, you'll get an error^^:

Deploying to project "qrator" in http://localhost:6800/addversion.json
Deploy failed: <urlopen error [Errno 111] Connection refused>

^To avoid this, open two terminals:
1. In one of them, run :
```
$ scrapyd
```
2. In the second one:
```
$ scrapyd-deploy          
```
scrapyd-deploy works directly because the config is already defined in scrapy.cfg file. Else, one has to run:
```
$ scrapyd-deploy default -p qrator 
```
For various scheduling/crawl commands, check scheduler/

NOTE

News sources include major agencies like DiscoverMag, HN, NYTimes, HBR, TechCrunch and so on..
http://localhost:6800/ -> Default scrapyd URL
http://doc.scrapy.org/en/latest/topics/architecture.html Scrapy Architecture
Refer to TODO.md for current project status.

GUIDELINES

While mapping fields from crawled data, in the spiders, try to keep the field names unique. For example:
1. pubDate:updated carries same relation as 'published'
2. summary is same as description.
3. and so on.. Refer to current items.py

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
dumps		dumps
qrator		qrator
scheduler		scheduler
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

qrator

Steps to help you get started

Requirements

Execution steps

NOTE

GUIDELINES

About

Releases

Packages

Contributors 2

Languages

License

HackerEcology/qrator

Folders and files

Latest commit

History

Repository files navigation

qrator

Steps to help you get started

Requirements

Execution steps

NOTE

GUIDELINES

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages