Write spider generators #32

ojongerius · 2017-10-29T23:59:20Z

For (most) spiders the only things that needs doing are:

Define the SITE_NAME, SITE_URL and COUNTRY
Create a non existent logical class name. Set allowed domains and the start url in the parse() function. Define
xpath queries to scrape the data in parse_job_page and parse_org_page.

There are a few ways we could skin this cat:

A script that generates a spider based on input, which would behave something like so:

> generate_spider.py --country US --site-name "A volunteering site" --site-url "www.example.com"
Have generated a spider for the UK with site name "A volunteering site" and site url "www.example.com" in spiders/a_volunteering_site.py

If you want to configure your spider using JSON, create a file called spiders/a_volunteering_site.json containing css and or xpath expressions to to parse your website.
Alternatively you can add these expressions in the code yourself too.

Once you are happy with the results, add the create Python code and JSON config and raise  PR.

Another approach could be to have one (or more) generic spider that runs depending on the arguments passed, so runs would look like so:

> generic _spider.py --config a_volunteering_site.json

Or, have a mode that looks for all JSON files in a config file and runs with the spiders, like so

> generic_spider.py --config-dir configs --skip a_spider_still_in_development

Option	Upside	Downside
Code generation	Generated code will be easy to customise/override behaviour. So far most logic has been in the pipeline and the spiders fall in just 2 categories; web scraping or API "scraping".	Harder to maintain.
Code abstraction	Abstracting into one or two spiders will be easier to maintain, and it removes the need to know how write Python, as you will be just configuring a spider.	Potentially harder to onboard new contributors

Decide on abstraction vs generation
Write POC
Merge code and updated CONTRIBUTING to reflect changes.

The text was updated successfully, but these errors were encountered:

ojongerius added the help wanted label Oct 29, 2017

ojongerius mentioned this issue Oct 30, 2017

Add crawlers for The Netherlands #37

Open

ojongerius added the enhancement label Oct 30, 2017

This was referenced Oct 31, 2017

Create crawlers for Canada #41

Open

Puling spider logic of Seek out into separate repo #9

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write spider generators #32

Write spider generators #32

ojongerius commented Oct 29, 2017 •

edited

Loading

Write spider generators #32

Write spider generators #32

Comments

ojongerius commented Oct 29, 2017 • edited Loading

ojongerius commented Oct 29, 2017 •

edited

Loading