Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write spider generators #32

Open
3 tasks
ojongerius opened this issue Oct 29, 2017 · 0 comments
Open
3 tasks

Write spider generators #32

ojongerius opened this issue Oct 29, 2017 · 0 comments

Comments

@ojongerius
Copy link
Member

ojongerius commented Oct 29, 2017

For (most) spiders the only things that needs doing are:

  • Define the SITE_NAME, SITE_URL and COUNTRY
  • Create a non existent logical class name. Set allowed domains and the start url in the parse() function. Define
    xpath queries to scrape the data in parse_job_page and parse_org_page.

There are a few ways we could skin this cat:

A script that generates a spider based on input, which would behave something like so:

> generate_spider.py --country US --site-name "A volunteering site" --site-url "www.example.com"
Have generated a spider for the UK with site name "A volunteering site" and site url "www.example.com" in spiders/a_volunteering_site.py

If you want to configure your spider using JSON, create a file called spiders/a_volunteering_site.json containing css and or xpath expressions to to parse your website.
Alternatively you can add these expressions in the code yourself too.

Once you are happy with the results, add the create Python code and JSON config and raise  PR.

Another approach could be to have one (or more) generic spider that runs depending on the arguments passed, so runs would look like so:

> generic _spider.py --config a_volunteering_site.json

Or, have a mode that looks for all JSON files in a config file and runs with the spiders, like so

> generic_spider.py --config-dir configs --skip a_spider_still_in_development
Option Upside Downside
Code generation Generated code will be easy to customise/override behaviour. So far most logic has been in the pipeline and the spiders fall in just 2 categories; web scraping or API "scraping". Harder to maintain.
Code abstraction Abstracting into one or two spiders will be easier to maintain, and it removes the need to know how write Python, as you will be just configuring a spider. Potentially harder to onboard new contributors
  • Decide on abstraction vs generation
  • Write POC
  • Merge code and updated CONTRIBUTING to reflect changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant