You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For (most) spiders the only things that needs doing are:
Define the SITE_NAME, SITE_URL and COUNTRY
Create a non existent logical class name. Set allowed domains and the start url in the parse() function. Define
xpath queries to scrape the data in parse_job_page and parse_org_page.
There are a few ways we could skin this cat:
A script that generates a spider based on input, which would behave something like so:
> generate_spider.py --country US --site-name "A volunteering site" --site-url "www.example.com"
Have generated a spider for the UK with site name "A volunteering site" and site url "www.example.com" in spiders/a_volunteering_site.py
If you want to configure your spider using JSON, create a file called spiders/a_volunteering_site.json containing css and or xpath expressions to to parse your website.
Alternatively you can add these expressions in the code yourself too.
Once you are happy with the results, add the create Python code and JSON config and raise PR.
Another approach could be to have one (or more) generic spider that runs depending on the arguments passed, so runs would look like so:
Generated code will be easy to customise/override behaviour. So far most logic has been in the pipeline and the spiders fall in just 2 categories; web scraping or API "scraping".
Harder to maintain.
Code abstraction
Abstracting into one or two spiders will be easier to maintain, and it removes the need to know how write Python, as you will be just configuring a spider.
Potentially harder to onboard new contributors
Decide on abstraction vs generation
Write POC
Merge code and updated CONTRIBUTING to reflect changes.
The text was updated successfully, but these errors were encountered:
For (most) spiders the only things that needs doing are:
parse()
function. Definexpath queries to scrape the data in
parse_job_page
andparse_org_page
.There are a few ways we could skin this cat:
A script that generates a spider based on input, which would behave something like so:
Another approach could be to have one (or more) generic spider that runs depending on the arguments passed, so runs would look like so:
Or, have a mode that looks for all JSON files in a config file and runs with the spiders, like so
The text was updated successfully, but these errors were encountered: