Harvesting metadata from ESDAC

ESDAC is a drupal website with dedicated sections for datasets, maps and documents. This folder contains 2 scripts which together bring the esdac records into SWR.

fetch.py

Fetches the html pages into the postgres database.

For datasets, first 5 list pages are collected, from each listing, the relevant page links are scraped. Then each link is fetched.
For maps (EUDASM) and documents, there are no child pages, so the metadata is directly scraped from the list page

parse.py

The parse script queries the html from the database and parses the content to Dublin Core metadata, which is placed back into the database.

Resume parameter

The harvest process should resume where it left of last time. This mechanism is triggered by a environment parameter HV_RESUME (default:true). Url's requested in previous runs are fetched from database and each url to be requested is verified if it exists in this list.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Harvesting metadata from ESDAC

fetch.py

parse.py

Resume parameter

Files

README.md

Latest commit

History

README.md

File metadata and controls

Harvesting metadata from ESDAC

fetch.py

parse.py

Resume parameter