Skip to content

Latest commit

 

History

History
20 lines (11 loc) · 999 Bytes

README.md

File metadata and controls

20 lines (11 loc) · 999 Bytes

Harvesting metadata from ESDAC

ESDAC is a drupal website with dedicated sections for datasets, maps and documents. This folder contains 2 scripts which together bring the esdac records into SWR.

fetch.py

Fetches the html pages into the postgres database.

  • For datasets, first 5 list pages are collected, from each listing, the relevant page links are scraped. Then each link is fetched.
  • For maps (EUDASM) and documents, there are no child pages, so the metadata is directly scraped from the list page

parse.py

The parse script queries the html from the database and parses the content to Dublin Core metadata, which is placed back into the database.

Resume parameter

The harvest process should resume where it left of last time. This mechanism is triggered by a environment parameter HV_RESUME (default:true). Url's requested in previous runs are fetched from database and each url to be requested is verified if it exists in this list.