Skip to content
This repository has been archived by the owner on Oct 10, 2023. It is now read-only.

frdr-dfdr/frdr_harvest

Repository files navigation

********** WARNING **********

THIS REPOSITORY IS NO LONGER MAINTAINED

USE AT YOUR OWN RISK

Please see: Lunaris Harvester

********** WARNING **********

FRDR Harvester

This is a repository crawler which outputs gmeta.json files for indexing by Globus. It currently supports harvesting with the following standards and APIs:

Configuration

Configuration is split into two files:

  • The first controls the operation of the indexer, and is located in conf/harvester.conf.
  • The list of repositories to be crawled is in conf/repos.json, structured as shown below. For documentation of repos.json properties, see: repos_json_properties.md.
{
    "repos": [
        {
            "name": "FRDR",
            "type": "oai",
            "url": "https://frdr-dfdr.ca/oai/request",
            "homepage_url": "https://frdr-dfdr.ca/",
            "thumbnail": "https://frdr-dfdr.ca/discover/img/sources/frdr_80x80.png",
            "set": "col_storagegroup1",
            "metadataprefix": "frdr",
            "repo_refresh_days": 0,
            "enabled": true
        },
        {
            "name": "UBC Dataverse",
            "type": "dataverse",
            "url": "https://dataverse.scholarsportal.info/api/dataverses/%id%/contents",
            "homepage_url": "https://dataverse.scholarsportal.info/dataverse/ubc",
            "thumbnail": "https://frdr-drdr.ca/discover/img/sources/ubccrest_80x80.png",
            "set": 71618,
            "enabled": true
        }
    ]
}

You can call the crawler directly, which will run once, crawl all of the target domains, export metadata, and exit, by using harvester.py.

You can also run it with --onlyharvest or --onlyexport if you want to skip the metadata export or crawling stages, respectively. There are two export formats which may be specified with the --export-format option: dataverse and gmeta. You can also use --only-new-records to only export records that have changed since the last run.

Supported database types are "sqlite" and "postgres"; the psycopg2 library is required for postgres support.