Skip to content

Custom Harvester Creation

Erin Braswell edited this page Jan 26, 2015 · 7 revisions

Creating a custom harvester using scrAPI tools

Many harvesters for the SHARE project are written for providers with an OAI-PMH endpoint, and can be written very quickly by creating an instance of an oai harvester class. However, many other providers have a custom data output that requires a bit more of a custom implementation.

Here's how to create a custom harvester using tools provided within [scrapi (https://github.com/chrisseto/scrapi). For more information about scrapi, see the GitHub repo.

Here's what a typical custom harvester looks like:

consumer.py

## Consumer for the CrossRef metadata service
from __future__ import unicode_literals

import json
import requests
from datetime import date, timedelta

from dateutil.parser import parse
from nameparser import HumanName

from scrapi.linter import lint
from scrapi.linter.document import RawDocument, NormalizedDocument

NAME = 'crossref'

DEFAULT_ENCODING = 'UTF-8'

record_encoding = None


def copy_to_unicode(element):

    encoding = record_encoding or DEFAULT_ENCODING
    element = ''.join(element)
    if isinstance(element, unicode):
        return element
    else:
        return unicode(element, encoding=encoding)


def consume(days_back=0):
    base_url = 'http://api.crossref.org/v1/works?filter=from-pub-date:{},until-pub-date:{}&rows=1000'
    start_date = date.today() - timedelta(days_back)
    url = base_url.format(str(start_date), str(date.today()))
    print(url)
    data = requests.get(url)
    doc = data.json()

    records = doc['message']['items']

    doc_list = []
    for record in records:
        doc_id = record['DOI'] or record['URL']
        doc_list.append(RawDocument({
            'doc': json.dumps(record),
            'source': NAME,
            'docID': doc_id,
            'filetype': 'xml'
        }))

    return doc_list


def get_contributors(doc):
    contributor_list = []
    contributor_dict_list = doc.get('author') or []
    full_names = []
    orcid = ''
    for entry in contributor_dict_list:
        full_name = '{} {}'.format(entry.get('given'), entry.get('family'))
        full_names.append(full_name)
        orcid = entry.get('ORCID') or ''
    for person in full_names:
        name = HumanName(person)
        contributor = {
            'prefix': name.title,
            'given': name.first,
            'middle': name.middle,
            'family': name.last,
            'suffix': name.suffix,
            'email': '',
            'ORCID': orcid
        }
        contributor_list.append(contributor)

    return contributor_list


def get_ids(doc, raw_doc):
    ids = {}
    ids['url'] = doc.get('URL')
    ids['doi'] = doc.get('DOI')
    ids['serviceID'] = raw_doc.get('docID')
    return ids


def get_properties(doc):
    properties = {
        'published-in': {
            'journalTitle': doc.get('container-title'),
            'volume': doc.get('volume'),
            'issue': doc.get('issue')
        },
        'publisher': doc.get('publisher'),
        'type': doc.get('type'),
        'ISSN': doc.get('ISSN'),
        'ISBN': doc.get('ISBN'),
        'member': doc.get('member'),
        'score': doc.get('score'),
        'issued': doc.get('issued'),
        'deposited': doc.get('deposited'),
        'indexed': doc.get('indexed'),
        'page': doc.get('page'),
        'issue': doc.get('issue'),
        'volume': doc.get('volume'),
        'referenceCount': doc.get('reference-count'),
        'updatePolicy': doc.get('update-policy'),
        'depositedTimestamp': doc['deposited'].get('timestamp')
    }
    return properties


def get_tags(doc):
    tags = (((doc.get('subject') or []) + doc.get('container-title'))) or []
    return [tag.lower() for tag in tags]


def get_date_updated(doc):
    issued_date_parts = doc['issued'].get('date-parts') or []
    date = ' '.join([str(part) for part in issued_date_parts[0]])
    isodateupdated = parse(date).isoformat()
    return copy_to_unicode(isodateupdated)


def normalize(raw_doc):
    doc_str = raw_doc.get('doc')
    doc = json.loads(doc_str)

    normalized_dict = {
        'title': (doc.get('title') or [''])[0],
        'contributors': get_contributors(doc),
        'properties': get_properties(doc),
        'description': (doc.get('subtitle') or [''])[0],
        'id': get_ids(doc, raw_doc),
        'source': NAME,
        'dateUpdated': get_date_updated(doc),
        'tags': get_tags(doc)
    }

    return NormalizedDocument(normalized_dict)


if __name__ == '__main__':
    print(lint(consume, normalize))

scrapi
Before you can import the scrapi linter objects, you'll need to install them first. You can do from the pip-installable scrapi-tools library so with pip, using pip install git+http://github.com/chrisseto/scrapi.git.

consume()
The consume function will take an optional argument of the number of days back you'd like your harvester to run. We built this in so that the harvester could be run with different frequencies going further back if need be.

Your consume function should return a list of RawDocument objects, which consist of the individual result, the source name, a unique doc id, and the filetype.

smaller helper functions Each harvester will have smaller helper functions that grab different parts of the information. These functions include:

  • get_ids()
  • get_date_created()
  • get_date_updated()
  • get_tags()
  • get_contributors()
  • get_properties()

normalize()
Your normalize function will be passed a raw document, along with a timestamp from when the file was actually consumed.

Your normalize function should return a NormalizedDocument, that should be passed a dictionary with fields that conform to the scrAPI schema.

Use the smaller helper functions to populate the dictionary that will become the normalized document.

lint()
The lint function will check the output of your consume and normalize functions, ensuring that they are outputting documents of the correct type, in a format that conforms to the current version of the scrAPI schema. Please run lint on your harvester before submitting a pull request to scrapi. Lint will return with error messages letting you know what parts of your harvester should be altered. If everything looks good with your harvester output, the lint function will let you know!

init.py

Along with your main consumer.py python script, you should also include an init.py file with the following information.

__init__.py

from consumer import consume, normalize

config.json

Along with your harvester, you should also include a json configuration file. For now, this file should include the days you'd like your scraper to run, along with the hour and minute it should run. It should also include the long name of the scraper (the way you'd like it to appear in scrAPI search results), the short name (all one word and in lowercase), the github url where your scraper lives, and the file format for your raw docs.

{
    "days": "mon-sun",
    "hour": 23,
    "minute": 59,
    "longName": "CrossRef",
    "shortName": "crossref",
    "url": "https://github.com/erinspace/CrossRef.git",
    "fileFormat": "xml"
}

requirements.txt

You should also include a requirements.txt file, containing the python libraries needed to run your harvester. It should also include an entry for installing the scrapi tools from the git repo! You can make your requirements.txt file from within a virtual environment for your harvester with pip freeze > requirements.txt. Remove the entry for scrapi, as you'll add that in the next section.

lxml==3.4.0
requests==2.4.1
nameparser==0.3.3
python-dateutil==2.2

dev-requirements.txt

Here, you'll include development requirements for your scraper, including scrapi linting tools. Make sure to change the entry for scrapi to git+http://github.com/chrisseto/scrapi.git.

git+http://github.com/chrisseto/scrapi.git
-r requirements.txt

SHARE Logo

Technical Overview

Creating a Harvester

Running Harvesters with ScrAPI

Consuming Notifications - Feed Options

Issues & Using the Issue Tracker

Metadata Schema

Glossary

Provider Names

Statistics Snapshot

Experimental Push API

Use Cases

SHARE is a project of the ARL, AAU, and APLU. Development of the SHARE Notification Service is being carried out in partnership with the Center for Open Science and is supported by generous funding from The Institute of Museum and Library Services (IMLS) and the Alfred P. Sloan Foundation.

Clone this wiki locally