This repo contains documentation and scripts for how the M.E. Grenander Department of Special Collections & Archives connects ArchivesSpace, ArcLight, and Hyrax and keeps everything synced together. It contains:
- Documentation for uploading digital object in Hyrax using existing
- Overnight exporting and indexing scripts that update data between each service
Updated documentation for this repo is on our documentation site:
Uploading Digital Objects to Hyrax
-
Go to Hyrax and login, or create an account and request uploading access
- Let Greg know when you create an account and return when you have upload permissions.
-
Once you have upload permissions, go to Arclight, find the file that represents the digital object you want to upload. From the URI, copy the long string of letters and numbers right after the “aspace_”. This is the unique ArchivesSpace ID for that record.
- Notice the collection ID is in the URI as well.
- In your Dashboard, select “Works” on the left side menu
- Select the “Add new work” button on the right side
- For most cases, select “Digital Archival Objects” and then the “Create Work” button.
- In the “Descriptions” tab, enter only the ArchivesSpace ID, and the Collection number
- Select the “Load Record” button to pull additional metadata from Arclight (JavaScript file)
- Add additional Metadata, Resource Type and Rights Statement is required, while “Additional fields” are not
- In the “Files” tab, browse and upload any files represented by the Arclight record. These can be PDFs, Office documents (doc, docx, ppt, xlsx, etc.), or any image file.
- Select the Visibility of the work on the right side, and Save the work.
- Each night,
exportPublicData.py
uses ArchivesSnake to query ArchivesSpace for resources updated since the last run. - For collections with the complete set of DACS-minimum elements it exports EAD 2002 files and for collections with only abstracts and extents it saves them to Pipe-delimited CSVs.
- It also builds a CSV of local subjects and collection IDs.
- All this data is pushed to Github.
exportPublicData.py
runsstaticPages.py
when its finished, which builds static browse pages for all collections, including a complete A-Z list, alpha lists for each collecting area, and pages for each local subject.
- Later, collection data is updated with
git pull
andindexNewEAD.sh
indexes EAD files updated in the past day withfind -mtime -1
into the ArcLight Solr instance. - There are also additional indexing shell scripts for ad hoc updates.
indexAllEAD.sh
reindexes all EAD filesindexOneEAD.sh
indexes only one EAD by collection ID (./indexOneEAD.sh apap101
)indexOneNDPA.sh
indexes one NDPA EAD file, necessary because they have the same collection ID prefixesindexNewNoLog.sh
indexes one EAD file, but logs to the stdout instead of a log fileindexOneURL.sh
indexes via a URL instead of from disk (not actively used)
- Finally,
processNewUploads.py
queries the Hyrax Solr index for new uploads that are connected to ArchivesSpace ref_ids, but do not have accession numbers. - It downloads the new binaries and metadata and creates basic Archival Information Packages (AIPs) using bagit-python
- It then uses ArchivesSnake to add a new Digital Object Record in ArchivesSpace that links to the object in Hyrax
- Last, it adds a new accession ID in Hyrax
- (Also check out Noah Huffman's talk that probably does this better [Direct Link].)
- A simple library that converts Posix timestamps and ISO 8601 Dates to DACS-compliant display dates.
exportPublicData.py
uses this to make dates for the static browse pages.
- Queries the Bing background image API each night to display new background images for ArchivesSpace and Find-It just for fun.
# get new image from Bing
0 2 * * * source /home/user/.bashrc; pyenv activate aspaceExport && python /opt/lib/ArchivesSpace-ArcLight-Workflow/image_a_day.py 1>> /media/SPE/indexing-logs/image_a_day.log 2>&1 && pyenv deactivate
# export data from ASpace
0 0 * * * source /home/user/.bashrc; pyenv activate aspaceExport && python /opt/lib/ArchivesSpace-ArcLight-Workflow/exportPublicData.py 1>> /media/SPE/indexing-logs/export.log 2>&1 && pyenv deactivate
# pull new EADs from Gitub
30 0 * * * echo "$(date) $line git pull" >> /media/SPE/indexing-logs/git.log && git --git-dir=/opt/lib/collections/.git --work-tree=/opt/lib/collections pull 1>> /media/SPE/indexing-logs/git.log 2>&1
# Index modified apap collections
5 1 * * * /opt/lib/ArchivesSpace-ArcLight-Workflow/indexNewEAD.sh "apap"
# Index modified ua collections
15 1 * * * /opt/lib/ArchivesSpace-ArcLight-Workflow/indexNewEAD.sh "ua"
# Index modified ndpa collections
25 1 * * * /opt/lib/ArchivesSpace-ArcLight-Workflow/indexNewEAD.sh "ndpa"
# Index modified ger collections
35 1 * * * /opt/lib/ArchivesSpace-ArcLight-Workflow/indexNewEAD.sh "ger"
# Index modified mss collections
45 1 * * * /opt/lib/ArchivesSpace-ArcLight-Workflow/indexNewEAD.sh "mss"
# Download new Hyrax uploads and create new ASpace digital objects
0 2 * * * source /home/user/.bashrc; pyenv activate processNewUploads && python /opt/lib/ArchivesSpace-ArcLight-Workflow/processNewUploads.py 1>> /media/SPE/indexing-logs/processNewUploads.log 2>&1 && pyenv deactivate