VuDL is the ingest and management platform that powers Villanova University's Digital Library.
This project provides an interface (driven by a JSON API) for managing review and pagination of jobs prior to their import into a digital repository, as well as additional tools for managing that repository after it has been populated. It is intended to be used in combination with Fedora Commons and VuFind®, among other dependencies (see below for further details).
For data ingestion purposes, this software assumes a two-tiered set of directories, in which the top tier represents categories and the second tier represents jobs.
Each category folder is expected to contain a batch-params.ini file that controls parameters for the category. For example:
[collection]
; PID of holding area object in Fedora:
destination = vudl:5
[ocr]
; Should we perform OCR on the jobs in this category?
ocr = 'true'
[pdf]
; Should we generate a PDF if none is already in the folder?
generate = 'true'
For image-based jobs, each job folder is expected to contain TIFF images of a multi-page item. For example:
/usr/local/holding
/category1
/batch-params.ini
/job1
0001.TIF
0002.TIF
The software provides functionality for automatically generating JPEG derivatives of these TIFFs as well as assigning labels to the pages within the jobs. When all of this work has been completed, the finished data can be published to a repository. (Currently, this is designed for a Fedora 6-based repository).
Job folders can also include PDF files (for document-based jobs), FLAC files (for audio-based jobs) or AVI/MKV/MOV/MP4 files (for video-based jobs). Videos may optionally be accompanied by .txt or .vtt files containing transcripts -- the filenames just need to match (e.g. myVideo.vtt and myVideo.mp4).
This is a complex system which uses a large number of tools to manage a digital repository.
The software is written in node.js, but requires quite a few external tools to accomplish its goals. You should install the external tools first, then the Javascript dependencies.
This software is designed to run on multiple operating systems; however, Ubuntu (or other Debian flavors) tend to be the quickest and easiest because of the availability of easy-to-install packages for most of the external dependencies.
- Cantaloupe Image Server (or another IIIF image server) - optional, but required when using VuFind® (see below, and also setup notes).
- Fedora Commons - required for storing repository content
- Fedora Commons Camel Toolbox - used for sending messages from Fedora to VuDL to enable indexing, etc.
- FFmpeg - required for audio/video processing
- FITS - required for file characterization
- ImageMagick - required by textcleaner (see below)
- OCRmyPDF - required for OCR enhancement of PDFs
- Redis - required to support queue features
- Relational Database (SQLite by default, or MySQL/MariaDB by configuration) - required for user session persistence and PID generation
- Solr - required for searching/indexing content; it is recommended that you use the instance bundled with VuFind® (see below)
- tesseract-ocr - required for OCR of image files
- textcleaner - required for cleanup of image files prior to OCR
- Tika - required for text extraction from document files
- VuFind® - strongly recommended as the public front-end for the repository
- Node.js (developed and tested with v15)
- NPM (
npm install -g npm
) - Execute
npm install
to install root node dependencies - Execute
npm run setup
to install node dependencies in subdirectories
Copy api/vudl.ini.dist to api/vudl.ini, and configure it using a text editor.
This configuration file allows you to specify where files will be stored during ingest/processing, as well as the paths to the various external tools required by this package.
In two separate terminals or panes, run:
npm run api:watch
to run Typescript for the API codenpm run dev
to fire up all dev servers
After a few moments, a new tab should automatically open in your browser pointing to localhost:3000
. Refresh until the app appears.
Script | Description |
---|---|
api | run node api server |
api:build | Build api Typescript |
api:dev | run api server, restart server on changes |
api:format | format only the api code with Prettier |
api:lint | lint only the api code |
api:saml:metadata | output SAML SP metadata to share with an IdP |
api:setup | install npm dependencies in api/ |
client | run react-scripts server |
client:build | build React code for production |
client:format | format only the client code with Prettier |
client:lint | lint only the api code |
client:setup | install npm dependencies in client/ |
client:snapshots | update snapshots used by test suite |
client:test | run client unit tests with test coverage |
client:testWatch | run client unit tests while watching source folders |
queue | start job queue worker (call with -- [queuename] to specify a non-default queue to monitor) |
queue:dev | start job queue worker, restart on changes |
ingest | add all published jobs to the ingest queue |
backend | start both api and worker queue servers |
backend:dev | start both api and worker queue servers, restart on changes |
build | build entire project for production |
dev | run api, client, and queue dev servers (auto-restart) |
format | format all code with Prettier |
lint | report lint errors in all code |
setup | install subdirectory npm dependencies |
start | run api, client, and queue servers (production) |
test | run both client and api tests |
watch | alias for api:watch |
Some useful command-line utilities (which can be run using node [filename]
) are found in the
api/scripts directory.
Script | Description |
---|---|
export-combined-solr-cache.js | If Solr caching is enabled, this exports it to XML files. |
generate-master-md.js | Queue a job to generate the MASTER-MD stream for the specified PID. |
generate-pdfs-from-list.js | Queue PDF generation jobs from the specified PID list file. |
index-pid-list.js | Queue index jobs from the specified PID list file. |
ingest.js | Queue ingest jobs for content publised in the Paginator. |
purge-deleted-children-of-pid.js | Purge all deleted children of the specified PID. |
purge-trash.js | Purge all deleted objects that are children of the configured trash_pid. |
reindex-from-solr-cache.js | Rebuild the Solr index from the Solr cache. |
send-notification.js | Add a job to the notify queue. |