This project maintains the full text datasets provided to researchers and the HathiTrust Research Center.
git clone https://github.com/hathitrust/datasets
cd datasets
docker-compose build
docker-compose run test bundle install
docker-compose run test
The datasets are the volume fulltexts with rights that permit inclusion in and distribution by the hathitrust research datasets. Each volume is uniquely identified by a prefix and number. e.g. pre.01234567891011 or exp.11223344556677
The entire corpus of the fulltexts available for research is called ht_text
. This is the superset of available volumes. This set is only used directly by the HathiTrust Research Center.
There are subsets of volumes that correspond to specific rights attributed to the volumes. For example, volumes with the rights, public domain world, are in the subset ht_text_pd_world. The list of subsets is:
- ht_text_pd
- ht_text_pd_open_access
- ht_text_pd_world
- ht_text_pd_world_open_access
The The zip files which contain the data reside in ht_text (the superset). The subsets are mirrors of sections of the ht_text pairtree with the final directory being a symlink to the correpsonding directory in ht_text.
ls -l /datasets/ht_text_pd_world/obj/exp/pairtree_root/11/22/33/44/55/66/77
11223344556677 -> /datasets/ht_text/ht_text_pd_world/obj/exp/pairtree_root/11/22/33/44/55/66/77
Prior to beginning a new run, the queue of jobs must be empty. In order to be empty, each job must have completed successfully. Failed jobs are re-queued. This is done to prevent race conditions with multiple changes to the same volume.
There are two kinds of changes to the HathiTrust volumes that the research datasets need to incorporate: - Rights: Updates to the copyright determination or access rights. Queried from the aptly named, rights table. - Content: Updates to the OCR text. Queried from the re-ingest feed table.
The list of changes is filtered into queues. There is a queue for each subset and a queue for the content changes.
For each volume in a queue, a job is scheduled to apply the changes to the filesystem.
Deployed via private ArgoCD control repository
This creates a set of workers for handling data set jobs, as well as a set of cron jobs to generate the dataset full inventory, fetch metadata, queue jobs for updating the data set, and compiling and processing the logs generated by the workers.
Atomic filesystem moves. This remains to be tested.