This project is a Python script that Archive-It partners can use to download their WARC files and associated metadata.
This script uses Archive-It's Web Archiving Systems API (WASAPI) and Partner API to download WARC files and associated metadata. The code was developed as part of a Professional Experience project at the UBC iSchool for use by UBC Library Digital Initiatives, with the goal of digitally preserving WARC files captured using Archive-It.
Because the files will be preserved in Archivematica, the script organizes downloads in the following Submission Information Package (SIP) structure:
- ARCHIVEIT_COLLECTION-<collection number>_JOB-<crawl ID>
- metadata
- submissionDocumentation
- <host-list csv>: list of host names and summary data from hosts report
- <mimetype-list csv>: list of mimetypes and summary data from file types report
- <seed-list csv>: list of seed URLs and summary data from seed report
- objects
- <WARC file(s)>
- metadata
Each package contains one crawl's WARC files and administrative metadata. At present, descriptive metadata is not downloaded by this script.
Filename | Description |
---|---|
warc_downloader.py | Main script |
Pipfile | Pipfile containing dependencies |
credentials.env | Example file – edit with your Archive-It credentials |
- Clone or download this repository
- Run
pipenv install
within the project folder - Edit credentials.env, replacing sampleUsername and samplePassword with your Archive-It credentials
- Run
pipenv run python warc_downloader.py
- Follow the prompts provided:
Prompt | Notes |
---|---|
Enter collection number: |
Enter the collection number from which to download WARC files. |
Would you like to narrow further by date? Enter y or n: |
y to provide a date range for which WARC files to download, n to proceed with current results. If a collection has > 100 files, the initial query will only return 100 files, and you will be required to narrow the results by date. |
Enter a start date (YYYY-MM-DD): |
Enter the earliest date for which to retrieve WARC files. |
Enter an end date (YYYY-MM-DD): |
Enter the latest date for which to retrieve WARC files. Note that the end date is not inclusive. For example, to get all files from 2019, use start date 2019-01-01 and end date 2020-01-01. |
Download files? Enter y or n: |
y to download files, n to exit. |
- As the files download, scan for any output in red text. The script will indicate if there is any file corruption (md5 checksum did not match) or missing metadata files.