Scenario: you have a single (but probably multi-page) PDF, rather than a set of image files. You would like to have a IIIF Manifest for this PDF, so it can be loaded into IIIF Viewers, annotated as IIIF Canvases, and so on.
One way of doing this would be to allow the DLCS to accept a PDF as input, and have it extract the images for the pages and provide them as independent image services.
At the moment, you POST
a hydra:Collection of Images to the queue, or PUT a single Image to its location.
Current example of a hydra:Collection
POST
ed to the DLCS queue (here just containing one Image):
{
"@context": "http://www.w3.org/ns/hydra/context.jsonld",
"@type": "Collection",
"member": [
{
"id": "b28802068_0008.jp2",
"space": 6,
"origin": "https://s3-eu-west-1.amazonaws.com/bucketname/key-path/b28802068_0008.jp2",
"string1": "b28802068",
"string2": "",
"string3": "",
"number1": 0,
"number2": 8,
"number3": 0,
"roles": [],
"duration": 0,
"family": "I",
"mediaType": "image/jp2",
"text": "https://api.wellcomecollection.org/text/alto/b28802068/b28802068_0008.jp2",
"textType": "alto",
"maxUnauthorised": -1,
}
]
}
The response body is an accepted Batch - the batch's images haven't been processed yet, but they are in the queue and will be processed in time. This could be a long time; it depends entirely on the number of images in the queue and the resources available to the DLCS to process them. Crucially, it's a very lightweight operation to enqueue things, so bursts of activity at ingest don't overwhelm the DLCS, it is able to spread the more intensive work over as much time as is needed to do it.
{
"@context": "https://api.dlcs.io/contexts/Batch.jsonld",
"@id": "https://api.dlcs.io/customers/2/queue/batches/761372",
"@type": "vocab:Batch",
"errorImages": "https://api.dlcs.io/customers/2/queue/batches/761372/errorImages",
"images": "https://api.dlcs.io/customers/2/queue/batches/761372/images",
"completedImages": "https://api.dlcs.io/customers/2/queue/batches/761372/completedImages",
"test": "https://api.dlcs.io/customers/2/queue/batches/761372/test",
"submitted": "2021-02-17T15:21:12.0443523+00:00",
"count": 36,
"completed": 0,
"errors": 0,
"finished": "0001-01-01T00:00:00",
"superseded": false
}
The images
property in the above links to a collection of the images in the batch.
The new requirement is that a PDF becomes a set of individual image services.
A PDF here is a bit like a latent Collection. A hydra:Collection
can have any domain class as members. This suggests we can introduce a new class to the API of the DLCS, stepping back a bit from the specifics of PDF:
Composite |
---|
id |
space |
origin |
string1 |
string2 |
string3 |
number1 |
number2 |
number3 |
roles |
duration |
family |
mediaType |
text |
textType |
maxUnauthorised |
originFormat |
incrementSeed |
Submitting a Composite
to the queue is telling the DLCS "unpack the sequence of images inside the resource, and create a DLCS image for each one, assigning the properties provided to each DLCS image". However, one of those properties - id
- cannot be the same for each image, and it's likely that at least one of the metadata fields (string1, string2, number1, etc) will need to vary across the images. We achieve this with format strings and increments, explained below.
The Composite
class has the same properties as Image (asset), plus a couple of extra ones.
Until now, the only permitted @type
of member of a hydra:Collection
submitted to the DLCS queue has been Image
; this has allowed us to omit the type in the first example.
From now however, type should be supplied (although we can assume Image if not provided).
The extra properties are:
This tells the DLCS what the source file is. Initially, the only permitted value for this property is application/pdf
; no others are supported (but they could be later).
The latter affects how the DLCS assigns properties to the images it will create. It works in tandem with format strings provided in the id and metadata fields. This must be an integer, for now. We can look to add other ways of formatting later.
The id
property of a Composite MUST contain a format string. Metadata fields (string1, string2, number1, etc) may optionally contain format strings. For any property containing a format string, the DLCS will assign a value for that property obtained by combining the format string with the incrementSeed.
Note: most DLCS classes have both
id
and@id
properties. The@id
property is the fully qualified URL of the resource, whereas theid
property is the path component of that fully qualified resource that you are providing to make it unique within itscustomer
andspace
. Keeping these separate allows domain names to change, or path syntax to change; it also allows you to submit assets to the DLCS without worrying about fully qualified identifiers for them.
An example:
POST /queue
{
"@context": "http://www.w3.org/ns/hydra/context.jsonld",
"@type": "Collection",
"member": [
{
"@type": "vocab:Composite",
"id": "my-pdf-{:03d}",
"space": 6,
"origin": "https://s3-eu-west-1.amazonaws.com/bucketname/key-path/my-pdf.pdf",
"string1": "my-id-{:03d}",
"string2": "",
"string3": "",
"number1": "0",
"number2": "{:03d}",
"number3": "0",
"roles": [],
"family": "I",
"text": "https://example.org/text/alto/my-pdf/my-pdf-{:-03d}.xml",
"textType": "alto",
"maxUnauthorised": -1,
"originFormat": "application/pdf",
"incrementSeed": 0
}
]
}
Here, the number{1-3}
fields are all strings, too. They will be interpreted as numbers unless they contain an identifiable format string, such as {:03}
. This syntax is borrowed from Python (see https://docs.python.org/3/tutorial/inputoutput.html).
Adopting these formats now seems slightly overkill, but it gives us complete flexibility to extend the formatting mechanism in future in response to emerging use cases.
If the value must itself contain a brace character ({
or }
), these should be escaped through the use of double-bracing, i.e. {{
and }}
will render as single braces {
and }
respectively.
To avoid overly complicating the POST /queue
call, any metadata values that cannot be expressed via format strings can be amended with by making a PATCH /customers/{customer}/spaces/{spaceId}/images/{imageId}
request to update individual images after they have been ingested.
The process of retrieving a PDF from its origin, rasterizing its pages into individual images, pushing each image to a DLCS-managed storage location, and generating the request to POST
to the DLCS API to process those images are potentially expensive and thus long running operations. It is not reasonable - and is contrary to good API design - to expect a client to wait for these processes to complete before the request completes.
As a result, if the above example is POST
'ed, it should return almost immediately an empty HTTP 202 Accepted
response, complete with a JSON response body describing the processing status of each PDF contained to be processed:
{
"id": "https://ch.dlcs.io/collections/84d0955c-3573-4582-af57-3805a273685a",
"members": [
{
"id": "https://ch.dlcs.io/collections/84d0955c-3573-4582-af57-3805a273685a/members/81a572ff-5622-44f4-b22b-bc3a1074544d",
"status": "PENDING",
"created": "2021-11-26T13:23:36.772426Z",
"last_updated": "2021-11-26T13:23:36.772426Z"
},
{
"id": "https://ch.dlcs.io/collections/84d0955c-3573-4582-af57-3805a273685a/members/d24aa8a8-0ea5-45d7-9d96-ded7f836ae77",
"status": "PENDING",
"created": "2021-11-26T13:23:36.772426Z",
"last_updated": "2021-11-26T13:23:36.787751Z"
}
]
}
The client can then continue to query the URI provided in the top level id
field, and will receive a 200 OK
response with a JSON response body describing the current status of each PDF contained within the original request:
{
"id": "https://ch.dlcs.io/collections/84d0955c-3573-4582-af57-3805a273685a",
"members": [
{
"id": "https://ch.dlcs.io/collections/84d0955c-3573-4582-af57-3805a273685a/members/81a572ff-5622-44f4-b22b-bc3a1074544d",
"status": "COMPLETED",
"created": "2021-11-26T13:15:18.849333Z",
"last_updated": "2021-11-26T13:23:36.772426Z",
"image_count": 1,
"dlcs_uris": [
"https://api.dlcs.digirati.io/customers/17/queue/batches/570439"
]
},
{
"id": "https://ch.dlcs.io/collections/84d0955c-3573-4582-af57-3805a273685a/members/d24aa8a8-0ea5-45d7-9d96-ded7f836ae77",
"status": "FETCHING_ORIGIN",
"created": "2021-11-26T13:23:11.398093Z",
"last_updated": "2021-11-26T13:23:36.787751Z"
}
]
}
An individual PDF can be queried directly using the id
provided for that member, and in addition to the JSON response body specific to that PDF, the caller will receive one of the following response codes:
Status | HTTP Code | Headers | Body | Notes |
---|---|---|---|---|
Processing | 200 OK |
None | None | Indicates that the backend is still processing / rasterizing the PDF ingestion request, or has been completed. |
Errored | 422 Unprocessable Entity |
None | { "Error": "Description" } |
An error occurred during the processing / rasterization of the PDF. The response body contains more details. |
Once rasterized, an individual PDF can be split into multiple batches before being submitted to the DLCS for ingestion. The dlcs_uris
contains an array of URI's representing each of the batches created for that PDF.
Assuming that my-pdf.pdf
was a 3-page PDF, then once the PDF processing / rasterization has completed and the client is provided with one or more URI's to the DLCS batches that were created for the ingestion of the rasterized images. This would get a hydra:Collection
again, and it would look something like this:
{
"@context": "http://www.w3.org/ns/hydra/context.jsonld",
"@id": "https://api.dlcs.io/customers/2/queue/batches/761372",
"@type": "Collection",
"member": [
{
"@type": "vocab:Image",
"@id": "https://api.dlcs.io/customers/2/spaces/5/images/my-pdf-0000",
"id": "my-pdf-0000",
"service": "https://dlcs.io/iiif-img/2/5/my-pdf-0000",
"space": 6,
"origin": "https://s3-eu-west-1.amazonaws.com/dlcs-internal-bucket/key-path/my-pdf.pdf/my-pdf-0000.jp2",
"string1": "my-id-0000",
"string2": "",
"string3": "",
"number1": 0,
"number2": 0,
"number3": 0,
"roles": [],
"family": "I",
"text": "https://example.org/text/alto/my-pdf/my-pdf-0000.xml",
"textType": "alto",
"maxUnauthorised": -1
},
{
"@type": "vocab:Image",
"@id": "https://api.dlcs.io/customers/2/spaces/5/images/my-pdf-0001",
"id": "my-pdf-0001",
"service": "https://dlcs.io/iiif-img/2/5/my-pdf-0001",
"space": 6,
"origin": "https://s3-eu-west-1.amazonaws.com/dlcs-internal-bucket/key-path/my-pdf.pdf/my-pdf-0001.jp2",
"string1": "my-id-0001",
"string2": "",
"string3": "",
"number1": 0,
"number2": 1,
"number3": 0,
"roles": [],
"family": "I",
"text": "https://example.org/text/alto/my-pdf/my-pdf-0001.xml",
"textType": "alto",
"maxUnauthorised": -1
},
{
"@type": "vocab:Image",
"@id": "https://api.dlcs.io/customers/2/spaces/5/images/my-pdf-0002",
"id": "my-pdf-0002",
"service": "https://dlcs.io/iiif-img/2/5/my-pdf-0002",
"space": 6,
"origin": "https://s3-eu-west-1.amazonaws.com/dlcs-internal-bucket/key-path/my-pdf.pdf/my-pdf-0002.jp2",
"string1": "my-id-0002",
"string2": "",
"string3": "",
"number1": 0,
"number2": 2,
"number3": 0,
"roles": [],
"family": "I",
"text": "https://example.org/text/alto/my-pdf/my-pdf-0002.xml",
"textType": "alto",
"maxUnauthorised": -1
}
]
}
Cantaloupe can provide an image service for any page of a PDF. While it would mean switching to Cantaloupe as image server, this would mean NOT breaking up the PDF into images and keeping it intact. However the DLCS would need to know about the PDF forever, not just at ingest time (it's not just a wrapper to get images into the system).
This approach is attractive might might be too big a step right now.
The Deliverator API is currently being re-implemented in Protagonist, but this work will take a while to complete. We could require an alternate handler for submissions of PDFs to the queue, e.g., /pdfqueue
or pdf.dlcs.xxx/queue
, and have our standalone Python service process this API endpoint, unpack the PDF, and register the PDF's images using the regular, existing API.