PDF QUality Assessment for Digitisation batches

What is pdfquad?

Pdfquad is a simple tool for automated quality assessment of PDF documents in digitisation batches against a user-defined technical profile. It uses PyMuPDF to parse the PDF file structure and extract some relevant properties. Properties of embedded images are extracted using Pillow.

These properties are serialized to a simple XML structure, which is then evaluated against Schematron rules that define the expected/required technical characteristics.

Installation

Install the software with the pip package manager:

pip install pdfquad

Then run pdfquad once:

pdfquad

Depending on your system, pdfquad will create a folder named pdfquad in one of the following locations:

For Linux, it will use the location defined by environment variable $XDG_CONFIG_HOME. If this variable is not set, it will use the .config directory in the user's home folder (e.g. /home/johan/.config/pdfquad). Note that the .config directory is hidden by default.
For Windows, it will use the AppData\Local folder (e.g. C:\Users\johan\AppData\Local\pdfquad).

The folder contains two subdirectories named profiles and schemas, which are explained in the "Profiles" and "Schemas" sections below.

Command-line syntax

The general syntax of pdfquad is:

usage: pdfquad [-h] [--version] {process,list,copyps} ...

Pdfquad has three sub-commands:

Command	Description
process	Process a batch.
list	List available profiles and schemas.
copyps	Copy default profiles and schemas to user directory.

process command

Run pdfquad with the process command to process a batch. The syntax is:

usage: pdfquad process [-h] [--maxpdfs MAXPDFS] [--prefixout PREFIXOUT]
                       [--outdir OUTDIR] [--verbose]
                       profile batchDir

The process command expects the following positional arguments:

Argument	Description
profile	This defines the validation profile. Note that any file paths entered here will be ignored, as Pdfquad only accepts profiles from the profiles directory. You can just enter the file name without the path. Use the list command to list all available profiles.
batchDir	This defines the batch directory that will be analyzed.

In addition, the following optional arguments are available:

Argument	Description
--maxpdfs, -x	This defines the maximum number of PDFs that are reported in each output XML file (default: 10).
--prefixout, -p	This defines a text prefix on which the names of the output files are based (default: "pq").
--outdir, -o	This defines the directory where output is written (default: current working directory from which pdfquad is launched).
--verbose, -b	This tells pdfquad to report Schematron output in verbose format.

In the simplest case, we can call pdfquad with the profile and the batch directory as the only arguments:

pdfquad process dbnl-fulltext.xml ./mybatch

Pdfquad will now recursively traverse all directories and files inside the "mybatch" directory, and analyse all PDF files (based on a file extension match).

list command

Run pdfquad with the list command to get a list of the available profiles and schemas, as well as their locations. For example:

pdfquad list

Results in:

Available profiles (directory /home/johan/.config/pdfquad/profiles):
  - dbnl-fulltext.xml
Available schemas (directory /home/johan/.config/pdfquad/schemas):
  - pdf-dbnl-85.sch
  - pdf-dbnl-50.sch

copyps command

If you run pdfquad with the copyps command, it will copy the default profiles and schemas that are included in the installation over to your user directory.

Warning: any changes you made to the default profiles or schemas will be lost after this operation, so proceed with caution! If you want to keep any of these files, just make a copy and save them under a different name before running the copyps command.

Profiles

A profile is an XML file that defines how a digitisation batch is evaluated. It is made up of one or more schema elements, that each link a file or directory naming pattern to a Schematron file. Here's an example:

<?xml version="1.0"?>

<profile>

<schema type="parentDirName" match="endswith" pattern="pi-85">pdf-dbnl-85.sch</schema>
<schema type="parentDirName" match="endswith" pattern="pi-50">pdf-dbnl-50.sch</schema>

</profile>

Here we see two schema elements. Each element refers to a Schematron file (explained in the next section). The values of the type, match and pattern attributes define how this file is linked to file or directory names inside the batch:

If type is "fileName", the matching is based on the naming of a PDF. In case of "parentDirName" the matching uses the naming of the direct parent directory of a PDF.
The match attribute defines whether the matching pattern with the file or directory name is exact ("is") or partial ("startswith", "endswith", "contains".)
The pattern attribute defines a text string that is used for the match.

In the example above, the profile says that if a PDF has a direct parent directory whose name ends with "pi-85", pdfquad should use Schematron file "pdf-dbnl-85.sch". If the directory name ends with "pi-50", it should use "pdf-dbnl-50.sch".

Available profiles

Currently the following profiles are included:

Profile	Description
dbnl-fulltext.xml	Profile for DBNL full-text digitisation batches.

Schemas

Schemas contain the Schematron rules on which the quality assessment is based. Some background information about this type of rule-based validation can be found in this blog post. Currently the following schemas are included:

pdf-dbnl-85.sch

This is a schema for production master PDFs with images in JPEG format that are compressed at 85% quality. It includes the following checks:

Check	Value
Thumbnails	Document does not open with thumbnails
File attachments	Document does not contain file attachments
Digital signatures	Document does not contain digital signatures
JavaScript	Document does not contain JavaScript
Open password	Document is not protected with open password
Exceptions, PDF	Parsing at PDF level did not result in any exceptions
PDF version	1.7
Encryption	Document does not use encryption
Annotations	Document does not contain WaterMark, Screen, Movie, 3D, Sound, FileAttachment, Link, Ink, Popup, Widget, Polygon, Text, FreeText or SVG annotations
Optional Content	Document does not contain any optional content layers
Images per page	Each page contains exactly 1 image
Watermarks	Document does not contain watermarks
ICC profile	Each image contains an ICC profile, which is either defined as a PDF object, or embedded in the image stream
Width, height	Image XObject dictionary values and image stream values are identical
Bits per component	Image XObject dictionary values and image stream values are identical
Filter value of Image XObject dictionary	DCTDecode
Image stream format	JPEG
Image stream resolution (ppi)	Within range [299, 301]
Image stream color components	3
Image stream JPEG Quality	Within range [83, 87]
Exceptions, stream	Parsing of the image streams did not result in any exceptions

pdf-dbnl-50.sch

This is a schema for small access PDFs with images in JPEG format that are compressed at 50% quality. It includes the following checks:

Check	Value
Thumbnails	Document does not open with thumbnails
File attachments	Document does not contain file attachments
Digital signatures	Document does not contain digital signatures
JavaScript	Document does not contain JavaScript
Open password	Document is not protected with open password
Exceptions, PDF	Parsing at PDF level did not result in any exceptions
PDF version	1.7
Encryption	Document does not use encryption
Annotations	Document does not contain WaterMark, Screen, Movie, 3D, Sound, FileAttachment, Link, Ink, Popup, Widget, Polygon, Text, FreeText or SVG annotations
Optional Content	Document does not contain any optional content groups
Images per page	Each page contains exactly 1 image
Watermarks	Document does not contain watermarks
ICC profile	Each image contains an ICC profile, which is either defined as a PDF object, or embedded in the image stream
Width, height	Image XObject dictionary values and image stream values are identical
Bits per component	Image XObject dictionary values and image stream values are identical
Filter value of Image XObject dictionary	DCTDecode
Image stream format	JPEG
Image stream resolution (ppi)	Within range [299, 301]
Image stream color components	3
Image stream JPEG Quality	Within range [48, 52]
Exceptions, stream	Parsing of the image streams did not result in any exceptions

Output

Pdfquad reports the following output:

Comprehensive output file (XML)

Pdfquad generates one or more comprehensive output files in XML format. For each PDF, these contain all extracted properties, as well a the Schematron report and the assessment status. Here's an example file.

Since these files can get really large, Pdfquad splits the results across multiple output files, using the following naming convention:

pq_mybatch_001.xml
pq_mybatch_002.xml
etcetera

By default Pdfquad limits the number of reported PDFs for each output file to 10, after which it creates a new file. This behaviour can be changed by using the --maxpdfs (alias -x) option. For example, the command below will limit the number of PDFs per output file to 1 (so each PDF will have its dedicated output file):

pdfquad process dbnl-fulltext.xml ./mybatch -x 1

Summary file (CSV)

This is a comma-delimited text file with, for each PDF, the following columns:

Column	Description
file	Full path to the PDF file.
validationSuccess	Flag with value True if Schematron validation was succesful, and False if not. A value False indicates that the file could not be validated (e.g. because no matching schema was found, or the validation resulted in an unexpected exception)
validationOutcome	The outcome of the Schematron validation/assessment. Value is Pass if file passed all tests, and Fail otherwise. Note that it is automatically set to Fail if the Schematron validation was unsuccessful (i.e. "validationSuccess" is False)
noPages	The number of pages in the document.
fileOut	Corresponding comprehensive output file with full output for this PDF.

Here's an example:

file,validationSuccess,validationOutcome,noPages,fileOut
/home/johan/pdfquad-test/mybatch/20241106/anbe001lexi02/300dpi-85/anbe001lexi02_01.pdf,True,Pass,1528,/home/johan/pdfquad-test/pq_mybatch_001.xml
/home/johan/pdfquad-test/mybatch/20241106/anbe001lexi02/300dpi-50/anbe001lexi02_01.pdf,True,Fail,1528,/home/johan/pdfquad-test/pq_mybatch_001.xml
/home/johan/pdfquad-test/mybatch/20241106/brin003196603/300dpi-85/brin003196603_01.pdf,True,Fail,1260,/home/johan/pdfquad-test/pq_mybatch_001.xml
/home/johan/pdfquad-test/mybatch/20241106/brin003196603/300dpi-50/brin003196603_01.pdf,True,Fail,1260,/home/johan/pdfquad-test/pq_mybatch_001.xml
/home/johan/pdfquad-test/mybatch/20241105/_deu002201201/300dpi-85/_deu002201201_01.pdf,True,Fail,297,/home/johan/pdfquad-test/pq_mybatch_001.xml
/home/johan/pdfquad-test/mybatch/20241105/_deu002201201/300dpi-50/_deu002201201_01.pdf,True,Fail,297,/home/johan/pdfquad-test/pq_mybatch_001.xml
/home/johan/pdfquad-test/mybatch/20241105/_boe012192401/300dpi-85/_boe012192401_01.pdf,True,Pass,346,/home/johan/pdfquad-test/pq_mybatch_001.xml
/home/johan/pdfquad-test/mybatch/20241105/_boe012192401/300dpi-50/_boe012192401_01.pdf,True,Fail,346,/home/johan/pdfquad-test/pq_mybatch_001.xml

Licensing

Pdfquad is released under the Apache License, Version 2.0.

Useful links

Schematron

Name		Name	Last commit message	Last commit date
Latest commit History 164 Commits
examples		examples
pdfquad		pdfquad
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cli.py		cli.py
package-pypi.sh		package-pypi.sh
pdfquad_win64.spec		pdfquad_win64.spec
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF QUality Assessment for Digitisation batches

What is pdfquad?

Installation

Command-line syntax

process command

list command

copyps command

Profiles

Available profiles

Schemas

pdf-dbnl-85.sch

pdf-dbnl-50.sch

Output

Comprehensive output file (XML)

Summary file (CSV)

Licensing

Useful links

About

Releases

Packages

Languages

License

KBNLresearch/pdfquad

Folders and files

Latest commit

History

Repository files navigation

PDF QUality Assessment for Digitisation batches

What is pdfquad?

Installation

Command-line syntax

process command

list command

copyps command

Profiles

Available profiles

Schemas

pdf-dbnl-85.sch

pdf-dbnl-50.sch

Output

Comprehensive output file (XML)

Summary file (CSV)

Licensing

Useful links

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages