pdf_scout

This CLI tool automatically generates PDF bookmarks (also known as an 'outline' or a 'table of contents') for computer-generated PDF documents.

You can install it globally via pip:

pip install --user pdf_scout
pdf_scout ./my_document.pdf

pip uninstall pdf_scout

This project is a work in progress and will likely only generate suitable bookmarks for documents that conform to the following requirements:

Single column of text (not multiple columns)
Font size of header text > font size of body text
Header text is justified or left-aligned
Paragraph spacing for headers > body text paragraph spacing
Consistent left margins on every page

Supported document types

pdf_scout expressly seeks to supports the following classes of documents:

Singapore State Court and Supreme Court Judgments (unreported)
Singapore Law Reports
~~OpenDoc-generated PDFs, such as the State Court Practice Directions 2021 and the Supreme Court Practice Directions 2021~~ – OpenDoc has been deprecated by GovTech

It may support other types of documents as well. If a particular class of document isn't supported or does not work well, please open an issue and I will consider adding support for it.

Development

This project manages its dependencies using poetry and is only supported for Python ^3.9. After installing poetry and entering the project folder, run the following to install the dependencies:

poetry install

To open a virtualenv in the project folder with the dependencies, run:

poetry shell

To run a script directly, run:

poetry run python ./pdf_scout/app.py <INPUT_FILE_PATH>

Debugging using VSCode:

python -m debugpy --listen 0.0.0.0:5678 --wait-for-client ./pdf_scout/app.py

Tests

There are snapshot tests. Input PDFs are not provided at the moment, so you will have to populate the /pdf folder manually using the relevant sources (you may want to consider using Clerkent to download the unreported versions of judgments):

poetry run pytest
poetry run pytest --snapshot-update

Static type-checking

poetry run mypy pdf_scout/app.py

Tips

Processing a large PDF can take some time, so to iterate faster when debugging certain behaviour, extract the problematic part of the PDF as a separate file

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github/workflows		.github/workflows
assets		assets
pdf_scout		pdf_scout
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENCE		LICENCE
README.md		README.md
mypy.ini		mypy.ini
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pdf_scout

Supported document types

Development

Tests

Static type-checking

Tips

About

Releases 5

Contributors 2

Languages

License

hueyy/pdf_scout

Folders and files

Latest commit

History

Repository files navigation

pdf_scout

Supported document types

Development

Tests

Static type-checking

Tips

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 5

Contributors 2

Languages