Skip to content

Latest commit

 

History

History
83 lines (53 loc) · 2.69 KB

README.md

File metadata and controls

83 lines (53 loc) · 2.69 KB

pdf_scout

PyPI PyPI - Python Version PyPI - License

This CLI tool automatically generates PDF bookmarks (also known as an 'outline' or a 'table of contents') for computer-generated PDF documents.

You can install it globally via pip:

pip install --user pdf_scout
pdf_scout ./my_document.pdf

pip uninstall pdf_scout

screenshot

This project is a work in progress and will likely only generate suitable bookmarks for documents that conform to the following requirements:

  • Single column of text (not multiple columns)
  • Font size of header text > font size of body text
  • Header text is justified or left-aligned
  • Paragraph spacing for headers > body text paragraph spacing
  • Consistent left margins on every page

Supported document types

pdf_scout expressly seeks to supports the following classes of documents:

It may support other types of documents as well. If a particular class of document isn't supported or does not work well, please open an issue and I will consider adding support for it.

Development

This project manages its dependencies using poetry and is only supported for Python ^3.9. After installing poetry and entering the project folder, run the following to install the dependencies:

poetry install

To open a virtualenv in the project folder with the dependencies, run:

poetry shell

To run a script directly, run:

poetry run python ./pdf_scout/app.py <INPUT_FILE_PATH>

Debugging using VSCode:

python -m debugpy --listen 0.0.0.0:5678 --wait-for-client ./pdf_scout/app.py

Tests

There are snapshot tests. Input PDFs are not provided at the moment, so you will have to populate the /pdf folder manually using the relevant sources (you may want to consider using Clerkent to download the unreported versions of judgments):

poetry run pytest
poetry run pytest --snapshot-update

Static type-checking

poetry run mypy pdf_scout/app.py

Tips

  • Processing a large PDF can take some time, so to iterate faster when debugging certain behaviour, extract the problematic part of the PDF as a separate file