ENH: Add command to extract annotated pages #97

wolfram77 · 2025-02-05T03:27:22Z

Hello pdfly contributors, I hope you're doing well. My advisor has repeatedly asked for a way to filter annotated pages from a PDF (thesis). I managed to find a solution using pymupdf, but having a CLI tool for this would be helpful. Any suggestions on how to integrate this feature into pdfly?

Lucas-C · 2025-02-05T08:34:30Z

Hi @wolfram77 👋

In order to determine if pdfly could include this feature, could you please provide a little more details?

would this feature only extract pages with annotations from an existing PDF?
how do you suggest this to be added to pdfly? With a new extract-pages-with-annotations subcommand?

wolfram77 · 2025-02-05T11:37:11Z

Hello @Lucas-C

Thanks for considering my request.

Yes, extracting only pages with annotations from a PDF would be useful, as it would help my guide and other professors filter out the pages they have commented on, especially when reviewing theses.

As for adding this to pdfly, perhaps a shorter annotated-pages command would work, with an optional output file - allowing the output to be something like input_file_annotated.pdf.

Lucas-C · 2025-02-05T12:25:49Z

Alright.

I think that keep the extract- prefix may be better, to be explicit in what this subcommand does, and for consistency with the existing extract- commands.

Would you like to submit a PR to implement this feature? 🙂

You will find some documentation on how to detect annotations using pypdf2 there:
https://pypdf2.readthedocs.io/en/3.x/user/reading-pdf-annotations.html

wolfram77 · 2025-02-06T05:19:30Z

I tried the following, but it seems to include way too many pages than expected.

    input  = PdfReader(str(input_pdf))
    output = PdfWriter()
    # Copy only the pages with annotations
    for page in input.pages:
        if "/Annots" in page:
            output.add_page(page)
    # Save the output PDF
    output.write(output_pdf)

Lucas-C · 2025-02-06T07:57:36Z

That's a good start 🙂 👍

There are many kind of PDF annotations :

file attachments
Text / FreeText / Ink / Highligh text comments
PDF signatures
actions, triggered when opening the document or when clicking in an area...
even links (internal or external) are annotations!

In order to distinguish between those, you will have to check their /Type

For more information, you can check the PDF specs: https://developer.adobe.com/document-services/docs/assets/5b15559b96303194340b99820d3a70fa/PDF_ISO_32000-2.pdf

wolfram77 mentioned this issue Feb 6, 2025

ENH: With extract-annotated-pages command #98

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add command to extract annotated pages #97

ENH: Add command to extract annotated pages #97

wolfram77 commented Feb 5, 2025

Lucas-C commented Feb 5, 2025

wolfram77 commented Feb 5, 2025

Lucas-C commented Feb 5, 2025 •

edited

Loading

wolfram77 commented Feb 6, 2025

Lucas-C commented Feb 6, 2025 •

edited

Loading

ENH: Add command to extract annotated pages #97

ENH: Add command to extract annotated pages #97

Comments

wolfram77 commented Feb 5, 2025

Lucas-C commented Feb 5, 2025

wolfram77 commented Feb 5, 2025

Lucas-C commented Feb 5, 2025 • edited Loading

wolfram77 commented Feb 6, 2025

Lucas-C commented Feb 6, 2025 • edited Loading

Lucas-C commented Feb 5, 2025 •

edited

Loading

Lucas-C commented Feb 6, 2025 •

edited

Loading