Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add command to extract annotated pages #97

Open
wolfram77 opened this issue Feb 5, 2025 · 5 comments
Open

ENH: Add command to extract annotated pages #97

wolfram77 opened this issue Feb 5, 2025 · 5 comments

Comments

@wolfram77
Copy link

Hello pdfly contributors, I hope you're doing well. My advisor has repeatedly asked for a way to filter annotated pages from a PDF (thesis). I managed to find a solution using pymupdf, but having a CLI tool for this would be helpful. Any suggestions on how to integrate this feature into pdfly?

@Lucas-C
Copy link
Member

Lucas-C commented Feb 5, 2025

Hi @wolfram77 👋

In order to determine if pdfly could include this feature, could you please provide a little more details?

  • would this feature only extract pages with annotations from an existing PDF?
  • how do you suggest this to be added to pdfly? With a new extract-pages-with-annotations subcommand?

@wolfram77
Copy link
Author

Hello @Lucas-C

Thanks for considering my request.

Yes, extracting only pages with annotations from a PDF would be useful, as it would help my guide and other professors filter out the pages they have commented on, especially when reviewing theses.

As for adding this to pdfly, perhaps a shorter annotated-pages command would work, with an optional output file - allowing the output to be something like input_file_annotated.pdf.

@Lucas-C
Copy link
Member

Lucas-C commented Feb 5, 2025

Alright.

I think that keep the extract- prefix may be better, to be explicit in what this subcommand does, and for consistency with the existing extract- commands.

Would you like to submit a PR to implement this feature? 🙂

You will find some documentation on how to detect annotations using pypdf2 there:
https://pypdf2.readthedocs.io/en/3.x/user/reading-pdf-annotations.html

@wolfram77
Copy link
Author

I tried the following, but it seems to include way too many pages than expected.

    input  = PdfReader(str(input_pdf))
    output = PdfWriter()
    # Copy only the pages with annotations
    for page in input.pages:
        if "/Annots" in page:
            output.add_page(page)
    # Save the output PDF
    output.write(output_pdf)

@Lucas-C
Copy link
Member

Lucas-C commented Feb 6, 2025

That's a good start 🙂 👍

There are many kind of PDF annotations :

  • file attachments
  • Text / FreeText / Ink / Highligh text comments
  • PDF signatures
  • actions, triggered when opening the document or when clicking in an area...
  • even links (internal or external) are annotations!

In order to distinguish between those, you will have to check their /Type

For more information, you can check the PDF specs: https://developer.adobe.com/document-services/docs/assets/5b15559b96303194340b99820d3a70fa/PDF_ISO_32000-2.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants