Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

page.links return all links with same xref, is it something possible ?? #3563

Closed
flint-company-backup opened this issue Jun 9, 2024 · 6 comments
Labels
not a bug not a bug / user error / unable to reproduce

Comments

@flint-company-backup
Copy link

I'm very suprised to analyze a pdf and try to get all the links and it give me a dict with links but all the same "xref".
Is there a way to delete these link although they all have the same xref ?
Thanks

    for page in doc.pages():
        print(f"links : {page.get_links()}")
        text += page.get_text().lower()
        links.extend([x['uri'].lower() for x in page.links(kinds=[pymupdf.LINK_URI])])
=> 
"Links : [{'kind': 2, 'xref': 0, 'from': Rect(248.3167724609375, 174.28570556640625, 279.57891845703125, 183.88568"
    "115234375), 'uri': 'https://XXXXXs'}, {'kind': 2, 'xref': 0, 'from': Rect(410.7126770019531, 37"
    ".824951171875, 496.1625671386719, 46.2249755859375), 'uri': 'mailto:XXXXXs'}, {'kind': 2, 'xref"
    "': 0, 'from': Rect(410.7126770019531, 55.4649658203125, 456.3170166015625, 63.86492919921875), 'uri': 'https://XXXXXs/'}, {'kind': 2, 'xref': 0, 'from': Rect(238.4034881591797, 699.2244873046875, 260.1038818359375,"
    " 708.824462890625), 'uri': 'https://XXXXXs'}, {'kind': 2, 'xref': 0, 'from': Rect(267.0658264160156, 699.2244873046875, 284.93472290039"
    "06, 708.824462890625), 'uri': 'XXXXXs'"
    "}, {'kind': 2, 'xref': 0, 'from': Rect(291.89666748046875, 699.2244873046875, 336.628173828125, 708.824462890625)"
    ", 'uri': 'https://XXXXXs"
    "ahier-des-charge'}, {'kind': 2, 'xref': 0, 'from': Rect(343.5901184082031, 699.2244873046875, 412.5794982910156, "
    "708.824462890625), 'uri': 'https://XXXXXs"
    "onvertir'}, {'kind': 2, 'xref': 0, 'from': Rect(419.54144287109375, 699.2244873046875, 479.58209228515625, 708.82"
    "4462890625), 'uri': 'https://XXXXXs/'}]"
@JorjMcKie
Copy link
Collaborator

Please provide a reproducing example.
So far your post leads to nothing actionable.

@flint-company-backup
Copy link
Author

Please provide a reproducing example. So far your post leads to nothing actionable.

Hard to do since it's a resume of an existing person and personal data... You have a way to workaround this to provide the example ?

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Jun 10, 2024 via email

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Jun 10, 2024

The example PDF shared with me violates the specifications for links / annotations:

image

Instead of giving indirect references as it should be, it provides all the links dirctly in the /Annots array.
IAW it should look like /Annots [4711 0 R 4712 0 R ...]. Instead we find:

/Annots [ <<
        /Type /Annot
        /Subtype /Link
        /Rect [ 248.31678 596.1143 279.57893 605.7143 ]
        /Border [ 0 0 0 ]
        /A <<
          /Type /Action
          /S /URI
          /URI (https://alexialabbe.fr/#projects)
        >>
      >> <<
        /Type /Annot
        /Subtype /Link
        /Rect [ 238.40349 71.17554 260.10389 80.775539 ]
        /Border [ 0 0 0 ]
        /A <<
          /Type /Action
          /S /URI
          /URI (https://blog.codein.fr/guide-rgpd-les-pratiques-essentielles-pour-assurer-la-conformite-de-votre-site-web)
        >>
      >> 
... ]

So pymupdf does recognize the links, but cannot assign an xref to them (xref=0 consequently).
You cannot update / delete links in PyMuPDF using the normal API (delete_link etc.) in such a situation - no way.
But you can edit the page's object definition source using low-level API and kill everything: for this you could delete the whole /Annots array.
This will remove everything (!!!): links, annotations and fields that may be on the page.

doc.xref_set_key(5, "Annots", "null")
            
print(doc.xref_object(5))  # 5 = page xref
            
<<
  /Type /Page
  /Parent 1 0 R
  /MediaBox [ 0 0 540 780 ]
  /Contents 134 0 R
  /Resources <<
    /ExtGState <<
      /Alpha0 10 0 R
      /Alpha1 11 0 R
    >>
    /Font <<
      /Font4 14 0 R
      /Font11 21 0 R
      /Font12 22 0 R
      /Font5 15 0 R
    >>
  >>
  /Annots null
  /Group <<
    /S /Transparency
    /CS /DeviceRGB
  >>
>>

All links are gone!

@JorjMcKie
Copy link
Collaborator

BTW the example page looks exactly the same, but all hot areas are gone.
Also, the file size (when saving via ez_save()) goes down to 44KB (was 1 MB before).

@JorjMcKie JorjMcKie added not a bug not a bug / user error / unable to reproduce and removed example required Waiting for information labels Jun 10, 2024
@flint-company-backup
Copy link
Author

Thanks Jorj !!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
not a bug not a bug / user error / unable to reproduce
Projects
None yet
Development

No branches or pull requests

2 participants