page.links return all links with same xref, is it something possible ?? #3563

flint-company-backup · 2024-06-09T16:42:04Z

I'm very suprised to analyze a pdf and try to get all the links and it give me a dict with links but all the same "xref".
Is there a way to delete these link although they all have the same xref ?
Thanks

    for page in doc.pages():
        print(f"links : {page.get_links()}")
        text += page.get_text().lower()
        links.extend([x['uri'].lower() for x in page.links(kinds=[pymupdf.LINK_URI])])
=> 
"Links : [{'kind': 2, 'xref': 0, 'from': Rect(248.3167724609375, 174.28570556640625, 279.57891845703125, 183.88568"
    "115234375), 'uri': 'https://XXXXXs'}, {'kind': 2, 'xref': 0, 'from': Rect(410.7126770019531, 37"
    ".824951171875, 496.1625671386719, 46.2249755859375), 'uri': 'mailto:XXXXXs'}, {'kind': 2, 'xref"
    "': 0, 'from': Rect(410.7126770019531, 55.4649658203125, 456.3170166015625, 63.86492919921875), 'uri': 'https://XXXXXs/'}, {'kind': 2, 'xref': 0, 'from': Rect(238.4034881591797, 699.2244873046875, 260.1038818359375,"
    " 708.824462890625), 'uri': 'https://XXXXXs'}, {'kind': 2, 'xref': 0, 'from': Rect(267.0658264160156, 699.2244873046875, 284.93472290039"
    "06, 708.824462890625), 'uri': 'XXXXXs'"
    "}, {'kind': 2, 'xref': 0, 'from': Rect(291.89666748046875, 699.2244873046875, 336.628173828125, 708.824462890625)"
    ", 'uri': 'https://XXXXXs"
    "ahier-des-charge'}, {'kind': 2, 'xref': 0, 'from': Rect(343.5901184082031, 699.2244873046875, 412.5794982910156, "
    "708.824462890625), 'uri': 'https://XXXXXs"
    "onvertir'}, {'kind': 2, 'xref': 0, 'from': Rect(419.54144287109375, 699.2244873046875, 479.58209228515625, 708.82"
    "4462890625), 'uri': 'https://XXXXXs/'}]"

JorjMcKie · 2024-06-09T21:42:53Z

Please provide a reproducing example.
So far your post leads to nothing actionable.

flint-company-backup · 2024-06-10T07:02:51Z

Please provide a reproducing example. So far your post leads to nothing actionable.

Hard to do since it's a resume of an existing person and personal data... You have a way to workaround this to provide the example ?

JorjMcKie · 2024-06-10T10:38:12Z

Your PDF obviously has a problem which we should intercept and handle in a better way. So, no: we need a reproducer to confirm that we guessed the right cause. But you can use my private email for the submission so it won't be exposed to the public. Otherwise this post will never become a bug report ... Gesendet von Outlook für Android<https://aka.ms/AAb9ysg>

…

________________________________ From: Flint ***@***.***> Sent: Monday, June 10, 2024 3:03:12 AM To: pymupdf/PyMuPDF ***@***.***> Cc: Jorj X. McKie ***@***.***>; Comment ***@***.***> Subject: Re: [pymupdf/PyMuPDF] page.links return all links with same xref, is it something possible ?? (Issue #3563) Please provide a reproducing example. So far your post leads to nothing actionable. Hard to do since it's a resume of an existing person and personal data... You have a way to workaround this to provide the example ? — Reply to this email directly, view it on GitHub<#3563 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AB7IDIUIV7R7E3PED7QX3V3ZGVFTBAVCNFSM6AAAAABJBAXOWWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJXGQ4TIMRVG4>. You are receiving this because you commented.Message ID: ***@***.***>

JorjMcKie · 2024-06-10T13:17:00Z

The example PDF shared with me violates the specifications for links / annotations:

Instead of giving indirect references as it should be, it provides all the links dirctly in the /Annots array.
IAW it should look like /Annots [4711 0 R 4712 0 R ...]. Instead we find:

/Annots [ <<
        /Type /Annot
        /Subtype /Link
        /Rect [ 248.31678 596.1143 279.57893 605.7143 ]
        /Border [ 0 0 0 ]
        /A <<
          /Type /Action
          /S /URI
          /URI (https://alexialabbe.fr/#projects)
        >>
      >> <<
        /Type /Annot
        /Subtype /Link
        /Rect [ 238.40349 71.17554 260.10389 80.775539 ]
        /Border [ 0 0 0 ]
        /A <<
          /Type /Action
          /S /URI
          /URI (https://blog.codein.fr/guide-rgpd-les-pratiques-essentielles-pour-assurer-la-conformite-de-votre-site-web)
        >>
      >> 
... ]

So pymupdf does recognize the links, but cannot assign an xref to them (xref=0 consequently).
You cannot update / delete links in PyMuPDF using the normal API (delete_link etc.) in such a situation - no way.
But you can edit the page's object definition source using low-level API and kill everything: for this you could delete the whole /Annots array.
This will remove everything (!!!): links, annotations and fields that may be on the page.

doc.xref_set_key(5, "Annots", "null")
            
print(doc.xref_object(5))  # 5 = page xref
            
<<
  /Type /Page
  /Parent 1 0 R
  /MediaBox [ 0 0 540 780 ]
  /Contents 134 0 R
  /Resources <<
    /ExtGState <<
      /Alpha0 10 0 R
      /Alpha1 11 0 R
    >>
    /Font <<
      /Font4 14 0 R
      /Font11 21 0 R
      /Font12 22 0 R
      /Font5 15 0 R
    >>
  >>
  /Annots null
  /Group <<
    /S /Transparency
    /CS /DeviceRGB
  >>
>>

All links are gone!

JorjMcKie · 2024-06-10T13:22:10Z

BTW the example page looks exactly the same, but all hot areas are gone.
Also, the file size (when saving via ez_save()) goes down to 44KB (was 1 MB before).

flint-company-backup · 2024-06-11T09:16:58Z

Thanks Jorj !!

JorjMcKie added example required Waiting for information labels Jun 9, 2024

JorjMcKie added not a bug not a bug / user error / unable to reproduce and removed example required Waiting for information labels Jun 10, 2024

JorjMcKie closed this as completed Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

page.links return all links with same xref, is it something possible ?? #3563

page.links return all links with same xref, is it something possible ?? #3563

flint-company-backup commented Jun 9, 2024

JorjMcKie commented Jun 9, 2024

flint-company-backup commented Jun 10, 2024

JorjMcKie commented Jun 10, 2024 via email

JorjMcKie commented Jun 10, 2024 •

edited

Loading

JorjMcKie commented Jun 10, 2024

flint-company-backup commented Jun 11, 2024

page.links return all links with same xref, is it something possible ?? #3563

page.links return all links with same xref, is it something possible ?? #3563

Comments

flint-company-backup commented Jun 9, 2024

JorjMcKie commented Jun 9, 2024

flint-company-backup commented Jun 10, 2024

JorjMcKie commented Jun 10, 2024 via email

JorjMcKie commented Jun 10, 2024 • edited Loading

JorjMcKie commented Jun 10, 2024

flint-company-backup commented Jun 11, 2024

JorjMcKie commented Jun 10, 2024 •

edited

Loading