page.search_for not working for all text-part in pdf-document? #3838

Rapid1898-code · 2024-09-03T18:30:12Z

Rapid1898-code
Sep 3, 2024

Hello - i have the attached source-file test.pdf and i try to change the appearances of
["A1", "A2", "B1", "B2", "P1", "I1", "C1", "C2"]
in the pdf-document using the following code
and the output goes to out.pdf

As you can see in the output it seems that i can´t find the entries,
B1, C2, P1, I1
using page.search_for

Why are this strings not found - but the other ones (A1, A2, B2, C1) are found?

Code:

import fitz

def adjust_matrix(font, bbox, text, fontsize):
    """Compute matrix performing a horizontal scale.

    Args:
        font: Font object
        bbox: bbox of the text to fill
        text: text
        fontsize: fontsize to use
    Returns:
        Horizontal scaling matrix
    """
    tl = font.text_length(text, fontsize=fontsize)
    width = bbox[2] - bbox[0]
    scale = width / tl
    return fitz.Matrix(scale, 1)


def get_fontlist(page):
    """Make a dictionary for exsiting fonts."""
    flist = {}
    for f in page.get_fonts():
        # extract xref, full font name and reference name
        xref, fullname, refname = f[0], f[3], f[4]
        # subset fonts have a "+" in string position 7
        fullname = fullname[7:] if "+" in fullname else fullname
        ff = doc.extract_font(xref)  # exract font buffer
        font = fitz.Font(fontbuffer=ff[-1])
        # store reference name and the Font object
        flist[fullname] = (refname, font)
    return flist

doc = fitz.open("test.pdf")
pageNumbers = len(doc)
searchList = ["A1", "A2", "B1", "B2", "P1", "I1", "C1", "C2"]

for pageNr in range(pageNumbers):
  print(f"Working for page number {pageNr}")
  page = doc[pageNr]
  fontfilename = "c:/Temp/ArimoBold.ttf"
  fontname = "F0"

  # make a dictionary of fonts used on this page
  fontlist = get_fontlist(page)
  # print(fontlist)
  # exit()
  # we intend to replace all occurrences of "The Impact of" by "The Effects of"

  for searchItem in searchList:
    print(f"Working for {searchItem}")
      
    bboxes = page.search_for(searchItem)

    print(bboxes)
    input("Press!")

    spans = []  # store occurences here
    for bbox in bboxes:  # extract full text meta info for each occurrence
      for b in page.get_text("dict", clip=bbox)["blocks"]:
        for l in b["lines"]:
          spans.extend(l["spans"])

    # now redact away the word "passenger"
    for s in spans:
      page.add_redact_annot(s["bbox"])

    # page._apply_redactions(images=0, graphics=0)
    page.apply_redactions(images=fitz.PDF_REDACT_IMAGE_NONE)  # don't touch images

    # now insert new text in emptied bboxes
    for s in spans:
      point = fitz.Point(s["origin"])  # insertion point
      text = s["text"]  # original text (passenger)
      fsize = s["size"]  # fontsize
      font = s["font"]  # font name in PDF

      # extract-convert color - note we use red for demo-purposes
      color = fitz.sRGB_to_pdf(s["color"])  # re-use original color
      # replace old by new text
      text = text.replace(searchItem, "YES")

      # choose right font for output:
      # there often exists ambiguity WRT "-" instead of spaces etc. so we
      # make a second try when encountering problems
      try:
          fontname, font_obj = fontlist[font]
      except KeyError:
          fontname, font_obj = fontlist[font.replace("-", " ")]

      # IMPORTANT: matrix to stretch or shrink new text horizontally
      matrix = adjust_matrix(font_obj, s["bbox"], text, fsize)

      page.insert_font(fontname="F0", fontfile=fontfilename)        
      page.insert_text(
        point,
        text,
        fontname="F0",
        # fontsize=fsize,
        fontsize=14,
        color=(0, 205/255, 0),
        morph=(point, matrix),  # this will shrink / stretch horizontally
      )

doc.subset_fonts()  # not needed in this version: reusing existing font subsets
doc.ez_save("out.pdf")

out.pdf
test.pdf

JorjMcKie · 2024-09-04T10:27:52Z

JorjMcKie
Sep 4, 2024
Maintainer

Weird!
It looks like some problem beyond your responsibility. This file contains so-called StrutureTree information which cause advanced PDFs to present their content based on higher level structure information.
If you had tried to extract the text you would have seen that something very different comes out than what is made visible by PDF viewers:

print(page.get_text())
Compliant
Scan Results for West School District
June 11, 2024
June 11, 2024
Compliant
The Chemical Hygiene Plan from West School isn’t OSHA compliant. 
Your document is 82.35% aligned wi
th OSHA standards. Let’s change that!
46
We  found 46 issues in your document
A1
A2
Is the goal of the Chemical Hygiene Plan described? 
Is the Chemical Hygiene Plan reviewed annually?
Yes
B. Chemical Hygiene Personnel
No
B2
Is the goal for Chemical Hygiene Personnel described? 
Yes
This audit is subjected to terms of service. 
No

If you remove this "structure" information using low-level API, everything behaves as expected.

cat = doc.pdf_catalog()
doc.xref_set_key(cat, "StructTreeRoot", "null")  # remove the structure info

# now proceed as normal
page = doc[...]

0 replies

JorjMcKie · 2024-09-04T10:38:41Z

JorjMcKie
Sep 4, 2024
Maintainer

I submitted a bug report on MuPDF's system here: https://bugs.ghostscript.com/show_bug.cgi?id=708005

0 replies

Rapid1898-code · 2024-09-04T12:57:13Z

Rapid1898-code
Sep 4, 2024
Author

Works great again - thx a lot!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

page.search_for not working for all text-part in pdf-document? #3838

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

page.search_for not working for all text-part in pdf-document? #3838

Rapid1898-code Sep 3, 2024

Replies: 3 comments

JorjMcKie Sep 4, 2024 Maintainer

JorjMcKie Sep 4, 2024 Maintainer

Rapid1898-code Sep 4, 2024 Author

Rapid1898-code
Sep 3, 2024

JorjMcKie
Sep 4, 2024
Maintainer

JorjMcKie
Sep 4, 2024
Maintainer

Rapid1898-code
Sep 4, 2024
Author