Horizontal spaces / Tabs in a line result in text being read as two lines | TEXT_PRESERVE_WHITESPACE not working as intended #2810

mikejokic · 2023-11-15T15:43:47Z

Please provide all mandatory information!

Describe the bug (mandatory)

When a line in my text contains a tab, it is being converted to a new line character and read as two lines. Using flags = 2 or TEXT_PRESERVE_WHITESPACE is not resolving the issue.

To Reproduce (mandatory)

test.pdf

import fitz
doc=fitz.open("/test.pdf")
for page in doc:
print(page.get_text(option = "text", flags = fitz.TEXT_PRESERVE_WHITESPACE))

output
Test
this is a test

Expected behavior (optional)

Print line by line. and keep whitespace in between
Test this is a test

The text was updated successfully, but these errors were encountered:

JorjMcKie · 2023-11-15T16:28:28Z

Your text on the page contains no tab! CLI Command mutool trace test.pdf > test.xml show that there are no white spaces at all:
test.zip
The text extraction logic established by MuPDF tries to collect text in spans whenever the properties are the same and the baseline is sufficiently close. Plus the inter-word distance is not too large.
Whenever these conditions are not met, a new line is started.

JorjMcKie · 2023-11-15T16:38:07Z

Of the the used font Calibri, the following subset is present:

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo <<
  /Registry (Adobe)
  /Ordering (UCS)
  /Supplement 0
>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<00><FF>
endcodespacerange
6 beginbfrange
<21><21><0054>  ==> "T" (chr(0x54))
<22><22><0065>  ==> "e"
<23><24><0073>  ==> "s", "t"
<25><25><0020>  ==> " "
<26><27><0068>  ==> "h", "i"
<28><28><0061>  ==> "a"
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end
end

No tab char or other white space!

mikejokic · 2023-11-15T17:44:33Z

Thank you for quick response. The findings are surprising as I authored this pdf by inserting a tab. Regardless, I want to work within the logic established by MUPDF.

Through your other comments on github, I found and am using the fitz module now to preserve layout using gettext by outputting a .txt file. However, I want to retain page number information. ex: The word "test" was mentioned in page 1,5 and 10. How can I do this? The text file has no delimiters showing end of page.

JorjMcKie added the not a bug not a bug / user error / unable to reproduce label Nov 15, 2023

JorjMcKie closed this as completed Nov 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Horizontal spaces / Tabs in a line result in text being read as two lines | TEXT_PRESERVE_WHITESPACE not working as intended #2810

Horizontal spaces / Tabs in a line result in text being read as two lines | TEXT_PRESERVE_WHITESPACE not working as intended #2810

mikejokic commented Nov 15, 2023 •

edited

Loading

JorjMcKie commented Nov 15, 2023

JorjMcKie commented Nov 15, 2023

mikejokic commented Nov 15, 2023 •

edited

Loading

Horizontal spaces / Tabs in a line result in text being read as two lines | TEXT_PRESERVE_WHITESPACE not working as intended #2810

Horizontal spaces / Tabs in a line result in text being read as two lines | TEXT_PRESERVE_WHITESPACE not working as intended #2810

Comments

mikejokic commented Nov 15, 2023 • edited Loading

Describe the bug (mandatory)

To Reproduce (mandatory)

Expected behavior (optional)

JorjMcKie commented Nov 15, 2023

JorjMcKie commented Nov 15, 2023

mikejokic commented Nov 15, 2023 • edited Loading

mikejokic commented Nov 15, 2023 •

edited

Loading

mikejokic commented Nov 15, 2023 •

edited

Loading