Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Horizontal spaces / Tabs in a line result in text being read as two lines | TEXT_PRESERVE_WHITESPACE not working as intended #2810

Closed
mikejokic opened this issue Nov 15, 2023 · 3 comments
Labels
not a bug not a bug / user error / unable to reproduce

Comments

@mikejokic
Copy link

mikejokic commented Nov 15, 2023

Please provide all mandatory information!

Describe the bug (mandatory)

When a line in my text contains a tab, it is being converted to a new line character and read as two lines. Using flags = 2 or TEXT_PRESERVE_WHITESPACE is not resolving the issue.

To Reproduce (mandatory)

test.pdf

import fitz
doc=fitz.open("/test.pdf")
for page in doc:
print(page.get_text(option = "text", flags = fitz.TEXT_PRESERVE_WHITESPACE))

output
Test
this is a test

Expected behavior (optional)

Print line by line. and keep whitespace in between
Test this is a test

@JorjMcKie JorjMcKie added the not a bug not a bug / user error / unable to reproduce label Nov 15, 2023
@JorjMcKie
Copy link
Collaborator

Your text on the page contains no tab! CLI Command mutool trace test.pdf > test.xml show that there are no white spaces at all:
test.zip
The text extraction logic established by MuPDF tries to collect text in spans whenever the properties are the same and the baseline is sufficiently close. Plus the inter-word distance is not too large.
Whenever these conditions are not met, a new line is started.

@JorjMcKie
Copy link
Collaborator

Of the the used font Calibri, the following subset is present:

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo <<
  /Registry (Adobe)
  /Ordering (UCS)
  /Supplement 0
>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<00><FF>
endcodespacerange
6 beginbfrange
<21><21><0054>  ==> "T" (chr(0x54))
<22><22><0065>  ==> "e"
<23><24><0073>  ==> "s", "t"
<25><25><0020>  ==> " "
<26><27><0068>  ==> "h", "i"
<28><28><0061>  ==> "a"
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end
end

No tab char or other white space!

@mikejokic
Copy link
Author

mikejokic commented Nov 15, 2023

Thank you for quick response. The findings are surprising as I authored this pdf by inserting a tab. Regardless, I want to work within the logic established by MUPDF.

Through your other comments on github, I found and am using the fitz module now to preserve layout using gettext by outputting a .txt file. However, I want to retain page number information. ex: The word "test" was mentioned in page 1,5 and 10. How can I do this? The text file has no delimiters showing end of page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
not a bug not a bug / user error / unable to reproduce
Projects
None yet
Development

No branches or pull requests

2 participants