Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The positional information extracted through the Page.get_text('words', flags=fitz.TEXT_INHIBIT_SPACES) method has a deviation. #2796

Closed
1339503169 opened this issue Nov 10, 2023 · 4 comments

Comments

@1339503169
Copy link

Please provide all mandatory information!

Describe the bug (mandatory)

The positional information extracted through the Page.get_text('words', flags=fitz.TEXT_INHIBIT_SPACES) method has a deviation.

To Reproduce (mandatory)

words_test.pdf
image
image

pymupdf version is 1.23.5

The code belows can reproduces the bug

document = fitz.open('data/word_test.pdf') page = document.load_page(0) words = page.get_text('words', flags=fitz.TEXT_INHIBIT_SPACES) for word in words: rect =fitz.Rect(word[0], word[1], word[2], word[3]) color = (0, 1, 0) page.draw_rect(rect, color) document.save('word_test_new.pdf')

The text boxes extracted through the Page.get_text('words', flags=fitz.TEXT_INHIBIT_SPACES) contain some abnormal blocks that seem much larger than I anticipated. Is there room for optimization that I might be missing?

@JorjMcKie
Copy link
Collaborator

This file has an illegal font specification in that it uses "Identity-H" encoding for a non-embedded font (SimSun).
This seems to cause confusion wRT character sizes (causing the extremely high bboxes).
All this is outside control of PyMuPDF and has to be looked at by the MuPDF experts.
Do you want to submit a bug there? https://bugs.ghostscript.com/enter_bug.cgi

@1339503169
Copy link
Author

otherway i try to convert pdf to image, it seems like the transferred image does not look consistent with the original pdf , this is the image i transfered from this pdf , is there some setting i dont set?
image

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Nov 14, 2023

I see no difference - where are the deviations?

@JorjMcKie
Copy link
Collaborator

Closed b/o of waiting for response for an extended period of time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants