You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
document = fitz.open('data/word_test.pdf') page = document.load_page(0) words = page.get_text('words', flags=fitz.TEXT_INHIBIT_SPACES) for word in words: rect =fitz.Rect(word[0], word[1], word[2], word[3]) color = (0, 1, 0) page.draw_rect(rect, color) document.save('word_test_new.pdf')
The text boxes extracted through the Page.get_text('words', flags=fitz.TEXT_INHIBIT_SPACES) contain some abnormal blocks that seem much larger than I anticipated. Is there room for optimization that I might be missing?
The text was updated successfully, but these errors were encountered:
This file has an illegal font specification in that it uses "Identity-H" encoding for a non-embedded font (SimSun).
This seems to cause confusion wRT character sizes (causing the extremely high bboxes).
All this is outside control of PyMuPDF and has to be looked at by the MuPDF experts.
Do you want to submit a bug there? https://bugs.ghostscript.com/enter_bug.cgi
otherway i try to convert pdf to image, it seems like the transferred image does not look consistent with the original pdf , this is the image i transfered from this pdf , is there some setting i dont set?
Please provide all mandatory information!
Describe the bug (mandatory)
The positional information extracted through the Page.get_text('words', flags=fitz.TEXT_INHIBIT_SPACES) method has a deviation.
To Reproduce (mandatory)
words_test.pdf
pymupdf version is 1.23.5
The code belows can reproduces the bug
document = fitz.open('data/word_test.pdf') page = document.load_page(0) words = page.get_text('words', flags=fitz.TEXT_INHIBIT_SPACES) for word in words: rect =fitz.Rect(word[0], word[1], word[2], word[3]) color = (0, 1, 0) page.draw_rect(rect, color) document.save('word_test_new.pdf')
The text boxes extracted through the Page.get_text('words', flags=fitz.TEXT_INHIBIT_SPACES) contain some abnormal blocks that seem much larger than I anticipated. Is there room for optimization that I might be missing?
The text was updated successfully, but these errors were encountered: