Skip to content

The positional information extracted through the Page.get_text('words', flags=fitz.TEXT_INHIBIT_SPACES) method has a deviation. #2796

Closed
@1339503169

Description

@1339503169

Please provide all mandatory information!

Describe the bug (mandatory)

The positional information extracted through the Page.get_text('words', flags=fitz.TEXT_INHIBIT_SPACES) method has a deviation.

To Reproduce (mandatory)

words_test.pdf
image
image

pymupdf version is 1.23.5

The code belows can reproduces the bug

document = fitz.open('data/word_test.pdf') page = document.load_page(0) words = page.get_text('words', flags=fitz.TEXT_INHIBIT_SPACES) for word in words: rect =fitz.Rect(word[0], word[1], word[2], word[3]) color = (0, 1, 0) page.draw_rect(rect, color) document.save('word_test_new.pdf')

The text boxes extracted through the Page.get_text('words', flags=fitz.TEXT_INHIBIT_SPACES) contain some abnormal blocks that seem much larger than I anticipated. Is there room for optimization that I might be missing?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions