Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix issue 964 #965

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from
Open

fix issue 964 #965

wants to merge 3 commits into from

Conversation

jnhyperion
Copy link

@jnhyperion jnhyperion commented Aug 10, 2023

I found that this issue is caused by some blank chars is overlapped with the following non blank chars.
The simple solution is to remove these overlapped blank chars.

fix: #964

@jsvine
Copy link
Owner

jsvine commented Aug 16, 2023

Thanks for this proposal, @jnhyperion. I think this particular change isn't quite right for the library, as it's quite specific to a particular (and relatively uncommon) edge case. I find that changes like those might fix the handling of some PDFs, but risk causing problems for others, as there's such a wide variety of PDFs. But perhaps we can think of a more general feature that would still help for your use case, such as a simple .extract_text(ignore_whitespace=True) parameter or Page.remove_whitespace(..., only_overlapping=True) method (in a similar spirit to Page.dedupe_chars(...)).

Added `page.remove_whitespace(only_overlapping=False, ...)`
@jnhyperion
Copy link
Author

you're right, I added a new method Page.remove_whitespace.

@jnhyperion jnhyperion changed the base branch from stable to develop August 28, 2023 02:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

extracted word is broken
2 participants