Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to extract line-based bounding boxes from pdfminer. #874

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

bsowell
Copy link
Contributor

@bsowell bsowell commented Oct 4, 2024

We have been using pdfminer's layout detection to group text into boxes. This can cause issues, especially with table extraction, when the boxes don't line up with cells or what we detect with the DETR model. This change adds support for an object_type parameter to the PdfMinerExtractor that can be set to "boxes" (the current behavior), or "lines", which groups characters into lines, but does not group them further.

To avoid an explosion of options, we introduce a
"text_extractor_options" dict as a paramter, and refactor the TextExtractor class hierarchy a bit to support it.

We have been using pdfminer's layout detection to group text into
boxes. This can cause issues, especially with table extraction, when
the boxes don't line up with cells or what we detect with the DETR
model. This change adds support for an object_type parameter to the
PdfMinerExtractor that can be set to "boxes" (the current behavior),
or "lines", which groups characters into lines, but does not group
them further.

To avoid an explosion of options, we introduce a
"text_extractor_options" dict as a paramter, and refactor the
TextExtractor class hierarchy a bit to support it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant