Add option to extract line-based bounding boxes from pdfminer. #874

bsowell · 2024-10-04T16:22:53Z

We have been using pdfminer's layout detection to group text into boxes. This can cause issues, especially with table extraction, when the boxes don't line up with cells or what we detect with the DETR model. This change adds support for an object_type parameter to the PdfMinerExtractor that can be set to "boxes" (the current behavior), or "lines", which groups characters into lines, but does not group them further.

To avoid an explosion of options, we introduce a
"text_extractor_options" dict as a paramter, and refactor the TextExtractor class hierarchy a bit to support it.

We have been using pdfminer's layout detection to group text into boxes. This can cause issues, especially with table extraction, when the boxes don't line up with cells or what we detect with the DETR model. This change adds support for an object_type parameter to the PdfMinerExtractor that can be set to "boxes" (the current behavior), or "lines", which groups characters into lines, but does not group them further. To avoid an explosion of options, we introduce a "text_extractor_options" dict as a paramter, and refactor the TextExtractor class hierarchy a bit to support it.

bsowell requested a review from karanataryn October 4, 2024 16:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to extract line-based bounding boxes from pdfminer. #874

Add option to extract line-based bounding boxes from pdfminer. #874

bsowell commented Oct 4, 2024

Add option to extract line-based bounding boxes from pdfminer. #874

Are you sure you want to change the base?

Add option to extract line-based bounding boxes from pdfminer. #874

Conversation

bsowell commented Oct 4, 2024