Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade document extraction #187

Closed
wants to merge 16 commits into from
Closed

Upgrade document extraction #187

wants to merge 16 commits into from

Conversation

flooie
Copy link
Collaborator

@flooie flooie commented Apr 29, 2024

@mlissner

This PR is meant to improve the extraction of text from PDFs by using a few
additional simple rules to decide if text is extracted appropriately.

Those rules include

  • Identifying any widget/free text annotations (previously this could lead to incorrect representation of the document)
  • Images larger than 10% of the page. Meant to exclude tiny images - or images of lines.
  • Gibberish text from weird text embedding or missing fonts
  • Documents with less than 10 words on average per page.

Err strings were added for each of these reasons, which should be used in checking if OCR is needed on the CL side.
Previously we would/could identify documents as needing OCR - but also returning the text none the less - so that pages could be missed and CL wouldnt be aware that it might want to OCR the document.

Additionally, an optional flag has been added skip-margins as a boolean that can be used to
crop out the 1 inch margins that are required for court opinions as well as skewed stamp text we see in some
courts. This is meant to get the text to represent the text of the opinion.

Tests were updated for the PDF changes and different possible difficult Pdfs were included.

Finally, a change to LXML and html Cleaning was addressed by adding `lxml_html_clean'.

Beacuse of changes to LXML
Add pdfplumber as main tool for extracting text
from a PDF - and add a strip margin flag to
enable cropping out text in the margins
and removing skewed text
Added and fixed tests
Modified one test pdf to better reflect the test
@flooie flooie requested a review from mlissner April 30, 2024 13:46
Copy link
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. I made a few comments, but none that I think is too crazy. My one remaining doubt is what the output looks like compared to the old output. Can you provide some examples of normal, good, bad, ugly, etc so we can see the improvement here?

I also worry about if we move to striping margins by default that that will cause trouble down the road when we remove more than we want, like, for example, on a scanned document where the scan is off center or something. Maybe it's safer to remove the margin at the top and left and leave the bottom and right?

doctor/lib/utils.py Outdated Show resolved Hide resolved
requirements.txt Show resolved Hide resolved
doctor/lib/utils.py Outdated Show resolved Hide resolved
doctor/lib/utils.py Outdated Show resolved Hide resolved
doctor/lib/utils.py Outdated Show resolved Hide resolved
doctor/lib/utils.py Outdated Show resolved Hide resolved
doctor/lib/utils.py Outdated Show resolved Hide resolved
return "\n".join(page_content)


def ocr_needed(path: str, content: str, page_count: int) -> [bool, Any]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the checks in this function in roughly performant order, such the fast checks come first, and difficult ones come later? Might be a good idea?

doctor/tasks.py Show resolved Hide resolved
doctor/tasks.py Outdated Show resolved Hide resolved
@flooie
Copy link
Collaborator Author

flooie commented May 1, 2024

Cool. I made a few comments, but none that I think is too crazy. My one remaining doubt is what the output looks like compared to the old output. Can you provide some examples of normal, good, bad, ugly, etc so we can see the improvement here?

I also worry about if we move to striping margins by default that that will cause trouble down the road when we remove more than we want, like, for example, on a scanned document where the scan is off center or something. Maybe it's safer to remove the margin at the top and left and leave the bottom and right?

I have to go back thru the rest of your comments and I will provide some sample output but I wanted to address a few things.

  1. strip_margins is set to false by default.
  2. strip_margins only applies to good PDFs that can be extracted with OCR. As you rightly point out we don't want to strip or crop out the margins in a scan because the margins could include actual content in an image. And it only works for the content extracted in PDF plumber.

Copy link
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, Bill. I think it's pretty good. Nice and tidy on the whole.

I think the main thing I'd like to see are more comments practically everywhere. There are a lot of finicky things in here that will be really hard to work on in the future unless it's commented ad nasuem.

Other than that, the other missing piece is a few lines in the changelog. We should make sure to do that too.

doctor/views.py Outdated Show resolved Hide resolved
doctor/tasks.py Show resolved Hide resolved
doctor/lib/ocr_utils.py Outdated Show resolved Hide resolved
doctor/tasks.py Outdated Show resolved Hide resolved
doctor/lib/ocr_utils.py Outdated Show resolved Hide resolved
doctor/lib/ocr_utils.py Outdated Show resolved Hide resolved
doctor/lib/ocr_utils.py Outdated Show resolved Hide resolved
doctor/lib/ocr_utils.py Outdated Show resolved Hide resolved
doctor/lib/ocr_utils.py Outdated Show resolved Hide resolved
doctor/lib/ocr_utils.py Outdated Show resolved Hide resolved
@mlissner
Copy link
Member

I'm chatting with a customer now that values doctor for its high-speed text extraction. Could we keep pdftotext in this PR, and have a v2 text extractor that has all your improvements?

@flooie
Copy link
Collaborator Author

flooie commented May 28, 2024

I heavily simplified the code and created a NEW pr for it. or am - so im closing this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants