Upgrade document extraction #187

flooie · 2024-04-29T19:27:36Z

This PR is meant to improve the extraction of text from PDFs by using a few
additional simple rules to decide if text is extracted appropriately.

Those rules include

Identifying any widget/free text annotations (previously this could lead to incorrect representation of the document)
Images larger than 10% of the page. Meant to exclude tiny images - or images of lines.
Gibberish text from weird text embedding or missing fonts
Documents with less than 10 words on average per page.

Err strings were added for each of these reasons, which should be used in checking if OCR is needed on the CL side.
Previously we would/could identify documents as needing OCR - but also returning the text none the less - so that pages could be missed and CL wouldnt be aware that it might want to OCR the document.

Additionally, an optional flag has been added skip-margins as a boolean that can be used to
crop out the 1 inch margins that are required for court opinions as well as skewed stamp text we see in some
courts. This is meant to get the text to represent the text of the opinion.

Tests were updated for the PDF changes and different possible difficult Pdfs were included.

Finally, a change to LXML and html Cleaning was addressed by adding `lxml_html_clean'.

Beacuse of changes to LXML

Add pdfplumber as main tool for extracting text from a PDF - and add a strip margin flag to enable cropping out text in the margins and removing skewed text

Added and fixed tests Modified one test pdf to better reflect the test

mlissner

Cool. I made a few comments, but none that I think is too crazy. My one remaining doubt is what the output looks like compared to the old output. Can you provide some examples of normal, good, bad, ugly, etc so we can see the improvement here?

I also worry about if we move to striping margins by default that that will cause trouble down the road when we remove more than we want, like, for example, on a scanned document where the scan is off center or something. Maybe it's safer to remove the margin at the top and left and leave the bottom and right?

doctor/lib/utils.py

requirements.txt

doctor/lib/utils.py

mlissner · 2024-05-01T00:00:25Z

doctor/lib/utils.py

+    return "\n".join(page_content)
+
+
+def ocr_needed(path: str, content: str, page_count: int) -> [bool, Any]:


Are the checks in this function in roughly performant order, such the fast checks come first, and difficult ones come later? Might be a good idea?

doctor/tasks.py

flooie · 2024-05-01T15:34:31Z

Cool. I made a few comments, but none that I think is too crazy. My one remaining doubt is what the output looks like compared to the old output. Can you provide some examples of normal, good, bad, ugly, etc so we can see the improvement here?

I also worry about if we move to striping margins by default that that will cause trouble down the road when we remove more than we want, like, for example, on a scanned document where the scan is off center or something. Maybe it's safer to remove the margin at the top and left and leave the bottom and right?

I have to go back thru the rest of your comments and I will provide some sample output but I wanted to address a few things.

strip_margins is set to false by default.
strip_margins only applies to good PDFs that can be extracted with OCR. As you rightly point out we don't want to strip or crop out the margins in a scan because the margins could include actual content in an image. And it only works for the content extracted in PDF plumber.

Change extract from pdf to drop ocr available flag

for more information, see https://pre-commit.ci

mlissner

Thanks for this, Bill. I think it's pretty good. Nice and tidy on the whole.

I think the main thing I'd like to see are more comments practically everywhere. There are a lot of finicky things in here that will be really hard to work on in the future unless it's commented ad nasuem.

Other than that, the other missing piece is a few lines in the changelog. We should make sure to do that too.

doctor/views.py

doctor/tasks.py

doctor/lib/ocr_utils.py

doctor/tasks.py

doctor/lib/ocr_utils.py

mlissner · 2024-05-20T20:30:45Z

I'm chatting with a customer now that values doctor for its high-speed text extraction. Could we keep pdftotext in this PR, and have a v2 text extractor that has all your improvements?

flooie · 2024-05-28T21:39:11Z

I heavily simplified the code and created a NEW pr for it. or am - so im closing this PR

flooie added 4 commits April 29, 2024 13:38

feat(req): Add lxml_html_clean

12bfaa3

Beacuse of changes to LXML

feat(pdf): Add strip margin flag for PDF extraction

a91bc95

Add pdfplumber as main tool for extracting text from a PDF - and add a strip margin flag to enable cropping out text in the margins and removing skewed text

tests(extraction): Add and fix tests

0260927

Added and fixed tests Modified one test pdf to better reflect the test

chore(lint): Fix lint

5cb3d1c

flooie requested a review from mlissner April 30, 2024 13:46

mlissner reviewed May 1, 2024

View reviewed changes

flooie and others added 9 commits May 14, 2024 17:20

feat(ocr_utils): Move all ocr utils to new file

8c87680

feat(tasks): Update extract from pdf

5ee15b5

Change extract from pdf to drop ocr available flag

feat(utils): Move new/old utils to ocr

9d52634

feat(views): Update views to use new PDF extraction method

233a615

feat(tests): Update tests

c070cb2

[pre-commit.ci] auto fixes from pre-commit.com hooks

e3855f0

for more information, see https://pre-commit.ci

feat(ocr_utils): Update docstrings and typing

6dd78f1

feat(ocr_utils): Caption Adjustments

6c0fef0

[pre-commit.ci] auto fixes from pre-commit.com hooks

1e6a0c1

for more information, see https://pre-commit.ci

mlissner mentioned this pull request May 15, 2024

update django to pick up CVEs #188

Merged

mlissner requested changes May 16, 2024

View reviewed changes

flooie added 3 commits May 16, 2024 11:22

feat(text_extraction): Rename file

3a7666d

feat(views): Remove extra page count call

ad55b20

feat(tasks): Rename imports

06d26d0

flooie closed this May 28, 2024

mlissner mentioned this pull request May 29, 2024

Add new recap PDF extraction endpoint #190

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade document extraction #187

Upgrade document extraction #187

flooie commented Apr 29, 2024

mlissner left a comment

mlissner May 1, 2024

flooie commented May 1, 2024

mlissner left a comment

mlissner commented May 20, 2024

flooie commented May 28, 2024

		return "\n".join(page_content)


		def ocr_needed(path: str, content: str, page_count: int) -> [bool, Any]:

Upgrade document extraction #187

Upgrade document extraction #187

Conversation

flooie commented Apr 29, 2024

mlissner left a comment

Choose a reason for hiding this comment

mlissner May 1, 2024

Choose a reason for hiding this comment

flooie commented May 1, 2024

mlissner left a comment

Choose a reason for hiding this comment

mlissner commented May 20, 2024

flooie commented May 28, 2024