Skip to content

Commit

Permalink
fix(text_extraction): Explain get_word
Browse files Browse the repository at this point in the history
Add explination for get word func and
why we chose mysterious parameters
  • Loading branch information
flooie committed May 29, 2024
1 parent bd3aace commit c246bef
Showing 1 changed file with 22 additions and 0 deletions.
22 changes: 22 additions & 0 deletions doctor/lib/text_extraction.py
Original file line number Diff line number Diff line change
Expand Up @@ -270,6 +270,28 @@ def insert_whitespace(content: str, word: dict, prev: dict) -> str:
def get_word(word_dict: dict, width: float, strip_margin: bool) -> str:
"""Append word to content
This function determines if a word should be added to the page content
and adds the word.
tesseract provides confidence values for its OCR outputs. We use those
confidence values to determine if something is a good OCR output, a
likely artifact and should be excluded or is bad ocr but not an artifact.
If a word has a zero confidence or starts on the left most edge of the paper
we return it as an empty string. It is likely an artifact.
If a word has confidence below 40, a number that usually equates to 3 to 5
standard deviations from confidences found in other words is entirely in the
margin of the page - its likely an artifact as well.
If a confidence is below 5 - for a very short word - or for a very long word
its likely part of the document but we have no idea so we return a square
box to indicate that. This is often caused by stamps or lines in case captions
Finally if a low confidence word starts in the right margin - its likely a
bad OCR that is multiple standard deviations away so we return the word as
empty squares.
:param word_dict: the word object from tesseract
:param width: The width of the document
:param strip_margin: should we strip the margin
Expand Down

0 comments on commit c246bef

Please sign in to comment.