Character confidence threshold #3860

plutasnyy · 2025-01-06T19:05:45Z

This change adds the ability to filter out characters predicted by Tesseract with low confidence scores.

Some notes:

I intentionally disabled it by default; I think some low score(like 0.9-0.95 for Tesseract) could be a safe choice though
I wanted to use character bboxes and combine them into word bbox later. However, a bug in Tesseract in some specific scenarios returns incorrect character bboxes (unit tests caught it 🥳 ). More in comment in the code

…threshold

MaksOpp

LGTM!

MaksOpp · 2025-01-08T15:23:40Z

unstructured/partition/utils/config.py

+    @property
+    def TESSERACT_CHARACTER_CONFIDENCE_THRESHOLD(self) -> int:
+        """Tesseract predictions with confidence below this threshold are ignored"""
+        return self._get_float("TESSERACT_CHARACTER_CONFIDENCE_THRESHOLD", 0.0)


I wonder, maybe we'd like to have some really low default threshold, i.e. 0.1, just to filter out complete garbage chars?

I am ok with 0; the default behavior is no filter at all so this PR should just keep that for now. We can use followups to change this value.

MaksOpp · 2025-01-08T15:25:13Z

unstructured/partition/utils/ocr_models/tesseract_ocr.py

+        image: np.ndarray,
+        lang: str = "eng",
+        config: str = "",
+        character_confidence_threshold: float = 0.5,


Here we are adding some default, so maybe let's also keep it in config?

I see below we again have 0.5 as a default in hocr_to_dataframe, so either way, I would unify those

…threshold

badGarnet · 2025-01-10T17:26:54Z

unstructured/partition/utils/ocr_models/tesseract_ocr.py

+        ocr_df = self.hocr_to_dataframe(hocr, character_confidence_threshold)
+        return ocr_df
+
+    def hocr_to_dataframe(


what's the compute performance with this code? We essentially were relying on tesseract internal cpp code to parse results but here we do it in python.

I have not analyzed this. We simply iterate over ~300 words, I am not sure there is any risk of significant slowdowns. What do you think?

badGarnet · 2025-01-10T17:29:37Z

unstructured/partition/utils/ocr_models/tesseract_ocr.py

+                        "width": right - left,
+                        "height": bottom - top,


small nit on performance we can create df using bbox first then use vector ops to compute width and height (and overwrite the data for right and bottom).

This change adds the ability to filter out characters predicted by Tesseract with low confidence scores. Some notes: - I intentionally disabled it by default; I think some low score(like 0.9-0.95 for Tesseract) could be a safe choice though - I wanted to use character bboxes and combine them into word bbox later. However, a bug in Tesseract in some specific scenarios returns incorrect character bboxes (unit tests caught it 🥳 ). More in comment in the code

plutasnyy added 12 commits January 3, 2025 13:45

add pobs

a8dd7b8

upadte

9e31ebc

feat: Add character level confidence thresholds

c0f2768

add psm

052ae50

Fix config name

4b54d8a

Update

6fcd3f4

Merge remote-tracking branch 'origin/main' into character-confidence-…

c157a66

…threshold

Update config

137678f

Remove unused zoom

c25039f

Remove unused zoom

3bff8ae

Use word bboxes instead of character bboxees

c1e9b8e

Do not return None

0e44926

plutasnyy marked this pull request as ready for review January 8, 2025 10:39

plutasnyy requested review from badGarnet and MaksOpp January 8, 2025 10:39

plutasnyy added 3 commits January 8, 2025 11:56

Fix empty df scenario

2d9054d

fix unit test

a61aa85

Fix unittests

1611a61

MaksOpp approved these changes Jan 8, 2025

View reviewed changes

plutasnyy added 3 commits January 8, 2025 16:46

Set default threshold

cee5440

Merge remote-tracking branch 'origin/main' into character-confidence-…

b2dd3fe

…threshold

Set default threshold to 0

c5b6570

plutasnyy added this pull request to the merge queue Jan 9, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 9, 2025

plutasnyy added this pull request to the merge queue Jan 9, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 9, 2025

plutasnyy added this pull request to the merge queue Jan 10, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 10, 2025

badGarnet reviewed Jan 10, 2025

View reviewed changes

Refactor

013a351

plutasnyy added this pull request to the merge queue Jan 13, 2025

Merged via the queue into main with commit 8685905 Jan 13, 2025
41 checks passed

plutasnyy deleted the character-confidence-threshold branch January 13, 2025 14:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Character confidence threshold #3860

Character confidence threshold #3860

Uh oh!

plutasnyy commented Jan 6, 2025 •

edited

Loading

Uh oh!

MaksOpp left a comment

Uh oh!

MaksOpp Jan 8, 2025

Uh oh!

badGarnet Jan 10, 2025

Uh oh!

MaksOpp Jan 8, 2025

Uh oh!

MaksOpp Jan 8, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

badGarnet Jan 10, 2025

Uh oh!

plutasnyy Jan 13, 2025

Uh oh!

badGarnet Jan 10, 2025

Uh oh!

plutasnyy Jan 13, 2025

Uh oh!

Uh oh!

Uh oh!

Character confidence threshold #3860

Character confidence threshold #3860

Uh oh!

Conversation

plutasnyy commented Jan 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaksOpp left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

plutasnyy commented Jan 6, 2025 •

edited

Loading