Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Character confidence threshold #3860

Merged
merged 19 commits into from
Jan 13, 2025
Merged

Character confidence threshold #3860

merged 19 commits into from
Jan 13, 2025

Conversation

plutasnyy
Copy link
Contributor

@plutasnyy plutasnyy commented Jan 6, 2025

This change adds the ability to filter out characters predicted by Tesseract with low confidence scores.

Some notes:

  • I intentionally disabled it by default; I think some low score(like 0.9-0.95 for Tesseract) could be a safe choice though
  • I wanted to use character bboxes and combine them into word bbox later. However, a bug in Tesseract in some specific scenarios returns incorrect character bboxes (unit tests caught it 🥳 ). More in comment in the code

@plutasnyy plutasnyy marked this pull request as ready for review January 8, 2025 10:39
@plutasnyy plutasnyy requested review from badGarnet and MaksOpp January 8, 2025 10:39
Copy link
Contributor

@MaksOpp MaksOpp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@property
def TESSERACT_CHARACTER_CONFIDENCE_THRESHOLD(self) -> int:
"""Tesseract predictions with confidence below this threshold are ignored"""
return self._get_float("TESSERACT_CHARACTER_CONFIDENCE_THRESHOLD", 0.0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, maybe we'd like to have some really low default threshold, i.e. 0.1, just to filter out complete garbage chars?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with 0; the default behavior is no filter at all so this PR should just keep that for now. We can use followups to change this value.

image: np.ndarray,
lang: str = "eng",
config: str = "",
character_confidence_threshold: float = 0.5,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are adding some default, so maybe let's also keep it in config?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see below we again have 0.5 as a default in hocr_to_dataframe, so either way, I would unify those

@plutasnyy plutasnyy added this pull request to the merge queue Jan 9, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 9, 2025
@plutasnyy plutasnyy added this pull request to the merge queue Jan 9, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 9, 2025
@plutasnyy plutasnyy added this pull request to the merge queue Jan 10, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 10, 2025
ocr_df = self.hocr_to_dataframe(hocr, character_confidence_threshold)
return ocr_df

def hocr_to_dataframe(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the compute performance with this code? We essentially were relying on tesseract internal cpp code to parse results but here we do it in python.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not analyzed this. We simply iterate over ~300 words, I am not sure there is any risk of significant slowdowns. What do you think?

Comment on lines 130 to 131
"width": right - left,
"height": bottom - top,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nit on performance we can create df using bbox first then use vector ops to compute width and height (and overwrite the data for right and bottom).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@plutasnyy plutasnyy added this pull request to the merge queue Jan 13, 2025
Merged via the queue into main with commit 8685905 Jan 13, 2025
41 checks passed
@plutasnyy plutasnyy deleted the character-confidence-threshold branch January 13, 2025 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants