Skip to content

feat(ocr): add ocr #2254

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 10 commits into
base: develop
Choose a base branch
from
Draft

feat(ocr): add ocr #2254

wants to merge 10 commits into from

Conversation

perfectra1n
Copy link
Member

No description provided.

@perfectra1n perfectra1n force-pushed the feat/add-ocr-capabilities branch from 74c2453 to f135622 Compare June 10, 2025 20:22
@perfectra1n perfectra1n marked this pull request as ready for review June 10, 2025 22:46
@dosubot dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Jun 10, 2025
@perfectra1n
Copy link
Member Author

GREEN CHECKS!

Comment on lines +14 to +25
CREATE TABLE IF NOT EXISTS ocr_results (
id INTEGER PRIMARY KEY AUTOINCREMENT,
entity_id TEXT NOT NULL,
entity_type TEXT NOT NULL DEFAULT 'note',
extracted_text TEXT NOT NULL,
confidence REAL NOT NULL,
language TEXT NOT NULL DEFAULT 'eng',
extracted_at TEXT NOT NULL,
created_at TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP,
UNIQUE(entity_id, entity_type)
);
Copy link
Contributor

@eliandoran eliandoran Jun 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of storing the results in a separate table (that is not synced yet), how about simply storing the results in an attachment?

This:

  • Simplifies the table structure
  • Can easily filter for them by having a custom role.
  • Allows additional functionality such as being able to view the attachment in order to review how well the OCR went.
  • We can use the same blobs structure that is shared with notes and attachments.

The only problem is with image-attachments, since we can't have attachments for attachments.

Maybe we have to think a bit deeper, to see if it can be further reused on things other than OCR. Perhaps one way would be to have "different representations" for blobs, such as their binary data (e.g. images, files), but also a textual representation that can be used not only for images but also OCR, LLM, etc.

Copy link
Member Author

@perfectra1n perfectra1n Jun 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You definitely have the "whole picture" in mind - so do you think it's best to bite the bullet and create some new table/object that can be used/reused for objects (or even a more obscure "object", whatever that may be in the future) such as these, or do we just use something like a "sibling attachment" for now as you suggested?

CREATE INDEX IF NOT EXISTS idx_ocr_results_confidence
ON ocr_results (confidence);

-- Create full-text search index for extracted text
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's interesting but I suppose this creates additional complexity. We don't yet have full-text search for note content.

So I think we should try to find a solution (doesn't have to be in the same PR) that allows full text search on the entire blobs table.

{ name: "ocrEnabled", value: "false", isSynced: true },
{ name: "ocrLanguage", value: "eng", isSynced: true },
{ name: "ocrAutoProcessImages", value: "true", isSynced: true },
{ name: "ocrMinConfidence", value: "0.2", isSynced: true },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to increase the minimum confidence since there are quite a few false positives:

image

See OCR for this image:
image

aa _— =
ANI Eop i
Nar TR A Th ey
\ $08 Lt A SRT.
Ne. XO am >
’ rr Se LR YO - 1
AE NV Noobs
‘a ELS Je a
PAN AT
<! on ‘P ~ -~ gd pA
| A r ; > . - “
\Ke He : Led
Se ( ‘ i " ve
i» = d \ © Le (
Bur La ¢ NET NI

Copy link
Member Author

@perfectra1n perfectra1n Jun 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOL
"The letters, I can see them - I swear they're there"

@eliandoran eliandoran marked this pull request as draft June 11, 2025 21:15
@eliandoran
Copy link
Contributor

@zadam , it would be interesting to get your opinion on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:XXL This PR changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants