-
-
Notifications
You must be signed in to change notification settings - Fork 160
feat(ocr): add ocr #2254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
feat(ocr): add ocr #2254
Conversation
…onderful tesseract.js path issues
74c2453
to
f135622
Compare
GREEN CHECKS! |
CREATE TABLE IF NOT EXISTS ocr_results ( | ||
id INTEGER PRIMARY KEY AUTOINCREMENT, | ||
entity_id TEXT NOT NULL, | ||
entity_type TEXT NOT NULL DEFAULT 'note', | ||
extracted_text TEXT NOT NULL, | ||
confidence REAL NOT NULL, | ||
language TEXT NOT NULL DEFAULT 'eng', | ||
extracted_at TEXT NOT NULL, | ||
created_at TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP, | ||
updated_at TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP, | ||
UNIQUE(entity_id, entity_type) | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of storing the results in a separate table (that is not synced yet), how about simply storing the results in an attachment?
This:
- Simplifies the table structure
- Can easily filter for them by having a custom
role
. - Allows additional functionality such as being able to view the attachment in order to review how well the OCR went.
- We can use the same
blobs
structure that is shared with notes and attachments.
The only problem is with image-attachments, since we can't have attachments for attachments.
Maybe we have to think a bit deeper, to see if it can be further reused on things other than OCR. Perhaps one way would be to have "different representations" for blobs, such as their binary data (e.g. images, files), but also a textual representation that can be used not only for images but also OCR, LLM, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You definitely have the "whole picture" in mind - so do you think it's best to bite the bullet and create some new table/object that can be used/reused for objects (or even a more obscure "object", whatever that may be in the future) such as these, or do we just use something like a "sibling attachment" for now as you suggested?
CREATE INDEX IF NOT EXISTS idx_ocr_results_confidence | ||
ON ocr_results (confidence); | ||
|
||
-- Create full-text search index for extracted text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's interesting but I suppose this creates additional complexity. We don't yet have full-text search for note content.
So I think we should try to find a solution (doesn't have to be in the same PR) that allows full text search on the entire blobs
table.
{ name: "ocrEnabled", value: "false", isSynced: true }, | ||
{ name: "ocrLanguage", value: "eng", isSynced: true }, | ||
{ name: "ocrAutoProcessImages", value: "true", isSynced: true }, | ||
{ name: "ocrMinConfidence", value: "0.2", isSynced: true }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to increase the minimum confidence since there are quite a few false positives:
aa _— =
ANI Eop i
Nar TR A Th ey
\ $08 Lt A SRT.
Ne. XO am >
’ rr Se LR YO - 1
AE NV Noobs
‘a ELS Je a
PAN AT
<! on ‘P ~ -~ gd pA
| A r ; > . - “
\Ke He : Led
Se ( ‘ i " ve
i» = d \ © Le (
Bur La ¢ NET NI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LOL
"The letters, I can see them - I swear they're there"
@zadam , it would be interesting to get your opinion on this. |
No description provided.