Skip to content
This repository has been archived by the owner on Nov 9, 2024. It is now read-only.

Configurable Tika ocr strategies for PDFs #100

Open
wombat94 opened this issue Mar 30, 2020 · 0 comments
Open

Configurable Tika ocr strategies for PDFs #100

wombat94 opened this issue Mar 30, 2020 · 0 comments
Labels
area/tika type/enhancement New feature or request

Comments

@wombat94
Copy link

OCR of PDFs in Tika can take a long time. This is unnecessary if the PDF has already been ORCed.

I would like to see an option to define the OCR strategy used by Tika in the lodestone front end.

Ideally, this would be multi-pass with a first pass being no_ocr and if the size of returned data is below a threshold (perhaps 500 bytes of text) then re-process with text_and_ocr to recognize the document.

@dskaggs dskaggs added type/enhancement New feature or request area/tika labels Feb 3, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/tika type/enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants