You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- [ ] bug report -> please search issues before submitting
- [x] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)
Minimal steps to reproduce
Run prepdocs.sh script on some PDF files that contain images
Text from images embedded gets indexed
Any log messages given by the failure
n/a
Expected/desired behavior
I'd like to have a way to disable OCR of the images embedded in PDF files. Our use case is the application and training documentation that includes screenshots of application screens with random/example data displayed and we don't want it to be in the index.
OS and Version?
Linux Ubuntu
Versions
2024-08-23
The text was updated successfully, but these errors were encountered:
I asked the Document Intelligence team, and they say there's no option for disabling OCR in the API. You would need to update the parsing code to ignore the text parsed from images. I haven't done much Doc Intelligence parsing code myself, but I'm hoping that it's returned back in a way that you can delineate image text versus the other text.
This issue is for a: (mark with an
x
)Minimal steps to reproduce
prepdocs.sh
script on some PDF files that contain imagesAny log messages given by the failure
n/a
Expected/desired behavior
I'd like to have a way to disable OCR of the images embedded in PDF files. Our use case is the application and training documentation that includes screenshots of application screens with random/example data displayed and we don't want it to be in the index.
OS and Version?
Linux Ubuntu
Versions
2024-08-23
The text was updated successfully, but these errors were encountered: