Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to disable OCR in prepdocs script? #1958

Open
egor-yudkin opened this issue Sep 6, 2024 · 1 comment
Open

How to disable OCR in prepdocs script? #1958

egor-yudkin opened this issue Sep 6, 2024 · 1 comment

Comments

@egor-yudkin
Copy link

This issue is for a: (mark with an x)

- [ ] bug report -> please search issues before submitting
- [x] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

  1. Run prepdocs.sh script on some PDF files that contain images
  2. Text from images embedded gets indexed

Any log messages given by the failure

n/a

Expected/desired behavior

I'd like to have a way to disable OCR of the images embedded in PDF files. Our use case is the application and training documentation that includes screenshots of application screens with random/example data displayed and we don't want it to be in the index.

OS and Version?

Linux Ubuntu

Versions

2024-08-23

@pamelafox
Copy link
Collaborator

I asked the Document Intelligence team, and they say there's no option for disabling OCR in the API. You would need to update the parsing code to ignore the text parsed from images. I haven't done much Doc Intelligence parsing code myself, but I'm hoping that it's returned back in a way that you can delineate image text versus the other text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants