-
Notifications
You must be signed in to change notification settings - Fork 563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partial OCR using "get_textpage_ocr" ignores image masks while extracting text #3842
Comments
Thanks for the report. It does look like you have a valid point here ... However, there is a major problem here: The method looks at actual images on the page only. It knows nothing about PDF xrefs. The method used does not currently return image masks and will have to be changed to also return the image mask binary in addition. Anyway - it will take its time. |
Yeah, I was trying to modify the I didn't fully understand your alternate proposal. Currently, pixmap is indeed getting prepared by using the binary image data coming from the block dictionary. Do you mean rather than taking the binary data from the block, we take the image data from someplace else? (which in some way resolves our problem of unmasking the image) I have another rudimentary idea - If the number and sequence of images being returned from the 2 functions ( |
No, no. I explained my point poorly. If an image is detected on the page (and eligible WRT to its bbox size - like at least 20 x 20 or so), then do pix=page.get_pixmap(dpi=large, clip=bbox) |
Description of the bug
I have a pdf document from which I want to extract text.
PDF - https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-223.pdf
For extracting the text on Page-7 (TOC), I used the
get_textpage_ocr
with thefull
argument asFalse
, since the page has both digital text and text represented by image. However, the output of this function returns the text component only and not the OCRed text from the image parts.While looking into the code of the
get_textpage_ocr
function (in utils.py), I see it iterates over eachblock
and identifies the image blocks usingtype=1
filter. Then it extracts the image from the block, builds a Pixmap, and passes it to the OCR component.However, this function does not consider the image mask. I think due to this reason the extracted image is a masked image (which visually looks completely black), and that is why Tesseract is not able to extract any text for those image parts.
Further, I'm aware that there exists a
Page.get_images()
function which returns thexref
andsmask
, which can be later used to unmask the images using the below code -Using this method, I'm able to get the image with readable text (unlike the black image which is being extracted internally within the
get_textpage_ocr
function.Can we update the
page.get_text
function (which is called inside theget_textpage_ocr
function) to keep bothimage
andsmask
values in the block dictionary, or at least thexref
of the image so that one can extract thesmask
using thexref
.I can't use the
page.get_images
in my application since I need the bounding boxes coordinates as well, which are only provided in the block dictionary retrieved viapage.get_text
.Any ideas to resolve this issue?
Let me know if you need any more information to replicate this issue.
How to reproduce the bug
How to Reproduce
PyMuPDF version
1.24.10
Operating system
Windows
Python version
3.10
The text was updated successfully, but these errors were encountered: