Partial OCR using "get_textpage_ocr" ignores image masks while extracting text

### Description of the bug

I have a pdf document from which I want to extract text. 
PDF - https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-223.pdf

For extracting the text on Page-7 (TOC), I used the `get_textpage_ocr` with the `full` argument as `False`, since the page has both digital text and text represented by image. However, the output of this function returns the text component only and not the OCRed text from the image parts. 

While looking into the code of the `get_textpage_ocr` function _(in utils.py)_, I see it iterates over each `block` and identifies the image blocks using `type=1` filter. Then it extracts the image from the block, builds a Pixmap, and passes it to the OCR component.

```
tpage = page.get_textpage(flags=flags)
for block in page.get_text("dict", flags=pymupdf.TEXT_PRESERVE_IMAGES)["blocks"]:
    if block["type"] != 1:  # only look at images
        continue
    bbox = pymupdf.Rect(block["bbox"])
    if bbox.width <= 3 or bbox.height <= 3:  # ignore tiny stuff
        continue
    exception_types = (RuntimeError, mupdf.FzErrorBase)
    if pymupdf.mupdf_version_tuple < (1, 24):
        exception_types = RuntimeError
    try:
        pix = pymupdf.Pixmap(block["image"])  # get image pixmap
        if pix.n - pix.alpha != 3:  # we need to convert this to RGB!
            pix = pymupdf.Pixmap(pymupdf.csRGB, pix)
        if pix.alpha:  # must remove alpha channel
            pix = pymupdf.Pixmap(pix, 0)
        imgdoc = pymupdf.Document(
                "pdf",
                pix.pdfocr_tobytes(language=language, tessdata=tessdata),
                )  # pdf with OCRed page
        imgpage = imgdoc.load_page(0)  # read image as a page
        pix = None
        # compute matrix to transform coordinates back to that of 'page'
        imgrect = imgpage.rect  # page size of image PDF
        shrink = pymupdf.Matrix(1 / imgrect.width, 1 / imgrect.height)
        mat = shrink * block["transform"]
        imgpage.extend_textpage(tpage, flags=0, matrix=mat)
        imgdoc.close()
```

However, this function does not consider the image mask. I think due to this reason the extracted image is a masked image (which visually looks completely black), and that is why Tesseract is not able to extract any text for those image parts.


Further, I'm aware that there exists a `Page.get_images()` function which returns the `xref` and `smask`, which can be later used to unmask the images using the below code -

```
pix1 = pymupdf.Pixmap(doc.extract_image(xref)["image"])    # (1) pixmap of image w/o alpha
mask = pymupdf.Pixmap(doc.extract_image(smask)["image"])   # (2) mask pixmap
pix = pymupdf.Pixmap(pix1, mask)                           # (3) copy of pix1, image mask added
```

Using this method, I'm able to get the image with readable text (unlike the black image which is being extracted internally within the `get_textpage_ocr` function. 

Can we update the `page.get_text` function _(which is called inside the `get_textpage_ocr` function)_ to keep both `image` and `smask` values in the block dictionary, or at least the `xref` of the image so that one can extract the `smask` using the `xref`.

I can't use the `page.get_images` in my application since I need the bounding boxes coordinates as well, which are only provided in the block dictionary retrieved via `page.get_text`.

Any ideas to resolve this issue?
Let me know if you need any more information to replicate this issue.

### How to reproduce the bug

### How to Reproduce

1. Download the pdf file - https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-223.pdf
2. Extract the text page 7 of this pdf using partial OCR using the below code - 

```
import pymupdf
doc  = pymupdf.open('resouces/NIST.SP.800-223.pdf')
page = doc[6]    # since index starts from zero  

partial_tp = page.get_textpage_ocr(flags=0, full=False)
text_p_ocr = page.get_text(textpage=partial_tp)
print(text_p_ocr)
```

### PyMuPDF version

1.24.10

### Operating system

Windows

### Python version

3.10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Partial OCR using "get_textpage_ocr" ignores image masks while extracting text #3842

Description of the bug

How to reproduce the bug

How to Reproduce

PyMuPDF version

Operating system

Python version

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Partial OCR using "get_textpage_ocr" ignores image masks while extracting text #3842

Description

Description of the bug

How to reproduce the bug

How to Reproduce

PyMuPDF version

Operating system

Python version

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions