Blank image overlay #130

de-code · 2021-09-18T14:11:32Z

For the example 471433v1 (from bioRxiv 10k training dataset), there is an image that is extracted with a blank overlay, with the same coordinates as the true image figure.

In a PDF viewer the document is rendered fine.

PDF

PDFAlto XML (extract)

<Illustration ID="p13_i1" HPOS="73.2000" VPOS="484.545" WIDTH="438.250" HEIGHT="134.640" ROTATION="0.000000" FILEID="471433v1.lxml_data/image-3.png" TYPE="image"/>
<Illustration ID="p13_i2" HPOS="73.2000" VPOS="484.545" WIDTH="438.250" HEIGHT="134.640" ROTATION="0.000000" FILEID="471433v1.lxml_data/image-4.png" TYPE="image"/>
<Illustration ID="p13_s5902" HPOS="0.2800" VPOS="72.0000" WIDTH="595.200" HEIGHT="554.440" ROTATION="0.000000" FILEID="471433v1.lxml_data/image-13.svg" TYPE="svg"/>

In this case image-3.png is the true figure image. Whereas image-4.png appears to be blank / white (not transparent).

I am not sure how to interpret the order.
Due to the order I would think image-4.png is on top of image-3.png. And maybe it is missing transparency?
Alternatively, is the order meant to be the opposite?

The text was updated successfully, but these errors were encountered:

kermitt2 · 2021-12-03T11:36:30Z

Hi Daniel,

For the white image, it's probably the same as what I raised here: kermitt2/grobid#826

I think these are the "Soft-Mask" images of the PDF specifications (11.6.5.3 Soft-Mask Images, page 347).

Currently they are treated as usual images and but they are typed/distinguished in xpdf via the dictionary and there is a distinct ImageOutputDev methods for them. So we could probably mark these images with an attribute in the ALTO file or with the file name by defining some pattern, and/or add a parameter in the command line to output them or not - if you have a preference ?

de-code · 2021-12-03T12:22:53Z

Hi Daniel,

For the white image, it's probably the same as what I raised here: kermitt2/grobid#826

I think these are the "Soft-Mask" images of the PDF specifications (11.6.5.3 Soft-Mask Images, page 347).

Currently they are treated as usual images and but they are typed/distinguished in xpdf via the dictionary and there is a distinct ImageOutputDev methods for them. So we could probably mark these images with an attribute in the ALTO file or with the file name by defining some pattern, and/or add a parameter in the command line to output them or not - if you have a preference ?

Hi Patrice,

Thank you for getting back on it and explaining the issue.

Personally I would prefer the first option you mentioned, to add markup / attribute to the ALTO XML output. (not sure if there is already something in the schema that seems appropriate)

Then reflecting it in the filename or adding a command line argument could be an optional extra. But it should be easy to post process based on the XML, depending on the use case. (Who knows, the masking image could be useful)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blank image overlay #130

Blank image overlay #130

de-code commented Sep 18, 2021

kermitt2 commented Dec 3, 2021

de-code commented Dec 3, 2021 •

edited

Loading

Blank image overlay #130

Blank image overlay #130

Comments

de-code commented Sep 18, 2021

kermitt2 commented Dec 3, 2021

de-code commented Dec 3, 2021 • edited Loading

de-code commented Dec 3, 2021 •

edited

Loading