Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blank image overlay #130

Open
de-code opened this issue Sep 18, 2021 · 2 comments
Open

Blank image overlay #130

de-code opened this issue Sep 18, 2021 · 2 comments

Comments

@de-code
Copy link
Contributor

de-code commented Sep 18, 2021

For the example 471433v1 (from bioRxiv 10k training dataset), there is an image that is extracted with a blank overlay, with the same coordinates as the true image figure.

In a PDF viewer the document is rendered fine.

PDF

image

PDFAlto XML (extract)
<Illustration ID="p13_i1" HPOS="73.2000" VPOS="484.545" WIDTH="438.250" HEIGHT="134.640" ROTATION="0.000000" FILEID="471433v1.lxml_data/image-3.png" TYPE="image"/>
<Illustration ID="p13_i2" HPOS="73.2000" VPOS="484.545" WIDTH="438.250" HEIGHT="134.640" ROTATION="0.000000" FILEID="471433v1.lxml_data/image-4.png" TYPE="image"/>
<Illustration ID="p13_s5902" HPOS="0.2800" VPOS="72.0000" WIDTH="595.200" HEIGHT="554.440" ROTATION="0.000000" FILEID="471433v1.lxml_data/image-13.svg" TYPE="svg"/>

In this case image-3.png is the true figure image. Whereas image-4.png appears to be blank / white (not transparent).

I am not sure how to interpret the order.
Due to the order I would think image-4.png is on top of image-3.png. And maybe it is missing transparency?
Alternatively, is the order meant to be the opposite?

@kermitt2
Copy link
Owner

kermitt2 commented Dec 3, 2021

Hi Daniel,

For the white image, it's probably the same as what I raised here: kermitt2/grobid#826

I think these are the "Soft-Mask" images of the PDF specifications (11.6.5.3 Soft-Mask Images, page 347).

Currently they are treated as usual images and but they are typed/distinguished in xpdf via the dictionary and there is a distinct ImageOutputDev methods for them. So we could probably mark these images with an attribute in the ALTO file or with the file name by defining some pattern, and/or add a parameter in the command line to output them or not - if you have a preference ?

@de-code
Copy link
Contributor Author

de-code commented Dec 3, 2021

Hi Daniel,

For the white image, it's probably the same as what I raised here: kermitt2/grobid#826

I think these are the "Soft-Mask" images of the PDF specifications (11.6.5.3 Soft-Mask Images, page 347).

Currently they are treated as usual images and but they are typed/distinguished in xpdf via the dictionary and there is a distinct ImageOutputDev methods for them. So we could probably mark these images with an attribute in the ALTO file or with the file name by defining some pattern, and/or add a parameter in the command line to output them or not - if you have a preference ?

Hi Patrice,

Thank you for getting back on it and explaining the issue.

Personally I would prefer the first option you mentioned, to add markup / attribute to the ALTO XML output. (not sure if there is already something in the schema that seems appropriate)

Then reflecting it in the filename or adding a command line argument could be an optional extra. But it should be easy to post process based on the XML, depending on the use case. (Who knows, the masking image could be useful)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants