-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Blank image overlay #130
Comments
Hi Daniel, For the white image, it's probably the same as what I raised here: kermitt2/grobid#826 I think these are the "Soft-Mask" images of the PDF specifications (11.6.5.3 Soft-Mask Images, page 347). Currently they are treated as usual images and but they are typed/distinguished in xpdf via the dictionary and there is a distinct ImageOutputDev methods for them. So we could probably mark these images with an attribute in the ALTO file or with the file name by defining some pattern, and/or add a parameter in the command line to output them or not - if you have a preference ? |
Hi Patrice, Thank you for getting back on it and explaining the issue. Personally I would prefer the first option you mentioned, to add markup / attribute to the ALTO XML output. (not sure if there is already something in the schema that seems appropriate) Then reflecting it in the filename or adding a command line argument could be an optional extra. But it should be easy to post process based on the XML, depending on the use case. (Who knows, the masking image could be useful) |
For the example
471433v1
(from bioRxiv 10k training dataset), there is an image that is extracted with a blank overlay, with the same coordinates as the true image figure.In a PDF viewer the document is rendered fine.
PDF
PDFAlto XML (extract)
In this case
image-3.png
is the true figure image. Whereasimage-4.png
appears to be blank / white (not transparent).I am not sure how to interpret the order.
Due to the order I would think
image-4.png
is on top ofimage-3.png
. And maybe it is missing transparency?Alternatively, is the order meant to be the opposite?
The text was updated successfully, but these errors were encountered: