-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
For some PDFs fulltext document does not contain header document #1236
Comments
HI @loopdeloop76 indeed the first page is missing from the
This happens because Grobid is not really trained on this kind of documents, so to improve the result it would need a few examples as training data. |
@lfoppiano Thankyou very much for looking into this! |
Depending on the amount of non-standard scientific articles. Assuming they are around 10%, I think, it would be simpler for you to make an additional request to Grobid for the bibliographic metadata when the main one (let's say title and authors) are missing. |
Thanks @lfoppiano ! |
Operating System and architecture (arm64, amd64, x86, etc.)
No response
What is your Java version
No response
Log and information
No response
Further information
When using the web service (e.g. https://kermitt2-grobid.hf.space/) with
TEI
->Process Fulltext Document
the first page is missing from the parsed result for some PDFs. When I parse the same PDF usingProcess Header Document
I do get a response containing the content of the first page. As far as I understandProcess Fulltext Document
should contain the header document.Here is an example PDF:
paper1.pdf
Here is the result of
Process Header Document
:And here is the the beginning of the result of `Process Fulltext Document``
The second result does not include the abstract while the first one does. It does not make a difference whether
Consilidate header
is enabled or not.The text was updated successfully, but these errors were encountered: