For some PDFs fulltext document does not contain header document #1236

loopdeloop76 · 2025-01-19T13:54:19Z

Operating System and architecture (arm64, amd64, x86, etc.)

No response

What is your Java version

No response

Log and information

No response

Further information

When using the web service (e.g. https://kermitt2-grobid.hf.space/) with TEI -> Process Fulltext Document the first page is missing from the parsed result for some PDFs. When I parse the same PDF using Process Header Document I do get a response containing the content of the first page. As far as I understand Process Fulltext Document should contain the header document.

Here is an example PDF:

paper1.pdf

Here is the result of Process Header Document:

<TEI
    xmlns="http://www.tei-c.org/ns/1.0"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns:xlink="http://www.w3.org/1999/xlink" xml:space="preserve" xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd">
    <teiHeader xml:lang="en">
        <fileDesc>
            <titleStmt>
                <title level="a" type="main"/>
            </titleStmt>
            <publicationStmt>
                <publisher/>
                <availability status="unknown">
                    <licence/>
                </availability>
            </publicationStmt>
            <sourceDesc>
                <biblStruct>
                    <analytic></analytic>
                    <monogr>
                        <imprint>
                            <date/>
                        </imprint>
                    </monogr>
                    <idno type="MD5">BBE748E17EC45A5BEA5F9352C47C3F83</idno>
                </biblStruct>
            </sourceDesc>
        </fileDesc>
        <encodingDesc>
            <appInfo>
                <application version="0.8.1" ident="GROBID" when="2025-01-19T13:39+0000">
                    <desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
                    <ref target="https://github.com/kermitt2/grobid"/>
                </application>
            </appInfo>
        </encodingDesc>
        <profileDesc>
            <abstract>
                <p>1. Accelerating innovation The starting point Research and innovation (R&amp;I) are the main drivers of productivity and people's well-being [see Figure 1]. Innovation generates positive externalities, with new technologies serving as stepping stones for further innovation. This creates cumulative positive spillovers that justify a role for government intervention to promote research and innovation. R&amp;I will be critical for financing Europe's welfare system as the EU population ages and its labour force shrinks. The importance of R&amp;I for productivity growth will increase in the future as a result of the accelerating pace of global innovation during the past decades.</p>
            </abstract>
        </profileDesc>
    </teiHeader>
    <text xml:lang="en"></text>
</TEI>

And here is the the beginning of the result of `Process Fulltext Document``

<?xml version="1.0" encoding="UTF-8"?>
<TEI
    xmlns="http://www.tei-c.org/ns/1.0"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns:xlink="http://www.w3.org/1999/xlink" xml:space="preserve" xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd">
    <teiHeader xml:lang="en">
        <fileDesc>
            <titleStmt>
                <title level="a" type="main"/>
            </titleStmt>
            <publicationStmt>
                <publisher/>
                <availability status="unknown">
                    <licence/>
                </availability>
            </publicationStmt>
            <sourceDesc>
                <biblStruct>
                    <analytic></analytic>
                    <monogr>
                        <imprint>
                            <date/>
                        </imprint>
                    </monogr>
                    <idno type="MD5">BBE748E17EC45A5BEA5F9352C47C3F83</idno>
                </biblStruct>
            </sourceDesc>
        </fileDesc>
        <encodingDesc>
            <appInfo>
                <application version="0.8.1" ident="GROBID" when="2025-01-19T13:51+0000">
                    <desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
                    <ref target="https://github.com/kermitt2/grobid"/>
                </application>
            </appInfo>
        </encodingDesc>
        <profileDesc>
            <abstract/>
        </profileDesc>
    </teiHeader>
    <text xml:lang="en">
        <body>
            <div
                xmlns="http://www.tei-c.org/ns/1.0">
                <head>F I G U R E 1</head>
                <p>The impact of research and innovation Note: Left: business expenditure in R&amp;D (BERD) measured in percentage of gross domestic product (GDP) 2020 and labour productivity 2021 based on Eurostat. Right: Where-to-Be-Born Index by ...

The second result does not include the abstract while the first one does. It does not make a difference whether Consilidate header is enabled or not.

The text was updated successfully, but these errors were encountered:

lfoppiano · 2025-01-28T10:53:33Z

HI @loopdeloop76 indeed the first page is missing from the processFulltextDocument. I believe it's because the first page is classified as cover page and it's usually ignored. Cover pages are pages added by the publishers or the repository (e.g. research gate, HAL) that duplicate the main bibliographical information.

processHeaderDocument uses more rules and in this case give a better result.

This happens because Grobid is not really trained on this kind of documents, so to improve the result it would need a few examples as training data.

loopdeloop76 · 2025-01-31T13:01:56Z

@lfoppiano Thankyou very much for looking into this!
We use Grobid in an app for scientists, so most of the uploaded PDFs are papers. However some are not and it would be great if we could still extract as much text and metadata as possible.
Would you suggest always making an extra request to processHeaderDocument if processFulltextDocument does not return any metadata?
Or would it be better to fallback to another PDF parser entirely if the PDF is not a paper?

lfoppiano · 2025-02-02T21:28:26Z

Depending on the amount of non-standard scientific articles. Assuming they are around 10%, I think, it would be simpler for you to make an additional request to Grobid for the bibliographic metadata when the main one (let's say title and authors) are missing.

loopdeloop76 · 2025-02-02T21:43:46Z

Thanks @lfoppiano !

lfoppiano added the error cases Some error/test case for future improvements label Jan 28, 2025

lfoppiano added question There's no such thing as a stupid question and removed error cases Some error/test case for future improvements labels Feb 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

For some PDFs fulltext document does not contain header document #1236

For some PDFs fulltext document does not contain header document #1236

loopdeloop76 commented Jan 19, 2025 •

edited

Loading

lfoppiano commented Jan 28, 2025

loopdeloop76 commented Jan 31, 2025

lfoppiano commented Feb 2, 2025 •

edited

Loading

loopdeloop76 commented Feb 2, 2025

For some PDFs fulltext document does not contain header document #1236

For some PDFs fulltext document does not contain header document #1236

Comments

loopdeloop76 commented Jan 19, 2025 • edited Loading

Operating System and architecture (arm64, amd64, x86, etc.)

What is your Java version

Log and information

Further information

lfoppiano commented Jan 28, 2025

loopdeloop76 commented Jan 31, 2025

lfoppiano commented Feb 2, 2025 • edited Loading

loopdeloop76 commented Feb 2, 2025

loopdeloop76 commented Jan 19, 2025 •

edited

Loading

lfoppiano commented Feb 2, 2025 •

edited

Loading