Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For some PDFs fulltext document does not contain header document #1236

Open
loopdeloop76 opened this issue Jan 19, 2025 · 4 comments
Open

For some PDFs fulltext document does not contain header document #1236

loopdeloop76 opened this issue Jan 19, 2025 · 4 comments
Labels
question There's no such thing as a stupid question

Comments

@loopdeloop76
Copy link

loopdeloop76 commented Jan 19, 2025

Operating System and architecture (arm64, amd64, x86, etc.)

No response

What is your Java version

No response

Log and information

No response

Further information

When using the web service (e.g. https://kermitt2-grobid.hf.space/) with TEI -> Process Fulltext Document the first page is missing from the parsed result for some PDFs. When I parse the same PDF using Process Header Document I do get a response containing the content of the first page. As far as I understand Process Fulltext Document should contain the header document.

Here is an example PDF:

paper1.pdf

Here is the result of Process Header Document:

<TEI
    xmlns="http://www.tei-c.org/ns/1.0"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns:xlink="http://www.w3.org/1999/xlink" xml:space="preserve" xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd">
    <teiHeader xml:lang="en">
        <fileDesc>
            <titleStmt>
                <title level="a" type="main"/>
            </titleStmt>
            <publicationStmt>
                <publisher/>
                <availability status="unknown">
                    <licence/>
                </availability>
            </publicationStmt>
            <sourceDesc>
                <biblStruct>
                    <analytic></analytic>
                    <monogr>
                        <imprint>
                            <date/>
                        </imprint>
                    </monogr>
                    <idno type="MD5">BBE748E17EC45A5BEA5F9352C47C3F83</idno>
                </biblStruct>
            </sourceDesc>
        </fileDesc>
        <encodingDesc>
            <appInfo>
                <application version="0.8.1" ident="GROBID" when="2025-01-19T13:39+0000">
                    <desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
                    <ref target="https://github.com/kermitt2/grobid"/>
                </application>
            </appInfo>
        </encodingDesc>
        <profileDesc>
            <abstract>
                <p>1. Accelerating innovation The starting point Research and innovation (R&amp;I) are the main drivers of productivity and people's well-being [see Figure 1]. Innovation generates positive externalities, with new technologies serving as stepping stones for further innovation. This creates cumulative positive spillovers that justify a role for government intervention to promote research and innovation. R&amp;I will be critical for financing Europe's welfare system as the EU population ages and its labour force shrinks. The importance of R&amp;I for productivity growth will increase in the future as a result of the accelerating pace of global innovation during the past decades.</p>
            </abstract>
        </profileDesc>
    </teiHeader>
    <text xml:lang="en"></text>
</TEI>

And here is the the beginning of the result of `Process Fulltext Document``

<?xml version="1.0" encoding="UTF-8"?>
<TEI
    xmlns="http://www.tei-c.org/ns/1.0"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns:xlink="http://www.w3.org/1999/xlink" xml:space="preserve" xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd">
    <teiHeader xml:lang="en">
        <fileDesc>
            <titleStmt>
                <title level="a" type="main"/>
            </titleStmt>
            <publicationStmt>
                <publisher/>
                <availability status="unknown">
                    <licence/>
                </availability>
            </publicationStmt>
            <sourceDesc>
                <biblStruct>
                    <analytic></analytic>
                    <monogr>
                        <imprint>
                            <date/>
                        </imprint>
                    </monogr>
                    <idno type="MD5">BBE748E17EC45A5BEA5F9352C47C3F83</idno>
                </biblStruct>
            </sourceDesc>
        </fileDesc>
        <encodingDesc>
            <appInfo>
                <application version="0.8.1" ident="GROBID" when="2025-01-19T13:51+0000">
                    <desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
                    <ref target="https://github.com/kermitt2/grobid"/>
                </application>
            </appInfo>
        </encodingDesc>
        <profileDesc>
            <abstract/>
        </profileDesc>
    </teiHeader>
    <text xml:lang="en">
        <body>
            <div
                xmlns="http://www.tei-c.org/ns/1.0">
                <head>F I G U R E 1</head>
                <p>The impact of research and innovation Note: Left: business expenditure in R&amp;D (BERD) measured in percentage of gross domestic product (GDP) 2020 and labour productivity 2021 based on Eurostat. Right: Where-to-Be-Born Index by ...

The second result does not include the abstract while the first one does. It does not make a difference whether Consilidate header is enabled or not.

@lfoppiano
Copy link
Collaborator

HI @loopdeloop76 indeed the first page is missing from the processFulltextDocument. I believe it's because the first page is classified as cover page and it's usually ignored. Cover pages are pages added by the publishers or the repository (e.g. research gate, HAL) that duplicate the main bibliographical information.

processHeaderDocument uses more rules and in this case give a better result.

This happens because Grobid is not really trained on this kind of documents, so to improve the result it would need a few examples as training data.

@lfoppiano lfoppiano added the error cases Some error/test case for future improvements label Jan 28, 2025
@loopdeloop76
Copy link
Author

@lfoppiano Thankyou very much for looking into this!
We use Grobid in an app for scientists, so most of the uploaded PDFs are papers. However some are not and it would be great if we could still extract as much text and metadata as possible.
Would you suggest always making an extra request to processHeaderDocument if processFulltextDocument does not return any metadata?
Or would it be better to fallback to another PDF parser entirely if the PDF is not a paper?

@lfoppiano
Copy link
Collaborator

lfoppiano commented Feb 2, 2025

Depending on the amount of non-standard scientific articles. Assuming they are around 10%, I think, it would be simpler for you to make an additional request to Grobid for the bibliographic metadata when the main one (let's say title and authors) are missing.

@lfoppiano lfoppiano added question There's no such thing as a stupid question and removed error cases Some error/test case for future improvements labels Feb 2, 2025
@loopdeloop76
Copy link
Author

Thanks @lfoppiano !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question There's no such thing as a stupid question
Projects
None yet
Development

No branches or pull requests

2 participants