Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implimented arXivId Parsing for PDF with arXivId #12335

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

ar-rana
Copy link
Contributor

@ar-rana ar-rana commented Dec 24, 2024

Used the 'parse' method in ArXivIdentifier to get arXivId and added a testcase for the same using the link give in the issue.
Closes #12000
Closes https://github.com/koppor/jabref/issues/47".

Mandatory checks

  • I own the copyright of the code submitted and I licence it under the MIT license
  • Change in CHANGELOG.md described in a way that is understandable for the average user (if change is visible to the user)
  • Tests created for changes (if applicable)
  • Manually tested changed features in running JabRef (always required)
  • Screenshots added in PR description (for UI changes)
  • Checked developer's documentation: Is the information available and up to date? If not, I outlined it in this pull request.
  • Checked documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request to the documentation repository.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code currently does not meet JabRef's code guidelines.
We use Checkstyle to identify issues.
Please carefully follow the setup guide for the codestyle.
Afterwards, please run checkstyle locally and fix the issues.

In case of issues with the import order, double check that you activated Auto Import.
You can trigger fixing imports by pressing Ctrl+Alt+O to trigger Optimize Imports.

@ar-rana ar-rana marked this pull request as draft December 24, 2024 15:59
@ar-rana
Copy link
Contributor Author

ar-rana commented Dec 24, 2024

@koppor could you please review this PR and suggest changes where I am wrong.

@Siedlerchr
Copy link
Member

Thanks for your contribution. Please fix the check style issues first

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code currently does not meet JabRef's code guidelines.
We use OpenRewrite to ensure "modern" Java coding practices.
The issues found can be automatically fixed.
Please execute the gradle task rewriteRun, check the results, commit, and push.

You can check the detailed error output by navigating to your pull request, selecting the tab "Checks", section "Tests" (on the left), subsection "OpenRewrite".

@ar-rana
Copy link
Contributor Author

ar-rana commented Dec 25, 2024

Hello maintainers, could you please review this PR, I have fixed the previous issues.
The extra changes that are being reflected are because I merged the latest changes from upstream, please ignore them they will not be there in the actual PR.

@ar-rana ar-rana marked this pull request as ready for review December 25, 2024 13:20
@Siedlerchr
Copy link
Member

Codewise looks good to me. You have accidentally commited the csl styles submodules, see https://devdocs.jabref.org/code-howtos/faq.html#the-problem for a solution

@Siedlerchr Siedlerchr added the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Dec 25, 2024
@InAnYan
Copy link
Collaborator

InAnYan commented Dec 25, 2024

Hmm, I tried to import these papers:

https://arxiv.org/abs/2406.14319
https://arxiv.org/abs/2412.06769

But JabRef didn't import it properly, and arXiv was null all the time

@ar-rana
Copy link
Contributor Author

ar-rana commented Dec 27, 2024

I have changed the getArXivId method to this in my local branch:

private String getArXivId(String arXivId) {
        if (arXivId == null) {
            arXivId = ArXivIdentifier.parse(curString).map(ArXivIdentifier::asString).orElse(null);
            if (arXivId != null) {
                if (curString.length() > arXivId.length() + 7) {
                    curString = curString.substring(arXivId.length() + 7);
                    extractYear();
                }
                return arXivId;
            }
        }
        return arXivId;
    }

and modified the ArXivIdentifier.parse method a little by altering the identifier to this String identifier = value.split(" ")[0]; at line 41.

this fixes the null arXiv problem that @InAnYan was encontering when externally importing the papers, but for some reason it still does not import the titles correctly except for the paper that was mentioned in the issue.

Screenshot 2024-12-27 195014

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JUnit tests are failing. In the area "Some checks were not successful", locate "Tests / Unit tests (pull_request)" and click on "Details". This brings you to the test output.

You can then run these tests in IntelliJ to reproduce the failing tests locally. We offer a quick test running howto in the section Final build system checks in our setup guide.

@InAnYan
Copy link
Collaborator

InAnYan commented Dec 28, 2024

Hi, ar-rana! Thanks for working on this PR!

As holidays come soon, maintainers could be too busy these weeks, so don't worry if we give feedback too late

@@ -609,11 +609,15 @@ private String getDoi(String doi) {

private String getArXivId(String arXivId) {
if (arXivId == null) {
arXivId = ArXivIdentifier.parse(curString).map(ArXivIdentifier::asString).orElse(null);
String arXiv = curString.split(" ")[0];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could lead to a null pointer if there is no whitespace and you access the index

Copy link
Contributor Author

@ar-rana ar-rana Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello Siedlerchr, I tested this method with empty/non-empty strings with/without whitespaces and it would only give a null pointer if the curString is null which does not seem to be the case here. So should I add a change here or leave it, as getDoi also work the same way

String arXiv = curString.split(" ")[0];
arXivId = ArXivIdentifier.parse(arXiv).map(ArXivIdentifier::asString).orElse(null);
if (arXivId != null) {
if (curString.length() > arXivId.length() + 7) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does 7 stand for here? If possible, define a constant with a good name for the value 7 in this context.

Also, consider reducing the level of nesting in the code by returning early.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 is for the 'arxiv:' prefix in the arxiv string,

If possible, define a constant with a good name for the value 7 in this context.

Also, consider reducing the level of nesting in the code by returning early.

sure, I will work on that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, ARXIV_PREFIX_LENGTH = "arxiv:".length() would be a good name.

@@ -608,18 +608,16 @@ private String getDoi(String doi) {
}

private String getArXivId(String arXivId) {
final int ARXIV_PREFIX_LENGTH = "arxiv:".length();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's define it as static and move it outside the method to avoid reallocating the same string object everytime the method is called.

private static final ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should i move it to the ArXivIdentifier class instead, i think it might be more suited there

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general (and in OOP more specifically) we want to keep the data as close as possible to the code that uses it, if the constant is used there then okay otherwise let's keep it in the class where it is used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, thank you

proceedToNextNonEmptyLine();
}
return arXivId;
if (arXivId != null && curString.length() > arXivId.length() + ARXIV_PREFIX_LENGTH) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not what I meant by reduced nesting, Please read this: https://szymonkrajewski.pl/why-should-you-return-early/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private String getArXivId(String arXivId) {
        if (arXivId != null) {
            return arXivId;
        }

        String arXiv = curString.split(" ")[0];
        arXivId = ArXivIdentifier.parse(arXiv).map(ArXivIdentifier::asString).orElse(null);

        if (arXivId == null || curString.length() < arXivId.length() + ARXIV_PREFIX_LENGTH) {
            return arXivId;
        }
        // The arxiv string also contains the year
        curString = curString.substring(arXivId.length() + ARXIV_PREFIX_LENGTH);
        extractYear();
        curString = "";
        proceedToNextNonEmptyLine();

        return arXivId;
    }

This is what I wrote based on my understanding of the article, please share you feedback

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, great, it looks better!

.withField((StandardField.KEYWORDS), "Test Automation Artificial Intelligence AI-assisted Test Automation Grey Literature Automated Test Generation Self-Healing Test Scripts");

String firstPageContent = """
arXiv:2408.06224v1 [cs.SE] 12 Aug 2024
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it guaranteed that the arXiv id will be on the first line? If not we need to add at least two more tests.

  • ArXiv id in a middle line (the line begins with the id)
  • ArXiv ID in a middle line (the ID is somewhere in the middle of that line)

You can use JUnit 5 Parameterized Tests to reduce verbosity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hello houssem, I checked multiple arXiv papers(10) and all of them had their arxiv string at the last line, here you can see it at the top(sorry for the oversight, will fix that) but moving it at the last also passes the test, and since all these papers have the same format i think the arxiv string would mostly be at the end

// year is a class variable as the method extractYear() uses it;
String publisher = null;

EntryType type = StandardEntryType.InProceedings;
if (curString.length() > 4) {
// special case: possibly conference as first line on the page
arXivId = getArXivId(null);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example why comments are to be avoided as much as possible, this line compiles fine, runs fine but the comment above it is misguiding. Please move comment to its correct location.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@InAnYan
Copy link
Collaborator

InAnYan commented Jan 13, 2025

Great! I've imported some arXiv paper that I had on my disk, and it had successfully extracted the ID into eprint!

If you want to continue working on this feature, you could try to improve it further:

Ideally, I, as a user of JabRef, would like to import a paper in PDF, and JabRef should correctly determine all the necessary metadata. Your PR works on PDF importers level: you found arXiv ID in PDF content and included in eprint field. This is great! And it's the first important step of improving PDF import.

Unfortunately, PDF import is not ideal (and it's a hard task on its own). Now, it would be cool if after finding the eprint, another fetcher would be called that would extract all metadata from arXiv service based on that ID

Copy link
Collaborator

@InAnYan InAnYan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Just some small changes. The issue with abbrev and csl-styles should be resolved too

@@ -557,19 +561,25 @@ Optional<BibEntry> getEntryFromPDFContent(String firstpageContents, String lineS
if (doi != null) {
entry.setField(StandardField.DOI, doi);
}
if (arXivId != null) {
entry.setField(StandardField.EPRINT, arXivId);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@koppor, @Siedlerchr should there be additional fields like archivePrefix and primaryClass like in https://tex.stackexchange.com/a/3848/313784?

if (year != null && !"0000".equals(year)) {
entry.setField(StandardField.YEAR, year);
} else if (arXivId != null) {
year = "20" + arXivId.substring(0, 2);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, we should not set it.

Here is an example of eprint on arXiv that dates to 1995: https://arxiv.org/abs/hep-ph/9707270

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, this if clause should be deleted, as if I import the paper I mentioned before, it would probably set year to 2095, which is not right

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will do that

@@ -531,6 +538,10 @@ Optional<BibEntry> getEntryFromPDFContent(String firstpageContents, String lineS
}
}

if (arXivId != null && arXivId.contains(year)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a special reason, why year should be nulled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sometimes the extractyear() would pick up number from the arXivId and that is why it was being set to null so later on line 586-589 it would be corrected but now that we are removing that if clause, I will remove this as well

@ar-rana
Copy link
Contributor Author

ar-rana commented Jan 13, 2025

Great! I've imported some arXiv paper that I had on my disk, and it had successfully extracted the ID into eprint!

If you want to continue working on this feature, you could try to improve it further:

Ideally, I, as a user of JabRef, would like to import a paper in PDF, and JabRef should correctly determine all the necessary metadata. Your PR works on PDF importers level: you found arXiv ID in PDF content and included in eprint field. This is great! And it's the first important step of improving PDF import.

Unfortunately, PDF import is not ideal (and it's a hard task on its own). Now, it would be cool if after finding the eprint, another fetcher would be called that would extract all metadata from arXiv service based on that ID

sounds interesting, would love work on this

* upstream/main: (29 commits)
  Bump org.glassfish.jersey.containers:jersey-container-grizzly2-http (JabRef#12384)
  Bump src/main/resources/csl-styles from `080516e` to `6bae16d` (JabRef#12387)
  Bump src/main/resources/csl-locales from `96d704d` to `9914965` (JabRef#12386)
  Bump buildres/abbrv.jabref.org from `93a2cad` to `e74e6eb` (JabRef#12385)
  Bump org.openrewrite.rewrite from 6.29.3 to 7.0.0 (JabRef#12383)
  Bump org.glassfish.jersey.core:jersey-server from 3.1.9 to 3.1.10 (JabRef#12381)
  Bump org.glassfish.jersey.test-framework.providers:jersey-test-framework-provider-grizzly2 (JabRef#12380)
  fix linux build not updated
  Refactor the isUnwantedText (JabRef#12369)
  Searching for entries with empty field (JabRef#12376)
  Downgrade Ubuntu (JabRef#12375)
  Downgrade Ubuntu
  Fix main table sorting (JabRef#12371)
  fix bib and pdf name (JabRef#12366)
  use v4 instead of master
  Update abbrv.jabref.org (JabRef#12365)
  Bump buildres/abbrv.jabref.org from `78e1b08` to `c202741` (JabRef#12363)
  Bump org.beryx.jlink from 3.1.0-rc-1 to 3.1.1 (JabRef#12362)
  Bump tech.units:indriya from 2.2.1 to 2.2.2 (JabRef#12361)
  Bump com.dlsc.gemsfx:gemsfx from 2.80.0 to 2.81.0 (JabRef#12360)
  ...

# Conflicts:
#	buildres/abbrv.jabref.org
#	src/main/resources/csl-styles
@ar-rana
Copy link
Contributor Author

ar-rana commented Jan 15, 2025

@InAnYan I was working on the Arxiv Metadata method and this is what I have written so far,

    private Optional<BibEntry> getMetadataFromArxiv(String arXivId) {
        PreferencesFactory preferencesFactory = new PreferencesFactory();
        ArXivFetcher arXivFetcher = new ArXivFetcher(preferencesFactory.provide().getImportFormatPreferences());

        Optional<BibEntry> bibEntry = null;
        try {
            bibEntry = arXivFetcher.performSearchById(arXivId);
        } catch (FetcherException exception) {
            ParserResult.fromError(exception);
        }

        return bibEntry;
    }

this method returns an Optional<BibEntry> using ArXivFetcher which can directly be returned by getEntryFromPDFContent method, I just wanted to know if I should proceed with this implimentation because other ways dont return all the metadata from arXivId but just a few fields. (will also have to significantly modify the testcase here)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Importing a PDF with arXiv Id should fetch the arXiV information using the arXivFetcher
4 participants