Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implimented arXivId Parsing for PDF with arXivId #12335

Merged
merged 22 commits into from
Jan 20, 2025
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -372,9 +372,9 @@ Optional<BibEntry> getEntryFromPDFContent(String firstpageContents, String lineS
EntryType type = StandardEntryType.InProceedings;
if (curString.length() > 4) {
// special case: possibly conference as first line on the page
arXivId = getArXivId(null);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example why comments are to be avoided as much as possible, this line compiles fine, runs fine but the comment above it is misguiding. Please move comment to its correct location.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

extractYear();
doi = getDoi(null);
arXivId = getArXivId(arXivId);
if (curString.contains("Conference")) {
fillCurStringWithNonEmptyLines();
conference = curString;
@@ -391,7 +391,7 @@ Optional<BibEntry> getEntryFromPDFContent(String firstpageContents, String lineS
}
}
// sometimes ArXiv ID is read before title
getArXivId(null);
arXivId = getArXivId(arXivId);
// start: title
fillCurStringWithNonEmptyLines();
title = streamlineTitle(curString);
@@ -608,18 +608,16 @@ private String getDoi(String doi) {
}

private String getArXivId(String arXivId) {
final int ARXIV_PREFIX_LENGTH = "arxiv:".length();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's define it as static and move it outside the method to avoid reallocating the same string object everytime the method is called.

private static final ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should i move it to the ArXivIdentifier class instead, i think it might be more suited there

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general (and in OOP more specifically) we want to keep the data as close as possible to the code that uses it, if the constant is used there then okay otherwise let's keep it in the class where it is used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, thank you

if (arXivId == null) {
String arXiv = curString.split(" ")[0];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could lead to a null pointer if there is no whitespace and you access the index

Copy link
Contributor Author

@ar-rana ar-rana Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello Siedlerchr, I tested this method with empty/non-empty strings with/without whitespaces and it would only give a null pointer if the curString is null which does not seem to be the case here. So should I add a change here or leave it, as getDoi also work the same way

arXivId = ArXivIdentifier.parse(arXiv).map(ArXivIdentifier::asString).orElse(null);
if (arXivId != null) {
if (curString.length() > arXivId.length() + 7) {
// The arxiv string also contains the year
curString = curString.substring(arXivId.length() + 7);
extractYear();
curString = "";
proceedToNextNonEmptyLine();
}
return arXivId;
if (arXivId != null && curString.length() > arXivId.length() + ARXIV_PREFIX_LENGTH) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not what I meant by reduced nesting, Please read this: https://szymonkrajewski.pl/why-should-you-return-early/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private String getArXivId(String arXivId) {
        if (arXivId != null) {
            return arXivId;
        }

        String arXiv = curString.split(" ")[0];
        arXivId = ArXivIdentifier.parse(arXiv).map(ArXivIdentifier::asString).orElse(null);

        if (arXivId == null || curString.length() < arXivId.length() + ARXIV_PREFIX_LENGTH) {
            return arXivId;
        }
        // The arxiv string also contains the year
        curString = curString.substring(arXivId.length() + ARXIV_PREFIX_LENGTH);
        extractYear();
        curString = "";
        proceedToNextNonEmptyLine();

        return arXivId;
    }

This is what I wrote based on my understanding of the article, please share you feedback

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, great, it looks better!

// The arxiv string also contains the year
curString = curString.substring(arXivId.length() + ARXIV_PREFIX_LENGTH);
extractYear();
curString = "";
proceedToNextNonEmptyLine();
}
}
return arXivId;