-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implimented arXivId Parsing for PDF with arXivId #12335
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your code currently does not meet JabRef's code guidelines.
We use Checkstyle to identify issues.
Please carefully follow the setup guide for the codestyle.
Afterwards, please run checkstyle locally and fix the issues.
In case of issues with the import order, double check that you activated Auto Import.
You can trigger fixing imports by pressing Ctrl+Alt+O to trigger Optimize Imports.
@koppor could you please review this PR and suggest changes where I am wrong. |
Thanks for your contribution. Please fix the check style issues first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your code currently does not meet JabRef's code guidelines.
We use OpenRewrite to ensure "modern" Java coding practices.
The issues found can be automatically fixed.
Please execute the gradle task rewriteRun
, check the results, commit, and push.
You can check the detailed error output by navigating to your pull request, selecting the tab "Checks", section "Tests" (on the left), subsection "OpenRewrite".
Hello maintainers, could you please review this PR, I have fixed the previous issues. |
Codewise looks good to me. You have accidentally commited the csl styles submodules, see https://devdocs.jabref.org/code-howtos/faq.html#the-problem for a solution |
Hmm, I tried to import these papers: https://arxiv.org/abs/2406.14319 But JabRef didn't import it properly, and |
I have changed the
and modified the this fixes the null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JUnit tests are failing. In the area "Some checks were not successful", locate "Tests / Unit tests (pull_request)" and click on "Details". This brings you to the test output.
You can then run these tests in IntelliJ to reproduce the failing tests locally. We offer a quick test running howto in the section Final build system checks in our setup guide.
Hi, ar-rana! Thanks for working on this PR! As holidays come soon, maintainers could be too busy these weeks, so don't worry if we give feedback too late |
@@ -609,11 +609,15 @@ private String getDoi(String doi) { | |||
|
|||
private String getArXivId(String arXivId) { | |||
if (arXivId == null) { | |||
arXivId = ArXivIdentifier.parse(curString).map(ArXivIdentifier::asString).orElse(null); | |||
String arXiv = curString.split(" ")[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could lead to a null pointer if there is no whitespace and you access the index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello Siedlerchr, I tested this method with empty/non-empty strings with/without whitespaces and it would only give a null pointer if the curString
is null which does not seem to be the case here. So should I add a change here or leave it, as getDoi
also work the same way
String arXiv = curString.split(" ")[0]; | ||
arXivId = ArXivIdentifier.parse(arXiv).map(ArXivIdentifier::asString).orElse(null); | ||
if (arXivId != null) { | ||
if (curString.length() > arXivId.length() + 7) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does 7 stand for here? If possible, define a constant with a good name for the value 7 in this context.
Also, consider reducing the level of nesting in the code by returning early.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
7 is for the 'arxiv:'
prefix in the arxiv string,
If possible, define a constant with a good name for the value 7 in this context.
Also, consider reducing the level of nesting in the code by returning early.
sure, I will work on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, ARXIV_PREFIX_LENGTH = "arxiv:".length()
would be a good name.
@@ -608,18 +608,16 @@ private String getDoi(String doi) { | |||
} | |||
|
|||
private String getArXivId(String arXivId) { | |||
final int ARXIV_PREFIX_LENGTH = "arxiv:".length(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's define it as static and move it outside the method to avoid reallocating the same string object everytime the method is called.
private static final ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should i move it to the ArXivIdentifier
class instead, i think it might be more suited there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general (and in OOP more specifically) we want to keep the data as close as possible to the code that uses it, if the constant is used there then okay otherwise let's keep it in the class where it is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, thank you
proceedToNextNonEmptyLine(); | ||
} | ||
return arXivId; | ||
if (arXivId != null && curString.length() > arXivId.length() + ARXIV_PREFIX_LENGTH) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not what I meant by reduced nesting, Please read this: https://szymonkrajewski.pl/why-should-you-return-early/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private String getArXivId(String arXivId) {
if (arXivId != null) {
return arXivId;
}
String arXiv = curString.split(" ")[0];
arXivId = ArXivIdentifier.parse(arXiv).map(ArXivIdentifier::asString).orElse(null);
if (arXivId == null || curString.length() < arXivId.length() + ARXIV_PREFIX_LENGTH) {
return arXivId;
}
// The arxiv string also contains the year
curString = curString.substring(arXivId.length() + ARXIV_PREFIX_LENGTH);
extractYear();
curString = "";
proceedToNextNonEmptyLine();
return arXivId;
}
This is what I wrote based on my understanding of the article, please share you feedback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, great, it looks better!
.withField((StandardField.KEYWORDS), "Test Automation Artificial Intelligence AI-assisted Test Automation Grey Literature Automated Test Generation Self-Healing Test Scripts"); | ||
|
||
String firstPageContent = """ | ||
arXiv:2408.06224v1 [cs.SE] 12 Aug 2024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it guaranteed that the arXiv id will be on the first line? If not we need to add at least two more tests.
- ArXiv id in a middle line (the line begins with the id)
- ArXiv ID in a middle line (the ID is somewhere in the middle of that line)
You can use JUnit 5 Parameterized Tests to reduce verbosity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hello houssem, I checked multiple arXiv papers(10) and all of them had their arxiv string at the last line, here you can see it at the top(sorry for the oversight, will fix that) but moving it at the last also passes the test, and since all these papers have the same format i think the arxiv string would mostly be at the end
// year is a class variable as the method extractYear() uses it; | ||
String publisher = null; | ||
|
||
EntryType type = StandardEntryType.InProceedings; | ||
if (curString.length() > 4) { | ||
// special case: possibly conference as first line on the page | ||
arXivId = getArXivId(null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an example why comments are to be avoided as much as possible, this line compiles fine, runs fine but the comment above it is misguiding. Please move comment to its correct location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
Great! I've imported some arXiv paper that I had on my disk, and it had successfully extracted the ID into If you want to continue working on this feature, you could try to improve it further: Ideally, I, as a user of JabRef, would like to import a paper in PDF, and JabRef should correctly determine all the necessary metadata. Your PR works on PDF importers level: you found arXiv ID in PDF content and included in Unfortunately, PDF import is not ideal (and it's a hard task on its own). Now, it would be cool if after finding the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Just some small changes. The issue with abbrev
and csl-styles
should be resolved too
@@ -557,19 +561,25 @@ Optional<BibEntry> getEntryFromPDFContent(String firstpageContents, String lineS | |||
if (doi != null) { | |||
entry.setField(StandardField.DOI, doi); | |||
} | |||
if (arXivId != null) { | |||
entry.setField(StandardField.EPRINT, arXivId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@koppor, @Siedlerchr should there be additional fields like archivePrefix
and primaryClass
like in https://tex.stackexchange.com/a/3848/313784?
if (year != null && !"0000".equals(year)) { | ||
entry.setField(StandardField.YEAR, year); | ||
} else if (arXivId != null) { | ||
year = "20" + arXivId.substring(0, 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, we should not set it.
Here is an example of eprint on arXiv that dates to 1995: https://arxiv.org/abs/hep-ph/9707270
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still, this if clause should be deleted, as if I import the paper I mentioned before, it would probably set year to 2095, which is not right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, will do that
@@ -531,6 +538,10 @@ Optional<BibEntry> getEntryFromPDFContent(String firstpageContents, String lineS | |||
} | |||
} | |||
|
|||
if (arXivId != null && arXivId.contains(year)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a special reason, why year
should be nulled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sometimes the extractyear()
would pick up number from the arXivId and that is why it was being set to null so later on line 586-589
it would be corrected but now that we are removing that if clause, I will remove this as well
sounds interesting, would love work on this |
* upstream/main: (29 commits) Bump org.glassfish.jersey.containers:jersey-container-grizzly2-http (JabRef#12384) Bump src/main/resources/csl-styles from `080516e` to `6bae16d` (JabRef#12387) Bump src/main/resources/csl-locales from `96d704d` to `9914965` (JabRef#12386) Bump buildres/abbrv.jabref.org from `93a2cad` to `e74e6eb` (JabRef#12385) Bump org.openrewrite.rewrite from 6.29.3 to 7.0.0 (JabRef#12383) Bump org.glassfish.jersey.core:jersey-server from 3.1.9 to 3.1.10 (JabRef#12381) Bump org.glassfish.jersey.test-framework.providers:jersey-test-framework-provider-grizzly2 (JabRef#12380) fix linux build not updated Refactor the isUnwantedText (JabRef#12369) Searching for entries with empty field (JabRef#12376) Downgrade Ubuntu (JabRef#12375) Downgrade Ubuntu Fix main table sorting (JabRef#12371) fix bib and pdf name (JabRef#12366) use v4 instead of master Update abbrv.jabref.org (JabRef#12365) Bump buildres/abbrv.jabref.org from `78e1b08` to `c202741` (JabRef#12363) Bump org.beryx.jlink from 3.1.0-rc-1 to 3.1.1 (JabRef#12362) Bump tech.units:indriya from 2.2.1 to 2.2.2 (JabRef#12361) Bump com.dlsc.gemsfx:gemsfx from 2.80.0 to 2.81.0 (JabRef#12360) ... # Conflicts: # buildres/abbrv.jabref.org # src/main/resources/csl-styles
@InAnYan I was working on the Arxiv Metadata method and this is what I have written so far,
this method returns an |
Used the 'parse' method in ArXivIdentifier to get arXivId and added a testcase for the same using the link give in the issue.
Closes #12000
Closes https://github.com/koppor/jabref/issues/47".
Mandatory checks
CHANGELOG.md
described in a way that is understandable for the average user (if change is visible to the user)