Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use author identifiers in import API #10110

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

pidgezero-one
Copy link
Contributor

@pidgezero-one pidgezero-one commented Dec 3, 2024

This should be squash merged

Corresponding model update pr: internetarchive/openlibrary-client#419

This strictly expands the import schema.
It is not a breaking change.
Import records that don't include author IDs will continue to work as they currently do.

Closes #9448
Closes #9411

Technical

  • Adds support for author identifiers in import records.
    • If the import record for a book includes an author that has an "ol_id" property, the import API will attempt to find an author that matches that OL ID.
    • If the import does not include an ol_id field OR includes an ol_id field that doesn't match any existing authors, then if the import record for a book includes an author that has a "remote_ids" property, the import API will attempt to find an existing author that matches the most remote IDs within the record.
      • Q: Should the case of specifying an ol_id that doesn't exist our DB be an error that should reject the import?
    • If the record doesn't include or match any of the above, it will continue to be author-matched based on name and birth/death date, which is how the import api already operates in production.
  • Wikisource script updates:
    • Fixes incorrect birth/death date parsing.
    • Books with no identified author, title, or publish date will not be included in the jsonl output.
    • The name formatting helper function is only used when the author's name came specifically from wikisource and not from wikidata.
      • The majority of the time, WS import records produced by this script will strictly use author info from WD. However, not every WD item corresponding to a WS book is properly linked to an author. In those cases, the script falls back to attempt getting author information from the WS api response instead. WS data is highly unstructured, so only in those cases will the name formatter be used.
    • Moved dependencies specific to the wikisource script into a separate requirements.txt file that is intended to be installed only temporarily, since they're not required for OL to run. Instructions are included for how to run the script with this consideration.
    • Adds author identifiers to its output records, since it uses the Wikidata API, which includes OL IDs and most other remote_ids.
      • WS was the easiest source for me to use for generating records that had enough information to test these additions with. Nothing in the updated author matching logic is actually specific to WS, except for the next bullet point:
    • Wikisource records are exempt from being rejected for having a 1900 publish date. I don't know if this is a good idea or not, seeking feedback on that.

Issues:

  • Importing books is successful and matching authors are being found and used as expected, however navigating to the author's page from that new book's page does not show that new book on the author's page.

Testing

I put the entire output of the wikisource script into /import/batch/new.

Stakeholders

@cdrini @Freso

Attribution Disclaimer: By proposing this pull request, I affirm to have made a best-effort and exercised my discretion to make sure relevant sections of this code which substantially leverage code suggestions, code generation, or code snippets from sources (e.g. Stack Overflow, GitHub) have been annotated with basic attribution so reviewers & contributors may have confidence and access to the correct context to evaluate and use this code.

@pidgezero-one pidgezero-one marked this pull request as ready for review February 11, 2025 17:40
@pidgezero-one pidgezero-one changed the title [WIP] feat: use author known identifiers in import API feat: use author identifiers in import API Feb 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants