Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch covers importing IA endpoint #7616

Open
cdrini opened this issue Mar 9, 2023 · 13 comments · May be fixed by #10370
Open

Switch covers importing IA endpoint #7616

cdrini opened this issue Mar 9, 2023 · 13 comments · May be fixed by #10370
Assignees
Labels
Good First Issue Easy issue. Good for newcomers. [managed] Lead: @scottbarnes Issues overseen by Scott (Community Imports) Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Priority: 2 Important, as time permits. [managed] Type: Bug Something isn't working. [managed]

Comments

@cdrini
Copy link
Collaborator

cdrini commented Mar 9, 2023

Switch covers import endpoint to be eg https://archive.org/services/img/aldosfantastical0000frie/full/pct:600/0/default.jpg .

This appears to more accurately choose between cover and title page. Avoiding blank covers, but using them when necessary.

Some evaluation would be useful to see if this is indeed a better switch.

def get_cover_url(item_id):
"""Gets the URL of the archive.org item's title (or cover) page."""
base_url = f'{IA_BASE_URL}/download/{item_id}/page/'
title_response = requests.head(base_url + 'title.jpg', allow_redirects=True)
if title_response.status_code == 404:
return base_url + 'cover.jpg'
return base_url + 'title.jpg'

Stakeholders

@scottbarnes @hornc

@cdrini cdrini added Priority: 3 Issues that we can consider at our leisure. [managed] Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed] labels Mar 9, 2023
@cdrini cdrini changed the title Switch covers importing IA endpoing Switch covers importing IA endpoint Mar 9, 2023
@scottbarnes
Copy link
Collaborator

scottbarnes commented Mar 27, 2023

The short of this is that the endpoint used to display covers on IA searches seems substantially better than the current endpoint, though the resolution is lower on the proposed newer endpoint.

I made a Google Colab that enables a very simplistic cover comparison between the current and proposed 'new' cover endpoints for around 25 books that were on the import list of IA items that appeared book-like, but lacked a MARC record, and those that appeared book-like and had a MARC record: https://colab.research.google.com/drive/18AT-Hdu7j9dyrVaSR3VAEPTceC9qVpBp#scrollTo=yS1E0hh00KFT

The script simply iterates through the list of ocaids and for each one fetches the cover via the current end point, and the 'new' one.

The script takes about two minutes to run (for some reason...), but the results should be pretty easy to scroll through and analyze. Each image is labeled at the top as to which endpoint it comes from.

The 'new' endpoint seems to have either the same covers, or 'better' ones. However, the covers an the proposed 'new' endpoint are usually a lower resolution. I didn't resize any of the image output so that it's easier to see what is fetched, even if it makes it a bit annoying to view.

@cdrini
@hornc

@hornc
Copy link
Collaborator

hornc commented Mar 27, 2023

@cdrini @scottbarnes I think there's a deeper problem here than just switching the endpoints.

My understanding is? / was that when a book was scanned, it's best representative "title" page was marked -- if the title appeared clearly on the cover, that's what would be chosen, if there was an internal "title page", as is often the case with older cloth or leather bound books, that's what would be at title.jpg

It's possible that something has changed in the scanning process. I see archive.org items have a bookplateleaf page number, which seems related, but I can't see how it relates to any of the result in the collab.

The bestlovedpoems0000henr example shows what the current system is trying to protect against: it shows an internal page instead of the cloth cover. The 'new' method just shows blank cloth.

operationjanus0000cros shows the current system fetching the correct image cover without pulling up an internal title page, which isn't needed.

vtenitvoikhsnovr0000hoop seems like it has bad data for title.jpg... and many of the other examples seem to have unnecessary marked up internal title pages when their cover should do.

My feeling is the OL code is currently doing the right thing, but archive.org has data issues with how some title / covers are marked up.

  • Checking when the items that have bad results using the current scheme were scanned recently or a long time ago will confirm whether OL is in line with current scanning practice.

If recent scanned items are inconsistent, or consistently showing title pages when we'd expect covers, we'll need to feed back to the scanning process.

If recent scans are consistent and there's a different current way of picking the best 'title' image, we'll need to know what that is.

@scottbarnes
Copy link
Collaborator

Ah, I see what you are saying about the cloth cover issue, @hornc. For some reason I wasn't seeing the whole list, even though I created the thing. :)

I will look at trying to get some more recent data for further discussion.

@cdrini
Copy link
Collaborator Author

cdrini commented Feb 1, 2024

My understanding is? / was that when a book was scanned, it's best representative "title" page was marked -- if the title appeared clearly on the cover, that's what would be chosen, if there was an internal "title page", as is often the case with older cloth or leather bound books, that's what would be at title.jpg

That appears to no longer be the case; now it appears that title.jpg is always available regardless of the cover, and set to the title page. Note the code snippet above defaults to /title.jpg! It only uses cover.jpg as a fallback if title.jpg 404s. So effectively this is always returning title pages!

Here is a comparison of the three endpoints, based on the most recent 100 importbot edit OCAIDs. It seems like default.jpg has been identical to covers.jpg in these cases, but I think that might just be cause the items appear to be a little old. I know Jude recently added new logic for cloth cover detection, but not sure which endpoint uses that.

Comparison

tmp

@hornc
Copy link
Collaborator

hornc commented Feb 1, 2024

@cdrini
The feature / user story here as I see it is:

In order to get a visual indication of what the book is, (as a library patron) I want to see a book image with the title, author, and cover artwork (if any).

This means for most modern style books we want cover_url, but for cloth bound books and similar, we need title_url, since looking at the cloth does not satisfy the above.

AFAICT default_url is always the same image as cover_url?

There used to be either a manual process or automated smarts to figure this out correctly, and it looks like it's no longer working. archive.org is probably going to have the same issue -- maybe there is another way to determine cloth bound and pick the correct and useful image?

@cdrini cdrini added Priority: 2 Important, as time permits. [managed] Type: Bug Something isn't working. [managed] and removed Priority: 3 Issues that we can consider at our leisure. [managed] Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed] labels Dec 10, 2024
@cdrini
Copy link
Collaborator Author

cdrini commented Dec 10, 2024

AFAICT default_url is always the same image as cover_url?

I'm not sure, but default.jpg seems to do what you're describing and avoids cloth covers; see americanpeople0000unse in Scott's colab.

Let's roll with switching! The images are slightly smaller, but still huge :P

@mekarpeles mekarpeles added Good First Issue Easy issue. Good for newcomers. [managed] Lead: @scottbarnes Issues overseen by Scott (Community Imports) and removed Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] labels Jan 13, 2025
@Spaarsh
Copy link
Contributor

Spaarsh commented Jan 15, 2025

@mekarpeles I am willing to work on this issue.

Additionally, would it be a good practice to add some metadata (some sort of flag?) to covers where we have used the IA covers? Even going further to discern between those using the cover.jpg and title.jpg. It might come handy in debugging in case of some failure much later in time and could also find some usecase in our stats.

@github-actions github-actions bot added the Needs: Response Issues which require feedback from lead label Jan 15, 2025
@scottbarnes scottbarnes removed the Needs: Response Issues which require feedback from lead label Jan 15, 2025
@scottbarnes
Copy link
Collaborator

Thanks for offering to work on this, @Spaarsh. I went ahead and assigned this to you. As for whether we should store some additional metadata, the goal of being able to tell which cover import endpoint an IA import used is a good idea, though in this case I think we can just look at the import date if we need to, and compare it to the date of this change.

@Spaarsh
Copy link
Contributor

Spaarsh commented Jan 19, 2025

Should I include the original endpoint as a fallback in case of a 404 like this?

def get_cover_url(item_id):
    """Gets the URL of the archive.org item's title (or cover) page."""
    base_url = f'{IA_BASE_URL}/services/img/{item_id}/full/pct:600/0/'
    cover_response = requests.head(base_url + 'default.jpg', allow_redirects=True)
    if cover_response.status_code == 404:
        return get_fallback_cover_url(item_id)
    return base_url + 'default.jpg'


def get_fallback_cover_url(item_id):
    """Gets the URL of the archive.org item's title (or cover) page."""
    base_url = f'{IA_BASE_URL}/download/{item_id}/page/'
    title_response = requests.head(base_url + 'title.jpg', allow_redirects=True)
    if title_response.status_code == 404:
        return base_url + 'cover.jpg'
    return base_url + 'title.jpg'

@github-actions github-actions bot added the Needs: Response Issues which require feedback from lead label Jan 20, 2025
@scottbarnes
Copy link
Collaborator

That seems like a good idea. I say go for it.

@scottbarnes scottbarnes removed the Needs: Response Issues which require feedback from lead label Jan 20, 2025
@Spaarsh Spaarsh linked a pull request Jan 20, 2025 that will close this issue
@Spaarsh
Copy link
Contributor

Spaarsh commented Jan 20, 2025

@cdrini @scottbarnes @mekarpeles while working on this issue, I came across this code where we are still using the old endpoint for getting the covers in the /plugins/upstream/models.py. Are there any specific reasons for not switching the endpoint here as well?

def get_ia_cover(self, itemid, size):
image_sizes = {"S": (116, 58), "M": (180, 360), "L": (500, 500)}
w, h = image_sizes[size.upper()]
return f"https://archive.org/download/{itemid}/page/cover_w{w}_h{h}.jpg"

@github-actions github-actions bot added the Needs: Response Issues which require feedback from lead label Jan 21, 2025
@scottbarnes
Copy link
Collaborator

@Spaarsh, I'm very sorry for how long it took to respond. It could very well be that it makes sense to update that as well, but for the purpose so this specific issue, let's keep the scope as currently defined. If you're interested, I think it would make sense to open a new issue to explore whether we want to change get_ia_cover() as well.

@scottbarnes scottbarnes removed the Needs: Response Issues which require feedback from lead label Feb 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Good First Issue Easy issue. Good for newcomers. [managed] Lead: @scottbarnes Issues overseen by Scott (Community Imports) Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Priority: 2 Important, as time permits. [managed] Type: Bug Something isn't working. [managed]
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants
@hornc @mekarpeles @cdrini @scottbarnes @Spaarsh and others