Switch covers importing IA endpoint #7616

cdrini · 2023-03-09T01:29:41Z

Switch covers import endpoint to be eg https://archive.org/services/img/aldosfantastical0000frie/full/pct:600/0/default.jpg .

This appears to more accurately choose between cover and title page. Avoiding blank covers, but using them when necessary.

Some evaluation would be useful to see if this is indeed a better switch.

openlibrary/openlibrary/core/ia.py

Lines 117 to 123 in 2a36c54

    
           def get_cover_url(item_id): 
        
               """Gets the URL of the archive.org item's title (or cover) page.""" 
        
               base_url = f'{IA_BASE_URL}/download/{item_id}/page/' 
        
               title_response = requests.head(base_url + 'title.jpg', allow_redirects=True) 
        
               if title_response.status_code == 404: 
        
                   return base_url + 'cover.jpg' 
        
               return base_url + 'title.jpg'

Stakeholders

@scottbarnes @hornc

scottbarnes · 2023-03-27T03:31:30Z

The short of this is that the endpoint used to display covers on IA searches seems substantially better than the current endpoint, though the resolution is lower on the proposed newer endpoint.

I made a Google Colab that enables a very simplistic cover comparison between the current and proposed 'new' cover endpoints for around 25 books that were on the import list of IA items that appeared book-like, but lacked a MARC record, and those that appeared book-like and had a MARC record: https://colab.research.google.com/drive/18AT-Hdu7j9dyrVaSR3VAEPTceC9qVpBp#scrollTo=yS1E0hh00KFT

The script simply iterates through the list of ocaids and for each one fetches the cover via the current end point, and the 'new' one.

The script takes about two minutes to run (for some reason...), but the results should be pretty easy to scroll through and analyze. Each image is labeled at the top as to which endpoint it comes from.

The 'new' endpoint seems to have either the same covers, or 'better' ones. However, the covers an the proposed 'new' endpoint are usually a lower resolution. I didn't resize any of the image output so that it's easier to see what is fetched, even if it makes it a bit annoying to view.

@cdrini
@hornc

hornc · 2023-03-27T04:30:55Z

@cdrini @scottbarnes I think there's a deeper problem here than just switching the endpoints.

My understanding is? / was that when a book was scanned, it's best representative "title" page was marked -- if the title appeared clearly on the cover, that's what would be chosen, if there was an internal "title page", as is often the case with older cloth or leather bound books, that's what would be at title.jpg

It's possible that something has changed in the scanning process. I see archive.org items have a bookplateleaf page number, which seems related, but I can't see how it relates to any of the result in the collab.

The bestlovedpoems0000henr example shows what the current system is trying to protect against: it shows an internal page instead of the cloth cover. The 'new' method just shows blank cloth.

operationjanus0000cros shows the current system fetching the correct image cover without pulling up an internal title page, which isn't needed.

vtenitvoikhsnovr0000hoop seems like it has bad data for title.jpg... and many of the other examples seem to have unnecessary marked up internal title pages when their cover should do.

My feeling is the OL code is currently doing the right thing, but archive.org has data issues with how some title / covers are marked up.

Checking when the items that have bad results using the current scheme were scanned recently or a long time ago will confirm whether OL is in line with current scanning practice.

If recent scanned items are inconsistent, or consistently showing title pages when we'd expect covers, we'll need to feed back to the scanning process.

If recent scans are consistent and there's a different current way of picking the best 'title' image, we'll need to know what that is.

scottbarnes · 2023-03-27T04:52:13Z

Ah, I see what you are saying about the cloth cover issue, @hornc. For some reason I wasn't seeing the whole list, even though I created the thing. :)

I will look at trying to get some more recent data for further discussion.

cdrini · 2024-02-01T22:50:19Z

My understanding is? / was that when a book was scanned, it's best representative "title" page was marked -- if the title appeared clearly on the cover, that's what would be chosen, if there was an internal "title page", as is often the case with older cloth or leather bound books, that's what would be at title.jpg

That appears to no longer be the case; now it appears that title.jpg is always available regardless of the cover, and set to the title page. Note the code snippet above defaults to /title.jpg! It only uses cover.jpg as a fallback if title.jpg 404s. So effectively this is always returning title pages!

Here is a comparison of the three endpoints, based on the most recent 100 importbot edit OCAIDs. It seems like default.jpg has been identical to covers.jpg in these cases, but I think that might just be cause the items appear to be a little old. I know Jude recently added new logic for cloth cover detection, but not sure which endpoint uses that.

Comparison

hornc · 2024-02-01T23:10:59Z

@cdrini
The feature / user story here as I see it is:

In order to get a visual indication of what the book is, (as a library patron) I want to see a book image with the title, author, and cover artwork (if any).

This means for most modern style books we want cover_url, but for cloth bound books and similar, we need title_url, since looking at the cloth does not satisfy the above.

AFAICT default_url is always the same image as cover_url?

There used to be either a manual process or automated smarts to figure this out correctly, and it looks like it's no longer working. archive.org is probably going to have the same issue -- maybe there is another way to determine cloth bound and pick the correct and useful image?

cdrini · 2024-12-10T18:43:44Z

AFAICT default_url is always the same image as cover_url?

I'm not sure, but default.jpg seems to do what you're describing and avoids cloth covers; see americanpeople0000unse in Scott's colab.

Let's roll with switching! The images are slightly smaller, but still huge :P

cdrini · 2024-12-30T19:36:30Z

Spaarsh · 2025-01-15T03:21:02Z

@mekarpeles I am willing to work on this issue.

Additionally, would it be a good practice to add some metadata (some sort of flag?) to covers where we have used the IA covers? Even going further to discern between those using the cover.jpg and title.jpg. It might come handy in debugging in case of some failure much later in time and could also find some usecase in our stats.

scottbarnes · 2025-01-15T15:09:07Z

Thanks for offering to work on this, @Spaarsh. I went ahead and assigned this to you. As for whether we should store some additional metadata, the goal of being able to tell which cover import endpoint an IA import used is a good idea, though in this case I think we can just look at the import date if we need to, and compare it to the date of this change.

Spaarsh · 2025-01-19T14:01:26Z

Should I include the original endpoint as a fallback in case of a 404 like this?

def get_cover_url(item_id):
    """Gets the URL of the archive.org item's title (or cover) page."""
    base_url = f'{IA_BASE_URL}/services/img/{item_id}/full/pct:600/0/'
    cover_response = requests.head(base_url + 'default.jpg', allow_redirects=True)
    if cover_response.status_code == 404:
        return get_fallback_cover_url(item_id)
    return base_url + 'default.jpg'


def get_fallback_cover_url(item_id):
    """Gets the URL of the archive.org item's title (or cover) page."""
    base_url = f'{IA_BASE_URL}/download/{item_id}/page/'
    title_response = requests.head(base_url + 'title.jpg', allow_redirects=True)
    if title_response.status_code == 404:
        return base_url + 'cover.jpg'
    return base_url + 'title.jpg'

scottbarnes · 2025-01-20T16:06:14Z

That seems like a good idea. I say go for it.

Spaarsh · 2025-01-20T19:58:19Z

@cdrini @scottbarnes @mekarpeles while working on this issue, I came across this code where we are still using the old endpoint for getting the covers in the /plugins/upstream/models.py. Are there any specific reasons for not switching the endpoint here as well?

openlibrary/openlibrary/plugins/upstream/models.py

Lines 80 to 83 in 8fc3f86

    
           def get_ia_cover(self, itemid, size): 
        
               image_sizes = {"S": (116, 58), "M": (180, 360), "L": (500, 500)} 
        
               w, h = image_sizes[size.upper()] 
        
               return f"https://archive.org/download/{itemid}/page/cover_w{w}_h{h}.jpg"

scottbarnes · 2025-02-10T18:52:35Z

@Spaarsh, I'm very sorry for how long it took to respond. It could very well be that it makes sense to update that as well, but for the purpose so this specific issue, let's keep the scope as currently defined. If you're interested, I think it would make sense to open a new issue to explore whether we want to change get_ia_cover() as well.

cdrini mentioned this issue Mar 9, 2023

Use more IA data when (re)importing via /api/imports/ia (and overwrite v1 promise items) #7490

Closed

cdrini changed the title ~~Switch covers importing IA endpoing~~ Switch covers importing IA endpoint Mar 9, 2023

cdrini assigned scottbarnes Mar 9, 2023

cdrini added this to the Sprint 2025-01 milestone Dec 30, 2024

cdrini mentioned this issue Dec 30, 2024

Covers for IA-linked items not importing correctly #6866

Closed

mekarpeles unassigned scottbarnes Jan 13, 2025

mekarpeles added Good First Issue Easy issue. Good for newcomers. [managed] Lead: @scottbarnes Issues overseen by Scott (Community Imports) and removed Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] labels Jan 13, 2025

github-actions bot added the Needs: Response Issues which require feedback from lead label Jan 15, 2025

scottbarnes assigned Spaarsh Jan 15, 2025

scottbarnes removed the Needs: Response Issues which require feedback from lead label Jan 15, 2025

github-actions bot added the Needs: Response Issues which require feedback from lead label Jan 20, 2025

scottbarnes removed the Needs: Response Issues which require feedback from lead label Jan 20, 2025

Spaarsh linked a pull request Jan 20, 2025 that will close this issue

Switched cover importing IA endpoint #10370

Open

github-actions bot added the Needs: Response Issues which require feedback from lead label Jan 21, 2025

mekarpeles modified the milestones: Sprint 2025-01, Sprint 2025-02 Jan 30, 2025

scottbarnes removed the Needs: Response Issues which require feedback from lead label Feb 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch covers importing IA endpoint #7616

Switch covers importing IA endpoint #7616

cdrini commented Mar 9, 2023 •

edited

Loading

scottbarnes commented Mar 27, 2023 •

edited

Loading

hornc commented Mar 27, 2023

scottbarnes commented Mar 27, 2023

cdrini commented Feb 1, 2024 •

edited

Loading

hornc commented Feb 1, 2024

cdrini commented Dec 10, 2024 •

edited

Loading

cdrini commented Dec 30, 2024

Spaarsh commented Jan 15, 2025

scottbarnes commented Jan 15, 2025

Spaarsh commented Jan 19, 2025

scottbarnes commented Jan 20, 2025

Spaarsh commented Jan 20, 2025

scottbarnes commented Feb 10, 2025

Switch covers importing IA endpoint #7616

Switch covers importing IA endpoint #7616

Comments

cdrini commented Mar 9, 2023 • edited Loading

Stakeholders

scottbarnes commented Mar 27, 2023 • edited Loading

hornc commented Mar 27, 2023

scottbarnes commented Mar 27, 2023

cdrini commented Feb 1, 2024 • edited Loading

hornc commented Feb 1, 2024

cdrini commented Dec 10, 2024 • edited Loading

cdrini commented Dec 30, 2024

Spaarsh commented Jan 15, 2025

scottbarnes commented Jan 15, 2025

Spaarsh commented Jan 19, 2025

scottbarnes commented Jan 20, 2025

Spaarsh commented Jan 20, 2025

scottbarnes commented Feb 10, 2025

cdrini commented Mar 9, 2023 •

edited

Loading

scottbarnes commented Mar 27, 2023 •

edited

Loading

cdrini commented Feb 1, 2024 •

edited

Loading

cdrini commented Dec 10, 2024 •

edited

Loading