Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Command for fetching HathiTrust Images #694

Merged
merged 14 commits into from
Nov 15, 2024
Merged

Conversation

laurejt
Copy link
Contributor

@laurejt laurejt commented Nov 6, 2024

ref: #663

New command for fetching HathiTrust images ("full-size" and thumbnails).

Notes:

  • This currently depends on the develop branch of ppa-nlp
  • The width of "full-size" pages is currently set to 800

Questions:

  • Is a width of 800pixels the "right" size?
  • I'm having some trouble with displaying two stacked progress bars. Any ideas?
  • Should we use a different crawl delay?
  • Thoughts on logging progress?

@laurejt laurejt requested a review from rlskoeser November 6, 2024 14:44
@laurejt laurejt self-assigned this Nov 6, 2024
Copy link

codecov bot commented Nov 6, 2024

Codecov Report

Attention: Patch coverage is 71.42857% with 72 lines in your changes missing coverage. Please review.

Project coverage is 94.35%. Comparing base (dc30aa9) to head (4475a70).
Report is 1 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #694      +/-   ##
===========================================
- Coverage    95.12%   94.35%   -0.78%     
===========================================
  Files          139      141       +2     
  Lines         7573     7815     +242     
  Branches         7        8       +1     
===========================================
+ Hits          7204     7374     +170     
- Misses         369      441      +72     

Copy link
Contributor

@rlskoeser rlskoeser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fantastic, and the script seems to be working really well.

I think there's a bit more work to do to polish it up and improve the output and reporting, but it seems like the basic functionality is pretty much working as we want it to.

It would be nice to add handling for SIGINT / ctrl-c so you can stop the script but have it end gracefully, finish downloading images for the current volume and report on what ws done, then stop. If you're interested in adding that, let me know, I should be able to find an example where I've done something similar.

ppa/archive/management/commands/hathi_images.py Outdated Show resolved Hide resolved
ppa/archive/management/commands/hathi_images.py Outdated Show resolved Hide resolved
ppa/archive/management/commands/hathi_images.py Outdated Show resolved Hide resolved
ppa/archive/management/commands/hathi_images.py Outdated Show resolved Hide resolved
ppa/archive/management/commands/hathi_images.py Outdated Show resolved Hide resolved
ppa/archive/management/commands/hathi_images.py Outdated Show resolved Hide resolved

# Fetch thumbnail if file does not exist
page_thumbnail = thumbnail_dir / image_name
if not page_thumbnail.is_file():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered about a volume-level check to skip an entire volume if the expected number of images are present, but glad to see you have this check.

ppa/archive/management/commands/hathi_images.py Outdated Show resolved Hide resolved
ppa/archive/management/commands/hathi_images.py Outdated Show resolved Hide resolved
@laurejt laurejt requested a review from rlskoeser November 14, 2024 16:20
Copy link
Contributor

@rlskoeser rlskoeser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, let's get this version merged !

Comment on lines +159 to +161
elif response.status_code == 503:
logger.debug("Received 503 status code. Throttling may have occurred")
return success
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Have you run into any other error codes in your testing? It would be good to log other errors, but if they haven't come up yet then let's leave it.

Copy link
Contributor Author

@laurejt laurejt Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't actually run into any errors.

ppa/archive/management/commands/hathi_images.py Outdated Show resolved Hide resolved
logger = logging.getLogger(__name__)


class DownloadStats:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems complicated but I see the value of it for wrangling multiple counts and displaying them nicely

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, this is pretty much just a wrapped tuple of Counters with a pretty print.

@laurejt laurejt merged commit 6309142 into develop Nov 15, 2024
9 of 12 checks passed
@laurejt laurejt deleted the feature/fetch-hathi-images branch November 15, 2024 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants