Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise local ocr handling for gale page indexing #686

Merged
merged 7 commits into from
Nov 4, 2024

Conversation

rlskoeser
Copy link
Contributor

@rlskoeser rlskoeser commented Oct 28, 2024

related to #673

see this pull request for the script to convert the text files into the json format used Princeton-CDH/ppa-nlp#107

updates since draft version:

  • added error handling (adapted from previous implementation)
  • updated/adapted documentation
  • updated unit tests

@rlskoeser rlskoeser marked this pull request as draft October 28, 2024 16:52
ppa/archive/gale.py Fixed Show fixed Hide fixed
@rlskoeser rlskoeser requested a review from laurejt October 28, 2024 16:56
@laurejt
Copy link
Contributor

laurejt commented Oct 28, 2024

Overall, this seems like a reasonable approach to me.

  • Consider using pathlib instead of glob and os.path
  • I'm not sure I understand the difference between .json vs .jsonl. In both cases you may need to worry about missing pages.
  • Since the ocr code is all in ppa-nlp, it seems reasonable to also include the ocr-to-json script. In the future this script might also do additional text transformations.

ppa/archive/gale.py Outdated Show resolved Hide resolved
Comment on lines 239 to 233
ocr_text = local_ocr_text.get(page_number)
# TODO: set local ocr tag here
if not ocr_text and page_number not in local_ocr_text:
# if ocr text is unset and page number is not present,
# try getting the ocr from the gale api result
logger.warning(f"No local OCR for {item_id} {page_number}")
ocr_text = page.get("ocrText") # some pages have no text
logger.warning(f'Local OCR not found for {item_id} {page_number}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this would be better as an if/else clause, since the only time the Gale text should be fetched is when the page number is not in local_ocr_text

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, good point... there are two cases here, aren't there? we could have the page number in our json file and have no text, or it might not be present at all (e.g., we missed it somehow)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving this comment to remind you to confirm that I've done what you were suggesting.

I got a code complexity alert for this function - was able to reduce complexity slightly. If you see any easy wins, please suggest them.

@rlskoeser
Copy link
Contributor Author

@laurejt thanks for the quick review and useful feedback, I'll keep your recommendations in mind when I circle back to this.

jsonl vs json: I was thinking json lines so we could avoid having the whole file in memory, but that would mean we have to sort the pages before we generate the lines file (or when we load it), and then as we iterate we'd have to check if we were on the correct page - so logic is more complicated. I think these volumes are small enough that we can get by with one file per volume and just have the whole text in memory so we can look up contents by page number.

@rlskoeser rlskoeser changed the title Preliminary rewrite for indexing gale with local ocr content Revise local ocr handling for gale page indexing Oct 29, 2024
ppa/archive/gale.py Fixed Show resolved Hide resolved
@rlskoeser rlskoeser requested a review from laurejt October 29, 2024 20:33
@rlskoeser rlskoeser marked this pull request as ready for review October 29, 2024 20:49
@rlskoeser rlskoeser temporarily deployed to staging October 30, 2024 12:51 Inactive
ppa/archive/gale.py Outdated Show resolved Hide resolved
Comment on lines 244 to 245
if ocr_text:
tags = ["local_ocr"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be set if the resulting OCR for the page is blank?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, good catch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, maybe I misunderstood your question - that's what the logic is now, do you disagree with it?

Copy link
Contributor

@laurejt laurejt Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the tag should be set if the page's local ocr text is blank, because the empty string is a judgment of the local OCR (and not the Gale OCR)

ppa/archive/tests/test_gale.py Show resolved Hide resolved
ppa/archive/tests/test_gale.py Outdated Show resolved Hide resolved
rlskoeser and others added 2 commits October 31, 2024 12:09
ppa/archive/gale.py Dismissed Show dismissed Hide dismissed
@rlskoeser rlskoeser merged commit f04916b into develop Nov 4, 2024
10 checks passed
@rlskoeser rlskoeser deleted the feature/optimize-gale-localocr branch November 4, 2024 16:29
Copy link

codecov bot commented Nov 4, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.42%. Comparing base (44bd84d) to head (4635418).
Report is 215 commits behind head on develop.

❗ There is a different number of reports uploaded between BASE (44bd84d) and HEAD (4635418). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (44bd84d) HEAD (4635418)
javascript 3 2
Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #686      +/-   ##
===========================================
- Coverage   100.00%   94.42%   -5.58%     
===========================================
  Files            5      138     +133     
  Lines           78     7581    +7503     
  Branches         8        8              
===========================================
+ Hits            78     7158    +7080     
- Misses           0      423     +423     

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants