-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revise local ocr handling for gale page indexing #686
Conversation
Overall, this seems like a reasonable approach to me.
|
ppa/archive/gale.py
Outdated
ocr_text = local_ocr_text.get(page_number) | ||
# TODO: set local ocr tag here | ||
if not ocr_text and page_number not in local_ocr_text: | ||
# if ocr text is unset and page number is not present, | ||
# try getting the ocr from the gale api result | ||
logger.warning(f"No local OCR for {item_id} {page_number}") | ||
ocr_text = page.get("ocrText") # some pages have no text | ||
logger.warning(f'Local OCR not found for {item_id} {page_number}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this would be better as an if/else clause, since the only time the Gale text should be fetched is when the page number is not in local_ocr_text
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, good point... there are two cases here, aren't there? we could have the page number in our json file and have no text, or it might not be present at all (e.g., we missed it somehow)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving this comment to remind you to confirm that I've done what you were suggesting.
I got a code complexity alert for this function - was able to reduce complexity slightly. If you see any easy wins, please suggest them.
@laurejt thanks for the quick review and useful feedback, I'll keep your recommendations in mind when I circle back to this. jsonl vs json: I was thinking json lines so we could avoid having the whole file in memory, but that would mean we have to sort the pages before we generate the lines file (or when we load it), and then as we iterate we'd have to check if we were on the correct page - so logic is more complicated. I think these volumes are small enough that we can get by with one file per volume and just have the whole text in memory so we can look up contents by page number. |
ppa/archive/gale.py
Outdated
if ocr_text: | ||
tags = ["local_ocr"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be set if the resulting OCR for the page is blank?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope, good catch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait, maybe I misunderstood your question - that's what the logic is now, do you disagree with it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like the tag should be set if the page's local ocr text is blank, because the empty string is a judgment of the local OCR (and not the Gale OCR)
Co-authored-by: Laure Thompson <[email protected]>
Co-authored-by: Laure Thompson <[email protected]>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #686 +/- ##
===========================================
- Coverage 100.00% 94.42% -5.58%
===========================================
Files 5 138 +133
Lines 78 7581 +7503
Branches 8 8
===========================================
+ Hits 78 7158 +7080
- Misses 0 423 +423 |
related to #673
see this pull request for the script to convert the text files into the json format used Princeton-CDH/ppa-nlp#107
updates since draft version: