Revise local ocr handling for gale page indexing #686

rlskoeser · 2024-10-28T16:52:21Z

related to #673

see this pull request for the script to convert the text files into the json format used Princeton-CDH/ppa-nlp#107

updates since draft version:

added error handling (adapted from previous implementation)
updated/adapted documentation
updated unit tests

ppa/archive/gale.py

laurejt · 2024-10-28T17:17:09Z

Overall, this seems like a reasonable approach to me.

Consider using pathlib instead of glob and os.path
I'm not sure I understand the difference between .json vs .jsonl. In both cases you may need to worry about missing pages.
Since the ocr code is all in ppa-nlp, it seems reasonable to also include the ocr-to-json script. In the future this script might also do additional text transformations.

ppa/archive/gale.py

laurejt · 2024-10-28T17:21:31Z

ppa/archive/gale.py

+            ocr_text = local_ocr_text.get(page_number)
+            # TODO: set local ocr tag here
+            if not ocr_text and page_number not in local_ocr_text:
+                # if ocr text is unset and page number is not present,
+                # try getting the ocr from the gale api result
+                logger.warning(f"No local OCR for {item_id} {page_number}")
                ocr_text = page.get("ocrText")  # some pages have no text
-                logger.warning(f'Local OCR not found for {item_id} {page_number}')


Seems like this would be better as an if/else clause, since the only time the Gale text should be fetched is when the page number is not in local_ocr_text

ah, good point... there are two cases here, aren't there? we could have the page number in our json file and have no text, or it might not be present at all (e.g., we missed it somehow)

Leaving this comment to remind you to confirm that I've done what you were suggesting.

I got a code complexity alert for this function - was able to reduce complexity slightly. If you see any easy wins, please suggest them.

rlskoeser · 2024-10-28T17:32:51Z

@laurejt thanks for the quick review and useful feedback, I'll keep your recommendations in mind when I circle back to this.

jsonl vs json: I was thinking json lines so we could avoid having the whole file in memory, but that would mean we have to sort the pages before we generate the lines file (or when we load it), and then as we iterate we'd have to check if we were on the correct page - so logic is more complicated. I think these volumes are small enough that we can get by with one file per volume and just have the whole text in memory so we can look up contents by page number.

ppa/archive/gale.py

laurejt · 2024-10-31T14:40:14Z

ppa/archive/gale.py

+                if ocr_text:
+                    tags = ["local_ocr"]


Should this be set if the resulting OCR for the page is blank?

nope, good catch

wait, maybe I misunderstood your question - that's what the logic is now, do you disagree with it?

It seems like the tag should be set if the page's local ocr text is blank, because the empty string is a judgment of the local OCR (and not the Gale OCR)

ppa/archive/tests/test_gale.py

Co-authored-by: Laure Thompson <[email protected]>

ppa/archive/gale.py

codecov · 2024-11-04T16:30:30Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.42%. Comparing base (44bd84d) to head (4635418).
Report is 215 commits behind head on develop.

❗ There is a different number of reports uploaded between BASE (44bd84d) and HEAD (4635418). Click for more details.

HEAD has 1 upload less than BASE

Flag BASE (44bd84d) HEAD (4635418)

javascript 3 2

Additional details and impacted files

@@             Coverage Diff             @@
##           develop     #686      +/-   ##
===========================================
- Coverage   100.00%   94.42%   -5.58%     
===========================================
  Files            5      138     +133     
  Lines           78     7581    +7503     
  Branches         8        8              
===========================================
+ Hits            78     7158    +7080     
- Misses           0      423     +423

Preliminary rewrite for indexing gale with local ocr content

ac57aeb

rlskoeser marked this pull request as draft October 28, 2024 16:52

github-advanced-security bot found potential problems Oct 28, 2024

View reviewed changes

ppa/archive/gale.py Fixed Show fixed Hide fixed

rlskoeser requested a review from laurejt October 28, 2024 16:56

laurejt reviewed Oct 28, 2024

View reviewed changes

ppa/archive/gale.py Outdated Show resolved Hide resolved

laurejt reviewed Oct 28, 2024

View reviewed changes

Add error handling to revised local ocr logic and update unit tests

5cd0861

rlskoeser changed the title ~~Preliminary rewrite for indexing gale with local ocr content~~ Revise local ocr handling for gale page indexing Oct 29, 2024

github-advanced-security bot found potential problems Oct 29, 2024

View reviewed changes

ppa/archive/gale.py Fixed Show resolved Hide resolved

Improve if/else logic slightly and reduce complexity

07aaa04

rlskoeser requested a review from laurejt October 29, 2024 20:33

rlskoeser marked this pull request as ready for review October 29, 2024 20:49

rlskoeser temporarily deployed to staging October 30, 2024 12:51 Inactive

laurejt reviewed Oct 31, 2024

View reviewed changes

rlskoeser and others added 2 commits October 31, 2024 12:09

Update ppa/archive/tests/test_gale.py

4e90400

Co-authored-by: Laure Thompson <[email protected]>

Update ppa/archive/gale.py

8fe7241

Co-authored-by: Laure Thompson <[email protected]>

github-advanced-security bot found potential problems Oct 31, 2024

View reviewed changes

ppa/archive/gale.py Dismissed Show dismissed Hide dismissed

rlskoeser added 2 commits November 4, 2024 11:18

Set local ocr tag for blank ocr content, per @laurejt feedback

7722bb4

Simplify conditional logic for gale local ocr

4635418

laurejt approved these changes Nov 4, 2024

View reviewed changes

rlskoeser merged commit f04916b into develop Nov 4, 2024
10 checks passed

rlskoeser deleted the feature/optimize-gale-localocr branch November 4, 2024 16:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revise local ocr handling for gale page indexing #686

Revise local ocr handling for gale page indexing #686

rlskoeser commented Oct 28, 2024 •

edited

Loading

laurejt commented Oct 28, 2024

laurejt Oct 28, 2024

rlskoeser Oct 28, 2024

rlskoeser Oct 29, 2024

rlskoeser commented Oct 28, 2024

laurejt Oct 31, 2024

rlskoeser Oct 31, 2024

rlskoeser Oct 31, 2024

laurejt Oct 31, 2024 •

edited

Loading

codecov bot commented Nov 4, 2024

Revise local ocr handling for gale page indexing #686

Revise local ocr handling for gale page indexing #686

Conversation

rlskoeser commented Oct 28, 2024 • edited Loading

laurejt commented Oct 28, 2024

laurejt Oct 28, 2024

Choose a reason for hiding this comment

rlskoeser Oct 28, 2024

Choose a reason for hiding this comment

rlskoeser Oct 29, 2024

Choose a reason for hiding this comment

rlskoeser commented Oct 28, 2024

laurejt Oct 31, 2024

Choose a reason for hiding this comment

rlskoeser Oct 31, 2024

Choose a reason for hiding this comment

rlskoeser Oct 31, 2024

Choose a reason for hiding this comment

laurejt Oct 31, 2024 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Nov 4, 2024

Codecov Report

rlskoeser commented Oct 28, 2024 •

edited

Loading

laurejt Oct 31, 2024 •

edited

Loading