Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publication thread summer/fall 2024 OCR documents #1

Open
16 of 37 tasks
ctschroeder opened this issue Aug 12, 2024 · 12 comments
Open
16 of 37 tasks

Publication thread summer/fall 2024 OCR documents #1

ctschroeder opened this issue Aug 12, 2024 · 12 comments

Comments

@ctschroeder
Copy link
Member

ctschroeder commented Aug 12, 2024

Do not close this issue until all checkboxes below are complete or have been rescheduled:

List of corpora:

In Processed OCR folder (needs sentence splitting+full automatic NLP processing like the bible corpora)

  • Budge material (3 docs)
    • chapter divisions added/checked
    • metadata updated
  • Giron Legendes (11 docs)
    • chapter divisions added/checked
    • metadata updated
  • Lacau Apocrypha - acts.pilate (2 docs)
    • chapter divisions added/checked
    • metadata updated
  • Lacau Apocrypha - unknown gospel (2 docs)
    • chapter divisions added/checked
    • metadata updated
  • Sobhy Helias (4 docs)
    • chapter divisions added/checked
    • metadata updated

In GitDox

  • mercurius (formerly acacius.caeasaria)
    • needs entities & identities
    • needs translation span
    • @amir-zeldes NLP tool didn't NLP ⲫⲁⲅⲓⲟⲥ and ⲛⲫⲁⲅⲓⲟⲥ et al. correctly even though they were tokenized correctly
  • apocalypse.paul (2)
    - [ ] corpus name needed
    - [ ] other metadata updated
    - possibly error in data -- translation on p. 1043 begins with folio 24a but OCR coptic begins in the middle of folio 6a p. 533; perhaps move to later
  • mercurius (2)
  • pscyril.alexandria
    • On Mary still in XML mode (auto tagging?)
  • pscyril.jerusalem
    • on the cross
      • needs corpus name
      • metadata updated
      • chapter & verse need to be updated in spreadsheet based on open tags in XML
    • on Mary
      • needs corpus name
      • metadata updated
      • chapter & verse need to be updated in spreadsheet based on open tags in XML
  • psepiphanius
  • pschrysostom
    - still in XML mode (auto tagging?)
  • pscelestinus
  • pstimothy.alex
  • psote.psoi
  • timothy.discourse
@amir-zeldes
Copy link
Member

Do we have chapter splits for the OCR data somewhere? We can do versification using the automatic sentencer for now, but we don't really have a tool for predicting chapters.

@ctschroeder
Copy link
Member Author

@amir-zeldes I am manually adding chapter divisions where there are none currently

In Processed OCR folder, everything in Budge directory should be ready.

@ctschroeder
Copy link
Member Author

@amir-zeldes as you can see we have a lot of docs. I may not be able to get them all ready for October.

@amir-zeldes
Copy link
Member

I may not be able to get them all ready for October

No rush at all, it looks like we have plenty! Just one question though - I thought the ones in GD were the priority rather than the ones in the repo. Should we de-prioritize some of the GD ones or are you still planning to release all/some of those?

@ctschroeder
Copy link
Member Author

I will get to the GitDox ones as soon as I am done with the Helias collection. I thought these would be easier (no idea why) and would give you something to test for the automatic process

@ctschroeder
Copy link
Member Author

ok Helias is ready @amir-zeldes. It took longer than expected for various reasons. One thing -- there is a DS_store file in there that needs to be deleted.
Moving to gitdox files next (prob tomorrow)

@ctschroeder
Copy link
Member Author

@amir-zeldes Do all of these may need translation spans? (the ones in GitDox? the ones in GitHub?)

@amir-zeldes note above NLP tool didn't NLP ⲫⲁⲅⲓⲟⲥ and ⲛⲫⲁⲅⲓⲟⲥ et al. correctly even though they were tokenized correctly in at least one Mercurius doc. The spreadsheet has "'warn:empty_norm" in a bunch of cells where that word is.

Actually now that I look, that warning appears elsewhere in encomium.mercurius in places that I really don't understand why it's there?

Do I need to manually fix all those?

@ctschroeder
Copy link
Member Author

I think the warnings that are not about phagios are some lines that begin with a pipe and the previous line ends with an underscore. I tried to find those (sometimes they prevented NLP altogther), but I guess I missed a bunch

@amir-zeldes
Copy link
Member

Hi Carrie - I've got a bunch of e-mails coming on some of these topics but quick answers:

  • ds_store - no worries, I think we need another clean repo anyway (reasons in an upcoming e-mail) you can ignore this for now
  • If things are getting published from the OCR repo they don't need verses/translation, just chapters, scripts will add the rest. The ones in GitDox are a legacy pipeline, I guess we can auto add translations as a one off, but there's no trivial automatic way to do it if they are in spreadsheet mode (if they're in XML it can be done by adding at least one chapter tag, then the NLP button auto-splits and numbers within each chapter)
  • I'm trying to make the warn:empty_norm issue impossible to trip but it's still happening - I can take a look, can you tell me which docs?

More to come!

@amir-zeldes
Copy link
Member

OK, RE: mercurius, I've cleaned up the XML in doc3 and put it into a spreadsheet, and also added translation spans in all 3 docs. The warn issue was coming from groups with a leading/trailing pipe, e.g. _|ⲁ|ϥ|ⲥⲱⲧⲙ, _ⲉ|_. If metadata looks OK to you I can publish from this state in GD, entities/identities will be added automatically (though this should be reflected in the manually edited metadata in GD). I reassigned these to you in status metadata.

@ctschroeder
Copy link
Member Author

ctschroeder commented Oct 23, 2024 via email

@ctschroeder
Copy link
Member Author

@amir-zeldes acts of pilate (which is in https://github.com/CopticScriptorium/auto-corpora) is ready

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants