-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Publication thread summer/fall 2024 OCR documents #1
Comments
Do we have chapter splits for the OCR data somewhere? We can do versification using the automatic sentencer for now, but we don't really have a tool for predicting chapters. |
@amir-zeldes I am manually adding chapter divisions where there are none currently In Processed OCR folder, everything in Budge directory should be ready. |
@amir-zeldes as you can see we have a lot of docs. I may not be able to get them all ready for October. |
No rush at all, it looks like we have plenty! Just one question though - I thought the ones in GD were the priority rather than the ones in the repo. Should we de-prioritize some of the GD ones or are you still planning to release all/some of those? |
I will get to the GitDox ones as soon as I am done with the Helias collection. I thought these would be easier (no idea why) and would give you something to test for the automatic process |
ok Helias is ready @amir-zeldes. It took longer than expected for various reasons. One thing -- there is a DS_store file in there that needs to be deleted. |
@amir-zeldes Do all of these may need translation spans? (the ones in GitDox? the ones in GitHub?) @amir-zeldes note above NLP tool didn't NLP ⲫⲁⲅⲓⲟⲥ and ⲛⲫⲁⲅⲓⲟⲥ et al. correctly even though they were tokenized correctly in at least one Mercurius doc. The spreadsheet has "'warn:empty_norm" in a bunch of cells where that word is. Actually now that I look, that warning appears elsewhere in encomium.mercurius in places that I really don't understand why it's there? Do I need to manually fix all those? |
I think the warnings that are not about phagios are some lines that begin with a pipe and the previous line ends with an underscore. I tried to find those (sometimes they prevented NLP altogther), but I guess I missed a bunch |
Hi Carrie - I've got a bunch of e-mails coming on some of these topics but quick answers:
More to come! |
OK, RE: mercurius, I've cleaned up the XML in doc3 and put it into a spreadsheet, and also added translation spans in all 3 docs. The warn issue was coming from groups with a leading/trailing pipe, e.g. |
I’m working on Mercurius — there is something wonky with the manuscript metadata. More soon
…______________________________
|
@amir-zeldes acts of pilate (which is in https://github.com/CopticScriptorium/auto-corpora) is ready |
Do not close this issue until all checkboxes below are complete or have been rescheduled:
List of corpora:
In Processed OCR folder (needs sentence splitting+full automatic NLP processing like the bible corpora)
In GitDox
- [ ] corpus name needed
- [ ] other metadata updated
- possibly error in data -- translation on p. 1043 begins with folio 24a but OCR coptic begins in the middle of folio 6a p. 533; perhaps move to later
- still in XML mode (auto tagging?)
The text was updated successfully, but these errors were encountered: