Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publication Thread for Spring 2019 #22

Closed
24 of 30 tasks
ctschroeder opened this issue Sep 13, 2018 · 45 comments
Closed
24 of 30 tasks

Publication Thread for Spring 2019 #22

ctschroeder opened this issue Sep 13, 2018 · 45 comments
Assignees
Labels
Milestone

Comments

@ctschroeder
Copy link
Member

ctschroeder commented Sep 13, 2018

March 7, 2019 due date. Corpora will be listed after Fall 2018 publications #19

  • add macron/underdot order validation to GitDox rules and revalidate the following corpora
  • AP (@cluckmarq and Marina Ghaly)
  • Some Kinds of People Sift Dirt (@cluckmarq)
    • editorial review by @ctschroeder
    • GF 113-128 fragment needs vid_n layer; waiting on DB for info
    • update corpus metadata
    • check prepub
  • More Johannes canons (@eplatte & @ctschroeder)
    • editorial review by @ctschroeder & @eplatte
    • check prepub
    • update corpus metadata
    • add brackets to new johannes docs missed from the converter [done through FA 65-80]
  • Besa letters (from @amir-zeldes & @somiyagawa; need review by @ctschroeder)
    • On Vigilance & Exhorations ready; frg 3 another time
    • needs permission from Heike Behlmer for translation
    • needs to be broken into documents
    • needs to align translation
    • needs metadata
    • check prepub
    • update corpus metadata
    • diplomatic viz looks ekthetic on numbered lines
      For later:
  • Shenoute Canons 6?
  • God Says Through Those Who Are His (@bkrawiec)
    • editorial review by TBD
  • possibly Treebank corpus (more Mark from Liz Davidson)
    Alin has contacted us about more material and there's someone who wants to do G Philip
@ctschroeder ctschroeder added this to the Winter2019 milestone Sep 13, 2018
@ctschroeder ctschroeder self-assigned this Sep 13, 2018
@ctschroeder ctschroeder changed the title Publication Thread for Winter 2019 Publication Thread for Spring 2019 Dec 15, 2018
@ctschroeder
Copy link
Member Author

Decision: Will publish Johannes corpus right away before editorial review to help catch formatting TEI and segmentation.

@ctschroeder
Copy link
Member Author

@eplatte @cluckmarq Hi it's March 7. I'm running behind because I'm sick AGAIN. I will be working on Johannes Canons this weekend and they will be ready Monday. So if you all need until Monday please take the time. Christie: will you have more Dirt? I have not gotten to any of those fragments but can squeeze some in this weekend (probably segmentation:checked, auto everything else). Please let me know!
@amir-zeldes Heike has had the flu and hasn't finished Besa 2-3 but is plugging away. LMK if you still want to publish 1. We also could also probably publish 2-3 without translation if you want. Again LMK.

Please reply to this thread or shoot me an email if you have any other questions/concerns about this publication cycle.

@cluckmarq
Copy link
Member

@ctschroeder Sorry you are sick! :/ I am hoping to have a hunk of Dirt (GF113-128) by the end of the day (at the latest before lunch tomorrow). Almost done checking tokenization, but am solely working on this today.

@ctschroeder
Copy link
Member Author

that's awesome @cluckmarq! Thanks!

@eplatte
Copy link
Member

eplatte commented Mar 8, 2019

I hope you're feeling better soon, Carrie! I'm just finishing up my sections of Johannes (making changes from openrefine, adding versification), but I'm not going to be done tonight. I'm hoping to get everything done tomorrow, but they'll definitely be ready by Monday!

@amir-zeldes
Copy link
Member

Oh no, I'm sorry to hear you're sick too Carrie. Hope you feel better soon. For Besa 2, honestly it's so short we could include a translation of our own, I could ask one of the Coptic-speaking students here if they want to take it on too. I'd try not to mix translated+not in this release, it's not worth the headache of keeping track of the difference for 1-2 fragments IMO.

Thanks everyone!

@cluckmarq
Copy link
Member

@ctschroeder just an fyi. I am still working on correcting Dirt in ether (about 15% of way through text). Orthographic variants in the manuscript have translated into NLP misinterpreting a ton of things. But I continue to work on it, and hope I'll have made it through by Monday.

@ctschroeder
Copy link
Member Author

ctschroeder commented Mar 9, 2019 via email

@cluckmarq
Copy link
Member

@ctschroeder about 500 lines left to review in ether. i will try to finish up tonight after i put kids to bed. so close....

@cluckmarq
Copy link
Member

@ctschroeder done! i've assigned dirt (gf 113-128) to you for review. if it's not you reviewing, just let me know so i can assign properly. i am sure i've missed a few things. the main issue is that gf uses ϭ for ⲑ in several places. but, i think i've caught most of them. and there was one place i couldn't figure out what was going on grammatically: i'm not sure what the grouping currently in line 3261 is.

@ctschroeder
Copy link
Member Author

ctschroeder commented Mar 12, 2019 via email

@bkrawiec
Copy link
Member

@cluckmarq @ctschroeder Since I wasn't publishing I missed this discussion. It's not that GF "uses ϭ for ⲑ in several places." That's a known factor in the process of scraping the data--when Amir changes David's Word transcription into what we use, that letter consistently gets altered. I usually just search for the letter and change it. Sorry to be late on this!

@amir-zeldes
Copy link
Member

Huh, somehow this wasn't on my radar, but looking at the scraper script I was able to find the problem - if we ever have more of this kind of data, it shouldn't happen again. Sorry about that!

@ctschroeder
Copy link
Member Author

Hi @amir-zeldes. There are a bunch of files from 1 Cor & Mark plus single files in a22, victor, abraham, and fox that are marked "review" in GitDox. Are these all treebanking files? I am assuming we are not publishing them in this go-around. Thanks so much!

@amir-zeldes
Copy link
Member

We can if we want to, or we can wait for next time, but either way they do not need to be checked (even if there is a stray error somewhere, they should be much more error free than any of the other datasets we release)

@ctschroeder
Copy link
Member Author

Ok yes. I will check the version # and dates for the texts in corpora we are publishing and leave the rest for another time. Can you do me a favor and check the annotation metadata to be sure the right people are credited? Thanks so much!!

@ctschroeder
Copy link
Member Author

Hi @amir-zeldes. I am done looking over the AP and Besa docs! Could you please check the annotators for any of those that were tree-banked and then put them on the private ANNIS instance?
Also FYI: I added chapter/verse versification so these docs keep up with our data model. HOWEVER for Besa, this means they don't all validate now, because the validation rule is translation=verse; Kuhn's verses are long, multisentence. So for the old Besa letters with short translation spans, this mismatch makes them invalid. We can either ignore, change validation rules, or move the translation around. Let me know what you think!

@ctschroeder
Copy link
Member Author

@cluckmarq I'm almost done with your Dirt files! Looks good. I'm making a couple of lemmatization and normalization changes with some odd spellings, but I don't anticipate major questions for you. Thanks!

@amir-zeldes
Copy link
Member

OK, Liz has been added to annotation of AP1-4, 27-36, since she treebanked them. Besa treebanking was all me, so no need to add.

Before putting the current versions in ANNIS, I'm noticing some of the AP have verse instead of verse_n, and I just discovered online that some corpora have verse (Victor), and some have verse_n (Pseudo-theophilus)... Which one do we want it to be? I should adjust the vis to look for what we decide on.

@amir-zeldes
Copy link
Member

RE verse!=trans: it's OK as long as trans never covers multiple verses (opposite is OK, and already the case, compare: http://data.copticscriptorium.org/texts/besa_letters/to_aphthonia/norm)

@amir-zeldes
Copy link
Member

OK, Besa is converted and visible to developers as besa.letters_test in ANNIS (log in and toggle visible corpora from 'scriptorium' to 'all')

@ctschroeder
Copy link
Member Author

Hi @amir-zeldes. Thanks for putting up Besa. I'll check it soon. In the meantime can you put Dirt on the private ANNIS? One doc has trouble validating the lang column; it keeps saying some empty cells don't conform. I've tried everything -- deleting contents, adding valid contents, hitting return, doing this for the whole doc, validating (it validates), and then deleting those contents. But in the end the empty cells still get flagged.

@amir-zeldes
Copy link
Member

OK, I actually just re-uploaded Besa because the translation spans were very large and I wanted them sentence-wise for eventual treebanking.

I also got the dirt spreadsheets to validate - there were all sorts of weird hidden values under the existing merged spans, I'm not sure how they got there. One way to get rid of them seems to be to merge the cell above them into them, then unmerge.

The problem I have with dirt now is that GF113-128 is very large - about double NBFB. I know they are contiguous, but can we break the pages into two documents? I'd say GF122 could be a good spot - close to the middle and starts a new sentence. I would re-number the chapters then though, so we have a new chapter in GF122. Does that sound OK? If so I can make the partition myself, just let me know.

@ctschroeder
Copy link
Member Author

Hi @amir-zeldes!
Re Besa: did you change the verse numbering? Those numbers are Kuhn's and we are trying to keep to canonical numbering. I did notice the long spans but didn't change them for that reason.

Thanks for fixing Dirt!

Re Dirt GF 113-128: please do not change the chapter divisions. Those are David's divisions; I realize versification is ultimately arbitrary or subjective, but I would like to keep the chapter/paragraph divisions of the donating editor. As to where to divide, I would suggest GF 121 to begin a new document, because that's a new folio. It's not a new sentence but it is a new bound group and a new word. I would like to ask @cluckmarq and @bkrawiec what they prefer. Divide at GF 122 (a verso page) because it begins a new sentence or GF 121 because it begins a new folio (recto).

Also, when you get a chance can you put Johannes canons (anything "to publish" OR "review" - should be 8 documents) on the private ANNIS? Not all the metadata is there and not all have vid's but the spreadsheets should be valid and we should be able to see really wonky things to edit. Thanks so much!!

@ctschroeder
Copy link
Member Author

@amir-zeldes I talked to Christie about a couple things incl GF. She and I both prefer breaking at GF 121. I know you prefer GF 122 (a new sentence) bc of treebanking and entities. Do we really need to break it into two? Any possibility we can keep it one doc?

@amir-zeldes
Copy link
Member

No, no problem breaking at 121 - it will make a weird sentence boundary, but it's negligible in the context of the treebank (we have some fragmentary sentences anyway). Would you like me to break it there?

I think a long document will be a hassle in all sorts of contexts in the future, so I prefer to have some limit to document lengths. For readers it may also be more convenient to be able to scroll to metadata etc. more quickly, and splitting into two seems like a very minor change.

@amir-zeldes
Copy link
Member

RE Besa, Kuhn's divisions are indicated in p_n, so those stay unchanged. If you look at the 'verses' visualization you'll see it's fine. The only thing that changes is the extent of the highlighted region with floating translation when you hover over a part of a Kuhn paragraph. The analytic vis also looks much better this way, so I don't see a downside (plus I needed those spans for treebanking)

@ctschroeder
Copy link
Member Author

ctschroeder commented May 16, 2019 via email

@amir-zeldes
Copy link
Member

OK I just looked at GF, but 121 is not flush with the beginning of the chapter, so what do you want to do about the chapter span? Will GF 121 begin a new chapter (and the last chapter of GF120 is just two groups), or do you want the same chapter number attested in two documents? Technically nothing prevents that, but it does seem a little confusing.

@amir-zeldes
Copy link
Member

RE Besa - I didn't change chapter_n, vid_n etc., only the English translation. The rest lines up with Kuhn, and translated is nested within Kuhn spans.

@ctschroeder
Copy link
Member Author

For GF 121/122: pls keep the chapter and verse numbers as they are. Break them across the docs. I may need to renumber -- I am in email convo with David about chapter numbering right now -- but the chapter spans will stay the same. It's fine if they break across docs. Happens all the time.

@amir-zeldes
Copy link
Member

OK, split documents are up, I updated the big one to a status to_delete, feel free to remove if the split looks good

@ctschroeder
Copy link
Member Author

thanks @amir-zeldes. I will look at these all tonight or tomorrow. Had a big push reviewing Beth's johannes docs this weekend.

@eplatte
Copy link
Member

eplatte commented May 22, 2019

OK I'm done reviewing Carrie's Johannes docs. Thanks for your patience! I went through all with Open Refine. There is one section of FA143-158 that I couldn't figure out, in line 445 and lines 453-457. I think these are parallel expressions with ⲥⲟⲩⲛ (ⲥⲟⲟⲩⲛ), but I'm not sure what the verbs might be.
I also noticed that the main corpus page the metadata value for license is showing as invalid, though it doesn't come up with the metadata validation on each document. I'm sure I've used the wrong quotation marks. @amir-zeldes is there a way we could fix all five documents at once?

@amir-zeldes
Copy link
Member

I had a look, it's not just quotes, which should be single here, but also the angle brackets, which should be escaped. So instead of:

<a href="https://creativecommons.org/licenses/by-sa/3.0/">CC BY-SA 3.0 Unported</a>

It should be:

&lt;a href='https://creativecommons.org/licenses/by-sa/3.0/'&gt;CC BY-SA 3.0 Unported&lt;/a&gt;

I fixed it in the database.

@ctschroeder
Copy link
Member Author

@amir-zeldes I'm logged into ANNIS at https://corpling.uis.georgetown.edu/annis/scriptorium and I don't see any of the new documents. shenoute.dirt has only one doc, the ap corpus doesn't have any of the new sayings, etc.

@amir-zeldes
Copy link
Member

You need to log in and toggle visible corpora from 'scriptorium' to 'all', since these are not in the white list of corpora to display in scriptorium. Besa and Dirt are in, but I haven't converted AP yet, I think it still has some validation errors in GitDox. Do all AP documents already have verses?

@ctschroeder
Copy link
Member Author

No I only added them to new/modified ones. We can ignore out if you want. I have been trying to keep up with them as we publish. Re validation errors in AP I think I mentioned upthread some ones I couldn’t figure out.

Also Johannes Canons were ok’d for prepublication as well

@ctschroeder
Copy link
Member Author

I will check on Besa and Dirt tonight or Sunday. Thx for the tip on finding them!

@ctschroeder
Copy link
Member Author

@amir-zeldes some prepub notes:

  • AP the only files that don't validate are due to lack of verse_n; I checked every one. EXCEPT 53 that is giving me span errors that I can't fix for the life of me
  • Besa: the diplomatic viz are making every 5th line (the lines with the numbers) look ekthetic. Can this be fixed?
  • Dirt: I still do not see a shenoute.dirt corpus with more than one document anywhere. Can you please tell me the exact name of the corpus with the new docs
  • Johannes: still need those to check. If you give the prepub corpus a new name please tell me the exact name. Thanks.

@amir-zeldes
Copy link
Member

OK, I added johannes.canons_test and I reset permissions on shenoute.dirt_test (those are the corpus names). Can you check again? It might have been a permissions issue. If you can see besa.letters_test you should be able to see those two as well.

I also had to rename pb_n to pb_xml_id in some Johannes documents, and remove the TEI span. The pb_n seems to follow a different format though, so unless that's intentional, they should probably be renamed to FA143 etc. (not just a number)

I'll take a look at Besa vis and AP next - do we want to release them without verse_n?

@ctschroeder
Copy link
Member Author

@amir-zeldes re besa and ap: I don't have time to add verses to all the AP so yes, release w/o verse_n in all of them. Besa should already have verse_n in all docs, no?

@amir-zeldes
Copy link
Member

Yes, I think Besa is good to go.

@amir-zeldes
Copy link
Member

AP053 is fixed

@ctschroeder
Copy link
Member Author

Released May 31

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants