Publication Thread for Spring 2019 #22

ctschroeder · 2018-09-13T15:15:49Z

ctschroeder · 2019-02-15T17:18:25Z

Decision: Will publish Johannes corpus right away before editorial review to help catch formatting TEI and segmentation.

ctschroeder · 2019-03-07T17:36:14Z

@eplatte @cluckmarq Hi it's March 7. I'm running behind because I'm sick AGAIN. I will be working on Johannes Canons this weekend and they will be ready Monday. So if you all need until Monday please take the time. Christie: will you have more Dirt? I have not gotten to any of those fragments but can squeeze some in this weekend (probably segmentation:checked, auto everything else). Please let me know!
@amir-zeldes Heike has had the flu and hasn't finished Besa 2-3 but is plugging away. LMK if you still want to publish 1. We also could also probably publish 2-3 without translation if you want. Again LMK.

Please reply to this thread or shoot me an email if you have any other questions/concerns about this publication cycle.

cluckmarq · 2019-03-07T17:47:21Z

@ctschroeder Sorry you are sick! :/ I am hoping to have a hunk of Dirt (GF113-128) by the end of the day (at the latest before lunch tomorrow). Almost done checking tokenization, but am solely working on this today.

ctschroeder · 2019-03-07T17:59:47Z

that's awesome @cluckmarq! Thanks!

eplatte · 2019-03-08T05:09:39Z

I hope you're feeling better soon, Carrie! I'm just finishing up my sections of Johannes (making changes from openrefine, adding versification), but I'm not going to be done tonight. I'm hoping to get everything done tomorrow, but they'll definitely be ready by Monday!

amir-zeldes · 2019-03-08T19:20:50Z

Oh no, I'm sorry to hear you're sick too Carrie. Hope you feel better soon. For Besa 2, honestly it's so short we could include a translation of our own, I could ask one of the Coptic-speaking students here if they want to take it on too. I'd try not to mix translated+not in this release, it's not worth the headache of keeping track of the difference for 1-2 fragments IMO.

Thanks everyone!

cluckmarq · 2019-03-08T20:13:17Z

@ctschroeder just an fyi. I am still working on correcting Dirt in ether (about 15% of way through text). Orthographic variants in the manuscript have translated into NLP misinterpreting a ton of things. But I continue to work on it, and hope I'll have made it through by Monday.

ctschroeder · 2019-03-09T02:55:53Z

Ok!

cluckmarq · 2019-03-11T20:47:49Z

@ctschroeder about 500 lines left to review in ether. i will try to finish up tonight after i put kids to bed. so close....

cluckmarq · 2019-03-12T00:47:36Z

@ctschroeder done! i've assigned dirt (gf 113-128) to you for review. if it's not you reviewing, just let me know so i can assign properly. i am sure i've missed a few things. the main issue is that gf uses ϭ for ⲑ in several places. but, i think i've caught most of them. and there was one place i couldn't figure out what was going on grammatically: i'm not sure what the grouping currently in line 3261 is.

ctschroeder · 2019-03-12T02:25:02Z

Ok thanks. I’ll get to this end of the week or early next Christie!

bkrawiec · 2019-04-18T16:01:04Z

@cluckmarq @ctschroeder Since I wasn't publishing I missed this discussion. It's not that GF "uses ϭ for ⲑ in several places." That's a known factor in the process of scraping the data--when Amir changes David's Word transcription into what we use, that letter consistently gets altered. I usually just search for the letter and change it. Sorry to be late on this!

amir-zeldes · 2019-04-24T21:12:59Z

Huh, somehow this wasn't on my radar, but looking at the scraper script I was able to find the problem - if we ever have more of this kind of data, it shouldn't happen again. Sorry about that!

ctschroeder · 2019-05-01T04:49:08Z

Hi @amir-zeldes. There are a bunch of files from 1 Cor & Mark plus single files in a22, victor, abraham, and fox that are marked "review" in GitDox. Are these all treebanking files? I am assuming we are not publishing them in this go-around. Thanks so much!

amir-zeldes · 2019-05-02T14:09:43Z

We can if we want to, or we can wait for next time, but either way they do not need to be checked (even if there is a stray error somewhere, they should be much more error free than any of the other datasets we release)

ctschroeder · 2019-05-02T14:34:41Z

Ok yes. I will check the version # and dates for the texts in corpora we are publishing and leave the rest for another time. Can you do me a favor and check the annotation metadata to be sure the right people are credited? Thanks so much!!

ctschroeder · 2019-05-03T19:28:10Z

Hi @amir-zeldes. I am done looking over the AP and Besa docs! Could you please check the annotators for any of those that were tree-banked and then put them on the private ANNIS instance?
Also FYI: I added chapter/verse versification so these docs keep up with our data model. HOWEVER for Besa, this means they don't all validate now, because the validation rule is translation=verse; Kuhn's verses are long, multisentence. So for the old Besa letters with short translation spans, this mismatch makes them invalid. We can either ignore, change validation rules, or move the translation around. Let me know what you think!

ctschroeder · 2019-05-07T04:59:38Z

@cluckmarq I'm almost done with your Dirt files! Looks good. I'm making a couple of lemmatization and normalization changes with some odd spellings, but I don't anticipate major questions for you. Thanks!

amir-zeldes · 2019-05-07T17:42:50Z

OK, Liz has been added to annotation of AP1-4, 27-36, since she treebanked them. Besa treebanking was all me, so no need to add.

Before putting the current versions in ANNIS, I'm noticing some of the AP have verse instead of verse_n, and I just discovered online that some corpora have verse (Victor), and some have verse_n (Pseudo-theophilus)... Which one do we want it to be? I should adjust the vis to look for what we decide on.

amir-zeldes · 2019-05-07T17:57:59Z

RE verse!=trans: it's OK as long as trans never covers multiple verses (opposite is OK, and already the case, compare: http://data.copticscriptorium.org/texts/besa_letters/to_aphthonia/norm)

amir-zeldes · 2019-05-10T20:47:14Z

OK, Besa is converted and visible to developers as besa.letters_test in ANNIS (log in and toggle visible corpora from 'scriptorium' to 'all')

ctschroeder · 2019-05-14T04:37:31Z

Hi @amir-zeldes. Thanks for putting up Besa. I'll check it soon. In the meantime can you put Dirt on the private ANNIS? One doc has trouble validating the lang column; it keeps saying some empty cells don't conform. I've tried everything -- deleting contents, adding valid contents, hitting return, doing this for the whole doc, validating (it validates), and then deleting those contents. But in the end the empty cells still get flagged.

amir-zeldes · 2019-05-15T21:37:07Z

OK, I actually just re-uploaded Besa because the translation spans were very large and I wanted them sentence-wise for eventual treebanking.

I also got the dirt spreadsheets to validate - there were all sorts of weird hidden values under the existing merged spans, I'm not sure how they got there. One way to get rid of them seems to be to merge the cell above them into them, then unmerge.

The problem I have with dirt now is that GF113-128 is very large - about double NBFB. I know they are contiguous, but can we break the pages into two documents? I'd say GF122 could be a good spot - close to the middle and starts a new sentence. I would re-number the chapters then though, so we have a new chapter in GF122. Does that sound OK? If so I can make the partition myself, just let me know.

ctschroeder · 2019-05-15T21:48:43Z

Hi @amir-zeldes!
Re Besa: did you change the verse numbering? Those numbers are Kuhn's and we are trying to keep to canonical numbering. I did notice the long spans but didn't change them for that reason.

Thanks for fixing Dirt!

Re Dirt GF 113-128: please do not change the chapter divisions. Those are David's divisions; I realize versification is ultimately arbitrary or subjective, but I would like to keep the chapter/paragraph divisions of the donating editor. As to where to divide, I would suggest GF 121 to begin a new document, because that's a new folio. It's not a new sentence but it is a new bound group and a new word. I would like to ask @cluckmarq and @bkrawiec what they prefer. Divide at GF 122 (a verso page) because it begins a new sentence or GF 121 because it begins a new folio (recto).

Also, when you get a chance can you put Johannes canons (anything "to publish" OR "review" - should be 8 documents) on the private ANNIS? Not all the metadata is there and not all have vid's but the spreadsheets should be valid and we should be able to see really wonky things to edit. Thanks so much!!

ctschroeder · 2019-05-16T17:02:21Z

@amir-zeldes I talked to Christie about a couple things incl GF. She and I both prefer breaking at GF 121. I know you prefer GF 122 (a new sentence) bc of treebanking and entities. Do we really need to break it into two? Any possibility we can keep it one doc?

amir-zeldes · 2019-05-16T17:45:17Z

No, no problem breaking at 121 - it will make a weird sentence boundary, but it's negligible in the context of the treebank (we have some fragmentary sentences anyway). Would you like me to break it there?

I think a long document will be a hassle in all sorts of contexts in the future, so I prefer to have some limit to document lengths. For readers it may also be more convenient to be able to scroll to metadata etc. more quickly, and splitting into two seems like a very minor change.

amir-zeldes · 2019-05-16T17:46:49Z

RE Besa, Kuhn's divisions are indicated in p_n, so those stay unchanged. If you look at the 'verses' visualization you'll see it's fine. The only thing that changes is the extent of the highlighted region with floating translation when you hover over a part of a Kuhn paragraph. The analytic vis also looks much better this way, so I don't see a downside (plus I needed those spans for treebanking)

ctschroeder · 2019-05-16T18:40:24Z

Re GF great, so yes you can split at the beginning of 121. Can you mark them Review in Gitdox so I don’t forget to check the metatdata etc? Re Besa let me take a look. I’m more worried chapter_n, verse_n, and vid_n all keeping Kuhn’s numbers. P_n is easy peasy. I’ll get back when I check.

amir-zeldes · 2019-05-16T19:30:00Z

OK I just looked at GF, but 121 is not flush with the beginning of the chapter, so what do you want to do about the chapter span? Will GF 121 begin a new chapter (and the last chapter of GF120 is just two groups), or do you want the same chapter number attested in two documents? Technically nothing prevents that, but it does seem a little confusing.

amir-zeldes · 2019-05-16T19:30:37Z

RE Besa - I didn't change chapter_n, vid_n etc., only the English translation. The rest lines up with Kuhn, and translated is nested within Kuhn spans.

ctschroeder · 2019-05-16T22:40:16Z

For GF 121/122: pls keep the chapter and verse numbers as they are. Break them across the docs. I may need to renumber -- I am in email convo with David about chapter numbering right now -- but the chapter spans will stay the same. It's fine if they break across docs. Happens all the time.

amir-zeldes · 2019-05-18T13:45:43Z

OK, split documents are up, I updated the big one to a status to_delete, feel free to remove if the split looks good

ctschroeder · 2019-05-20T17:57:51Z

thanks @amir-zeldes. I will look at these all tonight or tomorrow. Had a big push reviewing Beth's johannes docs this weekend.

eplatte · 2019-05-22T15:09:51Z

OK I'm done reviewing Carrie's Johannes docs. Thanks for your patience! I went through all with Open Refine. There is one section of FA143-158 that I couldn't figure out, in line 445 and lines 453-457. I think these are parallel expressions with ⲥⲟⲩⲛ (ⲥⲟⲟⲩⲛ), but I'm not sure what the verbs might be.
I also noticed that the main corpus page the metadata value for license is showing as invalid, though it doesn't come up with the metadata validation on each document. I'm sure I've used the wrong quotation marks. @amir-zeldes is there a way we could fix all five documents at once?

amir-zeldes · 2019-05-23T15:03:16Z

I had a look, it's not just quotes, which should be single here, but also the angle brackets, which should be escaped. So instead of:

<a href="https://creativecommons.org/licenses/by-sa/3.0/">CC BY-SA 3.0 Unported</a>

It should be:

<a href='https://creativecommons.org/licenses/by-sa/3.0/'>CC BY-SA 3.0 Unported</a>

I fixed it in the database.

ctschroeder · 2019-05-23T22:52:38Z

@amir-zeldes I'm logged into ANNIS at https://corpling.uis.georgetown.edu/annis/scriptorium and I don't see any of the new documents. shenoute.dirt has only one doc, the ap corpus doesn't have any of the new sayings, etc.

amir-zeldes · 2019-05-24T20:12:47Z

You need to log in and toggle visible corpora from 'scriptorium' to 'all', since these are not in the white list of corpora to display in scriptorium. Besa and Dirt are in, but I haven't converted AP yet, I think it still has some validation errors in GitDox. Do all AP documents already have verses?

ctschroeder · 2019-05-24T21:04:39Z

No I only added them to new/modified ones. We can ignore out if you want. I have been trying to keep up with them as we publish. Re validation errors in AP I think I mentioned upthread some ones I couldn’t figure out.

Also Johannes Canons were ok’d for prepublication as well

ctschroeder · 2019-05-24T21:05:36Z

I will check on Besa and Dirt tonight or Sunday. Thx for the tip on finding them!

ctschroeder · 2019-05-26T18:03:05Z

@amir-zeldes some prepub notes:

AP the only files that don't validate are due to lack of verse_n; I checked every one. EXCEPT 53 that is giving me span errors that I can't fix for the life of me
Besa: the diplomatic viz are making every 5th line (the lines with the numbers) look ekthetic. Can this be fixed?
Dirt: I still do not see a shenoute.dirt corpus with more than one document anywhere. Can you please tell me the exact name of the corpus with the new docs
Johannes: still need those to check. If you give the prepub corpus a new name please tell me the exact name. Thanks.

amir-zeldes · 2019-05-26T22:14:04Z

OK, I added johannes.canons_test and I reset permissions on shenoute.dirt_test (those are the corpus names). Can you check again? It might have been a permissions issue. If you can see besa.letters_test you should be able to see those two as well.

I also had to rename pb_n to pb_xml_id in some Johannes documents, and remove the TEI span. The pb_n seems to follow a different format though, so unless that's intentional, they should probably be renamed to FA143 etc. (not just a number)

I'll take a look at Besa vis and AP next - do we want to release them without verse_n?

ctschroeder · 2019-05-27T16:06:01Z

@amir-zeldes re besa and ap: I don't have time to add verses to all the AP so yes, release w/o verse_n in all of them. Besa should already have verse_n in all docs, no?

amir-zeldes · 2019-05-28T16:17:17Z

Yes, I think Besa is good to go.

amir-zeldes · 2019-05-29T00:34:04Z

AP053 is fixed

ctschroeder · 2019-06-11T17:03:53Z

Released May 31

ctschroeder added the publish label Sep 13, 2018

ctschroeder added this to the Winter2019 milestone Sep 13, 2018

ctschroeder self-assigned this Sep 13, 2018

ctschroeder changed the title ~~Publication Thread for Winter 2019~~ Publication Thread for Spring 2019 Dec 15, 2018

ctschroeder assigned amir-zeldes, bkrawiec, eplatte and cluckmarq Dec 15, 2018

ctschroeder mentioned this issue Jan 9, 2019

Publication thread for Fall 2018 #19

Closed

10 tasks

ctschroeder closed this as completed Jun 11, 2019

Publication Thread for Spring 2019 #22

Publication Thread for Spring 2019 #22

Comments

ctschroeder commented Sep 13, 2018 • edited Loading

ctschroeder commented Feb 15, 2019

ctschroeder commented Mar 7, 2019

cluckmarq commented Mar 7, 2019

ctschroeder commented Mar 7, 2019

eplatte commented Mar 8, 2019

amir-zeldes commented Mar 8, 2019

cluckmarq commented Mar 8, 2019

ctschroeder commented Mar 9, 2019 via email

cluckmarq commented Mar 11, 2019

cluckmarq commented Mar 12, 2019

ctschroeder commented Mar 12, 2019 via email

bkrawiec commented Apr 18, 2019

amir-zeldes commented Apr 24, 2019

ctschroeder commented May 1, 2019

amir-zeldes commented May 2, 2019

ctschroeder commented May 2, 2019

ctschroeder commented May 3, 2019

ctschroeder commented May 7, 2019

amir-zeldes commented May 7, 2019

amir-zeldes commented May 7, 2019

amir-zeldes commented May 10, 2019

ctschroeder commented May 14, 2019

amir-zeldes commented May 15, 2019

ctschroeder commented May 15, 2019

ctschroeder commented May 16, 2019

amir-zeldes commented May 16, 2019

amir-zeldes commented May 16, 2019

ctschroeder commented May 16, 2019 via email

amir-zeldes commented May 16, 2019

amir-zeldes commented May 16, 2019

ctschroeder commented May 16, 2019

amir-zeldes commented May 18, 2019

ctschroeder commented May 20, 2019

eplatte commented May 22, 2019

amir-zeldes commented May 23, 2019

ctschroeder commented May 23, 2019

amir-zeldes commented May 24, 2019

ctschroeder commented May 24, 2019

ctschroeder commented May 24, 2019

ctschroeder commented May 26, 2019

amir-zeldes commented May 26, 2019

ctschroeder commented May 27, 2019

amir-zeldes commented May 28, 2019

amir-zeldes commented May 29, 2019

ctschroeder commented Jun 11, 2019

ctschroeder commented Sep 13, 2018 •

edited

Loading