Publication thread for Fall 2018 #19

ctschroeder · 2018-06-12T17:17:40Z

amir-zeldes · 2018-06-12T17:37:59Z

RE Besa's letters - there are way more than two - all of MONB.BA and BB, so almost all of the existing letters. We've only processed two so far though, and I'm not sure if we'll run into problems with later ones.

eplatte · 2018-06-12T18:26:33Z

Yes, I plan to do some more Johannes. I may also have something from Budge from my Coptic reading group at Reed.

ctschroeder · 2018-10-18T23:20:17Z

I'm looking at items in Gitdox for publication. Looks like in addition to treebanked material in Mark, 1 Cor, Victor, A22, and AOF we also have Eagerness docs. Is that correct, @amir-zeldes ? Are these newly treebanked Eagerness docs?

We also have a doc from Not Because a Fox barks. Are we republishing this corpus?

Last, there look to be some validation issues. I will go through them and let you know if I have any problems or questions.

(I am skipping the AP, since we will have new AP to publish in the Winter.)

amir-zeldes · 2018-10-21T14:06:26Z

Eagerness has no treebanked documents, so any edits are presumably sporadically noticed errors (probably no more than a handful). If there are no new documents, maybe we should wait with Eagerness until there is new material - I think there were still new documents coming in the future, right?

Similarly NBFB may have some tiny correction, but otherwise nothing new really. I may wait with it until we are closer to 'one click publication'. The rest have considerable changes due to treebanking and should be re-imported, they are much better quality now.

bkrawiec · 2018-10-21T19:36:21Z

I do not think there are more documents for Eagerness. I have been done with it for awhile and have moved on to Those. Becky

…

Sent from my iPhone On Oct 21, 2018, at 10:06 AM, Amir Zeldes <[email protected]<mailto:[email protected]>> wrote: Eagerness has no treebanked documents, so any edits are presumably sporadically noticed errors (probably no more than a handful). If there are no new documents, maybe we should wait with Eagerness until there is new material - I think there were still new documents coming in the future, right? Similarly NBFB may have some tiny correction, but otherwise nothing new really. I may wait with it until we are closer to 'one click publication'. The rest have considerable changes due to treebanking and should be re-imported, they are much better quality now. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#19 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AH7jHKtDyzmLDdgTR9mjJPOIM2-wQUwMks5unH9jgaJpZM4Uk1OQ>.

amir-zeldes · 2018-10-21T23:26:39Z

That's good to know thanks! We could try to squeeze them in, but as I wrote above, the changes are probably minimal, so maybe we should wait until we treebank some of Eagerness.

Another question about Mark/1 Cor - I see failed validations due to 'p' missing - do we want to require p? If so in what units? p mainly serves to segment the normalized view for convenience, but for Bible chapters, the verses already do a good job of that, so maybe we can remove this requirement for the Bible?

ctschroeder · 2018-10-23T16:29:18Z

Hi. I was making a list of things to go over as I was reviewing the corpora for publication, and "p" was on the list. It relates to our decision in DC to minimize the number of visualizations & viz names, as well. I think we can change the validation to p | vid_n. I am adding vid_n (the cts urns at the verse level) to all corpora as they are re-published. The visualizations break the text at p or at v, right? v is the verse number written as a number, and vid_n is the urn for the verse (same span as v). I would prefer the validation to be p | vid_n to remind us to add those cts urns.

ctschroeder · 2018-10-23T16:56:48Z

I'm making a list of things that are coming up, that I'll post when I'm done. But two big ones:

for 1 Cor and Mark, only parts of the corpora are treebanked? 1 Cor 1-6 and Mark 1-7. Correct? Those are the only ones marked as "to_review". So we will have partially treebanked corpora?
for corpora being republished due to treebanking (in the list above): are treebanking, tagging, and segmentation gold/checked/auto? Some are listed as treebanking: auto.

amir-zeldes · 2018-10-23T18:05:36Z

Mark is only 1-6 treebanked at the moment, same as 1Cor. Meta should show everything as gold for the treebanked chapters, and pos/seg 'checked' for the rest, parsing 'auto'. The tag/seg/parse metadata is document-wise, so mixed corpora should not be a problem.

amir-zeldes · 2018-10-23T18:12:07Z

RE p-annotations: I went ahead and made the p-check corpus dependent. There is no current way to make one validation check for either/or, so we'd need a separate rule to require vid_n in corpora where that's relevant (all corpora?)

ctschroeder · 2018-10-23T18:21:38Z

Can you go in and edit the parsing /tagging/segmenting metadata? There are a number of docs “review” or “to publish” and there is no way for me to tell currently which ones in each corpus are treebanked (since the parsing data is elsewhere). I realize a mixed corpus re treebaninking is fine technically, but from an annotating/curating point of view, having me make those edits to the metadata is asking for trouble, because Some corpora have both treebanked and nontreebanked docs for review or publication. I think you need to go in and make the changes since there are mixed treebanked corpora.

ctschroeder · 2018-10-23T18:22:28Z

I will for sure check the rest of the metadata and add corpus metadata to the corpora (like 1Cor) without.

amir-zeldes · 2018-10-23T18:28:33Z

I can do the automation meta for Sahidica. It's pretty well documented though, the list of treebanked documents is in the table here:

http://copticscriptorium.org/treebank.html

ctschroeder · 2018-10-23T19:35:19Z

Generally the process is most effective and accurate when each annotator adds/edits metadata when they annotate. Otherwise there is a lot of back and forth from the person doing review, or something gets missed. There's not an effective way for the person conducting the final editorial review to keep in their heads which metadata might change and which might not for each publication thread. The person doing the review (not always me) needs be able to look at the metadata for obvious errors, like typos or missing fields, but other than version number/date isn't expected to go through each existing field and ask whether the data needs to be changed. I will go ahead and reassign docs back to you to for checking the parsing/tagging/segmentation metadata before publication. Thanks!

amir-zeldes · 2018-10-23T19:44:44Z

ⲞⲔ, 1Cor and Mark should be good to go from the NLP metadata perspective. I also corrected any validation errors that are automatically caught, so they're all green, but I'm not sure if there's something we wanted but haven't added a validation for yet.

amir-zeldes · 2018-10-23T19:48:14Z

? I'm not sure I understand the preceding comment - I have no metadata changes to make that I'm aware of. I'm happy to keep NLP metadata up to date as we treebank in the future, but these are fields that didn't exist when the treebanking happened. Sahidica is now up to date.

ctschroeder · 2018-10-23T19:50:42Z

If I have any questions about the other mixed corpora besides the Sahidica ones, I'll let you know.

ctschroeder · 2018-10-23T19:50:55Z

Thanks for editing the Sahidica ones!

ctschroeder · 2018-10-26T17:59:13Z

@amir-zeldes can you tell me who has been treebanking (and then correcting tagging/segmentation) for the AOF, Victor, A22, Mark, 1 Cor texts? I will add their names to the corpus and document metadata. Thanks!

amir-zeldes · 2018-10-26T18:23:36Z

Mark+1cor new material is Mitchell. A22 is me and Liz. AOF is just me, Victor is me, Mitchell and the four Israelis listed here: https://github.com/UniversalDependencies/UD_Coptic-Scriptorium/blob/dev/README.md (under acknowledgments)

ctschroeder · 2018-10-29T21:06:06Z

Thank you! 1Cor is ready (I also added corpus metadata to GitDox) except for these questions:

There were a couple problems with spans in the layers in 2 of the 1 Cor docs (cells not merged in the spreadsheet in the _group layers). I merged the cells. Do you need to know which cells I modified for the treebanking data, or is the treebank data unaffected by the spans?

A couple of questions about Mark:

7 is also marked for review. Do you want to republish or wait? The commit log in github shows one commit, but I do not see any changes when the file is diff’d. It's possible there are saved-but-uncommitted changes in GitDox?
9 was listed as “to_publish” but like 7, the reason is unclear;it hasn’t been reviewed for this cycle, and there are no committed changes to the file in GitHub since the file was imported into GitDox. (Again possible savied-but-uncommitted changes in Gitdox?) I changed status to “review”. Do you want to republish or wait?

ctschroeder · 2018-11-26T19:10:43Z

Nothing to worry about! A couple sections were tree-banked, so I am adding verses and chapters. It's not a big deal. Just updating those two files to current standards since we are republishing. Sent from iCloud

ctschroeder · 2018-12-05T23:55:11Z

AOF is done

ctschroeder · 2018-12-06T00:04:52Z

Here are the urns that need addressing. We either should put something in the 404 or somehow redirect. Probably easier to list them on the 404 for now. I think any redirect is complex with this application.
urn-table.xlsx

ctschroeder · 2018-12-06T00:15:23Z

Also @amir-zeldes can you check the parsing/segmentation/tagging metadata for the files that were treebanked? Not sure I got them right.

The grids and metadata validate. TEI export does not always validate on the "to_publish" files. The validation report references line #s that I can't see

ctschroeder · 2018-12-06T04:57:44Z

Mark is as ready as it will be. @amir-zeldes I think we're good to go.

amir-zeldes · 2018-12-06T20:19:20Z

OK, I went over the AOF exports, they all check out now. The way to see those line numbers is to do a TEI export from the editor, download the file, and find that line number in the file. Usually I look for some word or translation nearby and actually check out the grid instead of trying to figure out the XML, since the error is usually apparent in the grid too.

XL still needs versification, it seems - did you say you wanted to just follow the translations or doing something else? Once that checks out I think we really are good to go!

Oh, and one more thing, is Besa/Vigilance included?

ctschroeder · 2018-12-06T20:40:30Z

Hi, I didn't add verses to the part of XL that is not AOF. XL is a florilegium codex. I'm not sure what work that piece of XL is from. I'd have to look it up. The AOF section has verses and verse ids. I did go through just now to make sure the spans coincide with each other. If empty spans are a problem then you can just put in some placeholder like "undetermined". I have not had time to touch Besa. The other corpora ended up more complicated than I anticipated. Besa is next on my list. If you want to wait for that it may take a week or more, because I need to check the metadata pretty closely and add all the cts urns, and I have to go over the final white paper comments from board members (Heike's final report is due Dec 15, and we want to be sure they are close.) Only two letters of Besa are marked for review in Gitdox. Can you be sure everything you want reviewed is marked for review? I don't want to miss anything. Thanks so much! Best, Carrie Sent from iCloud On Dec 06, 2018, at 12:19 PM, Amir Zeldes <[email protected]> wrote: OK, I went over the AOF exports, they all check out now. The way to see those line numbers is to do a TEI export from the editor, download the file, and find that line number in the file. Usually I look for some word or translation nearby and actually check out the grid instead of trying to figure out the XML, since the error is usually apparent in the grid too. XL still needs versification, it seems - did you say you wanted to just follow the translations or doing something else? Once that checks out I think we really are good to go! Oh, and one more thing, is Besa/Vigilance included? — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

amir-zeldes · 2018-12-07T20:26:54Z

AOF can't be published without filled verse/vid, based on the new schema. I'm happy to put 'undetermined' there, but it does seem odd... If they're another work, then we should consider putting them in another corpus one of these days, since for other works, our corpus objects logically correspond to the works, not codices.

I'm fine waiting with Besa, in the meantime I'll start processing the other corpora so we can get the release published. As for review, really all that means is 'was edited', so maybe we should start calling these something else (review sounds like the whole document needs to be reviewed, where really only some tiny changes occurred)

ctschroeder · 2018-12-07T20:40:46Z

I can break up XL, but it's one doc in the treebanking corpus so I didn't want to mess with it.
Re Gitdox: I hear what you're saying. The thing is, I think more status categories might get confusing. And often there are a lot of details, which are better documented in the Github issues. Sometimes even if a minor correction is made, the full document does get updated before publication, because there is a GitHub issue noting corrections, or data model enhancements, etc., that are slated to be added with the next publication of a document. So any documents slated for (re)publication really do all need to be reviewed. Part of the documented workflow on the wiki is that review editors (usually Beth or me) know to look at the GitHub issue for the publication cycle and the Github issues for the relevant corpus during review. Additionally the changes to a document often take place over long spans of time (e.g., someone makes corrections, a few months later someone adds more metadata to include new fields we've started using, a few months later someone treebanks it, then it's finally published). It's hard to create different status categories that will capture all of that nuance. I think marking them for "review" and posting details about the documents on the GitHub issues works better, because of the narrative and volume capacity of the issues. If we make any enhancement, I would suggest adding a link to our GitHub issues in GitDox. But I hesitate to change the status fields. I also wouldn't want to add another field, because it will get confusing toggling between additional information in Gitdox and the GitHub issues. I'd prefer keeping all the details on the documents in one place, GitHub, where I can see commit histories & issues all together.

amir-zeldes · 2018-12-07T20:43:09Z

Yeah, I understand, I'm not disagreeing about changing statuses right now, but I think we should have a conversation about this again sometime - possibly in parallel with or just before training the new DH specialist.

I just fixed XL, so it's 'undetermined' for URNs and 'x' for the verse number, that should work for conversion and validation for now. AOF validates now, we are good to go, leaving Besa aside.

ctschroeder · 2018-12-07T21:13:09Z

Ok that all sounds good, thanks!!

amir-zeldes · 2018-12-11T16:52:21Z

OK, everything is up on ANNIS now, excluding new Besa for the moment. Please take a look and if all looks well I can set up the ingest and push to GitHub as well.

ctschroeder · 2018-12-12T04:01:17Z

Thanks so much, @amir-zeldes. I will look over this tonight and tomorrow. In the meantime, what do you think about the issue of the urns? (#19 (comment))

amir-zeldes · 2018-12-12T21:08:59Z

I had a look at the Excel table - these all seem to be pure URNs, not URLs, so I'm not sure what you mean by 404 above - if they're URLs we can probably set the server's apache to intercept and 404 them, but if we are at the URN level, this is something the repo software needs to handle, no? Does it have any existing functionality to handle redirects or some kind of '404'-like scenario?

ctschroeder · 2018-12-12T21:56:02Z

What happens is the web application either 1) takes the URN someone types into a box and spits out the corresponding URL, which is operationalized (maybe wrong word here?) as a list of documents that contain that URN; or 2) takes the URL data.copticscriptorium.org/URN generated by someone clicking on that link somewhere else or typing it into their browser bar and likewise generates a page that is a list of documents that contain that URN.
SO, either typing the URN into the box or plugging our data URL+the URN into your browser will cause the web page to produce an error. I am just now realizing, it's not a 404 -- it's a "There are no results that match your search" (see here for an example: http://data.copticscriptorium.org/urn:cts:copticLit:besa.aphthonia.monbya ) so my idea was tragically flawed from the start.
I'm not sure how the application handles redirects. I don't really know the code (she says as if she were a programmer...!). I can search the code for the "no results" message and see if it's in there somewhere. If not, it might be in a database setting.

ctschroeder · 2018-12-12T21:59:08Z

Oh hey now the results message is in the index file https://github.com/CopticScriptorium/cts/blob/master/coptic/templates/index.html
We could modify that message to say hey, if your looking for these abc files, go to xyz instead.

I really do not know how the repo would manage a redirect.

ctschroeder · 2018-12-12T22:03:49Z

Meanwhile, these documents look great in ANNIS. Just a couple things.

I made small edits to the following files, so they need to be redone:

AOF YA 525-30
AOF ZH fragment
A22 corpus metadata (not sure where that is stored upon commit to Github? I edited the YB 307-320 file)

Also:

the Mark corpus still needs to be converted and added (see Publication thread for Fall 2018 #19 (comment) and the checklist at the top).
Document names for Victor and A22 corpora are wrong. I worry there is something wonky with the GitDox conversion. The filenames need to be changed and I made an issue in the GitDox GH repository
A22 and AOF visualizations should now be versified not the old normalized

amir-zeldes · 2018-12-13T17:38:02Z

OK, the SNP bug should be resolved. Mark has also been updated, and we have fresh versions of AOF and A22 as well. I've spot checked them, but please take a look as well.

amir-zeldes · 2018-12-13T18:11:49Z

Regarding the redirect, yes, if it were a URL based system they could be intercepted at the apache level, before the app ever sees the request, but the way it's been built this would require making some code changes. Maybe we should put some renovations to the repo on the agenda for next semester. I think we should talk about prioritization again at some point in January.

ctschroeder · 2018-12-13T18:16:54Z

Ok so in the meantime should I modify the text for the “no results...” message?

…

Sent from my iPad

amir-zeldes · 2018-12-14T15:44:54Z

Mmm... I guess you could, sure. Ultimately I'd like a better solution for this though, but it might take some time, so this could be a good band aid.

ctschroeder · 2018-12-14T18:10:42Z

Right. We need something in the meantime. I am home sick today but will get on this Monday.

ctschroeder · 2019-01-09T23:39:14Z

OK release is basically done except for a few behind-the-curtain actions:

Post TEI, PAULA, relANNIS, tt SGML to GitHub corpora repository (@amir-zeldes)
Create release (@ctschroeder or @amir-zeldes)
Update [geographic RDF[(http://wiki.copticscriptorium.org/doku.php?id=checklist_for_publishing_corpora#update_pelagios_rdf) (@ctschroeder)

ctschroeder · 2019-01-15T23:41:02Z

@amir-zeldes is there a way for an admin to batch change status "to_publish" to "published" for all docs with that status?thx!

ctschroeder · 2019-01-15T23:44:43Z

(@amir-zeldes also I switched AP 18 and 26 from "to_publish" to "review." I see from GitDox commits they have been treebanked. "to_publish" indicates everything is ready, including updated metadata for version # and date; these will still need some metadata changes. Thx!!!)

amir-zeldes · 2019-01-16T21:24:03Z

I can change values in the DB using a SQL statement to change statuses en masse, though maybe this is a good feature to have. If you need me to do something like that let me know, in the meantime I'll open an issue.

amir-zeldes · 2019-01-16T21:26:58Z

gucorpling/gitdox#123

ctschroeder · 2019-01-16T21:33:30Z

Ok I think it’s not a high priority if you’re willing to do it yourself. So in the meantime could you please switch everything in GitDox we published to “publish”? Should be everything currently labeled to_publish. Thanks!

amir-zeldes · 2019-01-16T22:23:51Z

Done!

ctschroeder self-assigned this Jun 12, 2018

ctschroeder added the 2018 goals label Jun 12, 2018

ctschroeder added this to the Fall2018 milestone Jun 12, 2018

ctschroeder modified the milestones: Fall2018, October 15 2018 Sep 13, 2018

ctschroeder mentioned this issue Sep 13, 2018

Publication Thread for Spring 2019 #22

Closed

30 tasks

ctschroeder added the publish label Oct 11, 2018

ctschroeder closed this as completed Mar 7, 2019

Publication thread for Fall 2018 #19

Publication thread for Fall 2018 #19

Comments

ctschroeder commented Jun 12, 2018 • edited Loading

amir-zeldes commented Jun 12, 2018

eplatte commented Jun 12, 2018

ctschroeder commented Oct 18, 2018

amir-zeldes commented Oct 21, 2018

bkrawiec commented Oct 21, 2018 via email

amir-zeldes commented Oct 21, 2018

ctschroeder commented Oct 23, 2018

ctschroeder commented Oct 23, 2018

amir-zeldes commented Oct 23, 2018

amir-zeldes commented Oct 23, 2018

ctschroeder commented Oct 23, 2018

ctschroeder commented Oct 23, 2018

amir-zeldes commented Oct 23, 2018

ctschroeder commented Oct 23, 2018

amir-zeldes commented Oct 23, 2018

amir-zeldes commented Oct 23, 2018

ctschroeder commented Oct 23, 2018

ctschroeder commented Oct 23, 2018

ctschroeder commented Oct 26, 2018

amir-zeldes commented Oct 26, 2018 • edited Loading

ctschroeder commented Oct 29, 2018 • edited Loading

ctschroeder commented Nov 26, 2018 via email

ctschroeder commented Dec 5, 2018

ctschroeder commented Dec 6, 2018

ctschroeder commented Dec 6, 2018

ctschroeder commented Dec 6, 2018

amir-zeldes commented Dec 6, 2018

ctschroeder commented Dec 6, 2018 via email

amir-zeldes commented Dec 7, 2018

ctschroeder commented Dec 7, 2018

amir-zeldes commented Dec 7, 2018 • edited Loading

ctschroeder commented Dec 7, 2018

amir-zeldes commented Dec 11, 2018

ctschroeder commented Dec 12, 2018

amir-zeldes commented Dec 12, 2018

ctschroeder commented Dec 12, 2018

ctschroeder commented Dec 12, 2018

ctschroeder commented Dec 12, 2018 • edited Loading

amir-zeldes commented Dec 13, 2018

amir-zeldes commented Dec 13, 2018

ctschroeder commented Dec 13, 2018 via email

amir-zeldes commented Dec 14, 2018

ctschroeder commented Dec 14, 2018 via email

ctschroeder commented Jan 9, 2019 • edited Loading

ctschroeder commented Jan 15, 2019

ctschroeder commented Jan 15, 2019

amir-zeldes commented Jan 16, 2019

amir-zeldes commented Jan 16, 2019

ctschroeder commented Jan 16, 2019

amir-zeldes commented Jan 16, 2019

ctschroeder commented Jun 12, 2018 •

edited

Loading

amir-zeldes commented Oct 26, 2018 •

edited

Loading

ctschroeder commented Oct 29, 2018 •

edited

Loading

amir-zeldes commented Dec 7, 2018 •

edited

Loading

ctschroeder commented Dec 12, 2018 •

edited

Loading

ctschroeder commented Jan 9, 2019 •

edited

Loading