Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publication thread for Fall 2018 #19

Closed
10 tasks done
ctschroeder opened this issue Jun 12, 2018 · 72 comments
Closed
10 tasks done

Publication thread for Fall 2018 #19

ctschroeder opened this issue Jun 12, 2018 · 72 comments
Assignees
Milestone

Comments

@ctschroeder
Copy link
Member

ctschroeder commented Jun 12, 2018

- More Johannes canons? (@eplatte )
- Some Kinds of People Sift Dirt (@cluckmarq)
- God Says Through Those Who Are His (@bkrawiec )

  • Mark (from gold/manual treebanking)
  • 1 Cor (from gold/manual treebanking)
  • Victor (from gold/manual treebanking)
  • AOF (@amir-zeldes treebanked section, new translation for a section)
  • A22 fragment(s) (@amir-zeldes treebanked data)

For automated corpora we will add info about fully automated tokenization, annotations in metadata.

Possible:
-AP (if Marina has new AP)
-Besa (@amir-zeldes @somiyagawa) (Besa from two main codices) MOVED to #22
- needs permission from Heike Behlmer for translation
- needs to be broken into documents
- needs to align translation (scrape translation text will take a few days; need to manually align translation)
- needs metadata

Before publication when checking metadata:

@ctschroeder ctschroeder self-assigned this Jun 12, 2018
@ctschroeder ctschroeder added this to the Fall2018 milestone Jun 12, 2018
@amir-zeldes
Copy link
Member

RE Besa's letters - there are way more than two - all of MONB.BA and BB, so almost all of the existing letters. We've only processed two so far though, and I'm not sure if we'll run into problems with later ones.

@eplatte
Copy link
Member

eplatte commented Jun 12, 2018

Yes, I plan to do some more Johannes. I may also have something from Budge from my Coptic reading group at Reed.

@ctschroeder
Copy link
Member Author

I'm looking at items in Gitdox for publication. Looks like in addition to treebanked material in Mark, 1 Cor, Victor, A22, and AOF we also have Eagerness docs. Is that correct, @amir-zeldes ? Are these newly treebanked Eagerness docs?

We also have a doc from Not Because a Fox barks. Are we republishing this corpus?

Last, there look to be some validation issues. I will go through them and let you know if I have any problems or questions.

(I am skipping the AP, since we will have new AP to publish in the Winter.)

@amir-zeldes
Copy link
Member

Eagerness has no treebanked documents, so any edits are presumably sporadically noticed errors (probably no more than a handful). If there are no new documents, maybe we should wait with Eagerness until there is new material - I think there were still new documents coming in the future, right?

Similarly NBFB may have some tiny correction, but otherwise nothing new really. I may wait with it until we are closer to 'one click publication'. The rest have considerable changes due to treebanking and should be re-imported, they are much better quality now.

@bkrawiec
Copy link
Member

bkrawiec commented Oct 21, 2018 via email

@amir-zeldes
Copy link
Member

That's good to know thanks! We could try to squeeze them in, but as I wrote above, the changes are probably minimal, so maybe we should wait until we treebank some of Eagerness.

Another question about Mark/1 Cor - I see failed validations due to 'p' missing - do we want to require p? If so in what units? p mainly serves to segment the normalized view for convenience, but for Bible chapters, the verses already do a good job of that, so maybe we can remove this requirement for the Bible?

@ctschroeder
Copy link
Member Author

Hi. I was making a list of things to go over as I was reviewing the corpora for publication, and "p" was on the list. It relates to our decision in DC to minimize the number of visualizations & viz names, as well. I think we can change the validation to p | vid_n. I am adding vid_n (the cts urns at the verse level) to all corpora as they are re-published. The visualizations break the text at p or at v, right? v is the verse number written as a number, and vid_n is the urn for the verse (same span as v). I would prefer the validation to be p | vid_n to remind us to add those cts urns.

@ctschroeder
Copy link
Member Author

I'm making a list of things that are coming up, that I'll post when I'm done. But two big ones:

  • for 1 Cor and Mark, only parts of the corpora are treebanked? 1 Cor 1-6 and Mark 1-7. Correct? Those are the only ones marked as "to_review". So we will have partially treebanked corpora?

  • for corpora being republished due to treebanking (in the list above): are treebanking, tagging, and segmentation gold/checked/auto? Some are listed as treebanking: auto.

@amir-zeldes
Copy link
Member

Mark is only 1-6 treebanked at the moment, same as 1Cor. Meta should show everything as gold for the treebanked chapters, and pos/seg 'checked' for the rest, parsing 'auto'. The tag/seg/parse metadata is document-wise, so mixed corpora should not be a problem.

@amir-zeldes
Copy link
Member

RE p-annotations: I went ahead and made the p-check corpus dependent. There is no current way to make one validation check for either/or, so we'd need a separate rule to require vid_n in corpora where that's relevant (all corpora?)

@ctschroeder
Copy link
Member Author

Can you go in and edit the parsing /tagging/segmenting metadata? There are a number of docs “review” or “to publish” and there is no way for me to tell currently which ones in each corpus are treebanked (since the parsing data is elsewhere). I realize a mixed corpus re treebaninking is fine technically, but from an annotating/curating point of view, having me make those edits to the metadata is asking for trouble, because Some corpora have both treebanked and nontreebanked docs for review or publication. I think you need to go in and make the changes since there are mixed treebanked corpora.

@ctschroeder
Copy link
Member Author

I will for sure check the rest of the metadata and add corpus metadata to the corpora (like 1Cor) without.

@amir-zeldes
Copy link
Member

I can do the automation meta for Sahidica. It's pretty well documented though, the list of treebanked documents is in the table here:

http://copticscriptorium.org/treebank.html

@ctschroeder
Copy link
Member Author

Generally the process is most effective and accurate when each annotator adds/edits metadata when they annotate. Otherwise there is a lot of back and forth from the person doing review, or something gets missed. There's not an effective way for the person conducting the final editorial review to keep in their heads which metadata might change and which might not for each publication thread. The person doing the review (not always me) needs be able to look at the metadata for obvious errors, like typos or missing fields, but other than version number/date isn't expected to go through each existing field and ask whether the data needs to be changed. I will go ahead and reassign docs back to you to for checking the parsing/tagging/segmentation metadata before publication. Thanks!

@amir-zeldes
Copy link
Member

ⲞⲔ, 1Cor and Mark should be good to go from the NLP metadata perspective. I also corrected any validation errors that are automatically caught, so they're all green, but I'm not sure if there's something we wanted but haven't added a validation for yet.

@amir-zeldes
Copy link
Member

? I'm not sure I understand the preceding comment - I have no metadata changes to make that I'm aware of. I'm happy to keep NLP metadata up to date as we treebank in the future, but these are fields that didn't exist when the treebanking happened. Sahidica is now up to date.

@ctschroeder
Copy link
Member Author

If I have any questions about the other mixed corpora besides the Sahidica ones, I'll let you know.

@ctschroeder
Copy link
Member Author

Thanks for editing the Sahidica ones!

@ctschroeder
Copy link
Member Author

@amir-zeldes can you tell me who has been treebanking (and then correcting tagging/segmentation) for the AOF, Victor, A22, Mark, 1 Cor texts? I will add their names to the corpus and document metadata. Thanks!

@amir-zeldes
Copy link
Member

amir-zeldes commented Oct 26, 2018

Mark+1cor new material is Mitchell. A22 is me and Liz. AOF is just me, Victor is me, Mitchell and the four Israelis listed here: https://github.com/UniversalDependencies/UD_Coptic-Scriptorium/blob/dev/README.md (under acknowledgments)

@ctschroeder
Copy link
Member Author

ctschroeder commented Oct 29, 2018

Thank you! 1Cor is ready (I also added corpus metadata to GitDox) except for these questions:

  • There were a couple problems with spans in the layers in 2 of the 1 Cor docs (cells not merged in the spreadsheet in the _group layers). I merged the cells. Do you need to know which cells I modified for the treebanking data, or is the treebank data unaffected by the spans?

A couple of questions about Mark:

  • 7 is also marked for review. Do you want to republish or wait? The commit log in github shows one commit, but I do not see any changes when the file is diff’d. It's possible there are saved-but-uncommitted changes in GitDox?
  • 9 was listed as “to_publish” but like 7, the reason is unclear;it hasn’t been reviewed for this cycle, and there are no committed changes to the file in GitHub since the file was imported into GitDox. (Again possible savied-but-uncommitted changes in Gitdox?) I changed status to “review”. Do you want to republish or wait?

@ctschroeder
Copy link
Member Author

ctschroeder commented Nov 26, 2018 via email

@ctschroeder
Copy link
Member Author

AOF is done

@ctschroeder
Copy link
Member Author

Here are the urns that need addressing. We either should put something in the 404 or somehow redirect. Probably easier to list them on the 404 for now. I think any redirect is complex with this application.
urn-table.xlsx

@ctschroeder
Copy link
Member Author

Also @amir-zeldes can you check the parsing/segmentation/tagging metadata for the files that were treebanked? Not sure I got them right.

  • The grids and metadata validate. TEI export does not always validate on the "to_publish" files. The validation report references line #s that I can't see

@ctschroeder
Copy link
Member Author

Mark is as ready as it will be. @amir-zeldes I think we're good to go.

@amir-zeldes
Copy link
Member

OK, I went over the AOF exports, they all check out now. The way to see those line numbers is to do a TEI export from the editor, download the file, and find that line number in the file. Usually I look for some word or translation nearby and actually check out the grid instead of trying to figure out the XML, since the error is usually apparent in the grid too.

XL still needs versification, it seems - did you say you wanted to just follow the translations or doing something else? Once that checks out I think we really are good to go!

Oh, and one more thing, is Besa/Vigilance included?

@ctschroeder
Copy link
Member Author

ctschroeder commented Dec 6, 2018 via email

@amir-zeldes
Copy link
Member

AOF can't be published without filled verse/vid, based on the new schema. I'm happy to put 'undetermined' there, but it does seem odd... If they're another work, then we should consider putting them in another corpus one of these days, since for other works, our corpus objects logically correspond to the works, not codices.

I'm fine waiting with Besa, in the meantime I'll start processing the other corpora so we can get the release published. As for review, really all that means is 'was edited', so maybe we should start calling these something else (review sounds like the whole document needs to be reviewed, where really only some tiny changes occurred)

@ctschroeder
Copy link
Member Author

I can break up XL, but it's one doc in the treebanking corpus so I didn't want to mess with it.
Re Gitdox: I hear what you're saying. The thing is, I think more status categories might get confusing. And often there are a lot of details, which are better documented in the Github issues. Sometimes even if a minor correction is made, the full document does get updated before publication, because there is a GitHub issue noting corrections, or data model enhancements, etc., that are slated to be added with the next publication of a document. So any documents slated for (re)publication really do all need to be reviewed. Part of the documented workflow on the wiki is that review editors (usually Beth or me) know to look at the GitHub issue for the publication cycle and the Github issues for the relevant corpus during review. Additionally the changes to a document often take place over long spans of time (e.g., someone makes corrections, a few months later someone adds more metadata to include new fields we've started using, a few months later someone treebanks it, then it's finally published). It's hard to create different status categories that will capture all of that nuance. I think marking them for "review" and posting details about the documents on the GitHub issues works better, because of the narrative and volume capacity of the issues. If we make any enhancement, I would suggest adding a link to our GitHub issues in GitDox. But I hesitate to change the status fields. I also wouldn't want to add another field, because it will get confusing toggling between additional information in Gitdox and the GitHub issues. I'd prefer keeping all the details on the documents in one place, GitHub, where I can see commit histories & issues all together.

@amir-zeldes
Copy link
Member

amir-zeldes commented Dec 7, 2018

Yeah, I understand, I'm not disagreeing about changing statuses right now, but I think we should have a conversation about this again sometime - possibly in parallel with or just before training the new DH specialist.

I just fixed XL, so it's 'undetermined' for URNs and 'x' for the verse number, that should work for conversion and validation for now. AOF validates now, we are good to go, leaving Besa aside.

@ctschroeder
Copy link
Member Author

Ok that all sounds good, thanks!!

@amir-zeldes
Copy link
Member

OK, everything is up on ANNIS now, excluding new Besa for the moment. Please take a look and if all looks well I can set up the ingest and push to GitHub as well.

@ctschroeder
Copy link
Member Author

Thanks so much, @amir-zeldes. I will look over this tonight and tomorrow. In the meantime, what do you think about the issue of the urns? (#19 (comment))

@amir-zeldes
Copy link
Member

I had a look at the Excel table - these all seem to be pure URNs, not URLs, so I'm not sure what you mean by 404 above - if they're URLs we can probably set the server's apache to intercept and 404 them, but if we are at the URN level, this is something the repo software needs to handle, no? Does it have any existing functionality to handle redirects or some kind of '404'-like scenario?

@ctschroeder
Copy link
Member Author

What happens is the web application either 1) takes the URN someone types into a box and spits out the corresponding URL, which is operationalized (maybe wrong word here?) as a list of documents that contain that URN; or 2) takes the URL data.copticscriptorium.org/URN generated by someone clicking on that link somewhere else or typing it into their browser bar and likewise generates a page that is a list of documents that contain that URN.
SO, either typing the URN into the box or plugging our data URL+the URN into your browser will cause the web page to produce an error. I am just now realizing, it's not a 404 -- it's a "There are no results that match your search" (see here for an example: http://data.copticscriptorium.org/urn:cts:copticLit:besa.aphthonia.monbya ) so my idea was tragically flawed from the start.
I'm not sure how the application handles redirects. I don't really know the code (she says as if she were a programmer...!). I can search the code for the "no results" message and see if it's in there somewhere. If not, it might be in a database setting.

@ctschroeder
Copy link
Member Author

Oh hey now the results message is in the index file https://github.com/CopticScriptorium/cts/blob/master/coptic/templates/index.html
We could modify that message to say hey, if your looking for these abc files, go to xyz instead.

I really do not know how the repo would manage a redirect.

@ctschroeder
Copy link
Member Author

ctschroeder commented Dec 12, 2018

Meanwhile, these documents look great in ANNIS. Just a couple things.

I made small edits to the following files, so they need to be redone:

  • AOF YA 525-30
  • AOF ZH fragment
  • A22 corpus metadata (not sure where that is stored upon commit to Github? I edited the YB 307-320 file)

Also:

  • the Mark corpus still needs to be converted and added (see Publication thread for Fall 2018 #19 (comment) and the checklist at the top).
  • Document names for Victor and A22 corpora are wrong. I worry there is something wonky with the GitDox conversion. The filenames need to be changed and I made an issue in the GitDox GH repository
  • A22 and AOF visualizations should now be versified not the old normalized

@amir-zeldes
Copy link
Member

OK, the SNP bug should be resolved. Mark has also been updated, and we have fresh versions of AOF and A22 as well. I've spot checked them, but please take a look as well.

@amir-zeldes
Copy link
Member

Regarding the redirect, yes, if it were a URL based system they could be intercepted at the apache level, before the app ever sees the request, but the way it's been built this would require making some code changes. Maybe we should put some renovations to the repo on the agenda for next semester. I think we should talk about prioritization again at some point in January.

@ctschroeder
Copy link
Member Author

ctschroeder commented Dec 13, 2018 via email

@amir-zeldes
Copy link
Member

Mmm... I guess you could, sure. Ultimately I'd like a better solution for this though, but it might take some time, so this could be a good band aid.

@ctschroeder
Copy link
Member Author

ctschroeder commented Dec 14, 2018 via email

@ctschroeder
Copy link
Member Author

ctschroeder commented Jan 9, 2019

OK release is basically done except for a few behind-the-curtain actions:

  • Post TEI, PAULA, relANNIS, tt SGML to GitHub corpora repository (@amir-zeldes)
  • Create release (@ctschroeder or @amir-zeldes)
  • Update [geographic RDF[(http://wiki.copticscriptorium.org/doku.php?id=checklist_for_publishing_corpora#update_pelagios_rdf) (@ctschroeder)

@ctschroeder
Copy link
Member Author

@amir-zeldes is there a way for an admin to batch change status "to_publish" to "published" for all docs with that status?thx!

@ctschroeder
Copy link
Member Author

(@amir-zeldes also I switched AP 18 and 26 from "to_publish" to "review." I see from GitDox commits they have been treebanked. "to_publish" indicates everything is ready, including updated metadata for version # and date; these will still need some metadata changes. Thx!!!)

@amir-zeldes
Copy link
Member

I can change values in the DB using a SQL statement to change statuses en masse, though maybe this is a good feature to have. If you need me to do something like that let me know, in the meantime I'll open an issue.

@amir-zeldes
Copy link
Member

gucorpling/gitdox#123

@ctschroeder
Copy link
Member Author

Ok I think it’s not a high priority if you’re willing to do it yourself. So in the meantime could you please switch everything in GitDox we published to “publish”? Should be everything currently labeled to_publish. Thanks!

@amir-zeldes
Copy link
Member

Done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants