Releases: CopticScriptorium/corpora
Spring 2019 corpus release
The Spring 2019 Coptic Scriptorium corpus release includes several new documents:
- several more sayings in the Coptic Apophthegmata Patrum (edited & annotated by @MarinaGh )
- additional fragments of Shenoute's sermon Some Kinds of People Sift Dirt (edited & annotated by @cluckmarq, editions provided by David Brakke)
- Besa's letter On Vigilance (edited and annotated by @somiyagawa and others)
- several more fragments of the monastic canons of Apa Johannes (annotated by @ctschroeder & @eplatte, digital edition provided by Diliana Atanassova)
All documents have metadata for word segmentation, tagging, and parsing to indicate whether those annotations are machine annotations only (automatic), checked for accuracy by an expert in Coptic (checked), or closely reviewed for accuracy, usually as a result of manual parsing (gold).
version 2.7.0
date 31 May 2019
November 2018 release
Corpora release for fall/winter 2018, version 2.6.0 of Coptic SCRIPTORIUM's corpora. This release includes:
-
Expanded Coptic Old Testament text provided by our partners at the Digital Edition of the Coptic Old Testament Project in Göttingen
-
More gold-standard treebanked texts (selections from 1 Corinthians, G Mark, Shenoute's Abraham Our Father, Shenoute's Acephalous Work 22, Martyrdom of Victor); see our treebank corpus
-
Updated files of Shenoute’s Abraham Our Father and Acephalous Work 22 to bring them up to date to our current data models
-
New metadata fields to indicate whether documents have been machine annotated or if an editor has reviewed the machine annotations
April 2018, version 2.5.0
This release contains new text data contributed by Alin Suciu and Diliana Atanassova as part of the KELLIA project, as well as transcriptions and annotations from various Coptic SCRIPTORIUM project participants. New data in this release includes excerpts from:
-
The Canons of Apa Johannes (2,024 words)
-
Pseudo-Theophilus On the Cross and The Thief (4,543 words)
-
additional Apophthegmata Patrum, bringing the total released to 75 apophthegms (9,413 words)
All texts are also linked word-by-word to the Coptic Dictionary Online (https://corpling.uis.georgetown.edu/coptic-dictionary/).
All corpora now also contain syntactic annotations derived from our tree-banking project. These annotations can be searched using the "func" annotation and visualized as treebanks.
We would like to thank the annotators and translators, without whose work the corpora would not be online. We thank the NEH and DFG for the necessary funding.
April-June 2018, version 2.5.0
This release contains new text data contributed by Alin Suciu and Diliana Atanassova as part of the KELLIA project, as well as transcriptions and annotations from various Coptic SCRIPTORIUM project participants. New data in this release includes excerpts from:
-
The Canons of Apa Johannes (2,024 words)
-
Pseudo-Theophilus On the Cross and The Thief (4,543 words)
-
additional Apophthegmata Patrum, bringing the total released to 75 apophthegms (9,413 words)
All texts are also linked word-by-word to the Coptic Dictionary Online (https://corpling.uis.georgetown.edu/coptic-dictionary/).
All corpora now also contain syntactic annotations derived from our tree-banking project. These annotations can be searched using the "func" annotation and visualized as treebanks.
We would like to thank the annotators and translators, without whose work the corpora would not be online. We thank the NEH and DFG for the necessary funding.
Note: In June 2018 this release was modified by adding the new versions of the automatically processed Sahidic New Testament and Sahidic Coptic Old Testament corpora.
November 2017, version 2.4.0
This release contains new data contributed by Alin Suciu, David Brakke and Diliana Atanassova, as well as out of copyright edition material contributed by the Marcion project. New data in this release includes excerpts from:
-
The Martyrdom of Saint Victor the General (2033 tokens)
-
The Canons of Apa Johannes (438 tokens)
-
Pseudo-Theophilus On the Cross and The Thief (2814 tokens)
-
Shenoute, Some Kinds of People Sift Dirt (888 tokens)
-
11 additional Apophthegmata Patrum, bringing the total released to 63 apophthegms (7077 tokens)
All texts are also linked to the Coptic Dictionary Online (https://corpling.uis.georgetown.edu/coptic-dictionary/), which has been updated with frequency information including these texts. We would like to thank the annotators and translators of these data sets, several of whom are new to the project, without whose work the corpora would not be online:
Alexander Turtureanu, Alin Suciu, Amir Zeldes, Caroline T. Schroeder, Christine Luckritz Marquis, Dana Robinson, David Brakke, David Sriboonreuang, Diliana Atanassova, Elizabeth Davidson, Elizabeth Platte, Gianna Zipp, J. Gregory Given, Janet Timbie, Jennifer Quigley, Laura Slaughter, Lauren McDermott, Marina Ghaly, Mitchell Abrams, Paul Lufter, Rebecca Krawiec, Saskia Franck and Tobias Paul
Edited in 2018 to add that this release also contains the newly released Coptic Old Testament corpus (which has a June 2017 version date).
April 2017, version 2.3.1
Release V2.3.1
- New versions of apophthegmata.patrum and shenoute.eagerness (I See Your Eagerness)
- Apophthegmata corpus now contains 52 apophthegms, over 7,900 tokens
- Eagerness now contains 17 manuscript parts, over 18,000 tokens
- Numerous corrections and improvements to consistency
December 2016, version 2.2
This corpus release includes new or revised documents for:
- 1 Corinthians: machine and manual annotations; new documents are chapters 13-16; edits to already published chapters include corrections and modifications to lemmas, normalization, part of speech, and/or tokenization to conform to evolving guidelines
- Mark: machine and manual annotations; edits to already published chapters include corrections and modifications to lemmas, normalization, part of speech, and/or tokenization to conform to evolving guidelines
- Not Because a Fox Barks (Shenoute): machine and manual annotations; edits to already published document include corrections and modifications to lemmas, normalization, part of speech, and/or tokenization to conform to evolving guidelines
- Besa letters: machine and manual annotations; edits to already published documents include corrections and modifications to lemmas, normalization, part of speech, and/or tokenization to conform to evolving guidelines
All other documents in our corpora are unchanged from the last release.
New metadata and corpus feature: We are beginning to add to our documents a metadata field called "order" which will allow us to present documents in a logical order for browsing or reading. We've implemented it in the Besa letters, corpus and will roll it out for other corpora in the future. Our Document Retrieval web application (data.copticscriptorium.org) now lists the documents in the order in which they appear in the manuscript tradition, when you filter for that corpus. Thus, users who wish to read or browse the documents in that order can do so easily.
Version control: We have set the version number on our document metadata, corpus metadata (in ANNIS), and release information (in GitHub) all to match. Version #s and dates are only revised when a document is revised. So if no documents in our AP corpus have been revised and republished, or no new documents for that corpus have been published, then the version # on the documents and corpus do not change. Only new and newly edited documents (and their corpora) will have version 2.2.0 and date 08 December 2016 in their metadata.
July 2016
Added new documents to the following corpora:
- shenoute.eagerness (10 documents, 10,030 tokens)
- apophthegmata.patrum (36 documents, 4,217 tokens)
Edited in 2018 to add: also contains May 2016 release of the Sahidica NT corpus.
December 2015
Added a document to the a22 corpus, added lemmatization to the a22 corpus, and added previous and next documents to the metadata to the a22 corpus.
October 2015
Includes full 16 chapters of Mark, with lemmatization.