EngDraCor

Note: This corpus is still in beta status.

The English Drama Corpus (EngDraCor) provides TEI documents which have been generated from a selection of dramatic works out of the EarlyPrint.org collection. This selection is being maintained in the engdracor-sources repository.

The following modifications to the original documents have been made:

markup for words (<tei:w>) and punctuation (<tei:pc>) has been removed
Wikidata IDs for plays have been added (see index.xml)
Wikidata IDs for authors have been added (see authors.xml)
speakers have been identified for selected plays and a list of characters is being added (see JSON files in meta/speakers)
IDs and ID references reused in the DraCor documents have been sanitized

Corpus selection

The tally of dramatic works in the Early Print corpus, as provided by its editors, amounted to 853 texts. In this initial phase, we set aside the 363 texts that lack speaker identification with who attributes in the original markup. From the remaining texts, we proceed to filter out 73 items which:

are not dramatic texts, but rather poems (like Shakespeare's The Rape of Lucrece), court entertainments, or masques (like Dekker's Arches of Triumph)
are collections of multiple plays (like Ben Johnson's Complete Works)
are section of plays (like the first part of Dekker's The Honest Whore), or incomplete

The remaining 433 plays constitute the first version of EngDraCor.

Updating the corpus

Prerequisites

The XSLT workflow depends on the following tools

XSLT Transformation

To update the entire corpus from the sources run the the ep2dracor script like this (assuming you have cloned the engdracor-sources repo to the same parent directory as engdracor):

./ep2dracor ../engdracor-sources/xml/*.xml

You can also update individual files, for instance:

./ep2dracor ../engdracor-sources/xml/A17872.xml

Adding new plays to the repo

add original documents to the engdracor-sources repo, see https://github.com/dracor-org/engdracor-sources#how-to-add-or-remove-plays
add entries in index.xml providing a unique DraCor ID, a slug and if available a Wikidata ID
run the XSLT transformation, e.g. ./ep2dracor ../engdracor-sources/xml/*.xml

Tooling

For scripting or reporting purposes you may want to obtain a simple list of plays included in the corpus. There is a stylesheet to generate such lists from the index.xml file.

# convert index.xml to CSV
saxon -s:index.xml -xsl:list.xsl
# list all DraCor IDs
saxon -s:index.xml -xsl:list.xsl type=id
# list all DraCor slugs
saxon -s:index.xml -xsl:list.xsl type=slug
# list all original EarlyPrint IDs
saxon -s:index.xml -xsl:list.xsl type=sourceid
# list only "vanilla selection"
saxon -s:index.xml -xsl:list.xsl type=slug vanilla=yes

License

The EngDraCor TEI files are licenced under the Creative Commons Attribution-NonCommercial 3.0 Unported license (CC BY-NC 3.0).

Name		Name	Last commit message	Last commit date
Latest commit History 162 Commits
.github		.github
meta		meta
scripts		scripts
tei		tei
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
authors.xml		authors.xml
corpus.xml		corpus.xml
ep2dracor		ep2dracor
ep2dracor.xsl		ep2dracor.xsl
format.conf		format.conf
index.xml		index.xml
list.xsl		list.xsl
slugify.xsl		slugify.xsl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EngDraCor

Corpus selection

Updating the corpus

Prerequisites

XSLT Transformation

Adding new plays to the repo

Tooling

License

About

Releases

Packages

Contributors 7

Languages

License

dracor-org/engdracor

Folders and files

Latest commit

History

Repository files navigation

EngDraCor

Corpus selection

Updating the corpus

Prerequisites

XSLT Transformation

Adding new plays to the repo

Tooling

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Languages

Packages