Note: This corpus is still in beta status.
The English Drama Corpus (EngDraCor) provides TEI documents which have been generated from a selection of dramatic works out of the EarlyPrint.org collection. This selection is being maintained in the engdracor-sources repository.
The following modifications to the original documents have been made:
- markup for words (
<tei:w>
) and punctuation (<tei:pc>
) has been removed - Wikidata IDs for plays have been added (see index.xml)
- Wikidata IDs for authors have been added (see authors.xml)
- speakers have been identified for selected plays and a list of characters is being added (see JSON files in meta/speakers)
- IDs and ID references reused in the DraCor documents have been sanitized
The tally of dramatic works in the Early Print corpus, as provided by its editors, amounted to 853 texts.
In this initial phase, we set aside the 363 texts that lack speaker identification with who
attributes in the original markup.
From the remaining texts, we proceed to filter out 73 items which:
- are not dramatic texts, but rather poems (like Shakespeare's The Rape of Lucrece), court entertainments, or masques (like Dekker's Arches of Triumph)
- are collections of multiple plays (like Ben Johnson's Complete Works)
- are section of plays (like the first part of Dekker's The Honest Whore), or incomplete
The remaining 433 plays constitute the first version of EngDraCor.
The XSLT workflow depends on the following tools
To update the entire corpus from the sources run the the ep2dracor
script like
this (assuming you have cloned the engdracor-sources
repo to the same parent
directory as engdracor
):
./ep2dracor ../engdracor-sources/xml/*.xml
You can also update individual files, for instance:
./ep2dracor ../engdracor-sources/xml/A17872.xml
- add original documents to the
engdracor-sources
repo, see https://github.com/dracor-org/engdracor-sources#how-to-add-or-remove-plays - add entries in index.xml providing a unique DraCor ID, a slug and if available a Wikidata ID
- run the XSLT transformation, e.g.
./ep2dracor ../engdracor-sources/xml/*.xml
For scripting or reporting purposes you may want to obtain a simple list of plays included in the corpus. There is a stylesheet to generate such lists from the index.xml file.
# convert index.xml to CSV
saxon -s:index.xml -xsl:list.xsl
# list all DraCor IDs
saxon -s:index.xml -xsl:list.xsl type=id
# list all DraCor slugs
saxon -s:index.xml -xsl:list.xsl type=slug
# list all original EarlyPrint IDs
saxon -s:index.xml -xsl:list.xsl type=sourceid
# list only "vanilla selection"
saxon -s:index.xml -xsl:list.xsl type=slug vanilla=yes
The EngDraCor TEI files are licenced under the Creative Commons Attribution-NonCommercial 3.0 Unported license (CC BY-NC 3.0).