Based on the extensive work of Paul Fièvre, we have been working on a DraCor-ready version of Théâtre Classique. FreDraCor is intended to be a valid TEI P5 resource.
By now, the 1940 files from the source have been structurally cleaned and
converted. All documents are valid against the TEI-All schema. We copied
some information from the castList
to the particDesc
section and tried to
preserve as much as possible.
Besides the fact that all texts are out of copyright, the files (including intellectual work represented as markup) is – according to the source files – licensed under a CC BY NC SA 4.0 licence.
The corpus can be explored at dracor.org/fre.
… we suggest the following:
- French Drama Corpus (FreDraCor): A TEI P5 Version of Paul Fièvre's "Théâtre Classique" Corpus. Edited by Carsten Milling, Frank Fischer and Mathias Göbel. Hosted on GitHub, 2021–. https://github.com/dracor-org/fredracor
Here are, among others, the most significant modifications performed on the original documents:
- add TEI namespace
- add XML declaration
- replace the
teiHeader
with a DraCor-specific version while preserving as much as possible of original content - refine licence statement using current version 3.0 of the given licence and adding URL
- add a
particDesc
- transform the
@xml:id
and@who
attributes into proper IDs and ID references - transform numeral
@id
into@n
ontei:s
andtei:l
- replace
@id
with@corresp
atcastItem/role
- upper-case
tei:l/@part
- remove instances of
tei:l/@syll
(commented) - remove unknown attributes from
tei:role
(commented) - rename
docDate/@value
todocDate/@when
- transform
addresse
element totei:opener/tei:salute
- transform
signature
element totei:signed
- remove empty
@type
- add written and print dates where available
- adjust case of character names (#14)
- normalize author names
- add Wikidata IDs for authors and plays (work in progress)
For comprehensive insight into our changes see both the
adjustments
made on the
dracor
branch of the theatre-classique
repository and the tc2dracor.xq transformation script.
Each FreDraCor play is given a DraCor ID (e.g.
fre000784
). These IDs are mapped to the Théâtre
Classique documents in ids.xml. When a new play from Théâtre
Classique is added to the corpus a new ID needs to be assigned and added to
ids.xml
.
To check the current validation status of the corpus against the
tei_all
schema run ./validate
from the root of the repo. (You will need to have
Jing installed for this to work.)
In fact, this script can be used to validate any directory of TEI documents. Just pass the directory as the first argument. For instance, if you have gerdracor checked out next to fredracor, try:
./validate ../gerdracor/tei
For building the FreDraCor documents from the Théâtre Classique sources a scripted workflow has been set up that processes the original files with an XQuery transformation. To speed up the process multiple eXist DB instances can be started in parallel using either Podman or Docker. These are the main steps of this workflow:
- start one or more pods (or containers) running eXist-db
- loading the transformation XQuery
tc2dracor.xq
and auxiliary files (authors.xml, ids.xml) to the database(s) - process each source file by posting it to the transformation XQuery and storing the output to the tei directory
- stop and remove all pods (or containers)
./tc2dracor [options] SOURCE_FILE [SOURCE_FILE...]
The conversion script expects one or more source files as its arguments. These
would typically be files from the xml
directory of the checked out
dracor
branch
of the theatre-classique
repository:
./tc2dracor ../theatre-classique/xml/*.{xml,XML}
NOTE: The dracor
branch of the theatre-classique
repo contains
corrections and amendments to the original source files which the conversion
script relies on but have not (yet) been adopted upstream.
NOTE: For the attribution of DraCor IDs to work, the file names of the source files need to match the ones of the original documents used in ids.xml (see DraCor IDs).
Display usage information and exit.
Number of pods or containers to start. Default: 1
As an alternative to using containers an eXist database already running on
localhost
can be used by passing its port number. With this option the sources
will be copied to the /db/tc2dracor/sources
collection of this database. No
parallel processing will take place.
Directory to write the created TEI files to. Default: ./tei
By default the conversion script uses podman
but falls back to docker
if
podman
is not available. This flag allows you to force the use of docker
when podman
would be available.
The internet does not forget. That's why the script can be run with an optional progress bar shown in the terminal.
Hint: For debugging across multiple containers you may also watch the combined log from every pod can be viewed while the conversion is running:
podman logs -f $(cat $(ls -rtd /tmp/tc2dracor-* | tail -1)/containers)
For debugging purposes the logs of all containers are also stored in a temporary
working directory after the transformation has finished. Use the -v
option to
see the exact location of these files at the end of the script run.
The transformation process uses the file authors.xml to unify
and enrich author information within FreDraCor. The entries in this file provide
a canonical tei:author
element for each author together with the matching
author string in the source documents (in the name
elements), e.g.:
<author>
<author xmlns="http://www.tei-c.org/ns/1.0">
<persName>
<forename>Charles</forename>
<surname>Collé</surname>
</persName>
<idno type="isni">0000000121258527</idno>
<idno type="wikidata">Q2404425</idno>
</author>
<name>Charles COLLÉ (1709-1783)</name>
<name>COLLE, Charles</name>
<isni>0000 0001 2125 8527</isni>
</author>
When the transformation script discovers an author that does not yet have an
entry in authors.xml, trying to properly identify the name parts and also
looking up the Wikidata ID if the source TEI provides an ISNI. The new entries
are written to the file authors.update.xxxx.xml
(where 'xxxx' is the eXist DB
port number used to run the transformation). This file should be merged manually
into authors.xml
.
See the list of open issues for possible future enhancements.