-
Notifications
You must be signed in to change notification settings - Fork 4
Home
Notes and explanations for various sub-projects.
Petar Soldo as a LiLa Erasmus intern at the Università Cattolica del Sacro Cuore, CIRCSE, Milan, Italy, Summer semester 2019/2020.
The subset as an XQuery variable:
declare variable $docs := ("aa-vv-supetarski.xml", "sisgor-g-prosopopeya.xml", "modr-n-navic.xml",
"marulus-m-carmina008.xml", "sisgor-g-odae.xml", "bunic-j-de-r.xml", "tubero-comm-rhac.xml",
"andreis-f-epist-nadasd.xml", "benesa-d_epigr03_croala5095251.croala-lat1.xml",
"gradic-s-oratio.xml", "boskovic-r-ecl.xml", "kunic-r-hymnus-cererem.xml", "milasin-f-viator.xml");
- Define a subset of CroALa files, copy it to another directory. Create a directory first. Then use the BaseX and XQuery script create-subset-from-selected-files.xq.
- Alternatively, clone the
croatiae-auctores-latini-textus
repository, which already contains the subset - Create a database from the subset: createCroALaDBfromsubset.xq
- Create a list of words in the subset: wordlist-from-subset-db.xq
- Inside the
TEI/text
node of the document, tokenize all text nodes, wrap words inw
tag and punctuation inpc
- Skip all elements with
@ana="editorial"
attribute and attribute value - Replace the original
TEI/text
node with the updated node - Export the files into the subset-tokenized directory
The tasks 1-3 are performed by the XQuery script subset-tokenize-w-pc.xq. Task 4 is done by the script subset-export-files.xq
The algorithm outlined above uses a recursive function to distinguish between text()
nodes and others:
declare function local:copy-nodes-filter-text($element) {
if ($element[@ana="editorial" or name()="g"]) then $element
else element { node-name($element) }
{ $element/@*,
for $child in $element/node()
return if (not($child/self::text()))
then local:copy-nodes-filter-text($child)
else for $c in tokenize($child, "\s+") return local:tokenize-words-pc($c)
}
};
The actual tokenization is done with the following function:
declare function local:tokenize-words-pc($token){
for $part in analyze-string($token, '\w+')/*
return if ($part/name()="fn:match") then element w { $part/string()}
else element pc { $part/string()}
};
The analyze-string XQuery function is very important and useful.
The problem: the supplied
tag is used on several levels, to mark a whole word supplied by editors, or a part of the word (beginning, middle, end). When just a part of the word is marked as supplied, tokenization will split the word in its parts.
The solution adopted for this project is to add a preparatory step and to remove the supplied
tag from the subset documents.
At the same time, we also used the @scope
attribute to distinguish types of supplied text (with values "verbum" for the whole word, "incipit" for the beginning, "medium" for the middle, and "finis" for the end).
The additional encoding is described in the TEI header:
<encodingDesc>
<tagsDecl resp="#NJ">
<namespace name="#benesa-d_epigr03_croala5095251.croala-lat1">
<tagUsage gi="supplied">With attribute @scope=verbum: a whole word is supplied.
With attribute @scope=incipit: beginning of the word is supplied.
With attribute @scope=medium: letters in the middle of the are supplied.
With attribute @scoep=finis: end of the word is supplied.
This description is important for word tokenization.</tagUsage>
</namespace>
</tagsDecl>
</encodingDesc>
To remove the supplied
tag, a new function is added to the subset-tokenize-w-pc.xq XQuery script (the function is modeled on the local:copy-nodes-filter-text
described above):
declare function local:copy-nodes-filter-supplied($element) {
if ($element[name()="supplied"]) then $element/text()
else element { node-name($element) }
{ $element/@*,
for $child in $element/node()
return if (not($child/self::text()))
then local:copy-nodes-filter-supplied($child)
else for $c in tokenize($child, "\s+") return $c
}
};
The final XQuery now has two steps:
for $xml_nodeset in db:open("croalatextussubset")//*:text
return replace node $xml_nodeset with local:copy-nodes-filter-text(local:copy-nodes-filter-supplied($xml_nodeset))