Skip to content

09 indexing fields

Gabi Keane edited this page Sep 27, 2023 · 10 revisions

Goals

In this lesson you’ll practice creating eXist-db fields and using them in queries. You’ll also learn how to use Monex, the profiling and debugging utility that comes with eXist-db, to verify that your fields have been created correctly.

What are fields and why do we use them?

Fields in eXist-db make it possible to create, store, and index (for quick retrieval) a transformed alternative view of some of your source data, so that you can operate with either the original form, as it appears in your source XML, or the transformation. For example:

  • If the <publisher> element of a document says “The Times”, you can create an associated field that reads “Times, The”, which you could then use to sort and display a list of publishers in a culturally expected way.
  • If the @when attribute on a date contains a date in ISO format, such as 2023-08-29, you can create an associated field that reads “August 29, 2023”. This would let you use the ISO date for sorting in chronological order and the field for rendering the date in a culturally expected way.

You can query for field values, much as you can query for elements in your original source XML, and the query will use eXist-db indexing (if you’ve configured it to do so; see below) for fast retrieval. In the first example above, you might ask to retrieve the fields associated for each publisher, sort them, and render them in a list, and you would do that by asking for the fields instead of for the original, unmodified <publisher> elements. You can also query for original values and render associated fields. In the second example above, you can retrieve all dates (in original ISO format), sort them, and then render not the ISO dates, but the associated human-readable field values. Because indexing happens automatically when you install an eXist-db app, retrieving field information is as fast as retrieving indexed nodes from your original XML source documents.

You could, alternatively, create the alternative formats above on demand at query time, without using fields. For example, you could retrieve dates in ISO format, as they appear in the XML source, and transform them a human-readable form before returning a result to your users, and you could retrieve a publisher name like “The Times” and move the definite article to the end, after a comma, after retrieval, before returning the string to your users. So why use a field? The answer is that field values are created just once, during indexing, and they can then be retrieved (or used in searching) without performing any additional transformations. This detail illustrates two advantages of using fields:

  • Performing a transformation once obviously requires less total computation (less total time) than performing it every time you need to render a value.
  • End-users don’t see the time it takes to perform an indexing operation, but they do see how long it takes to get a result when they initiate a query. Performing the transformation during indexing time, when users won’t care about how long it takes, means that you don’t have to perform it at query time, and users will appreciate the quicker response time.

Rendering latitude and longitude

In this lesson you’ll create a table that lists all places mentioned in the corpus, with table columns for the placename, the latitude and longitude, and, where applicable, a parent (containing) place. The beginning of the resulting table looks like:

Screenshot 2023-09-20 at 1 09 36 PM

The table is built from a gazetteer (this is the standard term for a reference list of places with additional information; the TEI calls it a placeography), a TEI document that contains <place> elements structured like the following:

<place xml:id="drumhead" type="haunted_house">
  <placeName>Drumhead</placeName>
  <location>
    <geo>55.97713888888889 -4.664861111111112</geo>
  </location>
</place>

The gazetteer is located at data/aux_xml/places.xml. The <geo> element for a place contains two whitespace-separated numerical values, and the TEI follows 1984 World Geodetic System (WGS84) in using the first value to represent the latitude and second to represent the longitude. Both values are expressed as decimal degrees (WGS84 also permits degree/minute/second notation), and the number of digits to the right of the decimal point in the data for this project ranges from a low of 4 to a high of 15. That variation came about because we entered the values manually and our sources varied in their precision, and the variation means that if you were to retrieve and render those values as they appear (right-aligned, as is customary with numbers), you would wind up with a ragged layout in your table. Users expect a column of numbers to be aligned at the decimal point, and you can meet that expectation, improving the appearance of the table, by retrieving the values and normalizing the widths, padding short values with zeroes and truncating long values, so that all values will have the same number of digits to the right of the decimal point. The “correct” precision to use for geolocation depends on size of the item being located (e.g., city vs building) and on the intended use. In this lesson we standardize on five digits to the right of the decimal point, which is accurate within approximately one meter.

Because the <geo> elements in places.xml separate latitude (first) from longitude (second) with exactly one space character, you can isolate latitude by applying substring-before(., ' ') to the <geo> and longitude by applying substring-after(., ' '). You can then use the format-number() function to normalize the values by truncating those with more than five digits to the right of the decimal point and right-padding those with fewer than five digits with zeroes. Because substring-before() and substring-after() evaluate to string values and format-number() requires that its first argument be a number, you’ll also need to use the number() function, which converts a string representation of a number to a value that XQuery recognizes as numeric. The pipeline for latitude thus looks like the following (assuming that the variable $geo holds a <geo> value from places.xml):

$geo ! substring-before(., ' ') ! number(.) ! format-number(., '0.00000')

Using a field to format numbers

As we note above, you could perform this string surgery when you create the view, but then you would have to run the same operation each time a user requests the table of places. The data for this sample project is small enough that you might not notice the extra computation, but in a larger project it would impinge not only on how quickly the page was returned to the user, but also on the general responsiveness of your server. If, though, you instead create a field that holds that formatted value, the field representation will be precomputed and stored when the app is installed, and therefore available for retrieval without having to run the pipeline above on demand.

Fields are created with <field> elements in the collection.xconf file, which is where you specify indexing rules for your app. Before proceeding with this lesson, check out the documentation on the eXist-db website for Configuring database indexes and for the Full-text index (sometimes called the “Lucene full-text index”), which is the indexing resource responsible for managing fields. (Fields were introduced into eXist-db after the 2014 publication of the eXist-db book and therefore are not discussed there.) Then create the following collection.xconf file in the main directory of your app:

<collection xmlns="http://exist-db.org/collection-config/1.0"
    xmlns:tei="http://www.tei-c.org/ns/1.0">
    <index xmlns:xs="http://www.w3.org/2001/XMLSchema">
        <!-- Configure lucene full text index -->
        <lucene>
            <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
            <analyzer id="ws" class="org.apache.lucene.analysis.core.WhitespaceAnalyzer"/>
            <module uri="http://www.obdurodon.org/hoaxed" prefix="hoax" at="modules/functions.xqm"/>
            <text qname="tei:place">

                <!-- ==================================================== -->
                <!-- Format number latitude                               -->
                <!--                                                      -->
                <!-- Use the format-number function to format and store   -->
                <!-- latitude                                             -->
                <!-- ==================================================== -->
                <field
                    name="format-lat"
                    expression="tei:location/tei:geo ! substring-before(. ,' ') ! hoax:round-geo(.)"
                    />
                <!-- ==================================================== -->
                <!-- Format number longitude                              -->
                <!--                                                      -->
                <!-- Use the format-number function to format and store   -->
                <!-- longitude                                            -->
                <!-- ==================================================== -->
                <field
                    name="format-long"
                    expression="tei:location/tei:geo ! substring-after(. ,' ') ! hoax:round-geo(.)"
                    />
            </text>
        </lucene>
    </index>
</collection>

The <lucene> element is where you specify the fields you want to configure. The two <analyzer> elements are boilerplate, which is to say that you can copy them as is. The <module> element makes your own user-defined functions, which you’ll write in modules/functions.xqm, available, and we’ll say more about that below.

The <text> element in collection.xconf is where you specify which elements you’d like to index and, in this case, which elements you’d like to enrich with fields. A <field> element child of <text> creates a field associated with the element type matched by the @qname value on the <text> element, which means that this configuration file creates two fields associated with <place> elements. <field> is an empty element with two attributes:

  • @name is a user-specified name that will be used to refer to the field. For example, your “format-lat” field will contain the formatted latitude value, that is, the first part of the <geo> (before the space character) with exactly five digits to the right of the decimal point.
  • @expression is an XPath expression that computes the field value, using the element specified by the @qname attribute on the <text> parent as the current context.

The first <field> element looks like:

<field 
  name="format-lat"
  expression="tei:location/tei:geo ! substring-before(. ,' ') ! hoax:round-geo(.)"
/>

This says that eXist-db should try to create a field called “format-lat” for every <place> element. The value of that field will be computed by navigating from each <place> to all of its <location> children (the schema for places.xml requires that there be exactly zero or one), and then from the <location> to all of its <geo> children (also exactly zero or one). If no <geo> element is found for a <place> no “format-lat” field is created for that <place>. If a <geo> is found, eXist-db selects that value, extracts the part before the space (the latitude), and processes it with a user-defined function in the hoax: namespace called round-geo(). That function is defined in the modules/functions.xqm file specified in the <module> element, and it performs the required normalization. Because both latitude and longitude are normalized the same way, and the only difference is whether you start with the substring before or after the space character in the <geo>, the shared code has been extracted to a single function that you then invoke for both fields. This approach means that if you later decide to change the precision (that is, to normalize to some number of digits other than five), you’ll be able to make the change in one place and it will apply to both latitude and longitude.

User-defined functions

XPath comes with more than a hundred functions built in, which may sound like a lot, but it isn’t possible for a standard function library to anticipate all needs of all users. For that reason, XQuery allows users to create their own functions, built out of pieces provided by standard XPath and standard XQuery. We’ll explore user-defined functions in more detail in a later lesson, and for this lesson you’ll create a single user-defined function as a way of streamlining the creation of your geographic coordinate fields.

Now create modules/functions.xqm with the following content:

xquery version "3.1";
(:~
 : This module provides all functions imported into modules
 : in the app, both those called directly to create models
 : and views and those used by collections.xconf to create
 : facets and fields.
 :
 : @author gab_keane
 : @version 1.0
 :)
(:==========
Import module (hoax), tei (tei), and model (m) namespaces
==========:)
module namespace hoax="http://www.obdurodon.org/hoaxed";
declare namespace m = "http://www.obdurodon.org/model";
declare namespace tei = "http://www.tei-c.org/ns/1.0";
declare namespace html="http://www.w3.org/1999/xhtml";

declare function hoax:round-geo($input as xs:string) as xs:string {
    format-number(number($input), '0.00000')

};

XQuery files are of two types: main modules, which are intended to be executed directly, and library modules, which are intended to be imported into other modules. So far all of the XQuery files you’ve created have been main modules, and functions.xqm is your first library module. Library modules must begin with a module namespace declaration, which looks like a regular namespace declaration except that it begins with the keyword module instead of the keyword namespace. User-declared functions inside a module must be in a user-declared namespace, and we create a namespace called hoax: for that purpose. The one function you create for this lesson doesn’t use the model (m:), TEI (tei:), or HTML (html:) namespaces, but you’ll write other functions later that do, so you can declare all of those, as well.

Your user-declared function begins with the keywords declare function followed by the name of the function (including the hoax: namespace). Function declarations also include:

  • Function parameters, that is, information about the input expected. The round-geo() function expects one string as its input (the string is created by the substring-before() or substring-after() functions discussed above), and the function declaration here binds that input to a variable called $input (you can call it whatever you want) and specifies (after the keyword as) that it must be a string (atomic types in this context use the namespace prefix xs:, which is predeclared in eXist-db, so you don’t need to declare it yourself).
  • The as keyword after the parentheses specifies the type of the output, and because format-number() is defined as creating a string, the output is also declared as xs:string.

Using as with type specifications is technically optional in XQuery, but failing to specify input and output types is reckless. The reason you should always specify datatypes is that if you don’t include those specifications and then accidentally submit or create something of the wrong type, you might wind up with incorrect results without knowing it. The purpose of specifying datatypes is to “turn mistakes into errors”, that is, to require the XQuery processor to notify you if the input or output types are not what they should be.

The user-defined round-geo() function accepts the input value forwarded to it inside the @expression attribute on the <field> element, converts it from a string to a number, and then truncates or pads it to five places to the right of the decimal point. The function then returns the result as a string.

Reindexing

Indexing rules, including the creation of fields, are part of collection.xconf, which lives in the main directory of your app. When you install your app using the eXist-db package manager, the installation process copies the collection.xconf file to a new location (the details are available in the official documentation links above) and then performs the actual indexing. Importantly, even if you are editing your files in VS Code and synchronizing automatically with your running instance of eXist-db, that only updates the copy of collection.xconf in your main app directory. In order to have new indexing rules take effect you must also copy your revised collection.xconf file to the alternative location and perform a reindexing operation. The easiest way to do that is to rebuild the app locally (by running ant at the command line) and then reinstall it with the package manager.

Checking your fields

Once you’ve created collection.xconf and modules/functions.xqm, rebuilt, and reinstalled your app (using the package manager), your fields will have been created and you’re ready to use them. You can verify that they have been created properly by launching the eXist-db Monex app, clicking on “Indexes” in the left sidebar, and then clicking on your app. When we do that we see:

Screen Shot 2023-09-06 at 4 34 08 PM

If you click on the “field” link in the right column you should see something like:

Screen Shot 2023-09-06 at 4 36 19 PM

Using fields

The last step for using fields is retrieving them during a query. If your XQuery were to retrieve all <place> elements as they appear in the source XML and break the <geo> descendants into latitude and longitude you would get the original values, with inconsistent numerical precision, and wind up with ragged output. Your query, then, needs to retrieve not the <geo> value, but the associated fields that you’ve already created. Because those field values were created during indexing, retrieving them is as fast as retrieving literal data from the source documents. The following modules/places.xql file does that:

xquery version "3.1";
(:=====
Declare namespaces
=====:)
declare namespace hoax = "http://www.obdurodon.org/hoaxed";
declare namespace m = "http://www.obdurodon.org/model";
declare namespace tei = "http://www.tei-c.org/ns/1.0";
declare namespace html="http://www.w3.org/1999/xhtml";
(:=====
Declare global variables to path
=====:)
declare variable $exist:root as xs:string :=
    request:get-parameter("exist:root", "xmldb:exist:///db/apps");
declare variable $exist:controller as xs:string :=
    request:get-parameter("exist:controller", "/hoaXed");
declare variable $path-to-data as xs:string :=
    $exist:root || $exist:controller || '/data';

declare variable $gazeteer as document-node() :=
    doc($exist:root || $exist:controller || '/data/aux_xml/places.xml');

<m:places>{
for $entry in $gazeteer/descendant::tei:place[ft:query(., (), map{'fields':('format-lat','format-long')})]
let $place-name as xs:string+ := $entry/tei:placeName ! string()
let $parent as xs:string? := $entry/parent::tei:place/tei:placeName[1] ! string()
where $entry/tei:location/tei:geo
order by $place-name[1]
return
    <m:placeEntry>
        {$place-name !  <m:placeName>{.}</m:placeName>}
        <m:geo>
            <m:lat>{ft:field($entry, 'format-lat')}</m:lat>
            <m:long>{ft:field($entry, 'format-long')}</m:long>
        </m:geo>
        {$parent ! <m:parentPlace>{.}</m:parentPlace>}
    </m:placeEntry>
}</m:places>

After the usual housekeeping (namespaces, controller variables, path to the data we care about, which in this case is data/aux_xml/places.xml) the XQuery retrieves all of the <place> elements in the source document with:

$gazeteer/descendant::tei:place[ft:query(., (), map{'fields':('format-lat','format-long')})]

The path expression without the predicate would select all of the <place> elements as they appear in the source XML. The ft:query() function is used for full-text searching with the Lucene full-text index, and in this case you don’t want to search for specific textual content, but you nonetheless need to engage with the Lucene index in order to retrieve the field values, which were precomputed when the app was installed and indexed. The predicate means that after your path expression selects all of the <place> elements you filter them by performing a full-text query, which works as follows:

  • The first argument to ft:query() is the thing within which you’re searching (sometimes called the haystack). In this case the dot means the current context node, so you’ll apply the predicate to each <place> element selected by the path expression, one by one.
  • The second argument to ft:query() is the thing you’re searching for (sometimes called the needle). If you wanted to filter your <place> elements according to their textual content you could specify required content here, but in this case you don’t really want to search for text, and you’re pretending to perform a full-text query only as a way of gaining access to the field values. The second argument is an empty sequence (represented by a pair of parentheses), which means that all context items will match and be selected.
  • The third argument to ft:query() is where you specify fields you want to use. Since you plan to use your precomputed formatted latitude and longitude values, you ask for those fields by name, using the @name attribute value of the <field> element as specified in your collection.xconf file. The syntax for this third argument to ft:query() is that it is an XPath map, specified as map { name: value }, where the name is the name of what you want to retrieve (in this case fields, a name that eXist-db will recognize) and the value is a sequence of the fields you care about. The name and and the values are strings, and therefore quoted.

You operate over the <place> elements in your for expression by binding the variable name $entry to each <place>, one at a time. You then extract information from that individual <place> using familiar XPath expressions, except that where you create the latitude and longitude information for the model you can use the field values by invoking the ft:field() function with two arguments: the item for which you’re retrieving a field (which you’ve associated with the variable $entry) and the field you care about at the moment (which is one of the two that you previously retrieved).

Not all <place> elements in places.xml include <geo> descendants, and we decided that the table will look best if it includes only places that have geocoordinates. For that reason, this XQuery uses the where clause to include <place> elements only if they have a <geo> descendant. We also decided to sort the places by name, and because some <place> elements have more than one <placeName> child, we use a numerical predicate to specify that we want to sort by the first <placeName> child.

As mentioned above, you could, alternatively, have skipped the fields, retrieved the original <geo> string, and performed the string surgery and number formatting as you processed the element. The reason that it’s better to precompute a field is that the value has been indexed, which means that the retrieval is much faster than computing the value on demand.

Creating the view (XQuery and CSS)

Now use the following XQuery to create views/places-to-html.xql:

xquery version "3.1";
(:=====
Declare namespaces
=====:)
declare namespace hoax = "http://www.obdurodon.org/hoaxed";
declare namespace m = "http://www.obdurodon.org/model";
declare namespace tei = "http://www.tei-c.org/ns/1.0";
declare namespace html="http://www.w3.org/1999/xhtml";

(:=====
the function request:get-data(); is an eXist-specific XQuery
function that we use to pass data among XQuery scripts via 
the controller.
=====:)
declare variable $data as document-node() := request:get-data();

declare function local:dispatch($node as node()) as item()* {
    typeswitch($node)
        case text() return $node
        case element(m:places) return local:table($node)
        case element(m:placeEntry) return local:row($node)
        case element(m:placeName) return local:placeName($node)
        case element (m:lat) return local:cell($node)
        case element (m:long) return local:cell($node)
        case element (m:parentPlace) return local:cell($node)
        default return local:passthru($node)
};

declare function local:table($node as element(m:places)) as element(html:table){
    <html:table id="places">
        <html:tr>
            <html:th>Placename</html:th>
            <html:th>Latitude</html:th>
            <html:th>Longitude</html:th>
            <html:th>Parent place</html:th>
        </html:tr>
        {local:passthru($node)}
    </html:table>
};
declare function local:row ($node as element(m:placeEntry)) as element(html:tr){
    <html:tr>{local:passthru($node)}</html:tr>
};
declare function local:cell ($node as element()) as element(html:td){
    <html:td>{local:passthru($node)}</html:td>
};
declare function local:placeName($node as element(m:placeName)) as element(html:td)? {
    if (not($node/preceding-sibling::m:placeName))
    then 
        <html:td>{string-join($node/../m:placeName, "; ")}</html:td>
    else ()
};
declare function local:passthru($node as node()) as item()* {
    for $child in $node/node() return local:dispatch($child)
};
local:dispatch($data)

The general transformation from the model to HTML is the same as in the previous lesson: you use recursive typeswitch to transform each element in the model namespace into an appropriate element in the HTML namespace. As always, transformation to HTML presumes a knowledge of the target HTML elements and attributes, so if you aren’t familiar with HTML tables, you can learn about them in the MDN Web Docs HTML tables tutorial.

You want the latitude and longitude columns to be right-aligned, and you can control that with the following additions to your CSS:

table, tr, th, td {
    border: 1px black solid;
    border-collapse: collapse;
}

#places td:nth-child(2), #places td:nth-child(3) {
    /* Right-align second and third columns (lat, long) of places table
     * (only)
     */
    text-align: right;
}

th, td {
    padding: .25em;
}

Note that when you created the HTML <table> element as part of the view you placed an @id attribute on it with the value “places”. A CSS selector step that begins with a hash mark (#) specifies an @id value, so the selector:

#places td:nth-child(2), #places td:nth-child(3)

selects all <td> elements that are the second and third children of their parents if they are also descendants of an elements with an @id value of “places”. The point of this specification is that it lets you style those two columns of the table of places without affecting any other tables on your site, since this is the only table you’ll create that will have this particular @id value.

Do field values belong to the model or to the view?

Whether the representation of the geographic coordinates with uniform precision should be considered part of the model or part of the view might be debated, and in this app those representations are present in the model. The engineering reason for that decision is that field values are retrieved from the source XML when the model is constructed and the view then transforms the model without returning directly to the source XML. This means that the precision-normalized field values can be retrieved only when the model is constructed.

Things to watch out for

  • Reindexing: Fields are defined in collection.xconf and created when eXist-db performs indexing at the time that an app is installed. You create the collection.xconf file in the main directory of your app, but during installation eXist-db copies it to a different location before it performs indexing. This means that if you edit collection.xconf (for example, to create new fields) it isn’t enough to sync the changed file with your running installation of eXist-db; you also need to ensure that the new collection.xconf gets copied to that different location and that eXist-db reindexes your collection. The simplest way to do that is to rebuild the app locally (by typing ant in the main directory for your app on the local file system) and reinstall it with the eXist-db package manager.
  • Element-field relationship: The Lucene full-text index, which is where fields are defined in collection.xconf, binds the definitions particular nodes in the source XML documents. For example, because the @qname value of the <text> element that contains your <field> elements is place, the fields are associated with <place> elements. This means that the field values can be made accessible when you select <place> elements to which you apply ft:query() within a predicate but—perhaps surprisingly—not if you select <geo> or <location>. That is, the value of the field is based on information in <geo>, but the field is a property of the <place> that contains the <geo>, and not of the <geo> itself. That organization fits our understanding of the model as a sequence of places that have properties, and those properties include the precision-normalized latitude and longitude.
  • Incremental development: You did a lot of work to implement the fields: you declare and define them in collection.xconf, you create a function in modules/functions.xqm, you use the ft:query() function when you create the model. If you create the fields all at once at they don’t work you may not know where the error lies. For that reason we started by exploring the XPath needed to create the precision-normalized values (path expression, functions, simple map or arrow operators) in eXide, then created them entirely within collection.xconf, and only then broke out the string and number manipulation into a user-defined function in modules/functions.xqm. We also verified in Monex that the fields were being created before we tried to retrieve and use them.