Skip to content

02 data analysis

Gabi Keane edited this page Jun 20, 2023 · 14 revisions

What is exploratory data analysis?

Exploratory data analysis (EDA) is the process of discovery and testing by which you can learn more about your data, discover issues, or begin noticing the patterns that will inform your research. EDA for digital editions focuses on the edition goals, so it’s a crucial moment for us to reflect on what those are, whether our encoding reflects those goals and questions, and how to construct a computational pipeline from our data to our edition.

Project management and your goals

We can revisit our research goals and non-goals now, with those informing our edition-specific goals. It can be helpful to tie your small goals into larger goals, and then those smaller goals can be broken down into tasks. Sometimes I draw this out as a hierarchical tree to illustrate the connections, which helps to ensure that all tasks are connected research goals. You can find the research goals, non-goals, and edition goals for the laboratory edition in the “Research questions” section of our Slides.

Using XQuery for EDA

XQuery is a query language for XML data. Other programming and database query languages can be used, but are not as effective at dealing with structured text (and especially mixed content, where structured and unstructured text coexist). We’ll be building our app using XQuery in eXist-db, but first we need to explore the data we’ve added to the database to get a better sense of how it’s organized. This exploration will also serve as a quick introduction to XQuery.

In VSCode …

In VSCode, open a new file and title it explore.xql. You can put this file in the modules subdirectory, or create a scratch directory specifically for learning. You’ll notice in this lesson’s corresponding branch we include the final code we arrive at in the titles.xql file, so you can compare and troubleshoot if you experience any issues.

Namespaces

It’s likely that you’re already acquainted with namespaces from your XML encoding, but here’s a refresher from XQuery for Humanists:

Briefly put, namespaces classify XML elements and attributes as belonging to distinct markup vocabularies. In practical terms, namespaces help you to avoid ambiguity in your markup (46).

We highly recommend reading chapter 3.4, “XML Gotchas” early in your development process. While we generally use books like this for reference, the early chapters of this book are an excellent primer (or reminder) of underlying technical principles and pitfalls you might struggle with while learning XQuery.

In this tutorial, we use the following namespaces. You may notice in our commit history that some of these have changed. It can be challenging to remember and become consistent with namespaces, especially early in a project. We recommend keeping central documentation somewhere, to ensure that your choices and changes are kept up to date.

Paste this into your new XQuery file:

xquery version "3.1";
(:==========
Declare namespaces
===========:)
declare namespace hoax = "http://obdurodon.org/hoaxed";
declare namespace m = "http://www.obdurodon.org/model";
declare namespace tei = "http://www.tei-c.org/ns/1.0";
declare namespace html="http://www.w3.org/1999/xhtml";

Global variables

Global variables help us construct a path to the data, navigating the tree of the database no matter whether the app is installed on our personal computers, a server, or the computer of a random user you distribute to in the future. Those variables like $exist:root and $exist:controller are predefined by eXist-db, but when we declare the variables and provide a value for them to map to (in this case, a text string), it can help us construct a path to our app’s data. You can read more about this in on page 200 of Siegel and Retter’s book eXist.

What are values and variables? Let’s look at XQuery for Humanists on page 86:

A value is some piece of information, and a variable represents that value, like a nickname or a placeholder for it.

In order to assign these nicknames, paste this code block below the previous one:

(:==========
Declare global variables to path
==========:)
declare variable $exist:root as xs:string := 
    request:get-parameter("exist:root", "xmldb:exist:///db/apps");
declare variable $exist:controller as xs:string := 
    request:get-parameter("exist:controller", "/hoaXed");
declare variable $path-to-data as xs:string := 
    $exist:root || $exist:controller || '/data';

These aren’t concepts you need to memorize or know well yet. The important pieces here can be summarized as:

  1. $exist:root defines the root directory of the database, so that no matter whose database we’re in, our app will work.
  2. $exist:controller defines the app name. In this case, it’s /hoaXed.
  3. $path-to-data concatenates those other variables together with the extra path step /data.

START HERE

Next, paste in these variables:

(:==========
Declare variable
==========:)
declare variable $articles-coll as document-node()+ 
    := collection($path-to-data || '/hoax_xml');
declare variable $articles as element(tei:TEI)+ 
    := $articles-coll/tei:TEI;

We defined the first variable, $articles-coll as a collection of document nodes using the collection() function (more on functions later on), and then we defined $articles by taking another step from the document node using XPath to reach each article’s root element, which we know is tei:TEI (an element called TEI in the tei namespace, which we declared above).

The dollar sign ($) means it’s a variable. By writing as document-node() or as element() we define the type of data that should be assigned to the variable. The plus sign (+) means we know there’s more than one element in the group.

The “nickname” in this case is $article, but the word “article“ isn’t meaningful to the database. That word is meaningful to us, it helps us remember to what we are referring. In its place, we could use $docs instead, or something less useful to us like $foo. The important thing is that we wrote it between declare variable and the := symbol. := doesn’t have an official name, so XQuery for Humanists calls it a “variable binding symbol”. In some other languages, like Python, it’s called a “walrus operator” because it looks like a walrus. It’s not technically an operator in XQuery, so in this tutorial we propose “walrus symbol” will do.

Before moving on from this section, play around with removing and replacing elements of the syntax in titles.xql to produce different error messages. When you execute the file using the View > Command Palette tool, notice the kinds of error messages you get. By reading the error messages, you can begin to practice for debugging your own code. Data types, syntax, and cardinality errors are the most common for beginners!

image

Introducing FLWOR

We recommend Michael Kay’s [https://www.stylusstudio.com/xquery-flwor.html](FLWOR tutorial). You should go read it before coming back here to start writing some FLWOR expressions!

Retrieve data with for loop

Let’s try returning something for every article in our group of articles

for $article in $articles

return 1

A new way to declare a variable! Instead of using the walrus symbol and declare variable, we let the for loop define it for us. If we wanted to read it out loud, we might say, “for every item in the variable $articles, assign it to a variable called $article and then return the number 1”. Kind of a weird thing to do, right? How many ”1”s are we going to get from this? Why?

Try executing this in your VSCode. Did you get what you expected?

If you got 36, that’s right. If not, pause here and do some troubleshooting by checking against the code above, checking your connection to the database, and running after each change you make.

We get 36 ”1”s because there are 36 articles, or to be more precise, there are 36 items assigned to the variable $article. What will happen if we return $article instead?

So this is a good first step: we confirmed that the correct number of documents are in the corpus, and we know how to navigate to them and fetch them using the file path.

We counted them informally, but we can also use a function to count them too.

Functions

Functions take input and give output. You can read more about how functions work in “Section 4.4: XPath Functions” on page 63 of XQuery for Humanists.

Let’s comment out our for loop for now and instead use the count function. Comments in XQuery look like smiley faces, and they prevent any code inside them from being executed.

(: for $article in $articles 
return :)
fn:count($articles)

Notice how we also commented out return. You can only use return as part of a FLWOR, so your statement must include a for or a let in order for you to use return. We’re actually only using XPath to do this counting.

The fn:count() function is one we get for free with XPath. You can find a full list of XPath and XQuery functions and operators from the W3C. It can also be written without its namespace as count() as the namespace is not required. You’ll notice that XQuery for Humanists uses it to help differentiate between these and functions you will learn to define later on.

Finding article titles

So far, we’ve covered variables, for loops, and functions. These XQuery features helped us answer the question: “Do we have the complete data set?”

Next, we can use XQuery to begin exploring one of our research goals: to publish the articles in a way that helps us answer research questions. Titles are going to play an important role later on in navigating these articles. Though we can see in the XML that they have unique IDs, these won’t really help human readers identify topics or articles that relate to their work.

Let’s remove the lines we were just working with, and replace them with this and execute:

for $article in $articles

return $article//tei:titleStmt/tei:title ! fn:string(.)

We can read this in human language as “for every item in the $articles variable, assign a new variable called $article and return the element called tei:title (which is a child of element tei:titleStmt, which is a descendant of $article). Then, before giving me the output, make this the input to a function called string which will remove the markup and return only a series of strings.”

Honestly that’s a mouthful, let’s zoom out a little bit.

There are 36 children in $articles. For each one you find, reach down its XML tree and fetch its tei:title element. Return the text contained in that element, without its markup.

The simple map operator (!)

The ! is a special character that applies a function to a sequence. Let’s play around with what we wrote above to see why we might want to use this.

for $article in $articles

return fn:string($article//tei:titleStmt/tei:title)

Written this way, we get the same output. That’s great! But if we want to apply 3 functions to our output, then we’re nesting functions and it can become confusing. Instead, we use the simple map operator and provide a . as input for our function, which denotes the current sequence item, so it will provide the correct input each time the for loop passes over it.

Adding back some XML

This is an XML database, and we know that eventually we’ll want to make this into an HTML list. For now, let’s wrap it in an <m:list> element as a placeholder. At the top of the XQuery document, you can see that the m namespace maps to a URL we defined with /model in the path. We will talk more about thinking about and using a model in later stages, but it’s good to get in the habit of using namespaces now.

(:==========
Address each article and output one list element
==========:)
<m:list>
{for $article in $articles
return $article//tei:titleStmt/tei:title ! fn:string(.)}
</m:list>

XML is valid XQuery, so we can just write those elements into the XQuery document without any special functions. Convenient, right? But if we do that, we have to use curly braces {} to denote where the executable XQuery begins and ends. Play around with removing the braces. What kind of output will you see?

With the braces, our output looks like one element with one long list of titles inside it. This isn’t very easily addressable, or even human readable, and it would be clearer if each title had its own element. Let’s try that.

(:==========
Address each article, output one list element containing item elements
==========:)
<m:list>
{for $article in $articles
    return 
        <m:item>
            {$article//tei:titleStmt/tei:title ! fn:string(.)}
        </m:item>
}
</m:list>

Again, take a few minutes to play around with breaking this code. Read the error messages. If you don’t understand why you got a particular error, try googling it. Breaking your code own purpose is a great way to gain experience fixing it.

Using let

We can begin defining variables using let next, as we build this title list into something usable in our edition. Let’s define the strings we just made as variables, so it’s easier to wrap them in XML elements.

(:==========
Address each article, output one list element containing item elements, which hold title and date elements
==========:)
<m:list>
{for $article in $articles
    let $title as xs:string := $article//tei:titleStmt/tei:title ! fn:string(.)
    return
            <m:title>{$title}</m:title>
}
</m:list>

Now we have a formula we can copy to write legible, useful variables and return them as model elements. Let's add more:

(:==========
Address each article, output one list element containing item elements, which hold title and date elements
==========:)
<m:list>
{for $article in $articles
    let $title as xs:string := $article//tei:titleStmt/tei:title ! fn:string(.)
    let $year as xs:string := $article//tei:sourceDesc//tei:bibl//tei:date/@when ! fn:string()
    return 
        <m:item>
            <m:title>{$title}</m:title>
            <m:date>{$year}</m:date>
        </m:item>
}
</m:list>

Summary

With XQuery, we can begin the process of exploratory data analysis with our edition data. If you’re new to XQuery and XPath, this may be a good point to pause and become more familiar with these languages. While we continue to introduce new concepts, patterns, functions, and development concepts in the next tutorials, we will not provide review at each stage. In the next steps, we begin to solidify our data model and transform it into HTML.