-
Notifications
You must be signed in to change notification settings - Fork 27
XPath: An Intro to Functions
Let's think about how XPath can get you just one BIG result or perhaps hundreds of results depending on how you write it. So try this in the XPath window on any old XML file that has <p>
elements for paragraphs:
//p
Say your file has 156 paragraphs: you’ll get 156 different results for //p
.
But if you “wrap” that expression in a count() function, you get just one result:
count(//p)
and that is the number 156.
(//p)
says, walk the whole tree, then do what I tell you—which is get me a number count (just one result). There’s a bunch of numerical calculations you can do with XPath like that, but count()
is one we use a lot, like to get a raw count of the number of times we marked a <placeName>
or a <geo>
element in Book 1 of that Forster voyage log for example.
My all-time favorite XPath function is distinct-values()
, and you’ll be using it in the next homework. Let’s say you marked up every time someone referred to a popular place like Tahiti, but there were a few different ways of spelling and writing Tahiti (Otaheite was one, and Kahiki is another, for example!) When you and your team got to work on marking up Tahiti references, you wanted to see all the different names used for that place. So that’s where you use the distinct-values() function, to get rid of duplicates and make you one master list of each distinctly different way of referring to it. So say you had 230 different Tahiti references, but only 5 different ways of referring to Tahiti: If you coded those like this in your document: <placeName ref="#Tahiti">
, you could write your distinct-values function like this:
distinct-values(//placeName[@ref="#Tahiti"])
It starts like this: //placeName[@ref="#Tahiti"]
, and then you wrap that expression in parentheses to walk the WHOLE TREE first, and then run the function you put in front:
distinct-values(//placeName[@ref="#Tahiti"])
Getting the hang of this? Well, here’s one more, different way to write a function, and that’s when we actually want MULTIPLE results for every node we land on. Sometimes I use the wildcard *
(asterisk) to land me on whatever element is sitting where I tell XPath to go because I’m trying to find out what all I used as a child or descendant or sibling of something. Let’s say I want XPath to take me to whatever elements I set as the immediate children of <p>
, I’d say:
//p/*
But that just shows me the insides of the elements down in the results window, when all I want to see is JUST the element names I used. For that, we have a handy function called name()
, which will get you element or attribute names. You actually need to set this at the END of your XPath expression, so that every time you stop on a p element’s child, you stop and ask its name:
//p/*/name()
Try that in our Forster voyage file and get a sense of how those two expressions are different! ALSO get a sense of why we put name()
at the end of the XPath expression instead of wrapping the function around the whole thing. We don’t want to walk the whole tree first, and if you try to wrap that expression (//p/*)
with name()
in front, you’ll get an error message: “A sequence of more than one item is not allowed as the argument of name” or something like that. That’s because name()
gets you a single individual name at each node.
Once you’ve returned those element names, wouldn’t you want to get their distinct-values()
? Well, you can build up from there…wrap the whole expression out to name()
and take your function like so:
distinct-values(//p/*/name())
See how those can work together? But name() still has got to sit at the end of an expression to stop on each node and ask its name.
Remember that predicates are basically filters. You say, find me all the paragraphs that have a persName in them as a descendant like this:
//p[descendant::persName]
and that will return all the paragraphs that have persName children or grandchildren, great-grandchildren, as far deep as the element nesting goes inside the paragraphs.
Now, what if we wanted to stop on <p>
elements that did NOT have something? That's where we use the not()
function, and we pretty much always use that one inside a predicate filter to return a boolean "not true" condition: Show me the paragraphs that do NOT have any persName descendants like this:
//p[not(descendant::persName)]
I return 446 of these in the Forster file.
Notice you can stack these predicates side by side or join them with and
or or
: Find me the paragraphs that DON'T have persNames but DO have placeNames:
//p[not(descendant::persName)][descendant::placeName]
In the Forster file I get 231 of these interesting paragraphs, where ONLY places but no people mentioned.
Finally, there’s the text()
node: what’s that? When you stop on an element in XPath and it has mixed content inside (text and other elements), you can actually treat the text inside as one of the children, though XPath doesn’t do that by default. You need to say, ONLY return the text() portion of this element. Take a look in your XPath window in the Forster file at the difference between:
//p/persName
and
//p/persName/text()
Do you see it? The results give you just the inside text of the element persName for that second one.
Now, try something similar with <p>
elements:
//p/*
//p/text()
//p/text()[following-sibling::placeName]
This gets you all the text()
nodes that have a following-sibling <placeName>
element.
Let’s say I want to stop on JUST the first following-sibling <placeName>
of those text()
nodes:
//p/text()/following-sibling::placeName[1]
Here’s a challenging one, building up from that expression:
//p/text()[following-sibling::*[1]/name() = "placeName"]
What am I doing here? Well, I stepped down into the text()
, and I want to find the text()
whose first following-sibling is called "placeName". That’s tricky. Notice, I’m using a comparison operator here.
Comparison Operators tell you when things are equivalent to other things, and we typically use them with results of functions: If the name() of the first following-sibling to text()
has a name() = "placeName", I return that text() node.
OK—that last one was just an example of some surgical precision we can use with XPath! But more typically when we pull out the =
sign or greater-than or less-than, we’re doing a comparison of values (usually numerical values, but sometimes to see if there’s a string of text that completely equals what we landed on).
Read about these on the awesome "XPath Functions we use the most" page: http://dh.obdurodon.org/functions.xhtml : Scroll down to the end of the page and read up on General vs. Value comparisons, and let’s take a look at a real life example:
Let’s say we’re hunting for all the //p
elements in the Forster voyage file that have <persName>
elements as children or descendants. We’d write:
//p[descendant::persName]
But there are SO many of these! 558 in fact. Let’s try to whittle it down and get only the paragraphs that have MORE THAN 5 persName elements inside. Here’s where we want to use a function to get a count of the persName inside the p elements, first, and then build up our expression, so we take the count and then say the count has to be greater than 5:
//p[count(descendant::persName) gt 5]
See how that works? Now I get 95 results. Be careful of your syntax: take the //p/count(descendant::persName)
and take a look at the results. (When you set it at the end like that you should get a count for every //p
element!) Then make it a filter with a predicate: get me all the <p>
elements THAT HAVE a count greater than or less than or equal to a certain number!
Let's say that you wanted to see all the place names in the file. The first thing you would type would be:
//placeName
Notice how you get a lot of results! 2688 of them, for the Forster file! There is no way there can be that many place names, right? That's where our good friend distinct-values()
comes in handy. By wrapping //placeName
with distinct-values()
, the results will be cut back--this time to 931 results, with
distinct-values(//placeName)
But have we really eliminated all the duplicates? Scroll down the list of place names. See anything weird? How about some of the names have random odd spaces inside? Hitting "pretty print" in oXygen sometimes adds those spaces in the text nodes and well, we don't want that, because it means sometimes a place is coded as "Isle de
France" and sometimes "Isle de France". Good news! There's an easy fix for this that we could solve right here in XPath and it's the normalize-space()
function! You might want to work with this as a global operation, like count()
or distinct-values()
, so that you would want to wrap //placeName
with normalize-space()
around it, but actually it won't work that way. It needs to make local stops just like the name()
function: The function needs to stop at each placeName and check to see if it has spacing issues to correct, like this:
//placeName/normalize-space()
This will return all the place names again but without that weird spacing in some of the text. Now, we want the distinct-values()
to wrap EVERYTHING! Notice that these "local stop" functions end with an open set of parentheses with nothing inside--and that is because we set them to run after we stop on each single node result of an XPath expression.
distinct-values(//placeName/normalize-space())
Huzzah! Notice that we eliminated many of the extra duplicates caused by too many white spaces! We went from 931 to 805 results, and our list is much more regular and tidy now!