Skip to content
This repository has been archived by the owner on Aug 22, 2020. It is now read-only.

XPath: String Functions on Georg Forster Voyage Narrative

AF Hall edited this page Oct 17, 2018 · 35 revisions

In the Georg Forster file, let's find/do the following:

contains(., ' '): (literal text searches only) Find all the date elements that contain a literal square bracket character: [

//date[contains(.,"[")]

The contains() function always requires two arguments in the format haystack, needle (where you're looking, what you're looking for). The second argument should be literal characters or a Regex expression and is contained by quotes "" or ''.


matches(., ' '): (regex patterns, which may also include literal text) Find all the date elements that contain 4 digits together (like a 4-digit year)

//date[matches(.,"\d{4}")]

Find all the persName elements that start with a lower-case letter (note how caret and dollar-sign work in XML nodes):

//persName[matches(.,"^[a-z]")]

normalize-space(): (remove extra white space from output in reading nodes)

//persName[matches(.,"^[a-z]")] ! normalize-space()

or

normalize-space(//persName[matches(.,"^[a-z]")])

substring-before(): (retrieves just a piece that comes before a literal string of text) Find all the persName elements that contain an 's, and then return the substring-before it:

//persName[contains(.,"'s")] ! substring-before(.,"'s") ! normalize-space() => distinct-values()

substring-after(): (like the above, but retrieves a piece that comes just after a literal string)


tokenize(): (breaks apart a string into pieces based on a regex. We often use this with a position() function to grab just the piece we want): Take the longitude readings in Forster, normalize spaces. Then tokenize on a white space of any kind, and take the SECOND token, then filter to return ONLY the tokens that hold one or more digits:

//geo[@select='lon'] ! normalize-space() ! tokenize(., "\s")[2][matches(., "\d+")]

**lower-case() and upper-case: (takes a string and converts it to all upper-case or all lower-case) Lower-case all the latitude readings in Forster:

//placeName ! normalize-space() ! upper-case(.)

Bundling strings together:

**string-join() (joins together a multiple sequence of strings with a separator) String together all the placeNames in the document. Maybe let's normalize the spaces, first.

string-join(//placeName ! normalize-space(), ", ")

or

//placeName ! normalize-space() => string-join(", ")

**concat() (joins together specific results in a one-to-one way, as many arguments as you have single pieces to put together) Patch together the first persName and the first placeName in each paragraph that has these.

//p[placeName and persName] ! concat('first place: ', placeName[1], ' first person: ', persName[1]) ! normalize-space()

or

//p[placeName and persName] ! normalize-space(concat('first place: ', placeName[1], ' first person: ', persName[1]))

More Reading/Reference/Examples: