Skip to content

Understanding Expressions

Yury Katkov edited this page Jun 30, 2013 · 12 revisions

Understanding expressions.

Introduction

OpenRefine supports "expressions" mostly to transform existing data or to create new data based on existing data, much like how spreadsheet software supports "formulas". There are, however, significant differences between [DocumentationForUsers#Expressions OpenRefine's expressions] and typical spreadsheet formulas.

Consider this sample data set

friend age
1. John Smith 28
2. Jane Doe 33

When you invoke the Transform command on, say, column "friend" and enter an expression, OpenRefine will go through each row in the data (matching facets and filters, if any), and evaluate that expression for that row in order to obtain a result for that row. Whereas in a spreadsheet, you would need to store a different formula in each cell of that column, in OpenRefine, you only need one single expression. And that's made possible through the use of Variables, as explained next.

When evaluated on a row in the example above, the expression can access that row and the cell in the column "friend" of that row through Variables. Think of a variable as a placeholder for something else. For example, there is a variable named "value" that is the placeholder for the cell's content. When the expression is evaluated on row 1, the variable "value" will stand for "John Smith"; when evaluated on row 2, it will stand for "Jane Doe". So, if the expression is

  value.split(" ")[1]

then for row 1, it will yield "Smith" and for row 2, "Doe". That expression splits against the space char found and takes the 2nd part.

IMPORTANT TIP: [1] is equivalent to saying "The 2nd part of an array or list" in GREL since indexing of arrays or lists in Refine actually begins with [0] or "The 1st part of an array or list".

Using Variables, a single expression yields different results for different rows.

Base Column

Note that an expression is typically based on one particular column in the data--the column whose drop-down menu is invoked. A lot of variables are created to stand for things about the cell in that "base column" of the current row on which the expression is evaluated. But there are still variables about the whole row, and through them, you can access cells in other columns.

Languages

Whereas each spreadsheet software has its own formula language, and only one language, OpenRefine is capable of supporting several languages for writing expressions. OpenRefine has its own native language called OpenRefine Expression Language (GREL), but you could also use Jython, or other languages if you install OpenRefine extensions that support them.

Where in GREL you write

  value.split(" ")[1]

in Jython you would write

  return value.split(" ")[1]

For that example the two languages are similar enough, but they don't have to be. On this documentation wiki, we will focus mostly on GREL. If you use another language, like Jython, please refer to its own documentation.

OpenRefine Expression Language (GREL)

Basics

GREL is designed to resemble Javascript. So you can expect these basic things to work, and know how they would work:

example description
value + " (approved)" concatenate two strings; whatever is in value gets converted to a string first
value + 2.239 add two numbers; if value actually holds something other than a number, this becomes a string concatenation
value.trim().length() trimming leading and trailing whitespace of value and than take the length of the result
value.substring(7, 10) take the substring of value from character index 7 up to and excluding character index 10
value.substring(13) take the substring of value from character index 13 until the end of the string

Concatenation

If you're used to Excel, note that the operator for string concatenation is not & but +.

Function syntax

In OpenRefine expression language function can be invoked using either of these 2 forms:

  • functionName(arg0, arg1, ...)
  • arg0.functionName(arg1, ...)

The second form above is a shortcut to make expressions easier to read.

It's only syntactic sugar or shorthand, such as:

dot shorthand notation full notation
value.trim().length() length(trim(value))
value.substring(7, 10) substring(value, 7, 10)
value.substring(13) substring(value, 13)

The first argument to a function can be swapped out in front of the function to formulate the dot notation. That is easier to read as the functions occur from left to right in the order of calling, rather than in the reverse order.

The dot notation can also be used to access member fields:

example description
cell.value same as just value because cell stands for the current cell
row.index index of the current row

For member fields whose names are not words (e.g., they contain spaces or other characters), use the bracket notation:

cells["First Name"] access the cell in the column called "First Name" of the current row

Brackets can also be used to get substrings and sub-arrays, and their syntax resembles Python a bit more than Javascript:

example description
value[1,3] access the substring of value starting from character index 1 up to but excluding character index 3
"internationalization"[1,3] return nt
"internationalization"[1,-2] return nternationalizati (negative indexes are counted from the end)

GREL supports branching and looping (e.g., "if" and "for") slightly differently than Javascript. To do branching and looping, you use GREL "controls", or sometimes also called "constructs", and their syntax is more like Excel's IF:

example description
if(value.length() > 10, "big string", "small string") if the length of what value stands for is great than 10 characters, then return "big string", otherwise, return "small string"
if(mod(row.index, 2) == 0, "even", "odd") if the 2 modulus of the row index is zero, then output "even", otherwise, output "odd"

Thus, the if control has this syntax

if ( test_condition , true_result , false_result )

The test_condition sub-expression is evaluated. If it yields true, then the true_result sub-expression is evaluated; otherwise, the false_result sub-expression is evaluated.

GREL doesn't support looping in the conventional meaning, as supported by for in Javascript. However, you can still process arrays of things using the various for- controls, e.g.,

forEach("Once upon a time in Mexico".split(" "), v, v.length()) return array of lengths of words, [ 4, 4, 1, 4, 2, 6 ]

The forEach control has this syntax

for ( array_subexpr , element_var_name , element_subexpr )

The array_subexpr sub-expression is evaluated to an array. For each element in that array, element_subexpr sub-expression will be evaluated with the variable named by element_var_name standing for that element. In the example above, when v.length() is evaluated for the first element--the string "Once", the variable called v stands for that string "Once", and the sub-expression v.length() evaluates to 4.

Another useful control is with that can be used to define a new variable. For instance, if we want to compute the average word length for the string "Once upon a time in Mexico", then we might start with

  forEach("Once upon a time in Mexico".split(" "), v, v.length()).sum() / "Once upon a time in Mexico".split(" ").length()

We have "Once upon a time in Mexico".split(" ") repeated twice. To shorten that expression, we can define a variable to stand in that sub-expression's place:

  with("Once upon a time in Mexico".split(" "), a, forEach(a, v, v.length()).sum() / a.length())

Thus, the with control has this syntax

with ( subexpr1 , var_name , subexpr2 )

First, the sub-expression subexpr1 is evaluated, and then a new variable called var_name is defined to stand in its place while the sub-expression subexpr2 is evaluated.

Clone this wiki locally