Skip to content
thadguidry edited this page Jan 7, 2013 · 9 revisions

Being reviewed by Thad Guidry


NOTE: This page was automatically converted from the old page at Google Code and has not been manually reviewed. Please compare it to the original , correct any errors or omissions, then remove this warning block.


Using Jython as your Expression Language

Full docs on the Jython language are at the official site http://www.jython.org.

Note: The Jython extension has been bundled with OpenRefine since 2.1. Before that it was an extension which needed to be installed separately.

You can use almost any Python (.py)(.pyc) files compatible with the bundled Jython 2.5.1 and drop them into the path. For instance, download, extract and drop in BeautifulSoup.py and use it to parse and extract HTML tags or content using Jython as your expression language in OpenRefine. Since Jython is essentially Java, you can even import Java libraries and utilize those!

  • Refine now has most of the Jsoup.org library built in for parsing and working with HTML elements and extraction

Built-in GREL Jsoup functions

Remember to restart Refine, so that new Jython/Python libraries are initialized during Butterfly's startup.

[StrippingHTML Using Jython and BeautifulSoup to handle Entity Extraction and HTML markup removal]

A few HTML parsing Python libraries to experiment with :

  1. HTMLParser (bundled with Jython in Refine)
  2. BeautifulSoup

A few XML parsing Python libraries:

  1. ElementTree (bundled with Jython in Refine)
  2. lxml will NOT work in Jython, since lxml has C bindings for CPython (regular Python), hence will not work in Refine which is Jython / Java only, and has no CPython interpreter built-in

Expressions in jython must have a return statement, e.g.,

  return value[1:-1]
  return rowIndex%2

Fields have to be accessed using the bracket operator rather than the dot operator

  return cells["col1"]["value"]

Accesing the Levenshtein distance between the reconciled value and the cell value (?):

  return cell["recon"]["features"]["nameLevenshtein"]

It seems as though you can make use of a wide range of Jython structures. To return the lower case of value (if the value is not null):

  if value is not None:
    return value.lower()
  else:
    return None
Clone this wiki locally