Skip to content

MinorThird Basic Concepts

linfrank edited this page Aug 14, 2012 · 2 revisions

Basic Concepts

A Bird's Eye View of MinorThird

In MinorThird, a collection of documents are stored in a TextBase. Annotations about these documents are stored in a corresponding TextLabels object. Each annotation asserts a category or property for a word, a document, or a subsequence of words (a.k.a. a Span). TextLabels stored information from many sources: they might hold annotations produced by human labelers (perhaps using a GUI tool like the TextBaseEditor) or, annotations produced by a hand-written program, or annotations produced by a learned program. Multiple TextLabels can annotate a single TextBase, if necessary.

Annotated TextBases can be stored in many ways, so a "repository" can be configured to hold a bunch of TextLabels and their associated TextBases. TextLabels in the repository are loaded with the FancyLoader. TextLabels and TextBases can also be loaded directly with the TextBaseLoader and the TextBaseEditor.

Moderately complex annotation programs can be implemented with Mixup, a special-purpose annotation language which is part of MinorThird. Mixup can also be used to generate features for learning algorithms. A sequence of Mixup commands can be combined in a MixupProgram. The MixupDebugger is a GUI tool for testing a MixupProgram.

MinorThird contains a number of methods for learning to extract Spans from a document, or learning to classify Spans. Top-level programs for conducting learning experiments and training, testing and applying Annotator's can be found in the edu.cmu.minorthird.ui package. The edu.cmu.minorthird.ui.Help class is a main program that, when invoked, lists the relevant main methods.

Under the hood, learning is performed using classes from inside the edu.cmu.minorthird.classify package. A ClassifierLearner learns a Classifier from a set of labeled Example's, usually stored in a Dataset. Several sequential classification algorithms are also implemented in the package edu.cmu.minorthird.classify.sequential. The classify package is independent of the edu.cmu.minorthird.text package, but linked to it by the routines in edu.cmu.minorthird.text.learn. Most importantly, the SpanFE package implements what is essentially a small feature extraction sub-language, embedded in Java, which makes it possible to easily generate a wide variety of features of a document, token, or Span. This language is even more powerful because it can base features on annotations stored in TextLabels that are associated with the Span.

Storing and Manipulating Annotated Text using the edu.cmu.minorthird.text Package

TextToken is a "token" (usually a single word in a document), plus some additional information that allows one to find out where this word/token occurred. Specifically, one can recover the string that contained the token, a shorter string identifier of this "document" string, and the character offsets of the token - i.e., where it appeared in the document string.

Span is a sequence of adjacent TextToken objects from the same document.

Span and TextToken are considered to be inherently ordered. If two Span or TextToken objects are from different document, they are in lexicographical order based on the identifiers of those documents. Within a single document, TextToken objects are ordered according to their position in the document, and Span objedts are ordered according to their leftmost TextToken (using the rightmost TextToken to break ties).

TextBase is a collection of tokenized document strings, accessible as Span objects.

TextLabels contains markup for a TextBase. This markup can consist of

  • String-valued properties of individual TextToken objects (i.e., individual occurrences in the TextBase of words).
  • String-valued properties of spans of TextToken objects in the TextBase.
  • Groupings of Span objects into types. A Span can belong to multiple types, and unlike properties, it is possible to quickly find all spans of a given type in a TextLabels, or find all spans of a given type in a specific document.

There are a couple of different varieties of TextLabels objects. A TextLabels can only be read, not modified. MonotonicTextLabels can be modified by changing attribute values, adding new attribute values, or adding a Span to a type; however, a Span cannot be removed from a type. A plain old TextLabels allows spans to be removed from a type as well (i.e., it is mutable).

NestedTextLabels is an odd sort of implementation of MonotonicTextLabels. It defines two TextLabels objects, an "inner" one and an "outer" one, such that the outer one can be monotonically added to, but the inner one is never modified. Semantically, the markup in NestedTextLabels is the union of the markup in the inner and outer TextLabels objects, except that property values in the outer TextLabels "shadow" values in the inner TextLabels. This has several possible uses, for instance:

  • One can change a TextLabels object and then "back out" the changes by
  1. creating a NestedTextLabels object with an empty "outer" MonotonicTextLabels,
  2. monotonically adding to this new "outer" TextLabels, and then
  3. discarding the NestedTextLabels and reverting to the old "inner" TextLabels to undo the modifications.
  • One can easily construct and view the union of two TextLabels's (or at least, some well-defined approximation of this) while still being able to modify either underlying TextLabels. For instance, one can construct a single TextLabels which contains the output of a MixupProgram, plus some hand-labeled "ground truth" data, while still being able to re-run the program and get new output and/or edit the "ground truth".