-
Notifications
You must be signed in to change notification settings - Fork 16
MinorThird Basic Concepts
In MinorThird, a collection of documents are stored in a TextBase. Annotations about these documents are stored in a corresponding TextLabels object. Each annotation asserts a category or property for a word, a document, or a subsequence of words (a.k.a. a Span). TextLabels stored information from many sources: they might hold annotations produced by human labelers (perhaps using a GUI tool like the TextBaseEditor) or, annotations produced by a hand-written program, or annotations produced by a learned program. Multiple TextLabels can annotate a single TextBase, if necessary.
Annotated TextBases can be stored in many ways, so a "repository" can be configured to hold a bunch of TextLabels and their associated TextBases. TextLabels in the repository are loaded with the FancyLoader. TextLabels and TextBases can also be loaded directly with the TextBaseLoader and the TextBaseEditor.
Moderately complex annotation programs can be implemented with Mixup, a special-purpose annotation language which is part of MinorThird. Mixup can also be used to generate features for learning algorithms. A sequence of Mixup commands can be combined in a MixupProgram. The MixupDebugger is a GUI tool for testing a MixupProgram.
MinorThird contains a number of methods for learning to extract Spans from a document, or learning to classify Spans. Top-level programs for conducting learning experiments and training, testing and applying Annotator's can be found in the edu.cmu.minorthird.ui
package. The edu.cmu.minorthird.ui.Help
class is a main program that, when invoked, lists the relevant main methods.
Under the hood, learning is performed using classes from inside the edu.cmu.minorthird.classify
package. A ClassifierLearner learns a Classifier from a set of labeled Example's, usually stored in a Dataset. Several sequential classification algorithms are also implemented in the package edu.cmu.minorthird.classify.sequential
. The classify package is independent of the edu.cmu.minorthird.text
package, but linked to it by the routines in edu.cmu.minorthird.text.learn
. Most importantly, the SpanFE package implements what is essentially a small feature extraction sub-language, embedded in Java, which makes it possible to easily generate a wide variety of features of a document, token, or Span. This language is even more powerful because it can base features on annotations stored in TextLabels that are associated with the Span.
TextToken
is a "token" (usually a single word in a document), plus some additional information that allows one to find out where this word/token occurred. Specifically, one can recover the string that contained the token, a shorter string identifier of this "document" string, and the character offsets of the token - i.e., where it appeared in the document string.
Span
is a sequence of adjacent TextToken
objects from the same document.
Span
and TextToken
are considered to be inherently ordered. If two Span
or TextToken
objects are from different document, they are in lexicographical order based on the identifiers of those documents. Within a single document, TextToken
objects are ordered according to their position in the document, and Span
objedts are ordered according to their leftmost TextToken
(using the rightmost TextToken
to break ties).
TextBase
is a collection of tokenized document strings, accessible as Span
objects.
TextLabels
contains markup for a TextBase
. This markup can consist of
- String-valued properties of individual
TextToken
objects (i.e., individual occurrences in theTextBase
of words). - String-valued properties of spans of
TextToken
objects in theTextBase
. - Groupings of
Span
objects into types. ASpan
can belong to multiple types, and unlike properties, it is possible to quickly find all spans of a given type in aTextLabels
, or find all spans of a given type in a specific document.
There are a couple of different varieties of TextLabels
objects. A TextLabels
can only be read, not modified. MonotonicTextLabels
can be modified by changing attribute values, adding new attribute values, or adding a Span
to a type; however, a Span
cannot be removed from a type. A plain old TextLabels
allows spans to be removed from a type as well (i.e., it is mutable).
NestedTextLabels
is an odd sort of implementation of MonotonicTextLabels
. It defines two TextLabels
objects, an "inner" one and an "outer" one, such that the outer one can be monotonically added to, but the inner one is never modified. Semantically, the markup in NestedTextLabels
is the union of the markup in the inner and outer TextLabels
objects, except that property values in the outer TextLabels
"shadow" values in the inner TextLabels
. This has several possible uses, for instance:
- One can change a
TextLabels
object and then "back out" the changes by
- creating a
NestedTextLabels
object with an empty "outer"MonotonicTextLabels
, - monotonically adding to this new "outer"
TextLabels
, and then - discarding the
NestedTextLabels
and reverting to the old "inner"TextLabels
to undo the modifications.
- One can easily construct and view the union of two TextLabels's (or at least, some well-defined approximation of this) while still being able to modify either underlying
TextLabels
. For instance, one can construct a singleTextLabels
which contains the output of aMixupProgram
, plus some hand-labeled "ground truth" data, while still being able to re-run the program and get new output and/or edit the "ground truth".