-
Notifications
You must be signed in to change notification settings - Fork 16
What's different about MinorThird
MinorThird's toolkit of learning methods is integrated tightly with the tools for manually and programmatically annotating text. Additionally, MinorThird differs from existing NLP and learning toolkits in a number of ways:
- Unlike many NLP packages (eg GATE, Alembic) it combines tools for annotating and visualizing text with state-of-the art learning methods.
- Unlike many other learning packages, it contains methods to visualize both training data and the performance of classifiers, which facilitates debugging.
- Unlike other learning packages less tightly integrated with text manipulation tools, it is possible to track and visualize the transformation of text data into machine learning data.
- Unlike many packages (including WEKA), it is open-source, and available for both commercial and research purposes.
- Unlike any open-source learning systems I know of, it is architected to support active learning and on-line learning, which should facilitate integration of learning methods into agents.
In MinorThird, a collection of documents are stored in a database called a "TextBase". Logical assertions about documents in a TextBase can be made, and stored in a special "TextLabels" object. "TextLabels" are a type of "stand off annotation"---unlike XML markup (for instance), the annotations are completely independent of the text. This means that the text can be stored in its original form, and that many different types of (perhaps incompatible) annotations can be associated with the same TextBase.
Each TextLabels annotation asserts a category or property for a word, a document, or a subsequence of words. (In MinorThird, a sequence of adjacent words is called a "span".) As an example, these annotations might be produced by human labelers; they might be produced by a hand-written program, or annotations by a learned program. TextLabels might encode syntactic properties (like shallow parses or part of speech tags) or semantic properties (like the functional role that entities play in a sentence). TextLabels can be nested, much like variable-binding environments can be nested in a programming language, which enables sets of hypothetical or temporary labels to be added in a local scope and then discarded.
Annotated TextBases are accessed in a single uniform way. However, they are stored in one of several schemes. A MinorThird "repository" can be configured to hold a bunch of TextLabels and their associated TextBases.
Moderately complex hand-coded annotation programs can be implemented with a special-purpose annotation language called Mixup, which is part of MinorThird. Mixup is based on a the widely used notion of cascaded finite state transducers, but includes some powerful features, including a GUI debugging environment, escape to Java, and a kind of subroutine call mechanism. Mixup can also be used to generate features for learning algorithms, and all the text-based learning tools in MinorThird are closely integrated with Mixup. For instance, feature extractors used in a learned named-entity recognition package might call a Mixup program to perform initial preprocessing of text.
MinorThird contains a number of methods for learning to extract and label spans from a document, or learning to classify spans (based on their content or context within a document). A special case of classifying spans is classifying entire documents. MinorThird includes a number of state-of-the-art sequential learning methods (like conditional random fields, and discriminative training methods for training hidden Markov models).
One practical difficulty in using learning techniques to solve NLP problems is that the input to learners is the result of a complex chain of transformations, which begin with text and end with very low-level representations. Verifying the correctness of this chain of derivations can be difficult. To address this problem, MinorThird also includes a number of tools for visualizing transformed data and relating it to the text from which it was derived.