Skip to content
linfrank edited this page Aug 16, 2012 · 2 revisions

Mixup Tutorial

Mixup is a simple pattern-matching and information extraction language included with MinorThird. The name's an acronym for My Information eXtraction and Understanding Package. You can run a Mixup program in MinorThird using the UI package (which will be covered in the next section). For more information on the language, see Mixup Language. Also, it may be helpful to look at the Javadoc for Mixup and MixupProgram.

Writing and Running Mixup Programs

MinorThird's language for manipulating text is Mixup. Some sample Mixup programs can be found on William's website under "Teaching" under "June 21,23,25, 2005". Here is a sample program sample1.mixup with commentary:

defSpanType source1 = title: ... '(' [ ... ] ')' ;

Here defSpanType source1 defines source1 as the SpanType which is defined to the right of the equal sign. The expression to the right of the equal sign is the pattern that matches source1. This line says that source1 is in the title between the parentheses. Here is a list of what each part of the expression means:

  • defSpanType - keyword
  • source1 - name of the defined SpanType
  • title: - start with title and match to the pattern defined in the remainder of the expression
  • ... - anything
  • '(' - the left parenthesis token
  • [ - START
  • ... anything
  • ] - END
  • ')' - the right parenthesis token
defSpanType source2 = description: [ !'-'+R ] '-' ... ;

This line is very similar to the line above, but contains a few new expressions:

  • ! - not this token
  • + - 1+ times
  • R - extend to the right

To see the parameters for running a Mixup program, type:

$ java –Xmx500M edu.cmu.minorthird.ui.RunMixup –help

Now lets try running a sample mixup program. To do this make sure the sample programs are in your minorthird/lib/mixup directory. Do:

$ java –Xmx500M edu.cmu.minorthird.ui.RunMixup –labels small-newsdir –mixup sample1.mixup –showResult

The –showResult parameter will graphically display the output. Or do:

$ java –Xmx500M edu.cmu.minorthird.ui.RunMixup –labels small-newsdir –mixup sample1.mixup –gui

Press the Start Task button to run the program. When the program is done running a window like this will appear:

This window looks similar to the one that appeared when you ran ViewLabels; however, you will notice that there are now 6 span types instead of 4 since sample1.mixup defined two more span types: source1 and source2. To see what the Mixup program extracted, try going to the SpanTypes tab and highlighting source1 and source2.

sample1a.mixup demonstrates what happens if a Mixup expression contains + instead of +R. Unlike other languages which extend patterns greedily, Mixup takes each pattern literally and backtracks as needed. To see how this works run:

$ java –Xmx500M edu.cmu.minorthird.ui.RunMixup –labels small-newdir –mixup sample1a.mixup –showResult –saveAs foo.labels

Note: -saveAs FILE means saving the labels in some computer–readable format.

When the window appears, highlight source2. Knowing that source2 is any prefix that ends before a -, you can see how this does not work right. Now try running sample1.mixup again and see how it does work right with the +R rather than just the +.

The lessons from these two sample mixup programs are:

  1. Use L and R prefixes when you can.
  2. Use non-determinism when you need to.

Take a look at another example, sample2.mixup. Then run:

$ java -Xmx500M edu.cmu.minorthird.ui.RunMixup -labels small-newsdir -mixup sample2.mixup –showResult

Now lets take a look at some annotators:

  1. Open sample3.mixup (don’t look at it yet).
  2. Run (this will take a while):
$ java –Xmx500M edu.cmu.minorthird.ui.RunMixup –labels small-newsdir -mixup sample3.mixup –showResult
  1. Now take a look at sample3.mixup:
  • require asks for a type of annotation to be define; similar to an import statement in Java.
  • Annotators are found usually in $MINORTHID/lib/mixup.
  • Annotators can be re-defined in annotators.config which is usually in $MINORTHIRD/config/annotators.config.
  1. When RunMixup is finished running, we will save the computation to save time later on. To do this, click the SaveAs button at the bottom middle of the top left window (you will have to scroll to get there). Note: File -> Save As does not work in this case; that is only for serializable objects.
  2. Now pick out some useful tags and save them in small-newsdir.labels:
$ perl -ane "print if $F[4]=~/(description|year|title|source|pubDate|link|extracted|contentArea|body|NP|Name|NNP)/" sample3.labels | grep addToType | cut -d" " -f5 | sort | uniq -c
$ perl -ane "print if $F[4]=~/(description|year|title|source|pubDate|link|extracted|contentArea|body|NP|Name|NNP)/" sample3.labels > small-newsdir.labels
$ java -Xmx500M edu.cmu.minorthird.ui.ViewLabels -labels small-newsdir

Note: the order in which MinorThird searches labels for for –labels FOO is as follows:

  1. Look in the repository.
  2. Look for directory FOO.
  3. Look for FOO.labels for markup, and ignore in-line markup.
  4. Look for in-line markup.

The Mixup Debugger and Label Editor

Debugging Mixup gives you the ability to edit your labels and your labeling program in parallel. To see how this works, copy saved-handLabeled.labels to handLabeled.labels and do:

$ java –Xmx500M edu.cmu.minorthird.ui.DebugMixup –labels small-newsdir –edir handLabeled.labels –mixup sample5.mixup

A window that looks like this will appear (without the highlighting at first):

To highlight extracted companies (which were defined by the Mixup program), select extracted_company from the first pull down menu on the section divider. All the extracted companies will turn yellow (you may have to scroll down a little to find any). Then to view the true companies, which were defined by handLabeled.labels, select true_company from the second pull down menu. All hand labeled companies that were properly extracted by the Mixup program will turn green, all companies that were missed by the Mixup program will turn blue, and false positives will turn red (see above picture for reference).

To edit the labels, click on a document, and click the Import button at the bottom of the window. This will import all the extracted company labels. To correct these labels click the Next button and Delete if it is a false positive. To add a label, highlight the span and click Add. When you are finished labeling a document, click Export. Click Save when you finish.

Some tips:

  • On RHS of the center bar, replace -top- with -body- to focus the window on what you care about.
  • Replace -top- with -extracted company- and move the slide to look for extractions-in-context.

When you're close enough with the debugging, you might want to hand the task over to someone else to get more training data. First run the current program:

$ java -Xmx500M edu.cmu.minorthird.ui.RunMixup -labels small-newsdir -mixup sample5.mixup -saveAs sample5.labels

Now take the relevant part of its output, and your hand-labeling results, and merge them:

$ grep extracted_company sample5.labels > labelingTask.labels
$ cat handLabeled.labels >> labelingTask.labels

Now run the labeling tool (which is somewhat stripped down) on the result:

$ java -Xmx500M edu.cmu.minorthird.ui.EditLabels -labels small-newsdir -edit labelingTask.labels -extractedType extracted_company -trueType true_company