-
Notifications
You must be signed in to change notification settings - Fork 16
Mixup Tutorial
Mixup is a simple pattern-matching and information extraction language included with MinorThird. The name's an acronym for My Information eXtraction and Understanding Package. You can run a Mixup program in MinorThird using the UI package (which will be covered in the next section). For more information on the language, see Mixup Language. Also, it may be helpful to look at the Javadoc for Mixup
and MixupProgram
.
MinorThird's language for manipulating text is Mixup. Some sample Mixup programs can be found on William's website under "Teaching" under "June 21,23,25, 2005". Here is a sample program sample1.mixup
with commentary:
defSpanType source1 = title: ... '(' [ ... ] ')' ;
Here defSpanType source1
defines source1
as the SpanType
which is defined to the right of the equal sign. The expression to the right of the equal sign is the pattern that matches source1
. This line says that source1
is in the title between the parentheses. Here is a list of what each part of the expression means:
-
defSpanType
- keyword -
source1
- name of the definedSpanType
-
title:
- start with title and match to the pattern defined in the remainder of the expression -
...
- anything -
'('
- the left parenthesis token -
[
- START -
...
anything -
]
- END -
')'
- the right parenthesis token
defSpanType source2 = description: [ !'-'+R ] '-' ... ;
This line is very similar to the line above, but contains a few new expressions:
-
!
- not this token -
+
- 1+ times -
R
- extend to the right
To see the parameters for running a Mixup program, type:
$ java –Xmx500M edu.cmu.minorthird.ui.RunMixup –help
Now lets try running a sample mixup program. To do this make sure the sample programs are in your minorthird/lib/mixup
directory. Do:
$ java –Xmx500M edu.cmu.minorthird.ui.RunMixup –labels small-newsdir –mixup sample1.mixup –showResult
The –showResult
parameter will graphically display the output. Or do:
$ java –Xmx500M edu.cmu.minorthird.ui.RunMixup –labels small-newsdir –mixup sample1.mixup –gui
Press the Start Task
button to run the program. When the program is done running a window like this will appear:
This window looks similar to the one that appeared when you ran ViewLabels
; however, you will notice that there are now 6 span types instead of 4 since sample1.mixup
defined two more span types: source1
and source2
. To see what the Mixup program extracted, try going to the SpanTypes
tab and highlighting source1
and source2
.
sample1a.mixup
demonstrates what happens if a Mixup expression contains +
instead of +R
. Unlike other languages which extend patterns greedily, Mixup takes each pattern literally and backtracks as needed. To see how this works run:
$ java –Xmx500M edu.cmu.minorthird.ui.RunMixup –labels small-newdir –mixup sample1a.mixup –showResult –saveAs foo.labels
Note: -saveAs FILE
means saving the labels in some computer–readable format.
When the window appears, highlight source2
. Knowing that source2
is any prefix that ends before a -
, you can see how this does not work right. Now try running sample1.mixup
again and see how it does work right with the +R
rather than just the +
.
The lessons from these two sample mixup programs are:
- Use
L
andR
prefixes when you can. - Use non-determinism when you need to.
Take a look at another example, sample2.mixup
. Then run:
$ java -Xmx500M edu.cmu.minorthird.ui.RunMixup -labels small-newsdir -mixup sample2.mixup –showResult
Now lets take a look at some annotators:
- Open
sample3.mixup
(don’t look at it yet). - Run (this will take a while):
$ java –Xmx500M edu.cmu.minorthird.ui.RunMixup –labels small-newsdir -mixup sample3.mixup –showResult
- Now take a look at
sample3.mixup
:
-
require
asks for a type of annotation to be define; similar to animport
statement in Java. - Annotators are found usually in
$MINORTHID/lib/mixup
. - Annotators can be re-defined in
annotators.config
which is usually in$MINORTHIRD/config/annotators.config
.
- When
RunMixup
is finished running, we will save the computation to save time later on. To do this, click theSaveAs
button at the bottom middle of the top left window (you will have to scroll to get there). Note:File
->Save As
does not work in this case; that is only for serializable objects. - Now pick out some useful tags and save them in
small-newsdir.labels
:
$ perl -ane "print if $F[4]=~/(description|year|title|source|pubDate|link|extracted|contentArea|body|NP|Name|NNP)/" sample3.labels | grep addToType | cut -d" " -f5 | sort | uniq -c
$ perl -ane "print if $F[4]=~/(description|year|title|source|pubDate|link|extracted|contentArea|body|NP|Name|NNP)/" sample3.labels > small-newsdir.labels
$ java -Xmx500M edu.cmu.minorthird.ui.ViewLabels -labels small-newsdir
Note: the order in which MinorThird searches labels for for –labels FOO
is as follows:
- Look in the repository.
- Look for directory
FOO
. - Look for
FOO.labels
for markup, and ignore in-line markup. - Look for in-line markup.
Debugging Mixup gives you the ability to edit your labels and your labeling program in parallel. To see how this works, copy saved-handLabeled.labels
to handLabeled.labels
and do:
$ java –Xmx500M edu.cmu.minorthird.ui.DebugMixup –labels small-newsdir –edir handLabeled.labels –mixup sample5.mixup
A window that looks like this will appear (without the highlighting at first):
To highlight extracted companies (which were defined by the Mixup program), select extracted_company
from the first pull down menu on the section divider. All the extracted companies will turn yellow (you may have to scroll down a little to find any). Then to view the true companies, which were defined by handLabeled.labels
, select true_company
from the second pull down menu. All hand labeled companies that were properly extracted by the Mixup program will turn green, all companies that were missed by the Mixup program will turn blue, and false positives will turn red (see above picture for reference).
To edit the labels, click on a document, and click the Import
button at the bottom of the window. This will import all the extracted company labels. To correct these labels click the Next
button and Delete
if it is a false positive. To add a label, highlight the span and click Add
. When you are finished labeling a document, click Export
. Click Save
when you finish.
Some tips:
- On RHS of the center bar, replace
-top-
with-body-
to focus the window on what you care about. - Replace
-top-
with-extracted company-
and move the slide to look for extractions-in-context.
When you're close enough with the debugging, you might want to hand the task over to someone else to get more training data. First run the current program:
$ java -Xmx500M edu.cmu.minorthird.ui.RunMixup -labels small-newsdir -mixup sample5.mixup -saveAs sample5.labels
Now take the relevant part of its output, and your hand-labeling results, and merge them:
$ grep extracted_company sample5.labels > labelingTask.labels
$ cat handLabeled.labels >> labelingTask.labels
Now run the labeling tool (which is somewhat stripped down) on the result:
$ java -Xmx500M edu.cmu.minorthird.ui.EditLabels -labels small-newsdir -edit labelingTask.labels -extractedType extracted_company -trueType true_company