-
Notifications
You must be signed in to change notification settings - Fork 16
Mixup Language
At the simplest level, Mixup is a pattern language for text spans (i.e. token sequences). Each expression is defined relative to a labeling called TextLabels
. From the ground up:
A simple pattern component (SPC) matches a single token. The SPC's include:
-
any
matches any token. -
eq('foo')
matches the tokenfoo
. This can also be abbreviated as'foo'
(in single quotes). -
re('regex')
matches any token whose string value matches the given regular expression (from thejava.util.regex
package). For instance,re('^\\d+$')
matches any sequence of digits. -
a(foo)
matches any token whose string value in the dictionary namedfoo
. Dictionaries are defined in aTextLabels
. For example,a(weekday)
might match one of'sun'
,'mon'
,'tues'
, ...,'sat'
. -
foo:bar
matches any token that has been tagged as having the valuebar
for the propertyfoo
. For example,pos:det
might match a determiner.
SPCs can be negated by prefixing them with a bang (!
). A conjunction of (optionally, negated) SPCs can be formed with angle brackets and commas, for instance: <a(month),!may>
might match any of 'jan'
, 'feb'
, ..., 'april'
, 'june'
, ..., or 'december'
.
A repeated pattern component (RPC) matches a sequence of adjacent tokens. An RPC is formed by appending one of the regex-like postfix operators (*
, +
, ?
, or {i,j}
where i
and j
are numbers) to a SPC. The RPC any*
can be abbreviated as ...
. An RPC matches any sequence between i and j tokens such that every token in the sequence matches the underlying SPC. Examples:
-
a(name){1,3}
matches any sequences of 1-3 tokens in the 'name' dictionary. -
<!a(punct),!'and'>*
matches any sequence of tokens that are not in the'punct'
dictionary and are not the token'and'
. -
pos:noun?
matches the a one-token sequence with the'pos'
property set to'noun'
, or an empty sequence.
An RPC can also be preceded by the token L
or followed by the token R
. An RPC modified by L
matches unless the sequence it corresponds to can be extended one token to the left, and still match. An RPC modified by a R
is analogous, but can't be extended to the right. For instance:
-
pos:adj+
matches any sequence of adjectives (if that's what'pos:adj'
means). However,L pos:adj+
only matches a sequence of adjectives that does NOT have an adjective immediately to the left of it. -
any{3,5}
matches any sequence of 3-5 tokens. However,any{3,5} R
only matches a sequence of 3-5 tokens that can't be extended to the right---in other words, a sequence that is either exactly 5 tokens long, or which ends with the final token of a document.
An RPC can also be either @foo
or @foo?
, where foo is a type. The RPC @foo
matches a span of type 'foo'
. The RPC @foo?
matches either a span of type foo
or an empty sequence.
A Mixup pattern is a bunch of RPC's concatenated together. A Mixup pattern matches a token sequence if all tokens in the sequence match up with some RPC. Examples:
-
... ',' 'Ph' '.' 'D'
matches any token sequence ending in", Ph.D"
. -
... '(' !eq(')'){,10} ')' ...
matches any sequence containing a parenthesized expression with less than 10 tokens in it.
Returning for a moment to the L
and R
operators, which say that a matched sequence can't be extended to the left or right; note that "can't be extended" can be interpreted two ways: either (a) any extension causes that RPC to fail to match or (b) any extension causes that RPC to fail to match, or else causes some other RPC pattern elsewhere in Mixup pattern to fail. The implementation current adopts interpretation (a).
Mixup is normally used for extraction, not matching. For extraction, every Mixup expression should contain matching left and right square brackets. For each Span
that the expression is matched against, and for every possible way the expression can be matched, a subspan of the tokens matching the RPCS's inside the square brackets will be extracted. For example,
... a(endOfSent) [ re('^[A-Z]') !a(endOfSent){3,} a(endOfSent)] ...
will extract "sentences" (actually, every sequence of at least three words between things in the endOfSent
dictionary). And
... [any any] ...
will extract all token bi-grams.
For more examples see also Mixup Tutorial.
The MixupProgram
class allows a series of statements to be executed, one after another, in order to modify a text labeling. Most of these statements are based on evaluating Mixup patterns, and then modifying the labels in response to those patterns.
Different types of Mixup statements are shown below:
defDict D = W1,W2,...,Wk
adds words W1,...,Wk
to dictionary D
. If Wi
is in double quotes, then Wi
is interpreted as a file name, and all lines from that file are loaded in the dictionary.
provide ANNOTATION_TYPE
declares that annotations of the given type are provided in the Mixup.
require ANNOTATION_TYPE,FILE
checks to see if annotations of the given type are present; if not, the Mixup program in FILE
is executed. FILE
can be wrapped in single quotes.
defSpanType TYPE SPAN_GENERATOR
adds all spans generated by the SPAN_GENERATOR
to TYPE
. There are several types of SPAN_GENERATOR
:
-
=T: EXPR
runs the Mixup expressionEXPR
on every span of typeT
, and returns all spans extracted by it. -
=T- EXPR
runs the Mixup expressionEXPR
on every span of typeT
, and returns all spansS
inT
such that nothing was successfully extracted byEXPR
. -
=T~ re REGEX,N
runs the Java regular expressionREGEX
on the string associated with each spanS
inT
, and returns the span associated with the Nth group inREGEX
. If the Nth group of the regular expressionREGEX
matches something that doesn't align with token boundaries, the closest legal token span will be used instead.
defSpanProp PROP:VAL SPAN_GENERATOR
is similar to defSpanType
, but sets the property PROP
to VAL
for all generated spans.
defTokenProp PROP:VAL SPAN_GENERATOR
is also similar to defSpanType', but sets the property
PROPto
VAL` for all tokens contained in the generated span.
Here's an extended example:
//=============================================================================
// Extract phrases about cells from biomedical captions.
//
// known current bugs:
// need better sentence-starting rules, not using stems
// (sentence start should be based on linguistically proper use of ":")
// need to discard things with unbalanced parens
// undesirable examples:
// "in Hela-tet Of f cells" extracts "f cells"
// "in contrast cells" extracts "in contrast cells"
// "respective cells" extracts "respective cells"
//=============================================================================
// words that might start a plural noun phrase about cells
defDict pluralStart = ,, no, with, within, from, of, the, these, all, in, on, only, for, by, to, other,
have, indicate, represent, show, and, or;
// end of a plural noun phrase about cells - ie, a plural cell-related noun
defDict pluralEnd = cells,strains,clones;
// end of a singular noun phrase about cells
defDict singEnd = cell,strain,clone;
// start of a singular noun phrase about cells
defDict singStart = ,, with, from, of, the, in, on, or, a, an, each, to, other, indicate, represent,
and, or, per;
// numbers
defDict number = one, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve,
thirteen, fourteen, fifteen, sixteen, seventeen, eighteen, nineteen, twenty;
// simplify syntax for these, since there's no good way to quote them
defDict openParen = (;
defDict closeParen = );
// 'context' is anything near a cell end. This is used to restrict search
defSpanType end =: ... [a(pluralEnd)] ... || ... [a(singEnd)] ...;
defSpanType context =: any+ [ any{15} @end any{2}] ... || [ any{,15} @end any{2}] ... ;
// the start of a sentence might have a panel label like (a) before it.
defSpanType sentStart =context: ... ['.' a(openParen) !a(closeParen){1,4} a(closeParen)] ... ;
defSpanType sentStart =context: ... ['.' ] re('^[A-Z]') ... ;
// something to ignore (not extract) that precedes a plural noun phrase
defSpanType ignoredPluralStart =context: ... [stem:a(pluralStart)] ...;
defSpanType ignoredPluralStart =context: ... [stem:a(pluralStart) a(number) ] ...;
defSpanType ignoredPluralStart =context: ... [stem:a(pluralStart) re('^[0-9]+$') ] ...;
defSpanType ignoredPluralStart =context: ... [@sentStart] ...;
// something to ignore (not extract) that precedes a singular noun phrase
defSpanType ignoredSingStart =context: ... [stem:a(singStart)] ...;
defSpanType ignoredSingStart =context: ... [@sentStart] ...;
// don't allow 'breaks' (commands, periods, etc) in the adjectives that qualify a
// cell-related noun.
defDict breakPunct = ,, .;
defSpanType qualifiers =context: ... [{1,8}] ...;
// finally define noun phrases as start,qualifiers,end
defSpanType cell =context: ... @ignoredPluralStart [@qualifiers a(pluralEnd)] ... ;
defSpanType cell =context: ... @ignoredSingStart [@qualifiers a(singEnd)] ... ;
// other cases seem to be like 'strain XY123' and 'strain XY123-fobar'
defSpanType cell =context: ... ['strain' re('^[A-Z]+[0-9]+$') '-' any] ... ;
defSpanType cell =context: ... ['strain' re('^[A-Z]+[0-9]+$') !'-'] ... ;