(largely completed; see GitHub issue #405)
We want to index (and search for) relations between two words, such as the dependency relation "verb X has object Y". In this case, we have two words, X and Y, and the object
relation pointing from X to Y:
X -object-> Y
Example:
Y <-o- X
Small man bites large dog.
For dependency relations, the most common case is that a word can point to multiple other words, but can only be pointed to by one other word. We don't want to limit the relations primitive to just dependency relations though, so our implementation will be able to handle any set of relations.
For dependency relations, there's a special relation called the root
relation, which has a target (dep) but not a source (head). We will need to record this special case as an option in the payload.
For dependency relations, linguists call the start of the arrow X the head of the relation and the end of the arrow Y the dependent.
However, because we don't want to limit the relations primitive to only dependency relations, and because the term head can be confusing (there's also the "head of an arrow", but that's the other end), we will use source for X and target for Y in BlackLab.
Of course, in addition to the CQL extension functions described below, we could also include syntax specifically suited for working with dependency relations, which would use the common terminology of head and dependent.
Obviously it would also be very useful to be able to index relations between groups of words. For example, in the example above, the object of the sentence is not just "man" but actually "Small man".
Our implementation should allow us to store relations between groups of words as well, although full support for this might not be in the initial version.
We might even want to support discontinuous groups of words at some point, but the initial version does not do this.
The CoNLL-U format is commonly used to encode texts with these kinds of dependency relations. We should support this format. We can use the LassySmall corpus for testing.
This CoNNL-U viewer is convenient for visualizing data in this format.
Eventually we'll look at adding support for TEI with dependency relations as well.
Indexing relations will mean changes to the index format. We will implement these changes in the integrated index format only; the classic external index format will not support them. We will simply change the integrated index format, so older integrated indexes will no longer be readable. This is okay as long as the integrated index format is still experimental.
A good way to index relations is similar to how we index spans such as <s/>
. Spans are indexed at the starting token, with the end position stored in the payload.
We could index these in the same annotation spans are already indexed in (historically called starttag
, but we will rename it to _relation
). We'd have to be careful to make sure both can easily be distinguished from one another.
Better yet, we should generalize the notion of an "inline tag" such as <s/>
we have right now to be indexed as a relation as well, i.e.:
X -starts_sentence_ending_at-> Y
Small man bites large dog.
NOTE: should there be a mechanism to exclude certain inline tag attributes (e.g. id) from being indexed in the term string? See below.
We could index relations at the source or target position X (with the term we index representing the type of relation, i.e. object
) and store the position of the opposite position (target/source) Y in the payload. Which token we should choose to store a relation depends on which direction is more commonly queried, as having to decode the payload to find the "other end" of the relation takes slightly longer. We may even choose to index at both positions, meaning we would also need to store the direction of the relation. Though we should first test if any of this actually makes a difference in practice.
To support relations between groups of words, we could of course add two length values to the payload as well. The first one would indicate the length of group X (the token the relation is indexed at) and the second one the length of group Y (the token indicated in the payload).
We should use relative positions and variable-length integers to store information in the payload, as this could save significant amounts of disk space and potentially be faster (because more dense storage makes better use of the disk cache).
We should also consider only storing certain values if they differ from the common case (e.g. if we're not dealing with word groups, don't store two 1s for the length).
Obviously, we'd need a query to find a relation, such as:
- find all
object
relations - find
object
relations where the source lemma equalsbite
- find all relations where the source lemma equals
bite
- find
object
relations where the target lemma equalsman
Of course, we also want to run more complex queries that combine several relations. There's two common ways to combine relations: two relations involving the same word, and two transitive relations. Examples:
- Find verbs where "man" is object and "dog" is subject. In other words, we're looking for a word that is the source for two specific relations. Visually:
man <-o- X -s-> dog
- Find subjects where its verb is "bite" and that verb has "man" as its object. In other words, we're looking for a word that is the target for a word that is itself the source for another relation. Visually:
man <-o- bite -s-> X
In the first case, we can search for both relations, and find matches between them based on their source positions.
In the second case, we find matches for both relations, keeping matches from the second relation where its source exactly matches the target of the first relation, then finally find the targets of those filtered matches.
In certain cases we may want to combine two relations into a single larger span, while retaining the source or target we matched them on. After doing this, the source or target need not be the first and last word of the span, so these positions need to be kept track of separately.
A simple way to implement this operation is to return a combined span that has the source and target of the first relation. The source/target from the second relation that was not matched on is lost, but users could still capture those in a group if necessary.
For the low-level syntax, capturing would be up to the user, but for the specific dependency relations syntax, we could automatically capture relevant parts.
If we generalize the above to sources/targets that are groups of words, the operations become slightly more complicated, because source and target are both spans in their own right, but the principles remain the same.
If we want to find e.g. two different adjectives applied to a word, we'll need a way to ensure that two matches are different. We could use "global capture constraints" for this. One issue is that CQL generally only allows these at the top-level of the query (hence the word global), but it seems that this is not a fundamental limitation. A query such as (A:[] "que" B:[] :: A.lemma = B.lemma) [lemma="willen"]
should be valid, as long as the constraint only references groups captured within the parentheses.
These "local constraints" combined with the ability to check if two groups are (not) equal, e.g. something like A@start > B@start
to ensure A
occurs after B
, should be enough to implement this.
Ideally, we'd like to build a tree structure of relations that can be included with the match.
This would mean joining relations into a tree structure each time we combine two relations.
Dependency relations are never cyclic, but because we want our relations support to be generic and future-proof, we should ensure cycles don't cause problems.
In addition to finding specific relations and combining them, maybe we want to find any "descendants" starting from a specific word. Getting this to perform well would probably be challenging though.
We'll provide a syntax specifically designed to easily query dependency relations (the likely most common use case for now).
For the more generic "building block" functions that these queries will rewrite to internally, see the next section below.
The dependency relations syntax could perhaps be toggleable, so it can be disabled if desired. However, if it does not interfere with other CQL queries, we can probably just enable it by default.
The dependency relation operator is an n-ary operator to match one or more dependency relations between a parent and several children. It also allows you to exclude certain relations, i.e. ensure that these don't exist.
parent
-deprel1-> child1;
-deprel2-> child2;
!-deprel3-> child3; ...
deprel
is interpreted as a regular expression. If you omit deprel
from the operator (i.e. -->
), it will match any relation type. A different way to write this is -.*->
.
Note that the NOT version of the operator is !-deprel->
; the !
here is part of the
operator symbol just like with !=
, not a separate operator. Adding a NOT clause means "there exists no relation of this type to this target".
The above operator expression is equivalent to the following functional query style:
rmatch(parent,
rel(deprel1, child1),
rel(deprel2, child2),
!rel(deprel3, child3), ...)
Similarly
parent -deprel1-> child -deprel2-> grandchild
will be equivalent to
rmatch(parent,
rel(deprel1, child,
rel(deprel2, grandchild)))
You can find root relations using this unary form of the operator:
^--> child
(you may specify a deprel
in the operator here as well, although that will generally always be root
for root relations)
The root relation operator will return the targets of these relations, as root relations have no source.
The source and target operands for the relation operators can be replaced by _
, meaning "the default value" or "don't care". These will be replaced with []*
(any n-gram), which puts no restrictions on source or target.
Examples:
# Find subjects that are nouns
_ -nsubj-> [pos="NOUN"]
# Find verbs with an object but no subject
[pos="VERB"] -obj-> _ ;
!-nsubj-> _
# Match node with children of certain types)
_ -nmod-> _ ;
-det-> _ ;
-advmod-> _
# Find two different adjectives
_ -amod-> _; -amod-> _
# Match a series of descendants of certain types, starting with a root
# (i.e. vertical paths in dependency trees)
^--> _ -nmod-> _ -case-> _
We need syntax to incorporate relation searches into Corpus Query Language. For the same reasons as explained in #396, we'll use a simple function call style for now. This will correspond closely to how relations are indexed an will be useful for any type of relations. We will provide specific syntax for relations types such as dependency relations.
TODO:
- exclude relations
- requirement that certain groups may not contain the same match (e.g. targets of two relations attached to the same source must be different, so you can e.g. find two different adjectives attached to a noun)
Below is a quick reference for these basic "building block" relations functions.
We can find a relation by type and target using:
rel(reltype = ".*", target = []*, spanMode = 'source', captureAs = '', direction = 'both')
If reltype
does not contain the substring ::
, it will be prefixed with the default relation class dep::
(for dependency relations). So .*
would match all dependency relations and .*::.*
would match relations of all types. The default relation class could be made configurable, of course.
The default value for target
matches all n-grams, although that query will obviously never actually be executed, we will just not filter on the child at all in that case.
Note that this returns the source of the relation by default, as we've already matched on the target.
spanMode
adjusts the resulting span and can take the values:
"source"
(span becomes the source of the relation)"target"
(span becomes the target of the relation - this is the default value)"full"
(span becomes the full span of the relation, including source and target)
By default rel()
returns spans that match the source of the relation. So e.g. rel('nsubj')
will find words that have a subject, and rel('nsubj', 'target')
will find the subjects themselves, and rel('nsubj', 'full')
will find spans that include both the source and target.
If captureAs
is set to a non-empty string, the relation will be captured as a group with that name. This is useful to avoid collisions if you're capturing the same relation type multiple times in a query. The same can be achieved in operator form with _ A:-det-> _
to capture these relations as A
.
direction
can be set to both
(default), root
(only root relations), forward
or backward
.
Note that if try to find sources of root relations, no matches will be returned, as root relations don't have a source. OTOH, if you find targets of root relations first, then change the span mode to source, root relations will not be thrown out but will return their targets. (NOTE: should this throw an error instead?)
NOTE: there is currently no way to filter out root relations. Do we need this?
We can also change the spanMode of the spans returned by rel()
according to what we need:
rspan(relation_matches, spanMode = "full")
This will return the same relation matches, but with the span start and end set according to the value of spanMode
(see above). The default is full
, i.e. a span covering both the source and target of the relation.
NOTE: for rspan
, there is a special extra spanMode "all"
, that will return a span covering the sources and targets of all relations matched.
We can match a tree fragment of a tree (a parent and some of its children) using rmatch
:
rmatch(clause1, clause2, ! clause3, ...)
Often this will be used with the target of the parent relation as clause1
and the source of child relations as clause2
, clause3
, etc. (some of them negated to indicate such children must not be present).
This operation is the same as a regular AND (NOT) operation with these clauses, but with the added requirement that a given relation may not be matched by more than one clause. (without the rmatch
operation, you might need many capture constraints to enforce this).
We can capture all relations occurring inside a span using rcapture
:
rcapture(query, captureAs, relationType = '.*')
This is useful for e.g. finding all dependency relations in a sentence.
Parameters:
query
is the query we're operating on.rcapture
does not affect the matches, other than adding a match info with the relations captured.captureAs
is the name of the match info to store the captured relations in.relationType
is a regular expression matching the relation types to capture. The default is.*
, i.e. all relations of the default relation class (dep
by default). Seerel()
for more details about matching relation types.
The relations are captured as a list-of-relations match info type.
Construct a relation type string that may include attributes
rtype(class, type, attributes = {})
(not yet implemented)
Below are examples of how to use the rel
function to find relations between words. Note _
indicates the default value for a parameter. Parameters at the end may be omitted; their default values will also be used.
Find all object
relations (spans will contain the source of the relation, i.e. the word that has an object):
rel('dep::object')
rel('object') # "dep::" is automatically prepended
rel('object', _, 'source')
All object
relations (spans will contain the targets, i.e. the objects):
rel('object', _, 'target')
All object
relations where the target is a noun:
rel('object', [pos = 'NOUN'], 'target')
Relations of type object
where the source word is bites
:
'bites' & rel('object')
All relations where the source word equals bites
:
'bites' & rel()
Relations of type object
where the target word equals man
:
'man' & rel('dep::object', _, 'target')
Find root relations:
rel('root')
Find all dependency relations:
rel('dep::.*')
rel('.*')
Find all relations, not just dependency relations:
rel('.*::.*')
Find sentences:
<s/>
rel(rtype('__tag', 's'), _, 'full')
rel('__tag::s'), _, 'full')
Find sentences with happy sentiment:
<s sentiment='happy' confidence='10' />
rel(rtype('__tag', 's', list('sentiment', 'happy', 'confidence', '10')), _, 'full')
rel('__tag::s\u0003\u0001confidence\u000210\u0003\u0001sentiment\u0002happy\u0003'), _, 'full')
Find words that have 'man' as their object ("find relation source where target is 'man' and type is 'object'"):
rel('object', 'man')
Find words that are the object for 'bites' ("find relation target where source is bites
and type is object
"):
rspan('bites' & rel('object'), 'target')
We can combine relation matches using the standard CQL operators.
Some examples follow.
Find words that have both a subject and an object (i.e. relations have same source):
rel('object') & rel('subject')
rmatch(_, rel('object'), rel('subject'))
Find words that have man
as their object and dog
as their subject (i.e. find X in 'man' <-O- X -S-> 'dog'
):
rel('object', 'man') & rel('subject', 'dog')
rmatch(_, rel('object', 'man') & rel('subject', 'dog'))
Find words that are the subject of a word that has 'man' as its object (i.e. find X in 'man' <-O- ? -S-> X
):
rspan(rel('subject') & rel('object', 'man'), 'target')
Note in the above that when combining two relations matches with &
, a new relations match is created that stores the information for both relations. The third rspan
parameter can be used to select the relation, but the default is the first of the two relations combined.
Find case
and nmod
relations with same source (returns that source):
rel('case') & rel('nmod')
rmatch(_, rel('case'), rel('nmod'))
Same, but return target for case
:
rspan(rel('case') & rel('nmod'), 'target')
Same, but return the full span for case
:
rspan(rel('case') & rel('nmod'), 'full')
Same, but return the full span covering both relations:
rspan(rel('case') & rel('nmod'), 'all')
Match target of one relation to source of another (return span covering all matched relations):
rel('nmod', rel('acl:relcl'), 'all')
TODO: rel() doesn't support spanMode 'all' yet, but should (matchTarget can have matched other relations)
TODO: see if we can unify capture groups and relations (WIP)
Just like in other CQL queries, we can tag parts with a group name to capture them.
V:(
rspan(S:relt('subject') &
rspan(O:relt('object'), 'source') &
[pos='VERB']
)
This would return spans including a verb and its subject and object, with the verb tagged as V
, the subject as S
and the object as O
.
As mentioned earlier, we want to unify the spans we already supported in BlackLab (i.e. XML tags like <s/>
) with the new relations. This means we need a way to encode regular spans plus their attributes as relations.
We will encode the tag name and all attribute values into a single indexed term, so we only need to store the payload once and cannot accidentally mix up attributes with those of other spans/relations (as is a problem in the classic external index format).
A potential downside of is that this could greatly increase the number of unique terms in the index, if there's multiple attributes and many different combinations of attribute values. This would consume more disk space and could slow down queries. While this doesn't seem to be a huge problem for most typical datasets, perhaps we could offer the option to exclude certain attributes from indexing.
Similarly, such attributes with many different values will slow down searching, even if we don't filter on any of the attributes. For this (common) case, we should index tags/relations with attributes twice: once with attributes and once without. We should take care to be able to distinguish between the two, e.g. using a prefix that indicates whether or not attributes are included. This also works for tags that have no attributes, as those can only ever be found by queries that don't filter on attributes.
For example, to encode a tag <s sentiment="happy" confidence="10" />
into a single term we can index these two terms in the _relation
attribute:
__tag::s\u0003
__tag::s\u0003\u0001confidence\u000210\u0003\u0001sentiment\u0002happy\u0003
For a tag where no attributes are indexed, e.g. <b>blabla</b>
, the only term we index would be:
__tag::b\u0003
So the "full relation type" in the first example is __tag::s
(consisting of relation class __tag
and relation type s
), from which the tag name s
can be decoded. We keep the tag name as part of the relation type so it's always at the start of the term, allowing us to use a fast prefix query. The full relation type is always followed by \u0003
("value end"). Note that the relation type does not need a prefix character because it is always the first part of the term.
If attributes are included in this term, they follow now, in alphabetical order. The alphabetical order is so we can construct an efficient regex to find multiple of them. Each attribute name is prefixed with \u0001
("name prefix"), and each attribute value prefixed with \u0002
("value prefix"). The term always ends with \u0003
("value end").
Inline tags don't really have a source and target word, and using the first word of the sentence and the last word (or the first word of the next sentence) creates awkward situations when trying to determine the span start and end from the source and target.
Instead, we'll opt to store a 0-length source and target for the tag. This source will start and end at the first word of the tag, and the target will start and end at the first word after the tag. This way the calculations line up.
Because we have XML-style syntax for spans, we likely won't need it, but just in case, a utility function rtype
could be provided, so that:
rtype('__tag', 's', list('sentiment', 'happy', 'confidence', '10'))
would return the regex:
__tag::s\u0003.*\u0001confidence\u000210\u0003.*\u0001sentiment\u0002happy\u0003.*
Similarly, rtype('__tag', 's')
would return the regex:
__tag::s.*\u0003
NOTE: this is beyond the scope of this task, but it's good to keep it in the back of our mind
Span nesting has never been completely accurately recorded in the index, because only the start and end tokens are stored, not the nesting level. If two spans have the same start and end, we can't tell which one is the parent of the other.
We could include the nesting hierarchy in the indexed term, e.g. <b><i>dog</i></b>
would be indexed as b_i
. In addition to giving us the correct nesting, it would also help to speed up queries like ("dog" within <i/>) within <b/>
, which would only have to search for "dog" within <b_i />
We also can't efficiently find an inline tag's ancestors or descendants; within
and contains
are relatively expensive, especially in large documents. We could (for certain relations?) store a unique relation id in the payload and have a separate index file where we can look up parent/child relations, so we can quickly walk the tree. This would help if we want to support complex XPath-like search queries.
This is a complete specification of how relations will be indexed in BlackLab. This includes the new way of indexing spans as relations as well.
Relations (and spans) are indexed in the _relation
annotation. A relation is encoded as a single term with a payload.
We have to choose where to index relations:
- always at the source, or always at the target
- always at the first position in the document (whether that's the source or target)
We'll start with the second option, which has the advantage that the associated spans don't need to be sorted to be in order of increasing start position. It is also the way inline tags are indexed right now, so no existing invariants and optimizations are messed up. This does mean that we will have to record whether the relation was indexed at the source or target.
The term indexed is a string of one of these forms:
relClass::relType\u0001
relClass::relType\u0001\u0003attr1\u0002value1\u0001\u0003attr2\u0002value2\u0001...
We call relClass::relType
the full relation type. It consists of the relation class and the relation type. The relation class distinguishes between different types of relations, e.g. __tag
for inline tags, dep
for dependency relations, etc. The relation type is used to distinguish between different relations of the same class, e.g. dep::subject
for subject relations, dep::object
for object relations, dep::nsubj
for nominal subject relations, etc.
There is always a \u0003
("value end") after the full relation type (such a closing character is useful to avoid unwanted prefix matches when using regexes).
If there are any attributes, they follow next. These are sorted alphabetically by name. An attribute is encoded as follows: a \u0001
, the attribute name, a \u0002
, the attribute value, and a \u0003
. The surrounding special characters ensure we can match any part of the attribute's name or value without unwanted matches.
The payload uses Lucene's VInt
(for non-negative numbers) and ZInt
(an implementation of variable-length quantity (VLQ)). We store a relative position for the target end to save space.
The payload for a relation consists of the following fields:
- This number is either
relTargetStart
orrelationId
, depending on theflags
byte (see below).relationId
is a unique id for this relation, which can be used to look up extra information, such as attributes, and maybe other information in the future. ForrelTargetStart
, see below. Default value:1
. flags: byte
: If0x02
is set, the relation only has a target (root relation). If0x04
is set, use a default length of 1 forsourceLength
andtargetLength
. If0x08
is set, the first number in the payload is therelationId
. The other bits are reserved for future use and must not be set. Default:0
.- If flag
0x08
was set,flags
is followed byrelTargetStart: ZInt
: relative position of the (start of the) target end. Default:1
. If flag0x08
was not set, this number will not be written here (in that case, the first number of the payload isrelTargetStart
, see above). sourceLength: VInt
: length of the source end of the relation. For a single word this would be 1; for a span of words, greater than one. For inline tags, it will be set to 0 (start and end tags are considered to be zero-length). Default:0
(normally) or1
(if flag0x04
is set)targetLength: VInt
: length of the target end of the relation. For a single word this would be 1; for a span of words, greater than one. For inline tags, this will be set to 0 (start and end tags are considered to be zero-length). Default:0
(normally) or1
(if flag0x04
is set)
Fields may be omitted from the end if they have the default value. Therefore, an empty payload means { relTargetStart: 1, flags: 0, sourceLength: 0, targetLength: 0 }
.
As another example, the payload 0x81; 0x04
would mean { relTargetStart: 1, flags: 4, sourceLength: 1, targetLength: 1 }
. Explanation: 0x81
is the VInt
encoding for 1
(the lower seven bits giving the number and the high bit set because this is the last byte of the number). The flag 0x04
is set, so the lengths default to 1
instead of 0
.
(NOTE: this feature is planned but not yet implemented)
Relations with attributes are indexed twice: once with the attributes as part of the term (so we can find instances with specific attribute values efficiently) and once without the attributes (so queries that don't filter on attributes are efficient even if there's a unique id attribute).
This causes problems when we want to group by an attribute value of a tag we matched, because we don't always have those available, depending on the query.
There could also be other things we want to look up about a relation in the future, like its ties to other relations: in the case of inline tags: parent, children, ancestors, descendants, etc. In the case of dependency relations: perhaps information about transitive connections to other relations, etc.
We want to enable this by adding a unique relation id to the payload that allows us to look up additional information in a separate file.
How this changes the relation encoding:
- we add a flag
0x10
; if set, a uniquerelationId
is stored in the payload. It would be stored afterflags
. This ensures BlackLab remains compatible with older corpora; however, corpora encoded this way are not compatible with older versions of BlackLab (as they would not be able to correctly decode the payload). relationId
is a unique number (for this annotated field), assigned when the relation is parsed and added, that can be used to look up information about the relation, such as its attributes, in separate index files.- The index files with relation info is created while writing the indexed terms with attributes to the segment (because that term contains all the information we need to create it, whereas the optimization terms obviously do not). Both versions of the term are added with the same
relationId
, of course.
When found using BlackLab, a relation has a source and target (which may be word groups or single words) as well as a "regular" span.
The span normally runs between source and target, but operations combining relations may enlarge the span so this no longer holds.
To calculate the Lucene span from a matched relation term's payload, first a few helper values:
thisStart = position the relation was indexed at
thisEnd = thisStart + thisLength
otherStart = thisStart + relOtherStart
otherEnd = otherStart + otherLength
So the Lucene span for a relation is:
[thisStart, max(thisEnd, otherEnd) )
(although it seems unlikely that we want the source and target to overlap)
(note that Lucene spans are half-open, i.e. the end position is not included in the span)