Skip to content

Commit

Permalink
Merge pull request #54 from tarsqi/2.0.0-fixes
Browse files Browse the repository at this point in the history
2.0.0 fixes
  • Loading branch information
marcverhagen authored Apr 3, 2017
2 parents f2948fb + 5df9b92 commit 6facdb8
Show file tree
Hide file tree
Showing 28 changed files with 418 additions and 713 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

# The settings file will be different for everyone so don't keep it in revision
# control
/settings.txt
/config.txt

# Python anydbm files created from versioned text files, code runs without these
# files (but should speed up with them)
Expand Down
34 changes: 34 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Change Log

All notable changes to this project will be documented in this file.

The format is loosely based on [Keep a Changelog](http://keepachangelog.com/). Loosely because we do not keep separate sections within a version for additions and fixes etcetera, instead most logged changes will start with one of Added, Changed, Deprecated, Removed, Fixed, or Security.

This project tries to adhere to [Semantic Versioning](http://semver.org/).


## Version 2.0.1 - 2017-04-03

- Added links to Tarsqi publications to the manual.
- Added use of confidence scores to LinkMerger (issue #23).
- Fixed bug where TTK created output with duplicate attributes (issue #32).
- Fixed issue with missing link identifiers (issue #38).
- Fixed bug where duplicate links were created by S2T component.
- Removed some completely out-of-date or irrelevant documentation and notes.


## Version 2.0.0 - 2017-03-27

A complete reset of the Tarsqi code. The most significant changes are:

- Massive simplification of many components.
- New and updated documentation.
- Use Mallet toolkit instead of the old classifier.
- Uses stand-off annotation thoughout instead of inline XML.
- Redesigned libraries.
- New test and evaluation code.


## Version 1.0 - 2007-11-15

First released version. Basically a wrapper around a series of components that were not released before individually.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@

This is the main repository for the Tarsqi Toolkit (TTK), a set of processing components for extracting temporal information from news wire texts. TTK extracts time expressions, events, subordination links and temporal links; in addition, it can ensure consistency of temporal information.

To use the Tarsqi Toolkit first either clone this repository or download the most recent release from https://github.com/tarsqi/ttk/releases, then follow the instructions in the manual at `docs/manual/index.html`, this manual is also posted on the [TimeML website](http://timeml.org/tarsqi/toolkit/manual/versions/2.0.0/manual/).
To use the Tarsqi Toolkit first either clone this repository or download the most recent release from https://github.com/tarsqi/ttk/releases, then follow the instructions in the manual at `docs/manual/index.html`. Manuals can also be browsed on the [TimeML website](http://timeml.org/tarsqi/toolkit/manual/versions/).
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
2.0.0
2.0.1
21 changes: 11 additions & 10 deletions components/blinker/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
VALUE = LIBRARY.timeml.VALUE
EIID = LIBRARY.timeml.EIID
TID = LIBRARY.timeml.TID
LID = LIBRARY.timeml.LID
POL = LIBRARY.timeml.POL
TLINK = LIBRARY.timeml.TLINK
RELTYPE = LIBRARY.timeml.RELTYPE
Expand Down Expand Up @@ -362,16 +363,6 @@ def _apply_event_anchoring_rules(self, sentence, timex, i):
return


def _add_tlink(self, reltype, id1, id2, source):
"""Add a TLINK to self.tarsqidoc."""
id1_attr = TIME_ID if id1.startswith('t') else EVENT_INSTANCE_ID
id2_attr = RELATED_TO_TIME if id2.startswith('t') else RELATED_TO_EVENT_INSTANCE
attrs = {id1_attr: id1,
id2_attr: id2,
RELTYPE: reltype,
ORIGIN: source}
self.tarsqidoc.tags.add_tag(TLINK, -1, -1, attrs)

def _apply_event_ordering_with_signal_rules(self):

"""Some more rules without using any rules, basically a placeholder
Expand Down Expand Up @@ -410,6 +401,16 @@ def _apply_event_ordering_with_signal_rules(self):
pass


def _add_tlink(self, reltype, id1, id2, source):
"""Add a TLINK to self.tarsqidoc."""
id1_attr = TIME_ID if id1.startswith('t') else EVENT_INSTANCE_ID
id2_attr = RELATED_TO_TIME if id2.startswith('t') else RELATED_TO_EVENT_INSTANCE
attrs = { LID: self.tarsqidoc.next_link_id(TLINK),
id1_attr: id1, id2_attr: id2,
RELTYPE: reltype, ORIGIN: source }
self.tarsqidoc.tags.add_tag(TLINK, -1, -1, attrs)


def _timex_pairs(timexes):
"""Return a list of timex pairs where the first element occurs before the
second element on the input list."""
Expand Down
21 changes: 13 additions & 8 deletions components/classifier/wrapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
TLINK = LIBRARY.timeml.TLINK
EIID = LIBRARY.timeml.EIID
TID = LIBRARY.timeml.TID
LID = LIBRARY.timeml.LID
RELTYPE = LIBRARY.timeml.RELTYPE
ORIGIN = LIBRARY.timeml.ORIGIN
EVENT_INSTANCE_ID = LIBRARY.timeml.EVENT_INSTANCE_ID
Expand All @@ -25,15 +26,16 @@


class ClassifierWrapper:

"""Wraps the maxent link classifier."""

def __init__(self, document):
self.component_name = CLASSIFIER
self.document = document
self.tarsqidoc = document # instance of TarsqiDocument
self.models = os.path.join(TTK_ROOT,
'components', 'classifier', 'models')
self.data = os.path.join(TTK_ROOT, 'data', 'tmp')
options = self.document.options
options = self.tarsqidoc.options
self.mallet = options.mallet
self.classifier = options.classifier
self.ee_model = os.path.join(self.models, options.ee_model)
Expand All @@ -48,7 +50,7 @@ def process(self):
et_vectors = os.path.join(self.data, "vectors.ET")
ee_results = ee_vectors + '.out'
et_results = et_vectors + '.out'
vectors.create_tarsqidoc_vectors(self.document, ee_vectors, et_vectors)
vectors.create_tarsqidoc_vectors(self.tarsqidoc, ee_vectors, et_vectors)
commands = [
mallet.classify_command(self.mallet, ee_vectors, self.ee_model),
mallet.classify_command(self.mallet, et_vectors, self.et_model)]
Expand All @@ -65,7 +67,7 @@ def process_future(self):
identifier is missing from the output."""
# TODO: when this is tested enough let it replace process()
(ee_vectors, et_vectors) \
= vectors.collect_tarsqidoc_vectors(self.document)
= vectors.collect_tarsqidoc_vectors(self.tarsqidoc)
mc = mallet.MalletClassifier(self.mallet)
mc.add_classifiers(self.ee_model, self.et_model)
ee_in = [str(v) for v in ee_vectors]
Expand All @@ -89,11 +91,12 @@ def _add_links(self, ee_vectors, et_vectors, ee_results, et_results):
continue
id1 = result_id.split('-')[-2]
id2 = result_id.split('-')[-1]
attrs = { RELTYPE: scores[0][1],
attrs = { LID: self.tarsqidoc.next_link_id(TLINK),
RELTYPE: scores[0][1],
ORIGIN: "%s-%.4f" % (CLASSIFIER, scores[0][0]),
_arg1_attr(id1): id1,
_arg2_attr(id2): id2 }
self.document.tags.add_tag(TLINK, -1, -1, attrs)
self.tarsqidoc.tags.add_tag(TLINK, -1, -1, attrs)

def _add_links_future(self, ee_results, et_results):
"""Insert new tlinks into the document using the results from the
Expand All @@ -106,9 +109,11 @@ def _add_links_future(self, ee_results, et_results):
id2 = result_id.split('-')[-1]
reltype = scores[0][1]
origin = "%s-%.4f" % (CLASSIFIER, scores[0][0])
attrs = { RELTYPE: reltype, ORIGIN: origin,
attrs = { LID: self.tarsqidoc.next_link_id(TLINK),
RELTYPE: reltype, ORIGIN: origin,
_arg1_attr(id1): id1, _arg2_attr(id2): id2 }
self.document.tags.add_tag(TLINK, -1, -1, attrs)
print attrs
self.tarsqidoc.tags.add_tag(TLINK, -1, -1, attrs)


def _get_vector_identifier(line):
Expand Down
73 changes: 52 additions & 21 deletions components/merging/wrapper.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
"""
Python wrapper around the merging code.
"""Python wrapper around the merging code.
Calls SputLink's ConstraintPropagator do do all the work. For now does not do
Calls SputLink's ConstraintPropagator to do all the work. For now does not do
any graph reduction at the end.
TODO:
Expand All @@ -11,15 +9,16 @@
links. We could add the disjunctive links as well. Or we could not take
inverse relations. Or we could reduce the graph to a minimal graph.
- We now give all links, but they are not ordered. For the merging routine to be
fully effective we should rank the TLINKs in terms of how likely they are to
be correct.
- We now put all links on the queue and order them in a rather simplistic way,
where we have S2T < Blinker < Classifier, and classifier are ordered use the
classifier-assigned confidence scores. For the merging routine to be better we
should let the ranking be informed by hard evaluation data.
"""

import os

from library.tarsqi_constants import LINK_MERGER
from library.tarsqi_constants import LINK_MERGER, S2T, BLINKER, CLASSIFIER
from library.main import LIBRARY
from docmodel.document import Tag
from components.common_modules.component import TarsqiComponent
Expand All @@ -28,6 +27,15 @@

TTK_ROOT = os.environ['TTK_ROOT']

TLINK = LIBRARY.timeml.TLINK
LID = LIBRARY.timeml.LID
RELTYPE = LIBRARY.timeml.RELTYPE
ORIGIN = LIBRARY.timeml.ORIGIN
EVENT_INSTANCE_ID = LIBRARY.timeml.EVENT_INSTANCE_ID
TIME_ID = LIBRARY.timeml.TIME_ID
RELATED_TO_EVENT_INSTANCE = LIBRARY.timeml.RELATED_TO_EVENT_INSTANCE
RELATED_TO_TIME = LIBRARY.timeml.RELATED_TO_TIME


class MergerWrapper:

Expand All @@ -41,7 +49,9 @@ def process(self):
"""Run the contraint propagator on all TLINKS in the TarsqiDocument and
add resulting links to the TarsqiDocument."""
cp = ConstraintPropagator(self.tarsqidoc)
# TODO: this is where we need to order the tlinks
tlinks = self.tarsqidoc.tags.find_tags(LIBRARY.timeml.TLINK)
# use a primitive sort to order the links on how good they are
tlinks = sorted(tlinks, compare_links)
cp.queue_constraints(self.tarsqidoc.tags.find_tags(LIBRARY.timeml.TLINK))
cp.propagate_constraints()
cp.reduce_graph()
Expand All @@ -63,30 +73,51 @@ def _add_constraint_to_tarsqidoc(self, edge):
id2 = edge.node2
origin = edge.constraint.source
tag_or_constraints = edge.constraint.history
attrs = {}
if isinstance(edge.constraint.history, Tag):
tag = edge.constraint.history
attrs = tag.attrs
else:
attrs[LIBRARY.timeml.RELTYPE] = translate_interval_relation(edge.constraint.relset)
attrs[LIBRARY.timeml.ORIGIN] = LINK_MERGER
attrs[tlink_arg1_attr(id1)] = id1
attrs[tlink_arg2_attr(id2)] = id2
self.tarsqidoc.tags.add_tag(LIBRARY.timeml.TLINK, -1, -1, attrs)
attrs = {
LID: self.tarsqidoc.next_link_id(TLINK),
RELTYPE: translate_interval_relation(edge.constraint.relset),
ORIGIN: LINK_MERGER,
tlink_arg1_attr(id1): id1,
tlink_arg2_attr(id2): id2}
self.tarsqidoc.tags.add_tag(TLINK, -1, -1, attrs)


def compare_links(link1, link2):
"""Compare the two links and decide which one of them is more likely to be
correct. Rather primitive for now. We consider S2T links the best, then links
derived by Blinker, then links derived by the classifier. Classifier links
themselves are ordered using their classifier-assigned confidence scores."""
o1, o2 = link1.attrs[ORIGIN], link2.attrs[ORIGIN]
if o1.startswith('S2T'):
return 0 if o2.startswith('S2T') else -1
elif o1.startswith('BLINKER'):
if o2.startswith('S2T'):
return 1
elif o2.startswith('BLINKER'):
return 0
elif o2.startswith('CLASSIFIER'):
return -1
elif o1.startswith('CLASSIFIER'):
if o2.startswith('CLASSIFIER'):
o1_confidence = float(o1[11:])
o2_confidence = float(o2[11:])
return cmp(o2_confidence, o1_confidence)
else:
return 1


def tlink_arg1_attr(identifier):
"""Return the TLINK attribute for the element linked given the identifier."""
return _arg_attr(identifier,
LIBRARY.timeml.TIME_ID,
LIBRARY.timeml.EVENT_INSTANCE_ID)
return _arg_attr(identifier, TIME_ID, EVENT_INSTANCE_ID)


def tlink_arg2_attr(identifier):
"""Return the TLINK attribute for the element linked given the identifier."""
return _arg_attr(identifier,
LIBRARY.timeml.RELATED_TO_TIME,
LIBRARY.timeml.RELATED_TO_EVENT_INSTANCE)
return _arg_attr(identifier, RELATED_TO_TIME, RELATED_TO_EVENT_INSTANCE)


def _arg_attr(identifier, attr1, attr2):
Expand Down
20 changes: 10 additions & 10 deletions components/s2t/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,28 +30,28 @@ def process_doctree(self, doctree):
"""Apply all S2T rules to doctree."""
self.doctree = doctree
# For sanity we clean out the tlinks since we are adding new tlinks to
# the document.
# the document, if we don't do this we might add some links twice.
self.doctree.tlinks = []
self.docelement = self.doctree.docelement
events = self.doctree.tarsqidoc.tags.find_tags(LIBRARY.timeml.EVENT)
eventsIdx = dict([(e.attrs['eiid'], e) for e in events])
for slinktag in self.doctree.slinks:
slink = Slink(self.doctree, eventsIdx, slinktag)
slink.match_rules(self.rules)
try:
slink.match_rules(self.rules)
except:
logger.error("S2T Error when processing Slink instance")
self._add_links_to_docelement()
self._add_links_to_tarsqidoc()

def _add_links_to_docelement(self):
def _add_links_to_tarsqidoc(self):
"""Export the links from the TarsqiTree to the TagRepository instance on
the TarsqiDocument. We do this because the match code inserts into the
tarsqi tree, but we may want to revisit this and do it the same way as
Blinker, which adds directly to the TarsqiDocument."""
for tlink in self.doctree.tlinks:
self._add_link(LIBRARY.timeml.TLINK, tlink.attrs)

def _add_link(self, tagname, attrs):
"""Add the link to the TagRepository instance on the TarsqiDocument."""
logger.debug("Adding %s: %s" % (tagname, attrs))
self.doctree.tarsqidoc.tags.add_tag(tagname, -1, -1, attrs)
tagname = LIBRARY.timeml.TLINK
logger.debug("Adding %s: %s" % (tagname, tlink.attrs))
self.doctree.tarsqidoc.tags.add_tag(tagname, -1, -1, tlink.attrs)


class Slink:
Expand Down
20 changes: 10 additions & 10 deletions config.sample.txt
Original file line number Diff line number Diff line change
@@ -1,15 +1,16 @@
# This is an example file with configuration options. You should make a copy of
# this file and name it settings.txt and make changes to the copy as needed.
# this file and name it config.txt and make changes to the copy as needed.

# Option can be changed here or in some cases on the command line when calling
# Options can be changed here or in some cases on the command line when calling
# the tarsqy.py script. Command line options will overwrite options specified
# here.


# The default pipeline, can be overridden with the --pipeline command line
# option
# option. LINK_MERGER is not included because it slows down so much on large
# documents.

pipeline = PREPROCESSOR,GUTIME,EVITA,SLINKET,S2T,BLINKER,CLASSIFIER,LINK_MERGER
pipeline = PREPROCESSOR,GUTIME,EVITA,SLINKET,S2T,BLINKER,CLASSIFIER


# Location of perl. Change this into an absolute path if perl cannot be accesed
Expand All @@ -20,16 +21,15 @@ perl = perl

# Location of the IMS TreeTagger, can be overridden with the --treetagger
# command line option. The default is the directory where the example build
# script build/install-treetagger-osx.sh puts the TreeTagger.
# script build/install-treetagger-osx.sh installs the TreeTagger.

treetagger = build/treetagger


# Location of Mallet, this should be the directory that contains the bin
# directory. This option can be overridden by the --mallet command line
# option. The default is the directory where the example build script
# build/install-mallet-osx.sh puts Mallet.

# build/install-mallet-osx.sh installs Mallet.

mallet = build/mallet/mallet-2.0.8

Expand Down Expand Up @@ -66,9 +66,9 @@ loglevel = 3
trap-errors = True


# User settings. Arbitray parameters can be added here and will then be
# accessible to user-specific code. Below is just one parameter that is used by
# example code that finds the DCT in a database (see MetadataParserDB in
# User configuration settings. Arbitray parameters can be added here and will
# then be accessible to user-specific code. Below is just one parameter that is
# used by example code that finds the DCT in a database (see MetadataParserDB in
# docmodel.metadata_parser). These parameters cannot be overruled by command
# line options.

Expand Down
Loading

0 comments on commit 6facdb8

Please sign in to comment.