-
Notifications
You must be signed in to change notification settings - Fork 20
XML format
Hiroshi Noji edited this page Aug 1, 2016
·
2 revisions
Apart from StanfordCoreNLP, Jigg's XML encodes several tag-specific information as attributes.
For example, the following <token>
in StanfordCoreNLP
<token id="1">
<word>Stanford</word>
<lemma>Stanford</lemma>
<CharacterOffsetBegin>0</CharacterOffsetBegin>
<CharacterOffsetEnd>8</CharacterOffsetEnd>
</token>
are represented in Jigg as
<token id="s0_1" form="Stanford" lemma="Stanford" CharacterOffsetBegin="0" CharacterOffsetEnd="8"/>
The main characteristics in Jigg are:
- Each element (e.g.,
token
) has a unique id (e.g,s0_1
) in the XML. In StanfordCoreNLP, these ids are not unique. - Some information (e.g., surface form) is represented as a different field (e.g.,
form
rather thanword
).