This is a collection of classes for working with CoNLL-X and CoNLL-U treebanks.
The opennlp.ext.conll.treebank
package contains the data model classes for
CoNLL based treebanks. It provides a unified model between the data in
CoNLL-X and CoNLL-U files.
The opennlp.ext.conll.stream.sentence
package contains parsers for the
different treebank formats that provide OpenNLP ObjectStream
views over the
treebank sentences.
ConllxSentenceStream
is a CoNLL-X file parser.ConlluSentenceStream
is a CoNLL-U file parser.SentenceStream
is a helper class for constructing a given sentence stream in the specified format from a variety of sources.
The following properties are supported by the sentence streams:
Parameter | Type | Values | Default |
---|---|---|---|
upos.tagset |
enumeration |
upos , upenn |
upos |
xpos.tagset |
enumeration |
upos , upenn |
upenn |
The opennlp.ext.conll.stream.sample.SentenceSampleStream
class creates
SentenceSample
values from treebank sentences that can be used to train
a SentenceModel
.
The following properties are supported by the SentenceSampleStream
class:
Parameter | Type | Values | Default |
---|---|---|---|
sentences.per.sample |
integer |
5 |
The opennlp.ext.conll.stream.sample.TokenSampleStream
class creates
TokenSample
values from treebank sentences that can be used to train
a TokenizerModel
.
The following properties are supported by the TokenSampleStream
class:
Parameter | Type | Values | Default |
---|---|---|---|
multi.token.words |
enumeration |
split , join |
split |
The opennlp.ext.conll.stream.sample.POSSampleStream
class creates POSSample
values from treebank sentences that can be used to train a POSModel
.
The following properties are supported by the POSSampleStream
class:
Parameter | Type | Values | Default |
---|---|---|---|
multi.token.words |
enumeration |
split , join |
split |
pos.tagset |
enumeration |
See pos.tagset. | UPOS |
The opennlp.ext.conll.stream.sample.LemmaSampleStream
class creates LemmaSample
values from treebank sentences that can be used to train a LemmaModel
.
The following properties are supported by the LemmaSampleStream
class:
Parameter | Type | Values | Default |
---|---|---|---|
multi.token.words |
enumeration |
split , join |
split |
pos.tagset |
enumeration |
See pos.tagset. | UPOS |
The pos.tagset
property can be one of the following enumeration values:
UPOS
, orCPOSTAG
-- The universal/coarse part of speech tag.XPOS
, orPOSTAG
-- The language-specific part of speech tag.UPOS-PTB
, orCPOSTAG-PTB
-- The universal part of speech tag followed by the PTB (Penn TreeBank) part of speech tag separated by a hyphen, e.g.DET-DT
.