Skip to content
danieldk edited this page Oct 30, 2010 · 1 revision

The language model and lexicon can be created with the train utility:

$ ./citar-train corpus_train lexicon ngrams

This will create the lexicon and ngrams files. The trainer will read corpora in the Brown format (one sentence per line, words and tags are separated with a forward slash). You can now test the tagger with the command-line tag utility, which reads tokenized sentences from the standard input and prints the most probable tag sequence:

$ echo "The cat is on the mat ." | ./tag lexicon ngrams
The/AT cat/NN is/BEZ on/IN the/AT mat/NN ./.
Clone this wiki locally