GitHub - luchux/nltk-trainer: Train NLTK objects with zero code

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 136 Commits
docs		docs
nltk_trainer		nltk_trainer
tests		tests
.hgignore		.hgignore
LICENSE		LICENSE
README.txt		README.txt
analyze_chunked_corpus.py		analyze_chunked_corpus.py
analyze_chunker_coverage.py		analyze_chunker_coverage.py
analyze_classifier_coverage.py		analyze_classifier_coverage.py
analyze_tagged_corpus.py		analyze_tagged_corpus.py
analyze_tagger_coverage.py		analyze_tagger_coverage.py
categorized_corpus2csv.py		categorized_corpus2csv.py
classify_corpus.py		classify_corpus.py
combine_classifiers.py		combine_classifiers.py
requirements.txt		requirements.txt
setup.py		setup.py
tag_phrases.py		tag_phrases.py
train_chunker.py		train_chunker.py
train_classifier.py		train_classifier.py
train_tagger.py		train_tagger.py
translate_corpus.py		translate_corpus.py

Repository files navigation

NLTK Trainer
------------

NLTK Trainer exists to make training and evaluating NLTK objects as easy as possible.


Requirements
------------

You must have Python 2.6 with `argparse <http://pypi.python.org/pypi/argparse/>`_ and `NLTK <http://www.nltk.org/>`_ 2.0 installed. `NumPy <http://numpy.scipy.org/>`_, `SciPy <http://www.scipy.org/>`_, and `megam <http://www.cs.utah.edu/~hal/megam/>`_ are recommended for training Maxent classifiers.


Training Classifiers
--------------------

Example usage with the movie_reviews corpus can be found in `Training Binary Text Classifiers with NLTK Trainer <http://streamhacker.com/2010/10/25/training-binary-text-classifiers-nltk-trainer/>`_.

Train a binary NaiveBayes classifier on the movie_reviews corpus, using paragraphs as the training instances::
	``python train_classifier.py --instances paras --classifier NaiveBayes movie_reviews``

Include bigrams as features::
	``python train_classifier.py --instances paras --classifier NaiveBayes --ngrams 1 --ngrams 2 movie_reviews``

Minimum score threshold::
	``python train_classifier.py --instances paras --classifier NaiveBayes --ngrams 1 --ngrams 2 --min_score 3 movie_reviews``

Maximum number of features::
	``python train_classifier.py --instances paras --classifier NaiveBayes --ngrams 1 --ngrams 2 --max_feats 1000 movie_reviews``

Use the default Maxent algorithm::
	``python train_classifier.py --instances paras --classifier Maxent movie_reviews``

Use the MEGAM Maxent algorithm::
	``python train_classifier.py --instances paras --classifier MEGAM movie_reviews``

Train on files instead of paragraphs::
	``python train_classifier.py --instances files --classifier MEGAM movie_reviews``

Train on sentences::
	``python train_classifier.py --instances sents --classifier MEGAM movie_reviews``

Evaluate the classifier by training on 3/4 of the paragraphs and testing against the remaing 1/4, without pickling::
	``python train_classifier.py --instances paras --classifier NaiveBayes --fraction 0.75 --no-pickle movie_reviews``

For a complete list of usage options::
	``python train_classifier.py --help``


Using a Trained Classifier
--------------------------

You can use a trained classifier by loading the pickle file using `nltk.data.load <http://nltk.googlecode.com/svn/trunk/doc/api/nltk.data-module.html#load>`_::
	>>> import nltk.data
	>>> classifier = nltk.data.load("classifiers/NAME_OF_CLASSIFIER.pickle")

Or if your classifier pickle file is not in a ``nltk_data`` subdirectory, you can load it with `pickle.load <http://docs.python.org/library/pickle.html#pickle.load>`_::
	>>> import pickle
	>>> classifier = pickle.load(open("/path/to/NAME_OF_CLASSIFIER.pickle"))

Either method will return an object that supports the `ClassifierI interface <http://nltk.googlecode.com/svn/trunk/doc/api/nltk.classify.api.ClassifierI-class.html>`_. 

Once you have a ``classifier`` object, you can use it to classify word features with the ``classifier.classify(feats)`` method, which returns a label::
	>>> words = ['some', 'words', 'in', 'a', 'sentence']
	>>> feats = dict([(word, True) for word in words])
	>>> classifier.classify(feats)

If you used the ``--ngrams`` option with values greater than 1, you should include these ngrams in the dictionary using `nltk.util.ngrams(words, n) <http://nltk.googlecode.com/svn/trunk/doc/api/nltk.util-module.html#ngrams>`_::
	>>> from nltk.util import ngrams
	>>> words = ['some', 'words', 'in', 'a', 'sentence']
	>>> feats = dict([(word, True) for word in words + ngrams(words, n)])
	>>> classifier.classify(feats)

The list of words you use for creating the feature dictionary should be created by `tokenizing <http://text-processing.com/demo/tokenize/>`_ the appropriate text instances: sentences, paragraphs, or files depending on the ``--instances`` option.


Training Part of Speech Taggers
-------------------------------

The ``train_tagger.py`` script can use any corpus included with NLTK that implements a ``tagged_sents()`` method. It can also train on the ``timit`` corpus, which includes tagged sentences that are not available through the ``TimitCorpusReader``.

Example usage can be found in `Training Part of Speech Taggers with NLTK Trainer <http://streamhacker.com/2011/03/21/training-part-speech-taggers-nltk-trainer/>`_.

Train the default sequential backoff tagger on the treebank corpus::
	``python train_tagger.py treebank``

To use a brill tagger with the default initial tagger::
	``python train_tagger.py treebank --brill``

To train a NaiveBayes classifier based tagger, without a sequential backoff tagger::
	``python train_tagger.py treebank --sequential '' --classifier NaiveBayes``

To train a unigram tagger::
	``python train_tagger.py treebank --sequential u``

To train on the switchboard corpus::
	``python train_tagger.py switchboard``

To train on a custom corpus, whose fileids end in ".pos", using a `TaggedCorpusReader <http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.tagged.TaggedCorpusReader-class.html>`_::
	``python train_tagger.py /path/to/corpus --reader nltk.corpus.reader.tagged.TaggedCorpusReader --fileids '.+\.pos'``

The corpus path can be absolute, or relative to a nltk_data directory. For example, both ``corpora/treebank/tagged`` and ``/usr/share/nltk_data/corpora/treebank/tagged`` will work.

You can also restrict the files used with the ``--fileids`` option::
	``python train_tagger.py conll2000 --fileids train.txt``

For a complete list of usage options::
	``python train_tagger.py --help``


Using a Trained Tagger
----------------------

You can use a trained tagger by loading the pickle file using `nltk.data.load <http://nltk.googlecode.com/svn/trunk/doc/api/nltk.data-module.html#load>`_::
	>>> import nltk.data
	>>> tagger = nltk.data.load("taggers/NAME_OF_TAGGER.pickle")

Or if your tagger pickle file is not in a ``nltk_data`` subdirectory, you can load it with `pickle.load <http://docs.python.org/library/pickle.html#pickle.load>`_::
	>>> import pickle
	>>> tagger = pickle.load(open("/path/to/NAME_OF_TAGGER.pickle"))

Either method will return an object that supports the `TaggerI interface <http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tag.api.TaggerI-class.html>`_.

Once you have a ``tagger`` object, you can use it to tag sentences (or lists of words) with the ``tagger.tag(words)`` method::
	>>> tagger.tag(['some', 'words', 'in', 'a', 'sentence'])

``tagger.tag(words)`` will return a list of 2-tuples of the form ``[(word, tag)]``.


Analyzing Tagger Coverage
-------------------------

The ``analyze_tagger_coverage.py`` script will run a part-of-speech tagger on a corpus to determine how many times each tag is found. Example output can be found in `Analyzing Tagged Corpora and NLTK Part of Speech Taggers <http://streamhacker.com/2011/03/23/analyzing-tagged-corpora-nltk-part-speech-taggers/>`_.

Here's an example using the NLTK default tagger on the treebank corpus::
	``python analyze_tagger_coverage.py treebank``

To get detailed metrics on each tag, you can use the ``--metrics`` option. This requires using a tagged corpus in order to compare actual tags against tags found by the tagger. See `NLTK Default Tagger Treebank Tag Coverage <http://streamhacker.com/2011/01/24/nltk-default-tagger-treebank-tag-coverage/>`_ and `NLTK Default Tagger CoNLL2000 Tag Coverage <http://streamhacker.com/2011/01/25/nltk-default-tagger-conll2000-tag-coverage/>`_ for examples and statistics.

To analyze the coverage of a different tagger, use the ``--tagger`` option with a path to the pickled tagger::
	``python analyze_tagger_coverage.py treebank --tagger /path/to/tagger.pickle``

To analyze coverage on a custom corpus, whose fileids end in ".pos", using a `TaggedCorpusReader <http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.tagged.TaggedCorpusReader-class.html>`_::
	``python analyze_tagger_coverage.py /path/to/corpus --reader nltk.corpus.reader.tagged.TaggedCorpusReader --fileids '.+\.pos'``

The corpus path can be absolute, or relative to a nltk_data directory. For example, both ``corpora/treebank/tagged`` and ``/usr/share/nltk_data/corpora/treebank/tagged`` will work.

For a complete list of usage options::
	``python analyze_tagger_coverage.py --help``


Analyzing a Tagged Corpus
-------------------------

The ``analyze_tagged_corpus.py`` script will show the following statistics about a tagged corpus:

 * total number of words
 * number of unique words
 * number of tags
 * the number of times each tag occurs

Example output can be found in `Analyzing Tagged Corpora and NLTK Part of Speech Taggers <http://streamhacker.com/2011/03/23/analyzing-tagged-corpora-nltk-part-speech-taggers/>`_.

To analyze the treebank corpus::
	``python analyze_tagged_corpus.py treebank``

To sort the output by tag count from highest to lowest::
	``python analyze_tagged_corpus.py treebank --sort count --reverse``

To see simplified tags, instead of standard tags::
	``python analyze_tagged_corpus.py treebank --simplify_tags``

To analyze a custom corpus, whose fileids end in ".pos", using a `TaggedCorpusReader <http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.tagged.TaggedCorpusReader-class.html>`_::
	``python analyze_tagged_corpus.py /path/to/corpus --reader nltk.corpus.reader.tagged.TaggedCorpusReader --fileids '.+\.pos'``

The corpus path can be absolute, or relative to a nltk_data directory. For example, both ``corpora/treebank/tagged`` and ``/usr/share/nltk_data/corpora/treebank/tagged`` will work.

For a complete list of usage options::
	``python analyze_tagged_corpus.py --help``


Training IOB Chunkers
---------------------

The ``train_chunker.py`` script can use any corpus included with NLTK that implements a ``chunked_sents()`` method.

Train the default sequential backoff tagger based chunker on the treebank_chunk corpus::
	``python train_chunker.py treebank_chunk``

To train a NaiveBayes classifier based chunker::
	``python train_chunker.py treebank_chunk --classifier NaiveBayes``

To train on the conll2000 corpus::
	``python train_chunker.py conll2000``

To train on a custom corpus, whose fileids end in ".pos", using a `ChunkedCorpusReader <http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.chunked.ChunkedCorpusReader-class.html>`_::
	``python train_chunker.py /path/to/corpus --reader nltk.corpus.reader.chunked.ChunkedCorpusReader --fileids '.+\.pos'``

The corpus path can be absolute, or relative to a nltk_data directory. For example, both ``corpora/treebank/tagged`` and ``/usr/share/nltk_data/corpora/treebank/tagged`` will work.

You can also restrict the files used with the ``--fileids`` option::
	``python train_chunker.py conll2000 --fileids train.txt``

For a complete list of usage options::
	``python train_chunker.py --help``