forked from japerk/nltk-trainer
-
Notifications
You must be signed in to change notification settings - Fork 0
Train NLTK objects with zero code
License
luchux/nltk-trainer
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
NLTK Trainer ------------ NLTK Trainer exists to make training and evaluating NLTK objects as easy as possible. Requirements ------------ You must have Python 2.6 with `argparse <http://pypi.python.org/pypi/argparse/>`_ and `NLTK <http://www.nltk.org/>`_ 2.0 installed. `NumPy <http://numpy.scipy.org/>`_, `SciPy <http://www.scipy.org/>`_, and `megam <http://www.cs.utah.edu/~hal/megam/>`_ are recommended for training Maxent classifiers. Training Classifiers -------------------- Example usage with the movie_reviews corpus can be found in `Training Binary Text Classifiers with NLTK Trainer <http://streamhacker.com/2010/10/25/training-binary-text-classifiers-nltk-trainer/>`_. Train a binary NaiveBayes classifier on the movie_reviews corpus, using paragraphs as the training instances:: ``python train_classifier.py --instances paras --classifier NaiveBayes movie_reviews`` Include bigrams as features:: ``python train_classifier.py --instances paras --classifier NaiveBayes --ngrams 1 --ngrams 2 movie_reviews`` Minimum score threshold:: ``python train_classifier.py --instances paras --classifier NaiveBayes --ngrams 1 --ngrams 2 --min_score 3 movie_reviews`` Maximum number of features:: ``python train_classifier.py --instances paras --classifier NaiveBayes --ngrams 1 --ngrams 2 --max_feats 1000 movie_reviews`` Use the default Maxent algorithm:: ``python train_classifier.py --instances paras --classifier Maxent movie_reviews`` Use the MEGAM Maxent algorithm:: ``python train_classifier.py --instances paras --classifier MEGAM movie_reviews`` Train on files instead of paragraphs:: ``python train_classifier.py --instances files --classifier MEGAM movie_reviews`` Train on sentences:: ``python train_classifier.py --instances sents --classifier MEGAM movie_reviews`` Evaluate the classifier by training on 3/4 of the paragraphs and testing against the remaing 1/4, without pickling:: ``python train_classifier.py --instances paras --classifier NaiveBayes --fraction 0.75 --no-pickle movie_reviews`` For a complete list of usage options:: ``python train_classifier.py --help`` Using a Trained Classifier -------------------------- You can use a trained classifier by loading the pickle file using `nltk.data.load <http://nltk.googlecode.com/svn/trunk/doc/api/nltk.data-module.html#load>`_:: >>> import nltk.data >>> classifier = nltk.data.load("classifiers/NAME_OF_CLASSIFIER.pickle") Or if your classifier pickle file is not in a ``nltk_data`` subdirectory, you can load it with `pickle.load <http://docs.python.org/library/pickle.html#pickle.load>`_:: >>> import pickle >>> classifier = pickle.load(open("/path/to/NAME_OF_CLASSIFIER.pickle")) Either method will return an object that supports the `ClassifierI interface <http://nltk.googlecode.com/svn/trunk/doc/api/nltk.classify.api.ClassifierI-class.html>`_. Once you have a ``classifier`` object, you can use it to classify word features with the ``classifier.classify(feats)`` method, which returns a label:: >>> words = ['some', 'words', 'in', 'a', 'sentence'] >>> feats = dict([(word, True) for word in words]) >>> classifier.classify(feats) If you used the ``--ngrams`` option with values greater than 1, you should include these ngrams in the dictionary using `nltk.util.ngrams(words, n) <http://nltk.googlecode.com/svn/trunk/doc/api/nltk.util-module.html#ngrams>`_:: >>> from nltk.util import ngrams >>> words = ['some', 'words', 'in', 'a', 'sentence'] >>> feats = dict([(word, True) for word in words + ngrams(words, n)]) >>> classifier.classify(feats) The list of words you use for creating the feature dictionary should be created by `tokenizing <http://text-processing.com/demo/tokenize/>`_ the appropriate text instances: sentences, paragraphs, or files depending on the ``--instances`` option. Training Part of Speech Taggers ------------------------------- The ``train_tagger.py`` script can use any corpus included with NLTK that implements a ``tagged_sents()`` method. It can also train on the ``timit`` corpus, which includes tagged sentences that are not available through the ``TimitCorpusReader``. Example usage can be found in `Training Part of Speech Taggers with NLTK Trainer <http://streamhacker.com/2011/03/21/training-part-speech-taggers-nltk-trainer/>`_. Train the default sequential backoff tagger on the treebank corpus:: ``python train_tagger.py treebank`` To use a brill tagger with the default initial tagger:: ``python train_tagger.py treebank --brill`` To train a NaiveBayes classifier based tagger, without a sequential backoff tagger:: ``python train_tagger.py treebank --sequential '' --classifier NaiveBayes`` To train a unigram tagger:: ``python train_tagger.py treebank --sequential u`` To train on the switchboard corpus:: ``python train_tagger.py switchboard`` To train on a custom corpus, whose fileids end in ".pos", using a `TaggedCorpusReader <http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.tagged.TaggedCorpusReader-class.html>`_:: ``python train_tagger.py /path/to/corpus --reader nltk.corpus.reader.tagged.TaggedCorpusReader --fileids '.+\.pos'`` The corpus path can be absolute, or relative to a nltk_data directory. For example, both ``corpora/treebank/tagged`` and ``/usr/share/nltk_data/corpora/treebank/tagged`` will work. You can also restrict the files used with the ``--fileids`` option:: ``python train_tagger.py conll2000 --fileids train.txt`` For a complete list of usage options:: ``python train_tagger.py --help`` Using a Trained Tagger ---------------------- You can use a trained tagger by loading the pickle file using `nltk.data.load <http://nltk.googlecode.com/svn/trunk/doc/api/nltk.data-module.html#load>`_:: >>> import nltk.data >>> tagger = nltk.data.load("taggers/NAME_OF_TAGGER.pickle") Or if your tagger pickle file is not in a ``nltk_data`` subdirectory, you can load it with `pickle.load <http://docs.python.org/library/pickle.html#pickle.load>`_:: >>> import pickle >>> tagger = pickle.load(open("/path/to/NAME_OF_TAGGER.pickle")) Either method will return an object that supports the `TaggerI interface <http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tag.api.TaggerI-class.html>`_. Once you have a ``tagger`` object, you can use it to tag sentences (or lists of words) with the ``tagger.tag(words)`` method:: >>> tagger.tag(['some', 'words', 'in', 'a', 'sentence']) ``tagger.tag(words)`` will return a list of 2-tuples of the form ``[(word, tag)]``. Analyzing Tagger Coverage ------------------------- The ``analyze_tagger_coverage.py`` script will run a part-of-speech tagger on a corpus to determine how many times each tag is found. Example output can be found in `Analyzing Tagged Corpora and NLTK Part of Speech Taggers <http://streamhacker.com/2011/03/23/analyzing-tagged-corpora-nltk-part-speech-taggers/>`_. Here's an example using the NLTK default tagger on the treebank corpus:: ``python analyze_tagger_coverage.py treebank`` To get detailed metrics on each tag, you can use the ``--metrics`` option. This requires using a tagged corpus in order to compare actual tags against tags found by the tagger. See `NLTK Default Tagger Treebank Tag Coverage <http://streamhacker.com/2011/01/24/nltk-default-tagger-treebank-tag-coverage/>`_ and `NLTK Default Tagger CoNLL2000 Tag Coverage <http://streamhacker.com/2011/01/25/nltk-default-tagger-conll2000-tag-coverage/>`_ for examples and statistics. To analyze the coverage of a different tagger, use the ``--tagger`` option with a path to the pickled tagger:: ``python analyze_tagger_coverage.py treebank --tagger /path/to/tagger.pickle`` To analyze coverage on a custom corpus, whose fileids end in ".pos", using a `TaggedCorpusReader <http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.tagged.TaggedCorpusReader-class.html>`_:: ``python analyze_tagger_coverage.py /path/to/corpus --reader nltk.corpus.reader.tagged.TaggedCorpusReader --fileids '.+\.pos'`` The corpus path can be absolute, or relative to a nltk_data directory. For example, both ``corpora/treebank/tagged`` and ``/usr/share/nltk_data/corpora/treebank/tagged`` will work. For a complete list of usage options:: ``python analyze_tagger_coverage.py --help`` Analyzing a Tagged Corpus ------------------------- The ``analyze_tagged_corpus.py`` script will show the following statistics about a tagged corpus: * total number of words * number of unique words * number of tags * the number of times each tag occurs Example output can be found in `Analyzing Tagged Corpora and NLTK Part of Speech Taggers <http://streamhacker.com/2011/03/23/analyzing-tagged-corpora-nltk-part-speech-taggers/>`_. To analyze the treebank corpus:: ``python analyze_tagged_corpus.py treebank`` To sort the output by tag count from highest to lowest:: ``python analyze_tagged_corpus.py treebank --sort count --reverse`` To see simplified tags, instead of standard tags:: ``python analyze_tagged_corpus.py treebank --simplify_tags`` To analyze a custom corpus, whose fileids end in ".pos", using a `TaggedCorpusReader <http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.tagged.TaggedCorpusReader-class.html>`_:: ``python analyze_tagged_corpus.py /path/to/corpus --reader nltk.corpus.reader.tagged.TaggedCorpusReader --fileids '.+\.pos'`` The corpus path can be absolute, or relative to a nltk_data directory. For example, both ``corpora/treebank/tagged`` and ``/usr/share/nltk_data/corpora/treebank/tagged`` will work. For a complete list of usage options:: ``python analyze_tagged_corpus.py --help`` Training IOB Chunkers --------------------- The ``train_chunker.py`` script can use any corpus included with NLTK that implements a ``chunked_sents()`` method. Train the default sequential backoff tagger based chunker on the treebank_chunk corpus:: ``python train_chunker.py treebank_chunk`` To train a NaiveBayes classifier based chunker:: ``python train_chunker.py treebank_chunk --classifier NaiveBayes`` To train on the conll2000 corpus:: ``python train_chunker.py conll2000`` To train on a custom corpus, whose fileids end in ".pos", using a `ChunkedCorpusReader <http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.chunked.ChunkedCorpusReader-class.html>`_:: ``python train_chunker.py /path/to/corpus --reader nltk.corpus.reader.chunked.ChunkedCorpusReader --fileids '.+\.pos'`` The corpus path can be absolute, or relative to a nltk_data directory. For example, both ``corpora/treebank/tagged`` and ``/usr/share/nltk_data/corpora/treebank/tagged`` will work. You can also restrict the files used with the ``--fileids`` option:: ``python train_chunker.py conll2000 --fileids train.txt`` For a complete list of usage options:: ``python train_chunker.py --help``
About
Train NLTK objects with zero code
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published