A project for CS585 - Introduction to Natural Language Processing
Instructor: Brendan T. O'Connor
Trains a naieve bayes classifier to predict sentiment of a movie review (positive or negative). The assignment code has been cleaned up and streamlined to facilitate reading and usage. This means the complete solution to the assignment is not here, just what I deemed the most relevant part for sharing.
tokenize_doc
train
report_statistics_after_training
__init__
: Addedfeature_extractor
member that defaults totokenize_doc
tokenize_and_update_model
: Switched to usefeature_extractor
member rather thantokenize_doc
tokenize_doc_stopwords
tokenize_doc_stopwords_custom
tokenize_doc_stopwords_and_stemming
update_model
p_word_given_label
log_likelihood
p_word_given_label_and_psuedocount
log_likelihood
log_prior
unnormalized_log_posterior
classify
likelihood_ratio
evaluate_classifier_accuracy
To train a Naive Bayes classifier on the large_movie_review_dataset
data using a feature extractor that stems, removes stopwords, and custom stopwords:
python nb_sentiment_classify.py
This command trains the model with every pseudocount from 1 to 25 (inclusive), creates a graph of pseudocount vs accuracy, returns the best pseudocount and the accuracy associated with that pseudocount.
from nb_sentiment_classify import NaiveBayes;
# Initialize model with default feature extractor
nb = NaiveBayes()
# Train model on large_movie_review_dataset
nb.train_model()
# Evaluate accuracy given a pseudocount (1 used in this example)
nb.evaluate_classifier_accuracy(1)