yelp-sentiment-analysis-backend

Yelp Rating Prediction Based on Text Review Sentiment Analysis

Abstract

Yelp releases a dataset of business data, review text data, user data, checkin data, tips, and photos. Using this data, it would be meaningful if we could take a random sample of reviews and train classifiers to determine the sentiment of reviews in order to predict ratings based solely on review's text. This can hopefully be done by cleaning the dataset up and utilizing the bag-of-words model. Once the data is preprocessed we should be able to train the models using set python libraries and evaluate the efficiency of the models on determining the sentiment of each review. The sentiment analysis of a full set of reviews will allow us to gauge the public opinion of a facility and predict a rating.

Basic Steps Involved

Access Yelp dataset JSON
Isolate review JSON data
- Remove unnecessary JSON data to reduce file size
Preprocess data
- Remove punctuation
- Remove spaces
- Remove minor words ("the", "a", "an", etc.)
- Remove low sentiment words
- Convert corpus to lowercase to avoid redundancy
- Convert corpus to bag-of-words vector format
Convert reviews into a countable vector
Split Yelp dataset into a smaller training dataset
Train multiple models
- Naive Bayes
- SVC
- SVM
- Logistic Regression
- Neural Network
- Random Forest
- Artificial Neural Network
Test and evaluate models
Use review sentiment models to predict rating

Tech Stack

Python (Jupyter Notebook)
CoreNLP
Glob
JSON Simple
Keras
Matplotlib
NLTK
Numpy
OS
Pandas
Random
Seaborn
Scikit-Learn
Scipy
String
Time
WordCloud
yelp_academic_dataset_review.json

Overview

In order to perform sentiment analysis on the yelp dataset, it's important to first look at the dataset's structure and determine what methods need to be applied to the dataset to clean it and enhance its utility. For the yelp dataset we are presented with a data structure as shown:

business_id	date	review_id	stars	text	type	user_id	cool	useful
9yKzy9PA...	2011-01-26	fWKvX8...	5	My wife took me...	review	rLtl...	2	5
ZRJwVLyz...	2011-07-27	IjZ33s...	5	I have no idea...	review	0a2K...	0	0
6oRAC4uy...	2012-06-14	IESLBz...	4	love the gyro...	review	0hT2...	0	1

For the most part the organization and utility are favorable, however; in order to perform some comparisons it may be important to determine the length of the 'text' column and also pre-process that data. This would allow me to take a bag of words approach to the problem at hand.

Bag-of-Words

The bag-of-words model is a simplifying representation used in natural language processing. In this model, a text (such as a review) is represented as the bag (multiset) of its words, disregarding grammar, stopwords, and even word order but keeping multiplicity.

Text Pre-processing and Tokenizing

As shown in the yelp dataset’s fifth column, 'text', our review text data is presented as a string of words. This presents an issue with sentiment analysis, as we have no mechanism at this point to determine if a string of words is inherently positive or negative in nature. In order to do this I took the bag-of-words approach, which requires each review string to be separated into a list of single words.

My wife took me here on my birthday for breakfast and it was excellent. The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.

['wife', 'took', 'birthday', 'breakfast', 'excellent', 'weather', 'perfect', 'made', 'sitting', 'outside', 'overlooking', 'grounds', 'absolute', 'pleasure']

Using my text_process function this is easily performed.

def text_process(text, weak_sentiment_word_list):
    # instantiate word_list array
    word_list = []
    
    # parse characters of text and remove punctuation
    nopunc = [char for char in text if char not in string.punctuation]
    nopunc = ''.join(nopunc)
    
    # parse word by word in text convert to lowercase and remove 
    # stopwords and weak sentiment words
    for word in nopunc.split():
        word = word.lower()
        if word not in stopwords.words('english'):
            if word not in weak_sentiment_word_list:
                word_list.append(word.lower())
        
    return word_list

The function takes a string and a list as an argument and parses it character by character, in the event that it encounters a punctuation character it is removed. Then the function parses the entire string word by word removing any stopwords or weak sentiment words encountered. It is important to remove stop words as they are a set of words that enhance readability, but do not necessarily add to the sentiment of a phrase so they likely are wasting processing time and reducing the accuracy of future models. It is also important to remove words that do not have strong sentiment attached to them as they also likely are wasting processing time and reducing the accuracy of future models.

def clean_dataset(yelp):
    # instantiate a weak_sentiment_list array
    weak_sentiment_list = []
    
    # retype text to string 
    yelp['text'] = yelp['text'].astype(str)
    
    # create length and tokenized columns 
    yelp['length'] = yelp['text'].apply(len)
    yelp['tokenized'] = yelp.apply(lambda row: text_process(row['text'], 
                          weak_sentiment_list), axis=1)
    
    # generate a weak sentiment word list and apply it to the tokenized column
    weak_sentiment_list = generate_weak_sentiment_list(yelp)
    yelp['tokenized'] = yelp.apply(lambda row: text_process(row['text'], 
                          weak_sentiment_list), axis=1)
    
    return yelp, weak_sentiment_list

This preprocessing takes place durring the text cleaning cycle, as performed by the function clean_dataset. This first prepares the text column by re-typing it as a string, along with creating a new column, length, to store the length of each review text for data analysis and comparison in the next step. Then text_prossess is called and this vector of words is saved in another new column 'tokenized'. After this we can call generate_weak_sentiment_list to generate a list of words with high counts that appear in both the strongly positive and strongly negative case. Once the weak sentiment list is created text_prossess is called a second time to remove all unnecessary words.

Producing a clean and processed dataset, as shown below.

Primary Analysis

From the start of this project I intended on comparing the 'star' column to the 'tokenized' column. However, it is important to determine if there are any other strong correlations between other variables in the dataset. My initial approach was to relate the 'star' column to the newly created 'length' column, mainly to show if there is a relationship between the length of reviews with respect to the star rating of the review.

I then wanted to determine if there was a clear bias in star count in the entire dataset.

From this histogram it is apparent that there is a well defined bias towards the upper end of the charts, the vast majority of reviews are 4 or 5 stars. This tells us that there is a strong likelihood that our models will have a bias towards positive sentiment over negative sentiment, and will have to be corrected.

def normalize_dataset(yelp):
    # instantiate yelp_normalized array
    yelp_normalized = []
    
    # create datasets separated by type of star
    yelp_1 = yelp[(yelp['stars'] == 1)]
    yelp_2 = yelp[(yelp['stars'] == 2)]
    yelp_3 = yelp[(yelp['stars'] == 3)]
    yelp_4 = yelp[(yelp['stars'] == 4)]
    yelp_5 = yelp[(yelp['stars'] == 5)]
    
    # determine the lowest count in datasets
    limiting_factor = min([len(yelp_1), len(yelp_2), len(yelp_3), len(yelp_4), len(yelp_5)])
        
    # concatenate all datasets into one dataset
    yelp_normalized.append(yelp_1.sample(limiting_factor))
    yelp_normalized.append(yelp_2.sample(limiting_factor))
    yelp_normalized.append(yelp_3.sample(limiting_factor))
    yelp_normalized.append(yelp_4.sample(limiting_factor))
    yelp_normalized.append(yelp_5.sample(limiting_factor))
    
    return pd.concat(yelp_normalized)

The method I chose to correct this bias is to perform a form of normalization on the entire set of yelp data. This was performed with the normalize_dataset function which compares the count of each star rating and then samples the lowest count from each set. This guarantees an equal distribution of all star ratings, meaning the bias towards large numbers has been eliminated.

After establishing from the graphs that there is not likely a strong connection between length of review with the star I decided to determine the correlations between all variables in the dataset. I first took the mean of the dataset with respect to the stars in order to prepare the dataset to undergo correlation.

Showing us that some of the variables in the dataset might be correlated with each other.

These correlations are more apparent in the pairwise plot, below. Showing us that there appears to be a strong linear relationship between the 'cool' column and 'useful' column and to a lesser degree between the 'cool' column and 'useful' column along with the 'useful' column and 'funny' columns. These are correlations that I might want to look into in the future.

All three of these columns make sense to have some degree of correlation with respect to each other, so the data represents intuition. The pairwise plot, however; does not show the correlation between the 'stars' column and the others as the 'stars' column holds discrete values rather than continuous points, so their correlation plots are striated.

Since, I am performing a bag of words approach to the review text I decided to generate two word clouds in order to have a graphical representation of each words sentiment. I generated these graphs by taking the bag of words for all the 1 star reviews and 5 star reviews, as representations of strongly positive reviews and strongly negative reviews. These graphs also indicate that there are a few non-stop words that have very little effect on the sentiment of a review.

This analysis lead me to perform a secondary cleaning to remove words with similar rank in both sets. Producing a much cleaner word clouds.

Classification

In order to train my models I first needed to prepare my data, so I duplicated my dataset into two separate classes, boundary and complete. This was performed by running the create_class function on the yelp dataset.

Boundary: Dataset containing only the data of 1 star reviews and 5 star reviews
Complete: Dataset containing the data of all reviews

Doing this allowed me to test and train all my models in both environments, showing definitively that, for sentiment analysis, it is more efficient and accurate to train on edge cases. This makes logical sense as 1 and 5 star reviews are only generated from people with strongly positive or strongly negative opinions, ideally.

Once I had two separate full datasets, I needed to separate them into training and test sets Before this I needed to generate X and y, the dataframe of the 'text' and 'stars' column, using the generate_X_y function.

With two classes of X dataframes this data then needed to be transformed by the bow_transformer function.

def bow_transformer(X):
    # vectorize words in X with an ngram of 1 and setting a feature ceiling at 450,000 
    # the feature ceiling was set to extend the boundary case without conflicts
    bow_transformer = count_vectorizer(ngram_range=(1, 2), max_features=450000).fit(X)
    
    # transform vectorize words
    X = bow_transformer.transform(X)
    
    return X

This function takes a dataframe and performs a count vectorizer using the data and a pre-processor. Once the word vector is ranked by count it is then transformed by the bow_transformer. This will return the same dataframe organized in such a way that now models will be able to be trained on the data.

The final step of generating training and test sets is to then run our new X and y with train_test_split function from Scikit-Learn. This will generate X_complete_train, X_complete_test, y_complete_train, y_complete_test and X_boundary_train, X_boundary_test, y_boundary_train, y_boundary_test for each class respectively.

Multinomial Naive Bayes Classifier

Using Scikit-Learn's multinomial_nb function, I generated this classifier and used it's built in functions in order to train the model.

# create and train multinomial naive bayes classifier
classifier_nb_complete = multinomial_nb()
classifier_nb_complete, prediction_nb_complete, time_train_nb_complete, time_predict_nb_complete, score_nb_complete, confusion_matrix_nb_complete = classifier_train(classifier_nb_complete, X_complete_train, y_complete_train, X_complete_test, y_complete_test)
# evaluate and plot the confusion matrix
classifier_string_nb_complete = 'Multinomial Naive Bayes Complete'
classifier_result(confusion_matrix_nb_complete, score_nb_complete, classifier_string_nb_complete)
# evaluate and plot the roc curve
fpr_nb_complete, tpr_nb_complete, thresholds_nb_complete, roc_auc_nb_complete = classifier_roc(prediction_nb_complete, y_complete_test, classifier_string_nb_complete)