Supervised-learning-using-Word-Vectors

Data Set and Problem Statement

I performed analysis on on a large Kaggle dataset of movie reviews from IMDB to predict whether a movie review is positive or negative. The data is split into training and testing sets. I have used a Random Forest classifier to fit the "Bag of Words" model. This model is further used to predict the sentiment label of movie reviews in the test dataset. I have understood how learning word vectors could play an important role in predictions using supervised learning in a text corpus.

Methodology

Pre-processing : Cleaning, Lowercase, Tokenization, Stopwords removal
Feature Extraction : Bag of Words - CountVectorizer
Classification : Random Forest
Dimension Reduction : Feature importances

Result

Testing the model on the dataset with 25000 features yields about 85.3% accuracy. The figure below shows the precision, recall, f1-score values on the test data.

The diagram below shows the confusion matrix obtained.

After reducing the dimension and bringing down the features to 20000, the model yields about 86% accuracy. The figure below shows the precision, recall, f1-score values on the test data.

The diagram below shows the confusion matrix obtained.

How to run

Run the project code by submitting it to spark.

spark-submit ClassificationRandomForest.py labeledTrainData.tsv

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
ClassficationRandomForest.py		ClassficationRandomForest.py
README.md		README.md
cm1.jpg		cm1.jpg
cm2.jpg		cm2.jpg
github_sentiment_word_vectors.pdf		github_sentiment_word_vectors.pdf
labeledTrainData.tsv		labeledTrainData.tsv
old.jpg		old.jpg
reduced.jpg		reduced.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Supervised-learning-using-Word-Vectors

Data Set and Problem Statement

Methodology

Result

How to run

About

Uh oh!

Releases

Packages

Languages

oindrillac/Supervised-learning-using-Word-Vectors

Folders and files

Latest commit

History

Repository files navigation

Supervised-learning-using-Word-Vectors

Data Set and Problem Statement

Methodology

Result

How to run

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages