I performed analysis on on a large Kaggle dataset of movie reviews from IMDB to predict whether a movie review is positive or negative. The data is split into training and testing sets. I have used a Random Forest classifier to fit the "Bag of Words" model. This model is further used to predict the sentiment label of movie reviews in the test dataset. I have understood how learning word vectors could play an important role in predictions using supervised learning in a text corpus.
- Pre-processing : Cleaning, Lowercase, Tokenization, Stopwords removal
- Feature Extraction : Bag of Words - CountVectorizer
- Classification : Random Forest
- Dimension Reduction : Feature importances
Testing the model on the dataset with 25000 features yields about 85.3% accuracy. The figure below shows the precision, recall, f1-score values on the test data.
The diagram below shows the confusion matrix obtained.
After reducing the dimension and bringing down the features to 20000, the model yields about 86% accuracy. The figure below shows the precision, recall, f1-score values on the test data.
The diagram below shows the confusion matrix obtained.
Run the project code by submitting it to spark.
spark-submit ClassificationRandomForest.py labeledTrainData.tsv