IMDB_Rating_prediction

Predict movies' IMDB rating score with machine learning techniques by Liuzhao 'Carlos' Tang

Code file:

Exploratory Data Analysis.ipynb
Extracting_IMDB_Data.ipynb
Data Preprocessing.ipynb
Model Selection.ipynb

data file:

IMDA_DATA.csv
dependent_var.csv
features.csv

other:

Machine Learning Project Report.pdf
README.md

Introduction

the project is to predict the IMDA score of a movie based on its properties and social network data with machine learning techniques.

This project provided a sight of estimating movie quality for industry analysts and insights for film distributors.

Task Definition

Task: The project is to predict a movie’s rating score on IMDB.com based on its properties and related social data

Evaluation: Models will be evaluated on MSE.

Data and Variables

The raw data came from an online dataset. It had 5043 rows and 28 columns(15 numerical variables, 12 categorical variables, and 1 target variable).

The dependent variable is the rating score(imdb_score).

The independent variables have two parts: properties variables and social network variables.

Properties variables are attributes of a movie such as an office box performance, duration, budget, years, aspect_ratio, content rating, casts, and so on.
The social network variables are social network indicators related to a movie like Facebook like and the number of reviews

Methodology

Pre-processing

Missing Value

update missing values by extracting data from IMDB database with the cinemagoer package
fill missing values with ‘Other’ for categorical variable and with the mean of variable for numeric variables

Outerliers

locate variables that contained outliers based on EDA
removed it and recheck the variable distribution: if they were a mild outlier, keet it, and if they were not mild, remove it

Feature Engineering

for categorical variables, there were two encoding strategies: (1)If the number of unique variables is low(color), used one-hot encoding.(2)If the number of unique variables is high, used hashing encoding technique.
for columns “movie_title” and “plot_keywords”, create two new features: word_num(the number of words), and avg_length(the average length of words)
No transformation for numeric variables

Model Design

Models and Methods

lasso regression model
support vector machine(SVM) model
random forest model
extreme gradient boosting model
blending ensemble model
stacking ensemble model

Dataset Split

train set : test set = 80% : 20%

Model Selection Process

Result

Blending Ensemble Model had the lowest MES(0.645)

Model	                              MSE
Lasso Regression	            0.911
SVM	                            1.292
Random Forest	                    0.722
XGB	                            0.769
Blending Ensemble Model	            0.644
Stacking Ensemble Model	            0.653

Top 5 important features are:

the number of voted users
the movie duration
the movie budget
the released year of movie
the number of users for reviews.

Feature name            Importance
num_voted_users	        0.211623
duration	        0.126319
budget	                0.069417
title_year	        0.065868
num_user_for_reviews	0.060038

Conclusion

The blending ensemble model has the best performance score. The finding suggests to select the model to deploy into production.
the number of voted users, duration, budget, released year, and the number of users for reviews have the highest importance. the effect of Social newwork data is less than the movie properties data

Future Work

Data

Choose high quality dataset

Feature engineering

check numeric variables' distribution. If they don't follow normal distribution, implement some transformations like box-cox transformation before standardization.
try other encoding technique for high cardinality columns

Model

try other models like Extra Randomized Trees(ERT) and Neural Networks.
try other ensemble methods like bagging and Boosting
implment grid search for the second layer model in stacking ensemble methods

Evaluation

use multiple metric to evaluate model performance

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

IMDB_Rating_prediction

Introduction

Task Definition

Data and Variables

Methodology

Pre-processing

Model Design

Result

Conclusion

Future Work

Files

README.md

Latest commit

History

README.md

File metadata and controls

IMDB_Rating_prediction

Introduction

Task Definition

Data and Variables

Methodology

Pre-processing

Model Design

Result

Conclusion

Future Work