Predict movies' IMDB rating score with machine learning techniques by Liuzhao 'Carlos' Tang
Code file:
- Exploratory Data Analysis.ipynb
- Extracting_IMDB_Data.ipynb
- Data Preprocessing.ipynb
- Model Selection.ipynb
data file:
- IMDA_DATA.csv
- dependent_var.csv
- features.csv
other:
- Machine Learning Project Report.pdf
- README.md
the project is to predict the IMDA score of a movie based on its properties and social network data with machine learning techniques.
This project provided a sight of estimating movie quality for industry analysts and insights for film distributors.
Task: The project is to predict a movie’s rating score on IMDB.com based on its properties and related social data
Evaluation: Models will be evaluated on MSE.
The raw data came from an online dataset. It had 5043 rows and 28 columns(15 numerical variables, 12 categorical variables, and 1 target variable).
The dependent variable is the rating score(imdb_score).
The independent variables have two parts: properties variables and social network variables.
- Properties variables are attributes of a movie such as an office box performance, duration, budget, years, aspect_ratio, content rating, casts, and so on.
- The social network variables are social network indicators related to a movie like Facebook like and the number of reviews
- Missing Value
- update missing values by extracting data from IMDB database with the cinemagoer package
- fill missing values with ‘Other’ for categorical variable and with the mean of variable for numeric variables
- Outerliers
- locate variables that contained outliers based on EDA
- removed it and recheck the variable distribution: if they were a mild outlier, keet it, and if they were not mild, remove it
- Feature Engineering
- for categorical variables, there were two encoding strategies: (1)If the number of unique variables is low(color), used one-hot encoding.(2)If the number of unique variables is high, used hashing encoding technique.
- for columns “movie_title” and “plot_keywords”, create two new features: word_num(the number of words), and avg_length(the average length of words)
- No transformation for numeric variables
- Models and Methods
- lasso regression model
- support vector machine(SVM) model
- random forest model
- extreme gradient boosting model
- blending ensemble model
- stacking ensemble model
- Dataset Split
- train set : test set = 80% : 20%
Blending Ensemble Model had the lowest MES(0.645)
Model MSE
Lasso Regression 0.911
SVM 1.292
Random Forest 0.722
XGB 0.769
Blending Ensemble Model 0.644
Stacking Ensemble Model 0.653
Top 5 important features are:
- the number of voted users
- the movie duration
- the movie budget
- the released year of movie
- the number of users for reviews.
Feature name Importance
num_voted_users 0.211623
duration 0.126319
budget 0.069417
title_year 0.065868
num_user_for_reviews 0.060038
- The blending ensemble model has the best performance score. The finding suggests to select the model to deploy into production.
- the number of voted users, duration, budget, released year, and the number of users for reviews have the highest importance. the effect of Social newwork data is less than the movie properties data
- Data
- Choose high quality dataset
- Feature engineering
- check numeric variables' distribution. If they don't follow normal distribution, implement some transformations like box-cox transformation before standardization.
- try other encoding technique for high cardinality columns
- Model
- try other models like Extra Randomized Trees(ERT) and Neural Networks.
- try other ensemble methods like bagging and Boosting
- implment grid search for the second layer model in stacking ensemble methods
- Evaluation
- use multiple metric to evaluate model performance