The IEEE Document containing the images and more deep explanation of the project can be found in inside the repository once it was required by the professor to do it as an academic paper best practice.
-
Data Extraction
It has brought to my attention that the IMDb has the best dataset. Knowing that, the source from this project were from the IMDB Dataset Interface
-
Data Preprocessing
Through the 5 thousand lines and 30 columns brought from the extraction by a cleaning and filtering relevant content. This step was done with help of Pandas Library when filtering and saving to another csv file while dumping the futile information
-
Machine Learning Techniques
During this process it was used the Sklearn Library that provided the functions for training and running the test from KNN Algorithm. In addition to that, it was also used the Stats Models Library to run the Linear Regression Algorithm as well as documenting the results in a .txt file.
Python project to predict the rating of a movie based on:
- Genre
- Duration
- Rating
- Age Rating
- Actor Likes
- Movie Likes
- Director Likes
- Cast Total Likes
- User Reviews
- Critic Reviews
- Creating virtual environment:
python -m venv venv
- Installing requirements:
make install
- Running the application:
python main.py
Algorithm | Accuracy | Time Execution | Test Size % |
---|---|---|---|
KNN | 0.7895 | 0.0016s | 40 |
Linear Regression | 0.6202 | 4.741s | 40 |
Notice that the KNN were a lot better against the Linear Regression. That said, it has been tested a myriad of test sizes from 20 up to 80 percent, and the results remained the same. It was done the same for testing the accuracy of the K value ranging up to 70 that weren't surprising when K reaches 35.