Research members: Alejandro Munoz; Chun Ho Wong; Edmund Wong; Yuan Wang; Dong Liu
This code is designed to analyze and process an anime dataset for various purposes. It consists of several tasks such as data preprocessing, normalization, clustering, correlation analysis, K-Nearest Neighbors (KNN) implementation, and Non-negative Matrix Factorization (NMF).
The code starts with loading the dataset and performing some preprocessing steps:
- Removing movies from the data by filtering out rows with the 'Type' column equal to 'Movie'.
- Removing rows with 'Unknown' values.
- Selecting relevant columns to create a new DataFrame: 'MAL_ID', 'Score', 'Episodes', 'Ranked', 'Popularity', 'Members', 'Favorites'.
The selected data is then normalized using the MinMaxScaler()
from the sklearn.preprocessing
library.
A K-Means clustering algorithm is applied to the normalized data, and the resulting clusters are visualized using a scatter plot.
A correlation table is generated using the Pearson correlation method to analyze the relationship between features and the presence of sequels.
The KNN algorithm is used to classify whether an anime has a sequel or not based on the selected features. The dataset is split into a training set and a test set, and the classifier is trained on the training set. The performance of the classifier is evaluated using a confusion matrix.
NMF is applied to the user-anime rating matrix, which is obtained from a separate ratings dataset. The NMF model is used to decompose the ratings matrix into latent factors, and the top anime corresponding to each concept are displayed.
To better understand the code, you may want to run each section separately and analyze the output. This will give you a deeper understanding of the steps involved in the analysis and how the various techniques are applied to the dataset.