This repository has been archived by the owner on Apr 5, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
code-movielens.Rmd
442 lines (333 loc) · 15.6 KB
/
code-movielens.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
---
title: "edX Capstone Movielens Project"
author: "Ciro B Rosa"
date: "10-Oct-2021"
output:
word_document: default
pdf_document: default
---
### Introduction and Objective
This report is the result of the job performed on the "Movielens" dataset. The objective is to design a machine learning model that predicts movie ratings for a given user that has not previously seen that movie, based on data such as previous ratings to other movies, user's preferences, etc. The level of efficiency of the model is measured as RMSE (Root Mean Square Error), which means the lowest it the best.
### Before you Begin: Prepare the Environment
This code has been tested on a specific Anaconda environment created on a Ubuntu 20.04 Linux Mate machine. All necessary scripts can be downloaded from the student's GitHub:
* https://github.com/cirobr/ds9-capstone-movielens.git
Next, the user should download and install Anaconda:
* https://www.anaconda.com/products/individual-d
Now, it is time to create an environment named "r-gpu" with the help of two command lines typed on terminal:
* conda create --name r-gpu python=3.9 notebook r-base=4.1 r-essentials r-e1071 r-irkernel r-varhandle r-foreach r-doparallel r-reticulate r-keras r-tfdatasets
* Rscript install-keras-gpu.R
Lastly, the below script run on RStudio executes the code provided by edX to generate the datasets and stores the files "edx.csv" and "validation.csv" on a sub-folder "./dat", which is also created in the process:
* code-preset.R
At this point, the code is ready for execution on RStudio, through this script:
* code-movielens.R
### Code Organization and Project Development.
The code is developed in such a way as to execute the following tasks:
* Setup the environment, libraries and define key functions, such as to calculate RMSE;
* Read the edX dataset and split it on trainset / testset;
* Create, train and evaluate the model (naive average and neural network);
* Validate the model with the "validation" data set, and present the results.
The report will present all relevant outputs, as appropriate, in order to evidence the steps taken.
### Project Development
#### Setup environment
```{r warning=FALSE}
# environment
library(reticulate) # interface R / Python
use_condaenv("r-gpu", required = TRUE) # conda env for running tf and keras on gpu
# libraries
# library(stringi) # used on the code as stringi::stri_sub()
library(ggplot2)
library(lubridate)
library(tidyverse)
library(caret)
library(foreach) # multi-core computing for nzv()
library(keras) # tensorflow wrap
library(tfdatasets)
# global variables
numberOfDigits <- 5
options(digits = numberOfDigits)
proportionTestSet <- 0.20
numberOfEpochs <- 20 # keras training parameter
# error function
errRMSE <- function(true_ratings, predicted_ratings){
sqrt(mean((true_ratings - predicted_ratings)^2))
}
# function: difference between timestamps in days
daysBetweenTimestamps <- function(x,y){
difftime(as_date(as_datetime(x)),
as_date(as_datetime(y)),
units = c("days")) %>%
as.numeric()
}
# function: extract each genre from column genres
extractGenresNames <- function(elementVector){
as.numeric(grepl(elementVector, genresVector))
}
```
#### Read and pre-process the edX dataset
In order to ensure the "validation" data set is not handled at all at the training stage, it is deleted from memory. Please recall that its correspondent CSV file is stored on hard drive. Next, the "edx" data set is loaded to memory.
```{r warning=FALSE}
# clean memory
if(exists("validation")) {rm(validation)}
# read dataset from csv
print("pre-process edx")
if(!exists("edx")) {edx <- read_csv(file = "./dat/edx.csv") %>% as_tibble()}
head(edx)
```
Next, the following predictors will be extracted from the edX data set:
* "yearsFromRelease" is a predictor that indicates the number of years between film release and timestamp of evaluation. The year of release of the film is extracted as "yearOfRelease" from the "title" column.
* "daysFromFirstUserRating" is a predictor that indicates the number of days between the first assessment from a given user and the timestamp of assessment made by the same user. The first assessment from each user is a temporary variable.
* "daysFromFirstMovieRating" is similar to the above predictor. It gives the number of days between the first assessment a given movie has received, and the timestamp of the evaluation for that same movie. The first assessment granted for each movie is also a temporary variable.
* The column "genres" is a multiclass column that indicates the genre(s) of the movie that a given user has classified it. The code seeks for the available genres categories and creates a binary column for each of them.
```{r warning=FALSE}
# move ratings to first column
edx2 <- edx %>% select(-c(rating))
edx2 <- cbind(rating = edx$rating, edx2)
# extract yearsFromRelease
edx2 <- edx2 %>%
select(-c(genres)) %>%
mutate(yearOfRelease = as.numeric(stringi::stri_sub(edx$title[1], -5, -2)),
timestampYear = year(as_datetime(timestamp)),
yearsFromRelease = timestampYear - yearOfRelease) %>%
select(-c(title, yearOfRelease, timestampYear))
# extract firstUserRating
dfFirstUserRating <- edx2 %>% group_by(userId) %>%
select(userId, timestamp) %>%
summarize(firstUserRating = min(timestamp))
edx2 <- left_join(edx2, dfFirstUserRating)
# extract firstMovieRating
dfFirstMovieRating <- edx2 %>% group_by(movieId) %>%
select(movieId, timestamp) %>%
summarize(firstMovieRating = min(timestamp))
edx2 <- left_join(edx2, dfFirstMovieRating)
# extract daysFromFirstUserRating and daysFromFirstMovieRating
edx2 <- edx2 %>% mutate(daysFromFirstUserRating = daysBetweenTimestamps(timestamp, firstUserRating),
daysFromFirstMovieRating = daysBetweenTimestamps(timestamp, firstMovieRating)) %>%
select(-c(timestamp, firstUserRating,firstMovieRating))
# extract movie genres as predictors
genresNames <- strsplit(edx$genres, "|", fixed = TRUE) %>%
unlist() %>%
unique()
genresVector <- edx$genres
df <- sapply(genresNames, extractGenresNames) %>% as_tibble()
# remove hyphen from predictor names
colnames(df)[7] <- "SciFi"
colnames(df)[16] <- "FilmNoir"
colnames(df)[20] <- "NoGenre"
edx2 <- bind_cols(edx2, df)
head(edx2)
# clean memory
rm(df, edx)
```
#### Split the edX dataset and check for stratification
Next, the "edX" data set is split on trainset and testset. The resulting tables are then verified for its correct stratification, with the aid of a chart that demonstrates the data splitting has also split each movie rating category, ranging between [0.5; 5.0], at approximately the same 80% trainset / 20% testset proportion.
```{r warning=FALSE}
# split train and test sets
set.seed(1, sample.kind = "Rounding")
test_index <- createDataPartition(edx2$rating,
times = 1,
p = proportionTestSet,
list = FALSE)
test_set <- edx2 %>% slice(test_index)
train_set <- edx2 %>% slice(-test_index)
# check for stratification of train / test split
p1 <- train_set %>%
group_by(rating) %>%
summarize(qty = n()) %>%
mutate(split = 'train_set')
p2 <- test_set %>%
group_by(rating) %>%
summarize(qty = n()) %>%
mutate(split = 'test_set')
p <- bind_rows(p1, p2) %>% group_by(split)
p %>% ggplot(aes(rating, qty, fill = split)) +
geom_bar(stat="identity", position = "dodge") +
ggtitle("Stratification of Testset / Trainset split")
```
#### Pre processing of trainset
The trainset is now pre processed for dimensionality reduction by eliminating the small variance predictors. This step is important as it will reduce computational workload at the neural network processing steps.
```{r warning=FALSE}
# remove movies and users from testset that are not present on trainset
test_set <- test_set %>%
semi_join(train_set, by = "movieId") %>%
semi_join(train_set, by = "userId")
# remove predictors with small variance
nzv <- train_set %>%
select(-rating) %>%
nearZeroVar(foreach = TRUE, allowParallel = TRUE)
removedPredictors <- colnames(train_set[,nzv])
removedPredictors
train_set <- train_set %>% select(-all_of(removedPredictors))
test_set <- test_set %>% select(-all_of(removedPredictors))
# cleanup memory
rm(edx2, test_index)
rm(p, p1, p2)
```
#### First model: Naive Average
Next, the naive average model is built as a baseline figure of merit for the project, which means that further processing is expected to, as a minimum, deliver performance better than achieved so far.
```{r warning=FALSE}
### predict by global average
mu <- mean(train_set$rating)
predicted <- mu
err <- RMSE(test_set$rating, predicted)
rmse_results <- tibble(model = "naiveAverage",
error = err)
rmse_results
```
#### Dataset preparation for neural network processing:
The below code is a pre-processing of data for the forthcoming neural network processing. The following steps are taken:
* A "movie bias" and "user bias" indexes are extracted, then merged to the train/test sets. The idea of extracting such features might allow the model to capture the "taste" of users to e.g. blockbusters, among others.
```{r warning=FALSE}
# add movie bias effect
dfBiasMovie <- train_set %>%
select(rating, movieId) %>%
group_by(movieId) %>%
summarize(biasMovie = mean(rating))
head(dfBiasMovie)
# add user bias effect
dfBiasUser <- train_set %>%
select(rating, userId) %>%
group_by(userId) %>%
summarize(biasUser = mean(rating))
head(dfBiasUser)
df_train <- train_set %>%
left_join(dfBiasMovie) %>%
left_join(dfBiasUser) %>%
as_tibble()
df_test <- test_set %>%
left_join(dfBiasMovie) %>%
left_join(dfBiasUser) %>%
as_tibble()
head(df_train)
# clean memory
rm(train_set, test_set)
```
#### Second model: Neural Network
The code presented next takes the necessary steps to configure, compile, train and test a Neural Network model. This student has chosen to go through this way as a novel approach, in the sense that it has not been exploited at all in classes. An excellent online lecture about the theory behind neural networks can be found here:
* https://youtu.be/Ih5Mr93E-2c
The baseline package used for training the model is "Keras", which is a wrap for "Tensorflow". The technical reference for programming with the package is found at the following links:
* https://cran.r-project.org/web/packages/keras/vignettes/index.html
* https://tensorflow.rstudio.com/tutorials/beginners/basic-ml/tutorial_basic_regression/
* https://datascience.stackexchange.com/questions/57171/how-to-improve-low-accuracy-keras-model-design/57292
The Keras package offers a variety of activation functions for its neurons. However, as the package may consume a significant amount of computer resources, this project will focus only on "Relu" activation function, and will not conduct a grid search among several activation functions.
Please note that the code can take +2h before it ends at an relatively usual Intel core i7 machine with GPU. At the end, the validation of result on each epoch is presented on a chart:
```{r warning=FALSE}
# scale predictors
spec <- feature_spec(df_train, rating ~ . ) %>%
step_numeric_column(all_numeric(), normalizer_fn = scaler_standard()) %>%
fit()
spec
# wrap the model in a function
build_model <- function() {
# create model
input <- layer_input_from_dataset(df_train %>% select(-c(rating)))
output <- input %>%
layer_dense_features(dense_features(spec)) %>%
layer_dense(units = 32, activation = "relu") %>%
layer_dense(units = 16, activation = "relu") %>%
layer_dense(units = 16, activation = "relu") %>%
layer_dense(units = 8, activation = "relu") %>%
layer_dense(units = 8, activation = "relu") %>%
layer_dense(units = 1)
model <- keras_model(input, output)
summary(model)
# compile model
model %>%
compile(
loss = "mse",
optimizer = optimizer_rmsprop(),
metrics = list("mean_absolute_error")
)
model
}
# train the model
print_dot_callback <- callback_lambda(
on_epoch_end = function(epoch, logs) {
if (epoch %% 80 == 0) cat("\n")
cat(".")
}
)
early_stop <- callback_early_stopping(monitor = "val_loss",
min_delta = 1e-5,
patience = 5,
mode = "min",
restore_best_weights = TRUE)
model <- build_model()
history <- model %>% fit(
x = df_train %>% select(-c(rating)),
y = df_train$rating,
epochs = numberOfEpochs,
validation_split = 0.2,
verbose = 0,
callbacks = list(early_stop, print_dot_callback)
)
plot(history)
```
The prediction on the validation set is now performed:
```{r warning=FALSE}
# predict
predicted <- model %>% predict(df_test %>% select(-c(rating)))
predicted <- predicted[ , 1]
# calculate error metrics
err <- errRMSE(df_test$rating, predicted)
rmse_results <- bind_rows(rmse_results,
tibble(model ="CNN",
error = err))
rmse_results
# clean memory
rm(df_train, df_test)
```
#### Validation: The Final Step
Given that we have a tested model, it is time to validate it. The "validation" dataset is now recovered and pre-processed for use, then the predictions are made over it.
The validation set needs to be pre-processed before being used on predictions. This is accomplished in a similar way as for the trainset/testset before:
```{r warning=FALSE}
# read dataset from csv
validation <- read_csv(file = "./dat/validation.csv") %>% as_tibble()
head(validation)
# prepare validation dataset
df_val <- validation %>%
select(-c(rating))
df_val <- cbind(rating = validation$rating, df_val)
df_val <- df_val %>%
select(-c(genres)) %>%
mutate(yearOfRelease = as.numeric(stringi::stri_sub(validation$title[1], -5, -2)),
timestampYear = year(as_datetime(timestamp)),
yearsFromRelease = timestampYear - yearOfRelease) %>%
select(-c(title, yearOfRelease, timestampYear)) %>%
left_join(dfFirstUserRating) %>%
left_join(dfFirstMovieRating) %>%
mutate(daysFromFirstUserRating = daysBetweenTimestamps(timestamp, firstUserRating),
daysFromFirstMovieRating = daysBetweenTimestamps(timestamp, firstMovieRating)) %>%
select(-c(timestamp, firstUserRating,firstMovieRating))
genresVector <- validation$genres
df <- sapply(genresNames, extractGenresNames) %>% as_tibble()
colnames(df)[7] <- "SciFi"
colnames(df)[16] <- "FilmNoir"
colnames(df)[20] <- "NoGenre"
df_val <- bind_cols(df_val, df)
df_val <- df_val %>%
select(-all_of(removedPredictors)) %>%
left_join(dfBiasMovie) %>%
left_join(dfBiasUser) %>%
as_tibble()
head(df_val)
# predict
predicted <- model %>% predict(df_val %>% select(-c(rating)))
predicted <- predicted[ , 1]
validRows <- !is.na(predicted)
# calculate error metrics
err <- errRMSE(df_val$rating[validRows], predicted[validRows])
rmse_results <- bind_rows(rmse_results,
tibble(model ="CNN validation",
error = err))
rmse_results
# clean memory
rm(df, df_val, validation, predicted, validRows)
```
### Conclusion
The student has demonstrated he has learned several skills from the classes that enabled him to pre-process data and exploit by himself the topic of "neural networks", that has not been covered during the course, and reaching to an RMSE of around 0.880, which is a significant improvement from the naive average approach.
### Next steps / Future work
The student plans to go further on studying the following topics in more details, in order to improve the result of this task and all future tasks:
* KNN;
* PCA and SVM;
* Neural Networks and the Keras package;
* Ensembles.