forked from KimmoVehkalahti/IODS-project
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathchapter4_backup.Rmd
708 lines (441 loc) · 25.6 KB
/
chapter4_backup.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
---
title: "IODS"
author: "Rong Guang"
date: "22/11/2022"
output:
html_document:
fig_caption: yes
theme: flatly
highlight: haddock
toc: yes
toc_depth: 3
toc_float: yes
number_section: no
pdf_document:
toc: yes
toc_depth: '3'
subtitle: Chapter 4
bibliography: citations.bib
---
# **Chapter 4: Clustering and classification**
I do not come up with any better way to arrange everything than follow the pathway of assignment requirement. I will report the analysis with each main section being one of the requirement and each subsection being a component of that requirement, consecutively. I hope this would also make your rating process easier.
## 1 Assignment requirement #1
Quoted from the assignment:
_"Create a new R Markdown file and save it as an empty file named ‘chapter4.Rmd’. Then include the file as a child file in your ‘index.Rmd’ file."_
### 1.1 Create a new R Markdown file and save it as an empty file named ‘chapter4.Rmd’
Done!
### 1.2 include the file as a child file in your ‘index.Rmd’ file.
Done!
## 2 Assignment requirement #2
Quoted from the assignment:
_"Load the Boston data from the MASS package. Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it. *(0-1 points)*"_
### 2.1 Load the Boston data from the MASS package.
```{r}
# access the MASS package
library(MASS)
# load the data
data("Boston")
# pass Boston to another object for easy typing
bos <- Boston
```
### 2.2 Explore the structure and the dimensions of the data
```{r}
library(tidyverse)
#explore structure
str(bos)
#explore dimensions
dim(bos)
#generate a codebook
# string is copy from dataset introduction
codebook <- data.frame(variable = "CRIM - per capita crime rate by town/ZN - proportion of residential land zoned for lots over 25,000 sq.ft./INDUS - proportion of non-retail business acres per town./CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)/NOX - nitric oxides concentration (parts per 10 million)/RM - average number of rooms per dwelling/AGE - proportion of owner-occupied units built prior to 1940/DIS - weighted distances to five Boston employment centres/RAD - index of accessibility to radial highways/TAX - full-value property-tax rate per $10,000/PTRATIO - pupil-teacher ratio by town/B - 1000(Bk-0.63)^2 where Bk is the proportion of blacks by town/LSTAT - % lower status of the population/MEDV - Median value of owner-occupied homes in $1000's")
codebook <- codebook %>%
separate_rows(variable, sep = "/") %>% # "/" is the delimiter for rows
separate(variable, sep = " - ", #" - " is the delimiter for variables
into = c("name", "description"), # names of sparated variables
remove = T) #remove old column
#check codebook
library(DT)
codebook %>% datatable
```
The data set has 506 observations of 14 variables. Variable CHAS is a dummy variable where 1 means the the place having tract that bounds river, and 0 means otherwise. It needs to be converted to a factor.
```{r}
bos <- bos %>%
mutate(chas = chas %>%
factor() %>%
fct_recode(
"With tracts that bonds river" = "1", #old value 1 to new label
"Otherwise" = "0") # old value 0 to new label
)
```
### 2.3 describe the dataset
Each of the 506 rows in the dataset describes a Boston suburb or town, and it has 14 columns with information such as average number of rooms per dwelling, pupil-teacher ratio, and per capita crime rate. The last row describes the median price of owner-occupied homes.
## 3 Assignment requirement #3
Quoted from the assignment:
_"Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. *(0-2 points)*"_
### 3.1 Show a graphical overview of the data
```{r, fig.height = 8, fig.width = 14, fig.cap = "Visualized relations of Boston dataset, variable #1~#7"}
library(GGally)
library(ggplot2)
#define a function that allows me to fine-tune the matrix
my.fun <- function(data, mapping, method = "lm",...){ #define arguments
p <- ggplot(data = data, mapping = mapping) + #pass arguments
geom_point(size = 0.3,
color = "blue",...) + #define points size and color
geom_smooth(size = 0.5,
color = "red",
method = method) #define line size and color; define lm regression
p #print the results
}
#the abbreviated variable names are not self-explanatory, set column and row
#names to be the variable labels for better reading
#this new object will be used in ggpairs function
names1 <- pull(codebook[1:7,], description) # extract row 1:7 of var description
names1 <- sapply(names1, #collapse the description into multiple lines
function(x) paste(strwrap(x, 35), # for better reading
collapse = "\n")) # "\n" calls for a new line
ggpairs(bos,
lower = list(
continuous = my.fun,
combo = wrap("facethist", bins = 20)),
col = 1:7,
columnLabels = names1) #define column labels as the names I just set
```
Note that variable about crime rate is plagued with outliers.
```{r, fig.height = 8, fig.width = 14,fig.cap = "Visualized relations of Boston dataset, variable #8~#14"}
#repeat what is done in the last chunk for variable 8~14
names2 <- pull(codebook[8:14,], description)
names2 <- sapply(names1, function(x) paste(strwrap(x, 35), collapse = "\n"))
ggpairs(bos,
lower = list(
continuous = my.fun),
col = 8:14,
columnLabels = names2,
)
```
### 3.2 Show summaries of the variables in the data.
```{r}
library(finalfit)
#summarize the continuous data
ff_glimpse(bos)$Continuous %>% datatable
```
```{r}
# summarize the categorical data
ff_glimpse(bos)$Categorical %>% datatable
```
### 3.3 Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.
#### 3.3.1 interpreting continuous variables
There are 13 continuous variables in the dataset. The crime rate of the town was 0.3(0.1~3.7)%; the proportion of a town's residential land zoned for lots over 25,000 sq.ft. was 0 (0~12.5)%; the proportion of non-retail business acres per town was 9.7(5.2~18.1)%; the nitric oxides concentration was 0.5(0.4~0.6) parts per 10 million; the average number of rooms per dwelling was 6.3±0.7 rooms; the proportion of owner-occupied units built prior to 1940 was 77.5(45.0~94.1)%; the weighted distances to five Boston employment centres was 3.2 (2.1~5.2) kilometers; the index of accessibility to radial highways was 5(4~24) units of accessibility; the full-value property-tax rate was $330(279~666) per \$10,000; the pupil-teacher ratio by town was 19.1(17.4~20.2); the Black proportion of population after taking the formula of 1000(Bk-0.63)^2 was 391.4(375.4~396.2); the proportion of population that is lower status was 11.4(6.9~17.0)%; the median value of owner-occupied homes was \$21.2(17~25)*1000.
#### 3.3.2 interpreting categorical variable
35(6.9%) towns have tracts that bonds Charles River.
#### 3.3.3 commenting on the relationships between variables
Except for the one binary variable about tract that bonds river, each variable in our data set shows a >0.3 and/or <-0.3 correlation with at least one of the other variables. Some of them have correlation as high as 0.9. All of the correlation coefficients are significant (_p_<0.001).
## 4 Assignment requirement #4
Quoted from the assignment:
_"Standardize the dataset and print out summaries of the scaled data. How did the variables change? Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set. *(0-2 points)*"_
### 4.1 Standardize the dataset and print out summaries of the scaled data
```{r}
library(MASS)
#binary variables with values as number will not influence the result of
#standardization and clustering, hence I will reload Boston without re-coding
#binary variable. This is for easiness of matrix multiplication in the following
#operations
bos <- Boston
bos.s <- as.data.frame(scale(bos))# bos.s means Boston Scaled
```
### 4.2 How did the variables change?
```{r}
ff_glimpse(bos.s)$Con %>% datatable
```
All the variables after scaling had a mean of 0 and most of variables' values ranged from -4 and 4, only except for variables crim (crime rate), which might be due to out-liers (corresponds to the finding from the correlation matrix).
### 4.3 Use the quantiles as the break points in the categorical variable and drop the old crime rate variable from the dataset.
```{r}
#generate cutoff according to quantile
bins <- quantile(bos.s$crim)
#generate a categorical variable "crime" and re-code it
bos.s <- bos.s %>%
mutate(crime = crim %>%
cut(breaks = bins, include.lowest = TRUE) %>%
fct_recode("Low" = "[-0.419,-0.411]",
"MediumLow" = "(-0.411,-0.39]",
"MediumHigh" = "(-0.39,0.00739]",
"High" = "(0.00739,9.92]"))
#remove crim
bos.s <- bos.s %>% select(-crim)
```
### 4.4 Divide the dataset to train and test sets, so that 80% of the data belongs to the train set
```{r}
set.seed(2022)
#generate an object containing the number of observations in bos dataset
n <- nrow(bos.s)
#generate an object "ind", which contains a random selected set of the indexing
#of bos dataset, and the number of indexing takes up 80% of number of observations
ind <- sample(1:n, size = n*0.8)
#generate train&test sets according to the random set of indexing number
train <- bos.s[ind,]
test <- bos.s[-ind,]
```
## 5 Assignment requirement #5
Quoted from the assignment:
_"Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot *(0-3 points)*"_
### 5.1 Fit the linear discriminant analysis on the train set (Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables)
```{r}
# fit an linear discriminant model on the train set, named as "lda.fit"
lda.fit <- lda(crime ~ ., data = train)
```
### 5.2 Draw the LDA (bi)plot
```{r}
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(factor(train$crime))
```
```{r, fig.cap="Fig 5.2 Biplot for LDA for clustering crime rate"}
#plot the lda results
plot(lda.fit, dimen = 2, pch = classes, col = classes)+
lda.arrows(lda.fit, myscale = 4)
```
Biplot based on LD1 and LD2 was generated, see fig 5.2. The most of four clusters separated poorly, except for the cluster "High". Heavy overlap was observed between each pair of other cluster. Besides, Clusters High and MediumHigh also showed notable overlaps.
Based on arrows, varaibles lstat explained the most for cluster High. Contributions of variables to other clusters are not clear enough due to the heavy overlap.
## 6 Assignment requirement #6
Quoted from the assignment:
_"Save the crime categories from the test set and then remove the categorical crime variable from the test dataset. Then predict the classes with the LDA model on the test data. Cross tabulate the results with the crime categories from the test set. Comment on the results. *(0-3 points)*"_
### 6.1 Save the crime categories from the test set and then remove the categorical crime variable from the test dataset
```{r}
#save crime into an object
classes.test <- test$crime
#remove crime
test$crime <- NULL
```
### 6.2 predict the classes with the LDA model on the test data
```{r}
predicted.test <- predict(lda.fit, test)
```
### 6.3 Cross tabulate the results with the crime categories from the test set.
```{r}
#generate a table that evaluate the accuracy of model, and pass the table into
#an object named "accuracy.tab"
accuracy.tab <- table(correct = classes.test, predicted = predicted.test$class )
#show the accuracy table
accuracy.tab
#ask R to identify the correct predictions and add them up
correct.n = 0 # the number of correct predictions starting at 0
for (i in 1:4){ #4 loops because we have 4 rows/columns
correct.c <- accuracy.tab[
which(rownames(accuracy.tab) == colnames(accuracy.tab)[i]),
i] # if a cell has same row and column names, pass its value into "correct.c"
correct.n = correct.c+ correct.n # update the value of correct prediction
} # by adding "correct.c"
# calculate the percent of correct predictions for test set
correct.n/(nrow(bos.s)*0.2) #denominator is the number of obs. in test set
```
### 6.4 Comment on the results
Overall, 66.2% of the predictions are correct, showing not quite satisfactory predicting effect of our linear discriminant analysis. Observe the result closely, it is found that the for high and medium high crime rate regions, the analysis did the best predictions, with 90% (47/52) of accuracy. For Low and medium low regions, the predictive effect of our analysis decreased tremendously. This might be the result of *a.* the violation of the assumption of multivariate normality (but evidence showed even when this is violated, LDA also exhibited good accuracy); *b.* large number of out-liers in the dependent variable before re-coding (LDA is sensitive to out-liers); *c.* The small size of category Low in test set; *d.* better categorization strategy for dependent variable needed (the current categorization is only base on quantiles, which is lack of more evidence-based foundation).
## 7 Assignment requirement #7
Quoted from the assignment:
_"Reload the Boston dataset and standardize the dataset (we did not do this in the Exercise Set, but you should scale the variables to get comparable distances). Calculate the distances between the observations. Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results. *(0-4 points)*"_
### 7.1 Reload the Boston dataset and standardize the dataset
```{r}
#reload Boston
data("Boston")
bos <- Boston
#standardize the dataset
bos.s <- as.data.frame(scale(bos))
```
### 7.2 Calculate the distances between the observations
```{r}
dis_eu <- dist(bos.s)
summary(dis_eu)
```
### 7.3 Run k-means algorithm on the dataset
```{r}
bos.s.km <- kmeans(bos.s, centers = 4)
```
### 7.4 Investigate what is the optimal number of clusters
```{r}
date()
set.seed(22) #22 is the date I carried out the analysis
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(bos.s, k)$tot.withinss})
```
```{r, fig.cap="Fig 7.4 Elbow plot for trends of within-cluster sum-of-square with increasing number of k"}
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')+
geom_line(aes(x = 3, color = "red")) +
annotate("text",
label =
"←Elbow effect happens here",
x = 4.8, y =3800, color = "red")
```
There is a huge reduction in variation with K =3, but after that the variation does not go down as quickly. I will use K = 3 and do the k-means clustering again.
### 7.5 run the algorithm again
```{r}
km <- kmeans(bos.s, centers = 3)
```
### 7.6 Visualize the clusters
```{r, fig.height = 8, fig.width = 14, fig.cap="Fig 7.6.1 Correlation Matrix with clusters", warning = F, message = F}
#define a function that allows me to fine-tune the matrix
my.fun.km <- function(data, mapping,...){ #define arguments
p <- ggplot(data = data, mapping = mapping) + #pass arguments
geom_point(size = 0.3,
color = factor(km$cluster),
...) + #define points size and color
stat_ellipse(geom = "polygon", mapping = mapping, alpha = 0.5)
#calculate an ellipse layer that separate clusters
p #print the results
}
ggpairs(bos, mapping = aes(fill=factor(km$cluster)),
lower = list(
continuous = my.fun.km
),
col = 1:7)
```
```{r, fig.height = 8, fig.width = 14, fig.cap="Fig 7.6.2 Correlation Matrix with clusters", warning = F, message = F}
ggpairs(bos, mapping = aes(fill=factor(km$cluster)),
lower = list(
continuous = my.fun.km
),
col = 8:14)
```
### 7.7 interpret the results
By observing the elbow plot that depicts the how size of within-cluster sum-of-square changes with number of k, an optimal number of k = 3 was determined. The subsequent results of k-means clustering was visualized in a correlation matrix with containing variables in dataset. It is observed that some of the variables have contributed tremendously to the clustering. For example, the variable black separates 1 cluster with the other 2 clusters nicely, see the 5th column of the picture above (Fig 7.6.2), x axis; and the variable age separates a different cluster with the other 2 clusters roughly, see the 7th row of the picture above (Fig 7.6.1), y axis. Some pairs of the variables have also played important role in clustering, for example, the combination of black and dist variables separate the 3 clusters roughly, see picture above (Fig 7.6.2, column 1nd, row 5th) or see the picture below (Fig 7.6.3). Due to the limitation of presenting more dimensions in a 2-dimension screen, I am not able to dig into the clustering effect of more variables combined. Fortunately, k-means clustering has done that for me, mathematically.
```{r, fig.cap= "the clustering effect of variables black and dis combined"}
bos %>% ggplot(aes(x = dis, y = black, color = factor(km$cluster))) +
geom_point() +
geom_abline(intercept = 480, slope = -25)+
geom_abline(intercept = 400, slope = -80) +
stat_ellipse(geom = "polygon",
aes(fill = km$cluster),
alpha = 0.25)
```
## 8 Assignment requirement #2
Quoted from the assignment:
_"Bonus: Perform k-means on the original Boston data with some reasonable number of clusters (> 2). Remember to standardize the dataset. Then perform LDA using the clusters as target classes. Include all the variables in the Boston data in the LDA model. Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution). Interpret the results. Which variables are the most influential linear separators for the clusters? *(0-2 points to compensate any loss of points from the above exercises)"*_
### 8.1 Perform k-means on the original Boston data with some reasonable number of clusters (> 2)(Remember to standardize the dataset)
```{r}
# k = 3 is the optimal clusters found
km <- kmeans(scale(Boston), centers = 3)
```
### 8.2 Perform LDA using the clusters as target classes.
```{r}
#reload and standardize data
bos.s <- as.data.frame(scale(Boston))
#save the clusters identified by k-means clustering as a column in the data set
bos.s$km.cluster <- km$cluster
lda.km <- lda(km.cluster ~ ., data = bos.s)
```
### 8.3 Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution)
```{r, fig.cap="Fig. 8.3 Biplot of the LDA for separating clusters identified by K means distance"}
# target classes as numeric
classes <- as.numeric(factor(bos.s$km.cluster))
#plot the lda results as biplot
plot(lda.km, dimen = 2, pch = classes, col = classes)+
lda.arrows(lda.km, myscale = 4)
```
### 8.4 Interpret the results
Biplot based on LD1 and LD2 was generated, see fig 8.3. The three clusters separated very clearly and some overlap observed between cluster 1 and cluster 3, and between cluster 2 and cluster 3. Cluster 1 and cluster 2 are perfectly separated.
Based on arrows, variables rm, dis and crim explained more for cluster 1; variables indus, rad, tax and nox explained more for cluster 2; and variables black, chas and ptratio explained more for clusters 3. Other variables' role in clustering was much weaker.
## 9 Assignment requirement #9
Quoted from the assignment:
_"Super-Bonus: Run the code below for the (scaled) train data that you used to fit the LDA. The code creates a matrix product, which is a projection of the data points. Install and access the plotly package. Create a 3D plot of the columns of the matrix product using the given code. Adjust the code: add argument color as a argument in the plot_ly() function. Set the color to be the crime classes of the train set. Draw another 3D plot where the color is defined by the clusters of the k-means. How do the plots differ? Are there any similarities? *(0-3 points to compensate any loss of points from the above exercises)*"_
### 9.1 Run the code below for the (scaled) train data that you used to fit the LDA. The code creates a matrix product, which is a projection of the data points.
```{r}
#reload the data
bos.s <- as.data.frame(scale(Boston))
#generate cutoff according to quantile
bins <- quantile(bos.s$crim)
#generate a categorical variable "crime" and re-code it
bos.s <- bos.s %>%
mutate(crime = crim %>%
cut(breaks = bins, include.lowest = TRUE) %>%
fct_recode("Low" = "[-0.419,-0.411]",
"MediumLow" = "(-0.411,-0.39]",
"MediumHigh" = "(-0.39,0.00739]",
"High" = "(0.00739,9.92]"))
#remove crim
bos.s <- bos.s %>% select(-crim)
set.seed(2022)
#generate an object containing the number of observations
n <- nrow(bos.s)
#generate a random set of indexing number with n = 80% of the obs.
ind <- sample(1:n, size = n*0.8)
#generate the train set and test
train <- bos.s[ind,]
test <- bos.s[-ind,]
```
### 9.2 Install and access the plotly package. Create a 3D plot of the columns of the matrix product using the given code. Adjust the code: add argument color as a argument in the plot_ly() function. Set the color to be the crime classes of the train set.
```{r}
#select predictors for train set, with outcome variable removed
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
dim(lda.fit$scaling)
# matrix multiplication, saving the resulting matrix into matrix_product
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
#turn matrix into data frame
matrix_product <- as.data.frame(matrix_product)
#plot the 3D plot with LD1, LD2 and LD3
library(plotly)
p1 <- plot_ly(x = matrix_product$LD1,
y = matrix_product$LD2,
z = matrix_product$LD3,
type= 'scatter3d',
mode='markers',
color = train$crime, #Set the color to be the crime classes of the train set.
size = 2)
p1
```
### 9.3 Draw another 3D plot where the color is defined by the clusters of the k-means.
```{r}
#select predictors for train set, with outcome variable removed
model_predictors <- dplyr::select(train, -crime)
#get the clusters of k-means for the train set
train.km <- kmeans(model_predictors, centers = 3)
p2 <- plot_ly(x = matrix_product$LD1,
y = matrix_product$LD2,
z = matrix_product$LD3,
type= 'scatter3d',
mode='markers',
color = factor(train.km$cluster), #color defined by clusters of the k-means
size = 1.5)
p2
```
### 9.4 How do the plots differ? Are there any similarities?
The LDA was trained according to a mathematical category of crime rates (quantiles), which has 4 categories. While k = 3 was adopted for the k-means clustering base on the size of within-cluster sum of square. Since LDA is a supervised technique, we know what are each categories represent, which are also labeled in the caption. K-means clustering is a unsupervised method and thus I do not know anything about the real-world representation of the 3 clusters identified before observing closely.
However, by observing the pictures together, it is interesting to find out that, cluster 3 from k-means nicely overlaps with High category from LDA. Also, cluster 2 from k-means roughly overlaps with Low and Medium low categories from LDA. As such, I will re-code categories from LDA according to this finding and see closely how well results from k-means and LDA are consistent.
#### 9.4.1 Recode categories from LDA into High, Medium and Low (old Low + Medium Low)
```{r}
train.crime3 <- train %>%
mutate(crime3 = crime %>%
fct_recode("Medium" = "MediumHigh" ,
"Low" = "MediumLow",
"High" = "High",
"Low" = "Low"))
km.cluster <- factor(train.km$cluster)
levels(km.cluster) <- c("Medium","Low","High")
```
#### 9.4.2 Check the accuracy table
```{r}
accuracy.tab <- table(correct = train.crime3$crime3, kmean.pred = km.cluster)
accuracy.tab
```
#### 9.4.3 Calculate the accuracy rate
```{r}
# looping through the 3X3 matrix to add up the columns and rows with same name
correct.n = 0
for (i in 1:3){
correct.c <- accuracy.tab[which(rownames(accuracy.tab) == colnames(accuracy.tab)[i]), i]
correct.n = correct.c+ correct.n
}
# calculate the accuracy rate
correct.n/(nrow(bos.s)*0.8)
```
It gets an accuracy rate of 71.3%, greatly outperforming the original LDA model, indicating k-mean cluster could be used as a cue for categorizing continuous variables.