chapter4_backup.Rmd

---
title: "IODS"
author: "Rong Guang"
date: "22/11/2022"
output:
  html_document:
    fig_caption: yes
    theme: flatly
    highlight: haddock
    toc: yes
    toc_depth: 3
    toc_float: yes
    number_section: no
  pdf_document:
    toc: yes
    toc_depth: '3'
subtitle: Chapter 4
bibliography: citations.bib
---

# **Chapter 4: Clustering and classification**

I do not come up with any better way to arrange everything than follow the pathway of assignment requirement. I will report the analysis with each main section being one of the requirement and each subsection being a component of that requirement, consecutively. I hope this would also make your rating process easier.

## 1 Assignment requirement #1

Quoted from the assignment: 

_"Create a new R Markdown file and save it as an empty file named ‘chapter4.Rmd’. Then include the file as a child file in your ‘index.Rmd’ file."_


### 1.1  Create a new R Markdown file and save it as an empty file named ‘chapter4.Rmd’

Done!

### 1.2 include the file as a child file in your ‘index.Rmd’ file.

Done!

## 2  Assignment requirement #2

Quoted from the assignment: 

_"Load the Boston data from the MASS package. Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it. *(0-1 points)*"_

### 2.1 Load the Boston data from the MASS package.

```{r}

# access the MASS package
library(MASS)

# load the data
data("Boston")

# pass Boston to another object for easy typing
bos <- Boston
```


### 2.2 Explore the structure and the dimensions of the data  

```{r}
library(tidyverse)
#explore structure
str(bos)

#explore dimensions
dim(bos)

#generate a codebook
# string is copy from dataset introduction
codebook <- data.frame(variable = "CRIM - per capita crime rate by town/ZN - proportion of residential land zoned for lots over 25,000 sq.ft./INDUS - proportion of non-retail business acres per town./CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)/NOX - nitric oxides concentration (parts per 10 million)/RM - average number of rooms per dwelling/AGE - proportion of owner-occupied units built prior to 1940/DIS - weighted distances to five Boston employment centres/RAD - index of accessibility to radial highways/TAX - full-value property-tax rate per $10,000/PTRATIO - pupil-teacher ratio by town/B - 1000(Bk-0.63)^2 where Bk is the proportion of blacks by town/LSTAT - % lower status of the population/MEDV - Median value of owner-occupied homes in $1000's") 

codebook <- codebook %>% 
  separate_rows(variable, sep = "/") %>%  # "/" is the delimiter for rows
  separate(variable, sep = " - ",      #" - " is the delimiter for variables
           into = c("name", "description"),  # names of sparated variables
           remove = T)  #remove old column
#check codebook
library(DT)
codebook %>% datatable 
```

The data set has 506 observations of 14 variables. Variable CHAS is a dummy variable where 1 means the the place having tract that bounds river, and 0 means otherwise. It needs to be converted to a factor.

```{r}
bos <- bos %>% 
  mutate(chas = chas %>%
           factor() %>% 
           fct_recode(
             "With tracts that bonds river" = "1", #old value 1 to new label
             "Otherwise" = "0") # old value 0 to new label
)
```

### 2.3 describe the dataset

Each of the 506 rows in the dataset describes a Boston suburb or town, and it has 14 columns with information such as average number of rooms per dwelling, pupil-teacher ratio, and per capita crime rate. The last row describes the median price of owner-occupied homes.

## 3 Assignment requirement #3

Quoted from the assignment:

_"Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. *(0-2 points)*"_

### 3.1 Show a graphical overview of the data

```{r, fig.height = 8, fig.width = 14, fig.cap = "Visualized relations of Boston dataset, variable #1~#7"}
library(GGally)
library(ggplot2)

#define a function that allows me to fine-tune the matrix
my.fun <- function(data, mapping, method = "lm",...){ #define arguments
  p <- ggplot(data = data, mapping = mapping) + #pass arguments
    geom_point(size = 0.3, 
               color = "blue",...) + #define points size and color
    geom_smooth(size = 0.5, 
                color = "red", 
                method = method) #define line size and color; define lm regression
  p #print the results
}

#the abbreviated variable names are not self-explanatory, set column and row
#names to be the variable labels for better reading
#this new object will be used in ggpairs function
names1 <- pull(codebook[1:7,], description)  # extract row 1：7 of var description
names1 <- sapply(names1,    #collapse the description into multiple lines
                 function(x) paste(strwrap(x, 35),  # for better reading
                                   collapse = "\n")) # "\n" calls for a new line

ggpairs(bos, 
        lower = list(
          continuous = my.fun,
          combo = wrap("facethist", bins = 20)),
        col = 1:7,
        columnLabels = names1) #define column labels as the names I just set
```

Note that variable about crime rate is plagued with outliers.

```{r, fig.height = 8, fig.width = 14,fig.cap = "Visualized relations of Boston dataset, variable #8~#14"}

#repeat what is done in the last chunk for variable 8~14
names2 <- pull(codebook[8:14,], description)
names2 <- sapply(names1, function(x) paste(strwrap(x, 35), collapse = "\n"))

ggpairs(bos, 
        lower = list(
          continuous = my.fun),
        col = 8:14,
        columnLabels = names2,
        )
```

### 3.2 Show summaries of the variables in the data. 

```{r}
library(finalfit)
#summarize the continuous data
ff_glimpse(bos)$Continuous %>% datatable
```


```{r}
# summarize the categorical data
ff_glimpse(bos)$Categorical %>% datatable
```


### 3.3 Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. 

#### 3.3.1 interpreting continuous variables

There are 13 continuous variables in the dataset. The crime rate of the town was 0.3(0.1~3.7)%; the proportion of a town's residential land zoned for lots over 25,000 sq.ft. was 0 (0~12.5)%; the proportion of non-retail business acres per town was 9.7(5.2~18.1)%; the nitric oxides concentration was 0.5(0.4~0.6) parts per 10 million; the average number of rooms per dwelling was 6.3±0.7 rooms; the proportion of owner-occupied units built prior to 1940 was 77.5(45.0~94.1)%; the weighted distances to five Boston employment centres was 3.2 (2.1~5.2) kilometers; the index of accessibility to radial highways was 5(4~24) units of accessibility; the full-value property-tax rate was $330(279~666) per \$10,000; the pupil-teacher ratio by town was 19.1(17.4~20.2); the Black proportion of population after taking the formula of 1000(Bk-0.63)^2 was 391.4(375.4~396.2); the proportion of population that is lower status was 11.4(6.9~17.0)%; the median value of owner-occupied homes was \$21.2(17~25)*1000. 
#### 3.3.2 interpreting categorical variable

35(6.9%) towns have tracts that bonds Charles River. 

#### 3.3.3 commenting on the relationships between variables

Except for the one binary variable about tract that bonds river, each variable in our data set shows a >0.3 and/or <-0.3 correlation with at least one of the other variables. Some of them have correlation as high as 0.9. All of the correlation coefficients are significant (_p_<0.001). 

## 4 Assignment requirement #4

Quoted from the assignment:

_"Standardize the dataset and print out summaries of the scaled data. How did the variables change? Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set. *(0-2 points)*"_

### 4.1 Standardize the dataset and print out summaries of the scaled data

```{r}
library(MASS)
#binary variables with values as number will not influence the result of 
#standardization and clustering, hence I will reload Boston without re-coding
#binary variable. This is for easiness of matrix multiplication in the following
#operations
              
bos <- Boston 
bos.s <- as.data.frame(scale(bos))# bos.s means Boston Scaled
```


### 4.2 How did the variables change?

```{r}
ff_glimpse(bos.s)$Con %>% datatable
```

All the variables after scaling had a mean of 0 and most of variables' values ranged from -4 and 4, only except for variables crim (crime rate), which might be due to out-liers (corresponds to the finding from the correlation matrix).  
### 4.3 Use the quantiles as the break points in the categorical variable and drop the old crime rate variable from the dataset.

```{r}
#generate cutoff according to quantile
bins <- quantile(bos.s$crim)
#generate a categorical variable "crime" and re-code it
bos.s <- bos.s %>% 
  mutate(crime = crim %>% 
           cut(breaks = bins, include.lowest = TRUE) %>% 
           fct_recode("Low" = "[-0.419,-0.411]",
                     "MediumLow" = "(-0.411,-0.39]",
                     "MediumHigh" = "(-0.39,0.00739]",
                     "High" = "(0.00739,9.92]"))
#remove crim
bos.s <- bos.s %>% select(-crim)
```

### 4.4 Divide the dataset to train and test sets, so that 80% of the data belongs to the train set

```{r}
set.seed(2022) 
#generate an object containing the number of observations in bos dataset
n <-  nrow(bos.s)

#generate an object "ind", which contains a random selected set of the indexing 
#of bos dataset, and the number of indexing takes up 80% of number of observations
ind <- sample(1:n, size = n*0.8)
#generate train&test sets according to the random set of indexing number
train <- bos.s[ind,]
test <- bos.s[-ind,]
```

## 5 Assignment requirement #5

Quoted from the assignment:

_"Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot *(0-3 points)*"_

### 5.1 Fit the linear discriminant analysis on the train set (Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables)

```{r}
# fit an linear discriminant model on the train set, named as "lda.fit"
lda.fit <- lda(crime ~ ., data = train) 
```

### 5.2 Draw the LDA (bi)plot

```{r}
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(factor(train$crime))
```

```{r, fig.cap="Fig 5.2 Biplot for LDA for clustering crime rate"}
#plot the lda results
plot(lda.fit, dimen = 2,  pch = classes, col = classes)+
lda.arrows(lda.fit, myscale = 4)
```

Biplot based on LD1  and LD2 was generated, see fig 5.2. The most of four clusters separated poorly, except for the cluster "High".  Heavy overlap was observed between each pair of other cluster. Besides, Clusters High and MediumHigh also showed notable overlaps.

Based on arrows, varaibles lstat explained the most for cluster High. Contributions of variables to other clusters are not clear enough due to the heavy overlap.

## 6 Assignment requirement #6

Quoted from the assignment:

_"Save the crime categories from the test set and then remove the categorical crime variable from the test dataset. Then predict the classes with the LDA model on the test data. Cross tabulate the results with the crime categories from the test set. Comment on the results. *(0-3 points)*"_

### 6.1 Save the crime categories from the test set and then remove the categorical crime variable from the test dataset

```{r}
#save crime into an object
classes.test <- test$crime
#remove crime
test$crime <- NULL
```


### 6.2 predict the classes with the LDA model on the test data

```{r}
predicted.test <- predict(lda.fit, test)
```


### 6.3 Cross tabulate the results with the crime categories from the test set.

```{r}
#generate a table that evaluate the accuracy of model, and pass the table into
#an object named "accuracy.tab"
accuracy.tab <- table(correct = classes.test, predicted = predicted.test$class )

#show the accuracy table
accuracy.tab

#ask R to identify the correct predictions and add them up
correct.n = 0 # the number of correct predictions starting at 0
for (i in 1:4){ #4 loops because we have 4 rows/columns
  correct.c <- accuracy.tab[
    which(rownames(accuracy.tab) == colnames(accuracy.tab)[i]), 
    i] # if a cell has same row and column names, pass its value into "correct.c"
  correct.n = correct.c+ correct.n # update the value of correct prediction
}                                  # by adding "correct.c"

# calculate the percent of correct predictions for test set
correct.n/(nrow(bos.s)*0.2) #denominator is the number of obs. in test set
```


### 6.4 Comment on the results

Overall, 66.2% of the predictions are correct, showing not quite satisfactory predicting effect of our linear discriminant analysis. Observe the result closely, it is found that the for high and medium high crime rate regions, the analysis did the best predictions, with 90% (47/52) of accuracy. For Low and medium low regions, the predictive effect of our analysis decreased tremendously. This might be the result of *a.* the violation of the assumption of multivariate normality (but evidence showed even when this is violated, LDA also exhibited good accuracy); *b.* large number of out-liers in the dependent variable before re-coding (LDA is sensitive to out-liers); *c.* The small size of category Low in test set; *d.* better categorization strategy for dependent variable needed (the current categorization is only base on quantiles, which is lack of more evidence-based foundation).

## 7 Assignment requirement #7

Quoted from the assignment:

_"Reload the Boston dataset and standardize the dataset (we did not do this in the Exercise Set, but you should scale the variables to get comparable distances). Calculate the distances between the observations. Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results. *(0-4 points)*"_

### 7.1 Reload the Boston dataset and standardize the dataset

```{r}
#reload Boston
data("Boston")
bos <- Boston
#standardize the dataset
bos.s <- as.data.frame(scale(bos))
```

### 7.2 Calculate the distances between the observations

```{r}
dis_eu <- dist(bos.s)
summary(dis_eu)
```

### 7.3 Run k-means algorithm on the dataset

```{r}
bos.s.km <- kmeans(bos.s, centers = 4) 
```

### 7.4 Investigate what is the optimal number of clusters

```{r}
date()
set.seed(22) #22 is the date I carried out the analysis

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(bos.s, k)$tot.withinss})
```

```{r, fig.cap="Fig 7.4 Elbow plot for trends of within-cluster sum-of-square with increasing number of k"}
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')+
  geom_line(aes(x = 3, color = "red")) +
  annotate("text", 
           label = 
             "←Elbow effect happens here", 
           x = 4.8, y =3800, color = "red") 
```

There is a huge reduction in variation with K =3, but after that the variation does not go down as quickly. I will use K = 3 and do the k-means clustering again.

### 7.5  run the algorithm again

```{r}
km <- kmeans(bos.s, centers = 3)
```


### 7.6 Visualize the clusters 


```{r, fig.height = 8, fig.width = 14, fig.cap="Fig 7.6.1 Correlation Matrix with clusters", warning = F, message = F}
#define a function that allows me to fine-tune the matrix
my.fun.km <- function(data, mapping,...){ #define arguments
  p <- ggplot(data = data, mapping = mapping) + #pass arguments
    geom_point(size = 0.3, 
               color = factor(km$cluster),
               ...) + #define points size and color
    stat_ellipse(geom = "polygon", mapping = mapping, alpha = 0.5)
     #calculate an ellipse layer that separate clusters
  p #print the results
}

ggpairs(bos, mapping = aes(fill=factor(km$cluster)),
        lower = list(
          continuous = my.fun.km
          ),
        col = 1:7)
```


```{r, fig.height = 8, fig.width = 14, fig.cap="Fig 7.6.2 Correlation Matrix with clusters", warning = F, message = F}
ggpairs(bos, mapping = aes(fill=factor(km$cluster)),
        lower = list(
          continuous = my.fun.km
          ),
        col = 8:14) 
```


### 7.7 interpret the results

By observing the elbow plot that depicts the how size of within-cluster sum-of-square changes with number of k, an optimal number of k = 3 was determined. The subsequent results of k-means clustering was visualized in a correlation matrix with containing variables in dataset. It is observed that some of the variables have contributed tremendously to the clustering. For example, the variable black separates 1 cluster with the other 2 clusters nicely, see the 5th column of the picture above (Fig 7.6.2), x axis; and the variable age separates a different cluster with the other 2 clusters roughly, see the 7th row of the picture above (Fig 7.6.1), y axis. Some pairs of the variables have also played important role in clustering, for example, the combination of black and dist variables separate the 3 clusters roughly, see picture above (Fig 7.6.2, column 1nd, row 5th) or see the picture below (Fig 7.6.3). Due to the limitation of presenting more dimensions in a 2-dimension screen, I am not able to dig into the clustering effect of more variables combined. Fortunately, k-means clustering has done that for me, mathematically. 

```{r, fig.cap= "the clustering effect of variables black and dis combined"}
bos %>% ggplot(aes(x = dis, y = black, color = factor(km$cluster))) +
  geom_point() +
  geom_abline(intercept = 480, slope = -25)+
  geom_abline(intercept = 400, slope = -80) +
  stat_ellipse(geom = "polygon",
               aes(fill = km$cluster),
               alpha = 0.25)
```

## 8 Assignment requirement #2

Quoted from the assignment:

_"Bonus: Perform k-means on the original Boston data with some reasonable number of clusters (> 2). Remember to standardize the dataset. Then perform LDA using the clusters as target classes. Include all the variables in the Boston data in the LDA model. Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution). Interpret the results. Which variables are the most influential linear separators for the clusters? *(0-2 points to compensate any loss of points from the above exercises)"*_

### 8.1 Perform k-means on the original Boston data with some reasonable number of clusters (> 2)(Remember to standardize the dataset)

```{r}
# k = 3 is the optimal clusters found
km <- kmeans(scale(Boston), centers = 3)
```

### 8.2  Perform LDA using the clusters as target classes.

```{r}
#reload and standardize data
bos.s  <- as.data.frame(scale(Boston))
#save the clusters identified by k-means clustering as a column in the data set
bos.s$km.cluster <- km$cluster
lda.km <- lda(km.cluster ~ ., data = bos.s) 
```

### 8.3 Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution)

```{r, fig.cap="Fig. 8.3 Biplot of the LDA for separating clusters identified by K means distance"}
# target classes as numeric
classes <- as.numeric(factor(bos.s$km.cluster))
#plot the lda results as biplot
plot(lda.km, dimen = 2,  pch = classes, col = classes)+
lda.arrows(lda.km, myscale = 4)
```

### 8.4 Interpret the results

Biplot based on LD1  and LD2 was generated, see fig 8.3. The three clusters separated very clearly and some overlap observed between cluster 1 and cluster 3, and between cluster 2 and cluster 3. Cluster 1 and cluster 2 are perfectly separated.

Based on arrows, variables rm, dis and crim explained more for cluster 1;  variables indus, rad, tax and nox explained more for cluster 2; and variables black, chas and ptratio explained more for clusters 3. Other variables' role in clustering was much weaker. 

## 9 Assignment requirement #9

Quoted from the assignment:

_"Super-Bonus: Run the code below for the (scaled) train data that you used to fit the LDA. The code creates a matrix product, which is a projection of the data points.  Install and access the plotly package. Create a 3D plot of the columns of the matrix product using the given code. Adjust the code: add argument color as a argument in the plot_ly() function. Set the color to be the crime classes of the train set. Draw another 3D plot where the color is defined by the clusters of the k-means. How do the plots differ? Are there any similarities? *(0-3 points to compensate any loss of points from the above exercises)*"_

### 9.1 Run the code below for the (scaled) train data that you used to fit the LDA. The code creates a matrix product, which is a projection of the data points.

```{r}
#reload the data
bos.s <- as.data.frame(scale(Boston))
#generate cutoff according to quantile
bins <- quantile(bos.s$crim)
#generate a categorical variable "crime" and re-code it
bos.s <- bos.s %>% 
  mutate(crime = crim %>% 
           cut(breaks = bins, include.lowest = TRUE) %>% 
           fct_recode("Low" = "[-0.419,-0.411]",
                     "MediumLow" = "(-0.411,-0.39]",
                     "MediumHigh" = "(-0.39,0.00739]",
                     "High" = "(0.00739,9.92]"))
#remove crim
bos.s <- bos.s %>% select(-crim)
set.seed(2022)
#generate an object containing the number of observations
n <-  nrow(bos.s)
#generate a random set of indexing number with n = 80% of the obs.
ind <- sample(1:n, size = n*0.8)
#generate the train set and test 
train <- bos.s[ind,]
test <- bos.s[-ind,]

```

### 9.2  Install and access the plotly package. Create a 3D plot of the columns of the matrix product using the given code. Adjust the code: add argument color as a argument in the plot_ly() function. Set the color to be the crime classes of the train set. 

```{r}
#select predictors for train set, with outcome variable removed
model_predictors <- dplyr::select(train, -crime)

# check the dimensions
dim(model_predictors)
dim(lda.fit$scaling)
# matrix multiplication, saving the resulting matrix into matrix_product
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
#turn matrix into data frame
matrix_product <- as.data.frame(matrix_product)

#plot the 3D plot with LD1, LD2 and LD3
library(plotly)
p1 <- plot_ly(x = matrix_product$LD1, 
        y = matrix_product$LD2, 
        z = matrix_product$LD3, 
        type= 'scatter3d', 
        mode='markers', 
        color = train$crime, #Set the color to be the crime classes of the train set. 
        size = 2)
p1
```

### 9.3 Draw another 3D plot where the color is defined by the clusters of the k-means. 

```{r}
#select predictors for train set, with outcome variable removed
model_predictors <- dplyr::select(train, -crime)

#get the clusters of k-means for the train set
train.km <- kmeans(model_predictors, centers = 3) 


p2 <- plot_ly(x = matrix_product$LD1, 
        y = matrix_product$LD2, 
        z = matrix_product$LD3, 
        type= 'scatter3d', 
        mode='markers', 
        color = factor(train.km$cluster), #color defined by clusters of the k-means
        size = 1.5)
p2
```

### 9.4 How do the plots differ? Are there any similarities?

The LDA was trained according to a mathematical category of crime rates (quantiles), which has 4 categories. While k = 3 was adopted for the k-means clustering base on the size of within-cluster sum of square. Since LDA is a supervised technique, we know what are each categories represent, which are also labeled in the caption. K-means clustering is a unsupervised method and thus I do not know anything about the real-world representation of the 3 clusters identified before observing closely. 

However, by observing the pictures together, it is interesting to find out that, cluster 3 from k-means nicely overlaps with High category from LDA. Also, cluster 2 from k-means roughly overlaps with Low and Medium low categories from LDA. As such, I will re-code categories from LDA according to this finding and see closely how well results from k-means and LDA are consistent.

#### 9.4.1 Recode categories from LDA into High, Medium and Low (old Low + Medium Low)

```{r}
train.crime3 <- train %>% 
  mutate(crime3 = crime %>% 
           fct_recode("Medium" = "MediumHigh" ,
                      "Low" = "MediumLow",
                      "High" = "High",
                      "Low" = "Low"))

km.cluster <- factor(train.km$cluster)
levels(km.cluster) <- c("Medium","Low","High")
```

#### 9.4.2 Check the accuracy table 

```{r}
accuracy.tab <- table(correct = train.crime3$crime3, kmean.pred = km.cluster)
accuracy.tab 
```

#### 9.4.3 Calculate the accuracy rate

```{r}
# looping through the 3X3 matrix to add up the columns and rows with same name
correct.n = 0
for (i in 1:3){
  correct.c <- accuracy.tab[which(rownames(accuracy.tab) == colnames(accuracy.tab)[i]), i]
  correct.n = correct.c+ correct.n
} 
# calculate the accuracy rate
correct.n/(nrow(bos.s)*0.8)
```

It gets an accuracy rate of 71.3%, greatly outperforming the original LDA model, indicating k-mean cluster could be used as a cue for categorizing continuous variables.