7 Clustering.Rmd

---
title: "Cluster Analysis"
output:
  word_document: default
  html_document: default
---

Clustering is a branch of unsupervised learning that deals with the grouping of objects/individuals according to appropriate similarity criteria that come from our data. These groups must be

-   i[nternally homogeneou]{.underline}s and [externally heterogeneous]{.underline}

-   [few]{.underline} in number

## Data requirements

-   [Low collinearity between the variables used]{.underline} since they should be real classification dimensions.
-   [Check for outliers]{.underline} since [cluster algorithms]{.underline} are [sensitive to extreme values.]{.underline}
-   [Standardize the data]{.underline} to achieve homogeneous units of measurement (avoid comparing different things).
-   Data do not have to be metric/non-metric.
-   Data do not have to be normally distributed/linearly related

## Main steps of cluster analyses

1.  Select a [proximity measure]{.underline} (distance/similarity measure) for individual observations
2.  Choose a [clustering algorithm]{.underline}
3.  Define the [new distance between two clusters]{.underline}
4.  Determine the [optimal number of clusters]{.underline}

## Proximity measures (step 1)

They [describe the relationship between objects.]{.underline} On the basis of these relationships, the individual objects are summarized into groups. We have

-   [similarity measures]{.underline}: Pearson correlation,...

-   [distance measures]{.underline}: City block distance, (squared) Euclidean distance,...

## Algorithm and new distance (steps 2 & 3)

-   [Single linkage (nearest neighbor)]{.underline}: new distance is smallest individual distance

-   [Complete linkage (furthest neighbor)]{.underline}: new distance is largest individual distance

-   [Ward:]{.underline} Calculation of new distance is based on a specific formula

## Optimal number of clusters (step 4)

Final solution must be

-   interpretable.

-   the best one w.r.t. to initial research problem.

-   Evaluate several solutions and choose the most suitable

-   Elbow criterion for agglomeration coefficients (AC)

## Example

We will use the "cars.sav" data

```{r}
library(haven)
cars <- read_sav("D:/data/Empirical Research/7 Cluster Analysis/cars.sav")
head(cars)

```

He can check for pairwise collinearity among the variables

```{r message=FALSE, warning=FALSE}
library(corrplot)

corrplot.mixed (cor(cars[,2:10]), lower.col='black', number.cex=.7)
```

As we can see there is indeed correlation among various variables, which means that ideally we should discard some of those. Additionally we could also check for outliers(using boxplots for instance). However, we are gonna keep the complete data set for our analysis. On the other hand, it is vital to standardize the data before proceeding

```{r}
cardat = scale(cars[,2:10])
row.names(cardat) <- cars$Name
```

As a first step let us select the "euclidean distance" as proximity measure for our data. We calculate the distance of each individual pair. This can be done either manually for each pair as the example below

```{r}
cars[,2:10] = scale(cars[,2:10])
edman = ((cars[1,2]-cars[2,2])^2+(cars[1,3]-cars[2,3])^2+(cars[1,4]-cars[2,4])^2+
        (cars[1,5]-cars[2,5])^2+(cars[1,6]-cars[2,6])^2+(cars[1,7]-cars[2,7])^2+
        (cars[1,8]-cars[2,8])^2+(cars[1,9]-cars[2,9])^2+(cars[1,10]-cars[2,10])^2)^0.5
edman
```

or (more preferably) automatically for all pairs by using the appropriate library in

```{r}
eucdist = dist(cardat, method = "euclidean") #city block = manhattan
eucdist
```

For squared euclidean and other distance measures: use distance() from philentropy

```{r eval=FALSE, include=FALSE}
library(philentropy)
sqe = distance(cardat, method="squared_euclidean", use.row.names = TRUE)
sqe
```

Now we perform clustering by defining the specific algorithm and the new distance rule for the procedure. Foe example

```{r}
#single linkage=single, complete linkage=complete 
clus = hclust(eucdist, method="ward.D")
clus
plot(clus)

data.frame(clus[2:1])
```

We observe that the result varies from the one extreme case of only one cluster(where all the 15 individuals belong to a single group) to the other extreme case of 15 clusters(where each individual belongs to each own group). Both case are quite useless and not informative.

The optimal number of clusters can be determined observing the change-rate of $height$ as we move alongs stage with ascending clusters

```{r}
NoC = 14:1
plot(NoC,clus$height, type="l")
```

As we can see the the most radical change of $height$ occurs as soon as we reach the stage with 2 clusters. Thus, according to the $elbow-rule$ we must choose 2 clusters for our analysis

```{r}
plot(clus)
rect.hclust(clus,k=2,border="red")
```