-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path7 Clustering.Rmd
133 lines (89 loc) · 4.79 KB
/
7 Clustering.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---
title: "Cluster Analysis"
output:
word_document: default
html_document: default
---
Clustering is a branch of unsupervised learning that deals with the grouping of objects/individuals according to appropriate similarity criteria that come from our data. These groups must be
- i[nternally homogeneou]{.underline}s and [externally heterogeneous]{.underline}
- [few]{.underline} in number
## Data requirements
- [Low collinearity between the variables used]{.underline} since they should be real classification dimensions.
- [Check for outliers]{.underline} since [cluster algorithms]{.underline} are [sensitive to extreme values.]{.underline}
- [Standardize the data]{.underline} to achieve homogeneous units of measurement (avoid comparing different things).
- Data do not have to be metric/non-metric.
- Data do not have to be normally distributed/linearly related
## Main steps of cluster analyses
1. Select a [proximity measure]{.underline} (distance/similarity measure) for individual observations
2. Choose a [clustering algorithm]{.underline}
3. Define the [new distance between two clusters]{.underline}
4. Determine the [optimal number of clusters]{.underline}
## Proximity measures (step 1)
They [describe the relationship between objects.]{.underline} On the basis of these relationships, the individual objects are summarized into groups. We have
- [similarity measures]{.underline}: Pearson correlation,...
- [distance measures]{.underline}: City block distance, (squared) Euclidean distance,...
## Algorithm and new distance (steps 2 & 3)
- [Single linkage (nearest neighbor)]{.underline}: new distance is smallest individual distance
- [Complete linkage (furthest neighbor)]{.underline}: new distance is largest individual distance
- [Ward:]{.underline} Calculation of new distance is based on a specific formula
## Optimal number of clusters (step 4)
Final solution must be
- interpretable.
- the best one w.r.t. to initial research problem.
- Evaluate several solutions and choose the most suitable
- Elbow criterion for agglomeration coefficients (AC)
## Example
We will use the "cars.sav" data
```{r}
library(haven)
cars <- read_sav("D:/data/Empirical Research/7 Cluster Analysis/cars.sav")
head(cars)
```
He can check for pairwise collinearity among the variables
```{r message=FALSE, warning=FALSE}
library(corrplot)
corrplot.mixed (cor(cars[,2:10]), lower.col='black', number.cex=.7)
```
As we can see there is indeed correlation among various variables, which means that ideally we should discard some of those. Additionally we could also check for outliers(using boxplots for instance). However, we are gonna keep the complete data set for our analysis. On the other hand, it is vital to standardize the data before proceeding
```{r}
cardat = scale(cars[,2:10])
row.names(cardat) <- cars$Name
```
As a first step let us select the "euclidean distance" as proximity measure for our data. We calculate the distance of each individual pair. This can be done either manually for each pair as the example below
```{r}
cars[,2:10] = scale(cars[,2:10])
edman = ((cars[1,2]-cars[2,2])^2+(cars[1,3]-cars[2,3])^2+(cars[1,4]-cars[2,4])^2+
(cars[1,5]-cars[2,5])^2+(cars[1,6]-cars[2,6])^2+(cars[1,7]-cars[2,7])^2+
(cars[1,8]-cars[2,8])^2+(cars[1,9]-cars[2,9])^2+(cars[1,10]-cars[2,10])^2)^0.5
edman
```
or (more preferably) automatically for all pairs by using the appropriate library in
```{r}
eucdist = dist(cardat, method = "euclidean") #city block = manhattan
eucdist
```
For squared euclidean and other distance measures: use distance() from philentropy
```{r eval=FALSE, include=FALSE}
library(philentropy)
sqe = distance(cardat, method="squared_euclidean", use.row.names = TRUE)
sqe
```
Now we perform clustering by defining the specific algorithm and the new distance rule for the procedure. Foe example
```{r}
#single linkage=single, complete linkage=complete
clus = hclust(eucdist, method="ward.D")
clus
plot(clus)
data.frame(clus[2:1])
```
We observe that the result varies from the one extreme case of only one cluster(where all the 15 individuals belong to a single group) to the other extreme case of 15 clusters(where each individual belongs to each own group). Both case are quite useless and not informative.
The optimal number of clusters can be determined observing the change-rate of $height$ as we move alongs stage with ascending clusters
```{r}
NoC = 14:1
plot(NoC,clus$height, type="l")
```
As we can see the the most radical change of $height$ occurs as soon as we reach the stage with 2 clusters. Thus, according to the $elbow-rule$ we must choose 2 clusters for our analysis
```{r}
plot(clus)
rect.hclust(clus,k=2,border="red")
```