-
Notifications
You must be signed in to change notification settings - Fork 64
/
Copy pathREADME.Rmd
192 lines (142 loc) · 5.62 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
---
output: github_document
bibliography: vignettes/dbscan.bib
link-citations: yes
---
```{r echo=FALSE, results = 'asis'}
pkg <- 'dbscan'
source("https://raw.githubusercontent.com/mhahsler/pkg_helpers/main/pkg_helpers.R")
pkg_title(pkg, anaconda = "r-dbscan", stackoverflow = "dbscan%2br")
```
## Introduction
This R package [@hahsler2019dbscan] provides a fast C++ (re)implementation of several density-based algorithms with a focus on the DBSCAN family for clustering spatial data.
The package includes:
__Clustering__
- __DBSCAN:__ Density-based spatial clustering of applications with noise [@ester1996density].
- __Jarvis-Patrick Clustering__: Clustering using a similarity measure based
on shared near neighbors [@jarvis1973].
- __SNN Clustering__: Shared nearest neighbor clustering [@erdoz2003].
- __HDBSCAN:__ Hierarchical DBSCAN with simplified hierarchy extraction [@campello2015hierarchical].
- __FOSC:__ Framework for optimal selection of clusters for unsupervised and semisupervised clustering of hierarchical cluster tree [@campello2013density].
- __OPTICS/OPTICSXi:__ Ordering points to identify the clustering structure and cluster extraction methods
[@ankerst1999optics].
__Outlier Detection__
- __LOF:__ Local outlier factor algorithm [@breunig2000lof].
- __GLOSH:__ Global-Local Outlier Score from Hierarchies algorithm [@campello2015hierarchical].
__Cluster Evaluation__
- __DBCV:__ Density-based clustering validation [@moulavi2014].
__Fast Nearest-Neighbor Search (using kd-trees)__
- __kNN search__
- __Fixed-radius NN search__
The implementations use the kd-tree data structure (from library ANN) for faster k-nearest neighbor search, and are
for Euclidean distance typically faster than the native R implementations (e.g., dbscan in package `fpc`), or the
implementations in [WEKA](https://ml.cms.waikato.ac.nz/weka/), [ELKI](https://elki-project.github.io/) and [Python's scikit-learn](https://scikit-learn.org/).
```{r echo=FALSE, results = 'asis'}
pkg_usage(pkg)
pkg_citation(pkg, 2)
pkg_install(pkg)
```
## Usage
Load the package and use the numeric variables in the iris dataset
```{r}
library("dbscan")
data("iris")
x <- as.matrix(iris[, 1:4])
```
DBSCAN
```{r}
db <- dbscan(x, eps = .42, minPts = 5)
db
```
Visualize the resulting clustering (noise points are shown in black).
```{r dbscan}
pairs(x, col = db$cluster + 1L)
```
OPTICS
```{r}
opt <- optics(x, eps = 1, minPts = 4)
opt
```
Extract DBSCAN-like clustering from OPTICS
and create a reachability plot (extracted DBSCAN clusters at eps_cl=.4 are colored)
```{r OPTICS_extractDBSCAN, fig.height=3}
opt <- extractDBSCAN(opt, eps_cl = .4)
plot(opt)
```
HDBSCAN
```{r}
hdb <- hdbscan(x, minPts = 4)
hdb
```
Visualize the hierarchical clustering as a simplified tree. HDBSCAN finds 2 stable clusters.
```{r hdbscan, fig.height=4}
plot(hdb, show_flat = TRUE)
```
## Using dbscan with tidyverse
`dbscan` provides for all clustering algorithms `tidy()`, `augment()`, and `glance()` so they can
be easily used with tidyverse, ggplot2 and [tidymodels](https://www.tidymodels.org/learn/statistics/k-means/).
```{r tidyverse, message=FALSE, warning=FALSE}
library(tidyverse)
db <- x %>% dbscan(eps = .42, minPts = 5)
```
Get cluster statistics as a tibble
```{r tidyverse2}
tidy(db)
```
Visualize the clustering with ggplot2 (use an x for noise points)
```{r tidyverse3}
augment(db, x) %>%
ggplot(aes(x = Petal.Length, y = Petal.Width)) +
geom_point(aes(color = .cluster, shape = noise)) +
scale_shape_manual(values=c(19, 4))
```
## Using dbscan from Python
R, the R package `dbscan`, and the Python package `rpy2` need to be installed.
```{python, eval = FALSE}
import pandas as pd
import numpy as np
### prepare data
iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',
header = None,
names = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Species'])
iris_numeric = iris[['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']]
# get R dbscan package
from rpy2.robjects import packages
dbscan = packages.importr('dbscan')
# enable automatic conversion of pandas dataframes to R dataframes
from rpy2.robjects import pandas2ri
pandas2ri.activate()
db = dbscan.dbscan(iris_numeric, eps = 0.5, MinPts = 5)
print(db)
```
```
## DBSCAN clustering for 150 objects.
## Parameters: eps = 0.5, minPts = 5
## Using euclidean distances and borderpoints = TRUE
## The clustering contains 2 cluster(s) and 17 noise points.
##
## 0 1 2
## 17 49 84
##
## Available fields: cluster, eps, minPts, dist, borderPoints
```
```{python, eval = FALSE}
# get the cluster assignment vector
labels = np.array(db.rx('cluster'))
labels
```
```
## array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
## 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
## 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 0, 2, 2, 2, 2, 2,
## 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0,
## 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 0, 2, 0, 0,
## 2, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 0,
## 2, 2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]],
## dtype=int32)
```
## License
The dbscan package is licensed under the [GNU General Public License (GPL) Version 3](https://www.gnu.org/licenses/gpl-3.0.en.html). The __OPTICSXi__ R implementation was directly ported from the ELKI framework's Java implementation (GNU AGPLv3), with permission by the original author, Erich Schubert.
## Changes
* List of changes from [NEWS.md](https://github.com/mhahsler/dbscan/blob/master/NEWS.md)
## References