-
Notifications
You must be signed in to change notification settings - Fork 18
/
README.Rmd
414 lines (312 loc) · 14.1 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
---
title: "FastKNN [![Travis-CI Build Status](https://travis-ci.org/davpinto/fastknn.svg?branch=master)](https://travis-ci.org/davpinto/fastknn) [![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/davpinto/fastknn?branch=master&svg=true)](https://ci.appveyor.com/project/davpinto/fastknn) [![codecov](https://codecov.io/github/davpinto/fastknn/branch/master/graphs/badge.svg)](https://codecov.io/github/davpinto/fastknn) [![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/fastknn?color=blue)](https://cran.r-project.org/package=fastknn)"
output: github_document
urlcolor: magenta
---
> Fast k-Nearest Neighbors Classifier with shrinkage estimator for the class membership probabilities
```{r setup, include=FALSE}
## rmarkdown::render('README.Rmd', 'github_document')
## rmarkdown::render(input = 'README.Rmd', output_format = 'html_document', output_file = "index.html", output_dir = "./docs")
## pkgdown::build_site()
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE,
fig.align = "center", fig.height = 4, dpi = 120,
cache = TRUE)
```
```{r, echo=FALSE, out.width='156px', fig.align='center'}
knitr::include_graphics('fastknn_logo.png')
```
### Why `fastknn`?
------
The `fastknn` is now available on [Kaggle](https://github.com/Kaggle/docker-rstats). Take a look at this [kernel](https://www.kaggle.com/davidpinto/d/uciml/forest-cover-type-dataset/fastknn-show-to-glm-what-knn-see-0-96) to see an example on how to use `fastknn` to improve your performance on **Kaggle** competitions.
------
1. Build KNN classifiers with **large datasets** (> 100k rows) in a few seconds.
1. Predict more **calibrated probabilities** and reduce log-loss with the `"dist"` estimator.
1. Find the **best k** parameter according to a variety of loss functions, using n-fold cross validation.
1. Plot beautiful classification **decision boundaries** for your dataset.
1. Do **feature engineering** and extract high informative features from your dataset.
1. Compete in **Kaggle**.
Give it a try and let me know what you think!
## Fast Nearest Neighbor Searching
The `fastknn` method implements a k-Nearest Neighbor (KNN) classifier based on the [ANN](https://www.cs.umd.edu/~mount/ANN) library. ANN is written in `C++` and is able to find the k nearest neighbors for every point in a given dataset in `O(N log N)` time. The package [RANN](https://github.com/jefferis/RANN) provides an easy interface to use ANN library in `R`.
## The FastKNN Classifier
The `fastknn` was developed to deal with very large datasets (> 100k rows) and is ideal to [Kaggle](https://www.kaggle.com) competitions. It can be about 50x faster then the popular `knn` method from the `R` package [class](https://cran.r-project.org/web/packages/class), for large datasets. Moreover, `fastknn` provides a shrinkage estimator to the class membership probabilities, based on the inverse distances of the nearest neighbors (see the equations on [fastknn website](https://davpinto.github.io/fastknn/)):
$$
P(x_i \in y_j) = \displaystyle\frac{\displaystyle\sum\limits_{k=1}^K \left( \frac{1}{d_{ik}}\cdot(n_{ik} \in y_j) \right)}{\displaystyle\sum\limits_{k=1}^K \left( \frac{1}{d_{ik}} \right)}
$$
where $x_i$ is the $i^{\text{th}}$ test instance, $y_j$ is the $j^{\text{th}}$ unique class label, $n_{ik}$ is the $k^{\text{th}}$ nearest neighbor of $x_i$, and $d_{ik}$ is the distance between $x_i$ and $n_{ik}$. This estimator can be thought of as a weighted voting rule, where those neighbors that are more close to $x_i$ will have more influence on predicting $x_i$'s label.
In general, the weighted estimator provides more **calibrated probabilities** when compared with the traditional estimator based on the label proportions of the nearest neighbors, and reduces **logarithmic loss** (log-loss).
### How to install `fastknn`?
The package `fastknn` is not on CRAN, so you need to install it directly from GitHub:
```{r, eval=FALSE}
library("devtools")
install_github("davpinto/fastknn")
```
### Required Packages
The base of `fastknn` is the `RANN` package, but other packages are required to make `fastknn` work properly. All of them are automatically installed when you install the `fastknn`.
* `RANN` for fast nearest neighbors searching,
* `foreach` and `doSNOW` to do parallelized cross-validation,
* `Metrics` to measure classification performance,
* `matrixStats` for fast matrix column-wise and row-wise statistics,
* `ggplot2` to plot classification decision boundaries,
* `viridis` for modern color palletes.
### Getting Started
Using `fastknn` is as simple as:
```{r}
## Load packages
library("fastknn")
library("caTools")
## Load toy data
data("chess", package = "fastknn")
## Split data for training and test
set.seed(123)
tr.idx <- which(caTools::sample.split(Y = chess$y, SplitRatio = 0.7))
x.tr <- chess$x[tr.idx, ]
x.te <- chess$x[-tr.idx, ]
y.tr <- chess$y[tr.idx]
y.te <- chess$y[-tr.idx]
## Fit KNN
yhat <- fastknn(x.tr, y.tr, x.te, k = 10)
## Evaluate model on test set
sprintf("Accuracy: %.2f", 100 * (1 - classLoss(actual = y.te, predicted = yhat$class)))
```
## Find the Best k
The `fastknn` provides a interface to select the best `k` using n-fold cross-validation. There are 4 possible **loss functions**:
* Overall classification error rate: `eval.metric = "overall_error"`
* Mean per-class classification error rate: `eval.metric = "mean_error"`
* Mean per-class AUC: `eval.metric = "auc"`
* Cross-entropy / logarithmic loss: `eval.metric = "logloss"`
Cross-validation using the **voting** probability estimator:
```{r, results='hide'}
## Load dataset
library("mlbench")
data("Sonar", package = "mlbench")
x <- data.matrix(Sonar[, -61])
y <- Sonar$Class
## 5-fold CV using log-loss as evaluation metric
set.seed(123)
cv.out <- fastknnCV(x, y, k = 3:15, method = "vote", folds = 5, eval.metric = "logloss")
cv.out$cv_table
```
```{r, echo=FALSE}
pander::pander(cv.out$cv_table)
```
Cross-validation using the **weighted voting** probability estimator:
```{r, results='hide'}
## 5-fold CV using log-loss as evaluation metric
set.seed(123)
cv.out <- fastknnCV(x, y, k = 3:15, method = "dist", folds = 5, eval.metric = "logloss")
cv.out$cv_table
```
```{r, echo=FALSE}
pander::pander(cv.out$cv_table)
```
Note that the mean **log-loss** for the **weighted voting** estimator is lower for every `k` evaluated.
Parallelization is available. You can specify the number of threads via `nthread` parameter.
## Plot Classification Decision Boundary
The `fastknn` provides a plotting function, based on `ggplot2`, to draw bi-dimensional decision boundaries. If your dataset has more than 2 variables, only the first two will be considered. In future versions of `fastknn` the most descriptive variables will be selected automatically beforehand, using a **feature ranking** technique.
### Two-class Problem
```{r}
## Load toy data
data("spirals", package = "fastknn")
## Split data for training and test
set.seed(123)
tr.idx <- which(caTools::sample.split(Y = spirals$y, SplitRatio = 0.7))
x.tr <- spirals$x[tr.idx, ]
x.te <- spirals$x[-tr.idx, ]
y.tr <- spirals$y[tr.idx]
y.te <- spirals$y[-tr.idx]
## Plot decision boundary
knnDecision(x.tr, y.tr, x.te, y.te, k = 15)
```
### Multi-class Problem
```{r}
## Load toy data
data("multi_spirals", package = "fastknn")
## Split data for training and test
set.seed(123)
tr.idx <- which(caTools::sample.split(Y = multi_spirals$y, SplitRatio = 0.7))
x.tr <- multi_spirals$x[tr.idx, ]
x.te <- multi_spirals$x[-tr.idx, ]
y.tr <- multi_spirals$y[tr.idx]
y.te <- multi_spirals$y[-tr.idx]
## Plot decision boundary
knnDecision(x.tr, y.tr, x.te, y.te, k = 15)
```
## Performance Test
Here we test the performance of `fastknn` on the [Covertype](https://archive.ics.uci.edu/ml/datasets/Covertype) datset. It is hosted on [UCI](https://archive.ics.uci.edu/ml/) repository and has been already used in a **Kaggle** [competition](https://www.kaggle.com/c/forest-cover-type-prediction). The dataset contains 581012 observations on 54 numeric features, classified into 7 different categories.
All experiments were conducted on a **64-bit Ubuntu 16.04 with Intel Core i7-6700HQ 2.60GHz and 16GB RAM DDR4**.
### Computing Time
Here `fastknn` is compared with the `knn` method from the package `class`. We had to use small samples from the Covertype data because `knn` takes too much time (> 1500s) to fit the entire dataset.
```{r, results='hide'}
#### Load packages
library('class')
library('fastknn')
library('caTools')
#### Load data
data("covertype", package = "fastknn")
covertype$Target <- as.factor(covertype$Target)
#### Test with different sample sizes
N <- nrow(covertype)
sample.frac <- c(10e3, 15e3, 20e3)/N
res <- lapply(sample.frac, function(frac, dt) {
## Reduce datset
set.seed(123)
sample.idx <- which(sample.split(dt$Target, SplitRatio = frac))
x <- as.matrix(dt[sample.idx, -55])
y <- dt$Target[sample.idx]
## Split data
set.seed(123)
tr.idx <- which(sample.split(y, SplitRatio = 0.7))
x.tr <- x[tr.idx, ]
x.te <- x[-tr.idx, ]
y.tr <- y[tr.idx]
y.te <- y[-tr.idx]
## Measure time
t1 <- system.time({
yhat1 <- knn(train = x.tr, test = x.te, cl = y.tr, k = 10, prob = TRUE)
})
t2 <- system.time({
yhat2 <- fastknn(xtr = x.tr, ytr = y.tr, xte = x.te, k = 10, method = "dist")
})
## Return
list(
method = c('knn', 'fastknn'),
nobs = as.integer(rep(N*frac, 2)),
time_sec = c(t1[3], t2[3]),
accuracy = round(100 * c(sum(yhat1 == y.te), sum(yhat2$class == y.te)) / length(y.te), 2)
)
}, dt = covertype)
res <- do.call('rbind.data.frame', res)
res
```
```{r, echo=FALSE}
pander::pander(res)
```
The `fastknn` takes **about 5s** to fit the entire dataset.
### Probability Prediction
We compared the `voting` estimator with the `weighted voting` estimator:
**Voting**
```{r, results='hide'}
#### Extract input variables and response variable
x <- as.matrix(covertype[, -55])
y <- as.factor(covertype$Target)
#### 5-fold cross-validation
set.seed(123)
res <- fastknnCV(x, y, k = 10, method = "vote", folds = 5, eval.metric = "logloss")
res$cv_table
```
```{r, echo=FALSE}
pander::pander(res$cv_table)
```
**Weighted Voting**
```{r, results='hide'}
#### 5-fold cross-validation
set.seed(123)
res <- fastknnCV(x, y, k = 10, method = "dist", folds = 5, eval.metric = "logloss")
res$cv_table
```
```{r, echo=FALSE}
pander::pander(res$cv_table)
```
## Feature Engineering
The **fastknn** provides a function to do **feature extraction** using KNN. It
generates `k * c` new features, where `c` is the number of class labels. The new
features are computed from the distances between the observations and their `k`
nearest neighbors inside each class. The following example shows that the
**KNN features** carry information that can not be extracted from data by a
linear learner, like a GLM model:
```{r, results='hide'}
library("mlbench")
library("caTools")
library("fastknn")
library("glmnet")
#### Load data
data("Ionosphere", package = "mlbench")
x <- data.matrix(subset(Ionosphere, select = -Class))
y <- Ionosphere$Class
#### Remove near zero variance columns
x <- x[, -c(1,2)]
#### Split data
set.seed(123)
tr.idx <- which(sample.split(Y = y, SplitRatio = 0.7))
x.tr <- x[tr.idx,]
x.te <- x[-tr.idx,]
y.tr <- y[tr.idx]
y.te <- y[-tr.idx]
#### GLM with original features
glm <- glmnet(x = x.tr, y = y.tr, family = "binomial", lambda = 0)
yhat <- drop(predict(glm, x.te, type = "class"))
yhat1 <- factor(yhat, levels = levels(y.tr))
#### Generate KNN features
set.seed(123)
new.data <- knnExtract(xtr = x.tr, ytr = y.tr, xte = x.te, k = 3)
#### GLM with KNN features
glm <- glmnet(x = new.data$new.tr, y = y.tr, family = "binomial", lambda = 0)
yhat <- drop(predict(glm, new.data$new.te, type = "class"))
yhat2 <- factor(yhat, levels = levels(y.tr))
#### Performance
sprintf("Accuracy: %.2f", 100 * (1 - classLoss(actual = y.te, predicted = yhat1)))
sprintf("Accuracy: %.2f", 100 * (1 - classLoss(actual = y.te, predicted = yhat2)))
```
```{r, echo=FALSE}
sprintf("Accuracy: %.2f", 100 * (1 - classLoss(actual = y.te, predicted = yhat1)))
sprintf("Accuracy: %.2f", 100 * (1 - classLoss(actual = y.te, predicted = yhat2)))
```
We can see that the **KNN features** improved a lot the classification performance
of the GLM model.
The `knnExtract()` function is based on the ideas presented in the
[winner solution](https://www.kaggle.com/c/otto-group-product-classification-challenge/forums/t/14335/1st-place-winner-solution-gilberto-titericz-stanislav-semenov) of the [Otto Group Product Classification Challenge](https://www.kaggle.com/c/otto-group-product-classification-challenge) on **Kaggle**.
Parallelization is available. You can specify the number of threads via `nthread` parameter.
### Understanding the KNN Features
KNN makes a nonlinear mapping of the original space and project it into a linear
one, in which the classes are linearly separable.
**Mapping the *chess* dataset**
```{r, results='hide'}
library("caTools")
library("fastknn")
library("ggplot2")
library("gridExtra")
## Load data
data("chess")
x <- data.matrix(chess$x)
y <- chess$y
## Split data
set.seed(123)
tr.idx <- which(sample.split(Y = y, SplitRatio = 0.7))
x.tr <- x[tr.idx,]
x.te <- x[-tr.idx,]
y.tr <- y[tr.idx]
y.te <- y[-tr.idx]
## Feature extraction with KNN
set.seed(123)
new.data <- knnExtract(x.tr, y.tr, x.te, k = 1)
## Decision boundaries
g1 <- knnDecision(x.tr, y.tr, x.te, y.te, k = 10) +
labs(title = "Original Features")
g2 <- knnDecision(new.data$new.tr, y.tr, new.data$new.te, y.te, k = 10) +
labs(title = "KNN Features")
grid.arrange(g1, g2, ncol = 2)
```
**Mapping the *spirals* dataset**
```{r, results='hide'}
## Load data
data("spirals")
x <- data.matrix(spirals$x)
y <- spirals$y
## Split data
set.seed(123)
tr.idx <- which(sample.split(Y = y, SplitRatio = 0.7))
x.tr <- x[tr.idx,]
x.te <- x[-tr.idx,]
y.tr <- y[tr.idx]
y.te <- y[-tr.idx]
## Feature extraction with KNN
set.seed(123)
new.data <- knnExtract(x.tr, y.tr, x.te, k = 1)
## Decision boundaries
g1 <- knnDecision(x.tr, y.tr, x.te, y.te, k = 10) +
labs(title = "Original Features")
g2 <- knnDecision(new.data$new.tr, y.tr, new.data$new.te, y.te, k = 10) +
labs(title = "KNN Features")
grid.arrange(g1, g2, ncol = 2)
```