README.Rmd

---
title: "FastKNN [![Travis-CI Build Status](https://travis-ci.org/davpinto/fastknn.svg?branch=master)](https://travis-ci.org/davpinto/fastknn) [![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/davpinto/fastknn?branch=master&svg=true)](https://ci.appveyor.com/project/davpinto/fastknn) [![codecov](https://codecov.io/github/davpinto/fastknn/branch/master/graphs/badge.svg)](https://codecov.io/github/davpinto/fastknn) [![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/fastknn?color=blue)](https://cran.r-project.org/package=fastknn)"
output: github_document
urlcolor: magenta
---

> Fast k-Nearest Neighbors Classifier with shrinkage estimator for the class membership probabilities

```{r setup, include=FALSE}
## rmarkdown::render('README.Rmd', 'github_document')
## rmarkdown::render(input = 'README.Rmd', output_format = 'html_document', output_file = "index.html", output_dir = "./docs")
## pkgdown::build_site()

knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, 
                      fig.align = "center", fig.height = 4, dpi = 120, 
                      cache = TRUE)
```


```{r, echo=FALSE, out.width='156px', fig.align='center'}
knitr::include_graphics('fastknn_logo.png')
```

### Why `fastknn`?

------

The `fastknn` is now available on [Kaggle](https://github.com/Kaggle/docker-rstats). Take a look at this [kernel](https://www.kaggle.com/davidpinto/d/uciml/forest-cover-type-dataset/fastknn-show-to-glm-what-knn-see-0-96) to see an example on how to use `fastknn` to improve your performance on **Kaggle** competitions.

------

1. Build KNN classifiers with **large datasets** (> 100k rows) in a few seconds.
1. Predict more **calibrated probabilities** and reduce log-loss with the `"dist"` estimator.
1. Find the **best k** parameter according to a variety of loss functions, using n-fold cross validation.
1. Plot beautiful classification **decision boundaries** for your dataset.
1. Do **feature engineering** and extract high informative features from your dataset.
1. Compete in **Kaggle**.

Give it a try and let me know what you think!

## Fast Nearest Neighbor Searching

The `fastknn` method implements a k-Nearest Neighbor (KNN) classifier based on the [ANN](https://www.cs.umd.edu/~mount/ANN) library. ANN is written in `C++` and is able to find the k nearest neighbors for every point in a given dataset in `O(N log N)` time. The package [RANN](https://github.com/jefferis/RANN) provides an easy interface to use ANN library in `R`.

## The FastKNN Classifier

The `fastknn` was developed to deal with very large datasets (> 100k rows) and is ideal to [Kaggle](https://www.kaggle.com) competitions. It can be about 50x faster then the popular `knn` method from the `R` package [class](https://cran.r-project.org/web/packages/class), for large datasets. Moreover, `fastknn` provides a shrinkage estimator to the class membership probabilities, based on the inverse distances of the nearest neighbors (see the equations on [fastknn website](https://davpinto.github.io/fastknn/)):

$$
P(x_i \in y_j) = \displaystyle\frac{\displaystyle\sum\limits_{k=1}^K \left( \frac{1}{d_{ik}}\cdot(n_{ik} \in y_j) \right)}{\displaystyle\sum\limits_{k=1}^K \left( \frac{1}{d_{ik}} \right)}
$$

where $x_i$ is the $i^{\text{th}}$ test instance, $y_j$ is the $j^{\text{th}}$ unique class label, $n_{ik}$ is the $k^{\text{th}}$ nearest neighbor of $x_i$, and $d_{ik}$ is the distance between $x_i$ and $n_{ik}$. This estimator can be thought of as a weighted voting rule, where those neighbors that are more close to $x_i$ will have more influence on predicting $x_i$'s label.

In general, the weighted estimator provides more **calibrated probabilities** when compared with the traditional estimator based on the label proportions of the nearest neighbors, and reduces **logarithmic loss** (log-loss).

### How to install `fastknn`?

The package `fastknn` is not on CRAN, so you need to install it directly from GitHub:

```{r, eval=FALSE}
library("devtools")
install_github("davpinto/fastknn")
```

### Required Packages

The base of `fastknn` is the `RANN` package, but other packages are required to make `fastknn` work properly. All of them are automatically installed when you install the `fastknn`.

* `RANN` for fast nearest neighbors searching,
* `foreach` and `doSNOW` to do parallelized cross-validation,
* `Metrics` to measure classification performance,
* `matrixStats` for fast matrix column-wise and row-wise statistics,
* `ggplot2` to plot classification decision boundaries,
* `viridis` for modern color palletes.

### Getting Started

Using `fastknn` is as simple as:

```{r}
## Load packages
library("fastknn")
library("caTools")

## Load toy data
data("chess", package = "fastknn")

## Split data for training and test
set.seed(123)
tr.idx <- which(caTools::sample.split(Y = chess$y, SplitRatio = 0.7))
x.tr   <- chess$x[tr.idx, ]
x.te   <- chess$x[-tr.idx, ]
y.tr   <- chess$y[tr.idx]
y.te   <- chess$y[-tr.idx]

## Fit KNN
yhat <- fastknn(x.tr, y.tr, x.te, k = 10)

## Evaluate model on test set
sprintf("Accuracy: %.2f", 100 * (1 - classLoss(actual = y.te, predicted = yhat$class)))
```

## Find the Best k

The `fastknn` provides a interface to select the best `k` using n-fold cross-validation. There are 4 possible **loss functions**:

* Overall classification error rate: `eval.metric = "overall_error"`
* Mean per-class classification error rate: `eval.metric = "mean_error"`
* Mean per-class AUC: `eval.metric = "auc"`
* Cross-entropy / logarithmic loss: `eval.metric = "logloss"`

Cross-validation using the **voting** probability estimator:

```{r, results='hide'}
## Load dataset
library("mlbench")
data("Sonar", package = "mlbench")
x <- data.matrix(Sonar[, -61])
y <- Sonar$Class

## 5-fold CV using log-loss as evaluation metric
set.seed(123)
cv.out <- fastknnCV(x, y, k = 3:15, method = "vote", folds = 5, eval.metric = "logloss")
cv.out$cv_table
```

```{r, echo=FALSE}
pander::pander(cv.out$cv_table)
```

Cross-validation using the **weighted voting** probability estimator:

```{r, results='hide'}
## 5-fold CV using log-loss as evaluation metric
set.seed(123)
cv.out <- fastknnCV(x, y, k = 3:15, method = "dist", folds = 5, eval.metric = "logloss")
cv.out$cv_table
```

```{r, echo=FALSE}
pander::pander(cv.out$cv_table)
```

Note that the mean **log-loss** for the **weighted voting** estimator is lower for every `k` evaluated.

Parallelization is available. You can specify the number of threads via `nthread` parameter.

## Plot Classification Decision Boundary

The `fastknn` provides a plotting function, based on `ggplot2`, to draw bi-dimensional decision boundaries. If your dataset has more than 2 variables, only the first two will be considered. In future versions of `fastknn` the most descriptive variables will be selected automatically beforehand, using a **feature ranking** technique.

### Two-class Problem

```{r}
## Load toy data
data("spirals", package = "fastknn")

## Split data for training and test
set.seed(123)
tr.idx <- which(caTools::sample.split(Y = spirals$y, SplitRatio = 0.7))
x.tr   <- spirals$x[tr.idx, ]
x.te   <- spirals$x[-tr.idx, ]
y.tr   <- spirals$y[tr.idx]
y.te   <- spirals$y[-tr.idx]

## Plot decision boundary
knnDecision(x.tr, y.tr, x.te, y.te, k = 15)
```

### Multi-class Problem

```{r}
## Load toy data
data("multi_spirals", package = "fastknn")

## Split data for training and test
set.seed(123)
tr.idx <- which(caTools::sample.split(Y = multi_spirals$y, SplitRatio = 0.7))
x.tr   <- multi_spirals$x[tr.idx, ]
x.te   <- multi_spirals$x[-tr.idx, ]
y.tr   <- multi_spirals$y[tr.idx]
y.te   <- multi_spirals$y[-tr.idx]

## Plot decision boundary
knnDecision(x.tr, y.tr, x.te, y.te, k = 15)
```

## Performance Test

Here we test the performance of `fastknn` on the [Covertype](https://archive.ics.uci.edu/ml/datasets/Covertype) datset. It is hosted on [UCI](https://archive.ics.uci.edu/ml/) repository and has been already used in a **Kaggle** [competition](https://www.kaggle.com/c/forest-cover-type-prediction). The dataset contains 581012 observations on 54 numeric features, classified into 7 different categories.

All experiments were conducted on a **64-bit Ubuntu 16.04 with Intel Core i7-6700HQ 2.60GHz and 16GB RAM DDR4**.

### Computing Time

Here `fastknn` is compared with the `knn` method from the package `class`. We had to use small samples from the Covertype data because `knn` takes too much time (> 1500s) to fit the entire dataset.

```{r, results='hide'}
#### Load packages
library('class')
library('fastknn')
library('caTools')

#### Load data
data("covertype", package = "fastknn")
covertype$Target <- as.factor(covertype$Target)

#### Test with different sample sizes
N <- nrow(covertype)
sample.frac <- c(10e3, 15e3, 20e3)/N
res <- lapply(sample.frac, function(frac, dt) {
   ## Reduce datset
   set.seed(123)
   sample.idx <- which(sample.split(dt$Target, SplitRatio = frac))
   x <- as.matrix(dt[sample.idx, -55])
   y <- dt$Target[sample.idx]
   
   ## Split data
   set.seed(123)
   tr.idx <- which(sample.split(y, SplitRatio = 0.7))
   x.tr   <- x[tr.idx, ]
   x.te   <- x[-tr.idx, ]
   y.tr   <- y[tr.idx]
   y.te   <- y[-tr.idx]
   
   ## Measure time
   t1 <- system.time({
      yhat1 <- knn(train = x.tr, test = x.te, cl = y.tr, k = 10, prob = TRUE)
   })
   t2 <- system.time({
      yhat2 <- fastknn(xtr = x.tr, ytr = y.tr, xte = x.te, k = 10, method = "dist")
   })
   
   ## Return
   list(
      method = c('knn', 'fastknn'),
      nobs = as.integer(rep(N*frac, 2)),
      time_sec = c(t1[3], t2[3]), 
      accuracy = round(100 * c(sum(yhat1 == y.te), sum(yhat2$class == y.te)) / length(y.te), 2)
   )
}, dt = covertype)
res <- do.call('rbind.data.frame', res)
res
```

```{r, echo=FALSE}
pander::pander(res)
```

The `fastknn` takes **about 5s** to fit the entire dataset.

### Probability Prediction

We compared the `voting` estimator with the `weighted voting` estimator:

**Voting**

```{r, results='hide'}
#### Extract input variables and response variable
x <- as.matrix(covertype[, -55])
y <- as.factor(covertype$Target)

#### 5-fold cross-validation
set.seed(123)
res <- fastknnCV(x, y, k = 10, method = "vote", folds = 5, eval.metric = "logloss")
res$cv_table
```

```{r, echo=FALSE}
pander::pander(res$cv_table)
```

**Weighted Voting**

```{r, results='hide'}
#### 5-fold cross-validation
set.seed(123)
res <- fastknnCV(x, y, k = 10, method = "dist", folds = 5, eval.metric = "logloss")
res$cv_table
```

```{r, echo=FALSE}
pander::pander(res$cv_table)
```

## Feature Engineering

The **fastknn** provides a function to do **feature extraction** using KNN. It 
generates `k * c` new features, where `c` is the number of class labels. The new 
features are computed from the distances between the observations and their `k` 
nearest neighbors inside each class. The following example shows that the 
**KNN features** carry information that can not be extracted from data by a 
linear learner, like a GLM model:

```{r, results='hide'}
library("mlbench")
library("caTools")
library("fastknn")
library("glmnet")

#### Load data
data("Ionosphere", package = "mlbench")
x <- data.matrix(subset(Ionosphere, select = -Class))
y <- Ionosphere$Class

#### Remove near zero variance columns
x <- x[, -c(1,2)]

#### Split data
set.seed(123)
tr.idx <- which(sample.split(Y = y, SplitRatio = 0.7))
x.tr <- x[tr.idx,]
x.te <- x[-tr.idx,]
y.tr <- y[tr.idx]
y.te <- y[-tr.idx]

#### GLM with original features
glm <- glmnet(x = x.tr, y = y.tr, family = "binomial", lambda = 0)
yhat <- drop(predict(glm, x.te, type = "class"))
yhat1 <- factor(yhat, levels = levels(y.tr))

#### Generate KNN features
set.seed(123)
new.data <- knnExtract(xtr = x.tr, ytr = y.tr, xte = x.te, k = 3)

#### GLM with KNN features
glm <- glmnet(x = new.data$new.tr, y = y.tr, family = "binomial", lambda = 0)
yhat <- drop(predict(glm, new.data$new.te, type = "class"))
yhat2 <- factor(yhat, levels = levels(y.tr))

#### Performance
sprintf("Accuracy: %.2f", 100 * (1 - classLoss(actual = y.te, predicted = yhat1)))
sprintf("Accuracy: %.2f", 100 * (1 - classLoss(actual = y.te, predicted = yhat2)))
```

```{r, echo=FALSE}
sprintf("Accuracy: %.2f", 100 * (1 - classLoss(actual = y.te, predicted = yhat1)))
sprintf("Accuracy: %.2f", 100 * (1 - classLoss(actual = y.te, predicted = yhat2)))
```

We can see that the **KNN features** improved a lot the classification performance 
of the GLM model.

The `knnExtract()` function is based on the ideas presented in the  
[winner solution](https://www.kaggle.com/c/otto-group-product-classification-challenge/forums/t/14335/1st-place-winner-solution-gilberto-titericz-stanislav-semenov) of the [Otto Group Product Classification Challenge](https://www.kaggle.com/c/otto-group-product-classification-challenge) on **Kaggle**.

Parallelization is available. You can specify the number of threads via `nthread` parameter.

### Understanding the KNN Features

KNN makes a nonlinear mapping of the original space and project it into a linear 
one, in which the classes are linearly separable.

**Mapping the *chess* dataset**

```{r, results='hide'}
library("caTools")
library("fastknn")
library("ggplot2")
library("gridExtra")

## Load data
data("chess")
x <- data.matrix(chess$x)
y <- chess$y

## Split data
set.seed(123)
tr.idx <- which(sample.split(Y = y, SplitRatio = 0.7))
x.tr <- x[tr.idx,]
x.te <- x[-tr.idx,]
y.tr <- y[tr.idx]
y.te <- y[-tr.idx]

## Feature extraction with KNN
set.seed(123)
new.data <- knnExtract(x.tr, y.tr, x.te, k = 1)

## Decision boundaries
g1 <- knnDecision(x.tr, y.tr, x.te, y.te, k = 10) +
   labs(title = "Original Features")
g2 <- knnDecision(new.data$new.tr, y.tr, new.data$new.te, y.te, k = 10) +
   labs(title = "KNN Features")
grid.arrange(g1, g2, ncol = 2)
```

**Mapping the *spirals* dataset**

```{r, results='hide'}
## Load data
data("spirals")
x <- data.matrix(spirals$x)
y <- spirals$y

## Split data
set.seed(123)
tr.idx <- which(sample.split(Y = y, SplitRatio = 0.7))
x.tr <- x[tr.idx,]
x.te <- x[-tr.idx,]
y.tr <- y[tr.idx]
y.te <- y[-tr.idx]

## Feature extraction with KNN
set.seed(123)
new.data <- knnExtract(x.tr, y.tr, x.te, k = 1)

## Decision boundaries
g1 <- knnDecision(x.tr, y.tr, x.te, y.te, k = 10) +
   labs(title = "Original Features")
g2 <- knnDecision(new.data$new.tr, y.tr, new.data$new.te, y.te, k = 10) +
   labs(title = "KNN Features")
grid.arrange(g1, g2, ncol = 2)
```