README.Rmd

---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
library(comparator)
```

# comparator: Comparison Functions for Clustering and Record Linkage

<!-- badges: start -->
<!-- badges: end -->

comparator implements comparison functions for clustering and record linkage 
applications. It includes functions for comparing strings, sequences and 
numeric vectors. Where possible, comparators are implemented in C/C++ to 
ensure fast performance.

## Supported comparators

### String comparators:

#### Edit-based:

* `Levenshtein()`: Levenshtein distance/similarity
* `DamerauLevenshtein()` Damerau-Levenshtein distance/similarity
* `Hamming()`: Hamming distance/similarity
* `OSA()`: Optimal String Alignment distance/similarity
* `LCS()`: Longest Common Subsequence distance/similarity
* `Jaro()`: Jaro distance/similarity
* `JaroWinkler()`: Jaro-Winkler distance/similarity

#### Token-based: 

Not yet implemented.

#### Hybrid token-character:

* `MongeElkan()`: Monge-Elkan similarity
* `FuzzyTokenSet()`: Fuzzy Token Set distance

#### Other:

* `InVocabulary()`: Compares strings using a reference vocabulary. Useful for 
  comparing names.
* `Lookup()`: Retrieves distances/similarities from a lookup table
* `BinaryComp()`: Compares strings based on whether they agree/disagree 
  exactly.

### Numeric comparators:

* `Euclidean()`: Euclidean (L-2) distance
* `Manhattan()`: Manhattan (L-1) distance
* `Chebyshev()`: Chebyshev (L-∞) distance
* `Minkowski()`: Minkowski (L-p) distance

## Installation

You can install the latest release from [CRAN](https://CRAN.R-project.org) 
by entering:

``` r
install.packages("comparator")
```

The development version can be installed from GitHub using `devtools`:

``` r
# install.packages("devtools")
devtools::install_github("ngmarchant/comparator")
```

## Example

A comparator is instantiated by calling its constructor function. 
For example, we can instantiate a Levenshtein similarity comparator that 
ignores differences in upper/lowercase characters as follows:

```{r lev}
comparator <- Levenshtein(similarity = TRUE, normalize = TRUE, ignore_case = TRUE)
```

We can apply the comparator to character vectors element-wise as follows:

```{r elementwise-str}
x <- c("John Doe", "Jane Doe")
y <- c("jonathon doe", "jane doe")
elementwise(comparator, x, y)

# shorthand for above
comparator(x, y)
```

This comparator is also defined on sequences:

```{r elementwise-seq}
x_seq <- list(c(1, 2, 1, 1), c(1, 2, 3, 4))
y_seq <- list(c(4, 3, 2, 1), c(1, 2, 3, 1))
elementwise(comparator, x_seq, y_seq)

# shorthand for above
comparator(x_seq, y_seq)
```

Pairwise comparisons are also supported using the following syntax:

```{r pairwise}
# compare each string in x with each string in y and return a similarity matrix
pairwise(comparator, x, y, return_matrix = TRUE)

# compare the strings in x pairwise and return a similarity matrix
pairwise(comparator, x, return_matrix = TRUE)
```