Skip to content
derekeder edited this page Nov 13, 2012 · 25 revisions

Deduplication, entity resolution, record linkage, author disambiguation, and others ...

As different research communities encountered this problem, they each gave it a new name but, ultimately, its all about trying to figure out what records are referring to the same thing.

Dedupe is a command line tool that quickly de-duplicates large sets of data.

Features:

  • machine learning - reads in human labeled data to automatically create optimum weights and blocking rules
  • built as a library - so it can be integrated in to your applications or import scripts
  • extensible - supports adding custom data types, string comparators and blocking rules
  • open source - anyone can use or modify it
  • written in python - targeted at the academic and journalism communities

See our ChiPy presentation for more.

Overview

We can break down our problem into two parts. First we need to figure out some rule that we can use to decide when two records are about the same thing or not. Ideally, we want a rule that always puts together records that belong together and always separates records that are about different things. Usually, we don't have enough information to do that, so then we would like a rule that does as good a job as possible and indicates how uncertain we should be about whether two records really belong together.

Second, because good rules are often very complicated and take time to execute, we cannot compare each record to every other record because it would take much too long. We have to figure out how to make only a small fraction of possible comparisons, yet have those comparisons be the right ones.

The approach that we take is to find good solutions to both of these problems using machine learning. The work is based upon the synthesizing 2006 dissertation of Mikhail Bilenko, /Learnable Similarity Functions and Their Application to Record Linkage and Clustering./

The Dedupe process

1. Data model setup

First, a 'data model' that declares what fields to look at and their data type -- 'String', 'Lat-Long', 'Name', etc is provided. Depending upon the field type, the program will use an appropriate distance measure.

An example data model:

fields = {'name': {'type': 'String'},
          'address': {'type': 'String'},
          'city': {'type': 'String'},
          'cuisine': {'type': 'String'}
         }

Right now, we have only implemented the 'String' type which uses an affine-gap Levenshtein distance. Other field types and distances are easily added.

2. Active learning to determine weights

Using active learning, Dedupe provides a pair of records for the user to label as referring to the same thing, different things, or unclear. For each pair in the sample, the program calculates a distance for each field based upon the data model. Using labeled examples, a regularized logistic classifier finds weights for those field distances so that the weighted sum of these distances represents the uncertainty of whether the pairs are duplicates or distinct.

After a user has labeled an example, this example is added to the training set. Weights are relearned, and the record that the program is now most unsure of is given to the user to label.

3. Train blocking rules

Once the user has finished training the classifier, the training data is used again to find a good blocking of the data. Blocking is about finding a set of features which true duplicates tend to share and which distinct records do not, and then only comparing those records that share that feature.

For example, we might split the data into blocks that all have the same first five characters in an address field. Since most of the data do not share this feature, we will avoid making most comparisons.

Available blocking predicates:

  • whole field
  • token field
  • common integer
  • same three char start
  • same five char start
  • same seven char start
  • near integers
  • common four gram
  • common six gram
  • TF/IDF

We would like to find a way to block the data so that we have an opportunity to compare every true duplicate, but make as few comparisons of true distinct records as possible. Unfortunately, the only way to find the guaranteed optimal blocking is through exhaustive search. Since we would like to consider an large number of possible ways of blocking, we use a greedy algorithm, particularly Chvatal's Greedy Set-Cover algorithm.

4. Apply blocking rules on the original dataset

At the end of training we learned a classification rule for comparing records and a good blocking procedure for deciding which fraction of comparisons to make. We can then apply these to our full dataset. First we define the blocks, which is usually a O(n) operation. Then, based on the blocks, produce the set of candidate record pairs to classify. Once we have that set, we could easily split it into chunks and distribute the relatively expensive comparisons across many machines (parallel computing).

5. Clustering

The classification gives us a score for each candidate pair that represents how certain we are that a pair is a duplicate pair or distinct. However, more than two records can refer to the same thing, and we want to compare these pairwise scores into duplicate clusters. Right now we do this with a fast implementation of agglomerative hierarchical clustering, which is O(n^2) where n is the number of candidate pairs that have a duplicate certainty score above some threshold. At some point, this may get to slow, but since a small fraction of records are duplicates, this algorithm will likely work for very large data.

License Preprocessing

Clone this wiki locally