Skip to content

Grouping duplicates

derekeder edited this page Feb 27, 2013 · 8 revisions

Once we have calculated the probability that pairs of record are duplicates or not, we still have a kind of thorny problem because it's not just pairs of records that can be duplicates. Three, four, thousands of records could all refer to the same organization, but we only have pairwise measures.

Let's say we have measured the following pairwise probabilities between records A, B, and C.

A -- 0.6 -- B -- 0.6 -- C 

The probability that A and B are duplicates is 60%, the probability that B and C are duplicates is 60%, but what is the probability that A and C is 60%?

Let's say that everything is going perfectly and we can say there's a 36% probability that A and C are duplicates. It looks like we would be better off saying that A and C should maybe not be considered duplicates.

Okay, then should we say that A and B are a duplicate pair and C is a distinct record or that A is the distinct record and that B and C are duplicates?

Well... these are thorny problems and we tried solving them a few different ways. In the end, we found that hierarchical clustering with centroid linkage gave us the best results. What this algorithm does is say that all points within some distance of centroid are part of the same group. In this example B would be the centroid, and A, B, C and would all be put in the same group.

Unfortunately, a much more principled answer does not exist because we the estimated pairwise probabilities are not transitive.

Clustering the groups depends on us setting a threshold for group membership--the distance of the points to the centroid. Depending on how we choose that threshold, we'll get very different groups, and we will want to choose this threshold wisely.

Clone this wiki locally