Skip to content

OpenRefine/EditDistanceClusterer

 
 

Repository files navigation

#EditDistanceJoiner

EditDistanceJoiner is a java library develop by database group of Tsinghua University which can help you (a) select similar string pairs and (b) get similar string clusters among lots of strings based on similarity measured by edit distance very effiently.

###How do it works ? This library is based on a method called PassJoin proposed on VLDB2012, which is proved to be orders of magnitude faster than previous methods. The library can handle a dataset in 2 minutes which costs 70 minutes by naive brute force implementation used in simile-vicino, besides, unlike simile-vicino which uses blocking methods to speed up clustering with the loss of accuracy, this library can generate accurate results.

###Usage This library use similar interface with simile-vicino. You can have a look at the samples in joining and clustering at EditDistanceClustererTest and EditDistanceJoinerTest

Releases

No releases published

Packages

No packages published

Languages

  • Java 92.9%
  • Shell 7.1%