Fuzzymatching

Match to strings based on String Distance.

Problem:

I was tasked to identify institutions from a Do Not Contact (DNC) list (n=2,000) and remove them from our master list (n=1,500,000). Unforunately, there were no unique IDs associated with the smaller DNC list, instead it only contains their institution name. Both lists were developed through manual data entry, leaving large inconsistencies in the names an organization can be found as on both lists (e.g., spelling errors, abreviations, white space, transpositions of words, etc.). The volume required for manual review of all DNC institutions was higher than fesibile possible given our resources.

Solution

Using the "stringdist" package in R, I was able to calculate the "distance" between two strings, essentially how similar they are. This generally took the form of how many transformations (Insertions, Deletions, Transpositions) needed to make one string into another.

However, the challege with this approach is that we need to compare every string in the DNC list with every string in the master list. Doing so using a vectorized approach would create an extremely large matrix of distances to calculate (1,500,000 X 2,000). So instead, I opted for a iterative approach, looping each record in the master list against the DNC list (creating a smaller 1 x 2,000 matrix), the best match was retained and stored in cumulative dataset (along with the distance value of the match).

The purpose was to speed up manual review rather than automate the entire process (accuracy was important). Final output data was used by reviewers to quickly exclude matches with small and large differences, and focus on ambigous cases best left for human reviewers.

Finally, I generalized this approach into a function to be used for similar problems.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Fuzzy String Match FunctionV1.R		Fuzzy String Match FunctionV1.R
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fuzzymatching

Problem:

Solution

About

Releases

Packages

Languages

Adamishere/Fuzzymatching

Folders and files

Latest commit

History

Repository files navigation

Fuzzymatching

Problem:

Solution

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages