Popular fuzzy name matching algorithms written in Rust. Name matching is not easy due to variation arising from spelling, missing components and pronunciation. These algorithms are designed to capture similarity between non-identical name sequences by assigning probability to a match between 0.0 and 1.0. Input names are preprocessed and transformed before comparison.
You'll find the following fuzzy string matching algorithms in this library:
Each of these algorithms excel at solving different challenges of name matching. You'll find that they tend to be rather complementary. This suggests that they work well in combination.
To better tackle the challenge of name matching, each algorithm comes with a preprocessing stage and a comparison stage. Input names undergo a series of preprocessing first before similary score is calculated in the subsequent comparison stage.
Measures edit distance between two strings. The higher the score, the more similar the strings are. See wikipedia.
Strength:
- Spelling mistakes and typos such as 'John' vs 'Jon', 'SmellyFish' vs 'JellyFish'.
- High Precision
Weakness:
Name transpositions such as 'John Doe' vs 'Doe, John'. Also lacking linguistic nuances
- Non alpha-numeric characters are converted to whitespace.
- Trim leading and ending whitespaces.
- All characters are converted to uppercase.
Refer to example.
let name1 = "John Doe"
let name2 = "Jon Doe"
let name_matcher = compare::JaroWinklerMatcher::default();
let score = name_matcher.get_score(name_1, name_2); //0.9666667
Measures overlapping tokens between two strings, defined by their intersection divided by the size of their union. The higher the score, the more similar the strings are. Names are tokenized by whitespace before score is calculated. See wikipedia.
Strength:
- Name transpositions. 'John Doe' vs 'Doe John'.
- Missing name components. 'John Adam Doe' vs 'John Doe'
Weakness:
Spelling mistakes. Slight character variation within a token can drastically reduce string similarity.
- Non alpha-numeric characters are converted to whitespace.
- Trim leading and ending whitespaces.
- All characters are converted to uppercase.
Refer to example.
let name1 = "John Doe"
let name2 = "Jon Doe"
let name_matcher = compare::JaccardMatcher::default();
let score = name_matcher.get_score(name_1, name_2); // 0.33333
Measures phoentic similarity between strings. Names are encoded in their Soundex form before comparison. See wikipedia. Score is binary [0.0, 1.0].
Strength:
- Similar sounding names "Robert" vs "Rupert"
Weakness:
- Low Precision.
- Last name is omitted using naive implementation. "Robert Doe" (R163) vs "Rupert John" (R163) (See improvement)
- Name transposition
- Non alpha-numeric characters are converted to whitespace.
- Trim leading and ending whitespaces.
- All characters are converted to uppercase.
- Reduced to their Soundex form.
Refer to example.
let name1 = "Roberto Doe"
let name2 = "Rupert Doe"
let name_matcher = compare::SoundexMatcher::default();
let score = name_matcher.get_score(name_1, name_2); // 1.0
An improvement over Naive Soundex. Each name is tokenized first before encoded in Soundex form. Jaccard Index of the two names is then calculated. The higher the score, the more similar the strings.
Illustration: "John Robert Doe" vs "Jim Rupert"
- Tokenized:
- {"John", "Robert", "Doe"} vs {"Jim", "Rupert"}
- Apply soundex:
- {"J500", "R163", "D00"} vs {"J500", "R163"}
- Jaccard Index:
- Ovelapping = {"J500", "R163"}
- Union = {"J500", "R163", "D00"}
- 2/3 = 0.66667
Strength:
- Higher recall than Classic Soundex.
- Every components within a name are taken into consideration.
- Works on transposed name. "Robert Jim" vs "John Rupert"
Weakness:
- Low Precision.
- Non alpha-numeric characters are converted to whitespace.
- Trim leading and ending whitespaces.
- All characters are converted to uppercase.
- Reduced to their Soundex form.
Refer to example.
let name1 = "Roberto Doe"
let name2 = "Rupert Doe"
let name_matcher = compare::SoundexJaccardMatcher::default();
let score = name_matcher.get_score(name_1, name_2); // 1.0
Each algorithm has its own set of weaknesses. Hence, a better approach would be to construct an ensemble model by combining two or more of such algorithms.
It is important to have a balanced mix of algorithms with high recall and high precision by weighing their score equally. This can be achieved using this
library by setting the weight
of every Matcher
instances upfront.
Refer to example.
- Financial Compliance (Sanction List, PEP, etc)
- Identity Verification
- Only works on ASCII characters.