Name Match

Popular fuzzy name matching algorithms written in Rust. Name matching is not easy due to variation arising from spelling, missing components and pronunciation. These algorithms are designed to capture similarity between non-identical name sequences by assigning probability to a match between 0.0 and 1.0. Input names are preprocessed and transformed before comparison.

Implementation Note

You'll find the following fuzzy string matching algorithms in this library:

Jaro Winkler Distance
Jaccard Index
Soundex
Soundex-Jaccard

Each of these algorithms excel at solving different challenges of name matching. You'll find that they tend to be rather complementary. This suggests that they work well in combination.

To better tackle the challenge of name matching, each algorithm comes with a preprocessing stage and a comparison stage. Input names undergo a series of preprocessing first before similary score is calculated in the subsequent comparison stage.

1. Jaro Winkler Distance (Classic)

Measures edit distance between two strings. The higher the score, the more similar the strings are. See wikipedia.

Use Case

Strength:

Spelling mistakes and typos such as 'John' vs 'Jon', 'SmellyFish' vs 'JellyFish'.
High Precision

Weakness:

Name transpositions such as 'John Doe' vs 'Doe, John'. Also lacking linguistic nuances

Preprocessing

Non alpha-numeric characters are converted to whitespace.
Trim leading and ending whitespaces.
All characters are converted to uppercase.

Usage

Refer to example.

let name1 = "John Doe"
let name2 = "Jon Doe"
let name_matcher = compare::JaroWinklerMatcher::default();
let score = name_matcher.get_score(name_1, name_2); //0.9666667

2. Jaccard Index (Classic)

Measures overlapping tokens between two strings, defined by their intersection divided by the size of their union. The higher the score, the more similar the strings are. Names are tokenized by whitespace before score is calculated. See wikipedia.

Use Case

Strength:

Name transpositions. 'John Doe' vs 'Doe John'.
Missing name components. 'John Adam Doe' vs 'John Doe'

Weakness:

Spelling mistakes. Slight character variation within a token can drastically reduce string similarity.

Preprocessing

Non alpha-numeric characters are converted to whitespace.
Trim leading and ending whitespaces.
All characters are converted to uppercase.

Usage

Refer to example.

let name1 = "John Doe"
let name2 = "Jon Doe"
let name_matcher = compare::JaccardMatcher::default();
let score = name_matcher.get_score(name_1, name_2); // 0.33333

3. Soundex (Classic)

Measures phoentic similarity between strings. Names are encoded in their Soundex form before comparison. See wikipedia. Score is binary [0.0, 1.0].

Use Case

Strength:

Similar sounding names "Robert" vs "Rupert"

Weakness:

Low Precision.
Last name is omitted using naive implementation. "Robert Doe" (R163) vs "Rupert John" (R163) (See improvement)
Name transposition

Preprocessing

Non alpha-numeric characters are converted to whitespace.
Trim leading and ending whitespaces.
All characters are converted to uppercase.
Reduced to their Soundex form.

Usage

Refer to example.

let name1 = "Roberto Doe"
let name2 = "Rupert Doe"
let name_matcher = compare::SoundexMatcher::default();
let score = name_matcher.get_score(name_1, name_2); // 1.0

4. Soundex-Jaccard (Custom)

An improvement over Naive Soundex. Each name is tokenized first before encoded in Soundex form. Jaccard Index of the two names is then calculated. The higher the score, the more similar the strings.

Illustration: "John Robert Doe" vs "Jim Rupert"

Tokenized:
- {"John", "Robert", "Doe"} vs {"Jim", "Rupert"}
Apply soundex:
- {"J500", "R163", "D00"} vs {"J500", "R163"}
Jaccard Index:
- Ovelapping = {"J500", "R163"}
- Union = {"J500", "R163", "D00"}
- 2/3 = 0.66667

Use Case

Strength:

Higher recall than Classic Soundex.
Every components within a name are taken into consideration.
Works on transposed name. "Robert Jim" vs "John Rupert"

Weakness:

Low Precision.

Preprocessing

Non alpha-numeric characters are converted to whitespace.
Trim leading and ending whitespaces.
All characters are converted to uppercase.
Reduced to their Soundex form.

Usage

Refer to example.

let name1 = "Roberto Doe"
let name2 = "Rupert Doe"
let name_matcher = compare::SoundexJaccardMatcher::default();
let score = name_matcher.get_score(name_1, name_2); // 1.0

Practical Use

Each algorithm has its own set of weaknesses. Hence, a better approach would be to construct an ensemble model by combining two or more of such algorithms. It is important to have a balanced mix of algorithms with high recall and high precision by weighing their score equally. This can be achieved using this library by setting the weight of every Matcher instances upfront.

Refer to example.

Applications

Financial Compliance (Sanction List, PEP, etc)
Identity Verification

Current Limitations

Only works on ASCII characters.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
benches		benches
examples		examples
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Name Match

Implementation Note

1. Jaro Winkler Distance (Classic)

Use Case

Preprocessing

Usage

2. Jaccard Index (Classic)

Use Case

Preprocessing

Usage

3. Soundex (Classic)

Use Case

Preprocessing

Usage

4. Soundex-Jaccard (Custom)

Use Case

Preprocessing

Usage

Practical Use

Applications

Current Limitations

About

Releases

Packages

Languages

License

henrybyhee/fuzzy_name_match

Folders and files

Latest commit

History

Repository files navigation

Name Match

Implementation Note

1. Jaro Winkler Distance (Classic)

Use Case

Preprocessing

Usage

2. Jaccard Index (Classic)

Use Case

Preprocessing

Usage

3. Soundex (Classic)

Use Case

Preprocessing

Usage

4. Soundex-Jaccard (Custom)

Use Case

Preprocessing

Usage

Practical Use

Applications

Current Limitations

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages