Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weirdly High scores (False Positives) #41

Open
Pranav082001 opened this issue Aug 2, 2022 · 1 comment
Open

Weirdly High scores (False Positives) #41

Pranav082001 opened this issue Aug 2, 2022 · 1 comment

Comments

@Pranav082001
Copy link

I have been experimenting with PolyFuzz for a while. I have observed some weird scoring behavouir. Following is the case in which I am getting a very high score of 90 despite the string hardly equal. It is not expected to get such high scores just because of common string "america", the edit distance would be very low if you compare it with list1 strings.

list1= ["american Futures and Options Exchange","America First Credit Union"]
list2=["america"]
model = PolyFuzz("EditDistance").match(list1, list2)
data=model.get_matches()
print(data)
                                    From       To  Similarity
0  american Futures and Options Exchange  america         0.9
1             America First Credit Union  america         0.9

Any workaround would be appreciated... Thanks!

@MaartenGr
Copy link
Owner

The edit distance that is being used as a default is RapidFuzz, more specifically, it uses the WRatio method for calculating the edit distance. The output is expected according to the scoring function that is being used. You can check it with the following:

>>> from rapidfuzz import process, fuzz
>>> match = process.extractOne(list2[0], list1, scorer=fuzz.WRatio)
>>> match
('american Futures and Options Exchange', 90.0, 0)

If you want to use a different edit distance technique, you can do something like this instead:

from polyfuzz import PolyFuzz
from polyfuzz.models import EditDistance
from rapidfuzz.distance import Levenshtein

list1 = ["american Futures and Options Exchange","America First Credit Union"]
list2 = ["America"]

distance = EditDistance(scorer=Levenshtein.distance, normalize=False)
model = PolyFuzz(distance).match(list1, list2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants