Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there alias feature ? #249

Closed
mkandulavm opened this issue Aug 16, 2022 · 6 comments
Closed

Is there alias feature ? #249

mkandulavm opened this issue Aug 16, 2022 · 6 comments

Comments

@mkandulavm
Copy link

Hi

Is there a way to provide a alias options ?
For example, "street", "st", "road" could be alias for some scenarios.

How can this be done ? Thank you.

@maxbachmann
Copy link
Member

maxbachmann commented Aug 16, 2022

So far there is no way to alias characters/words. There is already a request for character dependent weights: #241.
This would allow you to alias individual elements by setting their substitution cost as 0. This would still only work on individual symbols:

Levenshtein.distance(["street", "road"], ["st", "st"]) # result is 2
weights=...
weights["street", "st"] = 0
weights["st", "street"] = 0
weights["road", "st"] = 0
weights["st", "road"] = 0
weights["street", "road"] = 0
weights["road", "street"] = 0
Levenshtein.distance(["street", "road"], ["st", "st"], weights=weights) # result is 0

which might be enough for your use case.

@mkandulavm
Copy link
Author

mkandulavm commented Aug 16, 2022

This is exactly what I need !!
But, is an equivalent call exposed in c++ ?

Also, which scorer is best for such scenarios (since tokens can be presented without order).

@maxbachmann
Copy link
Member

maxbachmann commented Aug 16, 2022

But, is an equivalent call exposed in c++ ?

So far this feature does not exist in either of them. However it will absolutely be implemented in C++. The Python implementation will only wrap it. It will extend: https://github.com/maxbachmann/rapidfuzz-cpp/blob/d937555ad76a6f1ed853ab4b7102a7b22b6f0fcf/rapidfuzz/distance/Levenshtein.hpp#L142

Also, which scorer is best for such scenarios (since tokens can be presented without order).

At least right now the feature is only planned for Levenshtein/OSA/DamerauLevenshtein. None of those sort the tokens before comparing them.

@i30817
Copy link

i30817 commented Sep 2, 2022

You can also preprocess the input strings such that the fuzz operation occurs in x and the result is (x,y) with y being the original string. Then you preprocess things so that the words you want to be the same score are replaced by one canonical word in x.

This is heavy in string manipulation but if you want to use one of the sort scorers, like token_set_ratio or similar, you can do it like that.

The more you replace words (or do similar tricks like removing combining characters accents), the more likely that there will be 2 or more 'same best scores', which can lead to inconsistent results on repeated runs with the same dataset.

If it matters, get the 2 or 3 best ones (or until they're not the same score) then check if they have the same score, and if they do, either chose a consistent order for the 'winner' or if both are valid somehow, and you can, combine results.

@maxbachmann
Copy link
Member

This is heavy in string manipulation but if you want to use one of the sort scorers, like token_set_ratio or similar, you can do it like that.

This can be faster than using weights for Levenshtein for this purpose, since the weighted Levenshtein distance is quite a bit slower to calculate than the uniform Levenshtein distance. So e.g. when comparing a string to a list of known strings you can preprocess ahead of time it is likely faster to preprocess the strings yourself.

@maxbachmann
Copy link
Member

Closing this, since it is tracked as part of #241

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants