-
-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Damerau-Levenshtein with character-dependent weights #241
Comments
I think this should start off with a similar interface to the levenshtein distance:
which should be extended by support for float. In addition this could support a custom dict like structure for character weights which behaves similar to: class PairwiseWeights:
...
a = PairwiseWeights()
a["a", "b"] = 0.75 The pure Python version will likely implement this using a dictionary, while for the C++ implementation an efficient storage structure is important. This could look something like this: template<typename T>
class PairwiseWeights
{
std::array<std::array<T, 256>, 256> m_extendedAscii;
my_hashmap<std::tuple<int64_t, int64_t>, T> m_map;
}; This has an overhead of around 256kb for the extended ascii matrix, which would make the common case of Edit: this needs a bit of thought on how to name the containers. For now we probably need one for Substitutions/Transpositions and one for Insertions/Deletions |
Yes, you clearly need an efficient storage + fast retrieval (hashtable-type) structure. As far as I understand, the existing distance implementations work on any unicode strings? Then why limit the character range to extended ASCII? One should be able to provide weights for any Unicode points pair, in my opinion. |
This continues to be true. It is a mixed structure: template<typename U, typename K>
T get(U ch1, K ch2)
{
if (ch1 < 256 && ch2 < 256)
return m_extendedAscii[ch1][ch2];
return m_map.get(ch1, ch2);
} since in many cases strings are ascii only this can improve the access times significantly, while it still supports any hashable type. Especially since Python strings are stored as template<typename U, typename K>
T get(U ch1, K ch2)
{
return m_extendedAscii[ch1][ch2];
} for strings with all characters < 256 |
Note that this will likely take a while to get implemented, since this needs to be pretty generic to cover most use cases. Some examples are:
How do those interact, since you probably want to be able to mix them to some extent. An example is a keyboard layout:
Would it be possible to combine e.g. character dependent and position dependent weights? This might be useful sometimes. In addition I am unsure how normalization would work with these weights? |
I was actually only suggesting the technical possibility to specify these weights, leaving the choice of their actual values to the package user (one could provide default ones to play with if really desired, of course). I have my own custom-made C++ implementation of a Generalized Edit Distance with character-dependent weights,that are fully configurable from simple user-editable txt files. Default values are taken by all those character (pairs) no explicitly mentioned in the config. file, so it enables me to act upon only those values I want to depart from the standard weight. Should you find it helpful, I would be glad to send you my implementation files. Seeing it properly cythonized in RapidFuzz (which I am technically unable to do myself) would be delightful. |
I was describing possible use cases for users of the API to decide on an API. I think for now it would be enough to add character dependent weights. If other weights need an API change this can be done in a later major release. |
Sure. Having a reference implementation would be helpful. |
In relation to issue #180:
It would be great to generalize the foreseen implementation of the Dameau-Levenshtein distance to weights depending not only on the nature of the operation (insert/delete/substitute/transpose), but also on the specific character pairs involved in the operation.
Such weights could be either read from ressource files directly by the C/C++ implementation code, or provided as parameters (Python dictionaries) by the calling Python function.
This would be a very useful and not-so-complicated first step in the direction of context-sensitive string distance functions.
The text was updated successfully, but these errors were encountered: