-
-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Values with the same hash are treated as equal #37
Comments
I wouldn't call this a bug. It is clearly defined, that editdistance uses the hash value to compare two objects for equality. I do not really see a better way to achieve the similarity check. |
I agree that this is essentially a bug, though a design bug rather than an implementation bug. The natural design is to check for equality of the objects using ==, which would enable the OP's example to work. What would be the problem with checking for object equality rather than checking for hash code equality? (Are you just concerned about speed?) Another way of looking at this is, what if you want to do edit distance but at the word level rather than the character level, so that you pass editdistance.eval() two sequences of string (str) instances. The eval() function should compare strings for equality, not for identical hash code. I don't believe all strings have unique hash codes, so your library currently appears unable to support computing word level edit distance on sequences of strings correctly. |
Yes this implementation has the goal to be very fast. E.g. for: s1=["aaa"]*64
s2=["bab"]*64
editdistance.eval(s1, s2) this would change the time complexity from
This is a trade off between performance and correctness. You are always able to provide your own hash implementation to reduce the probability of a hash collision on your dataset. Or if performance is no concern for you, you can write a simple implementation of wagner-fischer for this. In this case this library is probably simply not the right choice for your application. |
There are a few things there I don't understand, possibly due to my own ignorance. Edit distance is said to be O(MN), where M and N are the lengths of the sequences, so why wouldn't your time complexity be O(N^2) to start since in your example M = N = 64? So far we've presumably been treating the average time to compare two sequence elements as essentially constant, and certainly unrelated to either M or N. Let's assume the string elements are all of equal length K. For string elements of length K (3 in your example of "aaa" and "bbb"), naively it would take O(K) character comparisons to compare the strings. So for string elements of equal length wouldn't we have O(MNK) with the naive comparison? But it seems to me that string equality checking is implemented by first checking the hash codes, and only if equal checking the individual characters one by one. The latter would very rarely be necessary, so there would not be a large performance impact. That's the point of a good hash code. You might say that you have to compute the hash codes in the first place, but that's only O((M+N)K), and I believe python will cache the hash codes for the strings, so the hashes only need to be computed once. I do appreciate having the library, just thinking about whether it can be better. |
Maybe it is not as bad for strings, since at least strings can be interned by Python. So an implementation could use the following equal check: equal = (hash1 == hash2) and (obj1 is obj2 or obj1 == obj2) which would be pretty much free for interned strings. Edit: I forgot this is only done for strings known from the script and not for runtime generated strings, which should be a lot more common when string matching. Still it could make sense to change the API in the following way: s1=["aaa"]*64
s2=["aaa"]*64
editdistance.eval(s1, s2)
# use hash + eq
editdistance.eval(editdistance.HashSequence(s1), editdistance.HashSequence(s2))
# directly work with hashes |
That sounds plausible. Thanks for your consideration and for the information. I may come back to this later.
I think your point is that for sequences whose edit distances are small relative to the sequence lengths, there will be a lot of hash collisions (from equal elements), so I was incorrect to say that is rare, good point. I think referring to "similar elements" is not a good description of the situation, though. |
The following test fails because all of the
Bug
objects are incorrectly treated as equal:Note that the implementation of
__hash__
here is perfectly valid; to quote from https://docs.python.org/3/reference/datamodel.html#object.__hash__:In particular, there is no requirement that objects with the same hash value compare equal, as can also be demonstrated with even build-in types like
int
:The text was updated successfully, but these errors were encountered: