-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tests that currently trick greedy algorithm #10
Comments
This kind of situation would be very rare in practice. You would need records A, B, C such that dice(A, B) = dice(A, C), which is unlikely. And if this is happening, then your separation between matches and non-matches should be improved. Hungarian algorithm-based solvers do perform marginally better under some conditions. However, they have complexity O(V²E), where V is the total number of records and E is the number of candidate pairs (the 2-party greedy solver is O(E)). They are also difficult to apply as they don’t operate on raw similarity scores, but on the probabilities of a match (the threshold represents a probability of ½). You can feed them the raw similarity scores, but this doesn’t have a good theoretical justification. I’m not sure there is much point pursuing this. @hardbyte are you happy to close the issue? |
Are you sure about that? I think in my first example Happy if you want to remove the |
If we let, dice(C, 1) = .85 in the first example, then greedy produces the mapping {(A, 1), (B, 3)}. You claim (please correct me if I’m wrong) that the mapping {(A, 2), (B, 3), (C, 1)} is superior. This is because two links with similarities .8 and .85 should take precedence over one link with similarity .9. But why is that? Should we be optimising for the highest sum of similarities? This doesn’t quite make sense because we would prefer similarities .8 + .85 over a pair with similarity 1, when a pair with similarity 1 should clearly always be accepted. Should we be going for the highest number of accepted pairs above the threshold? This fails for similar reasons. So in the case that we let dice(C, 1) = .85, it is not clear that the matching you propose is actually superior. It may well be, but this needs better justification. To really make this call, we’d have to go further than similarity scores, and actually derive the probability p(ismatch(X, Y)). We probably want dice(X, Y) = 0 ⟹ p(ismatch(X, Y)) = 0, dice(X, Y) = threshold ⟹ p(ismatch(X, Y)) = ½, and dice(X, Y) = 1 ⟹ p(ismatch(X, Y)) = 1. These probabilities are nontrivial to compute from raw similarity scores, although we could approximate. I have a solver that will take these probabilities into account to come up with the theoretically most likely matching. However, this solver performs only marginally better than greedy in practice. (And that’s if you know the exact probabilities!) |
Our greedy algorithm currently fails matching the following graph, where the connection between a and 1 looks likely, but ultimately shouldn't be chosen.
The network methods should succeed, and for now the greedy algorithm should fail. Correct mapping is:
{a: 2, b: 3, c: 1}
A similar variant where the result will probably be different between the methods (same desired output):
This issue tracks these issues - they exist as tests marked as
expectedFailure
.Aha! Link: https://csiro.aha.io/features/ANONLINK-81
The text was updated successfully, but these errors were encountered: