Tests that currently trick greedy algorithm #10

hardbyte · 2017-05-30T01:41:50Z

Our greedy algorithm currently fails matching the following graph, where the connection between a and 1 looks likely, but ultimately shouldn't be chosen.

The network methods should succeed, and for now the greedy algorithm should fail. Correct mapping is:
{a: 2, b: 3, c: 1}

A similar variant where the result will probably be different between the methods (same desired output):

This issue tracks these issues - they exist as tests marked as expectedFailure.

Aha! Link: https://csiro.aha.io/features/ANONLINK-81

The text was updated successfully, but these errors were encountered:

nbgl · 2018-11-07T02:11:45Z

This kind of situation would be very rare in practice. You would need records A, B, C such that dice(A, B) = dice(A, C), which is unlikely. And if this is happening, then your separation between matches and non-matches should be improved.

Hungarian algorithm-based solvers do perform marginally better under some conditions. However, they have complexity O(V²E), where V is the total number of records and E is the number of candidate pairs (the 2-party greedy solver is O(E)). They are also difficult to apply as they don’t operate on raw similarity scores, but on the probabilities of a match (the threshold represents a probability of ½). You can feed them the raw similarity scores, but this doesn’t have a good theoretical justification.

I’m not sure there is much point pursuing this. @hardbyte are you happy to close the issue?

hardbyte · 2018-11-07T11:31:12Z

You would need records A, B, C such that dice(A, B) = dice(A, C), which is unlikely.

Are you sure about that? I think in my first example dice(C, 1) just needs to be between 0.8 and 0.9 for our greedy solver to fail. The second example is the tie breaker condition which I agree is very unlikely.

Happy if you want to remove the expectedFailure tests and close.

nbgl · 2018-11-08T00:26:25Z

If we let, dice(C, 1) = .85 in the first example, then greedy produces the mapping {(A, 1), (B, 3)}. You claim (please correct me if I’m wrong) that the mapping {(A, 2), (B, 3), (C, 1)} is superior. This is because two links with similarities .8 and .85 should take precedence over one link with similarity .9.

But why is that? Should we be optimising for the highest sum of similarities? This doesn’t quite make sense because we would prefer similarities .8 + .85 over a pair with similarity 1, when a pair with similarity 1 should clearly always be accepted. Should we be going for the highest number of accepted pairs above the threshold? This fails for similar reasons.

So in the case that we let dice(C, 1) = .85, it is not clear that the matching you propose is actually superior. It may well be, but this needs better justification.

To really make this call, we’d have to go further than similarity scores, and actually derive the probability p(ismatch(X, Y)). We probably want dice(X, Y) = 0 ⟹ p(ismatch(X, Y)) = 0, dice(X, Y) = threshold ⟹ p(ismatch(X, Y)) = ½, and dice(X, Y) = 1 ⟹ p(ismatch(X, Y)) = 1. These probabilities are nontrivial to compute from raw similarity scores, although we could approximate.

I have a solver that will take these probabilities into account to come up with the theoretically most likely matching. However, this solver performs only marginally better than greedy in practice. (And that’s if you know the exact probabilities!)

hardbyte mentioned this issue May 30, 2017

Calculate nodes' degree to improve greedy method's accuracy #11

Closed

unzvfu self-assigned this Mar 20, 2018

hardbyte assigned nbgl and unassigned unzvfu Feb 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tests that currently trick greedy algorithm #10

Tests that currently trick greedy algorithm #10

hardbyte commented May 30, 2017 •

edited

Loading

nbgl commented Nov 7, 2018

hardbyte commented Nov 7, 2018

nbgl commented Nov 8, 2018

Tests that currently trick greedy algorithm #10

Tests that currently trick greedy algorithm #10

Comments

hardbyte commented May 30, 2017 • edited Loading

nbgl commented Nov 7, 2018

hardbyte commented Nov 7, 2018

nbgl commented Nov 8, 2018

hardbyte commented May 30, 2017 •

edited

Loading