Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[4189] Fix colliding filters by updating match method for MultiCharSequenceFilter to iterate through all patterns for the most complete match and not return on the first pattern found #4188

Merged
merged 9 commits into from
Feb 22, 2023

Conversation

saad-zaman
Copy link
Collaborator

@saad-zaman saad-zaman commented Feb 15, 2023

What this PR does / why we need it: Currently, the multiCharSequenceFilter matches function returns on the first pattern found here even if there are other patterns available with better matches. This results in a suboptimal byte[] val being returned which causes the public matches function to return a false negative here. The len(val) is not zero because a suboptimal val was returned when a better pattern could have been found if the multiCharSequenceFilter matches function iterated through all patterns available and looked for the best match.

Fixes # #4189

Special notes for your reviewer: Please provide any go style improvements or tips you think are relevant

Does this PR introduce a user-facing and/or backwards incompatible change?:

NONE

Does this PR require updating code package or user-facing documentation?:

NONE

Benchmark Test Results Before and After Changes to matches method of multiCharSequenceFilter:

testname old-bench-test-results.txt sec/op new-bench-test-results.txt sec/op sec/op vs base
MultiRangeFilterPrefixCollision-10 10.61n ± 1% 16.27n ± 4% +53.47% (p=0.000 n=10)
MultiRangeFilterSuffixCollision-10 11.51n ± 7% 17.00n ± 3% +47.65% (p=0.000 n=10)
MultiRangeFilterTwo-10 17.48n ± 5% 17.91n ± 2% +2.43% (p=0.004 n=10)
MultiRangeFilterSelectTwo-10 21.45n ± 3% 24.30n ± 2% +13.31% (p=0.000 n=10)
MultiRangeFilterTrieTwo-10 16.44n ± 4% 16.07n ± 2% ~ (p=0.342 n=10)
MultiRangeFilterSix-10 38.89n ± 1% 42.51n ± 2% +9.32% (p=0.000 n=10)
MultiRangeFilterSelectSix-10 56.47n ± 0% 65.47n ± 1% +15.94% (p=0.000 n=10)
MultiRangeFilterTrieSix-10 33.67n ± 0% 33.97n ± 4% +0.89% (p=0.005 n=10)
geomean 22.04n 25.57n +16.04%

…ilter to iterate through all patterns for the most complete match and not return on the first pattern found
@CLAassistant
Copy link

CLAassistant commented Feb 15, 2023

CLA assistant check
All committers have signed the CLA.

@saad-zaman saad-zaman changed the title [DRAFT PR] Fix colliding filters by updating match method for MultiCharSequenceF… [DRAFT PR] Fix colliding filters by updating match method for MultiCharSequenceFilter to iterate through all patterns for the most complete match and not return on the first pattern found Feb 15, 2023
@@ -27,6 +27,19 @@ import (
"github.com/stretchr/testify/require"
)

func TestPrefixCompositeR2Filter(t *testing.T) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a better place to add these test cases is in TestMultiCharSequenceFilter in this file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I also think the naming also isn't great. Is R2 naming only relevant internally to Uber?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, there's R2 references aplenty in OSS world.

@justinjc
Copy link
Collaborator

It's not immediately clear that this is a bug without mentioning how it's used in the public boolean Matches function here.

Maybe explicitly mention this in the description and how it would cause false negatives.

return val[len(pattern):], true
if f.backwards {
if bytes.HasSuffix(val, pattern) {
if bestPattern == nil || len(pattern) > len(bestPattern) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: seems like the bestPattern == nil isn't needed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, updated.

@@ -349,6 +349,11 @@ func TestMultiCharSequenceFilter(t *testing.T) {
validateLookup(t, f, "12test", true, "12")
validateLookup(t, f, "12test2", true, "12")
validateLookup(t, f, "123book", true, "123")

f, err = newMultiCharSequenceFilter([]byte("arachne_failures,arachne_failures_by_rack"), false)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a similar test case where backwards == true.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@saad-zaman
Copy link
Collaborator Author

It's not immediately clear that this is a bug without mentioning how it's used in the public boolean Matches function here.

Maybe explicitly mention this in the description and how it would cause false negatives.

Updated in the description and the associated issue.

@saad-zaman saad-zaman marked this pull request as ready for review February 16, 2023 00:37
@saad-zaman saad-zaman changed the title [DRAFT PR] Fix colliding filters by updating match method for MultiCharSequenceFilter to iterate through all patterns for the most complete match and not return on the first pattern found [4189] Fix colliding filters by updating match method for MultiCharSequenceFilter to iterate through all patterns for the most complete match and not return on the first pattern found Feb 16, 2023
id1 := "arachne_failures_by_rack"
f, err := newMultiCharSequenceFilter([]byte("arachne_failures,arachne_failures_by_rack"), false)
require.NoError(t, err)
_, matches1 := f.matches([]byte(id0))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm understanding the requirement wrong: Should "arachne_failures" match "arachne_failures" or "arachne_failures_by_rack" ? If your expectation was to match with "arachne_failures" with "arachne_failures" then wouldn't "len(pattern) > len(bestPattern)" make it match "arachne_failures_by_rack" ?

Follow-ups:

  1. Could you also assert the returned matchIndex ?
  2. Maybe add similar bestMatch testcase for backwards (HasSuffix) case as well ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're looking at an older commit possibly.

"arachne_failures" should match with "arachne-failures" and "arachne_failures_by_rack" should match with "arachne_failures_by_rack" even if it also matches with other patterns in the filter since this pattern is the best match.

In the latest commit I have tests for the length of the value in the returned byte array and with backwards set to true.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha! Yeah was looking at the older commit. Thanks for the update!

@prateek
Copy link
Collaborator

prateek commented Feb 16, 2023

@saad-zaman before landing, can you run existing/make new benchmarks for this change too -- https://dave.cheney.net/2013/06/30/how-to-write-benchmarks-in-go is a good guide to get you going.

@@ -125,6 +125,14 @@ func BenchmarkRangeFilterRangeMatchNegation(b *testing.B) {
benchRangeFilterRange(b, []byte("!a-z"), []byte("p"), false)
}

func BenchmarkMultiRangeFilterPrefixCollision(b *testing.B) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prateek PTAL

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put the results of the benchmark before and after the change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To Justin's point - please add a comment to the code with "After change results"

And in the PR, update description with both before/after results

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the PR description to add benchmark results and a table to help with comparison. Also updated the code to add a small optimization to stop iterating through patterns if an optimal match was found and updated tests to ensure it was working

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated table with a more legible benchmark test comparison

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add b.ReportAllocs() to these benchmarks?

Having a >50% impact seems pretty steep for this change, but I'm not seeing any obvious ways to make this function faster. Perhaps this impact is just very high because we're purposely checking the scenario where these "colliding filters" are present. Can we see the before/after numbers for the vanilla case where there aren't any colliding filters?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran it with b.ReportAllocs() and there are 0 allocs with the old and new logic.

Updated the tests to show the change for all benchmark tests related to my code. There is a small difference (~0-15%) in test cases where the prefix/suffix collision does not occur likely due to the additional conditionals present after a prefix or suffix detection is found.

@saad-zaman saad-zaman enabled auto-merge (squash) February 22, 2023 18:02
@saad-zaman saad-zaman merged commit 3afc5be into master Feb 22, 2023
@saad-zaman saad-zaman deleted the szaman/M3-2186/fix-colliding-filters branch February 22, 2023 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants