Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add substrings search path for constant like pattern #10731

Closed
wants to merge 1 commit into from

Conversation

skadilover
Copy link
Contributor

@skadilover skadilover commented Aug 13, 2024

We noticed that some like constant-patterns can be equivalent to substrings search such as tpch queries Q13 and Q16: "%special%requests%" "%Customer%Complaints%"

  1. Add substring-search fast path for like with search pattern : "((%+[^%#_\\]+)+%+)"
  2. Some OLAP engine supports "_" for substring-search. In order to obtain the best performance, "_" is not supported here, '#' is not support for the same reason too.
  3. This PR will also affect single substring-search like tpch Q9, if a fast-simd-strstr to speedup string-search will be added as follow-up, changes nothing right now.

benchmark result :
before:

tpchQuery9 23.96ms 41.73
tpchQuery13 156.02ms 6.41
tpchQuery16Part 9.02ms 110.83
tpchQuery16Supplier 192.13ms 5.20

after:

tpchQuery9 24.30ms 41.14
tpchQuery13 55.03ms 18.17
tpchQuery16Part 8.92ms 112.07
tpchQuery16Supplier 17.64ms 56.69

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 13, 2024
Copy link

netlify bot commented Aug 13, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 2ee154c
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/66f0df2d24d7f70008067085

@skadilover
Copy link
Contributor Author

skadilover commented Aug 13, 2024

@mbasmanova @Yuhta

Could you help to have a look at this PR ? Is it reasonable to add a fast-path here?
And could you help to look at if the "load_unaligned" method usage in smid_strstr_memcmp is safe as it may load some unkown-memory address here.

@skadilover skadilover force-pushed the like_fast_path branch 18 times, most recently from 91983d0 to ee87495 Compare August 13, 2024 10:22
Copy link
Contributor

@Yuhta Yuhta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a benchmark to show the performance gain?

velox/common/base/FixedMemCompare.h Outdated Show resolved Hide resolved
velox/common/base/FixedMemCompare.h Outdated Show resolved Hide resolved
velox/common/base/SimdUtil-inl.h Outdated Show resolved Hide resolved
velox/common/base/SimdUtil-inl.h Outdated Show resolved Hide resolved
size_t n,
const char* needle,
MEMCMP memcmp_fun) {
assert(k > 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VELOX_DCHECK

velox/common/base/FixedMemCompare.h Outdated Show resolved Hide resolved
velox/common/base/SimdUtil-inl.h Outdated Show resolved Hide resolved
velox/common/base/SimdUtil-inl.h Outdated Show resolved Hide resolved
velox/common/base/SimdUtil-inl.h Outdated Show resolved Hide resolved
@skadilover skadilover force-pushed the like_fast_path branch 2 times, most recently from 1ac4f0f to 60b2015 Compare August 14, 2024 12:58
@skadilover
Copy link
Contributor Author

@Yuhta update code , could you have a look at this once more ?

i + CharVector::size <= n otherwise it's overreading, you need to handle trailing characters separately

I think this is the most interesting thing in this PR:

  1. I add pageSafe check here now , like ClickHouse does : https://github.com/ClickHouse/ClickHouse/blob/a51f867ce014a4cb8f97c46e004b90f5feafd80f/src/Common/StringSearcher.h#L31
    bool isPageSafe(const void * const ptr) const
    {
        return ((page_size - 1) & reinterpret_cast<std::uintptr_t>(ptr)) <= page_size - N;
    }

By this way, load_unaligned is safe when overreading for they are in same memory page.

  1. For StringSearch, ClickHouse use Volnitsky https://github.com/ClickHouse/ClickHouse/blob/1513eda3efe3a24b7a87b9f0e7803bb8693d1bc8/src/Common/Volnitsky.h#L373

Howevever, performance seems not best, Doris delete Volnitsky when they copy the StringSearch implement : https://github.com/apache/doris/blob/4e326e5f496109edb253d094869d057573ab7a59/be/src/vec/common/string_searcher.h#L44

And use first-two char compare, however, I test it , about 2x slow than this PR`s implement.

@@ -32,6 +32,12 @@ limited to 20 different expressions per instance and thread of execution.
SELECT like('abc', '%b%'); -- true
SELECT like('a_c', '%#_%', '#'); -- true

String sequence search: There are some patterns that are equivalent to simple
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph doesn't fit with the existing "Patterns 'hello',..." docs. It would be nice to make the whole doc consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated with "hello velox"

@@ -14,6 +14,7 @@
* limitations under the License.
*/
#include "velox/functions/lib/Re2Functions.h"
#include "velox/common/base/SimdUtil.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? Would you point me to where it is used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

er...., forget this after removed simdstrstr

@skadilover
Copy link
Contributor Author

@mbasmanova ping ~

@@ -32,6 +32,12 @@ limited to 20 different expressions per instance and thread of execution.
SELECT like('abc', '%b%'); -- true
SELECT like('a_c', '%#_%', '#'); -- true

String sequence search: There are some patterns that are equivalent to simple
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to embed the new information into the existing flow.

Not all patterns require compilation of regular expressions. Patterns ....

The new pattern needs to added to this list. The separate paragraph then can be about the limitations that apply to these special patterns. I.e. We would clarify separately whether special patterns allow for custom character or not.

Also, please, generate the docs to make sure they look nice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

I am not sure if I understand your meaning, add a list for all fast path in velox.

@mbasmanova Could you look at this again ?

@skadilover skadilover force-pushed the like_fast_path branch 2 times, most recently from b963239 to 97822e5 Compare September 20, 2024 10:27
@@ -32,6 +32,22 @@ limited to 20 different expressions per instance and thread of execution.
SELECT like('abc', '%b%'); -- true
SELECT like('a_c', '%#_%', '#'); -- true

Not all patterns require compilation of regular expressions. For the patterns without custom escape:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is existing text

Not all patterns require
    compilation of regular expressions. Patterns 'hello', 'hello%', '_hello__%',....

We need to update it to add the new pattern. Currently, you are adding new text as if existing text doesn't exist or doesn't talk about similar thing.

Current:

Patterns 'hello', 'hello%', '_hello__%',
    '%hello', '%__hello_', '%hello%', where

New (proposed):

Patterns 'hello', 'hello%', '_hello__%',
    '%hello', '%__hello_', '%hello%', '%hello%velox%' where

@@ -32,6 +32,22 @@ limited to 20 different expressions per instance and thread of execution.
SELECT like('abc', '%b%'); -- true
SELECT like('a_c', '%#_%', '#'); -- true

Not all patterns require compilation of regular expressions. For the patterns without custom escape:

| patterns | description | fast search mothod |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This table might be a nice alternative to current doc, but we'll need to iterate on columns, formatting and user-friendly descriptions. I suggest to do that in a separate PR.

@skadilover
Copy link
Contributor Author

@mbasmanova
Got it, thanks for detail the comment, updated code, would you check it again?

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@mbasmanova mbasmanova added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Sep 20, 2024
@Yuhta
Copy link
Contributor

Yuhta commented Sep 20, 2024

Is the regression in benchmark caused by the change? If so we probably want to merge this after #10858

@skadilover
Copy link
Contributor Author

skadilover commented Sep 20, 2024

Is the regression in benchmark caused by the change? If so we probably want to merge this after #10858

It`s strange because Q2 and suffix is not supported by this change.

  // Not supports prefix.
  test("aa%bb%%", {});
  // Not supports sufix.
  test("%aa%bb", {});

@Yuhta I do not see regression on my machine

before 9851a4e
wildcardExactlyN 3.88ms 257.73
wildcardAtLeastN 3.79ms 263.72
fixedPattern 22.03ms 45.39
prefixPattern 21.83ms 45.80
suffixPattern 7.35ms 136.12
tpchQuery2 10.11ms 98.95
tpchQuery9 23.54ms 42.48
tpchQuery13 154.68ms 6.46
tpchQuery14 9.49ms 105.40
tpchQuery16Part 8.78ms 113.85
tpchQuery16Supplier 180.59ms 5.54
tpchQuery20 7.34ms 136.22

after
wildcardExactlyN 3.88ms 257.82
wildcardAtLeastN 3.79ms 263.84
fixedPattern 22.10ms 45.24
prefixPattern 21.92ms 45.61
suffixPattern 7.36ms 135.85
tpchQuery2 10.06ms 99.36
tpchQuery9 25.06ms 39.90
tpchQuery13 56.61ms 17.66
tpchQuery14 9.59ms 104.28
tpchQuery16Part 8.74ms 114.41
tpchQuery16Supplier 18.43ms 54.26
tpchQuery20 7.53ms 132.89

@Yuhta Maybe input data is not same ? what`s 'data' means in the cicd result? Would you help to check if the cicd result not stable, because I could not reproduced on my machine ?

image

remove simd_strstr to a seperate PR

fix format
@facebook-github-bot
Copy link
Contributor

@Yuhta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@Yuhta
Copy link
Contributor

Yuhta commented Sep 23, 2024

@skadilover It's just flakiness, disappeared after rerun

@skadilover
Copy link
Contributor Author

It's just flakiness, disappeared after rerun

I think it`s nothing to do with this change.

@facebook-github-bot
Copy link
Contributor

@Yuhta merged this pull request in 11e4047.

@@ -2154,6 +2201,14 @@ std::shared_ptr<exec::VectorFunction> makeLike(

PatternMetadata patternMetadata = PatternMetadata::generic();
try {
// Fast path for substrings search.
auto substrings =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After picking this PR, Gluten will be a unit test that fails.

Incorrect evaluation: ///__ LIKE %//%/% ESCAPE '/', actual: true, expected: false

It seems that this implementation did not take into account the ESCAPE. In this example, the last slash (/) is the ESCAPE, which causes the last percent (%) to be a regular character, not a wildcard. Therefore, the result should be false, but Velox returns true.
@skadilover @mbasmanova @Yuhta @xiaoxmeng Do you have any input?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So sorry for the mistake, I mis-understand the code here:

    BaseVector* escape = inputArgs[2].constantValue.get();
    if (!escape) {
      return std::make_shared<LikeGeneric>();
    }

and the ut just skiped by '#'

  testLike("a%c", "%#%%", '#', true);
  testLike("%cd", "%#%%", '#', true);

fix it here : #11091

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants