-
Notifications
You must be signed in to change notification settings - Fork 219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: improved performance of LIKE matching #1049
Conversation
Signed-off-by: Calum Murray <[email protected]>
Using the benchmarks in #1050 , I got the following results for this improvement over the original implementation:
@dgeorgievski are these improvements good enough for your use case? There are further performance improvements I could make if it is really needed, but they would make the code more complex so I'd prefer not to if there isn't a need |
chunks = append(chunks, ".*") | ||
for textIdx < textLen { | ||
// handle escaped characters -> pattern needs to increment two places here | ||
if patternIdx < patternLen-1 && pattern[patternIdx] == '\\' && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if moving the pattern[patternIdx] == '\\'
to be first would speed things up slightly since patternIdx < patternLen-1
will be true most of the time, and I think doing the check for \
is the biggest determining factor, so let's stop on that check as quickly as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh crap - never mind, I mixed up text* and pattern* variables. Although, I do wonder if you should add a check for the rest of the pattern being empty and then by-pass all of these if's.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, an empty pattern is an interesting case. I guess currently you would just end up in the final else
clause and return false
. We could definitely add a quick check at the top to check the textLen
and patternLen
before entering the for loop and all the if statemenets
lastMatchIdx = textIdx | ||
patternIdx += 1 | ||
// greedy match didn't work, try again from the last known match | ||
} else if lastWildcardIdx != -1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm probably wrong, and it's probably due to the nature of % being so greedy, but are we sure that we don't need a stack of "last known wildcard indexes" instead of just 1? I'm wondering about a case where we've seen multiple %'s and then we don't match something and we need to back-up to the first % not just the 2nd %. I can't seem to write a testcase to show the issue, hence why I'm probably wrong, but I thought I'd mention it/ask....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me the trick to understanding this is that wildcard matching has optimal substructure: if we have already matched part of the input string against part of the pattern string, we only need to use the remaining parts of the pattern string and input string to determine if the entire string matches the pattern or not.
The only slight "exception" to the optimal substructure is that the %
character can consume more than one element of the input. So, if at some point our pattern no longer matches the input text, we need to go back to the last wildcard character we saw, as well as the place we were at in the input text when we saw it and try again, consuming one character from the input text in the process.
The key is that we know that we have a valid match up until the last wildcard character we saw, and that apart from going back to that wildcard and consuming more characters from the input text there we can not otherwise move backwards in the pattern or the input text. So, storing an older index would not end up being used, since it would be in part of the input text we had already matched.
@duglin @pierDipi since this code is pretty technical, I also ran it (with modifications for https://leetcode.com/problems/wildcard-matching/submissions/1253803571 We have reasonable coverage of the LIKE expression in the tck tests, but I bet LeetCode has more edge cases than we do, so this makes me feel a bit better about the accuracy of this |
Signed-off-by: Calum Murray <[email protected]>
Signed-off-by: Calum Murray <[email protected]>
cc @embano1 |
Good thinking! |
This PR attempts to improve the performance of CESQL
LIKE
matches by moving away from building a regex and instead taking a simple greedy algorithm approach (which guarantees O(n+m) time complexity to consume the full string, where n is the length of the input string, and m is the length of the pattern).This change has the added benefit of simplifying the code - we are no longer building and then evaluating a regex, simply looping over a string.