title | date | lastmod |
---|---|---|
Boyer-Moore Algorithm |
2022-11-08 |
2022-11-21 |
Definitions : T denotes the input text to be searched. Its length is n. : P denotes the string to be searched for, called the pattern. Its length is m.
Steps
- Process the text T[1...n] from __left to right
- Scan the pattern P[1...m] from __right to left.
- Generate tables to find the maximum steps to slide the pattern after a mismatch
- For every mismatch, we slide by the maximum amount returned by the 2 heuristics
- Return once we find the pattern or the pattern does not exist (n-m+1) comparisons
Case 1 (mismatched character does not exist in the rest of the pattern):
Case 2 (mismatched character is found in the rest of the pattern):
Right most occurrence such that we do not over slide.
When using charJump only, taking the max of charJump and
Example of where we could have made a left shift of the pattern, which is not what we want:
Derive maximum shift from the structure of the pattern. Run through each case below in order (1 -> 2 -> 3) so as to shift by the least amount such that a suffix is matched.
[!NOTE] General formula for slide
$(m-k)+(m-q$ )
- k is the position of the mismatch
- q is the position of end of the next matching suffix
Case 1: matching suffix occurs earlier in the pattern and __preceded by a different character.
Case 2: matching suffix occurs at the __start of the pattern.
Case 3: no occurrence of the matching suffix in the rest of the pattern.
Case 4: mismatch on the first character The last character array entry is always 1. We have no information about matching suffixes from the text as the first comparison is already mismatched. Hence, we can only safely shift by 1 character.