Skip to content

Latest commit

 

History

History
175 lines (128 loc) · 10.2 KB

README.markdown

File metadata and controls

175 lines (128 loc) · 10.2 KB

Z-Algorithm String Search

Goal: Write a simple linear-time string matching algorithm in Swift that returns the indexes of all the occurrencies of a given pattern.

In other words, we want to implement an indexesOf(pattern: String) extension on String that returns an array [Int] of integers, representing all occurrences' indexes of the search pattern, or nil if the pattern could not be found inside the string.

For example:

let str = "Hello, playground!"
str.indexesOf(pattern: "ground")   // Output: [11]

let traffic = "🚗🚙🚌🚕🚑🚐🚗🚒🚚🚎🚛🚐🏎🚜🚗🏍🚒🚲🚕🚓🚌🚑"
traffic.indexesOf(pattern: "🚑") // Output: [4, 21]

Many string search algorithms use a pre-processing function to compute a table that will be used in successive stage. This table can save some time during the pattern search stage because it allows to avoid un-needed characters comparisons. The Z-Algorithm is one of these functions. It borns as a pattern pre-processing function (this is its role in the Knuth-Morris-Pratt algorithm and others) but, just like we will show here, it can be used also as a single string search algorithm.

Z-Algorithm as pattern pre-processor

As we said, the Z-Algorithm is foremost an algorithm that process a pattern in order to calculate a skip-comparisons-table. The computation of the Z-Algorithm over a pattern P produces an array (called Z in the literature) of integers in which each element, call it Z[i], represents the length of the longest substring of P that starts at i and matches a prefix of P. In simpler words, Z[i] records the longest prefix of P[i...|P|] that matches a prefix of P. As an example, let's consider P = "ffgtrhghhffgtggfredg". We have that Z[5] = 0 (f...h...), Z[9] = 4 (ffgtr...ffgtg...) and Z[15] = 1 (ff...fr...).

But how do we compute Z? Before we describe the algorithm we must indroduce the concept of Z-box. A Z-box is a pair (left, right) used during the computation that records the substring of maximal length that occurs also as a prefix of P. The two indices left and right represent, respectively, the left-end index and the right-end index of this substring. The definition of the Z-Algorithm is inductive and it computes the elements of the array for every position k in the pattern, starting from k = 1. The following values (Z[k + 1], Z[k + 2], ...) are computed after Z[k]. The idea behind the algorithm is that previously computed values can speed up the calculus of Z[k + 1], avoiding some character comparisons that were already done before. Consider this example: suppose we are at iteration k = 100, so we are analyzing position 100 of the pattern. All the values between Z[1] and Z[99] were correctly computed and left = 70 and right = 120. This means that there is a substring of length 51 starting at position 70 and ending at position 120 that matches the prefix of the pattern/string we are considering. Reasoning on it a little bit we can say that the substring of length 21 starting at position 100 matches the substring of length 21 starting at position 30 of the pattern (because we are inside a substring that matches a prefix of the pattern). So we can use Z[30] to compute Z[100] without additional character comparisons. This a simple description of the idea that is behind this algorithm. There are a few cases to manage when the use of pre-computed values cannot be directly applied and some comparisons are to be made.

Here is the code of the function that computes the Z-array:

func ZetaAlgorithm(ptrn: String) -> [Int]? {

    let pattern = Array(ptrn.characters)
    let patternLength: Int = pattern.count

    guard patternLength > 0 else {
        return nil
    }

    var zeta: [Int] = [Int](repeating: 0, count: patternLength)

    var left: Int = 0
    var right: Int = 0
    var k_1: Int = 0
    var betaLength: Int = 0
    var textIndex: Int = 0
    var patternIndex: Int = 0

    for k in 1 ..< patternLength {
        if k > right {  // Outside a Z-box: compare the characters until mismatch
            patternIndex = 0

            while k + patternIndex < patternLength  &&
                pattern[k + patternIndex] == pattern[patternIndex] {
                patternIndex = patternIndex + 1
            }

            zeta[k] = patternIndex

            if zeta[k] > 0 {
                left = k
                right = k + zeta[k] - 1
            }
        } else {  // Inside a Z-box
            k_1 = k - left + 1
            betaLength = right - k + 1

            if zeta[k_1 - 1] < betaLength { // Entirely inside a Z-box: we can use the values computed before
                zeta[k] = zeta[k_1 - 1]
            } else if zeta[k_1 - 1] >= betaLength { // Not entirely inside a Z-box: we must proceed with comparisons too
                textIndex = betaLength
                patternIndex = right + 1

                while patternIndex < patternLength && pattern[textIndex] == pattern[patternIndex] {
                    textIndex = textIndex + 1
                    patternIndex = patternIndex + 1
                }

                zeta[k] = patternIndex - k
                left = k
                right = patternIndex - 1
            }
        }
    }
    return zeta
}

Let's make an example reasoning with the code above. Let's consider the string P = “abababbb". The algorithm begins with k = 1, left = right = 0. So, no Z-box is "active" and thus, because k > right we start with the character comparisons beetwen P[1] and P[0].

   01234567
k:  x
   abababbb
   x
Z: 00000000
left:  0
right: 0

We have a mismatch at the first comparison and so the substring starting at P[1] does not match a prefix of P. So, we put Z[1] = 0 and let left and right untouched. We begin another iteration with k = 2, we have 2 > 0 and again we start comparing characters P[2] with P[0]. This time the characters match and so we continue the comparisons until a mismatch occurs. It happens at position 6. The characters matched are 4, so we put Z[2] = 4 and set left = k = 2 and right = k + Z[k] - 1 = 5. We have our first Z-box that is the substring "abab" (notice that it matches a prefix of P) starting at position left = 2.

   01234567
k:   x
   abababbb
   x
Z: 00400000
left:  2
right: 5

We then proceed with k = 3. We have 3 <= 5. We are inside the Z-box previously found and inside a prefix of P. So we can look for a position that has a previously computed value. We calculate k_1 = k - left = 1 that is the index of the prefix's character equal to P[k]. We check Z[1] = 0 and 0 < (right - k + 1 = 3) and we find that we are exactly inside the Z-box. We can use the previously computed value, so we put Z[3] = Z[1] = 0, left and right remain unchanged. At iteration k = 4 we initially execute the else branch of the outer if. Then in the inner if we have that k_1 = 2 and (Z[2] = 4) >= 5 - 4 + 1. So, the substring P[k...r] matches for right - k + 1 = 2 chars the prefix of P but it could not for the following characters. We must then compare the characters starting at r + 1 = 6 with those starting at right - k + 1 = 2. We have P[6] != P[2] and so we have to set Z[k] = 6 - 4 = 2, left = 4 and right = 5.

   01234567
k:     x
   abababbb
   x
Z: 00402000
left:  4
right: 5

With iteration k = 5 we have k <= right and then (Z[k_1] = 0) < (right - k + 1 = 1) and so we set z[k] = 0. In iteration 6 and 7 we execute the first branch of the outer if but we only have mismatches, so the algorithms terminates returning the Z-array as Z = [0, 0, 4, 0, 2, 0, 0, 0].

The Z-Algorithm runs in linear time. More specifically, the Z-Algorithm for a string P of size n has a running time of O(n).

The implementation of Z-Algorithm as string pre-processor is contained in the ZAlgorithm.swift file.

Z-Algorithm as string search algorithm

The Z-Algorithm discussed above leads to the simplest linear-time string matching algorithm. To obtain it, we have to simply concatenate the pattern P and text T in a string S = P$T where $ is a character that does not appear neither in P nor T. Then we run the algorithm on S obtaining the Z-array. All we have to do now is scan the Z-array looking for elements equal to n (which is the pattern length). When we find such value we can report an occurrence.

extension String {

    func indexesOf(pattern: String) -> [Int]? {
        let patternLength: Int = pattern.characters.count
        /* Let's calculate the Z-Algorithm on the concatenation of pattern and text */
        let zeta = ZetaAlgorithm(ptrn: pattern + "💲" + self)

        guard zeta != nil else {
            return nil
        }

        var indexes: [Int] = [Int]()

        /* Scan the zeta array to find matched patterns */
        for i in 0 ..< zeta!.count {
            if zeta![i] == patternLength {
                indexes.append(i - patternLength - 1)
            }
        }

        guard !indexes.isEmpty else {
            return nil
        }

        return indexes
    }
}

Let's make an example. Let P = “CATA“ and T = "GAGAACATACATGACCAT" be the pattern and the text. Let's concatenate them with the character $. We have the string S = "CATA$GAGAACATACATGACCAT". After computing the Z-Algorithm on S we obtain:

            1         2
  01234567890123456789012
  CATA$GAGAACATACATGACCAT
Z 00000000004000300001300
            ^

We scan the Z-array and at position 10 we find Z[10] = 4 = n. So we can report a match occuring at text position 10 - n - 1 = 5.

As said before, the complexity of this algorithm is linear. Defining n and m as pattern and text lengths, the final complexity we obtain is O(n + m + 1) = O(n + m).

Credits: This code is based on the handbook "Algorithm on String, Trees and Sequences: Computer Science and Computational Biology" by Dan Gusfield, Cambridge University Press, 1997.

Written for Swift Algorithm Club by Matteo Dunnhofer