A FUNDAMENTAL OPERATION on strings is substring search: given a text string of length N and a pattern string of length M, find an occurrence of the pattern within the text. Most algorithms for this problem can easily be extended to find all occurrences of the pattern in the text, to count the number of occurrences of the pattern in the text, or to provide context (substrings of the text surrounding each occurrence of the pattern).
To best appreciate the algorithms, think of the pattern as being relatively short (with M equal to, say, 100 or 1,000) and the text as being relatively long (with N equal to, say, 1 million or 1 billion). In substring search, we typically preprocess the pattern in order to be able to support fast searches for that pattern in the text.
Substring search is an interesting and classic problem: several very different (and surprising) algorithms have been discovered that not only provide a spectrum of useful practical methods but also illustrate a spectrum of fundamental algorithm design techniques.
There is a simple brute-force algorithm for substring search that is in widespread use. While it has a worst-case running time proportional to MN, the strings that arise in many applications lead to a running time that is (except in pathological cases) proportional to M + N. Furthermore, it is well-suited to standard architectural features on most computer systems, so an optimized version provides a standard benchmark that is difficult to beat, even with a clever algorithm.
In 1970, S. Cook proved a theoretical result about a particular type of abstract machine that implied the existence of an algorithm that solves the substring search problem in time proportional to M + N in the worst case. D. E. Knuth and V. R. Pratt laboriously followed through the construction Cook used to prove his theorem (which was not intended to be practical) and refined it into a relatively simple and practical algorithm. This seemed a rare and satisfying example of a theoretical result with immediate (and unexpected) practical applicability. But it turned out that J. H. Morris had discovered virtually the same algorithm as a solution to an annoying problem confronting him when implementing a text editor (he wanted to avoid having to “back up” in the text string). The fact that the same algorithm arose from two such different approaches lends it credibility as a fundamental solution to the problem.
Knuth, Morris, and Pratt didn’t get around to publishing their algorithm until 1976, and in the meantime R. S. Boyer and J. S. Moore (and, independently, R. W. Gosper) discovered an algorithm that is much faster in many applications, since it often examines only a fraction of the characters in the text string. Many text editors use this algorithm to achieve a noticeable decrease in response time for substring search.
Both the Knuth-Morris-Pratt (KMP) and the Boyer-Moore algorithms require some complicated preprocessing on the pattern that is difficult to understand and has limited the extent to which they are used. (In fact, the story goes that an unknown systems programmer found Morris’s algorithm too difficult to understand and replaced it with a brute-force implementation.)
In 1980, M. O. Rabin and R. M. Karp used hashing to develop an algorithm almost as simple as the brute-force algorithm that runs in time proportional to M + N with very high probability. Furthermore, their algorithm extends to two-dimensional patterns and text, which makes it more useful than the others for image processing.
This story illustrates that the search for a better algorithm is still very often justified; indeed, one suspects that there are still more developments on the horizon even for this classic problem.
An obvious method for substring search is to check, for each possible position in the text at which the pattern could match, whether it does in fact match. The
search() method below operates in this way to find the first occurrence of a pattern string
pat in a text string
Proposition Brute-force substring search requires ~NM character compares to search for a pattern of length M in a text of length N, in the worst case.
The alternate implementation is instructive. As before, the program keeps one pointer (
i) into the text and another pointer (
j) into the pattern. As long as they point to matching characters, both pointers are incremented.
The basic idea behind the algorithm discovered by Knuth, Morris, and Pratt is this: whenever we detect a mismatch, we already know some of the characters in the text (since they matched the patter characters prior to the mismatch). We can take advantage of this information to avoid backing up the text pointer over all those known characters.
In KMP substring search, we never back up the text pointer
i, and we use an array
dfa to record how far to back up the pattern pointer
j when a mismatch is detected. For every character
dfa[c][j] is the pattern position to compare against the next text position after comparing
A useful way to describe KMP is in terms of a DFA.
How do we compute the
dfa array corresponding to a given pattern? Remarkably, the answer to this question lies in the DFA itself!
What should the DFA do with the next character? Exactly what it would have done if we had backed up, except if it finds a match with
pat.charAt(j), when it should go to state
j + 1. For example, to decide what the DFA should do when we have a mismatch at
j = 5 for
A B A B A C, we use the DFA to learn that a full backup would leave us in state 3 for
B A B A, so we can copy
dfa, then set the entry for
C to 6 because
C (a match). Since we only need to know how the DFA runs for
j - 1 characters when we are building the jth state, we can always get the information that we need from the partially build DFA.
The discussion above leads to the remarkably compact code below for constructin the DFA corresponding to a given pattern. For each
dfa[j](for mismatch cases)
j + 1(for the match case)
Proposition Knuth-Morris-Pratt substring search accesses no more than M + N characters to search for a pattern of length M in a text of length N.
Another parameter comes into play: for an R-character alphabet, the total running time (and space) required to build the DFA is proportional to MR. It is possible to remove the factor of R by building a DFA where each state has a match transistion and a mismatch transition (not transitions for each possible character), the the construction is somewhat more intricate.
When backup in the text string is not a problem, we can develop a significantly faster substring-searching method by scanning the pattern from right to left when trying to match it against the text. For example, when search for the substring
B A A B B A A, if we find matches on the seventh and sixth characters but not on the fifth, then we can immediately slide the pattern seven positions to the right, and check the 14th character in the text next, because our partial match found
X A A where
X is not
B, which does not appear else where in the pattern.
To implement the mismatched character heuristic, we use an array
right that gives, for each character in the alphabet, the index of its rightmost occurrence in the pattern (or
-1 if the character is not in the pattern). This value tells us precisely how far to skip if that character appears in the text and causes a mismatch during the string search.
- If the string character causing the mismatch does not appear in the pattern, we can shift the pattern one position past the mismatch (by incrementing
j + 1)
- If the string character
ccausing the mismatch appears in the pattern and its rightmost occurrence in the pattern is to the left of the mismatch, we can shift the pattern to align the string character with its rightmost occurrence in the pattern (by incrementing
j - right[c])
- If the string character causing the mismatch appears in the apttern but its rightmost occurrence in the pattern is to right of the mismatch, would not increase
i, we just shift the pattern one position to the right (by increment
Property On typical inputs, substring search with the Boyer-Moore mismatched character heuristic uses ~N/M character compares to search for a pattern of length M in a text of length N.
The method developed by M.O. Rabin and R.M. Karp is a completely different approach to substring search that is based on hasing. We compute a hash function for the pattern and then look for a match by using the same hash function for each possible M-character substring for the text.
A string of length
M correponds to an
R number. To use a hash table of size
Q for keys of this type, we need a hash function to convert an
R number to an
int value between
Q - 1. Modular hashing provides an answer: take the remainder when dividing the number by
With five-digit values, we could just do all the necessary calculations with int values, but what do we do when
M is 100 or 1,000? A simple application of Horner’s method, precisely like the method that we examined for strings and other types of keys with multiple values.
The Rabin-Karp method is based on efficiently computing the hash function for position
i + 1 in the text, given its value for position
i. It follows directly from a simple mathematical formulation.
We subtract off the leading digit, multiply by
R, then add the trailing digit. Now, the crucial point is that we do not have to maintain the values of the numbers, just the values of their remainders when divided by
Q. A fundamental property of the modulus operation is that if we take the remainder when divided by
Q after each arithmetic operation, then we get the same answer as if we were to perform all of the arithmetic operations, then take the remainder when divided by
Property The Monte Carlo version of Rabin-Karp substring search is linear-time and extremely likely to be correct, and the Las Vegas version of Rabin-Karp substring search is correct and extremely likely to be linear-time.
The table at the bottom of the page summarizes the algorithms that we have discussed for substring search. As is often the case when we have several algorithms for the same task, each of them has attractive features:
- Brute-force search is easy to implement and works well in typical cases (Java’s
Stringuses brute-force search)
- Knuth-Morris-Pratt is guaranteed linear-time with no backup in the input
- Boyer-Moore is sublinear (by a factor of
M) in typical situations
- Rabin-Karp is linear
Each also has drawbacks:
- Brute-force might require time proportional to MN
- Knuth-Morris-Pratt and Boyer-Moore use extra space
- Rabin-Karp has a relatively long inner loop (several arithmetic operations, as opposed to character compares in the other methods)
These characteristics are summarized in the table below.