A FUNDAMENTAL OPERATION on strings is substring search: given a text string of length N and a pattern string of length M, find an occurrence of the pattern within the text. Most algorithms for this problem can easily be extended to find all occurrences of the pattern in the text, to count the number of occurrences of the pattern in the text, or to provide context (substrings of the text surrounding each occurrence of the pattern).
To best appreciate the algorithms, think of the pattern as being relatively short (with M equal to, say, 100 or 1,000) and the text as being relatively long (with N equal to, say, 1 million or 1 billion). In substring search, we typically preprocess the pattern in order to be able to support fast searches for that pattern in the text.
Substring search is an interesting and classic problem: several very different (and surprising) algorithms have been discovered that not only provide a spectrum of useful practical methods but also illustrate a spectrum of fundamental algorithm design techniques.
A short history
There is a simple bruteforce algorithm for substring search that is in widespread use. While it has a worstcase running time proportional to MN, the strings that arise in many applications lead to a running time that is (except in pathological cases) proportional to M + N. Furthermore, it is wellsuited to standard architectural features on most computer systems, so an optimized version provides a standard benchmark that is difficult to beat, even with a clever algorithm.
In 1970, S. Cook proved a theoretical result about a particular type of abstract machine that implied the existence of an algorithm that solves the substring search problem in time proportional to M + N in the worst case. D. E. Knuth and V. R. Pratt laboriously followed through the construction Cook used to prove his theorem (which was not intended to be practical) and refined it into a relatively simple and practical algorithm. This seemed a rare and satisfying example of a theoretical result with immediate (and unexpected) practical applicability. But it turned out that J. H. Morris had discovered virtually the same algorithm as a solution to an annoying problem confronting him when implementing a text editor (he wanted to avoid having to “back up” in the text string). The fact that the same algorithm arose from two such different approaches lends it credibility as a fundamental solution to the problem.
Knuth, Morris, and Pratt didn’t get around to publishing their algorithm until 1976, and in the meantime R. S. Boyer and J. S. Moore (and, independently, R. W. Gosper) discovered an algorithm that is much faster in many applications, since it often examines only a fraction of the characters in the text string. Many text editors use this algorithm to achieve a noticeable decrease in response time for substring search.
Both the KnuthMorrisPratt (KMP) and the BoyerMoore algorithms require some complicated preprocessing on the pattern that is difficult to understand and has limited the extent to which they are used. (In fact, the story goes that an unknown systems programmer found Morris’s algorithm too difficult to understand and replaced it with a bruteforce implementation.)
In 1980, M. O. Rabin and R. M. Karp used hashing to develop an algorithm almost as simple as the bruteforce algorithm that runs in time proportional to M + N with very high probability. Furthermore, their algorithm extends to twodimensional patterns and text, which makes it more useful than the others for image processing.
This story illustrates that the search for a better algorithm is still very often justified; indeed, one suspects that there are still more developments on the horizon even for this classic problem.
Bruteforce substring search
An obvious method for substring search is to check, for each possible position in the text at which the pattern could match, whether it does in fact match. The search()
method below operates in this way to find the first occurrence of a pattern string pat
in a text string txt
.


Proposition Bruteforce substring search requires ~NM character compares to search for a pattern of length M in a text of length N, in the worst case.
The alternate implementation is instructive. As before, the program keeps one pointer (i
) into the text and another pointer (j
) into the pattern. As long as they point to matching characters, both pointers are incremented.


KnuthMorrisPratt substring search
The basic idea behind the algorithm discovered by Knuth, Morris, and Pratt is this: whenever we detect a mismatch, we already know some of the characters in the text (since they matched the patter characters prior to the mismatch). We can take advantage of this information to avoid backing up the text pointer over all those known characters.
Backing up the pattern pointer
In KMP substring search, we never back up the text pointer i
, and we use an array dfa[][]
to record how far to back up the pattern pointer j
when a mismatch is detected. For every character c
, dfa[c][j]
is the pattern position to compare against the next text position after comparing c
with pat.charAt(j)
.
KMP search method
A useful way to describe KMP is in terms of a DFA.


How do we compute the dfa[][]
array corresponding to a given pattern? Remarkably, the answer to this question lies in the DFA itself!
What should the DFA do with the next character? Exactly what it would have done if we had backed up, except if it finds a match with pat.charAt(j)
, when it should go to state j + 1
. For example, to decide what the DFA should do when we have a mismatch at j = 5
for A B A B A C
, we use the DFA to learn that a full backup would leave us in state 3 for B A B A
, so we can copy dfa[][3]
to dfa[][5]
, then set the entry for C
to 6 because pat.charAt(5)
is C
(a match). Since we only need to know how the DFA runs for j  1
characters when we are building the jth state, we can always get the information that we need from the partially build DFA.
The discussion above leads to the remarkably compact code below for constructin the DFA corresponding to a given pattern. For each j
, it:
 Copies
dfa[][x]
todfa[][j]
(for mismatch cases)  Sets
dfa[pat.charAt(j)][j]
toj + 1
(for the match case)  Updates
x


Proposition KnuthMorrisPratt substring search accesses no more than M + N characters to search for a pattern of length M in a text of length N.
Another parameter comes into play: for an Rcharacter alphabet, the total running time (and space) required to build the DFA is proportional to MR. It is possible to remove the factor of R by building a DFA where each state has a match transistion and a mismatch transition (not transitions for each possible character), the the construction is somewhat more intricate.
BoyerMoore substring search
When backup in the text string is not a problem, we can develop a significantly faster substringsearching method by scanning the pattern from right to left when trying to match it against the text. For example, when search for the substring B A A B B A A
, if we find matches on the seventh and sixth characters but not on the fifth, then we can immediately slide the pattern seven positions to the right, and check the 14th character in the text next, because our partial match found X A A
where X
is not B
, which does not appear else where in the pattern.
To implement the mismatched character heuristic, we use an array right[]
that gives, for each character in the alphabet, the index of its rightmost occurrence in the pattern (or 1
if the character is not in the pattern). This value tells us precisely how far to skip if that character appears in the text and causes a mismatch during the string search.
Substring search:
 If the string character causing the mismatch does not appear in the pattern, we can shift the pattern one position past the mismatch (by incrementing
i
byj + 1
)  If the string character
c
causing the mismatch appears in the pattern and its rightmost occurrence in the pattern is to the left of the mismatch, we can shift the pattern to align the string character with its rightmost occurrence in the pattern (by incrementingi
byj  right[c]
)  If the string character causing the mismatch appears in the apttern but its rightmost occurrence in the pattern is to right of the mismatch, would not increase
i
, we just shift the pattern one position to the right (by incrementi
by1
)


Property On typical inputs, substring search with the BoyerMoore mismatched character heuristic uses ~N/M character compares to search for a pattern of length M in a text of length N.
RabinKarp fingerprint search
The method developed by M.O. Rabin and R.M. Karp is a completely different approach to substring search that is based on hasing. We compute a hash function for the pattern and then look for a match by using the same hash function for each possible Mcharacter substring for the text.
A string of length M
correponds to an M
digit baseR
number. To use a hash table of size Q
for keys of this type, we need a hash function to convert an M
digit baseR
number to an int
value between 0
and Q  1
. Modular hashing provides an answer: take the remainder when dividing the number by Q
.
With fivedigit values, we could just do all the necessary calculations with int values, but what do we do when M
is 100 or 1,000? A simple application of Horner’s method, precisely like the method that we examined for strings and other types of keys with multiple values.


The RabinKarp method is based on efficiently computing the hash function for position i + 1
in the text, given its value for position i
. It follows directly from a simple mathematical formulation.
We subtract off the leading digit, multiply by R
, then add the trailing digit. Now, the crucial point is that we do not have to maintain the values of the numbers, just the values of their remainders when divided by Q
. A fundamental property of the modulus operation is that if we take the remainder when divided by Q
after each arithmetic operation, then we get the same answer as if we were to perform all of the arithmetic operations, then take the remainder when divided by Q
.


Property The Monte Carlo version of RabinKarp substring search is lineartime and extremely likely to be correct, and the Las Vegas version of RabinKarp substring search is correct and extremely likely to be lineartime.
Summary
The table at the bottom of the page summarizes the algorithms that we have discussed for substring search. As is often the case when we have several algorithms for the same task, each of them has attractive features:
 Bruteforce search is easy to implement and works well in typical cases (Java’s
indexOf()
method inString
uses bruteforce search)  KnuthMorrisPratt is guaranteed lineartime with no backup in the input
 BoyerMoore is sublinear (by a factor of
M
) in typical situations  RabinKarp is linear
Each also has drawbacks:
 Bruteforce might require time proportional to MN
 KnuthMorrisPratt and BoyerMoore use extra space
 RabinKarp has a relatively long inner loop (several arithmetic operations, as opposed to character compares in the other methods)
These characteristics are summarized in the table below.

