Data Structures and Algorithms Rewind: Substring

A FUNDAMENTAL OPERATION on strings is substring search: given a text string of length N and a pattern string of length M, find an occurrence of the pattern within the text. Most algorithms for this problem can easily be extended to find all occurrences of the pattern in the text, to count the number of occurrences of the pattern in the text, or to provide context (substrings of the text surrounding each occurrence of the pattern).

To best appreciate the algorithms, think of the pattern as being relatively short (with M equal to, say, 100 or 1,000) and the text as being relatively long (with N equal to, say, 1 million or 1 billion). In substring search, we typically preprocess the pattern in order to be able to support fast searches for that pattern in the text.

Substring search is an interesting and classic problem: several very different (and surprising) algorithms have been discovered that not only provide a spectrum of useful practical methods but also illustrate a spectrum of fundamental algorithm design techniques.

A short history

There is a simple brute-force algorithm for substring search that is in widespread use. While it has a worst-case running time proportional to MN, the strings that arise in many applications lead to a running time that is (except in pathological cases) proportional to M + N. Furthermore, it is well-suited to standard architectural features on most computer systems, so an optimized version provides a standard benchmark that is difficult to beat, even with a clever algorithm.

In 1970, S. Cook proved a theoretical result about a particular type of abstract machine that implied the existence of an algorithm that solves the substring search problem in time proportional to M + N in the worst case. D. E. Knuth and V. R. Pratt laboriously followed through the construction Cook used to prove his theorem (which was not intended to be practical) and refined it into a relatively simple and practical algorithm. This seemed a rare and satisfying example of a theoretical result with immediate (and unexpected) practical applicability. But it turned out that J. H. Morris had discovered virtually the same algorithm as a solution to an annoying problem confronting him when implementing a text editor (he wanted to avoid having to “back up” in the text string). The fact that the same algorithm arose from two such different approaches lends it credibility as a fundamental solution to the problem.

Knuth, Morris, and Pratt didn’t get around to publishing their algorithm until 1976, and in the meantime R. S. Boyer and J. S. Moore (and, independently, R. W. Gosper) discovered an algorithm that is much faster in many applications, since it often examines only a fraction of the characters in the text string. Many text editors use this algorithm to achieve a noticeable decrease in response time for substring search.

Both the Knuth-Morris-Pratt (KMP) and the Boyer-Moore algorithms require some complicated preprocessing on the pattern that is difficult to understand and has limited the extent to which they are used. (In fact, the story goes that an unknown systems programmer found Morris’s algorithm too difficult to understand and replaced it with a brute-force implementation.)

In 1980, M. O. Rabin and R. M. Karp used hashing to develop an algorithm almost as simple as the brute-force algorithm that runs in time proportional to M + N with very high probability. Furthermore, their algorithm extends to two-dimensional patterns and text, which makes it more useful than the others for image processing.

This story illustrates that the search for a better algorithm is still very often justified; indeed, one suspects that there are still more developments on the horizon even for this classic problem.

An obvious method for substring search is to check, for each possible position in the text at which the pattern could match, whether it does in fact match. The search() method below operates in this way to find the first occurrence of a pattern string pat in a text string txt.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
public static int searchBF(String pattern, String text) {
int pLength = pattern.length();
int tLength = text.length();
for (int i = 0; i < tLength - pLength; i++) {
int j = 0;
for ( ; j < pLength; j++) {
if (pattern.charAt(j) != text.charAt(i + j)) {
break;
}
}
if (j == pLength) {
return i;
}
}
return -1;
}

Proposition Brute-force substring search requires ~NM character compares to search for a pattern of length M in a text of length N, in the worst case.

The alternate implementation is instructive. As before, the program keeps one pointer (i) into the text and another pointer (j) into the pattern. As long as they point to matching characters, both pointers are incremented.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
public static int searchBF1(String pattern, String text) {
int i = 0;
int tLength = text.length();
int j = 0;
int pLength = pattern.length();
for ( ; i < tLength && j < pLength; i++) {
if (text.charAt(i) == pattern.charAt(j)) {
j++;
} else {
i -= j;
j = 0;
}
}
if (j == pLength) {
return i - pLength;
} else {
return -1;
}
}

The basic idea behind the algorithm discovered by Knuth, Morris, and Pratt is this: whenever we detect a mismatch, we already know some of the characters in the text (since they matched the patter characters prior to the mismatch). We can take advantage of this information to avoid backing up the text pointer over all those known characters.

Backing up the pattern pointer

In KMP substring search, we never back up the text pointer i, and we use an array dfa[][] to record how far to back up the pattern pointer j when a mismatch is detected. For every character c, dfa[c][j] is the pattern position to compare against the next text position after comparing c with pat.charAt(j).

KMP search method

A useful way to describe KMP is in terms of a DFA.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
public static int searchKMP(String pattern, String text) {
int i = 0;
int tLength = text.length();
int j = 0;
int pLength = pattern.length();
int[][] dfa = constructDFA(pattern);
for ( ; j < pLength && i < tLength; i++) {
j = dfa[text.charAt(i)][j];
}
if (j == pLength) {
return i - pLength;
} else {
return -1;
}
}

How do we compute the dfa[][] array corresponding to a given pattern? Remarkably, the answer to this question lies in the DFA itself!

What should the DFA do with the next character? Exactly what it would have done if we had backed up, except if it finds a match with pat.charAt(j), when it should go to state j + 1. For example, to decide what the DFA should do when we have a mismatch at j = 5 for A B A B A C, we use the DFA to learn that a full backup would leave us in state 3 for B A B A, so we can copy dfa[][3] to dfa[][5], then set the entry for C to 6 because pat.charAt(5) is C (a match). Since we only need to know how the DFA runs for j - 1 characters when we are building the jth state, we can always get the information that we need from the partially build DFA.

The discussion above leads to the remarkably compact code below for constructin the DFA corresponding to a given pattern. For each j, it:

  • Copies dfa[][x] to dfa[][j] (for mismatch cases)
  • Sets dfa[pat.charAt(j)][j] to j + 1 (for the match case)
  • Updates x
1
2
3
4
5
6
7
8
9
10
11
12
13
private static int[][] constructDFA(String pattern) {
int length = pattern.length();
int[][] dfa = new int[26][length];
dfa[pattern.charAt(0) - 'a'][0] = 1;
for (int x = 0, j = 1; j < length; j++) {
for (int c = 0; c < 26; c++) {
dfa[c][j] = dfa[c][x];
}
dfa[pattern.charAt(j) - 'a'][j] = j + 1;
x = dfa[pattern.charAt(j) - 'a'][x];
}
return dfa;
}

Proposition Knuth-Morris-Pratt substring search accesses no more than M + N characters to search for a pattern of length M in a text of length N.

Another parameter comes into play: for an R-character alphabet, the total running time (and space) required to build the DFA is proportional to MR. It is possible to remove the factor of R by building a DFA where each state has a match transistion and a mismatch transition (not transitions for each possible character), the the construction is somewhat more intricate.

When backup in the text string is not a problem, we can develop a significantly faster substring-searching method by scanning the pattern from right to left when trying to match it against the text. For example, when search for the substring B A A B B A A, if we find matches on the seventh and sixth characters but not on the fifth, then we can immediately slide the pattern seven positions to the right, and check the 14th character in the text next, because our partial match found X A A where X is not B, which does not appear else where in the pattern.

To implement the mismatched character heuristic, we use an array right[] that gives, for each character in the alphabet, the index of its rightmost occurrence in the pattern (or -1 if the character is not in the pattern). This value tells us precisely how far to skip if that character appears in the text and causes a mismatch during the string search.

Substring search:

  • If the string character causing the mismatch does not appear in the pattern, we can shift the pattern one position past the mismatch (by incrementing i by j + 1)
  • If the string character c causing the mismatch appears in the pattern and its rightmost occurrence in the pattern is to the left of the mismatch, we can shift the pattern to align the string character with its rightmost occurrence in the pattern (by incrementing i by j - right[c])
  • If the string character causing the mismatch appears in the apttern but its rightmost occurrence in the pattern is to right of the mismatch, would not increase i, we just shift the pattern one position to the right (by increment i by 1)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
public static int searchBM(String pattern, String text) {
int[] right = buildRight(pattern);
int pLength = pattern.length();
int tLength = text.length();
int skip;
for (int i = 0; i <= tLength - pLength; i += skip) {
skip = 0;
for (int j = pLength - 1; j >= 0; j--) {
if (pattern.charAt(j) != text.charAt(i + j)) {
skip = j - right[text.charAt(i + j)];
if (skip < 1) {
skip = 1;
}
break;
}
}
if (skip == 0) {
return i;
}
}
return -1;
}
private static int[] buildRight(String pattern) {
int[] right = new int[256];
for (int c = 0; c < 256; c++) {
right[c] = -1;
}
// rightmost position for chars in pattern
for (int i = 0; i < pattern.length(); i++) {
right[pattern.charAt(i)] = i;
}
return right;
}

Property On typical inputs, substring search with the Boyer-Moore mismatched character heuristic uses ~N/M character compares to search for a pattern of length M in a text of length N.

The method developed by M.O. Rabin and R.M. Karp is a completely different approach to substring search that is based on hasing. We compute a hash function for the pattern and then look for a match by using the same hash function for each possible M-character substring for the text.

A string of length M correponds to an M-digit base-R number. To use a hash table of size Q for keys of this type, we need a hash function to convert an M-digit base-R number to an int value between 0 and Q - 1. Modular hashing provides an answer: take the remainder when dividing the number by Q.

With five-digit values, we could just do all the necessary calculations with int values, but what do we do when M is 100 or 1,000? A simple application of Horner’s method, precisely like the method that we examined for strings and other types of keys with multiple values.

1
2
3
4
5
6
7
private long hash(String key, int M) {
// Compute hash for key[0..M-1].
long h = 0;
for (int j = 0; j < M; j++)
h = (R * h + key.charAt(j)) % Q;
return h;
}

The Rabin-Karp method is based on efficiently computing the hash function for position i + 1 in the text, given its value for position i. It follows directly from a simple mathematical formulation.

We subtract off the leading digit, multiply by R, then add the trailing digit. Now, the crucial point is that we do not have to maintain the values of the numbers, just the values of their remainders when divided by Q. A fundamental property of the modulus operation is that if we take the remainder when divided by Q after each arithmetic operation, then we get the same answer as if we were to perform all of the arithmetic operations, then take the remainder when divided by Q.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
private static long PRIME = 997;
private static int ALPHABET_SIZE = 256;
private static long hash(String key, int length) {
long result = 0L;
for (int i = 0; i < length; i++) {
result = (ALPHABET_SIZE * result + key.charAt(i)) % PRIME;
}
return result;
}
public static int searchRK(String pattern, String text) {
int pLength = pattern.length();
int tLength = text.length();
if (pLength > tLength) {
return -1;
}
long rm = 1;
// Compute ALPHABET_SIZE ^ (pLength - 1) % PRIME
for (int i = 1; i <= pLength - 1; i++) {
rm = (ALPHABET_SIZE * rm) % PRIME;
}
long patternHash = hash(pattern, pLength);
long textHash = hash(text, pLength);
if (patternHash == textHash) {
return 0;
}
for (int i = pLength; i < tLength; i++) {
// Remove leading digit
textHash = (textHash + PRIME - rm * text.charAt(i - pLength) % PRIME) % PRIME;
// Add trailing digit
textHash = (textHash * ALPHABET_SIZE + text.charAt(i)) % PRIME;
// Check for match
if (patternHash == textHash) {
return i - pLength + 1;
}
}
return -1;
}

Property The Monte Carlo version of Rabin-Karp substring search is linear-time and extremely likely to be correct, and the Las Vegas version of Rabin-Karp substring search is correct and extremely likely to be linear-time.

Summary

The table at the bottom of the page summarizes the algorithms that we have discussed for substring search. As is often the case when we have several algorithms for the same task, each of them has attractive features:

  • Brute-force search is easy to implement and works well in typical cases (Java’s indexOf() method in String uses brute-force search)
  • Knuth-Morris-Pratt is guaranteed linear-time with no backup in the input
  • Boyer-Moore is sublinear (by a factor of M) in typical situations
  • Rabin-Karp is linear

Each also has drawbacks:

  • Brute-force might require time proportional to MN
  • Knuth-Morris-Pratt and Boyer-Moore use extra space
  • Rabin-Karp has a relatively long inner loop (several arithmetic operations, as opposed to character compares in the other methods)

These characteristics are summarized in the table below.

1
2
3
4
5
6
7
8
9
algorithm version guarantee typical backup? correct? extra space
brute force -- MN 1.1N yes yes 1
Knuth-Morris-Pratt full DFA 2N 1.1N no yes MR
mismatch transition 3N 1.1n no yes M
only
Boyer-Moore Full algorithm 3N N/M yes yes R
mismatched heuristic MN N/M yes yes R
Rabin-Karp Monte Carlo 7N 7N no yes 1
Las Vegas 7N 7N yes yes 1