Data Structures and Algorithms Rewind: Tries

As with sorting, we can take advantage of properties of strings to develop search methods (symbol-table implementations) that can be more efficient than the general-purpose methods for typical applications where search keys are strings.

Specifically, the methods that we consider in this section achieve the following performance characteristics in typical applications, even for huge tables:

  • Search hits take time proportional to the length of the search key
  • Search misses involve examining only a few characters

On reflection, these performance characteristics are quite remarkable, one of the crowning achievements of algorithmic technology and a primary factor in enabling the development of the computational infrastructure we now enjoy that has made so much information instantly accessible. Moreover, we can extend the symbol-table API to include character-based operations defined for string keys (but not necessarily for all Comparable types of keys) that are powerful and quite useful in practice, as in the following API:

1
2
3
4
5
6
7
8
9
10
StringST()
void put(String key, Value, val)
Value get(String key)
void delete(String key)
boolean isEmpty()
String longestPrefixOf(String s)
Iterable<String> keysWithPrefix(String s)
Iterable<String> keysThatMatch(String s)
int size()
Iterable<String> keys()

As we saw for sorting with string keys, it is often quite important to be able to work with strings from a specified alphabet. Simple and efficient implementations that are the method of choice for small alphabets turn out to be useless for large alphabets because they consume too much space. In such cases, it is certainly worthwhile to add a constructor that allows clients to specify the alphabet. We will consider the implementation of such a constructor later in this section but omit it from the API for now, in order to concentrate on string keys.

The following descriptions of the three new methods use the keys {she sells sea shells by the sea shore} to give examples:

  • longestPrefixOf() takes a string as argument and returns the longest key in the symbol table that is a prefix of that string. For the keys above, longestPrefixOf("shell") is she and longestPrefixOf("shellsort") is shells
  • keysWithPrefix() takes a string as argument and returns all the keys in the symbol table having that string as prefix. For the keys above, keysWithPrefix("she") is she and shells, and keysWithPrefix("se") is sells and sea
  • keysThatMatch() takes a string as argument and returns all the keys in the symbol table that match that string, in the sense that a period (.) in the argument string matches any character. For the keys above, keysThatMatch(".he") returns she and the, and keysThatMatch("s..") returns she and sea

Tries

In this section, we consider a search tree known as a trie, a data structure built from the characters of the string keys that allows us to use the characters of the search key to guide the search. The name “trie” is a bit of wordplay introduced by E. Fredkin in 1960 because the data structure is used for retrieval, but we pronounce it “try” to avoid confusion with “tree.”

Basic properties

As with search trees, tries are data structures composed of nodes that contain links that are either null or references to other nodes. Each node is pointed to by just one other node, which is called its parent (except root), and each node has R links, where R is the alphabet size.

Each node also has a corresponding value, which may be null or the value associated with one of the string keys in the symbol table. Specifically, we store the value associated with each key in the node corresponding to its last character. It’s very important to bear in mind the following fact: nodes with null values exist to facilitate search in the trie and do not correspond to keys.

1
2
3
4
5
6
7
8
9
10
11
public class TrieST<Value> {
private static int RADIX = 256;
private Node root = new Node();
private static class Node {
private Object val;
private Node[] next = new Node[RADIX];
}
}

Search in a trie

Finding the value associated with a given string key in a trie is a simple process, guided by the characters in the search key. Each node in the trie has a link corresponding to each possible string character. We start at the root, then follow the link associated with the first character in the key; from that node we follow the link associated with the second character in the key; from that node we follow the link associated with the third character in the key and so forth, until reaching the last character of the key or a null link.

  • The value at the node corresponding to the last character in the key is not null (as in the searches for shells and she depicted at left above). This result is a search hit — the value associated with the key is the value in the node corresponding to its last character
  • The value in the node corresponding to the last character in the key is null (as in the search for shell depicted at top right above). This result is a search miss: the key is not in the table
  • The search terminated with a null link (as in the search for shore depicted at bottom right above). This result is also a search miss
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
public Value get(String key) {
Node n = get(root, key, 0);
if (n == null) {
return null;
}
return (Value) n.val;
}
private Node get(Node n, String key, int currentLength) {
if (n == null) {
return null;
}
if (currentLength == key.length()) {
return n;
}
char c = key.charAt(currentLength);
return get(n.next[c], key, currentLength + 1);
}

Insertion into a trie

As with binary search trees, we insert by first doing a search: in a trie that means using the characters of the key to guide us down the trie until reaching the last character of the key or a null link. At this point, one of the following two conditions holds:

  • We encountered a null link before reaching the last character of the key. In this case, there is no trie node corresponding to the last character in the key, so we need to create nodes for each of the characters in the key not yet encountered and set the value in the last one to the value to be associated with the key
  • We encountered the last character of the key before reaching a null link. In this case, we set that node’s value to the value to be associated with the key (whether or not that value is null), as usual with our associative array convention
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
public void put(String key, Value value) {
root = put(root, key, value, 0);
}
private Node put(Node node, String key, Value value, int currentLength) {
if (node == null) {
node = new Node();
}
if (currentLength == key.length()) {
node.val = value;
return node;
}
char c = key.charAt(currentLength);
node.next[c] = put(node.next[c], key, value, currentLength + 1);
return node;
}

Node representation

As mentioned at the outset, our trie diagrams do not quite correspond to the data structures our programs will build, because we do not draw null links. Taking null links into account emphasizes the following important characteristics of tries:

  • Every node has R links, one for each possible character
  • Characters and keys are implicitly stored in the data structure
1
2
3
4
5
6
7
8
private static int RADIX = 256;
private Node root = new Node();
private static class Node {
private Object val;
private Node[] next = new Node[RADIX];
}

Size

As for the binary search trees, three straightforward options are available for implementing size():

  • An eager implementation where we maintain the number of keys in an instance variable N
  • A very eager implementation where we maintain the number of keys in a subtrie as a node instance variable that we update after the recursive calls in put() and delete()
  • A lazy recursive implementation like the one at right. It traverses all of the nodes in the trie, counting the number having a non-null value
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
public int size() {
return size(root);
}
private int size(Node node) {
if (node == null) {
return 0;
}
int count = 0;
if (node.val != null) {
count += 1;
}
for (int i = 0; i < RADIX; i++) {
count += size(node.next[i]);
}
return count;
}

Collecting keys

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
public Iterable<String> keys() {
return keysWithPrefix("");
}
private Iterable<String> keysWithPrefix(String prefix) {
List<String> keys = new ArrayList<>();
collect(get(root, prefix, 0), prefix, keys);
return keys;
}
private void collect(Node node, String prefix, List<String> keys) {
if (node == null) {
return;
}
if (node.val != null) {
keys.add(prefix);
}
for (int i = 0; i < RADIX; i++) {
collect(node, prefix + (char) i, keys);
}
}

Wildcard match

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
public Iterable<String> keysThatMatch(String pattern) {
List<String> keys = new ArrayList<>();
collect(root, "", pattern, keys);
return keys;
}
private void collect(Node node, String prefix, String pattern, List<String> keys) {
if (node == null) {
return;
}
int currentLength = prefix.length();
if (currentLength == pattern.length()) {
if (node.val != null) {
keys.add(prefix);
} else {
return;
}
}
char next = pattern.charAt(currentLength);
for (char c = 0; c < RADIX; c++) {
if (next == '.' || next == c) {
collect(node.next[c], prefix + c, pattern, keys);
}
}
}

Longest prefix

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
public String longestPrefixOf(String s) {
int length = search(root, s, 0, 0);
return s.substring(0, length);
}
private int search(Node node, String s, int currentLength, int lastValidLength) {
if (node == null) {
return lastValidLength;
}
if (node.val != null) {
lastValidLength = currentLength;
}
if (currentLength == s.length()) {
return lastValidLength;
}
char c = s.charAt(currentLength);
return search(node.next[c], s, currentLength + 1, lastValidLength);
}

Deletion

The first step needed to delete a key-value pair from a trie is to use a normal search to find the node corresponding to the key and set the corresponding value to null.

  • If that node has a non-null link to a child, then no more work is required
  • If all the links are null, we need to remove the node from the data structure
  • If doing so leaves all the links null in its parent, we need to remove that node, and so forth
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
public void delete(String key) {
root = delete(root, key, 0);
}
private Node delete(Node node, String key, int currentLength) {
if (node == null) {
return null;
}
if (currentLength == key.length()) {
node.val = null;
} else {
char c = key.charAt(currentLength);
node.next[c] = delete(node.next[c], key, currentLength + 1);
}
if (node.val != null) {
return node;
}
// Returns null if all the next pointers are null, otherwise return the node itself
for (char c = 0; c < RADIX; c++) {
if (node.next[c] != null) {
return node;
}
}
return null;
}

Properties of tries

Proposition The linked structure (shape) of a trie is independent of the key insertion/deletion order: there is a unique trie for any given set of keys.

Worst-case time bound for search and insert

Proposition The number of array access when searching in a trie or inserting a key into a trie is at most 1 plus the length of the key.

Expected time bound for search miss

Proposition The average number of nodes examined for search miss in a trie built from N random keys over an alphabet of size R is ~logRN.

From a practical standpoint, the most important implication of this proposition is that search miss does not depend on the key length. For example, it says that unsuccessful search in a trie built with 1 million random keys will require examining only three or four nodes, whether the keys are 7-digit license plates or 20-digit account numbers.

Space

Proposition The number of links in a trie is between RN and RNw, where w is the average key length.

The table on the facing page shows the costs for some typical applications that we have considered. It illustrates the following rules of thumb for tries:

  • When keys are short, the number of links is close to RN
  • When keys are long, the number of links is close to RNw
  • Therefore, decreasing R can save a huge amount of space

One-way branching

The primary reason that trie space is excessive for long keys is that long keys tend to have long tails in the trie, with each node having a single link to the next node (and, therefore, R−1 null links). This situation known as external one-way branching is not difficult to correct. A trie might also have internal one-way branching. For example, two long keys may be equal except for their last character. This situation is a bit more difficult to address.

These changes can make trie space usage a less important factor than for the straightforward implementation that we have considered, but they are not necessarily effective in practical applications. Next, we consider an alternative approach to reducing space usage for tries.

THE BOTTOM LINE is this: do not try to use the sample algorithm for large numbers of long keys taken from large alphabets, because it will require space proportional to R times the total number of key characters. Otherwise, if you can afford the space, trie performance is difficult to beat.

Tenary Search Tries (TSTs)

To help us avoid the excessive space cost associated with R-way tries, we now consider an alternative representation: the ternary search trie (TST). In a TST, each node has a character, three links, and a value. The three links correspond to keys whose current characters are less than, equal to, or greater than the node’s character.

In the R-way tries, trie nodes are represented by R links, with the character corresponding to each non-null link implictly represented by its index. In the corresponding TST, characters appear explicitly in nodes — we find characters corresponding to keys only when we are traversing the middle links.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
public class TernaryST<Value> {
private Node root;
@SuppressWarnings("unchecked")
public Value get(String key) {
Node node = get(root, key, 0);
if (node == null) {
return null;
}
return (Value) node.value;
}
private Node get(Node node, String key, int currentLength) {
if (node == null) {
return null;
}
if (currentLength == key.length()) {
return node;
}
char c = key.charAt(currentLength);
if (c < node.c) {
return get(node.left, key, currentLength);
} else if (c > node.c) {
return get(node.right, key, currentLength);
} else {
return get(node.mid, key, currentLength + 1);
}
}
public void put(String key, Value value) {
root = put(root, key, value, 0);
}
private Node put(Node node, String key, Value value, int currentLength) {
char c = key.charAt(currentLength);
if (node == null) {
node = new Node();
node.c = c;
}
if (c < node.c) {
node.left = put(node.left, key, value, currentLength);
} else if (c > node.c) {
node.right = put(node.right, key, value, currentLength);
} else if (currentLength < key.length() - 1) {
node.mid = put(node.mid, key, value, currentLength + 1);
} else {
node.value = value;
}
return node;
}
private static class Node {
char c;
Node left, mid, right;
Object value;
}
}

Properties of TSTs

Space

Proposition The number of links in a TST built from N string keys of average length w is between 3N and 3Nw.

Search cost

Proposition A search miss in a TST built from N random string keys requires ~lnN character compares on the average. A search hit or an insertion in a TST uses ~lnN+L character compares, where L is the length of the search key.

Summary

As with string sorting, we are naturally interested in how the string-searching methods that we have considered compare to the general-purpose methods that we considered in CHAPTER 3. The following table summarizes the important characteristics of the algorithms that we have discussed in this section (the rows for BSTs, red-black BSTs and hashing are included from CHAPTER 3, for comparison). For a particular application, these entries must be taken as indicative, not definitive, since so many factors (such as characteristics of keys and mix of operations) come into play when studying symbol-table implementations.

1
2
3
4
5
6
7
8
algorithm characters examined memory usage sweet pot
for search miss
binary tree search c1(lnN)^2 64N randomly ordered keys
2-3 tree search c2(lgN)^2 64N guaranteed performance
linear probing w 32N to 128N built-in types cached has values
trie search (R-way) logRN (8R + 56)N short keys, small alphabets
to (8R + 56)Nw
trie search (TST) 1.39lgN 64N to 64Nw nonrandom keys
  • If space is available, R-way tries provide the fastest search, essentially completing the job with a constant number of character compares
  • For large alphabets, where space may not be available for R-way tries, TSTs are preferable, since they use a logarithmic number of character compares, while BSTs use a logarithmic number of key compares
  • Hashing can be competitive, but, as usual, cannot support ordered symbol-table operations or extended character-based API operations such as prefix or wildcard match