Prologue
Data Structures and Algorithms Rewind is a series of notes / articles that will help me rewind all the basic / essential stuff of common data structures / algorithms. Most of the code snippets / contents will come from books below:
 Algorithms 4th
 Introduction to Algorithms 3rd
 Foundations of Algorithms 5th
 Algorithm Design Manual 2nd
 Data Structures and Algorithms Analysis in Java 3rd
(I might tweak the code a little bit so the code / text won’t follow the original one identically. e.g. Using stateofart libraries and proper namings)
Data Structures and Algorithms Rewind: Graph
 Is there a way to connect one item to another by following the connections?
 How many other items are connected to a given item?
 What is the shortest chain of connections between this item and this other item?
To model such situations, we use abstract mathematical objects called graphs.
Undirected Graphs
Definitions and Terms
Definitions:
 A graph is a set of vertices and a collection of edges that each connect a pair of vertices
 A path in a graph is a sequence of vertices connected by edges. A simple path is one with no repeated vertices. A cycle is a path with at least one edge whose first and last vertices are the same. A simple cycle is a cycle with no repeated edges or vertices (except the requisite repetition of the first and last vertices). The length of a path or a cycle is its number of edges
 A graph is connected if there is a path from every vertex to every other vertex in the graph. A graph that is not connected consists of a set of connected components, which are maximal connected subgraphs
 A tree is an acyclic connected graph. A disjoint set of trees is called a forest. A spanning tree of a connected graph is a subgraph that contains all of that graph’s vertices and is a single tree. A spanning forest of a graph is the union of spanning trees of its connected components
Terms: Selfloop, Parallel edges, Adjacent, Incident, Subgraph, Density, Dense, Sparse, Bipartite
Data type


Representations
 Adjacency matrix: Too much space cost
 Array of edges:
adj()
would examine all the edges in the graph  Array of ajacency lists: Satisfies both use cases
Adjacencylist data structure


Graph Processing / Searching


DepthFirst Search
The classic recursive method for searching in a connected graph mimics Tremaux maze explorations. To search a graph, invoke a recursive method that visits vertices. To visit a vertex:
 Mark it as having been visited
 Visit (recursively) all the vertices that are adjacent to it and that have not yet been marked


Proposition: DFS marks all the vertices connected to a given source in time proportional to the sum of their degrees.
We could use DFS to address:
 Conenctivity: Given a graph, support queries of the form Are two given vertices connected? and How many connected components does the graph have?
 Singlesource paths: Given a graph and a source vertex
s
, support queries of the form Is there a path froms
to a given target vertexv
? If so, find such a path


Proposition: DFS allows us to provide clients with a path from given source to any marked vertex in time proportional its length.
BreathFirst Search
Singlesource shortest paths: Given a graph and a source vertex s
, support queries of the form Is there a path from s
to a given target vertex v
? If so, find a shortest such path (one with a minimal number of edges)
DFS is analogous to one person exploring a maze. BFS is analogous to a group of researchers exploring by fanning out in all directions, each unrolling his or her own ball of string.


Proposition: For any vertex v
reachable from s
, BFS computes a shortest path from s
to v
. BFS takes time proportional to V + E
in the worst case.
Connected Components




Proposition: DFS uses preprocessing time and space proportional to V + E
to support constanttime connectivity queries in a graph.
Cycle Detection
Is a given graph acylic?


Twocolorability
Can the vertices of a given graph be assigned one of two colors in such a way that no edge connect vertices of the same color? which is equivalent to: Is the graph bipartite?


Directed Graphs
Glossary
A directed graph (or digraph) is a set of vertices and a collection of directed edges. Each directed edge connects an ordered pair of vertices.
A directed path in a digraph is a sequence of vertices in which there is a (directed) edge pointing from each vertex in the sequence to its successor in the sequence.
A directed cycle is a directed path with at least one edge whose first and last vertices are the same.
A simple path is a path with no repeated vertices.
A simple cycle is a cycle with no repeated edges or vertices (except the requisite repetition of the first and last vertices).
The length of a path or a cycle is its number of edges.
Digraph data type


Implementation:


Reachability in digraphs
Singlesource reachability: Given a digraph and a source vertex s
, support queries of the form Is there a directed path from s
to a given target vertex v
?
Multiplesource reachability: Given a digraph and a set of source vertices, support queries of the form Is there a directed path from some vertex in the set to a given target vertex v
?


Proposition: DFS marks all the vertices in a digraph reachable from a give set of sources in time proportional to the sum of the outdegrees of the vertices marked.
Application: Markandsweep garbage collection.
Questions:
Singlesource directed paths: Given a digraph and a source vertex s
, support queries of the form Is there a directed path from s
to a given target vertex v
? If so, find such a path.


Singlesource shortest directed paths: Given a digraph and a source vertex s
, support queries of the form Is there a directed path from s
to a given target vertex v
? If so, find a shortest such path.


Cycles and DAGs
Directed Cycle Detection
Directed cycle detection: Does a given digraph have a directed cycle? If so, find the vertices on some such cycle, in order from some vertex back to itself.
Definition: A directed acyclic graph (DAG) is a graph with no directed cycles.


When executing doDfs(digraph, v)
, we have followed a directed path from the source to v
. To keep track of this path, DirectedCycle
maintains a vertexindexed array on onStack[]
that marks the vertices on the recursive call stack (by setting onStack[v]
to true
on entry to doDfs(digraph, v)
and to false
on exit). DirectedCycle
also maintains an edgeTo[]
array so that it can return the cycle when it is detected.
Depthfirst orders and topological sort
Precedenceconstrained scheduling amounts to computing a topological order for the vertices of a DAG:


Proposition: A digraph has a topological order if and only if it is a DAG.




Proposition: Reverse postorder in a DAG is a topological sort.
Proposition: With DFS, we can topologically sort a DAG in time proportional to V + E
.
Strong connectivity in digraphs
Definition: Two vertice v
and w
are strongly connected if they are mutually reachable: that is, if there is a directed path from v
to w
and a directed path from w
to v
. A digraph is strongly connected if all its vertices are strongly connected to each other.
It is easy to see that two vertices are strongly connected if and only if there exists a general directed cycle that contains them both.
Strong components: Like connectivity in undirected graphs, strong connectivity in digraphs is an equivalence relation on the set of vertices, as it has the following properties:
 Reflexive: Every vertex
v
is strongly connected to itself  Symmetric: If
v
is strongly connected tow
, thenw
is strongly connected tov
 Transitive: If
v
is strongly connected tow
andw
is strongly connected tox
, thenv
is also strongly connected tox
API:


KosarajuSharir algorithm
 Given a digraph G, use
DepthFirstOrder
to compute the reverse postorder of its reverse digraph GR  Run standard DFS on G, but consider the unmarked vertices in the order just computed instead of the standard numerical order
 All vertices visited on a call to the recursive
dfs()
from the constructor are a strong component


Proposition: The KosarajuSharir algorithm uses preprocessing time and space proportional to V + E
to support constanttime strong connectivity queries in a diagraph.
Reachability revisited
What about pairs of vertices that are not strongly connected? There may be a path from v
to w
or a path from w
to v
, but not both
Allpairs reachability: Given a digraph, support queries of the form Is there a directed path from a given vertex v
to another given vertex w
?
For undirected graphs, the corresponding problem is equivalent to the connectivity problem; for digraphs, it is quite different from the strong connectivity problem.
Definition: The transitive closure of a digraph G is another digraph with the same set of vertices, but with an edge from v
to w
in the transitive closure if and only if w
is reachable from v
in G.
API:




The above solution is ideal for small or dense digraphs, but it is not a solution for the large digraphs we might encounter in practice because the constructor uses space proportional to V^2 and time proportional to V(V+E).
A general solution that achieves constanttime queries with substantially less than quadratic space is an unsolved research problem, with important practical implications.
Minimum Spanning Trees
An edgeweighted graph is a graph model where we associate weights or costs with each edge.
Definition: Recall that a spanning tree of a graph is a connected subgraph with no cycles that includes all the vertices. A minimum spanning tree (MST) of an edgeweighted graph is a spanning tree whose weight (the sum of the weights of its edges) is no larger than the weight of any other spanning tree.
Assumptions:
 The graph is connected
 The edge weights are not necessarily distances
 The edge weights may be zero or negative
 The edge weights are all different (this assumption is not restrictive because our algorithms work without modification in the presence of equal weights)
Underlying principles
 Adding an edge that connects two vertices in a tree creates a unique cycle
 Removing an edge from a tree breaks it into two separate subtrees
Cut property: This property, which we refer to as the cut property, has to do with identifying edges that must be in the MST of a given edgeweighted graph, by dividing vertices into two sets and examining edges that cross the division.
Definition: A cut of a graph is a partition of its vertices into two nonempty disjoint sets. A crossing edge of a cut is an edge that connects a vertex in one set with a vertex in the other.
Proposition (Cut property) Given any cut in an edgeweighted graph, the crossing edge of minimum weight is in the MST of the graph.
Proposition (Greedy MST algorithm) The following method colors black all edges in the MST of any connected edgeweighted graph with V
vertices: starting with all edges colored gray, find a cut with no black edges, color its minimumweight edge black, and continue until V  1
edges have been colored black.
Edgeweighted graph data type


Implementation:




Implementation:


MST API and test client


Test Data


Prim’s algorithm
Prim’s algorithm, is to attach a new edge to a single growing tree at each step. Start with any vertex as a singlevertex tree, than add V  1
edges to it, always taking next (coloring black) the minimum weight edge that connects a vertex on the tree to a vertex not yet on the tree (a crossing edge for the cut defined by tree vertices).
Proposition Prim’s algorithm computes the MST of any connected edgeweighted graph.
Data structures: We implement Prim’s algorithm with the aid of a few simple and familiar data strucutres:
 Vertices on the tree: We use a vertexindexed boolean array
marked[]
, wheremarked[v]
istrue
ifv
is on the tree  Edges in the tree: We use one of two data structures: either a queue
mst
to collect edges in the MST or a vertexindexed arrayedgeTo[]
ofEdge
objects, whereedgeTo[v]
is theEdge
that connectsv
to the tree  Crossing edges: We use a
MinPQ<Edge>
priority queue that compares edges by weight
Maintaining the set of crossing edges Each time we add an edge to the tree, we also add a vertex to the tree. To maintain the set of crossing edges, we need to add to the priority queue all edges from that vertex to any nontree vertex (using marked[]
to identify such edges). Bu we must do more: any edge connecting the vertex just added to a tree vertex that is already on the priority queue now becomes ineligible (it is no longer a crossing edge because it connects two tree vertices).
Lazy version of Prim’s algorithm


Proposition The lazy version of Prim’s algorithm uses space proportional to E
and time proportional to ElogE
(in the worst case) to compute the MST of a connected edgeweighted graph with E
edges and V
vertices.
Eager version of Prim’s algorithm
To improve the LazyPrimMst
, we might try to delete ineligible edges from the priority queue, so that the priority queue contains only the crossing edges between tree vertices and nontree vertices.
But we can eliminate even more edges. The key is to note that our only interest is in the minimal edge from each nontree vertex to a tree vertex. When we add a vertex v
to the tree, the only possible change with respect to each nontree vertex w
is that adding v
brings w
closer than before to the tree.
In short, we do not need to keep on the priority queue all
of the edges from w
to tree vertices  we just need to keep track of the minimumweight edge and check whether the addition of v
to the tree necessiates that we update that minimum (because of an edge vw
that has lower weight), which we can do as we process each edge in v
‘s adjacency list.
In other words, we maintain on the priority queue just one edge for each nontree vertex w
: the lightest edge that connects it to the tree. Any heavier edge connecting w
to the tree will become ineligible at some point, so there is no need to keep it on the priority queue.


Proposition The eager version of Prim’s algorithm uses extra space proportional to V
and time proportional to ElogV
(in the worst case) to compute the MST of a connected edgeweighted graph with E
edges and V
vertices.
Kruskal’s algorithm
Kruskal’s algorithm is to process the edges in order of their weight values (smallest to largest), taking for the MST (coloring black) each edge that does not form a cycle with edges previously added, stopping after adding V  1
edges.
Proposition Kruskal’s algorithm computes the MST of any connected edgeweighted graph


Proposition Kruskal’s algorithm uses space proportional to E
and time proportional to ElogE
(in the worst case) to compute the MST of an edgeweighted connected graph with E
edges and V
vertices.
Perspective


No theorectical results have been developed that deny the existence of an MST algorithm that is guaranteed to run in linear time for all graphs. On the other hand, the goal of developing algorithms for computing the MST of sparse graphs in linear time remains elusive.
Shortest Paths
The possibility of oneway roads means that we will need to consider edgeweighted digraphs. In this model, this problem is easy to formulate:
Find a lowestcost way to get from one vertex to another.
Definition A shortest path from vertex s
to vertex t
in an edgeweighted digraph is a directed path from s
to t
with the propertythat no other such path has a lower weight.
Singlesource shortest paths Given an edgeweighted digraph and a source vertex s
m support queries of the form Is there a directed path from s
to a given target vertex t
? If so, find a shortest such path (one whose total weight is minimal)
This section will cover:
 APIs and implementations for edgeweighted digraphs, and a singlesource shortestpaths API
 The classic Dijkstra’s algorithm for the problem when weights are nonnegative
 A faster algorithm for acyclic edgeweighted digraphs (edgeweighted DAGs) that works even when edge weights can be negative
 The classic BellmanFord algorithm for use in the general case, when cycles may be present, edge weights may be negative, and we need algorithms for finding negativeweight cycles and shortest paths in edgeweighted digraphs with no such cycles
Propperties of shortest paths:
 Paths are directed
 The weights are not necessarily distances
 Not all vertices need be reachable
 Negative weights introduce complications
 Shortest paths are normally simple
 Shortest paths are not necessarily unique
 Parallel edges and selfloop may be present
Definition Given an edgeweighted digraph and a designated source vertex s
, a shortestpaths tree for vertex s
is a subgraph containing s
and all the vertices reachable from s
that forms a directed tree rooted at s
such that every tree path is a shortest path in the digraph.
Edgeweighted digraph data types


Implementation




Implementation


Shortestpaths API:


Data structures for shortest paths
 Edges on the shortestpath tree: As for DFS, BFS, and Prim’s algorithm, we use a parentedge representation in the form of a vertexindexed array
edgeTo[]
ofDirectedEdge
objects, whereedgeTo[v]
is the edge that connectsv
to its parent in the tree  Distance to the source: We use a vertexindexed array
distTo[]
such thatdistTo[v]
is the length of the shortest known path froms
tov
By convention, edgeTo[s]
is null
and distTo[s]
is 0
. We also adopt the convention that distances to vertices that are not reachable from the source are all Double.POSITIVE_INFINITY
.
Edge relaxation To relax an edge v>w
means to test whether the best known way from s
to w
is to go from s
to v
, then take the edge from v
to w
, if so, update data structure to indicate that to be the case.


Vertex relaxation All of our implementations actually relax all the edges pointing from a given vertex as shown in the (overloaded) implementation of relax()
below.


Theoretical basis for shortestpaths algorithms
Proposition (Shortestpaths optimality conditions) Let G be an edgeweighted digraph, with s
a source vertex in G and distTo[]
a vertexindexed array of path lengths in G such that, for all v
reachable from s
, the value of distTo[v]
is the length of some path from s
to v
with distTo[v]
equal to infinity for all v
not reachable from s
. These values are the lengths of shortest paths if and only if they satisfy distTo[s] = 0
and distTo[w] <= distTo[v] + e.weight()
for each edge e
from v
or w
.
Proposition (Generic shortestpaths algorithm) Initialize distTo[s]
to 0
and all other distTo[]
values to infinity, and proceed as follows:
Relax any edge in G, continuing until no edge is eligible
For all vertices w
reachable from s
, the value of distTo[w]
after this computation is the length of a shortest path from s
to w
(and edgeTo[w]
is the last edge on such a path)
Dijkstra’s algorithm
We begin by initializing distTo[s]
to 0
and all other distTo[]
entries to positive infinity, then we relax and add to the tree a nontree vertex with the lowest distTo
value, continuing untill all vertices are on the tree or no nontree vertex has a finite distTo
value.
Data structures To implement Dijkstra’s algorithm we add to our distTo[]
and edgeTo[]
data structures an index priority queue pq
to keep track of vertices that are candidates for being the next to be relaxed.


Proposition Dijkstra’s algorithm use extra space proportional to V
and time proportional to ElogV
(in the worst case) to solve the singlesource shortest paths problem in an edgeweighted digraph with E
edges and V
vertices.
Variants:
 Singlesource shortest paths in undirected graphs
 Sourcesink shortest paths
 Allpairs shortest paths
 Shortest paths in Euclidean graphs
Acyclic edgeweighted digraphs
We now consider an algorithm for finding shortest paths that is simpler and faster than Dijkstra’s algorithm for edgeweighted DAGs. Specifically, it
 Solves the singlesource problem in linear time
 Handles negative edge weights
 Solves related problems, such as finding longest paths
These algorithms are straightforward extensions to the algorithm for topological sort in DAGs.
Specifically, vertex relaxation, in combination with topological sorting, immediately presents a solution to the singlesourceshortestpaths problem for edgeweighted DAGs. We initialize distTo[s]
to 0
and all other dist[]
values to infinity, then relax the vertices, one by one, taking the vertices in topological order.


Proposition By relaxing vertices in topological order, we can solve the singlesource shortestpaths problem for edgeweighted DAGs in time proportional to E + V
.
Longest paths
Singlesource longest paths in edgeweighted DAGs Given an edgeweighted DAG (with negative weights allowed) and a source vertex s
, support queries of the form: Is there a directed path from s
to a given target vertex v
? If so, find a longest such path.
Proposition We can solve the longestpaths problem in edgeweighted DAGs in time proportional to E + V
.
Parallel job scheduling
Parallel precedenceconstrained scheduling Given a set of jobs of specified duration to be completed, with precedence constraints that specify that certain jobs have to be completed before certain other jobs are begun, how can we schedule the jobs on identical processors (as many as needed) such that they are all completed in the minimum amount of time while still respecting the constraints?
Remarkably, a lineartime algorithm is available  an approach known as the critical path method demonstrates that the problem is equivalent to a longestpathss problem in an edgeweighted DAG.
Shortest paths in general edgeweighted digraphs
What if the digraph has negativeweight edges?
Definition A negative cycle in an edgeweighted digraph is a directed cycle whose total weight (sum of the weights of its edges) is negative.
Proposition There exists a shortest path from s
to v
in an edgeweighted digraph if and only if there exists at least one directed path from s
to v
and no vertex on any directed path from s
to v
is on a negative cycle.
Two important problems
Negative cycle detection Does a given edgeweighted digraph have a negative cycle? If it does, find one such cycle.
Singlesource shortest paths when nagative cycles are not reachable Given an edgeweighted digraph and a source s
with no negative cycles reachable from s
, support queries of the form Is their a directed path from s
to a given target vertex v
? If so, find a shortest such path.
BellmanFord algorithm
Proposition (BellmanFord algorithm) The following method solves the singlesource shortestpaths problem from a given source s
for any edgeweighted digraph with V vertices and no negative cycles reachable from s
: Initialize distTo[s]
to 0 and all other distTo[]
values to infinity. Then considering the digraph’s edges in any order, relax all edges. Make V
such pass.
Proposition The BellmanFord algorithm takes time proportional to EV
and extra space proportional
to V
.


We do not consider this version in detail because it always relaxes VE edges, and a simple modification makes the algorithm much more efficient for typical applications.
Queuebased BellmanFord
Specifically, we can easily determine a priori that numerous edges are not going to lead to a sucessful relaxation in any given pass: the only edges that could lead to a change in distTo[]
are thoseleaving a vertex whose distTo[]
value changed in the previous pass. To keep track of such vertices, we use a FIFO queue.
Implementation:


Proposition The queuebased implementation of the BellmanFord algorithm solves the single souce shortestpaths problem from a given source s
(or finds a negative cycle reachable from s
) for any edgeweighted digraph with E
edges and V
vertices, in time proportional to EV
and extra space proportional to V
, in the worst case.
Proof If there is no negative cycle reachable from s
, the algorithm terminates after relaxation corresponding to the (V  1)
st pass of the generic algorithm (since all shortest paths have fewer than V
edges). If there does exist a negative cycle reachable from s
, the queue never empties. If any edge is relaxed during the Vth
pass of the generic algorithm, then the edgeTo[]
array has a directed cycle and such cycle is a negative cycle.
Arbitrage
Proposition The arbitrage problem is a negativecycledetection problem in edgeweighted digraphs.
Proof Replace each weight by its logarithm, negated. With this change, computing path weights by multiplying edge weights edge weights in the original problem corresponds to adding them in the transformed problem. Specifically, any product w1w2...wk
corresponds to a sum ln(w1)ln(w2)...ln(wk)
. The transformed edge weights might be negative or positive, a path from v
to w
gives a way of converting from currency v
to currency w
, and any negative cycle is an arbitrage opportunity.