Java Concurrency in Practice Notes: Performance and Scalability

One of the primary reasons to use threads is to improve performance. Using threads can improve resource utilization by letting applications more easily exploit available processing capacity, and improve responsiveness by letting applications begin processing new tasks immediately while existing tasks are still running.

While better performance is often desirable - and improving performance can be very satisfying - safety always come first. First make your program right, then make it fast - and then only if your performance requirements and measurements tell you it needs to be faster.

Performance and Scalability

Thinking about performance

Improving performance means doing more work with fewer resources. When the performance of an activity is limited by availability of a particular resource, we say it is bound by that resource: CPU-bound, database-bound, etc.

While the goal may be to improve performance overall, using multiple threads always introduces some performance costs compared to the single-threaded approach. These include overhead associated with coordinating between threads (locking, signaling, and memory synchronization), increased context switching, thread creation and teardown, and scheduling overhead.

In using concurrency to achieve better performance, we are trying to do two things:

  • utilize the processing resources we have more effectively
  • enable our program to exploit additional processing resources if they become available

From a performance monitoring perspective, this means we are looking to keep the CPUs as busy as possible.

Performance versus scalability

Scalability describes the ability to improve throughput of capacity when additional computing resources (such as additional CPUs, memory, storage, or I/O bandwidth) are added.

Designing and tuning concurrent applications for scalability can be very different from traditional performance optimization.

  • When tuning for performance, the goal is usually to do the same work with less effort, such as by reusing previously results through caching or replacing with an better algorithm
  • When tuning for scalability, you are instead trying to find ways to parallelize the problem so you can take advantage of additional processing resources to do more work with more resources

These two aspects of performance - how fast and how much - are completely separate, and sometimes even at odds with each other. In order to achieve higher sclability or better hardware utilization, we often end up increasing the amount of work done to process each individual task, such as when we divide tasks into multiple “pipelined” subtasks.

A monolithic application where presentation, business logic, and persistence are interwined would almost certainly provide better performance for the first unit of work than would a well-factored multitier implementation distributed over multiple systems.

However, when the monolithic system reaches its processing capacity, we could have a serious problem: it may be prohibitively difficult to significantly increase capacity. So we often accept the performance costs of longer service time on greater computing resources used per unit of work so that our pplication can scale to handle greater load by adding more resources.

Of the various aspects of performance, the “how much” aspects - scalability, throughput, and capacity - are usually of greater concern for server applications than the “how fast” aspects.

Evaluating performance tradeoff

Avoid premature optimization. First make it right, then make it fast - if it is not already fast enough.

Most performance decisions involve multiple variables and are highly situational. Before deciding that one approach is “faster” than another, ask yourself some questions:

  • What do you mean by “faster”?
  • Under what conditions will this approach actually be faster? Under light or heavy load? With large or small data sets? Can you support your answer with measurements?
  • How often are these conditions likely to arise in your situation? Can you support your answer with measurements?
  • Is this code likely to be used in other situations where the conditions may be different?
  • What hidden costs, such as increased development or maintenance risk, are you trading for this improved performance? Is this a good tradeoff?

The quest for performance is probably the single greatest source of concurrency bugs. The belief that synchronization was “too slow” led to many clever-looking but dangerous idioms for reducing synchronization. (such as double-check locking) Worse, then you trade safety for performance, you may get neither.

Measure, don’t guess.

Amdahl’s law

If one of our primary reasons for using threads is to harness the power of multiple processors, we must also ensure that the problem is amendable to parallel decomposition and that our program effectively exploits this potential for parallelization.

Amdahl’s law describes how much a program can theoretically be sped up by additional computing resources, based on the proportion of parallelizable and serial components. If F is the fraction of the calculation that must be executed serially, then Amdahl’s law says that on a machine with N processors, we can achieve a speedup of at most:

1
Speedup <= 1 / (F + (1 - F) / N)

Drawing below shows the maximum possible processor utilization for varying degrees of serial execution and numbers of processors.

Processors Utilization

In order to predict what kind of speedup is possible from running your application on a multiprocessor system, you also need to identify the sources of serialization in your tasks.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
public class WorkerThread extends Thread {
private final BlockingQueue<Runnable> queue;
public WorkerThread(BlockingQueue<Runnable> queue) {
this.queue = queue;
}
public void run() {
while (true) {
try {
Runnable task = queue.take();
task.run();
} catch (InterruptedException e) {
break; /* Allow thread to exit */
}
}
}
}

At first glance, it may appear the above code is completely parallelizable: tasks do not wait for each other, and the more processors available, the more tasks can be processed concurrently. However, there is serial component as well - fetching the task from the work queue.

This example also ignores another common source of serialization: result handling. If instead each thread maintains its own data structure for results that are merged after all the tasks are performed, then the final stage is a source of serialization.

All concurrent applications have some sources of serialization; if you think your does not, look again.

Example: serialization hidden in frameworks

Queue implementation comparing

The figure above shows a simple application in which multiple threads repeatedly remove an element from a shared Queue and process it. The throughput of ConcurrentLinkedQueue continues to improve until it hits the number of processors and then remains mostly constant. On the other hand, the throughput of the synchronized LinkedList shows some improvement up to three threads, but then falls off as synchronization overhead increases.

The difference in throughput comes from differing degrees of serialization between the two queue implementations. The synchronized LinkedList guards the entire queue state with a single lock that is held for the duration of the offer or remove call; ConcurrentLinkedQueue uses a sophisticated non-blocking queue algorithm that uses atomic references to update individual link pointers.

Applying Amdahl’s law qualitatively

When evaluating an algorithm, thinking “in the limit” about what would happen with hundreds or thousands of processors can offer some insight into where scaling limits might appear.

Costs introduced by threads

Scheduling and interthread coordination have performance costs; for threads to offer a performance improvement, the performance benefits of parallelization must outweigh the costs introduced by concurrency.

Context switching

Context switch requires saving the execution context of the currently running thread and restoring the execution context of the newly scheduled thread.

When a new thread is switched in, the data it needs is unlikely to be in the local processor cache, so a context switch causes a flurry of cache misses, and thus threads run a little more slowly when they are first scheduled. This is one of the reasons that schedulers give each runnable thread a certain minimum time quantum even many other threads are waiting: it amortizes the cost of the context switch and its consequences over more uninterrupted execution time, improving overall throughput.

When a thread blocks because it is waiting for a contented lock, the JVM usually suspends the thread and allows it to be switched out. If threads block frequently, they will be unable to use their full scheduling quantum.

Memory synchronization

The performance cost of synchronization comes from several sources. The visibility guarantees provided by synchronized and volatile may entail using special instructions called memory barriers that can flush or invalidate caches, flush hardware write buffers, and stall execution pipelines. Memory barriers may also have indirected performance consequences becasue they inhibit other compiler optimizations; most operations cannot be reordered with memory barriers.

1
2
3
4
// WTF
synchronized(new Object()) {
// WTF
}

JVM would optimize away the lock in the above code.

Don’t worry excessively about the cost of uncontended synchronization. The basic mechanism is already quite fast, and JVMs can perform additional optimizations that further reduce or eliminate the cost. Instead, focus optimization efforts on areas where lock contention actually occurs.

Blocking

Uncontended synchronization can be handled entirely within the JVM; contended synchronization may require OS activity, which adds to the cost. When locking is contended, the losing threads must block.

The JVM can implement blocking either via spin-waiting (repeatedly trying to acquire the lock until it succeeds) or by suspending the blocked thread through the OS.

Recucing lock contention

While serialization hurts scalability and context switches hurt performance. Contended locking causes both, so reducing lock contention can improve both performance and scalability.

The principal thread to scalability in concurrent applications is the exclusive resource lock.

Two factors influence the likelihood of contention for a lock: how often that lock is requested and how long it is held once acquired.

There are three ways to reduce lock contention:

  • Reduce the duration for which locks are held
  • Reduce the frequency with which locks are requested
  • Replace exclusive locks with coordination mechanisms that permit greater concurrency

Narrowing lock scope (“Get in, get out”)

An effective way to reduce the likelihood of contention is to hold locks as briefly as possible. This can be done by moving code that doesn’t require the lock out of synchronized blocks, especially for expensive operations and potentially blocking operations such as I/O.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
@ThreadSafe
public class AttributeStore {
@GuardedBy("this") private final Map<String, String>
attributes = new HashMap<String, String>();
public synchronized boolean userLocationMatches(String name,
String regexp) {
String key = "users." + name + ".location";
String location = attributes.get(key);
if (location == null)
return false;
else
return Pattern.matches(regexp, location);
}
}

AttributeStore shows an example of holding a lock longer than necessary. The entire userLocationMatches is synchronized, but the only portion of the code that actually needs the lock is the call to Map.get.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
@ThreadSafe
public class BetterAttributeStore {
@GuardedBy("this") private final Map<String, String>
attributes = new HashMap<String, String>();
public boolean userLocationMatches(String name, String regexp) {
String key = "users." + name + ".location";
String location;
synchronized (this) {
location = attributes.get(key);
}
if (location == null)
return false;
else
return Pattern.matches(regexp, location);
}
}

BetterAttributeStore rwrites AttributeStore to reduce significantly the lock duration. Reducing the scope of the lock substantially reduces the number of instructions that are executed with the lock held. By Amdahl’s law, this removes an impediment to scalability because the amount of serialized code is reduced.

Because AttributeStore has only one state variable attributes, we can improve it further by the technique of delegating thread safety. By replacing attributes with a thread-safe map.

While shrinking synchronized blocks can improve scalability, a synchronized block can be too small - operations that need to be atomic (such as updating multiple variables that participate in an invariant) must be contained in a single synchronized block.

Reducing lock granularity

The other way to reduce the fraction of time that a lock is held is to have threads ask for it less often. This can be accomplished by lock splitting and lock striping, which involve using separate locks to guard multiple independent state variables previously guarded by a single lock.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
@ThreadSafe
public class ServerStatusBeforeSplit {
@GuardedBy("this") public final Set<String> users;
@GuardedBy("this") public final Set<String> queries;
public ServerStatusBeforeSplit() {
users = new HashSet<String>();
queries = new HashSet<String>();
}
public synchronized void addUser(String u) {
users.add(u);
}
public synchronized void addQuery(String q) {
queries.add(q);
}
public synchronized void removeUser(String u) {
users.remove(u);
}
public synchronized void removeQuery(String q) {
queries.remove(q);
}
}

ServerStatus shows a portion of the monitoring interface for a database server that maintains the set of currently logged-on users and the set of currently executing queries.

Instead of guarding both users and queries with the ServerStatus lock, we can instead guard each with a separate lock, as shown below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
@ThreadSafe
public class ServerStatusAfterSplit {
@GuardedBy("users") public final Set<String> users;
@GuardedBy("queries") public final Set<String> queries;
public ServerStatusAfterSplit() {
users = new HashSet<String>();
queries = new HashSet<String>();
}
public void addUser(String u) {
synchronized (users) {
users.add(u);
}
}
public void addQuery(String q) {
synchronized (queries) {
queries.add(q);
}
}
public void removeUser(String u) {
synchronized (users) {
users.remove(u);
}
}
public void removeQuery(String q) {
synchronized (users) {
queries.remove(q);
}
}
}

Lock striping

Splitting a heavily contented lock into two is likely to result in two heavily contended locks. While this will produce a small scalability improvement by enabling two threads to execute concurrently instead of one, it still does not dramatically improve prospects for concurrency on a system with many processors.

Lock splitting can sometimes be extended partition locking on a variable-sized set of independent objects, in which case it is called lock striping. For example, the implementation of ConcurrentHashMap uses an array of 16 blocks, each of which guart 1/16 of the hash buckets; bucket N is guarded by lock N mod 16. Assuming the hash function provides reasonable spreading chracteristics and keys are accessed uniformly, this should reduce the demand for any given lock by approximately a factor of 16. It is this technique that enables ConcurrentHashMap to support up to 16 concurrent writers.

One of the downsides of lock striping is that locking the collection for exclusive access is more difficult and costly than a single lock. Usually an operation can be performed by acquiring at most one clock, but occasionally you need to lock the entire collection, as when ConcurrentHashMap needs to expand the map and rehash the values into a larger set of buckets.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
@ThreadSafe
public class StripedMap {
// Synchronization policy: buckets[n] guarded by locks[n%N_LOCKS]
private static final int N_LOCKS = 16;
private final Node[] buckets;
private final Object[] locks;
private static class Node {
Node next;
Object key;
Object value;
}
public StripedMap(int numBuckets) {
buckets = new Node[numBuckets];
locks = new Object[N_LOCKS];
for (int i = 0; i < N_LOCKS; i++)
locks[i] = new Object();
}
private final int hash(Object key) {
return Math.abs(key.hashCode() % buckets.length);
}
public Object get(Object key) {
int hash = hash(key);
synchronized (locks[hash % N_LOCKS]) {
for (Node m = buckets[hash]; m != null; m = m.next)
if (m.key.equals(key))
return m.value;
}
return null;
}
public void clear() {
for (int i = 0; i < buckets.length; i++) {
synchronized (locks[i % N_LOCKS]) {
buckets[i] = null;
}
}
}
}

Avoiding hot fields

Lock splitting and lock stripping can improve scalability because they enable different threads to operate on different data without interfering with each other. A program that would benefit from lock splitting necessarily exhibits contention for a lock more often than for the data guarded by the lock.

Lock granularity cannot be reduced when there are variables that are reacquired for every operation. The is yet another area where raw performance and scalability are often at adds with each other; common optimizations such as caching frequently computed values can introduce “hot fields” that limit scalability.

Alternatives to exclusive locks

A third technique for mitigating the effect of lock conention is to forego the use of exclusive locks in favor of a more concurrency-friendly means of managing shared stated. The includ using the concurrent collections, read-write locks, immutable objects, and atomic varaibles.

ReadWriteLock enforces a multiple-reader, single-writer locking discipline: more than one reader can access the shared resource concurrently so long as none of them wants to modify it, but writes must acquire the lock exclusively. For read-mostly data structures, ReadWriteLock can offer greater concurrency than exclusive locking; for read-only data structures, immutability can eliminate the need for locking entirely.

Atomic variables offers a means of reducing the cost of updating “hot fields” such as statistics counters, sequence generators, or the referenceto the first not in a linked data structure.

Monitoring CPU utilization

Tools like vmstat and mpstat on Unix systems or perfmon on Windows systems can tell you just how “hot” the processors are running.

If the CPUs are asymmetrically utilized, your first goal should be to find increased parallelism in your program. Asymmetric utilization indicates that most of the computation is going on in a small set of threads, and your application will not be able to take advantage of additional processors.

If the CPUs are not fully utilized, you need to figure out why. There are several likely causes:

  • Insufficient load
  • I/O-bound
  • Externally bound
  • Lock contention

Just say no to object pooling

The performance of bject allocation and GC has been improved substantially which make object pooling unnecessary. Object pooling is yet another technique intended as a performance optimization but that turned into a scalability hazard.

Example: Comparing Map performance

Processors Utilization

The major scalability impediment for the synchronized Map implementations is that there is a single lock for the entire map, so only one thread can access the map at a time. On the other hand, ConcurrentHashMap does no locking for most successful read operations, and uses lock striping for write operations and those frew read operations that do require locking. As a result, multiple threads can access the Map concurrently without blocking.

Reducing context switch overhead

Many tasks involve operations that may block; transitioning between the running and blocked states entails a context switch. One source of blocking in server applications is generating log messages in the course of processing requests; to illustrate how throughput can be improved by reducing context switches, we’ll analyze the scheduling behavior of two logging approaches.

Most logging frameworks are thin wrappers around println; when you have something to log, just write it out right then and there. Another approach was to perform logging in a dedicated background thread instead of by the requesting thread.

To some extent, we are just moving the work around, moving the I/O to a thread where its cost isn’t perceived by the user. But by moving all the logging I/O to a single thread, we also eliminate the chance of contention for the output stream and thus eliminate a source of blocking. This improves overall throughput because fewer resources are consumed in scheduling, context switching, and block management.

Summary

Because one of the most common reasons to use threads is to exploit multiple processors, in discussing the performance of concurrent applications, we are usually more concerned with throughput or scalability than we are with raw service time.

Amdahl’s law tells us that the scalability of an application is driven by the proportion of code that must be executed serially. Since the primary source of serialization in Java programs is the exclusive resource lock, scalability can often be improved by spending less time holding locks, either by reducing lock granularity, reducing the duration for which locks are held, or replacing exclusive locks with nonexclusive or nonblocking alternatives.

(To Be Continued)