Highlights of Cantrill and Bonwick’s “Real-world Concurrency”

Reference: “Real-world Concurrency”, Commun. of the ACM 51, 11 (Nov. 2008), 34-39. DOI:10.1145/1400214.1400227 © 2008 ACM 0001-0782/08/1100 by Bryan Cantrill and Jeff Bonwick, Sun Microsystems.

 

  1. “Many have built highly concurrent, highly scalable […] systems without explicitly creating a single thread or acquiring a single lock; it is concurrency by architecture instead of by implementation.” [Emphasis added.]
  2. Because “writing parallel code remains something of a black art, […] we try both to argue that most code can (and should) achieve concurrency without explicit parallelism, and at the same time to elucidate techniques for those who must write explicitly parallel code. This article is half stern lecture on the merits of abstinence and half Kama Sutra.”

 

Cantrill and Bonwick’s paper is in three parts –

  1. Introduction and history
  2. Why you are unlikely to write much explicitly parallel code
  3. Practical tips for those that do need to write such code

 

Introduction and history

Introduction   

  1. It’s been said that with the rise of multicore “the free lunch [of Moore’s law] is over”. But don’t believe the hype.
  2. Although, of course, “writing multithreaded code […] remains, in a word, hard”, “few software engineers actually need to write multithreaded code: for most, concurrency can be achieved by standing on the shoulders of those subsystems that already are highly parallel in implementation. Those practitioners who are implementing a database or an operating system or a virtual machine will continue to need to sweat the details of writing multithreaded code, but for everyone else, the challenge is not how to implement those components but rather how best to use them to deliver a scalable system. While lunch might not be exactly free, it is practically all-you-can-eat—and the buffet is open!”
  3. “What does the proliferation of concurrency mean for the software that you develop? […] — Your software’s relationship to concurrency depends on where it physically executes, where it is in the stack of abstraction, and the business model that surrounds it.”
  4. “You may be able to leave some of your code forever executing in sequential bliss, and some may need to be highly parallel and explicitly multithreaded. […] Much of your code will […] be essentially sequential in nature but will need to be aware of concurrency at some level.”

 

History

Summary: “It was the maturity of concurrent software that led architects to consider concurrency on the die.” (“The ‘free lunch’ that some decry as being over is in fact, at long last, being served.”)

  1. It became apparent soon after the dawn of electronic computing that “a single central processing unit executing a single instruction stream would result in unnecessarily limited system performance”. 
  2. The approach of the 1961 “Burroughs B5000 […] ultimately proved to be the way forward: disjoint CPUs concurrently executing different instruction streams but sharing a common memory.”  But it “was not until the 1980s” that multiprocessing became a common research interest, such as, cache coherence protocols, prototyped parallel operating systems, and parallel databases.
  3. “In the 1990s, […] many computer companies [placed] big bets on symmetric multiprocessing” and the concurrent software needed to run them “ — if an operating system cannot execute in parallel, neither can much else in the system” — and realized “that their operating systems would need to be rewritten around the notion of concurrent execution”, as in “OpenSolaris, FreeBSD, and Linux”.
  4. Also in 1990s, Oracle and others bet big on “highly parallel relational databases” and displaced the old mainframe-based approach to transactions.
  5. By the end of the 1990s there were no uniprocessors on the list of TOP500 list of supercomputers, and “many transaction-oriented applications scaled with CPU, allowing users to realize the dream of expanding a system without revisiting architecture”.
  6. Also during 1990s “CPU clock rate continued to increase”, but “the speed of main memory was not keeping up”, hence “deeper (and more complicated) pipelines, caches, and prediction units”. And anyway, “only a slim fraction of code could actually achieve […] the rate of one cycle per instruction—most code was […] spending three, four, five (or many more) cycles per instruction”.
  7. Two trends — “the rise of concurrency and the futility of increasing clock rate”.  The logical conclusion: “instead of spending transistor budget on ‘faster’ CPUs that weren’t actually yielding much in terms of performance gains (and had terrible costs in terms of power, heat, and area), […] take advantage of the rise of concurrent software and use transistors to effect multiple (simpler) cores per die”.
  8. “There is a perception that microprocessor architects have—out of malice, cowardice, or despair—inflicted concurrency on software. In reality […] it was the maturity of concurrent software that led architects to consider concurrency on the die.” See  “one of the earliest chip multiprocessors […] for a detailed discussion of this motivation. Were software not ready, these microprocessors would not be commercially viable today.”

 

Why you are unlikely to write much explicitly parallel code

“Concurrency is for performance.” The three ways that “concurrent execution can improve performance”:

  1. “reduce latency (that is, make a unit of work execute faster)”, as in climate modeling
  2. “hide latency (that is, allow the system to continue doing work during a long-latency operation)”, as in overlapping local computation with remote communication
  3. “increase throughput (that is, make the system able to perform more work)”, as in queueing independent batch jobs

Reducing latency

  1. “Many important problems are simply not amenable to” parallelization.
  2. And, of those that are, many are “so parallelizable that they do not require the tight coupling of a shared memory—and they are often able to execute more economically on grids of small machines instead of a smaller number of highly concurrent ones.”

Hiding latency

  1. The complexity of “a parallel program can become a substantial […] burden to bear just for improved responsiveness.” “One can often achieve the same effect by employing nonblocking operations (e.g., asynchronous I/O) and an event loop (e.g., [poll()/select() calls]).”

Increased throughput

  1. “A system using concurrency to increase throughput need not [contain much] multithreaded code. [The] components of the system that share no state can be left entirely sequential, with the system executing multiple instances of these components concurrently. The sharing in the system can then be offloaded to components explicitly designed around parallel execution on shared state, which can ideally be reduced to those elements already known to operate well in concurrent environments: the database and/or the operating system.”
  2. “To make this concrete, in a typical MVC (model-view-controller) application, the view (typically implemented in environments such as JavaScript, PHP, or Flash) and the controller (typically implemented in environments such as J2EE or Ruby on Rails) can consist purely of sequential logic and still achieve high levels of concurrency, provided that the model (typically implemented in terms of a database) allows for parallelism.  [Hence, ] many have built highly concurrent, highly scalable MVC systems without explicitly creating a single thread or acquiring a single lock; it is concurrency by architecture instead of by implementation.”

 

Practical tips for writing explicitly parallel code

  1. “If you are the one developing the operating system or database [… you] do not need to be warned that writing multithreaded code is hard.”  But contrary to conventional wisdom, it’s feasible to write “scalable and correct multithreaded code” and “to organize and maintain large systems that rely on locking.”
  2. A large “part of the difficulty […] is the scarcity of written wisdom from experienced practitioners: oral tradition in lieu of formal writing has left the domain shrouded in mystery.”

 

Know your cold paths from your hot paths.

Terminology

  • Hot path = “code you want to be able to execute in parallel”
  • Cold path = code that “can execute sequentially without affecting performance”, such as “when initializing, in administrative paths, when unloading, etc.”
  1. “Not only is it a waste of time to make such cold paths execute with a high degree of parallelism, but it is also dangerous: these paths are often among the most difficult and error-prone to parallelize.”
  2. “In cold paths, keep the locking as coarse-grained as possible. Don’t hesitate to have one lock that covers a wide range of rare activity in your subsystem.”
  3. “In hot paths […] locking strategies must be simple and fine-grained, and you must be careful to avoid activity that can become a bottleneck.”
  4. “What if you don’t know [yet whether] a given body of code will be the hot path in the system? [Then assume] that your code is in a cold path and adopt a correspondingly coarse-grained locking strategy—but be prepared to be proven wrong by the data.”

 

Intuition is frequently wrong—be data intensive.

  1. “Many scalability problems can be attributed to a hot path [… intuitively] believed […] to be a cold path. [As soon as you have a functional prototype], the time for intuition has ended: your gut must defer to the data.”
  2. “Gathering data on a concurrent system is a tough problem. […] It requires you to put load on the system that resembles the load you expect to see when your system is deployed into production. [And] you must have the infrastructure to be able to dynamically instrument the system”.

Load

  1. “[Loads] are difficult and expensive to simulate in development. […]You must treat load generation and simulation as a first-class problem; the earlier you tackle this problem in your development, the earlier you will be able to get critical data that may have tremendous implications for your software. […] Timeliness is more important than absolute accuracy. […] It is much better to put a multithreaded system under the wrong kind of load than under no load whatsoever.”

Dynamic instrumentation

  1. “Understanding scalability inhibitors on a production system requires the ability to safely dynamically instrument its synchronization primitives. In developing Solaris, our need for this was so […] acute that it led […] us […] to develop [lockstat], [which] became instantly essential [for us …]—[and eventually…]  to further generalize dynamic instrumentation into DTrace, a system for nearly arbitrary dynamic instrumentation of production systems.”
  2. Not just “to find those parts of the system that are inhibiting scalability, but also […] to understand which techniques will be best suited for reducing that contention. […] Before breaking up a lock or re-architecting a subsystem to make it more parallel, we always strive to have the data in hand indicating that the subsystem’s lack of parallelism is a clear inhibitor to system scalability!”

 

Know when—and when not—to break up a lock.

  1. When “data indicates a single hot lock”, don’t blindly “break up the lock into per-CPU locks, a hash table of locks, per-structure locks, etc.” without first considering that: “contention can [often be] more easily reduced by decreasing the hold time of the lock. This can be done by algorithmic improvements (many scalability improvements have been achieved by reducing execution under the lock from quadratic time to linear time!) or by finding activity that is needlessly protected by the lock.”
  2. “Classic example” of needless protection: if “you are spending time […] deallocating elements from a shared data structure, you could dequeue and gather the data that needs to be freed with the lock held, but “defer the actual deallocation of the data until after the lock is dropped”.

 

Be wary of readers/writer locks.

Scenario: a data structure is frequently read, but is seldom written

  1. (Background – According to Paul Bridger, “A Read/Write Lock is a performance improvement over a standard mutex for cases where reads outnumber writes. The idea is to divide locks up into two classes: reading and writing. Multiple simultaneous read locks may be held, but write locks are exclusively held.  The exclusive writing lock ensures that race conditions do not occur, since if one client is writing to the data no other client may read or write. Also, the allowance for multiple simultaneous read locks decreases resource contention since multiple readers can safely use the shared data. This increases performance over a standard mutex for the assumed usage pattern of frequent simultaneous reads and infrequent writes.”)
  2. Unless “the hold time for the lock is long”, resist the temptation to“replace a […] mutex […] with a readers/writer lock”, because this “may scale worse than having a single lock.”
  3. “Why? Because the state associated with the readers/writer lock must itself be updated atomically, and in the absence of a more sophisticated (and less space-efficient) synchronization primitive, a readers/writer lock will use a single word of memory to store the number of readers. Because the number of readers must be updated atomically, acquiring the lock as a reader requires the same bus transaction—a read-to-own—as acquiring a mutex, and contention on that line can hurt every bit as much.”
  4. When hold times are long, “(e.g., performing I/O under a lock as reader) […], one should [still] gather data to make sure that [using a readers/writer lock] is having the desired effect on scalability.”
  5. Caution about blocking semantics: “If […] the lock implementation blocks new readers when a writer is blocked (a common paradigm to avoid writer starvation), one cannot recursively acquire a lock as reader: if a writer blocks between the initial acquisition as reader and the recursive acquisition as reader, deadlock will result when the recursive acquisition is blocked.” [Emphasis in original.]
  6. “All of this is not to say that readers/writer locks shouldn’t be used—just that they shouldn’t be romanticized.”

 

Know when to broadcast—and when to signal.

Two ways to awaken threads sleeping on a condition variable.

Signal

  • one thread sleeping on the variable is awakened
  • indicates resource availability
  • e.g., Java notify()

Broadcast

  • all threads sleeping on the variable are awakened
  • indicates state change
  • e.g., Java notifyAll()

A bad habit, common in Java-based systems, is to broadcast availability of a resource, creating a performance-degrading useless “fight over the lock protecting the condition variable”.

 

Design your systems to be composable.

“Two ways to make lock-based systems completely composable

  1. “Make locking entirely internal to the subsystem” with “control flow […] never returned to the caller with locks held”. Requires a “crisp interface […] between software components.” Example: “in concurrent operating systems, control never returns to user level with in-kernel locks held; the locks used to implement the system itself are entirely behind the system call interface that constitutes the interface to the system.”
  2. “[Leave] locking up to the client of the subsystem,” which “can be used concurrently by different subsystems and in different contexts.” Requires there “be no global subsystem state—subsystem state must be captured in per-instance state”. It’s “up to consumers of the subsystem to assure that they do not access their instance in parallel”. Example: “the AVL tree implementation used extensively in the Solaris kernel.” It’s “used concurrently by disjoint subsystems—the only constraint is that manipulation of a single AVL tree instance must be serialized.”

 

8 additional tips are explained in an extended version from ACM Queue

  1. Consider per-CPU locking.
  2. Learn to debug postmortem.
  3. Don’t use a semaphore where a mutex would suffice.
  4. Consider memory retiring to implement per-chain hash-table locks.
  5. Be aware of false sharing.
  6. Consider using nonblocking synchronization routines to monitor contention.
  7. When reacquiring locks, consider using generation counts to detect state change.
  8. Use wait- and lock-free structures only if you absolutely must.

Reference:  “Real-world Concurrency”, Commun. of the ACM 51, 11 (Nov. 2008), 34-39. DOI:10.1145/1400214.1400227 © 2008 ACM 0001-0782/08/1100 by Bryan Cantrill and Jeff Bonwick, Sun Microsystems.

More blog posts about Computing.

Advertisements

Tell me (anonymous OK)

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s