n4215: towards implementation and use of ... - open standards › jtc1 › sc22 › wg21 › docs...

N4215: Towards Implementation and Use of

memory order consume

Doc. No.: WG21/N4215Date: 2014-10-05

Reply to: Paul E. McKenney, Torvald Riegel, Jeff Preshing,Hans Boehm, Clark Nelson, and Olivier Giroux

Email: [email protected], [email protected], [email protected]@acm.org, [email protected], and [email protected]

Other contributors: Alec Teal, David Howells, David Lang, George Spelvin, Jeff Law,Joseph S. Myers, Lawrence Crowl, Linus Torvalds, Mark Batty, Michael Matz,

Peter Sewell, Peter Zijlstra, Ramana Radhakrishnan, Richard Biener, Will Deacon, ...

October 11, 2014

This document is a revision of WG21/N4036, basedon feedback at the 2014 Rapperswil meeting, at the2014 Redmond SG1 meeting, and on the variousemail reflectors.

1 Introduction

The most obscure member of the C11 and C++11memory order enum seems to be memory order

consume [27]. The purpose of memory order

consume is to allow reading threads to correctly tra-verse linked data structures without the need forlocks, atomic instructions, or (with the exception ofDEC Alpha) memory-fence instructions, even thoughnew elements are being inserted into these linkedstructures before, during, and after the traversal.Without memory order consume, both the compilerand (again, in the case of DEC Alpha) the CPUwould be within their rights to carry out aggres-sive data-speculation optimizations that would per-mit readers to see pre-initialization values in thenewly added data elements. The purpose of memoryorder consume is to prevent these optimizations.

Of course, memory order acquire may be used asa substitute for memory order consume, however do-ing so results in costly explicit memory-fence instruc-tions (or, where available, load-acquire instructions)on weakly ordered systems such as ARM, Itanium,and PowerPC [9, 3, 12, 13]. These systems enforcedependency ordering in hardware, in other words, ifthe address used by one memory-reference instructiondepends on the value from a preceding load instruc-tion, the hardware forces that earlier load to com-plete before the later memory-reference instructioncommences.1 Similarly, if the data to be stored by agiven store instruction depends on the value from apreceding load instruction, the hardware again forcesthat earlier load to complete before the later store in-struction commences. Recent software tools for ARMand PowerPC can help explicate their memory mod-els [19, 1, 2]. Note that strongly ordered systems likex86, IBM mainframe, and SPARC TSO enforce de-pendency ordering as a side effect of the fact thatthey do not reorder loads with subsequent memory

1 But please note that hardware can and does take advan-tage of the as-if rule, just as compilers do.

1

WG21/N4215 (Revised) 2

references. Therefore, memory order consume is ben-eficial on hot code paths, removing the need for hard-ware ordering instructions for weakly ordered systemsand permitting additional compiler optimizations onstrongly ordered systems.

When implementing concurrent insertion-onlydata structures, a few of which are found in the Linuxkernel, memory order consume is all that is required.However, most data structures also require removalof data elements. Such removal requires that thethread removing the data element wait for all read-ers to release their references to it before reclaim-ing that element. The traditional way to do this isvia garbage collectors (GCs), which have been avail-able for more than half a century [15] and which arenow available even for C and C++ [4]. Anotherway to wait for readers is to use read-copy update(RCU) [23, 21], which explicitly marks read-side re-gions of code and provides primitives that wait forall pre-existing readers to complete. RCU is gainingsignificant use both within the Linux kernel [16] andoutside of it [6, 5, 8, 14, 28].

Despite the growing number of memory order

consume use cases, there are no known high-performance implementations of memory order

consume loads in any C11 or C++11 environments.This situation suggests that some change is in or-der: After all, if implementations do not supportthe standard’s memory order consume facilty, userscan be expected to continue to exploit whateverimplementation-specific facilities allow them to gettheir jobs done. This document therefore providesa brief overview of RCU in Section 2 and surveysmemory order consume use cases within the Linuxkernel in Section 3. Section 4 looks at how depen-dency ordering is currently supported in pre-C11 im-plementations, and then Section 5 looks at possibleways to support those use cases in existing C11 andC++11 implementations, followed by some thoughtson incremental paths towards official support of theseuse cases in the standards. Section 6 lists some weak-nesses in the current C11 and C++11 specification ofdependency ordering, and finally Section 7 outlines afew possible alternative dependency-ordering specifi-cations.

Note: SC22/WG14 liason issue.

2 Introduction to RCU

The RCU synchronization mechanism is often used asa replacement for reader-writer locking because RCUavoids the high-overhead cache thrashing that is char-acteristic of many common reader-writer-locking im-plementations. RCU is based on three fundamentalconcepts:

1. Light-weight in-memory publish-subscribe oper-ation.

2. Operation that waits for pre-existing readers.

3. Maintaining multiple versions of data to avoiddisrupting old readers that are still referencingold versions.

These three concepts taken together allow readersand updaters to make forward progress concurrently.

We would like to use C11’s and C++11’s memory

order consume to implement RCU’s lightweight sub-scribe operation, rcu dereference(). We assumethat rcu dereference() is a good example of howdevelopers would exploit the dependency-orderingfeature of weakly ordered systems, so we look to rcu

dereference() as an indication of the semantics thatmemory order consume should have.

In one typical RCU use case, updaters publishnew versions of a data structure while readers con-currently subscribe to whatever version is currentat the time a given reader starts. Once all pre-existing readers complete, old versions can be re-claimed. This sort of use case may be a bit unfa-miliar to many, but it is extremely effective in manysituations, offering excellent performance, scalability,real-time latency, deadlock avoidance, and read-sidecomposability. More details on RCU are readily avail-able [8, 17, 18, 20, 21, 22, 24].

Figure 1 shows the growth of RCU usage over timewithin the Linux kernel, which is strong evidence ofRCU’s effectiveness. However, RCU is a specializedmechanism, so its use is much smaller than general-purpose techniques such as locking, as can be seen inFigure 2. It is unlikely that RCU’s usage will everapproach that of locking because RCU coordinatesonly between readers and updaters, which means


0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000 2

002

200

4

200

6

200

8

201

0

201

2

201

4

201

6

# R

CU

AP

I Use

s

Year

Figure 1: Growth of RCU Usage

that some other mechanism is required to coordinateamong concurrent updates. In the Linux kernel, thatupdate-side mechanism is normally locking, althoughpretty much any synchronization mechanism may beused, including transactional memory [10, 11, 26].

However RCU is now being used in many situa-tions where reader-writer locking would be used. Fig-ure 3 shows that the use of reader-writer locking haschanged little since RCU was introduced. This datasuggests that RCU is at least as important to parallelsoftware as is reader-writer locking.

In more recent years, a user-level library implemen-tation of RCU has been available [7]. This library isnow available for many platforms and has been in-cluded in a number of Linux distributions. It hasbeen pressed into service for a number of open-sourcesoftware projects, proprietary products, and reserchefforts.

Fully and fully performant C11/C++11 supportfor memory order consume is therefore quite impor-tant. However, good progress can often be made inthe short term by focusing on the cases that are com-monly used in practice rather than on the generalcase. The next section therefore takes a rough census

0

20000

40000

60000

80000

100000

120000

140000

200

2

200

4

200

6

200

8

201

0

201

2

201

4

201

6

# R

CU

/lock

ing

AP

I Use

sYear

locking

RCUrwlock

Figure 2: Growth of RCU Usage vs. Locking

of the Linux kernel’s use of the rcu dereference()

family of primitives, which memory order consume isintended to implement.

3 Linux-Kernel Use Cases

Section 3.1 lists types of dependency chains in theLinux kernel, Section 3.2 lists operators used withinthese dependency chains, Section‘3.3 lists operatorsthat are considered to terminate dependency chains,Section‘3.4 lists operator that often act as the lastlink in a dependency chain, and finally Section 3.5surveys a longer-than-average (but by no means max-imal) dependency chain that appears in the Linuxkernel.

It is worth reviewing the relationship betweenmemory order acquire and memory order consume

loads, both of which interact with memory release

stores.A memory order acquire load is said to synchro-

nize with a memory order release store if that loadreturns the value stored or in some special cases,some later value [27, 1.10p6-1.10p8]. When a memory

order acquire load synchronizes with a memory


0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000 2

002

200

4

200

6

200

8

201

0

201

2

201

4

201

6

# R

CU

/rw

lock

ing

AP

I Use

s

Year

RCU

rwlock

Figure 3: Growth of RCU Usage vs. Reader-WriterLocking

order release store, any memory reference preced-ing the memory order acquire load will happen be-fore any memory reference following the memory

order release store [27, 1.10p11-1.10p12]. Thisproperty allows a linked structure to be locklessly tra-versed by using memory order release stores whenupdating pointers to reference new data elements andby using memory order acquire loads when loadingpointers while locklessly traversing the data struc-ture, as shown in Figure 4.

Unfortunately, a memory order acquire load re-quires expensive special load instructions or memory-fence instructions on weakly ordered systems suchas ARM, Itanium, and PowerPC. Furthermore, intraverse(), the address of each memory order

acquire load within the while loop depends onthe value of the previous memory order acquire

load.2 Therefore, in this case, weakly ordered sys-tems don’t really need the special load instructions or

2 The initial load on line 16 might well depend on an earlierload, but for simplicity, this example assumes that the initialfoo head structure is statically allocated, and thus not subjectto updates.

1 void new_element(struct foo **pp, int a)2 {3 struct foo *p = malloc(sizeof(*p));45 if (!p)6 abort();7 p->a = a;8 atomic_store_explicit(pp, p, memory_order_release);9 }

1011 int traverse(struct foo_head *ph)12 {13 int a = -1;14 struct foo *p;1516 p = atomic_load_explicit(&ph->h, memory_order_acquire);17 while (p != NULL) {18 a = p->a;19 p = atomic_load_explicit(&p->n, memory_order_acquire);20 }21 return a;22 }23

24

Figure 4: Release/Acquire Linked Structure Traver-sal

the memory-fence instructions, as these systems caninstead rely on the hardware-enforced dependency or-dering.

This is the use case for memory order consume,which can be substituted for memory order acquire

in cases where hardware dependency ordering applies.One such case is the preceding example, and Fig-ure 5 shows that same example recast in terms ofmemory order consume. A memory order release

store is dependency ordered before a memory order

consume load when that load returns the value stored,or in some special cases, some later value [27, 1.10p1].Then, if the load carries a dependency to somelater memory reference [27, 1.10p9], any memoryreference preceding the memory order release storewill happen before that later memory reference [27,1.10p9-1.10p12]. This means that when there is de-pendency ordering, memory order consume gives thesame guarantees that memory order acquire does,but at lower cost.

On the other hand, memory order consume re-quires the compiler to track the carries-a-dependencyrelationships, with the set of such relationshipsheaded by a given memory order consume load be-



1011 int traverse(struct foo_head *ph)12 {13 int a = -1;14 struct foo *p;1516 p = atomic_load_explicit(&ph->h, memory_order_consume);17 while (p != NULL) {18 a = p->a;19 p = atomic_load_explicit(&p->n, memory_order_consume);20 }21 return a;22 }23

24

Figure 5: Release/Consume Linked Structure Traver-sal

ing called that load’s dependency chains. It is quitepossible that the complexity of implementing this ca-pability has thus far prevented high-quality memory

order consume implementations from appearing. Itis therefore worthwhile to review use of dependencychains in practice in order to determine what typesof operations typically appear in dependency chains,which might result in guidance to implementationsor perhaps even modifications to the definition ofmemory order consume.

3.1 Types of Linux-Kernel Depen-dency Chains

One goal for memory order consume is to implementrcu dereference(), which heads a Linux-kerneldependency-ordering tree. There are a numberof variant of rcu dereference() in the Linuxkernel in order to implement the four flavors ofRCU and also to enable RCU usage diagnositicsfor code that is shared by readers and updaters.These additional variants are rcu dereference(),rcu dereference bh(), rcu dereference

bh check(), rcu dereference bh check(),

rcu dereference check(), rcu dereference

index check(), rcu dereference protected(),rcu dereference raw(), rcu dereference

sched(), rcu dereference sched check(), srcu

dereference(), and srcu dereference check().Taken together, there are about 1300 uses of thesefunctions in version 3.13 of the Linux kernel.However, about 250 of those are rcu dereference

protected(), which is used only in update-side codeand thus does not head up read-side dependencychains, which leaves about 1000 uses to be inspectedfor dependency-ordering usage.

3.2 Operators in Linux-Kernel De-pendency Chains

A surprisingly small fraction of the possible C opera-tors appear in dependency chains in the Linux kernel,namely ->, infix =, casts, prefix &, prefix *, [], infix+, infix -, ternary ?:, and infix (bitwise) &.

By far the two most common operators are the-> pointer field selector and the = assignment oper-ator. Enabling the carries-dependency relationshipthrough only these two operators would likely coverbetter than 90% of the Linux-kernel use cases.

Casts, the prefix * indirection operator, and theprefix & address-of operator are used to implementLinux’s list primitives, which translate from listpointers embedded in a structure to the structure it-self. These operators are also used to get some of theeffects of C++ subtyping in the C language.

The [] array-indexing operator, and the infix +

and - arithmetic operators are used to manipulateRCU-protected arrays, as well as to index into arrayscontained within RCU-protected structures. RCU-protected arrays are becoming less common becausethey are being converted into more complex datastructures, such as trees. However, RCU-protectedstructures containing arrays are still fairly common.

The ternary ?: if-then-else expression is used tohandle default values for RCU-protected pointers, forexample, as shown in Figure 6, or in C++11 formin Figure 7. Note that the dependency is carriedonly through the rightmost two operands of ?:, neverthrough the leftmost one.


1 struct foo {2 int a;3 };4 struct foo *fp;5 struct foo default_foo;67 int bar(void)8 {9 struct foo *p;

1011 p = rcu_dereference(fp);12 return p ? p->a : default_foo.a;13 }

Figure 6: Default Value For RCU-Protected Pointer,Linux Kernel

1 class foo {2 int a;3 };4 std::atomic<foo *> fp;5 foo default_foo;67 int bar(void)8 {9 std::atomic<foo *> p;

1011 p = fp.load_explicit(memory_order_consume);12 return p ? kill_dependency(p->a) : default_foo.a;13 }

Figure 7: Default Value For RCU-Protected Pointer,C++11

The infix & operator is used to mask low-order bitsfrom RCU pointers. These bits are used by somealgorithms as markers. Such markers, though notcommon in the Linux kernel, are well-known in theart, with hazard pointers being but one example [25].Note that it is expected that both operands of infix &

are expected to have some non-zero bits, because oth-erwise a NULL pointer will result, and NULL pointerscannot reasonably be said to carry much of anything,let alone a dependency. Although I did not find anyinfix | operators in my census of Linux-kernel de-pendency chains, symmetry considerations argue foralso including it, for example, for read-side pointertagging. Presumably both of the operands of infix |

must have at least one zero bit.

To recap, the operators appearing in Linux-kerneldependency chains are: ->, infix =, casts, prefix &,prefix *, [], infix +, infix -, ternary ?:, infix (bitwise)&, and probably also |.

3.3 Operators Terminating Linux-Kernel Dependency Chains

Although C++11 has the kill dependency() func-tion to terminate a dependency chain, no such func-tion exists in the Linux kernel. Instead, Linux-kerneldependency chains are judged to have terminatedupon exit from the outermost RCU read-side criticalsection,3 when existence guarantees are handed offfrom RCU to some other synchronization mechanism(usually locking or reference counting), or when thevariable carrying the dependency goes out of scope.

That said, it is possible to analyze Linux-kerneldependency chains to see what part of the chain isactually required by the algorithm in question. Wecan therefore define the essential subset of a depen-dency chain to be that subset within which orderingis required by the algorithm. In the 3.13 version ofthe Linux kernel, the following operators always mark

3 The beginning of a given RCU read-side critical section ismarked with rcu read lock(), rcu read lock bh(), rcu read

lock sched(), or srcu read lock(), and the end by the cor-responding primitive from the list rcu read unlock(), rcu

read unlock bh(), rcu read unlock sched(), or srcu read

unlock(). There is currently no C++11 counterpart for anRCU read-side critical section.


the end of the essential subset of a dependency chain:(), !, ==, !=, &&, ||, infix *, /, and %.

The postfix () function-invocation operator is aninteresting special case in the Linux kernel. In theory,RCU could be used to protect JITed function bodies,but in current practice RCU is instead used to waitfor all pre-existing callers to the function referencedby the previous pointer. The functions are all com-piled into the kernel, and the dependency chains aretherefore irrelevant to the () operator. Hence, in ver-sion 3.13 of the Linux kernel, the () operator marksthe end of the essential subset of any dependencychain that it resides in.

The !, ==, !=, &&, and || operators are used ex-clusively in ”if” statements to make control-flow de-cisions, and therefore also mark the end of the essen-tial subset of any dependency chains that they residein. In theory, these relational and boolean operatorscould be used to form array indexes, but in practicethe Linux kernel does not yet do this in RCU depen-dency chains. The other relational operators (>, <,>=, and <=) should probably also be added to thislist.

The infix *, /, and % arithmetic operators couldpotentially be used for construct array addresses, butthey are not yet used that way in the Linux kernel.Instead, they are used to do computation on valuesfetched as the last operation in an essential subset ofa dependency chain.

In short, in the current Linux kernel, (), !, ==,!=, &&, ||, infix *, /, and % all mark the end of theessential subset of a dependency chain. That said,there is potential for them to be used as part of theessential subset of dependency changes in future ver-sions of the Linux kernel. And the same is of coursetrue of the remaining C-language operators, whichdid not appear within any of the dependency chainsin version 3.13 of the Linux kernel.

3.4 Operators Acting as Last Link inLinux-Kernel Dependency Chains

Although the -> operator is frequently used as part ofa Linux-kernel dependency chain, it often is intendedto be the last link in that chain. Therefore, the usescases for the -> operator deserve special mention.

The first use case involves fetching non-pointerdata from an RCU-protected data structure. Forexample, in the DRDB subsystem in Linux, -> isused to fetch a timeout value. This code requiresthat dependency ordering apply to this fetch, but itdoes not require a dependency chain extending be-yond that point. This sort of case would requirea kill dependency() for implementations based onthe C++11 and C11 standards.

The second use case involves linked data struc-tures where an RCU update might be applied onany pointer in the chain, for example, the stan-dard Linux-kernel linked list. The -> operator pro-vides dependency ordering for the fetch of the ->nextpointer, but that fetch must itself be a memory order

consume load in order to provide the required depen-dency ordering for the fields in the next structurein the list. Thus, a linked-list traversal consists ofa series of back-to-back non-overlapping dependencychains.

These two use cases raise the question of whethera dependency chain can continue beyond a -> oper-ator. The answer is “yes,” and this occurs when alinked structure is made visible to RCU readers as aunit. For exapmle, consider a linked list where eachlist element links to a constant binary search tree.If this tree is in place when the element is added tothe list, then a memory order consume load is neededonly when fetching the pointer to the element. Thedependency chain headed by this fetch suffices to or-der accesses to the binary search tree.

These cases need to be differentiated. The thirduse case appears to be the least frequent, which sug-gests that the -> operator (or a sequence of -> oper-ators) always be the last link of a dependency chain.

3.5 Linux-Kernel Dependency ChainLength

Many Linux-kernel dependency chains are very shortand contained, with a fair number living within theconfines of a single C statement. If there were onlya few short dependency chains in the Linux kernel,one could imagine decorating all the operators in eachdependency chain, for example, replacing the -> op-



1011 int traverse(struct foo_head *ph)12 {13 int a = -1;14 struct foo *p;1516 p = atomic_load_explicit(&field_dep(ph, h),17 memory_order_consume);18 while (p != NULL) {19 a = field_dep(p, a);20 p = atomic_load_explicit(&field_dep(p, n),21 memory_order_consume);22 }23 return a;24 }

Figure 8: Decorated Linked Structure Traversal

erator with something like the mythical field dep()

operator shown on lines 16, 19, and 20 of Figure 8.However, there are a great many dependency

chains that extend across multiple functions. Onerelatively modest example is in the Linux networkstack, in the arp process() function. This depen-dency chain extends as follows, with deeper nestingindicating deeper function-call levels:

• The arp process() function invokes in dev

get rcu(), which returns an RCU-protectedpointer. The head of the dependency chain istherefore within the in dev get rcu() func-tion.

• The arp process() function invokes the follow-ing macros and functions:

– IN DEV ROUTE LOCALNET(), which expandsto the ipv4 devconf get() function.

– arp ignore(), which in turn calls:

∗ IN DEV ARP IGNORE(), which expandsto the ipv4 devconf get() function.

∗ inet confirm addr(), which calls:

· dev net(), which in turn callsread pnet().

– IN DEV ARPFILTER(), which expands toipv4 devconf get().

– IN DEV CONF GET(), which also expands toipv4 devconf get().

– arp fwd proxy(), which calls:

∗ IN DEV PROXY ARP(), which expandsto ipv4 devconf get().

∗ IN DEV MEDIUM ID(), which also ex-pands to ipv4 devconf get().

– arp fwd pvlan(), which calls:

∗ IN DEV PROXY ARP PVLAN(), which ex-pands to ipv4 devconf get().

– pneigh enqueue().

Again, although a great many dependency chainsin the Linux kernel are quite short, there are quite afew that spread both widely and deeply. We thereforecannot expect Linux kernel hackers to look fondlyon any mechanism that requires them to decorateeach and every operator in each and every depen-dency chain as was shown in Figure 8. In fact, evenkill dependency() will likely be an extremely diffi-cult sell.

4 Dependency Ordering in Pre-C11 Implementations

Pre-C11 implementations of the C language do nothave any formal notion of dependency ordering, butthese implementations are nevertheless used to buildthe Linux kernel—and most likely all other softwareusing RCU. This section lays out a few straightfor-ward rules for both implementers (Section 4.2) andusers of these pre-C11 C-language implementations(Section 4.1).

4.1 Rules for C-Language RCU Users

The rules for C-language RCU users have evolvedover time, so this section will present them in reversechronological order.


4.1.1 Rules for 2014 GCC Implementations

The primary rule for developers implementing RCU-based algorithms is to avoid letting the compiler de-terming the value of any variable in any dependencychain. This primary rule implies a number of sec-ondary rules:

1. Use only intrinsic operators on basic types. Ifyou are making use of C++ template metapro-gramming or operator overloading, more elabo-rate rules apply, and those rules are outside thescope of this document.

2. Use a volatile load to head the dependency chain.This is necessary to avoid the compiler repeatingthe load or making use of (possibly erroneous)prior knowledge of the contents of the memorylocation, each of which can break dependencychains.

3. Avoid use of single-element RCU-protected ar-rays. The compiler is within its right to assumethat the value of an index into such an arraymust necessarily evaluate to zero. The com-piler could then substitute the constant zero forthe computation, breaking the dependency chainand introducing misordering.

4. Avoid cancellation when using the + and - infixarithmetic operators. For example, for a givenvariable x, avoid (x−x). The compiler is withinits rights to substitute zero for any such cancel-lation, breaking the dependency chain and againintroducing misordering. Similar arithmetic pit-falls must be avoided if the infix *, /, or % oper-ators appear in the essential subset of a depen-dency chain.

5. Avoid all-zero operands to the bitwise & opera-tor, and similarly avoid all-ones operands to thebitwise | operator. If the compiler is able todeduce the value of such operands, it is withinits rights to substitute the corresponding con-stant for the bitwise operation. Once again, thisbreaks the dependency chain, introducing mis-ordering.

6. If you are using RCU to protect JITed functions,so that the () function-invocation operator is amember of the essential subset of the dependencytree, you may need to interact directly with thehardware to flush instruction caches. This issuearises on some systems when a newly JITed func-tion is using the same memory that was used byan earlier JITed function.

7. Do not use the boolean && and || operators inessential dependency chains. The reason for thisprohibition is that they are often compiled usingbranches. Weak-memory machines such as ARMor PowerPC order stores after such branches, butcan speculate loads, which can break data depen-dency chains.

8. Do not use relational operators (==, !=, >, >=,<, or <=) in the essential subset of a dependencychain. The reason for this prohibition is that, asfor boolean operators, relational operators areoften compiled using branches. Weak-memorymachines such as ARM or PowerPC order storesafter such branches, but can speculate loads,which can break dependency chains.

9. Be very careful about comparing pointers in theessential subset of a dependency chain. As LinusTorvalds explained, if the two pointers are equal,the compiler could substitute the pointer you arecomparing against for the pointer in the essen-tial subset of the dependency chain. On ARMand Power hardware, it might be that only theoriginal value carried a hardware dependency, sothis substitution would break the chain, in turnpermitting misordering. Such comparisons areOK in the following cases:

(a) The pointer being compared against refer-ences memory that was initialized at boottime, or otherwise long enough ago thatreaders cannot still have pre-initialized datacached. Examples include module-init timefor module code, before kthread creationfor code running in a kthread, while theupdate-side lock is held, and so on.


(b) The pointer is never dereferenced afterbeing compared. This exception applieswhen comparing against the NULL pointeror when scanning RCU-protected circularlinked lists.

(c) The pointer being compared against is partof the essential subset of a dependencychain. This can be a different dependencychain, but only as long as that chain stemsfrom a pointer that was modified after anyinitialization of interest. This exception canapply when carrying out RCU-protectedtraversals from different entry points thatconverged on the same data structure.

(d) The pointer being compared against isfetched using rcu access pointer() andall subsequent dereferences are stores.

(e) The pointers compared not-equal and thecompiler does not have enough informationto deduce the value of the pointer. (Forexample, if the compiler can see that thepointer will only ever take on one of twovalues, then it will be able to deduce theexact value based on a not-equals compar-ison.)

10. Disable any value-speculation optimizations thatyour compiler might provide, especially if you aremaking use of feedback-based optimizations thattake data collected from prior runs.

4.1.2 Rules for 2003 GCC Implementations

Prior to the 2.6.9 version of the Linux kernel, therewas neither rcu dereference() nor rcu assign

pointer(). Instead, explicit memory barriers wereused, smp read barrier depends() by readers andsmp wmb() by updaters. For example, the code shownfor current Linux kernels in Figure 6 would be asshown in Figure 9 for 2.6.8 and earlier versions of theLinux kernel. A similar transformation relates theolder use of smp wmb() and the more recent use ofrcu assign pointer().

This older API was clearly much more vulnerableto compiler optimizations than is the current API,

1 struct foo {2 int a;3 };4 struct foo *fp;5 struct foo default_foo;67 int bar(void)8 {9 struct foo *p;

1011 p = fp;12 smp_read_barrier_depends();13 return p ? p->a : default_foo.a;14 }

Figure 9: Default Value For RCU-Protected Pointer,Old Linux Kernel

but the real motivation for this change was read-ability and maintainability, as can be seen from thecommit log for the mid-2004 patch introducing rcu

dereference():

This patch introduced an rcu

dereference() macro that replaces mostuses of smp read barrier depends().The new macro has the advantage ofexplicitly documenting which pointersare protected by RCU – in contrast, itis sometimes difficult to figure out whichpointer is being protected by a givensmp read barrier depends() call.

The commit log for the mid-2004 patch introducingrcu assign pointer() justifies the change in termsof eliminating hard-to-use explicit memory barriers:

Attached is a patch that adds an rcu

assign pointer() that allows a number ofexplicit smp wmb() memory barriers to bedispensed with, improving readability.

The importance of suppressing compiler optimiza-tions did not become apparent until much later. Infact, a volatile cast was not added to the implementa-tion of rcu dereference() until 2.6.24 in early 2008.

4.1.3 Rules for 1990s Sequent C Implemen-tations

1990s systems featured far slower CPUs and muchless memory that is commonly provisioned to-


1 void new_element(struct foo **pp, int a)2 {3 struct foo *p = malloc(sizeof(*p));45 if (!p)6 abort();7 p->a = a;8 rcu_assign_pointer(pp, p);9 }

1011 int traverse(struct foo_head *ph)12 {13 int a = -1;14 struct foo *p;1516 p = rcu_dereference(&ph->h);17 while (p != NULL) {18 if (p == (struct foo *)0xbadfab1e)19 a = ((struct foo *)0xbadfab1e)->a;20 else21 a = p->a;22 p = rcu_dereference(&p->n);23 }24 return a;25 }

Figure 10: Dangerous Optimizations: HardwareBranch Predictions

day, and the compilers were correspondingly lesssophisticated. Therefore, at that time, a sim-ple C-language field selector was used instead ofany sort of rcu dereference() or memory order

consume operation. Not only was there novolatile cast, there also was nothing resembling smp

read barrier depends(). The lack of smp read

barrier depends() is not too surprising, given thatDYNIX/ptx did not run on DEC Alpha.

This approach was nevertheless quite reliable be-cause the use cases within the DYNIX/ptx kernelwere straightforward, and provided little or no op-portunity for optimizations that might break depen-dency chains.

4.2 Rules for C-Language Imple-menters

The main rule for C-language implementers is toavoid any sort of value speculation, or, at the veryleast, provide means for the user to disable suchspeculation. An example of a value-speculation op-timization that can be carried out with the help ofhardware branch prediction is shown in Figure 10,

which is an optimized version of the code in Fig-ure 5. This sort of transformation might result fromfeedback-directed optimization, where profiling runsdetermined that the value loaded from ph was almostalway 0xbadfab1e. Although this transformation iscorrect in a single-threaded environment, in a concur-rent environment, nothing stops the compiler or theCPU from speculating the load on line 19 before itexecutes the rcu dereference() on line 16, whichcould result in line 19 executing before the corre-sponding store on line 7, resulting in a garbage valuein variable a.4

There are some situations where this sort of opti-mization would be safe, including:

1. The value speculated is a numeric value ratherthan a pointer, so that if the guess proves correctafter the fact, the computation will be appropri-ate after the fact.

2. The value speculated is a pointer to invariantdata, so that reasonable values are produced bydereeferencing, even if the guess proves to havebeen correct only after the fact.

3. As above, but where any updates result in datathat produces appropriate computations at anyand all phases of the update.

However, this list does not contain the general caseof memory order consume loads.

Pure hardware implementations of value specula-tion can avoid this problem because they monitorcache-coherence protocol events that would resultfrom some other CPU invalidating the guess.

In short, compiler writers must provide means todisable all forms of value speculation, unless the spec-ulation is accompanied by some means of detectingthe race condition that Figure 10 is subject to.

Are there other dependency-breaking optimizationsthat should be called out separately?

4 Kudos to Olivier Giroux for pointing out use of branchprediction to enable value speculation.


5 Dependency Ordering in C11and C++11 Implementations

The simplest way to avoid dependency-ordering is-sues is to strengthen all memory order consume oper-ations to memory order acquire. This functions cor-rectly, but may result in unacceptable performancedue to memory-barrier instructions on weakly or-dered systems such as ARM and PowerPC,5 and mayfurther unnecessarily suppress code-motion optimiza-tions.

Another straightforward approach is to avoid valuespeculation and other dependency-breaking opti-mizations. This might result in missed opportu-nities for optimization, but avoids any need fordependency-chain annotations and also all issuesthat might otherwise arise from use of dependency-breaking optimizations. This approach is fully com-patible with the Linux kernel community’s currentapproach to dependency chains. Unfortunately, thereare any number of valuable optimizations that breakdependency chains, so this approach seems impracti-cal.

A third approach is to avoid value speculationand other dependency-breaking optimizations in anyfunction containing either a memory order consume

load or a [[carries dependency]] attribute. Forexample, the hardware-branch-predition optimiza-tion shown in Figure 10 would be prohibited in suchfunctions, as would cancellation optimizations suchas optimizing a = b + c - c into a = b. This toocan result in missed opportunities for optimization,though very probably many fewer than the previousapproach. This approach can also result in issues dueto dependency-breaking optimizations in functionslacking [[carries dependency]] attributes, for ex-ample, function d() in Figure 11. It can also resultin spurious memory-barrier instructions when a de-pendency chain goes out of scope, for example, withthe return statement of function g() in Figure 12.

A fourth approach is to add a compile-time op-eration corresponding to the beginning and end ofRCU read-side critical section. These would need to

5 From a Linux-kernel community viewpoint, that shouldread “will result in unacceptable performance”.

1 int a(struct foo *p [[carries_dependency]])2 {3 return kill_dependency(p->a != 0);4 }56 int b(int x)7 {8 return x;9 }

1011 foo *c(void)12 {13 return fp.load_explicit(memory_order_consume);14 /* return rcu_dereference(fp) in Linux kernel. */15 }1617 int d(void)18 {19 int a;20 foo *p;2122 rcu_read_lock();23 p = c();24 a = p->a;25 rcu_read_unlock();26 return a;27 }

Figure 11: Example Functions for Dependency Or-dering, Part 1

1 [[carries_dependency]] struct foo *e(void)2 {3 return fp.load_explicit(memory_order_consume);4 /* return rcu_dereference(fp) in Linux kernel. */5 }67 int f(void)8 {9 int a;

10 foo *p;1112 rcu_read_lock();13 p = e();14 a = p->a;15 rcu_read_unlock();16 return kill_dependency(a);17 }1819 int g(void)20 {21 int a;22 foo *p;2324 rcu_read_lock();25 p = e();26 a = p->a;27 rcu_read_unlock();28 return b(a);29 }

Figure 12: Example Functions for Dependency Or-dering, Part 2


be evaluated at compile time, taking into accountthe fact that these critical sections can nest and canbe conditionally entered and exited. Note that theexit from an outermost RCU read-side critical sec-tion should imply a kill dependency() operation oneach variable that is live at that point in the code.6

Although it is probably impossible to precisely de-termine the bounds of a given RCU read-side criticalsection in the general case, conservative approachesthat might overestimate the extent of a given sec-tion should be acceptable in almost all cases. Thisapproach would make functions c() and d() in Fig-ure 11 handle dependency chains in a natural manner,but avoiding whole-program analysis would requiresomething similar to the [[carries dependency]]

annotations called out in the C11 and C++11 stan-dards.

A fifth approach would be to require that all op-erations on the essential subset of any dependencychain be annotated. This would greatly ease imple-mentation, but would not be likely to be accepted bythe Linux kernel community.

A sixth approach is to track dependencies as calledout in the C11 and C++11 standards. However, in-stead of emitting a memory-barrier instruction whena dependency chain flows into or out of a functionwithout the benefit of [[carries dependency]], in-sert an implicit kill dependency() invocation. Im-plementation should also optionally issue a diagnosticin this case. The motivation for this approach is thatit is expected that many more kill dependencies()

than [[carries dependency]] would be required toconvert the Linux kernel’s RCU code to C11. In theexample in Figure 12, this approach would allow func-tion g() to avoid emitting an unnecessary memory-barrier instruction, but without function f()’s ex-plicit kill dependency(). Both functions are in Fig-ure 12.

A seventh and final approach is to track dependen-cies as called out in in the C11 and C++11 standards.With this approach, functions e() and f() properlypreserve the required amount of dependency order-

6 What if a given rcu read unlock() sometimes marked theend of an outermost RCU read-side critical section, but othertimes was nested in some other RCU read-side critical section?In that case, there should be no kill dependency().

1 p = atomic_load_explicit(gp, memory_order_consume);2 if (p == ptr_a)3 a = p->special_a;4 else5 a = p->normal_a;

Figure 13: Dependency-Ordering Value-NarrowingHazard

ing.

6 Weaknesses in C11 andC++11 Dependency Or-dering

Experience has shown several weaknesses in the de-pendency ordering specified in the C11 and C++11standards:

1. The C11 standard does not provide attributes,and in particular, does not provide the[[carries dependency]] attribute. This pre-vents the developer from specifying that a givendependency chain passes into or out of a givenfunction.

2. The implementation complexity of thedependency-chain tracking required by bothstandard can be quite onerous on the one hand,and the overhead of unconditionally promotingmemory order consume loads to memory order

acquire can be excessive on weakly orderedimplementations on the other. There is thereforeno easy way out for a memory order consume

implementation on a weakly ordered system.

3. The function-level granularity of [[carries

dependency]] seems too coarse. One problemis that points-to analysis is non-trivial, so thatcompilers are likely to have difficulty determin-ing whether or not a given pointer carries a de-pendency. For example, the current wording ofthe standard (intentionally!) does not disallowdependency chaining through stores and loads.Therefore, if a dependency-carrying value might


ever be written to a given variable, an implemen-tation might reasonably assume that any loadfrom that variable must be assumed to carry adependency.

4. The rules set out in the standard [27, 1.10p9]do not align well with the rules that developersmust currently adhere to in order to maintaindependency chains when using pre-C11 and pre-C++11 compilers (see Section 4.1). For exam-ple, the standard requires x-x to carry a depen-dency, and providing this guarantee would at thevery least require the compiler to also turn offoptimizations that remove x-x (and similar pat-terns) if x might possibly be carrying a depen-dency. For another example, consider the value-speculation-like code shown in Figure 13 that issometimes written by developers, and that wasdescribed in bullet 9 of Section 4.1. In this ex-ample, the standard requires dependency order-ing between the memory order consume load online 1 and the subsequent dereference on line 3,but a typical compiler would not be expected todifferentiate between these two apparently iden-tical values. These two examples show that acompiler would need to detect and carefully han-dle these cases either by artificially inserting de-pendencies, omitting optimizations, differentiat-ing between apparently identical values, or evenby emitting memory order acquire fences.

5. The whole point of memory order consume andthe resulting dependency chains is to allow de-velopers to optimize their code. Such optimiza-tion attempts can be completely defeated by thememory order acquire fences that the standardcurrently requires when a dependency chain goesout of scope without the benefit of a [[carries

dependency]] attribute. Preventing the com-piler from emitting these fences requires liberaluse of kill dependency(), which clutters code,requires large developer effort, and further re-quires that the developer know quite a bit aboutwhich code patterns a given version of a givencompiler can optimize (thus avoiding needlessfences) and which it cannot (thus requiring man-ual insertion of kill dependency().

As of this writing, no known implementations fullysupport C11 or C++11 dependency ordering.

It is worth asking why Paul didn’t anticipate theseweaknesses. There are several reasons for this:

1. Compiler optimizations have become more ag-gressive over the seven years since Paul startedworking on standardization.

2. New dependency-ordering use cases have arisenduring that same time, in particular, there arelonger dependency chains and more of them,including dependency chains spanning multiplecompilation units.

3. The number of dependency chains has increasedby roughly an order of magnitude during thattime, so that changes in code style can be ex-pected to face a commeasurate increase in resis-tance from the Linux kernel community – unlessthose changes bring some tangible benefit.

With that, let’s look at some potential alternativesto dependency ordering as defined in the C11 andC++11 standards.

7 Potential Alternatives to C11and C++11 Dependency Or-dering

Given the weaknesses in the current standard’s spec-ification of dependency ordering, it is quite reason-able to consider alternatives. To this end, Section 7.1discusses ease-of-use issues involved with revisions tothe C11 and C++11 definitions of dependency or-dering, Section 7.2 enlists help from the type system,but also imposes value restrictions (thus revising theC11 and C++11 semantics for dependencies), Sec-tion 7.3 enlists help from the type system without thevalue restrictions, and Section 7.4 describes a whole-program approach to dependency chains (also revis-ing the C11 and C++11 semantics for dependencies).Section 7.5 describes a post-Rapperswil proposal thatdependency chains be restricted to function-scope lo-cal variables and temporaries, and Section 7.6 de-scribes a second post-Rapperswil proposal that the


[[carries dependency]] attribute be used to labellocal-scope variables that carry dependencies. Sec-tion 7.7 describes a proposal discussed verbally atRapperswil that explicitly marks the ends of depen-dency chains. Each approach appears to have advan-tages and disadvantages, so it is hoped that furtherdiscussion will either help settle on one of these alter-natives or generate something better. To help initiatethis discussion, Section 7.8 provides an initial com-parative evaluation.

7.1 Revising C11 and C++11Dependency-Ordering Definition

The following sections each describe a proposed revi-sion of the dependency-ordering definition from thatin the current C11 and C++11 standards. In manyof these proposals, developers are required to followan additional rule in order to be able to rely on de-pendency ordering: Subsequent execution must notlead to a situation where there is only one possiblevalue for the variable that is intended to carry thedependency.7 This is shown in Figure 15, where thecompiler is permitted to break dependency orderingon line 6 because it knows that the value of p is equalto that of q, which means that it could substitutethe latter value from the former, which would breakdependency ordering. In short, a dependency chainbreaks if it comes to a point where only a single valueis possible, regardless of the value of the memory

order consume load heading up the chain. At firstglance, this additional rule could be quite difficult tolive with, as dependency ordering could come and godepending on small details of code far away from thatpoint in the dependency chain.

However, a review of the Linux-kernel operators inSection 3.2 shows that the most commonly used op-erators act identically under both definitions. Theproblem-free operators include ->, infix =, casts, pre-fix &, prefix *, and ternary ?:.

7 This restricted notion of dependence is sometimes calledsemantic dependence, and the value at the end of a depen-dence chain that does not represent a semantic dependence issometimes said to be independent of the value at the head ofthe dependency chain.

1 int my_array[MY_ARRAY_SIZE];23 i = atomic_load_explicit(gi, memory_order_consume);4 r1 = my_array[i];

Figure 14: Single-Element Arrays and DependencyOrdering

One example of a potentially troublesome opera-tor, namely ==, is shown in Figure 15, where line 6breaks dependency ordering because the value of p isknown to be equal to that of q, which is not part of adependency chain. This example could be addressedthrough careful diagnostic design coupled with appro-priate coding standards. For example, the compilercould emit a warning on line 6, but remain silent forthe equivalent line substituting q for p, namely, dosomething with(q->a).

Another example is the use of postfix [] thatis shown in Figure 14. If this code fragment wascompiled with MY ARRAY SIZE equal to one, there isno dependency ordering between lines 3 and 4, butthat same code fragment compiled with MY ARRAY

SIZE equal to two or greater would be dependency-ordered. Here a diagnostic for single-element arraysmight prove useful, and such a diagnostic can easilybe supplied in this case using #if and #error.

In the Linux kernel, infix + and - are used forpointer and array computations. These are all safein that they operate on an integer and pointer, sothat any cancellation will not normally be detectableat compile time. However, one big purpose of diag-nostics is to detect abnormal conditions indicatingprobable bugs. Therefore, in cases where the com-piler can determine that two values from dependencychains are annihilating each other via infix + and -,a diagnostic would be appropriate.

Similarly, the Linux kernel uses infix (bitwise) & tomanipulate bits at the bottom of a pointer, whereagain cancellation will not normally be detectable atcompile time—except in the case of operations on aNULL pointer, for which dependency ordering is notmeaningful in any case. However, as with infix +

and -, if the compiler detects value annihilation, adiagnostic would be appropriate.

Although issues with false positives and negatives


1 value_dep_preserving struct foo *p;23 p = atomic_load_explicit(gp, memory_order_consume);4 q = some_other_pointer;5 if (p == q)6 do_something_with(p->a);7 else8 do_something_else_with(p->b);

Figure 15: Single-Value Variables and DependencyOrdering

needs further investigation, there is reason to hopethat this revision of the definition of dependency or-dering might avoid significant impacts on ease of use.With this hope, we proceed to the specific proposals.

7.2 Type-Based Designation of De-pendency Chains With Restric-tions

This approach was formulated by Torvald Riegel inresponse to Linus Torvalds’s spirited criticisms of thecurrent C11 and C++11 wording.

This approach introduces a new value dep

preserving type qualifier. Dependency ordering ispreserved only via variables having this type quali-fier. This is meant to model the real scope of depen-dencies, which is data flow, not execution at function-level granularity. This approach should therefore givedevelopers much finer control of which dependenciesare tracked.

Assigning from a value dep preserving value to anon-value dep preserving variable terminates thetracking of dependencies in much the same way thatan explicit kill dependency() would. However, un-like an explicit kill dependency(), compilers shouldbe able to emit a suppressable warning on implicitconversions, so as to alert the developer about other-wise silent dropping of dependency tracking.8

Next, we specify that memory order consume loadsreturn a value dep preserving type by default; thecompiler must assume such a load to be capable ofproducing any value of the underlying type. In other

8 Other choices are possible in this case, including emit-ting a memory order acquire fence in order to conservativelypreserve a potentially intended ordering.

words, the implementation is not permitted to applyany value-restriction knowledge it might gain fromwhole-program analysis.

This allows developers to start with a clean slatefor the additional rule that they must follow to beable to rely on dependency ordering: Subsequent ex-ecution must not lead to a situation there is only onepossible value for the value dep preserving expres-sion, because otherwise the implementation is per-mitted to break the dependency chain. As notedearlier, this is shown in Figure 15, where the com-piler is permitted to break dependency ordering online 6 because it knows that the value of p is equalto that of q, which means that it could substitutethe latter value from the former, which would breakdependency ordering.

This approach has several advantages:

1. The implementation is simpler because no de-pendency chains need to be traced. The imple-mentation can instead drive optimization deci-sions strictly from type information.

2. Use of the value dep preserving type modifierallows the developer to limit the extent of thedependency chains.

3. This type modifier can be used to mark a depen-dency chain’s entry to and exit from a functionin a straightforward way, without the need forattributes.

4. The value dep preserving type modifiers serveas valuable documentation of the developer’s in-tent.

5. This approach permits many additional opti-mizations compared to those permitted by thecurrent standard on code that carries a depen-dency. Expressions such as x-x no longer requireestablishment of artificial dependencies and thecompiler is no longer required to detect value-narrowing hazards like that shown in Figure 13.However, the compiler is still prohibited fromadding its own value-speculation optimizations.

6. Linus Torvalds seems to be OK with it, whichindicates that this set of rules might be practical


from the perspective of developers who currentlyexploit dependency chains.

According to Peter Sewell, one disadvantage is thatthis approach will be quite difficult to model, which inturn will pose obstacles for the analysis tooling thatwill be increasingly necessary for large-scale concur-rent programming efforts. In particular, the concernis that forcing the compiler to assume that a memory

order consume load could possibly return any valuepermitted by its type might require program-analysistools to consider counterfactual hypothetical execu-tions, which might complicate specification of seman-tics and verification.

7.3 Type-Based Designation of De-pendency Chains

Jeff Preshing made an off-list suggestion of using avalue dep preserving type modifier as suggestedby Torvald Riegel, but using this type modifier tostrictly enforce dependency ordering. For example,consider the code fragment shown in Figure 15. Thescheme described in Section 7.2 would not necessar-ily enforce dependency ordering between the load online 3 and the access one line 6, while the approachdescribed in this section would enforce dependencyordering in this case.

Furthermore, cancelling or value-destruction oper-ations on value dep preserving values would notdisrupt dependency ordering. As with the cur-rent C11 and C++11 standards, the implementationwould be required to emit a memory-barrier instruc-tion or compute an artificial dependency for such op-erations. (Note however that use of cancelling orvalue-destruction operations on dependency chainshas proven quite rare in practice.)

This approach shares many of the advantages ofTorvald Riegel’s approach:

1. The implementation is simpler because no de-pendency chains need be traced. The implemen-tation can instead drive optimization decisionsstrictly from type information.

2. Use of the value dep preserving type modifierallows the developer to limit the extent of thedependency chains.

3. This type modifier can be used to mark a depen-dency chain’s entry to and exit from a functionin a straightforward way, without the need forattributes.

4. The value dep preserving type modifiers serveas valuable documentation of the developer’s in-tent.

5. Although optimizations on a dependency chainare restricted just as in the current standard,the use of value dep preserving restricts thedependency chains to those intended by the de-veloper.

6. Restricting dependency-breaking optimizationson all dependency chains marked value dep

preserving, without exceptions for cases inwhich the compiler knows too much, might makethis approach easier to learn and to use.

It is expected that modeling this approach shouldbe straightforward because the modeling tools wouldbe able to make use of the type information.

7.4 Whole-Program Option

This approach, also suggested off-list by Jeff Presh-ing, has the goal of reusing existing non-dependency-ordered source code unchanged (albeit requiring re-compilation in most cases).9 For example, this ap-proach permits an instance of std::map to be refer-enced by a pointer loaded via memory order consume

and to provide that std::map instance with thebenefits of dependency ordering without any codechanges whatsoever to std::map. It is important tonote that this protection will be provided only to aread-only std::map that is referenced by a changingpointer loaded via memory order consume, in partic-ular, not to a concurrently updated std::map refer-enced by a pointer (read-only or otherwise) loaded

9 A module or library that is known to never carry a de-pendency need not be recompiled.


via memory order consume. This latter case wouldrequire changes to the underlying std:map implemen-tation, at a minimum, changing some of the loads tobe memory order consume loads. Nevertheless, theability to provide dependency-ordering protection topre-existing linked data structures is valuable, evenwith this read-only restriction.

This approach, which again does require full re-compilation, can be implemented using two ap-proaches:

1. Promote all memory order consume loads tomemory order acquire, as may be done withthe current standard.

2. On architectures that respect memory order-ing, prohibit all dependency-breaking optimiza-tions throughout the entire program, but onlyin cases where a change in the value returnedby a memory order consume load could causea change in the value computed later in thatsame dependency chain. Note again that thepossibility of storing a value obtained froma memory order consume load, then loading itlater, means that normal loads as well as memoryorder relaxed loads often must be consideredto head their own dependency chains.

Some implementations might allow the developerto choose between these two approaches, for example,by using a compiler switch provided for that purpose.

This approach also has the effect of permitting atrivial implementation of a memory order consume

atomic thread fence(). When using the first im-plementation approach, the atomic thread fence()

is simply promoted to memory order acquire. In-terestingly enough, when using the second ap-proach, the memory order consume atomic thread

fence() may simply be ignored. The reason forthis is that this approach has the effect of promot-ing memory order relaxed loads to memory order

consume, which already globally enforces all theordering that the memory order consume atomic

thread fence() is required to provide locally.10

10 Of course, this presumed promotion from memory order

relaxed to memory order consume means that architectures

This approach has its own set of advantages anddisadvantages:

1. This approach dispenses with the [[carries

dependency]] attribute and the kill

dependency() primitive.

2. This approach better promotes reuse of existingsource code. In particular, it should require nochanges to the current Linux-kernel source base,aside from changes to the rcu dereference()

family of primitives.

3. This approach allows implementations to carryout dependency-breaking optimizations on de-pendency chains as long as a change in thevalue from the memory order consume load doesnot change values further down the dependencychain, both with and without the optimization.Jeff conjectures that the set of dependency-breaking optimizations used in practice applyonly outside of dependency chains, by the re-vised definition in which single-value restrictionsbreak dependency chains.11 If this conjectureholds, it also applies to Torvald’s approach de-scribed in Section 7.2.

4. Code that follows the rules presented in Sec-tion 4.1 (substituting memory order consume

loads for volatile loads) would have its depen-dency ordering properly preserved.

It is unlikely that this approach could be modeledreasonably given the current state of the art. Therequirement that any given memory order consume

load be able to generate at least two different valuesat the tail of the dependency chain is believed to bea show-stopper.

7.5 Local-Variable Restriction

This approach, suggested off-list by Hans Boehm,limits the extent of dependency trees to a local, which

such as DEC Alpha that do not respect dependency order-ing must continue to use the first option of emitting memory-ordering instructions for memory order consume loads.

11 This is certainly the case for the usual optimizations ex-emplified by replacing x-x with zero.


includes local variables, temporaries, function argu-ments, and return variables. Assigning a value from amemory order consume load to such an object beginsa dependency chain. Assigning a value loaded fromsuch a local to a global variable (including function-local variables marked static) or to the heap impliesa kill dependency(), so that dependency chains areconfined to locals. However, if the compiler is unableto see the full dependency chain, for example, be-cause it passes into a function in another translationunit that is not marked [[carries dependency]],the compiler should promote memory order consume

to memory order acquire.12

Future work includes checking applicability to theLinux kernel. Section 3.2 indicates that the followingoperators should transmit dependency status fromone local variable or temporary to another: ->, in-fix =, casts, prefix &, prefix *, [], infix +, infix -,ternary ?:, infix (bitwise) &, and probably also |.Similarly, Section 3.3 indicates that the following op-erators should imply a kill dependency(): (), !,==, !=, &&, ||, infix *, /, and %.

It will also be necesary to check whether Linux-kernel usage expects dependency chains to passthrough globals and heap objects that are in someway thread-local. If there are such use cases, and ifthey are sane and cannot easily be changed to uselocal variables, should [[carries dependency]] beused to flag dependency-carrying globals and heapobjects?

This approach has the following advantages anddisadvantages:

1. This approach requires that the C language addthe [[carries dependency]] attribute if de-pendency chains are to span multiple translationunits, as is the case in some parts of the Linuxkernel.

2. The implementation is likely to be some-what simpler because only those dependencychains passing through local variables, compiler-generated temporaries, compiler-visible function

12 Some implementations might provide means to allow theuser to specify that a diagnostic be generated if such promotionis necessary.

arguments, and compiler-visible return valuesneed be traced. [[carries dependency]] at-tribute need be traced.

3. Many irrelevant dependency chains are prunedby default, thus fewer std::kill dependency()

calls are required.

4. Although optimizations on dependency chainsmust be restricted, the restricted scope of de-pendency chains reduces the impact of these re-strictions.

5. Applying this approach to the Linux kernelwould require relatively small changes, as theonly markings required are on function param-eters and return values corresponding to cross-translation-unit function calls.

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards.

7.6 Mark Dependency-Carrying Lo-cal Variables

This approach, suggested offlist by Clark Nelson, usesthe [[carries dependency]] attribute to mark non-static local-scope variables as carrying a dependency,in addition to its current use marking function ar-guments and return values as carrying dependen-cies. It is not permissible to mark global variablesor structure members with this attribute. Assigningfrom a [[carries dependency]] object to a non-[[carries dependency]] object results in an im-plicit kill dependency().

This approach is similar to that of Section 7.3, ex-cept that it uses an attribute rather than a type mod-ifier. As such, it has many of the advantages and dis-advantages of that approach, however, some believethat an attribute-based approach will be more ac-ceptable to the committee than would a type-modifierapproach.13 However, this approach does requirethat C add attributes.

13 Lawrence Crowl suggests a third approach, namely a vari-able modifier.


This leave the question of which operatorstransmit dependency chains from one [[carries

dependency]] object to another. Section 3.2 indi-cates that the following operators should transmitdependency status from one local variable or tem-porary to another: ->, infix =, casts, prefix &, prefix*, [], infix +, infix -, ternary ?:, infix (bitwise) &,and probably also |. Similarly, Section 3.3 showsthat the following operators should imply a kill

dependency(): (), !, ==, !=, &&, ||, infix *, /, and%.


1. This approach requires that the C language addthe [[carries dependency]] attribute.

2. The implementation is likely to be simpler be-cause only those dependency chains passingthrough variables marked with the [[carries

dependency]] attribute need be traced.

3. Many irrelevant dependency chains are prunedby default, thus fewer std::kill dependency()

calls are required.

4. The [[carries dependency]] calls serve asvaluable documentation of the developer’s in-tent.

5. Although optimizations on dependency chainsmust be restricted, use of explicit [[carries

dependency]] greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains.

6. Applying this to the Linux kernel would requiresignificant marking of variables carrying depen-dencies, given that the Linux kernel currentlyrequires no such markings.

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards.

1 p = atomic_load_explicit(&gp, memory_order_consume);2 if (p != NULL)3 do_it(atomic_dependency(p, gp));

Figure 16: Explicit Dependency Operations

7.7 Explicitly Marked DependencyChains

This approach, suggested at Rapperswil by OlivierGiroux, can be thought of as the inverse ofstd::kill dependency(). Instead of explicitlymarking where the dependency chains terminate,Olivier’s proposal uses a std::dependency() prim-itive to indicate the locations in the code that thedependency chains are required to reach. The first ar-gument to std::dependency() is the value to whichthe dependency must be carried, and the second argu-ment is the variable that heads the dependency chain,in other words, the second argument is the variablethat was loaded from by a memory order consume

load. This proposal differs from the others in thatit is expected to be implemented not necessarily bypreserving the dependency, but instead by insertingbarriers in those cases where optimizations have elim-inated any required dependencies. The goal here is toimpose minimal restrictions on optimizations of codecontaining dependency chains.

A C-language example is shown in Figure 16,where std::dependency() is transliterated to the C-language atomic dependency() function. On line 3,atomic dependency() returns the value of its firstargument (p), while ensuring that the data depen-dency from the memory order consume load from gp

is faithfully reflected in the assembly language imple-menting this code fragment. The assembly-languagereflection of this dependency might be in terms ofan assembly-language dependency (for example, onARM or PowerPC), implicit memory ordering (forexample, on x86 or mainframe), or by an explicitmemory-barrier instruction. However, if there was noatomic dependency() function, the compiler wouldbe under no obligation to preserve the dependency.

These explicitly specified dependencies may becombined with [[carries dependency]] attributeson function arguments, for example, as shown in Fig-


1 void foo(struct bar *q [[carries_dependency]])2 {3 if (q != NULL)4 do_it(atomic_dependency(y->b, y));5 }67 p = atomic_load_explicit(&gp, memory_order_consume);8 foo(atomic_dependency(p, gp));

Figure 17: Explicit Dependency Operations and car-ries dependency

ure 17. Note the interplay of atomic dependency()

and [[carries dependency]], where line 8 estab-lishes the dependency between the load from gp andthe [[carries dependency]] argument q of foo(),and where line 4 establishes the further dependencybetween argument q of foo() and do it()s argu-ment.

This approach is not yet complete. One issue isthe possibility of a given operation being dependenton multiple memory order consume loads. One ap-proach is of course to omit this functionality, and an-other is to allow atomic dependency() to allow anexpression as its first argument and a variable list ofmemory order consume loaded variables.

Another issue is connecting [[carries

dependency]] return values to subsequent atomic

dependency() invocations. There are a numberof possible resolutions to this issue. One approachwould be to use [[carries dependency]] attributeto mark the declaration of the variable to whichthe function’s return value is assigned, bringing theproposal from Section 7.6 to bear. In the specialcase where the memory order consume load is in thesame function body as the atomic dependency()

that depends on it, the atomic dependency()

could reference the variable that was the sourceof the original memory order consume load. An-other approach would be to allow function-returncarries dependency]] attributes to define namesthat could be used by later atomic dependency()

invocations.

A third issue arises if optimizations along a neededdependency chain allow ordering the dependent op-eration to precede the head of the dependency chain,in which case inserting barriers would be ineffec-

tive. The current proposal for addressing this issue isto suppress memory-movement optimizations acrossthe atomic dependency(), perhaps using somethinglike atomic signal fence() or the Linux kernel’sbarrier() macro. This approach allows dependencychecking and fence insertion to be carried out as afinal pass in the compilation process.


1. This approach requires that the C language addthe [[carries dependency]] attribute.

2. The implementation is likely to be simpler be-cause only those dependency chains having ex-plicit atomic dependency() calls (and, option-ally, intermediate [[carries dependency]] at-tributes) need be traced.

3. Irrelevant dependency chains are pruned by de-fault, with no std::kill dependency() calls re-quired.

4. The atomic dependency() calls serve as valu-able documentation of the developer’s intent.

5. Although optimizations on dependency chainsmust be restricted, use of explicit atomic

dependency() greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains.

6. Applying this to the Linux kernel would requiresignificant marking of dependency chains, giventhat the Linux kernel currently relies on implicitends of dependency chains.

It is not yet known whether this approach can bereasonably modeled.

7.8 Evaluation

This evaluation starts by enumerating the differentaudiences that any change to memory order consume

must address (Section 7.8.1) and then compares thevarious proposals based on the perceived viewpointsof these audiences (Section 7.8.2).


7.8.1 Audiences

The main audiences for any change to memory

order consume include standards committee mem-bers, compiler implementers, formal-methods re-searchers, developers intending to write new code,and developers working with existing RCU code. TheLinux kernel community is of course a notable exam-ple of this last category.

Standards committee members would like a cleanand non-intrusive change to the standard. Theywould of course also like a solution that minimizedthe number and vehemence of complaints from theother audiences, or, failing that, reduced the com-plaints to a tolerable noise level.

Compiler implementers would like a mechanismthat fits nicely into current implementations, whichdoes much to explain their satisfaction with the ap-proach of strengthening memory order consume tomemory order acquire. In particular, they wouldlike to avoid unbounded tracing of dependencies, andwould prefer minimal constraints on their ability toapply time-honored optimizations.

Formal-methods researchers would like a definitionof memory order consume that fits into existing the-oretical frameworks without undue conceptual vio-lence. Of particular concern is any need to deal withcounter-factuals, in other words, any need to rea-son not only about values of variables required forthe solution of a given litmus test, but also aboutother unrelated values for these variables. As such,counter-factuals are the rock upon which otherwiseattractive approaches involving semantic dependencyhave foundered.14 Some practitioners might won-der why the opinion of formal-methods researchersshould be given any weight at all, and the answer tothis question is that it is the work of formal-methodsresearchers that provides us the much-needed toolsthat we need to analyze both the memory-orderingspecification itself as well as programs using thatspecification.

Developers writing new code need something thatexpresses their algorithm with a minimum of syn-

14 That said, Alan Jeffries is making another attempt tocome up with a suitable formal definition of semantic depen-dency.

tactic saccharine, that is easy to learn, and that iseasy to maintain. For example, one of the weak-nesses of the current standards’ definition of memoryorder consume is the need to sprinkle large numbersof kill dependency() calls throughout one’s code.In short, developers would like it to be easy to write,analyze, and maintain code that uses dependency or-dering.

Developers with existing RCU code have the samedesires as do developers writing new code, but arealso very interested in minimizing the code churn re-quired to adhere to the standard.

The challenge if of course to find a proposal thataddresses the viewoints of all of these audiences. Aswe will see in the next session, this is not easy.

7.8.2 Comparison

A summary comparison of the proposals is shown inTable 1.

The dependency type can either be “dep” for nor-mal dependency or “sdep” for semantic dependency.Variable, formal-parameter, and return-value mark-ing can either be type-based (“T”), attribute-based(“A”), or not required (“ ”). End-of-chain han-dling can either require an explicit kill dependency

(“K”), an implicit kill dependency (“k”), explicitdesignation of dependency (“D”), or not required(“ ”).15 Dependency tracking might be required forall chains (“Y”), explicitly designated chains (“y”),or not required at all (“ ”). C-language [[carries

dependency]] support might be required (“Y”) ornot (“ ”).

The ideal proposal would have dependency type“dep” (thus making it easier to model dependencyordering), no need for variable, formal-parameter,or return-value marking (thus minimizing changesrequired for existing RCU code), implicit “do theright thing” end-of-chain handling,16 (thus minimiz-ing the need for whack-a-mole source-code markups),no need for dependency tracking (thus making iteasier to implement), and no need for C-language

15 Variables that go out of scope always have any dependencychain implicitly killed.

16 Perhaps implemented by a careful choice of exactly whichoperators carry dependencies in which situations.


Dep

enden

cyType

Variable

Marking

Form

al-Parameter

Marking

Return-V

alueMarking

End-O

f-Chain

Handling

Dep

enden

cyTracingRequired

CAttribute

Support

Required

C11 / C++11 dep A A K Y Y

Type-Based Designation of DependencyChains With Restrictions (Section 7.2)

sdep T T T k

Type-Based Designation of DependencyChains (Section 7.3)

dep T T T k

Whole-Program Option (Section 7.4) sdep

Local-Variable Restriction (Section 7.5) dep A A k Y

Mark Dependency-Carrying Local Variables(Section 7.6)

dep A A A k Y

Explicitly Marked Dependency Chains (Sec-tion 7.7)

dep A A Dk y Y

Table 1: Comparison of Consume Proposals


support for the [[carries dependency]] attribute(thus minimizing changes to the C standard).

We clearly have some more work to do.

8 Summary

This document has analyzed Linux-kernel use of de-pendency ordering and has laid out the status-quo in-teraction between the Linux kernel and pre-C11 com-pilers. It has also put forward some possible ways ofbuilding towards a full implementation of C11’s andC++11’s handling of dependency ordering. Finally,it calls out some weaknesses in C11’s and C++11’shandling of dependency ordering and offers some al-ternatives.

References

[1] Alglave, J., Maranget, L., Pawan, P.,Sarkar, S., Sewell, P., Williams, D., andNardelli, F. Z. PPCMEM/ARMMEM: A toolfor exploring the POWER and ARM memorymodels. http://www.cl.cam.ac.uk/~pes20/

ppc-supplemental/pldi105-sarkar.pdf,June 2011.

[2] Alglave, J., Maranget, L., andTautschnig, M. Herding cats: Modelling,simulation, testing, and data-mining for weakmemory. In Proceedings of the 35th ACM SIG-PLAN Conference on Programming LanguageDesign and Implementation (New York, NY,USA, 2014), PLDI ’14, ACM, pp. 40–40.

[3] ARM Limited. ARM Architecture ReferenceManual: ARMv7-A and ARMv7-R Edition,2010.

[4] Boehm, H. J. Space efficient conservativegarbage collection. SIGPLAN Not. 39, 4 (Apr.2004), 490–501.

[5] Bonzini, P., and Day, M. RCU imple-mentation for Qemu. http://lists.gnu.

org/archive/html/qemu-devel/2013-08/

msg02055.html, August 2013.

[6] Dalton, M. THE DESIGN AND IMPLE-MENTATION OF DYNAMIC INFORMATIONFLOW TRACKING SYSTEMS FOR SOFT-WARE SECURITY. PhD thesis, StanfordUniversity, 2009. Available: http://csl.

stanford.edu/~christos/publications/

2009.michael_dalton.phd_thesis.pdf

[Viewed March 9, 2010].

[7] Desnoyers, M. [RFC git tree] userspace RCU(urcu) for Linux. http://urcu.so, February2009.

[8] Desnoyers, M., McKenney, P. E., Stern,A., Dagenais, M. R., and Walpole, J.User-level implementations of read-copy update.IEEE Transactions on Parallel and DistributedSystems 23 (2012), 375–382.

[9] Grisenthwaite, R. ARM Barrier LitmusTests and Cookbook. ARM Limited, 2009.

[10] Howard, P. W., and Walpole, J. A rel-ativistic enhancement to software transactionalmemory. In Proceedings of the 3rd USENIX con-ference on Hot topics in parallelism (Berkeley,CA, USA, 2011), HotPar’11, USENIX Associa-tion, pp. 1–6.

[11] Howard, P. W., and Walpole, J. Relativis-tic red-black trees. Concurrency and Computa-tion: Practice and Experience (2013), n/a–n/a.

[12] Intel Corporation. A Formal Speci-fication of Intel Itanium Processor Fam-ily Memory Ordering, 2002. Avail-able: http://developer.intel.com/

design/itanium/downloads/251429.htm

ftp://download.intel.com/design/

Itanium/Downloads/25142901.pdf [Viewed:January 10, 2007].

[13] International Business Machines Corpo-ration. Power ISA Version 2.07, 2013.

[14] Kannan, H. Ordering decoupled metadata ac-cesses in multiprocessors. In MICRO 42: Pro-ceedings of the 42nd Annual IEEE/ACM Inter-national Symposium on Microarchitecture (NewYork, NY, USA, 2009), ACM, pp. 381–390.

http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/pldi105-sarkar.pdf

http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/pldi105-sarkar.pdf

http://lists.gnu.org/archive/html/qemu-devel/2013-08/msg02055.html



http://csl.stanford.edu/~christos/publications/2009.michael_dalton.phd_thesis.pdf



http://urcu.so

http://developer.intel.com/design/itanium/downloads/251429.htm

http://developer.intel.com/design/itanium/downloads/251429.htm

ftp://download.intel.com/design/Itanium/Downloads/25142901.pdf

ftp://download.intel.com/design/Itanium/Downloads/25142901.pdf


[15] McCarthy, J. Recursive functions of symbolicexpressions and their computation by machine,part i. Commun. ACM 3, 4 (Apr. 1960), 184–195.

[16] McKenney, P. E. Read-copy update(RCU) usage in Linux kernel. Available:http://www.rdrop.com/users/paulmck/

RCU/linuxusage/rculocktab.html [ViewedJanuary 14, 2007], October 2006.

[17] McKenney, P. E. What is RCU? part 2:Usage. Available: http://lwn.net/Articles/

263130/ [Viewed January 4, 2008], January2008.

[18] McKenney, P. E. The RCU API, 2010 edi-tion. http://lwn.net/Articles/418853/, De-cember 2010.

[19] McKenney, P. E. Validating memory barri-ers and atomic instructions. http://lwn.net/

Articles/470681/, December 2011.

[20] McKenney, P. E. Is Parallel ProgrammingHard, And, If So, What Can You Do About It?kernel.org, Corvallis, OR, USA, 2012.

[21] McKenney, P. E. Structured deferral: syn-chronization via procrastination. Commun.ACM 56, 7 (July 2013), 40–49.

[22] McKenney, P. E., Purcell, C., Al-gae, Schumin, B., Cornelius, G., Qwer-tyus, Conway, N., Sbw, Blainster, Ru-fus, C., Zoicon5, Anome, and Eisen, H.Read-copy update. http://en.wikipedia.

org/wiki/Read-copy-update, July 2006.

[23] McKenney, P. E., and Slingwine, J. D.Read-copy update: Using execution history tosolve concurrency problems. In Parallel andDistributed Computing and Systems (Las Vegas,NV, October 1998), pp. 509–518.

[24] McKenney, P. E., and Walpole, J. What isRCU, fundamentally? Available: http://lwn.

net/Articles/262464/ [Viewed December 27,2007], December 2007.

[25] Michael, M. M. Hazard pointers: Safe mem-ory reclamation for lock-free objects. IEEETransactions on Parallel and Distributed Sys-tems 15, 6 (June 2004), 491–504.

[26] Rossbach, C. J., Hofmann, O. S., Porter,D. E., Ramadan, H. E., Bhandari, A., andWitchel, E. TxLinux: Using and managinghardware transactional memory in an operatingsystem. In SOSP’07: Twenty-First ACMSymposium on Operating Systems Principles(October 2007), ACM SIGOPS. Avail-able: http://www.sosp2007.org/papers/

sosp056-rossbach.pdf [Viewed October 21,2007].

[27] Toit, S. D. Working draft, stan-dard for programming language C++.http://www.open-std.org/jtc1/sc22/wg21/

docs/papers/2013/n3691.pdf, May 2013.

[28] Vigueras, G., Orduna, J. M., and Lozano,M. A Read-Copy Update based parallel serverfor distributed crowd simulations. The Journalof Supercomputing (Apr. 2012).

http://www.rdrop.com/users/paulmck/RCU/linuxusage/rculocktab.html

http://www.rdrop.com/users/paulmck/RCU/linuxusage/rculocktab.html

http://lwn.net/Articles/263130/





http://en.wikipedia.org/wiki/Read-copy-update

http://en.wikipedia.org/wiki/Read-copy-update



http://www.sosp2007.org/papers/sosp056-rossbach.pdf

http://www.sosp2007.org/papers/sosp056-rossbach.pdf

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3691.pdf

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3691.pdf

n4215: towards implementation and use of ... - open standards › jtc1 › sc22 › wg21 › docs...

Documents