n4321: towards implementation and use of memory order … · 2015-07-14 · n4321: towards...

40
N4321: Towards Implementation and Use of memory order consume Doc. No.: WG21/N4321 (revised) Date: 2015-04-19 Reply to: Paul E. McKenney, Torvald Riegel, Jeff Preshing, Hans Boehm, Clark Nelson, and Olivier Giroux Email: [email protected], [email protected], jeff@preshing.com [email protected], [email protected], and [email protected] Other contributors: Alec Teal, David Howells, David Lang, George Spelvin, Jeff Law, Joseph S. Myers, Lawrence Crowl, Linus Torvalds, Mark Batty, Michael Matz, Peter Sewell, Peter Zijlstra, Ramana Radhakrishnan, Richard Biener, Will Deacon, Faisal Vali, ... July 13, 2015 This document is a revision of WG21/N4321, based on email discusssions and including yet another pro- posal in Section 7.9. This proposal has been fur- ther refined by discussions on various email reflectors. WG21/N4321 is itself a revision of WG21/N4215, based on feedback at the 2014 UIUC meeeting and on the various email reflectors. WG21/N4215 is in turn a revision of WG21/N4036, based on feedback at the 2014 Rapperswil meeting, at the 2014 Redmond SG1 meeting, and on the various email reflectors. A detailed change log appears starting on page 38. 1 Introduction The most obscure member of the C11 and C++11 memory order enum seems to be memory order consume [28]. The purpose of memory order consume is to allow reading threads to correctly tra- verse linked data structures without the need for locks, atomic instructions, or (with the exception of DEC Alpha) memory-fence instructions, even though new elements are being inserted into these linked structures before, during, and after the traversal. Without memory order consume, both the compiler and (again, in the case of DEC Alpha) the CPU would be within their rights to carry out aggres- sive data-speculation optimizations that would per- mit readers to see pre-initialization values in the newly added data elements. The purpose of memory order consume is to prevent these optimizations. Of course, memory order acquire may be used as a substitute for memory order consume, however do- ing so results in costly explicit memory-fence instruc- tions (or, where available, load-acquire instructions) on weakly ordered systems such as ARM, Itanium, and PowerPC [3, 9, 12, 13]. These systems enforce dependency ordering in hardware, in other words, if the address used by one memory-reference instruction depends on the value from a preceding load instruc- tion, the hardware forces that earlier load to com- plete before the later memory-reference instruction commences. 1 Similarly, if the data to be stored by a 1 But please note that hardware can and does take advan- 1

Upload: others

Post on 22-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: N4321: Towards Implementation and Use of memory order … · 2015-07-14 · N4321: Towards Implementation and Use of memory order consume Doc. No.: WG21/N4321 (revised) Date: 2015-04-19

N4321 Towards Implementation and Use of

memory order consume

Doc No WG21N4321 (revised)Date 2015-04-19

Reply to Paul E McKenney Torvald Riegel Jeff PreshingHans Boehm Clark Nelson and Olivier Giroux

Email paulmcklinuxvnetibmcom triegelredhatcom jeffpreshingcomboehmacmorg clarknelsonintelcom and OGirouxnvidiacom

Other contributors Alec Teal David Howells David Lang George SpelvinJeff Law Joseph S Myers Lawrence Crowl Linus Torvalds Mark Batty

Michael Matz Peter Sewell Peter Zijlstra Ramana Radhakrishnan Richard BienerWill Deacon Faisal Vali

July 13 2015

This document is a revision of WG21N4321 basedon email discusssions and including yet another pro-posal in Section 79 This proposal has been fur-ther refined by discussions on various email reflectorsWG21N4321 is itself a revision of WG21N4215based on feedback at the 2014 UIUC meeeting and onthe various email reflectors WG21N4215 is in turna revision of WG21N4036 based on feedback at the2014 Rapperswil meeting at the 2014 Redmond SG1meeting and on the various email reflectors

A detailed change log appears starting on page 38

1 Introduction

The most obscure member of the C11 and C++11memory order enum seems to be memory order

consume [28] The purpose of memory order

consume is to allow reading threads to correctly tra-verse linked data structures without the need forlocks atomic instructions or (with the exception ofDEC Alpha) memory-fence instructions even though

new elements are being inserted into these linkedstructures before during and after the traversalWithout memory order consume both the compilerand (again in the case of DEC Alpha) the CPUwould be within their rights to carry out aggres-sive data-speculation optimizations that would per-mit readers to see pre-initialization values in thenewly added data elements The purpose of memoryorder consume is to prevent these optimizations

Of course memory order acquire may be used asa substitute for memory order consume however do-ing so results in costly explicit memory-fence instruc-tions (or where available load-acquire instructions)on weakly ordered systems such as ARM Itaniumand PowerPC [3 9 12 13] These systems enforcedependency ordering in hardware in other words ifthe address used by one memory-reference instructiondepends on the value from a preceding load instruc-tion the hardware forces that earlier load to com-plete before the later memory-reference instructioncommences1 Similarly if the data to be stored by a

1 But please note that hardware can and does take advan-

1

WG21N4321 2

given store instruction depends on the value from apreceding load instruction the hardware again forcesthat earlier load to complete before the later store in-struction commences Recent software tools for ARMand PowerPC can help explicate their memory mod-els [1 2 19] Note that strongly ordered systems likex86 IBM mainframe and SPARC TSO enforce de-pendency ordering as a side effect of the fact thatthey do not reorder loads with subsequent memoryreferences Therefore memory order consume is ben-eficial on hot code paths removing the need for hard-ware ordering instructions for weakly ordered systemsand permitting additional compiler optimizations onstrongly ordered systems

When implementing concurrent insertion-onlydata structures a few of which are found in the Linuxkernel memory order consume is all that is requiredHowever most data structures also require removalof data elements Such removal requires that thethread removing the data element wait for all read-ers to release their references to it before reclaim-ing that element The traditional way to do this isvia garbage collectors (GCs) which have been avail-able for more than half a century [15] and which arenow available even for C and C++ [4] Anotherway to wait for readers is to use read-copy update(RCU) [21 24] which explicitly marks read-side re-gions of code and provides primitives that wait forall pre-existing readers to complete RCU is gainingsignificant use both within the Linux kernel [16] andoutside of it [5 6 8 14 29]

Despite the growing number of memory order

consume use cases there are no known high-performance implementations of memory order

consume loads in any C11 or C++11 environmentsThis situation suggests that some change is in or-der After all if implementations do not supportthe standardrsquos memory order consume facility userscan be expected to continue to exploit whateverimplementation-specific facilities allow them to gettheir jobs done This document therefore providesa brief overview of RCU in Section 2 and surveysmemory order consume use cases within the Linuxkernel in Section 3 Section 4 looks at how depen-

tage of the as-if rule just as compilers do

dency ordering is currently supported in pre-C11 im-plementations and then Section 5 looks at possibleways to support those use cases in existing C11 andC++11 implementations followed by some thoughtson incremental paths towards official support of theseuse cases in the standards Section 6 lists some weak-nesses in the current C11 and C++11 specification ofdependency ordering and finally Section 7 outlines afew possible alternative dependency-ordering specifi-cations

Note SC22WG14 liason issue

2 Introduction to RCU

The RCU synchronization mechanism is often used asa replacement for reader-writer locking because RCUavoids the high-overhead cache thrashing that is char-acteristic of many common reader-writer-locking im-plementations RCU is based on three fundamentalconcepts

1 Light-weight in-memory publish-subscribe oper-ation

2 Operation that waits for pre-existing readers

3 Maintaining multiple versions of data to avoiddisrupting old readers that are still referencingold versions

These three concepts taken together allow readersand updaters to make forward progress concurrently

We would like to use C11rsquos and C++11rsquos memory

order consume to implement RCUrsquos lightweight sub-scribe operation rcu dereference() We assumethat rcu dereference() is a good example of howdevelopers would exploit the dependency-orderingfeature of weakly ordered systems so we look to rcu

dereference() as an indication of the semantics thatmemory order consume should have

In one typical RCU use case updaters publishnew versions of a data structure while readers con-currently subscribe to whatever version is currentat the time a given reader starts Once all pre-existing readers complete old versions can be re-claimed This sort of use case may be a bit unfa-miliar to many but it is extremely effective in many

WG21N4321 3

0

2000

4000

6000

8000

10000

12000 2

002

200

4

200

6

200

8

201

0

201

2

201

4

201

6

R

CU

AP

I Use

s

Year

Figure 1 Growth of RCU Usage

situations offering excellent performance scalabilityreal-time latency deadlock avoidance and read-sidecomposability More details on RCU are readily avail-able [8 17 18 20 21 23 25]

Figure 1 shows the growth of RCU usage over timewithin the Linux kernel which is strong evidence ofRCUrsquos effectiveness However RCU is a specializedmechanism so its use is much smaller than general-purpose techniques such as locking as can be seen inFigure 2 It is unlikely that RCUrsquos usage will everapproach that of locking because RCU coordinatesonly between readers and updaters which meansthat some other mechanism is required to coordinateamong concurrent updates In the Linux kernel thatupdate-side mechanism is normally locking althoughpretty much any synchronization mechanism may beused including transactional memory [10 11 27]

However RCU is now being used in many situa-tions where reader-writer locking would be used Fig-ure 3 shows that the use of reader-writer locking haschanged little since RCU was introduced This datasuggests that RCU is at least as important to parallelsoftware as is reader-writer locking

0

20000

40000

60000

80000

100000

120000

140000

200

2

200

4

200

6

200

8

201

0

201

2

201

4

201

6

R

CU

lock

ing

AP

I Use

sYear

locking

RCUrwlock

Figure 2 Growth of RCU Usage vs Locking

In more recent years a user-level library implemen-tation of RCU has been available [7] This library isnow available for many platforms and has been in-cluded in a number of Linux distributions It hasbeen pressed into service for a number of open-sourcesoftware projects proprietary products and researchefforts

Fully and fully performant C11C++11 supportfor memory order consume is therefore quite impor-tant However good progress can often be made inthe short term by focusing on the cases that are com-monly used in practice rather than on the generalcase The next section therefore takes a rough censusof the Linux kernelrsquos use of the rcu dereference()

family of primitives which memory order consume isintended to implement

3 Linux-Kernel Use Cases

Section 31 lists types of dependency chains in theLinux kernel Section 32 lists operators used withinthese dependency chains Section 33 lists operatorsthat are considered to terminate dependency chainsSection 34 lists operator that often act as the last link

WG21N4321 4

0

2000

4000

6000

8000

10000

12000 2

002

200

4

200

6

200

8

201

0

201

2

201

4

201

6

R

CU

rw

lock

ing

AP

I Use

s

Year

RCU

rwlock

Figure 3 Growth of RCU Usage vs Reader-WriterLocking

in a dependency chain and finally Section 35 surveysa longer-than-average (but by no means maximal)dependency chain that appears in the Linux kernel

It is worth reviewing the relationship betweenmemory order acquire and memory order consume

loads both of which interact with memory order

release stores

A memory order release load is said to synchro-nize with a memory order acquire store if that loadreturns the value stored or in some special casessome later value [28 110p6-110p8] When a memory

order acquire load synchronizes with a memory

order release store any memory reference preced-ing the memory order acquire load will happen be-fore any memory reference following the memory

order release store [28 110p11-110p12] Thisproperty allows a linked structure to be locklessly tra-versed by using memory order release stores whenupdating pointers to reference new data elements andby using memory order acquire loads when loadingpointers while locklessly traversing the data struc-ture as shown in Figure 4

Unfortunately a memory order acquire load re-

1 void new_element(struct foo pp int a)2 3 struct foo p = malloc(sizeof(p))45 if (p)6 abort()7 p-gta = a8 atomic_store_explicit(pp p memory_order_release)9

1011 int traverse(struct foo_head ph)12 13 int a = -114 struct foo p1516 p = atomic_load_explicit(ampph-gth memory_order_acquire)17 while (p = NULL) 18 a = p-gta19 p = atomic_load_explicit(ampp-gtn memory_order_acquire)20 21 return a22 23

24

Figure 4 ReleaseAcquire Linked Structure Traver-sal

quires expensive special load instructions or memory-fence instructions on weakly ordered systems suchas ARM Itanium and PowerPC Furthermore intraverse() the address of each memory order

acquire load within the while loop depends on thevalue of the previous memory order acquire load2

Therefore in this case most weakly ordered systemsdonrsquot really need the special load instructions or thememory-fence instructions as these systems can in-stead rely on the hardware-enforced dependency or-dering

This is the use case for memory order consumewhich can be substituted for memory order acquire

in cases where hardware dependency ordering appliesOne such case is the preceding example and Fig-ure 5 shows that same example recast in terms ofmemory order consume A memory order release

store is dependency ordered before a memory order

consume load when that load returns the value storedor in some special cases some later value [28 110p1]

2 The initial load on line 16 might well depend on an earlierload but for simplicity this example assumes that the initialfoo head structure is statically allocated and thus not subjectto updates

WG21N4321 5

1 void new_element(struct foo pp int a)2 3 struct foo p = malloc(sizeof(p))45 if (p)6 abort()7 p-gta = a8 atomic_store_explicit(pp p memory_order_release)9

1011 int traverse(struct foo_head ph)12 13 int a = -114 struct foo p1516 p = atomic_load_explicit(ampph-gth memory_order_consume)17 while (p = NULL) 18 a = p-gta19 p = atomic_load_explicit(ampp-gtn memory_order_consume)20 21 return a22 23

24

Figure 5 ReleaseConsume Linked Structure Traver-sal

Then if the load carries a dependency to somelater memory reference [28 110p9] any memoryreference preceding the memory order release storewill happen before that later memory reference [28110p9-110p12] This means that when there is de-pendency ordering memory order consume gives thesame guarantees that memory order acquire doesbut at lower cost

On the other hand memory order consume re-quires the compiler to track the carries-a-dependencyrelationships with the set of such relationshipsheaded by a given memory order consume load be-ing called that loadrsquos dependency chains It is quitepossible that the complexity of implementing this ca-pability has thus far prevented high-quality memory

order consume implementations from appearing Itis therefore worthwhile to review use of dependencychains in practice in order to determine what typesof operations typically appear in dependency chainswhich might result in guidance to implementationsor perhaps even modifications to the definition ofmemory order consume

31 Types of Linux-Kernel Depen-dency Chains

One goal for memory order consume is to implementrcu dereference() which heads a Linux-kerneldependency-ordering tree There are a numberof variant of rcu dereference() in the Linuxkernel in order to implement the four flavors ofRCU and also to enable RCU usage diagnositicsfor code that is shared by readers and updatersThese additional variants are rcu dereference()rcu dereference bh() rcu dereference

bh check() rcu dereference bh check()rcu dereference check() rcu dereference

index check() rcu dereference protected()rcu dereference raw() rcu dereference

sched() rcu dereference sched check() srcu

dereference() and srcu dereference check()Taken together there are about 1300 uses of thesefunctions in version 313 of the Linux kernelHowever about 250 of those are rcu dereference

protected() which is used only in update-side codeand thus does not head up read-side dependencychains which leaves about 1000 uses to be inspectedfor dependency-ordering usage

32 Operators in Linux-Kernel De-pendency Chains

A surprisingly small fraction of the possible C opera-tors appear in dependency chains in the Linux kernelnamely -gt infix = casts prefix amp prefix [] infix+ infix - ternary and infix (bitwise) amp

By far the two most common operators are the-gt pointer field selector and the = assignment oper-ator Enabling the carries-dependency relationshipthrough only these two operators would likely coverbetter than 90 of the Linux-kernel use cases

Casts the prefix indirection operator and theprefix amp address-of operator are used to implementLinuxrsquos list primitives which translate from listpointers embedded in a structure to the structure it-self These operators are also used to get some of theeffects of C++ subtyping in the C language

The [] array-indexing operator and the infix +

and - arithmetic operators are used to manipulate

WG21N4321 6

1 struct foo 2 int a3 4 struct foo fp5 struct foo default_foo67 int bar(void)8 9 struct foo p

1011 p = rcu_dereference(fp)12 return p p-gta default_fooa13

Figure 6 Default Value For RCU-Protected PointerLinux Kernel

1 class foo 2 int a3 4 stdatomicltfoo gt fp5 foo default_foo67 int bar(void)8 9 stdatomicltfoo gt p

1011 p = fpload_explicit(memory_order_consume)12 return p kill_dependency(p-gta) default_fooa13

Figure 7 Default Value For RCU-Protected PointerC++11

RCU-protected arrays as well as to index into arrayscontained within RCU-protected structures RCU-protected arrays are becoming less common becausethey are being converted into more complex datastructures such as trees However RCU-protectedstructures containing arrays are still fairly common

The ternary if-then-else expression is used tohandle default values for RCU-protected pointers forexample as shown in Figure 6 or in C++11 formin Figure 7 Note that the dependency is carriedonly through the rightmost two operands of neverthrough the leftmost one

The infix amp operator is used to mask low-order bitsfrom RCU pointers These bits are used by somealgorithms as markers Such markers though notcommon in the Linux kernel are well-known in theart with hazard pointers being but one example [26]This operator is also sometimes used to locate thebeginning of an aligned structure for example if p

references a field within a data structure that is 4096bytes in size (or smaller) and that is also aligned to a4096-byte boundary then p amp ~0xfff will with theaddition of appropriate casting produce a pointer tothe beginning of the structure Note that it is ex-pected that both operands of infix amp are expected tohave some non-zero bits because otherwise a NULL

pointer will result and NULL pointers cannot reason-ably be said to carry much of anything let alone adependency

Although I did not find any infix | operators inmy census of Linux-kernel dependency chains sym-metry considerations argue for also including it forexample for read-side pointer tagging or for anotherexample locating the beginning of the next in an ar-ray of aligned structures Presumably both of theoperands of infix | must have at least one zero bit

To recap the operators appearing in Linux-kerneldependency chains are -gt infix = casts prefix ampprefix [] infix + infix - ternary infix (bitwise)amp and probably also |

33 Operators Terminating Linux-Kernel Dependency Chains

Although C++11 has the kill dependency() func-tion to terminate a dependency chain no such func-tion exists in the Linux kernel Instead Linux-kerneldependency chains are judged to have terminatedupon exit from the outermost RCU read-side criticalsection3 when existence guarantees are handed offfrom RCU to some other synchronization mechanism(usually locking or reference counting) or when thevariable carrying the dependency goes out of scope

That said it is possible to analyze Linux-kerneldependency chains to see what part of the chain isactually required by the algorithm in question Wecan therefore define the essential subset of a depen-dency chain to be that subset within which ordering

3 The beginning of a given RCU read-side critical section ismarked with rcu read lock() rcu read lock bh() rcu read

lock sched() or srcu read lock() and the end by the cor-responding primitive from the list rcu read unlock() rcu

read unlock bh() rcu read unlock sched() or srcu read

unlock() There is currently no C++11 counterpart for anRCU read-side critical section

WG21N4321 7

is required by the algorithm In the 313 version ofthe Linux kernel the following operators always markthe end of the essential subset of a dependency chain() == = ampamp || infix and

The postfix () function-invocation operator is aninteresting special case in the Linux kernel In theoryRCU could be used to protect JITed function bodiesbut in current practice RCU is instead used to waitfor all pre-existing callers to the function referencedby the previous pointer The functions are all com-piled into the kernel and the dependency chains aretherefore irrelevant to the () operator Hence in ver-sion 313 of the Linux kernel the () operator marksthe end of the essential subset of any dependencychain that it resides in

The == = ampamp and || operators are used ex-clusively in rdquoifrdquo statements to make control-flow de-cisions and therefore also mark the end of the essen-tial subset of any dependency chains that they residein In theory these relational and boolean operatorscould be used to form array indexes but in practicethe Linux kernel does not yet do this in RCU depen-dency chains The other relational operators (gt ltgt= and lt=) should probably also be added to thislist

The infix and arithmetic operators couldpotentially be used for construct array addresses butthey are not yet used that way in the Linux kernelInstead they are used to do computation on valuesfetched as the last operation in an essential subset ofa dependency chain

In short in the current Linux kernel () === ampamp || infix and all mark the end of theessential subset of a dependency chain That saidthere is potential for them to be used as part of theessential subset of dependency changes in future ver-sions of the Linux kernel And the same is of coursetrue of the remaining C-language operators whichdid not appear within any of the dependency chainsin version 313 of the Linux kernel

34 Operators Acting as Last Link inLinux-Kernel Dependency Chains

Although the -gt operator is frequently used as part ofa Linux-kernel dependency chain it often is intended

to be the last link in that chain Therefore the usescases for the -gt operator deserve special mention

The first use case involves fetching non-pointerdata from an RCU-protected data structure Forexample in the DRDB subsystem in Linux -gt isused to fetch a timeout value This code requiresthat dependency ordering apply to this fetch but itdoes not require a dependency chain extending be-yond that point This sort of case would requirea kill dependency() for implementations based onthe C++11 and C11 standards

The second use case involves linked data struc-tures where an RCU update might be applied onany pointer in the chain for example the stan-dard Linux-kernel linked list The -gt operator pro-vides dependency ordering for the fetch of the -gtnextpointer but that fetch must itself be a memory order

consume load in order to provide the required depen-dency ordering for the fields in the next structurein the list Thus a linked-list traversal consists ofa series of back-to-back non-overlapping dependencychains

These two use cases raise the question of whethera dependency chain can continue beyond a -gt oper-ator The answer is ldquoyesrdquo and this occurs when alinked structure is made visible to RCU readers as aunit For exapmle consider a linked list where eachlist element links to a constant binary search treeIf this tree is in place when the element is added tothe list then a memory order consume load is neededonly when fetching the pointer to the element Thedependency chain headed by this fetch suffices to or-der accesses to the binary search tree

These cases need to be differentiated The thirduse case appears to be the least frequent which sug-gests that the -gt operator (or a sequence of -gt oper-ators) always be the last link of a dependency chain

35 Linux-Kernel Dependency ChainLength

Many Linux-kernel dependency chains are very shortand contained with a fair number living within theconfines of a single C statement If there were onlya few short dependency chains in the Linux kernelone could imagine decorating all the operators in each

WG21N4321 8

1 void new_element(struct foo pp int a)2 3 struct foo p = malloc(sizeof(p))45 if (p)6 abort()7 p-gta = a8 atomic_store_explicit(pp p memory_order_release)9

1011 int traverse(struct foo_head ph)12 13 int a = -114 struct foo p1516 p = atomic_load_explicit(ampfield_dep(ph h)17 memory_order_consume)18 while (p = NULL) 19 a = field_dep(p a)20 p = atomic_load_explicit(ampfield_dep(p n)21 memory_order_consume)22 23 return a24

Figure 8 Decorated Linked Structure Traversal

dependency chain for example replacing the -gt op-erator with something like the mythical field dep()

operator shown on lines 16 19 and 20 of Figure 8

However there are a great many dependencychains that extend across multiple functions Onerelatively modest example is in the Linux networkstack in the arp process() function This depen-dency chain extends as follows with deeper nestingindicating deeper function-call levels

bull The arp process() function invokes in dev

get rcu() which returns an RCU-protectedpointer The head of the dependency chain istherefore within the in dev get rcu() func-tion

bull The arp process() function invokes the follow-ing macros and functions

ndash IN DEV ROUTE LOCALNET() which expandsto the ipv4 devconf get() function

ndash arp ignore() which in turn calls

lowast IN DEV ARP IGNORE() which expandsto the ipv4 devconf get() function

lowast inet confirm addr() which calls

middot dev net() which in turn callsread pnet()

ndash IN DEV ARPFILTER() which expands toipv4 devconf get()

ndash IN DEV CONF GET() which also expands toipv4 devconf get()

ndash arp fwd proxy() which calls

lowast IN DEV PROXY ARP() which expandsto ipv4 devconf get()

lowast IN DEV MEDIUM ID() which also ex-pands to ipv4 devconf get()

ndash arp fwd pvlan() which calls

lowast IN DEV PROXY ARP PVLAN() which ex-pands to ipv4 devconf get()

ndash pneigh enqueue()

Again although a great many dependency chainsin the Linux kernel are quite short there are quite afew that spread both widely and deeply We thereforecannot expect Linux kernel hackers to look fondlyon any mechanism that requires them to decorateeach and every operator in each and every depen-dency chain as was shown in Figure 8 In fact evenkill dependency() will likely be an extremely diffi-cult sell

4 Dependency Ordering in Pre-C11 Implementations

Pre-C11 implementations of the C language do nothave any formal notion of dependency ordering butthese implementations are nevertheless used to buildthe Linux kernelmdashand most likely all other softwareusing RCU This section lays out a few straightfor-ward rules for both implementers (Section 42) andusers of these pre-C11 C-language implementations(Section 41)

41 Rules for C-Language RCU Users

The rules for C-language RCU users have evolvedover time so this section will present them in reversechronological order

WG21N4321 9

411 Rules for 2014 GCC Implementations

The primary rule for developers implementing RCU-based algorithms is to avoid letting the compiler de-terming the value of any variable in any dependencychain This primary rule implies a number of sec-ondary rules

1 Use only intrinsic operators on basic types Ifyou are making use of C++ template metapro-gramming or operator overloading more elabo-rate rules apply and those rules are outside thescope of this document

2 Use a volatile load to head the dependency chainThis is necessary to avoid the compiler repeatingthe load or making use of (possibly erroneous)prior knowledge of the contents of the memorylocation each of which can break dependencychains

3 Avoid use of single-element RCU-protected ar-rays The compiler is within its right to assumethat the value of an index into such an arraymust necessarily evaluate to zero The com-piler could then substitute the constant zero forthe computation breaking the dependency chainand introducing misordering

4 Avoid cancellation when using the + and - infixarithmetic operators For example for a givenvariable x avoid (xminusx) The compiler is withinits rights to substitute zero for any such cancel-lation breaking the dependency chain and againintroducing misordering Similar arithmetic pit-falls must be avoided if the infix or oper-ators appear in the essential subset of a depen-dency chain

5 Avoid all-zero operands to the bitwise amp opera-tor and similarly avoid all-ones operands to thebitwise | operator If the compiler is able todeduce the value of such operands it is withinits rights to substitute the corresponding con-stant for the bitwise operation Once again thisbreaks the dependency chain introducing mis-ordering

Please note that single-bit operands to bitwise amp

can be dangerous because the compiler requiresonly a small amount of additional information todeduce the exact value which could again resultin constant substitution Operands to bitwise |

that have only one zero bit are similarly danger-ous

6 If you are using RCU to protect JITed functionsso that the () function-invocation operator is amember of the essential subset of the dependencytree you may need to interact directly with thehardware to flush instruction caches This issuearises on some systems when a newly JITed func-tion is using the same memory that was used byan earlier JITed function

7 Do not use the boolean ampamp and || operators inessential dependency chains The reason for thisprohibition is that they are often compiled usingbranches Weak-memory machines such as ARMor PowerPC order stores after such branches butcan speculate loads which can break data depen-dency chains

8 Do not use relational operators (== = gt gt=lt or lt=) in the essential subset of a dependencychain The reason for this prohibition is that asfor boolean operators relational operators areoften compiled using branches Weak-memorymachines such as ARM or PowerPC order storesafter such branches but can speculate loadswhich can break dependency chains

9 Be very careful about comparing pointers in theessential subset of a dependency chain As LinusTorvalds explained if the two pointers are equalthe compiler could substitute the pointer you arecomparing against for the pointer in the essen-tial subset of the dependency chain On ARMand Power hardware it might be that only theoriginal value carried a hardware dependency sothis substitution would break the chain in turnpermitting misordering Such comparisons areOK in the following cases

(a) The pointer being compared against refer-ences memory that was initialized at boot

WG21N4321 10

time or otherwise long enough ago thatreaders cannot still have pre-initialized datacached Examples include module-init timefor module code before kthread creationfor code running in a kthread while theupdate-side lock is held and so on

(b) The pointer is never dereferenced afterbeing compared This exception applieswhen comparing against the NULL pointeror when scanning RCU-protected circularlinked lists

(c) The pointer being compared against is partof the essential subset of a dependencychain This can be a different dependencychain but only as long as that chain stemsfrom a pointer that was modified after anyinitialization of interest This exception canapply when carrying out RCU-protectedtraversals from different entry points thatconverged on the same data structure

(d) The pointer being compared against isfetched using rcu access pointer() andall subsequent dereferences are stores

(e) The pointers compared not-equal and thecompiler does not have enough informationto deduce the value of the pointer (Forexample if the compiler can see that thepointer will only ever take on one of twovalues then it will be able to deduce theexact value based on a not-equals compar-ison)

10 Disable any value-speculation optimizations thatyour compiler might provide especially if you aremaking use of feedback-based optimizations thattake data collected from prior runs

412 Rules for 2003 GCC Implementations

Prior to the 269 version of the Linux kernel therewas neither rcu dereference() nor rcu assign

pointer() Instead explicit memory barriers wereused smp read barrier depends() by readers andsmp wmb() by updaters For example the code shownfor current Linux kernels in Figure 6 would be as

1 struct foo 2 int a3 4 struct foo fp5 struct foo default_foo67 int bar(void)8 9 struct foo p

1011 p = fp12 smp_read_barrier_depends()13 return p p-ampgta default_fooa14

Figure 9 Default Value For RCU-Protected PointerOld Linux Kernel

shown in Figure 9 for 268 and earlier versions of theLinux kernel A similar transformation relates theolder use of smp wmb() and the more recent use ofrcu assign pointer()

This older API was clearly much more vulnerableto compiler optimizations than is the current APIbut the real motivation for this change was read-ability and maintainability as can be seen from thecommit log for the mid-2004 patch introducing rcu

dereference()

This patch introduced an rcu

dereference() macro that replaces mostuses of smp read barrier depends()The new macro has the advantage ofexplicitly documenting which pointersare protected by RCU ndash in contrast itis sometimes difficult to figure out whichpointer is being protected by a givensmp read barrier depends() call

The commit log for the mid-2004 patch introducingrcu assign pointer() justifies the change in termsof eliminating hard-to-use explicit memory barriers

Attached is a patch that adds an rcu

assign pointer() that allows a number ofexplicit smp wmb() memory barriers to bedispensed with improving readability

The importance of suppressing compiler optimiza-tions did not become apparent until much later In

WG21N4321 11

fact a volatile cast was not added to the implementa-tion of rcu dereference() until 2624 in early 2008

413 Rules for 1990s Sequent C Implemen-tations

1990s systems featured far slower CPUs and muchless memory that is commonly provisioned to-day and the compilers were correspondingly lesssophisticated Therefore at that time a sim-ple C-language field selector was used instead ofany sort of rcu dereference() or memory order

consume operation Not only was there novolatile cast there also was nothing resembling smp

read barrier depends() The lack of smp read

barrier depends() is not too surprising given thatDYNIXptx did not run on DEC Alpha

This approach was nevertheless quite reliable be-cause the use cases within the DYNIXptx kernelwere both few and straightforward and provided lit-tle or no opportunity for optimizations that mightbreak dependency chains

42 Rules for C-Language Imple-menters

The main rule for C-language implementers is toavoid any sort of value speculation or at the veryleast provide means for the user to disable suchspeculation An example of a value-speculation op-timization that can be carried out with the help ofhardware branch prediction is shown in Figure 10which is an optimized version of the code in Fig-ure 5 This sort of transformation might result fromfeedback-directed optimization where profiling runsdetermined that the value loaded from ph was almostalway 0xbadfab1e Although this transformation iscorrect in a single-threaded environment in a concur-rent environment nothing stops the compiler or theCPU from speculating the load on line 19 before itexecutes the rcu dereference() on line 16 whichcould result in line 19 executing before the corre-sponding store on line 7 resulting in a garbage valuein variable a4

4 Kudos to Olivier Giroux for pointing out use of branchprediction to enable value speculation

1 void new_element(struct foo pp int a)2 3 struct foo p = malloc(sizeof(p))45 if (p)6 abort()7 p-gta = a8 rcu_assign_pointer(pp p)9

1011 int traverse(struct foo_head ph)12 13 int a = -114 struct foo p1516 p = rcu_dereference(ampph-gth)17 while (p = NULL) 18 if (p == (struct foo )0xbadfab1e)19 a = ((struct foo )0xbadfab1e)-gta20 else21 a = p-gta22 p = rcu_dereference(ampp-gtn)23 24 return a25

Figure 10 Dangerous Optimizations HardwareBranch Predictions

There are some situations where this sort of opti-mization would be safe including

1 The value speculated is a numeric value ratherthan a pointer so that if the guess proves correctafter the fact the computation will be appropri-ate after the fact

2 The value speculated is a pointer to invariantdata so that reasonable values are produced bydereferencing even if the guess proves to havebeen correct only after the fact

3 As above but where any updates result in datathat produces appropriate computations at anyand all phases of the update

However this list does not contain the general caseof memory order consume loads

Pure hardware implementations of value specula-tion can avoid this problem because they monitorcache-coherence protocol events that would resultfrom some other CPU invalidating the guess

In short compiler writers must provide means todisable all forms of value speculation unless the spec-

WG21N4321 12

ulation is accompanied by some means of detectingthe race condition that Figure 10 is subject to

Are there other dependency-breaking optimizationsthat should be called out separately

5 Dependency Ordering in C11and C++11 Implementations

The simplest way to avoid dependency-ordering is-sues is to strengthen all memory order consume oper-ations to memory order acquire This functions cor-rectly but may result in unacceptable performancedue to memory-barrier instructions on weakly or-dered systems such as ARM and PowerPC5 and mayfurther unnecessarily suppress code-motion optimiza-tions

Another straightforward approach is to avoid valuespeculation and other dependency-breaking opti-mizations This might result in missed opportu-nities for optimization but avoids any need fordependency-chain annotations and also all issuesthat might otherwise arise from use of dependency-breaking optimizations This approach is fully com-patible with the Linux kernel communityrsquos currentapproach to dependency chains Unfortunately thereare any number of valuable optimizations that breakdependency chains so this approach seems impracti-cal

A third approach is to avoid value speculationand other dependency-breaking optimizations in anyfunction containing either a memory order consume

load or a [[carries dependency]] attribute Forexample the hardware-branch-predition optimiza-tion shown in Figure 10 would be prohibited in suchfunctions as would cancellation optimizations suchas optimizing a = b + c - c into a = b This toocan result in missed opportunities for optimizationthough very probably many fewer than the previousapproach This approach can also result in issues dueto dependency-breaking optimizations in functionslacking [[carries dependency]] attributes for ex-ample function d() in Figure 11 It can also result

5 From a Linux-kernel community viewpoint that shouldread ldquowill result in unacceptable performancerdquo

1 int a(struct foo p [[carries_dependency]])2 3 return kill_dependency(p-gta = 0)4 56 int b(int x)7 8 return x9

1011 foo c(void)12 13 return fpload_explicit(memory_order_consume)14 return rcu_dereference(fp) in Linux kernel 15 1617 int d(void)18 19 int a20 foo p2122 rcu_read_lock()23 p = c()24 a = p-gta25 rcu_read_unlock()26 return a27

Figure 11 Example Functions for Dependency Or-dering Part 1

1 [[carries_dependency]] struct foo e(void)2 3 return fpload_explicit(memory_order_consume)4 return rcu_dereference(fp) in Linux kernel 5 67 int f(void)8 9 int a

10 foo p1112 rcu_read_lock()13 p = e()14 a = p-gta15 rcu_read_unlock()16 return kill_dependency(a)17 1819 int g(void)20 21 int a22 foo p2324 rcu_read_lock()25 p = e()26 a = p-gta27 rcu_read_unlock()28 return b(a)29

Figure 12 Example Functions for Dependency Or-dering Part 2

WG21N4321 13

in spurious memory-barrier instructions when a de-pendency chain goes out of scope for example withthe return statement of function g() in Figure 12

A fourth approach is to add a compile-time op-eration corresponding to the beginning and end ofRCU read-side critical section These would need tobe evaluated at compile time taking into accountthe fact that these critical sections can nest and canbe conditionally entered and exited Note that theexit from an outermost RCU read-side critical sec-tion should imply a kill dependency() operation oneach variable that is live at that point in the code6

Although it is probably impossible to precisely de-termine the bounds of a given RCU read-side criticalsection in the general case conservative approachesthat might overestimate the extent of a given sec-tion should be acceptable in almost all cases Thisapproach would make functions c() and d() in Fig-ure 11 handle dependency chains in a natural mannerbut avoiding whole-program analysis would requiresomething similar to the [[carries dependency]]

annotations called out in the C11 and C++11 stan-dards

A fifth approach would be to require that all op-erations on the essential subset of any dependencychain be annotated This would greatly ease imple-mentation but would not be likely to be accepted bythe Linux kernel community

A sixth approach is to track dependencies as calledout in the C11 and C++11 standards However in-stead of emitting a memory-barrier instruction whena dependency chain flows into or out of a functionwithout the benefit of [[carries dependency]] in-sert an implicit kill dependency() invocation Im-plementation should also optionally issue a diagnosticin this case The motivation for this approach is thatit is expected that many more kill dependencies()

than [[carries dependency]] would be required toconvert the Linux kernelrsquos RCU code to C11 In theexample in Figure 12 this approach would allow func-tion g() to avoid emitting an unnecessary memory-barrier instruction but without function f()rsquos ex-

6 What if a given rcu read unlock() sometimes marked theend of an outermost RCU read-side critical section but othertimes was nested in some other RCU read-side critical sectionIn that case there should be no kill dependency()

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a)3 a = p-gtspecial_a4 else5 a = p-gtnormal_a

Figure 13 Dependency-Ordering Value-NarrowingHazard

plicit kill dependency() Both functions are in Fig-ure 12

A seventh and final approach is to track dependen-cies as called out in in the C11 and C++11 standardsWith this approach functions e() and f() properlypreserve the required amount of dependency order-ing

6 Weaknesses in C11 andC++11 Dependency Or-dering

Experience has shown several weaknesses in the de-pendency ordering specified in the C11 and C++11standards

1 The C11 standard does not provide attributesand in particular does not provide the[[carries dependency]] attribute This pre-vents the developer from specifying that a givendependency chain passes into or out of a givenfunction

2 The implementation complexity of thedependency-chain tracking required by bothstandard can be quite onerous on the one handand the overhead of unconditionally promotingmemory order consume loads to memory order

acquire can be excessive on weakly orderedimplementations on the other There is thereforeno easy way out for a memory order consume

implementation on a weakly ordered system

3 The function-level granularity of [[carries

dependency]] seems too coarse One problemis that points-to analysis is non-trivial so that

WG21N4321 14

compilers are likely to have difficulty determin-ing whether or not a given pointer carries a de-pendency For example the current wording ofthe standard (intentionally) does not disallowdependency chaining through stores and loadsTherefore if a dependency-carrying value mightever be written to a given variable an implemen-tation might reasonably assume that any loadfrom that variable must be assumed to carry adependency

4 The rules set out in the standard [28 110p9]do not align well with the rules that develop-ers must currently adhere to in order to main-tain dependency chains when using pre-C11 andpre-C++11 compilers (see Section 41) For ex-ample the standard requires (x-x) to carry adependency and providing this guarantee wouldat the very least require the compiler to alsoturn off optimizations that remove (x-x) (andsimilar patterns) if x might possibly be carry-ing a dependency For another example con-sider the value-speculation-like code shown inFigure 13 that is sometimes written by devel-opers and that was described in bullet 9 ofSection 41 In this example the standard re-quires dependency ordering between the memoryorder consume load on line 1 and the sub-sequent dereference on line 3 but a typicalcompiler would not be expected to differenti-ate between these two apparently identical val-ues These two examples show that a compilerwould need to detect and carefully handle thesecases either by artificially inserting dependen-cies omitting optimizations differentiating be-tween apparently identical values or even byemitting memory order acquire fences

5 The whole point of memory order consume andthe resulting dependency chains is to allow de-velopers to optimize their code Such optimiza-tion attempts can be completely defeated by thememory order acquire fences that the standardcurrently requires when a dependency chain goesout of scope without the benefit of a [[carries

dependency]] attribute Preventing the com-piler from emitting these fences requires liberal

use of kill dependency() which clutters coderequires large developer effort and further re-quires that the developer know quite a bit aboutwhich code patterns a given version of a givencompiler can optimize (thus avoiding needlessfences) and which it cannot (thus requiring man-ual insertion of kill dependency()

As of this writing no known implementations fullysupport C11 or C++11 dependency ordering

It is worth asking why Paul didnrsquot anticipate theseweaknesses There are several reasons for this

1 Compiler optimizations have become more ag-gressive over the seven years since Paul startedworking on standardization

2 New dependency-ordering use cases have arisenduring that same time in particular there arelonger dependency chains and more of themincluding dependency chains spanning multiplecompilation units

3 The number of dependency chains has increasedby roughly an order of magnitude during thattime so that changes in code style can be ex-pected to face a commeasurate increase in resis-tance from the Linux kernel community ndash unlessthose changes bring some tangible benefit

With that letrsquos look at some potential alternativesto dependency ordering as defined in the C11 andC++11 standards

7 Potential Alternatives to C11and C++11 Dependency Or-dering

Given the weaknesses in the current standardrsquos spec-ification of dependency ordering it is quite reason-able to consider alternatives To this end Section 71discusses ease-of-use issues involved with revisions tothe C11 and C++11 definitions of dependency or-dering Section 72 enlists help from the type systembut also imposes value restrictions (thus revising the

WG21N4321 15

C11 and C++11 semantics for dependencies) Sec-tion 73 enlists help from the type system withoutthe value restrictions and Section 74 describes awhole-program approach to dependency chains (alsorevising the C11 and C++11 semantics for depen-dencies) Section 75 describes a post-Rapperswilproposal that dependency chains be restricted tofunction-scope local variables and temporaries andSection 76 describes a second post-Rapperswil pro-posal that the [[carries dependency]] attributebe used to label local-scope variables that carry de-pendencies Section 77 describes a proposal dis-cussed verbally at Rapperswil that explicitly marksthe tails of dependency chains Section 78 describesthe inverse namely marking the heads of dependencychains Section 79 describes an approach that avoidsmarking by sharply restricting the number and typeof operations permitted in dependency chains Eachapproach appears to have advantages and disadvan-tages so it is hoped that further discussion will eitherhelp settle on one of these alternatives or generatesomething better To help initiate this discussionSection 710 provides an initial comparative evalua-tion

71 Revising C11 and C++11Dependency-Ordering Definition

The following sections each describe a proposed revi-sion of the dependency-ordering definition from thatin the current C11 and C++11 standards In manyof these proposals developers are required to followan additional rule in order to be able to rely on de-pendency ordering Subsequent execution must notlead to a situation where there is only one possiblevalue for the variable that is intended to carry thedependency7 This is shown in Figure 17 where thecompiler is permitted to break dependency orderingon line 6 because it knows that the value of p is equalto that of q which means that it could substitutethe latter value from the former which would break

7 This restricted notion of dependence is sometimes calledsemantic dependence and the value at the end of a depen-dence chain that does not represent a semantic dependence issometimes said to be independent of the value at the head ofthe dependency chain

1 int my_array[MY_ARRAY_SIZE]23 i = atomic_load_explicit(gi memory_order_consume)4 r1 = my_array[i]

Figure 14 Single-Element Arrays and DependencyOrdering

dependency ordering In short a dependency chainbreaks if it comes to a point where only a single valueis possible regardless of the value of the memory

order consume load heading up the chain At firstglance this additional rule could be quite difficult tolive with as dependency ordering could come and godepending on small details of code far away from thatpoint in the dependency chain

However a review of the Linux-kernel operators inSection 32 shows that the most commonly used op-erators act identically under both definitions Theproblem-free operators include -gt infix = casts pre-fix amp prefix and ternary

One example of a potentially troublesome opera-tor namely == is shown in Figure 17 where line 6breaks dependency ordering because the value of p isknown to be equal to that of q which is not part of adependency chain This example could be addressedthrough careful diagnostic design coupled with appro-priate coding standards For example the compilercould emit a warning on line 6 but remain silent forthe equivalent line substituting q for p namely dosomething with(q-gta)

Another example is the use of postfix [] thatis shown in Figure 14 If this code fragment wascompiled with MY ARRAY SIZE equal to one there isno dependency ordering between lines 3 and 4 butthat same code fragment compiled with MY ARRAY

SIZE equal to two or greater would be dependency-ordered Here a diagnostic for single-element arraysmight prove useful and such a diagnostic can easilybe supplied in this case using if and error

In the Linux kernel infix + and - are used forpointer and array computations These are all safein that they operate on an integer and pointer sothat any cancellation will not normally be detectableat compile time However one big purpose of diag-nostics is to detect abnormal conditions indicating

WG21N4321 16

1 struct liststackhead 2 struct liststack __rcu first3 45 struct liststack 6 struct liststack __rcu next7 void t8 struct rcu_head rh9

1011 _Carries_dependency12 void ls_front(struct liststackhead head)13 14 _Carries_dependency void data15 struct liststack lsp1617 rcu_read_lock()18 lsp = rcu_dereference(head-gtfirst)19 if (lsp == NULL)20 data = NULL21 else22 data = rcu_dereference(lsp-gtt)23 rcu_read_unlock()24 return data25

Figure 15 List-Based-Stack Example Code 1 of 2

probable bugs Therefore in cases where the com-piler can determine that two values from dependencychains are annihilating each other via infix + and -a diagnostic would be appropriate

Similarly the Linux kernel uses infix (bitwise) amp tomanipulate bits at the bottom of a pointer whereagain cancellation will not normally be detectable atcompile timemdashexcept in the case of operations on aNULL pointer for which dependency ordering is notmeaningful in any case However as with infix +

and - if the compiler detects value annihilation adiagnostic would be appropriate

Although issues with false positives and negativesneeds further investigation there is reason to hopethat this revision of the definition of dependency or-dering might avoid significant impacts on ease of useWith this hope we proceed to the specific propos-als using the code in Figures 15 and 16 to showsome sample code using Linux-kernel nomenclaturewith the addition of a mythical C keyword Carries

dependency to annotate parameters variables andreturn values that carry dependencies Please notethat this code example in no way endorses the dubi-ous practice of creating a parallel program with thesort of choke point exemplified by the head of this

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 _Carries_dependency struct liststack lsp40 struct liststack lsnp141 _Carries_dependency struct liststack lsnp242 _Carries_dependency void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 16 List-Based-Stack Example Code 2 of 2

WG21N4321 17

1 value_dep_preserving struct foo p23 p = atomic_load_explicit(gp memory_order_consume)4 q = some_other_pointer5 if (p == q)6 do_something_with(p-gta)7 else8 do_something_else_with(p-gtb)

Figure 17 Single-Value Variables and DependencyOrdering

list Note also that cmpxchg() heads a dependencychain which is completely reasonable within the con-text of the Linux kernel due to its acquire semanticswhich of course might be argued to indicate that theannotations in ls pop() are unnecessary

72 Type-Based Designation of De-pendency Chains With Restric-tions

This approach was formulated by Torvald Riegel inresponse to Linus Torvaldsrsquos spirited criticisms of thecurrent C11 and C++11 wording

This approach introduces a new value dep

preserving type qualifier Dependency ordering ispreserved only via variables having this type quali-fier This is meant to model the real scope of depen-dencies which is data flow not execution at function-level granularity This approach should therefore givedevelopers much finer control of which dependenciesare tracked

Assigning from a value dep preserving value to anon-value dep preserving variable terminates thetracking of dependencies in much the same way thatan explicit kill dependency() would However un-like an explicit kill dependency() compilers shouldbe able to emit a suppressable warning on implicitconversions so as to alert the developer about other-wise silent dropping of dependency tracking8

Next we specify that memory order consume loadsreturn a value dep preserving type by default thecompiler must assume such a load to be capable of

8 Other choices are possible in this case including emit-ting a memory order acquire fence in order to conservativelypreserve a potentially intended ordering

producing any value of the underlying type In otherwords the implementation is not permitted to applyany value-restriction knowledge it might gain fromwhole-program analysis We call this a local seman-tic dependency to distinguish not only from a pure(syntactic) dependency but also from a global seman-tic dependency where global information may be ap-plied Note that any global semantic dependency isalso a local semantic dependency but that any localsemantic dependency which is headed by a variablethat can be proven to take on only a single value is nota global semantic dependency The term ldquosemanticdependencyrdquo should be interpreted to mean a globalsemantic dependency unless otherwise stated

This allows developers to start with a clean slatefor the additional rule that they must follow to beable to rely on dependency ordering Subsequent ex-ecution must not lead to a situation there is only onepossible value for the value dep preserving expres-sion because otherwise the implementation is per-mitted to break the dependency chain As notedearlier this is shown in Figure 17 where the com-piler is permitted to break dependency ordering online 6 because it knows that the value of p is equalto that of q which means that it could substitutethe latter value from the former which would breakdependency ordering

This approach has several advantages

1 The implementation is simpler because no de-pendency chains need to be traced The imple-mentation can instead drive optimization deci-sions strictly from type information

2 Use of the value dep preserving type modifierallows the developer to limit the extent of thedependency chains

3 This type modifier can be used to mark a depen-dency chainrsquos entry to and exit from a functionin a straightforward way without the need forattributes

4 The value dep preserving type modifiers serveas valuable documentation of the developerrsquos in-tent

WG21N4321 18

5 This approach permits many additional opti-mizations compared to those permitted by thecurrent standard on code that carries a depen-dency Expressions such as (x-x) no longerrequire establishment of artificial dependenciesand the compiler is no longer required to detectvalue-narrowing hazards like that shown in Fig-ure 13 However the compiler is still prohibitedfrom adding its own value-speculation optimiza-tions

6 Linus Torvalds seems to be OK with it whichindicates that this set of rules might be practicalfrom the perspective of developers who currentlyexploit dependency chains

According to Peter Sewell one disadvantage is thatthis approach will be quite difficult to model which inturn will pose obstacles for the analysis tooling thatwill be increasingly necessary for large-scale concur-rent programming efforts In particular the concernis that forcing the compiler to assume that a memory

order consume load could possibly return any valuepermitted by its type might require program-analysistools to consider counterfactual hypothetical execu-tions which might complicate specification of seman-tics and verification

Figures 18 and 19 show how this approach playsout with the list-based stack

73 Type-Based Designation of De-pendency Chains

Jeff Preshing made an off-list suggestion of using avalue dep preserving type modifier as suggestedby Torvald Riegel but using this type modifier tostrictly enforce dependency ordering For exampleconsider the code fragment shown in Figure 17 Thescheme described in Section 72 would not necessar-ily enforce dependency ordering between the load online 3 and the access one line 6 while the approachdescribed in this section would enforce dependencyordering in this case

Furthermore cancelling or value-destruction oper-ations on value dep preserving values would notdisrupt dependency ordering As with the cur-rent C11 and C++11 standards the implementation

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack value_dep_preserving first6 78 struct liststack 9 struct liststack value_dep_preserving next

10 void t11 struct rcu_head rh12 1314 value_dep_preserving15 void ls_front(struct liststackhead head)16 17 value_dep_preserving void data18 value_dep_preserving struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 18 List-Based-Stack Restricted Type-BasedDesignation 1 of 2

would be required to emit a memory-barrier instruc-tion or compute an artificial dependency for such op-erations (Note however that use of cancelling orvalue-destruction operations on dependency chainshas proven quite rare in practice)

This approach shares many of the advantages ofTorvald Riegelrsquos approach

1 The implementation is simpler because no de-pendency chains need be traced The implemen-tation can instead drive optimization decisionsstrictly from type information

2 Use of the value dep preserving type modifierallows the developer to limit the extent of thedependency chains

3 This type modifier can be used to mark a depen-dency chainrsquos entry to and exit from a functionin a straightforward way without the need forattributes

4 The value dep preserving type modifiers serve

WG21N4321 19

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 value_dep_preserving37 void ls_pop(struct liststackhead head)38 39 value_dep_preserving struct liststack lsp40 struct liststack lsnp141 value_dep_preserving struct liststack lsnp242 value_dep_preserving void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 19 List-Based-Stack Restricted Type-BasedDesignation 2 of 2

as valuable documentation of the developerrsquos in-tent

5 Although optimizations on a dependency chainare restricted just as in the current standardthe use of value dep preserving restricts thedependency chains to those intended by the de-veloper

6 Restricting dependency-breaking optimizationson all dependency chains marked value dep

preserving without exceptions for cases inwhich the compiler knows too much might makethis approach easier to learn and to use

It is expected that modeling this approach shouldbe straightforward because the modeling tools wouldbe able to make use of the type information Thisapproach results in the same code as shown in Fig-ures 18 and 19 of the previous section

74 Whole-Program Option

This approach also suggested off-list by Jeff Presh-ing has the goal of reusing existing non-dependency-ordered source code unchanged (albeit requiring re-compilation in most cases)9 For example this ap-proach permits an instance of stdmap to be refer-enced by a pointer loaded via memory order consume

and to provide that stdmap instance with thebenefits of dependency ordering without any codechanges whatsoever to stdmap It is important tonote that this protection will be provided only to aread-only stdmap that is referenced by a changingpointer loaded via memory order consume in partic-ular not to a concurrently updated stdmap refer-enced by a pointer (read-only or otherwise) loadedvia memory order consume This latter case wouldrequire changes to the underlying stdmap implemen-tation at a minimum changing some of the loads tobe memory order consume loads Nevertheless theability to provide dependency-ordering protection topre-existing linked data structures is valuable evenwith this read-only restriction

9 A module or library that is known to never carry a de-pendency need not be recompiled

WG21N4321 20

This approach which again does require full re-compilation can be implemented using two ap-proaches

1 Promote all memory order consume loads tomemory order acquire as may be done withthe current standard

2 On architectures that respect memory order-ing prohibit all dependency-breaking optimiza-tions throughout the entire program but onlyin cases where a change in the value returnedby a memory order consume load could cause achange in the value computed later in that samedependency chain in other words where there isa global semantic dependency Note again thatthe possibility of storing a value obtained froma memory order consume load then loading itlater means that normal loads as well as memoryorder relaxed loads often must be consideredto head their own dependency chains but onlywhen loaded by the same thread that did thestore

Some implementations might allow the developerto choose between these two approaches for exampleby using a compiler switch provided for that purpose

This approach also has the effect of permitting atrivial implementation of a memory order consume

atomic thread fence() When using the first im-plementation approach the atomic thread fence()

is simply promoted to memory order acquire In-terestingly enough when using the second ap-proach the memory order consume atomic thread

fence() may simply be ignored The reason forthis is that this approach has the effect of promot-ing memory order relaxed loads to memory order

consume which already globally enforces all theordering that the memory order consume atomic

thread fence() is required to provide locally10

This approach has its own set of advantages anddisadvantages

10 Of course this presumed promotion from memory order

relaxed to memory order consume means that architecturessuch as DEC Alpha that do not respect dependency order-ing must continue to use the first option of emitting memory-ordering instructions for memory order consume loads

1 This approach dispenses with the [[carries

dependency]] attribute and the kill

dependency() primitive

2 This approach better promotes reuse of existingsource code In particular it should require nochanges to the current Linux-kernel source baseaside from changes to the rcu dereference()

family of primitives

3 This approach allows implementations to carryout dependency-breaking optimizations on de-pendency chains as long as a change in thevalue from the memory order consume load doesnot change values further down the dependencychain both with and without the optimizationJeff conjectures that the set of dependency-breaking optimizations used in practice applyonly outside of dependency chains by the re-vised definition in which single-value restrictionsbreak dependency chains11 If this conjectureholds it also applies to Torvaldrsquos approach de-scribed in Section 72

4 Code that follows the rules presented in Sec-tion 41 (substituting memory order consume

loads for volatile loads) would have its depen-dency ordering properly preserved

It is unlikely that this approach could be modeledreasonably given the current state of the art Therequirement that any given memory order consume

load be able to generate at least two different val-ues at the tail of the dependency chain is believedto be a show-stopper especially when coupled withwhole-program analysis which might find that thereis only one value entering at the head of the depen-dency chain

This approach allows annotations to be discardedas shown in Figures 20 and 21 However the memoryorder consume loads are still required in order toenable the promote-to-acquire implementation style

11 This is certainly the case for the usual optimizations ex-emplified by replacing (x-x) with zero

WG21N4321 21

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 void ls_front(struct liststackhead head)15 16 void data17 struct liststack lsp1819 rcu_read_lock()20 lsp = rcu_dereference(head-gtfirst)21 if (lsp == NULL)22 data = NULL23 else24 data = rcu_dereference(lsp-gtt)25 rcu_read_unlock()26 return data27

Figure 20 List-Based-Stack Whole-Program Ap-proach 1 of 2

75 Local-Variable Restriction

This approach suggested off-list by Hans Boehmlimits the extent of dependency trees to a local whichincludes local variables temporaries function argu-ments and return variables Assigning a value from amemory order consume load to such an object beginsa dependency chain Assigning a value loaded fromsuch a local to a global variable (including function-local variables marked static) or to the heap impliesa kill dependency() so that dependency chains areconfined to locals However if the compiler is unableto see the full dependency chain for example be-cause it passes into a function in another translationunit that is not marked [[carries dependency]]the compiler should promote memory order consume

to memory order acquire12

Section 32 indicates that the following operatorsshould transmit dependency status from one localvariable or temporary to another -gt infix = casts

12 Some implementations might provide means to allow theuser to specify that a diagnostic be generated if such promotionis necessary

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 void ls_pop(struct liststackhead head)37 38 struct liststack lsp39 struct liststack lsnp140 struct liststack lsnp241 void data4243 rcu_read_lock()44 lsnp2 = rcu_dereference(head-gtfirst)45 do 46 lsnp1 = lsnp247 if (lsnp1 == NULL) 48 rcu_read_unlock()49 return NULL50 51 lsp = rcu_dereference(lsnp1-gtnext)52 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)53 while (lsnp1 = lsnp2)54 data = rcu_dereference(lsnp2-gtt)55 rcu_read_unlock()56 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)57 return data58

Figure 21 List-Based-Stack Whole-Program Ap-proach 2 of 2

WG21N4321 22

prefix amp prefix [] infix + infix - ternary infix (bitwise) amp and probably also | Similarly Sec-tion 33 indicates that the following operators shouldimply a kill dependency() () == = ampamp ||infix and

It will also be necesary to check whether Linux-kernel usage expects dependency chains to passthrough globals and heap objects that are in someway thread-local If there are such use cases and ifthey are sane and cannot easily be changed to uselocal variables should [[carries dependency]] beused to flag dependency-carrying globals and heapobjects

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute if de-pendency chains are to span multiple translationunits as is the case in some parts of the Linuxkernel

2 The implementation is likely to be some-what simpler because only those dependencychains passing through local variables compiler-generated temporaries compiler-visible functionarguments and compiler-visible return valuesneed be traced One could also argue that func-tion arguments and return values marked with[[carries dependency]] attribute also need tobe traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 Although optimizations on dependency chainsmust be restricted the restricted scope of de-pendency chains reduces the impact of these re-strictions

5 Applying this approach to the Linux kernelwould only require the addition of markings onfunction parameters and return values corre-sponding to cross-translation-unit function callsHowever there are a significant number of theseso this approach can expect significant resistancefrom the Linux community

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 22 List-Based-Stack Local-Variable Restric-tion 1 of 2

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach allows local-variable annotations tobe dropped as shown in Figure 22 and 23

76 Mark Dependency-Carrying Lo-cal Variables

This approach suggested offlist by Clark Nelson usesthe [[carries dependency]] attribute to mark non-static local-scope variables as carrying a dependencyin addition to its current use marking function ar-guments and return values as carrying dependen-cies It is not permissible to mark global variablesor structure members with this attribute Assigningfrom a [[carries dependency]] object to a non-[[carries dependency]] object results in an im-plicit kill dependency()

This approach is similar to that of Section 73 ex-cept that it uses an attribute rather than a type mod-ifier As such it has many of the advantages and dis-

WG21N4321 23

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 23 List-Based-Stack Local-Variable Restric-tion 2 of 2

advantages of that approach however some believethat an attribute-based approach will be more ac-ceptable to the committee than would a type-modifierapproach13 However this approach does requirethat C add attributes

This leave the question of which operatorstransmit dependency chains from one [[carries

dependency]] object to another Section 32 indi-cates that the following operators should transmitdependency status from one local variable or tem-porary to another -gt infix = casts prefix amp prefix [] infix + infix - ternary infix (bitwise) ampand probably also | Similarly Section 33 showsthat the following operators should imply a kill

dependency() () == = ampamp || infix and

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains passingthrough variables marked with the [[carries

dependency]] attribute need be traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 The [[carries dependency]] calls serve asvaluable documentation of the developerrsquos in-tent

5 Although optimizations on dependency chainsmust be restricted use of explicit [[carries

dependency]] greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of variables carrying depen-dencies given that the Linux kernel currentlyrequires no such markings

13 Lawrence Crowl suggests a third approach namely a vari-able modifier

WG21N4321 24

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 _Carries_dependency void data18 _Carries_dependency struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 24 List-Based-Stack Marked Local Variables1 of 2

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach results in code as shown in Fig-ures 24 and 25 where the [[carries dependency]]

attributes have been replaced with a mythicalCarries dependency C keyword

77 Explicitly Tail-Marked Depen-dency Chains

This approach suggested at Rapperswil by OlivierGiroux can be thought of as the inverse of

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 _Carries_dependency struct liststack lsp40 struct liststack lsnp141 _Carries_dependency struct liststack lsnp242 _Carries_dependency void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 25 List-Based-Stack Marked Local Variables2 of 2

WG21N4321 25

1 p = atomic_load_explicit(ampgp memory_order_consume)2 if (p = NULL)3 do_it(atomic_dependency(p gp))

Figure 26 Explicit Dependency Operations

stdkill dependency() Instead of explicitlymarking where the dependency chains terminateOlivierrsquos proposal uses a stddependency() prim-itive to indicate the locations in the code that thedependency chains are required to reach The first ar-gument to stddependency() is the value to whichthe dependency must be carried and the second argu-ment is the variable that heads the dependency chainin other words the second argument is the variablethat was loaded from by a memory order consume

load This proposal differs from the others in thatit is expected to be implemented not necessarily bypreserving the dependency but instead by insertingbarriers in those cases where optimizations have elim-inated any required dependencies The goal here is toimpose minimal restrictions on optimizations of codecontaining dependency chains

A C-language example is shown in Figure 26where stddependency() is transliterated to the C-language atomic dependency() function On line 3atomic dependency() returns the value of its firstargument (p) while ensuring that the data depen-dency from the memory order consume load from gp

is faithfully reflected in the assembly language imple-menting this code fragment The assembly-languagereflection of this dependency might be in terms ofan assembly-language dependency (for example onARM or PowerPC) implicit memory ordering (forexample on x86 or mainframe) or by an explicitmemory-barrier instruction However if there was noatomic dependency() function the compiler wouldbe under no obligation to preserve the dependency14

These explicitly specified dependencies may becombined with [[carries dependency]] attributeson function arguments for example as shown in Fig-ure 27 Note the interplay of atomic dependency()

and [[carries dependency]] where line 8 estab-

14 Would it be better to have the second argument to atomic

dependency() be a label rather than an expression

1 void foo(struct bar q [[carries_dependency]])2 3 if (q = NULL)4 do_it(atomic_dependency(q-gtb q))5 67 p = atomic_load_explicit(ampgp memory_order_consume)8 foo(atomic_dependency(p gp))

Figure 27 Explicit Dependency Operations and car-ries dependency

lishes the dependency between the load from gp andthe [[carries dependency]] argument q of foo()and where line 4 establishes the further dependencybetween argument q of foo() and do it()s argu-ment

This approach is not yet complete One issue isthe possibility of a given operation being dependenton multiple memory order consume loads One ap-proach is of course to omit this functionality and an-other is to allow atomic dependency() to allow anexpression as its first argument and a variable list ofmemory order consume loaded variables

Another issue is connecting [[carries

dependency]] return values to subsequent atomic

dependency() invocations There are a numberof possible resolutions to this issue One approachwould be to use [[carries dependency]] attributeto mark the declaration of the variable to whichthe functionrsquos return value is assigned bringing theproposal from Section 76 to bear In the specialcase where the memory order consume load is in thesame function body as the atomic dependency()

that depends on it the atomic dependency()

could reference the variable that was the sourceof the original memory order consume load An-other approach would be to allow function-returncarries dependency]] attributes to define namesthat could be used by later atomic dependency()

invocations

A third issue arises when atomic dependency()

must be applied after the head of the dependencychain has gone out of scope for example if the headwas contained in a variable defined in an inner scopethat has since been exited

A fourth issue arises if optimizations along a

WG21N4321 26

needed dependency chain allow ordering the depen-dent operation to precede the head of the depen-dency chain in which case inserting barriers wouldbe ineffective The current proposal for addressingthis issue is to suppress memory-movement optimiza-tions across the atomic dependency() perhaps us-ing something like atomic signal fence() or theLinux kernelrsquos barrier() macro This approach al-lows dependency checking and fence insertion to becarried out as a final pass in the compilation process

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains having ex-plicit atomic dependency() calls (and option-ally intermediate [[carries dependency]] at-tributes) need be traced

3 Irrelevant dependency chains are pruned by de-fault with no stdkill dependency() calls re-quired

4 The atomic dependency() calls serve as valu-able documentation of the developerrsquos intent

5 Although optimizations on dependency chainsmust be restricted use of explicit atomic

dependency() greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of dependency chains giventhat the Linux kernel currently relies on implicitends of dependency chains

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(atomic_dependency(lsp-gtt27 head-gtfirst))28 rcu_read_unlock()29 return atomic_dependency(data lsp-gtt)30

Figure 28 List-Based-Stack Tail-Marked Dependen-cies 1 of 2

It is not yet known whether this approach can bereasonably modeled

The result is shown in Figures 28 and 29

78 Explicitly Head-Marked Depen-dency Chains

This approach suggested via email by Olivier Girouxcan be thought of as another inverse of stdkilldependency() In this case the heads of the de-pendency chains are marked indicating to whichpointed-to objects dependencies should be carriedThis description is an extrapolation of a very con-cise proposal and corrections are welcome

The general idea is to provide an augmentedform of the load() member function that indicatesdependencies for example xload(memory order

consume x-gtnext) would cause a dependency tobe carried through the -gtnext field but throughno other field This is shown in Figure 30 where

WG21N4321 27

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(atomic_dependency(lsnp2-gtt lsnp))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return atomic_dereference(data lsnp2-gtt)59

Figure 29 List-Based-Stack Tail-Marked Dependen-cies 2 of 2

1 struct foo 2 struct foo a3 struct foo b4 struct foo c5 int d6 78 p = atomic_load_explicit(ampgp memory_order_consume9 p-gta p-gtb)

10 qa = p-gta Dependency carried 11 qb = p-gtb Dependency carried 12 qc = p-gtc No dependency carried 13 d = p-gtd No dependency carried

Figure 30 Explicit Dependency Operations and Aug-mented Load

the explicit dependency information on line 9 causeslines 10 and 11 to carry a dependency but lines 12and 13 not to do so

Some open questions regarding this approach

1 How does this interact with arguments and re-turn values Do the corresponding annotationsneed to indicate to which fields dependenciesmight be carried Should mismatches be con-sidered an error and if so which sorts of mis-matches

2 How are opaque types handled For exam-ple consider the Linux kernel linked-list facil-ity which embeds a list head structure intothe enclosing object that is to be placed on thelist The memory order consume load returnsa (struct list head ) but it may be nec-essary to carry a dependency to one or more ofthe fields in the enclosing object Should thisbe handled via something like xload(memory

order consume x) but if so doesnrsquot thisre-introduce the need for lots of stdkill

dependency() calls

3 Larger structures might have quite a few fieldsthat need dependencies carried Should therebe some sort of shorthand to make this easierto code for example tagging the fields needingdependency ordering in the declaration of thestruct or class

4 Common types of abstraction need to be han-dled correctly For example there are more than

WG21N4321 28

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst head-gtfirst-gtt)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(lsp-gtt lsp-gtt)) 28 rcu_read_unlock()29 return data30

Figure 31 List-Based-Stack Head-Marked Depen-dencies 1 of 2

500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

79 Restricted Dependency Chains

This approach restricts dependency chains to opera-tions for which compilers would naturally carry de-pendencies As such this approach can be consid-ered to be a refinement of the whole-program option(Section 74) restricted as described in Section 412but also omitting control dependencies and RCU-protected array indexes As always a memory order

consume load heads a dependency chain and as al-ways a memory order consume load may be imple-mented with a simple load instruction on architec-tures such as x86 ARM and Power

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst head-gtfirst-gtnext)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext lsnp-gtnext-gtt)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt lsnp2-gtt))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 32 List-Based-Stack Head-Marked Depen-dencies 2 of 2

WG21N4321 29

1 define rcu_ptr_extract(p) 2 ( 3 uintptr_t ___ip = (uintptr_t)(p) 4 5 ___ip = ___ip amp ~0x7 6 (typeof(p)) ___ip 7 )89 define rcu_ptr_set_bits(p bits)

10 ( 11 uintptr_t ___ip = (uintptr_t)(p) 12 13 ___ip = ___ip amp ~0x7 14 ___ip = ___ip | (bits) 15 (typeof(p)) ___ip 16 )1718 define rcu_ptr_get_bits(p bits) 19 ( 20 uintptr_t ___ip = (uintptr_t)(p) 21 22 ___ip amp ~0x7 23 )

Figure 33 Pointers and Bit Manipulation on 64-BitSystem

This results in a specific list of operations that ex-tend dependency chains and a separate specific list ofoperations that terminate such chains each coveredin its own section

791 Extending Dependency Chains

The following primitive operations extend thatchain15

1 If any value is part of a dependency chain thenusing that value as the left-hand side of an as-signment expression extends the chain to coverthe assignment This rule is exercised in theLinux kernel by stores into fields making up anRCU protected data element

2 If any value is part of a dependency chain thenusing that value as the right-hand side of an as-signment expression extends the chain to coverboth the assignment and the value returned bythat assignment statement Line 20 of Figure 20shows how this rule may be used to extend adependency chain into a local variable

15 In case of operator overloading the actual functions calledmust be analyzed in order to determine their effects on depen-dency chains

3 If any value that is part of a dependency chain isstored to a non-shared variable then any valueloaded by a later load from that same variable bythat same thread is also part of the dependencychain Lines 20 and 24 of Figure 20 illustratethis rule though this rule would apply even iflocal variable lsp was a intptr t instead of apointer

4 If a pointer that is part of a dependency chainis stored to any variable then any value loadedby a later load from that same variable by thatsame thread is also part of the dependency chainLines 20 and 24 of Figure 20 illustrate this rulethough this rule would apply even if local vari-able lsp was instead a shared variable

5 If a pointer is part of a dependency chain thenadding an integral value to that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers and alsoto addition via the infix + operator and via thepostfix [] operator Note that the addition mustbe carried out on a pointer Casting to an inte-gral type and then carrying out the addition willbreak the dependency chain Therefore insteadof casting to an integral type to carry out theaddition cast to a pointer to char16 Line 24 ofFigure 20 illustrates this given that the -gtt actsas a pointer offset prior to indirection

6 If a pointer is part of a dependency chain thensubtracting an integer from that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers Againcasting to an integral type and then carryingout the subtraction will break the dependencychain so instead cast to a pointer to char TheLinux-kernel container of() macro illustratesthis This macro is used to find the beginning ofa structure given a pointer to a field within thatsame structure

16 Yes some old systems had strange formats for characterpointers and this restriction does exclude those systems fromthis nuance of dependency ordering However to the best ofmy knowledge all such systems were uniprocessors so this isnot a real problem

WG21N4321 30

7 If a pointer is part of a dependency chain thendereferencing it using the prefix operator ex-tends the chain through the dereference opera-tion Line 24 of Figure 20 illustrates this giventhat the -gtt acts as a pointer offset prior to in-direction

8 If a pointer is part of a dependency chain thendereferencing it using the -gt field-selection oper-ator extends the chain to the field Note that thewhen the -gt operator is followed by one or more operators these latter operators are equiva-lent to adding a constant integer to the originalpointer Line 24 of Figure 20 directly illustratesthis rule

9 If a pointer is part of a dependency chain thencasting it (either explicitly or implicitly) to anypointer-sized type extends the chain to the re-sult Line 3 of Figure 33 illustrates this rule

10 If a value of type intptr t or uintptr t ispart of a dependency chain then casting it toa pointer type extends the chain to the resultLine 6 of Figure 33 illustrates this rule

11 If a value of type intptr t or uintptr t is partof a dependency chain then the bitwise infix amp

and | operators extend the dependency chain tothe resulting value Line 5 of Figure 33 illus-trates this rule

12 If a value of type intptr t or uintptr t is partof a dependency chain the the bitwise infix ^

operator extends the dependency chain to theresulting value The use case for this traversal ofbuddy-allocator-like lists or dense-array heapsbut it is not clear whether these use cases justifythis addition to dependency ordering

13 If a pointer is part of a dependency chain thenapplying the unary amp address-of operator op-tionally casting this address to a pointer type(perhaps repeatedly to different pointer typeseither explicitly or implicitly) then applying the dereference operator extends the chain to theresult This is used by some of the Linux-kernellist-processing macros

14 If a pointer is part of a dependency chain andthat pointer appears in the entry of a expres-sion selected by the condition then the chainextends to the result Please note that doesnot extend chains from its condition only fromits second or third argument

15 If a pointer to a function is part of a depen-dency chain then invoking the pointed-to func-tion extends the chain from the pointer to theinstructions executed Note that the exact mech-anism used to update instructions is implemen-tation defined and might require use of specialinstruction-cache-flush operations17

16 If a pointer is part of a dependency chain thenif that pointer is used as the actual parameter ofa function call the dependency chain extends tothe formal parameter

17 If a pointer is part of a dependency chain thenif that pointer is returned from a function thedependency chain extends to the returned valuein the calling function

18 If a given operation extends a dependency chainthen so does its atomic counterpart For exam-ple the rules applying to assignments also applyto atomic loads and stores

Any other operation terminates a dependencychain

792 Terminating Dependency Chains

Even though all other operations terminate depen-dency chains there are a few that deserve specialmention

1 Equality comparisons

2 Narrowing magnitude comparisons

3 Narrowing arithmetic operations

4 Narrowing bitwise operations

17 This may sound strange but just you try implementingdynamic linking without the ability to update instructions

WG21N4321 31

1 struct bar 2 struct bar next3 int a4 int b5 6 struct bar head = amphead 1 2 78 for (p = head-gtnext p p = rcu_dereference(p-gtnext)) 9 foo += p-gta

10 if (p == amphead)11 break12 13 bar = head-gtb

Figure 34 Back-Propagation of Dependency-ChainBreakage Due to Comparisons

1 if (p gt ampfoo)2 do_something(p)3 else if (p lt ampfoo)4 do_something_else(p)5 else6 do_something_nodep(p)

Figure 35 Inequality-Comparison Dependency-Chain Breakage

5 Storing non-pointers into shared variables

6 Passing values between threads without using amemory order consume load

7 Undefined behavior

8 Use of kill dependency

Each of these is covered below

Equality comparisons If a pointer is part of adependency chain then a == or = comparison thatcompares equal to some other pointer where thatother pointer is not part of any dependency chainwill cause any uses of the original pointer to no longerbe part of the dependency chain This dependency-chain breakage can back-propagate to earlier uses ofthe pointer so that in Figure 34 if the comparisonon line 10 compares equal then the access on line 9is not part of the dependency chain This is admit-tedly a rather strange code fragment and besidesthe Linux-kernel barrier() macro could prevent thisif placed between lines 9 and 10 Furthermore the

Linux kernelrsquos list macros avoid this situation becausethe equal comparison terminates the loop

So what if the compiler introduces an equality com-parison This might happen when doing feedback-directed optimization where the compiler might no-tice (for example) that a particularly statically allo-cated structure was almost always the first elementon a given list The compiler might therefore in-troduce a specialization optimization comparing theaddresses and generating code using the statically al-located structure on equals comparison On the onehand in the cases where the Linux kernel adds a stat-ically allocated structure to an RCU-protected linkeddata structure that structure has been initialized atcompile time so that dependency ordering is not re-quired On the other hand this appears to be anextremely dubious optimization for linked data struc-tures In a great many cases the added overhead ofthe comparison would overwhelm the benefits of gen-erating code based on the statically allocated struc-ture

High-quality implementations would therefore beexpected to provide means for disabling this sort ofoptimization especially for pointers obtained fromthe heap After all use of statically allocated struc-tures in RCU-protected lists could be quite usefulduring out-of-memory conditions in which case thespecialization optimization would almost always re-duce performance which is not what optmizationsare supposed to be doing

Narrowing magnitude comparisons A seriesof gt lt gt= or lt= operators that informs the compilerof the exact value of a pointer causes that pointerto no longer be part of the dependency chain SeeFigure 35 for an example of this On line 6 of thisfigure the compiler knows that the value of p is equalto ampfoo so although there is dependency ordering tolines 2 and 4 there is no dependency ordering toline 6 This dependency-chain breakage can back-propagate just as for equality comparisons How-ever dependencies are maintained for normal usesfor exmaple the use of comparisons for deadlockavoidance when acquiring locks contained in multi-ple RCU-protected data elements

WG21N4321 32

Narrowing arithmetic operations If a pointeris part of a dependency chain and if the valuesadded to or subtracted from that pointer cancel thepointer value so as to allow the compiler to pre-cisely determine the resulting value then the result-ing value will not be part of any dependency chainFor example if p is part of a dependency chain then((char )p-(uintptr t)p)+65536 will not be18

Narrowing bitwise operations If a value of typeintptr t or uintptr t is part of a dependency chainand if that value is one of the operands to an amp

or | infix operator whose result has too few or toomany bits set then the resulting value will not bepart of any dependency chain For example on a64-bit system if p is part of a dependency chainthen (p amp 0x7) provides just the tag bits and nor-mally cannot even be legally dereferenced Similarly(p | ~0) normally cannot be legally dereferencedHowever (p amp ~0x7) will provide a usable pointerthat is part of prsquos dependency chain Setting orclearing bits that are not used by the implementa-tion or that would be zero for a properly aligned ob-ject will not break the dependency chain That saidthe compiler might not know the definition of ldquoprop-erly alignedrdquo for example in the cases of manuallycache-aligned or page-aligned objects

Storing non-pointers into shared variables Ifa non-pointer value that is part of a dependency chainis stored into a shared variable then the dependencychain does not extend to a later load from that vari-able

Passing values between threads If a value thatis part of a dependency chain is stored into a variableby one thread and loaded from that same variable bysome other thread using either a non-atomic load ora memory order relaxed load then the dependencychain does not extend to the second thread To getthis effect the second thread would instead need touse a memory order consume load Note that this

18 That said 57p4 of C++ and 656p8 of C both say thatindexing outside of an object is undefined behavior so the lossof dependency ordering is likely the least of the problems here

would extend the dependency chain even if the cor-responding store was a memory order relaxed storebecause the required store-side ordering is providedby the dependency chain

Undefined behavior If undefined behavior is in-voked then consistent with the notion of undefinedbehavior there are no dependency-chain guarantees

If a given pointer intptr t or uintptr t is usedsuch that only one value avoids undefined behaviorthen the dependency chain is broken in the same wayas it would be in the case of an equality comparisonwith that same value

kill dependency() The result of calling kill

dependency is never part of any dependency chainThis operation can be used to suppress diagnosticsthat implementations might omit for likely misusesof dependency ordering

793 Restricted Dependency Chains and theLinux Kernel

This covers all known pointer-based RCU uses in theLinux kernel aside from RCU-protected array in-dexes (more on these later) However as noted ear-lier these restrictions might prove too constrainingfor future code Therefore it might be necessary tocombine this approach with some variation on one ofthe methods for explicitly marking variables formalparameters and return values that are intended tocarry dependencies

Section 32 discussed dependency chains headedby memory order consume loads of integers that arelater used as array indexes or pointer offsets Al-though there are a (very) few such uses in the Linuxkernel accommodating dependency chains headed byloads of integers greatly complicates the handling ofdependency chains For example given an integer x

produced by a memory order consume load we mustcorrectly handle expressions containing (x - x) in-cluding cases where the cancellation is not at all obvi-ous from the source code In contrast given a pointerp the expression (p - p) not a pointer and the rulesgiven above do not require carrying a dependency

WG21N4321 33

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a) 3 q = kill_dependency(p)4 a = q-gtspecial_a5 else 6 a = p-gtnormal_a7

Figure 36 Avoiding Diagnostic Due To Dependency-Ordering Value-Narrowing Hazard

through such an expression This approach there-fore excludes dependency chains headed by memory

order consume loads from non-pointer atomic vari-ables

This raises the question of what should be doneabout the Linux kernel code that relies on order-ing carried through integer array indexes Paulhas (hopefully) answered this question by creatinga Linux-kernel patch that removes the kernelrsquos de-pendency on RCU-protected array indexes [22]

794 Restricted Dependency Chains Ad-vantages and Disavantages

This approach to dependency ordering has the fol-lowing advantages and disadvantages

1 There is no need for compilers to trace depen-dency chains Instead dependencies are an au-tomatic result of current code-generation and op-timization practices

2 There would be little or no limitation on com-piler optimizations Compilers are no longer re-quired to establish artificial dependencies for ex-pressions such as (x - x) or to detect value-narrowing hazards involving == and =

3 The breaking of dependency chains when apointer compares equal to some other pointermight prove to be onerous but it is not aproblem for current Linux use cases19 Themost common cases are comparison against alist header (in which case an equality compar-ison terminates the traversal) and comparison

19 This needs to be re-verified

against NULL (in which case an equality com-parison indicates a pointer that cannot be deref-erenced in any case)

4 The compiler is prohibited from carrying outvalue-speculation optimizations on pointers orvalues of type intptr t or uintptr t that havebeen cast from a pointer type and subjected tono operations other than infix amp | and ^ (Isthis an advantage or a disadvantage The an-swer to this question is left to the reader)

5 In a great many cases dependency chains canreliably pass through library functions compiledby pre-C11 compilers

6 It would not be necessary to use stdkill

dependency() calls in most cases That saiduse of stdkill dependency() might at somefuture point allow the compiler to produce bet-ter diagnostics for dubious dependency-chain usecases For example a compiler might issue awarning for the value-narrowing hazard shownon line 3 of Figure 13 and that diagnostic mightbe suppressed as shown in Figure 36

7 More generally this approach allows annotationsto be discarded as was shown in Figures 20and 21 However the memory order consume

loads are still required in order to support DECAlpha and prevent compiler optimizations thatmight otherwise destroy the dependency chain

8 It would not be necessary to use [[carries

dependency]] attributes However as withstdkill dependency() use of something like[[carries dependency]] might produce bet-ter diagnostics for dubious use dependency-chainuse cases That said it might be preferable tosubstitute either type or variable modifiers forattributes given that use of attributes is not per-mitted to change the meaning of the programand also that attributes are not yet supportedby the C language

9 With the exception of one use case involving ar-rays no changes to the Linux-kernel source code

WG21N4321 34

are required As noted earlier a patch is avail-able to remove the Linux kernelrsquos dependency onRCU-protected array indexes [22]

In the future this approach could be augmentedby attributes type modifiers variable modifiers orother markings to allow more elaborate dependencychains to be created on the one hand and to improvethe compilerrsquos ability to emit diagnostics for dubioususes of dependency chains on the other

710 Evaluation

This evaluation starts by enumerating the differentaudiences that any change to memory order consume

must address (Section 7101) and then compares thevarious proposals based on the perceived viewpointsof these audiences (Section 7102)

7101 Audiences

The main audiences for any change to memory

order consume include standards committee mem-bers compiler implementers formal-methods re-searchers developers intending to write new codeand developers working with existing RCU code TheLinux kernel community is of course a notable exam-ple of this last category

Standards committee members would like a cleanand non-intrusive change to the standard Theywould of course also like solutions minimizing thenumber and vehemence of complaints from the otheraudiences or failing that reducing the complaintsto a tolerable noise level

Compiler implementers would like a mechanismthat fits nicely into current implementations whichdoes much to explain their satisfaction with the ap-proach of strengthening memory order consume tomemory order acquire In particular they wouldlike to avoid unbounded tracing of dependencies andwould prefer minimal constraints on their ability toapply time-honored optimizations

Formal-methods researchers would like a definitionof memory order consume that fits into existing the-oretical frameworks without undue conceptual vio-lence Of particular concern is any need to deal with

counter-factuals in other words any need to rea-son not only about values of variables required forthe solution of a given litmus test but also aboutother unrelated values for these variables As suchcounter-factuals are the rock upon which otherwiseattractive approaches involving semantic dependencyhave foundered20 Some practitioners might won-der why the opinion of formal-methods researchersshould be given any weight at all and the answer tothis question is that it is the work of formal-methodsresearchers that provides us the much-needed toolsthat we need to analyze both the memory-orderingspecification itself as well as programs using thatspecification

Developers writing new code need something thatexpresses their algorithm with a minimum of syn-tactic saccharine that is easy to learn and that iseasy to maintain For example one of the weak-nesses of the current standardsrsquo definition of memoryorder consume is the need to sprinkle large numbersof kill dependency() calls throughout onersquos codeIn short developers would like it to be easy to writeanalyze and maintain code that uses dependency or-dering

Developers with existing RCU code have the samedesires as do developers writing new code but arealso very interested in minimizing the code churn re-quired to adhere to the standard

The challenge if of course to find a proposal thataddresses the viewoints of all of these audiences Aswe will see in the next session this is not easy How-ever there is some hope that the approach presentedin Section 79 might suffice

7102 Comparison

A summary comparison of the proposals is shown inTable 1

The dependency type can either be ldquodeprdquo fornormal dependency ldquordeprdquo for the restricted de-pendencies discussed in Section 79 ldquosdeprdquo for(global) semantic dependency or ldquolsdeprdquo for local

20 That said Alan Jeffries is making another attempt tocome up with a suitable formal definition of semantic depen-dency

WG21N4321 35

Dep

enden

cyType

Variable

Marking

Form

al-Parameter

Marking

Return-V

alueMarking

Beginning-O

f-Chain

Handling

End-O

f-Chain

Handling

Dep

enden

cyTracingRequired

CAttribute

Support

Req

uired

C11 C++11 dep A A K Y Y

Type-Based Designation of DependencyChains With Restrictions (Section 72)

lsdep T T T k

Type-Based Designation of DependencyChains (Section 73)

dep T T T k

Whole-Program Option (Section 74) sdep

Local-Variable Restriction (Section 75) dep A A k Y

Mark Dependency-Carrying Local Variables(Section 76)

dep A A A k Y

Explicitly Tail-Marked Dependency Chains(Section 77)

dep A A Dk y Y

Explicitly Head-Marked Dependency Chains(Section 78)

dep D k y Y

Restricted Dependency Chains (Section 79) rdep

Marking ldquoArdquo attribute ldquoTrdquo typeBeginning of chain ldquoDrdquo explicit designationEnd of chain ldquoDrdquo explicit designation ldquokKrdquo implicitexplicit kill dependencyDependency tracking required ldquoyrdquo only for marked chains ldquoYrdquo always

Table 1 Comparison of Consume Proposals

WG21N4321 36

semantic dependency21 Variable formal-parameterand return-value marking can either be type-based(ldquoTrdquo) attribute-based (ldquoArdquo) or not required (ldquo rdquo)Beginning-of-chain handling can either require ex-plicit indication to which quantities dependenciesmust be carried (ldquoDrdquo) or nothing (ldquo rdquo) End-of-chain handling can either require an explicit kill

dependency (ldquoKrdquo) an implicit kill dependency

(ldquokrdquo) explicit designation of dependency (ldquoDrdquo) ornothing (ldquo rdquo)22 Dependency tracking might berequired for all chains (ldquoYrdquo) explicitly designatedchains (ldquoyrdquo) or not required at all (ldquo rdquo) C-language[[carries dependency]] support might be required(ldquoYrdquo) or not (ldquo rdquo)

The ideal proposal would have dependency typeldquodeprdquo (thus making it easier to model dependencyordering and making it unnecessary for developer tohave to outwit full-program optimizations) no needfor variable formal-parameter or return-value mark-ing (thus minimizing changes required for existingRCU code) implicit ldquodo the right thingrdquo end-of-chainhandling23 (thus minimizing the need for whack-a-mole source-code markups) no need for dependencytracking (thus making it easier to implement) andno need for C-language support for the [[carries

dependency]] attribute (thus minimizing changes tothe C standard)

7103 Other Approaches

If the C standards committee is unwilling to ac-cept attributes perhaps a new keyword such asCarries dependency would be an acceptable alter-native (This was suggested at the 2014 UIUC meet-ing but I cannot recall who suggested it)

21 Recall that a local semantic dependency remains a de-pendency even if the memory order consume load at its headcan return only a single value In contrast a global seman-tic dependency remains a dependency only if more than onevalue can appear at the end of the chain Therefore optimiza-tions based on global full-program analysis can break a globalsemantic dependency but can break neither a local semanticdependency nor a normal dependency

22 Variables that go out of scope always have any dependencychain implicitly killed

23 Perhaps implemented by a careful choice of exactly whichoperators carry dependencies in which situations

It might also be possible to combine different as-pects of the various proposals perhaps even arrivingat an improved proposal

Nevertheless we clearly have some more work todo

8 Summary

This document has analyzed Linux-kernel use of de-pendency ordering and has laid out the status-quo in-teraction between the Linux kernel and pre-C11 com-pilers It has also put forward some possible ways ofbuilding towards a full implementation of C11rsquos andC++11rsquos handling of dependency ordering Finallyit calls out some weaknesses in C11rsquos and C++11rsquoshandling of dependency ordering and offers some al-ternatives

References

[1] Alglave J Maranget L Pawan PSarkar S Sewell P Williams D andNardelli F Z PPCMEMARMMEM A toolfor exploring the POWER and ARM memorymodels httpwwwclcamacuk~pes20

ppc-supplementalpldi105-sarkarpdfJune 2011

[2] Alglave J Maranget L andTautschnig M Herding cats Modellingsimulation testing and data-mining for weakmemory In Proceedings of the 35th ACM SIG-PLAN Conference on Programming LanguageDesign and Implementation (New York NYUSA 2014) PLDI rsquo14 ACM pp 40ndash40

[3] ARM Limited ARM Architecture ReferenceManual ARMv7-A and ARMv7-R Edition2010

[4] Boehm H J Space efficient conservativegarbage collection SIGPLAN Not 39 4 (Apr2004) 490ndash501

[5] Bonzini P and Day M RCU imple-mentation for Qemu httplistsgnu

WG21N4321 37

orgarchivehtmlqemu-devel2013-08

msg02055html August 2013

[6] Dalton M THE DESIGN AND IMPLE-MENTATION OF DYNAMIC INFORMATIONFLOW TRACKING SYSTEMS FOR SOFT-WARE SECURITY PhD thesis StanfordUniversity 2009 Available httpcsl

stanfordedu~christospublications

2009michael_daltonphd_thesispdf

[Viewed March 9 2010]

[7] Desnoyers M [RFC git tree] userspace RCU(urcu) for Linux httpurcuso February2009

[8] Desnoyers M McKenney P E SternA Dagenais M R and Walpole JUser-level implementations of read-copy updateIEEE Transactions on Parallel and DistributedSystems 23 (2012) 375ndash382

[9] Grisenthwaite R ARM Barrier LitmusTests and Cookbook ARM Limited 2009

[10] Howard P W and Walpole J A rel-ativistic enhancement to software transactionalmemory In Proceedings of the 3rd USENIX con-ference on Hot topics in parallelism (BerkeleyCA USA 2011) HotParrsquo11 USENIX Associa-tion pp 1ndash6

[11] Howard P W and Walpole J Relativis-tic red-black trees Concurrency and Computa-tion Practice and Experience (2013) nandashna

[12] Intel Corporation A Formal Speci-fication of Intel Itanium Processor Fam-ily Memory Ordering 2002 Avail-able httpdeveloperintelcom

designitaniumdownloads251429htm

ftpdownloadintelcomdesign

ItaniumDownloads25142901pdf [ViewedJanuary 10 2007]

[13] International Business Machines Corpo-ration Power ISA Version 207 2013

[14] Kannan H Ordering decoupled metadata ac-cesses in multiprocessors In MICRO 42 Pro-ceedings of the 42nd Annual IEEEACM Inter-national Symposium on Microarchitecture (NewYork NY USA 2009) ACM pp 381ndash390

[15] McCarthy J Recursive functions of symbolicexpressions and their computation by machinepart i Commun ACM 3 4 (Apr 1960) 184ndash195

[16] McKenney P E Read-copy update(RCU) usage in Linux kernel Availablehttpwwwrdropcomuserspaulmck

RCUlinuxusagerculocktabhtml [ViewedJanuary 14 2007] October 2006

[17] McKenney P E What is RCU part 2Usage Available httplwnnetArticles

263130 [Viewed January 4 2008] January2008

[18] McKenney P E The RCU API 2010 edi-tion httplwnnetArticles418853 De-cember 2010

[19] McKenney P E Validating memory barri-ers and atomic instructions httplwnnet

Articles470681 December 2011

[20] McKenney P E Is Parallel ProgrammingHard And If So What Can You Do About Itkernelorg Corvallis OR USA 2012

[21] McKenney P E Structured deferral syn-chronization via procrastination CommunACM 56 7 (July 2013) 40ndash49

[22] McKenney P E [PATCH tipcorercu14] mce Stop using array-index-based RCUprimitives [PATCHtipcorercu14]mce

Stopusingarray-index-basedRCUprimitivesMay 2015

[23] McKenney P E Purcell C Al-gae Schumin B Cornelius G Qwer-tyus Conway N Sbw Blainster Ru-fus C Zoicon5 Anome and Eisen HRead-copy update httpenwikipedia

orgwikiRead-copy-update July 2006

WG21N4321 38

[24] McKenney P E and Slingwine J DRead-copy update Using execution history tosolve concurrency problems In Parallel andDistributed Computing and Systems (Las VegasNV October 1998) pp 509ndash518

[25] McKenney P E and Walpole J What isRCU fundamentally Available httplwn

netArticles262464 [Viewed December 272007] December 2007

[26] Michael M M Hazard pointers Safe mem-ory reclamation for lock-free objects IEEETransactions on Parallel and Distributed Sys-tems 15 6 (June 2004) 491ndash504

[27] Rossbach C J Hofmann O S PorterD E Ramadan H E Bhandari A andWitchel E TxLinux Using and managinghardware transactional memory in an operatingsystem In SOSPrsquo07 Twenty-First ACMSymposium on Operating Systems Principles(October 2007) ACM SIGOPS Avail-able httpwwwsosp2007orgpapers

sosp056-rossbachpdf [Viewed October 212007]

[28] Toit S D Working draft stan-dard for programming language C++httpwwwopen-stdorgjtc1sc22wg21

docspapers2013n3691pdf May 2013

[29] Vigueras G Orduna J M and LozanoM A Read-Copy Update based parallel serverfor distributed crowd simulations The Journalof Supercomputing (Apr 2012)

Change Log

This paper first appeared as N4026 in May of 2014Revisions to this document are as follows

bull Add a suggestion to Section 71 that developersprovide diagnostics for singleton arrays when us-ing RCU-protected indexes (May 28 2014)

bull Apply changes from Torvald Riegel feedback

1 Move Section 71 which summarizes theLinux-kernel uses of dependency chains upto the beginning of Section 7

2 Fix some typos in Section 74 and addwords accounting for DEC Alpha whichmust continue to promote memory order

consume loads to memory order acquire

with this sectionrsquos approach Also clarifyJeff Preshingrsquos conjecture about the lack ofoptimizations that break semantic depen-dency chains

(June 5 2014)

bull Add Section 75 which covers Hans Boehmrsquosapproach which restricts dependency chains tolocal variable Also add Section 76 whichcovers Clark Nelsorsquos approach which usesthe [[carries dependency]] attribute to markthose local variables that carry dependencies(July 17 2014)

bull Mark the paper as a revision of N4036 (July 172014)

bull Add Jeff Preshing Hans Boehm and Clark Nel-son as authors and clarify Hansrsquos proposal inSection 75 (August 13 2014)

bull Add Olivier Girouxrsquos proposal for explicitlymarking the tails of dependency chains as Sec-tion 77 and add Olivier as author Update theadvantages and disadvantages of Jeff Preshingrsquoswhole-program option in Section 74 Add advan-tages and disadvantages for Hans Boehmrsquos pro-posal in Section 75 and for Clark Nelsonrsquos pro-posal in Section 76 (August 23 2014)

bull Update Olivier Girouxrsquos explicit dependency-chain marking proposal in Section 77 based onfeedback from Olivier (August 24 2014)

bull Add a section header (Section 411) for thedependency-chain rules for 2014 GCC implemen-tations Add Section 413 laying out the sim-pler rules for managing dependency chains in theolder 1990s Sequent C implementations Adda footnote to Section 76 recording Lawrence

WG21N4321 39

Crowlrsquos preference for variable modifiers insteadof either attributes or type modifiers Addwords to Section 77 noting that atomic signal

fence() or the Linux kernelrsquos barrier() can beused to suppress code-motion optimizations thatmight otherwise prevent fixing up tail-markeddependency chains that were broken by code-motion optimizations (October 1 2014)

bull Add Section 34 which describes operators thatact as the last link in the Linux kernelrsquos depen-dency chains including the ambiguous role ofthe -gt operator (October 1 2014)

bull Update plots of RCU API usage (October 52014)

bull Flesh out Section 412 which lays out a shorthistory of dependency-chain management rulesfor the Linux kernel Numerous minor clarifica-tions and corrections throughout the document(October 5 2014)

bull Add ldquoandrdquo before final authorrsquos name and emailaddress Also add a pointer back to N4036 (Oc-tober 5 2014)

bull Add a brain-dead build script (October 52014)

bull Add Section 710 which has the beginnings ofan evaluation of the various proposals (October5 2014)

The paper was then published as N4215 whichwas revised as follows

bull Add verbiage to Sections 72 and 7102 differ-entiating between various forms of syntactic andsemantic dependencies (November 11 2014)

bull Add Section 7103 suggesting a C keyword in-stead of the C++ [[carries dependency]] at-tribute This section also suggests that combin-ing ideas from several proposals might be helpful(November 11 2014)

bull Add Section 78 which contains a first attemptto describe Olivier Girouxrsquos head-marked depen-dency chains (November 11 2014)

bull Add example code in Figures 15 and 16 Addsimilar figures to the proposals showing how theymark the dependency chains in the sample code(November 11 2014)

At this point the paper was published as N4321which was revised as follows

bull Add a warning against using operands to amp and| that leave only one bit set (or respectivelycleared) to Section 411 (April 9 2015)

bull Update plots of RCU API usage (April 192014)

bull Add Section 79 which restricts dependencychains with the goal of allowing unmarked de-pendency chains while avoiding restricting com-piler optimizations One important restriction isthe elimination of RCU-protected array indexes(April 19 2014)

bull Add verbiage to Section 32 noting that the amp

operator is sometimes used to find the beginningof a power-of-two aligned structure (April 192014)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Torvald Riegel to Sec-tion 79 (May 20 2015)

bull Donrsquot allow integers to carry dependencies intoor out of unmarked functions in Section 79(May 21 2015)

bull Update Section 79 to note that assignments can-not extend a dependency chain from one threadto another and that assignments cannot extendan intptr t- or uintptr t-based dependencychain through a shared variable even within agiven thread (May 21 2015)

bull Update Section 79 to note that atomic op-erations can extend dependency chains in thesame way that their non-atomic counterpartscan (May 21 2015)

WG21N4321 40

bull Update Section 79 to explicitly call out the factthat stores and subsequent loads can extend de-pendency chains (May 21 2015)

bull Update Section 79 to explicitly state that un-defined behavior terminates dependency chainsand that if only one value of a pointer avoidsundefined behavior any dependency chainsthrough that pointer at that point are termi-nated (May 22 2015)

bull Update Section 79 to document the difficul-ties stemming from value-specialization opti-mizations (May 23 2015)

bull Update Section 79 to note the back-propagationof the dependency-chain-breaking effects of thecompiler deducing the exact value of a pointer(May 27 2015)

bull Update Section 79 to give more informationon how much bit-setting and bit-clearing is toomuch (June 8 2015)

bull Update Section 79 to note that the Linux ker-nel does not rely on dependency ordering whena statically allocated structure is linked into anRCU-protected structure but also note that op-timizations breaking dependency ordering in thiscase seem quite dubious (July 11 2015)

bull Update Section 79 illustrating dependency-chain rules with examples (July 11 2015)

bull Update Section 79 noting that simple loads suf-fice for memory order consume loads (July 112015)

bull Update Sections 77 and 78 calling out abstrac-tion challenges for head- and tail-marking ap-proaches (July 13 2015)

bull Add subsections to Section 79 (July 13 2015)

bull Update Section 79 to amplify objections to spe-cialization optimizations applied to heap-basedpointers (July 13 2015)

  • 1 Introduction
  • 2 Introduction to RCU
  • 3 Linux-Kernel Use Cases
    • 31 Types of Linux-Kernel Dependency Chains
    • 32 Operators in Linux-Kernel Dependency Chains
    • 33 Operators Terminating Linux-Kernel Dependency Chains
    • 34 Operators Acting as Last Link in Linux-Kernel Dependency Chains
    • 35 Linux-Kernel Dependency Chain Length
      • 4 Dependency Ordering in Pre-C11 Implementations
        • 41 Rules for C-Language RCU Users
          • 411 Rules for 2014 GCC Implementations
          • 412 Rules for 2003 GCC Implementations
          • 413 Rules for 1990s Sequent C Implementations
            • 42 Rules for C-Language Implementers
              • 5 Dependency Ordering in C11 and C++11 Implementations
              • 6 Weaknesses in C11 and C++11 Dependency Ordering
              • 7 Potential Alternatives to C11 and C++11 Dependency Ordering
                • 71 Revising C11 and C++11 Dependency-Ordering Definition
                • 72 Type-Based Designation of Dependency Chains With Restrictions
                • 73 Type-Based Designation of Dependency Chains
                • 74 Whole-Program Option
                • 75 Local-Variable Restriction
                • 76 Mark Dependency-Carrying Local Variables
                • 77 Explicitly Tail-Marked Dependency Chains
                • 78 Explicitly Head-Marked Dependency Chains
                • 79 Restricted Dependency Chains
                  • 791 Extending Dependency Chains
                  • 792 Terminating Dependency Chains
                  • 793 Restricted Dependency Chains and the Linux Kernel
                  • 794 Restricted Dependency Chains Advantages and Disavantages
                    • 710 Evaluation
                      • 7101 Audiences
                      • 7102 Comparison
                      • 7103 Other Approaches
                          • 8 Summary
Page 2: N4321: Towards Implementation and Use of memory order … · 2015-07-14 · N4321: Towards Implementation and Use of memory order consume Doc. No.: WG21/N4321 (revised) Date: 2015-04-19

WG21N4321 2

given store instruction depends on the value from apreceding load instruction the hardware again forcesthat earlier load to complete before the later store in-struction commences Recent software tools for ARMand PowerPC can help explicate their memory mod-els [1 2 19] Note that strongly ordered systems likex86 IBM mainframe and SPARC TSO enforce de-pendency ordering as a side effect of the fact thatthey do not reorder loads with subsequent memoryreferences Therefore memory order consume is ben-eficial on hot code paths removing the need for hard-ware ordering instructions for weakly ordered systemsand permitting additional compiler optimizations onstrongly ordered systems

When implementing concurrent insertion-onlydata structures a few of which are found in the Linuxkernel memory order consume is all that is requiredHowever most data structures also require removalof data elements Such removal requires that thethread removing the data element wait for all read-ers to release their references to it before reclaim-ing that element The traditional way to do this isvia garbage collectors (GCs) which have been avail-able for more than half a century [15] and which arenow available even for C and C++ [4] Anotherway to wait for readers is to use read-copy update(RCU) [21 24] which explicitly marks read-side re-gions of code and provides primitives that wait forall pre-existing readers to complete RCU is gainingsignificant use both within the Linux kernel [16] andoutside of it [5 6 8 14 29]

Despite the growing number of memory order

consume use cases there are no known high-performance implementations of memory order

consume loads in any C11 or C++11 environmentsThis situation suggests that some change is in or-der After all if implementations do not supportthe standardrsquos memory order consume facility userscan be expected to continue to exploit whateverimplementation-specific facilities allow them to gettheir jobs done This document therefore providesa brief overview of RCU in Section 2 and surveysmemory order consume use cases within the Linuxkernel in Section 3 Section 4 looks at how depen-

tage of the as-if rule just as compilers do

dency ordering is currently supported in pre-C11 im-plementations and then Section 5 looks at possibleways to support those use cases in existing C11 andC++11 implementations followed by some thoughtson incremental paths towards official support of theseuse cases in the standards Section 6 lists some weak-nesses in the current C11 and C++11 specification ofdependency ordering and finally Section 7 outlines afew possible alternative dependency-ordering specifi-cations

Note SC22WG14 liason issue

2 Introduction to RCU

The RCU synchronization mechanism is often used asa replacement for reader-writer locking because RCUavoids the high-overhead cache thrashing that is char-acteristic of many common reader-writer-locking im-plementations RCU is based on three fundamentalconcepts

1 Light-weight in-memory publish-subscribe oper-ation

2 Operation that waits for pre-existing readers

3 Maintaining multiple versions of data to avoiddisrupting old readers that are still referencingold versions

These three concepts taken together allow readersand updaters to make forward progress concurrently

We would like to use C11rsquos and C++11rsquos memory

order consume to implement RCUrsquos lightweight sub-scribe operation rcu dereference() We assumethat rcu dereference() is a good example of howdevelopers would exploit the dependency-orderingfeature of weakly ordered systems so we look to rcu

dereference() as an indication of the semantics thatmemory order consume should have

In one typical RCU use case updaters publishnew versions of a data structure while readers con-currently subscribe to whatever version is currentat the time a given reader starts Once all pre-existing readers complete old versions can be re-claimed This sort of use case may be a bit unfa-miliar to many but it is extremely effective in many

WG21N4321 3

0

2000

4000

6000

8000

10000

12000 2

002

200

4

200

6

200

8

201

0

201

2

201

4

201

6

R

CU

AP

I Use

s

Year

Figure 1 Growth of RCU Usage

situations offering excellent performance scalabilityreal-time latency deadlock avoidance and read-sidecomposability More details on RCU are readily avail-able [8 17 18 20 21 23 25]

Figure 1 shows the growth of RCU usage over timewithin the Linux kernel which is strong evidence ofRCUrsquos effectiveness However RCU is a specializedmechanism so its use is much smaller than general-purpose techniques such as locking as can be seen inFigure 2 It is unlikely that RCUrsquos usage will everapproach that of locking because RCU coordinatesonly between readers and updaters which meansthat some other mechanism is required to coordinateamong concurrent updates In the Linux kernel thatupdate-side mechanism is normally locking althoughpretty much any synchronization mechanism may beused including transactional memory [10 11 27]

However RCU is now being used in many situa-tions where reader-writer locking would be used Fig-ure 3 shows that the use of reader-writer locking haschanged little since RCU was introduced This datasuggests that RCU is at least as important to parallelsoftware as is reader-writer locking

0

20000

40000

60000

80000

100000

120000

140000

200

2

200

4

200

6

200

8

201

0

201

2

201

4

201

6

R

CU

lock

ing

AP

I Use

sYear

locking

RCUrwlock

Figure 2 Growth of RCU Usage vs Locking

In more recent years a user-level library implemen-tation of RCU has been available [7] This library isnow available for many platforms and has been in-cluded in a number of Linux distributions It hasbeen pressed into service for a number of open-sourcesoftware projects proprietary products and researchefforts

Fully and fully performant C11C++11 supportfor memory order consume is therefore quite impor-tant However good progress can often be made inthe short term by focusing on the cases that are com-monly used in practice rather than on the generalcase The next section therefore takes a rough censusof the Linux kernelrsquos use of the rcu dereference()

family of primitives which memory order consume isintended to implement

3 Linux-Kernel Use Cases

Section 31 lists types of dependency chains in theLinux kernel Section 32 lists operators used withinthese dependency chains Section 33 lists operatorsthat are considered to terminate dependency chainsSection 34 lists operator that often act as the last link

WG21N4321 4

0

2000

4000

6000

8000

10000

12000 2

002

200

4

200

6

200

8

201

0

201

2

201

4

201

6

R

CU

rw

lock

ing

AP

I Use

s

Year

RCU

rwlock

Figure 3 Growth of RCU Usage vs Reader-WriterLocking

in a dependency chain and finally Section 35 surveysa longer-than-average (but by no means maximal)dependency chain that appears in the Linux kernel

It is worth reviewing the relationship betweenmemory order acquire and memory order consume

loads both of which interact with memory order

release stores

A memory order release load is said to synchro-nize with a memory order acquire store if that loadreturns the value stored or in some special casessome later value [28 110p6-110p8] When a memory

order acquire load synchronizes with a memory

order release store any memory reference preced-ing the memory order acquire load will happen be-fore any memory reference following the memory

order release store [28 110p11-110p12] Thisproperty allows a linked structure to be locklessly tra-versed by using memory order release stores whenupdating pointers to reference new data elements andby using memory order acquire loads when loadingpointers while locklessly traversing the data struc-ture as shown in Figure 4

Unfortunately a memory order acquire load re-

1 void new_element(struct foo pp int a)2 3 struct foo p = malloc(sizeof(p))45 if (p)6 abort()7 p-gta = a8 atomic_store_explicit(pp p memory_order_release)9

1011 int traverse(struct foo_head ph)12 13 int a = -114 struct foo p1516 p = atomic_load_explicit(ampph-gth memory_order_acquire)17 while (p = NULL) 18 a = p-gta19 p = atomic_load_explicit(ampp-gtn memory_order_acquire)20 21 return a22 23

24

Figure 4 ReleaseAcquire Linked Structure Traver-sal

quires expensive special load instructions or memory-fence instructions on weakly ordered systems suchas ARM Itanium and PowerPC Furthermore intraverse() the address of each memory order

acquire load within the while loop depends on thevalue of the previous memory order acquire load2

Therefore in this case most weakly ordered systemsdonrsquot really need the special load instructions or thememory-fence instructions as these systems can in-stead rely on the hardware-enforced dependency or-dering

This is the use case for memory order consumewhich can be substituted for memory order acquire

in cases where hardware dependency ordering appliesOne such case is the preceding example and Fig-ure 5 shows that same example recast in terms ofmemory order consume A memory order release

store is dependency ordered before a memory order

consume load when that load returns the value storedor in some special cases some later value [28 110p1]

2 The initial load on line 16 might well depend on an earlierload but for simplicity this example assumes that the initialfoo head structure is statically allocated and thus not subjectto updates

WG21N4321 5

1 void new_element(struct foo pp int a)2 3 struct foo p = malloc(sizeof(p))45 if (p)6 abort()7 p-gta = a8 atomic_store_explicit(pp p memory_order_release)9

1011 int traverse(struct foo_head ph)12 13 int a = -114 struct foo p1516 p = atomic_load_explicit(ampph-gth memory_order_consume)17 while (p = NULL) 18 a = p-gta19 p = atomic_load_explicit(ampp-gtn memory_order_consume)20 21 return a22 23

24

Figure 5 ReleaseConsume Linked Structure Traver-sal

Then if the load carries a dependency to somelater memory reference [28 110p9] any memoryreference preceding the memory order release storewill happen before that later memory reference [28110p9-110p12] This means that when there is de-pendency ordering memory order consume gives thesame guarantees that memory order acquire doesbut at lower cost

On the other hand memory order consume re-quires the compiler to track the carries-a-dependencyrelationships with the set of such relationshipsheaded by a given memory order consume load be-ing called that loadrsquos dependency chains It is quitepossible that the complexity of implementing this ca-pability has thus far prevented high-quality memory

order consume implementations from appearing Itis therefore worthwhile to review use of dependencychains in practice in order to determine what typesof operations typically appear in dependency chainswhich might result in guidance to implementationsor perhaps even modifications to the definition ofmemory order consume

31 Types of Linux-Kernel Depen-dency Chains

One goal for memory order consume is to implementrcu dereference() which heads a Linux-kerneldependency-ordering tree There are a numberof variant of rcu dereference() in the Linuxkernel in order to implement the four flavors ofRCU and also to enable RCU usage diagnositicsfor code that is shared by readers and updatersThese additional variants are rcu dereference()rcu dereference bh() rcu dereference

bh check() rcu dereference bh check()rcu dereference check() rcu dereference

index check() rcu dereference protected()rcu dereference raw() rcu dereference

sched() rcu dereference sched check() srcu

dereference() and srcu dereference check()Taken together there are about 1300 uses of thesefunctions in version 313 of the Linux kernelHowever about 250 of those are rcu dereference

protected() which is used only in update-side codeand thus does not head up read-side dependencychains which leaves about 1000 uses to be inspectedfor dependency-ordering usage

32 Operators in Linux-Kernel De-pendency Chains

A surprisingly small fraction of the possible C opera-tors appear in dependency chains in the Linux kernelnamely -gt infix = casts prefix amp prefix [] infix+ infix - ternary and infix (bitwise) amp

By far the two most common operators are the-gt pointer field selector and the = assignment oper-ator Enabling the carries-dependency relationshipthrough only these two operators would likely coverbetter than 90 of the Linux-kernel use cases

Casts the prefix indirection operator and theprefix amp address-of operator are used to implementLinuxrsquos list primitives which translate from listpointers embedded in a structure to the structure it-self These operators are also used to get some of theeffects of C++ subtyping in the C language

The [] array-indexing operator and the infix +

and - arithmetic operators are used to manipulate

WG21N4321 6

1 struct foo 2 int a3 4 struct foo fp5 struct foo default_foo67 int bar(void)8 9 struct foo p

1011 p = rcu_dereference(fp)12 return p p-gta default_fooa13

Figure 6 Default Value For RCU-Protected PointerLinux Kernel

1 class foo 2 int a3 4 stdatomicltfoo gt fp5 foo default_foo67 int bar(void)8 9 stdatomicltfoo gt p

1011 p = fpload_explicit(memory_order_consume)12 return p kill_dependency(p-gta) default_fooa13

Figure 7 Default Value For RCU-Protected PointerC++11

RCU-protected arrays as well as to index into arrayscontained within RCU-protected structures RCU-protected arrays are becoming less common becausethey are being converted into more complex datastructures such as trees However RCU-protectedstructures containing arrays are still fairly common

The ternary if-then-else expression is used tohandle default values for RCU-protected pointers forexample as shown in Figure 6 or in C++11 formin Figure 7 Note that the dependency is carriedonly through the rightmost two operands of neverthrough the leftmost one

The infix amp operator is used to mask low-order bitsfrom RCU pointers These bits are used by somealgorithms as markers Such markers though notcommon in the Linux kernel are well-known in theart with hazard pointers being but one example [26]This operator is also sometimes used to locate thebeginning of an aligned structure for example if p

references a field within a data structure that is 4096bytes in size (or smaller) and that is also aligned to a4096-byte boundary then p amp ~0xfff will with theaddition of appropriate casting produce a pointer tothe beginning of the structure Note that it is ex-pected that both operands of infix amp are expected tohave some non-zero bits because otherwise a NULL

pointer will result and NULL pointers cannot reason-ably be said to carry much of anything let alone adependency

Although I did not find any infix | operators inmy census of Linux-kernel dependency chains sym-metry considerations argue for also including it forexample for read-side pointer tagging or for anotherexample locating the beginning of the next in an ar-ray of aligned structures Presumably both of theoperands of infix | must have at least one zero bit

To recap the operators appearing in Linux-kerneldependency chains are -gt infix = casts prefix ampprefix [] infix + infix - ternary infix (bitwise)amp and probably also |

33 Operators Terminating Linux-Kernel Dependency Chains

Although C++11 has the kill dependency() func-tion to terminate a dependency chain no such func-tion exists in the Linux kernel Instead Linux-kerneldependency chains are judged to have terminatedupon exit from the outermost RCU read-side criticalsection3 when existence guarantees are handed offfrom RCU to some other synchronization mechanism(usually locking or reference counting) or when thevariable carrying the dependency goes out of scope

That said it is possible to analyze Linux-kerneldependency chains to see what part of the chain isactually required by the algorithm in question Wecan therefore define the essential subset of a depen-dency chain to be that subset within which ordering

3 The beginning of a given RCU read-side critical section ismarked with rcu read lock() rcu read lock bh() rcu read

lock sched() or srcu read lock() and the end by the cor-responding primitive from the list rcu read unlock() rcu

read unlock bh() rcu read unlock sched() or srcu read

unlock() There is currently no C++11 counterpart for anRCU read-side critical section

WG21N4321 7

is required by the algorithm In the 313 version ofthe Linux kernel the following operators always markthe end of the essential subset of a dependency chain() == = ampamp || infix and

The postfix () function-invocation operator is aninteresting special case in the Linux kernel In theoryRCU could be used to protect JITed function bodiesbut in current practice RCU is instead used to waitfor all pre-existing callers to the function referencedby the previous pointer The functions are all com-piled into the kernel and the dependency chains aretherefore irrelevant to the () operator Hence in ver-sion 313 of the Linux kernel the () operator marksthe end of the essential subset of any dependencychain that it resides in

The == = ampamp and || operators are used ex-clusively in rdquoifrdquo statements to make control-flow de-cisions and therefore also mark the end of the essen-tial subset of any dependency chains that they residein In theory these relational and boolean operatorscould be used to form array indexes but in practicethe Linux kernel does not yet do this in RCU depen-dency chains The other relational operators (gt ltgt= and lt=) should probably also be added to thislist

The infix and arithmetic operators couldpotentially be used for construct array addresses butthey are not yet used that way in the Linux kernelInstead they are used to do computation on valuesfetched as the last operation in an essential subset ofa dependency chain

In short in the current Linux kernel () === ampamp || infix and all mark the end of theessential subset of a dependency chain That saidthere is potential for them to be used as part of theessential subset of dependency changes in future ver-sions of the Linux kernel And the same is of coursetrue of the remaining C-language operators whichdid not appear within any of the dependency chainsin version 313 of the Linux kernel

34 Operators Acting as Last Link inLinux-Kernel Dependency Chains

Although the -gt operator is frequently used as part ofa Linux-kernel dependency chain it often is intended

to be the last link in that chain Therefore the usescases for the -gt operator deserve special mention

The first use case involves fetching non-pointerdata from an RCU-protected data structure Forexample in the DRDB subsystem in Linux -gt isused to fetch a timeout value This code requiresthat dependency ordering apply to this fetch but itdoes not require a dependency chain extending be-yond that point This sort of case would requirea kill dependency() for implementations based onthe C++11 and C11 standards

The second use case involves linked data struc-tures where an RCU update might be applied onany pointer in the chain for example the stan-dard Linux-kernel linked list The -gt operator pro-vides dependency ordering for the fetch of the -gtnextpointer but that fetch must itself be a memory order

consume load in order to provide the required depen-dency ordering for the fields in the next structurein the list Thus a linked-list traversal consists ofa series of back-to-back non-overlapping dependencychains

These two use cases raise the question of whethera dependency chain can continue beyond a -gt oper-ator The answer is ldquoyesrdquo and this occurs when alinked structure is made visible to RCU readers as aunit For exapmle consider a linked list where eachlist element links to a constant binary search treeIf this tree is in place when the element is added tothe list then a memory order consume load is neededonly when fetching the pointer to the element Thedependency chain headed by this fetch suffices to or-der accesses to the binary search tree

These cases need to be differentiated The thirduse case appears to be the least frequent which sug-gests that the -gt operator (or a sequence of -gt oper-ators) always be the last link of a dependency chain

35 Linux-Kernel Dependency ChainLength

Many Linux-kernel dependency chains are very shortand contained with a fair number living within theconfines of a single C statement If there were onlya few short dependency chains in the Linux kernelone could imagine decorating all the operators in each

WG21N4321 8

1 void new_element(struct foo pp int a)2 3 struct foo p = malloc(sizeof(p))45 if (p)6 abort()7 p-gta = a8 atomic_store_explicit(pp p memory_order_release)9

1011 int traverse(struct foo_head ph)12 13 int a = -114 struct foo p1516 p = atomic_load_explicit(ampfield_dep(ph h)17 memory_order_consume)18 while (p = NULL) 19 a = field_dep(p a)20 p = atomic_load_explicit(ampfield_dep(p n)21 memory_order_consume)22 23 return a24

Figure 8 Decorated Linked Structure Traversal

dependency chain for example replacing the -gt op-erator with something like the mythical field dep()

operator shown on lines 16 19 and 20 of Figure 8

However there are a great many dependencychains that extend across multiple functions Onerelatively modest example is in the Linux networkstack in the arp process() function This depen-dency chain extends as follows with deeper nestingindicating deeper function-call levels

bull The arp process() function invokes in dev

get rcu() which returns an RCU-protectedpointer The head of the dependency chain istherefore within the in dev get rcu() func-tion

bull The arp process() function invokes the follow-ing macros and functions

ndash IN DEV ROUTE LOCALNET() which expandsto the ipv4 devconf get() function

ndash arp ignore() which in turn calls

lowast IN DEV ARP IGNORE() which expandsto the ipv4 devconf get() function

lowast inet confirm addr() which calls

middot dev net() which in turn callsread pnet()

ndash IN DEV ARPFILTER() which expands toipv4 devconf get()

ndash IN DEV CONF GET() which also expands toipv4 devconf get()

ndash arp fwd proxy() which calls

lowast IN DEV PROXY ARP() which expandsto ipv4 devconf get()

lowast IN DEV MEDIUM ID() which also ex-pands to ipv4 devconf get()

ndash arp fwd pvlan() which calls

lowast IN DEV PROXY ARP PVLAN() which ex-pands to ipv4 devconf get()

ndash pneigh enqueue()

Again although a great many dependency chainsin the Linux kernel are quite short there are quite afew that spread both widely and deeply We thereforecannot expect Linux kernel hackers to look fondlyon any mechanism that requires them to decorateeach and every operator in each and every depen-dency chain as was shown in Figure 8 In fact evenkill dependency() will likely be an extremely diffi-cult sell

4 Dependency Ordering in Pre-C11 Implementations

Pre-C11 implementations of the C language do nothave any formal notion of dependency ordering butthese implementations are nevertheless used to buildthe Linux kernelmdashand most likely all other softwareusing RCU This section lays out a few straightfor-ward rules for both implementers (Section 42) andusers of these pre-C11 C-language implementations(Section 41)

41 Rules for C-Language RCU Users

The rules for C-language RCU users have evolvedover time so this section will present them in reversechronological order

WG21N4321 9

411 Rules for 2014 GCC Implementations

The primary rule for developers implementing RCU-based algorithms is to avoid letting the compiler de-terming the value of any variable in any dependencychain This primary rule implies a number of sec-ondary rules

1 Use only intrinsic operators on basic types Ifyou are making use of C++ template metapro-gramming or operator overloading more elabo-rate rules apply and those rules are outside thescope of this document

2 Use a volatile load to head the dependency chainThis is necessary to avoid the compiler repeatingthe load or making use of (possibly erroneous)prior knowledge of the contents of the memorylocation each of which can break dependencychains

3 Avoid use of single-element RCU-protected ar-rays The compiler is within its right to assumethat the value of an index into such an arraymust necessarily evaluate to zero The com-piler could then substitute the constant zero forthe computation breaking the dependency chainand introducing misordering

4 Avoid cancellation when using the + and - infixarithmetic operators For example for a givenvariable x avoid (xminusx) The compiler is withinits rights to substitute zero for any such cancel-lation breaking the dependency chain and againintroducing misordering Similar arithmetic pit-falls must be avoided if the infix or oper-ators appear in the essential subset of a depen-dency chain

5 Avoid all-zero operands to the bitwise amp opera-tor and similarly avoid all-ones operands to thebitwise | operator If the compiler is able todeduce the value of such operands it is withinits rights to substitute the corresponding con-stant for the bitwise operation Once again thisbreaks the dependency chain introducing mis-ordering

Please note that single-bit operands to bitwise amp

can be dangerous because the compiler requiresonly a small amount of additional information todeduce the exact value which could again resultin constant substitution Operands to bitwise |

that have only one zero bit are similarly danger-ous

6 If you are using RCU to protect JITed functionsso that the () function-invocation operator is amember of the essential subset of the dependencytree you may need to interact directly with thehardware to flush instruction caches This issuearises on some systems when a newly JITed func-tion is using the same memory that was used byan earlier JITed function

7 Do not use the boolean ampamp and || operators inessential dependency chains The reason for thisprohibition is that they are often compiled usingbranches Weak-memory machines such as ARMor PowerPC order stores after such branches butcan speculate loads which can break data depen-dency chains

8 Do not use relational operators (== = gt gt=lt or lt=) in the essential subset of a dependencychain The reason for this prohibition is that asfor boolean operators relational operators areoften compiled using branches Weak-memorymachines such as ARM or PowerPC order storesafter such branches but can speculate loadswhich can break dependency chains

9 Be very careful about comparing pointers in theessential subset of a dependency chain As LinusTorvalds explained if the two pointers are equalthe compiler could substitute the pointer you arecomparing against for the pointer in the essen-tial subset of the dependency chain On ARMand Power hardware it might be that only theoriginal value carried a hardware dependency sothis substitution would break the chain in turnpermitting misordering Such comparisons areOK in the following cases

(a) The pointer being compared against refer-ences memory that was initialized at boot

WG21N4321 10

time or otherwise long enough ago thatreaders cannot still have pre-initialized datacached Examples include module-init timefor module code before kthread creationfor code running in a kthread while theupdate-side lock is held and so on

(b) The pointer is never dereferenced afterbeing compared This exception applieswhen comparing against the NULL pointeror when scanning RCU-protected circularlinked lists

(c) The pointer being compared against is partof the essential subset of a dependencychain This can be a different dependencychain but only as long as that chain stemsfrom a pointer that was modified after anyinitialization of interest This exception canapply when carrying out RCU-protectedtraversals from different entry points thatconverged on the same data structure

(d) The pointer being compared against isfetched using rcu access pointer() andall subsequent dereferences are stores

(e) The pointers compared not-equal and thecompiler does not have enough informationto deduce the value of the pointer (Forexample if the compiler can see that thepointer will only ever take on one of twovalues then it will be able to deduce theexact value based on a not-equals compar-ison)

10 Disable any value-speculation optimizations thatyour compiler might provide especially if you aremaking use of feedback-based optimizations thattake data collected from prior runs

412 Rules for 2003 GCC Implementations

Prior to the 269 version of the Linux kernel therewas neither rcu dereference() nor rcu assign

pointer() Instead explicit memory barriers wereused smp read barrier depends() by readers andsmp wmb() by updaters For example the code shownfor current Linux kernels in Figure 6 would be as

1 struct foo 2 int a3 4 struct foo fp5 struct foo default_foo67 int bar(void)8 9 struct foo p

1011 p = fp12 smp_read_barrier_depends()13 return p p-ampgta default_fooa14

Figure 9 Default Value For RCU-Protected PointerOld Linux Kernel

shown in Figure 9 for 268 and earlier versions of theLinux kernel A similar transformation relates theolder use of smp wmb() and the more recent use ofrcu assign pointer()

This older API was clearly much more vulnerableto compiler optimizations than is the current APIbut the real motivation for this change was read-ability and maintainability as can be seen from thecommit log for the mid-2004 patch introducing rcu

dereference()

This patch introduced an rcu

dereference() macro that replaces mostuses of smp read barrier depends()The new macro has the advantage ofexplicitly documenting which pointersare protected by RCU ndash in contrast itis sometimes difficult to figure out whichpointer is being protected by a givensmp read barrier depends() call

The commit log for the mid-2004 patch introducingrcu assign pointer() justifies the change in termsof eliminating hard-to-use explicit memory barriers

Attached is a patch that adds an rcu

assign pointer() that allows a number ofexplicit smp wmb() memory barriers to bedispensed with improving readability

The importance of suppressing compiler optimiza-tions did not become apparent until much later In

WG21N4321 11

fact a volatile cast was not added to the implementa-tion of rcu dereference() until 2624 in early 2008

413 Rules for 1990s Sequent C Implemen-tations

1990s systems featured far slower CPUs and muchless memory that is commonly provisioned to-day and the compilers were correspondingly lesssophisticated Therefore at that time a sim-ple C-language field selector was used instead ofany sort of rcu dereference() or memory order

consume operation Not only was there novolatile cast there also was nothing resembling smp

read barrier depends() The lack of smp read

barrier depends() is not too surprising given thatDYNIXptx did not run on DEC Alpha

This approach was nevertheless quite reliable be-cause the use cases within the DYNIXptx kernelwere both few and straightforward and provided lit-tle or no opportunity for optimizations that mightbreak dependency chains

42 Rules for C-Language Imple-menters

The main rule for C-language implementers is toavoid any sort of value speculation or at the veryleast provide means for the user to disable suchspeculation An example of a value-speculation op-timization that can be carried out with the help ofhardware branch prediction is shown in Figure 10which is an optimized version of the code in Fig-ure 5 This sort of transformation might result fromfeedback-directed optimization where profiling runsdetermined that the value loaded from ph was almostalway 0xbadfab1e Although this transformation iscorrect in a single-threaded environment in a concur-rent environment nothing stops the compiler or theCPU from speculating the load on line 19 before itexecutes the rcu dereference() on line 16 whichcould result in line 19 executing before the corre-sponding store on line 7 resulting in a garbage valuein variable a4

4 Kudos to Olivier Giroux for pointing out use of branchprediction to enable value speculation

1 void new_element(struct foo pp int a)2 3 struct foo p = malloc(sizeof(p))45 if (p)6 abort()7 p-gta = a8 rcu_assign_pointer(pp p)9

1011 int traverse(struct foo_head ph)12 13 int a = -114 struct foo p1516 p = rcu_dereference(ampph-gth)17 while (p = NULL) 18 if (p == (struct foo )0xbadfab1e)19 a = ((struct foo )0xbadfab1e)-gta20 else21 a = p-gta22 p = rcu_dereference(ampp-gtn)23 24 return a25

Figure 10 Dangerous Optimizations HardwareBranch Predictions

There are some situations where this sort of opti-mization would be safe including

1 The value speculated is a numeric value ratherthan a pointer so that if the guess proves correctafter the fact the computation will be appropri-ate after the fact

2 The value speculated is a pointer to invariantdata so that reasonable values are produced bydereferencing even if the guess proves to havebeen correct only after the fact

3 As above but where any updates result in datathat produces appropriate computations at anyand all phases of the update

However this list does not contain the general caseof memory order consume loads

Pure hardware implementations of value specula-tion can avoid this problem because they monitorcache-coherence protocol events that would resultfrom some other CPU invalidating the guess

In short compiler writers must provide means todisable all forms of value speculation unless the spec-

WG21N4321 12

ulation is accompanied by some means of detectingthe race condition that Figure 10 is subject to

Are there other dependency-breaking optimizationsthat should be called out separately

5 Dependency Ordering in C11and C++11 Implementations

The simplest way to avoid dependency-ordering is-sues is to strengthen all memory order consume oper-ations to memory order acquire This functions cor-rectly but may result in unacceptable performancedue to memory-barrier instructions on weakly or-dered systems such as ARM and PowerPC5 and mayfurther unnecessarily suppress code-motion optimiza-tions

Another straightforward approach is to avoid valuespeculation and other dependency-breaking opti-mizations This might result in missed opportu-nities for optimization but avoids any need fordependency-chain annotations and also all issuesthat might otherwise arise from use of dependency-breaking optimizations This approach is fully com-patible with the Linux kernel communityrsquos currentapproach to dependency chains Unfortunately thereare any number of valuable optimizations that breakdependency chains so this approach seems impracti-cal

A third approach is to avoid value speculationand other dependency-breaking optimizations in anyfunction containing either a memory order consume

load or a [[carries dependency]] attribute Forexample the hardware-branch-predition optimiza-tion shown in Figure 10 would be prohibited in suchfunctions as would cancellation optimizations suchas optimizing a = b + c - c into a = b This toocan result in missed opportunities for optimizationthough very probably many fewer than the previousapproach This approach can also result in issues dueto dependency-breaking optimizations in functionslacking [[carries dependency]] attributes for ex-ample function d() in Figure 11 It can also result

5 From a Linux-kernel community viewpoint that shouldread ldquowill result in unacceptable performancerdquo

1 int a(struct foo p [[carries_dependency]])2 3 return kill_dependency(p-gta = 0)4 56 int b(int x)7 8 return x9

1011 foo c(void)12 13 return fpload_explicit(memory_order_consume)14 return rcu_dereference(fp) in Linux kernel 15 1617 int d(void)18 19 int a20 foo p2122 rcu_read_lock()23 p = c()24 a = p-gta25 rcu_read_unlock()26 return a27

Figure 11 Example Functions for Dependency Or-dering Part 1

1 [[carries_dependency]] struct foo e(void)2 3 return fpload_explicit(memory_order_consume)4 return rcu_dereference(fp) in Linux kernel 5 67 int f(void)8 9 int a

10 foo p1112 rcu_read_lock()13 p = e()14 a = p-gta15 rcu_read_unlock()16 return kill_dependency(a)17 1819 int g(void)20 21 int a22 foo p2324 rcu_read_lock()25 p = e()26 a = p-gta27 rcu_read_unlock()28 return b(a)29

Figure 12 Example Functions for Dependency Or-dering Part 2

WG21N4321 13

in spurious memory-barrier instructions when a de-pendency chain goes out of scope for example withthe return statement of function g() in Figure 12

A fourth approach is to add a compile-time op-eration corresponding to the beginning and end ofRCU read-side critical section These would need tobe evaluated at compile time taking into accountthe fact that these critical sections can nest and canbe conditionally entered and exited Note that theexit from an outermost RCU read-side critical sec-tion should imply a kill dependency() operation oneach variable that is live at that point in the code6

Although it is probably impossible to precisely de-termine the bounds of a given RCU read-side criticalsection in the general case conservative approachesthat might overestimate the extent of a given sec-tion should be acceptable in almost all cases Thisapproach would make functions c() and d() in Fig-ure 11 handle dependency chains in a natural mannerbut avoiding whole-program analysis would requiresomething similar to the [[carries dependency]]

annotations called out in the C11 and C++11 stan-dards

A fifth approach would be to require that all op-erations on the essential subset of any dependencychain be annotated This would greatly ease imple-mentation but would not be likely to be accepted bythe Linux kernel community

A sixth approach is to track dependencies as calledout in the C11 and C++11 standards However in-stead of emitting a memory-barrier instruction whena dependency chain flows into or out of a functionwithout the benefit of [[carries dependency]] in-sert an implicit kill dependency() invocation Im-plementation should also optionally issue a diagnosticin this case The motivation for this approach is thatit is expected that many more kill dependencies()

than [[carries dependency]] would be required toconvert the Linux kernelrsquos RCU code to C11 In theexample in Figure 12 this approach would allow func-tion g() to avoid emitting an unnecessary memory-barrier instruction but without function f()rsquos ex-

6 What if a given rcu read unlock() sometimes marked theend of an outermost RCU read-side critical section but othertimes was nested in some other RCU read-side critical sectionIn that case there should be no kill dependency()

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a)3 a = p-gtspecial_a4 else5 a = p-gtnormal_a

Figure 13 Dependency-Ordering Value-NarrowingHazard

plicit kill dependency() Both functions are in Fig-ure 12

A seventh and final approach is to track dependen-cies as called out in in the C11 and C++11 standardsWith this approach functions e() and f() properlypreserve the required amount of dependency order-ing

6 Weaknesses in C11 andC++11 Dependency Or-dering

Experience has shown several weaknesses in the de-pendency ordering specified in the C11 and C++11standards

1 The C11 standard does not provide attributesand in particular does not provide the[[carries dependency]] attribute This pre-vents the developer from specifying that a givendependency chain passes into or out of a givenfunction

2 The implementation complexity of thedependency-chain tracking required by bothstandard can be quite onerous on the one handand the overhead of unconditionally promotingmemory order consume loads to memory order

acquire can be excessive on weakly orderedimplementations on the other There is thereforeno easy way out for a memory order consume

implementation on a weakly ordered system

3 The function-level granularity of [[carries

dependency]] seems too coarse One problemis that points-to analysis is non-trivial so that

WG21N4321 14

compilers are likely to have difficulty determin-ing whether or not a given pointer carries a de-pendency For example the current wording ofthe standard (intentionally) does not disallowdependency chaining through stores and loadsTherefore if a dependency-carrying value mightever be written to a given variable an implemen-tation might reasonably assume that any loadfrom that variable must be assumed to carry adependency

4 The rules set out in the standard [28 110p9]do not align well with the rules that develop-ers must currently adhere to in order to main-tain dependency chains when using pre-C11 andpre-C++11 compilers (see Section 41) For ex-ample the standard requires (x-x) to carry adependency and providing this guarantee wouldat the very least require the compiler to alsoturn off optimizations that remove (x-x) (andsimilar patterns) if x might possibly be carry-ing a dependency For another example con-sider the value-speculation-like code shown inFigure 13 that is sometimes written by devel-opers and that was described in bullet 9 ofSection 41 In this example the standard re-quires dependency ordering between the memoryorder consume load on line 1 and the sub-sequent dereference on line 3 but a typicalcompiler would not be expected to differenti-ate between these two apparently identical val-ues These two examples show that a compilerwould need to detect and carefully handle thesecases either by artificially inserting dependen-cies omitting optimizations differentiating be-tween apparently identical values or even byemitting memory order acquire fences

5 The whole point of memory order consume andthe resulting dependency chains is to allow de-velopers to optimize their code Such optimiza-tion attempts can be completely defeated by thememory order acquire fences that the standardcurrently requires when a dependency chain goesout of scope without the benefit of a [[carries

dependency]] attribute Preventing the com-piler from emitting these fences requires liberal

use of kill dependency() which clutters coderequires large developer effort and further re-quires that the developer know quite a bit aboutwhich code patterns a given version of a givencompiler can optimize (thus avoiding needlessfences) and which it cannot (thus requiring man-ual insertion of kill dependency()

As of this writing no known implementations fullysupport C11 or C++11 dependency ordering

It is worth asking why Paul didnrsquot anticipate theseweaknesses There are several reasons for this

1 Compiler optimizations have become more ag-gressive over the seven years since Paul startedworking on standardization

2 New dependency-ordering use cases have arisenduring that same time in particular there arelonger dependency chains and more of themincluding dependency chains spanning multiplecompilation units

3 The number of dependency chains has increasedby roughly an order of magnitude during thattime so that changes in code style can be ex-pected to face a commeasurate increase in resis-tance from the Linux kernel community ndash unlessthose changes bring some tangible benefit

With that letrsquos look at some potential alternativesto dependency ordering as defined in the C11 andC++11 standards

7 Potential Alternatives to C11and C++11 Dependency Or-dering

Given the weaknesses in the current standardrsquos spec-ification of dependency ordering it is quite reason-able to consider alternatives To this end Section 71discusses ease-of-use issues involved with revisions tothe C11 and C++11 definitions of dependency or-dering Section 72 enlists help from the type systembut also imposes value restrictions (thus revising the

WG21N4321 15

C11 and C++11 semantics for dependencies) Sec-tion 73 enlists help from the type system withoutthe value restrictions and Section 74 describes awhole-program approach to dependency chains (alsorevising the C11 and C++11 semantics for depen-dencies) Section 75 describes a post-Rapperswilproposal that dependency chains be restricted tofunction-scope local variables and temporaries andSection 76 describes a second post-Rapperswil pro-posal that the [[carries dependency]] attributebe used to label local-scope variables that carry de-pendencies Section 77 describes a proposal dis-cussed verbally at Rapperswil that explicitly marksthe tails of dependency chains Section 78 describesthe inverse namely marking the heads of dependencychains Section 79 describes an approach that avoidsmarking by sharply restricting the number and typeof operations permitted in dependency chains Eachapproach appears to have advantages and disadvan-tages so it is hoped that further discussion will eitherhelp settle on one of these alternatives or generatesomething better To help initiate this discussionSection 710 provides an initial comparative evalua-tion

71 Revising C11 and C++11Dependency-Ordering Definition

The following sections each describe a proposed revi-sion of the dependency-ordering definition from thatin the current C11 and C++11 standards In manyof these proposals developers are required to followan additional rule in order to be able to rely on de-pendency ordering Subsequent execution must notlead to a situation where there is only one possiblevalue for the variable that is intended to carry thedependency7 This is shown in Figure 17 where thecompiler is permitted to break dependency orderingon line 6 because it knows that the value of p is equalto that of q which means that it could substitutethe latter value from the former which would break

7 This restricted notion of dependence is sometimes calledsemantic dependence and the value at the end of a depen-dence chain that does not represent a semantic dependence issometimes said to be independent of the value at the head ofthe dependency chain

1 int my_array[MY_ARRAY_SIZE]23 i = atomic_load_explicit(gi memory_order_consume)4 r1 = my_array[i]

Figure 14 Single-Element Arrays and DependencyOrdering

dependency ordering In short a dependency chainbreaks if it comes to a point where only a single valueis possible regardless of the value of the memory

order consume load heading up the chain At firstglance this additional rule could be quite difficult tolive with as dependency ordering could come and godepending on small details of code far away from thatpoint in the dependency chain

However a review of the Linux-kernel operators inSection 32 shows that the most commonly used op-erators act identically under both definitions Theproblem-free operators include -gt infix = casts pre-fix amp prefix and ternary

One example of a potentially troublesome opera-tor namely == is shown in Figure 17 where line 6breaks dependency ordering because the value of p isknown to be equal to that of q which is not part of adependency chain This example could be addressedthrough careful diagnostic design coupled with appro-priate coding standards For example the compilercould emit a warning on line 6 but remain silent forthe equivalent line substituting q for p namely dosomething with(q-gta)

Another example is the use of postfix [] thatis shown in Figure 14 If this code fragment wascompiled with MY ARRAY SIZE equal to one there isno dependency ordering between lines 3 and 4 butthat same code fragment compiled with MY ARRAY

SIZE equal to two or greater would be dependency-ordered Here a diagnostic for single-element arraysmight prove useful and such a diagnostic can easilybe supplied in this case using if and error

In the Linux kernel infix + and - are used forpointer and array computations These are all safein that they operate on an integer and pointer sothat any cancellation will not normally be detectableat compile time However one big purpose of diag-nostics is to detect abnormal conditions indicating

WG21N4321 16

1 struct liststackhead 2 struct liststack __rcu first3 45 struct liststack 6 struct liststack __rcu next7 void t8 struct rcu_head rh9

1011 _Carries_dependency12 void ls_front(struct liststackhead head)13 14 _Carries_dependency void data15 struct liststack lsp1617 rcu_read_lock()18 lsp = rcu_dereference(head-gtfirst)19 if (lsp == NULL)20 data = NULL21 else22 data = rcu_dereference(lsp-gtt)23 rcu_read_unlock()24 return data25

Figure 15 List-Based-Stack Example Code 1 of 2

probable bugs Therefore in cases where the com-piler can determine that two values from dependencychains are annihilating each other via infix + and -a diagnostic would be appropriate

Similarly the Linux kernel uses infix (bitwise) amp tomanipulate bits at the bottom of a pointer whereagain cancellation will not normally be detectable atcompile timemdashexcept in the case of operations on aNULL pointer for which dependency ordering is notmeaningful in any case However as with infix +

and - if the compiler detects value annihilation adiagnostic would be appropriate

Although issues with false positives and negativesneeds further investigation there is reason to hopethat this revision of the definition of dependency or-dering might avoid significant impacts on ease of useWith this hope we proceed to the specific propos-als using the code in Figures 15 and 16 to showsome sample code using Linux-kernel nomenclaturewith the addition of a mythical C keyword Carries

dependency to annotate parameters variables andreturn values that carry dependencies Please notethat this code example in no way endorses the dubi-ous practice of creating a parallel program with thesort of choke point exemplified by the head of this

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 _Carries_dependency struct liststack lsp40 struct liststack lsnp141 _Carries_dependency struct liststack lsnp242 _Carries_dependency void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 16 List-Based-Stack Example Code 2 of 2

WG21N4321 17

1 value_dep_preserving struct foo p23 p = atomic_load_explicit(gp memory_order_consume)4 q = some_other_pointer5 if (p == q)6 do_something_with(p-gta)7 else8 do_something_else_with(p-gtb)

Figure 17 Single-Value Variables and DependencyOrdering

list Note also that cmpxchg() heads a dependencychain which is completely reasonable within the con-text of the Linux kernel due to its acquire semanticswhich of course might be argued to indicate that theannotations in ls pop() are unnecessary

72 Type-Based Designation of De-pendency Chains With Restric-tions

This approach was formulated by Torvald Riegel inresponse to Linus Torvaldsrsquos spirited criticisms of thecurrent C11 and C++11 wording

This approach introduces a new value dep

preserving type qualifier Dependency ordering ispreserved only via variables having this type quali-fier This is meant to model the real scope of depen-dencies which is data flow not execution at function-level granularity This approach should therefore givedevelopers much finer control of which dependenciesare tracked

Assigning from a value dep preserving value to anon-value dep preserving variable terminates thetracking of dependencies in much the same way thatan explicit kill dependency() would However un-like an explicit kill dependency() compilers shouldbe able to emit a suppressable warning on implicitconversions so as to alert the developer about other-wise silent dropping of dependency tracking8

Next we specify that memory order consume loadsreturn a value dep preserving type by default thecompiler must assume such a load to be capable of

8 Other choices are possible in this case including emit-ting a memory order acquire fence in order to conservativelypreserve a potentially intended ordering

producing any value of the underlying type In otherwords the implementation is not permitted to applyany value-restriction knowledge it might gain fromwhole-program analysis We call this a local seman-tic dependency to distinguish not only from a pure(syntactic) dependency but also from a global seman-tic dependency where global information may be ap-plied Note that any global semantic dependency isalso a local semantic dependency but that any localsemantic dependency which is headed by a variablethat can be proven to take on only a single value is nota global semantic dependency The term ldquosemanticdependencyrdquo should be interpreted to mean a globalsemantic dependency unless otherwise stated

This allows developers to start with a clean slatefor the additional rule that they must follow to beable to rely on dependency ordering Subsequent ex-ecution must not lead to a situation there is only onepossible value for the value dep preserving expres-sion because otherwise the implementation is per-mitted to break the dependency chain As notedearlier this is shown in Figure 17 where the com-piler is permitted to break dependency ordering online 6 because it knows that the value of p is equalto that of q which means that it could substitutethe latter value from the former which would breakdependency ordering

This approach has several advantages

1 The implementation is simpler because no de-pendency chains need to be traced The imple-mentation can instead drive optimization deci-sions strictly from type information

2 Use of the value dep preserving type modifierallows the developer to limit the extent of thedependency chains

3 This type modifier can be used to mark a depen-dency chainrsquos entry to and exit from a functionin a straightforward way without the need forattributes

4 The value dep preserving type modifiers serveas valuable documentation of the developerrsquos in-tent

WG21N4321 18

5 This approach permits many additional opti-mizations compared to those permitted by thecurrent standard on code that carries a depen-dency Expressions such as (x-x) no longerrequire establishment of artificial dependenciesand the compiler is no longer required to detectvalue-narrowing hazards like that shown in Fig-ure 13 However the compiler is still prohibitedfrom adding its own value-speculation optimiza-tions

6 Linus Torvalds seems to be OK with it whichindicates that this set of rules might be practicalfrom the perspective of developers who currentlyexploit dependency chains

According to Peter Sewell one disadvantage is thatthis approach will be quite difficult to model which inturn will pose obstacles for the analysis tooling thatwill be increasingly necessary for large-scale concur-rent programming efforts In particular the concernis that forcing the compiler to assume that a memory

order consume load could possibly return any valuepermitted by its type might require program-analysistools to consider counterfactual hypothetical execu-tions which might complicate specification of seman-tics and verification

Figures 18 and 19 show how this approach playsout with the list-based stack

73 Type-Based Designation of De-pendency Chains

Jeff Preshing made an off-list suggestion of using avalue dep preserving type modifier as suggestedby Torvald Riegel but using this type modifier tostrictly enforce dependency ordering For exampleconsider the code fragment shown in Figure 17 Thescheme described in Section 72 would not necessar-ily enforce dependency ordering between the load online 3 and the access one line 6 while the approachdescribed in this section would enforce dependencyordering in this case

Furthermore cancelling or value-destruction oper-ations on value dep preserving values would notdisrupt dependency ordering As with the cur-rent C11 and C++11 standards the implementation

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack value_dep_preserving first6 78 struct liststack 9 struct liststack value_dep_preserving next

10 void t11 struct rcu_head rh12 1314 value_dep_preserving15 void ls_front(struct liststackhead head)16 17 value_dep_preserving void data18 value_dep_preserving struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 18 List-Based-Stack Restricted Type-BasedDesignation 1 of 2

would be required to emit a memory-barrier instruc-tion or compute an artificial dependency for such op-erations (Note however that use of cancelling orvalue-destruction operations on dependency chainshas proven quite rare in practice)

This approach shares many of the advantages ofTorvald Riegelrsquos approach

1 The implementation is simpler because no de-pendency chains need be traced The implemen-tation can instead drive optimization decisionsstrictly from type information

2 Use of the value dep preserving type modifierallows the developer to limit the extent of thedependency chains

3 This type modifier can be used to mark a depen-dency chainrsquos entry to and exit from a functionin a straightforward way without the need forattributes

4 The value dep preserving type modifiers serve

WG21N4321 19

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 value_dep_preserving37 void ls_pop(struct liststackhead head)38 39 value_dep_preserving struct liststack lsp40 struct liststack lsnp141 value_dep_preserving struct liststack lsnp242 value_dep_preserving void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 19 List-Based-Stack Restricted Type-BasedDesignation 2 of 2

as valuable documentation of the developerrsquos in-tent

5 Although optimizations on a dependency chainare restricted just as in the current standardthe use of value dep preserving restricts thedependency chains to those intended by the de-veloper

6 Restricting dependency-breaking optimizationson all dependency chains marked value dep

preserving without exceptions for cases inwhich the compiler knows too much might makethis approach easier to learn and to use

It is expected that modeling this approach shouldbe straightforward because the modeling tools wouldbe able to make use of the type information Thisapproach results in the same code as shown in Fig-ures 18 and 19 of the previous section

74 Whole-Program Option

This approach also suggested off-list by Jeff Presh-ing has the goal of reusing existing non-dependency-ordered source code unchanged (albeit requiring re-compilation in most cases)9 For example this ap-proach permits an instance of stdmap to be refer-enced by a pointer loaded via memory order consume

and to provide that stdmap instance with thebenefits of dependency ordering without any codechanges whatsoever to stdmap It is important tonote that this protection will be provided only to aread-only stdmap that is referenced by a changingpointer loaded via memory order consume in partic-ular not to a concurrently updated stdmap refer-enced by a pointer (read-only or otherwise) loadedvia memory order consume This latter case wouldrequire changes to the underlying stdmap implemen-tation at a minimum changing some of the loads tobe memory order consume loads Nevertheless theability to provide dependency-ordering protection topre-existing linked data structures is valuable evenwith this read-only restriction

9 A module or library that is known to never carry a de-pendency need not be recompiled

WG21N4321 20

This approach which again does require full re-compilation can be implemented using two ap-proaches

1 Promote all memory order consume loads tomemory order acquire as may be done withthe current standard

2 On architectures that respect memory order-ing prohibit all dependency-breaking optimiza-tions throughout the entire program but onlyin cases where a change in the value returnedby a memory order consume load could cause achange in the value computed later in that samedependency chain in other words where there isa global semantic dependency Note again thatthe possibility of storing a value obtained froma memory order consume load then loading itlater means that normal loads as well as memoryorder relaxed loads often must be consideredto head their own dependency chains but onlywhen loaded by the same thread that did thestore

Some implementations might allow the developerto choose between these two approaches for exampleby using a compiler switch provided for that purpose

This approach also has the effect of permitting atrivial implementation of a memory order consume

atomic thread fence() When using the first im-plementation approach the atomic thread fence()

is simply promoted to memory order acquire In-terestingly enough when using the second ap-proach the memory order consume atomic thread

fence() may simply be ignored The reason forthis is that this approach has the effect of promot-ing memory order relaxed loads to memory order

consume which already globally enforces all theordering that the memory order consume atomic

thread fence() is required to provide locally10

This approach has its own set of advantages anddisadvantages

10 Of course this presumed promotion from memory order

relaxed to memory order consume means that architecturessuch as DEC Alpha that do not respect dependency order-ing must continue to use the first option of emitting memory-ordering instructions for memory order consume loads

1 This approach dispenses with the [[carries

dependency]] attribute and the kill

dependency() primitive

2 This approach better promotes reuse of existingsource code In particular it should require nochanges to the current Linux-kernel source baseaside from changes to the rcu dereference()

family of primitives

3 This approach allows implementations to carryout dependency-breaking optimizations on de-pendency chains as long as a change in thevalue from the memory order consume load doesnot change values further down the dependencychain both with and without the optimizationJeff conjectures that the set of dependency-breaking optimizations used in practice applyonly outside of dependency chains by the re-vised definition in which single-value restrictionsbreak dependency chains11 If this conjectureholds it also applies to Torvaldrsquos approach de-scribed in Section 72

4 Code that follows the rules presented in Sec-tion 41 (substituting memory order consume

loads for volatile loads) would have its depen-dency ordering properly preserved

It is unlikely that this approach could be modeledreasonably given the current state of the art Therequirement that any given memory order consume

load be able to generate at least two different val-ues at the tail of the dependency chain is believedto be a show-stopper especially when coupled withwhole-program analysis which might find that thereis only one value entering at the head of the depen-dency chain

This approach allows annotations to be discardedas shown in Figures 20 and 21 However the memoryorder consume loads are still required in order toenable the promote-to-acquire implementation style

11 This is certainly the case for the usual optimizations ex-emplified by replacing (x-x) with zero

WG21N4321 21

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 void ls_front(struct liststackhead head)15 16 void data17 struct liststack lsp1819 rcu_read_lock()20 lsp = rcu_dereference(head-gtfirst)21 if (lsp == NULL)22 data = NULL23 else24 data = rcu_dereference(lsp-gtt)25 rcu_read_unlock()26 return data27

Figure 20 List-Based-Stack Whole-Program Ap-proach 1 of 2

75 Local-Variable Restriction

This approach suggested off-list by Hans Boehmlimits the extent of dependency trees to a local whichincludes local variables temporaries function argu-ments and return variables Assigning a value from amemory order consume load to such an object beginsa dependency chain Assigning a value loaded fromsuch a local to a global variable (including function-local variables marked static) or to the heap impliesa kill dependency() so that dependency chains areconfined to locals However if the compiler is unableto see the full dependency chain for example be-cause it passes into a function in another translationunit that is not marked [[carries dependency]]the compiler should promote memory order consume

to memory order acquire12

Section 32 indicates that the following operatorsshould transmit dependency status from one localvariable or temporary to another -gt infix = casts

12 Some implementations might provide means to allow theuser to specify that a diagnostic be generated if such promotionis necessary

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 void ls_pop(struct liststackhead head)37 38 struct liststack lsp39 struct liststack lsnp140 struct liststack lsnp241 void data4243 rcu_read_lock()44 lsnp2 = rcu_dereference(head-gtfirst)45 do 46 lsnp1 = lsnp247 if (lsnp1 == NULL) 48 rcu_read_unlock()49 return NULL50 51 lsp = rcu_dereference(lsnp1-gtnext)52 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)53 while (lsnp1 = lsnp2)54 data = rcu_dereference(lsnp2-gtt)55 rcu_read_unlock()56 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)57 return data58

Figure 21 List-Based-Stack Whole-Program Ap-proach 2 of 2

WG21N4321 22

prefix amp prefix [] infix + infix - ternary infix (bitwise) amp and probably also | Similarly Sec-tion 33 indicates that the following operators shouldimply a kill dependency() () == = ampamp ||infix and

It will also be necesary to check whether Linux-kernel usage expects dependency chains to passthrough globals and heap objects that are in someway thread-local If there are such use cases and ifthey are sane and cannot easily be changed to uselocal variables should [[carries dependency]] beused to flag dependency-carrying globals and heapobjects

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute if de-pendency chains are to span multiple translationunits as is the case in some parts of the Linuxkernel

2 The implementation is likely to be some-what simpler because only those dependencychains passing through local variables compiler-generated temporaries compiler-visible functionarguments and compiler-visible return valuesneed be traced One could also argue that func-tion arguments and return values marked with[[carries dependency]] attribute also need tobe traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 Although optimizations on dependency chainsmust be restricted the restricted scope of de-pendency chains reduces the impact of these re-strictions

5 Applying this approach to the Linux kernelwould only require the addition of markings onfunction parameters and return values corre-sponding to cross-translation-unit function callsHowever there are a significant number of theseso this approach can expect significant resistancefrom the Linux community

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 22 List-Based-Stack Local-Variable Restric-tion 1 of 2

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach allows local-variable annotations tobe dropped as shown in Figure 22 and 23

76 Mark Dependency-Carrying Lo-cal Variables

This approach suggested offlist by Clark Nelson usesthe [[carries dependency]] attribute to mark non-static local-scope variables as carrying a dependencyin addition to its current use marking function ar-guments and return values as carrying dependen-cies It is not permissible to mark global variablesor structure members with this attribute Assigningfrom a [[carries dependency]] object to a non-[[carries dependency]] object results in an im-plicit kill dependency()

This approach is similar to that of Section 73 ex-cept that it uses an attribute rather than a type mod-ifier As such it has many of the advantages and dis-

WG21N4321 23

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 23 List-Based-Stack Local-Variable Restric-tion 2 of 2

advantages of that approach however some believethat an attribute-based approach will be more ac-ceptable to the committee than would a type-modifierapproach13 However this approach does requirethat C add attributes

This leave the question of which operatorstransmit dependency chains from one [[carries

dependency]] object to another Section 32 indi-cates that the following operators should transmitdependency status from one local variable or tem-porary to another -gt infix = casts prefix amp prefix [] infix + infix - ternary infix (bitwise) ampand probably also | Similarly Section 33 showsthat the following operators should imply a kill

dependency() () == = ampamp || infix and

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains passingthrough variables marked with the [[carries

dependency]] attribute need be traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 The [[carries dependency]] calls serve asvaluable documentation of the developerrsquos in-tent

5 Although optimizations on dependency chainsmust be restricted use of explicit [[carries

dependency]] greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of variables carrying depen-dencies given that the Linux kernel currentlyrequires no such markings

13 Lawrence Crowl suggests a third approach namely a vari-able modifier

WG21N4321 24

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 _Carries_dependency void data18 _Carries_dependency struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 24 List-Based-Stack Marked Local Variables1 of 2

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach results in code as shown in Fig-ures 24 and 25 where the [[carries dependency]]

attributes have been replaced with a mythicalCarries dependency C keyword

77 Explicitly Tail-Marked Depen-dency Chains

This approach suggested at Rapperswil by OlivierGiroux can be thought of as the inverse of

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 _Carries_dependency struct liststack lsp40 struct liststack lsnp141 _Carries_dependency struct liststack lsnp242 _Carries_dependency void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 25 List-Based-Stack Marked Local Variables2 of 2

WG21N4321 25

1 p = atomic_load_explicit(ampgp memory_order_consume)2 if (p = NULL)3 do_it(atomic_dependency(p gp))

Figure 26 Explicit Dependency Operations

stdkill dependency() Instead of explicitlymarking where the dependency chains terminateOlivierrsquos proposal uses a stddependency() prim-itive to indicate the locations in the code that thedependency chains are required to reach The first ar-gument to stddependency() is the value to whichthe dependency must be carried and the second argu-ment is the variable that heads the dependency chainin other words the second argument is the variablethat was loaded from by a memory order consume

load This proposal differs from the others in thatit is expected to be implemented not necessarily bypreserving the dependency but instead by insertingbarriers in those cases where optimizations have elim-inated any required dependencies The goal here is toimpose minimal restrictions on optimizations of codecontaining dependency chains

A C-language example is shown in Figure 26where stddependency() is transliterated to the C-language atomic dependency() function On line 3atomic dependency() returns the value of its firstargument (p) while ensuring that the data depen-dency from the memory order consume load from gp

is faithfully reflected in the assembly language imple-menting this code fragment The assembly-languagereflection of this dependency might be in terms ofan assembly-language dependency (for example onARM or PowerPC) implicit memory ordering (forexample on x86 or mainframe) or by an explicitmemory-barrier instruction However if there was noatomic dependency() function the compiler wouldbe under no obligation to preserve the dependency14

These explicitly specified dependencies may becombined with [[carries dependency]] attributeson function arguments for example as shown in Fig-ure 27 Note the interplay of atomic dependency()

and [[carries dependency]] where line 8 estab-

14 Would it be better to have the second argument to atomic

dependency() be a label rather than an expression

1 void foo(struct bar q [[carries_dependency]])2 3 if (q = NULL)4 do_it(atomic_dependency(q-gtb q))5 67 p = atomic_load_explicit(ampgp memory_order_consume)8 foo(atomic_dependency(p gp))

Figure 27 Explicit Dependency Operations and car-ries dependency

lishes the dependency between the load from gp andthe [[carries dependency]] argument q of foo()and where line 4 establishes the further dependencybetween argument q of foo() and do it()s argu-ment

This approach is not yet complete One issue isthe possibility of a given operation being dependenton multiple memory order consume loads One ap-proach is of course to omit this functionality and an-other is to allow atomic dependency() to allow anexpression as its first argument and a variable list ofmemory order consume loaded variables

Another issue is connecting [[carries

dependency]] return values to subsequent atomic

dependency() invocations There are a numberof possible resolutions to this issue One approachwould be to use [[carries dependency]] attributeto mark the declaration of the variable to whichthe functionrsquos return value is assigned bringing theproposal from Section 76 to bear In the specialcase where the memory order consume load is in thesame function body as the atomic dependency()

that depends on it the atomic dependency()

could reference the variable that was the sourceof the original memory order consume load An-other approach would be to allow function-returncarries dependency]] attributes to define namesthat could be used by later atomic dependency()

invocations

A third issue arises when atomic dependency()

must be applied after the head of the dependencychain has gone out of scope for example if the headwas contained in a variable defined in an inner scopethat has since been exited

A fourth issue arises if optimizations along a

WG21N4321 26

needed dependency chain allow ordering the depen-dent operation to precede the head of the depen-dency chain in which case inserting barriers wouldbe ineffective The current proposal for addressingthis issue is to suppress memory-movement optimiza-tions across the atomic dependency() perhaps us-ing something like atomic signal fence() or theLinux kernelrsquos barrier() macro This approach al-lows dependency checking and fence insertion to becarried out as a final pass in the compilation process

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains having ex-plicit atomic dependency() calls (and option-ally intermediate [[carries dependency]] at-tributes) need be traced

3 Irrelevant dependency chains are pruned by de-fault with no stdkill dependency() calls re-quired

4 The atomic dependency() calls serve as valu-able documentation of the developerrsquos intent

5 Although optimizations on dependency chainsmust be restricted use of explicit atomic

dependency() greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of dependency chains giventhat the Linux kernel currently relies on implicitends of dependency chains

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(atomic_dependency(lsp-gtt27 head-gtfirst))28 rcu_read_unlock()29 return atomic_dependency(data lsp-gtt)30

Figure 28 List-Based-Stack Tail-Marked Dependen-cies 1 of 2

It is not yet known whether this approach can bereasonably modeled

The result is shown in Figures 28 and 29

78 Explicitly Head-Marked Depen-dency Chains

This approach suggested via email by Olivier Girouxcan be thought of as another inverse of stdkilldependency() In this case the heads of the de-pendency chains are marked indicating to whichpointed-to objects dependencies should be carriedThis description is an extrapolation of a very con-cise proposal and corrections are welcome

The general idea is to provide an augmentedform of the load() member function that indicatesdependencies for example xload(memory order

consume x-gtnext) would cause a dependency tobe carried through the -gtnext field but throughno other field This is shown in Figure 30 where

WG21N4321 27

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(atomic_dependency(lsnp2-gtt lsnp))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return atomic_dereference(data lsnp2-gtt)59

Figure 29 List-Based-Stack Tail-Marked Dependen-cies 2 of 2

1 struct foo 2 struct foo a3 struct foo b4 struct foo c5 int d6 78 p = atomic_load_explicit(ampgp memory_order_consume9 p-gta p-gtb)

10 qa = p-gta Dependency carried 11 qb = p-gtb Dependency carried 12 qc = p-gtc No dependency carried 13 d = p-gtd No dependency carried

Figure 30 Explicit Dependency Operations and Aug-mented Load

the explicit dependency information on line 9 causeslines 10 and 11 to carry a dependency but lines 12and 13 not to do so

Some open questions regarding this approach

1 How does this interact with arguments and re-turn values Do the corresponding annotationsneed to indicate to which fields dependenciesmight be carried Should mismatches be con-sidered an error and if so which sorts of mis-matches

2 How are opaque types handled For exam-ple consider the Linux kernel linked-list facil-ity which embeds a list head structure intothe enclosing object that is to be placed on thelist The memory order consume load returnsa (struct list head ) but it may be nec-essary to carry a dependency to one or more ofthe fields in the enclosing object Should thisbe handled via something like xload(memory

order consume x) but if so doesnrsquot thisre-introduce the need for lots of stdkill

dependency() calls

3 Larger structures might have quite a few fieldsthat need dependencies carried Should therebe some sort of shorthand to make this easierto code for example tagging the fields needingdependency ordering in the declaration of thestruct or class

4 Common types of abstraction need to be han-dled correctly For example there are more than

WG21N4321 28

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst head-gtfirst-gtt)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(lsp-gtt lsp-gtt)) 28 rcu_read_unlock()29 return data30

Figure 31 List-Based-Stack Head-Marked Depen-dencies 1 of 2

500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

79 Restricted Dependency Chains

This approach restricts dependency chains to opera-tions for which compilers would naturally carry de-pendencies As such this approach can be consid-ered to be a refinement of the whole-program option(Section 74) restricted as described in Section 412but also omitting control dependencies and RCU-protected array indexes As always a memory order

consume load heads a dependency chain and as al-ways a memory order consume load may be imple-mented with a simple load instruction on architec-tures such as x86 ARM and Power

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst head-gtfirst-gtnext)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext lsnp-gtnext-gtt)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt lsnp2-gtt))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 32 List-Based-Stack Head-Marked Depen-dencies 2 of 2

WG21N4321 29

1 define rcu_ptr_extract(p) 2 ( 3 uintptr_t ___ip = (uintptr_t)(p) 4 5 ___ip = ___ip amp ~0x7 6 (typeof(p)) ___ip 7 )89 define rcu_ptr_set_bits(p bits)

10 ( 11 uintptr_t ___ip = (uintptr_t)(p) 12 13 ___ip = ___ip amp ~0x7 14 ___ip = ___ip | (bits) 15 (typeof(p)) ___ip 16 )1718 define rcu_ptr_get_bits(p bits) 19 ( 20 uintptr_t ___ip = (uintptr_t)(p) 21 22 ___ip amp ~0x7 23 )

Figure 33 Pointers and Bit Manipulation on 64-BitSystem

This results in a specific list of operations that ex-tend dependency chains and a separate specific list ofoperations that terminate such chains each coveredin its own section

791 Extending Dependency Chains

The following primitive operations extend thatchain15

1 If any value is part of a dependency chain thenusing that value as the left-hand side of an as-signment expression extends the chain to coverthe assignment This rule is exercised in theLinux kernel by stores into fields making up anRCU protected data element

2 If any value is part of a dependency chain thenusing that value as the right-hand side of an as-signment expression extends the chain to coverboth the assignment and the value returned bythat assignment statement Line 20 of Figure 20shows how this rule may be used to extend adependency chain into a local variable

15 In case of operator overloading the actual functions calledmust be analyzed in order to determine their effects on depen-dency chains

3 If any value that is part of a dependency chain isstored to a non-shared variable then any valueloaded by a later load from that same variable bythat same thread is also part of the dependencychain Lines 20 and 24 of Figure 20 illustratethis rule though this rule would apply even iflocal variable lsp was a intptr t instead of apointer

4 If a pointer that is part of a dependency chainis stored to any variable then any value loadedby a later load from that same variable by thatsame thread is also part of the dependency chainLines 20 and 24 of Figure 20 illustrate this rulethough this rule would apply even if local vari-able lsp was instead a shared variable

5 If a pointer is part of a dependency chain thenadding an integral value to that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers and alsoto addition via the infix + operator and via thepostfix [] operator Note that the addition mustbe carried out on a pointer Casting to an inte-gral type and then carrying out the addition willbreak the dependency chain Therefore insteadof casting to an integral type to carry out theaddition cast to a pointer to char16 Line 24 ofFigure 20 illustrates this given that the -gtt actsas a pointer offset prior to indirection

6 If a pointer is part of a dependency chain thensubtracting an integer from that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers Againcasting to an integral type and then carryingout the subtraction will break the dependencychain so instead cast to a pointer to char TheLinux-kernel container of() macro illustratesthis This macro is used to find the beginning ofa structure given a pointer to a field within thatsame structure

16 Yes some old systems had strange formats for characterpointers and this restriction does exclude those systems fromthis nuance of dependency ordering However to the best ofmy knowledge all such systems were uniprocessors so this isnot a real problem

WG21N4321 30

7 If a pointer is part of a dependency chain thendereferencing it using the prefix operator ex-tends the chain through the dereference opera-tion Line 24 of Figure 20 illustrates this giventhat the -gtt acts as a pointer offset prior to in-direction

8 If a pointer is part of a dependency chain thendereferencing it using the -gt field-selection oper-ator extends the chain to the field Note that thewhen the -gt operator is followed by one or more operators these latter operators are equiva-lent to adding a constant integer to the originalpointer Line 24 of Figure 20 directly illustratesthis rule

9 If a pointer is part of a dependency chain thencasting it (either explicitly or implicitly) to anypointer-sized type extends the chain to the re-sult Line 3 of Figure 33 illustrates this rule

10 If a value of type intptr t or uintptr t ispart of a dependency chain then casting it toa pointer type extends the chain to the resultLine 6 of Figure 33 illustrates this rule

11 If a value of type intptr t or uintptr t is partof a dependency chain then the bitwise infix amp

and | operators extend the dependency chain tothe resulting value Line 5 of Figure 33 illus-trates this rule

12 If a value of type intptr t or uintptr t is partof a dependency chain the the bitwise infix ^

operator extends the dependency chain to theresulting value The use case for this traversal ofbuddy-allocator-like lists or dense-array heapsbut it is not clear whether these use cases justifythis addition to dependency ordering

13 If a pointer is part of a dependency chain thenapplying the unary amp address-of operator op-tionally casting this address to a pointer type(perhaps repeatedly to different pointer typeseither explicitly or implicitly) then applying the dereference operator extends the chain to theresult This is used by some of the Linux-kernellist-processing macros

14 If a pointer is part of a dependency chain andthat pointer appears in the entry of a expres-sion selected by the condition then the chainextends to the result Please note that doesnot extend chains from its condition only fromits second or third argument

15 If a pointer to a function is part of a depen-dency chain then invoking the pointed-to func-tion extends the chain from the pointer to theinstructions executed Note that the exact mech-anism used to update instructions is implemen-tation defined and might require use of specialinstruction-cache-flush operations17

16 If a pointer is part of a dependency chain thenif that pointer is used as the actual parameter ofa function call the dependency chain extends tothe formal parameter

17 If a pointer is part of a dependency chain thenif that pointer is returned from a function thedependency chain extends to the returned valuein the calling function

18 If a given operation extends a dependency chainthen so does its atomic counterpart For exam-ple the rules applying to assignments also applyto atomic loads and stores

Any other operation terminates a dependencychain

792 Terminating Dependency Chains

Even though all other operations terminate depen-dency chains there are a few that deserve specialmention

1 Equality comparisons

2 Narrowing magnitude comparisons

3 Narrowing arithmetic operations

4 Narrowing bitwise operations

17 This may sound strange but just you try implementingdynamic linking without the ability to update instructions

WG21N4321 31

1 struct bar 2 struct bar next3 int a4 int b5 6 struct bar head = amphead 1 2 78 for (p = head-gtnext p p = rcu_dereference(p-gtnext)) 9 foo += p-gta

10 if (p == amphead)11 break12 13 bar = head-gtb

Figure 34 Back-Propagation of Dependency-ChainBreakage Due to Comparisons

1 if (p gt ampfoo)2 do_something(p)3 else if (p lt ampfoo)4 do_something_else(p)5 else6 do_something_nodep(p)

Figure 35 Inequality-Comparison Dependency-Chain Breakage

5 Storing non-pointers into shared variables

6 Passing values between threads without using amemory order consume load

7 Undefined behavior

8 Use of kill dependency

Each of these is covered below

Equality comparisons If a pointer is part of adependency chain then a == or = comparison thatcompares equal to some other pointer where thatother pointer is not part of any dependency chainwill cause any uses of the original pointer to no longerbe part of the dependency chain This dependency-chain breakage can back-propagate to earlier uses ofthe pointer so that in Figure 34 if the comparisonon line 10 compares equal then the access on line 9is not part of the dependency chain This is admit-tedly a rather strange code fragment and besidesthe Linux-kernel barrier() macro could prevent thisif placed between lines 9 and 10 Furthermore the

Linux kernelrsquos list macros avoid this situation becausethe equal comparison terminates the loop

So what if the compiler introduces an equality com-parison This might happen when doing feedback-directed optimization where the compiler might no-tice (for example) that a particularly statically allo-cated structure was almost always the first elementon a given list The compiler might therefore in-troduce a specialization optimization comparing theaddresses and generating code using the statically al-located structure on equals comparison On the onehand in the cases where the Linux kernel adds a stat-ically allocated structure to an RCU-protected linkeddata structure that structure has been initialized atcompile time so that dependency ordering is not re-quired On the other hand this appears to be anextremely dubious optimization for linked data struc-tures In a great many cases the added overhead ofthe comparison would overwhelm the benefits of gen-erating code based on the statically allocated struc-ture

High-quality implementations would therefore beexpected to provide means for disabling this sort ofoptimization especially for pointers obtained fromthe heap After all use of statically allocated struc-tures in RCU-protected lists could be quite usefulduring out-of-memory conditions in which case thespecialization optimization would almost always re-duce performance which is not what optmizationsare supposed to be doing

Narrowing magnitude comparisons A seriesof gt lt gt= or lt= operators that informs the compilerof the exact value of a pointer causes that pointerto no longer be part of the dependency chain SeeFigure 35 for an example of this On line 6 of thisfigure the compiler knows that the value of p is equalto ampfoo so although there is dependency ordering tolines 2 and 4 there is no dependency ordering toline 6 This dependency-chain breakage can back-propagate just as for equality comparisons How-ever dependencies are maintained for normal usesfor exmaple the use of comparisons for deadlockavoidance when acquiring locks contained in multi-ple RCU-protected data elements

WG21N4321 32

Narrowing arithmetic operations If a pointeris part of a dependency chain and if the valuesadded to or subtracted from that pointer cancel thepointer value so as to allow the compiler to pre-cisely determine the resulting value then the result-ing value will not be part of any dependency chainFor example if p is part of a dependency chain then((char )p-(uintptr t)p)+65536 will not be18

Narrowing bitwise operations If a value of typeintptr t or uintptr t is part of a dependency chainand if that value is one of the operands to an amp

or | infix operator whose result has too few or toomany bits set then the resulting value will not bepart of any dependency chain For example on a64-bit system if p is part of a dependency chainthen (p amp 0x7) provides just the tag bits and nor-mally cannot even be legally dereferenced Similarly(p | ~0) normally cannot be legally dereferencedHowever (p amp ~0x7) will provide a usable pointerthat is part of prsquos dependency chain Setting orclearing bits that are not used by the implementa-tion or that would be zero for a properly aligned ob-ject will not break the dependency chain That saidthe compiler might not know the definition of ldquoprop-erly alignedrdquo for example in the cases of manuallycache-aligned or page-aligned objects

Storing non-pointers into shared variables Ifa non-pointer value that is part of a dependency chainis stored into a shared variable then the dependencychain does not extend to a later load from that vari-able

Passing values between threads If a value thatis part of a dependency chain is stored into a variableby one thread and loaded from that same variable bysome other thread using either a non-atomic load ora memory order relaxed load then the dependencychain does not extend to the second thread To getthis effect the second thread would instead need touse a memory order consume load Note that this

18 That said 57p4 of C++ and 656p8 of C both say thatindexing outside of an object is undefined behavior so the lossof dependency ordering is likely the least of the problems here

would extend the dependency chain even if the cor-responding store was a memory order relaxed storebecause the required store-side ordering is providedby the dependency chain

Undefined behavior If undefined behavior is in-voked then consistent with the notion of undefinedbehavior there are no dependency-chain guarantees

If a given pointer intptr t or uintptr t is usedsuch that only one value avoids undefined behaviorthen the dependency chain is broken in the same wayas it would be in the case of an equality comparisonwith that same value

kill dependency() The result of calling kill

dependency is never part of any dependency chainThis operation can be used to suppress diagnosticsthat implementations might omit for likely misusesof dependency ordering

793 Restricted Dependency Chains and theLinux Kernel

This covers all known pointer-based RCU uses in theLinux kernel aside from RCU-protected array in-dexes (more on these later) However as noted ear-lier these restrictions might prove too constrainingfor future code Therefore it might be necessary tocombine this approach with some variation on one ofthe methods for explicitly marking variables formalparameters and return values that are intended tocarry dependencies

Section 32 discussed dependency chains headedby memory order consume loads of integers that arelater used as array indexes or pointer offsets Al-though there are a (very) few such uses in the Linuxkernel accommodating dependency chains headed byloads of integers greatly complicates the handling ofdependency chains For example given an integer x

produced by a memory order consume load we mustcorrectly handle expressions containing (x - x) in-cluding cases where the cancellation is not at all obvi-ous from the source code In contrast given a pointerp the expression (p - p) not a pointer and the rulesgiven above do not require carrying a dependency

WG21N4321 33

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a) 3 q = kill_dependency(p)4 a = q-gtspecial_a5 else 6 a = p-gtnormal_a7

Figure 36 Avoiding Diagnostic Due To Dependency-Ordering Value-Narrowing Hazard

through such an expression This approach there-fore excludes dependency chains headed by memory

order consume loads from non-pointer atomic vari-ables

This raises the question of what should be doneabout the Linux kernel code that relies on order-ing carried through integer array indexes Paulhas (hopefully) answered this question by creatinga Linux-kernel patch that removes the kernelrsquos de-pendency on RCU-protected array indexes [22]

794 Restricted Dependency Chains Ad-vantages and Disavantages

This approach to dependency ordering has the fol-lowing advantages and disadvantages

1 There is no need for compilers to trace depen-dency chains Instead dependencies are an au-tomatic result of current code-generation and op-timization practices

2 There would be little or no limitation on com-piler optimizations Compilers are no longer re-quired to establish artificial dependencies for ex-pressions such as (x - x) or to detect value-narrowing hazards involving == and =

3 The breaking of dependency chains when apointer compares equal to some other pointermight prove to be onerous but it is not aproblem for current Linux use cases19 Themost common cases are comparison against alist header (in which case an equality compar-ison terminates the traversal) and comparison

19 This needs to be re-verified

against NULL (in which case an equality com-parison indicates a pointer that cannot be deref-erenced in any case)

4 The compiler is prohibited from carrying outvalue-speculation optimizations on pointers orvalues of type intptr t or uintptr t that havebeen cast from a pointer type and subjected tono operations other than infix amp | and ^ (Isthis an advantage or a disadvantage The an-swer to this question is left to the reader)

5 In a great many cases dependency chains canreliably pass through library functions compiledby pre-C11 compilers

6 It would not be necessary to use stdkill

dependency() calls in most cases That saiduse of stdkill dependency() might at somefuture point allow the compiler to produce bet-ter diagnostics for dubious dependency-chain usecases For example a compiler might issue awarning for the value-narrowing hazard shownon line 3 of Figure 13 and that diagnostic mightbe suppressed as shown in Figure 36

7 More generally this approach allows annotationsto be discarded as was shown in Figures 20and 21 However the memory order consume

loads are still required in order to support DECAlpha and prevent compiler optimizations thatmight otherwise destroy the dependency chain

8 It would not be necessary to use [[carries

dependency]] attributes However as withstdkill dependency() use of something like[[carries dependency]] might produce bet-ter diagnostics for dubious use dependency-chainuse cases That said it might be preferable tosubstitute either type or variable modifiers forattributes given that use of attributes is not per-mitted to change the meaning of the programand also that attributes are not yet supportedby the C language

9 With the exception of one use case involving ar-rays no changes to the Linux-kernel source code

WG21N4321 34

are required As noted earlier a patch is avail-able to remove the Linux kernelrsquos dependency onRCU-protected array indexes [22]

In the future this approach could be augmentedby attributes type modifiers variable modifiers orother markings to allow more elaborate dependencychains to be created on the one hand and to improvethe compilerrsquos ability to emit diagnostics for dubioususes of dependency chains on the other

710 Evaluation

This evaluation starts by enumerating the differentaudiences that any change to memory order consume

must address (Section 7101) and then compares thevarious proposals based on the perceived viewpointsof these audiences (Section 7102)

7101 Audiences

The main audiences for any change to memory

order consume include standards committee mem-bers compiler implementers formal-methods re-searchers developers intending to write new codeand developers working with existing RCU code TheLinux kernel community is of course a notable exam-ple of this last category

Standards committee members would like a cleanand non-intrusive change to the standard Theywould of course also like solutions minimizing thenumber and vehemence of complaints from the otheraudiences or failing that reducing the complaintsto a tolerable noise level

Compiler implementers would like a mechanismthat fits nicely into current implementations whichdoes much to explain their satisfaction with the ap-proach of strengthening memory order consume tomemory order acquire In particular they wouldlike to avoid unbounded tracing of dependencies andwould prefer minimal constraints on their ability toapply time-honored optimizations

Formal-methods researchers would like a definitionof memory order consume that fits into existing the-oretical frameworks without undue conceptual vio-lence Of particular concern is any need to deal with

counter-factuals in other words any need to rea-son not only about values of variables required forthe solution of a given litmus test but also aboutother unrelated values for these variables As suchcounter-factuals are the rock upon which otherwiseattractive approaches involving semantic dependencyhave foundered20 Some practitioners might won-der why the opinion of formal-methods researchersshould be given any weight at all and the answer tothis question is that it is the work of formal-methodsresearchers that provides us the much-needed toolsthat we need to analyze both the memory-orderingspecification itself as well as programs using thatspecification

Developers writing new code need something thatexpresses their algorithm with a minimum of syn-tactic saccharine that is easy to learn and that iseasy to maintain For example one of the weak-nesses of the current standardsrsquo definition of memoryorder consume is the need to sprinkle large numbersof kill dependency() calls throughout onersquos codeIn short developers would like it to be easy to writeanalyze and maintain code that uses dependency or-dering

Developers with existing RCU code have the samedesires as do developers writing new code but arealso very interested in minimizing the code churn re-quired to adhere to the standard

The challenge if of course to find a proposal thataddresses the viewoints of all of these audiences Aswe will see in the next session this is not easy How-ever there is some hope that the approach presentedin Section 79 might suffice

7102 Comparison

A summary comparison of the proposals is shown inTable 1

The dependency type can either be ldquodeprdquo fornormal dependency ldquordeprdquo for the restricted de-pendencies discussed in Section 79 ldquosdeprdquo for(global) semantic dependency or ldquolsdeprdquo for local

20 That said Alan Jeffries is making another attempt tocome up with a suitable formal definition of semantic depen-dency

WG21N4321 35

Dep

enden

cyType

Variable

Marking

Form

al-Parameter

Marking

Return-V

alueMarking

Beginning-O

f-Chain

Handling

End-O

f-Chain

Handling

Dep

enden

cyTracingRequired

CAttribute

Support

Req

uired

C11 C++11 dep A A K Y Y

Type-Based Designation of DependencyChains With Restrictions (Section 72)

lsdep T T T k

Type-Based Designation of DependencyChains (Section 73)

dep T T T k

Whole-Program Option (Section 74) sdep

Local-Variable Restriction (Section 75) dep A A k Y

Mark Dependency-Carrying Local Variables(Section 76)

dep A A A k Y

Explicitly Tail-Marked Dependency Chains(Section 77)

dep A A Dk y Y

Explicitly Head-Marked Dependency Chains(Section 78)

dep D k y Y

Restricted Dependency Chains (Section 79) rdep

Marking ldquoArdquo attribute ldquoTrdquo typeBeginning of chain ldquoDrdquo explicit designationEnd of chain ldquoDrdquo explicit designation ldquokKrdquo implicitexplicit kill dependencyDependency tracking required ldquoyrdquo only for marked chains ldquoYrdquo always

Table 1 Comparison of Consume Proposals

WG21N4321 36

semantic dependency21 Variable formal-parameterand return-value marking can either be type-based(ldquoTrdquo) attribute-based (ldquoArdquo) or not required (ldquo rdquo)Beginning-of-chain handling can either require ex-plicit indication to which quantities dependenciesmust be carried (ldquoDrdquo) or nothing (ldquo rdquo) End-of-chain handling can either require an explicit kill

dependency (ldquoKrdquo) an implicit kill dependency

(ldquokrdquo) explicit designation of dependency (ldquoDrdquo) ornothing (ldquo rdquo)22 Dependency tracking might berequired for all chains (ldquoYrdquo) explicitly designatedchains (ldquoyrdquo) or not required at all (ldquo rdquo) C-language[[carries dependency]] support might be required(ldquoYrdquo) or not (ldquo rdquo)

The ideal proposal would have dependency typeldquodeprdquo (thus making it easier to model dependencyordering and making it unnecessary for developer tohave to outwit full-program optimizations) no needfor variable formal-parameter or return-value mark-ing (thus minimizing changes required for existingRCU code) implicit ldquodo the right thingrdquo end-of-chainhandling23 (thus minimizing the need for whack-a-mole source-code markups) no need for dependencytracking (thus making it easier to implement) andno need for C-language support for the [[carries

dependency]] attribute (thus minimizing changes tothe C standard)

7103 Other Approaches

If the C standards committee is unwilling to ac-cept attributes perhaps a new keyword such asCarries dependency would be an acceptable alter-native (This was suggested at the 2014 UIUC meet-ing but I cannot recall who suggested it)

21 Recall that a local semantic dependency remains a de-pendency even if the memory order consume load at its headcan return only a single value In contrast a global seman-tic dependency remains a dependency only if more than onevalue can appear at the end of the chain Therefore optimiza-tions based on global full-program analysis can break a globalsemantic dependency but can break neither a local semanticdependency nor a normal dependency

22 Variables that go out of scope always have any dependencychain implicitly killed

23 Perhaps implemented by a careful choice of exactly whichoperators carry dependencies in which situations

It might also be possible to combine different as-pects of the various proposals perhaps even arrivingat an improved proposal

Nevertheless we clearly have some more work todo

8 Summary

This document has analyzed Linux-kernel use of de-pendency ordering and has laid out the status-quo in-teraction between the Linux kernel and pre-C11 com-pilers It has also put forward some possible ways ofbuilding towards a full implementation of C11rsquos andC++11rsquos handling of dependency ordering Finallyit calls out some weaknesses in C11rsquos and C++11rsquoshandling of dependency ordering and offers some al-ternatives

References

[1] Alglave J Maranget L Pawan PSarkar S Sewell P Williams D andNardelli F Z PPCMEMARMMEM A toolfor exploring the POWER and ARM memorymodels httpwwwclcamacuk~pes20

ppc-supplementalpldi105-sarkarpdfJune 2011

[2] Alglave J Maranget L andTautschnig M Herding cats Modellingsimulation testing and data-mining for weakmemory In Proceedings of the 35th ACM SIG-PLAN Conference on Programming LanguageDesign and Implementation (New York NYUSA 2014) PLDI rsquo14 ACM pp 40ndash40

[3] ARM Limited ARM Architecture ReferenceManual ARMv7-A and ARMv7-R Edition2010

[4] Boehm H J Space efficient conservativegarbage collection SIGPLAN Not 39 4 (Apr2004) 490ndash501

[5] Bonzini P and Day M RCU imple-mentation for Qemu httplistsgnu

WG21N4321 37

orgarchivehtmlqemu-devel2013-08

msg02055html August 2013

[6] Dalton M THE DESIGN AND IMPLE-MENTATION OF DYNAMIC INFORMATIONFLOW TRACKING SYSTEMS FOR SOFT-WARE SECURITY PhD thesis StanfordUniversity 2009 Available httpcsl

stanfordedu~christospublications

2009michael_daltonphd_thesispdf

[Viewed March 9 2010]

[7] Desnoyers M [RFC git tree] userspace RCU(urcu) for Linux httpurcuso February2009

[8] Desnoyers M McKenney P E SternA Dagenais M R and Walpole JUser-level implementations of read-copy updateIEEE Transactions on Parallel and DistributedSystems 23 (2012) 375ndash382

[9] Grisenthwaite R ARM Barrier LitmusTests and Cookbook ARM Limited 2009

[10] Howard P W and Walpole J A rel-ativistic enhancement to software transactionalmemory In Proceedings of the 3rd USENIX con-ference on Hot topics in parallelism (BerkeleyCA USA 2011) HotParrsquo11 USENIX Associa-tion pp 1ndash6

[11] Howard P W and Walpole J Relativis-tic red-black trees Concurrency and Computa-tion Practice and Experience (2013) nandashna

[12] Intel Corporation A Formal Speci-fication of Intel Itanium Processor Fam-ily Memory Ordering 2002 Avail-able httpdeveloperintelcom

designitaniumdownloads251429htm

ftpdownloadintelcomdesign

ItaniumDownloads25142901pdf [ViewedJanuary 10 2007]

[13] International Business Machines Corpo-ration Power ISA Version 207 2013

[14] Kannan H Ordering decoupled metadata ac-cesses in multiprocessors In MICRO 42 Pro-ceedings of the 42nd Annual IEEEACM Inter-national Symposium on Microarchitecture (NewYork NY USA 2009) ACM pp 381ndash390

[15] McCarthy J Recursive functions of symbolicexpressions and their computation by machinepart i Commun ACM 3 4 (Apr 1960) 184ndash195

[16] McKenney P E Read-copy update(RCU) usage in Linux kernel Availablehttpwwwrdropcomuserspaulmck

RCUlinuxusagerculocktabhtml [ViewedJanuary 14 2007] October 2006

[17] McKenney P E What is RCU part 2Usage Available httplwnnetArticles

263130 [Viewed January 4 2008] January2008

[18] McKenney P E The RCU API 2010 edi-tion httplwnnetArticles418853 De-cember 2010

[19] McKenney P E Validating memory barri-ers and atomic instructions httplwnnet

Articles470681 December 2011

[20] McKenney P E Is Parallel ProgrammingHard And If So What Can You Do About Itkernelorg Corvallis OR USA 2012

[21] McKenney P E Structured deferral syn-chronization via procrastination CommunACM 56 7 (July 2013) 40ndash49

[22] McKenney P E [PATCH tipcorercu14] mce Stop using array-index-based RCUprimitives [PATCHtipcorercu14]mce

Stopusingarray-index-basedRCUprimitivesMay 2015

[23] McKenney P E Purcell C Al-gae Schumin B Cornelius G Qwer-tyus Conway N Sbw Blainster Ru-fus C Zoicon5 Anome and Eisen HRead-copy update httpenwikipedia

orgwikiRead-copy-update July 2006

WG21N4321 38

[24] McKenney P E and Slingwine J DRead-copy update Using execution history tosolve concurrency problems In Parallel andDistributed Computing and Systems (Las VegasNV October 1998) pp 509ndash518

[25] McKenney P E and Walpole J What isRCU fundamentally Available httplwn

netArticles262464 [Viewed December 272007] December 2007

[26] Michael M M Hazard pointers Safe mem-ory reclamation for lock-free objects IEEETransactions on Parallel and Distributed Sys-tems 15 6 (June 2004) 491ndash504

[27] Rossbach C J Hofmann O S PorterD E Ramadan H E Bhandari A andWitchel E TxLinux Using and managinghardware transactional memory in an operatingsystem In SOSPrsquo07 Twenty-First ACMSymposium on Operating Systems Principles(October 2007) ACM SIGOPS Avail-able httpwwwsosp2007orgpapers

sosp056-rossbachpdf [Viewed October 212007]

[28] Toit S D Working draft stan-dard for programming language C++httpwwwopen-stdorgjtc1sc22wg21

docspapers2013n3691pdf May 2013

[29] Vigueras G Orduna J M and LozanoM A Read-Copy Update based parallel serverfor distributed crowd simulations The Journalof Supercomputing (Apr 2012)

Change Log

This paper first appeared as N4026 in May of 2014Revisions to this document are as follows

bull Add a suggestion to Section 71 that developersprovide diagnostics for singleton arrays when us-ing RCU-protected indexes (May 28 2014)

bull Apply changes from Torvald Riegel feedback

1 Move Section 71 which summarizes theLinux-kernel uses of dependency chains upto the beginning of Section 7

2 Fix some typos in Section 74 and addwords accounting for DEC Alpha whichmust continue to promote memory order

consume loads to memory order acquire

with this sectionrsquos approach Also clarifyJeff Preshingrsquos conjecture about the lack ofoptimizations that break semantic depen-dency chains

(June 5 2014)

bull Add Section 75 which covers Hans Boehmrsquosapproach which restricts dependency chains tolocal variable Also add Section 76 whichcovers Clark Nelsorsquos approach which usesthe [[carries dependency]] attribute to markthose local variables that carry dependencies(July 17 2014)

bull Mark the paper as a revision of N4036 (July 172014)

bull Add Jeff Preshing Hans Boehm and Clark Nel-son as authors and clarify Hansrsquos proposal inSection 75 (August 13 2014)

bull Add Olivier Girouxrsquos proposal for explicitlymarking the tails of dependency chains as Sec-tion 77 and add Olivier as author Update theadvantages and disadvantages of Jeff Preshingrsquoswhole-program option in Section 74 Add advan-tages and disadvantages for Hans Boehmrsquos pro-posal in Section 75 and for Clark Nelsonrsquos pro-posal in Section 76 (August 23 2014)

bull Update Olivier Girouxrsquos explicit dependency-chain marking proposal in Section 77 based onfeedback from Olivier (August 24 2014)

bull Add a section header (Section 411) for thedependency-chain rules for 2014 GCC implemen-tations Add Section 413 laying out the sim-pler rules for managing dependency chains in theolder 1990s Sequent C implementations Adda footnote to Section 76 recording Lawrence

WG21N4321 39

Crowlrsquos preference for variable modifiers insteadof either attributes or type modifiers Addwords to Section 77 noting that atomic signal

fence() or the Linux kernelrsquos barrier() can beused to suppress code-motion optimizations thatmight otherwise prevent fixing up tail-markeddependency chains that were broken by code-motion optimizations (October 1 2014)

bull Add Section 34 which describes operators thatact as the last link in the Linux kernelrsquos depen-dency chains including the ambiguous role ofthe -gt operator (October 1 2014)

bull Update plots of RCU API usage (October 52014)

bull Flesh out Section 412 which lays out a shorthistory of dependency-chain management rulesfor the Linux kernel Numerous minor clarifica-tions and corrections throughout the document(October 5 2014)

bull Add ldquoandrdquo before final authorrsquos name and emailaddress Also add a pointer back to N4036 (Oc-tober 5 2014)

bull Add a brain-dead build script (October 52014)

bull Add Section 710 which has the beginnings ofan evaluation of the various proposals (October5 2014)

The paper was then published as N4215 whichwas revised as follows

bull Add verbiage to Sections 72 and 7102 differ-entiating between various forms of syntactic andsemantic dependencies (November 11 2014)

bull Add Section 7103 suggesting a C keyword in-stead of the C++ [[carries dependency]] at-tribute This section also suggests that combin-ing ideas from several proposals might be helpful(November 11 2014)

bull Add Section 78 which contains a first attemptto describe Olivier Girouxrsquos head-marked depen-dency chains (November 11 2014)

bull Add example code in Figures 15 and 16 Addsimilar figures to the proposals showing how theymark the dependency chains in the sample code(November 11 2014)

At this point the paper was published as N4321which was revised as follows

bull Add a warning against using operands to amp and| that leave only one bit set (or respectivelycleared) to Section 411 (April 9 2015)

bull Update plots of RCU API usage (April 192014)

bull Add Section 79 which restricts dependencychains with the goal of allowing unmarked de-pendency chains while avoiding restricting com-piler optimizations One important restriction isthe elimination of RCU-protected array indexes(April 19 2014)

bull Add verbiage to Section 32 noting that the amp

operator is sometimes used to find the beginningof a power-of-two aligned structure (April 192014)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Torvald Riegel to Sec-tion 79 (May 20 2015)

bull Donrsquot allow integers to carry dependencies intoor out of unmarked functions in Section 79(May 21 2015)

bull Update Section 79 to note that assignments can-not extend a dependency chain from one threadto another and that assignments cannot extendan intptr t- or uintptr t-based dependencychain through a shared variable even within agiven thread (May 21 2015)

bull Update Section 79 to note that atomic op-erations can extend dependency chains in thesame way that their non-atomic counterpartscan (May 21 2015)

WG21N4321 40

bull Update Section 79 to explicitly call out the factthat stores and subsequent loads can extend de-pendency chains (May 21 2015)

bull Update Section 79 to explicitly state that un-defined behavior terminates dependency chainsand that if only one value of a pointer avoidsundefined behavior any dependency chainsthrough that pointer at that point are termi-nated (May 22 2015)

bull Update Section 79 to document the difficul-ties stemming from value-specialization opti-mizations (May 23 2015)

bull Update Section 79 to note the back-propagationof the dependency-chain-breaking effects of thecompiler deducing the exact value of a pointer(May 27 2015)

bull Update Section 79 to give more informationon how much bit-setting and bit-clearing is toomuch (June 8 2015)

bull Update Section 79 to note that the Linux ker-nel does not rely on dependency ordering whena statically allocated structure is linked into anRCU-protected structure but also note that op-timizations breaking dependency ordering in thiscase seem quite dubious (July 11 2015)

bull Update Section 79 illustrating dependency-chain rules with examples (July 11 2015)

bull Update Section 79 noting that simple loads suf-fice for memory order consume loads (July 112015)

bull Update Sections 77 and 78 calling out abstrac-tion challenges for head- and tail-marking ap-proaches (July 13 2015)

bull Add subsections to Section 79 (July 13 2015)

bull Update Section 79 to amplify objections to spe-cialization optimizations applied to heap-basedpointers (July 13 2015)

  • 1 Introduction
  • 2 Introduction to RCU
  • 3 Linux-Kernel Use Cases
    • 31 Types of Linux-Kernel Dependency Chains
    • 32 Operators in Linux-Kernel Dependency Chains
    • 33 Operators Terminating Linux-Kernel Dependency Chains
    • 34 Operators Acting as Last Link in Linux-Kernel Dependency Chains
    • 35 Linux-Kernel Dependency Chain Length
      • 4 Dependency Ordering in Pre-C11 Implementations
        • 41 Rules for C-Language RCU Users
          • 411 Rules for 2014 GCC Implementations
          • 412 Rules for 2003 GCC Implementations
          • 413 Rules for 1990s Sequent C Implementations
            • 42 Rules for C-Language Implementers
              • 5 Dependency Ordering in C11 and C++11 Implementations
              • 6 Weaknesses in C11 and C++11 Dependency Ordering
              • 7 Potential Alternatives to C11 and C++11 Dependency Ordering
                • 71 Revising C11 and C++11 Dependency-Ordering Definition
                • 72 Type-Based Designation of Dependency Chains With Restrictions
                • 73 Type-Based Designation of Dependency Chains
                • 74 Whole-Program Option
                • 75 Local-Variable Restriction
                • 76 Mark Dependency-Carrying Local Variables
                • 77 Explicitly Tail-Marked Dependency Chains
                • 78 Explicitly Head-Marked Dependency Chains
                • 79 Restricted Dependency Chains
                  • 791 Extending Dependency Chains
                  • 792 Terminating Dependency Chains
                  • 793 Restricted Dependency Chains and the Linux Kernel
                  • 794 Restricted Dependency Chains Advantages and Disavantages
                    • 710 Evaluation
                      • 7101 Audiences
                      • 7102 Comparison
                      • 7103 Other Approaches
                          • 8 Summary
Page 3: N4321: Towards Implementation and Use of memory order … · 2015-07-14 · N4321: Towards Implementation and Use of memory order consume Doc. No.: WG21/N4321 (revised) Date: 2015-04-19

WG21N4321 3

0

2000

4000

6000

8000

10000

12000 2

002

200

4

200

6

200

8

201

0

201

2

201

4

201

6

R

CU

AP

I Use

s

Year

Figure 1 Growth of RCU Usage

situations offering excellent performance scalabilityreal-time latency deadlock avoidance and read-sidecomposability More details on RCU are readily avail-able [8 17 18 20 21 23 25]

Figure 1 shows the growth of RCU usage over timewithin the Linux kernel which is strong evidence ofRCUrsquos effectiveness However RCU is a specializedmechanism so its use is much smaller than general-purpose techniques such as locking as can be seen inFigure 2 It is unlikely that RCUrsquos usage will everapproach that of locking because RCU coordinatesonly between readers and updaters which meansthat some other mechanism is required to coordinateamong concurrent updates In the Linux kernel thatupdate-side mechanism is normally locking althoughpretty much any synchronization mechanism may beused including transactional memory [10 11 27]

However RCU is now being used in many situa-tions where reader-writer locking would be used Fig-ure 3 shows that the use of reader-writer locking haschanged little since RCU was introduced This datasuggests that RCU is at least as important to parallelsoftware as is reader-writer locking

0

20000

40000

60000

80000

100000

120000

140000

200

2

200

4

200

6

200

8

201

0

201

2

201

4

201

6

R

CU

lock

ing

AP

I Use

sYear

locking

RCUrwlock

Figure 2 Growth of RCU Usage vs Locking

In more recent years a user-level library implemen-tation of RCU has been available [7] This library isnow available for many platforms and has been in-cluded in a number of Linux distributions It hasbeen pressed into service for a number of open-sourcesoftware projects proprietary products and researchefforts

Fully and fully performant C11C++11 supportfor memory order consume is therefore quite impor-tant However good progress can often be made inthe short term by focusing on the cases that are com-monly used in practice rather than on the generalcase The next section therefore takes a rough censusof the Linux kernelrsquos use of the rcu dereference()

family of primitives which memory order consume isintended to implement

3 Linux-Kernel Use Cases

Section 31 lists types of dependency chains in theLinux kernel Section 32 lists operators used withinthese dependency chains Section 33 lists operatorsthat are considered to terminate dependency chainsSection 34 lists operator that often act as the last link

WG21N4321 4

0

2000

4000

6000

8000

10000

12000 2

002

200

4

200

6

200

8

201

0

201

2

201

4

201

6

R

CU

rw

lock

ing

AP

I Use

s

Year

RCU

rwlock

Figure 3 Growth of RCU Usage vs Reader-WriterLocking

in a dependency chain and finally Section 35 surveysa longer-than-average (but by no means maximal)dependency chain that appears in the Linux kernel

It is worth reviewing the relationship betweenmemory order acquire and memory order consume

loads both of which interact with memory order

release stores

A memory order release load is said to synchro-nize with a memory order acquire store if that loadreturns the value stored or in some special casessome later value [28 110p6-110p8] When a memory

order acquire load synchronizes with a memory

order release store any memory reference preced-ing the memory order acquire load will happen be-fore any memory reference following the memory

order release store [28 110p11-110p12] Thisproperty allows a linked structure to be locklessly tra-versed by using memory order release stores whenupdating pointers to reference new data elements andby using memory order acquire loads when loadingpointers while locklessly traversing the data struc-ture as shown in Figure 4

Unfortunately a memory order acquire load re-

1 void new_element(struct foo pp int a)2 3 struct foo p = malloc(sizeof(p))45 if (p)6 abort()7 p-gta = a8 atomic_store_explicit(pp p memory_order_release)9

1011 int traverse(struct foo_head ph)12 13 int a = -114 struct foo p1516 p = atomic_load_explicit(ampph-gth memory_order_acquire)17 while (p = NULL) 18 a = p-gta19 p = atomic_load_explicit(ampp-gtn memory_order_acquire)20 21 return a22 23

24

Figure 4 ReleaseAcquire Linked Structure Traver-sal

quires expensive special load instructions or memory-fence instructions on weakly ordered systems suchas ARM Itanium and PowerPC Furthermore intraverse() the address of each memory order

acquire load within the while loop depends on thevalue of the previous memory order acquire load2

Therefore in this case most weakly ordered systemsdonrsquot really need the special load instructions or thememory-fence instructions as these systems can in-stead rely on the hardware-enforced dependency or-dering

This is the use case for memory order consumewhich can be substituted for memory order acquire

in cases where hardware dependency ordering appliesOne such case is the preceding example and Fig-ure 5 shows that same example recast in terms ofmemory order consume A memory order release

store is dependency ordered before a memory order

consume load when that load returns the value storedor in some special cases some later value [28 110p1]

2 The initial load on line 16 might well depend on an earlierload but for simplicity this example assumes that the initialfoo head structure is statically allocated and thus not subjectto updates

WG21N4321 5

1 void new_element(struct foo pp int a)2 3 struct foo p = malloc(sizeof(p))45 if (p)6 abort()7 p-gta = a8 atomic_store_explicit(pp p memory_order_release)9

1011 int traverse(struct foo_head ph)12 13 int a = -114 struct foo p1516 p = atomic_load_explicit(ampph-gth memory_order_consume)17 while (p = NULL) 18 a = p-gta19 p = atomic_load_explicit(ampp-gtn memory_order_consume)20 21 return a22 23

24

Figure 5 ReleaseConsume Linked Structure Traver-sal

Then if the load carries a dependency to somelater memory reference [28 110p9] any memoryreference preceding the memory order release storewill happen before that later memory reference [28110p9-110p12] This means that when there is de-pendency ordering memory order consume gives thesame guarantees that memory order acquire doesbut at lower cost

On the other hand memory order consume re-quires the compiler to track the carries-a-dependencyrelationships with the set of such relationshipsheaded by a given memory order consume load be-ing called that loadrsquos dependency chains It is quitepossible that the complexity of implementing this ca-pability has thus far prevented high-quality memory

order consume implementations from appearing Itis therefore worthwhile to review use of dependencychains in practice in order to determine what typesof operations typically appear in dependency chainswhich might result in guidance to implementationsor perhaps even modifications to the definition ofmemory order consume

31 Types of Linux-Kernel Depen-dency Chains

One goal for memory order consume is to implementrcu dereference() which heads a Linux-kerneldependency-ordering tree There are a numberof variant of rcu dereference() in the Linuxkernel in order to implement the four flavors ofRCU and also to enable RCU usage diagnositicsfor code that is shared by readers and updatersThese additional variants are rcu dereference()rcu dereference bh() rcu dereference

bh check() rcu dereference bh check()rcu dereference check() rcu dereference

index check() rcu dereference protected()rcu dereference raw() rcu dereference

sched() rcu dereference sched check() srcu

dereference() and srcu dereference check()Taken together there are about 1300 uses of thesefunctions in version 313 of the Linux kernelHowever about 250 of those are rcu dereference

protected() which is used only in update-side codeand thus does not head up read-side dependencychains which leaves about 1000 uses to be inspectedfor dependency-ordering usage

32 Operators in Linux-Kernel De-pendency Chains

A surprisingly small fraction of the possible C opera-tors appear in dependency chains in the Linux kernelnamely -gt infix = casts prefix amp prefix [] infix+ infix - ternary and infix (bitwise) amp

By far the two most common operators are the-gt pointer field selector and the = assignment oper-ator Enabling the carries-dependency relationshipthrough only these two operators would likely coverbetter than 90 of the Linux-kernel use cases

Casts the prefix indirection operator and theprefix amp address-of operator are used to implementLinuxrsquos list primitives which translate from listpointers embedded in a structure to the structure it-self These operators are also used to get some of theeffects of C++ subtyping in the C language

The [] array-indexing operator and the infix +

and - arithmetic operators are used to manipulate

WG21N4321 6

1 struct foo 2 int a3 4 struct foo fp5 struct foo default_foo67 int bar(void)8 9 struct foo p

1011 p = rcu_dereference(fp)12 return p p-gta default_fooa13

Figure 6 Default Value For RCU-Protected PointerLinux Kernel

1 class foo 2 int a3 4 stdatomicltfoo gt fp5 foo default_foo67 int bar(void)8 9 stdatomicltfoo gt p

1011 p = fpload_explicit(memory_order_consume)12 return p kill_dependency(p-gta) default_fooa13

Figure 7 Default Value For RCU-Protected PointerC++11

RCU-protected arrays as well as to index into arrayscontained within RCU-protected structures RCU-protected arrays are becoming less common becausethey are being converted into more complex datastructures such as trees However RCU-protectedstructures containing arrays are still fairly common

The ternary if-then-else expression is used tohandle default values for RCU-protected pointers forexample as shown in Figure 6 or in C++11 formin Figure 7 Note that the dependency is carriedonly through the rightmost two operands of neverthrough the leftmost one

The infix amp operator is used to mask low-order bitsfrom RCU pointers These bits are used by somealgorithms as markers Such markers though notcommon in the Linux kernel are well-known in theart with hazard pointers being but one example [26]This operator is also sometimes used to locate thebeginning of an aligned structure for example if p

references a field within a data structure that is 4096bytes in size (or smaller) and that is also aligned to a4096-byte boundary then p amp ~0xfff will with theaddition of appropriate casting produce a pointer tothe beginning of the structure Note that it is ex-pected that both operands of infix amp are expected tohave some non-zero bits because otherwise a NULL

pointer will result and NULL pointers cannot reason-ably be said to carry much of anything let alone adependency

Although I did not find any infix | operators inmy census of Linux-kernel dependency chains sym-metry considerations argue for also including it forexample for read-side pointer tagging or for anotherexample locating the beginning of the next in an ar-ray of aligned structures Presumably both of theoperands of infix | must have at least one zero bit

To recap the operators appearing in Linux-kerneldependency chains are -gt infix = casts prefix ampprefix [] infix + infix - ternary infix (bitwise)amp and probably also |

33 Operators Terminating Linux-Kernel Dependency Chains

Although C++11 has the kill dependency() func-tion to terminate a dependency chain no such func-tion exists in the Linux kernel Instead Linux-kerneldependency chains are judged to have terminatedupon exit from the outermost RCU read-side criticalsection3 when existence guarantees are handed offfrom RCU to some other synchronization mechanism(usually locking or reference counting) or when thevariable carrying the dependency goes out of scope

That said it is possible to analyze Linux-kerneldependency chains to see what part of the chain isactually required by the algorithm in question Wecan therefore define the essential subset of a depen-dency chain to be that subset within which ordering

3 The beginning of a given RCU read-side critical section ismarked with rcu read lock() rcu read lock bh() rcu read

lock sched() or srcu read lock() and the end by the cor-responding primitive from the list rcu read unlock() rcu

read unlock bh() rcu read unlock sched() or srcu read

unlock() There is currently no C++11 counterpart for anRCU read-side critical section

WG21N4321 7

is required by the algorithm In the 313 version ofthe Linux kernel the following operators always markthe end of the essential subset of a dependency chain() == = ampamp || infix and

The postfix () function-invocation operator is aninteresting special case in the Linux kernel In theoryRCU could be used to protect JITed function bodiesbut in current practice RCU is instead used to waitfor all pre-existing callers to the function referencedby the previous pointer The functions are all com-piled into the kernel and the dependency chains aretherefore irrelevant to the () operator Hence in ver-sion 313 of the Linux kernel the () operator marksthe end of the essential subset of any dependencychain that it resides in

The == = ampamp and || operators are used ex-clusively in rdquoifrdquo statements to make control-flow de-cisions and therefore also mark the end of the essen-tial subset of any dependency chains that they residein In theory these relational and boolean operatorscould be used to form array indexes but in practicethe Linux kernel does not yet do this in RCU depen-dency chains The other relational operators (gt ltgt= and lt=) should probably also be added to thislist

The infix and arithmetic operators couldpotentially be used for construct array addresses butthey are not yet used that way in the Linux kernelInstead they are used to do computation on valuesfetched as the last operation in an essential subset ofa dependency chain

In short in the current Linux kernel () === ampamp || infix and all mark the end of theessential subset of a dependency chain That saidthere is potential for them to be used as part of theessential subset of dependency changes in future ver-sions of the Linux kernel And the same is of coursetrue of the remaining C-language operators whichdid not appear within any of the dependency chainsin version 313 of the Linux kernel

34 Operators Acting as Last Link inLinux-Kernel Dependency Chains

Although the -gt operator is frequently used as part ofa Linux-kernel dependency chain it often is intended

to be the last link in that chain Therefore the usescases for the -gt operator deserve special mention

The first use case involves fetching non-pointerdata from an RCU-protected data structure Forexample in the DRDB subsystem in Linux -gt isused to fetch a timeout value This code requiresthat dependency ordering apply to this fetch but itdoes not require a dependency chain extending be-yond that point This sort of case would requirea kill dependency() for implementations based onthe C++11 and C11 standards

The second use case involves linked data struc-tures where an RCU update might be applied onany pointer in the chain for example the stan-dard Linux-kernel linked list The -gt operator pro-vides dependency ordering for the fetch of the -gtnextpointer but that fetch must itself be a memory order

consume load in order to provide the required depen-dency ordering for the fields in the next structurein the list Thus a linked-list traversal consists ofa series of back-to-back non-overlapping dependencychains

These two use cases raise the question of whethera dependency chain can continue beyond a -gt oper-ator The answer is ldquoyesrdquo and this occurs when alinked structure is made visible to RCU readers as aunit For exapmle consider a linked list where eachlist element links to a constant binary search treeIf this tree is in place when the element is added tothe list then a memory order consume load is neededonly when fetching the pointer to the element Thedependency chain headed by this fetch suffices to or-der accesses to the binary search tree

These cases need to be differentiated The thirduse case appears to be the least frequent which sug-gests that the -gt operator (or a sequence of -gt oper-ators) always be the last link of a dependency chain

35 Linux-Kernel Dependency ChainLength

Many Linux-kernel dependency chains are very shortand contained with a fair number living within theconfines of a single C statement If there were onlya few short dependency chains in the Linux kernelone could imagine decorating all the operators in each

WG21N4321 8

1 void new_element(struct foo pp int a)2 3 struct foo p = malloc(sizeof(p))45 if (p)6 abort()7 p-gta = a8 atomic_store_explicit(pp p memory_order_release)9

1011 int traverse(struct foo_head ph)12 13 int a = -114 struct foo p1516 p = atomic_load_explicit(ampfield_dep(ph h)17 memory_order_consume)18 while (p = NULL) 19 a = field_dep(p a)20 p = atomic_load_explicit(ampfield_dep(p n)21 memory_order_consume)22 23 return a24

Figure 8 Decorated Linked Structure Traversal

dependency chain for example replacing the -gt op-erator with something like the mythical field dep()

operator shown on lines 16 19 and 20 of Figure 8

However there are a great many dependencychains that extend across multiple functions Onerelatively modest example is in the Linux networkstack in the arp process() function This depen-dency chain extends as follows with deeper nestingindicating deeper function-call levels

bull The arp process() function invokes in dev

get rcu() which returns an RCU-protectedpointer The head of the dependency chain istherefore within the in dev get rcu() func-tion

bull The arp process() function invokes the follow-ing macros and functions

ndash IN DEV ROUTE LOCALNET() which expandsto the ipv4 devconf get() function

ndash arp ignore() which in turn calls

lowast IN DEV ARP IGNORE() which expandsto the ipv4 devconf get() function

lowast inet confirm addr() which calls

middot dev net() which in turn callsread pnet()

ndash IN DEV ARPFILTER() which expands toipv4 devconf get()

ndash IN DEV CONF GET() which also expands toipv4 devconf get()

ndash arp fwd proxy() which calls

lowast IN DEV PROXY ARP() which expandsto ipv4 devconf get()

lowast IN DEV MEDIUM ID() which also ex-pands to ipv4 devconf get()

ndash arp fwd pvlan() which calls

lowast IN DEV PROXY ARP PVLAN() which ex-pands to ipv4 devconf get()

ndash pneigh enqueue()

Again although a great many dependency chainsin the Linux kernel are quite short there are quite afew that spread both widely and deeply We thereforecannot expect Linux kernel hackers to look fondlyon any mechanism that requires them to decorateeach and every operator in each and every depen-dency chain as was shown in Figure 8 In fact evenkill dependency() will likely be an extremely diffi-cult sell

4 Dependency Ordering in Pre-C11 Implementations

Pre-C11 implementations of the C language do nothave any formal notion of dependency ordering butthese implementations are nevertheless used to buildthe Linux kernelmdashand most likely all other softwareusing RCU This section lays out a few straightfor-ward rules for both implementers (Section 42) andusers of these pre-C11 C-language implementations(Section 41)

41 Rules for C-Language RCU Users

The rules for C-language RCU users have evolvedover time so this section will present them in reversechronological order

WG21N4321 9

411 Rules for 2014 GCC Implementations

The primary rule for developers implementing RCU-based algorithms is to avoid letting the compiler de-terming the value of any variable in any dependencychain This primary rule implies a number of sec-ondary rules

1 Use only intrinsic operators on basic types Ifyou are making use of C++ template metapro-gramming or operator overloading more elabo-rate rules apply and those rules are outside thescope of this document

2 Use a volatile load to head the dependency chainThis is necessary to avoid the compiler repeatingthe load or making use of (possibly erroneous)prior knowledge of the contents of the memorylocation each of which can break dependencychains

3 Avoid use of single-element RCU-protected ar-rays The compiler is within its right to assumethat the value of an index into such an arraymust necessarily evaluate to zero The com-piler could then substitute the constant zero forthe computation breaking the dependency chainand introducing misordering

4 Avoid cancellation when using the + and - infixarithmetic operators For example for a givenvariable x avoid (xminusx) The compiler is withinits rights to substitute zero for any such cancel-lation breaking the dependency chain and againintroducing misordering Similar arithmetic pit-falls must be avoided if the infix or oper-ators appear in the essential subset of a depen-dency chain

5 Avoid all-zero operands to the bitwise amp opera-tor and similarly avoid all-ones operands to thebitwise | operator If the compiler is able todeduce the value of such operands it is withinits rights to substitute the corresponding con-stant for the bitwise operation Once again thisbreaks the dependency chain introducing mis-ordering

Please note that single-bit operands to bitwise amp

can be dangerous because the compiler requiresonly a small amount of additional information todeduce the exact value which could again resultin constant substitution Operands to bitwise |

that have only one zero bit are similarly danger-ous

6 If you are using RCU to protect JITed functionsso that the () function-invocation operator is amember of the essential subset of the dependencytree you may need to interact directly with thehardware to flush instruction caches This issuearises on some systems when a newly JITed func-tion is using the same memory that was used byan earlier JITed function

7 Do not use the boolean ampamp and || operators inessential dependency chains The reason for thisprohibition is that they are often compiled usingbranches Weak-memory machines such as ARMor PowerPC order stores after such branches butcan speculate loads which can break data depen-dency chains

8 Do not use relational operators (== = gt gt=lt or lt=) in the essential subset of a dependencychain The reason for this prohibition is that asfor boolean operators relational operators areoften compiled using branches Weak-memorymachines such as ARM or PowerPC order storesafter such branches but can speculate loadswhich can break dependency chains

9 Be very careful about comparing pointers in theessential subset of a dependency chain As LinusTorvalds explained if the two pointers are equalthe compiler could substitute the pointer you arecomparing against for the pointer in the essen-tial subset of the dependency chain On ARMand Power hardware it might be that only theoriginal value carried a hardware dependency sothis substitution would break the chain in turnpermitting misordering Such comparisons areOK in the following cases

(a) The pointer being compared against refer-ences memory that was initialized at boot

WG21N4321 10

time or otherwise long enough ago thatreaders cannot still have pre-initialized datacached Examples include module-init timefor module code before kthread creationfor code running in a kthread while theupdate-side lock is held and so on

(b) The pointer is never dereferenced afterbeing compared This exception applieswhen comparing against the NULL pointeror when scanning RCU-protected circularlinked lists

(c) The pointer being compared against is partof the essential subset of a dependencychain This can be a different dependencychain but only as long as that chain stemsfrom a pointer that was modified after anyinitialization of interest This exception canapply when carrying out RCU-protectedtraversals from different entry points thatconverged on the same data structure

(d) The pointer being compared against isfetched using rcu access pointer() andall subsequent dereferences are stores

(e) The pointers compared not-equal and thecompiler does not have enough informationto deduce the value of the pointer (Forexample if the compiler can see that thepointer will only ever take on one of twovalues then it will be able to deduce theexact value based on a not-equals compar-ison)

10 Disable any value-speculation optimizations thatyour compiler might provide especially if you aremaking use of feedback-based optimizations thattake data collected from prior runs

412 Rules for 2003 GCC Implementations

Prior to the 269 version of the Linux kernel therewas neither rcu dereference() nor rcu assign

pointer() Instead explicit memory barriers wereused smp read barrier depends() by readers andsmp wmb() by updaters For example the code shownfor current Linux kernels in Figure 6 would be as

1 struct foo 2 int a3 4 struct foo fp5 struct foo default_foo67 int bar(void)8 9 struct foo p

1011 p = fp12 smp_read_barrier_depends()13 return p p-ampgta default_fooa14

Figure 9 Default Value For RCU-Protected PointerOld Linux Kernel

shown in Figure 9 for 268 and earlier versions of theLinux kernel A similar transformation relates theolder use of smp wmb() and the more recent use ofrcu assign pointer()

This older API was clearly much more vulnerableto compiler optimizations than is the current APIbut the real motivation for this change was read-ability and maintainability as can be seen from thecommit log for the mid-2004 patch introducing rcu

dereference()

This patch introduced an rcu

dereference() macro that replaces mostuses of smp read barrier depends()The new macro has the advantage ofexplicitly documenting which pointersare protected by RCU ndash in contrast itis sometimes difficult to figure out whichpointer is being protected by a givensmp read barrier depends() call

The commit log for the mid-2004 patch introducingrcu assign pointer() justifies the change in termsof eliminating hard-to-use explicit memory barriers

Attached is a patch that adds an rcu

assign pointer() that allows a number ofexplicit smp wmb() memory barriers to bedispensed with improving readability

The importance of suppressing compiler optimiza-tions did not become apparent until much later In

WG21N4321 11

fact a volatile cast was not added to the implementa-tion of rcu dereference() until 2624 in early 2008

413 Rules for 1990s Sequent C Implemen-tations

1990s systems featured far slower CPUs and muchless memory that is commonly provisioned to-day and the compilers were correspondingly lesssophisticated Therefore at that time a sim-ple C-language field selector was used instead ofany sort of rcu dereference() or memory order

consume operation Not only was there novolatile cast there also was nothing resembling smp

read barrier depends() The lack of smp read

barrier depends() is not too surprising given thatDYNIXptx did not run on DEC Alpha

This approach was nevertheless quite reliable be-cause the use cases within the DYNIXptx kernelwere both few and straightforward and provided lit-tle or no opportunity for optimizations that mightbreak dependency chains

42 Rules for C-Language Imple-menters

The main rule for C-language implementers is toavoid any sort of value speculation or at the veryleast provide means for the user to disable suchspeculation An example of a value-speculation op-timization that can be carried out with the help ofhardware branch prediction is shown in Figure 10which is an optimized version of the code in Fig-ure 5 This sort of transformation might result fromfeedback-directed optimization where profiling runsdetermined that the value loaded from ph was almostalway 0xbadfab1e Although this transformation iscorrect in a single-threaded environment in a concur-rent environment nothing stops the compiler or theCPU from speculating the load on line 19 before itexecutes the rcu dereference() on line 16 whichcould result in line 19 executing before the corre-sponding store on line 7 resulting in a garbage valuein variable a4

4 Kudos to Olivier Giroux for pointing out use of branchprediction to enable value speculation

1 void new_element(struct foo pp int a)2 3 struct foo p = malloc(sizeof(p))45 if (p)6 abort()7 p-gta = a8 rcu_assign_pointer(pp p)9

1011 int traverse(struct foo_head ph)12 13 int a = -114 struct foo p1516 p = rcu_dereference(ampph-gth)17 while (p = NULL) 18 if (p == (struct foo )0xbadfab1e)19 a = ((struct foo )0xbadfab1e)-gta20 else21 a = p-gta22 p = rcu_dereference(ampp-gtn)23 24 return a25

Figure 10 Dangerous Optimizations HardwareBranch Predictions

There are some situations where this sort of opti-mization would be safe including

1 The value speculated is a numeric value ratherthan a pointer so that if the guess proves correctafter the fact the computation will be appropri-ate after the fact

2 The value speculated is a pointer to invariantdata so that reasonable values are produced bydereferencing even if the guess proves to havebeen correct only after the fact

3 As above but where any updates result in datathat produces appropriate computations at anyand all phases of the update

However this list does not contain the general caseof memory order consume loads

Pure hardware implementations of value specula-tion can avoid this problem because they monitorcache-coherence protocol events that would resultfrom some other CPU invalidating the guess

In short compiler writers must provide means todisable all forms of value speculation unless the spec-

WG21N4321 12

ulation is accompanied by some means of detectingthe race condition that Figure 10 is subject to

Are there other dependency-breaking optimizationsthat should be called out separately

5 Dependency Ordering in C11and C++11 Implementations

The simplest way to avoid dependency-ordering is-sues is to strengthen all memory order consume oper-ations to memory order acquire This functions cor-rectly but may result in unacceptable performancedue to memory-barrier instructions on weakly or-dered systems such as ARM and PowerPC5 and mayfurther unnecessarily suppress code-motion optimiza-tions

Another straightforward approach is to avoid valuespeculation and other dependency-breaking opti-mizations This might result in missed opportu-nities for optimization but avoids any need fordependency-chain annotations and also all issuesthat might otherwise arise from use of dependency-breaking optimizations This approach is fully com-patible with the Linux kernel communityrsquos currentapproach to dependency chains Unfortunately thereare any number of valuable optimizations that breakdependency chains so this approach seems impracti-cal

A third approach is to avoid value speculationand other dependency-breaking optimizations in anyfunction containing either a memory order consume

load or a [[carries dependency]] attribute Forexample the hardware-branch-predition optimiza-tion shown in Figure 10 would be prohibited in suchfunctions as would cancellation optimizations suchas optimizing a = b + c - c into a = b This toocan result in missed opportunities for optimizationthough very probably many fewer than the previousapproach This approach can also result in issues dueto dependency-breaking optimizations in functionslacking [[carries dependency]] attributes for ex-ample function d() in Figure 11 It can also result

5 From a Linux-kernel community viewpoint that shouldread ldquowill result in unacceptable performancerdquo

1 int a(struct foo p [[carries_dependency]])2 3 return kill_dependency(p-gta = 0)4 56 int b(int x)7 8 return x9

1011 foo c(void)12 13 return fpload_explicit(memory_order_consume)14 return rcu_dereference(fp) in Linux kernel 15 1617 int d(void)18 19 int a20 foo p2122 rcu_read_lock()23 p = c()24 a = p-gta25 rcu_read_unlock()26 return a27

Figure 11 Example Functions for Dependency Or-dering Part 1

1 [[carries_dependency]] struct foo e(void)2 3 return fpload_explicit(memory_order_consume)4 return rcu_dereference(fp) in Linux kernel 5 67 int f(void)8 9 int a

10 foo p1112 rcu_read_lock()13 p = e()14 a = p-gta15 rcu_read_unlock()16 return kill_dependency(a)17 1819 int g(void)20 21 int a22 foo p2324 rcu_read_lock()25 p = e()26 a = p-gta27 rcu_read_unlock()28 return b(a)29

Figure 12 Example Functions for Dependency Or-dering Part 2

WG21N4321 13

in spurious memory-barrier instructions when a de-pendency chain goes out of scope for example withthe return statement of function g() in Figure 12

A fourth approach is to add a compile-time op-eration corresponding to the beginning and end ofRCU read-side critical section These would need tobe evaluated at compile time taking into accountthe fact that these critical sections can nest and canbe conditionally entered and exited Note that theexit from an outermost RCU read-side critical sec-tion should imply a kill dependency() operation oneach variable that is live at that point in the code6

Although it is probably impossible to precisely de-termine the bounds of a given RCU read-side criticalsection in the general case conservative approachesthat might overestimate the extent of a given sec-tion should be acceptable in almost all cases Thisapproach would make functions c() and d() in Fig-ure 11 handle dependency chains in a natural mannerbut avoiding whole-program analysis would requiresomething similar to the [[carries dependency]]

annotations called out in the C11 and C++11 stan-dards

A fifth approach would be to require that all op-erations on the essential subset of any dependencychain be annotated This would greatly ease imple-mentation but would not be likely to be accepted bythe Linux kernel community

A sixth approach is to track dependencies as calledout in the C11 and C++11 standards However in-stead of emitting a memory-barrier instruction whena dependency chain flows into or out of a functionwithout the benefit of [[carries dependency]] in-sert an implicit kill dependency() invocation Im-plementation should also optionally issue a diagnosticin this case The motivation for this approach is thatit is expected that many more kill dependencies()

than [[carries dependency]] would be required toconvert the Linux kernelrsquos RCU code to C11 In theexample in Figure 12 this approach would allow func-tion g() to avoid emitting an unnecessary memory-barrier instruction but without function f()rsquos ex-

6 What if a given rcu read unlock() sometimes marked theend of an outermost RCU read-side critical section but othertimes was nested in some other RCU read-side critical sectionIn that case there should be no kill dependency()

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a)3 a = p-gtspecial_a4 else5 a = p-gtnormal_a

Figure 13 Dependency-Ordering Value-NarrowingHazard

plicit kill dependency() Both functions are in Fig-ure 12

A seventh and final approach is to track dependen-cies as called out in in the C11 and C++11 standardsWith this approach functions e() and f() properlypreserve the required amount of dependency order-ing

6 Weaknesses in C11 andC++11 Dependency Or-dering

Experience has shown several weaknesses in the de-pendency ordering specified in the C11 and C++11standards

1 The C11 standard does not provide attributesand in particular does not provide the[[carries dependency]] attribute This pre-vents the developer from specifying that a givendependency chain passes into or out of a givenfunction

2 The implementation complexity of thedependency-chain tracking required by bothstandard can be quite onerous on the one handand the overhead of unconditionally promotingmemory order consume loads to memory order

acquire can be excessive on weakly orderedimplementations on the other There is thereforeno easy way out for a memory order consume

implementation on a weakly ordered system

3 The function-level granularity of [[carries

dependency]] seems too coarse One problemis that points-to analysis is non-trivial so that

WG21N4321 14

compilers are likely to have difficulty determin-ing whether or not a given pointer carries a de-pendency For example the current wording ofthe standard (intentionally) does not disallowdependency chaining through stores and loadsTherefore if a dependency-carrying value mightever be written to a given variable an implemen-tation might reasonably assume that any loadfrom that variable must be assumed to carry adependency

4 The rules set out in the standard [28 110p9]do not align well with the rules that develop-ers must currently adhere to in order to main-tain dependency chains when using pre-C11 andpre-C++11 compilers (see Section 41) For ex-ample the standard requires (x-x) to carry adependency and providing this guarantee wouldat the very least require the compiler to alsoturn off optimizations that remove (x-x) (andsimilar patterns) if x might possibly be carry-ing a dependency For another example con-sider the value-speculation-like code shown inFigure 13 that is sometimes written by devel-opers and that was described in bullet 9 ofSection 41 In this example the standard re-quires dependency ordering between the memoryorder consume load on line 1 and the sub-sequent dereference on line 3 but a typicalcompiler would not be expected to differenti-ate between these two apparently identical val-ues These two examples show that a compilerwould need to detect and carefully handle thesecases either by artificially inserting dependen-cies omitting optimizations differentiating be-tween apparently identical values or even byemitting memory order acquire fences

5 The whole point of memory order consume andthe resulting dependency chains is to allow de-velopers to optimize their code Such optimiza-tion attempts can be completely defeated by thememory order acquire fences that the standardcurrently requires when a dependency chain goesout of scope without the benefit of a [[carries

dependency]] attribute Preventing the com-piler from emitting these fences requires liberal

use of kill dependency() which clutters coderequires large developer effort and further re-quires that the developer know quite a bit aboutwhich code patterns a given version of a givencompiler can optimize (thus avoiding needlessfences) and which it cannot (thus requiring man-ual insertion of kill dependency()

As of this writing no known implementations fullysupport C11 or C++11 dependency ordering

It is worth asking why Paul didnrsquot anticipate theseweaknesses There are several reasons for this

1 Compiler optimizations have become more ag-gressive over the seven years since Paul startedworking on standardization

2 New dependency-ordering use cases have arisenduring that same time in particular there arelonger dependency chains and more of themincluding dependency chains spanning multiplecompilation units

3 The number of dependency chains has increasedby roughly an order of magnitude during thattime so that changes in code style can be ex-pected to face a commeasurate increase in resis-tance from the Linux kernel community ndash unlessthose changes bring some tangible benefit

With that letrsquos look at some potential alternativesto dependency ordering as defined in the C11 andC++11 standards

7 Potential Alternatives to C11and C++11 Dependency Or-dering

Given the weaknesses in the current standardrsquos spec-ification of dependency ordering it is quite reason-able to consider alternatives To this end Section 71discusses ease-of-use issues involved with revisions tothe C11 and C++11 definitions of dependency or-dering Section 72 enlists help from the type systembut also imposes value restrictions (thus revising the

WG21N4321 15

C11 and C++11 semantics for dependencies) Sec-tion 73 enlists help from the type system withoutthe value restrictions and Section 74 describes awhole-program approach to dependency chains (alsorevising the C11 and C++11 semantics for depen-dencies) Section 75 describes a post-Rapperswilproposal that dependency chains be restricted tofunction-scope local variables and temporaries andSection 76 describes a second post-Rapperswil pro-posal that the [[carries dependency]] attributebe used to label local-scope variables that carry de-pendencies Section 77 describes a proposal dis-cussed verbally at Rapperswil that explicitly marksthe tails of dependency chains Section 78 describesthe inverse namely marking the heads of dependencychains Section 79 describes an approach that avoidsmarking by sharply restricting the number and typeof operations permitted in dependency chains Eachapproach appears to have advantages and disadvan-tages so it is hoped that further discussion will eitherhelp settle on one of these alternatives or generatesomething better To help initiate this discussionSection 710 provides an initial comparative evalua-tion

71 Revising C11 and C++11Dependency-Ordering Definition

The following sections each describe a proposed revi-sion of the dependency-ordering definition from thatin the current C11 and C++11 standards In manyof these proposals developers are required to followan additional rule in order to be able to rely on de-pendency ordering Subsequent execution must notlead to a situation where there is only one possiblevalue for the variable that is intended to carry thedependency7 This is shown in Figure 17 where thecompiler is permitted to break dependency orderingon line 6 because it knows that the value of p is equalto that of q which means that it could substitutethe latter value from the former which would break

7 This restricted notion of dependence is sometimes calledsemantic dependence and the value at the end of a depen-dence chain that does not represent a semantic dependence issometimes said to be independent of the value at the head ofthe dependency chain

1 int my_array[MY_ARRAY_SIZE]23 i = atomic_load_explicit(gi memory_order_consume)4 r1 = my_array[i]

Figure 14 Single-Element Arrays and DependencyOrdering

dependency ordering In short a dependency chainbreaks if it comes to a point where only a single valueis possible regardless of the value of the memory

order consume load heading up the chain At firstglance this additional rule could be quite difficult tolive with as dependency ordering could come and godepending on small details of code far away from thatpoint in the dependency chain

However a review of the Linux-kernel operators inSection 32 shows that the most commonly used op-erators act identically under both definitions Theproblem-free operators include -gt infix = casts pre-fix amp prefix and ternary

One example of a potentially troublesome opera-tor namely == is shown in Figure 17 where line 6breaks dependency ordering because the value of p isknown to be equal to that of q which is not part of adependency chain This example could be addressedthrough careful diagnostic design coupled with appro-priate coding standards For example the compilercould emit a warning on line 6 but remain silent forthe equivalent line substituting q for p namely dosomething with(q-gta)

Another example is the use of postfix [] thatis shown in Figure 14 If this code fragment wascompiled with MY ARRAY SIZE equal to one there isno dependency ordering between lines 3 and 4 butthat same code fragment compiled with MY ARRAY

SIZE equal to two or greater would be dependency-ordered Here a diagnostic for single-element arraysmight prove useful and such a diagnostic can easilybe supplied in this case using if and error

In the Linux kernel infix + and - are used forpointer and array computations These are all safein that they operate on an integer and pointer sothat any cancellation will not normally be detectableat compile time However one big purpose of diag-nostics is to detect abnormal conditions indicating

WG21N4321 16

1 struct liststackhead 2 struct liststack __rcu first3 45 struct liststack 6 struct liststack __rcu next7 void t8 struct rcu_head rh9

1011 _Carries_dependency12 void ls_front(struct liststackhead head)13 14 _Carries_dependency void data15 struct liststack lsp1617 rcu_read_lock()18 lsp = rcu_dereference(head-gtfirst)19 if (lsp == NULL)20 data = NULL21 else22 data = rcu_dereference(lsp-gtt)23 rcu_read_unlock()24 return data25

Figure 15 List-Based-Stack Example Code 1 of 2

probable bugs Therefore in cases where the com-piler can determine that two values from dependencychains are annihilating each other via infix + and -a diagnostic would be appropriate

Similarly the Linux kernel uses infix (bitwise) amp tomanipulate bits at the bottom of a pointer whereagain cancellation will not normally be detectable atcompile timemdashexcept in the case of operations on aNULL pointer for which dependency ordering is notmeaningful in any case However as with infix +

and - if the compiler detects value annihilation adiagnostic would be appropriate

Although issues with false positives and negativesneeds further investigation there is reason to hopethat this revision of the definition of dependency or-dering might avoid significant impacts on ease of useWith this hope we proceed to the specific propos-als using the code in Figures 15 and 16 to showsome sample code using Linux-kernel nomenclaturewith the addition of a mythical C keyword Carries

dependency to annotate parameters variables andreturn values that carry dependencies Please notethat this code example in no way endorses the dubi-ous practice of creating a parallel program with thesort of choke point exemplified by the head of this

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 _Carries_dependency struct liststack lsp40 struct liststack lsnp141 _Carries_dependency struct liststack lsnp242 _Carries_dependency void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 16 List-Based-Stack Example Code 2 of 2

WG21N4321 17

1 value_dep_preserving struct foo p23 p = atomic_load_explicit(gp memory_order_consume)4 q = some_other_pointer5 if (p == q)6 do_something_with(p-gta)7 else8 do_something_else_with(p-gtb)

Figure 17 Single-Value Variables and DependencyOrdering

list Note also that cmpxchg() heads a dependencychain which is completely reasonable within the con-text of the Linux kernel due to its acquire semanticswhich of course might be argued to indicate that theannotations in ls pop() are unnecessary

72 Type-Based Designation of De-pendency Chains With Restric-tions

This approach was formulated by Torvald Riegel inresponse to Linus Torvaldsrsquos spirited criticisms of thecurrent C11 and C++11 wording

This approach introduces a new value dep

preserving type qualifier Dependency ordering ispreserved only via variables having this type quali-fier This is meant to model the real scope of depen-dencies which is data flow not execution at function-level granularity This approach should therefore givedevelopers much finer control of which dependenciesare tracked

Assigning from a value dep preserving value to anon-value dep preserving variable terminates thetracking of dependencies in much the same way thatan explicit kill dependency() would However un-like an explicit kill dependency() compilers shouldbe able to emit a suppressable warning on implicitconversions so as to alert the developer about other-wise silent dropping of dependency tracking8

Next we specify that memory order consume loadsreturn a value dep preserving type by default thecompiler must assume such a load to be capable of

8 Other choices are possible in this case including emit-ting a memory order acquire fence in order to conservativelypreserve a potentially intended ordering

producing any value of the underlying type In otherwords the implementation is not permitted to applyany value-restriction knowledge it might gain fromwhole-program analysis We call this a local seman-tic dependency to distinguish not only from a pure(syntactic) dependency but also from a global seman-tic dependency where global information may be ap-plied Note that any global semantic dependency isalso a local semantic dependency but that any localsemantic dependency which is headed by a variablethat can be proven to take on only a single value is nota global semantic dependency The term ldquosemanticdependencyrdquo should be interpreted to mean a globalsemantic dependency unless otherwise stated

This allows developers to start with a clean slatefor the additional rule that they must follow to beable to rely on dependency ordering Subsequent ex-ecution must not lead to a situation there is only onepossible value for the value dep preserving expres-sion because otherwise the implementation is per-mitted to break the dependency chain As notedearlier this is shown in Figure 17 where the com-piler is permitted to break dependency ordering online 6 because it knows that the value of p is equalto that of q which means that it could substitutethe latter value from the former which would breakdependency ordering

This approach has several advantages

1 The implementation is simpler because no de-pendency chains need to be traced The imple-mentation can instead drive optimization deci-sions strictly from type information

2 Use of the value dep preserving type modifierallows the developer to limit the extent of thedependency chains

3 This type modifier can be used to mark a depen-dency chainrsquos entry to and exit from a functionin a straightforward way without the need forattributes

4 The value dep preserving type modifiers serveas valuable documentation of the developerrsquos in-tent

WG21N4321 18

5 This approach permits many additional opti-mizations compared to those permitted by thecurrent standard on code that carries a depen-dency Expressions such as (x-x) no longerrequire establishment of artificial dependenciesand the compiler is no longer required to detectvalue-narrowing hazards like that shown in Fig-ure 13 However the compiler is still prohibitedfrom adding its own value-speculation optimiza-tions

6 Linus Torvalds seems to be OK with it whichindicates that this set of rules might be practicalfrom the perspective of developers who currentlyexploit dependency chains

According to Peter Sewell one disadvantage is thatthis approach will be quite difficult to model which inturn will pose obstacles for the analysis tooling thatwill be increasingly necessary for large-scale concur-rent programming efforts In particular the concernis that forcing the compiler to assume that a memory

order consume load could possibly return any valuepermitted by its type might require program-analysistools to consider counterfactual hypothetical execu-tions which might complicate specification of seman-tics and verification

Figures 18 and 19 show how this approach playsout with the list-based stack

73 Type-Based Designation of De-pendency Chains

Jeff Preshing made an off-list suggestion of using avalue dep preserving type modifier as suggestedby Torvald Riegel but using this type modifier tostrictly enforce dependency ordering For exampleconsider the code fragment shown in Figure 17 Thescheme described in Section 72 would not necessar-ily enforce dependency ordering between the load online 3 and the access one line 6 while the approachdescribed in this section would enforce dependencyordering in this case

Furthermore cancelling or value-destruction oper-ations on value dep preserving values would notdisrupt dependency ordering As with the cur-rent C11 and C++11 standards the implementation

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack value_dep_preserving first6 78 struct liststack 9 struct liststack value_dep_preserving next

10 void t11 struct rcu_head rh12 1314 value_dep_preserving15 void ls_front(struct liststackhead head)16 17 value_dep_preserving void data18 value_dep_preserving struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 18 List-Based-Stack Restricted Type-BasedDesignation 1 of 2

would be required to emit a memory-barrier instruc-tion or compute an artificial dependency for such op-erations (Note however that use of cancelling orvalue-destruction operations on dependency chainshas proven quite rare in practice)

This approach shares many of the advantages ofTorvald Riegelrsquos approach

1 The implementation is simpler because no de-pendency chains need be traced The implemen-tation can instead drive optimization decisionsstrictly from type information

2 Use of the value dep preserving type modifierallows the developer to limit the extent of thedependency chains

3 This type modifier can be used to mark a depen-dency chainrsquos entry to and exit from a functionin a straightforward way without the need forattributes

4 The value dep preserving type modifiers serve

WG21N4321 19

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 value_dep_preserving37 void ls_pop(struct liststackhead head)38 39 value_dep_preserving struct liststack lsp40 struct liststack lsnp141 value_dep_preserving struct liststack lsnp242 value_dep_preserving void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 19 List-Based-Stack Restricted Type-BasedDesignation 2 of 2

as valuable documentation of the developerrsquos in-tent

5 Although optimizations on a dependency chainare restricted just as in the current standardthe use of value dep preserving restricts thedependency chains to those intended by the de-veloper

6 Restricting dependency-breaking optimizationson all dependency chains marked value dep

preserving without exceptions for cases inwhich the compiler knows too much might makethis approach easier to learn and to use

It is expected that modeling this approach shouldbe straightforward because the modeling tools wouldbe able to make use of the type information Thisapproach results in the same code as shown in Fig-ures 18 and 19 of the previous section

74 Whole-Program Option

This approach also suggested off-list by Jeff Presh-ing has the goal of reusing existing non-dependency-ordered source code unchanged (albeit requiring re-compilation in most cases)9 For example this ap-proach permits an instance of stdmap to be refer-enced by a pointer loaded via memory order consume

and to provide that stdmap instance with thebenefits of dependency ordering without any codechanges whatsoever to stdmap It is important tonote that this protection will be provided only to aread-only stdmap that is referenced by a changingpointer loaded via memory order consume in partic-ular not to a concurrently updated stdmap refer-enced by a pointer (read-only or otherwise) loadedvia memory order consume This latter case wouldrequire changes to the underlying stdmap implemen-tation at a minimum changing some of the loads tobe memory order consume loads Nevertheless theability to provide dependency-ordering protection topre-existing linked data structures is valuable evenwith this read-only restriction

9 A module or library that is known to never carry a de-pendency need not be recompiled

WG21N4321 20

This approach which again does require full re-compilation can be implemented using two ap-proaches

1 Promote all memory order consume loads tomemory order acquire as may be done withthe current standard

2 On architectures that respect memory order-ing prohibit all dependency-breaking optimiza-tions throughout the entire program but onlyin cases where a change in the value returnedby a memory order consume load could cause achange in the value computed later in that samedependency chain in other words where there isa global semantic dependency Note again thatthe possibility of storing a value obtained froma memory order consume load then loading itlater means that normal loads as well as memoryorder relaxed loads often must be consideredto head their own dependency chains but onlywhen loaded by the same thread that did thestore

Some implementations might allow the developerto choose between these two approaches for exampleby using a compiler switch provided for that purpose

This approach also has the effect of permitting atrivial implementation of a memory order consume

atomic thread fence() When using the first im-plementation approach the atomic thread fence()

is simply promoted to memory order acquire In-terestingly enough when using the second ap-proach the memory order consume atomic thread

fence() may simply be ignored The reason forthis is that this approach has the effect of promot-ing memory order relaxed loads to memory order

consume which already globally enforces all theordering that the memory order consume atomic

thread fence() is required to provide locally10

This approach has its own set of advantages anddisadvantages

10 Of course this presumed promotion from memory order

relaxed to memory order consume means that architecturessuch as DEC Alpha that do not respect dependency order-ing must continue to use the first option of emitting memory-ordering instructions for memory order consume loads

1 This approach dispenses with the [[carries

dependency]] attribute and the kill

dependency() primitive

2 This approach better promotes reuse of existingsource code In particular it should require nochanges to the current Linux-kernel source baseaside from changes to the rcu dereference()

family of primitives

3 This approach allows implementations to carryout dependency-breaking optimizations on de-pendency chains as long as a change in thevalue from the memory order consume load doesnot change values further down the dependencychain both with and without the optimizationJeff conjectures that the set of dependency-breaking optimizations used in practice applyonly outside of dependency chains by the re-vised definition in which single-value restrictionsbreak dependency chains11 If this conjectureholds it also applies to Torvaldrsquos approach de-scribed in Section 72

4 Code that follows the rules presented in Sec-tion 41 (substituting memory order consume

loads for volatile loads) would have its depen-dency ordering properly preserved

It is unlikely that this approach could be modeledreasonably given the current state of the art Therequirement that any given memory order consume

load be able to generate at least two different val-ues at the tail of the dependency chain is believedto be a show-stopper especially when coupled withwhole-program analysis which might find that thereis only one value entering at the head of the depen-dency chain

This approach allows annotations to be discardedas shown in Figures 20 and 21 However the memoryorder consume loads are still required in order toenable the promote-to-acquire implementation style

11 This is certainly the case for the usual optimizations ex-emplified by replacing (x-x) with zero

WG21N4321 21

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 void ls_front(struct liststackhead head)15 16 void data17 struct liststack lsp1819 rcu_read_lock()20 lsp = rcu_dereference(head-gtfirst)21 if (lsp == NULL)22 data = NULL23 else24 data = rcu_dereference(lsp-gtt)25 rcu_read_unlock()26 return data27

Figure 20 List-Based-Stack Whole-Program Ap-proach 1 of 2

75 Local-Variable Restriction

This approach suggested off-list by Hans Boehmlimits the extent of dependency trees to a local whichincludes local variables temporaries function argu-ments and return variables Assigning a value from amemory order consume load to such an object beginsa dependency chain Assigning a value loaded fromsuch a local to a global variable (including function-local variables marked static) or to the heap impliesa kill dependency() so that dependency chains areconfined to locals However if the compiler is unableto see the full dependency chain for example be-cause it passes into a function in another translationunit that is not marked [[carries dependency]]the compiler should promote memory order consume

to memory order acquire12

Section 32 indicates that the following operatorsshould transmit dependency status from one localvariable or temporary to another -gt infix = casts

12 Some implementations might provide means to allow theuser to specify that a diagnostic be generated if such promotionis necessary

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 void ls_pop(struct liststackhead head)37 38 struct liststack lsp39 struct liststack lsnp140 struct liststack lsnp241 void data4243 rcu_read_lock()44 lsnp2 = rcu_dereference(head-gtfirst)45 do 46 lsnp1 = lsnp247 if (lsnp1 == NULL) 48 rcu_read_unlock()49 return NULL50 51 lsp = rcu_dereference(lsnp1-gtnext)52 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)53 while (lsnp1 = lsnp2)54 data = rcu_dereference(lsnp2-gtt)55 rcu_read_unlock()56 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)57 return data58

Figure 21 List-Based-Stack Whole-Program Ap-proach 2 of 2

WG21N4321 22

prefix amp prefix [] infix + infix - ternary infix (bitwise) amp and probably also | Similarly Sec-tion 33 indicates that the following operators shouldimply a kill dependency() () == = ampamp ||infix and

It will also be necesary to check whether Linux-kernel usage expects dependency chains to passthrough globals and heap objects that are in someway thread-local If there are such use cases and ifthey are sane and cannot easily be changed to uselocal variables should [[carries dependency]] beused to flag dependency-carrying globals and heapobjects

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute if de-pendency chains are to span multiple translationunits as is the case in some parts of the Linuxkernel

2 The implementation is likely to be some-what simpler because only those dependencychains passing through local variables compiler-generated temporaries compiler-visible functionarguments and compiler-visible return valuesneed be traced One could also argue that func-tion arguments and return values marked with[[carries dependency]] attribute also need tobe traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 Although optimizations on dependency chainsmust be restricted the restricted scope of de-pendency chains reduces the impact of these re-strictions

5 Applying this approach to the Linux kernelwould only require the addition of markings onfunction parameters and return values corre-sponding to cross-translation-unit function callsHowever there are a significant number of theseso this approach can expect significant resistancefrom the Linux community

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 22 List-Based-Stack Local-Variable Restric-tion 1 of 2

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach allows local-variable annotations tobe dropped as shown in Figure 22 and 23

76 Mark Dependency-Carrying Lo-cal Variables

This approach suggested offlist by Clark Nelson usesthe [[carries dependency]] attribute to mark non-static local-scope variables as carrying a dependencyin addition to its current use marking function ar-guments and return values as carrying dependen-cies It is not permissible to mark global variablesor structure members with this attribute Assigningfrom a [[carries dependency]] object to a non-[[carries dependency]] object results in an im-plicit kill dependency()

This approach is similar to that of Section 73 ex-cept that it uses an attribute rather than a type mod-ifier As such it has many of the advantages and dis-

WG21N4321 23

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 23 List-Based-Stack Local-Variable Restric-tion 2 of 2

advantages of that approach however some believethat an attribute-based approach will be more ac-ceptable to the committee than would a type-modifierapproach13 However this approach does requirethat C add attributes

This leave the question of which operatorstransmit dependency chains from one [[carries

dependency]] object to another Section 32 indi-cates that the following operators should transmitdependency status from one local variable or tem-porary to another -gt infix = casts prefix amp prefix [] infix + infix - ternary infix (bitwise) ampand probably also | Similarly Section 33 showsthat the following operators should imply a kill

dependency() () == = ampamp || infix and

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains passingthrough variables marked with the [[carries

dependency]] attribute need be traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 The [[carries dependency]] calls serve asvaluable documentation of the developerrsquos in-tent

5 Although optimizations on dependency chainsmust be restricted use of explicit [[carries

dependency]] greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of variables carrying depen-dencies given that the Linux kernel currentlyrequires no such markings

13 Lawrence Crowl suggests a third approach namely a vari-able modifier

WG21N4321 24

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 _Carries_dependency void data18 _Carries_dependency struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 24 List-Based-Stack Marked Local Variables1 of 2

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach results in code as shown in Fig-ures 24 and 25 where the [[carries dependency]]

attributes have been replaced with a mythicalCarries dependency C keyword

77 Explicitly Tail-Marked Depen-dency Chains

This approach suggested at Rapperswil by OlivierGiroux can be thought of as the inverse of

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 _Carries_dependency struct liststack lsp40 struct liststack lsnp141 _Carries_dependency struct liststack lsnp242 _Carries_dependency void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 25 List-Based-Stack Marked Local Variables2 of 2

WG21N4321 25

1 p = atomic_load_explicit(ampgp memory_order_consume)2 if (p = NULL)3 do_it(atomic_dependency(p gp))

Figure 26 Explicit Dependency Operations

stdkill dependency() Instead of explicitlymarking where the dependency chains terminateOlivierrsquos proposal uses a stddependency() prim-itive to indicate the locations in the code that thedependency chains are required to reach The first ar-gument to stddependency() is the value to whichthe dependency must be carried and the second argu-ment is the variable that heads the dependency chainin other words the second argument is the variablethat was loaded from by a memory order consume

load This proposal differs from the others in thatit is expected to be implemented not necessarily bypreserving the dependency but instead by insertingbarriers in those cases where optimizations have elim-inated any required dependencies The goal here is toimpose minimal restrictions on optimizations of codecontaining dependency chains

A C-language example is shown in Figure 26where stddependency() is transliterated to the C-language atomic dependency() function On line 3atomic dependency() returns the value of its firstargument (p) while ensuring that the data depen-dency from the memory order consume load from gp

is faithfully reflected in the assembly language imple-menting this code fragment The assembly-languagereflection of this dependency might be in terms ofan assembly-language dependency (for example onARM or PowerPC) implicit memory ordering (forexample on x86 or mainframe) or by an explicitmemory-barrier instruction However if there was noatomic dependency() function the compiler wouldbe under no obligation to preserve the dependency14

These explicitly specified dependencies may becombined with [[carries dependency]] attributeson function arguments for example as shown in Fig-ure 27 Note the interplay of atomic dependency()

and [[carries dependency]] where line 8 estab-

14 Would it be better to have the second argument to atomic

dependency() be a label rather than an expression

1 void foo(struct bar q [[carries_dependency]])2 3 if (q = NULL)4 do_it(atomic_dependency(q-gtb q))5 67 p = atomic_load_explicit(ampgp memory_order_consume)8 foo(atomic_dependency(p gp))

Figure 27 Explicit Dependency Operations and car-ries dependency

lishes the dependency between the load from gp andthe [[carries dependency]] argument q of foo()and where line 4 establishes the further dependencybetween argument q of foo() and do it()s argu-ment

This approach is not yet complete One issue isthe possibility of a given operation being dependenton multiple memory order consume loads One ap-proach is of course to omit this functionality and an-other is to allow atomic dependency() to allow anexpression as its first argument and a variable list ofmemory order consume loaded variables

Another issue is connecting [[carries

dependency]] return values to subsequent atomic

dependency() invocations There are a numberof possible resolutions to this issue One approachwould be to use [[carries dependency]] attributeto mark the declaration of the variable to whichthe functionrsquos return value is assigned bringing theproposal from Section 76 to bear In the specialcase where the memory order consume load is in thesame function body as the atomic dependency()

that depends on it the atomic dependency()

could reference the variable that was the sourceof the original memory order consume load An-other approach would be to allow function-returncarries dependency]] attributes to define namesthat could be used by later atomic dependency()

invocations

A third issue arises when atomic dependency()

must be applied after the head of the dependencychain has gone out of scope for example if the headwas contained in a variable defined in an inner scopethat has since been exited

A fourth issue arises if optimizations along a

WG21N4321 26

needed dependency chain allow ordering the depen-dent operation to precede the head of the depen-dency chain in which case inserting barriers wouldbe ineffective The current proposal for addressingthis issue is to suppress memory-movement optimiza-tions across the atomic dependency() perhaps us-ing something like atomic signal fence() or theLinux kernelrsquos barrier() macro This approach al-lows dependency checking and fence insertion to becarried out as a final pass in the compilation process

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains having ex-plicit atomic dependency() calls (and option-ally intermediate [[carries dependency]] at-tributes) need be traced

3 Irrelevant dependency chains are pruned by de-fault with no stdkill dependency() calls re-quired

4 The atomic dependency() calls serve as valu-able documentation of the developerrsquos intent

5 Although optimizations on dependency chainsmust be restricted use of explicit atomic

dependency() greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of dependency chains giventhat the Linux kernel currently relies on implicitends of dependency chains

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(atomic_dependency(lsp-gtt27 head-gtfirst))28 rcu_read_unlock()29 return atomic_dependency(data lsp-gtt)30

Figure 28 List-Based-Stack Tail-Marked Dependen-cies 1 of 2

It is not yet known whether this approach can bereasonably modeled

The result is shown in Figures 28 and 29

78 Explicitly Head-Marked Depen-dency Chains

This approach suggested via email by Olivier Girouxcan be thought of as another inverse of stdkilldependency() In this case the heads of the de-pendency chains are marked indicating to whichpointed-to objects dependencies should be carriedThis description is an extrapolation of a very con-cise proposal and corrections are welcome

The general idea is to provide an augmentedform of the load() member function that indicatesdependencies for example xload(memory order

consume x-gtnext) would cause a dependency tobe carried through the -gtnext field but throughno other field This is shown in Figure 30 where

WG21N4321 27

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(atomic_dependency(lsnp2-gtt lsnp))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return atomic_dereference(data lsnp2-gtt)59

Figure 29 List-Based-Stack Tail-Marked Dependen-cies 2 of 2

1 struct foo 2 struct foo a3 struct foo b4 struct foo c5 int d6 78 p = atomic_load_explicit(ampgp memory_order_consume9 p-gta p-gtb)

10 qa = p-gta Dependency carried 11 qb = p-gtb Dependency carried 12 qc = p-gtc No dependency carried 13 d = p-gtd No dependency carried

Figure 30 Explicit Dependency Operations and Aug-mented Load

the explicit dependency information on line 9 causeslines 10 and 11 to carry a dependency but lines 12and 13 not to do so

Some open questions regarding this approach

1 How does this interact with arguments and re-turn values Do the corresponding annotationsneed to indicate to which fields dependenciesmight be carried Should mismatches be con-sidered an error and if so which sorts of mis-matches

2 How are opaque types handled For exam-ple consider the Linux kernel linked-list facil-ity which embeds a list head structure intothe enclosing object that is to be placed on thelist The memory order consume load returnsa (struct list head ) but it may be nec-essary to carry a dependency to one or more ofthe fields in the enclosing object Should thisbe handled via something like xload(memory

order consume x) but if so doesnrsquot thisre-introduce the need for lots of stdkill

dependency() calls

3 Larger structures might have quite a few fieldsthat need dependencies carried Should therebe some sort of shorthand to make this easierto code for example tagging the fields needingdependency ordering in the declaration of thestruct or class

4 Common types of abstraction need to be han-dled correctly For example there are more than

WG21N4321 28

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst head-gtfirst-gtt)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(lsp-gtt lsp-gtt)) 28 rcu_read_unlock()29 return data30

Figure 31 List-Based-Stack Head-Marked Depen-dencies 1 of 2

500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

79 Restricted Dependency Chains

This approach restricts dependency chains to opera-tions for which compilers would naturally carry de-pendencies As such this approach can be consid-ered to be a refinement of the whole-program option(Section 74) restricted as described in Section 412but also omitting control dependencies and RCU-protected array indexes As always a memory order

consume load heads a dependency chain and as al-ways a memory order consume load may be imple-mented with a simple load instruction on architec-tures such as x86 ARM and Power

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst head-gtfirst-gtnext)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext lsnp-gtnext-gtt)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt lsnp2-gtt))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 32 List-Based-Stack Head-Marked Depen-dencies 2 of 2

WG21N4321 29

1 define rcu_ptr_extract(p) 2 ( 3 uintptr_t ___ip = (uintptr_t)(p) 4 5 ___ip = ___ip amp ~0x7 6 (typeof(p)) ___ip 7 )89 define rcu_ptr_set_bits(p bits)

10 ( 11 uintptr_t ___ip = (uintptr_t)(p) 12 13 ___ip = ___ip amp ~0x7 14 ___ip = ___ip | (bits) 15 (typeof(p)) ___ip 16 )1718 define rcu_ptr_get_bits(p bits) 19 ( 20 uintptr_t ___ip = (uintptr_t)(p) 21 22 ___ip amp ~0x7 23 )

Figure 33 Pointers and Bit Manipulation on 64-BitSystem

This results in a specific list of operations that ex-tend dependency chains and a separate specific list ofoperations that terminate such chains each coveredin its own section

791 Extending Dependency Chains

The following primitive operations extend thatchain15

1 If any value is part of a dependency chain thenusing that value as the left-hand side of an as-signment expression extends the chain to coverthe assignment This rule is exercised in theLinux kernel by stores into fields making up anRCU protected data element

2 If any value is part of a dependency chain thenusing that value as the right-hand side of an as-signment expression extends the chain to coverboth the assignment and the value returned bythat assignment statement Line 20 of Figure 20shows how this rule may be used to extend adependency chain into a local variable

15 In case of operator overloading the actual functions calledmust be analyzed in order to determine their effects on depen-dency chains

3 If any value that is part of a dependency chain isstored to a non-shared variable then any valueloaded by a later load from that same variable bythat same thread is also part of the dependencychain Lines 20 and 24 of Figure 20 illustratethis rule though this rule would apply even iflocal variable lsp was a intptr t instead of apointer

4 If a pointer that is part of a dependency chainis stored to any variable then any value loadedby a later load from that same variable by thatsame thread is also part of the dependency chainLines 20 and 24 of Figure 20 illustrate this rulethough this rule would apply even if local vari-able lsp was instead a shared variable

5 If a pointer is part of a dependency chain thenadding an integral value to that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers and alsoto addition via the infix + operator and via thepostfix [] operator Note that the addition mustbe carried out on a pointer Casting to an inte-gral type and then carrying out the addition willbreak the dependency chain Therefore insteadof casting to an integral type to carry out theaddition cast to a pointer to char16 Line 24 ofFigure 20 illustrates this given that the -gtt actsas a pointer offset prior to indirection

6 If a pointer is part of a dependency chain thensubtracting an integer from that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers Againcasting to an integral type and then carryingout the subtraction will break the dependencychain so instead cast to a pointer to char TheLinux-kernel container of() macro illustratesthis This macro is used to find the beginning ofa structure given a pointer to a field within thatsame structure

16 Yes some old systems had strange formats for characterpointers and this restriction does exclude those systems fromthis nuance of dependency ordering However to the best ofmy knowledge all such systems were uniprocessors so this isnot a real problem

WG21N4321 30

7 If a pointer is part of a dependency chain thendereferencing it using the prefix operator ex-tends the chain through the dereference opera-tion Line 24 of Figure 20 illustrates this giventhat the -gtt acts as a pointer offset prior to in-direction

8 If a pointer is part of a dependency chain thendereferencing it using the -gt field-selection oper-ator extends the chain to the field Note that thewhen the -gt operator is followed by one or more operators these latter operators are equiva-lent to adding a constant integer to the originalpointer Line 24 of Figure 20 directly illustratesthis rule

9 If a pointer is part of a dependency chain thencasting it (either explicitly or implicitly) to anypointer-sized type extends the chain to the re-sult Line 3 of Figure 33 illustrates this rule

10 If a value of type intptr t or uintptr t ispart of a dependency chain then casting it toa pointer type extends the chain to the resultLine 6 of Figure 33 illustrates this rule

11 If a value of type intptr t or uintptr t is partof a dependency chain then the bitwise infix amp

and | operators extend the dependency chain tothe resulting value Line 5 of Figure 33 illus-trates this rule

12 If a value of type intptr t or uintptr t is partof a dependency chain the the bitwise infix ^

operator extends the dependency chain to theresulting value The use case for this traversal ofbuddy-allocator-like lists or dense-array heapsbut it is not clear whether these use cases justifythis addition to dependency ordering

13 If a pointer is part of a dependency chain thenapplying the unary amp address-of operator op-tionally casting this address to a pointer type(perhaps repeatedly to different pointer typeseither explicitly or implicitly) then applying the dereference operator extends the chain to theresult This is used by some of the Linux-kernellist-processing macros

14 If a pointer is part of a dependency chain andthat pointer appears in the entry of a expres-sion selected by the condition then the chainextends to the result Please note that doesnot extend chains from its condition only fromits second or third argument

15 If a pointer to a function is part of a depen-dency chain then invoking the pointed-to func-tion extends the chain from the pointer to theinstructions executed Note that the exact mech-anism used to update instructions is implemen-tation defined and might require use of specialinstruction-cache-flush operations17

16 If a pointer is part of a dependency chain thenif that pointer is used as the actual parameter ofa function call the dependency chain extends tothe formal parameter

17 If a pointer is part of a dependency chain thenif that pointer is returned from a function thedependency chain extends to the returned valuein the calling function

18 If a given operation extends a dependency chainthen so does its atomic counterpart For exam-ple the rules applying to assignments also applyto atomic loads and stores

Any other operation terminates a dependencychain

792 Terminating Dependency Chains

Even though all other operations terminate depen-dency chains there are a few that deserve specialmention

1 Equality comparisons

2 Narrowing magnitude comparisons

3 Narrowing arithmetic operations

4 Narrowing bitwise operations

17 This may sound strange but just you try implementingdynamic linking without the ability to update instructions

WG21N4321 31

1 struct bar 2 struct bar next3 int a4 int b5 6 struct bar head = amphead 1 2 78 for (p = head-gtnext p p = rcu_dereference(p-gtnext)) 9 foo += p-gta

10 if (p == amphead)11 break12 13 bar = head-gtb

Figure 34 Back-Propagation of Dependency-ChainBreakage Due to Comparisons

1 if (p gt ampfoo)2 do_something(p)3 else if (p lt ampfoo)4 do_something_else(p)5 else6 do_something_nodep(p)

Figure 35 Inequality-Comparison Dependency-Chain Breakage

5 Storing non-pointers into shared variables

6 Passing values between threads without using amemory order consume load

7 Undefined behavior

8 Use of kill dependency

Each of these is covered below

Equality comparisons If a pointer is part of adependency chain then a == or = comparison thatcompares equal to some other pointer where thatother pointer is not part of any dependency chainwill cause any uses of the original pointer to no longerbe part of the dependency chain This dependency-chain breakage can back-propagate to earlier uses ofthe pointer so that in Figure 34 if the comparisonon line 10 compares equal then the access on line 9is not part of the dependency chain This is admit-tedly a rather strange code fragment and besidesthe Linux-kernel barrier() macro could prevent thisif placed between lines 9 and 10 Furthermore the

Linux kernelrsquos list macros avoid this situation becausethe equal comparison terminates the loop

So what if the compiler introduces an equality com-parison This might happen when doing feedback-directed optimization where the compiler might no-tice (for example) that a particularly statically allo-cated structure was almost always the first elementon a given list The compiler might therefore in-troduce a specialization optimization comparing theaddresses and generating code using the statically al-located structure on equals comparison On the onehand in the cases where the Linux kernel adds a stat-ically allocated structure to an RCU-protected linkeddata structure that structure has been initialized atcompile time so that dependency ordering is not re-quired On the other hand this appears to be anextremely dubious optimization for linked data struc-tures In a great many cases the added overhead ofthe comparison would overwhelm the benefits of gen-erating code based on the statically allocated struc-ture

High-quality implementations would therefore beexpected to provide means for disabling this sort ofoptimization especially for pointers obtained fromthe heap After all use of statically allocated struc-tures in RCU-protected lists could be quite usefulduring out-of-memory conditions in which case thespecialization optimization would almost always re-duce performance which is not what optmizationsare supposed to be doing

Narrowing magnitude comparisons A seriesof gt lt gt= or lt= operators that informs the compilerof the exact value of a pointer causes that pointerto no longer be part of the dependency chain SeeFigure 35 for an example of this On line 6 of thisfigure the compiler knows that the value of p is equalto ampfoo so although there is dependency ordering tolines 2 and 4 there is no dependency ordering toline 6 This dependency-chain breakage can back-propagate just as for equality comparisons How-ever dependencies are maintained for normal usesfor exmaple the use of comparisons for deadlockavoidance when acquiring locks contained in multi-ple RCU-protected data elements

WG21N4321 32

Narrowing arithmetic operations If a pointeris part of a dependency chain and if the valuesadded to or subtracted from that pointer cancel thepointer value so as to allow the compiler to pre-cisely determine the resulting value then the result-ing value will not be part of any dependency chainFor example if p is part of a dependency chain then((char )p-(uintptr t)p)+65536 will not be18

Narrowing bitwise operations If a value of typeintptr t or uintptr t is part of a dependency chainand if that value is one of the operands to an amp

or | infix operator whose result has too few or toomany bits set then the resulting value will not bepart of any dependency chain For example on a64-bit system if p is part of a dependency chainthen (p amp 0x7) provides just the tag bits and nor-mally cannot even be legally dereferenced Similarly(p | ~0) normally cannot be legally dereferencedHowever (p amp ~0x7) will provide a usable pointerthat is part of prsquos dependency chain Setting orclearing bits that are not used by the implementa-tion or that would be zero for a properly aligned ob-ject will not break the dependency chain That saidthe compiler might not know the definition of ldquoprop-erly alignedrdquo for example in the cases of manuallycache-aligned or page-aligned objects

Storing non-pointers into shared variables Ifa non-pointer value that is part of a dependency chainis stored into a shared variable then the dependencychain does not extend to a later load from that vari-able

Passing values between threads If a value thatis part of a dependency chain is stored into a variableby one thread and loaded from that same variable bysome other thread using either a non-atomic load ora memory order relaxed load then the dependencychain does not extend to the second thread To getthis effect the second thread would instead need touse a memory order consume load Note that this

18 That said 57p4 of C++ and 656p8 of C both say thatindexing outside of an object is undefined behavior so the lossof dependency ordering is likely the least of the problems here

would extend the dependency chain even if the cor-responding store was a memory order relaxed storebecause the required store-side ordering is providedby the dependency chain

Undefined behavior If undefined behavior is in-voked then consistent with the notion of undefinedbehavior there are no dependency-chain guarantees

If a given pointer intptr t or uintptr t is usedsuch that only one value avoids undefined behaviorthen the dependency chain is broken in the same wayas it would be in the case of an equality comparisonwith that same value

kill dependency() The result of calling kill

dependency is never part of any dependency chainThis operation can be used to suppress diagnosticsthat implementations might omit for likely misusesof dependency ordering

793 Restricted Dependency Chains and theLinux Kernel

This covers all known pointer-based RCU uses in theLinux kernel aside from RCU-protected array in-dexes (more on these later) However as noted ear-lier these restrictions might prove too constrainingfor future code Therefore it might be necessary tocombine this approach with some variation on one ofthe methods for explicitly marking variables formalparameters and return values that are intended tocarry dependencies

Section 32 discussed dependency chains headedby memory order consume loads of integers that arelater used as array indexes or pointer offsets Al-though there are a (very) few such uses in the Linuxkernel accommodating dependency chains headed byloads of integers greatly complicates the handling ofdependency chains For example given an integer x

produced by a memory order consume load we mustcorrectly handle expressions containing (x - x) in-cluding cases where the cancellation is not at all obvi-ous from the source code In contrast given a pointerp the expression (p - p) not a pointer and the rulesgiven above do not require carrying a dependency

WG21N4321 33

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a) 3 q = kill_dependency(p)4 a = q-gtspecial_a5 else 6 a = p-gtnormal_a7

Figure 36 Avoiding Diagnostic Due To Dependency-Ordering Value-Narrowing Hazard

through such an expression This approach there-fore excludes dependency chains headed by memory

order consume loads from non-pointer atomic vari-ables

This raises the question of what should be doneabout the Linux kernel code that relies on order-ing carried through integer array indexes Paulhas (hopefully) answered this question by creatinga Linux-kernel patch that removes the kernelrsquos de-pendency on RCU-protected array indexes [22]

794 Restricted Dependency Chains Ad-vantages and Disavantages

This approach to dependency ordering has the fol-lowing advantages and disadvantages

1 There is no need for compilers to trace depen-dency chains Instead dependencies are an au-tomatic result of current code-generation and op-timization practices

2 There would be little or no limitation on com-piler optimizations Compilers are no longer re-quired to establish artificial dependencies for ex-pressions such as (x - x) or to detect value-narrowing hazards involving == and =

3 The breaking of dependency chains when apointer compares equal to some other pointermight prove to be onerous but it is not aproblem for current Linux use cases19 Themost common cases are comparison against alist header (in which case an equality compar-ison terminates the traversal) and comparison

19 This needs to be re-verified

against NULL (in which case an equality com-parison indicates a pointer that cannot be deref-erenced in any case)

4 The compiler is prohibited from carrying outvalue-speculation optimizations on pointers orvalues of type intptr t or uintptr t that havebeen cast from a pointer type and subjected tono operations other than infix amp | and ^ (Isthis an advantage or a disadvantage The an-swer to this question is left to the reader)

5 In a great many cases dependency chains canreliably pass through library functions compiledby pre-C11 compilers

6 It would not be necessary to use stdkill

dependency() calls in most cases That saiduse of stdkill dependency() might at somefuture point allow the compiler to produce bet-ter diagnostics for dubious dependency-chain usecases For example a compiler might issue awarning for the value-narrowing hazard shownon line 3 of Figure 13 and that diagnostic mightbe suppressed as shown in Figure 36

7 More generally this approach allows annotationsto be discarded as was shown in Figures 20and 21 However the memory order consume

loads are still required in order to support DECAlpha and prevent compiler optimizations thatmight otherwise destroy the dependency chain

8 It would not be necessary to use [[carries

dependency]] attributes However as withstdkill dependency() use of something like[[carries dependency]] might produce bet-ter diagnostics for dubious use dependency-chainuse cases That said it might be preferable tosubstitute either type or variable modifiers forattributes given that use of attributes is not per-mitted to change the meaning of the programand also that attributes are not yet supportedby the C language

9 With the exception of one use case involving ar-rays no changes to the Linux-kernel source code

WG21N4321 34

are required As noted earlier a patch is avail-able to remove the Linux kernelrsquos dependency onRCU-protected array indexes [22]

In the future this approach could be augmentedby attributes type modifiers variable modifiers orother markings to allow more elaborate dependencychains to be created on the one hand and to improvethe compilerrsquos ability to emit diagnostics for dubioususes of dependency chains on the other

710 Evaluation

This evaluation starts by enumerating the differentaudiences that any change to memory order consume

must address (Section 7101) and then compares thevarious proposals based on the perceived viewpointsof these audiences (Section 7102)

7101 Audiences

The main audiences for any change to memory

order consume include standards committee mem-bers compiler implementers formal-methods re-searchers developers intending to write new codeand developers working with existing RCU code TheLinux kernel community is of course a notable exam-ple of this last category

Standards committee members would like a cleanand non-intrusive change to the standard Theywould of course also like solutions minimizing thenumber and vehemence of complaints from the otheraudiences or failing that reducing the complaintsto a tolerable noise level

Compiler implementers would like a mechanismthat fits nicely into current implementations whichdoes much to explain their satisfaction with the ap-proach of strengthening memory order consume tomemory order acquire In particular they wouldlike to avoid unbounded tracing of dependencies andwould prefer minimal constraints on their ability toapply time-honored optimizations

Formal-methods researchers would like a definitionof memory order consume that fits into existing the-oretical frameworks without undue conceptual vio-lence Of particular concern is any need to deal with

counter-factuals in other words any need to rea-son not only about values of variables required forthe solution of a given litmus test but also aboutother unrelated values for these variables As suchcounter-factuals are the rock upon which otherwiseattractive approaches involving semantic dependencyhave foundered20 Some practitioners might won-der why the opinion of formal-methods researchersshould be given any weight at all and the answer tothis question is that it is the work of formal-methodsresearchers that provides us the much-needed toolsthat we need to analyze both the memory-orderingspecification itself as well as programs using thatspecification

Developers writing new code need something thatexpresses their algorithm with a minimum of syn-tactic saccharine that is easy to learn and that iseasy to maintain For example one of the weak-nesses of the current standardsrsquo definition of memoryorder consume is the need to sprinkle large numbersof kill dependency() calls throughout onersquos codeIn short developers would like it to be easy to writeanalyze and maintain code that uses dependency or-dering

Developers with existing RCU code have the samedesires as do developers writing new code but arealso very interested in minimizing the code churn re-quired to adhere to the standard

The challenge if of course to find a proposal thataddresses the viewoints of all of these audiences Aswe will see in the next session this is not easy How-ever there is some hope that the approach presentedin Section 79 might suffice

7102 Comparison

A summary comparison of the proposals is shown inTable 1

The dependency type can either be ldquodeprdquo fornormal dependency ldquordeprdquo for the restricted de-pendencies discussed in Section 79 ldquosdeprdquo for(global) semantic dependency or ldquolsdeprdquo for local

20 That said Alan Jeffries is making another attempt tocome up with a suitable formal definition of semantic depen-dency

WG21N4321 35

Dep

enden

cyType

Variable

Marking

Form

al-Parameter

Marking

Return-V

alueMarking

Beginning-O

f-Chain

Handling

End-O

f-Chain

Handling

Dep

enden

cyTracingRequired

CAttribute

Support

Req

uired

C11 C++11 dep A A K Y Y

Type-Based Designation of DependencyChains With Restrictions (Section 72)

lsdep T T T k

Type-Based Designation of DependencyChains (Section 73)

dep T T T k

Whole-Program Option (Section 74) sdep

Local-Variable Restriction (Section 75) dep A A k Y

Mark Dependency-Carrying Local Variables(Section 76)

dep A A A k Y

Explicitly Tail-Marked Dependency Chains(Section 77)

dep A A Dk y Y

Explicitly Head-Marked Dependency Chains(Section 78)

dep D k y Y

Restricted Dependency Chains (Section 79) rdep

Marking ldquoArdquo attribute ldquoTrdquo typeBeginning of chain ldquoDrdquo explicit designationEnd of chain ldquoDrdquo explicit designation ldquokKrdquo implicitexplicit kill dependencyDependency tracking required ldquoyrdquo only for marked chains ldquoYrdquo always

Table 1 Comparison of Consume Proposals

WG21N4321 36

semantic dependency21 Variable formal-parameterand return-value marking can either be type-based(ldquoTrdquo) attribute-based (ldquoArdquo) or not required (ldquo rdquo)Beginning-of-chain handling can either require ex-plicit indication to which quantities dependenciesmust be carried (ldquoDrdquo) or nothing (ldquo rdquo) End-of-chain handling can either require an explicit kill

dependency (ldquoKrdquo) an implicit kill dependency

(ldquokrdquo) explicit designation of dependency (ldquoDrdquo) ornothing (ldquo rdquo)22 Dependency tracking might berequired for all chains (ldquoYrdquo) explicitly designatedchains (ldquoyrdquo) or not required at all (ldquo rdquo) C-language[[carries dependency]] support might be required(ldquoYrdquo) or not (ldquo rdquo)

The ideal proposal would have dependency typeldquodeprdquo (thus making it easier to model dependencyordering and making it unnecessary for developer tohave to outwit full-program optimizations) no needfor variable formal-parameter or return-value mark-ing (thus minimizing changes required for existingRCU code) implicit ldquodo the right thingrdquo end-of-chainhandling23 (thus minimizing the need for whack-a-mole source-code markups) no need for dependencytracking (thus making it easier to implement) andno need for C-language support for the [[carries

dependency]] attribute (thus minimizing changes tothe C standard)

7103 Other Approaches

If the C standards committee is unwilling to ac-cept attributes perhaps a new keyword such asCarries dependency would be an acceptable alter-native (This was suggested at the 2014 UIUC meet-ing but I cannot recall who suggested it)

21 Recall that a local semantic dependency remains a de-pendency even if the memory order consume load at its headcan return only a single value In contrast a global seman-tic dependency remains a dependency only if more than onevalue can appear at the end of the chain Therefore optimiza-tions based on global full-program analysis can break a globalsemantic dependency but can break neither a local semanticdependency nor a normal dependency

22 Variables that go out of scope always have any dependencychain implicitly killed

23 Perhaps implemented by a careful choice of exactly whichoperators carry dependencies in which situations

It might also be possible to combine different as-pects of the various proposals perhaps even arrivingat an improved proposal

Nevertheless we clearly have some more work todo

8 Summary

This document has analyzed Linux-kernel use of de-pendency ordering and has laid out the status-quo in-teraction between the Linux kernel and pre-C11 com-pilers It has also put forward some possible ways ofbuilding towards a full implementation of C11rsquos andC++11rsquos handling of dependency ordering Finallyit calls out some weaknesses in C11rsquos and C++11rsquoshandling of dependency ordering and offers some al-ternatives

References

[1] Alglave J Maranget L Pawan PSarkar S Sewell P Williams D andNardelli F Z PPCMEMARMMEM A toolfor exploring the POWER and ARM memorymodels httpwwwclcamacuk~pes20

ppc-supplementalpldi105-sarkarpdfJune 2011

[2] Alglave J Maranget L andTautschnig M Herding cats Modellingsimulation testing and data-mining for weakmemory In Proceedings of the 35th ACM SIG-PLAN Conference on Programming LanguageDesign and Implementation (New York NYUSA 2014) PLDI rsquo14 ACM pp 40ndash40

[3] ARM Limited ARM Architecture ReferenceManual ARMv7-A and ARMv7-R Edition2010

[4] Boehm H J Space efficient conservativegarbage collection SIGPLAN Not 39 4 (Apr2004) 490ndash501

[5] Bonzini P and Day M RCU imple-mentation for Qemu httplistsgnu

WG21N4321 37

orgarchivehtmlqemu-devel2013-08

msg02055html August 2013

[6] Dalton M THE DESIGN AND IMPLE-MENTATION OF DYNAMIC INFORMATIONFLOW TRACKING SYSTEMS FOR SOFT-WARE SECURITY PhD thesis StanfordUniversity 2009 Available httpcsl

stanfordedu~christospublications

2009michael_daltonphd_thesispdf

[Viewed March 9 2010]

[7] Desnoyers M [RFC git tree] userspace RCU(urcu) for Linux httpurcuso February2009

[8] Desnoyers M McKenney P E SternA Dagenais M R and Walpole JUser-level implementations of read-copy updateIEEE Transactions on Parallel and DistributedSystems 23 (2012) 375ndash382

[9] Grisenthwaite R ARM Barrier LitmusTests and Cookbook ARM Limited 2009

[10] Howard P W and Walpole J A rel-ativistic enhancement to software transactionalmemory In Proceedings of the 3rd USENIX con-ference on Hot topics in parallelism (BerkeleyCA USA 2011) HotParrsquo11 USENIX Associa-tion pp 1ndash6

[11] Howard P W and Walpole J Relativis-tic red-black trees Concurrency and Computa-tion Practice and Experience (2013) nandashna

[12] Intel Corporation A Formal Speci-fication of Intel Itanium Processor Fam-ily Memory Ordering 2002 Avail-able httpdeveloperintelcom

designitaniumdownloads251429htm

ftpdownloadintelcomdesign

ItaniumDownloads25142901pdf [ViewedJanuary 10 2007]

[13] International Business Machines Corpo-ration Power ISA Version 207 2013

[14] Kannan H Ordering decoupled metadata ac-cesses in multiprocessors In MICRO 42 Pro-ceedings of the 42nd Annual IEEEACM Inter-national Symposium on Microarchitecture (NewYork NY USA 2009) ACM pp 381ndash390

[15] McCarthy J Recursive functions of symbolicexpressions and their computation by machinepart i Commun ACM 3 4 (Apr 1960) 184ndash195

[16] McKenney P E Read-copy update(RCU) usage in Linux kernel Availablehttpwwwrdropcomuserspaulmck

RCUlinuxusagerculocktabhtml [ViewedJanuary 14 2007] October 2006

[17] McKenney P E What is RCU part 2Usage Available httplwnnetArticles

263130 [Viewed January 4 2008] January2008

[18] McKenney P E The RCU API 2010 edi-tion httplwnnetArticles418853 De-cember 2010

[19] McKenney P E Validating memory barri-ers and atomic instructions httplwnnet

Articles470681 December 2011

[20] McKenney P E Is Parallel ProgrammingHard And If So What Can You Do About Itkernelorg Corvallis OR USA 2012

[21] McKenney P E Structured deferral syn-chronization via procrastination CommunACM 56 7 (July 2013) 40ndash49

[22] McKenney P E [PATCH tipcorercu14] mce Stop using array-index-based RCUprimitives [PATCHtipcorercu14]mce

Stopusingarray-index-basedRCUprimitivesMay 2015

[23] McKenney P E Purcell C Al-gae Schumin B Cornelius G Qwer-tyus Conway N Sbw Blainster Ru-fus C Zoicon5 Anome and Eisen HRead-copy update httpenwikipedia

orgwikiRead-copy-update July 2006

WG21N4321 38

[24] McKenney P E and Slingwine J DRead-copy update Using execution history tosolve concurrency problems In Parallel andDistributed Computing and Systems (Las VegasNV October 1998) pp 509ndash518

[25] McKenney P E and Walpole J What isRCU fundamentally Available httplwn

netArticles262464 [Viewed December 272007] December 2007

[26] Michael M M Hazard pointers Safe mem-ory reclamation for lock-free objects IEEETransactions on Parallel and Distributed Sys-tems 15 6 (June 2004) 491ndash504

[27] Rossbach C J Hofmann O S PorterD E Ramadan H E Bhandari A andWitchel E TxLinux Using and managinghardware transactional memory in an operatingsystem In SOSPrsquo07 Twenty-First ACMSymposium on Operating Systems Principles(October 2007) ACM SIGOPS Avail-able httpwwwsosp2007orgpapers

sosp056-rossbachpdf [Viewed October 212007]

[28] Toit S D Working draft stan-dard for programming language C++httpwwwopen-stdorgjtc1sc22wg21

docspapers2013n3691pdf May 2013

[29] Vigueras G Orduna J M and LozanoM A Read-Copy Update based parallel serverfor distributed crowd simulations The Journalof Supercomputing (Apr 2012)

Change Log

This paper first appeared as N4026 in May of 2014Revisions to this document are as follows

bull Add a suggestion to Section 71 that developersprovide diagnostics for singleton arrays when us-ing RCU-protected indexes (May 28 2014)

bull Apply changes from Torvald Riegel feedback

1 Move Section 71 which summarizes theLinux-kernel uses of dependency chains upto the beginning of Section 7

2 Fix some typos in Section 74 and addwords accounting for DEC Alpha whichmust continue to promote memory order

consume loads to memory order acquire

with this sectionrsquos approach Also clarifyJeff Preshingrsquos conjecture about the lack ofoptimizations that break semantic depen-dency chains

(June 5 2014)

bull Add Section 75 which covers Hans Boehmrsquosapproach which restricts dependency chains tolocal variable Also add Section 76 whichcovers Clark Nelsorsquos approach which usesthe [[carries dependency]] attribute to markthose local variables that carry dependencies(July 17 2014)

bull Mark the paper as a revision of N4036 (July 172014)

bull Add Jeff Preshing Hans Boehm and Clark Nel-son as authors and clarify Hansrsquos proposal inSection 75 (August 13 2014)

bull Add Olivier Girouxrsquos proposal for explicitlymarking the tails of dependency chains as Sec-tion 77 and add Olivier as author Update theadvantages and disadvantages of Jeff Preshingrsquoswhole-program option in Section 74 Add advan-tages and disadvantages for Hans Boehmrsquos pro-posal in Section 75 and for Clark Nelsonrsquos pro-posal in Section 76 (August 23 2014)

bull Update Olivier Girouxrsquos explicit dependency-chain marking proposal in Section 77 based onfeedback from Olivier (August 24 2014)

bull Add a section header (Section 411) for thedependency-chain rules for 2014 GCC implemen-tations Add Section 413 laying out the sim-pler rules for managing dependency chains in theolder 1990s Sequent C implementations Adda footnote to Section 76 recording Lawrence

WG21N4321 39

Crowlrsquos preference for variable modifiers insteadof either attributes or type modifiers Addwords to Section 77 noting that atomic signal

fence() or the Linux kernelrsquos barrier() can beused to suppress code-motion optimizations thatmight otherwise prevent fixing up tail-markeddependency chains that were broken by code-motion optimizations (October 1 2014)

bull Add Section 34 which describes operators thatact as the last link in the Linux kernelrsquos depen-dency chains including the ambiguous role ofthe -gt operator (October 1 2014)

bull Update plots of RCU API usage (October 52014)

bull Flesh out Section 412 which lays out a shorthistory of dependency-chain management rulesfor the Linux kernel Numerous minor clarifica-tions and corrections throughout the document(October 5 2014)

bull Add ldquoandrdquo before final authorrsquos name and emailaddress Also add a pointer back to N4036 (Oc-tober 5 2014)

bull Add a brain-dead build script (October 52014)

bull Add Section 710 which has the beginnings ofan evaluation of the various proposals (October5 2014)

The paper was then published as N4215 whichwas revised as follows

bull Add verbiage to Sections 72 and 7102 differ-entiating between various forms of syntactic andsemantic dependencies (November 11 2014)

bull Add Section 7103 suggesting a C keyword in-stead of the C++ [[carries dependency]] at-tribute This section also suggests that combin-ing ideas from several proposals might be helpful(November 11 2014)

bull Add Section 78 which contains a first attemptto describe Olivier Girouxrsquos head-marked depen-dency chains (November 11 2014)

bull Add example code in Figures 15 and 16 Addsimilar figures to the proposals showing how theymark the dependency chains in the sample code(November 11 2014)

At this point the paper was published as N4321which was revised as follows

bull Add a warning against using operands to amp and| that leave only one bit set (or respectivelycleared) to Section 411 (April 9 2015)

bull Update plots of RCU API usage (April 192014)

bull Add Section 79 which restricts dependencychains with the goal of allowing unmarked de-pendency chains while avoiding restricting com-piler optimizations One important restriction isthe elimination of RCU-protected array indexes(April 19 2014)

bull Add verbiage to Section 32 noting that the amp

operator is sometimes used to find the beginningof a power-of-two aligned structure (April 192014)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Torvald Riegel to Sec-tion 79 (May 20 2015)

bull Donrsquot allow integers to carry dependencies intoor out of unmarked functions in Section 79(May 21 2015)

bull Update Section 79 to note that assignments can-not extend a dependency chain from one threadto another and that assignments cannot extendan intptr t- or uintptr t-based dependencychain through a shared variable even within agiven thread (May 21 2015)

bull Update Section 79 to note that atomic op-erations can extend dependency chains in thesame way that their non-atomic counterpartscan (May 21 2015)

WG21N4321 40

bull Update Section 79 to explicitly call out the factthat stores and subsequent loads can extend de-pendency chains (May 21 2015)

bull Update Section 79 to explicitly state that un-defined behavior terminates dependency chainsand that if only one value of a pointer avoidsundefined behavior any dependency chainsthrough that pointer at that point are termi-nated (May 22 2015)

bull Update Section 79 to document the difficul-ties stemming from value-specialization opti-mizations (May 23 2015)

bull Update Section 79 to note the back-propagationof the dependency-chain-breaking effects of thecompiler deducing the exact value of a pointer(May 27 2015)

bull Update Section 79 to give more informationon how much bit-setting and bit-clearing is toomuch (June 8 2015)

bull Update Section 79 to note that the Linux ker-nel does not rely on dependency ordering whena statically allocated structure is linked into anRCU-protected structure but also note that op-timizations breaking dependency ordering in thiscase seem quite dubious (July 11 2015)

bull Update Section 79 illustrating dependency-chain rules with examples (July 11 2015)

bull Update Section 79 noting that simple loads suf-fice for memory order consume loads (July 112015)

bull Update Sections 77 and 78 calling out abstrac-tion challenges for head- and tail-marking ap-proaches (July 13 2015)

bull Add subsections to Section 79 (July 13 2015)

bull Update Section 79 to amplify objections to spe-cialization optimizations applied to heap-basedpointers (July 13 2015)

  • 1 Introduction
  • 2 Introduction to RCU
  • 3 Linux-Kernel Use Cases
    • 31 Types of Linux-Kernel Dependency Chains
    • 32 Operators in Linux-Kernel Dependency Chains
    • 33 Operators Terminating Linux-Kernel Dependency Chains
    • 34 Operators Acting as Last Link in Linux-Kernel Dependency Chains
    • 35 Linux-Kernel Dependency Chain Length
      • 4 Dependency Ordering in Pre-C11 Implementations
        • 41 Rules for C-Language RCU Users
          • 411 Rules for 2014 GCC Implementations
          • 412 Rules for 2003 GCC Implementations
          • 413 Rules for 1990s Sequent C Implementations
            • 42 Rules for C-Language Implementers
              • 5 Dependency Ordering in C11 and C++11 Implementations
              • 6 Weaknesses in C11 and C++11 Dependency Ordering
              • 7 Potential Alternatives to C11 and C++11 Dependency Ordering
                • 71 Revising C11 and C++11 Dependency-Ordering Definition
                • 72 Type-Based Designation of Dependency Chains With Restrictions
                • 73 Type-Based Designation of Dependency Chains
                • 74 Whole-Program Option
                • 75 Local-Variable Restriction
                • 76 Mark Dependency-Carrying Local Variables
                • 77 Explicitly Tail-Marked Dependency Chains
                • 78 Explicitly Head-Marked Dependency Chains
                • 79 Restricted Dependency Chains
                  • 791 Extending Dependency Chains
                  • 792 Terminating Dependency Chains
                  • 793 Restricted Dependency Chains and the Linux Kernel
                  • 794 Restricted Dependency Chains Advantages and Disavantages
                    • 710 Evaluation
                      • 7101 Audiences
                      • 7102 Comparison
                      • 7103 Other Approaches
                          • 8 Summary
Page 4: N4321: Towards Implementation and Use of memory order … · 2015-07-14 · N4321: Towards Implementation and Use of memory order consume Doc. No.: WG21/N4321 (revised) Date: 2015-04-19

WG21N4321 4

0

2000

4000

6000

8000

10000

12000 2

002

200

4

200

6

200

8

201

0

201

2

201

4

201

6

R

CU

rw

lock

ing

AP

I Use

s

Year

RCU

rwlock

Figure 3 Growth of RCU Usage vs Reader-WriterLocking

in a dependency chain and finally Section 35 surveysa longer-than-average (but by no means maximal)dependency chain that appears in the Linux kernel

It is worth reviewing the relationship betweenmemory order acquire and memory order consume

loads both of which interact with memory order

release stores

A memory order release load is said to synchro-nize with a memory order acquire store if that loadreturns the value stored or in some special casessome later value [28 110p6-110p8] When a memory

order acquire load synchronizes with a memory

order release store any memory reference preced-ing the memory order acquire load will happen be-fore any memory reference following the memory

order release store [28 110p11-110p12] Thisproperty allows a linked structure to be locklessly tra-versed by using memory order release stores whenupdating pointers to reference new data elements andby using memory order acquire loads when loadingpointers while locklessly traversing the data struc-ture as shown in Figure 4

Unfortunately a memory order acquire load re-

1 void new_element(struct foo pp int a)2 3 struct foo p = malloc(sizeof(p))45 if (p)6 abort()7 p-gta = a8 atomic_store_explicit(pp p memory_order_release)9

1011 int traverse(struct foo_head ph)12 13 int a = -114 struct foo p1516 p = atomic_load_explicit(ampph-gth memory_order_acquire)17 while (p = NULL) 18 a = p-gta19 p = atomic_load_explicit(ampp-gtn memory_order_acquire)20 21 return a22 23

24

Figure 4 ReleaseAcquire Linked Structure Traver-sal

quires expensive special load instructions or memory-fence instructions on weakly ordered systems suchas ARM Itanium and PowerPC Furthermore intraverse() the address of each memory order

acquire load within the while loop depends on thevalue of the previous memory order acquire load2

Therefore in this case most weakly ordered systemsdonrsquot really need the special load instructions or thememory-fence instructions as these systems can in-stead rely on the hardware-enforced dependency or-dering

This is the use case for memory order consumewhich can be substituted for memory order acquire

in cases where hardware dependency ordering appliesOne such case is the preceding example and Fig-ure 5 shows that same example recast in terms ofmemory order consume A memory order release

store is dependency ordered before a memory order

consume load when that load returns the value storedor in some special cases some later value [28 110p1]

2 The initial load on line 16 might well depend on an earlierload but for simplicity this example assumes that the initialfoo head structure is statically allocated and thus not subjectto updates

WG21N4321 5

1 void new_element(struct foo pp int a)2 3 struct foo p = malloc(sizeof(p))45 if (p)6 abort()7 p-gta = a8 atomic_store_explicit(pp p memory_order_release)9

1011 int traverse(struct foo_head ph)12 13 int a = -114 struct foo p1516 p = atomic_load_explicit(ampph-gth memory_order_consume)17 while (p = NULL) 18 a = p-gta19 p = atomic_load_explicit(ampp-gtn memory_order_consume)20 21 return a22 23

24

Figure 5 ReleaseConsume Linked Structure Traver-sal

Then if the load carries a dependency to somelater memory reference [28 110p9] any memoryreference preceding the memory order release storewill happen before that later memory reference [28110p9-110p12] This means that when there is de-pendency ordering memory order consume gives thesame guarantees that memory order acquire doesbut at lower cost

On the other hand memory order consume re-quires the compiler to track the carries-a-dependencyrelationships with the set of such relationshipsheaded by a given memory order consume load be-ing called that loadrsquos dependency chains It is quitepossible that the complexity of implementing this ca-pability has thus far prevented high-quality memory

order consume implementations from appearing Itis therefore worthwhile to review use of dependencychains in practice in order to determine what typesof operations typically appear in dependency chainswhich might result in guidance to implementationsor perhaps even modifications to the definition ofmemory order consume

31 Types of Linux-Kernel Depen-dency Chains

One goal for memory order consume is to implementrcu dereference() which heads a Linux-kerneldependency-ordering tree There are a numberof variant of rcu dereference() in the Linuxkernel in order to implement the four flavors ofRCU and also to enable RCU usage diagnositicsfor code that is shared by readers and updatersThese additional variants are rcu dereference()rcu dereference bh() rcu dereference

bh check() rcu dereference bh check()rcu dereference check() rcu dereference

index check() rcu dereference protected()rcu dereference raw() rcu dereference

sched() rcu dereference sched check() srcu

dereference() and srcu dereference check()Taken together there are about 1300 uses of thesefunctions in version 313 of the Linux kernelHowever about 250 of those are rcu dereference

protected() which is used only in update-side codeand thus does not head up read-side dependencychains which leaves about 1000 uses to be inspectedfor dependency-ordering usage

32 Operators in Linux-Kernel De-pendency Chains

A surprisingly small fraction of the possible C opera-tors appear in dependency chains in the Linux kernelnamely -gt infix = casts prefix amp prefix [] infix+ infix - ternary and infix (bitwise) amp

By far the two most common operators are the-gt pointer field selector and the = assignment oper-ator Enabling the carries-dependency relationshipthrough only these two operators would likely coverbetter than 90 of the Linux-kernel use cases

Casts the prefix indirection operator and theprefix amp address-of operator are used to implementLinuxrsquos list primitives which translate from listpointers embedded in a structure to the structure it-self These operators are also used to get some of theeffects of C++ subtyping in the C language

The [] array-indexing operator and the infix +

and - arithmetic operators are used to manipulate

WG21N4321 6

1 struct foo 2 int a3 4 struct foo fp5 struct foo default_foo67 int bar(void)8 9 struct foo p

1011 p = rcu_dereference(fp)12 return p p-gta default_fooa13

Figure 6 Default Value For RCU-Protected PointerLinux Kernel

1 class foo 2 int a3 4 stdatomicltfoo gt fp5 foo default_foo67 int bar(void)8 9 stdatomicltfoo gt p

1011 p = fpload_explicit(memory_order_consume)12 return p kill_dependency(p-gta) default_fooa13

Figure 7 Default Value For RCU-Protected PointerC++11

RCU-protected arrays as well as to index into arrayscontained within RCU-protected structures RCU-protected arrays are becoming less common becausethey are being converted into more complex datastructures such as trees However RCU-protectedstructures containing arrays are still fairly common

The ternary if-then-else expression is used tohandle default values for RCU-protected pointers forexample as shown in Figure 6 or in C++11 formin Figure 7 Note that the dependency is carriedonly through the rightmost two operands of neverthrough the leftmost one

The infix amp operator is used to mask low-order bitsfrom RCU pointers These bits are used by somealgorithms as markers Such markers though notcommon in the Linux kernel are well-known in theart with hazard pointers being but one example [26]This operator is also sometimes used to locate thebeginning of an aligned structure for example if p

references a field within a data structure that is 4096bytes in size (or smaller) and that is also aligned to a4096-byte boundary then p amp ~0xfff will with theaddition of appropriate casting produce a pointer tothe beginning of the structure Note that it is ex-pected that both operands of infix amp are expected tohave some non-zero bits because otherwise a NULL

pointer will result and NULL pointers cannot reason-ably be said to carry much of anything let alone adependency

Although I did not find any infix | operators inmy census of Linux-kernel dependency chains sym-metry considerations argue for also including it forexample for read-side pointer tagging or for anotherexample locating the beginning of the next in an ar-ray of aligned structures Presumably both of theoperands of infix | must have at least one zero bit

To recap the operators appearing in Linux-kerneldependency chains are -gt infix = casts prefix ampprefix [] infix + infix - ternary infix (bitwise)amp and probably also |

33 Operators Terminating Linux-Kernel Dependency Chains

Although C++11 has the kill dependency() func-tion to terminate a dependency chain no such func-tion exists in the Linux kernel Instead Linux-kerneldependency chains are judged to have terminatedupon exit from the outermost RCU read-side criticalsection3 when existence guarantees are handed offfrom RCU to some other synchronization mechanism(usually locking or reference counting) or when thevariable carrying the dependency goes out of scope

That said it is possible to analyze Linux-kerneldependency chains to see what part of the chain isactually required by the algorithm in question Wecan therefore define the essential subset of a depen-dency chain to be that subset within which ordering

3 The beginning of a given RCU read-side critical section ismarked with rcu read lock() rcu read lock bh() rcu read

lock sched() or srcu read lock() and the end by the cor-responding primitive from the list rcu read unlock() rcu

read unlock bh() rcu read unlock sched() or srcu read

unlock() There is currently no C++11 counterpart for anRCU read-side critical section

WG21N4321 7

is required by the algorithm In the 313 version ofthe Linux kernel the following operators always markthe end of the essential subset of a dependency chain() == = ampamp || infix and

The postfix () function-invocation operator is aninteresting special case in the Linux kernel In theoryRCU could be used to protect JITed function bodiesbut in current practice RCU is instead used to waitfor all pre-existing callers to the function referencedby the previous pointer The functions are all com-piled into the kernel and the dependency chains aretherefore irrelevant to the () operator Hence in ver-sion 313 of the Linux kernel the () operator marksthe end of the essential subset of any dependencychain that it resides in

The == = ampamp and || operators are used ex-clusively in rdquoifrdquo statements to make control-flow de-cisions and therefore also mark the end of the essen-tial subset of any dependency chains that they residein In theory these relational and boolean operatorscould be used to form array indexes but in practicethe Linux kernel does not yet do this in RCU depen-dency chains The other relational operators (gt ltgt= and lt=) should probably also be added to thislist

The infix and arithmetic operators couldpotentially be used for construct array addresses butthey are not yet used that way in the Linux kernelInstead they are used to do computation on valuesfetched as the last operation in an essential subset ofa dependency chain

In short in the current Linux kernel () === ampamp || infix and all mark the end of theessential subset of a dependency chain That saidthere is potential for them to be used as part of theessential subset of dependency changes in future ver-sions of the Linux kernel And the same is of coursetrue of the remaining C-language operators whichdid not appear within any of the dependency chainsin version 313 of the Linux kernel

34 Operators Acting as Last Link inLinux-Kernel Dependency Chains

Although the -gt operator is frequently used as part ofa Linux-kernel dependency chain it often is intended

to be the last link in that chain Therefore the usescases for the -gt operator deserve special mention

The first use case involves fetching non-pointerdata from an RCU-protected data structure Forexample in the DRDB subsystem in Linux -gt isused to fetch a timeout value This code requiresthat dependency ordering apply to this fetch but itdoes not require a dependency chain extending be-yond that point This sort of case would requirea kill dependency() for implementations based onthe C++11 and C11 standards

The second use case involves linked data struc-tures where an RCU update might be applied onany pointer in the chain for example the stan-dard Linux-kernel linked list The -gt operator pro-vides dependency ordering for the fetch of the -gtnextpointer but that fetch must itself be a memory order

consume load in order to provide the required depen-dency ordering for the fields in the next structurein the list Thus a linked-list traversal consists ofa series of back-to-back non-overlapping dependencychains

These two use cases raise the question of whethera dependency chain can continue beyond a -gt oper-ator The answer is ldquoyesrdquo and this occurs when alinked structure is made visible to RCU readers as aunit For exapmle consider a linked list where eachlist element links to a constant binary search treeIf this tree is in place when the element is added tothe list then a memory order consume load is neededonly when fetching the pointer to the element Thedependency chain headed by this fetch suffices to or-der accesses to the binary search tree

These cases need to be differentiated The thirduse case appears to be the least frequent which sug-gests that the -gt operator (or a sequence of -gt oper-ators) always be the last link of a dependency chain

35 Linux-Kernel Dependency ChainLength

Many Linux-kernel dependency chains are very shortand contained with a fair number living within theconfines of a single C statement If there were onlya few short dependency chains in the Linux kernelone could imagine decorating all the operators in each

WG21N4321 8

1 void new_element(struct foo pp int a)2 3 struct foo p = malloc(sizeof(p))45 if (p)6 abort()7 p-gta = a8 atomic_store_explicit(pp p memory_order_release)9

1011 int traverse(struct foo_head ph)12 13 int a = -114 struct foo p1516 p = atomic_load_explicit(ampfield_dep(ph h)17 memory_order_consume)18 while (p = NULL) 19 a = field_dep(p a)20 p = atomic_load_explicit(ampfield_dep(p n)21 memory_order_consume)22 23 return a24

Figure 8 Decorated Linked Structure Traversal

dependency chain for example replacing the -gt op-erator with something like the mythical field dep()

operator shown on lines 16 19 and 20 of Figure 8

However there are a great many dependencychains that extend across multiple functions Onerelatively modest example is in the Linux networkstack in the arp process() function This depen-dency chain extends as follows with deeper nestingindicating deeper function-call levels

bull The arp process() function invokes in dev

get rcu() which returns an RCU-protectedpointer The head of the dependency chain istherefore within the in dev get rcu() func-tion

bull The arp process() function invokes the follow-ing macros and functions

ndash IN DEV ROUTE LOCALNET() which expandsto the ipv4 devconf get() function

ndash arp ignore() which in turn calls

lowast IN DEV ARP IGNORE() which expandsto the ipv4 devconf get() function

lowast inet confirm addr() which calls

middot dev net() which in turn callsread pnet()

ndash IN DEV ARPFILTER() which expands toipv4 devconf get()

ndash IN DEV CONF GET() which also expands toipv4 devconf get()

ndash arp fwd proxy() which calls

lowast IN DEV PROXY ARP() which expandsto ipv4 devconf get()

lowast IN DEV MEDIUM ID() which also ex-pands to ipv4 devconf get()

ndash arp fwd pvlan() which calls

lowast IN DEV PROXY ARP PVLAN() which ex-pands to ipv4 devconf get()

ndash pneigh enqueue()

Again although a great many dependency chainsin the Linux kernel are quite short there are quite afew that spread both widely and deeply We thereforecannot expect Linux kernel hackers to look fondlyon any mechanism that requires them to decorateeach and every operator in each and every depen-dency chain as was shown in Figure 8 In fact evenkill dependency() will likely be an extremely diffi-cult sell

4 Dependency Ordering in Pre-C11 Implementations

Pre-C11 implementations of the C language do nothave any formal notion of dependency ordering butthese implementations are nevertheless used to buildthe Linux kernelmdashand most likely all other softwareusing RCU This section lays out a few straightfor-ward rules for both implementers (Section 42) andusers of these pre-C11 C-language implementations(Section 41)

41 Rules for C-Language RCU Users

The rules for C-language RCU users have evolvedover time so this section will present them in reversechronological order

WG21N4321 9

411 Rules for 2014 GCC Implementations

The primary rule for developers implementing RCU-based algorithms is to avoid letting the compiler de-terming the value of any variable in any dependencychain This primary rule implies a number of sec-ondary rules

1 Use only intrinsic operators on basic types Ifyou are making use of C++ template metapro-gramming or operator overloading more elabo-rate rules apply and those rules are outside thescope of this document

2 Use a volatile load to head the dependency chainThis is necessary to avoid the compiler repeatingthe load or making use of (possibly erroneous)prior knowledge of the contents of the memorylocation each of which can break dependencychains

3 Avoid use of single-element RCU-protected ar-rays The compiler is within its right to assumethat the value of an index into such an arraymust necessarily evaluate to zero The com-piler could then substitute the constant zero forthe computation breaking the dependency chainand introducing misordering

4 Avoid cancellation when using the + and - infixarithmetic operators For example for a givenvariable x avoid (xminusx) The compiler is withinits rights to substitute zero for any such cancel-lation breaking the dependency chain and againintroducing misordering Similar arithmetic pit-falls must be avoided if the infix or oper-ators appear in the essential subset of a depen-dency chain

5 Avoid all-zero operands to the bitwise amp opera-tor and similarly avoid all-ones operands to thebitwise | operator If the compiler is able todeduce the value of such operands it is withinits rights to substitute the corresponding con-stant for the bitwise operation Once again thisbreaks the dependency chain introducing mis-ordering

Please note that single-bit operands to bitwise amp

can be dangerous because the compiler requiresonly a small amount of additional information todeduce the exact value which could again resultin constant substitution Operands to bitwise |

that have only one zero bit are similarly danger-ous

6 If you are using RCU to protect JITed functionsso that the () function-invocation operator is amember of the essential subset of the dependencytree you may need to interact directly with thehardware to flush instruction caches This issuearises on some systems when a newly JITed func-tion is using the same memory that was used byan earlier JITed function

7 Do not use the boolean ampamp and || operators inessential dependency chains The reason for thisprohibition is that they are often compiled usingbranches Weak-memory machines such as ARMor PowerPC order stores after such branches butcan speculate loads which can break data depen-dency chains

8 Do not use relational operators (== = gt gt=lt or lt=) in the essential subset of a dependencychain The reason for this prohibition is that asfor boolean operators relational operators areoften compiled using branches Weak-memorymachines such as ARM or PowerPC order storesafter such branches but can speculate loadswhich can break dependency chains

9 Be very careful about comparing pointers in theessential subset of a dependency chain As LinusTorvalds explained if the two pointers are equalthe compiler could substitute the pointer you arecomparing against for the pointer in the essen-tial subset of the dependency chain On ARMand Power hardware it might be that only theoriginal value carried a hardware dependency sothis substitution would break the chain in turnpermitting misordering Such comparisons areOK in the following cases

(a) The pointer being compared against refer-ences memory that was initialized at boot

WG21N4321 10

time or otherwise long enough ago thatreaders cannot still have pre-initialized datacached Examples include module-init timefor module code before kthread creationfor code running in a kthread while theupdate-side lock is held and so on

(b) The pointer is never dereferenced afterbeing compared This exception applieswhen comparing against the NULL pointeror when scanning RCU-protected circularlinked lists

(c) The pointer being compared against is partof the essential subset of a dependencychain This can be a different dependencychain but only as long as that chain stemsfrom a pointer that was modified after anyinitialization of interest This exception canapply when carrying out RCU-protectedtraversals from different entry points thatconverged on the same data structure

(d) The pointer being compared against isfetched using rcu access pointer() andall subsequent dereferences are stores

(e) The pointers compared not-equal and thecompiler does not have enough informationto deduce the value of the pointer (Forexample if the compiler can see that thepointer will only ever take on one of twovalues then it will be able to deduce theexact value based on a not-equals compar-ison)

10 Disable any value-speculation optimizations thatyour compiler might provide especially if you aremaking use of feedback-based optimizations thattake data collected from prior runs

412 Rules for 2003 GCC Implementations

Prior to the 269 version of the Linux kernel therewas neither rcu dereference() nor rcu assign

pointer() Instead explicit memory barriers wereused smp read barrier depends() by readers andsmp wmb() by updaters For example the code shownfor current Linux kernels in Figure 6 would be as

1 struct foo 2 int a3 4 struct foo fp5 struct foo default_foo67 int bar(void)8 9 struct foo p

1011 p = fp12 smp_read_barrier_depends()13 return p p-ampgta default_fooa14

Figure 9 Default Value For RCU-Protected PointerOld Linux Kernel

shown in Figure 9 for 268 and earlier versions of theLinux kernel A similar transformation relates theolder use of smp wmb() and the more recent use ofrcu assign pointer()

This older API was clearly much more vulnerableto compiler optimizations than is the current APIbut the real motivation for this change was read-ability and maintainability as can be seen from thecommit log for the mid-2004 patch introducing rcu

dereference()

This patch introduced an rcu

dereference() macro that replaces mostuses of smp read barrier depends()The new macro has the advantage ofexplicitly documenting which pointersare protected by RCU ndash in contrast itis sometimes difficult to figure out whichpointer is being protected by a givensmp read barrier depends() call

The commit log for the mid-2004 patch introducingrcu assign pointer() justifies the change in termsof eliminating hard-to-use explicit memory barriers

Attached is a patch that adds an rcu

assign pointer() that allows a number ofexplicit smp wmb() memory barriers to bedispensed with improving readability

The importance of suppressing compiler optimiza-tions did not become apparent until much later In

WG21N4321 11

fact a volatile cast was not added to the implementa-tion of rcu dereference() until 2624 in early 2008

413 Rules for 1990s Sequent C Implemen-tations

1990s systems featured far slower CPUs and muchless memory that is commonly provisioned to-day and the compilers were correspondingly lesssophisticated Therefore at that time a sim-ple C-language field selector was used instead ofany sort of rcu dereference() or memory order

consume operation Not only was there novolatile cast there also was nothing resembling smp

read barrier depends() The lack of smp read

barrier depends() is not too surprising given thatDYNIXptx did not run on DEC Alpha

This approach was nevertheless quite reliable be-cause the use cases within the DYNIXptx kernelwere both few and straightforward and provided lit-tle or no opportunity for optimizations that mightbreak dependency chains

42 Rules for C-Language Imple-menters

The main rule for C-language implementers is toavoid any sort of value speculation or at the veryleast provide means for the user to disable suchspeculation An example of a value-speculation op-timization that can be carried out with the help ofhardware branch prediction is shown in Figure 10which is an optimized version of the code in Fig-ure 5 This sort of transformation might result fromfeedback-directed optimization where profiling runsdetermined that the value loaded from ph was almostalway 0xbadfab1e Although this transformation iscorrect in a single-threaded environment in a concur-rent environment nothing stops the compiler or theCPU from speculating the load on line 19 before itexecutes the rcu dereference() on line 16 whichcould result in line 19 executing before the corre-sponding store on line 7 resulting in a garbage valuein variable a4

4 Kudos to Olivier Giroux for pointing out use of branchprediction to enable value speculation

1 void new_element(struct foo pp int a)2 3 struct foo p = malloc(sizeof(p))45 if (p)6 abort()7 p-gta = a8 rcu_assign_pointer(pp p)9

1011 int traverse(struct foo_head ph)12 13 int a = -114 struct foo p1516 p = rcu_dereference(ampph-gth)17 while (p = NULL) 18 if (p == (struct foo )0xbadfab1e)19 a = ((struct foo )0xbadfab1e)-gta20 else21 a = p-gta22 p = rcu_dereference(ampp-gtn)23 24 return a25

Figure 10 Dangerous Optimizations HardwareBranch Predictions

There are some situations where this sort of opti-mization would be safe including

1 The value speculated is a numeric value ratherthan a pointer so that if the guess proves correctafter the fact the computation will be appropri-ate after the fact

2 The value speculated is a pointer to invariantdata so that reasonable values are produced bydereferencing even if the guess proves to havebeen correct only after the fact

3 As above but where any updates result in datathat produces appropriate computations at anyand all phases of the update

However this list does not contain the general caseof memory order consume loads

Pure hardware implementations of value specula-tion can avoid this problem because they monitorcache-coherence protocol events that would resultfrom some other CPU invalidating the guess

In short compiler writers must provide means todisable all forms of value speculation unless the spec-

WG21N4321 12

ulation is accompanied by some means of detectingthe race condition that Figure 10 is subject to

Are there other dependency-breaking optimizationsthat should be called out separately

5 Dependency Ordering in C11and C++11 Implementations

The simplest way to avoid dependency-ordering is-sues is to strengthen all memory order consume oper-ations to memory order acquire This functions cor-rectly but may result in unacceptable performancedue to memory-barrier instructions on weakly or-dered systems such as ARM and PowerPC5 and mayfurther unnecessarily suppress code-motion optimiza-tions

Another straightforward approach is to avoid valuespeculation and other dependency-breaking opti-mizations This might result in missed opportu-nities for optimization but avoids any need fordependency-chain annotations and also all issuesthat might otherwise arise from use of dependency-breaking optimizations This approach is fully com-patible with the Linux kernel communityrsquos currentapproach to dependency chains Unfortunately thereare any number of valuable optimizations that breakdependency chains so this approach seems impracti-cal

A third approach is to avoid value speculationand other dependency-breaking optimizations in anyfunction containing either a memory order consume

load or a [[carries dependency]] attribute Forexample the hardware-branch-predition optimiza-tion shown in Figure 10 would be prohibited in suchfunctions as would cancellation optimizations suchas optimizing a = b + c - c into a = b This toocan result in missed opportunities for optimizationthough very probably many fewer than the previousapproach This approach can also result in issues dueto dependency-breaking optimizations in functionslacking [[carries dependency]] attributes for ex-ample function d() in Figure 11 It can also result

5 From a Linux-kernel community viewpoint that shouldread ldquowill result in unacceptable performancerdquo

1 int a(struct foo p [[carries_dependency]])2 3 return kill_dependency(p-gta = 0)4 56 int b(int x)7 8 return x9

1011 foo c(void)12 13 return fpload_explicit(memory_order_consume)14 return rcu_dereference(fp) in Linux kernel 15 1617 int d(void)18 19 int a20 foo p2122 rcu_read_lock()23 p = c()24 a = p-gta25 rcu_read_unlock()26 return a27

Figure 11 Example Functions for Dependency Or-dering Part 1

1 [[carries_dependency]] struct foo e(void)2 3 return fpload_explicit(memory_order_consume)4 return rcu_dereference(fp) in Linux kernel 5 67 int f(void)8 9 int a

10 foo p1112 rcu_read_lock()13 p = e()14 a = p-gta15 rcu_read_unlock()16 return kill_dependency(a)17 1819 int g(void)20 21 int a22 foo p2324 rcu_read_lock()25 p = e()26 a = p-gta27 rcu_read_unlock()28 return b(a)29

Figure 12 Example Functions for Dependency Or-dering Part 2

WG21N4321 13

in spurious memory-barrier instructions when a de-pendency chain goes out of scope for example withthe return statement of function g() in Figure 12

A fourth approach is to add a compile-time op-eration corresponding to the beginning and end ofRCU read-side critical section These would need tobe evaluated at compile time taking into accountthe fact that these critical sections can nest and canbe conditionally entered and exited Note that theexit from an outermost RCU read-side critical sec-tion should imply a kill dependency() operation oneach variable that is live at that point in the code6

Although it is probably impossible to precisely de-termine the bounds of a given RCU read-side criticalsection in the general case conservative approachesthat might overestimate the extent of a given sec-tion should be acceptable in almost all cases Thisapproach would make functions c() and d() in Fig-ure 11 handle dependency chains in a natural mannerbut avoiding whole-program analysis would requiresomething similar to the [[carries dependency]]

annotations called out in the C11 and C++11 stan-dards

A fifth approach would be to require that all op-erations on the essential subset of any dependencychain be annotated This would greatly ease imple-mentation but would not be likely to be accepted bythe Linux kernel community

A sixth approach is to track dependencies as calledout in the C11 and C++11 standards However in-stead of emitting a memory-barrier instruction whena dependency chain flows into or out of a functionwithout the benefit of [[carries dependency]] in-sert an implicit kill dependency() invocation Im-plementation should also optionally issue a diagnosticin this case The motivation for this approach is thatit is expected that many more kill dependencies()

than [[carries dependency]] would be required toconvert the Linux kernelrsquos RCU code to C11 In theexample in Figure 12 this approach would allow func-tion g() to avoid emitting an unnecessary memory-barrier instruction but without function f()rsquos ex-

6 What if a given rcu read unlock() sometimes marked theend of an outermost RCU read-side critical section but othertimes was nested in some other RCU read-side critical sectionIn that case there should be no kill dependency()

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a)3 a = p-gtspecial_a4 else5 a = p-gtnormal_a

Figure 13 Dependency-Ordering Value-NarrowingHazard

plicit kill dependency() Both functions are in Fig-ure 12

A seventh and final approach is to track dependen-cies as called out in in the C11 and C++11 standardsWith this approach functions e() and f() properlypreserve the required amount of dependency order-ing

6 Weaknesses in C11 andC++11 Dependency Or-dering

Experience has shown several weaknesses in the de-pendency ordering specified in the C11 and C++11standards

1 The C11 standard does not provide attributesand in particular does not provide the[[carries dependency]] attribute This pre-vents the developer from specifying that a givendependency chain passes into or out of a givenfunction

2 The implementation complexity of thedependency-chain tracking required by bothstandard can be quite onerous on the one handand the overhead of unconditionally promotingmemory order consume loads to memory order

acquire can be excessive on weakly orderedimplementations on the other There is thereforeno easy way out for a memory order consume

implementation on a weakly ordered system

3 The function-level granularity of [[carries

dependency]] seems too coarse One problemis that points-to analysis is non-trivial so that

WG21N4321 14

compilers are likely to have difficulty determin-ing whether or not a given pointer carries a de-pendency For example the current wording ofthe standard (intentionally) does not disallowdependency chaining through stores and loadsTherefore if a dependency-carrying value mightever be written to a given variable an implemen-tation might reasonably assume that any loadfrom that variable must be assumed to carry adependency

4 The rules set out in the standard [28 110p9]do not align well with the rules that develop-ers must currently adhere to in order to main-tain dependency chains when using pre-C11 andpre-C++11 compilers (see Section 41) For ex-ample the standard requires (x-x) to carry adependency and providing this guarantee wouldat the very least require the compiler to alsoturn off optimizations that remove (x-x) (andsimilar patterns) if x might possibly be carry-ing a dependency For another example con-sider the value-speculation-like code shown inFigure 13 that is sometimes written by devel-opers and that was described in bullet 9 ofSection 41 In this example the standard re-quires dependency ordering between the memoryorder consume load on line 1 and the sub-sequent dereference on line 3 but a typicalcompiler would not be expected to differenti-ate between these two apparently identical val-ues These two examples show that a compilerwould need to detect and carefully handle thesecases either by artificially inserting dependen-cies omitting optimizations differentiating be-tween apparently identical values or even byemitting memory order acquire fences

5 The whole point of memory order consume andthe resulting dependency chains is to allow de-velopers to optimize their code Such optimiza-tion attempts can be completely defeated by thememory order acquire fences that the standardcurrently requires when a dependency chain goesout of scope without the benefit of a [[carries

dependency]] attribute Preventing the com-piler from emitting these fences requires liberal

use of kill dependency() which clutters coderequires large developer effort and further re-quires that the developer know quite a bit aboutwhich code patterns a given version of a givencompiler can optimize (thus avoiding needlessfences) and which it cannot (thus requiring man-ual insertion of kill dependency()

As of this writing no known implementations fullysupport C11 or C++11 dependency ordering

It is worth asking why Paul didnrsquot anticipate theseweaknesses There are several reasons for this

1 Compiler optimizations have become more ag-gressive over the seven years since Paul startedworking on standardization

2 New dependency-ordering use cases have arisenduring that same time in particular there arelonger dependency chains and more of themincluding dependency chains spanning multiplecompilation units

3 The number of dependency chains has increasedby roughly an order of magnitude during thattime so that changes in code style can be ex-pected to face a commeasurate increase in resis-tance from the Linux kernel community ndash unlessthose changes bring some tangible benefit

With that letrsquos look at some potential alternativesto dependency ordering as defined in the C11 andC++11 standards

7 Potential Alternatives to C11and C++11 Dependency Or-dering

Given the weaknesses in the current standardrsquos spec-ification of dependency ordering it is quite reason-able to consider alternatives To this end Section 71discusses ease-of-use issues involved with revisions tothe C11 and C++11 definitions of dependency or-dering Section 72 enlists help from the type systembut also imposes value restrictions (thus revising the

WG21N4321 15

C11 and C++11 semantics for dependencies) Sec-tion 73 enlists help from the type system withoutthe value restrictions and Section 74 describes awhole-program approach to dependency chains (alsorevising the C11 and C++11 semantics for depen-dencies) Section 75 describes a post-Rapperswilproposal that dependency chains be restricted tofunction-scope local variables and temporaries andSection 76 describes a second post-Rapperswil pro-posal that the [[carries dependency]] attributebe used to label local-scope variables that carry de-pendencies Section 77 describes a proposal dis-cussed verbally at Rapperswil that explicitly marksthe tails of dependency chains Section 78 describesthe inverse namely marking the heads of dependencychains Section 79 describes an approach that avoidsmarking by sharply restricting the number and typeof operations permitted in dependency chains Eachapproach appears to have advantages and disadvan-tages so it is hoped that further discussion will eitherhelp settle on one of these alternatives or generatesomething better To help initiate this discussionSection 710 provides an initial comparative evalua-tion

71 Revising C11 and C++11Dependency-Ordering Definition

The following sections each describe a proposed revi-sion of the dependency-ordering definition from thatin the current C11 and C++11 standards In manyof these proposals developers are required to followan additional rule in order to be able to rely on de-pendency ordering Subsequent execution must notlead to a situation where there is only one possiblevalue for the variable that is intended to carry thedependency7 This is shown in Figure 17 where thecompiler is permitted to break dependency orderingon line 6 because it knows that the value of p is equalto that of q which means that it could substitutethe latter value from the former which would break

7 This restricted notion of dependence is sometimes calledsemantic dependence and the value at the end of a depen-dence chain that does not represent a semantic dependence issometimes said to be independent of the value at the head ofthe dependency chain

1 int my_array[MY_ARRAY_SIZE]23 i = atomic_load_explicit(gi memory_order_consume)4 r1 = my_array[i]

Figure 14 Single-Element Arrays and DependencyOrdering

dependency ordering In short a dependency chainbreaks if it comes to a point where only a single valueis possible regardless of the value of the memory

order consume load heading up the chain At firstglance this additional rule could be quite difficult tolive with as dependency ordering could come and godepending on small details of code far away from thatpoint in the dependency chain

However a review of the Linux-kernel operators inSection 32 shows that the most commonly used op-erators act identically under both definitions Theproblem-free operators include -gt infix = casts pre-fix amp prefix and ternary

One example of a potentially troublesome opera-tor namely == is shown in Figure 17 where line 6breaks dependency ordering because the value of p isknown to be equal to that of q which is not part of adependency chain This example could be addressedthrough careful diagnostic design coupled with appro-priate coding standards For example the compilercould emit a warning on line 6 but remain silent forthe equivalent line substituting q for p namely dosomething with(q-gta)

Another example is the use of postfix [] thatis shown in Figure 14 If this code fragment wascompiled with MY ARRAY SIZE equal to one there isno dependency ordering between lines 3 and 4 butthat same code fragment compiled with MY ARRAY

SIZE equal to two or greater would be dependency-ordered Here a diagnostic for single-element arraysmight prove useful and such a diagnostic can easilybe supplied in this case using if and error

In the Linux kernel infix + and - are used forpointer and array computations These are all safein that they operate on an integer and pointer sothat any cancellation will not normally be detectableat compile time However one big purpose of diag-nostics is to detect abnormal conditions indicating

WG21N4321 16

1 struct liststackhead 2 struct liststack __rcu first3 45 struct liststack 6 struct liststack __rcu next7 void t8 struct rcu_head rh9

1011 _Carries_dependency12 void ls_front(struct liststackhead head)13 14 _Carries_dependency void data15 struct liststack lsp1617 rcu_read_lock()18 lsp = rcu_dereference(head-gtfirst)19 if (lsp == NULL)20 data = NULL21 else22 data = rcu_dereference(lsp-gtt)23 rcu_read_unlock()24 return data25

Figure 15 List-Based-Stack Example Code 1 of 2

probable bugs Therefore in cases where the com-piler can determine that two values from dependencychains are annihilating each other via infix + and -a diagnostic would be appropriate

Similarly the Linux kernel uses infix (bitwise) amp tomanipulate bits at the bottom of a pointer whereagain cancellation will not normally be detectable atcompile timemdashexcept in the case of operations on aNULL pointer for which dependency ordering is notmeaningful in any case However as with infix +

and - if the compiler detects value annihilation adiagnostic would be appropriate

Although issues with false positives and negativesneeds further investigation there is reason to hopethat this revision of the definition of dependency or-dering might avoid significant impacts on ease of useWith this hope we proceed to the specific propos-als using the code in Figures 15 and 16 to showsome sample code using Linux-kernel nomenclaturewith the addition of a mythical C keyword Carries

dependency to annotate parameters variables andreturn values that carry dependencies Please notethat this code example in no way endorses the dubi-ous practice of creating a parallel program with thesort of choke point exemplified by the head of this

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 _Carries_dependency struct liststack lsp40 struct liststack lsnp141 _Carries_dependency struct liststack lsnp242 _Carries_dependency void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 16 List-Based-Stack Example Code 2 of 2

WG21N4321 17

1 value_dep_preserving struct foo p23 p = atomic_load_explicit(gp memory_order_consume)4 q = some_other_pointer5 if (p == q)6 do_something_with(p-gta)7 else8 do_something_else_with(p-gtb)

Figure 17 Single-Value Variables and DependencyOrdering

list Note also that cmpxchg() heads a dependencychain which is completely reasonable within the con-text of the Linux kernel due to its acquire semanticswhich of course might be argued to indicate that theannotations in ls pop() are unnecessary

72 Type-Based Designation of De-pendency Chains With Restric-tions

This approach was formulated by Torvald Riegel inresponse to Linus Torvaldsrsquos spirited criticisms of thecurrent C11 and C++11 wording

This approach introduces a new value dep

preserving type qualifier Dependency ordering ispreserved only via variables having this type quali-fier This is meant to model the real scope of depen-dencies which is data flow not execution at function-level granularity This approach should therefore givedevelopers much finer control of which dependenciesare tracked

Assigning from a value dep preserving value to anon-value dep preserving variable terminates thetracking of dependencies in much the same way thatan explicit kill dependency() would However un-like an explicit kill dependency() compilers shouldbe able to emit a suppressable warning on implicitconversions so as to alert the developer about other-wise silent dropping of dependency tracking8

Next we specify that memory order consume loadsreturn a value dep preserving type by default thecompiler must assume such a load to be capable of

8 Other choices are possible in this case including emit-ting a memory order acquire fence in order to conservativelypreserve a potentially intended ordering

producing any value of the underlying type In otherwords the implementation is not permitted to applyany value-restriction knowledge it might gain fromwhole-program analysis We call this a local seman-tic dependency to distinguish not only from a pure(syntactic) dependency but also from a global seman-tic dependency where global information may be ap-plied Note that any global semantic dependency isalso a local semantic dependency but that any localsemantic dependency which is headed by a variablethat can be proven to take on only a single value is nota global semantic dependency The term ldquosemanticdependencyrdquo should be interpreted to mean a globalsemantic dependency unless otherwise stated

This allows developers to start with a clean slatefor the additional rule that they must follow to beable to rely on dependency ordering Subsequent ex-ecution must not lead to a situation there is only onepossible value for the value dep preserving expres-sion because otherwise the implementation is per-mitted to break the dependency chain As notedearlier this is shown in Figure 17 where the com-piler is permitted to break dependency ordering online 6 because it knows that the value of p is equalto that of q which means that it could substitutethe latter value from the former which would breakdependency ordering

This approach has several advantages

1 The implementation is simpler because no de-pendency chains need to be traced The imple-mentation can instead drive optimization deci-sions strictly from type information

2 Use of the value dep preserving type modifierallows the developer to limit the extent of thedependency chains

3 This type modifier can be used to mark a depen-dency chainrsquos entry to and exit from a functionin a straightforward way without the need forattributes

4 The value dep preserving type modifiers serveas valuable documentation of the developerrsquos in-tent

WG21N4321 18

5 This approach permits many additional opti-mizations compared to those permitted by thecurrent standard on code that carries a depen-dency Expressions such as (x-x) no longerrequire establishment of artificial dependenciesand the compiler is no longer required to detectvalue-narrowing hazards like that shown in Fig-ure 13 However the compiler is still prohibitedfrom adding its own value-speculation optimiza-tions

6 Linus Torvalds seems to be OK with it whichindicates that this set of rules might be practicalfrom the perspective of developers who currentlyexploit dependency chains

According to Peter Sewell one disadvantage is thatthis approach will be quite difficult to model which inturn will pose obstacles for the analysis tooling thatwill be increasingly necessary for large-scale concur-rent programming efforts In particular the concernis that forcing the compiler to assume that a memory

order consume load could possibly return any valuepermitted by its type might require program-analysistools to consider counterfactual hypothetical execu-tions which might complicate specification of seman-tics and verification

Figures 18 and 19 show how this approach playsout with the list-based stack

73 Type-Based Designation of De-pendency Chains

Jeff Preshing made an off-list suggestion of using avalue dep preserving type modifier as suggestedby Torvald Riegel but using this type modifier tostrictly enforce dependency ordering For exampleconsider the code fragment shown in Figure 17 Thescheme described in Section 72 would not necessar-ily enforce dependency ordering between the load online 3 and the access one line 6 while the approachdescribed in this section would enforce dependencyordering in this case

Furthermore cancelling or value-destruction oper-ations on value dep preserving values would notdisrupt dependency ordering As with the cur-rent C11 and C++11 standards the implementation

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack value_dep_preserving first6 78 struct liststack 9 struct liststack value_dep_preserving next

10 void t11 struct rcu_head rh12 1314 value_dep_preserving15 void ls_front(struct liststackhead head)16 17 value_dep_preserving void data18 value_dep_preserving struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 18 List-Based-Stack Restricted Type-BasedDesignation 1 of 2

would be required to emit a memory-barrier instruc-tion or compute an artificial dependency for such op-erations (Note however that use of cancelling orvalue-destruction operations on dependency chainshas proven quite rare in practice)

This approach shares many of the advantages ofTorvald Riegelrsquos approach

1 The implementation is simpler because no de-pendency chains need be traced The implemen-tation can instead drive optimization decisionsstrictly from type information

2 Use of the value dep preserving type modifierallows the developer to limit the extent of thedependency chains

3 This type modifier can be used to mark a depen-dency chainrsquos entry to and exit from a functionin a straightforward way without the need forattributes

4 The value dep preserving type modifiers serve

WG21N4321 19

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 value_dep_preserving37 void ls_pop(struct liststackhead head)38 39 value_dep_preserving struct liststack lsp40 struct liststack lsnp141 value_dep_preserving struct liststack lsnp242 value_dep_preserving void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 19 List-Based-Stack Restricted Type-BasedDesignation 2 of 2

as valuable documentation of the developerrsquos in-tent

5 Although optimizations on a dependency chainare restricted just as in the current standardthe use of value dep preserving restricts thedependency chains to those intended by the de-veloper

6 Restricting dependency-breaking optimizationson all dependency chains marked value dep

preserving without exceptions for cases inwhich the compiler knows too much might makethis approach easier to learn and to use

It is expected that modeling this approach shouldbe straightforward because the modeling tools wouldbe able to make use of the type information Thisapproach results in the same code as shown in Fig-ures 18 and 19 of the previous section

74 Whole-Program Option

This approach also suggested off-list by Jeff Presh-ing has the goal of reusing existing non-dependency-ordered source code unchanged (albeit requiring re-compilation in most cases)9 For example this ap-proach permits an instance of stdmap to be refer-enced by a pointer loaded via memory order consume

and to provide that stdmap instance with thebenefits of dependency ordering without any codechanges whatsoever to stdmap It is important tonote that this protection will be provided only to aread-only stdmap that is referenced by a changingpointer loaded via memory order consume in partic-ular not to a concurrently updated stdmap refer-enced by a pointer (read-only or otherwise) loadedvia memory order consume This latter case wouldrequire changes to the underlying stdmap implemen-tation at a minimum changing some of the loads tobe memory order consume loads Nevertheless theability to provide dependency-ordering protection topre-existing linked data structures is valuable evenwith this read-only restriction

9 A module or library that is known to never carry a de-pendency need not be recompiled

WG21N4321 20

This approach which again does require full re-compilation can be implemented using two ap-proaches

1 Promote all memory order consume loads tomemory order acquire as may be done withthe current standard

2 On architectures that respect memory order-ing prohibit all dependency-breaking optimiza-tions throughout the entire program but onlyin cases where a change in the value returnedby a memory order consume load could cause achange in the value computed later in that samedependency chain in other words where there isa global semantic dependency Note again thatthe possibility of storing a value obtained froma memory order consume load then loading itlater means that normal loads as well as memoryorder relaxed loads often must be consideredto head their own dependency chains but onlywhen loaded by the same thread that did thestore

Some implementations might allow the developerto choose between these two approaches for exampleby using a compiler switch provided for that purpose

This approach also has the effect of permitting atrivial implementation of a memory order consume

atomic thread fence() When using the first im-plementation approach the atomic thread fence()

is simply promoted to memory order acquire In-terestingly enough when using the second ap-proach the memory order consume atomic thread

fence() may simply be ignored The reason forthis is that this approach has the effect of promot-ing memory order relaxed loads to memory order

consume which already globally enforces all theordering that the memory order consume atomic

thread fence() is required to provide locally10

This approach has its own set of advantages anddisadvantages

10 Of course this presumed promotion from memory order

relaxed to memory order consume means that architecturessuch as DEC Alpha that do not respect dependency order-ing must continue to use the first option of emitting memory-ordering instructions for memory order consume loads

1 This approach dispenses with the [[carries

dependency]] attribute and the kill

dependency() primitive

2 This approach better promotes reuse of existingsource code In particular it should require nochanges to the current Linux-kernel source baseaside from changes to the rcu dereference()

family of primitives

3 This approach allows implementations to carryout dependency-breaking optimizations on de-pendency chains as long as a change in thevalue from the memory order consume load doesnot change values further down the dependencychain both with and without the optimizationJeff conjectures that the set of dependency-breaking optimizations used in practice applyonly outside of dependency chains by the re-vised definition in which single-value restrictionsbreak dependency chains11 If this conjectureholds it also applies to Torvaldrsquos approach de-scribed in Section 72

4 Code that follows the rules presented in Sec-tion 41 (substituting memory order consume

loads for volatile loads) would have its depen-dency ordering properly preserved

It is unlikely that this approach could be modeledreasonably given the current state of the art Therequirement that any given memory order consume

load be able to generate at least two different val-ues at the tail of the dependency chain is believedto be a show-stopper especially when coupled withwhole-program analysis which might find that thereis only one value entering at the head of the depen-dency chain

This approach allows annotations to be discardedas shown in Figures 20 and 21 However the memoryorder consume loads are still required in order toenable the promote-to-acquire implementation style

11 This is certainly the case for the usual optimizations ex-emplified by replacing (x-x) with zero

WG21N4321 21

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 void ls_front(struct liststackhead head)15 16 void data17 struct liststack lsp1819 rcu_read_lock()20 lsp = rcu_dereference(head-gtfirst)21 if (lsp == NULL)22 data = NULL23 else24 data = rcu_dereference(lsp-gtt)25 rcu_read_unlock()26 return data27

Figure 20 List-Based-Stack Whole-Program Ap-proach 1 of 2

75 Local-Variable Restriction

This approach suggested off-list by Hans Boehmlimits the extent of dependency trees to a local whichincludes local variables temporaries function argu-ments and return variables Assigning a value from amemory order consume load to such an object beginsa dependency chain Assigning a value loaded fromsuch a local to a global variable (including function-local variables marked static) or to the heap impliesa kill dependency() so that dependency chains areconfined to locals However if the compiler is unableto see the full dependency chain for example be-cause it passes into a function in another translationunit that is not marked [[carries dependency]]the compiler should promote memory order consume

to memory order acquire12

Section 32 indicates that the following operatorsshould transmit dependency status from one localvariable or temporary to another -gt infix = casts

12 Some implementations might provide means to allow theuser to specify that a diagnostic be generated if such promotionis necessary

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 void ls_pop(struct liststackhead head)37 38 struct liststack lsp39 struct liststack lsnp140 struct liststack lsnp241 void data4243 rcu_read_lock()44 lsnp2 = rcu_dereference(head-gtfirst)45 do 46 lsnp1 = lsnp247 if (lsnp1 == NULL) 48 rcu_read_unlock()49 return NULL50 51 lsp = rcu_dereference(lsnp1-gtnext)52 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)53 while (lsnp1 = lsnp2)54 data = rcu_dereference(lsnp2-gtt)55 rcu_read_unlock()56 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)57 return data58

Figure 21 List-Based-Stack Whole-Program Ap-proach 2 of 2

WG21N4321 22

prefix amp prefix [] infix + infix - ternary infix (bitwise) amp and probably also | Similarly Sec-tion 33 indicates that the following operators shouldimply a kill dependency() () == = ampamp ||infix and

It will also be necesary to check whether Linux-kernel usage expects dependency chains to passthrough globals and heap objects that are in someway thread-local If there are such use cases and ifthey are sane and cannot easily be changed to uselocal variables should [[carries dependency]] beused to flag dependency-carrying globals and heapobjects

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute if de-pendency chains are to span multiple translationunits as is the case in some parts of the Linuxkernel

2 The implementation is likely to be some-what simpler because only those dependencychains passing through local variables compiler-generated temporaries compiler-visible functionarguments and compiler-visible return valuesneed be traced One could also argue that func-tion arguments and return values marked with[[carries dependency]] attribute also need tobe traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 Although optimizations on dependency chainsmust be restricted the restricted scope of de-pendency chains reduces the impact of these re-strictions

5 Applying this approach to the Linux kernelwould only require the addition of markings onfunction parameters and return values corre-sponding to cross-translation-unit function callsHowever there are a significant number of theseso this approach can expect significant resistancefrom the Linux community

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 22 List-Based-Stack Local-Variable Restric-tion 1 of 2

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach allows local-variable annotations tobe dropped as shown in Figure 22 and 23

76 Mark Dependency-Carrying Lo-cal Variables

This approach suggested offlist by Clark Nelson usesthe [[carries dependency]] attribute to mark non-static local-scope variables as carrying a dependencyin addition to its current use marking function ar-guments and return values as carrying dependen-cies It is not permissible to mark global variablesor structure members with this attribute Assigningfrom a [[carries dependency]] object to a non-[[carries dependency]] object results in an im-plicit kill dependency()

This approach is similar to that of Section 73 ex-cept that it uses an attribute rather than a type mod-ifier As such it has many of the advantages and dis-

WG21N4321 23

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 23 List-Based-Stack Local-Variable Restric-tion 2 of 2

advantages of that approach however some believethat an attribute-based approach will be more ac-ceptable to the committee than would a type-modifierapproach13 However this approach does requirethat C add attributes

This leave the question of which operatorstransmit dependency chains from one [[carries

dependency]] object to another Section 32 indi-cates that the following operators should transmitdependency status from one local variable or tem-porary to another -gt infix = casts prefix amp prefix [] infix + infix - ternary infix (bitwise) ampand probably also | Similarly Section 33 showsthat the following operators should imply a kill

dependency() () == = ampamp || infix and

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains passingthrough variables marked with the [[carries

dependency]] attribute need be traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 The [[carries dependency]] calls serve asvaluable documentation of the developerrsquos in-tent

5 Although optimizations on dependency chainsmust be restricted use of explicit [[carries

dependency]] greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of variables carrying depen-dencies given that the Linux kernel currentlyrequires no such markings

13 Lawrence Crowl suggests a third approach namely a vari-able modifier

WG21N4321 24

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 _Carries_dependency void data18 _Carries_dependency struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 24 List-Based-Stack Marked Local Variables1 of 2

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach results in code as shown in Fig-ures 24 and 25 where the [[carries dependency]]

attributes have been replaced with a mythicalCarries dependency C keyword

77 Explicitly Tail-Marked Depen-dency Chains

This approach suggested at Rapperswil by OlivierGiroux can be thought of as the inverse of

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 _Carries_dependency struct liststack lsp40 struct liststack lsnp141 _Carries_dependency struct liststack lsnp242 _Carries_dependency void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 25 List-Based-Stack Marked Local Variables2 of 2

WG21N4321 25

1 p = atomic_load_explicit(ampgp memory_order_consume)2 if (p = NULL)3 do_it(atomic_dependency(p gp))

Figure 26 Explicit Dependency Operations

stdkill dependency() Instead of explicitlymarking where the dependency chains terminateOlivierrsquos proposal uses a stddependency() prim-itive to indicate the locations in the code that thedependency chains are required to reach The first ar-gument to stddependency() is the value to whichthe dependency must be carried and the second argu-ment is the variable that heads the dependency chainin other words the second argument is the variablethat was loaded from by a memory order consume

load This proposal differs from the others in thatit is expected to be implemented not necessarily bypreserving the dependency but instead by insertingbarriers in those cases where optimizations have elim-inated any required dependencies The goal here is toimpose minimal restrictions on optimizations of codecontaining dependency chains

A C-language example is shown in Figure 26where stddependency() is transliterated to the C-language atomic dependency() function On line 3atomic dependency() returns the value of its firstargument (p) while ensuring that the data depen-dency from the memory order consume load from gp

is faithfully reflected in the assembly language imple-menting this code fragment The assembly-languagereflection of this dependency might be in terms ofan assembly-language dependency (for example onARM or PowerPC) implicit memory ordering (forexample on x86 or mainframe) or by an explicitmemory-barrier instruction However if there was noatomic dependency() function the compiler wouldbe under no obligation to preserve the dependency14

These explicitly specified dependencies may becombined with [[carries dependency]] attributeson function arguments for example as shown in Fig-ure 27 Note the interplay of atomic dependency()

and [[carries dependency]] where line 8 estab-

14 Would it be better to have the second argument to atomic

dependency() be a label rather than an expression

1 void foo(struct bar q [[carries_dependency]])2 3 if (q = NULL)4 do_it(atomic_dependency(q-gtb q))5 67 p = atomic_load_explicit(ampgp memory_order_consume)8 foo(atomic_dependency(p gp))

Figure 27 Explicit Dependency Operations and car-ries dependency

lishes the dependency between the load from gp andthe [[carries dependency]] argument q of foo()and where line 4 establishes the further dependencybetween argument q of foo() and do it()s argu-ment

This approach is not yet complete One issue isthe possibility of a given operation being dependenton multiple memory order consume loads One ap-proach is of course to omit this functionality and an-other is to allow atomic dependency() to allow anexpression as its first argument and a variable list ofmemory order consume loaded variables

Another issue is connecting [[carries

dependency]] return values to subsequent atomic

dependency() invocations There are a numberof possible resolutions to this issue One approachwould be to use [[carries dependency]] attributeto mark the declaration of the variable to whichthe functionrsquos return value is assigned bringing theproposal from Section 76 to bear In the specialcase where the memory order consume load is in thesame function body as the atomic dependency()

that depends on it the atomic dependency()

could reference the variable that was the sourceof the original memory order consume load An-other approach would be to allow function-returncarries dependency]] attributes to define namesthat could be used by later atomic dependency()

invocations

A third issue arises when atomic dependency()

must be applied after the head of the dependencychain has gone out of scope for example if the headwas contained in a variable defined in an inner scopethat has since been exited

A fourth issue arises if optimizations along a

WG21N4321 26

needed dependency chain allow ordering the depen-dent operation to precede the head of the depen-dency chain in which case inserting barriers wouldbe ineffective The current proposal for addressingthis issue is to suppress memory-movement optimiza-tions across the atomic dependency() perhaps us-ing something like atomic signal fence() or theLinux kernelrsquos barrier() macro This approach al-lows dependency checking and fence insertion to becarried out as a final pass in the compilation process

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains having ex-plicit atomic dependency() calls (and option-ally intermediate [[carries dependency]] at-tributes) need be traced

3 Irrelevant dependency chains are pruned by de-fault with no stdkill dependency() calls re-quired

4 The atomic dependency() calls serve as valu-able documentation of the developerrsquos intent

5 Although optimizations on dependency chainsmust be restricted use of explicit atomic

dependency() greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of dependency chains giventhat the Linux kernel currently relies on implicitends of dependency chains

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(atomic_dependency(lsp-gtt27 head-gtfirst))28 rcu_read_unlock()29 return atomic_dependency(data lsp-gtt)30

Figure 28 List-Based-Stack Tail-Marked Dependen-cies 1 of 2

It is not yet known whether this approach can bereasonably modeled

The result is shown in Figures 28 and 29

78 Explicitly Head-Marked Depen-dency Chains

This approach suggested via email by Olivier Girouxcan be thought of as another inverse of stdkilldependency() In this case the heads of the de-pendency chains are marked indicating to whichpointed-to objects dependencies should be carriedThis description is an extrapolation of a very con-cise proposal and corrections are welcome

The general idea is to provide an augmentedform of the load() member function that indicatesdependencies for example xload(memory order

consume x-gtnext) would cause a dependency tobe carried through the -gtnext field but throughno other field This is shown in Figure 30 where

WG21N4321 27

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(atomic_dependency(lsnp2-gtt lsnp))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return atomic_dereference(data lsnp2-gtt)59

Figure 29 List-Based-Stack Tail-Marked Dependen-cies 2 of 2

1 struct foo 2 struct foo a3 struct foo b4 struct foo c5 int d6 78 p = atomic_load_explicit(ampgp memory_order_consume9 p-gta p-gtb)

10 qa = p-gta Dependency carried 11 qb = p-gtb Dependency carried 12 qc = p-gtc No dependency carried 13 d = p-gtd No dependency carried

Figure 30 Explicit Dependency Operations and Aug-mented Load

the explicit dependency information on line 9 causeslines 10 and 11 to carry a dependency but lines 12and 13 not to do so

Some open questions regarding this approach

1 How does this interact with arguments and re-turn values Do the corresponding annotationsneed to indicate to which fields dependenciesmight be carried Should mismatches be con-sidered an error and if so which sorts of mis-matches

2 How are opaque types handled For exam-ple consider the Linux kernel linked-list facil-ity which embeds a list head structure intothe enclosing object that is to be placed on thelist The memory order consume load returnsa (struct list head ) but it may be nec-essary to carry a dependency to one or more ofthe fields in the enclosing object Should thisbe handled via something like xload(memory

order consume x) but if so doesnrsquot thisre-introduce the need for lots of stdkill

dependency() calls

3 Larger structures might have quite a few fieldsthat need dependencies carried Should therebe some sort of shorthand to make this easierto code for example tagging the fields needingdependency ordering in the declaration of thestruct or class

4 Common types of abstraction need to be han-dled correctly For example there are more than

WG21N4321 28

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst head-gtfirst-gtt)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(lsp-gtt lsp-gtt)) 28 rcu_read_unlock()29 return data30

Figure 31 List-Based-Stack Head-Marked Depen-dencies 1 of 2

500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

79 Restricted Dependency Chains

This approach restricts dependency chains to opera-tions for which compilers would naturally carry de-pendencies As such this approach can be consid-ered to be a refinement of the whole-program option(Section 74) restricted as described in Section 412but also omitting control dependencies and RCU-protected array indexes As always a memory order

consume load heads a dependency chain and as al-ways a memory order consume load may be imple-mented with a simple load instruction on architec-tures such as x86 ARM and Power

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst head-gtfirst-gtnext)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext lsnp-gtnext-gtt)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt lsnp2-gtt))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 32 List-Based-Stack Head-Marked Depen-dencies 2 of 2

WG21N4321 29

1 define rcu_ptr_extract(p) 2 ( 3 uintptr_t ___ip = (uintptr_t)(p) 4 5 ___ip = ___ip amp ~0x7 6 (typeof(p)) ___ip 7 )89 define rcu_ptr_set_bits(p bits)

10 ( 11 uintptr_t ___ip = (uintptr_t)(p) 12 13 ___ip = ___ip amp ~0x7 14 ___ip = ___ip | (bits) 15 (typeof(p)) ___ip 16 )1718 define rcu_ptr_get_bits(p bits) 19 ( 20 uintptr_t ___ip = (uintptr_t)(p) 21 22 ___ip amp ~0x7 23 )

Figure 33 Pointers and Bit Manipulation on 64-BitSystem

This results in a specific list of operations that ex-tend dependency chains and a separate specific list ofoperations that terminate such chains each coveredin its own section

791 Extending Dependency Chains

The following primitive operations extend thatchain15

1 If any value is part of a dependency chain thenusing that value as the left-hand side of an as-signment expression extends the chain to coverthe assignment This rule is exercised in theLinux kernel by stores into fields making up anRCU protected data element

2 If any value is part of a dependency chain thenusing that value as the right-hand side of an as-signment expression extends the chain to coverboth the assignment and the value returned bythat assignment statement Line 20 of Figure 20shows how this rule may be used to extend adependency chain into a local variable

15 In case of operator overloading the actual functions calledmust be analyzed in order to determine their effects on depen-dency chains

3 If any value that is part of a dependency chain isstored to a non-shared variable then any valueloaded by a later load from that same variable bythat same thread is also part of the dependencychain Lines 20 and 24 of Figure 20 illustratethis rule though this rule would apply even iflocal variable lsp was a intptr t instead of apointer

4 If a pointer that is part of a dependency chainis stored to any variable then any value loadedby a later load from that same variable by thatsame thread is also part of the dependency chainLines 20 and 24 of Figure 20 illustrate this rulethough this rule would apply even if local vari-able lsp was instead a shared variable

5 If a pointer is part of a dependency chain thenadding an integral value to that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers and alsoto addition via the infix + operator and via thepostfix [] operator Note that the addition mustbe carried out on a pointer Casting to an inte-gral type and then carrying out the addition willbreak the dependency chain Therefore insteadof casting to an integral type to carry out theaddition cast to a pointer to char16 Line 24 ofFigure 20 illustrates this given that the -gtt actsas a pointer offset prior to indirection

6 If a pointer is part of a dependency chain thensubtracting an integer from that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers Againcasting to an integral type and then carryingout the subtraction will break the dependencychain so instead cast to a pointer to char TheLinux-kernel container of() macro illustratesthis This macro is used to find the beginning ofa structure given a pointer to a field within thatsame structure

16 Yes some old systems had strange formats for characterpointers and this restriction does exclude those systems fromthis nuance of dependency ordering However to the best ofmy knowledge all such systems were uniprocessors so this isnot a real problem

WG21N4321 30

7 If a pointer is part of a dependency chain thendereferencing it using the prefix operator ex-tends the chain through the dereference opera-tion Line 24 of Figure 20 illustrates this giventhat the -gtt acts as a pointer offset prior to in-direction

8 If a pointer is part of a dependency chain thendereferencing it using the -gt field-selection oper-ator extends the chain to the field Note that thewhen the -gt operator is followed by one or more operators these latter operators are equiva-lent to adding a constant integer to the originalpointer Line 24 of Figure 20 directly illustratesthis rule

9 If a pointer is part of a dependency chain thencasting it (either explicitly or implicitly) to anypointer-sized type extends the chain to the re-sult Line 3 of Figure 33 illustrates this rule

10 If a value of type intptr t or uintptr t ispart of a dependency chain then casting it toa pointer type extends the chain to the resultLine 6 of Figure 33 illustrates this rule

11 If a value of type intptr t or uintptr t is partof a dependency chain then the bitwise infix amp

and | operators extend the dependency chain tothe resulting value Line 5 of Figure 33 illus-trates this rule

12 If a value of type intptr t or uintptr t is partof a dependency chain the the bitwise infix ^

operator extends the dependency chain to theresulting value The use case for this traversal ofbuddy-allocator-like lists or dense-array heapsbut it is not clear whether these use cases justifythis addition to dependency ordering

13 If a pointer is part of a dependency chain thenapplying the unary amp address-of operator op-tionally casting this address to a pointer type(perhaps repeatedly to different pointer typeseither explicitly or implicitly) then applying the dereference operator extends the chain to theresult This is used by some of the Linux-kernellist-processing macros

14 If a pointer is part of a dependency chain andthat pointer appears in the entry of a expres-sion selected by the condition then the chainextends to the result Please note that doesnot extend chains from its condition only fromits second or third argument

15 If a pointer to a function is part of a depen-dency chain then invoking the pointed-to func-tion extends the chain from the pointer to theinstructions executed Note that the exact mech-anism used to update instructions is implemen-tation defined and might require use of specialinstruction-cache-flush operations17

16 If a pointer is part of a dependency chain thenif that pointer is used as the actual parameter ofa function call the dependency chain extends tothe formal parameter

17 If a pointer is part of a dependency chain thenif that pointer is returned from a function thedependency chain extends to the returned valuein the calling function

18 If a given operation extends a dependency chainthen so does its atomic counterpart For exam-ple the rules applying to assignments also applyto atomic loads and stores

Any other operation terminates a dependencychain

792 Terminating Dependency Chains

Even though all other operations terminate depen-dency chains there are a few that deserve specialmention

1 Equality comparisons

2 Narrowing magnitude comparisons

3 Narrowing arithmetic operations

4 Narrowing bitwise operations

17 This may sound strange but just you try implementingdynamic linking without the ability to update instructions

WG21N4321 31

1 struct bar 2 struct bar next3 int a4 int b5 6 struct bar head = amphead 1 2 78 for (p = head-gtnext p p = rcu_dereference(p-gtnext)) 9 foo += p-gta

10 if (p == amphead)11 break12 13 bar = head-gtb

Figure 34 Back-Propagation of Dependency-ChainBreakage Due to Comparisons

1 if (p gt ampfoo)2 do_something(p)3 else if (p lt ampfoo)4 do_something_else(p)5 else6 do_something_nodep(p)

Figure 35 Inequality-Comparison Dependency-Chain Breakage

5 Storing non-pointers into shared variables

6 Passing values between threads without using amemory order consume load

7 Undefined behavior

8 Use of kill dependency

Each of these is covered below

Equality comparisons If a pointer is part of adependency chain then a == or = comparison thatcompares equal to some other pointer where thatother pointer is not part of any dependency chainwill cause any uses of the original pointer to no longerbe part of the dependency chain This dependency-chain breakage can back-propagate to earlier uses ofthe pointer so that in Figure 34 if the comparisonon line 10 compares equal then the access on line 9is not part of the dependency chain This is admit-tedly a rather strange code fragment and besidesthe Linux-kernel barrier() macro could prevent thisif placed between lines 9 and 10 Furthermore the

Linux kernelrsquos list macros avoid this situation becausethe equal comparison terminates the loop

So what if the compiler introduces an equality com-parison This might happen when doing feedback-directed optimization where the compiler might no-tice (for example) that a particularly statically allo-cated structure was almost always the first elementon a given list The compiler might therefore in-troduce a specialization optimization comparing theaddresses and generating code using the statically al-located structure on equals comparison On the onehand in the cases where the Linux kernel adds a stat-ically allocated structure to an RCU-protected linkeddata structure that structure has been initialized atcompile time so that dependency ordering is not re-quired On the other hand this appears to be anextremely dubious optimization for linked data struc-tures In a great many cases the added overhead ofthe comparison would overwhelm the benefits of gen-erating code based on the statically allocated struc-ture

High-quality implementations would therefore beexpected to provide means for disabling this sort ofoptimization especially for pointers obtained fromthe heap After all use of statically allocated struc-tures in RCU-protected lists could be quite usefulduring out-of-memory conditions in which case thespecialization optimization would almost always re-duce performance which is not what optmizationsare supposed to be doing

Narrowing magnitude comparisons A seriesof gt lt gt= or lt= operators that informs the compilerof the exact value of a pointer causes that pointerto no longer be part of the dependency chain SeeFigure 35 for an example of this On line 6 of thisfigure the compiler knows that the value of p is equalto ampfoo so although there is dependency ordering tolines 2 and 4 there is no dependency ordering toline 6 This dependency-chain breakage can back-propagate just as for equality comparisons How-ever dependencies are maintained for normal usesfor exmaple the use of comparisons for deadlockavoidance when acquiring locks contained in multi-ple RCU-protected data elements

WG21N4321 32

Narrowing arithmetic operations If a pointeris part of a dependency chain and if the valuesadded to or subtracted from that pointer cancel thepointer value so as to allow the compiler to pre-cisely determine the resulting value then the result-ing value will not be part of any dependency chainFor example if p is part of a dependency chain then((char )p-(uintptr t)p)+65536 will not be18

Narrowing bitwise operations If a value of typeintptr t or uintptr t is part of a dependency chainand if that value is one of the operands to an amp

or | infix operator whose result has too few or toomany bits set then the resulting value will not bepart of any dependency chain For example on a64-bit system if p is part of a dependency chainthen (p amp 0x7) provides just the tag bits and nor-mally cannot even be legally dereferenced Similarly(p | ~0) normally cannot be legally dereferencedHowever (p amp ~0x7) will provide a usable pointerthat is part of prsquos dependency chain Setting orclearing bits that are not used by the implementa-tion or that would be zero for a properly aligned ob-ject will not break the dependency chain That saidthe compiler might not know the definition of ldquoprop-erly alignedrdquo for example in the cases of manuallycache-aligned or page-aligned objects

Storing non-pointers into shared variables Ifa non-pointer value that is part of a dependency chainis stored into a shared variable then the dependencychain does not extend to a later load from that vari-able

Passing values between threads If a value thatis part of a dependency chain is stored into a variableby one thread and loaded from that same variable bysome other thread using either a non-atomic load ora memory order relaxed load then the dependencychain does not extend to the second thread To getthis effect the second thread would instead need touse a memory order consume load Note that this

18 That said 57p4 of C++ and 656p8 of C both say thatindexing outside of an object is undefined behavior so the lossof dependency ordering is likely the least of the problems here

would extend the dependency chain even if the cor-responding store was a memory order relaxed storebecause the required store-side ordering is providedby the dependency chain

Undefined behavior If undefined behavior is in-voked then consistent with the notion of undefinedbehavior there are no dependency-chain guarantees

If a given pointer intptr t or uintptr t is usedsuch that only one value avoids undefined behaviorthen the dependency chain is broken in the same wayas it would be in the case of an equality comparisonwith that same value

kill dependency() The result of calling kill

dependency is never part of any dependency chainThis operation can be used to suppress diagnosticsthat implementations might omit for likely misusesof dependency ordering

793 Restricted Dependency Chains and theLinux Kernel

This covers all known pointer-based RCU uses in theLinux kernel aside from RCU-protected array in-dexes (more on these later) However as noted ear-lier these restrictions might prove too constrainingfor future code Therefore it might be necessary tocombine this approach with some variation on one ofthe methods for explicitly marking variables formalparameters and return values that are intended tocarry dependencies

Section 32 discussed dependency chains headedby memory order consume loads of integers that arelater used as array indexes or pointer offsets Al-though there are a (very) few such uses in the Linuxkernel accommodating dependency chains headed byloads of integers greatly complicates the handling ofdependency chains For example given an integer x

produced by a memory order consume load we mustcorrectly handle expressions containing (x - x) in-cluding cases where the cancellation is not at all obvi-ous from the source code In contrast given a pointerp the expression (p - p) not a pointer and the rulesgiven above do not require carrying a dependency

WG21N4321 33

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a) 3 q = kill_dependency(p)4 a = q-gtspecial_a5 else 6 a = p-gtnormal_a7

Figure 36 Avoiding Diagnostic Due To Dependency-Ordering Value-Narrowing Hazard

through such an expression This approach there-fore excludes dependency chains headed by memory

order consume loads from non-pointer atomic vari-ables

This raises the question of what should be doneabout the Linux kernel code that relies on order-ing carried through integer array indexes Paulhas (hopefully) answered this question by creatinga Linux-kernel patch that removes the kernelrsquos de-pendency on RCU-protected array indexes [22]

794 Restricted Dependency Chains Ad-vantages and Disavantages

This approach to dependency ordering has the fol-lowing advantages and disadvantages

1 There is no need for compilers to trace depen-dency chains Instead dependencies are an au-tomatic result of current code-generation and op-timization practices

2 There would be little or no limitation on com-piler optimizations Compilers are no longer re-quired to establish artificial dependencies for ex-pressions such as (x - x) or to detect value-narrowing hazards involving == and =

3 The breaking of dependency chains when apointer compares equal to some other pointermight prove to be onerous but it is not aproblem for current Linux use cases19 Themost common cases are comparison against alist header (in which case an equality compar-ison terminates the traversal) and comparison

19 This needs to be re-verified

against NULL (in which case an equality com-parison indicates a pointer that cannot be deref-erenced in any case)

4 The compiler is prohibited from carrying outvalue-speculation optimizations on pointers orvalues of type intptr t or uintptr t that havebeen cast from a pointer type and subjected tono operations other than infix amp | and ^ (Isthis an advantage or a disadvantage The an-swer to this question is left to the reader)

5 In a great many cases dependency chains canreliably pass through library functions compiledby pre-C11 compilers

6 It would not be necessary to use stdkill

dependency() calls in most cases That saiduse of stdkill dependency() might at somefuture point allow the compiler to produce bet-ter diagnostics for dubious dependency-chain usecases For example a compiler might issue awarning for the value-narrowing hazard shownon line 3 of Figure 13 and that diagnostic mightbe suppressed as shown in Figure 36

7 More generally this approach allows annotationsto be discarded as was shown in Figures 20and 21 However the memory order consume

loads are still required in order to support DECAlpha and prevent compiler optimizations thatmight otherwise destroy the dependency chain

8 It would not be necessary to use [[carries

dependency]] attributes However as withstdkill dependency() use of something like[[carries dependency]] might produce bet-ter diagnostics for dubious use dependency-chainuse cases That said it might be preferable tosubstitute either type or variable modifiers forattributes given that use of attributes is not per-mitted to change the meaning of the programand also that attributes are not yet supportedby the C language

9 With the exception of one use case involving ar-rays no changes to the Linux-kernel source code

WG21N4321 34

are required As noted earlier a patch is avail-able to remove the Linux kernelrsquos dependency onRCU-protected array indexes [22]

In the future this approach could be augmentedby attributes type modifiers variable modifiers orother markings to allow more elaborate dependencychains to be created on the one hand and to improvethe compilerrsquos ability to emit diagnostics for dubioususes of dependency chains on the other

710 Evaluation

This evaluation starts by enumerating the differentaudiences that any change to memory order consume

must address (Section 7101) and then compares thevarious proposals based on the perceived viewpointsof these audiences (Section 7102)

7101 Audiences

The main audiences for any change to memory

order consume include standards committee mem-bers compiler implementers formal-methods re-searchers developers intending to write new codeand developers working with existing RCU code TheLinux kernel community is of course a notable exam-ple of this last category

Standards committee members would like a cleanand non-intrusive change to the standard Theywould of course also like solutions minimizing thenumber and vehemence of complaints from the otheraudiences or failing that reducing the complaintsto a tolerable noise level

Compiler implementers would like a mechanismthat fits nicely into current implementations whichdoes much to explain their satisfaction with the ap-proach of strengthening memory order consume tomemory order acquire In particular they wouldlike to avoid unbounded tracing of dependencies andwould prefer minimal constraints on their ability toapply time-honored optimizations

Formal-methods researchers would like a definitionof memory order consume that fits into existing the-oretical frameworks without undue conceptual vio-lence Of particular concern is any need to deal with

counter-factuals in other words any need to rea-son not only about values of variables required forthe solution of a given litmus test but also aboutother unrelated values for these variables As suchcounter-factuals are the rock upon which otherwiseattractive approaches involving semantic dependencyhave foundered20 Some practitioners might won-der why the opinion of formal-methods researchersshould be given any weight at all and the answer tothis question is that it is the work of formal-methodsresearchers that provides us the much-needed toolsthat we need to analyze both the memory-orderingspecification itself as well as programs using thatspecification

Developers writing new code need something thatexpresses their algorithm with a minimum of syn-tactic saccharine that is easy to learn and that iseasy to maintain For example one of the weak-nesses of the current standardsrsquo definition of memoryorder consume is the need to sprinkle large numbersof kill dependency() calls throughout onersquos codeIn short developers would like it to be easy to writeanalyze and maintain code that uses dependency or-dering

Developers with existing RCU code have the samedesires as do developers writing new code but arealso very interested in minimizing the code churn re-quired to adhere to the standard

The challenge if of course to find a proposal thataddresses the viewoints of all of these audiences Aswe will see in the next session this is not easy How-ever there is some hope that the approach presentedin Section 79 might suffice

7102 Comparison

A summary comparison of the proposals is shown inTable 1

The dependency type can either be ldquodeprdquo fornormal dependency ldquordeprdquo for the restricted de-pendencies discussed in Section 79 ldquosdeprdquo for(global) semantic dependency or ldquolsdeprdquo for local

20 That said Alan Jeffries is making another attempt tocome up with a suitable formal definition of semantic depen-dency

WG21N4321 35

Dep

enden

cyType

Variable

Marking

Form

al-Parameter

Marking

Return-V

alueMarking

Beginning-O

f-Chain

Handling

End-O

f-Chain

Handling

Dep

enden

cyTracingRequired

CAttribute

Support

Req

uired

C11 C++11 dep A A K Y Y

Type-Based Designation of DependencyChains With Restrictions (Section 72)

lsdep T T T k

Type-Based Designation of DependencyChains (Section 73)

dep T T T k

Whole-Program Option (Section 74) sdep

Local-Variable Restriction (Section 75) dep A A k Y

Mark Dependency-Carrying Local Variables(Section 76)

dep A A A k Y

Explicitly Tail-Marked Dependency Chains(Section 77)

dep A A Dk y Y

Explicitly Head-Marked Dependency Chains(Section 78)

dep D k y Y

Restricted Dependency Chains (Section 79) rdep

Marking ldquoArdquo attribute ldquoTrdquo typeBeginning of chain ldquoDrdquo explicit designationEnd of chain ldquoDrdquo explicit designation ldquokKrdquo implicitexplicit kill dependencyDependency tracking required ldquoyrdquo only for marked chains ldquoYrdquo always

Table 1 Comparison of Consume Proposals

WG21N4321 36

semantic dependency21 Variable formal-parameterand return-value marking can either be type-based(ldquoTrdquo) attribute-based (ldquoArdquo) or not required (ldquo rdquo)Beginning-of-chain handling can either require ex-plicit indication to which quantities dependenciesmust be carried (ldquoDrdquo) or nothing (ldquo rdquo) End-of-chain handling can either require an explicit kill

dependency (ldquoKrdquo) an implicit kill dependency

(ldquokrdquo) explicit designation of dependency (ldquoDrdquo) ornothing (ldquo rdquo)22 Dependency tracking might berequired for all chains (ldquoYrdquo) explicitly designatedchains (ldquoyrdquo) or not required at all (ldquo rdquo) C-language[[carries dependency]] support might be required(ldquoYrdquo) or not (ldquo rdquo)

The ideal proposal would have dependency typeldquodeprdquo (thus making it easier to model dependencyordering and making it unnecessary for developer tohave to outwit full-program optimizations) no needfor variable formal-parameter or return-value mark-ing (thus minimizing changes required for existingRCU code) implicit ldquodo the right thingrdquo end-of-chainhandling23 (thus minimizing the need for whack-a-mole source-code markups) no need for dependencytracking (thus making it easier to implement) andno need for C-language support for the [[carries

dependency]] attribute (thus minimizing changes tothe C standard)

7103 Other Approaches

If the C standards committee is unwilling to ac-cept attributes perhaps a new keyword such asCarries dependency would be an acceptable alter-native (This was suggested at the 2014 UIUC meet-ing but I cannot recall who suggested it)

21 Recall that a local semantic dependency remains a de-pendency even if the memory order consume load at its headcan return only a single value In contrast a global seman-tic dependency remains a dependency only if more than onevalue can appear at the end of the chain Therefore optimiza-tions based on global full-program analysis can break a globalsemantic dependency but can break neither a local semanticdependency nor a normal dependency

22 Variables that go out of scope always have any dependencychain implicitly killed

23 Perhaps implemented by a careful choice of exactly whichoperators carry dependencies in which situations

It might also be possible to combine different as-pects of the various proposals perhaps even arrivingat an improved proposal

Nevertheless we clearly have some more work todo

8 Summary

This document has analyzed Linux-kernel use of de-pendency ordering and has laid out the status-quo in-teraction between the Linux kernel and pre-C11 com-pilers It has also put forward some possible ways ofbuilding towards a full implementation of C11rsquos andC++11rsquos handling of dependency ordering Finallyit calls out some weaknesses in C11rsquos and C++11rsquoshandling of dependency ordering and offers some al-ternatives

References

[1] Alglave J Maranget L Pawan PSarkar S Sewell P Williams D andNardelli F Z PPCMEMARMMEM A toolfor exploring the POWER and ARM memorymodels httpwwwclcamacuk~pes20

ppc-supplementalpldi105-sarkarpdfJune 2011

[2] Alglave J Maranget L andTautschnig M Herding cats Modellingsimulation testing and data-mining for weakmemory In Proceedings of the 35th ACM SIG-PLAN Conference on Programming LanguageDesign and Implementation (New York NYUSA 2014) PLDI rsquo14 ACM pp 40ndash40

[3] ARM Limited ARM Architecture ReferenceManual ARMv7-A and ARMv7-R Edition2010

[4] Boehm H J Space efficient conservativegarbage collection SIGPLAN Not 39 4 (Apr2004) 490ndash501

[5] Bonzini P and Day M RCU imple-mentation for Qemu httplistsgnu

WG21N4321 37

orgarchivehtmlqemu-devel2013-08

msg02055html August 2013

[6] Dalton M THE DESIGN AND IMPLE-MENTATION OF DYNAMIC INFORMATIONFLOW TRACKING SYSTEMS FOR SOFT-WARE SECURITY PhD thesis StanfordUniversity 2009 Available httpcsl

stanfordedu~christospublications

2009michael_daltonphd_thesispdf

[Viewed March 9 2010]

[7] Desnoyers M [RFC git tree] userspace RCU(urcu) for Linux httpurcuso February2009

[8] Desnoyers M McKenney P E SternA Dagenais M R and Walpole JUser-level implementations of read-copy updateIEEE Transactions on Parallel and DistributedSystems 23 (2012) 375ndash382

[9] Grisenthwaite R ARM Barrier LitmusTests and Cookbook ARM Limited 2009

[10] Howard P W and Walpole J A rel-ativistic enhancement to software transactionalmemory In Proceedings of the 3rd USENIX con-ference on Hot topics in parallelism (BerkeleyCA USA 2011) HotParrsquo11 USENIX Associa-tion pp 1ndash6

[11] Howard P W and Walpole J Relativis-tic red-black trees Concurrency and Computa-tion Practice and Experience (2013) nandashna

[12] Intel Corporation A Formal Speci-fication of Intel Itanium Processor Fam-ily Memory Ordering 2002 Avail-able httpdeveloperintelcom

designitaniumdownloads251429htm

ftpdownloadintelcomdesign

ItaniumDownloads25142901pdf [ViewedJanuary 10 2007]

[13] International Business Machines Corpo-ration Power ISA Version 207 2013

[14] Kannan H Ordering decoupled metadata ac-cesses in multiprocessors In MICRO 42 Pro-ceedings of the 42nd Annual IEEEACM Inter-national Symposium on Microarchitecture (NewYork NY USA 2009) ACM pp 381ndash390

[15] McCarthy J Recursive functions of symbolicexpressions and their computation by machinepart i Commun ACM 3 4 (Apr 1960) 184ndash195

[16] McKenney P E Read-copy update(RCU) usage in Linux kernel Availablehttpwwwrdropcomuserspaulmck

RCUlinuxusagerculocktabhtml [ViewedJanuary 14 2007] October 2006

[17] McKenney P E What is RCU part 2Usage Available httplwnnetArticles

263130 [Viewed January 4 2008] January2008

[18] McKenney P E The RCU API 2010 edi-tion httplwnnetArticles418853 De-cember 2010

[19] McKenney P E Validating memory barri-ers and atomic instructions httplwnnet

Articles470681 December 2011

[20] McKenney P E Is Parallel ProgrammingHard And If So What Can You Do About Itkernelorg Corvallis OR USA 2012

[21] McKenney P E Structured deferral syn-chronization via procrastination CommunACM 56 7 (July 2013) 40ndash49

[22] McKenney P E [PATCH tipcorercu14] mce Stop using array-index-based RCUprimitives [PATCHtipcorercu14]mce

Stopusingarray-index-basedRCUprimitivesMay 2015

[23] McKenney P E Purcell C Al-gae Schumin B Cornelius G Qwer-tyus Conway N Sbw Blainster Ru-fus C Zoicon5 Anome and Eisen HRead-copy update httpenwikipedia

orgwikiRead-copy-update July 2006

WG21N4321 38

[24] McKenney P E and Slingwine J DRead-copy update Using execution history tosolve concurrency problems In Parallel andDistributed Computing and Systems (Las VegasNV October 1998) pp 509ndash518

[25] McKenney P E and Walpole J What isRCU fundamentally Available httplwn

netArticles262464 [Viewed December 272007] December 2007

[26] Michael M M Hazard pointers Safe mem-ory reclamation for lock-free objects IEEETransactions on Parallel and Distributed Sys-tems 15 6 (June 2004) 491ndash504

[27] Rossbach C J Hofmann O S PorterD E Ramadan H E Bhandari A andWitchel E TxLinux Using and managinghardware transactional memory in an operatingsystem In SOSPrsquo07 Twenty-First ACMSymposium on Operating Systems Principles(October 2007) ACM SIGOPS Avail-able httpwwwsosp2007orgpapers

sosp056-rossbachpdf [Viewed October 212007]

[28] Toit S D Working draft stan-dard for programming language C++httpwwwopen-stdorgjtc1sc22wg21

docspapers2013n3691pdf May 2013

[29] Vigueras G Orduna J M and LozanoM A Read-Copy Update based parallel serverfor distributed crowd simulations The Journalof Supercomputing (Apr 2012)

Change Log

This paper first appeared as N4026 in May of 2014Revisions to this document are as follows

bull Add a suggestion to Section 71 that developersprovide diagnostics for singleton arrays when us-ing RCU-protected indexes (May 28 2014)

bull Apply changes from Torvald Riegel feedback

1 Move Section 71 which summarizes theLinux-kernel uses of dependency chains upto the beginning of Section 7

2 Fix some typos in Section 74 and addwords accounting for DEC Alpha whichmust continue to promote memory order

consume loads to memory order acquire

with this sectionrsquos approach Also clarifyJeff Preshingrsquos conjecture about the lack ofoptimizations that break semantic depen-dency chains

(June 5 2014)

bull Add Section 75 which covers Hans Boehmrsquosapproach which restricts dependency chains tolocal variable Also add Section 76 whichcovers Clark Nelsorsquos approach which usesthe [[carries dependency]] attribute to markthose local variables that carry dependencies(July 17 2014)

bull Mark the paper as a revision of N4036 (July 172014)

bull Add Jeff Preshing Hans Boehm and Clark Nel-son as authors and clarify Hansrsquos proposal inSection 75 (August 13 2014)

bull Add Olivier Girouxrsquos proposal for explicitlymarking the tails of dependency chains as Sec-tion 77 and add Olivier as author Update theadvantages and disadvantages of Jeff Preshingrsquoswhole-program option in Section 74 Add advan-tages and disadvantages for Hans Boehmrsquos pro-posal in Section 75 and for Clark Nelsonrsquos pro-posal in Section 76 (August 23 2014)

bull Update Olivier Girouxrsquos explicit dependency-chain marking proposal in Section 77 based onfeedback from Olivier (August 24 2014)

bull Add a section header (Section 411) for thedependency-chain rules for 2014 GCC implemen-tations Add Section 413 laying out the sim-pler rules for managing dependency chains in theolder 1990s Sequent C implementations Adda footnote to Section 76 recording Lawrence

WG21N4321 39

Crowlrsquos preference for variable modifiers insteadof either attributes or type modifiers Addwords to Section 77 noting that atomic signal

fence() or the Linux kernelrsquos barrier() can beused to suppress code-motion optimizations thatmight otherwise prevent fixing up tail-markeddependency chains that were broken by code-motion optimizations (October 1 2014)

bull Add Section 34 which describes operators thatact as the last link in the Linux kernelrsquos depen-dency chains including the ambiguous role ofthe -gt operator (October 1 2014)

bull Update plots of RCU API usage (October 52014)

bull Flesh out Section 412 which lays out a shorthistory of dependency-chain management rulesfor the Linux kernel Numerous minor clarifica-tions and corrections throughout the document(October 5 2014)

bull Add ldquoandrdquo before final authorrsquos name and emailaddress Also add a pointer back to N4036 (Oc-tober 5 2014)

bull Add a brain-dead build script (October 52014)

bull Add Section 710 which has the beginnings ofan evaluation of the various proposals (October5 2014)

The paper was then published as N4215 whichwas revised as follows

bull Add verbiage to Sections 72 and 7102 differ-entiating between various forms of syntactic andsemantic dependencies (November 11 2014)

bull Add Section 7103 suggesting a C keyword in-stead of the C++ [[carries dependency]] at-tribute This section also suggests that combin-ing ideas from several proposals might be helpful(November 11 2014)

bull Add Section 78 which contains a first attemptto describe Olivier Girouxrsquos head-marked depen-dency chains (November 11 2014)

bull Add example code in Figures 15 and 16 Addsimilar figures to the proposals showing how theymark the dependency chains in the sample code(November 11 2014)

At this point the paper was published as N4321which was revised as follows

bull Add a warning against using operands to amp and| that leave only one bit set (or respectivelycleared) to Section 411 (April 9 2015)

bull Update plots of RCU API usage (April 192014)

bull Add Section 79 which restricts dependencychains with the goal of allowing unmarked de-pendency chains while avoiding restricting com-piler optimizations One important restriction isthe elimination of RCU-protected array indexes(April 19 2014)

bull Add verbiage to Section 32 noting that the amp

operator is sometimes used to find the beginningof a power-of-two aligned structure (April 192014)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Torvald Riegel to Sec-tion 79 (May 20 2015)

bull Donrsquot allow integers to carry dependencies intoor out of unmarked functions in Section 79(May 21 2015)

bull Update Section 79 to note that assignments can-not extend a dependency chain from one threadto another and that assignments cannot extendan intptr t- or uintptr t-based dependencychain through a shared variable even within agiven thread (May 21 2015)

bull Update Section 79 to note that atomic op-erations can extend dependency chains in thesame way that their non-atomic counterpartscan (May 21 2015)

WG21N4321 40

bull Update Section 79 to explicitly call out the factthat stores and subsequent loads can extend de-pendency chains (May 21 2015)

bull Update Section 79 to explicitly state that un-defined behavior terminates dependency chainsand that if only one value of a pointer avoidsundefined behavior any dependency chainsthrough that pointer at that point are termi-nated (May 22 2015)

bull Update Section 79 to document the difficul-ties stemming from value-specialization opti-mizations (May 23 2015)

bull Update Section 79 to note the back-propagationof the dependency-chain-breaking effects of thecompiler deducing the exact value of a pointer(May 27 2015)

bull Update Section 79 to give more informationon how much bit-setting and bit-clearing is toomuch (June 8 2015)

bull Update Section 79 to note that the Linux ker-nel does not rely on dependency ordering whena statically allocated structure is linked into anRCU-protected structure but also note that op-timizations breaking dependency ordering in thiscase seem quite dubious (July 11 2015)

bull Update Section 79 illustrating dependency-chain rules with examples (July 11 2015)

bull Update Section 79 noting that simple loads suf-fice for memory order consume loads (July 112015)

bull Update Sections 77 and 78 calling out abstrac-tion challenges for head- and tail-marking ap-proaches (July 13 2015)

bull Add subsections to Section 79 (July 13 2015)

bull Update Section 79 to amplify objections to spe-cialization optimizations applied to heap-basedpointers (July 13 2015)

  • 1 Introduction
  • 2 Introduction to RCU
  • 3 Linux-Kernel Use Cases
    • 31 Types of Linux-Kernel Dependency Chains
    • 32 Operators in Linux-Kernel Dependency Chains
    • 33 Operators Terminating Linux-Kernel Dependency Chains
    • 34 Operators Acting as Last Link in Linux-Kernel Dependency Chains
    • 35 Linux-Kernel Dependency Chain Length
      • 4 Dependency Ordering in Pre-C11 Implementations
        • 41 Rules for C-Language RCU Users
          • 411 Rules for 2014 GCC Implementations
          • 412 Rules for 2003 GCC Implementations
          • 413 Rules for 1990s Sequent C Implementations
            • 42 Rules for C-Language Implementers
              • 5 Dependency Ordering in C11 and C++11 Implementations
              • 6 Weaknesses in C11 and C++11 Dependency Ordering
              • 7 Potential Alternatives to C11 and C++11 Dependency Ordering
                • 71 Revising C11 and C++11 Dependency-Ordering Definition
                • 72 Type-Based Designation of Dependency Chains With Restrictions
                • 73 Type-Based Designation of Dependency Chains
                • 74 Whole-Program Option
                • 75 Local-Variable Restriction
                • 76 Mark Dependency-Carrying Local Variables
                • 77 Explicitly Tail-Marked Dependency Chains
                • 78 Explicitly Head-Marked Dependency Chains
                • 79 Restricted Dependency Chains
                  • 791 Extending Dependency Chains
                  • 792 Terminating Dependency Chains
                  • 793 Restricted Dependency Chains and the Linux Kernel
                  • 794 Restricted Dependency Chains Advantages and Disavantages
                    • 710 Evaluation
                      • 7101 Audiences
                      • 7102 Comparison
                      • 7103 Other Approaches
                          • 8 Summary
Page 5: N4321: Towards Implementation and Use of memory order … · 2015-07-14 · N4321: Towards Implementation and Use of memory order consume Doc. No.: WG21/N4321 (revised) Date: 2015-04-19

WG21N4321 5

1 void new_element(struct foo pp int a)2 3 struct foo p = malloc(sizeof(p))45 if (p)6 abort()7 p-gta = a8 atomic_store_explicit(pp p memory_order_release)9

1011 int traverse(struct foo_head ph)12 13 int a = -114 struct foo p1516 p = atomic_load_explicit(ampph-gth memory_order_consume)17 while (p = NULL) 18 a = p-gta19 p = atomic_load_explicit(ampp-gtn memory_order_consume)20 21 return a22 23

24

Figure 5 ReleaseConsume Linked Structure Traver-sal

Then if the load carries a dependency to somelater memory reference [28 110p9] any memoryreference preceding the memory order release storewill happen before that later memory reference [28110p9-110p12] This means that when there is de-pendency ordering memory order consume gives thesame guarantees that memory order acquire doesbut at lower cost

On the other hand memory order consume re-quires the compiler to track the carries-a-dependencyrelationships with the set of such relationshipsheaded by a given memory order consume load be-ing called that loadrsquos dependency chains It is quitepossible that the complexity of implementing this ca-pability has thus far prevented high-quality memory

order consume implementations from appearing Itis therefore worthwhile to review use of dependencychains in practice in order to determine what typesof operations typically appear in dependency chainswhich might result in guidance to implementationsor perhaps even modifications to the definition ofmemory order consume

31 Types of Linux-Kernel Depen-dency Chains

One goal for memory order consume is to implementrcu dereference() which heads a Linux-kerneldependency-ordering tree There are a numberof variant of rcu dereference() in the Linuxkernel in order to implement the four flavors ofRCU and also to enable RCU usage diagnositicsfor code that is shared by readers and updatersThese additional variants are rcu dereference()rcu dereference bh() rcu dereference

bh check() rcu dereference bh check()rcu dereference check() rcu dereference

index check() rcu dereference protected()rcu dereference raw() rcu dereference

sched() rcu dereference sched check() srcu

dereference() and srcu dereference check()Taken together there are about 1300 uses of thesefunctions in version 313 of the Linux kernelHowever about 250 of those are rcu dereference

protected() which is used only in update-side codeand thus does not head up read-side dependencychains which leaves about 1000 uses to be inspectedfor dependency-ordering usage

32 Operators in Linux-Kernel De-pendency Chains

A surprisingly small fraction of the possible C opera-tors appear in dependency chains in the Linux kernelnamely -gt infix = casts prefix amp prefix [] infix+ infix - ternary and infix (bitwise) amp

By far the two most common operators are the-gt pointer field selector and the = assignment oper-ator Enabling the carries-dependency relationshipthrough only these two operators would likely coverbetter than 90 of the Linux-kernel use cases

Casts the prefix indirection operator and theprefix amp address-of operator are used to implementLinuxrsquos list primitives which translate from listpointers embedded in a structure to the structure it-self These operators are also used to get some of theeffects of C++ subtyping in the C language

The [] array-indexing operator and the infix +

and - arithmetic operators are used to manipulate

WG21N4321 6

1 struct foo 2 int a3 4 struct foo fp5 struct foo default_foo67 int bar(void)8 9 struct foo p

1011 p = rcu_dereference(fp)12 return p p-gta default_fooa13

Figure 6 Default Value For RCU-Protected PointerLinux Kernel

1 class foo 2 int a3 4 stdatomicltfoo gt fp5 foo default_foo67 int bar(void)8 9 stdatomicltfoo gt p

1011 p = fpload_explicit(memory_order_consume)12 return p kill_dependency(p-gta) default_fooa13

Figure 7 Default Value For RCU-Protected PointerC++11

RCU-protected arrays as well as to index into arrayscontained within RCU-protected structures RCU-protected arrays are becoming less common becausethey are being converted into more complex datastructures such as trees However RCU-protectedstructures containing arrays are still fairly common

The ternary if-then-else expression is used tohandle default values for RCU-protected pointers forexample as shown in Figure 6 or in C++11 formin Figure 7 Note that the dependency is carriedonly through the rightmost two operands of neverthrough the leftmost one

The infix amp operator is used to mask low-order bitsfrom RCU pointers These bits are used by somealgorithms as markers Such markers though notcommon in the Linux kernel are well-known in theart with hazard pointers being but one example [26]This operator is also sometimes used to locate thebeginning of an aligned structure for example if p

references a field within a data structure that is 4096bytes in size (or smaller) and that is also aligned to a4096-byte boundary then p amp ~0xfff will with theaddition of appropriate casting produce a pointer tothe beginning of the structure Note that it is ex-pected that both operands of infix amp are expected tohave some non-zero bits because otherwise a NULL

pointer will result and NULL pointers cannot reason-ably be said to carry much of anything let alone adependency

Although I did not find any infix | operators inmy census of Linux-kernel dependency chains sym-metry considerations argue for also including it forexample for read-side pointer tagging or for anotherexample locating the beginning of the next in an ar-ray of aligned structures Presumably both of theoperands of infix | must have at least one zero bit

To recap the operators appearing in Linux-kerneldependency chains are -gt infix = casts prefix ampprefix [] infix + infix - ternary infix (bitwise)amp and probably also |

33 Operators Terminating Linux-Kernel Dependency Chains

Although C++11 has the kill dependency() func-tion to terminate a dependency chain no such func-tion exists in the Linux kernel Instead Linux-kerneldependency chains are judged to have terminatedupon exit from the outermost RCU read-side criticalsection3 when existence guarantees are handed offfrom RCU to some other synchronization mechanism(usually locking or reference counting) or when thevariable carrying the dependency goes out of scope

That said it is possible to analyze Linux-kerneldependency chains to see what part of the chain isactually required by the algorithm in question Wecan therefore define the essential subset of a depen-dency chain to be that subset within which ordering

3 The beginning of a given RCU read-side critical section ismarked with rcu read lock() rcu read lock bh() rcu read

lock sched() or srcu read lock() and the end by the cor-responding primitive from the list rcu read unlock() rcu

read unlock bh() rcu read unlock sched() or srcu read

unlock() There is currently no C++11 counterpart for anRCU read-side critical section

WG21N4321 7

is required by the algorithm In the 313 version ofthe Linux kernel the following operators always markthe end of the essential subset of a dependency chain() == = ampamp || infix and

The postfix () function-invocation operator is aninteresting special case in the Linux kernel In theoryRCU could be used to protect JITed function bodiesbut in current practice RCU is instead used to waitfor all pre-existing callers to the function referencedby the previous pointer The functions are all com-piled into the kernel and the dependency chains aretherefore irrelevant to the () operator Hence in ver-sion 313 of the Linux kernel the () operator marksthe end of the essential subset of any dependencychain that it resides in

The == = ampamp and || operators are used ex-clusively in rdquoifrdquo statements to make control-flow de-cisions and therefore also mark the end of the essen-tial subset of any dependency chains that they residein In theory these relational and boolean operatorscould be used to form array indexes but in practicethe Linux kernel does not yet do this in RCU depen-dency chains The other relational operators (gt ltgt= and lt=) should probably also be added to thislist

The infix and arithmetic operators couldpotentially be used for construct array addresses butthey are not yet used that way in the Linux kernelInstead they are used to do computation on valuesfetched as the last operation in an essential subset ofa dependency chain

In short in the current Linux kernel () === ampamp || infix and all mark the end of theessential subset of a dependency chain That saidthere is potential for them to be used as part of theessential subset of dependency changes in future ver-sions of the Linux kernel And the same is of coursetrue of the remaining C-language operators whichdid not appear within any of the dependency chainsin version 313 of the Linux kernel

34 Operators Acting as Last Link inLinux-Kernel Dependency Chains

Although the -gt operator is frequently used as part ofa Linux-kernel dependency chain it often is intended

to be the last link in that chain Therefore the usescases for the -gt operator deserve special mention

The first use case involves fetching non-pointerdata from an RCU-protected data structure Forexample in the DRDB subsystem in Linux -gt isused to fetch a timeout value This code requiresthat dependency ordering apply to this fetch but itdoes not require a dependency chain extending be-yond that point This sort of case would requirea kill dependency() for implementations based onthe C++11 and C11 standards

The second use case involves linked data struc-tures where an RCU update might be applied onany pointer in the chain for example the stan-dard Linux-kernel linked list The -gt operator pro-vides dependency ordering for the fetch of the -gtnextpointer but that fetch must itself be a memory order

consume load in order to provide the required depen-dency ordering for the fields in the next structurein the list Thus a linked-list traversal consists ofa series of back-to-back non-overlapping dependencychains

These two use cases raise the question of whethera dependency chain can continue beyond a -gt oper-ator The answer is ldquoyesrdquo and this occurs when alinked structure is made visible to RCU readers as aunit For exapmle consider a linked list where eachlist element links to a constant binary search treeIf this tree is in place when the element is added tothe list then a memory order consume load is neededonly when fetching the pointer to the element Thedependency chain headed by this fetch suffices to or-der accesses to the binary search tree

These cases need to be differentiated The thirduse case appears to be the least frequent which sug-gests that the -gt operator (or a sequence of -gt oper-ators) always be the last link of a dependency chain

35 Linux-Kernel Dependency ChainLength

Many Linux-kernel dependency chains are very shortand contained with a fair number living within theconfines of a single C statement If there were onlya few short dependency chains in the Linux kernelone could imagine decorating all the operators in each

WG21N4321 8

1 void new_element(struct foo pp int a)2 3 struct foo p = malloc(sizeof(p))45 if (p)6 abort()7 p-gta = a8 atomic_store_explicit(pp p memory_order_release)9

1011 int traverse(struct foo_head ph)12 13 int a = -114 struct foo p1516 p = atomic_load_explicit(ampfield_dep(ph h)17 memory_order_consume)18 while (p = NULL) 19 a = field_dep(p a)20 p = atomic_load_explicit(ampfield_dep(p n)21 memory_order_consume)22 23 return a24

Figure 8 Decorated Linked Structure Traversal

dependency chain for example replacing the -gt op-erator with something like the mythical field dep()

operator shown on lines 16 19 and 20 of Figure 8

However there are a great many dependencychains that extend across multiple functions Onerelatively modest example is in the Linux networkstack in the arp process() function This depen-dency chain extends as follows with deeper nestingindicating deeper function-call levels

bull The arp process() function invokes in dev

get rcu() which returns an RCU-protectedpointer The head of the dependency chain istherefore within the in dev get rcu() func-tion

bull The arp process() function invokes the follow-ing macros and functions

ndash IN DEV ROUTE LOCALNET() which expandsto the ipv4 devconf get() function

ndash arp ignore() which in turn calls

lowast IN DEV ARP IGNORE() which expandsto the ipv4 devconf get() function

lowast inet confirm addr() which calls

middot dev net() which in turn callsread pnet()

ndash IN DEV ARPFILTER() which expands toipv4 devconf get()

ndash IN DEV CONF GET() which also expands toipv4 devconf get()

ndash arp fwd proxy() which calls

lowast IN DEV PROXY ARP() which expandsto ipv4 devconf get()

lowast IN DEV MEDIUM ID() which also ex-pands to ipv4 devconf get()

ndash arp fwd pvlan() which calls

lowast IN DEV PROXY ARP PVLAN() which ex-pands to ipv4 devconf get()

ndash pneigh enqueue()

Again although a great many dependency chainsin the Linux kernel are quite short there are quite afew that spread both widely and deeply We thereforecannot expect Linux kernel hackers to look fondlyon any mechanism that requires them to decorateeach and every operator in each and every depen-dency chain as was shown in Figure 8 In fact evenkill dependency() will likely be an extremely diffi-cult sell

4 Dependency Ordering in Pre-C11 Implementations

Pre-C11 implementations of the C language do nothave any formal notion of dependency ordering butthese implementations are nevertheless used to buildthe Linux kernelmdashand most likely all other softwareusing RCU This section lays out a few straightfor-ward rules for both implementers (Section 42) andusers of these pre-C11 C-language implementations(Section 41)

41 Rules for C-Language RCU Users

The rules for C-language RCU users have evolvedover time so this section will present them in reversechronological order

WG21N4321 9

411 Rules for 2014 GCC Implementations

The primary rule for developers implementing RCU-based algorithms is to avoid letting the compiler de-terming the value of any variable in any dependencychain This primary rule implies a number of sec-ondary rules

1 Use only intrinsic operators on basic types Ifyou are making use of C++ template metapro-gramming or operator overloading more elabo-rate rules apply and those rules are outside thescope of this document

2 Use a volatile load to head the dependency chainThis is necessary to avoid the compiler repeatingthe load or making use of (possibly erroneous)prior knowledge of the contents of the memorylocation each of which can break dependencychains

3 Avoid use of single-element RCU-protected ar-rays The compiler is within its right to assumethat the value of an index into such an arraymust necessarily evaluate to zero The com-piler could then substitute the constant zero forthe computation breaking the dependency chainand introducing misordering

4 Avoid cancellation when using the + and - infixarithmetic operators For example for a givenvariable x avoid (xminusx) The compiler is withinits rights to substitute zero for any such cancel-lation breaking the dependency chain and againintroducing misordering Similar arithmetic pit-falls must be avoided if the infix or oper-ators appear in the essential subset of a depen-dency chain

5 Avoid all-zero operands to the bitwise amp opera-tor and similarly avoid all-ones operands to thebitwise | operator If the compiler is able todeduce the value of such operands it is withinits rights to substitute the corresponding con-stant for the bitwise operation Once again thisbreaks the dependency chain introducing mis-ordering

Please note that single-bit operands to bitwise amp

can be dangerous because the compiler requiresonly a small amount of additional information todeduce the exact value which could again resultin constant substitution Operands to bitwise |

that have only one zero bit are similarly danger-ous

6 If you are using RCU to protect JITed functionsso that the () function-invocation operator is amember of the essential subset of the dependencytree you may need to interact directly with thehardware to flush instruction caches This issuearises on some systems when a newly JITed func-tion is using the same memory that was used byan earlier JITed function

7 Do not use the boolean ampamp and || operators inessential dependency chains The reason for thisprohibition is that they are often compiled usingbranches Weak-memory machines such as ARMor PowerPC order stores after such branches butcan speculate loads which can break data depen-dency chains

8 Do not use relational operators (== = gt gt=lt or lt=) in the essential subset of a dependencychain The reason for this prohibition is that asfor boolean operators relational operators areoften compiled using branches Weak-memorymachines such as ARM or PowerPC order storesafter such branches but can speculate loadswhich can break dependency chains

9 Be very careful about comparing pointers in theessential subset of a dependency chain As LinusTorvalds explained if the two pointers are equalthe compiler could substitute the pointer you arecomparing against for the pointer in the essen-tial subset of the dependency chain On ARMand Power hardware it might be that only theoriginal value carried a hardware dependency sothis substitution would break the chain in turnpermitting misordering Such comparisons areOK in the following cases

(a) The pointer being compared against refer-ences memory that was initialized at boot

WG21N4321 10

time or otherwise long enough ago thatreaders cannot still have pre-initialized datacached Examples include module-init timefor module code before kthread creationfor code running in a kthread while theupdate-side lock is held and so on

(b) The pointer is never dereferenced afterbeing compared This exception applieswhen comparing against the NULL pointeror when scanning RCU-protected circularlinked lists

(c) The pointer being compared against is partof the essential subset of a dependencychain This can be a different dependencychain but only as long as that chain stemsfrom a pointer that was modified after anyinitialization of interest This exception canapply when carrying out RCU-protectedtraversals from different entry points thatconverged on the same data structure

(d) The pointer being compared against isfetched using rcu access pointer() andall subsequent dereferences are stores

(e) The pointers compared not-equal and thecompiler does not have enough informationto deduce the value of the pointer (Forexample if the compiler can see that thepointer will only ever take on one of twovalues then it will be able to deduce theexact value based on a not-equals compar-ison)

10 Disable any value-speculation optimizations thatyour compiler might provide especially if you aremaking use of feedback-based optimizations thattake data collected from prior runs

412 Rules for 2003 GCC Implementations

Prior to the 269 version of the Linux kernel therewas neither rcu dereference() nor rcu assign

pointer() Instead explicit memory barriers wereused smp read barrier depends() by readers andsmp wmb() by updaters For example the code shownfor current Linux kernels in Figure 6 would be as

1 struct foo 2 int a3 4 struct foo fp5 struct foo default_foo67 int bar(void)8 9 struct foo p

1011 p = fp12 smp_read_barrier_depends()13 return p p-ampgta default_fooa14

Figure 9 Default Value For RCU-Protected PointerOld Linux Kernel

shown in Figure 9 for 268 and earlier versions of theLinux kernel A similar transformation relates theolder use of smp wmb() and the more recent use ofrcu assign pointer()

This older API was clearly much more vulnerableto compiler optimizations than is the current APIbut the real motivation for this change was read-ability and maintainability as can be seen from thecommit log for the mid-2004 patch introducing rcu

dereference()

This patch introduced an rcu

dereference() macro that replaces mostuses of smp read barrier depends()The new macro has the advantage ofexplicitly documenting which pointersare protected by RCU ndash in contrast itis sometimes difficult to figure out whichpointer is being protected by a givensmp read barrier depends() call

The commit log for the mid-2004 patch introducingrcu assign pointer() justifies the change in termsof eliminating hard-to-use explicit memory barriers

Attached is a patch that adds an rcu

assign pointer() that allows a number ofexplicit smp wmb() memory barriers to bedispensed with improving readability

The importance of suppressing compiler optimiza-tions did not become apparent until much later In

WG21N4321 11

fact a volatile cast was not added to the implementa-tion of rcu dereference() until 2624 in early 2008

413 Rules for 1990s Sequent C Implemen-tations

1990s systems featured far slower CPUs and muchless memory that is commonly provisioned to-day and the compilers were correspondingly lesssophisticated Therefore at that time a sim-ple C-language field selector was used instead ofany sort of rcu dereference() or memory order

consume operation Not only was there novolatile cast there also was nothing resembling smp

read barrier depends() The lack of smp read

barrier depends() is not too surprising given thatDYNIXptx did not run on DEC Alpha

This approach was nevertheless quite reliable be-cause the use cases within the DYNIXptx kernelwere both few and straightforward and provided lit-tle or no opportunity for optimizations that mightbreak dependency chains

42 Rules for C-Language Imple-menters

The main rule for C-language implementers is toavoid any sort of value speculation or at the veryleast provide means for the user to disable suchspeculation An example of a value-speculation op-timization that can be carried out with the help ofhardware branch prediction is shown in Figure 10which is an optimized version of the code in Fig-ure 5 This sort of transformation might result fromfeedback-directed optimization where profiling runsdetermined that the value loaded from ph was almostalway 0xbadfab1e Although this transformation iscorrect in a single-threaded environment in a concur-rent environment nothing stops the compiler or theCPU from speculating the load on line 19 before itexecutes the rcu dereference() on line 16 whichcould result in line 19 executing before the corre-sponding store on line 7 resulting in a garbage valuein variable a4

4 Kudos to Olivier Giroux for pointing out use of branchprediction to enable value speculation

1 void new_element(struct foo pp int a)2 3 struct foo p = malloc(sizeof(p))45 if (p)6 abort()7 p-gta = a8 rcu_assign_pointer(pp p)9

1011 int traverse(struct foo_head ph)12 13 int a = -114 struct foo p1516 p = rcu_dereference(ampph-gth)17 while (p = NULL) 18 if (p == (struct foo )0xbadfab1e)19 a = ((struct foo )0xbadfab1e)-gta20 else21 a = p-gta22 p = rcu_dereference(ampp-gtn)23 24 return a25

Figure 10 Dangerous Optimizations HardwareBranch Predictions

There are some situations where this sort of opti-mization would be safe including

1 The value speculated is a numeric value ratherthan a pointer so that if the guess proves correctafter the fact the computation will be appropri-ate after the fact

2 The value speculated is a pointer to invariantdata so that reasonable values are produced bydereferencing even if the guess proves to havebeen correct only after the fact

3 As above but where any updates result in datathat produces appropriate computations at anyand all phases of the update

However this list does not contain the general caseof memory order consume loads

Pure hardware implementations of value specula-tion can avoid this problem because they monitorcache-coherence protocol events that would resultfrom some other CPU invalidating the guess

In short compiler writers must provide means todisable all forms of value speculation unless the spec-

WG21N4321 12

ulation is accompanied by some means of detectingthe race condition that Figure 10 is subject to

Are there other dependency-breaking optimizationsthat should be called out separately

5 Dependency Ordering in C11and C++11 Implementations

The simplest way to avoid dependency-ordering is-sues is to strengthen all memory order consume oper-ations to memory order acquire This functions cor-rectly but may result in unacceptable performancedue to memory-barrier instructions on weakly or-dered systems such as ARM and PowerPC5 and mayfurther unnecessarily suppress code-motion optimiza-tions

Another straightforward approach is to avoid valuespeculation and other dependency-breaking opti-mizations This might result in missed opportu-nities for optimization but avoids any need fordependency-chain annotations and also all issuesthat might otherwise arise from use of dependency-breaking optimizations This approach is fully com-patible with the Linux kernel communityrsquos currentapproach to dependency chains Unfortunately thereare any number of valuable optimizations that breakdependency chains so this approach seems impracti-cal

A third approach is to avoid value speculationand other dependency-breaking optimizations in anyfunction containing either a memory order consume

load or a [[carries dependency]] attribute Forexample the hardware-branch-predition optimiza-tion shown in Figure 10 would be prohibited in suchfunctions as would cancellation optimizations suchas optimizing a = b + c - c into a = b This toocan result in missed opportunities for optimizationthough very probably many fewer than the previousapproach This approach can also result in issues dueto dependency-breaking optimizations in functionslacking [[carries dependency]] attributes for ex-ample function d() in Figure 11 It can also result

5 From a Linux-kernel community viewpoint that shouldread ldquowill result in unacceptable performancerdquo

1 int a(struct foo p [[carries_dependency]])2 3 return kill_dependency(p-gta = 0)4 56 int b(int x)7 8 return x9

1011 foo c(void)12 13 return fpload_explicit(memory_order_consume)14 return rcu_dereference(fp) in Linux kernel 15 1617 int d(void)18 19 int a20 foo p2122 rcu_read_lock()23 p = c()24 a = p-gta25 rcu_read_unlock()26 return a27

Figure 11 Example Functions for Dependency Or-dering Part 1

1 [[carries_dependency]] struct foo e(void)2 3 return fpload_explicit(memory_order_consume)4 return rcu_dereference(fp) in Linux kernel 5 67 int f(void)8 9 int a

10 foo p1112 rcu_read_lock()13 p = e()14 a = p-gta15 rcu_read_unlock()16 return kill_dependency(a)17 1819 int g(void)20 21 int a22 foo p2324 rcu_read_lock()25 p = e()26 a = p-gta27 rcu_read_unlock()28 return b(a)29

Figure 12 Example Functions for Dependency Or-dering Part 2

WG21N4321 13

in spurious memory-barrier instructions when a de-pendency chain goes out of scope for example withthe return statement of function g() in Figure 12

A fourth approach is to add a compile-time op-eration corresponding to the beginning and end ofRCU read-side critical section These would need tobe evaluated at compile time taking into accountthe fact that these critical sections can nest and canbe conditionally entered and exited Note that theexit from an outermost RCU read-side critical sec-tion should imply a kill dependency() operation oneach variable that is live at that point in the code6

Although it is probably impossible to precisely de-termine the bounds of a given RCU read-side criticalsection in the general case conservative approachesthat might overestimate the extent of a given sec-tion should be acceptable in almost all cases Thisapproach would make functions c() and d() in Fig-ure 11 handle dependency chains in a natural mannerbut avoiding whole-program analysis would requiresomething similar to the [[carries dependency]]

annotations called out in the C11 and C++11 stan-dards

A fifth approach would be to require that all op-erations on the essential subset of any dependencychain be annotated This would greatly ease imple-mentation but would not be likely to be accepted bythe Linux kernel community

A sixth approach is to track dependencies as calledout in the C11 and C++11 standards However in-stead of emitting a memory-barrier instruction whena dependency chain flows into or out of a functionwithout the benefit of [[carries dependency]] in-sert an implicit kill dependency() invocation Im-plementation should also optionally issue a diagnosticin this case The motivation for this approach is thatit is expected that many more kill dependencies()

than [[carries dependency]] would be required toconvert the Linux kernelrsquos RCU code to C11 In theexample in Figure 12 this approach would allow func-tion g() to avoid emitting an unnecessary memory-barrier instruction but without function f()rsquos ex-

6 What if a given rcu read unlock() sometimes marked theend of an outermost RCU read-side critical section but othertimes was nested in some other RCU read-side critical sectionIn that case there should be no kill dependency()

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a)3 a = p-gtspecial_a4 else5 a = p-gtnormal_a

Figure 13 Dependency-Ordering Value-NarrowingHazard

plicit kill dependency() Both functions are in Fig-ure 12

A seventh and final approach is to track dependen-cies as called out in in the C11 and C++11 standardsWith this approach functions e() and f() properlypreserve the required amount of dependency order-ing

6 Weaknesses in C11 andC++11 Dependency Or-dering

Experience has shown several weaknesses in the de-pendency ordering specified in the C11 and C++11standards

1 The C11 standard does not provide attributesand in particular does not provide the[[carries dependency]] attribute This pre-vents the developer from specifying that a givendependency chain passes into or out of a givenfunction

2 The implementation complexity of thedependency-chain tracking required by bothstandard can be quite onerous on the one handand the overhead of unconditionally promotingmemory order consume loads to memory order

acquire can be excessive on weakly orderedimplementations on the other There is thereforeno easy way out for a memory order consume

implementation on a weakly ordered system

3 The function-level granularity of [[carries

dependency]] seems too coarse One problemis that points-to analysis is non-trivial so that

WG21N4321 14

compilers are likely to have difficulty determin-ing whether or not a given pointer carries a de-pendency For example the current wording ofthe standard (intentionally) does not disallowdependency chaining through stores and loadsTherefore if a dependency-carrying value mightever be written to a given variable an implemen-tation might reasonably assume that any loadfrom that variable must be assumed to carry adependency

4 The rules set out in the standard [28 110p9]do not align well with the rules that develop-ers must currently adhere to in order to main-tain dependency chains when using pre-C11 andpre-C++11 compilers (see Section 41) For ex-ample the standard requires (x-x) to carry adependency and providing this guarantee wouldat the very least require the compiler to alsoturn off optimizations that remove (x-x) (andsimilar patterns) if x might possibly be carry-ing a dependency For another example con-sider the value-speculation-like code shown inFigure 13 that is sometimes written by devel-opers and that was described in bullet 9 ofSection 41 In this example the standard re-quires dependency ordering between the memoryorder consume load on line 1 and the sub-sequent dereference on line 3 but a typicalcompiler would not be expected to differenti-ate between these two apparently identical val-ues These two examples show that a compilerwould need to detect and carefully handle thesecases either by artificially inserting dependen-cies omitting optimizations differentiating be-tween apparently identical values or even byemitting memory order acquire fences

5 The whole point of memory order consume andthe resulting dependency chains is to allow de-velopers to optimize their code Such optimiza-tion attempts can be completely defeated by thememory order acquire fences that the standardcurrently requires when a dependency chain goesout of scope without the benefit of a [[carries

dependency]] attribute Preventing the com-piler from emitting these fences requires liberal

use of kill dependency() which clutters coderequires large developer effort and further re-quires that the developer know quite a bit aboutwhich code patterns a given version of a givencompiler can optimize (thus avoiding needlessfences) and which it cannot (thus requiring man-ual insertion of kill dependency()

As of this writing no known implementations fullysupport C11 or C++11 dependency ordering

It is worth asking why Paul didnrsquot anticipate theseweaknesses There are several reasons for this

1 Compiler optimizations have become more ag-gressive over the seven years since Paul startedworking on standardization

2 New dependency-ordering use cases have arisenduring that same time in particular there arelonger dependency chains and more of themincluding dependency chains spanning multiplecompilation units

3 The number of dependency chains has increasedby roughly an order of magnitude during thattime so that changes in code style can be ex-pected to face a commeasurate increase in resis-tance from the Linux kernel community ndash unlessthose changes bring some tangible benefit

With that letrsquos look at some potential alternativesto dependency ordering as defined in the C11 andC++11 standards

7 Potential Alternatives to C11and C++11 Dependency Or-dering

Given the weaknesses in the current standardrsquos spec-ification of dependency ordering it is quite reason-able to consider alternatives To this end Section 71discusses ease-of-use issues involved with revisions tothe C11 and C++11 definitions of dependency or-dering Section 72 enlists help from the type systembut also imposes value restrictions (thus revising the

WG21N4321 15

C11 and C++11 semantics for dependencies) Sec-tion 73 enlists help from the type system withoutthe value restrictions and Section 74 describes awhole-program approach to dependency chains (alsorevising the C11 and C++11 semantics for depen-dencies) Section 75 describes a post-Rapperswilproposal that dependency chains be restricted tofunction-scope local variables and temporaries andSection 76 describes a second post-Rapperswil pro-posal that the [[carries dependency]] attributebe used to label local-scope variables that carry de-pendencies Section 77 describes a proposal dis-cussed verbally at Rapperswil that explicitly marksthe tails of dependency chains Section 78 describesthe inverse namely marking the heads of dependencychains Section 79 describes an approach that avoidsmarking by sharply restricting the number and typeof operations permitted in dependency chains Eachapproach appears to have advantages and disadvan-tages so it is hoped that further discussion will eitherhelp settle on one of these alternatives or generatesomething better To help initiate this discussionSection 710 provides an initial comparative evalua-tion

71 Revising C11 and C++11Dependency-Ordering Definition

The following sections each describe a proposed revi-sion of the dependency-ordering definition from thatin the current C11 and C++11 standards In manyof these proposals developers are required to followan additional rule in order to be able to rely on de-pendency ordering Subsequent execution must notlead to a situation where there is only one possiblevalue for the variable that is intended to carry thedependency7 This is shown in Figure 17 where thecompiler is permitted to break dependency orderingon line 6 because it knows that the value of p is equalto that of q which means that it could substitutethe latter value from the former which would break

7 This restricted notion of dependence is sometimes calledsemantic dependence and the value at the end of a depen-dence chain that does not represent a semantic dependence issometimes said to be independent of the value at the head ofthe dependency chain

1 int my_array[MY_ARRAY_SIZE]23 i = atomic_load_explicit(gi memory_order_consume)4 r1 = my_array[i]

Figure 14 Single-Element Arrays and DependencyOrdering

dependency ordering In short a dependency chainbreaks if it comes to a point where only a single valueis possible regardless of the value of the memory

order consume load heading up the chain At firstglance this additional rule could be quite difficult tolive with as dependency ordering could come and godepending on small details of code far away from thatpoint in the dependency chain

However a review of the Linux-kernel operators inSection 32 shows that the most commonly used op-erators act identically under both definitions Theproblem-free operators include -gt infix = casts pre-fix amp prefix and ternary

One example of a potentially troublesome opera-tor namely == is shown in Figure 17 where line 6breaks dependency ordering because the value of p isknown to be equal to that of q which is not part of adependency chain This example could be addressedthrough careful diagnostic design coupled with appro-priate coding standards For example the compilercould emit a warning on line 6 but remain silent forthe equivalent line substituting q for p namely dosomething with(q-gta)

Another example is the use of postfix [] thatis shown in Figure 14 If this code fragment wascompiled with MY ARRAY SIZE equal to one there isno dependency ordering between lines 3 and 4 butthat same code fragment compiled with MY ARRAY

SIZE equal to two or greater would be dependency-ordered Here a diagnostic for single-element arraysmight prove useful and such a diagnostic can easilybe supplied in this case using if and error

In the Linux kernel infix + and - are used forpointer and array computations These are all safein that they operate on an integer and pointer sothat any cancellation will not normally be detectableat compile time However one big purpose of diag-nostics is to detect abnormal conditions indicating

WG21N4321 16

1 struct liststackhead 2 struct liststack __rcu first3 45 struct liststack 6 struct liststack __rcu next7 void t8 struct rcu_head rh9

1011 _Carries_dependency12 void ls_front(struct liststackhead head)13 14 _Carries_dependency void data15 struct liststack lsp1617 rcu_read_lock()18 lsp = rcu_dereference(head-gtfirst)19 if (lsp == NULL)20 data = NULL21 else22 data = rcu_dereference(lsp-gtt)23 rcu_read_unlock()24 return data25

Figure 15 List-Based-Stack Example Code 1 of 2

probable bugs Therefore in cases where the com-piler can determine that two values from dependencychains are annihilating each other via infix + and -a diagnostic would be appropriate

Similarly the Linux kernel uses infix (bitwise) amp tomanipulate bits at the bottom of a pointer whereagain cancellation will not normally be detectable atcompile timemdashexcept in the case of operations on aNULL pointer for which dependency ordering is notmeaningful in any case However as with infix +

and - if the compiler detects value annihilation adiagnostic would be appropriate

Although issues with false positives and negativesneeds further investigation there is reason to hopethat this revision of the definition of dependency or-dering might avoid significant impacts on ease of useWith this hope we proceed to the specific propos-als using the code in Figures 15 and 16 to showsome sample code using Linux-kernel nomenclaturewith the addition of a mythical C keyword Carries

dependency to annotate parameters variables andreturn values that carry dependencies Please notethat this code example in no way endorses the dubi-ous practice of creating a parallel program with thesort of choke point exemplified by the head of this

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 _Carries_dependency struct liststack lsp40 struct liststack lsnp141 _Carries_dependency struct liststack lsnp242 _Carries_dependency void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 16 List-Based-Stack Example Code 2 of 2

WG21N4321 17

1 value_dep_preserving struct foo p23 p = atomic_load_explicit(gp memory_order_consume)4 q = some_other_pointer5 if (p == q)6 do_something_with(p-gta)7 else8 do_something_else_with(p-gtb)

Figure 17 Single-Value Variables and DependencyOrdering

list Note also that cmpxchg() heads a dependencychain which is completely reasonable within the con-text of the Linux kernel due to its acquire semanticswhich of course might be argued to indicate that theannotations in ls pop() are unnecessary

72 Type-Based Designation of De-pendency Chains With Restric-tions

This approach was formulated by Torvald Riegel inresponse to Linus Torvaldsrsquos spirited criticisms of thecurrent C11 and C++11 wording

This approach introduces a new value dep

preserving type qualifier Dependency ordering ispreserved only via variables having this type quali-fier This is meant to model the real scope of depen-dencies which is data flow not execution at function-level granularity This approach should therefore givedevelopers much finer control of which dependenciesare tracked

Assigning from a value dep preserving value to anon-value dep preserving variable terminates thetracking of dependencies in much the same way thatan explicit kill dependency() would However un-like an explicit kill dependency() compilers shouldbe able to emit a suppressable warning on implicitconversions so as to alert the developer about other-wise silent dropping of dependency tracking8

Next we specify that memory order consume loadsreturn a value dep preserving type by default thecompiler must assume such a load to be capable of

8 Other choices are possible in this case including emit-ting a memory order acquire fence in order to conservativelypreserve a potentially intended ordering

producing any value of the underlying type In otherwords the implementation is not permitted to applyany value-restriction knowledge it might gain fromwhole-program analysis We call this a local seman-tic dependency to distinguish not only from a pure(syntactic) dependency but also from a global seman-tic dependency where global information may be ap-plied Note that any global semantic dependency isalso a local semantic dependency but that any localsemantic dependency which is headed by a variablethat can be proven to take on only a single value is nota global semantic dependency The term ldquosemanticdependencyrdquo should be interpreted to mean a globalsemantic dependency unless otherwise stated

This allows developers to start with a clean slatefor the additional rule that they must follow to beable to rely on dependency ordering Subsequent ex-ecution must not lead to a situation there is only onepossible value for the value dep preserving expres-sion because otherwise the implementation is per-mitted to break the dependency chain As notedearlier this is shown in Figure 17 where the com-piler is permitted to break dependency ordering online 6 because it knows that the value of p is equalto that of q which means that it could substitutethe latter value from the former which would breakdependency ordering

This approach has several advantages

1 The implementation is simpler because no de-pendency chains need to be traced The imple-mentation can instead drive optimization deci-sions strictly from type information

2 Use of the value dep preserving type modifierallows the developer to limit the extent of thedependency chains

3 This type modifier can be used to mark a depen-dency chainrsquos entry to and exit from a functionin a straightforward way without the need forattributes

4 The value dep preserving type modifiers serveas valuable documentation of the developerrsquos in-tent

WG21N4321 18

5 This approach permits many additional opti-mizations compared to those permitted by thecurrent standard on code that carries a depen-dency Expressions such as (x-x) no longerrequire establishment of artificial dependenciesand the compiler is no longer required to detectvalue-narrowing hazards like that shown in Fig-ure 13 However the compiler is still prohibitedfrom adding its own value-speculation optimiza-tions

6 Linus Torvalds seems to be OK with it whichindicates that this set of rules might be practicalfrom the perspective of developers who currentlyexploit dependency chains

According to Peter Sewell one disadvantage is thatthis approach will be quite difficult to model which inturn will pose obstacles for the analysis tooling thatwill be increasingly necessary for large-scale concur-rent programming efforts In particular the concernis that forcing the compiler to assume that a memory

order consume load could possibly return any valuepermitted by its type might require program-analysistools to consider counterfactual hypothetical execu-tions which might complicate specification of seman-tics and verification

Figures 18 and 19 show how this approach playsout with the list-based stack

73 Type-Based Designation of De-pendency Chains

Jeff Preshing made an off-list suggestion of using avalue dep preserving type modifier as suggestedby Torvald Riegel but using this type modifier tostrictly enforce dependency ordering For exampleconsider the code fragment shown in Figure 17 Thescheme described in Section 72 would not necessar-ily enforce dependency ordering between the load online 3 and the access one line 6 while the approachdescribed in this section would enforce dependencyordering in this case

Furthermore cancelling or value-destruction oper-ations on value dep preserving values would notdisrupt dependency ordering As with the cur-rent C11 and C++11 standards the implementation

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack value_dep_preserving first6 78 struct liststack 9 struct liststack value_dep_preserving next

10 void t11 struct rcu_head rh12 1314 value_dep_preserving15 void ls_front(struct liststackhead head)16 17 value_dep_preserving void data18 value_dep_preserving struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 18 List-Based-Stack Restricted Type-BasedDesignation 1 of 2

would be required to emit a memory-barrier instruc-tion or compute an artificial dependency for such op-erations (Note however that use of cancelling orvalue-destruction operations on dependency chainshas proven quite rare in practice)

This approach shares many of the advantages ofTorvald Riegelrsquos approach

1 The implementation is simpler because no de-pendency chains need be traced The implemen-tation can instead drive optimization decisionsstrictly from type information

2 Use of the value dep preserving type modifierallows the developer to limit the extent of thedependency chains

3 This type modifier can be used to mark a depen-dency chainrsquos entry to and exit from a functionin a straightforward way without the need forattributes

4 The value dep preserving type modifiers serve

WG21N4321 19

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 value_dep_preserving37 void ls_pop(struct liststackhead head)38 39 value_dep_preserving struct liststack lsp40 struct liststack lsnp141 value_dep_preserving struct liststack lsnp242 value_dep_preserving void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 19 List-Based-Stack Restricted Type-BasedDesignation 2 of 2

as valuable documentation of the developerrsquos in-tent

5 Although optimizations on a dependency chainare restricted just as in the current standardthe use of value dep preserving restricts thedependency chains to those intended by the de-veloper

6 Restricting dependency-breaking optimizationson all dependency chains marked value dep

preserving without exceptions for cases inwhich the compiler knows too much might makethis approach easier to learn and to use

It is expected that modeling this approach shouldbe straightforward because the modeling tools wouldbe able to make use of the type information Thisapproach results in the same code as shown in Fig-ures 18 and 19 of the previous section

74 Whole-Program Option

This approach also suggested off-list by Jeff Presh-ing has the goal of reusing existing non-dependency-ordered source code unchanged (albeit requiring re-compilation in most cases)9 For example this ap-proach permits an instance of stdmap to be refer-enced by a pointer loaded via memory order consume

and to provide that stdmap instance with thebenefits of dependency ordering without any codechanges whatsoever to stdmap It is important tonote that this protection will be provided only to aread-only stdmap that is referenced by a changingpointer loaded via memory order consume in partic-ular not to a concurrently updated stdmap refer-enced by a pointer (read-only or otherwise) loadedvia memory order consume This latter case wouldrequire changes to the underlying stdmap implemen-tation at a minimum changing some of the loads tobe memory order consume loads Nevertheless theability to provide dependency-ordering protection topre-existing linked data structures is valuable evenwith this read-only restriction

9 A module or library that is known to never carry a de-pendency need not be recompiled

WG21N4321 20

This approach which again does require full re-compilation can be implemented using two ap-proaches

1 Promote all memory order consume loads tomemory order acquire as may be done withthe current standard

2 On architectures that respect memory order-ing prohibit all dependency-breaking optimiza-tions throughout the entire program but onlyin cases where a change in the value returnedby a memory order consume load could cause achange in the value computed later in that samedependency chain in other words where there isa global semantic dependency Note again thatthe possibility of storing a value obtained froma memory order consume load then loading itlater means that normal loads as well as memoryorder relaxed loads often must be consideredto head their own dependency chains but onlywhen loaded by the same thread that did thestore

Some implementations might allow the developerto choose between these two approaches for exampleby using a compiler switch provided for that purpose

This approach also has the effect of permitting atrivial implementation of a memory order consume

atomic thread fence() When using the first im-plementation approach the atomic thread fence()

is simply promoted to memory order acquire In-terestingly enough when using the second ap-proach the memory order consume atomic thread

fence() may simply be ignored The reason forthis is that this approach has the effect of promot-ing memory order relaxed loads to memory order

consume which already globally enforces all theordering that the memory order consume atomic

thread fence() is required to provide locally10

This approach has its own set of advantages anddisadvantages

10 Of course this presumed promotion from memory order

relaxed to memory order consume means that architecturessuch as DEC Alpha that do not respect dependency order-ing must continue to use the first option of emitting memory-ordering instructions for memory order consume loads

1 This approach dispenses with the [[carries

dependency]] attribute and the kill

dependency() primitive

2 This approach better promotes reuse of existingsource code In particular it should require nochanges to the current Linux-kernel source baseaside from changes to the rcu dereference()

family of primitives

3 This approach allows implementations to carryout dependency-breaking optimizations on de-pendency chains as long as a change in thevalue from the memory order consume load doesnot change values further down the dependencychain both with and without the optimizationJeff conjectures that the set of dependency-breaking optimizations used in practice applyonly outside of dependency chains by the re-vised definition in which single-value restrictionsbreak dependency chains11 If this conjectureholds it also applies to Torvaldrsquos approach de-scribed in Section 72

4 Code that follows the rules presented in Sec-tion 41 (substituting memory order consume

loads for volatile loads) would have its depen-dency ordering properly preserved

It is unlikely that this approach could be modeledreasonably given the current state of the art Therequirement that any given memory order consume

load be able to generate at least two different val-ues at the tail of the dependency chain is believedto be a show-stopper especially when coupled withwhole-program analysis which might find that thereis only one value entering at the head of the depen-dency chain

This approach allows annotations to be discardedas shown in Figures 20 and 21 However the memoryorder consume loads are still required in order toenable the promote-to-acquire implementation style

11 This is certainly the case for the usual optimizations ex-emplified by replacing (x-x) with zero

WG21N4321 21

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 void ls_front(struct liststackhead head)15 16 void data17 struct liststack lsp1819 rcu_read_lock()20 lsp = rcu_dereference(head-gtfirst)21 if (lsp == NULL)22 data = NULL23 else24 data = rcu_dereference(lsp-gtt)25 rcu_read_unlock()26 return data27

Figure 20 List-Based-Stack Whole-Program Ap-proach 1 of 2

75 Local-Variable Restriction

This approach suggested off-list by Hans Boehmlimits the extent of dependency trees to a local whichincludes local variables temporaries function argu-ments and return variables Assigning a value from amemory order consume load to such an object beginsa dependency chain Assigning a value loaded fromsuch a local to a global variable (including function-local variables marked static) or to the heap impliesa kill dependency() so that dependency chains areconfined to locals However if the compiler is unableto see the full dependency chain for example be-cause it passes into a function in another translationunit that is not marked [[carries dependency]]the compiler should promote memory order consume

to memory order acquire12

Section 32 indicates that the following operatorsshould transmit dependency status from one localvariable or temporary to another -gt infix = casts

12 Some implementations might provide means to allow theuser to specify that a diagnostic be generated if such promotionis necessary

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 void ls_pop(struct liststackhead head)37 38 struct liststack lsp39 struct liststack lsnp140 struct liststack lsnp241 void data4243 rcu_read_lock()44 lsnp2 = rcu_dereference(head-gtfirst)45 do 46 lsnp1 = lsnp247 if (lsnp1 == NULL) 48 rcu_read_unlock()49 return NULL50 51 lsp = rcu_dereference(lsnp1-gtnext)52 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)53 while (lsnp1 = lsnp2)54 data = rcu_dereference(lsnp2-gtt)55 rcu_read_unlock()56 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)57 return data58

Figure 21 List-Based-Stack Whole-Program Ap-proach 2 of 2

WG21N4321 22

prefix amp prefix [] infix + infix - ternary infix (bitwise) amp and probably also | Similarly Sec-tion 33 indicates that the following operators shouldimply a kill dependency() () == = ampamp ||infix and

It will also be necesary to check whether Linux-kernel usage expects dependency chains to passthrough globals and heap objects that are in someway thread-local If there are such use cases and ifthey are sane and cannot easily be changed to uselocal variables should [[carries dependency]] beused to flag dependency-carrying globals and heapobjects

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute if de-pendency chains are to span multiple translationunits as is the case in some parts of the Linuxkernel

2 The implementation is likely to be some-what simpler because only those dependencychains passing through local variables compiler-generated temporaries compiler-visible functionarguments and compiler-visible return valuesneed be traced One could also argue that func-tion arguments and return values marked with[[carries dependency]] attribute also need tobe traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 Although optimizations on dependency chainsmust be restricted the restricted scope of de-pendency chains reduces the impact of these re-strictions

5 Applying this approach to the Linux kernelwould only require the addition of markings onfunction parameters and return values corre-sponding to cross-translation-unit function callsHowever there are a significant number of theseso this approach can expect significant resistancefrom the Linux community

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 22 List-Based-Stack Local-Variable Restric-tion 1 of 2

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach allows local-variable annotations tobe dropped as shown in Figure 22 and 23

76 Mark Dependency-Carrying Lo-cal Variables

This approach suggested offlist by Clark Nelson usesthe [[carries dependency]] attribute to mark non-static local-scope variables as carrying a dependencyin addition to its current use marking function ar-guments and return values as carrying dependen-cies It is not permissible to mark global variablesor structure members with this attribute Assigningfrom a [[carries dependency]] object to a non-[[carries dependency]] object results in an im-plicit kill dependency()

This approach is similar to that of Section 73 ex-cept that it uses an attribute rather than a type mod-ifier As such it has many of the advantages and dis-

WG21N4321 23

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 23 List-Based-Stack Local-Variable Restric-tion 2 of 2

advantages of that approach however some believethat an attribute-based approach will be more ac-ceptable to the committee than would a type-modifierapproach13 However this approach does requirethat C add attributes

This leave the question of which operatorstransmit dependency chains from one [[carries

dependency]] object to another Section 32 indi-cates that the following operators should transmitdependency status from one local variable or tem-porary to another -gt infix = casts prefix amp prefix [] infix + infix - ternary infix (bitwise) ampand probably also | Similarly Section 33 showsthat the following operators should imply a kill

dependency() () == = ampamp || infix and

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains passingthrough variables marked with the [[carries

dependency]] attribute need be traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 The [[carries dependency]] calls serve asvaluable documentation of the developerrsquos in-tent

5 Although optimizations on dependency chainsmust be restricted use of explicit [[carries

dependency]] greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of variables carrying depen-dencies given that the Linux kernel currentlyrequires no such markings

13 Lawrence Crowl suggests a third approach namely a vari-able modifier

WG21N4321 24

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 _Carries_dependency void data18 _Carries_dependency struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 24 List-Based-Stack Marked Local Variables1 of 2

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach results in code as shown in Fig-ures 24 and 25 where the [[carries dependency]]

attributes have been replaced with a mythicalCarries dependency C keyword

77 Explicitly Tail-Marked Depen-dency Chains

This approach suggested at Rapperswil by OlivierGiroux can be thought of as the inverse of

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 _Carries_dependency struct liststack lsp40 struct liststack lsnp141 _Carries_dependency struct liststack lsnp242 _Carries_dependency void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 25 List-Based-Stack Marked Local Variables2 of 2

WG21N4321 25

1 p = atomic_load_explicit(ampgp memory_order_consume)2 if (p = NULL)3 do_it(atomic_dependency(p gp))

Figure 26 Explicit Dependency Operations

stdkill dependency() Instead of explicitlymarking where the dependency chains terminateOlivierrsquos proposal uses a stddependency() prim-itive to indicate the locations in the code that thedependency chains are required to reach The first ar-gument to stddependency() is the value to whichthe dependency must be carried and the second argu-ment is the variable that heads the dependency chainin other words the second argument is the variablethat was loaded from by a memory order consume

load This proposal differs from the others in thatit is expected to be implemented not necessarily bypreserving the dependency but instead by insertingbarriers in those cases where optimizations have elim-inated any required dependencies The goal here is toimpose minimal restrictions on optimizations of codecontaining dependency chains

A C-language example is shown in Figure 26where stddependency() is transliterated to the C-language atomic dependency() function On line 3atomic dependency() returns the value of its firstargument (p) while ensuring that the data depen-dency from the memory order consume load from gp

is faithfully reflected in the assembly language imple-menting this code fragment The assembly-languagereflection of this dependency might be in terms ofan assembly-language dependency (for example onARM or PowerPC) implicit memory ordering (forexample on x86 or mainframe) or by an explicitmemory-barrier instruction However if there was noatomic dependency() function the compiler wouldbe under no obligation to preserve the dependency14

These explicitly specified dependencies may becombined with [[carries dependency]] attributeson function arguments for example as shown in Fig-ure 27 Note the interplay of atomic dependency()

and [[carries dependency]] where line 8 estab-

14 Would it be better to have the second argument to atomic

dependency() be a label rather than an expression

1 void foo(struct bar q [[carries_dependency]])2 3 if (q = NULL)4 do_it(atomic_dependency(q-gtb q))5 67 p = atomic_load_explicit(ampgp memory_order_consume)8 foo(atomic_dependency(p gp))

Figure 27 Explicit Dependency Operations and car-ries dependency

lishes the dependency between the load from gp andthe [[carries dependency]] argument q of foo()and where line 4 establishes the further dependencybetween argument q of foo() and do it()s argu-ment

This approach is not yet complete One issue isthe possibility of a given operation being dependenton multiple memory order consume loads One ap-proach is of course to omit this functionality and an-other is to allow atomic dependency() to allow anexpression as its first argument and a variable list ofmemory order consume loaded variables

Another issue is connecting [[carries

dependency]] return values to subsequent atomic

dependency() invocations There are a numberof possible resolutions to this issue One approachwould be to use [[carries dependency]] attributeto mark the declaration of the variable to whichthe functionrsquos return value is assigned bringing theproposal from Section 76 to bear In the specialcase where the memory order consume load is in thesame function body as the atomic dependency()

that depends on it the atomic dependency()

could reference the variable that was the sourceof the original memory order consume load An-other approach would be to allow function-returncarries dependency]] attributes to define namesthat could be used by later atomic dependency()

invocations

A third issue arises when atomic dependency()

must be applied after the head of the dependencychain has gone out of scope for example if the headwas contained in a variable defined in an inner scopethat has since been exited

A fourth issue arises if optimizations along a

WG21N4321 26

needed dependency chain allow ordering the depen-dent operation to precede the head of the depen-dency chain in which case inserting barriers wouldbe ineffective The current proposal for addressingthis issue is to suppress memory-movement optimiza-tions across the atomic dependency() perhaps us-ing something like atomic signal fence() or theLinux kernelrsquos barrier() macro This approach al-lows dependency checking and fence insertion to becarried out as a final pass in the compilation process

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains having ex-plicit atomic dependency() calls (and option-ally intermediate [[carries dependency]] at-tributes) need be traced

3 Irrelevant dependency chains are pruned by de-fault with no stdkill dependency() calls re-quired

4 The atomic dependency() calls serve as valu-able documentation of the developerrsquos intent

5 Although optimizations on dependency chainsmust be restricted use of explicit atomic

dependency() greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of dependency chains giventhat the Linux kernel currently relies on implicitends of dependency chains

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(atomic_dependency(lsp-gtt27 head-gtfirst))28 rcu_read_unlock()29 return atomic_dependency(data lsp-gtt)30

Figure 28 List-Based-Stack Tail-Marked Dependen-cies 1 of 2

It is not yet known whether this approach can bereasonably modeled

The result is shown in Figures 28 and 29

78 Explicitly Head-Marked Depen-dency Chains

This approach suggested via email by Olivier Girouxcan be thought of as another inverse of stdkilldependency() In this case the heads of the de-pendency chains are marked indicating to whichpointed-to objects dependencies should be carriedThis description is an extrapolation of a very con-cise proposal and corrections are welcome

The general idea is to provide an augmentedform of the load() member function that indicatesdependencies for example xload(memory order

consume x-gtnext) would cause a dependency tobe carried through the -gtnext field but throughno other field This is shown in Figure 30 where

WG21N4321 27

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(atomic_dependency(lsnp2-gtt lsnp))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return atomic_dereference(data lsnp2-gtt)59

Figure 29 List-Based-Stack Tail-Marked Dependen-cies 2 of 2

1 struct foo 2 struct foo a3 struct foo b4 struct foo c5 int d6 78 p = atomic_load_explicit(ampgp memory_order_consume9 p-gta p-gtb)

10 qa = p-gta Dependency carried 11 qb = p-gtb Dependency carried 12 qc = p-gtc No dependency carried 13 d = p-gtd No dependency carried

Figure 30 Explicit Dependency Operations and Aug-mented Load

the explicit dependency information on line 9 causeslines 10 and 11 to carry a dependency but lines 12and 13 not to do so

Some open questions regarding this approach

1 How does this interact with arguments and re-turn values Do the corresponding annotationsneed to indicate to which fields dependenciesmight be carried Should mismatches be con-sidered an error and if so which sorts of mis-matches

2 How are opaque types handled For exam-ple consider the Linux kernel linked-list facil-ity which embeds a list head structure intothe enclosing object that is to be placed on thelist The memory order consume load returnsa (struct list head ) but it may be nec-essary to carry a dependency to one or more ofthe fields in the enclosing object Should thisbe handled via something like xload(memory

order consume x) but if so doesnrsquot thisre-introduce the need for lots of stdkill

dependency() calls

3 Larger structures might have quite a few fieldsthat need dependencies carried Should therebe some sort of shorthand to make this easierto code for example tagging the fields needingdependency ordering in the declaration of thestruct or class

4 Common types of abstraction need to be han-dled correctly For example there are more than

WG21N4321 28

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst head-gtfirst-gtt)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(lsp-gtt lsp-gtt)) 28 rcu_read_unlock()29 return data30

Figure 31 List-Based-Stack Head-Marked Depen-dencies 1 of 2

500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

79 Restricted Dependency Chains

This approach restricts dependency chains to opera-tions for which compilers would naturally carry de-pendencies As such this approach can be consid-ered to be a refinement of the whole-program option(Section 74) restricted as described in Section 412but also omitting control dependencies and RCU-protected array indexes As always a memory order

consume load heads a dependency chain and as al-ways a memory order consume load may be imple-mented with a simple load instruction on architec-tures such as x86 ARM and Power

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst head-gtfirst-gtnext)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext lsnp-gtnext-gtt)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt lsnp2-gtt))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 32 List-Based-Stack Head-Marked Depen-dencies 2 of 2

WG21N4321 29

1 define rcu_ptr_extract(p) 2 ( 3 uintptr_t ___ip = (uintptr_t)(p) 4 5 ___ip = ___ip amp ~0x7 6 (typeof(p)) ___ip 7 )89 define rcu_ptr_set_bits(p bits)

10 ( 11 uintptr_t ___ip = (uintptr_t)(p) 12 13 ___ip = ___ip amp ~0x7 14 ___ip = ___ip | (bits) 15 (typeof(p)) ___ip 16 )1718 define rcu_ptr_get_bits(p bits) 19 ( 20 uintptr_t ___ip = (uintptr_t)(p) 21 22 ___ip amp ~0x7 23 )

Figure 33 Pointers and Bit Manipulation on 64-BitSystem

This results in a specific list of operations that ex-tend dependency chains and a separate specific list ofoperations that terminate such chains each coveredin its own section

791 Extending Dependency Chains

The following primitive operations extend thatchain15

1 If any value is part of a dependency chain thenusing that value as the left-hand side of an as-signment expression extends the chain to coverthe assignment This rule is exercised in theLinux kernel by stores into fields making up anRCU protected data element

2 If any value is part of a dependency chain thenusing that value as the right-hand side of an as-signment expression extends the chain to coverboth the assignment and the value returned bythat assignment statement Line 20 of Figure 20shows how this rule may be used to extend adependency chain into a local variable

15 In case of operator overloading the actual functions calledmust be analyzed in order to determine their effects on depen-dency chains

3 If any value that is part of a dependency chain isstored to a non-shared variable then any valueloaded by a later load from that same variable bythat same thread is also part of the dependencychain Lines 20 and 24 of Figure 20 illustratethis rule though this rule would apply even iflocal variable lsp was a intptr t instead of apointer

4 If a pointer that is part of a dependency chainis stored to any variable then any value loadedby a later load from that same variable by thatsame thread is also part of the dependency chainLines 20 and 24 of Figure 20 illustrate this rulethough this rule would apply even if local vari-able lsp was instead a shared variable

5 If a pointer is part of a dependency chain thenadding an integral value to that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers and alsoto addition via the infix + operator and via thepostfix [] operator Note that the addition mustbe carried out on a pointer Casting to an inte-gral type and then carrying out the addition willbreak the dependency chain Therefore insteadof casting to an integral type to carry out theaddition cast to a pointer to char16 Line 24 ofFigure 20 illustrates this given that the -gtt actsas a pointer offset prior to indirection

6 If a pointer is part of a dependency chain thensubtracting an integer from that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers Againcasting to an integral type and then carryingout the subtraction will break the dependencychain so instead cast to a pointer to char TheLinux-kernel container of() macro illustratesthis This macro is used to find the beginning ofa structure given a pointer to a field within thatsame structure

16 Yes some old systems had strange formats for characterpointers and this restriction does exclude those systems fromthis nuance of dependency ordering However to the best ofmy knowledge all such systems were uniprocessors so this isnot a real problem

WG21N4321 30

7 If a pointer is part of a dependency chain thendereferencing it using the prefix operator ex-tends the chain through the dereference opera-tion Line 24 of Figure 20 illustrates this giventhat the -gtt acts as a pointer offset prior to in-direction

8 If a pointer is part of a dependency chain thendereferencing it using the -gt field-selection oper-ator extends the chain to the field Note that thewhen the -gt operator is followed by one or more operators these latter operators are equiva-lent to adding a constant integer to the originalpointer Line 24 of Figure 20 directly illustratesthis rule

9 If a pointer is part of a dependency chain thencasting it (either explicitly or implicitly) to anypointer-sized type extends the chain to the re-sult Line 3 of Figure 33 illustrates this rule

10 If a value of type intptr t or uintptr t ispart of a dependency chain then casting it toa pointer type extends the chain to the resultLine 6 of Figure 33 illustrates this rule

11 If a value of type intptr t or uintptr t is partof a dependency chain then the bitwise infix amp

and | operators extend the dependency chain tothe resulting value Line 5 of Figure 33 illus-trates this rule

12 If a value of type intptr t or uintptr t is partof a dependency chain the the bitwise infix ^

operator extends the dependency chain to theresulting value The use case for this traversal ofbuddy-allocator-like lists or dense-array heapsbut it is not clear whether these use cases justifythis addition to dependency ordering

13 If a pointer is part of a dependency chain thenapplying the unary amp address-of operator op-tionally casting this address to a pointer type(perhaps repeatedly to different pointer typeseither explicitly or implicitly) then applying the dereference operator extends the chain to theresult This is used by some of the Linux-kernellist-processing macros

14 If a pointer is part of a dependency chain andthat pointer appears in the entry of a expres-sion selected by the condition then the chainextends to the result Please note that doesnot extend chains from its condition only fromits second or third argument

15 If a pointer to a function is part of a depen-dency chain then invoking the pointed-to func-tion extends the chain from the pointer to theinstructions executed Note that the exact mech-anism used to update instructions is implemen-tation defined and might require use of specialinstruction-cache-flush operations17

16 If a pointer is part of a dependency chain thenif that pointer is used as the actual parameter ofa function call the dependency chain extends tothe formal parameter

17 If a pointer is part of a dependency chain thenif that pointer is returned from a function thedependency chain extends to the returned valuein the calling function

18 If a given operation extends a dependency chainthen so does its atomic counterpart For exam-ple the rules applying to assignments also applyto atomic loads and stores

Any other operation terminates a dependencychain

792 Terminating Dependency Chains

Even though all other operations terminate depen-dency chains there are a few that deserve specialmention

1 Equality comparisons

2 Narrowing magnitude comparisons

3 Narrowing arithmetic operations

4 Narrowing bitwise operations

17 This may sound strange but just you try implementingdynamic linking without the ability to update instructions

WG21N4321 31

1 struct bar 2 struct bar next3 int a4 int b5 6 struct bar head = amphead 1 2 78 for (p = head-gtnext p p = rcu_dereference(p-gtnext)) 9 foo += p-gta

10 if (p == amphead)11 break12 13 bar = head-gtb

Figure 34 Back-Propagation of Dependency-ChainBreakage Due to Comparisons

1 if (p gt ampfoo)2 do_something(p)3 else if (p lt ampfoo)4 do_something_else(p)5 else6 do_something_nodep(p)

Figure 35 Inequality-Comparison Dependency-Chain Breakage

5 Storing non-pointers into shared variables

6 Passing values between threads without using amemory order consume load

7 Undefined behavior

8 Use of kill dependency

Each of these is covered below

Equality comparisons If a pointer is part of adependency chain then a == or = comparison thatcompares equal to some other pointer where thatother pointer is not part of any dependency chainwill cause any uses of the original pointer to no longerbe part of the dependency chain This dependency-chain breakage can back-propagate to earlier uses ofthe pointer so that in Figure 34 if the comparisonon line 10 compares equal then the access on line 9is not part of the dependency chain This is admit-tedly a rather strange code fragment and besidesthe Linux-kernel barrier() macro could prevent thisif placed between lines 9 and 10 Furthermore the

Linux kernelrsquos list macros avoid this situation becausethe equal comparison terminates the loop

So what if the compiler introduces an equality com-parison This might happen when doing feedback-directed optimization where the compiler might no-tice (for example) that a particularly statically allo-cated structure was almost always the first elementon a given list The compiler might therefore in-troduce a specialization optimization comparing theaddresses and generating code using the statically al-located structure on equals comparison On the onehand in the cases where the Linux kernel adds a stat-ically allocated structure to an RCU-protected linkeddata structure that structure has been initialized atcompile time so that dependency ordering is not re-quired On the other hand this appears to be anextremely dubious optimization for linked data struc-tures In a great many cases the added overhead ofthe comparison would overwhelm the benefits of gen-erating code based on the statically allocated struc-ture

High-quality implementations would therefore beexpected to provide means for disabling this sort ofoptimization especially for pointers obtained fromthe heap After all use of statically allocated struc-tures in RCU-protected lists could be quite usefulduring out-of-memory conditions in which case thespecialization optimization would almost always re-duce performance which is not what optmizationsare supposed to be doing

Narrowing magnitude comparisons A seriesof gt lt gt= or lt= operators that informs the compilerof the exact value of a pointer causes that pointerto no longer be part of the dependency chain SeeFigure 35 for an example of this On line 6 of thisfigure the compiler knows that the value of p is equalto ampfoo so although there is dependency ordering tolines 2 and 4 there is no dependency ordering toline 6 This dependency-chain breakage can back-propagate just as for equality comparisons How-ever dependencies are maintained for normal usesfor exmaple the use of comparisons for deadlockavoidance when acquiring locks contained in multi-ple RCU-protected data elements

WG21N4321 32

Narrowing arithmetic operations If a pointeris part of a dependency chain and if the valuesadded to or subtracted from that pointer cancel thepointer value so as to allow the compiler to pre-cisely determine the resulting value then the result-ing value will not be part of any dependency chainFor example if p is part of a dependency chain then((char )p-(uintptr t)p)+65536 will not be18

Narrowing bitwise operations If a value of typeintptr t or uintptr t is part of a dependency chainand if that value is one of the operands to an amp

or | infix operator whose result has too few or toomany bits set then the resulting value will not bepart of any dependency chain For example on a64-bit system if p is part of a dependency chainthen (p amp 0x7) provides just the tag bits and nor-mally cannot even be legally dereferenced Similarly(p | ~0) normally cannot be legally dereferencedHowever (p amp ~0x7) will provide a usable pointerthat is part of prsquos dependency chain Setting orclearing bits that are not used by the implementa-tion or that would be zero for a properly aligned ob-ject will not break the dependency chain That saidthe compiler might not know the definition of ldquoprop-erly alignedrdquo for example in the cases of manuallycache-aligned or page-aligned objects

Storing non-pointers into shared variables Ifa non-pointer value that is part of a dependency chainis stored into a shared variable then the dependencychain does not extend to a later load from that vari-able

Passing values between threads If a value thatis part of a dependency chain is stored into a variableby one thread and loaded from that same variable bysome other thread using either a non-atomic load ora memory order relaxed load then the dependencychain does not extend to the second thread To getthis effect the second thread would instead need touse a memory order consume load Note that this

18 That said 57p4 of C++ and 656p8 of C both say thatindexing outside of an object is undefined behavior so the lossof dependency ordering is likely the least of the problems here

would extend the dependency chain even if the cor-responding store was a memory order relaxed storebecause the required store-side ordering is providedby the dependency chain

Undefined behavior If undefined behavior is in-voked then consistent with the notion of undefinedbehavior there are no dependency-chain guarantees

If a given pointer intptr t or uintptr t is usedsuch that only one value avoids undefined behaviorthen the dependency chain is broken in the same wayas it would be in the case of an equality comparisonwith that same value

kill dependency() The result of calling kill

dependency is never part of any dependency chainThis operation can be used to suppress diagnosticsthat implementations might omit for likely misusesof dependency ordering

793 Restricted Dependency Chains and theLinux Kernel

This covers all known pointer-based RCU uses in theLinux kernel aside from RCU-protected array in-dexes (more on these later) However as noted ear-lier these restrictions might prove too constrainingfor future code Therefore it might be necessary tocombine this approach with some variation on one ofthe methods for explicitly marking variables formalparameters and return values that are intended tocarry dependencies

Section 32 discussed dependency chains headedby memory order consume loads of integers that arelater used as array indexes or pointer offsets Al-though there are a (very) few such uses in the Linuxkernel accommodating dependency chains headed byloads of integers greatly complicates the handling ofdependency chains For example given an integer x

produced by a memory order consume load we mustcorrectly handle expressions containing (x - x) in-cluding cases where the cancellation is not at all obvi-ous from the source code In contrast given a pointerp the expression (p - p) not a pointer and the rulesgiven above do not require carrying a dependency

WG21N4321 33

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a) 3 q = kill_dependency(p)4 a = q-gtspecial_a5 else 6 a = p-gtnormal_a7

Figure 36 Avoiding Diagnostic Due To Dependency-Ordering Value-Narrowing Hazard

through such an expression This approach there-fore excludes dependency chains headed by memory

order consume loads from non-pointer atomic vari-ables

This raises the question of what should be doneabout the Linux kernel code that relies on order-ing carried through integer array indexes Paulhas (hopefully) answered this question by creatinga Linux-kernel patch that removes the kernelrsquos de-pendency on RCU-protected array indexes [22]

794 Restricted Dependency Chains Ad-vantages and Disavantages

This approach to dependency ordering has the fol-lowing advantages and disadvantages

1 There is no need for compilers to trace depen-dency chains Instead dependencies are an au-tomatic result of current code-generation and op-timization practices

2 There would be little or no limitation on com-piler optimizations Compilers are no longer re-quired to establish artificial dependencies for ex-pressions such as (x - x) or to detect value-narrowing hazards involving == and =

3 The breaking of dependency chains when apointer compares equal to some other pointermight prove to be onerous but it is not aproblem for current Linux use cases19 Themost common cases are comparison against alist header (in which case an equality compar-ison terminates the traversal) and comparison

19 This needs to be re-verified

against NULL (in which case an equality com-parison indicates a pointer that cannot be deref-erenced in any case)

4 The compiler is prohibited from carrying outvalue-speculation optimizations on pointers orvalues of type intptr t or uintptr t that havebeen cast from a pointer type and subjected tono operations other than infix amp | and ^ (Isthis an advantage or a disadvantage The an-swer to this question is left to the reader)

5 In a great many cases dependency chains canreliably pass through library functions compiledby pre-C11 compilers

6 It would not be necessary to use stdkill

dependency() calls in most cases That saiduse of stdkill dependency() might at somefuture point allow the compiler to produce bet-ter diagnostics for dubious dependency-chain usecases For example a compiler might issue awarning for the value-narrowing hazard shownon line 3 of Figure 13 and that diagnostic mightbe suppressed as shown in Figure 36

7 More generally this approach allows annotationsto be discarded as was shown in Figures 20and 21 However the memory order consume

loads are still required in order to support DECAlpha and prevent compiler optimizations thatmight otherwise destroy the dependency chain

8 It would not be necessary to use [[carries

dependency]] attributes However as withstdkill dependency() use of something like[[carries dependency]] might produce bet-ter diagnostics for dubious use dependency-chainuse cases That said it might be preferable tosubstitute either type or variable modifiers forattributes given that use of attributes is not per-mitted to change the meaning of the programand also that attributes are not yet supportedby the C language

9 With the exception of one use case involving ar-rays no changes to the Linux-kernel source code

WG21N4321 34

are required As noted earlier a patch is avail-able to remove the Linux kernelrsquos dependency onRCU-protected array indexes [22]

In the future this approach could be augmentedby attributes type modifiers variable modifiers orother markings to allow more elaborate dependencychains to be created on the one hand and to improvethe compilerrsquos ability to emit diagnostics for dubioususes of dependency chains on the other

710 Evaluation

This evaluation starts by enumerating the differentaudiences that any change to memory order consume

must address (Section 7101) and then compares thevarious proposals based on the perceived viewpointsof these audiences (Section 7102)

7101 Audiences

The main audiences for any change to memory

order consume include standards committee mem-bers compiler implementers formal-methods re-searchers developers intending to write new codeand developers working with existing RCU code TheLinux kernel community is of course a notable exam-ple of this last category

Standards committee members would like a cleanand non-intrusive change to the standard Theywould of course also like solutions minimizing thenumber and vehemence of complaints from the otheraudiences or failing that reducing the complaintsto a tolerable noise level

Compiler implementers would like a mechanismthat fits nicely into current implementations whichdoes much to explain their satisfaction with the ap-proach of strengthening memory order consume tomemory order acquire In particular they wouldlike to avoid unbounded tracing of dependencies andwould prefer minimal constraints on their ability toapply time-honored optimizations

Formal-methods researchers would like a definitionof memory order consume that fits into existing the-oretical frameworks without undue conceptual vio-lence Of particular concern is any need to deal with

counter-factuals in other words any need to rea-son not only about values of variables required forthe solution of a given litmus test but also aboutother unrelated values for these variables As suchcounter-factuals are the rock upon which otherwiseattractive approaches involving semantic dependencyhave foundered20 Some practitioners might won-der why the opinion of formal-methods researchersshould be given any weight at all and the answer tothis question is that it is the work of formal-methodsresearchers that provides us the much-needed toolsthat we need to analyze both the memory-orderingspecification itself as well as programs using thatspecification

Developers writing new code need something thatexpresses their algorithm with a minimum of syn-tactic saccharine that is easy to learn and that iseasy to maintain For example one of the weak-nesses of the current standardsrsquo definition of memoryorder consume is the need to sprinkle large numbersof kill dependency() calls throughout onersquos codeIn short developers would like it to be easy to writeanalyze and maintain code that uses dependency or-dering

Developers with existing RCU code have the samedesires as do developers writing new code but arealso very interested in minimizing the code churn re-quired to adhere to the standard

The challenge if of course to find a proposal thataddresses the viewoints of all of these audiences Aswe will see in the next session this is not easy How-ever there is some hope that the approach presentedin Section 79 might suffice

7102 Comparison

A summary comparison of the proposals is shown inTable 1

The dependency type can either be ldquodeprdquo fornormal dependency ldquordeprdquo for the restricted de-pendencies discussed in Section 79 ldquosdeprdquo for(global) semantic dependency or ldquolsdeprdquo for local

20 That said Alan Jeffries is making another attempt tocome up with a suitable formal definition of semantic depen-dency

WG21N4321 35

Dep

enden

cyType

Variable

Marking

Form

al-Parameter

Marking

Return-V

alueMarking

Beginning-O

f-Chain

Handling

End-O

f-Chain

Handling

Dep

enden

cyTracingRequired

CAttribute

Support

Req

uired

C11 C++11 dep A A K Y Y

Type-Based Designation of DependencyChains With Restrictions (Section 72)

lsdep T T T k

Type-Based Designation of DependencyChains (Section 73)

dep T T T k

Whole-Program Option (Section 74) sdep

Local-Variable Restriction (Section 75) dep A A k Y

Mark Dependency-Carrying Local Variables(Section 76)

dep A A A k Y

Explicitly Tail-Marked Dependency Chains(Section 77)

dep A A Dk y Y

Explicitly Head-Marked Dependency Chains(Section 78)

dep D k y Y

Restricted Dependency Chains (Section 79) rdep

Marking ldquoArdquo attribute ldquoTrdquo typeBeginning of chain ldquoDrdquo explicit designationEnd of chain ldquoDrdquo explicit designation ldquokKrdquo implicitexplicit kill dependencyDependency tracking required ldquoyrdquo only for marked chains ldquoYrdquo always

Table 1 Comparison of Consume Proposals

WG21N4321 36

semantic dependency21 Variable formal-parameterand return-value marking can either be type-based(ldquoTrdquo) attribute-based (ldquoArdquo) or not required (ldquo rdquo)Beginning-of-chain handling can either require ex-plicit indication to which quantities dependenciesmust be carried (ldquoDrdquo) or nothing (ldquo rdquo) End-of-chain handling can either require an explicit kill

dependency (ldquoKrdquo) an implicit kill dependency

(ldquokrdquo) explicit designation of dependency (ldquoDrdquo) ornothing (ldquo rdquo)22 Dependency tracking might berequired for all chains (ldquoYrdquo) explicitly designatedchains (ldquoyrdquo) or not required at all (ldquo rdquo) C-language[[carries dependency]] support might be required(ldquoYrdquo) or not (ldquo rdquo)

The ideal proposal would have dependency typeldquodeprdquo (thus making it easier to model dependencyordering and making it unnecessary for developer tohave to outwit full-program optimizations) no needfor variable formal-parameter or return-value mark-ing (thus minimizing changes required for existingRCU code) implicit ldquodo the right thingrdquo end-of-chainhandling23 (thus minimizing the need for whack-a-mole source-code markups) no need for dependencytracking (thus making it easier to implement) andno need for C-language support for the [[carries

dependency]] attribute (thus minimizing changes tothe C standard)

7103 Other Approaches

If the C standards committee is unwilling to ac-cept attributes perhaps a new keyword such asCarries dependency would be an acceptable alter-native (This was suggested at the 2014 UIUC meet-ing but I cannot recall who suggested it)

21 Recall that a local semantic dependency remains a de-pendency even if the memory order consume load at its headcan return only a single value In contrast a global seman-tic dependency remains a dependency only if more than onevalue can appear at the end of the chain Therefore optimiza-tions based on global full-program analysis can break a globalsemantic dependency but can break neither a local semanticdependency nor a normal dependency

22 Variables that go out of scope always have any dependencychain implicitly killed

23 Perhaps implemented by a careful choice of exactly whichoperators carry dependencies in which situations

It might also be possible to combine different as-pects of the various proposals perhaps even arrivingat an improved proposal

Nevertheless we clearly have some more work todo

8 Summary

This document has analyzed Linux-kernel use of de-pendency ordering and has laid out the status-quo in-teraction between the Linux kernel and pre-C11 com-pilers It has also put forward some possible ways ofbuilding towards a full implementation of C11rsquos andC++11rsquos handling of dependency ordering Finallyit calls out some weaknesses in C11rsquos and C++11rsquoshandling of dependency ordering and offers some al-ternatives

References

[1] Alglave J Maranget L Pawan PSarkar S Sewell P Williams D andNardelli F Z PPCMEMARMMEM A toolfor exploring the POWER and ARM memorymodels httpwwwclcamacuk~pes20

ppc-supplementalpldi105-sarkarpdfJune 2011

[2] Alglave J Maranget L andTautschnig M Herding cats Modellingsimulation testing and data-mining for weakmemory In Proceedings of the 35th ACM SIG-PLAN Conference on Programming LanguageDesign and Implementation (New York NYUSA 2014) PLDI rsquo14 ACM pp 40ndash40

[3] ARM Limited ARM Architecture ReferenceManual ARMv7-A and ARMv7-R Edition2010

[4] Boehm H J Space efficient conservativegarbage collection SIGPLAN Not 39 4 (Apr2004) 490ndash501

[5] Bonzini P and Day M RCU imple-mentation for Qemu httplistsgnu

WG21N4321 37

orgarchivehtmlqemu-devel2013-08

msg02055html August 2013

[6] Dalton M THE DESIGN AND IMPLE-MENTATION OF DYNAMIC INFORMATIONFLOW TRACKING SYSTEMS FOR SOFT-WARE SECURITY PhD thesis StanfordUniversity 2009 Available httpcsl

stanfordedu~christospublications

2009michael_daltonphd_thesispdf

[Viewed March 9 2010]

[7] Desnoyers M [RFC git tree] userspace RCU(urcu) for Linux httpurcuso February2009

[8] Desnoyers M McKenney P E SternA Dagenais M R and Walpole JUser-level implementations of read-copy updateIEEE Transactions on Parallel and DistributedSystems 23 (2012) 375ndash382

[9] Grisenthwaite R ARM Barrier LitmusTests and Cookbook ARM Limited 2009

[10] Howard P W and Walpole J A rel-ativistic enhancement to software transactionalmemory In Proceedings of the 3rd USENIX con-ference on Hot topics in parallelism (BerkeleyCA USA 2011) HotParrsquo11 USENIX Associa-tion pp 1ndash6

[11] Howard P W and Walpole J Relativis-tic red-black trees Concurrency and Computa-tion Practice and Experience (2013) nandashna

[12] Intel Corporation A Formal Speci-fication of Intel Itanium Processor Fam-ily Memory Ordering 2002 Avail-able httpdeveloperintelcom

designitaniumdownloads251429htm

ftpdownloadintelcomdesign

ItaniumDownloads25142901pdf [ViewedJanuary 10 2007]

[13] International Business Machines Corpo-ration Power ISA Version 207 2013

[14] Kannan H Ordering decoupled metadata ac-cesses in multiprocessors In MICRO 42 Pro-ceedings of the 42nd Annual IEEEACM Inter-national Symposium on Microarchitecture (NewYork NY USA 2009) ACM pp 381ndash390

[15] McCarthy J Recursive functions of symbolicexpressions and their computation by machinepart i Commun ACM 3 4 (Apr 1960) 184ndash195

[16] McKenney P E Read-copy update(RCU) usage in Linux kernel Availablehttpwwwrdropcomuserspaulmck

RCUlinuxusagerculocktabhtml [ViewedJanuary 14 2007] October 2006

[17] McKenney P E What is RCU part 2Usage Available httplwnnetArticles

263130 [Viewed January 4 2008] January2008

[18] McKenney P E The RCU API 2010 edi-tion httplwnnetArticles418853 De-cember 2010

[19] McKenney P E Validating memory barri-ers and atomic instructions httplwnnet

Articles470681 December 2011

[20] McKenney P E Is Parallel ProgrammingHard And If So What Can You Do About Itkernelorg Corvallis OR USA 2012

[21] McKenney P E Structured deferral syn-chronization via procrastination CommunACM 56 7 (July 2013) 40ndash49

[22] McKenney P E [PATCH tipcorercu14] mce Stop using array-index-based RCUprimitives [PATCHtipcorercu14]mce

Stopusingarray-index-basedRCUprimitivesMay 2015

[23] McKenney P E Purcell C Al-gae Schumin B Cornelius G Qwer-tyus Conway N Sbw Blainster Ru-fus C Zoicon5 Anome and Eisen HRead-copy update httpenwikipedia

orgwikiRead-copy-update July 2006

WG21N4321 38

[24] McKenney P E and Slingwine J DRead-copy update Using execution history tosolve concurrency problems In Parallel andDistributed Computing and Systems (Las VegasNV October 1998) pp 509ndash518

[25] McKenney P E and Walpole J What isRCU fundamentally Available httplwn

netArticles262464 [Viewed December 272007] December 2007

[26] Michael M M Hazard pointers Safe mem-ory reclamation for lock-free objects IEEETransactions on Parallel and Distributed Sys-tems 15 6 (June 2004) 491ndash504

[27] Rossbach C J Hofmann O S PorterD E Ramadan H E Bhandari A andWitchel E TxLinux Using and managinghardware transactional memory in an operatingsystem In SOSPrsquo07 Twenty-First ACMSymposium on Operating Systems Principles(October 2007) ACM SIGOPS Avail-able httpwwwsosp2007orgpapers

sosp056-rossbachpdf [Viewed October 212007]

[28] Toit S D Working draft stan-dard for programming language C++httpwwwopen-stdorgjtc1sc22wg21

docspapers2013n3691pdf May 2013

[29] Vigueras G Orduna J M and LozanoM A Read-Copy Update based parallel serverfor distributed crowd simulations The Journalof Supercomputing (Apr 2012)

Change Log

This paper first appeared as N4026 in May of 2014Revisions to this document are as follows

bull Add a suggestion to Section 71 that developersprovide diagnostics for singleton arrays when us-ing RCU-protected indexes (May 28 2014)

bull Apply changes from Torvald Riegel feedback

1 Move Section 71 which summarizes theLinux-kernel uses of dependency chains upto the beginning of Section 7

2 Fix some typos in Section 74 and addwords accounting for DEC Alpha whichmust continue to promote memory order

consume loads to memory order acquire

with this sectionrsquos approach Also clarifyJeff Preshingrsquos conjecture about the lack ofoptimizations that break semantic depen-dency chains

(June 5 2014)

bull Add Section 75 which covers Hans Boehmrsquosapproach which restricts dependency chains tolocal variable Also add Section 76 whichcovers Clark Nelsorsquos approach which usesthe [[carries dependency]] attribute to markthose local variables that carry dependencies(July 17 2014)

bull Mark the paper as a revision of N4036 (July 172014)

bull Add Jeff Preshing Hans Boehm and Clark Nel-son as authors and clarify Hansrsquos proposal inSection 75 (August 13 2014)

bull Add Olivier Girouxrsquos proposal for explicitlymarking the tails of dependency chains as Sec-tion 77 and add Olivier as author Update theadvantages and disadvantages of Jeff Preshingrsquoswhole-program option in Section 74 Add advan-tages and disadvantages for Hans Boehmrsquos pro-posal in Section 75 and for Clark Nelsonrsquos pro-posal in Section 76 (August 23 2014)

bull Update Olivier Girouxrsquos explicit dependency-chain marking proposal in Section 77 based onfeedback from Olivier (August 24 2014)

bull Add a section header (Section 411) for thedependency-chain rules for 2014 GCC implemen-tations Add Section 413 laying out the sim-pler rules for managing dependency chains in theolder 1990s Sequent C implementations Adda footnote to Section 76 recording Lawrence

WG21N4321 39

Crowlrsquos preference for variable modifiers insteadof either attributes or type modifiers Addwords to Section 77 noting that atomic signal

fence() or the Linux kernelrsquos barrier() can beused to suppress code-motion optimizations thatmight otherwise prevent fixing up tail-markeddependency chains that were broken by code-motion optimizations (October 1 2014)

bull Add Section 34 which describes operators thatact as the last link in the Linux kernelrsquos depen-dency chains including the ambiguous role ofthe -gt operator (October 1 2014)

bull Update plots of RCU API usage (October 52014)

bull Flesh out Section 412 which lays out a shorthistory of dependency-chain management rulesfor the Linux kernel Numerous minor clarifica-tions and corrections throughout the document(October 5 2014)

bull Add ldquoandrdquo before final authorrsquos name and emailaddress Also add a pointer back to N4036 (Oc-tober 5 2014)

bull Add a brain-dead build script (October 52014)

bull Add Section 710 which has the beginnings ofan evaluation of the various proposals (October5 2014)

The paper was then published as N4215 whichwas revised as follows

bull Add verbiage to Sections 72 and 7102 differ-entiating between various forms of syntactic andsemantic dependencies (November 11 2014)

bull Add Section 7103 suggesting a C keyword in-stead of the C++ [[carries dependency]] at-tribute This section also suggests that combin-ing ideas from several proposals might be helpful(November 11 2014)

bull Add Section 78 which contains a first attemptto describe Olivier Girouxrsquos head-marked depen-dency chains (November 11 2014)

bull Add example code in Figures 15 and 16 Addsimilar figures to the proposals showing how theymark the dependency chains in the sample code(November 11 2014)

At this point the paper was published as N4321which was revised as follows

bull Add a warning against using operands to amp and| that leave only one bit set (or respectivelycleared) to Section 411 (April 9 2015)

bull Update plots of RCU API usage (April 192014)

bull Add Section 79 which restricts dependencychains with the goal of allowing unmarked de-pendency chains while avoiding restricting com-piler optimizations One important restriction isthe elimination of RCU-protected array indexes(April 19 2014)

bull Add verbiage to Section 32 noting that the amp

operator is sometimes used to find the beginningof a power-of-two aligned structure (April 192014)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Torvald Riegel to Sec-tion 79 (May 20 2015)

bull Donrsquot allow integers to carry dependencies intoor out of unmarked functions in Section 79(May 21 2015)

bull Update Section 79 to note that assignments can-not extend a dependency chain from one threadto another and that assignments cannot extendan intptr t- or uintptr t-based dependencychain through a shared variable even within agiven thread (May 21 2015)

bull Update Section 79 to note that atomic op-erations can extend dependency chains in thesame way that their non-atomic counterpartscan (May 21 2015)

WG21N4321 40

bull Update Section 79 to explicitly call out the factthat stores and subsequent loads can extend de-pendency chains (May 21 2015)

bull Update Section 79 to explicitly state that un-defined behavior terminates dependency chainsand that if only one value of a pointer avoidsundefined behavior any dependency chainsthrough that pointer at that point are termi-nated (May 22 2015)

bull Update Section 79 to document the difficul-ties stemming from value-specialization opti-mizations (May 23 2015)

bull Update Section 79 to note the back-propagationof the dependency-chain-breaking effects of thecompiler deducing the exact value of a pointer(May 27 2015)

bull Update Section 79 to give more informationon how much bit-setting and bit-clearing is toomuch (June 8 2015)

bull Update Section 79 to note that the Linux ker-nel does not rely on dependency ordering whena statically allocated structure is linked into anRCU-protected structure but also note that op-timizations breaking dependency ordering in thiscase seem quite dubious (July 11 2015)

bull Update Section 79 illustrating dependency-chain rules with examples (July 11 2015)

bull Update Section 79 noting that simple loads suf-fice for memory order consume loads (July 112015)

bull Update Sections 77 and 78 calling out abstrac-tion challenges for head- and tail-marking ap-proaches (July 13 2015)

bull Add subsections to Section 79 (July 13 2015)

bull Update Section 79 to amplify objections to spe-cialization optimizations applied to heap-basedpointers (July 13 2015)

  • 1 Introduction
  • 2 Introduction to RCU
  • 3 Linux-Kernel Use Cases
    • 31 Types of Linux-Kernel Dependency Chains
    • 32 Operators in Linux-Kernel Dependency Chains
    • 33 Operators Terminating Linux-Kernel Dependency Chains
    • 34 Operators Acting as Last Link in Linux-Kernel Dependency Chains
    • 35 Linux-Kernel Dependency Chain Length
      • 4 Dependency Ordering in Pre-C11 Implementations
        • 41 Rules for C-Language RCU Users
          • 411 Rules for 2014 GCC Implementations
          • 412 Rules for 2003 GCC Implementations
          • 413 Rules for 1990s Sequent C Implementations
            • 42 Rules for C-Language Implementers
              • 5 Dependency Ordering in C11 and C++11 Implementations
              • 6 Weaknesses in C11 and C++11 Dependency Ordering
              • 7 Potential Alternatives to C11 and C++11 Dependency Ordering
                • 71 Revising C11 and C++11 Dependency-Ordering Definition
                • 72 Type-Based Designation of Dependency Chains With Restrictions
                • 73 Type-Based Designation of Dependency Chains
                • 74 Whole-Program Option
                • 75 Local-Variable Restriction
                • 76 Mark Dependency-Carrying Local Variables
                • 77 Explicitly Tail-Marked Dependency Chains
                • 78 Explicitly Head-Marked Dependency Chains
                • 79 Restricted Dependency Chains
                  • 791 Extending Dependency Chains
                  • 792 Terminating Dependency Chains
                  • 793 Restricted Dependency Chains and the Linux Kernel
                  • 794 Restricted Dependency Chains Advantages and Disavantages
                    • 710 Evaluation
                      • 7101 Audiences
                      • 7102 Comparison
                      • 7103 Other Approaches
                          • 8 Summary
Page 6: N4321: Towards Implementation and Use of memory order … · 2015-07-14 · N4321: Towards Implementation and Use of memory order consume Doc. No.: WG21/N4321 (revised) Date: 2015-04-19

WG21N4321 6

1 struct foo 2 int a3 4 struct foo fp5 struct foo default_foo67 int bar(void)8 9 struct foo p

1011 p = rcu_dereference(fp)12 return p p-gta default_fooa13

Figure 6 Default Value For RCU-Protected PointerLinux Kernel

1 class foo 2 int a3 4 stdatomicltfoo gt fp5 foo default_foo67 int bar(void)8 9 stdatomicltfoo gt p

1011 p = fpload_explicit(memory_order_consume)12 return p kill_dependency(p-gta) default_fooa13

Figure 7 Default Value For RCU-Protected PointerC++11

RCU-protected arrays as well as to index into arrayscontained within RCU-protected structures RCU-protected arrays are becoming less common becausethey are being converted into more complex datastructures such as trees However RCU-protectedstructures containing arrays are still fairly common

The ternary if-then-else expression is used tohandle default values for RCU-protected pointers forexample as shown in Figure 6 or in C++11 formin Figure 7 Note that the dependency is carriedonly through the rightmost two operands of neverthrough the leftmost one

The infix amp operator is used to mask low-order bitsfrom RCU pointers These bits are used by somealgorithms as markers Such markers though notcommon in the Linux kernel are well-known in theart with hazard pointers being but one example [26]This operator is also sometimes used to locate thebeginning of an aligned structure for example if p

references a field within a data structure that is 4096bytes in size (or smaller) and that is also aligned to a4096-byte boundary then p amp ~0xfff will with theaddition of appropriate casting produce a pointer tothe beginning of the structure Note that it is ex-pected that both operands of infix amp are expected tohave some non-zero bits because otherwise a NULL

pointer will result and NULL pointers cannot reason-ably be said to carry much of anything let alone adependency

Although I did not find any infix | operators inmy census of Linux-kernel dependency chains sym-metry considerations argue for also including it forexample for read-side pointer tagging or for anotherexample locating the beginning of the next in an ar-ray of aligned structures Presumably both of theoperands of infix | must have at least one zero bit

To recap the operators appearing in Linux-kerneldependency chains are -gt infix = casts prefix ampprefix [] infix + infix - ternary infix (bitwise)amp and probably also |

33 Operators Terminating Linux-Kernel Dependency Chains

Although C++11 has the kill dependency() func-tion to terminate a dependency chain no such func-tion exists in the Linux kernel Instead Linux-kerneldependency chains are judged to have terminatedupon exit from the outermost RCU read-side criticalsection3 when existence guarantees are handed offfrom RCU to some other synchronization mechanism(usually locking or reference counting) or when thevariable carrying the dependency goes out of scope

That said it is possible to analyze Linux-kerneldependency chains to see what part of the chain isactually required by the algorithm in question Wecan therefore define the essential subset of a depen-dency chain to be that subset within which ordering

3 The beginning of a given RCU read-side critical section ismarked with rcu read lock() rcu read lock bh() rcu read

lock sched() or srcu read lock() and the end by the cor-responding primitive from the list rcu read unlock() rcu

read unlock bh() rcu read unlock sched() or srcu read

unlock() There is currently no C++11 counterpart for anRCU read-side critical section

WG21N4321 7

is required by the algorithm In the 313 version ofthe Linux kernel the following operators always markthe end of the essential subset of a dependency chain() == = ampamp || infix and

The postfix () function-invocation operator is aninteresting special case in the Linux kernel In theoryRCU could be used to protect JITed function bodiesbut in current practice RCU is instead used to waitfor all pre-existing callers to the function referencedby the previous pointer The functions are all com-piled into the kernel and the dependency chains aretherefore irrelevant to the () operator Hence in ver-sion 313 of the Linux kernel the () operator marksthe end of the essential subset of any dependencychain that it resides in

The == = ampamp and || operators are used ex-clusively in rdquoifrdquo statements to make control-flow de-cisions and therefore also mark the end of the essen-tial subset of any dependency chains that they residein In theory these relational and boolean operatorscould be used to form array indexes but in practicethe Linux kernel does not yet do this in RCU depen-dency chains The other relational operators (gt ltgt= and lt=) should probably also be added to thislist

The infix and arithmetic operators couldpotentially be used for construct array addresses butthey are not yet used that way in the Linux kernelInstead they are used to do computation on valuesfetched as the last operation in an essential subset ofa dependency chain

In short in the current Linux kernel () === ampamp || infix and all mark the end of theessential subset of a dependency chain That saidthere is potential for them to be used as part of theessential subset of dependency changes in future ver-sions of the Linux kernel And the same is of coursetrue of the remaining C-language operators whichdid not appear within any of the dependency chainsin version 313 of the Linux kernel

34 Operators Acting as Last Link inLinux-Kernel Dependency Chains

Although the -gt operator is frequently used as part ofa Linux-kernel dependency chain it often is intended

to be the last link in that chain Therefore the usescases for the -gt operator deserve special mention

The first use case involves fetching non-pointerdata from an RCU-protected data structure Forexample in the DRDB subsystem in Linux -gt isused to fetch a timeout value This code requiresthat dependency ordering apply to this fetch but itdoes not require a dependency chain extending be-yond that point This sort of case would requirea kill dependency() for implementations based onthe C++11 and C11 standards

The second use case involves linked data struc-tures where an RCU update might be applied onany pointer in the chain for example the stan-dard Linux-kernel linked list The -gt operator pro-vides dependency ordering for the fetch of the -gtnextpointer but that fetch must itself be a memory order

consume load in order to provide the required depen-dency ordering for the fields in the next structurein the list Thus a linked-list traversal consists ofa series of back-to-back non-overlapping dependencychains

These two use cases raise the question of whethera dependency chain can continue beyond a -gt oper-ator The answer is ldquoyesrdquo and this occurs when alinked structure is made visible to RCU readers as aunit For exapmle consider a linked list where eachlist element links to a constant binary search treeIf this tree is in place when the element is added tothe list then a memory order consume load is neededonly when fetching the pointer to the element Thedependency chain headed by this fetch suffices to or-der accesses to the binary search tree

These cases need to be differentiated The thirduse case appears to be the least frequent which sug-gests that the -gt operator (or a sequence of -gt oper-ators) always be the last link of a dependency chain

35 Linux-Kernel Dependency ChainLength

Many Linux-kernel dependency chains are very shortand contained with a fair number living within theconfines of a single C statement If there were onlya few short dependency chains in the Linux kernelone could imagine decorating all the operators in each

WG21N4321 8

1 void new_element(struct foo pp int a)2 3 struct foo p = malloc(sizeof(p))45 if (p)6 abort()7 p-gta = a8 atomic_store_explicit(pp p memory_order_release)9

1011 int traverse(struct foo_head ph)12 13 int a = -114 struct foo p1516 p = atomic_load_explicit(ampfield_dep(ph h)17 memory_order_consume)18 while (p = NULL) 19 a = field_dep(p a)20 p = atomic_load_explicit(ampfield_dep(p n)21 memory_order_consume)22 23 return a24

Figure 8 Decorated Linked Structure Traversal

dependency chain for example replacing the -gt op-erator with something like the mythical field dep()

operator shown on lines 16 19 and 20 of Figure 8

However there are a great many dependencychains that extend across multiple functions Onerelatively modest example is in the Linux networkstack in the arp process() function This depen-dency chain extends as follows with deeper nestingindicating deeper function-call levels

bull The arp process() function invokes in dev

get rcu() which returns an RCU-protectedpointer The head of the dependency chain istherefore within the in dev get rcu() func-tion

bull The arp process() function invokes the follow-ing macros and functions

ndash IN DEV ROUTE LOCALNET() which expandsto the ipv4 devconf get() function

ndash arp ignore() which in turn calls

lowast IN DEV ARP IGNORE() which expandsto the ipv4 devconf get() function

lowast inet confirm addr() which calls

middot dev net() which in turn callsread pnet()

ndash IN DEV ARPFILTER() which expands toipv4 devconf get()

ndash IN DEV CONF GET() which also expands toipv4 devconf get()

ndash arp fwd proxy() which calls

lowast IN DEV PROXY ARP() which expandsto ipv4 devconf get()

lowast IN DEV MEDIUM ID() which also ex-pands to ipv4 devconf get()

ndash arp fwd pvlan() which calls

lowast IN DEV PROXY ARP PVLAN() which ex-pands to ipv4 devconf get()

ndash pneigh enqueue()

Again although a great many dependency chainsin the Linux kernel are quite short there are quite afew that spread both widely and deeply We thereforecannot expect Linux kernel hackers to look fondlyon any mechanism that requires them to decorateeach and every operator in each and every depen-dency chain as was shown in Figure 8 In fact evenkill dependency() will likely be an extremely diffi-cult sell

4 Dependency Ordering in Pre-C11 Implementations

Pre-C11 implementations of the C language do nothave any formal notion of dependency ordering butthese implementations are nevertheless used to buildthe Linux kernelmdashand most likely all other softwareusing RCU This section lays out a few straightfor-ward rules for both implementers (Section 42) andusers of these pre-C11 C-language implementations(Section 41)

41 Rules for C-Language RCU Users

The rules for C-language RCU users have evolvedover time so this section will present them in reversechronological order

WG21N4321 9

411 Rules for 2014 GCC Implementations

The primary rule for developers implementing RCU-based algorithms is to avoid letting the compiler de-terming the value of any variable in any dependencychain This primary rule implies a number of sec-ondary rules

1 Use only intrinsic operators on basic types Ifyou are making use of C++ template metapro-gramming or operator overloading more elabo-rate rules apply and those rules are outside thescope of this document

2 Use a volatile load to head the dependency chainThis is necessary to avoid the compiler repeatingthe load or making use of (possibly erroneous)prior knowledge of the contents of the memorylocation each of which can break dependencychains

3 Avoid use of single-element RCU-protected ar-rays The compiler is within its right to assumethat the value of an index into such an arraymust necessarily evaluate to zero The com-piler could then substitute the constant zero forthe computation breaking the dependency chainand introducing misordering

4 Avoid cancellation when using the + and - infixarithmetic operators For example for a givenvariable x avoid (xminusx) The compiler is withinits rights to substitute zero for any such cancel-lation breaking the dependency chain and againintroducing misordering Similar arithmetic pit-falls must be avoided if the infix or oper-ators appear in the essential subset of a depen-dency chain

5 Avoid all-zero operands to the bitwise amp opera-tor and similarly avoid all-ones operands to thebitwise | operator If the compiler is able todeduce the value of such operands it is withinits rights to substitute the corresponding con-stant for the bitwise operation Once again thisbreaks the dependency chain introducing mis-ordering

Please note that single-bit operands to bitwise amp

can be dangerous because the compiler requiresonly a small amount of additional information todeduce the exact value which could again resultin constant substitution Operands to bitwise |

that have only one zero bit are similarly danger-ous

6 If you are using RCU to protect JITed functionsso that the () function-invocation operator is amember of the essential subset of the dependencytree you may need to interact directly with thehardware to flush instruction caches This issuearises on some systems when a newly JITed func-tion is using the same memory that was used byan earlier JITed function

7 Do not use the boolean ampamp and || operators inessential dependency chains The reason for thisprohibition is that they are often compiled usingbranches Weak-memory machines such as ARMor PowerPC order stores after such branches butcan speculate loads which can break data depen-dency chains

8 Do not use relational operators (== = gt gt=lt or lt=) in the essential subset of a dependencychain The reason for this prohibition is that asfor boolean operators relational operators areoften compiled using branches Weak-memorymachines such as ARM or PowerPC order storesafter such branches but can speculate loadswhich can break dependency chains

9 Be very careful about comparing pointers in theessential subset of a dependency chain As LinusTorvalds explained if the two pointers are equalthe compiler could substitute the pointer you arecomparing against for the pointer in the essen-tial subset of the dependency chain On ARMand Power hardware it might be that only theoriginal value carried a hardware dependency sothis substitution would break the chain in turnpermitting misordering Such comparisons areOK in the following cases

(a) The pointer being compared against refer-ences memory that was initialized at boot

WG21N4321 10

time or otherwise long enough ago thatreaders cannot still have pre-initialized datacached Examples include module-init timefor module code before kthread creationfor code running in a kthread while theupdate-side lock is held and so on

(b) The pointer is never dereferenced afterbeing compared This exception applieswhen comparing against the NULL pointeror when scanning RCU-protected circularlinked lists

(c) The pointer being compared against is partof the essential subset of a dependencychain This can be a different dependencychain but only as long as that chain stemsfrom a pointer that was modified after anyinitialization of interest This exception canapply when carrying out RCU-protectedtraversals from different entry points thatconverged on the same data structure

(d) The pointer being compared against isfetched using rcu access pointer() andall subsequent dereferences are stores

(e) The pointers compared not-equal and thecompiler does not have enough informationto deduce the value of the pointer (Forexample if the compiler can see that thepointer will only ever take on one of twovalues then it will be able to deduce theexact value based on a not-equals compar-ison)

10 Disable any value-speculation optimizations thatyour compiler might provide especially if you aremaking use of feedback-based optimizations thattake data collected from prior runs

412 Rules for 2003 GCC Implementations

Prior to the 269 version of the Linux kernel therewas neither rcu dereference() nor rcu assign

pointer() Instead explicit memory barriers wereused smp read barrier depends() by readers andsmp wmb() by updaters For example the code shownfor current Linux kernels in Figure 6 would be as

1 struct foo 2 int a3 4 struct foo fp5 struct foo default_foo67 int bar(void)8 9 struct foo p

1011 p = fp12 smp_read_barrier_depends()13 return p p-ampgta default_fooa14

Figure 9 Default Value For RCU-Protected PointerOld Linux Kernel

shown in Figure 9 for 268 and earlier versions of theLinux kernel A similar transformation relates theolder use of smp wmb() and the more recent use ofrcu assign pointer()

This older API was clearly much more vulnerableto compiler optimizations than is the current APIbut the real motivation for this change was read-ability and maintainability as can be seen from thecommit log for the mid-2004 patch introducing rcu

dereference()

This patch introduced an rcu

dereference() macro that replaces mostuses of smp read barrier depends()The new macro has the advantage ofexplicitly documenting which pointersare protected by RCU ndash in contrast itis sometimes difficult to figure out whichpointer is being protected by a givensmp read barrier depends() call

The commit log for the mid-2004 patch introducingrcu assign pointer() justifies the change in termsof eliminating hard-to-use explicit memory barriers

Attached is a patch that adds an rcu

assign pointer() that allows a number ofexplicit smp wmb() memory barriers to bedispensed with improving readability

The importance of suppressing compiler optimiza-tions did not become apparent until much later In

WG21N4321 11

fact a volatile cast was not added to the implementa-tion of rcu dereference() until 2624 in early 2008

413 Rules for 1990s Sequent C Implemen-tations

1990s systems featured far slower CPUs and muchless memory that is commonly provisioned to-day and the compilers were correspondingly lesssophisticated Therefore at that time a sim-ple C-language field selector was used instead ofany sort of rcu dereference() or memory order

consume operation Not only was there novolatile cast there also was nothing resembling smp

read barrier depends() The lack of smp read

barrier depends() is not too surprising given thatDYNIXptx did not run on DEC Alpha

This approach was nevertheless quite reliable be-cause the use cases within the DYNIXptx kernelwere both few and straightforward and provided lit-tle or no opportunity for optimizations that mightbreak dependency chains

42 Rules for C-Language Imple-menters

The main rule for C-language implementers is toavoid any sort of value speculation or at the veryleast provide means for the user to disable suchspeculation An example of a value-speculation op-timization that can be carried out with the help ofhardware branch prediction is shown in Figure 10which is an optimized version of the code in Fig-ure 5 This sort of transformation might result fromfeedback-directed optimization where profiling runsdetermined that the value loaded from ph was almostalway 0xbadfab1e Although this transformation iscorrect in a single-threaded environment in a concur-rent environment nothing stops the compiler or theCPU from speculating the load on line 19 before itexecutes the rcu dereference() on line 16 whichcould result in line 19 executing before the corre-sponding store on line 7 resulting in a garbage valuein variable a4

4 Kudos to Olivier Giroux for pointing out use of branchprediction to enable value speculation

1 void new_element(struct foo pp int a)2 3 struct foo p = malloc(sizeof(p))45 if (p)6 abort()7 p-gta = a8 rcu_assign_pointer(pp p)9

1011 int traverse(struct foo_head ph)12 13 int a = -114 struct foo p1516 p = rcu_dereference(ampph-gth)17 while (p = NULL) 18 if (p == (struct foo )0xbadfab1e)19 a = ((struct foo )0xbadfab1e)-gta20 else21 a = p-gta22 p = rcu_dereference(ampp-gtn)23 24 return a25

Figure 10 Dangerous Optimizations HardwareBranch Predictions

There are some situations where this sort of opti-mization would be safe including

1 The value speculated is a numeric value ratherthan a pointer so that if the guess proves correctafter the fact the computation will be appropri-ate after the fact

2 The value speculated is a pointer to invariantdata so that reasonable values are produced bydereferencing even if the guess proves to havebeen correct only after the fact

3 As above but where any updates result in datathat produces appropriate computations at anyand all phases of the update

However this list does not contain the general caseof memory order consume loads

Pure hardware implementations of value specula-tion can avoid this problem because they monitorcache-coherence protocol events that would resultfrom some other CPU invalidating the guess

In short compiler writers must provide means todisable all forms of value speculation unless the spec-

WG21N4321 12

ulation is accompanied by some means of detectingthe race condition that Figure 10 is subject to

Are there other dependency-breaking optimizationsthat should be called out separately

5 Dependency Ordering in C11and C++11 Implementations

The simplest way to avoid dependency-ordering is-sues is to strengthen all memory order consume oper-ations to memory order acquire This functions cor-rectly but may result in unacceptable performancedue to memory-barrier instructions on weakly or-dered systems such as ARM and PowerPC5 and mayfurther unnecessarily suppress code-motion optimiza-tions

Another straightforward approach is to avoid valuespeculation and other dependency-breaking opti-mizations This might result in missed opportu-nities for optimization but avoids any need fordependency-chain annotations and also all issuesthat might otherwise arise from use of dependency-breaking optimizations This approach is fully com-patible with the Linux kernel communityrsquos currentapproach to dependency chains Unfortunately thereare any number of valuable optimizations that breakdependency chains so this approach seems impracti-cal

A third approach is to avoid value speculationand other dependency-breaking optimizations in anyfunction containing either a memory order consume

load or a [[carries dependency]] attribute Forexample the hardware-branch-predition optimiza-tion shown in Figure 10 would be prohibited in suchfunctions as would cancellation optimizations suchas optimizing a = b + c - c into a = b This toocan result in missed opportunities for optimizationthough very probably many fewer than the previousapproach This approach can also result in issues dueto dependency-breaking optimizations in functionslacking [[carries dependency]] attributes for ex-ample function d() in Figure 11 It can also result

5 From a Linux-kernel community viewpoint that shouldread ldquowill result in unacceptable performancerdquo

1 int a(struct foo p [[carries_dependency]])2 3 return kill_dependency(p-gta = 0)4 56 int b(int x)7 8 return x9

1011 foo c(void)12 13 return fpload_explicit(memory_order_consume)14 return rcu_dereference(fp) in Linux kernel 15 1617 int d(void)18 19 int a20 foo p2122 rcu_read_lock()23 p = c()24 a = p-gta25 rcu_read_unlock()26 return a27

Figure 11 Example Functions for Dependency Or-dering Part 1

1 [[carries_dependency]] struct foo e(void)2 3 return fpload_explicit(memory_order_consume)4 return rcu_dereference(fp) in Linux kernel 5 67 int f(void)8 9 int a

10 foo p1112 rcu_read_lock()13 p = e()14 a = p-gta15 rcu_read_unlock()16 return kill_dependency(a)17 1819 int g(void)20 21 int a22 foo p2324 rcu_read_lock()25 p = e()26 a = p-gta27 rcu_read_unlock()28 return b(a)29

Figure 12 Example Functions for Dependency Or-dering Part 2

WG21N4321 13

in spurious memory-barrier instructions when a de-pendency chain goes out of scope for example withthe return statement of function g() in Figure 12

A fourth approach is to add a compile-time op-eration corresponding to the beginning and end ofRCU read-side critical section These would need tobe evaluated at compile time taking into accountthe fact that these critical sections can nest and canbe conditionally entered and exited Note that theexit from an outermost RCU read-side critical sec-tion should imply a kill dependency() operation oneach variable that is live at that point in the code6

Although it is probably impossible to precisely de-termine the bounds of a given RCU read-side criticalsection in the general case conservative approachesthat might overestimate the extent of a given sec-tion should be acceptable in almost all cases Thisapproach would make functions c() and d() in Fig-ure 11 handle dependency chains in a natural mannerbut avoiding whole-program analysis would requiresomething similar to the [[carries dependency]]

annotations called out in the C11 and C++11 stan-dards

A fifth approach would be to require that all op-erations on the essential subset of any dependencychain be annotated This would greatly ease imple-mentation but would not be likely to be accepted bythe Linux kernel community

A sixth approach is to track dependencies as calledout in the C11 and C++11 standards However in-stead of emitting a memory-barrier instruction whena dependency chain flows into or out of a functionwithout the benefit of [[carries dependency]] in-sert an implicit kill dependency() invocation Im-plementation should also optionally issue a diagnosticin this case The motivation for this approach is thatit is expected that many more kill dependencies()

than [[carries dependency]] would be required toconvert the Linux kernelrsquos RCU code to C11 In theexample in Figure 12 this approach would allow func-tion g() to avoid emitting an unnecessary memory-barrier instruction but without function f()rsquos ex-

6 What if a given rcu read unlock() sometimes marked theend of an outermost RCU read-side critical section but othertimes was nested in some other RCU read-side critical sectionIn that case there should be no kill dependency()

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a)3 a = p-gtspecial_a4 else5 a = p-gtnormal_a

Figure 13 Dependency-Ordering Value-NarrowingHazard

plicit kill dependency() Both functions are in Fig-ure 12

A seventh and final approach is to track dependen-cies as called out in in the C11 and C++11 standardsWith this approach functions e() and f() properlypreserve the required amount of dependency order-ing

6 Weaknesses in C11 andC++11 Dependency Or-dering

Experience has shown several weaknesses in the de-pendency ordering specified in the C11 and C++11standards

1 The C11 standard does not provide attributesand in particular does not provide the[[carries dependency]] attribute This pre-vents the developer from specifying that a givendependency chain passes into or out of a givenfunction

2 The implementation complexity of thedependency-chain tracking required by bothstandard can be quite onerous on the one handand the overhead of unconditionally promotingmemory order consume loads to memory order

acquire can be excessive on weakly orderedimplementations on the other There is thereforeno easy way out for a memory order consume

implementation on a weakly ordered system

3 The function-level granularity of [[carries

dependency]] seems too coarse One problemis that points-to analysis is non-trivial so that

WG21N4321 14

compilers are likely to have difficulty determin-ing whether or not a given pointer carries a de-pendency For example the current wording ofthe standard (intentionally) does not disallowdependency chaining through stores and loadsTherefore if a dependency-carrying value mightever be written to a given variable an implemen-tation might reasonably assume that any loadfrom that variable must be assumed to carry adependency

4 The rules set out in the standard [28 110p9]do not align well with the rules that develop-ers must currently adhere to in order to main-tain dependency chains when using pre-C11 andpre-C++11 compilers (see Section 41) For ex-ample the standard requires (x-x) to carry adependency and providing this guarantee wouldat the very least require the compiler to alsoturn off optimizations that remove (x-x) (andsimilar patterns) if x might possibly be carry-ing a dependency For another example con-sider the value-speculation-like code shown inFigure 13 that is sometimes written by devel-opers and that was described in bullet 9 ofSection 41 In this example the standard re-quires dependency ordering between the memoryorder consume load on line 1 and the sub-sequent dereference on line 3 but a typicalcompiler would not be expected to differenti-ate between these two apparently identical val-ues These two examples show that a compilerwould need to detect and carefully handle thesecases either by artificially inserting dependen-cies omitting optimizations differentiating be-tween apparently identical values or even byemitting memory order acquire fences

5 The whole point of memory order consume andthe resulting dependency chains is to allow de-velopers to optimize their code Such optimiza-tion attempts can be completely defeated by thememory order acquire fences that the standardcurrently requires when a dependency chain goesout of scope without the benefit of a [[carries

dependency]] attribute Preventing the com-piler from emitting these fences requires liberal

use of kill dependency() which clutters coderequires large developer effort and further re-quires that the developer know quite a bit aboutwhich code patterns a given version of a givencompiler can optimize (thus avoiding needlessfences) and which it cannot (thus requiring man-ual insertion of kill dependency()

As of this writing no known implementations fullysupport C11 or C++11 dependency ordering

It is worth asking why Paul didnrsquot anticipate theseweaknesses There are several reasons for this

1 Compiler optimizations have become more ag-gressive over the seven years since Paul startedworking on standardization

2 New dependency-ordering use cases have arisenduring that same time in particular there arelonger dependency chains and more of themincluding dependency chains spanning multiplecompilation units

3 The number of dependency chains has increasedby roughly an order of magnitude during thattime so that changes in code style can be ex-pected to face a commeasurate increase in resis-tance from the Linux kernel community ndash unlessthose changes bring some tangible benefit

With that letrsquos look at some potential alternativesto dependency ordering as defined in the C11 andC++11 standards

7 Potential Alternatives to C11and C++11 Dependency Or-dering

Given the weaknesses in the current standardrsquos spec-ification of dependency ordering it is quite reason-able to consider alternatives To this end Section 71discusses ease-of-use issues involved with revisions tothe C11 and C++11 definitions of dependency or-dering Section 72 enlists help from the type systembut also imposes value restrictions (thus revising the

WG21N4321 15

C11 and C++11 semantics for dependencies) Sec-tion 73 enlists help from the type system withoutthe value restrictions and Section 74 describes awhole-program approach to dependency chains (alsorevising the C11 and C++11 semantics for depen-dencies) Section 75 describes a post-Rapperswilproposal that dependency chains be restricted tofunction-scope local variables and temporaries andSection 76 describes a second post-Rapperswil pro-posal that the [[carries dependency]] attributebe used to label local-scope variables that carry de-pendencies Section 77 describes a proposal dis-cussed verbally at Rapperswil that explicitly marksthe tails of dependency chains Section 78 describesthe inverse namely marking the heads of dependencychains Section 79 describes an approach that avoidsmarking by sharply restricting the number and typeof operations permitted in dependency chains Eachapproach appears to have advantages and disadvan-tages so it is hoped that further discussion will eitherhelp settle on one of these alternatives or generatesomething better To help initiate this discussionSection 710 provides an initial comparative evalua-tion

71 Revising C11 and C++11Dependency-Ordering Definition

The following sections each describe a proposed revi-sion of the dependency-ordering definition from thatin the current C11 and C++11 standards In manyof these proposals developers are required to followan additional rule in order to be able to rely on de-pendency ordering Subsequent execution must notlead to a situation where there is only one possiblevalue for the variable that is intended to carry thedependency7 This is shown in Figure 17 where thecompiler is permitted to break dependency orderingon line 6 because it knows that the value of p is equalto that of q which means that it could substitutethe latter value from the former which would break

7 This restricted notion of dependence is sometimes calledsemantic dependence and the value at the end of a depen-dence chain that does not represent a semantic dependence issometimes said to be independent of the value at the head ofthe dependency chain

1 int my_array[MY_ARRAY_SIZE]23 i = atomic_load_explicit(gi memory_order_consume)4 r1 = my_array[i]

Figure 14 Single-Element Arrays and DependencyOrdering

dependency ordering In short a dependency chainbreaks if it comes to a point where only a single valueis possible regardless of the value of the memory

order consume load heading up the chain At firstglance this additional rule could be quite difficult tolive with as dependency ordering could come and godepending on small details of code far away from thatpoint in the dependency chain

However a review of the Linux-kernel operators inSection 32 shows that the most commonly used op-erators act identically under both definitions Theproblem-free operators include -gt infix = casts pre-fix amp prefix and ternary

One example of a potentially troublesome opera-tor namely == is shown in Figure 17 where line 6breaks dependency ordering because the value of p isknown to be equal to that of q which is not part of adependency chain This example could be addressedthrough careful diagnostic design coupled with appro-priate coding standards For example the compilercould emit a warning on line 6 but remain silent forthe equivalent line substituting q for p namely dosomething with(q-gta)

Another example is the use of postfix [] thatis shown in Figure 14 If this code fragment wascompiled with MY ARRAY SIZE equal to one there isno dependency ordering between lines 3 and 4 butthat same code fragment compiled with MY ARRAY

SIZE equal to two or greater would be dependency-ordered Here a diagnostic for single-element arraysmight prove useful and such a diagnostic can easilybe supplied in this case using if and error

In the Linux kernel infix + and - are used forpointer and array computations These are all safein that they operate on an integer and pointer sothat any cancellation will not normally be detectableat compile time However one big purpose of diag-nostics is to detect abnormal conditions indicating

WG21N4321 16

1 struct liststackhead 2 struct liststack __rcu first3 45 struct liststack 6 struct liststack __rcu next7 void t8 struct rcu_head rh9

1011 _Carries_dependency12 void ls_front(struct liststackhead head)13 14 _Carries_dependency void data15 struct liststack lsp1617 rcu_read_lock()18 lsp = rcu_dereference(head-gtfirst)19 if (lsp == NULL)20 data = NULL21 else22 data = rcu_dereference(lsp-gtt)23 rcu_read_unlock()24 return data25

Figure 15 List-Based-Stack Example Code 1 of 2

probable bugs Therefore in cases where the com-piler can determine that two values from dependencychains are annihilating each other via infix + and -a diagnostic would be appropriate

Similarly the Linux kernel uses infix (bitwise) amp tomanipulate bits at the bottom of a pointer whereagain cancellation will not normally be detectable atcompile timemdashexcept in the case of operations on aNULL pointer for which dependency ordering is notmeaningful in any case However as with infix +

and - if the compiler detects value annihilation adiagnostic would be appropriate

Although issues with false positives and negativesneeds further investigation there is reason to hopethat this revision of the definition of dependency or-dering might avoid significant impacts on ease of useWith this hope we proceed to the specific propos-als using the code in Figures 15 and 16 to showsome sample code using Linux-kernel nomenclaturewith the addition of a mythical C keyword Carries

dependency to annotate parameters variables andreturn values that carry dependencies Please notethat this code example in no way endorses the dubi-ous practice of creating a parallel program with thesort of choke point exemplified by the head of this

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 _Carries_dependency struct liststack lsp40 struct liststack lsnp141 _Carries_dependency struct liststack lsnp242 _Carries_dependency void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 16 List-Based-Stack Example Code 2 of 2

WG21N4321 17

1 value_dep_preserving struct foo p23 p = atomic_load_explicit(gp memory_order_consume)4 q = some_other_pointer5 if (p == q)6 do_something_with(p-gta)7 else8 do_something_else_with(p-gtb)

Figure 17 Single-Value Variables and DependencyOrdering

list Note also that cmpxchg() heads a dependencychain which is completely reasonable within the con-text of the Linux kernel due to its acquire semanticswhich of course might be argued to indicate that theannotations in ls pop() are unnecessary

72 Type-Based Designation of De-pendency Chains With Restric-tions

This approach was formulated by Torvald Riegel inresponse to Linus Torvaldsrsquos spirited criticisms of thecurrent C11 and C++11 wording

This approach introduces a new value dep

preserving type qualifier Dependency ordering ispreserved only via variables having this type quali-fier This is meant to model the real scope of depen-dencies which is data flow not execution at function-level granularity This approach should therefore givedevelopers much finer control of which dependenciesare tracked

Assigning from a value dep preserving value to anon-value dep preserving variable terminates thetracking of dependencies in much the same way thatan explicit kill dependency() would However un-like an explicit kill dependency() compilers shouldbe able to emit a suppressable warning on implicitconversions so as to alert the developer about other-wise silent dropping of dependency tracking8

Next we specify that memory order consume loadsreturn a value dep preserving type by default thecompiler must assume such a load to be capable of

8 Other choices are possible in this case including emit-ting a memory order acquire fence in order to conservativelypreserve a potentially intended ordering

producing any value of the underlying type In otherwords the implementation is not permitted to applyany value-restriction knowledge it might gain fromwhole-program analysis We call this a local seman-tic dependency to distinguish not only from a pure(syntactic) dependency but also from a global seman-tic dependency where global information may be ap-plied Note that any global semantic dependency isalso a local semantic dependency but that any localsemantic dependency which is headed by a variablethat can be proven to take on only a single value is nota global semantic dependency The term ldquosemanticdependencyrdquo should be interpreted to mean a globalsemantic dependency unless otherwise stated

This allows developers to start with a clean slatefor the additional rule that they must follow to beable to rely on dependency ordering Subsequent ex-ecution must not lead to a situation there is only onepossible value for the value dep preserving expres-sion because otherwise the implementation is per-mitted to break the dependency chain As notedearlier this is shown in Figure 17 where the com-piler is permitted to break dependency ordering online 6 because it knows that the value of p is equalto that of q which means that it could substitutethe latter value from the former which would breakdependency ordering

This approach has several advantages

1 The implementation is simpler because no de-pendency chains need to be traced The imple-mentation can instead drive optimization deci-sions strictly from type information

2 Use of the value dep preserving type modifierallows the developer to limit the extent of thedependency chains

3 This type modifier can be used to mark a depen-dency chainrsquos entry to and exit from a functionin a straightforward way without the need forattributes

4 The value dep preserving type modifiers serveas valuable documentation of the developerrsquos in-tent

WG21N4321 18

5 This approach permits many additional opti-mizations compared to those permitted by thecurrent standard on code that carries a depen-dency Expressions such as (x-x) no longerrequire establishment of artificial dependenciesand the compiler is no longer required to detectvalue-narrowing hazards like that shown in Fig-ure 13 However the compiler is still prohibitedfrom adding its own value-speculation optimiza-tions

6 Linus Torvalds seems to be OK with it whichindicates that this set of rules might be practicalfrom the perspective of developers who currentlyexploit dependency chains

According to Peter Sewell one disadvantage is thatthis approach will be quite difficult to model which inturn will pose obstacles for the analysis tooling thatwill be increasingly necessary for large-scale concur-rent programming efforts In particular the concernis that forcing the compiler to assume that a memory

order consume load could possibly return any valuepermitted by its type might require program-analysistools to consider counterfactual hypothetical execu-tions which might complicate specification of seman-tics and verification

Figures 18 and 19 show how this approach playsout with the list-based stack

73 Type-Based Designation of De-pendency Chains

Jeff Preshing made an off-list suggestion of using avalue dep preserving type modifier as suggestedby Torvald Riegel but using this type modifier tostrictly enforce dependency ordering For exampleconsider the code fragment shown in Figure 17 Thescheme described in Section 72 would not necessar-ily enforce dependency ordering between the load online 3 and the access one line 6 while the approachdescribed in this section would enforce dependencyordering in this case

Furthermore cancelling or value-destruction oper-ations on value dep preserving values would notdisrupt dependency ordering As with the cur-rent C11 and C++11 standards the implementation

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack value_dep_preserving first6 78 struct liststack 9 struct liststack value_dep_preserving next

10 void t11 struct rcu_head rh12 1314 value_dep_preserving15 void ls_front(struct liststackhead head)16 17 value_dep_preserving void data18 value_dep_preserving struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 18 List-Based-Stack Restricted Type-BasedDesignation 1 of 2

would be required to emit a memory-barrier instruc-tion or compute an artificial dependency for such op-erations (Note however that use of cancelling orvalue-destruction operations on dependency chainshas proven quite rare in practice)

This approach shares many of the advantages ofTorvald Riegelrsquos approach

1 The implementation is simpler because no de-pendency chains need be traced The implemen-tation can instead drive optimization decisionsstrictly from type information

2 Use of the value dep preserving type modifierallows the developer to limit the extent of thedependency chains

3 This type modifier can be used to mark a depen-dency chainrsquos entry to and exit from a functionin a straightforward way without the need forattributes

4 The value dep preserving type modifiers serve

WG21N4321 19

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 value_dep_preserving37 void ls_pop(struct liststackhead head)38 39 value_dep_preserving struct liststack lsp40 struct liststack lsnp141 value_dep_preserving struct liststack lsnp242 value_dep_preserving void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 19 List-Based-Stack Restricted Type-BasedDesignation 2 of 2

as valuable documentation of the developerrsquos in-tent

5 Although optimizations on a dependency chainare restricted just as in the current standardthe use of value dep preserving restricts thedependency chains to those intended by the de-veloper

6 Restricting dependency-breaking optimizationson all dependency chains marked value dep

preserving without exceptions for cases inwhich the compiler knows too much might makethis approach easier to learn and to use

It is expected that modeling this approach shouldbe straightforward because the modeling tools wouldbe able to make use of the type information Thisapproach results in the same code as shown in Fig-ures 18 and 19 of the previous section

74 Whole-Program Option

This approach also suggested off-list by Jeff Presh-ing has the goal of reusing existing non-dependency-ordered source code unchanged (albeit requiring re-compilation in most cases)9 For example this ap-proach permits an instance of stdmap to be refer-enced by a pointer loaded via memory order consume

and to provide that stdmap instance with thebenefits of dependency ordering without any codechanges whatsoever to stdmap It is important tonote that this protection will be provided only to aread-only stdmap that is referenced by a changingpointer loaded via memory order consume in partic-ular not to a concurrently updated stdmap refer-enced by a pointer (read-only or otherwise) loadedvia memory order consume This latter case wouldrequire changes to the underlying stdmap implemen-tation at a minimum changing some of the loads tobe memory order consume loads Nevertheless theability to provide dependency-ordering protection topre-existing linked data structures is valuable evenwith this read-only restriction

9 A module or library that is known to never carry a de-pendency need not be recompiled

WG21N4321 20

This approach which again does require full re-compilation can be implemented using two ap-proaches

1 Promote all memory order consume loads tomemory order acquire as may be done withthe current standard

2 On architectures that respect memory order-ing prohibit all dependency-breaking optimiza-tions throughout the entire program but onlyin cases where a change in the value returnedby a memory order consume load could cause achange in the value computed later in that samedependency chain in other words where there isa global semantic dependency Note again thatthe possibility of storing a value obtained froma memory order consume load then loading itlater means that normal loads as well as memoryorder relaxed loads often must be consideredto head their own dependency chains but onlywhen loaded by the same thread that did thestore

Some implementations might allow the developerto choose between these two approaches for exampleby using a compiler switch provided for that purpose

This approach also has the effect of permitting atrivial implementation of a memory order consume

atomic thread fence() When using the first im-plementation approach the atomic thread fence()

is simply promoted to memory order acquire In-terestingly enough when using the second ap-proach the memory order consume atomic thread

fence() may simply be ignored The reason forthis is that this approach has the effect of promot-ing memory order relaxed loads to memory order

consume which already globally enforces all theordering that the memory order consume atomic

thread fence() is required to provide locally10

This approach has its own set of advantages anddisadvantages

10 Of course this presumed promotion from memory order

relaxed to memory order consume means that architecturessuch as DEC Alpha that do not respect dependency order-ing must continue to use the first option of emitting memory-ordering instructions for memory order consume loads

1 This approach dispenses with the [[carries

dependency]] attribute and the kill

dependency() primitive

2 This approach better promotes reuse of existingsource code In particular it should require nochanges to the current Linux-kernel source baseaside from changes to the rcu dereference()

family of primitives

3 This approach allows implementations to carryout dependency-breaking optimizations on de-pendency chains as long as a change in thevalue from the memory order consume load doesnot change values further down the dependencychain both with and without the optimizationJeff conjectures that the set of dependency-breaking optimizations used in practice applyonly outside of dependency chains by the re-vised definition in which single-value restrictionsbreak dependency chains11 If this conjectureholds it also applies to Torvaldrsquos approach de-scribed in Section 72

4 Code that follows the rules presented in Sec-tion 41 (substituting memory order consume

loads for volatile loads) would have its depen-dency ordering properly preserved

It is unlikely that this approach could be modeledreasonably given the current state of the art Therequirement that any given memory order consume

load be able to generate at least two different val-ues at the tail of the dependency chain is believedto be a show-stopper especially when coupled withwhole-program analysis which might find that thereis only one value entering at the head of the depen-dency chain

This approach allows annotations to be discardedas shown in Figures 20 and 21 However the memoryorder consume loads are still required in order toenable the promote-to-acquire implementation style

11 This is certainly the case for the usual optimizations ex-emplified by replacing (x-x) with zero

WG21N4321 21

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 void ls_front(struct liststackhead head)15 16 void data17 struct liststack lsp1819 rcu_read_lock()20 lsp = rcu_dereference(head-gtfirst)21 if (lsp == NULL)22 data = NULL23 else24 data = rcu_dereference(lsp-gtt)25 rcu_read_unlock()26 return data27

Figure 20 List-Based-Stack Whole-Program Ap-proach 1 of 2

75 Local-Variable Restriction

This approach suggested off-list by Hans Boehmlimits the extent of dependency trees to a local whichincludes local variables temporaries function argu-ments and return variables Assigning a value from amemory order consume load to such an object beginsa dependency chain Assigning a value loaded fromsuch a local to a global variable (including function-local variables marked static) or to the heap impliesa kill dependency() so that dependency chains areconfined to locals However if the compiler is unableto see the full dependency chain for example be-cause it passes into a function in another translationunit that is not marked [[carries dependency]]the compiler should promote memory order consume

to memory order acquire12

Section 32 indicates that the following operatorsshould transmit dependency status from one localvariable or temporary to another -gt infix = casts

12 Some implementations might provide means to allow theuser to specify that a diagnostic be generated if such promotionis necessary

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 void ls_pop(struct liststackhead head)37 38 struct liststack lsp39 struct liststack lsnp140 struct liststack lsnp241 void data4243 rcu_read_lock()44 lsnp2 = rcu_dereference(head-gtfirst)45 do 46 lsnp1 = lsnp247 if (lsnp1 == NULL) 48 rcu_read_unlock()49 return NULL50 51 lsp = rcu_dereference(lsnp1-gtnext)52 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)53 while (lsnp1 = lsnp2)54 data = rcu_dereference(lsnp2-gtt)55 rcu_read_unlock()56 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)57 return data58

Figure 21 List-Based-Stack Whole-Program Ap-proach 2 of 2

WG21N4321 22

prefix amp prefix [] infix + infix - ternary infix (bitwise) amp and probably also | Similarly Sec-tion 33 indicates that the following operators shouldimply a kill dependency() () == = ampamp ||infix and

It will also be necesary to check whether Linux-kernel usage expects dependency chains to passthrough globals and heap objects that are in someway thread-local If there are such use cases and ifthey are sane and cannot easily be changed to uselocal variables should [[carries dependency]] beused to flag dependency-carrying globals and heapobjects

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute if de-pendency chains are to span multiple translationunits as is the case in some parts of the Linuxkernel

2 The implementation is likely to be some-what simpler because only those dependencychains passing through local variables compiler-generated temporaries compiler-visible functionarguments and compiler-visible return valuesneed be traced One could also argue that func-tion arguments and return values marked with[[carries dependency]] attribute also need tobe traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 Although optimizations on dependency chainsmust be restricted the restricted scope of de-pendency chains reduces the impact of these re-strictions

5 Applying this approach to the Linux kernelwould only require the addition of markings onfunction parameters and return values corre-sponding to cross-translation-unit function callsHowever there are a significant number of theseso this approach can expect significant resistancefrom the Linux community

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 22 List-Based-Stack Local-Variable Restric-tion 1 of 2

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach allows local-variable annotations tobe dropped as shown in Figure 22 and 23

76 Mark Dependency-Carrying Lo-cal Variables

This approach suggested offlist by Clark Nelson usesthe [[carries dependency]] attribute to mark non-static local-scope variables as carrying a dependencyin addition to its current use marking function ar-guments and return values as carrying dependen-cies It is not permissible to mark global variablesor structure members with this attribute Assigningfrom a [[carries dependency]] object to a non-[[carries dependency]] object results in an im-plicit kill dependency()

This approach is similar to that of Section 73 ex-cept that it uses an attribute rather than a type mod-ifier As such it has many of the advantages and dis-

WG21N4321 23

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 23 List-Based-Stack Local-Variable Restric-tion 2 of 2

advantages of that approach however some believethat an attribute-based approach will be more ac-ceptable to the committee than would a type-modifierapproach13 However this approach does requirethat C add attributes

This leave the question of which operatorstransmit dependency chains from one [[carries

dependency]] object to another Section 32 indi-cates that the following operators should transmitdependency status from one local variable or tem-porary to another -gt infix = casts prefix amp prefix [] infix + infix - ternary infix (bitwise) ampand probably also | Similarly Section 33 showsthat the following operators should imply a kill

dependency() () == = ampamp || infix and

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains passingthrough variables marked with the [[carries

dependency]] attribute need be traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 The [[carries dependency]] calls serve asvaluable documentation of the developerrsquos in-tent

5 Although optimizations on dependency chainsmust be restricted use of explicit [[carries

dependency]] greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of variables carrying depen-dencies given that the Linux kernel currentlyrequires no such markings

13 Lawrence Crowl suggests a third approach namely a vari-able modifier

WG21N4321 24

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 _Carries_dependency void data18 _Carries_dependency struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 24 List-Based-Stack Marked Local Variables1 of 2

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach results in code as shown in Fig-ures 24 and 25 where the [[carries dependency]]

attributes have been replaced with a mythicalCarries dependency C keyword

77 Explicitly Tail-Marked Depen-dency Chains

This approach suggested at Rapperswil by OlivierGiroux can be thought of as the inverse of

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 _Carries_dependency struct liststack lsp40 struct liststack lsnp141 _Carries_dependency struct liststack lsnp242 _Carries_dependency void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 25 List-Based-Stack Marked Local Variables2 of 2

WG21N4321 25

1 p = atomic_load_explicit(ampgp memory_order_consume)2 if (p = NULL)3 do_it(atomic_dependency(p gp))

Figure 26 Explicit Dependency Operations

stdkill dependency() Instead of explicitlymarking where the dependency chains terminateOlivierrsquos proposal uses a stddependency() prim-itive to indicate the locations in the code that thedependency chains are required to reach The first ar-gument to stddependency() is the value to whichthe dependency must be carried and the second argu-ment is the variable that heads the dependency chainin other words the second argument is the variablethat was loaded from by a memory order consume

load This proposal differs from the others in thatit is expected to be implemented not necessarily bypreserving the dependency but instead by insertingbarriers in those cases where optimizations have elim-inated any required dependencies The goal here is toimpose minimal restrictions on optimizations of codecontaining dependency chains

A C-language example is shown in Figure 26where stddependency() is transliterated to the C-language atomic dependency() function On line 3atomic dependency() returns the value of its firstargument (p) while ensuring that the data depen-dency from the memory order consume load from gp

is faithfully reflected in the assembly language imple-menting this code fragment The assembly-languagereflection of this dependency might be in terms ofan assembly-language dependency (for example onARM or PowerPC) implicit memory ordering (forexample on x86 or mainframe) or by an explicitmemory-barrier instruction However if there was noatomic dependency() function the compiler wouldbe under no obligation to preserve the dependency14

These explicitly specified dependencies may becombined with [[carries dependency]] attributeson function arguments for example as shown in Fig-ure 27 Note the interplay of atomic dependency()

and [[carries dependency]] where line 8 estab-

14 Would it be better to have the second argument to atomic

dependency() be a label rather than an expression

1 void foo(struct bar q [[carries_dependency]])2 3 if (q = NULL)4 do_it(atomic_dependency(q-gtb q))5 67 p = atomic_load_explicit(ampgp memory_order_consume)8 foo(atomic_dependency(p gp))

Figure 27 Explicit Dependency Operations and car-ries dependency

lishes the dependency between the load from gp andthe [[carries dependency]] argument q of foo()and where line 4 establishes the further dependencybetween argument q of foo() and do it()s argu-ment

This approach is not yet complete One issue isthe possibility of a given operation being dependenton multiple memory order consume loads One ap-proach is of course to omit this functionality and an-other is to allow atomic dependency() to allow anexpression as its first argument and a variable list ofmemory order consume loaded variables

Another issue is connecting [[carries

dependency]] return values to subsequent atomic

dependency() invocations There are a numberof possible resolutions to this issue One approachwould be to use [[carries dependency]] attributeto mark the declaration of the variable to whichthe functionrsquos return value is assigned bringing theproposal from Section 76 to bear In the specialcase where the memory order consume load is in thesame function body as the atomic dependency()

that depends on it the atomic dependency()

could reference the variable that was the sourceof the original memory order consume load An-other approach would be to allow function-returncarries dependency]] attributes to define namesthat could be used by later atomic dependency()

invocations

A third issue arises when atomic dependency()

must be applied after the head of the dependencychain has gone out of scope for example if the headwas contained in a variable defined in an inner scopethat has since been exited

A fourth issue arises if optimizations along a

WG21N4321 26

needed dependency chain allow ordering the depen-dent operation to precede the head of the depen-dency chain in which case inserting barriers wouldbe ineffective The current proposal for addressingthis issue is to suppress memory-movement optimiza-tions across the atomic dependency() perhaps us-ing something like atomic signal fence() or theLinux kernelrsquos barrier() macro This approach al-lows dependency checking and fence insertion to becarried out as a final pass in the compilation process

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains having ex-plicit atomic dependency() calls (and option-ally intermediate [[carries dependency]] at-tributes) need be traced

3 Irrelevant dependency chains are pruned by de-fault with no stdkill dependency() calls re-quired

4 The atomic dependency() calls serve as valu-able documentation of the developerrsquos intent

5 Although optimizations on dependency chainsmust be restricted use of explicit atomic

dependency() greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of dependency chains giventhat the Linux kernel currently relies on implicitends of dependency chains

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(atomic_dependency(lsp-gtt27 head-gtfirst))28 rcu_read_unlock()29 return atomic_dependency(data lsp-gtt)30

Figure 28 List-Based-Stack Tail-Marked Dependen-cies 1 of 2

It is not yet known whether this approach can bereasonably modeled

The result is shown in Figures 28 and 29

78 Explicitly Head-Marked Depen-dency Chains

This approach suggested via email by Olivier Girouxcan be thought of as another inverse of stdkilldependency() In this case the heads of the de-pendency chains are marked indicating to whichpointed-to objects dependencies should be carriedThis description is an extrapolation of a very con-cise proposal and corrections are welcome

The general idea is to provide an augmentedform of the load() member function that indicatesdependencies for example xload(memory order

consume x-gtnext) would cause a dependency tobe carried through the -gtnext field but throughno other field This is shown in Figure 30 where

WG21N4321 27

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(atomic_dependency(lsnp2-gtt lsnp))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return atomic_dereference(data lsnp2-gtt)59

Figure 29 List-Based-Stack Tail-Marked Dependen-cies 2 of 2

1 struct foo 2 struct foo a3 struct foo b4 struct foo c5 int d6 78 p = atomic_load_explicit(ampgp memory_order_consume9 p-gta p-gtb)

10 qa = p-gta Dependency carried 11 qb = p-gtb Dependency carried 12 qc = p-gtc No dependency carried 13 d = p-gtd No dependency carried

Figure 30 Explicit Dependency Operations and Aug-mented Load

the explicit dependency information on line 9 causeslines 10 and 11 to carry a dependency but lines 12and 13 not to do so

Some open questions regarding this approach

1 How does this interact with arguments and re-turn values Do the corresponding annotationsneed to indicate to which fields dependenciesmight be carried Should mismatches be con-sidered an error and if so which sorts of mis-matches

2 How are opaque types handled For exam-ple consider the Linux kernel linked-list facil-ity which embeds a list head structure intothe enclosing object that is to be placed on thelist The memory order consume load returnsa (struct list head ) but it may be nec-essary to carry a dependency to one or more ofthe fields in the enclosing object Should thisbe handled via something like xload(memory

order consume x) but if so doesnrsquot thisre-introduce the need for lots of stdkill

dependency() calls

3 Larger structures might have quite a few fieldsthat need dependencies carried Should therebe some sort of shorthand to make this easierto code for example tagging the fields needingdependency ordering in the declaration of thestruct or class

4 Common types of abstraction need to be han-dled correctly For example there are more than

WG21N4321 28

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst head-gtfirst-gtt)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(lsp-gtt lsp-gtt)) 28 rcu_read_unlock()29 return data30

Figure 31 List-Based-Stack Head-Marked Depen-dencies 1 of 2

500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

79 Restricted Dependency Chains

This approach restricts dependency chains to opera-tions for which compilers would naturally carry de-pendencies As such this approach can be consid-ered to be a refinement of the whole-program option(Section 74) restricted as described in Section 412but also omitting control dependencies and RCU-protected array indexes As always a memory order

consume load heads a dependency chain and as al-ways a memory order consume load may be imple-mented with a simple load instruction on architec-tures such as x86 ARM and Power

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst head-gtfirst-gtnext)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext lsnp-gtnext-gtt)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt lsnp2-gtt))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 32 List-Based-Stack Head-Marked Depen-dencies 2 of 2

WG21N4321 29

1 define rcu_ptr_extract(p) 2 ( 3 uintptr_t ___ip = (uintptr_t)(p) 4 5 ___ip = ___ip amp ~0x7 6 (typeof(p)) ___ip 7 )89 define rcu_ptr_set_bits(p bits)

10 ( 11 uintptr_t ___ip = (uintptr_t)(p) 12 13 ___ip = ___ip amp ~0x7 14 ___ip = ___ip | (bits) 15 (typeof(p)) ___ip 16 )1718 define rcu_ptr_get_bits(p bits) 19 ( 20 uintptr_t ___ip = (uintptr_t)(p) 21 22 ___ip amp ~0x7 23 )

Figure 33 Pointers and Bit Manipulation on 64-BitSystem

This results in a specific list of operations that ex-tend dependency chains and a separate specific list ofoperations that terminate such chains each coveredin its own section

791 Extending Dependency Chains

The following primitive operations extend thatchain15

1 If any value is part of a dependency chain thenusing that value as the left-hand side of an as-signment expression extends the chain to coverthe assignment This rule is exercised in theLinux kernel by stores into fields making up anRCU protected data element

2 If any value is part of a dependency chain thenusing that value as the right-hand side of an as-signment expression extends the chain to coverboth the assignment and the value returned bythat assignment statement Line 20 of Figure 20shows how this rule may be used to extend adependency chain into a local variable

15 In case of operator overloading the actual functions calledmust be analyzed in order to determine their effects on depen-dency chains

3 If any value that is part of a dependency chain isstored to a non-shared variable then any valueloaded by a later load from that same variable bythat same thread is also part of the dependencychain Lines 20 and 24 of Figure 20 illustratethis rule though this rule would apply even iflocal variable lsp was a intptr t instead of apointer

4 If a pointer that is part of a dependency chainis stored to any variable then any value loadedby a later load from that same variable by thatsame thread is also part of the dependency chainLines 20 and 24 of Figure 20 illustrate this rulethough this rule would apply even if local vari-able lsp was instead a shared variable

5 If a pointer is part of a dependency chain thenadding an integral value to that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers and alsoto addition via the infix + operator and via thepostfix [] operator Note that the addition mustbe carried out on a pointer Casting to an inte-gral type and then carrying out the addition willbreak the dependency chain Therefore insteadof casting to an integral type to carry out theaddition cast to a pointer to char16 Line 24 ofFigure 20 illustrates this given that the -gtt actsas a pointer offset prior to indirection

6 If a pointer is part of a dependency chain thensubtracting an integer from that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers Againcasting to an integral type and then carryingout the subtraction will break the dependencychain so instead cast to a pointer to char TheLinux-kernel container of() macro illustratesthis This macro is used to find the beginning ofa structure given a pointer to a field within thatsame structure

16 Yes some old systems had strange formats for characterpointers and this restriction does exclude those systems fromthis nuance of dependency ordering However to the best ofmy knowledge all such systems were uniprocessors so this isnot a real problem

WG21N4321 30

7 If a pointer is part of a dependency chain thendereferencing it using the prefix operator ex-tends the chain through the dereference opera-tion Line 24 of Figure 20 illustrates this giventhat the -gtt acts as a pointer offset prior to in-direction

8 If a pointer is part of a dependency chain thendereferencing it using the -gt field-selection oper-ator extends the chain to the field Note that thewhen the -gt operator is followed by one or more operators these latter operators are equiva-lent to adding a constant integer to the originalpointer Line 24 of Figure 20 directly illustratesthis rule

9 If a pointer is part of a dependency chain thencasting it (either explicitly or implicitly) to anypointer-sized type extends the chain to the re-sult Line 3 of Figure 33 illustrates this rule

10 If a value of type intptr t or uintptr t ispart of a dependency chain then casting it toa pointer type extends the chain to the resultLine 6 of Figure 33 illustrates this rule

11 If a value of type intptr t or uintptr t is partof a dependency chain then the bitwise infix amp

and | operators extend the dependency chain tothe resulting value Line 5 of Figure 33 illus-trates this rule

12 If a value of type intptr t or uintptr t is partof a dependency chain the the bitwise infix ^

operator extends the dependency chain to theresulting value The use case for this traversal ofbuddy-allocator-like lists or dense-array heapsbut it is not clear whether these use cases justifythis addition to dependency ordering

13 If a pointer is part of a dependency chain thenapplying the unary amp address-of operator op-tionally casting this address to a pointer type(perhaps repeatedly to different pointer typeseither explicitly or implicitly) then applying the dereference operator extends the chain to theresult This is used by some of the Linux-kernellist-processing macros

14 If a pointer is part of a dependency chain andthat pointer appears in the entry of a expres-sion selected by the condition then the chainextends to the result Please note that doesnot extend chains from its condition only fromits second or third argument

15 If a pointer to a function is part of a depen-dency chain then invoking the pointed-to func-tion extends the chain from the pointer to theinstructions executed Note that the exact mech-anism used to update instructions is implemen-tation defined and might require use of specialinstruction-cache-flush operations17

16 If a pointer is part of a dependency chain thenif that pointer is used as the actual parameter ofa function call the dependency chain extends tothe formal parameter

17 If a pointer is part of a dependency chain thenif that pointer is returned from a function thedependency chain extends to the returned valuein the calling function

18 If a given operation extends a dependency chainthen so does its atomic counterpart For exam-ple the rules applying to assignments also applyto atomic loads and stores

Any other operation terminates a dependencychain

792 Terminating Dependency Chains

Even though all other operations terminate depen-dency chains there are a few that deserve specialmention

1 Equality comparisons

2 Narrowing magnitude comparisons

3 Narrowing arithmetic operations

4 Narrowing bitwise operations

17 This may sound strange but just you try implementingdynamic linking without the ability to update instructions

WG21N4321 31

1 struct bar 2 struct bar next3 int a4 int b5 6 struct bar head = amphead 1 2 78 for (p = head-gtnext p p = rcu_dereference(p-gtnext)) 9 foo += p-gta

10 if (p == amphead)11 break12 13 bar = head-gtb

Figure 34 Back-Propagation of Dependency-ChainBreakage Due to Comparisons

1 if (p gt ampfoo)2 do_something(p)3 else if (p lt ampfoo)4 do_something_else(p)5 else6 do_something_nodep(p)

Figure 35 Inequality-Comparison Dependency-Chain Breakage

5 Storing non-pointers into shared variables

6 Passing values between threads without using amemory order consume load

7 Undefined behavior

8 Use of kill dependency

Each of these is covered below

Equality comparisons If a pointer is part of adependency chain then a == or = comparison thatcompares equal to some other pointer where thatother pointer is not part of any dependency chainwill cause any uses of the original pointer to no longerbe part of the dependency chain This dependency-chain breakage can back-propagate to earlier uses ofthe pointer so that in Figure 34 if the comparisonon line 10 compares equal then the access on line 9is not part of the dependency chain This is admit-tedly a rather strange code fragment and besidesthe Linux-kernel barrier() macro could prevent thisif placed between lines 9 and 10 Furthermore the

Linux kernelrsquos list macros avoid this situation becausethe equal comparison terminates the loop

So what if the compiler introduces an equality com-parison This might happen when doing feedback-directed optimization where the compiler might no-tice (for example) that a particularly statically allo-cated structure was almost always the first elementon a given list The compiler might therefore in-troduce a specialization optimization comparing theaddresses and generating code using the statically al-located structure on equals comparison On the onehand in the cases where the Linux kernel adds a stat-ically allocated structure to an RCU-protected linkeddata structure that structure has been initialized atcompile time so that dependency ordering is not re-quired On the other hand this appears to be anextremely dubious optimization for linked data struc-tures In a great many cases the added overhead ofthe comparison would overwhelm the benefits of gen-erating code based on the statically allocated struc-ture

High-quality implementations would therefore beexpected to provide means for disabling this sort ofoptimization especially for pointers obtained fromthe heap After all use of statically allocated struc-tures in RCU-protected lists could be quite usefulduring out-of-memory conditions in which case thespecialization optimization would almost always re-duce performance which is not what optmizationsare supposed to be doing

Narrowing magnitude comparisons A seriesof gt lt gt= or lt= operators that informs the compilerof the exact value of a pointer causes that pointerto no longer be part of the dependency chain SeeFigure 35 for an example of this On line 6 of thisfigure the compiler knows that the value of p is equalto ampfoo so although there is dependency ordering tolines 2 and 4 there is no dependency ordering toline 6 This dependency-chain breakage can back-propagate just as for equality comparisons How-ever dependencies are maintained for normal usesfor exmaple the use of comparisons for deadlockavoidance when acquiring locks contained in multi-ple RCU-protected data elements

WG21N4321 32

Narrowing arithmetic operations If a pointeris part of a dependency chain and if the valuesadded to or subtracted from that pointer cancel thepointer value so as to allow the compiler to pre-cisely determine the resulting value then the result-ing value will not be part of any dependency chainFor example if p is part of a dependency chain then((char )p-(uintptr t)p)+65536 will not be18

Narrowing bitwise operations If a value of typeintptr t or uintptr t is part of a dependency chainand if that value is one of the operands to an amp

or | infix operator whose result has too few or toomany bits set then the resulting value will not bepart of any dependency chain For example on a64-bit system if p is part of a dependency chainthen (p amp 0x7) provides just the tag bits and nor-mally cannot even be legally dereferenced Similarly(p | ~0) normally cannot be legally dereferencedHowever (p amp ~0x7) will provide a usable pointerthat is part of prsquos dependency chain Setting orclearing bits that are not used by the implementa-tion or that would be zero for a properly aligned ob-ject will not break the dependency chain That saidthe compiler might not know the definition of ldquoprop-erly alignedrdquo for example in the cases of manuallycache-aligned or page-aligned objects

Storing non-pointers into shared variables Ifa non-pointer value that is part of a dependency chainis stored into a shared variable then the dependencychain does not extend to a later load from that vari-able

Passing values between threads If a value thatis part of a dependency chain is stored into a variableby one thread and loaded from that same variable bysome other thread using either a non-atomic load ora memory order relaxed load then the dependencychain does not extend to the second thread To getthis effect the second thread would instead need touse a memory order consume load Note that this

18 That said 57p4 of C++ and 656p8 of C both say thatindexing outside of an object is undefined behavior so the lossof dependency ordering is likely the least of the problems here

would extend the dependency chain even if the cor-responding store was a memory order relaxed storebecause the required store-side ordering is providedby the dependency chain

Undefined behavior If undefined behavior is in-voked then consistent with the notion of undefinedbehavior there are no dependency-chain guarantees

If a given pointer intptr t or uintptr t is usedsuch that only one value avoids undefined behaviorthen the dependency chain is broken in the same wayas it would be in the case of an equality comparisonwith that same value

kill dependency() The result of calling kill

dependency is never part of any dependency chainThis operation can be used to suppress diagnosticsthat implementations might omit for likely misusesof dependency ordering

793 Restricted Dependency Chains and theLinux Kernel

This covers all known pointer-based RCU uses in theLinux kernel aside from RCU-protected array in-dexes (more on these later) However as noted ear-lier these restrictions might prove too constrainingfor future code Therefore it might be necessary tocombine this approach with some variation on one ofthe methods for explicitly marking variables formalparameters and return values that are intended tocarry dependencies

Section 32 discussed dependency chains headedby memory order consume loads of integers that arelater used as array indexes or pointer offsets Al-though there are a (very) few such uses in the Linuxkernel accommodating dependency chains headed byloads of integers greatly complicates the handling ofdependency chains For example given an integer x

produced by a memory order consume load we mustcorrectly handle expressions containing (x - x) in-cluding cases where the cancellation is not at all obvi-ous from the source code In contrast given a pointerp the expression (p - p) not a pointer and the rulesgiven above do not require carrying a dependency

WG21N4321 33

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a) 3 q = kill_dependency(p)4 a = q-gtspecial_a5 else 6 a = p-gtnormal_a7

Figure 36 Avoiding Diagnostic Due To Dependency-Ordering Value-Narrowing Hazard

through such an expression This approach there-fore excludes dependency chains headed by memory

order consume loads from non-pointer atomic vari-ables

This raises the question of what should be doneabout the Linux kernel code that relies on order-ing carried through integer array indexes Paulhas (hopefully) answered this question by creatinga Linux-kernel patch that removes the kernelrsquos de-pendency on RCU-protected array indexes [22]

794 Restricted Dependency Chains Ad-vantages and Disavantages

This approach to dependency ordering has the fol-lowing advantages and disadvantages

1 There is no need for compilers to trace depen-dency chains Instead dependencies are an au-tomatic result of current code-generation and op-timization practices

2 There would be little or no limitation on com-piler optimizations Compilers are no longer re-quired to establish artificial dependencies for ex-pressions such as (x - x) or to detect value-narrowing hazards involving == and =

3 The breaking of dependency chains when apointer compares equal to some other pointermight prove to be onerous but it is not aproblem for current Linux use cases19 Themost common cases are comparison against alist header (in which case an equality compar-ison terminates the traversal) and comparison

19 This needs to be re-verified

against NULL (in which case an equality com-parison indicates a pointer that cannot be deref-erenced in any case)

4 The compiler is prohibited from carrying outvalue-speculation optimizations on pointers orvalues of type intptr t or uintptr t that havebeen cast from a pointer type and subjected tono operations other than infix amp | and ^ (Isthis an advantage or a disadvantage The an-swer to this question is left to the reader)

5 In a great many cases dependency chains canreliably pass through library functions compiledby pre-C11 compilers

6 It would not be necessary to use stdkill

dependency() calls in most cases That saiduse of stdkill dependency() might at somefuture point allow the compiler to produce bet-ter diagnostics for dubious dependency-chain usecases For example a compiler might issue awarning for the value-narrowing hazard shownon line 3 of Figure 13 and that diagnostic mightbe suppressed as shown in Figure 36

7 More generally this approach allows annotationsto be discarded as was shown in Figures 20and 21 However the memory order consume

loads are still required in order to support DECAlpha and prevent compiler optimizations thatmight otherwise destroy the dependency chain

8 It would not be necessary to use [[carries

dependency]] attributes However as withstdkill dependency() use of something like[[carries dependency]] might produce bet-ter diagnostics for dubious use dependency-chainuse cases That said it might be preferable tosubstitute either type or variable modifiers forattributes given that use of attributes is not per-mitted to change the meaning of the programand also that attributes are not yet supportedby the C language

9 With the exception of one use case involving ar-rays no changes to the Linux-kernel source code

WG21N4321 34

are required As noted earlier a patch is avail-able to remove the Linux kernelrsquos dependency onRCU-protected array indexes [22]

In the future this approach could be augmentedby attributes type modifiers variable modifiers orother markings to allow more elaborate dependencychains to be created on the one hand and to improvethe compilerrsquos ability to emit diagnostics for dubioususes of dependency chains on the other

710 Evaluation

This evaluation starts by enumerating the differentaudiences that any change to memory order consume

must address (Section 7101) and then compares thevarious proposals based on the perceived viewpointsof these audiences (Section 7102)

7101 Audiences

The main audiences for any change to memory

order consume include standards committee mem-bers compiler implementers formal-methods re-searchers developers intending to write new codeand developers working with existing RCU code TheLinux kernel community is of course a notable exam-ple of this last category

Standards committee members would like a cleanand non-intrusive change to the standard Theywould of course also like solutions minimizing thenumber and vehemence of complaints from the otheraudiences or failing that reducing the complaintsto a tolerable noise level

Compiler implementers would like a mechanismthat fits nicely into current implementations whichdoes much to explain their satisfaction with the ap-proach of strengthening memory order consume tomemory order acquire In particular they wouldlike to avoid unbounded tracing of dependencies andwould prefer minimal constraints on their ability toapply time-honored optimizations

Formal-methods researchers would like a definitionof memory order consume that fits into existing the-oretical frameworks without undue conceptual vio-lence Of particular concern is any need to deal with

counter-factuals in other words any need to rea-son not only about values of variables required forthe solution of a given litmus test but also aboutother unrelated values for these variables As suchcounter-factuals are the rock upon which otherwiseattractive approaches involving semantic dependencyhave foundered20 Some practitioners might won-der why the opinion of formal-methods researchersshould be given any weight at all and the answer tothis question is that it is the work of formal-methodsresearchers that provides us the much-needed toolsthat we need to analyze both the memory-orderingspecification itself as well as programs using thatspecification

Developers writing new code need something thatexpresses their algorithm with a minimum of syn-tactic saccharine that is easy to learn and that iseasy to maintain For example one of the weak-nesses of the current standardsrsquo definition of memoryorder consume is the need to sprinkle large numbersof kill dependency() calls throughout onersquos codeIn short developers would like it to be easy to writeanalyze and maintain code that uses dependency or-dering

Developers with existing RCU code have the samedesires as do developers writing new code but arealso very interested in minimizing the code churn re-quired to adhere to the standard

The challenge if of course to find a proposal thataddresses the viewoints of all of these audiences Aswe will see in the next session this is not easy How-ever there is some hope that the approach presentedin Section 79 might suffice

7102 Comparison

A summary comparison of the proposals is shown inTable 1

The dependency type can either be ldquodeprdquo fornormal dependency ldquordeprdquo for the restricted de-pendencies discussed in Section 79 ldquosdeprdquo for(global) semantic dependency or ldquolsdeprdquo for local

20 That said Alan Jeffries is making another attempt tocome up with a suitable formal definition of semantic depen-dency

WG21N4321 35

Dep

enden

cyType

Variable

Marking

Form

al-Parameter

Marking

Return-V

alueMarking

Beginning-O

f-Chain

Handling

End-O

f-Chain

Handling

Dep

enden

cyTracingRequired

CAttribute

Support

Req

uired

C11 C++11 dep A A K Y Y

Type-Based Designation of DependencyChains With Restrictions (Section 72)

lsdep T T T k

Type-Based Designation of DependencyChains (Section 73)

dep T T T k

Whole-Program Option (Section 74) sdep

Local-Variable Restriction (Section 75) dep A A k Y

Mark Dependency-Carrying Local Variables(Section 76)

dep A A A k Y

Explicitly Tail-Marked Dependency Chains(Section 77)

dep A A Dk y Y

Explicitly Head-Marked Dependency Chains(Section 78)

dep D k y Y

Restricted Dependency Chains (Section 79) rdep

Marking ldquoArdquo attribute ldquoTrdquo typeBeginning of chain ldquoDrdquo explicit designationEnd of chain ldquoDrdquo explicit designation ldquokKrdquo implicitexplicit kill dependencyDependency tracking required ldquoyrdquo only for marked chains ldquoYrdquo always

Table 1 Comparison of Consume Proposals

WG21N4321 36

semantic dependency21 Variable formal-parameterand return-value marking can either be type-based(ldquoTrdquo) attribute-based (ldquoArdquo) or not required (ldquo rdquo)Beginning-of-chain handling can either require ex-plicit indication to which quantities dependenciesmust be carried (ldquoDrdquo) or nothing (ldquo rdquo) End-of-chain handling can either require an explicit kill

dependency (ldquoKrdquo) an implicit kill dependency

(ldquokrdquo) explicit designation of dependency (ldquoDrdquo) ornothing (ldquo rdquo)22 Dependency tracking might berequired for all chains (ldquoYrdquo) explicitly designatedchains (ldquoyrdquo) or not required at all (ldquo rdquo) C-language[[carries dependency]] support might be required(ldquoYrdquo) or not (ldquo rdquo)

The ideal proposal would have dependency typeldquodeprdquo (thus making it easier to model dependencyordering and making it unnecessary for developer tohave to outwit full-program optimizations) no needfor variable formal-parameter or return-value mark-ing (thus minimizing changes required for existingRCU code) implicit ldquodo the right thingrdquo end-of-chainhandling23 (thus minimizing the need for whack-a-mole source-code markups) no need for dependencytracking (thus making it easier to implement) andno need for C-language support for the [[carries

dependency]] attribute (thus minimizing changes tothe C standard)

7103 Other Approaches

If the C standards committee is unwilling to ac-cept attributes perhaps a new keyword such asCarries dependency would be an acceptable alter-native (This was suggested at the 2014 UIUC meet-ing but I cannot recall who suggested it)

21 Recall that a local semantic dependency remains a de-pendency even if the memory order consume load at its headcan return only a single value In contrast a global seman-tic dependency remains a dependency only if more than onevalue can appear at the end of the chain Therefore optimiza-tions based on global full-program analysis can break a globalsemantic dependency but can break neither a local semanticdependency nor a normal dependency

22 Variables that go out of scope always have any dependencychain implicitly killed

23 Perhaps implemented by a careful choice of exactly whichoperators carry dependencies in which situations

It might also be possible to combine different as-pects of the various proposals perhaps even arrivingat an improved proposal

Nevertheless we clearly have some more work todo

8 Summary

This document has analyzed Linux-kernel use of de-pendency ordering and has laid out the status-quo in-teraction between the Linux kernel and pre-C11 com-pilers It has also put forward some possible ways ofbuilding towards a full implementation of C11rsquos andC++11rsquos handling of dependency ordering Finallyit calls out some weaknesses in C11rsquos and C++11rsquoshandling of dependency ordering and offers some al-ternatives

References

[1] Alglave J Maranget L Pawan PSarkar S Sewell P Williams D andNardelli F Z PPCMEMARMMEM A toolfor exploring the POWER and ARM memorymodels httpwwwclcamacuk~pes20

ppc-supplementalpldi105-sarkarpdfJune 2011

[2] Alglave J Maranget L andTautschnig M Herding cats Modellingsimulation testing and data-mining for weakmemory In Proceedings of the 35th ACM SIG-PLAN Conference on Programming LanguageDesign and Implementation (New York NYUSA 2014) PLDI rsquo14 ACM pp 40ndash40

[3] ARM Limited ARM Architecture ReferenceManual ARMv7-A and ARMv7-R Edition2010

[4] Boehm H J Space efficient conservativegarbage collection SIGPLAN Not 39 4 (Apr2004) 490ndash501

[5] Bonzini P and Day M RCU imple-mentation for Qemu httplistsgnu

WG21N4321 37

orgarchivehtmlqemu-devel2013-08

msg02055html August 2013

[6] Dalton M THE DESIGN AND IMPLE-MENTATION OF DYNAMIC INFORMATIONFLOW TRACKING SYSTEMS FOR SOFT-WARE SECURITY PhD thesis StanfordUniversity 2009 Available httpcsl

stanfordedu~christospublications

2009michael_daltonphd_thesispdf

[Viewed March 9 2010]

[7] Desnoyers M [RFC git tree] userspace RCU(urcu) for Linux httpurcuso February2009

[8] Desnoyers M McKenney P E SternA Dagenais M R and Walpole JUser-level implementations of read-copy updateIEEE Transactions on Parallel and DistributedSystems 23 (2012) 375ndash382

[9] Grisenthwaite R ARM Barrier LitmusTests and Cookbook ARM Limited 2009

[10] Howard P W and Walpole J A rel-ativistic enhancement to software transactionalmemory In Proceedings of the 3rd USENIX con-ference on Hot topics in parallelism (BerkeleyCA USA 2011) HotParrsquo11 USENIX Associa-tion pp 1ndash6

[11] Howard P W and Walpole J Relativis-tic red-black trees Concurrency and Computa-tion Practice and Experience (2013) nandashna

[12] Intel Corporation A Formal Speci-fication of Intel Itanium Processor Fam-ily Memory Ordering 2002 Avail-able httpdeveloperintelcom

designitaniumdownloads251429htm

ftpdownloadintelcomdesign

ItaniumDownloads25142901pdf [ViewedJanuary 10 2007]

[13] International Business Machines Corpo-ration Power ISA Version 207 2013

[14] Kannan H Ordering decoupled metadata ac-cesses in multiprocessors In MICRO 42 Pro-ceedings of the 42nd Annual IEEEACM Inter-national Symposium on Microarchitecture (NewYork NY USA 2009) ACM pp 381ndash390

[15] McCarthy J Recursive functions of symbolicexpressions and their computation by machinepart i Commun ACM 3 4 (Apr 1960) 184ndash195

[16] McKenney P E Read-copy update(RCU) usage in Linux kernel Availablehttpwwwrdropcomuserspaulmck

RCUlinuxusagerculocktabhtml [ViewedJanuary 14 2007] October 2006

[17] McKenney P E What is RCU part 2Usage Available httplwnnetArticles

263130 [Viewed January 4 2008] January2008

[18] McKenney P E The RCU API 2010 edi-tion httplwnnetArticles418853 De-cember 2010

[19] McKenney P E Validating memory barri-ers and atomic instructions httplwnnet

Articles470681 December 2011

[20] McKenney P E Is Parallel ProgrammingHard And If So What Can You Do About Itkernelorg Corvallis OR USA 2012

[21] McKenney P E Structured deferral syn-chronization via procrastination CommunACM 56 7 (July 2013) 40ndash49

[22] McKenney P E [PATCH tipcorercu14] mce Stop using array-index-based RCUprimitives [PATCHtipcorercu14]mce

Stopusingarray-index-basedRCUprimitivesMay 2015

[23] McKenney P E Purcell C Al-gae Schumin B Cornelius G Qwer-tyus Conway N Sbw Blainster Ru-fus C Zoicon5 Anome and Eisen HRead-copy update httpenwikipedia

orgwikiRead-copy-update July 2006

WG21N4321 38

[24] McKenney P E and Slingwine J DRead-copy update Using execution history tosolve concurrency problems In Parallel andDistributed Computing and Systems (Las VegasNV October 1998) pp 509ndash518

[25] McKenney P E and Walpole J What isRCU fundamentally Available httplwn

netArticles262464 [Viewed December 272007] December 2007

[26] Michael M M Hazard pointers Safe mem-ory reclamation for lock-free objects IEEETransactions on Parallel and Distributed Sys-tems 15 6 (June 2004) 491ndash504

[27] Rossbach C J Hofmann O S PorterD E Ramadan H E Bhandari A andWitchel E TxLinux Using and managinghardware transactional memory in an operatingsystem In SOSPrsquo07 Twenty-First ACMSymposium on Operating Systems Principles(October 2007) ACM SIGOPS Avail-able httpwwwsosp2007orgpapers

sosp056-rossbachpdf [Viewed October 212007]

[28] Toit S D Working draft stan-dard for programming language C++httpwwwopen-stdorgjtc1sc22wg21

docspapers2013n3691pdf May 2013

[29] Vigueras G Orduna J M and LozanoM A Read-Copy Update based parallel serverfor distributed crowd simulations The Journalof Supercomputing (Apr 2012)

Change Log

This paper first appeared as N4026 in May of 2014Revisions to this document are as follows

bull Add a suggestion to Section 71 that developersprovide diagnostics for singleton arrays when us-ing RCU-protected indexes (May 28 2014)

bull Apply changes from Torvald Riegel feedback

1 Move Section 71 which summarizes theLinux-kernel uses of dependency chains upto the beginning of Section 7

2 Fix some typos in Section 74 and addwords accounting for DEC Alpha whichmust continue to promote memory order

consume loads to memory order acquire

with this sectionrsquos approach Also clarifyJeff Preshingrsquos conjecture about the lack ofoptimizations that break semantic depen-dency chains

(June 5 2014)

bull Add Section 75 which covers Hans Boehmrsquosapproach which restricts dependency chains tolocal variable Also add Section 76 whichcovers Clark Nelsorsquos approach which usesthe [[carries dependency]] attribute to markthose local variables that carry dependencies(July 17 2014)

bull Mark the paper as a revision of N4036 (July 172014)

bull Add Jeff Preshing Hans Boehm and Clark Nel-son as authors and clarify Hansrsquos proposal inSection 75 (August 13 2014)

bull Add Olivier Girouxrsquos proposal for explicitlymarking the tails of dependency chains as Sec-tion 77 and add Olivier as author Update theadvantages and disadvantages of Jeff Preshingrsquoswhole-program option in Section 74 Add advan-tages and disadvantages for Hans Boehmrsquos pro-posal in Section 75 and for Clark Nelsonrsquos pro-posal in Section 76 (August 23 2014)

bull Update Olivier Girouxrsquos explicit dependency-chain marking proposal in Section 77 based onfeedback from Olivier (August 24 2014)

bull Add a section header (Section 411) for thedependency-chain rules for 2014 GCC implemen-tations Add Section 413 laying out the sim-pler rules for managing dependency chains in theolder 1990s Sequent C implementations Adda footnote to Section 76 recording Lawrence

WG21N4321 39

Crowlrsquos preference for variable modifiers insteadof either attributes or type modifiers Addwords to Section 77 noting that atomic signal

fence() or the Linux kernelrsquos barrier() can beused to suppress code-motion optimizations thatmight otherwise prevent fixing up tail-markeddependency chains that were broken by code-motion optimizations (October 1 2014)

bull Add Section 34 which describes operators thatact as the last link in the Linux kernelrsquos depen-dency chains including the ambiguous role ofthe -gt operator (October 1 2014)

bull Update plots of RCU API usage (October 52014)

bull Flesh out Section 412 which lays out a shorthistory of dependency-chain management rulesfor the Linux kernel Numerous minor clarifica-tions and corrections throughout the document(October 5 2014)

bull Add ldquoandrdquo before final authorrsquos name and emailaddress Also add a pointer back to N4036 (Oc-tober 5 2014)

bull Add a brain-dead build script (October 52014)

bull Add Section 710 which has the beginnings ofan evaluation of the various proposals (October5 2014)

The paper was then published as N4215 whichwas revised as follows

bull Add verbiage to Sections 72 and 7102 differ-entiating between various forms of syntactic andsemantic dependencies (November 11 2014)

bull Add Section 7103 suggesting a C keyword in-stead of the C++ [[carries dependency]] at-tribute This section also suggests that combin-ing ideas from several proposals might be helpful(November 11 2014)

bull Add Section 78 which contains a first attemptto describe Olivier Girouxrsquos head-marked depen-dency chains (November 11 2014)

bull Add example code in Figures 15 and 16 Addsimilar figures to the proposals showing how theymark the dependency chains in the sample code(November 11 2014)

At this point the paper was published as N4321which was revised as follows

bull Add a warning against using operands to amp and| that leave only one bit set (or respectivelycleared) to Section 411 (April 9 2015)

bull Update plots of RCU API usage (April 192014)

bull Add Section 79 which restricts dependencychains with the goal of allowing unmarked de-pendency chains while avoiding restricting com-piler optimizations One important restriction isthe elimination of RCU-protected array indexes(April 19 2014)

bull Add verbiage to Section 32 noting that the amp

operator is sometimes used to find the beginningof a power-of-two aligned structure (April 192014)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Torvald Riegel to Sec-tion 79 (May 20 2015)

bull Donrsquot allow integers to carry dependencies intoor out of unmarked functions in Section 79(May 21 2015)

bull Update Section 79 to note that assignments can-not extend a dependency chain from one threadto another and that assignments cannot extendan intptr t- or uintptr t-based dependencychain through a shared variable even within agiven thread (May 21 2015)

bull Update Section 79 to note that atomic op-erations can extend dependency chains in thesame way that their non-atomic counterpartscan (May 21 2015)

WG21N4321 40

bull Update Section 79 to explicitly call out the factthat stores and subsequent loads can extend de-pendency chains (May 21 2015)

bull Update Section 79 to explicitly state that un-defined behavior terminates dependency chainsand that if only one value of a pointer avoidsundefined behavior any dependency chainsthrough that pointer at that point are termi-nated (May 22 2015)

bull Update Section 79 to document the difficul-ties stemming from value-specialization opti-mizations (May 23 2015)

bull Update Section 79 to note the back-propagationof the dependency-chain-breaking effects of thecompiler deducing the exact value of a pointer(May 27 2015)

bull Update Section 79 to give more informationon how much bit-setting and bit-clearing is toomuch (June 8 2015)

bull Update Section 79 to note that the Linux ker-nel does not rely on dependency ordering whena statically allocated structure is linked into anRCU-protected structure but also note that op-timizations breaking dependency ordering in thiscase seem quite dubious (July 11 2015)

bull Update Section 79 illustrating dependency-chain rules with examples (July 11 2015)

bull Update Section 79 noting that simple loads suf-fice for memory order consume loads (July 112015)

bull Update Sections 77 and 78 calling out abstrac-tion challenges for head- and tail-marking ap-proaches (July 13 2015)

bull Add subsections to Section 79 (July 13 2015)

bull Update Section 79 to amplify objections to spe-cialization optimizations applied to heap-basedpointers (July 13 2015)

  • 1 Introduction
  • 2 Introduction to RCU
  • 3 Linux-Kernel Use Cases
    • 31 Types of Linux-Kernel Dependency Chains
    • 32 Operators in Linux-Kernel Dependency Chains
    • 33 Operators Terminating Linux-Kernel Dependency Chains
    • 34 Operators Acting as Last Link in Linux-Kernel Dependency Chains
    • 35 Linux-Kernel Dependency Chain Length
      • 4 Dependency Ordering in Pre-C11 Implementations
        • 41 Rules for C-Language RCU Users
          • 411 Rules for 2014 GCC Implementations
          • 412 Rules for 2003 GCC Implementations
          • 413 Rules for 1990s Sequent C Implementations
            • 42 Rules for C-Language Implementers
              • 5 Dependency Ordering in C11 and C++11 Implementations
              • 6 Weaknesses in C11 and C++11 Dependency Ordering
              • 7 Potential Alternatives to C11 and C++11 Dependency Ordering
                • 71 Revising C11 and C++11 Dependency-Ordering Definition
                • 72 Type-Based Designation of Dependency Chains With Restrictions
                • 73 Type-Based Designation of Dependency Chains
                • 74 Whole-Program Option
                • 75 Local-Variable Restriction
                • 76 Mark Dependency-Carrying Local Variables
                • 77 Explicitly Tail-Marked Dependency Chains
                • 78 Explicitly Head-Marked Dependency Chains
                • 79 Restricted Dependency Chains
                  • 791 Extending Dependency Chains
                  • 792 Terminating Dependency Chains
                  • 793 Restricted Dependency Chains and the Linux Kernel
                  • 794 Restricted Dependency Chains Advantages and Disavantages
                    • 710 Evaluation
                      • 7101 Audiences
                      • 7102 Comparison
                      • 7103 Other Approaches
                          • 8 Summary
Page 7: N4321: Towards Implementation and Use of memory order … · 2015-07-14 · N4321: Towards Implementation and Use of memory order consume Doc. No.: WG21/N4321 (revised) Date: 2015-04-19

WG21N4321 7

is required by the algorithm In the 313 version ofthe Linux kernel the following operators always markthe end of the essential subset of a dependency chain() == = ampamp || infix and

The postfix () function-invocation operator is aninteresting special case in the Linux kernel In theoryRCU could be used to protect JITed function bodiesbut in current practice RCU is instead used to waitfor all pre-existing callers to the function referencedby the previous pointer The functions are all com-piled into the kernel and the dependency chains aretherefore irrelevant to the () operator Hence in ver-sion 313 of the Linux kernel the () operator marksthe end of the essential subset of any dependencychain that it resides in

The == = ampamp and || operators are used ex-clusively in rdquoifrdquo statements to make control-flow de-cisions and therefore also mark the end of the essen-tial subset of any dependency chains that they residein In theory these relational and boolean operatorscould be used to form array indexes but in practicethe Linux kernel does not yet do this in RCU depen-dency chains The other relational operators (gt ltgt= and lt=) should probably also be added to thislist

The infix and arithmetic operators couldpotentially be used for construct array addresses butthey are not yet used that way in the Linux kernelInstead they are used to do computation on valuesfetched as the last operation in an essential subset ofa dependency chain

In short in the current Linux kernel () === ampamp || infix and all mark the end of theessential subset of a dependency chain That saidthere is potential for them to be used as part of theessential subset of dependency changes in future ver-sions of the Linux kernel And the same is of coursetrue of the remaining C-language operators whichdid not appear within any of the dependency chainsin version 313 of the Linux kernel

34 Operators Acting as Last Link inLinux-Kernel Dependency Chains

Although the -gt operator is frequently used as part ofa Linux-kernel dependency chain it often is intended

to be the last link in that chain Therefore the usescases for the -gt operator deserve special mention

The first use case involves fetching non-pointerdata from an RCU-protected data structure Forexample in the DRDB subsystem in Linux -gt isused to fetch a timeout value This code requiresthat dependency ordering apply to this fetch but itdoes not require a dependency chain extending be-yond that point This sort of case would requirea kill dependency() for implementations based onthe C++11 and C11 standards

The second use case involves linked data struc-tures where an RCU update might be applied onany pointer in the chain for example the stan-dard Linux-kernel linked list The -gt operator pro-vides dependency ordering for the fetch of the -gtnextpointer but that fetch must itself be a memory order

consume load in order to provide the required depen-dency ordering for the fields in the next structurein the list Thus a linked-list traversal consists ofa series of back-to-back non-overlapping dependencychains

These two use cases raise the question of whethera dependency chain can continue beyond a -gt oper-ator The answer is ldquoyesrdquo and this occurs when alinked structure is made visible to RCU readers as aunit For exapmle consider a linked list where eachlist element links to a constant binary search treeIf this tree is in place when the element is added tothe list then a memory order consume load is neededonly when fetching the pointer to the element Thedependency chain headed by this fetch suffices to or-der accesses to the binary search tree

These cases need to be differentiated The thirduse case appears to be the least frequent which sug-gests that the -gt operator (or a sequence of -gt oper-ators) always be the last link of a dependency chain

35 Linux-Kernel Dependency ChainLength

Many Linux-kernel dependency chains are very shortand contained with a fair number living within theconfines of a single C statement If there were onlya few short dependency chains in the Linux kernelone could imagine decorating all the operators in each

WG21N4321 8

1 void new_element(struct foo pp int a)2 3 struct foo p = malloc(sizeof(p))45 if (p)6 abort()7 p-gta = a8 atomic_store_explicit(pp p memory_order_release)9

1011 int traverse(struct foo_head ph)12 13 int a = -114 struct foo p1516 p = atomic_load_explicit(ampfield_dep(ph h)17 memory_order_consume)18 while (p = NULL) 19 a = field_dep(p a)20 p = atomic_load_explicit(ampfield_dep(p n)21 memory_order_consume)22 23 return a24

Figure 8 Decorated Linked Structure Traversal

dependency chain for example replacing the -gt op-erator with something like the mythical field dep()

operator shown on lines 16 19 and 20 of Figure 8

However there are a great many dependencychains that extend across multiple functions Onerelatively modest example is in the Linux networkstack in the arp process() function This depen-dency chain extends as follows with deeper nestingindicating deeper function-call levels

bull The arp process() function invokes in dev

get rcu() which returns an RCU-protectedpointer The head of the dependency chain istherefore within the in dev get rcu() func-tion

bull The arp process() function invokes the follow-ing macros and functions

ndash IN DEV ROUTE LOCALNET() which expandsto the ipv4 devconf get() function

ndash arp ignore() which in turn calls

lowast IN DEV ARP IGNORE() which expandsto the ipv4 devconf get() function

lowast inet confirm addr() which calls

middot dev net() which in turn callsread pnet()

ndash IN DEV ARPFILTER() which expands toipv4 devconf get()

ndash IN DEV CONF GET() which also expands toipv4 devconf get()

ndash arp fwd proxy() which calls

lowast IN DEV PROXY ARP() which expandsto ipv4 devconf get()

lowast IN DEV MEDIUM ID() which also ex-pands to ipv4 devconf get()

ndash arp fwd pvlan() which calls

lowast IN DEV PROXY ARP PVLAN() which ex-pands to ipv4 devconf get()

ndash pneigh enqueue()

Again although a great many dependency chainsin the Linux kernel are quite short there are quite afew that spread both widely and deeply We thereforecannot expect Linux kernel hackers to look fondlyon any mechanism that requires them to decorateeach and every operator in each and every depen-dency chain as was shown in Figure 8 In fact evenkill dependency() will likely be an extremely diffi-cult sell

4 Dependency Ordering in Pre-C11 Implementations

Pre-C11 implementations of the C language do nothave any formal notion of dependency ordering butthese implementations are nevertheless used to buildthe Linux kernelmdashand most likely all other softwareusing RCU This section lays out a few straightfor-ward rules for both implementers (Section 42) andusers of these pre-C11 C-language implementations(Section 41)

41 Rules for C-Language RCU Users

The rules for C-language RCU users have evolvedover time so this section will present them in reversechronological order

WG21N4321 9

411 Rules for 2014 GCC Implementations

The primary rule for developers implementing RCU-based algorithms is to avoid letting the compiler de-terming the value of any variable in any dependencychain This primary rule implies a number of sec-ondary rules

1 Use only intrinsic operators on basic types Ifyou are making use of C++ template metapro-gramming or operator overloading more elabo-rate rules apply and those rules are outside thescope of this document

2 Use a volatile load to head the dependency chainThis is necessary to avoid the compiler repeatingthe load or making use of (possibly erroneous)prior knowledge of the contents of the memorylocation each of which can break dependencychains

3 Avoid use of single-element RCU-protected ar-rays The compiler is within its right to assumethat the value of an index into such an arraymust necessarily evaluate to zero The com-piler could then substitute the constant zero forthe computation breaking the dependency chainand introducing misordering

4 Avoid cancellation when using the + and - infixarithmetic operators For example for a givenvariable x avoid (xminusx) The compiler is withinits rights to substitute zero for any such cancel-lation breaking the dependency chain and againintroducing misordering Similar arithmetic pit-falls must be avoided if the infix or oper-ators appear in the essential subset of a depen-dency chain

5 Avoid all-zero operands to the bitwise amp opera-tor and similarly avoid all-ones operands to thebitwise | operator If the compiler is able todeduce the value of such operands it is withinits rights to substitute the corresponding con-stant for the bitwise operation Once again thisbreaks the dependency chain introducing mis-ordering

Please note that single-bit operands to bitwise amp

can be dangerous because the compiler requiresonly a small amount of additional information todeduce the exact value which could again resultin constant substitution Operands to bitwise |

that have only one zero bit are similarly danger-ous

6 If you are using RCU to protect JITed functionsso that the () function-invocation operator is amember of the essential subset of the dependencytree you may need to interact directly with thehardware to flush instruction caches This issuearises on some systems when a newly JITed func-tion is using the same memory that was used byan earlier JITed function

7 Do not use the boolean ampamp and || operators inessential dependency chains The reason for thisprohibition is that they are often compiled usingbranches Weak-memory machines such as ARMor PowerPC order stores after such branches butcan speculate loads which can break data depen-dency chains

8 Do not use relational operators (== = gt gt=lt or lt=) in the essential subset of a dependencychain The reason for this prohibition is that asfor boolean operators relational operators areoften compiled using branches Weak-memorymachines such as ARM or PowerPC order storesafter such branches but can speculate loadswhich can break dependency chains

9 Be very careful about comparing pointers in theessential subset of a dependency chain As LinusTorvalds explained if the two pointers are equalthe compiler could substitute the pointer you arecomparing against for the pointer in the essen-tial subset of the dependency chain On ARMand Power hardware it might be that only theoriginal value carried a hardware dependency sothis substitution would break the chain in turnpermitting misordering Such comparisons areOK in the following cases

(a) The pointer being compared against refer-ences memory that was initialized at boot

WG21N4321 10

time or otherwise long enough ago thatreaders cannot still have pre-initialized datacached Examples include module-init timefor module code before kthread creationfor code running in a kthread while theupdate-side lock is held and so on

(b) The pointer is never dereferenced afterbeing compared This exception applieswhen comparing against the NULL pointeror when scanning RCU-protected circularlinked lists

(c) The pointer being compared against is partof the essential subset of a dependencychain This can be a different dependencychain but only as long as that chain stemsfrom a pointer that was modified after anyinitialization of interest This exception canapply when carrying out RCU-protectedtraversals from different entry points thatconverged on the same data structure

(d) The pointer being compared against isfetched using rcu access pointer() andall subsequent dereferences are stores

(e) The pointers compared not-equal and thecompiler does not have enough informationto deduce the value of the pointer (Forexample if the compiler can see that thepointer will only ever take on one of twovalues then it will be able to deduce theexact value based on a not-equals compar-ison)

10 Disable any value-speculation optimizations thatyour compiler might provide especially if you aremaking use of feedback-based optimizations thattake data collected from prior runs

412 Rules for 2003 GCC Implementations

Prior to the 269 version of the Linux kernel therewas neither rcu dereference() nor rcu assign

pointer() Instead explicit memory barriers wereused smp read barrier depends() by readers andsmp wmb() by updaters For example the code shownfor current Linux kernels in Figure 6 would be as

1 struct foo 2 int a3 4 struct foo fp5 struct foo default_foo67 int bar(void)8 9 struct foo p

1011 p = fp12 smp_read_barrier_depends()13 return p p-ampgta default_fooa14

Figure 9 Default Value For RCU-Protected PointerOld Linux Kernel

shown in Figure 9 for 268 and earlier versions of theLinux kernel A similar transformation relates theolder use of smp wmb() and the more recent use ofrcu assign pointer()

This older API was clearly much more vulnerableto compiler optimizations than is the current APIbut the real motivation for this change was read-ability and maintainability as can be seen from thecommit log for the mid-2004 patch introducing rcu

dereference()

This patch introduced an rcu

dereference() macro that replaces mostuses of smp read barrier depends()The new macro has the advantage ofexplicitly documenting which pointersare protected by RCU ndash in contrast itis sometimes difficult to figure out whichpointer is being protected by a givensmp read barrier depends() call

The commit log for the mid-2004 patch introducingrcu assign pointer() justifies the change in termsof eliminating hard-to-use explicit memory barriers

Attached is a patch that adds an rcu

assign pointer() that allows a number ofexplicit smp wmb() memory barriers to bedispensed with improving readability

The importance of suppressing compiler optimiza-tions did not become apparent until much later In

WG21N4321 11

fact a volatile cast was not added to the implementa-tion of rcu dereference() until 2624 in early 2008

413 Rules for 1990s Sequent C Implemen-tations

1990s systems featured far slower CPUs and muchless memory that is commonly provisioned to-day and the compilers were correspondingly lesssophisticated Therefore at that time a sim-ple C-language field selector was used instead ofany sort of rcu dereference() or memory order

consume operation Not only was there novolatile cast there also was nothing resembling smp

read barrier depends() The lack of smp read

barrier depends() is not too surprising given thatDYNIXptx did not run on DEC Alpha

This approach was nevertheless quite reliable be-cause the use cases within the DYNIXptx kernelwere both few and straightforward and provided lit-tle or no opportunity for optimizations that mightbreak dependency chains

42 Rules for C-Language Imple-menters

The main rule for C-language implementers is toavoid any sort of value speculation or at the veryleast provide means for the user to disable suchspeculation An example of a value-speculation op-timization that can be carried out with the help ofhardware branch prediction is shown in Figure 10which is an optimized version of the code in Fig-ure 5 This sort of transformation might result fromfeedback-directed optimization where profiling runsdetermined that the value loaded from ph was almostalway 0xbadfab1e Although this transformation iscorrect in a single-threaded environment in a concur-rent environment nothing stops the compiler or theCPU from speculating the load on line 19 before itexecutes the rcu dereference() on line 16 whichcould result in line 19 executing before the corre-sponding store on line 7 resulting in a garbage valuein variable a4

4 Kudos to Olivier Giroux for pointing out use of branchprediction to enable value speculation

1 void new_element(struct foo pp int a)2 3 struct foo p = malloc(sizeof(p))45 if (p)6 abort()7 p-gta = a8 rcu_assign_pointer(pp p)9

1011 int traverse(struct foo_head ph)12 13 int a = -114 struct foo p1516 p = rcu_dereference(ampph-gth)17 while (p = NULL) 18 if (p == (struct foo )0xbadfab1e)19 a = ((struct foo )0xbadfab1e)-gta20 else21 a = p-gta22 p = rcu_dereference(ampp-gtn)23 24 return a25

Figure 10 Dangerous Optimizations HardwareBranch Predictions

There are some situations where this sort of opti-mization would be safe including

1 The value speculated is a numeric value ratherthan a pointer so that if the guess proves correctafter the fact the computation will be appropri-ate after the fact

2 The value speculated is a pointer to invariantdata so that reasonable values are produced bydereferencing even if the guess proves to havebeen correct only after the fact

3 As above but where any updates result in datathat produces appropriate computations at anyand all phases of the update

However this list does not contain the general caseof memory order consume loads

Pure hardware implementations of value specula-tion can avoid this problem because they monitorcache-coherence protocol events that would resultfrom some other CPU invalidating the guess

In short compiler writers must provide means todisable all forms of value speculation unless the spec-

WG21N4321 12

ulation is accompanied by some means of detectingthe race condition that Figure 10 is subject to

Are there other dependency-breaking optimizationsthat should be called out separately

5 Dependency Ordering in C11and C++11 Implementations

The simplest way to avoid dependency-ordering is-sues is to strengthen all memory order consume oper-ations to memory order acquire This functions cor-rectly but may result in unacceptable performancedue to memory-barrier instructions on weakly or-dered systems such as ARM and PowerPC5 and mayfurther unnecessarily suppress code-motion optimiza-tions

Another straightforward approach is to avoid valuespeculation and other dependency-breaking opti-mizations This might result in missed opportu-nities for optimization but avoids any need fordependency-chain annotations and also all issuesthat might otherwise arise from use of dependency-breaking optimizations This approach is fully com-patible with the Linux kernel communityrsquos currentapproach to dependency chains Unfortunately thereare any number of valuable optimizations that breakdependency chains so this approach seems impracti-cal

A third approach is to avoid value speculationand other dependency-breaking optimizations in anyfunction containing either a memory order consume

load or a [[carries dependency]] attribute Forexample the hardware-branch-predition optimiza-tion shown in Figure 10 would be prohibited in suchfunctions as would cancellation optimizations suchas optimizing a = b + c - c into a = b This toocan result in missed opportunities for optimizationthough very probably many fewer than the previousapproach This approach can also result in issues dueto dependency-breaking optimizations in functionslacking [[carries dependency]] attributes for ex-ample function d() in Figure 11 It can also result

5 From a Linux-kernel community viewpoint that shouldread ldquowill result in unacceptable performancerdquo

1 int a(struct foo p [[carries_dependency]])2 3 return kill_dependency(p-gta = 0)4 56 int b(int x)7 8 return x9

1011 foo c(void)12 13 return fpload_explicit(memory_order_consume)14 return rcu_dereference(fp) in Linux kernel 15 1617 int d(void)18 19 int a20 foo p2122 rcu_read_lock()23 p = c()24 a = p-gta25 rcu_read_unlock()26 return a27

Figure 11 Example Functions for Dependency Or-dering Part 1

1 [[carries_dependency]] struct foo e(void)2 3 return fpload_explicit(memory_order_consume)4 return rcu_dereference(fp) in Linux kernel 5 67 int f(void)8 9 int a

10 foo p1112 rcu_read_lock()13 p = e()14 a = p-gta15 rcu_read_unlock()16 return kill_dependency(a)17 1819 int g(void)20 21 int a22 foo p2324 rcu_read_lock()25 p = e()26 a = p-gta27 rcu_read_unlock()28 return b(a)29

Figure 12 Example Functions for Dependency Or-dering Part 2

WG21N4321 13

in spurious memory-barrier instructions when a de-pendency chain goes out of scope for example withthe return statement of function g() in Figure 12

A fourth approach is to add a compile-time op-eration corresponding to the beginning and end ofRCU read-side critical section These would need tobe evaluated at compile time taking into accountthe fact that these critical sections can nest and canbe conditionally entered and exited Note that theexit from an outermost RCU read-side critical sec-tion should imply a kill dependency() operation oneach variable that is live at that point in the code6

Although it is probably impossible to precisely de-termine the bounds of a given RCU read-side criticalsection in the general case conservative approachesthat might overestimate the extent of a given sec-tion should be acceptable in almost all cases Thisapproach would make functions c() and d() in Fig-ure 11 handle dependency chains in a natural mannerbut avoiding whole-program analysis would requiresomething similar to the [[carries dependency]]

annotations called out in the C11 and C++11 stan-dards

A fifth approach would be to require that all op-erations on the essential subset of any dependencychain be annotated This would greatly ease imple-mentation but would not be likely to be accepted bythe Linux kernel community

A sixth approach is to track dependencies as calledout in the C11 and C++11 standards However in-stead of emitting a memory-barrier instruction whena dependency chain flows into or out of a functionwithout the benefit of [[carries dependency]] in-sert an implicit kill dependency() invocation Im-plementation should also optionally issue a diagnosticin this case The motivation for this approach is thatit is expected that many more kill dependencies()

than [[carries dependency]] would be required toconvert the Linux kernelrsquos RCU code to C11 In theexample in Figure 12 this approach would allow func-tion g() to avoid emitting an unnecessary memory-barrier instruction but without function f()rsquos ex-

6 What if a given rcu read unlock() sometimes marked theend of an outermost RCU read-side critical section but othertimes was nested in some other RCU read-side critical sectionIn that case there should be no kill dependency()

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a)3 a = p-gtspecial_a4 else5 a = p-gtnormal_a

Figure 13 Dependency-Ordering Value-NarrowingHazard

plicit kill dependency() Both functions are in Fig-ure 12

A seventh and final approach is to track dependen-cies as called out in in the C11 and C++11 standardsWith this approach functions e() and f() properlypreserve the required amount of dependency order-ing

6 Weaknesses in C11 andC++11 Dependency Or-dering

Experience has shown several weaknesses in the de-pendency ordering specified in the C11 and C++11standards

1 The C11 standard does not provide attributesand in particular does not provide the[[carries dependency]] attribute This pre-vents the developer from specifying that a givendependency chain passes into or out of a givenfunction

2 The implementation complexity of thedependency-chain tracking required by bothstandard can be quite onerous on the one handand the overhead of unconditionally promotingmemory order consume loads to memory order

acquire can be excessive on weakly orderedimplementations on the other There is thereforeno easy way out for a memory order consume

implementation on a weakly ordered system

3 The function-level granularity of [[carries

dependency]] seems too coarse One problemis that points-to analysis is non-trivial so that

WG21N4321 14

compilers are likely to have difficulty determin-ing whether or not a given pointer carries a de-pendency For example the current wording ofthe standard (intentionally) does not disallowdependency chaining through stores and loadsTherefore if a dependency-carrying value mightever be written to a given variable an implemen-tation might reasonably assume that any loadfrom that variable must be assumed to carry adependency

4 The rules set out in the standard [28 110p9]do not align well with the rules that develop-ers must currently adhere to in order to main-tain dependency chains when using pre-C11 andpre-C++11 compilers (see Section 41) For ex-ample the standard requires (x-x) to carry adependency and providing this guarantee wouldat the very least require the compiler to alsoturn off optimizations that remove (x-x) (andsimilar patterns) if x might possibly be carry-ing a dependency For another example con-sider the value-speculation-like code shown inFigure 13 that is sometimes written by devel-opers and that was described in bullet 9 ofSection 41 In this example the standard re-quires dependency ordering between the memoryorder consume load on line 1 and the sub-sequent dereference on line 3 but a typicalcompiler would not be expected to differenti-ate between these two apparently identical val-ues These two examples show that a compilerwould need to detect and carefully handle thesecases either by artificially inserting dependen-cies omitting optimizations differentiating be-tween apparently identical values or even byemitting memory order acquire fences

5 The whole point of memory order consume andthe resulting dependency chains is to allow de-velopers to optimize their code Such optimiza-tion attempts can be completely defeated by thememory order acquire fences that the standardcurrently requires when a dependency chain goesout of scope without the benefit of a [[carries

dependency]] attribute Preventing the com-piler from emitting these fences requires liberal

use of kill dependency() which clutters coderequires large developer effort and further re-quires that the developer know quite a bit aboutwhich code patterns a given version of a givencompiler can optimize (thus avoiding needlessfences) and which it cannot (thus requiring man-ual insertion of kill dependency()

As of this writing no known implementations fullysupport C11 or C++11 dependency ordering

It is worth asking why Paul didnrsquot anticipate theseweaknesses There are several reasons for this

1 Compiler optimizations have become more ag-gressive over the seven years since Paul startedworking on standardization

2 New dependency-ordering use cases have arisenduring that same time in particular there arelonger dependency chains and more of themincluding dependency chains spanning multiplecompilation units

3 The number of dependency chains has increasedby roughly an order of magnitude during thattime so that changes in code style can be ex-pected to face a commeasurate increase in resis-tance from the Linux kernel community ndash unlessthose changes bring some tangible benefit

With that letrsquos look at some potential alternativesto dependency ordering as defined in the C11 andC++11 standards

7 Potential Alternatives to C11and C++11 Dependency Or-dering

Given the weaknesses in the current standardrsquos spec-ification of dependency ordering it is quite reason-able to consider alternatives To this end Section 71discusses ease-of-use issues involved with revisions tothe C11 and C++11 definitions of dependency or-dering Section 72 enlists help from the type systembut also imposes value restrictions (thus revising the

WG21N4321 15

C11 and C++11 semantics for dependencies) Sec-tion 73 enlists help from the type system withoutthe value restrictions and Section 74 describes awhole-program approach to dependency chains (alsorevising the C11 and C++11 semantics for depen-dencies) Section 75 describes a post-Rapperswilproposal that dependency chains be restricted tofunction-scope local variables and temporaries andSection 76 describes a second post-Rapperswil pro-posal that the [[carries dependency]] attributebe used to label local-scope variables that carry de-pendencies Section 77 describes a proposal dis-cussed verbally at Rapperswil that explicitly marksthe tails of dependency chains Section 78 describesthe inverse namely marking the heads of dependencychains Section 79 describes an approach that avoidsmarking by sharply restricting the number and typeof operations permitted in dependency chains Eachapproach appears to have advantages and disadvan-tages so it is hoped that further discussion will eitherhelp settle on one of these alternatives or generatesomething better To help initiate this discussionSection 710 provides an initial comparative evalua-tion

71 Revising C11 and C++11Dependency-Ordering Definition

The following sections each describe a proposed revi-sion of the dependency-ordering definition from thatin the current C11 and C++11 standards In manyof these proposals developers are required to followan additional rule in order to be able to rely on de-pendency ordering Subsequent execution must notlead to a situation where there is only one possiblevalue for the variable that is intended to carry thedependency7 This is shown in Figure 17 where thecompiler is permitted to break dependency orderingon line 6 because it knows that the value of p is equalto that of q which means that it could substitutethe latter value from the former which would break

7 This restricted notion of dependence is sometimes calledsemantic dependence and the value at the end of a depen-dence chain that does not represent a semantic dependence issometimes said to be independent of the value at the head ofthe dependency chain

1 int my_array[MY_ARRAY_SIZE]23 i = atomic_load_explicit(gi memory_order_consume)4 r1 = my_array[i]

Figure 14 Single-Element Arrays and DependencyOrdering

dependency ordering In short a dependency chainbreaks if it comes to a point where only a single valueis possible regardless of the value of the memory

order consume load heading up the chain At firstglance this additional rule could be quite difficult tolive with as dependency ordering could come and godepending on small details of code far away from thatpoint in the dependency chain

However a review of the Linux-kernel operators inSection 32 shows that the most commonly used op-erators act identically under both definitions Theproblem-free operators include -gt infix = casts pre-fix amp prefix and ternary

One example of a potentially troublesome opera-tor namely == is shown in Figure 17 where line 6breaks dependency ordering because the value of p isknown to be equal to that of q which is not part of adependency chain This example could be addressedthrough careful diagnostic design coupled with appro-priate coding standards For example the compilercould emit a warning on line 6 but remain silent forthe equivalent line substituting q for p namely dosomething with(q-gta)

Another example is the use of postfix [] thatis shown in Figure 14 If this code fragment wascompiled with MY ARRAY SIZE equal to one there isno dependency ordering between lines 3 and 4 butthat same code fragment compiled with MY ARRAY

SIZE equal to two or greater would be dependency-ordered Here a diagnostic for single-element arraysmight prove useful and such a diagnostic can easilybe supplied in this case using if and error

In the Linux kernel infix + and - are used forpointer and array computations These are all safein that they operate on an integer and pointer sothat any cancellation will not normally be detectableat compile time However one big purpose of diag-nostics is to detect abnormal conditions indicating

WG21N4321 16

1 struct liststackhead 2 struct liststack __rcu first3 45 struct liststack 6 struct liststack __rcu next7 void t8 struct rcu_head rh9

1011 _Carries_dependency12 void ls_front(struct liststackhead head)13 14 _Carries_dependency void data15 struct liststack lsp1617 rcu_read_lock()18 lsp = rcu_dereference(head-gtfirst)19 if (lsp == NULL)20 data = NULL21 else22 data = rcu_dereference(lsp-gtt)23 rcu_read_unlock()24 return data25

Figure 15 List-Based-Stack Example Code 1 of 2

probable bugs Therefore in cases where the com-piler can determine that two values from dependencychains are annihilating each other via infix + and -a diagnostic would be appropriate

Similarly the Linux kernel uses infix (bitwise) amp tomanipulate bits at the bottom of a pointer whereagain cancellation will not normally be detectable atcompile timemdashexcept in the case of operations on aNULL pointer for which dependency ordering is notmeaningful in any case However as with infix +

and - if the compiler detects value annihilation adiagnostic would be appropriate

Although issues with false positives and negativesneeds further investigation there is reason to hopethat this revision of the definition of dependency or-dering might avoid significant impacts on ease of useWith this hope we proceed to the specific propos-als using the code in Figures 15 and 16 to showsome sample code using Linux-kernel nomenclaturewith the addition of a mythical C keyword Carries

dependency to annotate parameters variables andreturn values that carry dependencies Please notethat this code example in no way endorses the dubi-ous practice of creating a parallel program with thesort of choke point exemplified by the head of this

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 _Carries_dependency struct liststack lsp40 struct liststack lsnp141 _Carries_dependency struct liststack lsnp242 _Carries_dependency void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 16 List-Based-Stack Example Code 2 of 2

WG21N4321 17

1 value_dep_preserving struct foo p23 p = atomic_load_explicit(gp memory_order_consume)4 q = some_other_pointer5 if (p == q)6 do_something_with(p-gta)7 else8 do_something_else_with(p-gtb)

Figure 17 Single-Value Variables and DependencyOrdering

list Note also that cmpxchg() heads a dependencychain which is completely reasonable within the con-text of the Linux kernel due to its acquire semanticswhich of course might be argued to indicate that theannotations in ls pop() are unnecessary

72 Type-Based Designation of De-pendency Chains With Restric-tions

This approach was formulated by Torvald Riegel inresponse to Linus Torvaldsrsquos spirited criticisms of thecurrent C11 and C++11 wording

This approach introduces a new value dep

preserving type qualifier Dependency ordering ispreserved only via variables having this type quali-fier This is meant to model the real scope of depen-dencies which is data flow not execution at function-level granularity This approach should therefore givedevelopers much finer control of which dependenciesare tracked

Assigning from a value dep preserving value to anon-value dep preserving variable terminates thetracking of dependencies in much the same way thatan explicit kill dependency() would However un-like an explicit kill dependency() compilers shouldbe able to emit a suppressable warning on implicitconversions so as to alert the developer about other-wise silent dropping of dependency tracking8

Next we specify that memory order consume loadsreturn a value dep preserving type by default thecompiler must assume such a load to be capable of

8 Other choices are possible in this case including emit-ting a memory order acquire fence in order to conservativelypreserve a potentially intended ordering

producing any value of the underlying type In otherwords the implementation is not permitted to applyany value-restriction knowledge it might gain fromwhole-program analysis We call this a local seman-tic dependency to distinguish not only from a pure(syntactic) dependency but also from a global seman-tic dependency where global information may be ap-plied Note that any global semantic dependency isalso a local semantic dependency but that any localsemantic dependency which is headed by a variablethat can be proven to take on only a single value is nota global semantic dependency The term ldquosemanticdependencyrdquo should be interpreted to mean a globalsemantic dependency unless otherwise stated

This allows developers to start with a clean slatefor the additional rule that they must follow to beable to rely on dependency ordering Subsequent ex-ecution must not lead to a situation there is only onepossible value for the value dep preserving expres-sion because otherwise the implementation is per-mitted to break the dependency chain As notedearlier this is shown in Figure 17 where the com-piler is permitted to break dependency ordering online 6 because it knows that the value of p is equalto that of q which means that it could substitutethe latter value from the former which would breakdependency ordering

This approach has several advantages

1 The implementation is simpler because no de-pendency chains need to be traced The imple-mentation can instead drive optimization deci-sions strictly from type information

2 Use of the value dep preserving type modifierallows the developer to limit the extent of thedependency chains

3 This type modifier can be used to mark a depen-dency chainrsquos entry to and exit from a functionin a straightforward way without the need forattributes

4 The value dep preserving type modifiers serveas valuable documentation of the developerrsquos in-tent

WG21N4321 18

5 This approach permits many additional opti-mizations compared to those permitted by thecurrent standard on code that carries a depen-dency Expressions such as (x-x) no longerrequire establishment of artificial dependenciesand the compiler is no longer required to detectvalue-narrowing hazards like that shown in Fig-ure 13 However the compiler is still prohibitedfrom adding its own value-speculation optimiza-tions

6 Linus Torvalds seems to be OK with it whichindicates that this set of rules might be practicalfrom the perspective of developers who currentlyexploit dependency chains

According to Peter Sewell one disadvantage is thatthis approach will be quite difficult to model which inturn will pose obstacles for the analysis tooling thatwill be increasingly necessary for large-scale concur-rent programming efforts In particular the concernis that forcing the compiler to assume that a memory

order consume load could possibly return any valuepermitted by its type might require program-analysistools to consider counterfactual hypothetical execu-tions which might complicate specification of seman-tics and verification

Figures 18 and 19 show how this approach playsout with the list-based stack

73 Type-Based Designation of De-pendency Chains

Jeff Preshing made an off-list suggestion of using avalue dep preserving type modifier as suggestedby Torvald Riegel but using this type modifier tostrictly enforce dependency ordering For exampleconsider the code fragment shown in Figure 17 Thescheme described in Section 72 would not necessar-ily enforce dependency ordering between the load online 3 and the access one line 6 while the approachdescribed in this section would enforce dependencyordering in this case

Furthermore cancelling or value-destruction oper-ations on value dep preserving values would notdisrupt dependency ordering As with the cur-rent C11 and C++11 standards the implementation

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack value_dep_preserving first6 78 struct liststack 9 struct liststack value_dep_preserving next

10 void t11 struct rcu_head rh12 1314 value_dep_preserving15 void ls_front(struct liststackhead head)16 17 value_dep_preserving void data18 value_dep_preserving struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 18 List-Based-Stack Restricted Type-BasedDesignation 1 of 2

would be required to emit a memory-barrier instruc-tion or compute an artificial dependency for such op-erations (Note however that use of cancelling orvalue-destruction operations on dependency chainshas proven quite rare in practice)

This approach shares many of the advantages ofTorvald Riegelrsquos approach

1 The implementation is simpler because no de-pendency chains need be traced The implemen-tation can instead drive optimization decisionsstrictly from type information

2 Use of the value dep preserving type modifierallows the developer to limit the extent of thedependency chains

3 This type modifier can be used to mark a depen-dency chainrsquos entry to and exit from a functionin a straightforward way without the need forattributes

4 The value dep preserving type modifiers serve

WG21N4321 19

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 value_dep_preserving37 void ls_pop(struct liststackhead head)38 39 value_dep_preserving struct liststack lsp40 struct liststack lsnp141 value_dep_preserving struct liststack lsnp242 value_dep_preserving void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 19 List-Based-Stack Restricted Type-BasedDesignation 2 of 2

as valuable documentation of the developerrsquos in-tent

5 Although optimizations on a dependency chainare restricted just as in the current standardthe use of value dep preserving restricts thedependency chains to those intended by the de-veloper

6 Restricting dependency-breaking optimizationson all dependency chains marked value dep

preserving without exceptions for cases inwhich the compiler knows too much might makethis approach easier to learn and to use

It is expected that modeling this approach shouldbe straightforward because the modeling tools wouldbe able to make use of the type information Thisapproach results in the same code as shown in Fig-ures 18 and 19 of the previous section

74 Whole-Program Option

This approach also suggested off-list by Jeff Presh-ing has the goal of reusing existing non-dependency-ordered source code unchanged (albeit requiring re-compilation in most cases)9 For example this ap-proach permits an instance of stdmap to be refer-enced by a pointer loaded via memory order consume

and to provide that stdmap instance with thebenefits of dependency ordering without any codechanges whatsoever to stdmap It is important tonote that this protection will be provided only to aread-only stdmap that is referenced by a changingpointer loaded via memory order consume in partic-ular not to a concurrently updated stdmap refer-enced by a pointer (read-only or otherwise) loadedvia memory order consume This latter case wouldrequire changes to the underlying stdmap implemen-tation at a minimum changing some of the loads tobe memory order consume loads Nevertheless theability to provide dependency-ordering protection topre-existing linked data structures is valuable evenwith this read-only restriction

9 A module or library that is known to never carry a de-pendency need not be recompiled

WG21N4321 20

This approach which again does require full re-compilation can be implemented using two ap-proaches

1 Promote all memory order consume loads tomemory order acquire as may be done withthe current standard

2 On architectures that respect memory order-ing prohibit all dependency-breaking optimiza-tions throughout the entire program but onlyin cases where a change in the value returnedby a memory order consume load could cause achange in the value computed later in that samedependency chain in other words where there isa global semantic dependency Note again thatthe possibility of storing a value obtained froma memory order consume load then loading itlater means that normal loads as well as memoryorder relaxed loads often must be consideredto head their own dependency chains but onlywhen loaded by the same thread that did thestore

Some implementations might allow the developerto choose between these two approaches for exampleby using a compiler switch provided for that purpose

This approach also has the effect of permitting atrivial implementation of a memory order consume

atomic thread fence() When using the first im-plementation approach the atomic thread fence()

is simply promoted to memory order acquire In-terestingly enough when using the second ap-proach the memory order consume atomic thread

fence() may simply be ignored The reason forthis is that this approach has the effect of promot-ing memory order relaxed loads to memory order

consume which already globally enforces all theordering that the memory order consume atomic

thread fence() is required to provide locally10

This approach has its own set of advantages anddisadvantages

10 Of course this presumed promotion from memory order

relaxed to memory order consume means that architecturessuch as DEC Alpha that do not respect dependency order-ing must continue to use the first option of emitting memory-ordering instructions for memory order consume loads

1 This approach dispenses with the [[carries

dependency]] attribute and the kill

dependency() primitive

2 This approach better promotes reuse of existingsource code In particular it should require nochanges to the current Linux-kernel source baseaside from changes to the rcu dereference()

family of primitives

3 This approach allows implementations to carryout dependency-breaking optimizations on de-pendency chains as long as a change in thevalue from the memory order consume load doesnot change values further down the dependencychain both with and without the optimizationJeff conjectures that the set of dependency-breaking optimizations used in practice applyonly outside of dependency chains by the re-vised definition in which single-value restrictionsbreak dependency chains11 If this conjectureholds it also applies to Torvaldrsquos approach de-scribed in Section 72

4 Code that follows the rules presented in Sec-tion 41 (substituting memory order consume

loads for volatile loads) would have its depen-dency ordering properly preserved

It is unlikely that this approach could be modeledreasonably given the current state of the art Therequirement that any given memory order consume

load be able to generate at least two different val-ues at the tail of the dependency chain is believedto be a show-stopper especially when coupled withwhole-program analysis which might find that thereis only one value entering at the head of the depen-dency chain

This approach allows annotations to be discardedas shown in Figures 20 and 21 However the memoryorder consume loads are still required in order toenable the promote-to-acquire implementation style

11 This is certainly the case for the usual optimizations ex-emplified by replacing (x-x) with zero

WG21N4321 21

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 void ls_front(struct liststackhead head)15 16 void data17 struct liststack lsp1819 rcu_read_lock()20 lsp = rcu_dereference(head-gtfirst)21 if (lsp == NULL)22 data = NULL23 else24 data = rcu_dereference(lsp-gtt)25 rcu_read_unlock()26 return data27

Figure 20 List-Based-Stack Whole-Program Ap-proach 1 of 2

75 Local-Variable Restriction

This approach suggested off-list by Hans Boehmlimits the extent of dependency trees to a local whichincludes local variables temporaries function argu-ments and return variables Assigning a value from amemory order consume load to such an object beginsa dependency chain Assigning a value loaded fromsuch a local to a global variable (including function-local variables marked static) or to the heap impliesa kill dependency() so that dependency chains areconfined to locals However if the compiler is unableto see the full dependency chain for example be-cause it passes into a function in another translationunit that is not marked [[carries dependency]]the compiler should promote memory order consume

to memory order acquire12

Section 32 indicates that the following operatorsshould transmit dependency status from one localvariable or temporary to another -gt infix = casts

12 Some implementations might provide means to allow theuser to specify that a diagnostic be generated if such promotionis necessary

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 void ls_pop(struct liststackhead head)37 38 struct liststack lsp39 struct liststack lsnp140 struct liststack lsnp241 void data4243 rcu_read_lock()44 lsnp2 = rcu_dereference(head-gtfirst)45 do 46 lsnp1 = lsnp247 if (lsnp1 == NULL) 48 rcu_read_unlock()49 return NULL50 51 lsp = rcu_dereference(lsnp1-gtnext)52 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)53 while (lsnp1 = lsnp2)54 data = rcu_dereference(lsnp2-gtt)55 rcu_read_unlock()56 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)57 return data58

Figure 21 List-Based-Stack Whole-Program Ap-proach 2 of 2

WG21N4321 22

prefix amp prefix [] infix + infix - ternary infix (bitwise) amp and probably also | Similarly Sec-tion 33 indicates that the following operators shouldimply a kill dependency() () == = ampamp ||infix and

It will also be necesary to check whether Linux-kernel usage expects dependency chains to passthrough globals and heap objects that are in someway thread-local If there are such use cases and ifthey are sane and cannot easily be changed to uselocal variables should [[carries dependency]] beused to flag dependency-carrying globals and heapobjects

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute if de-pendency chains are to span multiple translationunits as is the case in some parts of the Linuxkernel

2 The implementation is likely to be some-what simpler because only those dependencychains passing through local variables compiler-generated temporaries compiler-visible functionarguments and compiler-visible return valuesneed be traced One could also argue that func-tion arguments and return values marked with[[carries dependency]] attribute also need tobe traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 Although optimizations on dependency chainsmust be restricted the restricted scope of de-pendency chains reduces the impact of these re-strictions

5 Applying this approach to the Linux kernelwould only require the addition of markings onfunction parameters and return values corre-sponding to cross-translation-unit function callsHowever there are a significant number of theseso this approach can expect significant resistancefrom the Linux community

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 22 List-Based-Stack Local-Variable Restric-tion 1 of 2

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach allows local-variable annotations tobe dropped as shown in Figure 22 and 23

76 Mark Dependency-Carrying Lo-cal Variables

This approach suggested offlist by Clark Nelson usesthe [[carries dependency]] attribute to mark non-static local-scope variables as carrying a dependencyin addition to its current use marking function ar-guments and return values as carrying dependen-cies It is not permissible to mark global variablesor structure members with this attribute Assigningfrom a [[carries dependency]] object to a non-[[carries dependency]] object results in an im-plicit kill dependency()

This approach is similar to that of Section 73 ex-cept that it uses an attribute rather than a type mod-ifier As such it has many of the advantages and dis-

WG21N4321 23

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 23 List-Based-Stack Local-Variable Restric-tion 2 of 2

advantages of that approach however some believethat an attribute-based approach will be more ac-ceptable to the committee than would a type-modifierapproach13 However this approach does requirethat C add attributes

This leave the question of which operatorstransmit dependency chains from one [[carries

dependency]] object to another Section 32 indi-cates that the following operators should transmitdependency status from one local variable or tem-porary to another -gt infix = casts prefix amp prefix [] infix + infix - ternary infix (bitwise) ampand probably also | Similarly Section 33 showsthat the following operators should imply a kill

dependency() () == = ampamp || infix and

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains passingthrough variables marked with the [[carries

dependency]] attribute need be traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 The [[carries dependency]] calls serve asvaluable documentation of the developerrsquos in-tent

5 Although optimizations on dependency chainsmust be restricted use of explicit [[carries

dependency]] greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of variables carrying depen-dencies given that the Linux kernel currentlyrequires no such markings

13 Lawrence Crowl suggests a third approach namely a vari-able modifier

WG21N4321 24

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 _Carries_dependency void data18 _Carries_dependency struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 24 List-Based-Stack Marked Local Variables1 of 2

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach results in code as shown in Fig-ures 24 and 25 where the [[carries dependency]]

attributes have been replaced with a mythicalCarries dependency C keyword

77 Explicitly Tail-Marked Depen-dency Chains

This approach suggested at Rapperswil by OlivierGiroux can be thought of as the inverse of

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 _Carries_dependency struct liststack lsp40 struct liststack lsnp141 _Carries_dependency struct liststack lsnp242 _Carries_dependency void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 25 List-Based-Stack Marked Local Variables2 of 2

WG21N4321 25

1 p = atomic_load_explicit(ampgp memory_order_consume)2 if (p = NULL)3 do_it(atomic_dependency(p gp))

Figure 26 Explicit Dependency Operations

stdkill dependency() Instead of explicitlymarking where the dependency chains terminateOlivierrsquos proposal uses a stddependency() prim-itive to indicate the locations in the code that thedependency chains are required to reach The first ar-gument to stddependency() is the value to whichthe dependency must be carried and the second argu-ment is the variable that heads the dependency chainin other words the second argument is the variablethat was loaded from by a memory order consume

load This proposal differs from the others in thatit is expected to be implemented not necessarily bypreserving the dependency but instead by insertingbarriers in those cases where optimizations have elim-inated any required dependencies The goal here is toimpose minimal restrictions on optimizations of codecontaining dependency chains

A C-language example is shown in Figure 26where stddependency() is transliterated to the C-language atomic dependency() function On line 3atomic dependency() returns the value of its firstargument (p) while ensuring that the data depen-dency from the memory order consume load from gp

is faithfully reflected in the assembly language imple-menting this code fragment The assembly-languagereflection of this dependency might be in terms ofan assembly-language dependency (for example onARM or PowerPC) implicit memory ordering (forexample on x86 or mainframe) or by an explicitmemory-barrier instruction However if there was noatomic dependency() function the compiler wouldbe under no obligation to preserve the dependency14

These explicitly specified dependencies may becombined with [[carries dependency]] attributeson function arguments for example as shown in Fig-ure 27 Note the interplay of atomic dependency()

and [[carries dependency]] where line 8 estab-

14 Would it be better to have the second argument to atomic

dependency() be a label rather than an expression

1 void foo(struct bar q [[carries_dependency]])2 3 if (q = NULL)4 do_it(atomic_dependency(q-gtb q))5 67 p = atomic_load_explicit(ampgp memory_order_consume)8 foo(atomic_dependency(p gp))

Figure 27 Explicit Dependency Operations and car-ries dependency

lishes the dependency between the load from gp andthe [[carries dependency]] argument q of foo()and where line 4 establishes the further dependencybetween argument q of foo() and do it()s argu-ment

This approach is not yet complete One issue isthe possibility of a given operation being dependenton multiple memory order consume loads One ap-proach is of course to omit this functionality and an-other is to allow atomic dependency() to allow anexpression as its first argument and a variable list ofmemory order consume loaded variables

Another issue is connecting [[carries

dependency]] return values to subsequent atomic

dependency() invocations There are a numberof possible resolutions to this issue One approachwould be to use [[carries dependency]] attributeto mark the declaration of the variable to whichthe functionrsquos return value is assigned bringing theproposal from Section 76 to bear In the specialcase where the memory order consume load is in thesame function body as the atomic dependency()

that depends on it the atomic dependency()

could reference the variable that was the sourceof the original memory order consume load An-other approach would be to allow function-returncarries dependency]] attributes to define namesthat could be used by later atomic dependency()

invocations

A third issue arises when atomic dependency()

must be applied after the head of the dependencychain has gone out of scope for example if the headwas contained in a variable defined in an inner scopethat has since been exited

A fourth issue arises if optimizations along a

WG21N4321 26

needed dependency chain allow ordering the depen-dent operation to precede the head of the depen-dency chain in which case inserting barriers wouldbe ineffective The current proposal for addressingthis issue is to suppress memory-movement optimiza-tions across the atomic dependency() perhaps us-ing something like atomic signal fence() or theLinux kernelrsquos barrier() macro This approach al-lows dependency checking and fence insertion to becarried out as a final pass in the compilation process

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains having ex-plicit atomic dependency() calls (and option-ally intermediate [[carries dependency]] at-tributes) need be traced

3 Irrelevant dependency chains are pruned by de-fault with no stdkill dependency() calls re-quired

4 The atomic dependency() calls serve as valu-able documentation of the developerrsquos intent

5 Although optimizations on dependency chainsmust be restricted use of explicit atomic

dependency() greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of dependency chains giventhat the Linux kernel currently relies on implicitends of dependency chains

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(atomic_dependency(lsp-gtt27 head-gtfirst))28 rcu_read_unlock()29 return atomic_dependency(data lsp-gtt)30

Figure 28 List-Based-Stack Tail-Marked Dependen-cies 1 of 2

It is not yet known whether this approach can bereasonably modeled

The result is shown in Figures 28 and 29

78 Explicitly Head-Marked Depen-dency Chains

This approach suggested via email by Olivier Girouxcan be thought of as another inverse of stdkilldependency() In this case the heads of the de-pendency chains are marked indicating to whichpointed-to objects dependencies should be carriedThis description is an extrapolation of a very con-cise proposal and corrections are welcome

The general idea is to provide an augmentedform of the load() member function that indicatesdependencies for example xload(memory order

consume x-gtnext) would cause a dependency tobe carried through the -gtnext field but throughno other field This is shown in Figure 30 where

WG21N4321 27

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(atomic_dependency(lsnp2-gtt lsnp))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return atomic_dereference(data lsnp2-gtt)59

Figure 29 List-Based-Stack Tail-Marked Dependen-cies 2 of 2

1 struct foo 2 struct foo a3 struct foo b4 struct foo c5 int d6 78 p = atomic_load_explicit(ampgp memory_order_consume9 p-gta p-gtb)

10 qa = p-gta Dependency carried 11 qb = p-gtb Dependency carried 12 qc = p-gtc No dependency carried 13 d = p-gtd No dependency carried

Figure 30 Explicit Dependency Operations and Aug-mented Load

the explicit dependency information on line 9 causeslines 10 and 11 to carry a dependency but lines 12and 13 not to do so

Some open questions regarding this approach

1 How does this interact with arguments and re-turn values Do the corresponding annotationsneed to indicate to which fields dependenciesmight be carried Should mismatches be con-sidered an error and if so which sorts of mis-matches

2 How are opaque types handled For exam-ple consider the Linux kernel linked-list facil-ity which embeds a list head structure intothe enclosing object that is to be placed on thelist The memory order consume load returnsa (struct list head ) but it may be nec-essary to carry a dependency to one or more ofthe fields in the enclosing object Should thisbe handled via something like xload(memory

order consume x) but if so doesnrsquot thisre-introduce the need for lots of stdkill

dependency() calls

3 Larger structures might have quite a few fieldsthat need dependencies carried Should therebe some sort of shorthand to make this easierto code for example tagging the fields needingdependency ordering in the declaration of thestruct or class

4 Common types of abstraction need to be han-dled correctly For example there are more than

WG21N4321 28

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst head-gtfirst-gtt)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(lsp-gtt lsp-gtt)) 28 rcu_read_unlock()29 return data30

Figure 31 List-Based-Stack Head-Marked Depen-dencies 1 of 2

500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

79 Restricted Dependency Chains

This approach restricts dependency chains to opera-tions for which compilers would naturally carry de-pendencies As such this approach can be consid-ered to be a refinement of the whole-program option(Section 74) restricted as described in Section 412but also omitting control dependencies and RCU-protected array indexes As always a memory order

consume load heads a dependency chain and as al-ways a memory order consume load may be imple-mented with a simple load instruction on architec-tures such as x86 ARM and Power

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst head-gtfirst-gtnext)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext lsnp-gtnext-gtt)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt lsnp2-gtt))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 32 List-Based-Stack Head-Marked Depen-dencies 2 of 2

WG21N4321 29

1 define rcu_ptr_extract(p) 2 ( 3 uintptr_t ___ip = (uintptr_t)(p) 4 5 ___ip = ___ip amp ~0x7 6 (typeof(p)) ___ip 7 )89 define rcu_ptr_set_bits(p bits)

10 ( 11 uintptr_t ___ip = (uintptr_t)(p) 12 13 ___ip = ___ip amp ~0x7 14 ___ip = ___ip | (bits) 15 (typeof(p)) ___ip 16 )1718 define rcu_ptr_get_bits(p bits) 19 ( 20 uintptr_t ___ip = (uintptr_t)(p) 21 22 ___ip amp ~0x7 23 )

Figure 33 Pointers and Bit Manipulation on 64-BitSystem

This results in a specific list of operations that ex-tend dependency chains and a separate specific list ofoperations that terminate such chains each coveredin its own section

791 Extending Dependency Chains

The following primitive operations extend thatchain15

1 If any value is part of a dependency chain thenusing that value as the left-hand side of an as-signment expression extends the chain to coverthe assignment This rule is exercised in theLinux kernel by stores into fields making up anRCU protected data element

2 If any value is part of a dependency chain thenusing that value as the right-hand side of an as-signment expression extends the chain to coverboth the assignment and the value returned bythat assignment statement Line 20 of Figure 20shows how this rule may be used to extend adependency chain into a local variable

15 In case of operator overloading the actual functions calledmust be analyzed in order to determine their effects on depen-dency chains

3 If any value that is part of a dependency chain isstored to a non-shared variable then any valueloaded by a later load from that same variable bythat same thread is also part of the dependencychain Lines 20 and 24 of Figure 20 illustratethis rule though this rule would apply even iflocal variable lsp was a intptr t instead of apointer

4 If a pointer that is part of a dependency chainis stored to any variable then any value loadedby a later load from that same variable by thatsame thread is also part of the dependency chainLines 20 and 24 of Figure 20 illustrate this rulethough this rule would apply even if local vari-able lsp was instead a shared variable

5 If a pointer is part of a dependency chain thenadding an integral value to that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers and alsoto addition via the infix + operator and via thepostfix [] operator Note that the addition mustbe carried out on a pointer Casting to an inte-gral type and then carrying out the addition willbreak the dependency chain Therefore insteadof casting to an integral type to carry out theaddition cast to a pointer to char16 Line 24 ofFigure 20 illustrates this given that the -gtt actsas a pointer offset prior to indirection

6 If a pointer is part of a dependency chain thensubtracting an integer from that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers Againcasting to an integral type and then carryingout the subtraction will break the dependencychain so instead cast to a pointer to char TheLinux-kernel container of() macro illustratesthis This macro is used to find the beginning ofa structure given a pointer to a field within thatsame structure

16 Yes some old systems had strange formats for characterpointers and this restriction does exclude those systems fromthis nuance of dependency ordering However to the best ofmy knowledge all such systems were uniprocessors so this isnot a real problem

WG21N4321 30

7 If a pointer is part of a dependency chain thendereferencing it using the prefix operator ex-tends the chain through the dereference opera-tion Line 24 of Figure 20 illustrates this giventhat the -gtt acts as a pointer offset prior to in-direction

8 If a pointer is part of a dependency chain thendereferencing it using the -gt field-selection oper-ator extends the chain to the field Note that thewhen the -gt operator is followed by one or more operators these latter operators are equiva-lent to adding a constant integer to the originalpointer Line 24 of Figure 20 directly illustratesthis rule

9 If a pointer is part of a dependency chain thencasting it (either explicitly or implicitly) to anypointer-sized type extends the chain to the re-sult Line 3 of Figure 33 illustrates this rule

10 If a value of type intptr t or uintptr t ispart of a dependency chain then casting it toa pointer type extends the chain to the resultLine 6 of Figure 33 illustrates this rule

11 If a value of type intptr t or uintptr t is partof a dependency chain then the bitwise infix amp

and | operators extend the dependency chain tothe resulting value Line 5 of Figure 33 illus-trates this rule

12 If a value of type intptr t or uintptr t is partof a dependency chain the the bitwise infix ^

operator extends the dependency chain to theresulting value The use case for this traversal ofbuddy-allocator-like lists or dense-array heapsbut it is not clear whether these use cases justifythis addition to dependency ordering

13 If a pointer is part of a dependency chain thenapplying the unary amp address-of operator op-tionally casting this address to a pointer type(perhaps repeatedly to different pointer typeseither explicitly or implicitly) then applying the dereference operator extends the chain to theresult This is used by some of the Linux-kernellist-processing macros

14 If a pointer is part of a dependency chain andthat pointer appears in the entry of a expres-sion selected by the condition then the chainextends to the result Please note that doesnot extend chains from its condition only fromits second or third argument

15 If a pointer to a function is part of a depen-dency chain then invoking the pointed-to func-tion extends the chain from the pointer to theinstructions executed Note that the exact mech-anism used to update instructions is implemen-tation defined and might require use of specialinstruction-cache-flush operations17

16 If a pointer is part of a dependency chain thenif that pointer is used as the actual parameter ofa function call the dependency chain extends tothe formal parameter

17 If a pointer is part of a dependency chain thenif that pointer is returned from a function thedependency chain extends to the returned valuein the calling function

18 If a given operation extends a dependency chainthen so does its atomic counterpart For exam-ple the rules applying to assignments also applyto atomic loads and stores

Any other operation terminates a dependencychain

792 Terminating Dependency Chains

Even though all other operations terminate depen-dency chains there are a few that deserve specialmention

1 Equality comparisons

2 Narrowing magnitude comparisons

3 Narrowing arithmetic operations

4 Narrowing bitwise operations

17 This may sound strange but just you try implementingdynamic linking without the ability to update instructions

WG21N4321 31

1 struct bar 2 struct bar next3 int a4 int b5 6 struct bar head = amphead 1 2 78 for (p = head-gtnext p p = rcu_dereference(p-gtnext)) 9 foo += p-gta

10 if (p == amphead)11 break12 13 bar = head-gtb

Figure 34 Back-Propagation of Dependency-ChainBreakage Due to Comparisons

1 if (p gt ampfoo)2 do_something(p)3 else if (p lt ampfoo)4 do_something_else(p)5 else6 do_something_nodep(p)

Figure 35 Inequality-Comparison Dependency-Chain Breakage

5 Storing non-pointers into shared variables

6 Passing values between threads without using amemory order consume load

7 Undefined behavior

8 Use of kill dependency

Each of these is covered below

Equality comparisons If a pointer is part of adependency chain then a == or = comparison thatcompares equal to some other pointer where thatother pointer is not part of any dependency chainwill cause any uses of the original pointer to no longerbe part of the dependency chain This dependency-chain breakage can back-propagate to earlier uses ofthe pointer so that in Figure 34 if the comparisonon line 10 compares equal then the access on line 9is not part of the dependency chain This is admit-tedly a rather strange code fragment and besidesthe Linux-kernel barrier() macro could prevent thisif placed between lines 9 and 10 Furthermore the

Linux kernelrsquos list macros avoid this situation becausethe equal comparison terminates the loop

So what if the compiler introduces an equality com-parison This might happen when doing feedback-directed optimization where the compiler might no-tice (for example) that a particularly statically allo-cated structure was almost always the first elementon a given list The compiler might therefore in-troduce a specialization optimization comparing theaddresses and generating code using the statically al-located structure on equals comparison On the onehand in the cases where the Linux kernel adds a stat-ically allocated structure to an RCU-protected linkeddata structure that structure has been initialized atcompile time so that dependency ordering is not re-quired On the other hand this appears to be anextremely dubious optimization for linked data struc-tures In a great many cases the added overhead ofthe comparison would overwhelm the benefits of gen-erating code based on the statically allocated struc-ture

High-quality implementations would therefore beexpected to provide means for disabling this sort ofoptimization especially for pointers obtained fromthe heap After all use of statically allocated struc-tures in RCU-protected lists could be quite usefulduring out-of-memory conditions in which case thespecialization optimization would almost always re-duce performance which is not what optmizationsare supposed to be doing

Narrowing magnitude comparisons A seriesof gt lt gt= or lt= operators that informs the compilerof the exact value of a pointer causes that pointerto no longer be part of the dependency chain SeeFigure 35 for an example of this On line 6 of thisfigure the compiler knows that the value of p is equalto ampfoo so although there is dependency ordering tolines 2 and 4 there is no dependency ordering toline 6 This dependency-chain breakage can back-propagate just as for equality comparisons How-ever dependencies are maintained for normal usesfor exmaple the use of comparisons for deadlockavoidance when acquiring locks contained in multi-ple RCU-protected data elements

WG21N4321 32

Narrowing arithmetic operations If a pointeris part of a dependency chain and if the valuesadded to or subtracted from that pointer cancel thepointer value so as to allow the compiler to pre-cisely determine the resulting value then the result-ing value will not be part of any dependency chainFor example if p is part of a dependency chain then((char )p-(uintptr t)p)+65536 will not be18

Narrowing bitwise operations If a value of typeintptr t or uintptr t is part of a dependency chainand if that value is one of the operands to an amp

or | infix operator whose result has too few or toomany bits set then the resulting value will not bepart of any dependency chain For example on a64-bit system if p is part of a dependency chainthen (p amp 0x7) provides just the tag bits and nor-mally cannot even be legally dereferenced Similarly(p | ~0) normally cannot be legally dereferencedHowever (p amp ~0x7) will provide a usable pointerthat is part of prsquos dependency chain Setting orclearing bits that are not used by the implementa-tion or that would be zero for a properly aligned ob-ject will not break the dependency chain That saidthe compiler might not know the definition of ldquoprop-erly alignedrdquo for example in the cases of manuallycache-aligned or page-aligned objects

Storing non-pointers into shared variables Ifa non-pointer value that is part of a dependency chainis stored into a shared variable then the dependencychain does not extend to a later load from that vari-able

Passing values between threads If a value thatis part of a dependency chain is stored into a variableby one thread and loaded from that same variable bysome other thread using either a non-atomic load ora memory order relaxed load then the dependencychain does not extend to the second thread To getthis effect the second thread would instead need touse a memory order consume load Note that this

18 That said 57p4 of C++ and 656p8 of C both say thatindexing outside of an object is undefined behavior so the lossof dependency ordering is likely the least of the problems here

would extend the dependency chain even if the cor-responding store was a memory order relaxed storebecause the required store-side ordering is providedby the dependency chain

Undefined behavior If undefined behavior is in-voked then consistent with the notion of undefinedbehavior there are no dependency-chain guarantees

If a given pointer intptr t or uintptr t is usedsuch that only one value avoids undefined behaviorthen the dependency chain is broken in the same wayas it would be in the case of an equality comparisonwith that same value

kill dependency() The result of calling kill

dependency is never part of any dependency chainThis operation can be used to suppress diagnosticsthat implementations might omit for likely misusesof dependency ordering

793 Restricted Dependency Chains and theLinux Kernel

This covers all known pointer-based RCU uses in theLinux kernel aside from RCU-protected array in-dexes (more on these later) However as noted ear-lier these restrictions might prove too constrainingfor future code Therefore it might be necessary tocombine this approach with some variation on one ofthe methods for explicitly marking variables formalparameters and return values that are intended tocarry dependencies

Section 32 discussed dependency chains headedby memory order consume loads of integers that arelater used as array indexes or pointer offsets Al-though there are a (very) few such uses in the Linuxkernel accommodating dependency chains headed byloads of integers greatly complicates the handling ofdependency chains For example given an integer x

produced by a memory order consume load we mustcorrectly handle expressions containing (x - x) in-cluding cases where the cancellation is not at all obvi-ous from the source code In contrast given a pointerp the expression (p - p) not a pointer and the rulesgiven above do not require carrying a dependency

WG21N4321 33

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a) 3 q = kill_dependency(p)4 a = q-gtspecial_a5 else 6 a = p-gtnormal_a7

Figure 36 Avoiding Diagnostic Due To Dependency-Ordering Value-Narrowing Hazard

through such an expression This approach there-fore excludes dependency chains headed by memory

order consume loads from non-pointer atomic vari-ables

This raises the question of what should be doneabout the Linux kernel code that relies on order-ing carried through integer array indexes Paulhas (hopefully) answered this question by creatinga Linux-kernel patch that removes the kernelrsquos de-pendency on RCU-protected array indexes [22]

794 Restricted Dependency Chains Ad-vantages and Disavantages

This approach to dependency ordering has the fol-lowing advantages and disadvantages

1 There is no need for compilers to trace depen-dency chains Instead dependencies are an au-tomatic result of current code-generation and op-timization practices

2 There would be little or no limitation on com-piler optimizations Compilers are no longer re-quired to establish artificial dependencies for ex-pressions such as (x - x) or to detect value-narrowing hazards involving == and =

3 The breaking of dependency chains when apointer compares equal to some other pointermight prove to be onerous but it is not aproblem for current Linux use cases19 Themost common cases are comparison against alist header (in which case an equality compar-ison terminates the traversal) and comparison

19 This needs to be re-verified

against NULL (in which case an equality com-parison indicates a pointer that cannot be deref-erenced in any case)

4 The compiler is prohibited from carrying outvalue-speculation optimizations on pointers orvalues of type intptr t or uintptr t that havebeen cast from a pointer type and subjected tono operations other than infix amp | and ^ (Isthis an advantage or a disadvantage The an-swer to this question is left to the reader)

5 In a great many cases dependency chains canreliably pass through library functions compiledby pre-C11 compilers

6 It would not be necessary to use stdkill

dependency() calls in most cases That saiduse of stdkill dependency() might at somefuture point allow the compiler to produce bet-ter diagnostics for dubious dependency-chain usecases For example a compiler might issue awarning for the value-narrowing hazard shownon line 3 of Figure 13 and that diagnostic mightbe suppressed as shown in Figure 36

7 More generally this approach allows annotationsto be discarded as was shown in Figures 20and 21 However the memory order consume

loads are still required in order to support DECAlpha and prevent compiler optimizations thatmight otherwise destroy the dependency chain

8 It would not be necessary to use [[carries

dependency]] attributes However as withstdkill dependency() use of something like[[carries dependency]] might produce bet-ter diagnostics for dubious use dependency-chainuse cases That said it might be preferable tosubstitute either type or variable modifiers forattributes given that use of attributes is not per-mitted to change the meaning of the programand also that attributes are not yet supportedby the C language

9 With the exception of one use case involving ar-rays no changes to the Linux-kernel source code

WG21N4321 34

are required As noted earlier a patch is avail-able to remove the Linux kernelrsquos dependency onRCU-protected array indexes [22]

In the future this approach could be augmentedby attributes type modifiers variable modifiers orother markings to allow more elaborate dependencychains to be created on the one hand and to improvethe compilerrsquos ability to emit diagnostics for dubioususes of dependency chains on the other

710 Evaluation

This evaluation starts by enumerating the differentaudiences that any change to memory order consume

must address (Section 7101) and then compares thevarious proposals based on the perceived viewpointsof these audiences (Section 7102)

7101 Audiences

The main audiences for any change to memory

order consume include standards committee mem-bers compiler implementers formal-methods re-searchers developers intending to write new codeand developers working with existing RCU code TheLinux kernel community is of course a notable exam-ple of this last category

Standards committee members would like a cleanand non-intrusive change to the standard Theywould of course also like solutions minimizing thenumber and vehemence of complaints from the otheraudiences or failing that reducing the complaintsto a tolerable noise level

Compiler implementers would like a mechanismthat fits nicely into current implementations whichdoes much to explain their satisfaction with the ap-proach of strengthening memory order consume tomemory order acquire In particular they wouldlike to avoid unbounded tracing of dependencies andwould prefer minimal constraints on their ability toapply time-honored optimizations

Formal-methods researchers would like a definitionof memory order consume that fits into existing the-oretical frameworks without undue conceptual vio-lence Of particular concern is any need to deal with

counter-factuals in other words any need to rea-son not only about values of variables required forthe solution of a given litmus test but also aboutother unrelated values for these variables As suchcounter-factuals are the rock upon which otherwiseattractive approaches involving semantic dependencyhave foundered20 Some practitioners might won-der why the opinion of formal-methods researchersshould be given any weight at all and the answer tothis question is that it is the work of formal-methodsresearchers that provides us the much-needed toolsthat we need to analyze both the memory-orderingspecification itself as well as programs using thatspecification

Developers writing new code need something thatexpresses their algorithm with a minimum of syn-tactic saccharine that is easy to learn and that iseasy to maintain For example one of the weak-nesses of the current standardsrsquo definition of memoryorder consume is the need to sprinkle large numbersof kill dependency() calls throughout onersquos codeIn short developers would like it to be easy to writeanalyze and maintain code that uses dependency or-dering

Developers with existing RCU code have the samedesires as do developers writing new code but arealso very interested in minimizing the code churn re-quired to adhere to the standard

The challenge if of course to find a proposal thataddresses the viewoints of all of these audiences Aswe will see in the next session this is not easy How-ever there is some hope that the approach presentedin Section 79 might suffice

7102 Comparison

A summary comparison of the proposals is shown inTable 1

The dependency type can either be ldquodeprdquo fornormal dependency ldquordeprdquo for the restricted de-pendencies discussed in Section 79 ldquosdeprdquo for(global) semantic dependency or ldquolsdeprdquo for local

20 That said Alan Jeffries is making another attempt tocome up with a suitable formal definition of semantic depen-dency

WG21N4321 35

Dep

enden

cyType

Variable

Marking

Form

al-Parameter

Marking

Return-V

alueMarking

Beginning-O

f-Chain

Handling

End-O

f-Chain

Handling

Dep

enden

cyTracingRequired

CAttribute

Support

Req

uired

C11 C++11 dep A A K Y Y

Type-Based Designation of DependencyChains With Restrictions (Section 72)

lsdep T T T k

Type-Based Designation of DependencyChains (Section 73)

dep T T T k

Whole-Program Option (Section 74) sdep

Local-Variable Restriction (Section 75) dep A A k Y

Mark Dependency-Carrying Local Variables(Section 76)

dep A A A k Y

Explicitly Tail-Marked Dependency Chains(Section 77)

dep A A Dk y Y

Explicitly Head-Marked Dependency Chains(Section 78)

dep D k y Y

Restricted Dependency Chains (Section 79) rdep

Marking ldquoArdquo attribute ldquoTrdquo typeBeginning of chain ldquoDrdquo explicit designationEnd of chain ldquoDrdquo explicit designation ldquokKrdquo implicitexplicit kill dependencyDependency tracking required ldquoyrdquo only for marked chains ldquoYrdquo always

Table 1 Comparison of Consume Proposals

WG21N4321 36

semantic dependency21 Variable formal-parameterand return-value marking can either be type-based(ldquoTrdquo) attribute-based (ldquoArdquo) or not required (ldquo rdquo)Beginning-of-chain handling can either require ex-plicit indication to which quantities dependenciesmust be carried (ldquoDrdquo) or nothing (ldquo rdquo) End-of-chain handling can either require an explicit kill

dependency (ldquoKrdquo) an implicit kill dependency

(ldquokrdquo) explicit designation of dependency (ldquoDrdquo) ornothing (ldquo rdquo)22 Dependency tracking might berequired for all chains (ldquoYrdquo) explicitly designatedchains (ldquoyrdquo) or not required at all (ldquo rdquo) C-language[[carries dependency]] support might be required(ldquoYrdquo) or not (ldquo rdquo)

The ideal proposal would have dependency typeldquodeprdquo (thus making it easier to model dependencyordering and making it unnecessary for developer tohave to outwit full-program optimizations) no needfor variable formal-parameter or return-value mark-ing (thus minimizing changes required for existingRCU code) implicit ldquodo the right thingrdquo end-of-chainhandling23 (thus minimizing the need for whack-a-mole source-code markups) no need for dependencytracking (thus making it easier to implement) andno need for C-language support for the [[carries

dependency]] attribute (thus minimizing changes tothe C standard)

7103 Other Approaches

If the C standards committee is unwilling to ac-cept attributes perhaps a new keyword such asCarries dependency would be an acceptable alter-native (This was suggested at the 2014 UIUC meet-ing but I cannot recall who suggested it)

21 Recall that a local semantic dependency remains a de-pendency even if the memory order consume load at its headcan return only a single value In contrast a global seman-tic dependency remains a dependency only if more than onevalue can appear at the end of the chain Therefore optimiza-tions based on global full-program analysis can break a globalsemantic dependency but can break neither a local semanticdependency nor a normal dependency

22 Variables that go out of scope always have any dependencychain implicitly killed

23 Perhaps implemented by a careful choice of exactly whichoperators carry dependencies in which situations

It might also be possible to combine different as-pects of the various proposals perhaps even arrivingat an improved proposal

Nevertheless we clearly have some more work todo

8 Summary

This document has analyzed Linux-kernel use of de-pendency ordering and has laid out the status-quo in-teraction between the Linux kernel and pre-C11 com-pilers It has also put forward some possible ways ofbuilding towards a full implementation of C11rsquos andC++11rsquos handling of dependency ordering Finallyit calls out some weaknesses in C11rsquos and C++11rsquoshandling of dependency ordering and offers some al-ternatives

References

[1] Alglave J Maranget L Pawan PSarkar S Sewell P Williams D andNardelli F Z PPCMEMARMMEM A toolfor exploring the POWER and ARM memorymodels httpwwwclcamacuk~pes20

ppc-supplementalpldi105-sarkarpdfJune 2011

[2] Alglave J Maranget L andTautschnig M Herding cats Modellingsimulation testing and data-mining for weakmemory In Proceedings of the 35th ACM SIG-PLAN Conference on Programming LanguageDesign and Implementation (New York NYUSA 2014) PLDI rsquo14 ACM pp 40ndash40

[3] ARM Limited ARM Architecture ReferenceManual ARMv7-A and ARMv7-R Edition2010

[4] Boehm H J Space efficient conservativegarbage collection SIGPLAN Not 39 4 (Apr2004) 490ndash501

[5] Bonzini P and Day M RCU imple-mentation for Qemu httplistsgnu

WG21N4321 37

orgarchivehtmlqemu-devel2013-08

msg02055html August 2013

[6] Dalton M THE DESIGN AND IMPLE-MENTATION OF DYNAMIC INFORMATIONFLOW TRACKING SYSTEMS FOR SOFT-WARE SECURITY PhD thesis StanfordUniversity 2009 Available httpcsl

stanfordedu~christospublications

2009michael_daltonphd_thesispdf

[Viewed March 9 2010]

[7] Desnoyers M [RFC git tree] userspace RCU(urcu) for Linux httpurcuso February2009

[8] Desnoyers M McKenney P E SternA Dagenais M R and Walpole JUser-level implementations of read-copy updateIEEE Transactions on Parallel and DistributedSystems 23 (2012) 375ndash382

[9] Grisenthwaite R ARM Barrier LitmusTests and Cookbook ARM Limited 2009

[10] Howard P W and Walpole J A rel-ativistic enhancement to software transactionalmemory In Proceedings of the 3rd USENIX con-ference on Hot topics in parallelism (BerkeleyCA USA 2011) HotParrsquo11 USENIX Associa-tion pp 1ndash6

[11] Howard P W and Walpole J Relativis-tic red-black trees Concurrency and Computa-tion Practice and Experience (2013) nandashna

[12] Intel Corporation A Formal Speci-fication of Intel Itanium Processor Fam-ily Memory Ordering 2002 Avail-able httpdeveloperintelcom

designitaniumdownloads251429htm

ftpdownloadintelcomdesign

ItaniumDownloads25142901pdf [ViewedJanuary 10 2007]

[13] International Business Machines Corpo-ration Power ISA Version 207 2013

[14] Kannan H Ordering decoupled metadata ac-cesses in multiprocessors In MICRO 42 Pro-ceedings of the 42nd Annual IEEEACM Inter-national Symposium on Microarchitecture (NewYork NY USA 2009) ACM pp 381ndash390

[15] McCarthy J Recursive functions of symbolicexpressions and their computation by machinepart i Commun ACM 3 4 (Apr 1960) 184ndash195

[16] McKenney P E Read-copy update(RCU) usage in Linux kernel Availablehttpwwwrdropcomuserspaulmck

RCUlinuxusagerculocktabhtml [ViewedJanuary 14 2007] October 2006

[17] McKenney P E What is RCU part 2Usage Available httplwnnetArticles

263130 [Viewed January 4 2008] January2008

[18] McKenney P E The RCU API 2010 edi-tion httplwnnetArticles418853 De-cember 2010

[19] McKenney P E Validating memory barri-ers and atomic instructions httplwnnet

Articles470681 December 2011

[20] McKenney P E Is Parallel ProgrammingHard And If So What Can You Do About Itkernelorg Corvallis OR USA 2012

[21] McKenney P E Structured deferral syn-chronization via procrastination CommunACM 56 7 (July 2013) 40ndash49

[22] McKenney P E [PATCH tipcorercu14] mce Stop using array-index-based RCUprimitives [PATCHtipcorercu14]mce

Stopusingarray-index-basedRCUprimitivesMay 2015

[23] McKenney P E Purcell C Al-gae Schumin B Cornelius G Qwer-tyus Conway N Sbw Blainster Ru-fus C Zoicon5 Anome and Eisen HRead-copy update httpenwikipedia

orgwikiRead-copy-update July 2006

WG21N4321 38

[24] McKenney P E and Slingwine J DRead-copy update Using execution history tosolve concurrency problems In Parallel andDistributed Computing and Systems (Las VegasNV October 1998) pp 509ndash518

[25] McKenney P E and Walpole J What isRCU fundamentally Available httplwn

netArticles262464 [Viewed December 272007] December 2007

[26] Michael M M Hazard pointers Safe mem-ory reclamation for lock-free objects IEEETransactions on Parallel and Distributed Sys-tems 15 6 (June 2004) 491ndash504

[27] Rossbach C J Hofmann O S PorterD E Ramadan H E Bhandari A andWitchel E TxLinux Using and managinghardware transactional memory in an operatingsystem In SOSPrsquo07 Twenty-First ACMSymposium on Operating Systems Principles(October 2007) ACM SIGOPS Avail-able httpwwwsosp2007orgpapers

sosp056-rossbachpdf [Viewed October 212007]

[28] Toit S D Working draft stan-dard for programming language C++httpwwwopen-stdorgjtc1sc22wg21

docspapers2013n3691pdf May 2013

[29] Vigueras G Orduna J M and LozanoM A Read-Copy Update based parallel serverfor distributed crowd simulations The Journalof Supercomputing (Apr 2012)

Change Log

This paper first appeared as N4026 in May of 2014Revisions to this document are as follows

bull Add a suggestion to Section 71 that developersprovide diagnostics for singleton arrays when us-ing RCU-protected indexes (May 28 2014)

bull Apply changes from Torvald Riegel feedback

1 Move Section 71 which summarizes theLinux-kernel uses of dependency chains upto the beginning of Section 7

2 Fix some typos in Section 74 and addwords accounting for DEC Alpha whichmust continue to promote memory order

consume loads to memory order acquire

with this sectionrsquos approach Also clarifyJeff Preshingrsquos conjecture about the lack ofoptimizations that break semantic depen-dency chains

(June 5 2014)

bull Add Section 75 which covers Hans Boehmrsquosapproach which restricts dependency chains tolocal variable Also add Section 76 whichcovers Clark Nelsorsquos approach which usesthe [[carries dependency]] attribute to markthose local variables that carry dependencies(July 17 2014)

bull Mark the paper as a revision of N4036 (July 172014)

bull Add Jeff Preshing Hans Boehm and Clark Nel-son as authors and clarify Hansrsquos proposal inSection 75 (August 13 2014)

bull Add Olivier Girouxrsquos proposal for explicitlymarking the tails of dependency chains as Sec-tion 77 and add Olivier as author Update theadvantages and disadvantages of Jeff Preshingrsquoswhole-program option in Section 74 Add advan-tages and disadvantages for Hans Boehmrsquos pro-posal in Section 75 and for Clark Nelsonrsquos pro-posal in Section 76 (August 23 2014)

bull Update Olivier Girouxrsquos explicit dependency-chain marking proposal in Section 77 based onfeedback from Olivier (August 24 2014)

bull Add a section header (Section 411) for thedependency-chain rules for 2014 GCC implemen-tations Add Section 413 laying out the sim-pler rules for managing dependency chains in theolder 1990s Sequent C implementations Adda footnote to Section 76 recording Lawrence

WG21N4321 39

Crowlrsquos preference for variable modifiers insteadof either attributes or type modifiers Addwords to Section 77 noting that atomic signal

fence() or the Linux kernelrsquos barrier() can beused to suppress code-motion optimizations thatmight otherwise prevent fixing up tail-markeddependency chains that were broken by code-motion optimizations (October 1 2014)

bull Add Section 34 which describes operators thatact as the last link in the Linux kernelrsquos depen-dency chains including the ambiguous role ofthe -gt operator (October 1 2014)

bull Update plots of RCU API usage (October 52014)

bull Flesh out Section 412 which lays out a shorthistory of dependency-chain management rulesfor the Linux kernel Numerous minor clarifica-tions and corrections throughout the document(October 5 2014)

bull Add ldquoandrdquo before final authorrsquos name and emailaddress Also add a pointer back to N4036 (Oc-tober 5 2014)

bull Add a brain-dead build script (October 52014)

bull Add Section 710 which has the beginnings ofan evaluation of the various proposals (October5 2014)

The paper was then published as N4215 whichwas revised as follows

bull Add verbiage to Sections 72 and 7102 differ-entiating between various forms of syntactic andsemantic dependencies (November 11 2014)

bull Add Section 7103 suggesting a C keyword in-stead of the C++ [[carries dependency]] at-tribute This section also suggests that combin-ing ideas from several proposals might be helpful(November 11 2014)

bull Add Section 78 which contains a first attemptto describe Olivier Girouxrsquos head-marked depen-dency chains (November 11 2014)

bull Add example code in Figures 15 and 16 Addsimilar figures to the proposals showing how theymark the dependency chains in the sample code(November 11 2014)

At this point the paper was published as N4321which was revised as follows

bull Add a warning against using operands to amp and| that leave only one bit set (or respectivelycleared) to Section 411 (April 9 2015)

bull Update plots of RCU API usage (April 192014)

bull Add Section 79 which restricts dependencychains with the goal of allowing unmarked de-pendency chains while avoiding restricting com-piler optimizations One important restriction isthe elimination of RCU-protected array indexes(April 19 2014)

bull Add verbiage to Section 32 noting that the amp

operator is sometimes used to find the beginningof a power-of-two aligned structure (April 192014)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Torvald Riegel to Sec-tion 79 (May 20 2015)

bull Donrsquot allow integers to carry dependencies intoor out of unmarked functions in Section 79(May 21 2015)

bull Update Section 79 to note that assignments can-not extend a dependency chain from one threadto another and that assignments cannot extendan intptr t- or uintptr t-based dependencychain through a shared variable even within agiven thread (May 21 2015)

bull Update Section 79 to note that atomic op-erations can extend dependency chains in thesame way that their non-atomic counterpartscan (May 21 2015)

WG21N4321 40

bull Update Section 79 to explicitly call out the factthat stores and subsequent loads can extend de-pendency chains (May 21 2015)

bull Update Section 79 to explicitly state that un-defined behavior terminates dependency chainsand that if only one value of a pointer avoidsundefined behavior any dependency chainsthrough that pointer at that point are termi-nated (May 22 2015)

bull Update Section 79 to document the difficul-ties stemming from value-specialization opti-mizations (May 23 2015)

bull Update Section 79 to note the back-propagationof the dependency-chain-breaking effects of thecompiler deducing the exact value of a pointer(May 27 2015)

bull Update Section 79 to give more informationon how much bit-setting and bit-clearing is toomuch (June 8 2015)

bull Update Section 79 to note that the Linux ker-nel does not rely on dependency ordering whena statically allocated structure is linked into anRCU-protected structure but also note that op-timizations breaking dependency ordering in thiscase seem quite dubious (July 11 2015)

bull Update Section 79 illustrating dependency-chain rules with examples (July 11 2015)

bull Update Section 79 noting that simple loads suf-fice for memory order consume loads (July 112015)

bull Update Sections 77 and 78 calling out abstrac-tion challenges for head- and tail-marking ap-proaches (July 13 2015)

bull Add subsections to Section 79 (July 13 2015)

bull Update Section 79 to amplify objections to spe-cialization optimizations applied to heap-basedpointers (July 13 2015)

  • 1 Introduction
  • 2 Introduction to RCU
  • 3 Linux-Kernel Use Cases
    • 31 Types of Linux-Kernel Dependency Chains
    • 32 Operators in Linux-Kernel Dependency Chains
    • 33 Operators Terminating Linux-Kernel Dependency Chains
    • 34 Operators Acting as Last Link in Linux-Kernel Dependency Chains
    • 35 Linux-Kernel Dependency Chain Length
      • 4 Dependency Ordering in Pre-C11 Implementations
        • 41 Rules for C-Language RCU Users
          • 411 Rules for 2014 GCC Implementations
          • 412 Rules for 2003 GCC Implementations
          • 413 Rules for 1990s Sequent C Implementations
            • 42 Rules for C-Language Implementers
              • 5 Dependency Ordering in C11 and C++11 Implementations
              • 6 Weaknesses in C11 and C++11 Dependency Ordering
              • 7 Potential Alternatives to C11 and C++11 Dependency Ordering
                • 71 Revising C11 and C++11 Dependency-Ordering Definition
                • 72 Type-Based Designation of Dependency Chains With Restrictions
                • 73 Type-Based Designation of Dependency Chains
                • 74 Whole-Program Option
                • 75 Local-Variable Restriction
                • 76 Mark Dependency-Carrying Local Variables
                • 77 Explicitly Tail-Marked Dependency Chains
                • 78 Explicitly Head-Marked Dependency Chains
                • 79 Restricted Dependency Chains
                  • 791 Extending Dependency Chains
                  • 792 Terminating Dependency Chains
                  • 793 Restricted Dependency Chains and the Linux Kernel
                  • 794 Restricted Dependency Chains Advantages and Disavantages
                    • 710 Evaluation
                      • 7101 Audiences
                      • 7102 Comparison
                      • 7103 Other Approaches
                          • 8 Summary
Page 8: N4321: Towards Implementation and Use of memory order … · 2015-07-14 · N4321: Towards Implementation and Use of memory order consume Doc. No.: WG21/N4321 (revised) Date: 2015-04-19

WG21N4321 8

1 void new_element(struct foo pp int a)2 3 struct foo p = malloc(sizeof(p))45 if (p)6 abort()7 p-gta = a8 atomic_store_explicit(pp p memory_order_release)9

1011 int traverse(struct foo_head ph)12 13 int a = -114 struct foo p1516 p = atomic_load_explicit(ampfield_dep(ph h)17 memory_order_consume)18 while (p = NULL) 19 a = field_dep(p a)20 p = atomic_load_explicit(ampfield_dep(p n)21 memory_order_consume)22 23 return a24

Figure 8 Decorated Linked Structure Traversal

dependency chain for example replacing the -gt op-erator with something like the mythical field dep()

operator shown on lines 16 19 and 20 of Figure 8

However there are a great many dependencychains that extend across multiple functions Onerelatively modest example is in the Linux networkstack in the arp process() function This depen-dency chain extends as follows with deeper nestingindicating deeper function-call levels

bull The arp process() function invokes in dev

get rcu() which returns an RCU-protectedpointer The head of the dependency chain istherefore within the in dev get rcu() func-tion

bull The arp process() function invokes the follow-ing macros and functions

ndash IN DEV ROUTE LOCALNET() which expandsto the ipv4 devconf get() function

ndash arp ignore() which in turn calls

lowast IN DEV ARP IGNORE() which expandsto the ipv4 devconf get() function

lowast inet confirm addr() which calls

middot dev net() which in turn callsread pnet()

ndash IN DEV ARPFILTER() which expands toipv4 devconf get()

ndash IN DEV CONF GET() which also expands toipv4 devconf get()

ndash arp fwd proxy() which calls

lowast IN DEV PROXY ARP() which expandsto ipv4 devconf get()

lowast IN DEV MEDIUM ID() which also ex-pands to ipv4 devconf get()

ndash arp fwd pvlan() which calls

lowast IN DEV PROXY ARP PVLAN() which ex-pands to ipv4 devconf get()

ndash pneigh enqueue()

Again although a great many dependency chainsin the Linux kernel are quite short there are quite afew that spread both widely and deeply We thereforecannot expect Linux kernel hackers to look fondlyon any mechanism that requires them to decorateeach and every operator in each and every depen-dency chain as was shown in Figure 8 In fact evenkill dependency() will likely be an extremely diffi-cult sell

4 Dependency Ordering in Pre-C11 Implementations

Pre-C11 implementations of the C language do nothave any formal notion of dependency ordering butthese implementations are nevertheless used to buildthe Linux kernelmdashand most likely all other softwareusing RCU This section lays out a few straightfor-ward rules for both implementers (Section 42) andusers of these pre-C11 C-language implementations(Section 41)

41 Rules for C-Language RCU Users

The rules for C-language RCU users have evolvedover time so this section will present them in reversechronological order

WG21N4321 9

411 Rules for 2014 GCC Implementations

The primary rule for developers implementing RCU-based algorithms is to avoid letting the compiler de-terming the value of any variable in any dependencychain This primary rule implies a number of sec-ondary rules

1 Use only intrinsic operators on basic types Ifyou are making use of C++ template metapro-gramming or operator overloading more elabo-rate rules apply and those rules are outside thescope of this document

2 Use a volatile load to head the dependency chainThis is necessary to avoid the compiler repeatingthe load or making use of (possibly erroneous)prior knowledge of the contents of the memorylocation each of which can break dependencychains

3 Avoid use of single-element RCU-protected ar-rays The compiler is within its right to assumethat the value of an index into such an arraymust necessarily evaluate to zero The com-piler could then substitute the constant zero forthe computation breaking the dependency chainand introducing misordering

4 Avoid cancellation when using the + and - infixarithmetic operators For example for a givenvariable x avoid (xminusx) The compiler is withinits rights to substitute zero for any such cancel-lation breaking the dependency chain and againintroducing misordering Similar arithmetic pit-falls must be avoided if the infix or oper-ators appear in the essential subset of a depen-dency chain

5 Avoid all-zero operands to the bitwise amp opera-tor and similarly avoid all-ones operands to thebitwise | operator If the compiler is able todeduce the value of such operands it is withinits rights to substitute the corresponding con-stant for the bitwise operation Once again thisbreaks the dependency chain introducing mis-ordering

Please note that single-bit operands to bitwise amp

can be dangerous because the compiler requiresonly a small amount of additional information todeduce the exact value which could again resultin constant substitution Operands to bitwise |

that have only one zero bit are similarly danger-ous

6 If you are using RCU to protect JITed functionsso that the () function-invocation operator is amember of the essential subset of the dependencytree you may need to interact directly with thehardware to flush instruction caches This issuearises on some systems when a newly JITed func-tion is using the same memory that was used byan earlier JITed function

7 Do not use the boolean ampamp and || operators inessential dependency chains The reason for thisprohibition is that they are often compiled usingbranches Weak-memory machines such as ARMor PowerPC order stores after such branches butcan speculate loads which can break data depen-dency chains

8 Do not use relational operators (== = gt gt=lt or lt=) in the essential subset of a dependencychain The reason for this prohibition is that asfor boolean operators relational operators areoften compiled using branches Weak-memorymachines such as ARM or PowerPC order storesafter such branches but can speculate loadswhich can break dependency chains

9 Be very careful about comparing pointers in theessential subset of a dependency chain As LinusTorvalds explained if the two pointers are equalthe compiler could substitute the pointer you arecomparing against for the pointer in the essen-tial subset of the dependency chain On ARMand Power hardware it might be that only theoriginal value carried a hardware dependency sothis substitution would break the chain in turnpermitting misordering Such comparisons areOK in the following cases

(a) The pointer being compared against refer-ences memory that was initialized at boot

WG21N4321 10

time or otherwise long enough ago thatreaders cannot still have pre-initialized datacached Examples include module-init timefor module code before kthread creationfor code running in a kthread while theupdate-side lock is held and so on

(b) The pointer is never dereferenced afterbeing compared This exception applieswhen comparing against the NULL pointeror when scanning RCU-protected circularlinked lists

(c) The pointer being compared against is partof the essential subset of a dependencychain This can be a different dependencychain but only as long as that chain stemsfrom a pointer that was modified after anyinitialization of interest This exception canapply when carrying out RCU-protectedtraversals from different entry points thatconverged on the same data structure

(d) The pointer being compared against isfetched using rcu access pointer() andall subsequent dereferences are stores

(e) The pointers compared not-equal and thecompiler does not have enough informationto deduce the value of the pointer (Forexample if the compiler can see that thepointer will only ever take on one of twovalues then it will be able to deduce theexact value based on a not-equals compar-ison)

10 Disable any value-speculation optimizations thatyour compiler might provide especially if you aremaking use of feedback-based optimizations thattake data collected from prior runs

412 Rules for 2003 GCC Implementations

Prior to the 269 version of the Linux kernel therewas neither rcu dereference() nor rcu assign

pointer() Instead explicit memory barriers wereused smp read barrier depends() by readers andsmp wmb() by updaters For example the code shownfor current Linux kernels in Figure 6 would be as

1 struct foo 2 int a3 4 struct foo fp5 struct foo default_foo67 int bar(void)8 9 struct foo p

1011 p = fp12 smp_read_barrier_depends()13 return p p-ampgta default_fooa14

Figure 9 Default Value For RCU-Protected PointerOld Linux Kernel

shown in Figure 9 for 268 and earlier versions of theLinux kernel A similar transformation relates theolder use of smp wmb() and the more recent use ofrcu assign pointer()

This older API was clearly much more vulnerableto compiler optimizations than is the current APIbut the real motivation for this change was read-ability and maintainability as can be seen from thecommit log for the mid-2004 patch introducing rcu

dereference()

This patch introduced an rcu

dereference() macro that replaces mostuses of smp read barrier depends()The new macro has the advantage ofexplicitly documenting which pointersare protected by RCU ndash in contrast itis sometimes difficult to figure out whichpointer is being protected by a givensmp read barrier depends() call

The commit log for the mid-2004 patch introducingrcu assign pointer() justifies the change in termsof eliminating hard-to-use explicit memory barriers

Attached is a patch that adds an rcu

assign pointer() that allows a number ofexplicit smp wmb() memory barriers to bedispensed with improving readability

The importance of suppressing compiler optimiza-tions did not become apparent until much later In

WG21N4321 11

fact a volatile cast was not added to the implementa-tion of rcu dereference() until 2624 in early 2008

413 Rules for 1990s Sequent C Implemen-tations

1990s systems featured far slower CPUs and muchless memory that is commonly provisioned to-day and the compilers were correspondingly lesssophisticated Therefore at that time a sim-ple C-language field selector was used instead ofany sort of rcu dereference() or memory order

consume operation Not only was there novolatile cast there also was nothing resembling smp

read barrier depends() The lack of smp read

barrier depends() is not too surprising given thatDYNIXptx did not run on DEC Alpha

This approach was nevertheless quite reliable be-cause the use cases within the DYNIXptx kernelwere both few and straightforward and provided lit-tle or no opportunity for optimizations that mightbreak dependency chains

42 Rules for C-Language Imple-menters

The main rule for C-language implementers is toavoid any sort of value speculation or at the veryleast provide means for the user to disable suchspeculation An example of a value-speculation op-timization that can be carried out with the help ofhardware branch prediction is shown in Figure 10which is an optimized version of the code in Fig-ure 5 This sort of transformation might result fromfeedback-directed optimization where profiling runsdetermined that the value loaded from ph was almostalway 0xbadfab1e Although this transformation iscorrect in a single-threaded environment in a concur-rent environment nothing stops the compiler or theCPU from speculating the load on line 19 before itexecutes the rcu dereference() on line 16 whichcould result in line 19 executing before the corre-sponding store on line 7 resulting in a garbage valuein variable a4

4 Kudos to Olivier Giroux for pointing out use of branchprediction to enable value speculation

1 void new_element(struct foo pp int a)2 3 struct foo p = malloc(sizeof(p))45 if (p)6 abort()7 p-gta = a8 rcu_assign_pointer(pp p)9

1011 int traverse(struct foo_head ph)12 13 int a = -114 struct foo p1516 p = rcu_dereference(ampph-gth)17 while (p = NULL) 18 if (p == (struct foo )0xbadfab1e)19 a = ((struct foo )0xbadfab1e)-gta20 else21 a = p-gta22 p = rcu_dereference(ampp-gtn)23 24 return a25

Figure 10 Dangerous Optimizations HardwareBranch Predictions

There are some situations where this sort of opti-mization would be safe including

1 The value speculated is a numeric value ratherthan a pointer so that if the guess proves correctafter the fact the computation will be appropri-ate after the fact

2 The value speculated is a pointer to invariantdata so that reasonable values are produced bydereferencing even if the guess proves to havebeen correct only after the fact

3 As above but where any updates result in datathat produces appropriate computations at anyand all phases of the update

However this list does not contain the general caseof memory order consume loads

Pure hardware implementations of value specula-tion can avoid this problem because they monitorcache-coherence protocol events that would resultfrom some other CPU invalidating the guess

In short compiler writers must provide means todisable all forms of value speculation unless the spec-

WG21N4321 12

ulation is accompanied by some means of detectingthe race condition that Figure 10 is subject to

Are there other dependency-breaking optimizationsthat should be called out separately

5 Dependency Ordering in C11and C++11 Implementations

The simplest way to avoid dependency-ordering is-sues is to strengthen all memory order consume oper-ations to memory order acquire This functions cor-rectly but may result in unacceptable performancedue to memory-barrier instructions on weakly or-dered systems such as ARM and PowerPC5 and mayfurther unnecessarily suppress code-motion optimiza-tions

Another straightforward approach is to avoid valuespeculation and other dependency-breaking opti-mizations This might result in missed opportu-nities for optimization but avoids any need fordependency-chain annotations and also all issuesthat might otherwise arise from use of dependency-breaking optimizations This approach is fully com-patible with the Linux kernel communityrsquos currentapproach to dependency chains Unfortunately thereare any number of valuable optimizations that breakdependency chains so this approach seems impracti-cal

A third approach is to avoid value speculationand other dependency-breaking optimizations in anyfunction containing either a memory order consume

load or a [[carries dependency]] attribute Forexample the hardware-branch-predition optimiza-tion shown in Figure 10 would be prohibited in suchfunctions as would cancellation optimizations suchas optimizing a = b + c - c into a = b This toocan result in missed opportunities for optimizationthough very probably many fewer than the previousapproach This approach can also result in issues dueto dependency-breaking optimizations in functionslacking [[carries dependency]] attributes for ex-ample function d() in Figure 11 It can also result

5 From a Linux-kernel community viewpoint that shouldread ldquowill result in unacceptable performancerdquo

1 int a(struct foo p [[carries_dependency]])2 3 return kill_dependency(p-gta = 0)4 56 int b(int x)7 8 return x9

1011 foo c(void)12 13 return fpload_explicit(memory_order_consume)14 return rcu_dereference(fp) in Linux kernel 15 1617 int d(void)18 19 int a20 foo p2122 rcu_read_lock()23 p = c()24 a = p-gta25 rcu_read_unlock()26 return a27

Figure 11 Example Functions for Dependency Or-dering Part 1

1 [[carries_dependency]] struct foo e(void)2 3 return fpload_explicit(memory_order_consume)4 return rcu_dereference(fp) in Linux kernel 5 67 int f(void)8 9 int a

10 foo p1112 rcu_read_lock()13 p = e()14 a = p-gta15 rcu_read_unlock()16 return kill_dependency(a)17 1819 int g(void)20 21 int a22 foo p2324 rcu_read_lock()25 p = e()26 a = p-gta27 rcu_read_unlock()28 return b(a)29

Figure 12 Example Functions for Dependency Or-dering Part 2

WG21N4321 13

in spurious memory-barrier instructions when a de-pendency chain goes out of scope for example withthe return statement of function g() in Figure 12

A fourth approach is to add a compile-time op-eration corresponding to the beginning and end ofRCU read-side critical section These would need tobe evaluated at compile time taking into accountthe fact that these critical sections can nest and canbe conditionally entered and exited Note that theexit from an outermost RCU read-side critical sec-tion should imply a kill dependency() operation oneach variable that is live at that point in the code6

Although it is probably impossible to precisely de-termine the bounds of a given RCU read-side criticalsection in the general case conservative approachesthat might overestimate the extent of a given sec-tion should be acceptable in almost all cases Thisapproach would make functions c() and d() in Fig-ure 11 handle dependency chains in a natural mannerbut avoiding whole-program analysis would requiresomething similar to the [[carries dependency]]

annotations called out in the C11 and C++11 stan-dards

A fifth approach would be to require that all op-erations on the essential subset of any dependencychain be annotated This would greatly ease imple-mentation but would not be likely to be accepted bythe Linux kernel community

A sixth approach is to track dependencies as calledout in the C11 and C++11 standards However in-stead of emitting a memory-barrier instruction whena dependency chain flows into or out of a functionwithout the benefit of [[carries dependency]] in-sert an implicit kill dependency() invocation Im-plementation should also optionally issue a diagnosticin this case The motivation for this approach is thatit is expected that many more kill dependencies()

than [[carries dependency]] would be required toconvert the Linux kernelrsquos RCU code to C11 In theexample in Figure 12 this approach would allow func-tion g() to avoid emitting an unnecessary memory-barrier instruction but without function f()rsquos ex-

6 What if a given rcu read unlock() sometimes marked theend of an outermost RCU read-side critical section but othertimes was nested in some other RCU read-side critical sectionIn that case there should be no kill dependency()

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a)3 a = p-gtspecial_a4 else5 a = p-gtnormal_a

Figure 13 Dependency-Ordering Value-NarrowingHazard

plicit kill dependency() Both functions are in Fig-ure 12

A seventh and final approach is to track dependen-cies as called out in in the C11 and C++11 standardsWith this approach functions e() and f() properlypreserve the required amount of dependency order-ing

6 Weaknesses in C11 andC++11 Dependency Or-dering

Experience has shown several weaknesses in the de-pendency ordering specified in the C11 and C++11standards

1 The C11 standard does not provide attributesand in particular does not provide the[[carries dependency]] attribute This pre-vents the developer from specifying that a givendependency chain passes into or out of a givenfunction

2 The implementation complexity of thedependency-chain tracking required by bothstandard can be quite onerous on the one handand the overhead of unconditionally promotingmemory order consume loads to memory order

acquire can be excessive on weakly orderedimplementations on the other There is thereforeno easy way out for a memory order consume

implementation on a weakly ordered system

3 The function-level granularity of [[carries

dependency]] seems too coarse One problemis that points-to analysis is non-trivial so that

WG21N4321 14

compilers are likely to have difficulty determin-ing whether or not a given pointer carries a de-pendency For example the current wording ofthe standard (intentionally) does not disallowdependency chaining through stores and loadsTherefore if a dependency-carrying value mightever be written to a given variable an implemen-tation might reasonably assume that any loadfrom that variable must be assumed to carry adependency

4 The rules set out in the standard [28 110p9]do not align well with the rules that develop-ers must currently adhere to in order to main-tain dependency chains when using pre-C11 andpre-C++11 compilers (see Section 41) For ex-ample the standard requires (x-x) to carry adependency and providing this guarantee wouldat the very least require the compiler to alsoturn off optimizations that remove (x-x) (andsimilar patterns) if x might possibly be carry-ing a dependency For another example con-sider the value-speculation-like code shown inFigure 13 that is sometimes written by devel-opers and that was described in bullet 9 ofSection 41 In this example the standard re-quires dependency ordering between the memoryorder consume load on line 1 and the sub-sequent dereference on line 3 but a typicalcompiler would not be expected to differenti-ate between these two apparently identical val-ues These two examples show that a compilerwould need to detect and carefully handle thesecases either by artificially inserting dependen-cies omitting optimizations differentiating be-tween apparently identical values or even byemitting memory order acquire fences

5 The whole point of memory order consume andthe resulting dependency chains is to allow de-velopers to optimize their code Such optimiza-tion attempts can be completely defeated by thememory order acquire fences that the standardcurrently requires when a dependency chain goesout of scope without the benefit of a [[carries

dependency]] attribute Preventing the com-piler from emitting these fences requires liberal

use of kill dependency() which clutters coderequires large developer effort and further re-quires that the developer know quite a bit aboutwhich code patterns a given version of a givencompiler can optimize (thus avoiding needlessfences) and which it cannot (thus requiring man-ual insertion of kill dependency()

As of this writing no known implementations fullysupport C11 or C++11 dependency ordering

It is worth asking why Paul didnrsquot anticipate theseweaknesses There are several reasons for this

1 Compiler optimizations have become more ag-gressive over the seven years since Paul startedworking on standardization

2 New dependency-ordering use cases have arisenduring that same time in particular there arelonger dependency chains and more of themincluding dependency chains spanning multiplecompilation units

3 The number of dependency chains has increasedby roughly an order of magnitude during thattime so that changes in code style can be ex-pected to face a commeasurate increase in resis-tance from the Linux kernel community ndash unlessthose changes bring some tangible benefit

With that letrsquos look at some potential alternativesto dependency ordering as defined in the C11 andC++11 standards

7 Potential Alternatives to C11and C++11 Dependency Or-dering

Given the weaknesses in the current standardrsquos spec-ification of dependency ordering it is quite reason-able to consider alternatives To this end Section 71discusses ease-of-use issues involved with revisions tothe C11 and C++11 definitions of dependency or-dering Section 72 enlists help from the type systembut also imposes value restrictions (thus revising the

WG21N4321 15

C11 and C++11 semantics for dependencies) Sec-tion 73 enlists help from the type system withoutthe value restrictions and Section 74 describes awhole-program approach to dependency chains (alsorevising the C11 and C++11 semantics for depen-dencies) Section 75 describes a post-Rapperswilproposal that dependency chains be restricted tofunction-scope local variables and temporaries andSection 76 describes a second post-Rapperswil pro-posal that the [[carries dependency]] attributebe used to label local-scope variables that carry de-pendencies Section 77 describes a proposal dis-cussed verbally at Rapperswil that explicitly marksthe tails of dependency chains Section 78 describesthe inverse namely marking the heads of dependencychains Section 79 describes an approach that avoidsmarking by sharply restricting the number and typeof operations permitted in dependency chains Eachapproach appears to have advantages and disadvan-tages so it is hoped that further discussion will eitherhelp settle on one of these alternatives or generatesomething better To help initiate this discussionSection 710 provides an initial comparative evalua-tion

71 Revising C11 and C++11Dependency-Ordering Definition

The following sections each describe a proposed revi-sion of the dependency-ordering definition from thatin the current C11 and C++11 standards In manyof these proposals developers are required to followan additional rule in order to be able to rely on de-pendency ordering Subsequent execution must notlead to a situation where there is only one possiblevalue for the variable that is intended to carry thedependency7 This is shown in Figure 17 where thecompiler is permitted to break dependency orderingon line 6 because it knows that the value of p is equalto that of q which means that it could substitutethe latter value from the former which would break

7 This restricted notion of dependence is sometimes calledsemantic dependence and the value at the end of a depen-dence chain that does not represent a semantic dependence issometimes said to be independent of the value at the head ofthe dependency chain

1 int my_array[MY_ARRAY_SIZE]23 i = atomic_load_explicit(gi memory_order_consume)4 r1 = my_array[i]

Figure 14 Single-Element Arrays and DependencyOrdering

dependency ordering In short a dependency chainbreaks if it comes to a point where only a single valueis possible regardless of the value of the memory

order consume load heading up the chain At firstglance this additional rule could be quite difficult tolive with as dependency ordering could come and godepending on small details of code far away from thatpoint in the dependency chain

However a review of the Linux-kernel operators inSection 32 shows that the most commonly used op-erators act identically under both definitions Theproblem-free operators include -gt infix = casts pre-fix amp prefix and ternary

One example of a potentially troublesome opera-tor namely == is shown in Figure 17 where line 6breaks dependency ordering because the value of p isknown to be equal to that of q which is not part of adependency chain This example could be addressedthrough careful diagnostic design coupled with appro-priate coding standards For example the compilercould emit a warning on line 6 but remain silent forthe equivalent line substituting q for p namely dosomething with(q-gta)

Another example is the use of postfix [] thatis shown in Figure 14 If this code fragment wascompiled with MY ARRAY SIZE equal to one there isno dependency ordering between lines 3 and 4 butthat same code fragment compiled with MY ARRAY

SIZE equal to two or greater would be dependency-ordered Here a diagnostic for single-element arraysmight prove useful and such a diagnostic can easilybe supplied in this case using if and error

In the Linux kernel infix + and - are used forpointer and array computations These are all safein that they operate on an integer and pointer sothat any cancellation will not normally be detectableat compile time However one big purpose of diag-nostics is to detect abnormal conditions indicating

WG21N4321 16

1 struct liststackhead 2 struct liststack __rcu first3 45 struct liststack 6 struct liststack __rcu next7 void t8 struct rcu_head rh9

1011 _Carries_dependency12 void ls_front(struct liststackhead head)13 14 _Carries_dependency void data15 struct liststack lsp1617 rcu_read_lock()18 lsp = rcu_dereference(head-gtfirst)19 if (lsp == NULL)20 data = NULL21 else22 data = rcu_dereference(lsp-gtt)23 rcu_read_unlock()24 return data25

Figure 15 List-Based-Stack Example Code 1 of 2

probable bugs Therefore in cases where the com-piler can determine that two values from dependencychains are annihilating each other via infix + and -a diagnostic would be appropriate

Similarly the Linux kernel uses infix (bitwise) amp tomanipulate bits at the bottom of a pointer whereagain cancellation will not normally be detectable atcompile timemdashexcept in the case of operations on aNULL pointer for which dependency ordering is notmeaningful in any case However as with infix +

and - if the compiler detects value annihilation adiagnostic would be appropriate

Although issues with false positives and negativesneeds further investigation there is reason to hopethat this revision of the definition of dependency or-dering might avoid significant impacts on ease of useWith this hope we proceed to the specific propos-als using the code in Figures 15 and 16 to showsome sample code using Linux-kernel nomenclaturewith the addition of a mythical C keyword Carries

dependency to annotate parameters variables andreturn values that carry dependencies Please notethat this code example in no way endorses the dubi-ous practice of creating a parallel program with thesort of choke point exemplified by the head of this

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 _Carries_dependency struct liststack lsp40 struct liststack lsnp141 _Carries_dependency struct liststack lsnp242 _Carries_dependency void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 16 List-Based-Stack Example Code 2 of 2

WG21N4321 17

1 value_dep_preserving struct foo p23 p = atomic_load_explicit(gp memory_order_consume)4 q = some_other_pointer5 if (p == q)6 do_something_with(p-gta)7 else8 do_something_else_with(p-gtb)

Figure 17 Single-Value Variables and DependencyOrdering

list Note also that cmpxchg() heads a dependencychain which is completely reasonable within the con-text of the Linux kernel due to its acquire semanticswhich of course might be argued to indicate that theannotations in ls pop() are unnecessary

72 Type-Based Designation of De-pendency Chains With Restric-tions

This approach was formulated by Torvald Riegel inresponse to Linus Torvaldsrsquos spirited criticisms of thecurrent C11 and C++11 wording

This approach introduces a new value dep

preserving type qualifier Dependency ordering ispreserved only via variables having this type quali-fier This is meant to model the real scope of depen-dencies which is data flow not execution at function-level granularity This approach should therefore givedevelopers much finer control of which dependenciesare tracked

Assigning from a value dep preserving value to anon-value dep preserving variable terminates thetracking of dependencies in much the same way thatan explicit kill dependency() would However un-like an explicit kill dependency() compilers shouldbe able to emit a suppressable warning on implicitconversions so as to alert the developer about other-wise silent dropping of dependency tracking8

Next we specify that memory order consume loadsreturn a value dep preserving type by default thecompiler must assume such a load to be capable of

8 Other choices are possible in this case including emit-ting a memory order acquire fence in order to conservativelypreserve a potentially intended ordering

producing any value of the underlying type In otherwords the implementation is not permitted to applyany value-restriction knowledge it might gain fromwhole-program analysis We call this a local seman-tic dependency to distinguish not only from a pure(syntactic) dependency but also from a global seman-tic dependency where global information may be ap-plied Note that any global semantic dependency isalso a local semantic dependency but that any localsemantic dependency which is headed by a variablethat can be proven to take on only a single value is nota global semantic dependency The term ldquosemanticdependencyrdquo should be interpreted to mean a globalsemantic dependency unless otherwise stated

This allows developers to start with a clean slatefor the additional rule that they must follow to beable to rely on dependency ordering Subsequent ex-ecution must not lead to a situation there is only onepossible value for the value dep preserving expres-sion because otherwise the implementation is per-mitted to break the dependency chain As notedearlier this is shown in Figure 17 where the com-piler is permitted to break dependency ordering online 6 because it knows that the value of p is equalto that of q which means that it could substitutethe latter value from the former which would breakdependency ordering

This approach has several advantages

1 The implementation is simpler because no de-pendency chains need to be traced The imple-mentation can instead drive optimization deci-sions strictly from type information

2 Use of the value dep preserving type modifierallows the developer to limit the extent of thedependency chains

3 This type modifier can be used to mark a depen-dency chainrsquos entry to and exit from a functionin a straightforward way without the need forattributes

4 The value dep preserving type modifiers serveas valuable documentation of the developerrsquos in-tent

WG21N4321 18

5 This approach permits many additional opti-mizations compared to those permitted by thecurrent standard on code that carries a depen-dency Expressions such as (x-x) no longerrequire establishment of artificial dependenciesand the compiler is no longer required to detectvalue-narrowing hazards like that shown in Fig-ure 13 However the compiler is still prohibitedfrom adding its own value-speculation optimiza-tions

6 Linus Torvalds seems to be OK with it whichindicates that this set of rules might be practicalfrom the perspective of developers who currentlyexploit dependency chains

According to Peter Sewell one disadvantage is thatthis approach will be quite difficult to model which inturn will pose obstacles for the analysis tooling thatwill be increasingly necessary for large-scale concur-rent programming efforts In particular the concernis that forcing the compiler to assume that a memory

order consume load could possibly return any valuepermitted by its type might require program-analysistools to consider counterfactual hypothetical execu-tions which might complicate specification of seman-tics and verification

Figures 18 and 19 show how this approach playsout with the list-based stack

73 Type-Based Designation of De-pendency Chains

Jeff Preshing made an off-list suggestion of using avalue dep preserving type modifier as suggestedby Torvald Riegel but using this type modifier tostrictly enforce dependency ordering For exampleconsider the code fragment shown in Figure 17 Thescheme described in Section 72 would not necessar-ily enforce dependency ordering between the load online 3 and the access one line 6 while the approachdescribed in this section would enforce dependencyordering in this case

Furthermore cancelling or value-destruction oper-ations on value dep preserving values would notdisrupt dependency ordering As with the cur-rent C11 and C++11 standards the implementation

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack value_dep_preserving first6 78 struct liststack 9 struct liststack value_dep_preserving next

10 void t11 struct rcu_head rh12 1314 value_dep_preserving15 void ls_front(struct liststackhead head)16 17 value_dep_preserving void data18 value_dep_preserving struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 18 List-Based-Stack Restricted Type-BasedDesignation 1 of 2

would be required to emit a memory-barrier instruc-tion or compute an artificial dependency for such op-erations (Note however that use of cancelling orvalue-destruction operations on dependency chainshas proven quite rare in practice)

This approach shares many of the advantages ofTorvald Riegelrsquos approach

1 The implementation is simpler because no de-pendency chains need be traced The implemen-tation can instead drive optimization decisionsstrictly from type information

2 Use of the value dep preserving type modifierallows the developer to limit the extent of thedependency chains

3 This type modifier can be used to mark a depen-dency chainrsquos entry to and exit from a functionin a straightforward way without the need forattributes

4 The value dep preserving type modifiers serve

WG21N4321 19

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 value_dep_preserving37 void ls_pop(struct liststackhead head)38 39 value_dep_preserving struct liststack lsp40 struct liststack lsnp141 value_dep_preserving struct liststack lsnp242 value_dep_preserving void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 19 List-Based-Stack Restricted Type-BasedDesignation 2 of 2

as valuable documentation of the developerrsquos in-tent

5 Although optimizations on a dependency chainare restricted just as in the current standardthe use of value dep preserving restricts thedependency chains to those intended by the de-veloper

6 Restricting dependency-breaking optimizationson all dependency chains marked value dep

preserving without exceptions for cases inwhich the compiler knows too much might makethis approach easier to learn and to use

It is expected that modeling this approach shouldbe straightforward because the modeling tools wouldbe able to make use of the type information Thisapproach results in the same code as shown in Fig-ures 18 and 19 of the previous section

74 Whole-Program Option

This approach also suggested off-list by Jeff Presh-ing has the goal of reusing existing non-dependency-ordered source code unchanged (albeit requiring re-compilation in most cases)9 For example this ap-proach permits an instance of stdmap to be refer-enced by a pointer loaded via memory order consume

and to provide that stdmap instance with thebenefits of dependency ordering without any codechanges whatsoever to stdmap It is important tonote that this protection will be provided only to aread-only stdmap that is referenced by a changingpointer loaded via memory order consume in partic-ular not to a concurrently updated stdmap refer-enced by a pointer (read-only or otherwise) loadedvia memory order consume This latter case wouldrequire changes to the underlying stdmap implemen-tation at a minimum changing some of the loads tobe memory order consume loads Nevertheless theability to provide dependency-ordering protection topre-existing linked data structures is valuable evenwith this read-only restriction

9 A module or library that is known to never carry a de-pendency need not be recompiled

WG21N4321 20

This approach which again does require full re-compilation can be implemented using two ap-proaches

1 Promote all memory order consume loads tomemory order acquire as may be done withthe current standard

2 On architectures that respect memory order-ing prohibit all dependency-breaking optimiza-tions throughout the entire program but onlyin cases where a change in the value returnedby a memory order consume load could cause achange in the value computed later in that samedependency chain in other words where there isa global semantic dependency Note again thatthe possibility of storing a value obtained froma memory order consume load then loading itlater means that normal loads as well as memoryorder relaxed loads often must be consideredto head their own dependency chains but onlywhen loaded by the same thread that did thestore

Some implementations might allow the developerto choose between these two approaches for exampleby using a compiler switch provided for that purpose

This approach also has the effect of permitting atrivial implementation of a memory order consume

atomic thread fence() When using the first im-plementation approach the atomic thread fence()

is simply promoted to memory order acquire In-terestingly enough when using the second ap-proach the memory order consume atomic thread

fence() may simply be ignored The reason forthis is that this approach has the effect of promot-ing memory order relaxed loads to memory order

consume which already globally enforces all theordering that the memory order consume atomic

thread fence() is required to provide locally10

This approach has its own set of advantages anddisadvantages

10 Of course this presumed promotion from memory order

relaxed to memory order consume means that architecturessuch as DEC Alpha that do not respect dependency order-ing must continue to use the first option of emitting memory-ordering instructions for memory order consume loads

1 This approach dispenses with the [[carries

dependency]] attribute and the kill

dependency() primitive

2 This approach better promotes reuse of existingsource code In particular it should require nochanges to the current Linux-kernel source baseaside from changes to the rcu dereference()

family of primitives

3 This approach allows implementations to carryout dependency-breaking optimizations on de-pendency chains as long as a change in thevalue from the memory order consume load doesnot change values further down the dependencychain both with and without the optimizationJeff conjectures that the set of dependency-breaking optimizations used in practice applyonly outside of dependency chains by the re-vised definition in which single-value restrictionsbreak dependency chains11 If this conjectureholds it also applies to Torvaldrsquos approach de-scribed in Section 72

4 Code that follows the rules presented in Sec-tion 41 (substituting memory order consume

loads for volatile loads) would have its depen-dency ordering properly preserved

It is unlikely that this approach could be modeledreasonably given the current state of the art Therequirement that any given memory order consume

load be able to generate at least two different val-ues at the tail of the dependency chain is believedto be a show-stopper especially when coupled withwhole-program analysis which might find that thereis only one value entering at the head of the depen-dency chain

This approach allows annotations to be discardedas shown in Figures 20 and 21 However the memoryorder consume loads are still required in order toenable the promote-to-acquire implementation style

11 This is certainly the case for the usual optimizations ex-emplified by replacing (x-x) with zero

WG21N4321 21

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 void ls_front(struct liststackhead head)15 16 void data17 struct liststack lsp1819 rcu_read_lock()20 lsp = rcu_dereference(head-gtfirst)21 if (lsp == NULL)22 data = NULL23 else24 data = rcu_dereference(lsp-gtt)25 rcu_read_unlock()26 return data27

Figure 20 List-Based-Stack Whole-Program Ap-proach 1 of 2

75 Local-Variable Restriction

This approach suggested off-list by Hans Boehmlimits the extent of dependency trees to a local whichincludes local variables temporaries function argu-ments and return variables Assigning a value from amemory order consume load to such an object beginsa dependency chain Assigning a value loaded fromsuch a local to a global variable (including function-local variables marked static) or to the heap impliesa kill dependency() so that dependency chains areconfined to locals However if the compiler is unableto see the full dependency chain for example be-cause it passes into a function in another translationunit that is not marked [[carries dependency]]the compiler should promote memory order consume

to memory order acquire12

Section 32 indicates that the following operatorsshould transmit dependency status from one localvariable or temporary to another -gt infix = casts

12 Some implementations might provide means to allow theuser to specify that a diagnostic be generated if such promotionis necessary

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 void ls_pop(struct liststackhead head)37 38 struct liststack lsp39 struct liststack lsnp140 struct liststack lsnp241 void data4243 rcu_read_lock()44 lsnp2 = rcu_dereference(head-gtfirst)45 do 46 lsnp1 = lsnp247 if (lsnp1 == NULL) 48 rcu_read_unlock()49 return NULL50 51 lsp = rcu_dereference(lsnp1-gtnext)52 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)53 while (lsnp1 = lsnp2)54 data = rcu_dereference(lsnp2-gtt)55 rcu_read_unlock()56 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)57 return data58

Figure 21 List-Based-Stack Whole-Program Ap-proach 2 of 2

WG21N4321 22

prefix amp prefix [] infix + infix - ternary infix (bitwise) amp and probably also | Similarly Sec-tion 33 indicates that the following operators shouldimply a kill dependency() () == = ampamp ||infix and

It will also be necesary to check whether Linux-kernel usage expects dependency chains to passthrough globals and heap objects that are in someway thread-local If there are such use cases and ifthey are sane and cannot easily be changed to uselocal variables should [[carries dependency]] beused to flag dependency-carrying globals and heapobjects

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute if de-pendency chains are to span multiple translationunits as is the case in some parts of the Linuxkernel

2 The implementation is likely to be some-what simpler because only those dependencychains passing through local variables compiler-generated temporaries compiler-visible functionarguments and compiler-visible return valuesneed be traced One could also argue that func-tion arguments and return values marked with[[carries dependency]] attribute also need tobe traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 Although optimizations on dependency chainsmust be restricted the restricted scope of de-pendency chains reduces the impact of these re-strictions

5 Applying this approach to the Linux kernelwould only require the addition of markings onfunction parameters and return values corre-sponding to cross-translation-unit function callsHowever there are a significant number of theseso this approach can expect significant resistancefrom the Linux community

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 22 List-Based-Stack Local-Variable Restric-tion 1 of 2

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach allows local-variable annotations tobe dropped as shown in Figure 22 and 23

76 Mark Dependency-Carrying Lo-cal Variables

This approach suggested offlist by Clark Nelson usesthe [[carries dependency]] attribute to mark non-static local-scope variables as carrying a dependencyin addition to its current use marking function ar-guments and return values as carrying dependen-cies It is not permissible to mark global variablesor structure members with this attribute Assigningfrom a [[carries dependency]] object to a non-[[carries dependency]] object results in an im-plicit kill dependency()

This approach is similar to that of Section 73 ex-cept that it uses an attribute rather than a type mod-ifier As such it has many of the advantages and dis-

WG21N4321 23

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 23 List-Based-Stack Local-Variable Restric-tion 2 of 2

advantages of that approach however some believethat an attribute-based approach will be more ac-ceptable to the committee than would a type-modifierapproach13 However this approach does requirethat C add attributes

This leave the question of which operatorstransmit dependency chains from one [[carries

dependency]] object to another Section 32 indi-cates that the following operators should transmitdependency status from one local variable or tem-porary to another -gt infix = casts prefix amp prefix [] infix + infix - ternary infix (bitwise) ampand probably also | Similarly Section 33 showsthat the following operators should imply a kill

dependency() () == = ampamp || infix and

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains passingthrough variables marked with the [[carries

dependency]] attribute need be traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 The [[carries dependency]] calls serve asvaluable documentation of the developerrsquos in-tent

5 Although optimizations on dependency chainsmust be restricted use of explicit [[carries

dependency]] greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of variables carrying depen-dencies given that the Linux kernel currentlyrequires no such markings

13 Lawrence Crowl suggests a third approach namely a vari-able modifier

WG21N4321 24

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 _Carries_dependency void data18 _Carries_dependency struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 24 List-Based-Stack Marked Local Variables1 of 2

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach results in code as shown in Fig-ures 24 and 25 where the [[carries dependency]]

attributes have been replaced with a mythicalCarries dependency C keyword

77 Explicitly Tail-Marked Depen-dency Chains

This approach suggested at Rapperswil by OlivierGiroux can be thought of as the inverse of

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 _Carries_dependency struct liststack lsp40 struct liststack lsnp141 _Carries_dependency struct liststack lsnp242 _Carries_dependency void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 25 List-Based-Stack Marked Local Variables2 of 2

WG21N4321 25

1 p = atomic_load_explicit(ampgp memory_order_consume)2 if (p = NULL)3 do_it(atomic_dependency(p gp))

Figure 26 Explicit Dependency Operations

stdkill dependency() Instead of explicitlymarking where the dependency chains terminateOlivierrsquos proposal uses a stddependency() prim-itive to indicate the locations in the code that thedependency chains are required to reach The first ar-gument to stddependency() is the value to whichthe dependency must be carried and the second argu-ment is the variable that heads the dependency chainin other words the second argument is the variablethat was loaded from by a memory order consume

load This proposal differs from the others in thatit is expected to be implemented not necessarily bypreserving the dependency but instead by insertingbarriers in those cases where optimizations have elim-inated any required dependencies The goal here is toimpose minimal restrictions on optimizations of codecontaining dependency chains

A C-language example is shown in Figure 26where stddependency() is transliterated to the C-language atomic dependency() function On line 3atomic dependency() returns the value of its firstargument (p) while ensuring that the data depen-dency from the memory order consume load from gp

is faithfully reflected in the assembly language imple-menting this code fragment The assembly-languagereflection of this dependency might be in terms ofan assembly-language dependency (for example onARM or PowerPC) implicit memory ordering (forexample on x86 or mainframe) or by an explicitmemory-barrier instruction However if there was noatomic dependency() function the compiler wouldbe under no obligation to preserve the dependency14

These explicitly specified dependencies may becombined with [[carries dependency]] attributeson function arguments for example as shown in Fig-ure 27 Note the interplay of atomic dependency()

and [[carries dependency]] where line 8 estab-

14 Would it be better to have the second argument to atomic

dependency() be a label rather than an expression

1 void foo(struct bar q [[carries_dependency]])2 3 if (q = NULL)4 do_it(atomic_dependency(q-gtb q))5 67 p = atomic_load_explicit(ampgp memory_order_consume)8 foo(atomic_dependency(p gp))

Figure 27 Explicit Dependency Operations and car-ries dependency

lishes the dependency between the load from gp andthe [[carries dependency]] argument q of foo()and where line 4 establishes the further dependencybetween argument q of foo() and do it()s argu-ment

This approach is not yet complete One issue isthe possibility of a given operation being dependenton multiple memory order consume loads One ap-proach is of course to omit this functionality and an-other is to allow atomic dependency() to allow anexpression as its first argument and a variable list ofmemory order consume loaded variables

Another issue is connecting [[carries

dependency]] return values to subsequent atomic

dependency() invocations There are a numberof possible resolutions to this issue One approachwould be to use [[carries dependency]] attributeto mark the declaration of the variable to whichthe functionrsquos return value is assigned bringing theproposal from Section 76 to bear In the specialcase where the memory order consume load is in thesame function body as the atomic dependency()

that depends on it the atomic dependency()

could reference the variable that was the sourceof the original memory order consume load An-other approach would be to allow function-returncarries dependency]] attributes to define namesthat could be used by later atomic dependency()

invocations

A third issue arises when atomic dependency()

must be applied after the head of the dependencychain has gone out of scope for example if the headwas contained in a variable defined in an inner scopethat has since been exited

A fourth issue arises if optimizations along a

WG21N4321 26

needed dependency chain allow ordering the depen-dent operation to precede the head of the depen-dency chain in which case inserting barriers wouldbe ineffective The current proposal for addressingthis issue is to suppress memory-movement optimiza-tions across the atomic dependency() perhaps us-ing something like atomic signal fence() or theLinux kernelrsquos barrier() macro This approach al-lows dependency checking and fence insertion to becarried out as a final pass in the compilation process

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains having ex-plicit atomic dependency() calls (and option-ally intermediate [[carries dependency]] at-tributes) need be traced

3 Irrelevant dependency chains are pruned by de-fault with no stdkill dependency() calls re-quired

4 The atomic dependency() calls serve as valu-able documentation of the developerrsquos intent

5 Although optimizations on dependency chainsmust be restricted use of explicit atomic

dependency() greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of dependency chains giventhat the Linux kernel currently relies on implicitends of dependency chains

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(atomic_dependency(lsp-gtt27 head-gtfirst))28 rcu_read_unlock()29 return atomic_dependency(data lsp-gtt)30

Figure 28 List-Based-Stack Tail-Marked Dependen-cies 1 of 2

It is not yet known whether this approach can bereasonably modeled

The result is shown in Figures 28 and 29

78 Explicitly Head-Marked Depen-dency Chains

This approach suggested via email by Olivier Girouxcan be thought of as another inverse of stdkilldependency() In this case the heads of the de-pendency chains are marked indicating to whichpointed-to objects dependencies should be carriedThis description is an extrapolation of a very con-cise proposal and corrections are welcome

The general idea is to provide an augmentedform of the load() member function that indicatesdependencies for example xload(memory order

consume x-gtnext) would cause a dependency tobe carried through the -gtnext field but throughno other field This is shown in Figure 30 where

WG21N4321 27

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(atomic_dependency(lsnp2-gtt lsnp))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return atomic_dereference(data lsnp2-gtt)59

Figure 29 List-Based-Stack Tail-Marked Dependen-cies 2 of 2

1 struct foo 2 struct foo a3 struct foo b4 struct foo c5 int d6 78 p = atomic_load_explicit(ampgp memory_order_consume9 p-gta p-gtb)

10 qa = p-gta Dependency carried 11 qb = p-gtb Dependency carried 12 qc = p-gtc No dependency carried 13 d = p-gtd No dependency carried

Figure 30 Explicit Dependency Operations and Aug-mented Load

the explicit dependency information on line 9 causeslines 10 and 11 to carry a dependency but lines 12and 13 not to do so

Some open questions regarding this approach

1 How does this interact with arguments and re-turn values Do the corresponding annotationsneed to indicate to which fields dependenciesmight be carried Should mismatches be con-sidered an error and if so which sorts of mis-matches

2 How are opaque types handled For exam-ple consider the Linux kernel linked-list facil-ity which embeds a list head structure intothe enclosing object that is to be placed on thelist The memory order consume load returnsa (struct list head ) but it may be nec-essary to carry a dependency to one or more ofthe fields in the enclosing object Should thisbe handled via something like xload(memory

order consume x) but if so doesnrsquot thisre-introduce the need for lots of stdkill

dependency() calls

3 Larger structures might have quite a few fieldsthat need dependencies carried Should therebe some sort of shorthand to make this easierto code for example tagging the fields needingdependency ordering in the declaration of thestruct or class

4 Common types of abstraction need to be han-dled correctly For example there are more than

WG21N4321 28

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst head-gtfirst-gtt)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(lsp-gtt lsp-gtt)) 28 rcu_read_unlock()29 return data30

Figure 31 List-Based-Stack Head-Marked Depen-dencies 1 of 2

500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

79 Restricted Dependency Chains

This approach restricts dependency chains to opera-tions for which compilers would naturally carry de-pendencies As such this approach can be consid-ered to be a refinement of the whole-program option(Section 74) restricted as described in Section 412but also omitting control dependencies and RCU-protected array indexes As always a memory order

consume load heads a dependency chain and as al-ways a memory order consume load may be imple-mented with a simple load instruction on architec-tures such as x86 ARM and Power

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst head-gtfirst-gtnext)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext lsnp-gtnext-gtt)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt lsnp2-gtt))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 32 List-Based-Stack Head-Marked Depen-dencies 2 of 2

WG21N4321 29

1 define rcu_ptr_extract(p) 2 ( 3 uintptr_t ___ip = (uintptr_t)(p) 4 5 ___ip = ___ip amp ~0x7 6 (typeof(p)) ___ip 7 )89 define rcu_ptr_set_bits(p bits)

10 ( 11 uintptr_t ___ip = (uintptr_t)(p) 12 13 ___ip = ___ip amp ~0x7 14 ___ip = ___ip | (bits) 15 (typeof(p)) ___ip 16 )1718 define rcu_ptr_get_bits(p bits) 19 ( 20 uintptr_t ___ip = (uintptr_t)(p) 21 22 ___ip amp ~0x7 23 )

Figure 33 Pointers and Bit Manipulation on 64-BitSystem

This results in a specific list of operations that ex-tend dependency chains and a separate specific list ofoperations that terminate such chains each coveredin its own section

791 Extending Dependency Chains

The following primitive operations extend thatchain15

1 If any value is part of a dependency chain thenusing that value as the left-hand side of an as-signment expression extends the chain to coverthe assignment This rule is exercised in theLinux kernel by stores into fields making up anRCU protected data element

2 If any value is part of a dependency chain thenusing that value as the right-hand side of an as-signment expression extends the chain to coverboth the assignment and the value returned bythat assignment statement Line 20 of Figure 20shows how this rule may be used to extend adependency chain into a local variable

15 In case of operator overloading the actual functions calledmust be analyzed in order to determine their effects on depen-dency chains

3 If any value that is part of a dependency chain isstored to a non-shared variable then any valueloaded by a later load from that same variable bythat same thread is also part of the dependencychain Lines 20 and 24 of Figure 20 illustratethis rule though this rule would apply even iflocal variable lsp was a intptr t instead of apointer

4 If a pointer that is part of a dependency chainis stored to any variable then any value loadedby a later load from that same variable by thatsame thread is also part of the dependency chainLines 20 and 24 of Figure 20 illustrate this rulethough this rule would apply even if local vari-able lsp was instead a shared variable

5 If a pointer is part of a dependency chain thenadding an integral value to that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers and alsoto addition via the infix + operator and via thepostfix [] operator Note that the addition mustbe carried out on a pointer Casting to an inte-gral type and then carrying out the addition willbreak the dependency chain Therefore insteadof casting to an integral type to carry out theaddition cast to a pointer to char16 Line 24 ofFigure 20 illustrates this given that the -gtt actsas a pointer offset prior to indirection

6 If a pointer is part of a dependency chain thensubtracting an integer from that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers Againcasting to an integral type and then carryingout the subtraction will break the dependencychain so instead cast to a pointer to char TheLinux-kernel container of() macro illustratesthis This macro is used to find the beginning ofa structure given a pointer to a field within thatsame structure

16 Yes some old systems had strange formats for characterpointers and this restriction does exclude those systems fromthis nuance of dependency ordering However to the best ofmy knowledge all such systems were uniprocessors so this isnot a real problem

WG21N4321 30

7 If a pointer is part of a dependency chain thendereferencing it using the prefix operator ex-tends the chain through the dereference opera-tion Line 24 of Figure 20 illustrates this giventhat the -gtt acts as a pointer offset prior to in-direction

8 If a pointer is part of a dependency chain thendereferencing it using the -gt field-selection oper-ator extends the chain to the field Note that thewhen the -gt operator is followed by one or more operators these latter operators are equiva-lent to adding a constant integer to the originalpointer Line 24 of Figure 20 directly illustratesthis rule

9 If a pointer is part of a dependency chain thencasting it (either explicitly or implicitly) to anypointer-sized type extends the chain to the re-sult Line 3 of Figure 33 illustrates this rule

10 If a value of type intptr t or uintptr t ispart of a dependency chain then casting it toa pointer type extends the chain to the resultLine 6 of Figure 33 illustrates this rule

11 If a value of type intptr t or uintptr t is partof a dependency chain then the bitwise infix amp

and | operators extend the dependency chain tothe resulting value Line 5 of Figure 33 illus-trates this rule

12 If a value of type intptr t or uintptr t is partof a dependency chain the the bitwise infix ^

operator extends the dependency chain to theresulting value The use case for this traversal ofbuddy-allocator-like lists or dense-array heapsbut it is not clear whether these use cases justifythis addition to dependency ordering

13 If a pointer is part of a dependency chain thenapplying the unary amp address-of operator op-tionally casting this address to a pointer type(perhaps repeatedly to different pointer typeseither explicitly or implicitly) then applying the dereference operator extends the chain to theresult This is used by some of the Linux-kernellist-processing macros

14 If a pointer is part of a dependency chain andthat pointer appears in the entry of a expres-sion selected by the condition then the chainextends to the result Please note that doesnot extend chains from its condition only fromits second or third argument

15 If a pointer to a function is part of a depen-dency chain then invoking the pointed-to func-tion extends the chain from the pointer to theinstructions executed Note that the exact mech-anism used to update instructions is implemen-tation defined and might require use of specialinstruction-cache-flush operations17

16 If a pointer is part of a dependency chain thenif that pointer is used as the actual parameter ofa function call the dependency chain extends tothe formal parameter

17 If a pointer is part of a dependency chain thenif that pointer is returned from a function thedependency chain extends to the returned valuein the calling function

18 If a given operation extends a dependency chainthen so does its atomic counterpart For exam-ple the rules applying to assignments also applyto atomic loads and stores

Any other operation terminates a dependencychain

792 Terminating Dependency Chains

Even though all other operations terminate depen-dency chains there are a few that deserve specialmention

1 Equality comparisons

2 Narrowing magnitude comparisons

3 Narrowing arithmetic operations

4 Narrowing bitwise operations

17 This may sound strange but just you try implementingdynamic linking without the ability to update instructions

WG21N4321 31

1 struct bar 2 struct bar next3 int a4 int b5 6 struct bar head = amphead 1 2 78 for (p = head-gtnext p p = rcu_dereference(p-gtnext)) 9 foo += p-gta

10 if (p == amphead)11 break12 13 bar = head-gtb

Figure 34 Back-Propagation of Dependency-ChainBreakage Due to Comparisons

1 if (p gt ampfoo)2 do_something(p)3 else if (p lt ampfoo)4 do_something_else(p)5 else6 do_something_nodep(p)

Figure 35 Inequality-Comparison Dependency-Chain Breakage

5 Storing non-pointers into shared variables

6 Passing values between threads without using amemory order consume load

7 Undefined behavior

8 Use of kill dependency

Each of these is covered below

Equality comparisons If a pointer is part of adependency chain then a == or = comparison thatcompares equal to some other pointer where thatother pointer is not part of any dependency chainwill cause any uses of the original pointer to no longerbe part of the dependency chain This dependency-chain breakage can back-propagate to earlier uses ofthe pointer so that in Figure 34 if the comparisonon line 10 compares equal then the access on line 9is not part of the dependency chain This is admit-tedly a rather strange code fragment and besidesthe Linux-kernel barrier() macro could prevent thisif placed between lines 9 and 10 Furthermore the

Linux kernelrsquos list macros avoid this situation becausethe equal comparison terminates the loop

So what if the compiler introduces an equality com-parison This might happen when doing feedback-directed optimization where the compiler might no-tice (for example) that a particularly statically allo-cated structure was almost always the first elementon a given list The compiler might therefore in-troduce a specialization optimization comparing theaddresses and generating code using the statically al-located structure on equals comparison On the onehand in the cases where the Linux kernel adds a stat-ically allocated structure to an RCU-protected linkeddata structure that structure has been initialized atcompile time so that dependency ordering is not re-quired On the other hand this appears to be anextremely dubious optimization for linked data struc-tures In a great many cases the added overhead ofthe comparison would overwhelm the benefits of gen-erating code based on the statically allocated struc-ture

High-quality implementations would therefore beexpected to provide means for disabling this sort ofoptimization especially for pointers obtained fromthe heap After all use of statically allocated struc-tures in RCU-protected lists could be quite usefulduring out-of-memory conditions in which case thespecialization optimization would almost always re-duce performance which is not what optmizationsare supposed to be doing

Narrowing magnitude comparisons A seriesof gt lt gt= or lt= operators that informs the compilerof the exact value of a pointer causes that pointerto no longer be part of the dependency chain SeeFigure 35 for an example of this On line 6 of thisfigure the compiler knows that the value of p is equalto ampfoo so although there is dependency ordering tolines 2 and 4 there is no dependency ordering toline 6 This dependency-chain breakage can back-propagate just as for equality comparisons How-ever dependencies are maintained for normal usesfor exmaple the use of comparisons for deadlockavoidance when acquiring locks contained in multi-ple RCU-protected data elements

WG21N4321 32

Narrowing arithmetic operations If a pointeris part of a dependency chain and if the valuesadded to or subtracted from that pointer cancel thepointer value so as to allow the compiler to pre-cisely determine the resulting value then the result-ing value will not be part of any dependency chainFor example if p is part of a dependency chain then((char )p-(uintptr t)p)+65536 will not be18

Narrowing bitwise operations If a value of typeintptr t or uintptr t is part of a dependency chainand if that value is one of the operands to an amp

or | infix operator whose result has too few or toomany bits set then the resulting value will not bepart of any dependency chain For example on a64-bit system if p is part of a dependency chainthen (p amp 0x7) provides just the tag bits and nor-mally cannot even be legally dereferenced Similarly(p | ~0) normally cannot be legally dereferencedHowever (p amp ~0x7) will provide a usable pointerthat is part of prsquos dependency chain Setting orclearing bits that are not used by the implementa-tion or that would be zero for a properly aligned ob-ject will not break the dependency chain That saidthe compiler might not know the definition of ldquoprop-erly alignedrdquo for example in the cases of manuallycache-aligned or page-aligned objects

Storing non-pointers into shared variables Ifa non-pointer value that is part of a dependency chainis stored into a shared variable then the dependencychain does not extend to a later load from that vari-able

Passing values between threads If a value thatis part of a dependency chain is stored into a variableby one thread and loaded from that same variable bysome other thread using either a non-atomic load ora memory order relaxed load then the dependencychain does not extend to the second thread To getthis effect the second thread would instead need touse a memory order consume load Note that this

18 That said 57p4 of C++ and 656p8 of C both say thatindexing outside of an object is undefined behavior so the lossof dependency ordering is likely the least of the problems here

would extend the dependency chain even if the cor-responding store was a memory order relaxed storebecause the required store-side ordering is providedby the dependency chain

Undefined behavior If undefined behavior is in-voked then consistent with the notion of undefinedbehavior there are no dependency-chain guarantees

If a given pointer intptr t or uintptr t is usedsuch that only one value avoids undefined behaviorthen the dependency chain is broken in the same wayas it would be in the case of an equality comparisonwith that same value

kill dependency() The result of calling kill

dependency is never part of any dependency chainThis operation can be used to suppress diagnosticsthat implementations might omit for likely misusesof dependency ordering

793 Restricted Dependency Chains and theLinux Kernel

This covers all known pointer-based RCU uses in theLinux kernel aside from RCU-protected array in-dexes (more on these later) However as noted ear-lier these restrictions might prove too constrainingfor future code Therefore it might be necessary tocombine this approach with some variation on one ofthe methods for explicitly marking variables formalparameters and return values that are intended tocarry dependencies

Section 32 discussed dependency chains headedby memory order consume loads of integers that arelater used as array indexes or pointer offsets Al-though there are a (very) few such uses in the Linuxkernel accommodating dependency chains headed byloads of integers greatly complicates the handling ofdependency chains For example given an integer x

produced by a memory order consume load we mustcorrectly handle expressions containing (x - x) in-cluding cases where the cancellation is not at all obvi-ous from the source code In contrast given a pointerp the expression (p - p) not a pointer and the rulesgiven above do not require carrying a dependency

WG21N4321 33

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a) 3 q = kill_dependency(p)4 a = q-gtspecial_a5 else 6 a = p-gtnormal_a7

Figure 36 Avoiding Diagnostic Due To Dependency-Ordering Value-Narrowing Hazard

through such an expression This approach there-fore excludes dependency chains headed by memory

order consume loads from non-pointer atomic vari-ables

This raises the question of what should be doneabout the Linux kernel code that relies on order-ing carried through integer array indexes Paulhas (hopefully) answered this question by creatinga Linux-kernel patch that removes the kernelrsquos de-pendency on RCU-protected array indexes [22]

794 Restricted Dependency Chains Ad-vantages and Disavantages

This approach to dependency ordering has the fol-lowing advantages and disadvantages

1 There is no need for compilers to trace depen-dency chains Instead dependencies are an au-tomatic result of current code-generation and op-timization practices

2 There would be little or no limitation on com-piler optimizations Compilers are no longer re-quired to establish artificial dependencies for ex-pressions such as (x - x) or to detect value-narrowing hazards involving == and =

3 The breaking of dependency chains when apointer compares equal to some other pointermight prove to be onerous but it is not aproblem for current Linux use cases19 Themost common cases are comparison against alist header (in which case an equality compar-ison terminates the traversal) and comparison

19 This needs to be re-verified

against NULL (in which case an equality com-parison indicates a pointer that cannot be deref-erenced in any case)

4 The compiler is prohibited from carrying outvalue-speculation optimizations on pointers orvalues of type intptr t or uintptr t that havebeen cast from a pointer type and subjected tono operations other than infix amp | and ^ (Isthis an advantage or a disadvantage The an-swer to this question is left to the reader)

5 In a great many cases dependency chains canreliably pass through library functions compiledby pre-C11 compilers

6 It would not be necessary to use stdkill

dependency() calls in most cases That saiduse of stdkill dependency() might at somefuture point allow the compiler to produce bet-ter diagnostics for dubious dependency-chain usecases For example a compiler might issue awarning for the value-narrowing hazard shownon line 3 of Figure 13 and that diagnostic mightbe suppressed as shown in Figure 36

7 More generally this approach allows annotationsto be discarded as was shown in Figures 20and 21 However the memory order consume

loads are still required in order to support DECAlpha and prevent compiler optimizations thatmight otherwise destroy the dependency chain

8 It would not be necessary to use [[carries

dependency]] attributes However as withstdkill dependency() use of something like[[carries dependency]] might produce bet-ter diagnostics for dubious use dependency-chainuse cases That said it might be preferable tosubstitute either type or variable modifiers forattributes given that use of attributes is not per-mitted to change the meaning of the programand also that attributes are not yet supportedby the C language

9 With the exception of one use case involving ar-rays no changes to the Linux-kernel source code

WG21N4321 34

are required As noted earlier a patch is avail-able to remove the Linux kernelrsquos dependency onRCU-protected array indexes [22]

In the future this approach could be augmentedby attributes type modifiers variable modifiers orother markings to allow more elaborate dependencychains to be created on the one hand and to improvethe compilerrsquos ability to emit diagnostics for dubioususes of dependency chains on the other

710 Evaluation

This evaluation starts by enumerating the differentaudiences that any change to memory order consume

must address (Section 7101) and then compares thevarious proposals based on the perceived viewpointsof these audiences (Section 7102)

7101 Audiences

The main audiences for any change to memory

order consume include standards committee mem-bers compiler implementers formal-methods re-searchers developers intending to write new codeand developers working with existing RCU code TheLinux kernel community is of course a notable exam-ple of this last category

Standards committee members would like a cleanand non-intrusive change to the standard Theywould of course also like solutions minimizing thenumber and vehemence of complaints from the otheraudiences or failing that reducing the complaintsto a tolerable noise level

Compiler implementers would like a mechanismthat fits nicely into current implementations whichdoes much to explain their satisfaction with the ap-proach of strengthening memory order consume tomemory order acquire In particular they wouldlike to avoid unbounded tracing of dependencies andwould prefer minimal constraints on their ability toapply time-honored optimizations

Formal-methods researchers would like a definitionof memory order consume that fits into existing the-oretical frameworks without undue conceptual vio-lence Of particular concern is any need to deal with

counter-factuals in other words any need to rea-son not only about values of variables required forthe solution of a given litmus test but also aboutother unrelated values for these variables As suchcounter-factuals are the rock upon which otherwiseattractive approaches involving semantic dependencyhave foundered20 Some practitioners might won-der why the opinion of formal-methods researchersshould be given any weight at all and the answer tothis question is that it is the work of formal-methodsresearchers that provides us the much-needed toolsthat we need to analyze both the memory-orderingspecification itself as well as programs using thatspecification

Developers writing new code need something thatexpresses their algorithm with a minimum of syn-tactic saccharine that is easy to learn and that iseasy to maintain For example one of the weak-nesses of the current standardsrsquo definition of memoryorder consume is the need to sprinkle large numbersof kill dependency() calls throughout onersquos codeIn short developers would like it to be easy to writeanalyze and maintain code that uses dependency or-dering

Developers with existing RCU code have the samedesires as do developers writing new code but arealso very interested in minimizing the code churn re-quired to adhere to the standard

The challenge if of course to find a proposal thataddresses the viewoints of all of these audiences Aswe will see in the next session this is not easy How-ever there is some hope that the approach presentedin Section 79 might suffice

7102 Comparison

A summary comparison of the proposals is shown inTable 1

The dependency type can either be ldquodeprdquo fornormal dependency ldquordeprdquo for the restricted de-pendencies discussed in Section 79 ldquosdeprdquo for(global) semantic dependency or ldquolsdeprdquo for local

20 That said Alan Jeffries is making another attempt tocome up with a suitable formal definition of semantic depen-dency

WG21N4321 35

Dep

enden

cyType

Variable

Marking

Form

al-Parameter

Marking

Return-V

alueMarking

Beginning-O

f-Chain

Handling

End-O

f-Chain

Handling

Dep

enden

cyTracingRequired

CAttribute

Support

Req

uired

C11 C++11 dep A A K Y Y

Type-Based Designation of DependencyChains With Restrictions (Section 72)

lsdep T T T k

Type-Based Designation of DependencyChains (Section 73)

dep T T T k

Whole-Program Option (Section 74) sdep

Local-Variable Restriction (Section 75) dep A A k Y

Mark Dependency-Carrying Local Variables(Section 76)

dep A A A k Y

Explicitly Tail-Marked Dependency Chains(Section 77)

dep A A Dk y Y

Explicitly Head-Marked Dependency Chains(Section 78)

dep D k y Y

Restricted Dependency Chains (Section 79) rdep

Marking ldquoArdquo attribute ldquoTrdquo typeBeginning of chain ldquoDrdquo explicit designationEnd of chain ldquoDrdquo explicit designation ldquokKrdquo implicitexplicit kill dependencyDependency tracking required ldquoyrdquo only for marked chains ldquoYrdquo always

Table 1 Comparison of Consume Proposals

WG21N4321 36

semantic dependency21 Variable formal-parameterand return-value marking can either be type-based(ldquoTrdquo) attribute-based (ldquoArdquo) or not required (ldquo rdquo)Beginning-of-chain handling can either require ex-plicit indication to which quantities dependenciesmust be carried (ldquoDrdquo) or nothing (ldquo rdquo) End-of-chain handling can either require an explicit kill

dependency (ldquoKrdquo) an implicit kill dependency

(ldquokrdquo) explicit designation of dependency (ldquoDrdquo) ornothing (ldquo rdquo)22 Dependency tracking might berequired for all chains (ldquoYrdquo) explicitly designatedchains (ldquoyrdquo) or not required at all (ldquo rdquo) C-language[[carries dependency]] support might be required(ldquoYrdquo) or not (ldquo rdquo)

The ideal proposal would have dependency typeldquodeprdquo (thus making it easier to model dependencyordering and making it unnecessary for developer tohave to outwit full-program optimizations) no needfor variable formal-parameter or return-value mark-ing (thus minimizing changes required for existingRCU code) implicit ldquodo the right thingrdquo end-of-chainhandling23 (thus minimizing the need for whack-a-mole source-code markups) no need for dependencytracking (thus making it easier to implement) andno need for C-language support for the [[carries

dependency]] attribute (thus minimizing changes tothe C standard)

7103 Other Approaches

If the C standards committee is unwilling to ac-cept attributes perhaps a new keyword such asCarries dependency would be an acceptable alter-native (This was suggested at the 2014 UIUC meet-ing but I cannot recall who suggested it)

21 Recall that a local semantic dependency remains a de-pendency even if the memory order consume load at its headcan return only a single value In contrast a global seman-tic dependency remains a dependency only if more than onevalue can appear at the end of the chain Therefore optimiza-tions based on global full-program analysis can break a globalsemantic dependency but can break neither a local semanticdependency nor a normal dependency

22 Variables that go out of scope always have any dependencychain implicitly killed

23 Perhaps implemented by a careful choice of exactly whichoperators carry dependencies in which situations

It might also be possible to combine different as-pects of the various proposals perhaps even arrivingat an improved proposal

Nevertheless we clearly have some more work todo

8 Summary

This document has analyzed Linux-kernel use of de-pendency ordering and has laid out the status-quo in-teraction between the Linux kernel and pre-C11 com-pilers It has also put forward some possible ways ofbuilding towards a full implementation of C11rsquos andC++11rsquos handling of dependency ordering Finallyit calls out some weaknesses in C11rsquos and C++11rsquoshandling of dependency ordering and offers some al-ternatives

References

[1] Alglave J Maranget L Pawan PSarkar S Sewell P Williams D andNardelli F Z PPCMEMARMMEM A toolfor exploring the POWER and ARM memorymodels httpwwwclcamacuk~pes20

ppc-supplementalpldi105-sarkarpdfJune 2011

[2] Alglave J Maranget L andTautschnig M Herding cats Modellingsimulation testing and data-mining for weakmemory In Proceedings of the 35th ACM SIG-PLAN Conference on Programming LanguageDesign and Implementation (New York NYUSA 2014) PLDI rsquo14 ACM pp 40ndash40

[3] ARM Limited ARM Architecture ReferenceManual ARMv7-A and ARMv7-R Edition2010

[4] Boehm H J Space efficient conservativegarbage collection SIGPLAN Not 39 4 (Apr2004) 490ndash501

[5] Bonzini P and Day M RCU imple-mentation for Qemu httplistsgnu

WG21N4321 37

orgarchivehtmlqemu-devel2013-08

msg02055html August 2013

[6] Dalton M THE DESIGN AND IMPLE-MENTATION OF DYNAMIC INFORMATIONFLOW TRACKING SYSTEMS FOR SOFT-WARE SECURITY PhD thesis StanfordUniversity 2009 Available httpcsl

stanfordedu~christospublications

2009michael_daltonphd_thesispdf

[Viewed March 9 2010]

[7] Desnoyers M [RFC git tree] userspace RCU(urcu) for Linux httpurcuso February2009

[8] Desnoyers M McKenney P E SternA Dagenais M R and Walpole JUser-level implementations of read-copy updateIEEE Transactions on Parallel and DistributedSystems 23 (2012) 375ndash382

[9] Grisenthwaite R ARM Barrier LitmusTests and Cookbook ARM Limited 2009

[10] Howard P W and Walpole J A rel-ativistic enhancement to software transactionalmemory In Proceedings of the 3rd USENIX con-ference on Hot topics in parallelism (BerkeleyCA USA 2011) HotParrsquo11 USENIX Associa-tion pp 1ndash6

[11] Howard P W and Walpole J Relativis-tic red-black trees Concurrency and Computa-tion Practice and Experience (2013) nandashna

[12] Intel Corporation A Formal Speci-fication of Intel Itanium Processor Fam-ily Memory Ordering 2002 Avail-able httpdeveloperintelcom

designitaniumdownloads251429htm

ftpdownloadintelcomdesign

ItaniumDownloads25142901pdf [ViewedJanuary 10 2007]

[13] International Business Machines Corpo-ration Power ISA Version 207 2013

[14] Kannan H Ordering decoupled metadata ac-cesses in multiprocessors In MICRO 42 Pro-ceedings of the 42nd Annual IEEEACM Inter-national Symposium on Microarchitecture (NewYork NY USA 2009) ACM pp 381ndash390

[15] McCarthy J Recursive functions of symbolicexpressions and their computation by machinepart i Commun ACM 3 4 (Apr 1960) 184ndash195

[16] McKenney P E Read-copy update(RCU) usage in Linux kernel Availablehttpwwwrdropcomuserspaulmck

RCUlinuxusagerculocktabhtml [ViewedJanuary 14 2007] October 2006

[17] McKenney P E What is RCU part 2Usage Available httplwnnetArticles

263130 [Viewed January 4 2008] January2008

[18] McKenney P E The RCU API 2010 edi-tion httplwnnetArticles418853 De-cember 2010

[19] McKenney P E Validating memory barri-ers and atomic instructions httplwnnet

Articles470681 December 2011

[20] McKenney P E Is Parallel ProgrammingHard And If So What Can You Do About Itkernelorg Corvallis OR USA 2012

[21] McKenney P E Structured deferral syn-chronization via procrastination CommunACM 56 7 (July 2013) 40ndash49

[22] McKenney P E [PATCH tipcorercu14] mce Stop using array-index-based RCUprimitives [PATCHtipcorercu14]mce

Stopusingarray-index-basedRCUprimitivesMay 2015

[23] McKenney P E Purcell C Al-gae Schumin B Cornelius G Qwer-tyus Conway N Sbw Blainster Ru-fus C Zoicon5 Anome and Eisen HRead-copy update httpenwikipedia

orgwikiRead-copy-update July 2006

WG21N4321 38

[24] McKenney P E and Slingwine J DRead-copy update Using execution history tosolve concurrency problems In Parallel andDistributed Computing and Systems (Las VegasNV October 1998) pp 509ndash518

[25] McKenney P E and Walpole J What isRCU fundamentally Available httplwn

netArticles262464 [Viewed December 272007] December 2007

[26] Michael M M Hazard pointers Safe mem-ory reclamation for lock-free objects IEEETransactions on Parallel and Distributed Sys-tems 15 6 (June 2004) 491ndash504

[27] Rossbach C J Hofmann O S PorterD E Ramadan H E Bhandari A andWitchel E TxLinux Using and managinghardware transactional memory in an operatingsystem In SOSPrsquo07 Twenty-First ACMSymposium on Operating Systems Principles(October 2007) ACM SIGOPS Avail-able httpwwwsosp2007orgpapers

sosp056-rossbachpdf [Viewed October 212007]

[28] Toit S D Working draft stan-dard for programming language C++httpwwwopen-stdorgjtc1sc22wg21

docspapers2013n3691pdf May 2013

[29] Vigueras G Orduna J M and LozanoM A Read-Copy Update based parallel serverfor distributed crowd simulations The Journalof Supercomputing (Apr 2012)

Change Log

This paper first appeared as N4026 in May of 2014Revisions to this document are as follows

bull Add a suggestion to Section 71 that developersprovide diagnostics for singleton arrays when us-ing RCU-protected indexes (May 28 2014)

bull Apply changes from Torvald Riegel feedback

1 Move Section 71 which summarizes theLinux-kernel uses of dependency chains upto the beginning of Section 7

2 Fix some typos in Section 74 and addwords accounting for DEC Alpha whichmust continue to promote memory order

consume loads to memory order acquire

with this sectionrsquos approach Also clarifyJeff Preshingrsquos conjecture about the lack ofoptimizations that break semantic depen-dency chains

(June 5 2014)

bull Add Section 75 which covers Hans Boehmrsquosapproach which restricts dependency chains tolocal variable Also add Section 76 whichcovers Clark Nelsorsquos approach which usesthe [[carries dependency]] attribute to markthose local variables that carry dependencies(July 17 2014)

bull Mark the paper as a revision of N4036 (July 172014)

bull Add Jeff Preshing Hans Boehm and Clark Nel-son as authors and clarify Hansrsquos proposal inSection 75 (August 13 2014)

bull Add Olivier Girouxrsquos proposal for explicitlymarking the tails of dependency chains as Sec-tion 77 and add Olivier as author Update theadvantages and disadvantages of Jeff Preshingrsquoswhole-program option in Section 74 Add advan-tages and disadvantages for Hans Boehmrsquos pro-posal in Section 75 and for Clark Nelsonrsquos pro-posal in Section 76 (August 23 2014)

bull Update Olivier Girouxrsquos explicit dependency-chain marking proposal in Section 77 based onfeedback from Olivier (August 24 2014)

bull Add a section header (Section 411) for thedependency-chain rules for 2014 GCC implemen-tations Add Section 413 laying out the sim-pler rules for managing dependency chains in theolder 1990s Sequent C implementations Adda footnote to Section 76 recording Lawrence

WG21N4321 39

Crowlrsquos preference for variable modifiers insteadof either attributes or type modifiers Addwords to Section 77 noting that atomic signal

fence() or the Linux kernelrsquos barrier() can beused to suppress code-motion optimizations thatmight otherwise prevent fixing up tail-markeddependency chains that were broken by code-motion optimizations (October 1 2014)

bull Add Section 34 which describes operators thatact as the last link in the Linux kernelrsquos depen-dency chains including the ambiguous role ofthe -gt operator (October 1 2014)

bull Update plots of RCU API usage (October 52014)

bull Flesh out Section 412 which lays out a shorthistory of dependency-chain management rulesfor the Linux kernel Numerous minor clarifica-tions and corrections throughout the document(October 5 2014)

bull Add ldquoandrdquo before final authorrsquos name and emailaddress Also add a pointer back to N4036 (Oc-tober 5 2014)

bull Add a brain-dead build script (October 52014)

bull Add Section 710 which has the beginnings ofan evaluation of the various proposals (October5 2014)

The paper was then published as N4215 whichwas revised as follows

bull Add verbiage to Sections 72 and 7102 differ-entiating between various forms of syntactic andsemantic dependencies (November 11 2014)

bull Add Section 7103 suggesting a C keyword in-stead of the C++ [[carries dependency]] at-tribute This section also suggests that combin-ing ideas from several proposals might be helpful(November 11 2014)

bull Add Section 78 which contains a first attemptto describe Olivier Girouxrsquos head-marked depen-dency chains (November 11 2014)

bull Add example code in Figures 15 and 16 Addsimilar figures to the proposals showing how theymark the dependency chains in the sample code(November 11 2014)

At this point the paper was published as N4321which was revised as follows

bull Add a warning against using operands to amp and| that leave only one bit set (or respectivelycleared) to Section 411 (April 9 2015)

bull Update plots of RCU API usage (April 192014)

bull Add Section 79 which restricts dependencychains with the goal of allowing unmarked de-pendency chains while avoiding restricting com-piler optimizations One important restriction isthe elimination of RCU-protected array indexes(April 19 2014)

bull Add verbiage to Section 32 noting that the amp

operator is sometimes used to find the beginningof a power-of-two aligned structure (April 192014)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Torvald Riegel to Sec-tion 79 (May 20 2015)

bull Donrsquot allow integers to carry dependencies intoor out of unmarked functions in Section 79(May 21 2015)

bull Update Section 79 to note that assignments can-not extend a dependency chain from one threadto another and that assignments cannot extendan intptr t- or uintptr t-based dependencychain through a shared variable even within agiven thread (May 21 2015)

bull Update Section 79 to note that atomic op-erations can extend dependency chains in thesame way that their non-atomic counterpartscan (May 21 2015)

WG21N4321 40

bull Update Section 79 to explicitly call out the factthat stores and subsequent loads can extend de-pendency chains (May 21 2015)

bull Update Section 79 to explicitly state that un-defined behavior terminates dependency chainsand that if only one value of a pointer avoidsundefined behavior any dependency chainsthrough that pointer at that point are termi-nated (May 22 2015)

bull Update Section 79 to document the difficul-ties stemming from value-specialization opti-mizations (May 23 2015)

bull Update Section 79 to note the back-propagationof the dependency-chain-breaking effects of thecompiler deducing the exact value of a pointer(May 27 2015)

bull Update Section 79 to give more informationon how much bit-setting and bit-clearing is toomuch (June 8 2015)

bull Update Section 79 to note that the Linux ker-nel does not rely on dependency ordering whena statically allocated structure is linked into anRCU-protected structure but also note that op-timizations breaking dependency ordering in thiscase seem quite dubious (July 11 2015)

bull Update Section 79 illustrating dependency-chain rules with examples (July 11 2015)

bull Update Section 79 noting that simple loads suf-fice for memory order consume loads (July 112015)

bull Update Sections 77 and 78 calling out abstrac-tion challenges for head- and tail-marking ap-proaches (July 13 2015)

bull Add subsections to Section 79 (July 13 2015)

bull Update Section 79 to amplify objections to spe-cialization optimizations applied to heap-basedpointers (July 13 2015)

  • 1 Introduction
  • 2 Introduction to RCU
  • 3 Linux-Kernel Use Cases
    • 31 Types of Linux-Kernel Dependency Chains
    • 32 Operators in Linux-Kernel Dependency Chains
    • 33 Operators Terminating Linux-Kernel Dependency Chains
    • 34 Operators Acting as Last Link in Linux-Kernel Dependency Chains
    • 35 Linux-Kernel Dependency Chain Length
      • 4 Dependency Ordering in Pre-C11 Implementations
        • 41 Rules for C-Language RCU Users
          • 411 Rules for 2014 GCC Implementations
          • 412 Rules for 2003 GCC Implementations
          • 413 Rules for 1990s Sequent C Implementations
            • 42 Rules for C-Language Implementers
              • 5 Dependency Ordering in C11 and C++11 Implementations
              • 6 Weaknesses in C11 and C++11 Dependency Ordering
              • 7 Potential Alternatives to C11 and C++11 Dependency Ordering
                • 71 Revising C11 and C++11 Dependency-Ordering Definition
                • 72 Type-Based Designation of Dependency Chains With Restrictions
                • 73 Type-Based Designation of Dependency Chains
                • 74 Whole-Program Option
                • 75 Local-Variable Restriction
                • 76 Mark Dependency-Carrying Local Variables
                • 77 Explicitly Tail-Marked Dependency Chains
                • 78 Explicitly Head-Marked Dependency Chains
                • 79 Restricted Dependency Chains
                  • 791 Extending Dependency Chains
                  • 792 Terminating Dependency Chains
                  • 793 Restricted Dependency Chains and the Linux Kernel
                  • 794 Restricted Dependency Chains Advantages and Disavantages
                    • 710 Evaluation
                      • 7101 Audiences
                      • 7102 Comparison
                      • 7103 Other Approaches
                          • 8 Summary
Page 9: N4321: Towards Implementation and Use of memory order … · 2015-07-14 · N4321: Towards Implementation and Use of memory order consume Doc. No.: WG21/N4321 (revised) Date: 2015-04-19

WG21N4321 9

411 Rules for 2014 GCC Implementations

The primary rule for developers implementing RCU-based algorithms is to avoid letting the compiler de-terming the value of any variable in any dependencychain This primary rule implies a number of sec-ondary rules

1 Use only intrinsic operators on basic types Ifyou are making use of C++ template metapro-gramming or operator overloading more elabo-rate rules apply and those rules are outside thescope of this document

2 Use a volatile load to head the dependency chainThis is necessary to avoid the compiler repeatingthe load or making use of (possibly erroneous)prior knowledge of the contents of the memorylocation each of which can break dependencychains

3 Avoid use of single-element RCU-protected ar-rays The compiler is within its right to assumethat the value of an index into such an arraymust necessarily evaluate to zero The com-piler could then substitute the constant zero forthe computation breaking the dependency chainand introducing misordering

4 Avoid cancellation when using the + and - infixarithmetic operators For example for a givenvariable x avoid (xminusx) The compiler is withinits rights to substitute zero for any such cancel-lation breaking the dependency chain and againintroducing misordering Similar arithmetic pit-falls must be avoided if the infix or oper-ators appear in the essential subset of a depen-dency chain

5 Avoid all-zero operands to the bitwise amp opera-tor and similarly avoid all-ones operands to thebitwise | operator If the compiler is able todeduce the value of such operands it is withinits rights to substitute the corresponding con-stant for the bitwise operation Once again thisbreaks the dependency chain introducing mis-ordering

Please note that single-bit operands to bitwise amp

can be dangerous because the compiler requiresonly a small amount of additional information todeduce the exact value which could again resultin constant substitution Operands to bitwise |

that have only one zero bit are similarly danger-ous

6 If you are using RCU to protect JITed functionsso that the () function-invocation operator is amember of the essential subset of the dependencytree you may need to interact directly with thehardware to flush instruction caches This issuearises on some systems when a newly JITed func-tion is using the same memory that was used byan earlier JITed function

7 Do not use the boolean ampamp and || operators inessential dependency chains The reason for thisprohibition is that they are often compiled usingbranches Weak-memory machines such as ARMor PowerPC order stores after such branches butcan speculate loads which can break data depen-dency chains

8 Do not use relational operators (== = gt gt=lt or lt=) in the essential subset of a dependencychain The reason for this prohibition is that asfor boolean operators relational operators areoften compiled using branches Weak-memorymachines such as ARM or PowerPC order storesafter such branches but can speculate loadswhich can break dependency chains

9 Be very careful about comparing pointers in theessential subset of a dependency chain As LinusTorvalds explained if the two pointers are equalthe compiler could substitute the pointer you arecomparing against for the pointer in the essen-tial subset of the dependency chain On ARMand Power hardware it might be that only theoriginal value carried a hardware dependency sothis substitution would break the chain in turnpermitting misordering Such comparisons areOK in the following cases

(a) The pointer being compared against refer-ences memory that was initialized at boot

WG21N4321 10

time or otherwise long enough ago thatreaders cannot still have pre-initialized datacached Examples include module-init timefor module code before kthread creationfor code running in a kthread while theupdate-side lock is held and so on

(b) The pointer is never dereferenced afterbeing compared This exception applieswhen comparing against the NULL pointeror when scanning RCU-protected circularlinked lists

(c) The pointer being compared against is partof the essential subset of a dependencychain This can be a different dependencychain but only as long as that chain stemsfrom a pointer that was modified after anyinitialization of interest This exception canapply when carrying out RCU-protectedtraversals from different entry points thatconverged on the same data structure

(d) The pointer being compared against isfetched using rcu access pointer() andall subsequent dereferences are stores

(e) The pointers compared not-equal and thecompiler does not have enough informationto deduce the value of the pointer (Forexample if the compiler can see that thepointer will only ever take on one of twovalues then it will be able to deduce theexact value based on a not-equals compar-ison)

10 Disable any value-speculation optimizations thatyour compiler might provide especially if you aremaking use of feedback-based optimizations thattake data collected from prior runs

412 Rules for 2003 GCC Implementations

Prior to the 269 version of the Linux kernel therewas neither rcu dereference() nor rcu assign

pointer() Instead explicit memory barriers wereused smp read barrier depends() by readers andsmp wmb() by updaters For example the code shownfor current Linux kernels in Figure 6 would be as

1 struct foo 2 int a3 4 struct foo fp5 struct foo default_foo67 int bar(void)8 9 struct foo p

1011 p = fp12 smp_read_barrier_depends()13 return p p-ampgta default_fooa14

Figure 9 Default Value For RCU-Protected PointerOld Linux Kernel

shown in Figure 9 for 268 and earlier versions of theLinux kernel A similar transformation relates theolder use of smp wmb() and the more recent use ofrcu assign pointer()

This older API was clearly much more vulnerableto compiler optimizations than is the current APIbut the real motivation for this change was read-ability and maintainability as can be seen from thecommit log for the mid-2004 patch introducing rcu

dereference()

This patch introduced an rcu

dereference() macro that replaces mostuses of smp read barrier depends()The new macro has the advantage ofexplicitly documenting which pointersare protected by RCU ndash in contrast itis sometimes difficult to figure out whichpointer is being protected by a givensmp read barrier depends() call

The commit log for the mid-2004 patch introducingrcu assign pointer() justifies the change in termsof eliminating hard-to-use explicit memory barriers

Attached is a patch that adds an rcu

assign pointer() that allows a number ofexplicit smp wmb() memory barriers to bedispensed with improving readability

The importance of suppressing compiler optimiza-tions did not become apparent until much later In

WG21N4321 11

fact a volatile cast was not added to the implementa-tion of rcu dereference() until 2624 in early 2008

413 Rules for 1990s Sequent C Implemen-tations

1990s systems featured far slower CPUs and muchless memory that is commonly provisioned to-day and the compilers were correspondingly lesssophisticated Therefore at that time a sim-ple C-language field selector was used instead ofany sort of rcu dereference() or memory order

consume operation Not only was there novolatile cast there also was nothing resembling smp

read barrier depends() The lack of smp read

barrier depends() is not too surprising given thatDYNIXptx did not run on DEC Alpha

This approach was nevertheless quite reliable be-cause the use cases within the DYNIXptx kernelwere both few and straightforward and provided lit-tle or no opportunity for optimizations that mightbreak dependency chains

42 Rules for C-Language Imple-menters

The main rule for C-language implementers is toavoid any sort of value speculation or at the veryleast provide means for the user to disable suchspeculation An example of a value-speculation op-timization that can be carried out with the help ofhardware branch prediction is shown in Figure 10which is an optimized version of the code in Fig-ure 5 This sort of transformation might result fromfeedback-directed optimization where profiling runsdetermined that the value loaded from ph was almostalway 0xbadfab1e Although this transformation iscorrect in a single-threaded environment in a concur-rent environment nothing stops the compiler or theCPU from speculating the load on line 19 before itexecutes the rcu dereference() on line 16 whichcould result in line 19 executing before the corre-sponding store on line 7 resulting in a garbage valuein variable a4

4 Kudos to Olivier Giroux for pointing out use of branchprediction to enable value speculation

1 void new_element(struct foo pp int a)2 3 struct foo p = malloc(sizeof(p))45 if (p)6 abort()7 p-gta = a8 rcu_assign_pointer(pp p)9

1011 int traverse(struct foo_head ph)12 13 int a = -114 struct foo p1516 p = rcu_dereference(ampph-gth)17 while (p = NULL) 18 if (p == (struct foo )0xbadfab1e)19 a = ((struct foo )0xbadfab1e)-gta20 else21 a = p-gta22 p = rcu_dereference(ampp-gtn)23 24 return a25

Figure 10 Dangerous Optimizations HardwareBranch Predictions

There are some situations where this sort of opti-mization would be safe including

1 The value speculated is a numeric value ratherthan a pointer so that if the guess proves correctafter the fact the computation will be appropri-ate after the fact

2 The value speculated is a pointer to invariantdata so that reasonable values are produced bydereferencing even if the guess proves to havebeen correct only after the fact

3 As above but where any updates result in datathat produces appropriate computations at anyand all phases of the update

However this list does not contain the general caseof memory order consume loads

Pure hardware implementations of value specula-tion can avoid this problem because they monitorcache-coherence protocol events that would resultfrom some other CPU invalidating the guess

In short compiler writers must provide means todisable all forms of value speculation unless the spec-

WG21N4321 12

ulation is accompanied by some means of detectingthe race condition that Figure 10 is subject to

Are there other dependency-breaking optimizationsthat should be called out separately

5 Dependency Ordering in C11and C++11 Implementations

The simplest way to avoid dependency-ordering is-sues is to strengthen all memory order consume oper-ations to memory order acquire This functions cor-rectly but may result in unacceptable performancedue to memory-barrier instructions on weakly or-dered systems such as ARM and PowerPC5 and mayfurther unnecessarily suppress code-motion optimiza-tions

Another straightforward approach is to avoid valuespeculation and other dependency-breaking opti-mizations This might result in missed opportu-nities for optimization but avoids any need fordependency-chain annotations and also all issuesthat might otherwise arise from use of dependency-breaking optimizations This approach is fully com-patible with the Linux kernel communityrsquos currentapproach to dependency chains Unfortunately thereare any number of valuable optimizations that breakdependency chains so this approach seems impracti-cal

A third approach is to avoid value speculationand other dependency-breaking optimizations in anyfunction containing either a memory order consume

load or a [[carries dependency]] attribute Forexample the hardware-branch-predition optimiza-tion shown in Figure 10 would be prohibited in suchfunctions as would cancellation optimizations suchas optimizing a = b + c - c into a = b This toocan result in missed opportunities for optimizationthough very probably many fewer than the previousapproach This approach can also result in issues dueto dependency-breaking optimizations in functionslacking [[carries dependency]] attributes for ex-ample function d() in Figure 11 It can also result

5 From a Linux-kernel community viewpoint that shouldread ldquowill result in unacceptable performancerdquo

1 int a(struct foo p [[carries_dependency]])2 3 return kill_dependency(p-gta = 0)4 56 int b(int x)7 8 return x9

1011 foo c(void)12 13 return fpload_explicit(memory_order_consume)14 return rcu_dereference(fp) in Linux kernel 15 1617 int d(void)18 19 int a20 foo p2122 rcu_read_lock()23 p = c()24 a = p-gta25 rcu_read_unlock()26 return a27

Figure 11 Example Functions for Dependency Or-dering Part 1

1 [[carries_dependency]] struct foo e(void)2 3 return fpload_explicit(memory_order_consume)4 return rcu_dereference(fp) in Linux kernel 5 67 int f(void)8 9 int a

10 foo p1112 rcu_read_lock()13 p = e()14 a = p-gta15 rcu_read_unlock()16 return kill_dependency(a)17 1819 int g(void)20 21 int a22 foo p2324 rcu_read_lock()25 p = e()26 a = p-gta27 rcu_read_unlock()28 return b(a)29

Figure 12 Example Functions for Dependency Or-dering Part 2

WG21N4321 13

in spurious memory-barrier instructions when a de-pendency chain goes out of scope for example withthe return statement of function g() in Figure 12

A fourth approach is to add a compile-time op-eration corresponding to the beginning and end ofRCU read-side critical section These would need tobe evaluated at compile time taking into accountthe fact that these critical sections can nest and canbe conditionally entered and exited Note that theexit from an outermost RCU read-side critical sec-tion should imply a kill dependency() operation oneach variable that is live at that point in the code6

Although it is probably impossible to precisely de-termine the bounds of a given RCU read-side criticalsection in the general case conservative approachesthat might overestimate the extent of a given sec-tion should be acceptable in almost all cases Thisapproach would make functions c() and d() in Fig-ure 11 handle dependency chains in a natural mannerbut avoiding whole-program analysis would requiresomething similar to the [[carries dependency]]

annotations called out in the C11 and C++11 stan-dards

A fifth approach would be to require that all op-erations on the essential subset of any dependencychain be annotated This would greatly ease imple-mentation but would not be likely to be accepted bythe Linux kernel community

A sixth approach is to track dependencies as calledout in the C11 and C++11 standards However in-stead of emitting a memory-barrier instruction whena dependency chain flows into or out of a functionwithout the benefit of [[carries dependency]] in-sert an implicit kill dependency() invocation Im-plementation should also optionally issue a diagnosticin this case The motivation for this approach is thatit is expected that many more kill dependencies()

than [[carries dependency]] would be required toconvert the Linux kernelrsquos RCU code to C11 In theexample in Figure 12 this approach would allow func-tion g() to avoid emitting an unnecessary memory-barrier instruction but without function f()rsquos ex-

6 What if a given rcu read unlock() sometimes marked theend of an outermost RCU read-side critical section but othertimes was nested in some other RCU read-side critical sectionIn that case there should be no kill dependency()

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a)3 a = p-gtspecial_a4 else5 a = p-gtnormal_a

Figure 13 Dependency-Ordering Value-NarrowingHazard

plicit kill dependency() Both functions are in Fig-ure 12

A seventh and final approach is to track dependen-cies as called out in in the C11 and C++11 standardsWith this approach functions e() and f() properlypreserve the required amount of dependency order-ing

6 Weaknesses in C11 andC++11 Dependency Or-dering

Experience has shown several weaknesses in the de-pendency ordering specified in the C11 and C++11standards

1 The C11 standard does not provide attributesand in particular does not provide the[[carries dependency]] attribute This pre-vents the developer from specifying that a givendependency chain passes into or out of a givenfunction

2 The implementation complexity of thedependency-chain tracking required by bothstandard can be quite onerous on the one handand the overhead of unconditionally promotingmemory order consume loads to memory order

acquire can be excessive on weakly orderedimplementations on the other There is thereforeno easy way out for a memory order consume

implementation on a weakly ordered system

3 The function-level granularity of [[carries

dependency]] seems too coarse One problemis that points-to analysis is non-trivial so that

WG21N4321 14

compilers are likely to have difficulty determin-ing whether or not a given pointer carries a de-pendency For example the current wording ofthe standard (intentionally) does not disallowdependency chaining through stores and loadsTherefore if a dependency-carrying value mightever be written to a given variable an implemen-tation might reasonably assume that any loadfrom that variable must be assumed to carry adependency

4 The rules set out in the standard [28 110p9]do not align well with the rules that develop-ers must currently adhere to in order to main-tain dependency chains when using pre-C11 andpre-C++11 compilers (see Section 41) For ex-ample the standard requires (x-x) to carry adependency and providing this guarantee wouldat the very least require the compiler to alsoturn off optimizations that remove (x-x) (andsimilar patterns) if x might possibly be carry-ing a dependency For another example con-sider the value-speculation-like code shown inFigure 13 that is sometimes written by devel-opers and that was described in bullet 9 ofSection 41 In this example the standard re-quires dependency ordering between the memoryorder consume load on line 1 and the sub-sequent dereference on line 3 but a typicalcompiler would not be expected to differenti-ate between these two apparently identical val-ues These two examples show that a compilerwould need to detect and carefully handle thesecases either by artificially inserting dependen-cies omitting optimizations differentiating be-tween apparently identical values or even byemitting memory order acquire fences

5 The whole point of memory order consume andthe resulting dependency chains is to allow de-velopers to optimize their code Such optimiza-tion attempts can be completely defeated by thememory order acquire fences that the standardcurrently requires when a dependency chain goesout of scope without the benefit of a [[carries

dependency]] attribute Preventing the com-piler from emitting these fences requires liberal

use of kill dependency() which clutters coderequires large developer effort and further re-quires that the developer know quite a bit aboutwhich code patterns a given version of a givencompiler can optimize (thus avoiding needlessfences) and which it cannot (thus requiring man-ual insertion of kill dependency()

As of this writing no known implementations fullysupport C11 or C++11 dependency ordering

It is worth asking why Paul didnrsquot anticipate theseweaknesses There are several reasons for this

1 Compiler optimizations have become more ag-gressive over the seven years since Paul startedworking on standardization

2 New dependency-ordering use cases have arisenduring that same time in particular there arelonger dependency chains and more of themincluding dependency chains spanning multiplecompilation units

3 The number of dependency chains has increasedby roughly an order of magnitude during thattime so that changes in code style can be ex-pected to face a commeasurate increase in resis-tance from the Linux kernel community ndash unlessthose changes bring some tangible benefit

With that letrsquos look at some potential alternativesto dependency ordering as defined in the C11 andC++11 standards

7 Potential Alternatives to C11and C++11 Dependency Or-dering

Given the weaknesses in the current standardrsquos spec-ification of dependency ordering it is quite reason-able to consider alternatives To this end Section 71discusses ease-of-use issues involved with revisions tothe C11 and C++11 definitions of dependency or-dering Section 72 enlists help from the type systembut also imposes value restrictions (thus revising the

WG21N4321 15

C11 and C++11 semantics for dependencies) Sec-tion 73 enlists help from the type system withoutthe value restrictions and Section 74 describes awhole-program approach to dependency chains (alsorevising the C11 and C++11 semantics for depen-dencies) Section 75 describes a post-Rapperswilproposal that dependency chains be restricted tofunction-scope local variables and temporaries andSection 76 describes a second post-Rapperswil pro-posal that the [[carries dependency]] attributebe used to label local-scope variables that carry de-pendencies Section 77 describes a proposal dis-cussed verbally at Rapperswil that explicitly marksthe tails of dependency chains Section 78 describesthe inverse namely marking the heads of dependencychains Section 79 describes an approach that avoidsmarking by sharply restricting the number and typeof operations permitted in dependency chains Eachapproach appears to have advantages and disadvan-tages so it is hoped that further discussion will eitherhelp settle on one of these alternatives or generatesomething better To help initiate this discussionSection 710 provides an initial comparative evalua-tion

71 Revising C11 and C++11Dependency-Ordering Definition

The following sections each describe a proposed revi-sion of the dependency-ordering definition from thatin the current C11 and C++11 standards In manyof these proposals developers are required to followan additional rule in order to be able to rely on de-pendency ordering Subsequent execution must notlead to a situation where there is only one possiblevalue for the variable that is intended to carry thedependency7 This is shown in Figure 17 where thecompiler is permitted to break dependency orderingon line 6 because it knows that the value of p is equalto that of q which means that it could substitutethe latter value from the former which would break

7 This restricted notion of dependence is sometimes calledsemantic dependence and the value at the end of a depen-dence chain that does not represent a semantic dependence issometimes said to be independent of the value at the head ofthe dependency chain

1 int my_array[MY_ARRAY_SIZE]23 i = atomic_load_explicit(gi memory_order_consume)4 r1 = my_array[i]

Figure 14 Single-Element Arrays and DependencyOrdering

dependency ordering In short a dependency chainbreaks if it comes to a point where only a single valueis possible regardless of the value of the memory

order consume load heading up the chain At firstglance this additional rule could be quite difficult tolive with as dependency ordering could come and godepending on small details of code far away from thatpoint in the dependency chain

However a review of the Linux-kernel operators inSection 32 shows that the most commonly used op-erators act identically under both definitions Theproblem-free operators include -gt infix = casts pre-fix amp prefix and ternary

One example of a potentially troublesome opera-tor namely == is shown in Figure 17 where line 6breaks dependency ordering because the value of p isknown to be equal to that of q which is not part of adependency chain This example could be addressedthrough careful diagnostic design coupled with appro-priate coding standards For example the compilercould emit a warning on line 6 but remain silent forthe equivalent line substituting q for p namely dosomething with(q-gta)

Another example is the use of postfix [] thatis shown in Figure 14 If this code fragment wascompiled with MY ARRAY SIZE equal to one there isno dependency ordering between lines 3 and 4 butthat same code fragment compiled with MY ARRAY

SIZE equal to two or greater would be dependency-ordered Here a diagnostic for single-element arraysmight prove useful and such a diagnostic can easilybe supplied in this case using if and error

In the Linux kernel infix + and - are used forpointer and array computations These are all safein that they operate on an integer and pointer sothat any cancellation will not normally be detectableat compile time However one big purpose of diag-nostics is to detect abnormal conditions indicating

WG21N4321 16

1 struct liststackhead 2 struct liststack __rcu first3 45 struct liststack 6 struct liststack __rcu next7 void t8 struct rcu_head rh9

1011 _Carries_dependency12 void ls_front(struct liststackhead head)13 14 _Carries_dependency void data15 struct liststack lsp1617 rcu_read_lock()18 lsp = rcu_dereference(head-gtfirst)19 if (lsp == NULL)20 data = NULL21 else22 data = rcu_dereference(lsp-gtt)23 rcu_read_unlock()24 return data25

Figure 15 List-Based-Stack Example Code 1 of 2

probable bugs Therefore in cases where the com-piler can determine that two values from dependencychains are annihilating each other via infix + and -a diagnostic would be appropriate

Similarly the Linux kernel uses infix (bitwise) amp tomanipulate bits at the bottom of a pointer whereagain cancellation will not normally be detectable atcompile timemdashexcept in the case of operations on aNULL pointer for which dependency ordering is notmeaningful in any case However as with infix +

and - if the compiler detects value annihilation adiagnostic would be appropriate

Although issues with false positives and negativesneeds further investigation there is reason to hopethat this revision of the definition of dependency or-dering might avoid significant impacts on ease of useWith this hope we proceed to the specific propos-als using the code in Figures 15 and 16 to showsome sample code using Linux-kernel nomenclaturewith the addition of a mythical C keyword Carries

dependency to annotate parameters variables andreturn values that carry dependencies Please notethat this code example in no way endorses the dubi-ous practice of creating a parallel program with thesort of choke point exemplified by the head of this

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 _Carries_dependency struct liststack lsp40 struct liststack lsnp141 _Carries_dependency struct liststack lsnp242 _Carries_dependency void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 16 List-Based-Stack Example Code 2 of 2

WG21N4321 17

1 value_dep_preserving struct foo p23 p = atomic_load_explicit(gp memory_order_consume)4 q = some_other_pointer5 if (p == q)6 do_something_with(p-gta)7 else8 do_something_else_with(p-gtb)

Figure 17 Single-Value Variables and DependencyOrdering

list Note also that cmpxchg() heads a dependencychain which is completely reasonable within the con-text of the Linux kernel due to its acquire semanticswhich of course might be argued to indicate that theannotations in ls pop() are unnecessary

72 Type-Based Designation of De-pendency Chains With Restric-tions

This approach was formulated by Torvald Riegel inresponse to Linus Torvaldsrsquos spirited criticisms of thecurrent C11 and C++11 wording

This approach introduces a new value dep

preserving type qualifier Dependency ordering ispreserved only via variables having this type quali-fier This is meant to model the real scope of depen-dencies which is data flow not execution at function-level granularity This approach should therefore givedevelopers much finer control of which dependenciesare tracked

Assigning from a value dep preserving value to anon-value dep preserving variable terminates thetracking of dependencies in much the same way thatan explicit kill dependency() would However un-like an explicit kill dependency() compilers shouldbe able to emit a suppressable warning on implicitconversions so as to alert the developer about other-wise silent dropping of dependency tracking8

Next we specify that memory order consume loadsreturn a value dep preserving type by default thecompiler must assume such a load to be capable of

8 Other choices are possible in this case including emit-ting a memory order acquire fence in order to conservativelypreserve a potentially intended ordering

producing any value of the underlying type In otherwords the implementation is not permitted to applyany value-restriction knowledge it might gain fromwhole-program analysis We call this a local seman-tic dependency to distinguish not only from a pure(syntactic) dependency but also from a global seman-tic dependency where global information may be ap-plied Note that any global semantic dependency isalso a local semantic dependency but that any localsemantic dependency which is headed by a variablethat can be proven to take on only a single value is nota global semantic dependency The term ldquosemanticdependencyrdquo should be interpreted to mean a globalsemantic dependency unless otherwise stated

This allows developers to start with a clean slatefor the additional rule that they must follow to beable to rely on dependency ordering Subsequent ex-ecution must not lead to a situation there is only onepossible value for the value dep preserving expres-sion because otherwise the implementation is per-mitted to break the dependency chain As notedearlier this is shown in Figure 17 where the com-piler is permitted to break dependency ordering online 6 because it knows that the value of p is equalto that of q which means that it could substitutethe latter value from the former which would breakdependency ordering

This approach has several advantages

1 The implementation is simpler because no de-pendency chains need to be traced The imple-mentation can instead drive optimization deci-sions strictly from type information

2 Use of the value dep preserving type modifierallows the developer to limit the extent of thedependency chains

3 This type modifier can be used to mark a depen-dency chainrsquos entry to and exit from a functionin a straightforward way without the need forattributes

4 The value dep preserving type modifiers serveas valuable documentation of the developerrsquos in-tent

WG21N4321 18

5 This approach permits many additional opti-mizations compared to those permitted by thecurrent standard on code that carries a depen-dency Expressions such as (x-x) no longerrequire establishment of artificial dependenciesand the compiler is no longer required to detectvalue-narrowing hazards like that shown in Fig-ure 13 However the compiler is still prohibitedfrom adding its own value-speculation optimiza-tions

6 Linus Torvalds seems to be OK with it whichindicates that this set of rules might be practicalfrom the perspective of developers who currentlyexploit dependency chains

According to Peter Sewell one disadvantage is thatthis approach will be quite difficult to model which inturn will pose obstacles for the analysis tooling thatwill be increasingly necessary for large-scale concur-rent programming efforts In particular the concernis that forcing the compiler to assume that a memory

order consume load could possibly return any valuepermitted by its type might require program-analysistools to consider counterfactual hypothetical execu-tions which might complicate specification of seman-tics and verification

Figures 18 and 19 show how this approach playsout with the list-based stack

73 Type-Based Designation of De-pendency Chains

Jeff Preshing made an off-list suggestion of using avalue dep preserving type modifier as suggestedby Torvald Riegel but using this type modifier tostrictly enforce dependency ordering For exampleconsider the code fragment shown in Figure 17 Thescheme described in Section 72 would not necessar-ily enforce dependency ordering between the load online 3 and the access one line 6 while the approachdescribed in this section would enforce dependencyordering in this case

Furthermore cancelling or value-destruction oper-ations on value dep preserving values would notdisrupt dependency ordering As with the cur-rent C11 and C++11 standards the implementation

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack value_dep_preserving first6 78 struct liststack 9 struct liststack value_dep_preserving next

10 void t11 struct rcu_head rh12 1314 value_dep_preserving15 void ls_front(struct liststackhead head)16 17 value_dep_preserving void data18 value_dep_preserving struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 18 List-Based-Stack Restricted Type-BasedDesignation 1 of 2

would be required to emit a memory-barrier instruc-tion or compute an artificial dependency for such op-erations (Note however that use of cancelling orvalue-destruction operations on dependency chainshas proven quite rare in practice)

This approach shares many of the advantages ofTorvald Riegelrsquos approach

1 The implementation is simpler because no de-pendency chains need be traced The implemen-tation can instead drive optimization decisionsstrictly from type information

2 Use of the value dep preserving type modifierallows the developer to limit the extent of thedependency chains

3 This type modifier can be used to mark a depen-dency chainrsquos entry to and exit from a functionin a straightforward way without the need forattributes

4 The value dep preserving type modifiers serve

WG21N4321 19

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 value_dep_preserving37 void ls_pop(struct liststackhead head)38 39 value_dep_preserving struct liststack lsp40 struct liststack lsnp141 value_dep_preserving struct liststack lsnp242 value_dep_preserving void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 19 List-Based-Stack Restricted Type-BasedDesignation 2 of 2

as valuable documentation of the developerrsquos in-tent

5 Although optimizations on a dependency chainare restricted just as in the current standardthe use of value dep preserving restricts thedependency chains to those intended by the de-veloper

6 Restricting dependency-breaking optimizationson all dependency chains marked value dep

preserving without exceptions for cases inwhich the compiler knows too much might makethis approach easier to learn and to use

It is expected that modeling this approach shouldbe straightforward because the modeling tools wouldbe able to make use of the type information Thisapproach results in the same code as shown in Fig-ures 18 and 19 of the previous section

74 Whole-Program Option

This approach also suggested off-list by Jeff Presh-ing has the goal of reusing existing non-dependency-ordered source code unchanged (albeit requiring re-compilation in most cases)9 For example this ap-proach permits an instance of stdmap to be refer-enced by a pointer loaded via memory order consume

and to provide that stdmap instance with thebenefits of dependency ordering without any codechanges whatsoever to stdmap It is important tonote that this protection will be provided only to aread-only stdmap that is referenced by a changingpointer loaded via memory order consume in partic-ular not to a concurrently updated stdmap refer-enced by a pointer (read-only or otherwise) loadedvia memory order consume This latter case wouldrequire changes to the underlying stdmap implemen-tation at a minimum changing some of the loads tobe memory order consume loads Nevertheless theability to provide dependency-ordering protection topre-existing linked data structures is valuable evenwith this read-only restriction

9 A module or library that is known to never carry a de-pendency need not be recompiled

WG21N4321 20

This approach which again does require full re-compilation can be implemented using two ap-proaches

1 Promote all memory order consume loads tomemory order acquire as may be done withthe current standard

2 On architectures that respect memory order-ing prohibit all dependency-breaking optimiza-tions throughout the entire program but onlyin cases where a change in the value returnedby a memory order consume load could cause achange in the value computed later in that samedependency chain in other words where there isa global semantic dependency Note again thatthe possibility of storing a value obtained froma memory order consume load then loading itlater means that normal loads as well as memoryorder relaxed loads often must be consideredto head their own dependency chains but onlywhen loaded by the same thread that did thestore

Some implementations might allow the developerto choose between these two approaches for exampleby using a compiler switch provided for that purpose

This approach also has the effect of permitting atrivial implementation of a memory order consume

atomic thread fence() When using the first im-plementation approach the atomic thread fence()

is simply promoted to memory order acquire In-terestingly enough when using the second ap-proach the memory order consume atomic thread

fence() may simply be ignored The reason forthis is that this approach has the effect of promot-ing memory order relaxed loads to memory order

consume which already globally enforces all theordering that the memory order consume atomic

thread fence() is required to provide locally10

This approach has its own set of advantages anddisadvantages

10 Of course this presumed promotion from memory order

relaxed to memory order consume means that architecturessuch as DEC Alpha that do not respect dependency order-ing must continue to use the first option of emitting memory-ordering instructions for memory order consume loads

1 This approach dispenses with the [[carries

dependency]] attribute and the kill

dependency() primitive

2 This approach better promotes reuse of existingsource code In particular it should require nochanges to the current Linux-kernel source baseaside from changes to the rcu dereference()

family of primitives

3 This approach allows implementations to carryout dependency-breaking optimizations on de-pendency chains as long as a change in thevalue from the memory order consume load doesnot change values further down the dependencychain both with and without the optimizationJeff conjectures that the set of dependency-breaking optimizations used in practice applyonly outside of dependency chains by the re-vised definition in which single-value restrictionsbreak dependency chains11 If this conjectureholds it also applies to Torvaldrsquos approach de-scribed in Section 72

4 Code that follows the rules presented in Sec-tion 41 (substituting memory order consume

loads for volatile loads) would have its depen-dency ordering properly preserved

It is unlikely that this approach could be modeledreasonably given the current state of the art Therequirement that any given memory order consume

load be able to generate at least two different val-ues at the tail of the dependency chain is believedto be a show-stopper especially when coupled withwhole-program analysis which might find that thereis only one value entering at the head of the depen-dency chain

This approach allows annotations to be discardedas shown in Figures 20 and 21 However the memoryorder consume loads are still required in order toenable the promote-to-acquire implementation style

11 This is certainly the case for the usual optimizations ex-emplified by replacing (x-x) with zero

WG21N4321 21

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 void ls_front(struct liststackhead head)15 16 void data17 struct liststack lsp1819 rcu_read_lock()20 lsp = rcu_dereference(head-gtfirst)21 if (lsp == NULL)22 data = NULL23 else24 data = rcu_dereference(lsp-gtt)25 rcu_read_unlock()26 return data27

Figure 20 List-Based-Stack Whole-Program Ap-proach 1 of 2

75 Local-Variable Restriction

This approach suggested off-list by Hans Boehmlimits the extent of dependency trees to a local whichincludes local variables temporaries function argu-ments and return variables Assigning a value from amemory order consume load to such an object beginsa dependency chain Assigning a value loaded fromsuch a local to a global variable (including function-local variables marked static) or to the heap impliesa kill dependency() so that dependency chains areconfined to locals However if the compiler is unableto see the full dependency chain for example be-cause it passes into a function in another translationunit that is not marked [[carries dependency]]the compiler should promote memory order consume

to memory order acquire12

Section 32 indicates that the following operatorsshould transmit dependency status from one localvariable or temporary to another -gt infix = casts

12 Some implementations might provide means to allow theuser to specify that a diagnostic be generated if such promotionis necessary

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 void ls_pop(struct liststackhead head)37 38 struct liststack lsp39 struct liststack lsnp140 struct liststack lsnp241 void data4243 rcu_read_lock()44 lsnp2 = rcu_dereference(head-gtfirst)45 do 46 lsnp1 = lsnp247 if (lsnp1 == NULL) 48 rcu_read_unlock()49 return NULL50 51 lsp = rcu_dereference(lsnp1-gtnext)52 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)53 while (lsnp1 = lsnp2)54 data = rcu_dereference(lsnp2-gtt)55 rcu_read_unlock()56 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)57 return data58

Figure 21 List-Based-Stack Whole-Program Ap-proach 2 of 2

WG21N4321 22

prefix amp prefix [] infix + infix - ternary infix (bitwise) amp and probably also | Similarly Sec-tion 33 indicates that the following operators shouldimply a kill dependency() () == = ampamp ||infix and

It will also be necesary to check whether Linux-kernel usage expects dependency chains to passthrough globals and heap objects that are in someway thread-local If there are such use cases and ifthey are sane and cannot easily be changed to uselocal variables should [[carries dependency]] beused to flag dependency-carrying globals and heapobjects

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute if de-pendency chains are to span multiple translationunits as is the case in some parts of the Linuxkernel

2 The implementation is likely to be some-what simpler because only those dependencychains passing through local variables compiler-generated temporaries compiler-visible functionarguments and compiler-visible return valuesneed be traced One could also argue that func-tion arguments and return values marked with[[carries dependency]] attribute also need tobe traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 Although optimizations on dependency chainsmust be restricted the restricted scope of de-pendency chains reduces the impact of these re-strictions

5 Applying this approach to the Linux kernelwould only require the addition of markings onfunction parameters and return values corre-sponding to cross-translation-unit function callsHowever there are a significant number of theseso this approach can expect significant resistancefrom the Linux community

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 22 List-Based-Stack Local-Variable Restric-tion 1 of 2

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach allows local-variable annotations tobe dropped as shown in Figure 22 and 23

76 Mark Dependency-Carrying Lo-cal Variables

This approach suggested offlist by Clark Nelson usesthe [[carries dependency]] attribute to mark non-static local-scope variables as carrying a dependencyin addition to its current use marking function ar-guments and return values as carrying dependen-cies It is not permissible to mark global variablesor structure members with this attribute Assigningfrom a [[carries dependency]] object to a non-[[carries dependency]] object results in an im-plicit kill dependency()

This approach is similar to that of Section 73 ex-cept that it uses an attribute rather than a type mod-ifier As such it has many of the advantages and dis-

WG21N4321 23

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 23 List-Based-Stack Local-Variable Restric-tion 2 of 2

advantages of that approach however some believethat an attribute-based approach will be more ac-ceptable to the committee than would a type-modifierapproach13 However this approach does requirethat C add attributes

This leave the question of which operatorstransmit dependency chains from one [[carries

dependency]] object to another Section 32 indi-cates that the following operators should transmitdependency status from one local variable or tem-porary to another -gt infix = casts prefix amp prefix [] infix + infix - ternary infix (bitwise) ampand probably also | Similarly Section 33 showsthat the following operators should imply a kill

dependency() () == = ampamp || infix and

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains passingthrough variables marked with the [[carries

dependency]] attribute need be traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 The [[carries dependency]] calls serve asvaluable documentation of the developerrsquos in-tent

5 Although optimizations on dependency chainsmust be restricted use of explicit [[carries

dependency]] greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of variables carrying depen-dencies given that the Linux kernel currentlyrequires no such markings

13 Lawrence Crowl suggests a third approach namely a vari-able modifier

WG21N4321 24

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 _Carries_dependency void data18 _Carries_dependency struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 24 List-Based-Stack Marked Local Variables1 of 2

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach results in code as shown in Fig-ures 24 and 25 where the [[carries dependency]]

attributes have been replaced with a mythicalCarries dependency C keyword

77 Explicitly Tail-Marked Depen-dency Chains

This approach suggested at Rapperswil by OlivierGiroux can be thought of as the inverse of

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 _Carries_dependency struct liststack lsp40 struct liststack lsnp141 _Carries_dependency struct liststack lsnp242 _Carries_dependency void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 25 List-Based-Stack Marked Local Variables2 of 2

WG21N4321 25

1 p = atomic_load_explicit(ampgp memory_order_consume)2 if (p = NULL)3 do_it(atomic_dependency(p gp))

Figure 26 Explicit Dependency Operations

stdkill dependency() Instead of explicitlymarking where the dependency chains terminateOlivierrsquos proposal uses a stddependency() prim-itive to indicate the locations in the code that thedependency chains are required to reach The first ar-gument to stddependency() is the value to whichthe dependency must be carried and the second argu-ment is the variable that heads the dependency chainin other words the second argument is the variablethat was loaded from by a memory order consume

load This proposal differs from the others in thatit is expected to be implemented not necessarily bypreserving the dependency but instead by insertingbarriers in those cases where optimizations have elim-inated any required dependencies The goal here is toimpose minimal restrictions on optimizations of codecontaining dependency chains

A C-language example is shown in Figure 26where stddependency() is transliterated to the C-language atomic dependency() function On line 3atomic dependency() returns the value of its firstargument (p) while ensuring that the data depen-dency from the memory order consume load from gp

is faithfully reflected in the assembly language imple-menting this code fragment The assembly-languagereflection of this dependency might be in terms ofan assembly-language dependency (for example onARM or PowerPC) implicit memory ordering (forexample on x86 or mainframe) or by an explicitmemory-barrier instruction However if there was noatomic dependency() function the compiler wouldbe under no obligation to preserve the dependency14

These explicitly specified dependencies may becombined with [[carries dependency]] attributeson function arguments for example as shown in Fig-ure 27 Note the interplay of atomic dependency()

and [[carries dependency]] where line 8 estab-

14 Would it be better to have the second argument to atomic

dependency() be a label rather than an expression

1 void foo(struct bar q [[carries_dependency]])2 3 if (q = NULL)4 do_it(atomic_dependency(q-gtb q))5 67 p = atomic_load_explicit(ampgp memory_order_consume)8 foo(atomic_dependency(p gp))

Figure 27 Explicit Dependency Operations and car-ries dependency

lishes the dependency between the load from gp andthe [[carries dependency]] argument q of foo()and where line 4 establishes the further dependencybetween argument q of foo() and do it()s argu-ment

This approach is not yet complete One issue isthe possibility of a given operation being dependenton multiple memory order consume loads One ap-proach is of course to omit this functionality and an-other is to allow atomic dependency() to allow anexpression as its first argument and a variable list ofmemory order consume loaded variables

Another issue is connecting [[carries

dependency]] return values to subsequent atomic

dependency() invocations There are a numberof possible resolutions to this issue One approachwould be to use [[carries dependency]] attributeto mark the declaration of the variable to whichthe functionrsquos return value is assigned bringing theproposal from Section 76 to bear In the specialcase where the memory order consume load is in thesame function body as the atomic dependency()

that depends on it the atomic dependency()

could reference the variable that was the sourceof the original memory order consume load An-other approach would be to allow function-returncarries dependency]] attributes to define namesthat could be used by later atomic dependency()

invocations

A third issue arises when atomic dependency()

must be applied after the head of the dependencychain has gone out of scope for example if the headwas contained in a variable defined in an inner scopethat has since been exited

A fourth issue arises if optimizations along a

WG21N4321 26

needed dependency chain allow ordering the depen-dent operation to precede the head of the depen-dency chain in which case inserting barriers wouldbe ineffective The current proposal for addressingthis issue is to suppress memory-movement optimiza-tions across the atomic dependency() perhaps us-ing something like atomic signal fence() or theLinux kernelrsquos barrier() macro This approach al-lows dependency checking and fence insertion to becarried out as a final pass in the compilation process

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains having ex-plicit atomic dependency() calls (and option-ally intermediate [[carries dependency]] at-tributes) need be traced

3 Irrelevant dependency chains are pruned by de-fault with no stdkill dependency() calls re-quired

4 The atomic dependency() calls serve as valu-able documentation of the developerrsquos intent

5 Although optimizations on dependency chainsmust be restricted use of explicit atomic

dependency() greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of dependency chains giventhat the Linux kernel currently relies on implicitends of dependency chains

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(atomic_dependency(lsp-gtt27 head-gtfirst))28 rcu_read_unlock()29 return atomic_dependency(data lsp-gtt)30

Figure 28 List-Based-Stack Tail-Marked Dependen-cies 1 of 2

It is not yet known whether this approach can bereasonably modeled

The result is shown in Figures 28 and 29

78 Explicitly Head-Marked Depen-dency Chains

This approach suggested via email by Olivier Girouxcan be thought of as another inverse of stdkilldependency() In this case the heads of the de-pendency chains are marked indicating to whichpointed-to objects dependencies should be carriedThis description is an extrapolation of a very con-cise proposal and corrections are welcome

The general idea is to provide an augmentedform of the load() member function that indicatesdependencies for example xload(memory order

consume x-gtnext) would cause a dependency tobe carried through the -gtnext field but throughno other field This is shown in Figure 30 where

WG21N4321 27

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(atomic_dependency(lsnp2-gtt lsnp))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return atomic_dereference(data lsnp2-gtt)59

Figure 29 List-Based-Stack Tail-Marked Dependen-cies 2 of 2

1 struct foo 2 struct foo a3 struct foo b4 struct foo c5 int d6 78 p = atomic_load_explicit(ampgp memory_order_consume9 p-gta p-gtb)

10 qa = p-gta Dependency carried 11 qb = p-gtb Dependency carried 12 qc = p-gtc No dependency carried 13 d = p-gtd No dependency carried

Figure 30 Explicit Dependency Operations and Aug-mented Load

the explicit dependency information on line 9 causeslines 10 and 11 to carry a dependency but lines 12and 13 not to do so

Some open questions regarding this approach

1 How does this interact with arguments and re-turn values Do the corresponding annotationsneed to indicate to which fields dependenciesmight be carried Should mismatches be con-sidered an error and if so which sorts of mis-matches

2 How are opaque types handled For exam-ple consider the Linux kernel linked-list facil-ity which embeds a list head structure intothe enclosing object that is to be placed on thelist The memory order consume load returnsa (struct list head ) but it may be nec-essary to carry a dependency to one or more ofthe fields in the enclosing object Should thisbe handled via something like xload(memory

order consume x) but if so doesnrsquot thisre-introduce the need for lots of stdkill

dependency() calls

3 Larger structures might have quite a few fieldsthat need dependencies carried Should therebe some sort of shorthand to make this easierto code for example tagging the fields needingdependency ordering in the declaration of thestruct or class

4 Common types of abstraction need to be han-dled correctly For example there are more than

WG21N4321 28

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst head-gtfirst-gtt)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(lsp-gtt lsp-gtt)) 28 rcu_read_unlock()29 return data30

Figure 31 List-Based-Stack Head-Marked Depen-dencies 1 of 2

500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

79 Restricted Dependency Chains

This approach restricts dependency chains to opera-tions for which compilers would naturally carry de-pendencies As such this approach can be consid-ered to be a refinement of the whole-program option(Section 74) restricted as described in Section 412but also omitting control dependencies and RCU-protected array indexes As always a memory order

consume load heads a dependency chain and as al-ways a memory order consume load may be imple-mented with a simple load instruction on architec-tures such as x86 ARM and Power

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst head-gtfirst-gtnext)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext lsnp-gtnext-gtt)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt lsnp2-gtt))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 32 List-Based-Stack Head-Marked Depen-dencies 2 of 2

WG21N4321 29

1 define rcu_ptr_extract(p) 2 ( 3 uintptr_t ___ip = (uintptr_t)(p) 4 5 ___ip = ___ip amp ~0x7 6 (typeof(p)) ___ip 7 )89 define rcu_ptr_set_bits(p bits)

10 ( 11 uintptr_t ___ip = (uintptr_t)(p) 12 13 ___ip = ___ip amp ~0x7 14 ___ip = ___ip | (bits) 15 (typeof(p)) ___ip 16 )1718 define rcu_ptr_get_bits(p bits) 19 ( 20 uintptr_t ___ip = (uintptr_t)(p) 21 22 ___ip amp ~0x7 23 )

Figure 33 Pointers and Bit Manipulation on 64-BitSystem

This results in a specific list of operations that ex-tend dependency chains and a separate specific list ofoperations that terminate such chains each coveredin its own section

791 Extending Dependency Chains

The following primitive operations extend thatchain15

1 If any value is part of a dependency chain thenusing that value as the left-hand side of an as-signment expression extends the chain to coverthe assignment This rule is exercised in theLinux kernel by stores into fields making up anRCU protected data element

2 If any value is part of a dependency chain thenusing that value as the right-hand side of an as-signment expression extends the chain to coverboth the assignment and the value returned bythat assignment statement Line 20 of Figure 20shows how this rule may be used to extend adependency chain into a local variable

15 In case of operator overloading the actual functions calledmust be analyzed in order to determine their effects on depen-dency chains

3 If any value that is part of a dependency chain isstored to a non-shared variable then any valueloaded by a later load from that same variable bythat same thread is also part of the dependencychain Lines 20 and 24 of Figure 20 illustratethis rule though this rule would apply even iflocal variable lsp was a intptr t instead of apointer

4 If a pointer that is part of a dependency chainis stored to any variable then any value loadedby a later load from that same variable by thatsame thread is also part of the dependency chainLines 20 and 24 of Figure 20 illustrate this rulethough this rule would apply even if local vari-able lsp was instead a shared variable

5 If a pointer is part of a dependency chain thenadding an integral value to that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers and alsoto addition via the infix + operator and via thepostfix [] operator Note that the addition mustbe carried out on a pointer Casting to an inte-gral type and then carrying out the addition willbreak the dependency chain Therefore insteadof casting to an integral type to carry out theaddition cast to a pointer to char16 Line 24 ofFigure 20 illustrates this given that the -gtt actsas a pointer offset prior to indirection

6 If a pointer is part of a dependency chain thensubtracting an integer from that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers Againcasting to an integral type and then carryingout the subtraction will break the dependencychain so instead cast to a pointer to char TheLinux-kernel container of() macro illustratesthis This macro is used to find the beginning ofa structure given a pointer to a field within thatsame structure

16 Yes some old systems had strange formats for characterpointers and this restriction does exclude those systems fromthis nuance of dependency ordering However to the best ofmy knowledge all such systems were uniprocessors so this isnot a real problem

WG21N4321 30

7 If a pointer is part of a dependency chain thendereferencing it using the prefix operator ex-tends the chain through the dereference opera-tion Line 24 of Figure 20 illustrates this giventhat the -gtt acts as a pointer offset prior to in-direction

8 If a pointer is part of a dependency chain thendereferencing it using the -gt field-selection oper-ator extends the chain to the field Note that thewhen the -gt operator is followed by one or more operators these latter operators are equiva-lent to adding a constant integer to the originalpointer Line 24 of Figure 20 directly illustratesthis rule

9 If a pointer is part of a dependency chain thencasting it (either explicitly or implicitly) to anypointer-sized type extends the chain to the re-sult Line 3 of Figure 33 illustrates this rule

10 If a value of type intptr t or uintptr t ispart of a dependency chain then casting it toa pointer type extends the chain to the resultLine 6 of Figure 33 illustrates this rule

11 If a value of type intptr t or uintptr t is partof a dependency chain then the bitwise infix amp

and | operators extend the dependency chain tothe resulting value Line 5 of Figure 33 illus-trates this rule

12 If a value of type intptr t or uintptr t is partof a dependency chain the the bitwise infix ^

operator extends the dependency chain to theresulting value The use case for this traversal ofbuddy-allocator-like lists or dense-array heapsbut it is not clear whether these use cases justifythis addition to dependency ordering

13 If a pointer is part of a dependency chain thenapplying the unary amp address-of operator op-tionally casting this address to a pointer type(perhaps repeatedly to different pointer typeseither explicitly or implicitly) then applying the dereference operator extends the chain to theresult This is used by some of the Linux-kernellist-processing macros

14 If a pointer is part of a dependency chain andthat pointer appears in the entry of a expres-sion selected by the condition then the chainextends to the result Please note that doesnot extend chains from its condition only fromits second or third argument

15 If a pointer to a function is part of a depen-dency chain then invoking the pointed-to func-tion extends the chain from the pointer to theinstructions executed Note that the exact mech-anism used to update instructions is implemen-tation defined and might require use of specialinstruction-cache-flush operations17

16 If a pointer is part of a dependency chain thenif that pointer is used as the actual parameter ofa function call the dependency chain extends tothe formal parameter

17 If a pointer is part of a dependency chain thenif that pointer is returned from a function thedependency chain extends to the returned valuein the calling function

18 If a given operation extends a dependency chainthen so does its atomic counterpart For exam-ple the rules applying to assignments also applyto atomic loads and stores

Any other operation terminates a dependencychain

792 Terminating Dependency Chains

Even though all other operations terminate depen-dency chains there are a few that deserve specialmention

1 Equality comparisons

2 Narrowing magnitude comparisons

3 Narrowing arithmetic operations

4 Narrowing bitwise operations

17 This may sound strange but just you try implementingdynamic linking without the ability to update instructions

WG21N4321 31

1 struct bar 2 struct bar next3 int a4 int b5 6 struct bar head = amphead 1 2 78 for (p = head-gtnext p p = rcu_dereference(p-gtnext)) 9 foo += p-gta

10 if (p == amphead)11 break12 13 bar = head-gtb

Figure 34 Back-Propagation of Dependency-ChainBreakage Due to Comparisons

1 if (p gt ampfoo)2 do_something(p)3 else if (p lt ampfoo)4 do_something_else(p)5 else6 do_something_nodep(p)

Figure 35 Inequality-Comparison Dependency-Chain Breakage

5 Storing non-pointers into shared variables

6 Passing values between threads without using amemory order consume load

7 Undefined behavior

8 Use of kill dependency

Each of these is covered below

Equality comparisons If a pointer is part of adependency chain then a == or = comparison thatcompares equal to some other pointer where thatother pointer is not part of any dependency chainwill cause any uses of the original pointer to no longerbe part of the dependency chain This dependency-chain breakage can back-propagate to earlier uses ofthe pointer so that in Figure 34 if the comparisonon line 10 compares equal then the access on line 9is not part of the dependency chain This is admit-tedly a rather strange code fragment and besidesthe Linux-kernel barrier() macro could prevent thisif placed between lines 9 and 10 Furthermore the

Linux kernelrsquos list macros avoid this situation becausethe equal comparison terminates the loop

So what if the compiler introduces an equality com-parison This might happen when doing feedback-directed optimization where the compiler might no-tice (for example) that a particularly statically allo-cated structure was almost always the first elementon a given list The compiler might therefore in-troduce a specialization optimization comparing theaddresses and generating code using the statically al-located structure on equals comparison On the onehand in the cases where the Linux kernel adds a stat-ically allocated structure to an RCU-protected linkeddata structure that structure has been initialized atcompile time so that dependency ordering is not re-quired On the other hand this appears to be anextremely dubious optimization for linked data struc-tures In a great many cases the added overhead ofthe comparison would overwhelm the benefits of gen-erating code based on the statically allocated struc-ture

High-quality implementations would therefore beexpected to provide means for disabling this sort ofoptimization especially for pointers obtained fromthe heap After all use of statically allocated struc-tures in RCU-protected lists could be quite usefulduring out-of-memory conditions in which case thespecialization optimization would almost always re-duce performance which is not what optmizationsare supposed to be doing

Narrowing magnitude comparisons A seriesof gt lt gt= or lt= operators that informs the compilerof the exact value of a pointer causes that pointerto no longer be part of the dependency chain SeeFigure 35 for an example of this On line 6 of thisfigure the compiler knows that the value of p is equalto ampfoo so although there is dependency ordering tolines 2 and 4 there is no dependency ordering toline 6 This dependency-chain breakage can back-propagate just as for equality comparisons How-ever dependencies are maintained for normal usesfor exmaple the use of comparisons for deadlockavoidance when acquiring locks contained in multi-ple RCU-protected data elements

WG21N4321 32

Narrowing arithmetic operations If a pointeris part of a dependency chain and if the valuesadded to or subtracted from that pointer cancel thepointer value so as to allow the compiler to pre-cisely determine the resulting value then the result-ing value will not be part of any dependency chainFor example if p is part of a dependency chain then((char )p-(uintptr t)p)+65536 will not be18

Narrowing bitwise operations If a value of typeintptr t or uintptr t is part of a dependency chainand if that value is one of the operands to an amp

or | infix operator whose result has too few or toomany bits set then the resulting value will not bepart of any dependency chain For example on a64-bit system if p is part of a dependency chainthen (p amp 0x7) provides just the tag bits and nor-mally cannot even be legally dereferenced Similarly(p | ~0) normally cannot be legally dereferencedHowever (p amp ~0x7) will provide a usable pointerthat is part of prsquos dependency chain Setting orclearing bits that are not used by the implementa-tion or that would be zero for a properly aligned ob-ject will not break the dependency chain That saidthe compiler might not know the definition of ldquoprop-erly alignedrdquo for example in the cases of manuallycache-aligned or page-aligned objects

Storing non-pointers into shared variables Ifa non-pointer value that is part of a dependency chainis stored into a shared variable then the dependencychain does not extend to a later load from that vari-able

Passing values between threads If a value thatis part of a dependency chain is stored into a variableby one thread and loaded from that same variable bysome other thread using either a non-atomic load ora memory order relaxed load then the dependencychain does not extend to the second thread To getthis effect the second thread would instead need touse a memory order consume load Note that this

18 That said 57p4 of C++ and 656p8 of C both say thatindexing outside of an object is undefined behavior so the lossof dependency ordering is likely the least of the problems here

would extend the dependency chain even if the cor-responding store was a memory order relaxed storebecause the required store-side ordering is providedby the dependency chain

Undefined behavior If undefined behavior is in-voked then consistent with the notion of undefinedbehavior there are no dependency-chain guarantees

If a given pointer intptr t or uintptr t is usedsuch that only one value avoids undefined behaviorthen the dependency chain is broken in the same wayas it would be in the case of an equality comparisonwith that same value

kill dependency() The result of calling kill

dependency is never part of any dependency chainThis operation can be used to suppress diagnosticsthat implementations might omit for likely misusesof dependency ordering

793 Restricted Dependency Chains and theLinux Kernel

This covers all known pointer-based RCU uses in theLinux kernel aside from RCU-protected array in-dexes (more on these later) However as noted ear-lier these restrictions might prove too constrainingfor future code Therefore it might be necessary tocombine this approach with some variation on one ofthe methods for explicitly marking variables formalparameters and return values that are intended tocarry dependencies

Section 32 discussed dependency chains headedby memory order consume loads of integers that arelater used as array indexes or pointer offsets Al-though there are a (very) few such uses in the Linuxkernel accommodating dependency chains headed byloads of integers greatly complicates the handling ofdependency chains For example given an integer x

produced by a memory order consume load we mustcorrectly handle expressions containing (x - x) in-cluding cases where the cancellation is not at all obvi-ous from the source code In contrast given a pointerp the expression (p - p) not a pointer and the rulesgiven above do not require carrying a dependency

WG21N4321 33

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a) 3 q = kill_dependency(p)4 a = q-gtspecial_a5 else 6 a = p-gtnormal_a7

Figure 36 Avoiding Diagnostic Due To Dependency-Ordering Value-Narrowing Hazard

through such an expression This approach there-fore excludes dependency chains headed by memory

order consume loads from non-pointer atomic vari-ables

This raises the question of what should be doneabout the Linux kernel code that relies on order-ing carried through integer array indexes Paulhas (hopefully) answered this question by creatinga Linux-kernel patch that removes the kernelrsquos de-pendency on RCU-protected array indexes [22]

794 Restricted Dependency Chains Ad-vantages and Disavantages

This approach to dependency ordering has the fol-lowing advantages and disadvantages

1 There is no need for compilers to trace depen-dency chains Instead dependencies are an au-tomatic result of current code-generation and op-timization practices

2 There would be little or no limitation on com-piler optimizations Compilers are no longer re-quired to establish artificial dependencies for ex-pressions such as (x - x) or to detect value-narrowing hazards involving == and =

3 The breaking of dependency chains when apointer compares equal to some other pointermight prove to be onerous but it is not aproblem for current Linux use cases19 Themost common cases are comparison against alist header (in which case an equality compar-ison terminates the traversal) and comparison

19 This needs to be re-verified

against NULL (in which case an equality com-parison indicates a pointer that cannot be deref-erenced in any case)

4 The compiler is prohibited from carrying outvalue-speculation optimizations on pointers orvalues of type intptr t or uintptr t that havebeen cast from a pointer type and subjected tono operations other than infix amp | and ^ (Isthis an advantage or a disadvantage The an-swer to this question is left to the reader)

5 In a great many cases dependency chains canreliably pass through library functions compiledby pre-C11 compilers

6 It would not be necessary to use stdkill

dependency() calls in most cases That saiduse of stdkill dependency() might at somefuture point allow the compiler to produce bet-ter diagnostics for dubious dependency-chain usecases For example a compiler might issue awarning for the value-narrowing hazard shownon line 3 of Figure 13 and that diagnostic mightbe suppressed as shown in Figure 36

7 More generally this approach allows annotationsto be discarded as was shown in Figures 20and 21 However the memory order consume

loads are still required in order to support DECAlpha and prevent compiler optimizations thatmight otherwise destroy the dependency chain

8 It would not be necessary to use [[carries

dependency]] attributes However as withstdkill dependency() use of something like[[carries dependency]] might produce bet-ter diagnostics for dubious use dependency-chainuse cases That said it might be preferable tosubstitute either type or variable modifiers forattributes given that use of attributes is not per-mitted to change the meaning of the programand also that attributes are not yet supportedby the C language

9 With the exception of one use case involving ar-rays no changes to the Linux-kernel source code

WG21N4321 34

are required As noted earlier a patch is avail-able to remove the Linux kernelrsquos dependency onRCU-protected array indexes [22]

In the future this approach could be augmentedby attributes type modifiers variable modifiers orother markings to allow more elaborate dependencychains to be created on the one hand and to improvethe compilerrsquos ability to emit diagnostics for dubioususes of dependency chains on the other

710 Evaluation

This evaluation starts by enumerating the differentaudiences that any change to memory order consume

must address (Section 7101) and then compares thevarious proposals based on the perceived viewpointsof these audiences (Section 7102)

7101 Audiences

The main audiences for any change to memory

order consume include standards committee mem-bers compiler implementers formal-methods re-searchers developers intending to write new codeand developers working with existing RCU code TheLinux kernel community is of course a notable exam-ple of this last category

Standards committee members would like a cleanand non-intrusive change to the standard Theywould of course also like solutions minimizing thenumber and vehemence of complaints from the otheraudiences or failing that reducing the complaintsto a tolerable noise level

Compiler implementers would like a mechanismthat fits nicely into current implementations whichdoes much to explain their satisfaction with the ap-proach of strengthening memory order consume tomemory order acquire In particular they wouldlike to avoid unbounded tracing of dependencies andwould prefer minimal constraints on their ability toapply time-honored optimizations

Formal-methods researchers would like a definitionof memory order consume that fits into existing the-oretical frameworks without undue conceptual vio-lence Of particular concern is any need to deal with

counter-factuals in other words any need to rea-son not only about values of variables required forthe solution of a given litmus test but also aboutother unrelated values for these variables As suchcounter-factuals are the rock upon which otherwiseattractive approaches involving semantic dependencyhave foundered20 Some practitioners might won-der why the opinion of formal-methods researchersshould be given any weight at all and the answer tothis question is that it is the work of formal-methodsresearchers that provides us the much-needed toolsthat we need to analyze both the memory-orderingspecification itself as well as programs using thatspecification

Developers writing new code need something thatexpresses their algorithm with a minimum of syn-tactic saccharine that is easy to learn and that iseasy to maintain For example one of the weak-nesses of the current standardsrsquo definition of memoryorder consume is the need to sprinkle large numbersof kill dependency() calls throughout onersquos codeIn short developers would like it to be easy to writeanalyze and maintain code that uses dependency or-dering

Developers with existing RCU code have the samedesires as do developers writing new code but arealso very interested in minimizing the code churn re-quired to adhere to the standard

The challenge if of course to find a proposal thataddresses the viewoints of all of these audiences Aswe will see in the next session this is not easy How-ever there is some hope that the approach presentedin Section 79 might suffice

7102 Comparison

A summary comparison of the proposals is shown inTable 1

The dependency type can either be ldquodeprdquo fornormal dependency ldquordeprdquo for the restricted de-pendencies discussed in Section 79 ldquosdeprdquo for(global) semantic dependency or ldquolsdeprdquo for local

20 That said Alan Jeffries is making another attempt tocome up with a suitable formal definition of semantic depen-dency

WG21N4321 35

Dep

enden

cyType

Variable

Marking

Form

al-Parameter

Marking

Return-V

alueMarking

Beginning-O

f-Chain

Handling

End-O

f-Chain

Handling

Dep

enden

cyTracingRequired

CAttribute

Support

Req

uired

C11 C++11 dep A A K Y Y

Type-Based Designation of DependencyChains With Restrictions (Section 72)

lsdep T T T k

Type-Based Designation of DependencyChains (Section 73)

dep T T T k

Whole-Program Option (Section 74) sdep

Local-Variable Restriction (Section 75) dep A A k Y

Mark Dependency-Carrying Local Variables(Section 76)

dep A A A k Y

Explicitly Tail-Marked Dependency Chains(Section 77)

dep A A Dk y Y

Explicitly Head-Marked Dependency Chains(Section 78)

dep D k y Y

Restricted Dependency Chains (Section 79) rdep

Marking ldquoArdquo attribute ldquoTrdquo typeBeginning of chain ldquoDrdquo explicit designationEnd of chain ldquoDrdquo explicit designation ldquokKrdquo implicitexplicit kill dependencyDependency tracking required ldquoyrdquo only for marked chains ldquoYrdquo always

Table 1 Comparison of Consume Proposals

WG21N4321 36

semantic dependency21 Variable formal-parameterand return-value marking can either be type-based(ldquoTrdquo) attribute-based (ldquoArdquo) or not required (ldquo rdquo)Beginning-of-chain handling can either require ex-plicit indication to which quantities dependenciesmust be carried (ldquoDrdquo) or nothing (ldquo rdquo) End-of-chain handling can either require an explicit kill

dependency (ldquoKrdquo) an implicit kill dependency

(ldquokrdquo) explicit designation of dependency (ldquoDrdquo) ornothing (ldquo rdquo)22 Dependency tracking might berequired for all chains (ldquoYrdquo) explicitly designatedchains (ldquoyrdquo) or not required at all (ldquo rdquo) C-language[[carries dependency]] support might be required(ldquoYrdquo) or not (ldquo rdquo)

The ideal proposal would have dependency typeldquodeprdquo (thus making it easier to model dependencyordering and making it unnecessary for developer tohave to outwit full-program optimizations) no needfor variable formal-parameter or return-value mark-ing (thus minimizing changes required for existingRCU code) implicit ldquodo the right thingrdquo end-of-chainhandling23 (thus minimizing the need for whack-a-mole source-code markups) no need for dependencytracking (thus making it easier to implement) andno need for C-language support for the [[carries

dependency]] attribute (thus minimizing changes tothe C standard)

7103 Other Approaches

If the C standards committee is unwilling to ac-cept attributes perhaps a new keyword such asCarries dependency would be an acceptable alter-native (This was suggested at the 2014 UIUC meet-ing but I cannot recall who suggested it)

21 Recall that a local semantic dependency remains a de-pendency even if the memory order consume load at its headcan return only a single value In contrast a global seman-tic dependency remains a dependency only if more than onevalue can appear at the end of the chain Therefore optimiza-tions based on global full-program analysis can break a globalsemantic dependency but can break neither a local semanticdependency nor a normal dependency

22 Variables that go out of scope always have any dependencychain implicitly killed

23 Perhaps implemented by a careful choice of exactly whichoperators carry dependencies in which situations

It might also be possible to combine different as-pects of the various proposals perhaps even arrivingat an improved proposal

Nevertheless we clearly have some more work todo

8 Summary

This document has analyzed Linux-kernel use of de-pendency ordering and has laid out the status-quo in-teraction between the Linux kernel and pre-C11 com-pilers It has also put forward some possible ways ofbuilding towards a full implementation of C11rsquos andC++11rsquos handling of dependency ordering Finallyit calls out some weaknesses in C11rsquos and C++11rsquoshandling of dependency ordering and offers some al-ternatives

References

[1] Alglave J Maranget L Pawan PSarkar S Sewell P Williams D andNardelli F Z PPCMEMARMMEM A toolfor exploring the POWER and ARM memorymodels httpwwwclcamacuk~pes20

ppc-supplementalpldi105-sarkarpdfJune 2011

[2] Alglave J Maranget L andTautschnig M Herding cats Modellingsimulation testing and data-mining for weakmemory In Proceedings of the 35th ACM SIG-PLAN Conference on Programming LanguageDesign and Implementation (New York NYUSA 2014) PLDI rsquo14 ACM pp 40ndash40

[3] ARM Limited ARM Architecture ReferenceManual ARMv7-A and ARMv7-R Edition2010

[4] Boehm H J Space efficient conservativegarbage collection SIGPLAN Not 39 4 (Apr2004) 490ndash501

[5] Bonzini P and Day M RCU imple-mentation for Qemu httplistsgnu

WG21N4321 37

orgarchivehtmlqemu-devel2013-08

msg02055html August 2013

[6] Dalton M THE DESIGN AND IMPLE-MENTATION OF DYNAMIC INFORMATIONFLOW TRACKING SYSTEMS FOR SOFT-WARE SECURITY PhD thesis StanfordUniversity 2009 Available httpcsl

stanfordedu~christospublications

2009michael_daltonphd_thesispdf

[Viewed March 9 2010]

[7] Desnoyers M [RFC git tree] userspace RCU(urcu) for Linux httpurcuso February2009

[8] Desnoyers M McKenney P E SternA Dagenais M R and Walpole JUser-level implementations of read-copy updateIEEE Transactions on Parallel and DistributedSystems 23 (2012) 375ndash382

[9] Grisenthwaite R ARM Barrier LitmusTests and Cookbook ARM Limited 2009

[10] Howard P W and Walpole J A rel-ativistic enhancement to software transactionalmemory In Proceedings of the 3rd USENIX con-ference on Hot topics in parallelism (BerkeleyCA USA 2011) HotParrsquo11 USENIX Associa-tion pp 1ndash6

[11] Howard P W and Walpole J Relativis-tic red-black trees Concurrency and Computa-tion Practice and Experience (2013) nandashna

[12] Intel Corporation A Formal Speci-fication of Intel Itanium Processor Fam-ily Memory Ordering 2002 Avail-able httpdeveloperintelcom

designitaniumdownloads251429htm

ftpdownloadintelcomdesign

ItaniumDownloads25142901pdf [ViewedJanuary 10 2007]

[13] International Business Machines Corpo-ration Power ISA Version 207 2013

[14] Kannan H Ordering decoupled metadata ac-cesses in multiprocessors In MICRO 42 Pro-ceedings of the 42nd Annual IEEEACM Inter-national Symposium on Microarchitecture (NewYork NY USA 2009) ACM pp 381ndash390

[15] McCarthy J Recursive functions of symbolicexpressions and their computation by machinepart i Commun ACM 3 4 (Apr 1960) 184ndash195

[16] McKenney P E Read-copy update(RCU) usage in Linux kernel Availablehttpwwwrdropcomuserspaulmck

RCUlinuxusagerculocktabhtml [ViewedJanuary 14 2007] October 2006

[17] McKenney P E What is RCU part 2Usage Available httplwnnetArticles

263130 [Viewed January 4 2008] January2008

[18] McKenney P E The RCU API 2010 edi-tion httplwnnetArticles418853 De-cember 2010

[19] McKenney P E Validating memory barri-ers and atomic instructions httplwnnet

Articles470681 December 2011

[20] McKenney P E Is Parallel ProgrammingHard And If So What Can You Do About Itkernelorg Corvallis OR USA 2012

[21] McKenney P E Structured deferral syn-chronization via procrastination CommunACM 56 7 (July 2013) 40ndash49

[22] McKenney P E [PATCH tipcorercu14] mce Stop using array-index-based RCUprimitives [PATCHtipcorercu14]mce

Stopusingarray-index-basedRCUprimitivesMay 2015

[23] McKenney P E Purcell C Al-gae Schumin B Cornelius G Qwer-tyus Conway N Sbw Blainster Ru-fus C Zoicon5 Anome and Eisen HRead-copy update httpenwikipedia

orgwikiRead-copy-update July 2006

WG21N4321 38

[24] McKenney P E and Slingwine J DRead-copy update Using execution history tosolve concurrency problems In Parallel andDistributed Computing and Systems (Las VegasNV October 1998) pp 509ndash518

[25] McKenney P E and Walpole J What isRCU fundamentally Available httplwn

netArticles262464 [Viewed December 272007] December 2007

[26] Michael M M Hazard pointers Safe mem-ory reclamation for lock-free objects IEEETransactions on Parallel and Distributed Sys-tems 15 6 (June 2004) 491ndash504

[27] Rossbach C J Hofmann O S PorterD E Ramadan H E Bhandari A andWitchel E TxLinux Using and managinghardware transactional memory in an operatingsystem In SOSPrsquo07 Twenty-First ACMSymposium on Operating Systems Principles(October 2007) ACM SIGOPS Avail-able httpwwwsosp2007orgpapers

sosp056-rossbachpdf [Viewed October 212007]

[28] Toit S D Working draft stan-dard for programming language C++httpwwwopen-stdorgjtc1sc22wg21

docspapers2013n3691pdf May 2013

[29] Vigueras G Orduna J M and LozanoM A Read-Copy Update based parallel serverfor distributed crowd simulations The Journalof Supercomputing (Apr 2012)

Change Log

This paper first appeared as N4026 in May of 2014Revisions to this document are as follows

bull Add a suggestion to Section 71 that developersprovide diagnostics for singleton arrays when us-ing RCU-protected indexes (May 28 2014)

bull Apply changes from Torvald Riegel feedback

1 Move Section 71 which summarizes theLinux-kernel uses of dependency chains upto the beginning of Section 7

2 Fix some typos in Section 74 and addwords accounting for DEC Alpha whichmust continue to promote memory order

consume loads to memory order acquire

with this sectionrsquos approach Also clarifyJeff Preshingrsquos conjecture about the lack ofoptimizations that break semantic depen-dency chains

(June 5 2014)

bull Add Section 75 which covers Hans Boehmrsquosapproach which restricts dependency chains tolocal variable Also add Section 76 whichcovers Clark Nelsorsquos approach which usesthe [[carries dependency]] attribute to markthose local variables that carry dependencies(July 17 2014)

bull Mark the paper as a revision of N4036 (July 172014)

bull Add Jeff Preshing Hans Boehm and Clark Nel-son as authors and clarify Hansrsquos proposal inSection 75 (August 13 2014)

bull Add Olivier Girouxrsquos proposal for explicitlymarking the tails of dependency chains as Sec-tion 77 and add Olivier as author Update theadvantages and disadvantages of Jeff Preshingrsquoswhole-program option in Section 74 Add advan-tages and disadvantages for Hans Boehmrsquos pro-posal in Section 75 and for Clark Nelsonrsquos pro-posal in Section 76 (August 23 2014)

bull Update Olivier Girouxrsquos explicit dependency-chain marking proposal in Section 77 based onfeedback from Olivier (August 24 2014)

bull Add a section header (Section 411) for thedependency-chain rules for 2014 GCC implemen-tations Add Section 413 laying out the sim-pler rules for managing dependency chains in theolder 1990s Sequent C implementations Adda footnote to Section 76 recording Lawrence

WG21N4321 39

Crowlrsquos preference for variable modifiers insteadof either attributes or type modifiers Addwords to Section 77 noting that atomic signal

fence() or the Linux kernelrsquos barrier() can beused to suppress code-motion optimizations thatmight otherwise prevent fixing up tail-markeddependency chains that were broken by code-motion optimizations (October 1 2014)

bull Add Section 34 which describes operators thatact as the last link in the Linux kernelrsquos depen-dency chains including the ambiguous role ofthe -gt operator (October 1 2014)

bull Update plots of RCU API usage (October 52014)

bull Flesh out Section 412 which lays out a shorthistory of dependency-chain management rulesfor the Linux kernel Numerous minor clarifica-tions and corrections throughout the document(October 5 2014)

bull Add ldquoandrdquo before final authorrsquos name and emailaddress Also add a pointer back to N4036 (Oc-tober 5 2014)

bull Add a brain-dead build script (October 52014)

bull Add Section 710 which has the beginnings ofan evaluation of the various proposals (October5 2014)

The paper was then published as N4215 whichwas revised as follows

bull Add verbiage to Sections 72 and 7102 differ-entiating between various forms of syntactic andsemantic dependencies (November 11 2014)

bull Add Section 7103 suggesting a C keyword in-stead of the C++ [[carries dependency]] at-tribute This section also suggests that combin-ing ideas from several proposals might be helpful(November 11 2014)

bull Add Section 78 which contains a first attemptto describe Olivier Girouxrsquos head-marked depen-dency chains (November 11 2014)

bull Add example code in Figures 15 and 16 Addsimilar figures to the proposals showing how theymark the dependency chains in the sample code(November 11 2014)

At this point the paper was published as N4321which was revised as follows

bull Add a warning against using operands to amp and| that leave only one bit set (or respectivelycleared) to Section 411 (April 9 2015)

bull Update plots of RCU API usage (April 192014)

bull Add Section 79 which restricts dependencychains with the goal of allowing unmarked de-pendency chains while avoiding restricting com-piler optimizations One important restriction isthe elimination of RCU-protected array indexes(April 19 2014)

bull Add verbiage to Section 32 noting that the amp

operator is sometimes used to find the beginningof a power-of-two aligned structure (April 192014)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Torvald Riegel to Sec-tion 79 (May 20 2015)

bull Donrsquot allow integers to carry dependencies intoor out of unmarked functions in Section 79(May 21 2015)

bull Update Section 79 to note that assignments can-not extend a dependency chain from one threadto another and that assignments cannot extendan intptr t- or uintptr t-based dependencychain through a shared variable even within agiven thread (May 21 2015)

bull Update Section 79 to note that atomic op-erations can extend dependency chains in thesame way that their non-atomic counterpartscan (May 21 2015)

WG21N4321 40

bull Update Section 79 to explicitly call out the factthat stores and subsequent loads can extend de-pendency chains (May 21 2015)

bull Update Section 79 to explicitly state that un-defined behavior terminates dependency chainsand that if only one value of a pointer avoidsundefined behavior any dependency chainsthrough that pointer at that point are termi-nated (May 22 2015)

bull Update Section 79 to document the difficul-ties stemming from value-specialization opti-mizations (May 23 2015)

bull Update Section 79 to note the back-propagationof the dependency-chain-breaking effects of thecompiler deducing the exact value of a pointer(May 27 2015)

bull Update Section 79 to give more informationon how much bit-setting and bit-clearing is toomuch (June 8 2015)

bull Update Section 79 to note that the Linux ker-nel does not rely on dependency ordering whena statically allocated structure is linked into anRCU-protected structure but also note that op-timizations breaking dependency ordering in thiscase seem quite dubious (July 11 2015)

bull Update Section 79 illustrating dependency-chain rules with examples (July 11 2015)

bull Update Section 79 noting that simple loads suf-fice for memory order consume loads (July 112015)

bull Update Sections 77 and 78 calling out abstrac-tion challenges for head- and tail-marking ap-proaches (July 13 2015)

bull Add subsections to Section 79 (July 13 2015)

bull Update Section 79 to amplify objections to spe-cialization optimizations applied to heap-basedpointers (July 13 2015)

  • 1 Introduction
  • 2 Introduction to RCU
  • 3 Linux-Kernel Use Cases
    • 31 Types of Linux-Kernel Dependency Chains
    • 32 Operators in Linux-Kernel Dependency Chains
    • 33 Operators Terminating Linux-Kernel Dependency Chains
    • 34 Operators Acting as Last Link in Linux-Kernel Dependency Chains
    • 35 Linux-Kernel Dependency Chain Length
      • 4 Dependency Ordering in Pre-C11 Implementations
        • 41 Rules for C-Language RCU Users
          • 411 Rules for 2014 GCC Implementations
          • 412 Rules for 2003 GCC Implementations
          • 413 Rules for 1990s Sequent C Implementations
            • 42 Rules for C-Language Implementers
              • 5 Dependency Ordering in C11 and C++11 Implementations
              • 6 Weaknesses in C11 and C++11 Dependency Ordering
              • 7 Potential Alternatives to C11 and C++11 Dependency Ordering
                • 71 Revising C11 and C++11 Dependency-Ordering Definition
                • 72 Type-Based Designation of Dependency Chains With Restrictions
                • 73 Type-Based Designation of Dependency Chains
                • 74 Whole-Program Option
                • 75 Local-Variable Restriction
                • 76 Mark Dependency-Carrying Local Variables
                • 77 Explicitly Tail-Marked Dependency Chains
                • 78 Explicitly Head-Marked Dependency Chains
                • 79 Restricted Dependency Chains
                  • 791 Extending Dependency Chains
                  • 792 Terminating Dependency Chains
                  • 793 Restricted Dependency Chains and the Linux Kernel
                  • 794 Restricted Dependency Chains Advantages and Disavantages
                    • 710 Evaluation
                      • 7101 Audiences
                      • 7102 Comparison
                      • 7103 Other Approaches
                          • 8 Summary
Page 10: N4321: Towards Implementation and Use of memory order … · 2015-07-14 · N4321: Towards Implementation and Use of memory order consume Doc. No.: WG21/N4321 (revised) Date: 2015-04-19

WG21N4321 10

time or otherwise long enough ago thatreaders cannot still have pre-initialized datacached Examples include module-init timefor module code before kthread creationfor code running in a kthread while theupdate-side lock is held and so on

(b) The pointer is never dereferenced afterbeing compared This exception applieswhen comparing against the NULL pointeror when scanning RCU-protected circularlinked lists

(c) The pointer being compared against is partof the essential subset of a dependencychain This can be a different dependencychain but only as long as that chain stemsfrom a pointer that was modified after anyinitialization of interest This exception canapply when carrying out RCU-protectedtraversals from different entry points thatconverged on the same data structure

(d) The pointer being compared against isfetched using rcu access pointer() andall subsequent dereferences are stores

(e) The pointers compared not-equal and thecompiler does not have enough informationto deduce the value of the pointer (Forexample if the compiler can see that thepointer will only ever take on one of twovalues then it will be able to deduce theexact value based on a not-equals compar-ison)

10 Disable any value-speculation optimizations thatyour compiler might provide especially if you aremaking use of feedback-based optimizations thattake data collected from prior runs

412 Rules for 2003 GCC Implementations

Prior to the 269 version of the Linux kernel therewas neither rcu dereference() nor rcu assign

pointer() Instead explicit memory barriers wereused smp read barrier depends() by readers andsmp wmb() by updaters For example the code shownfor current Linux kernels in Figure 6 would be as

1 struct foo 2 int a3 4 struct foo fp5 struct foo default_foo67 int bar(void)8 9 struct foo p

1011 p = fp12 smp_read_barrier_depends()13 return p p-ampgta default_fooa14

Figure 9 Default Value For RCU-Protected PointerOld Linux Kernel

shown in Figure 9 for 268 and earlier versions of theLinux kernel A similar transformation relates theolder use of smp wmb() and the more recent use ofrcu assign pointer()

This older API was clearly much more vulnerableto compiler optimizations than is the current APIbut the real motivation for this change was read-ability and maintainability as can be seen from thecommit log for the mid-2004 patch introducing rcu

dereference()

This patch introduced an rcu

dereference() macro that replaces mostuses of smp read barrier depends()The new macro has the advantage ofexplicitly documenting which pointersare protected by RCU ndash in contrast itis sometimes difficult to figure out whichpointer is being protected by a givensmp read barrier depends() call

The commit log for the mid-2004 patch introducingrcu assign pointer() justifies the change in termsof eliminating hard-to-use explicit memory barriers

Attached is a patch that adds an rcu

assign pointer() that allows a number ofexplicit smp wmb() memory barriers to bedispensed with improving readability

The importance of suppressing compiler optimiza-tions did not become apparent until much later In

WG21N4321 11

fact a volatile cast was not added to the implementa-tion of rcu dereference() until 2624 in early 2008

413 Rules for 1990s Sequent C Implemen-tations

1990s systems featured far slower CPUs and muchless memory that is commonly provisioned to-day and the compilers were correspondingly lesssophisticated Therefore at that time a sim-ple C-language field selector was used instead ofany sort of rcu dereference() or memory order

consume operation Not only was there novolatile cast there also was nothing resembling smp

read barrier depends() The lack of smp read

barrier depends() is not too surprising given thatDYNIXptx did not run on DEC Alpha

This approach was nevertheless quite reliable be-cause the use cases within the DYNIXptx kernelwere both few and straightforward and provided lit-tle or no opportunity for optimizations that mightbreak dependency chains

42 Rules for C-Language Imple-menters

The main rule for C-language implementers is toavoid any sort of value speculation or at the veryleast provide means for the user to disable suchspeculation An example of a value-speculation op-timization that can be carried out with the help ofhardware branch prediction is shown in Figure 10which is an optimized version of the code in Fig-ure 5 This sort of transformation might result fromfeedback-directed optimization where profiling runsdetermined that the value loaded from ph was almostalway 0xbadfab1e Although this transformation iscorrect in a single-threaded environment in a concur-rent environment nothing stops the compiler or theCPU from speculating the load on line 19 before itexecutes the rcu dereference() on line 16 whichcould result in line 19 executing before the corre-sponding store on line 7 resulting in a garbage valuein variable a4

4 Kudos to Olivier Giroux for pointing out use of branchprediction to enable value speculation

1 void new_element(struct foo pp int a)2 3 struct foo p = malloc(sizeof(p))45 if (p)6 abort()7 p-gta = a8 rcu_assign_pointer(pp p)9

1011 int traverse(struct foo_head ph)12 13 int a = -114 struct foo p1516 p = rcu_dereference(ampph-gth)17 while (p = NULL) 18 if (p == (struct foo )0xbadfab1e)19 a = ((struct foo )0xbadfab1e)-gta20 else21 a = p-gta22 p = rcu_dereference(ampp-gtn)23 24 return a25

Figure 10 Dangerous Optimizations HardwareBranch Predictions

There are some situations where this sort of opti-mization would be safe including

1 The value speculated is a numeric value ratherthan a pointer so that if the guess proves correctafter the fact the computation will be appropri-ate after the fact

2 The value speculated is a pointer to invariantdata so that reasonable values are produced bydereferencing even if the guess proves to havebeen correct only after the fact

3 As above but where any updates result in datathat produces appropriate computations at anyand all phases of the update

However this list does not contain the general caseof memory order consume loads

Pure hardware implementations of value specula-tion can avoid this problem because they monitorcache-coherence protocol events that would resultfrom some other CPU invalidating the guess

In short compiler writers must provide means todisable all forms of value speculation unless the spec-

WG21N4321 12

ulation is accompanied by some means of detectingthe race condition that Figure 10 is subject to

Are there other dependency-breaking optimizationsthat should be called out separately

5 Dependency Ordering in C11and C++11 Implementations

The simplest way to avoid dependency-ordering is-sues is to strengthen all memory order consume oper-ations to memory order acquire This functions cor-rectly but may result in unacceptable performancedue to memory-barrier instructions on weakly or-dered systems such as ARM and PowerPC5 and mayfurther unnecessarily suppress code-motion optimiza-tions

Another straightforward approach is to avoid valuespeculation and other dependency-breaking opti-mizations This might result in missed opportu-nities for optimization but avoids any need fordependency-chain annotations and also all issuesthat might otherwise arise from use of dependency-breaking optimizations This approach is fully com-patible with the Linux kernel communityrsquos currentapproach to dependency chains Unfortunately thereare any number of valuable optimizations that breakdependency chains so this approach seems impracti-cal

A third approach is to avoid value speculationand other dependency-breaking optimizations in anyfunction containing either a memory order consume

load or a [[carries dependency]] attribute Forexample the hardware-branch-predition optimiza-tion shown in Figure 10 would be prohibited in suchfunctions as would cancellation optimizations suchas optimizing a = b + c - c into a = b This toocan result in missed opportunities for optimizationthough very probably many fewer than the previousapproach This approach can also result in issues dueto dependency-breaking optimizations in functionslacking [[carries dependency]] attributes for ex-ample function d() in Figure 11 It can also result

5 From a Linux-kernel community viewpoint that shouldread ldquowill result in unacceptable performancerdquo

1 int a(struct foo p [[carries_dependency]])2 3 return kill_dependency(p-gta = 0)4 56 int b(int x)7 8 return x9

1011 foo c(void)12 13 return fpload_explicit(memory_order_consume)14 return rcu_dereference(fp) in Linux kernel 15 1617 int d(void)18 19 int a20 foo p2122 rcu_read_lock()23 p = c()24 a = p-gta25 rcu_read_unlock()26 return a27

Figure 11 Example Functions for Dependency Or-dering Part 1

1 [[carries_dependency]] struct foo e(void)2 3 return fpload_explicit(memory_order_consume)4 return rcu_dereference(fp) in Linux kernel 5 67 int f(void)8 9 int a

10 foo p1112 rcu_read_lock()13 p = e()14 a = p-gta15 rcu_read_unlock()16 return kill_dependency(a)17 1819 int g(void)20 21 int a22 foo p2324 rcu_read_lock()25 p = e()26 a = p-gta27 rcu_read_unlock()28 return b(a)29

Figure 12 Example Functions for Dependency Or-dering Part 2

WG21N4321 13

in spurious memory-barrier instructions when a de-pendency chain goes out of scope for example withthe return statement of function g() in Figure 12

A fourth approach is to add a compile-time op-eration corresponding to the beginning and end ofRCU read-side critical section These would need tobe evaluated at compile time taking into accountthe fact that these critical sections can nest and canbe conditionally entered and exited Note that theexit from an outermost RCU read-side critical sec-tion should imply a kill dependency() operation oneach variable that is live at that point in the code6

Although it is probably impossible to precisely de-termine the bounds of a given RCU read-side criticalsection in the general case conservative approachesthat might overestimate the extent of a given sec-tion should be acceptable in almost all cases Thisapproach would make functions c() and d() in Fig-ure 11 handle dependency chains in a natural mannerbut avoiding whole-program analysis would requiresomething similar to the [[carries dependency]]

annotations called out in the C11 and C++11 stan-dards

A fifth approach would be to require that all op-erations on the essential subset of any dependencychain be annotated This would greatly ease imple-mentation but would not be likely to be accepted bythe Linux kernel community

A sixth approach is to track dependencies as calledout in the C11 and C++11 standards However in-stead of emitting a memory-barrier instruction whena dependency chain flows into or out of a functionwithout the benefit of [[carries dependency]] in-sert an implicit kill dependency() invocation Im-plementation should also optionally issue a diagnosticin this case The motivation for this approach is thatit is expected that many more kill dependencies()

than [[carries dependency]] would be required toconvert the Linux kernelrsquos RCU code to C11 In theexample in Figure 12 this approach would allow func-tion g() to avoid emitting an unnecessary memory-barrier instruction but without function f()rsquos ex-

6 What if a given rcu read unlock() sometimes marked theend of an outermost RCU read-side critical section but othertimes was nested in some other RCU read-side critical sectionIn that case there should be no kill dependency()

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a)3 a = p-gtspecial_a4 else5 a = p-gtnormal_a

Figure 13 Dependency-Ordering Value-NarrowingHazard

plicit kill dependency() Both functions are in Fig-ure 12

A seventh and final approach is to track dependen-cies as called out in in the C11 and C++11 standardsWith this approach functions e() and f() properlypreserve the required amount of dependency order-ing

6 Weaknesses in C11 andC++11 Dependency Or-dering

Experience has shown several weaknesses in the de-pendency ordering specified in the C11 and C++11standards

1 The C11 standard does not provide attributesand in particular does not provide the[[carries dependency]] attribute This pre-vents the developer from specifying that a givendependency chain passes into or out of a givenfunction

2 The implementation complexity of thedependency-chain tracking required by bothstandard can be quite onerous on the one handand the overhead of unconditionally promotingmemory order consume loads to memory order

acquire can be excessive on weakly orderedimplementations on the other There is thereforeno easy way out for a memory order consume

implementation on a weakly ordered system

3 The function-level granularity of [[carries

dependency]] seems too coarse One problemis that points-to analysis is non-trivial so that

WG21N4321 14

compilers are likely to have difficulty determin-ing whether or not a given pointer carries a de-pendency For example the current wording ofthe standard (intentionally) does not disallowdependency chaining through stores and loadsTherefore if a dependency-carrying value mightever be written to a given variable an implemen-tation might reasonably assume that any loadfrom that variable must be assumed to carry adependency

4 The rules set out in the standard [28 110p9]do not align well with the rules that develop-ers must currently adhere to in order to main-tain dependency chains when using pre-C11 andpre-C++11 compilers (see Section 41) For ex-ample the standard requires (x-x) to carry adependency and providing this guarantee wouldat the very least require the compiler to alsoturn off optimizations that remove (x-x) (andsimilar patterns) if x might possibly be carry-ing a dependency For another example con-sider the value-speculation-like code shown inFigure 13 that is sometimes written by devel-opers and that was described in bullet 9 ofSection 41 In this example the standard re-quires dependency ordering between the memoryorder consume load on line 1 and the sub-sequent dereference on line 3 but a typicalcompiler would not be expected to differenti-ate between these two apparently identical val-ues These two examples show that a compilerwould need to detect and carefully handle thesecases either by artificially inserting dependen-cies omitting optimizations differentiating be-tween apparently identical values or even byemitting memory order acquire fences

5 The whole point of memory order consume andthe resulting dependency chains is to allow de-velopers to optimize their code Such optimiza-tion attempts can be completely defeated by thememory order acquire fences that the standardcurrently requires when a dependency chain goesout of scope without the benefit of a [[carries

dependency]] attribute Preventing the com-piler from emitting these fences requires liberal

use of kill dependency() which clutters coderequires large developer effort and further re-quires that the developer know quite a bit aboutwhich code patterns a given version of a givencompiler can optimize (thus avoiding needlessfences) and which it cannot (thus requiring man-ual insertion of kill dependency()

As of this writing no known implementations fullysupport C11 or C++11 dependency ordering

It is worth asking why Paul didnrsquot anticipate theseweaknesses There are several reasons for this

1 Compiler optimizations have become more ag-gressive over the seven years since Paul startedworking on standardization

2 New dependency-ordering use cases have arisenduring that same time in particular there arelonger dependency chains and more of themincluding dependency chains spanning multiplecompilation units

3 The number of dependency chains has increasedby roughly an order of magnitude during thattime so that changes in code style can be ex-pected to face a commeasurate increase in resis-tance from the Linux kernel community ndash unlessthose changes bring some tangible benefit

With that letrsquos look at some potential alternativesto dependency ordering as defined in the C11 andC++11 standards

7 Potential Alternatives to C11and C++11 Dependency Or-dering

Given the weaknesses in the current standardrsquos spec-ification of dependency ordering it is quite reason-able to consider alternatives To this end Section 71discusses ease-of-use issues involved with revisions tothe C11 and C++11 definitions of dependency or-dering Section 72 enlists help from the type systembut also imposes value restrictions (thus revising the

WG21N4321 15

C11 and C++11 semantics for dependencies) Sec-tion 73 enlists help from the type system withoutthe value restrictions and Section 74 describes awhole-program approach to dependency chains (alsorevising the C11 and C++11 semantics for depen-dencies) Section 75 describes a post-Rapperswilproposal that dependency chains be restricted tofunction-scope local variables and temporaries andSection 76 describes a second post-Rapperswil pro-posal that the [[carries dependency]] attributebe used to label local-scope variables that carry de-pendencies Section 77 describes a proposal dis-cussed verbally at Rapperswil that explicitly marksthe tails of dependency chains Section 78 describesthe inverse namely marking the heads of dependencychains Section 79 describes an approach that avoidsmarking by sharply restricting the number and typeof operations permitted in dependency chains Eachapproach appears to have advantages and disadvan-tages so it is hoped that further discussion will eitherhelp settle on one of these alternatives or generatesomething better To help initiate this discussionSection 710 provides an initial comparative evalua-tion

71 Revising C11 and C++11Dependency-Ordering Definition

The following sections each describe a proposed revi-sion of the dependency-ordering definition from thatin the current C11 and C++11 standards In manyof these proposals developers are required to followan additional rule in order to be able to rely on de-pendency ordering Subsequent execution must notlead to a situation where there is only one possiblevalue for the variable that is intended to carry thedependency7 This is shown in Figure 17 where thecompiler is permitted to break dependency orderingon line 6 because it knows that the value of p is equalto that of q which means that it could substitutethe latter value from the former which would break

7 This restricted notion of dependence is sometimes calledsemantic dependence and the value at the end of a depen-dence chain that does not represent a semantic dependence issometimes said to be independent of the value at the head ofthe dependency chain

1 int my_array[MY_ARRAY_SIZE]23 i = atomic_load_explicit(gi memory_order_consume)4 r1 = my_array[i]

Figure 14 Single-Element Arrays and DependencyOrdering

dependency ordering In short a dependency chainbreaks if it comes to a point where only a single valueis possible regardless of the value of the memory

order consume load heading up the chain At firstglance this additional rule could be quite difficult tolive with as dependency ordering could come and godepending on small details of code far away from thatpoint in the dependency chain

However a review of the Linux-kernel operators inSection 32 shows that the most commonly used op-erators act identically under both definitions Theproblem-free operators include -gt infix = casts pre-fix amp prefix and ternary

One example of a potentially troublesome opera-tor namely == is shown in Figure 17 where line 6breaks dependency ordering because the value of p isknown to be equal to that of q which is not part of adependency chain This example could be addressedthrough careful diagnostic design coupled with appro-priate coding standards For example the compilercould emit a warning on line 6 but remain silent forthe equivalent line substituting q for p namely dosomething with(q-gta)

Another example is the use of postfix [] thatis shown in Figure 14 If this code fragment wascompiled with MY ARRAY SIZE equal to one there isno dependency ordering between lines 3 and 4 butthat same code fragment compiled with MY ARRAY

SIZE equal to two or greater would be dependency-ordered Here a diagnostic for single-element arraysmight prove useful and such a diagnostic can easilybe supplied in this case using if and error

In the Linux kernel infix + and - are used forpointer and array computations These are all safein that they operate on an integer and pointer sothat any cancellation will not normally be detectableat compile time However one big purpose of diag-nostics is to detect abnormal conditions indicating

WG21N4321 16

1 struct liststackhead 2 struct liststack __rcu first3 45 struct liststack 6 struct liststack __rcu next7 void t8 struct rcu_head rh9

1011 _Carries_dependency12 void ls_front(struct liststackhead head)13 14 _Carries_dependency void data15 struct liststack lsp1617 rcu_read_lock()18 lsp = rcu_dereference(head-gtfirst)19 if (lsp == NULL)20 data = NULL21 else22 data = rcu_dereference(lsp-gtt)23 rcu_read_unlock()24 return data25

Figure 15 List-Based-Stack Example Code 1 of 2

probable bugs Therefore in cases where the com-piler can determine that two values from dependencychains are annihilating each other via infix + and -a diagnostic would be appropriate

Similarly the Linux kernel uses infix (bitwise) amp tomanipulate bits at the bottom of a pointer whereagain cancellation will not normally be detectable atcompile timemdashexcept in the case of operations on aNULL pointer for which dependency ordering is notmeaningful in any case However as with infix +

and - if the compiler detects value annihilation adiagnostic would be appropriate

Although issues with false positives and negativesneeds further investigation there is reason to hopethat this revision of the definition of dependency or-dering might avoid significant impacts on ease of useWith this hope we proceed to the specific propos-als using the code in Figures 15 and 16 to showsome sample code using Linux-kernel nomenclaturewith the addition of a mythical C keyword Carries

dependency to annotate parameters variables andreturn values that carry dependencies Please notethat this code example in no way endorses the dubi-ous practice of creating a parallel program with thesort of choke point exemplified by the head of this

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 _Carries_dependency struct liststack lsp40 struct liststack lsnp141 _Carries_dependency struct liststack lsnp242 _Carries_dependency void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 16 List-Based-Stack Example Code 2 of 2

WG21N4321 17

1 value_dep_preserving struct foo p23 p = atomic_load_explicit(gp memory_order_consume)4 q = some_other_pointer5 if (p == q)6 do_something_with(p-gta)7 else8 do_something_else_with(p-gtb)

Figure 17 Single-Value Variables and DependencyOrdering

list Note also that cmpxchg() heads a dependencychain which is completely reasonable within the con-text of the Linux kernel due to its acquire semanticswhich of course might be argued to indicate that theannotations in ls pop() are unnecessary

72 Type-Based Designation of De-pendency Chains With Restric-tions

This approach was formulated by Torvald Riegel inresponse to Linus Torvaldsrsquos spirited criticisms of thecurrent C11 and C++11 wording

This approach introduces a new value dep

preserving type qualifier Dependency ordering ispreserved only via variables having this type quali-fier This is meant to model the real scope of depen-dencies which is data flow not execution at function-level granularity This approach should therefore givedevelopers much finer control of which dependenciesare tracked

Assigning from a value dep preserving value to anon-value dep preserving variable terminates thetracking of dependencies in much the same way thatan explicit kill dependency() would However un-like an explicit kill dependency() compilers shouldbe able to emit a suppressable warning on implicitconversions so as to alert the developer about other-wise silent dropping of dependency tracking8

Next we specify that memory order consume loadsreturn a value dep preserving type by default thecompiler must assume such a load to be capable of

8 Other choices are possible in this case including emit-ting a memory order acquire fence in order to conservativelypreserve a potentially intended ordering

producing any value of the underlying type In otherwords the implementation is not permitted to applyany value-restriction knowledge it might gain fromwhole-program analysis We call this a local seman-tic dependency to distinguish not only from a pure(syntactic) dependency but also from a global seman-tic dependency where global information may be ap-plied Note that any global semantic dependency isalso a local semantic dependency but that any localsemantic dependency which is headed by a variablethat can be proven to take on only a single value is nota global semantic dependency The term ldquosemanticdependencyrdquo should be interpreted to mean a globalsemantic dependency unless otherwise stated

This allows developers to start with a clean slatefor the additional rule that they must follow to beable to rely on dependency ordering Subsequent ex-ecution must not lead to a situation there is only onepossible value for the value dep preserving expres-sion because otherwise the implementation is per-mitted to break the dependency chain As notedearlier this is shown in Figure 17 where the com-piler is permitted to break dependency ordering online 6 because it knows that the value of p is equalto that of q which means that it could substitutethe latter value from the former which would breakdependency ordering

This approach has several advantages

1 The implementation is simpler because no de-pendency chains need to be traced The imple-mentation can instead drive optimization deci-sions strictly from type information

2 Use of the value dep preserving type modifierallows the developer to limit the extent of thedependency chains

3 This type modifier can be used to mark a depen-dency chainrsquos entry to and exit from a functionin a straightforward way without the need forattributes

4 The value dep preserving type modifiers serveas valuable documentation of the developerrsquos in-tent

WG21N4321 18

5 This approach permits many additional opti-mizations compared to those permitted by thecurrent standard on code that carries a depen-dency Expressions such as (x-x) no longerrequire establishment of artificial dependenciesand the compiler is no longer required to detectvalue-narrowing hazards like that shown in Fig-ure 13 However the compiler is still prohibitedfrom adding its own value-speculation optimiza-tions

6 Linus Torvalds seems to be OK with it whichindicates that this set of rules might be practicalfrom the perspective of developers who currentlyexploit dependency chains

According to Peter Sewell one disadvantage is thatthis approach will be quite difficult to model which inturn will pose obstacles for the analysis tooling thatwill be increasingly necessary for large-scale concur-rent programming efforts In particular the concernis that forcing the compiler to assume that a memory

order consume load could possibly return any valuepermitted by its type might require program-analysistools to consider counterfactual hypothetical execu-tions which might complicate specification of seman-tics and verification

Figures 18 and 19 show how this approach playsout with the list-based stack

73 Type-Based Designation of De-pendency Chains

Jeff Preshing made an off-list suggestion of using avalue dep preserving type modifier as suggestedby Torvald Riegel but using this type modifier tostrictly enforce dependency ordering For exampleconsider the code fragment shown in Figure 17 Thescheme described in Section 72 would not necessar-ily enforce dependency ordering between the load online 3 and the access one line 6 while the approachdescribed in this section would enforce dependencyordering in this case

Furthermore cancelling or value-destruction oper-ations on value dep preserving values would notdisrupt dependency ordering As with the cur-rent C11 and C++11 standards the implementation

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack value_dep_preserving first6 78 struct liststack 9 struct liststack value_dep_preserving next

10 void t11 struct rcu_head rh12 1314 value_dep_preserving15 void ls_front(struct liststackhead head)16 17 value_dep_preserving void data18 value_dep_preserving struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 18 List-Based-Stack Restricted Type-BasedDesignation 1 of 2

would be required to emit a memory-barrier instruc-tion or compute an artificial dependency for such op-erations (Note however that use of cancelling orvalue-destruction operations on dependency chainshas proven quite rare in practice)

This approach shares many of the advantages ofTorvald Riegelrsquos approach

1 The implementation is simpler because no de-pendency chains need be traced The implemen-tation can instead drive optimization decisionsstrictly from type information

2 Use of the value dep preserving type modifierallows the developer to limit the extent of thedependency chains

3 This type modifier can be used to mark a depen-dency chainrsquos entry to and exit from a functionin a straightforward way without the need forattributes

4 The value dep preserving type modifiers serve

WG21N4321 19

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 value_dep_preserving37 void ls_pop(struct liststackhead head)38 39 value_dep_preserving struct liststack lsp40 struct liststack lsnp141 value_dep_preserving struct liststack lsnp242 value_dep_preserving void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 19 List-Based-Stack Restricted Type-BasedDesignation 2 of 2

as valuable documentation of the developerrsquos in-tent

5 Although optimizations on a dependency chainare restricted just as in the current standardthe use of value dep preserving restricts thedependency chains to those intended by the de-veloper

6 Restricting dependency-breaking optimizationson all dependency chains marked value dep

preserving without exceptions for cases inwhich the compiler knows too much might makethis approach easier to learn and to use

It is expected that modeling this approach shouldbe straightforward because the modeling tools wouldbe able to make use of the type information Thisapproach results in the same code as shown in Fig-ures 18 and 19 of the previous section

74 Whole-Program Option

This approach also suggested off-list by Jeff Presh-ing has the goal of reusing existing non-dependency-ordered source code unchanged (albeit requiring re-compilation in most cases)9 For example this ap-proach permits an instance of stdmap to be refer-enced by a pointer loaded via memory order consume

and to provide that stdmap instance with thebenefits of dependency ordering without any codechanges whatsoever to stdmap It is important tonote that this protection will be provided only to aread-only stdmap that is referenced by a changingpointer loaded via memory order consume in partic-ular not to a concurrently updated stdmap refer-enced by a pointer (read-only or otherwise) loadedvia memory order consume This latter case wouldrequire changes to the underlying stdmap implemen-tation at a minimum changing some of the loads tobe memory order consume loads Nevertheless theability to provide dependency-ordering protection topre-existing linked data structures is valuable evenwith this read-only restriction

9 A module or library that is known to never carry a de-pendency need not be recompiled

WG21N4321 20

This approach which again does require full re-compilation can be implemented using two ap-proaches

1 Promote all memory order consume loads tomemory order acquire as may be done withthe current standard

2 On architectures that respect memory order-ing prohibit all dependency-breaking optimiza-tions throughout the entire program but onlyin cases where a change in the value returnedby a memory order consume load could cause achange in the value computed later in that samedependency chain in other words where there isa global semantic dependency Note again thatthe possibility of storing a value obtained froma memory order consume load then loading itlater means that normal loads as well as memoryorder relaxed loads often must be consideredto head their own dependency chains but onlywhen loaded by the same thread that did thestore

Some implementations might allow the developerto choose between these two approaches for exampleby using a compiler switch provided for that purpose

This approach also has the effect of permitting atrivial implementation of a memory order consume

atomic thread fence() When using the first im-plementation approach the atomic thread fence()

is simply promoted to memory order acquire In-terestingly enough when using the second ap-proach the memory order consume atomic thread

fence() may simply be ignored The reason forthis is that this approach has the effect of promot-ing memory order relaxed loads to memory order

consume which already globally enforces all theordering that the memory order consume atomic

thread fence() is required to provide locally10

This approach has its own set of advantages anddisadvantages

10 Of course this presumed promotion from memory order

relaxed to memory order consume means that architecturessuch as DEC Alpha that do not respect dependency order-ing must continue to use the first option of emitting memory-ordering instructions for memory order consume loads

1 This approach dispenses with the [[carries

dependency]] attribute and the kill

dependency() primitive

2 This approach better promotes reuse of existingsource code In particular it should require nochanges to the current Linux-kernel source baseaside from changes to the rcu dereference()

family of primitives

3 This approach allows implementations to carryout dependency-breaking optimizations on de-pendency chains as long as a change in thevalue from the memory order consume load doesnot change values further down the dependencychain both with and without the optimizationJeff conjectures that the set of dependency-breaking optimizations used in practice applyonly outside of dependency chains by the re-vised definition in which single-value restrictionsbreak dependency chains11 If this conjectureholds it also applies to Torvaldrsquos approach de-scribed in Section 72

4 Code that follows the rules presented in Sec-tion 41 (substituting memory order consume

loads for volatile loads) would have its depen-dency ordering properly preserved

It is unlikely that this approach could be modeledreasonably given the current state of the art Therequirement that any given memory order consume

load be able to generate at least two different val-ues at the tail of the dependency chain is believedto be a show-stopper especially when coupled withwhole-program analysis which might find that thereis only one value entering at the head of the depen-dency chain

This approach allows annotations to be discardedas shown in Figures 20 and 21 However the memoryorder consume loads are still required in order toenable the promote-to-acquire implementation style

11 This is certainly the case for the usual optimizations ex-emplified by replacing (x-x) with zero

WG21N4321 21

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 void ls_front(struct liststackhead head)15 16 void data17 struct liststack lsp1819 rcu_read_lock()20 lsp = rcu_dereference(head-gtfirst)21 if (lsp == NULL)22 data = NULL23 else24 data = rcu_dereference(lsp-gtt)25 rcu_read_unlock()26 return data27

Figure 20 List-Based-Stack Whole-Program Ap-proach 1 of 2

75 Local-Variable Restriction

This approach suggested off-list by Hans Boehmlimits the extent of dependency trees to a local whichincludes local variables temporaries function argu-ments and return variables Assigning a value from amemory order consume load to such an object beginsa dependency chain Assigning a value loaded fromsuch a local to a global variable (including function-local variables marked static) or to the heap impliesa kill dependency() so that dependency chains areconfined to locals However if the compiler is unableto see the full dependency chain for example be-cause it passes into a function in another translationunit that is not marked [[carries dependency]]the compiler should promote memory order consume

to memory order acquire12

Section 32 indicates that the following operatorsshould transmit dependency status from one localvariable or temporary to another -gt infix = casts

12 Some implementations might provide means to allow theuser to specify that a diagnostic be generated if such promotionis necessary

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 void ls_pop(struct liststackhead head)37 38 struct liststack lsp39 struct liststack lsnp140 struct liststack lsnp241 void data4243 rcu_read_lock()44 lsnp2 = rcu_dereference(head-gtfirst)45 do 46 lsnp1 = lsnp247 if (lsnp1 == NULL) 48 rcu_read_unlock()49 return NULL50 51 lsp = rcu_dereference(lsnp1-gtnext)52 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)53 while (lsnp1 = lsnp2)54 data = rcu_dereference(lsnp2-gtt)55 rcu_read_unlock()56 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)57 return data58

Figure 21 List-Based-Stack Whole-Program Ap-proach 2 of 2

WG21N4321 22

prefix amp prefix [] infix + infix - ternary infix (bitwise) amp and probably also | Similarly Sec-tion 33 indicates that the following operators shouldimply a kill dependency() () == = ampamp ||infix and

It will also be necesary to check whether Linux-kernel usage expects dependency chains to passthrough globals and heap objects that are in someway thread-local If there are such use cases and ifthey are sane and cannot easily be changed to uselocal variables should [[carries dependency]] beused to flag dependency-carrying globals and heapobjects

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute if de-pendency chains are to span multiple translationunits as is the case in some parts of the Linuxkernel

2 The implementation is likely to be some-what simpler because only those dependencychains passing through local variables compiler-generated temporaries compiler-visible functionarguments and compiler-visible return valuesneed be traced One could also argue that func-tion arguments and return values marked with[[carries dependency]] attribute also need tobe traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 Although optimizations on dependency chainsmust be restricted the restricted scope of de-pendency chains reduces the impact of these re-strictions

5 Applying this approach to the Linux kernelwould only require the addition of markings onfunction parameters and return values corre-sponding to cross-translation-unit function callsHowever there are a significant number of theseso this approach can expect significant resistancefrom the Linux community

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 22 List-Based-Stack Local-Variable Restric-tion 1 of 2

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach allows local-variable annotations tobe dropped as shown in Figure 22 and 23

76 Mark Dependency-Carrying Lo-cal Variables

This approach suggested offlist by Clark Nelson usesthe [[carries dependency]] attribute to mark non-static local-scope variables as carrying a dependencyin addition to its current use marking function ar-guments and return values as carrying dependen-cies It is not permissible to mark global variablesor structure members with this attribute Assigningfrom a [[carries dependency]] object to a non-[[carries dependency]] object results in an im-plicit kill dependency()

This approach is similar to that of Section 73 ex-cept that it uses an attribute rather than a type mod-ifier As such it has many of the advantages and dis-

WG21N4321 23

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 23 List-Based-Stack Local-Variable Restric-tion 2 of 2

advantages of that approach however some believethat an attribute-based approach will be more ac-ceptable to the committee than would a type-modifierapproach13 However this approach does requirethat C add attributes

This leave the question of which operatorstransmit dependency chains from one [[carries

dependency]] object to another Section 32 indi-cates that the following operators should transmitdependency status from one local variable or tem-porary to another -gt infix = casts prefix amp prefix [] infix + infix - ternary infix (bitwise) ampand probably also | Similarly Section 33 showsthat the following operators should imply a kill

dependency() () == = ampamp || infix and

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains passingthrough variables marked with the [[carries

dependency]] attribute need be traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 The [[carries dependency]] calls serve asvaluable documentation of the developerrsquos in-tent

5 Although optimizations on dependency chainsmust be restricted use of explicit [[carries

dependency]] greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of variables carrying depen-dencies given that the Linux kernel currentlyrequires no such markings

13 Lawrence Crowl suggests a third approach namely a vari-able modifier

WG21N4321 24

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 _Carries_dependency void data18 _Carries_dependency struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 24 List-Based-Stack Marked Local Variables1 of 2

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach results in code as shown in Fig-ures 24 and 25 where the [[carries dependency]]

attributes have been replaced with a mythicalCarries dependency C keyword

77 Explicitly Tail-Marked Depen-dency Chains

This approach suggested at Rapperswil by OlivierGiroux can be thought of as the inverse of

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 _Carries_dependency struct liststack lsp40 struct liststack lsnp141 _Carries_dependency struct liststack lsnp242 _Carries_dependency void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 25 List-Based-Stack Marked Local Variables2 of 2

WG21N4321 25

1 p = atomic_load_explicit(ampgp memory_order_consume)2 if (p = NULL)3 do_it(atomic_dependency(p gp))

Figure 26 Explicit Dependency Operations

stdkill dependency() Instead of explicitlymarking where the dependency chains terminateOlivierrsquos proposal uses a stddependency() prim-itive to indicate the locations in the code that thedependency chains are required to reach The first ar-gument to stddependency() is the value to whichthe dependency must be carried and the second argu-ment is the variable that heads the dependency chainin other words the second argument is the variablethat was loaded from by a memory order consume

load This proposal differs from the others in thatit is expected to be implemented not necessarily bypreserving the dependency but instead by insertingbarriers in those cases where optimizations have elim-inated any required dependencies The goal here is toimpose minimal restrictions on optimizations of codecontaining dependency chains

A C-language example is shown in Figure 26where stddependency() is transliterated to the C-language atomic dependency() function On line 3atomic dependency() returns the value of its firstargument (p) while ensuring that the data depen-dency from the memory order consume load from gp

is faithfully reflected in the assembly language imple-menting this code fragment The assembly-languagereflection of this dependency might be in terms ofan assembly-language dependency (for example onARM or PowerPC) implicit memory ordering (forexample on x86 or mainframe) or by an explicitmemory-barrier instruction However if there was noatomic dependency() function the compiler wouldbe under no obligation to preserve the dependency14

These explicitly specified dependencies may becombined with [[carries dependency]] attributeson function arguments for example as shown in Fig-ure 27 Note the interplay of atomic dependency()

and [[carries dependency]] where line 8 estab-

14 Would it be better to have the second argument to atomic

dependency() be a label rather than an expression

1 void foo(struct bar q [[carries_dependency]])2 3 if (q = NULL)4 do_it(atomic_dependency(q-gtb q))5 67 p = atomic_load_explicit(ampgp memory_order_consume)8 foo(atomic_dependency(p gp))

Figure 27 Explicit Dependency Operations and car-ries dependency

lishes the dependency between the load from gp andthe [[carries dependency]] argument q of foo()and where line 4 establishes the further dependencybetween argument q of foo() and do it()s argu-ment

This approach is not yet complete One issue isthe possibility of a given operation being dependenton multiple memory order consume loads One ap-proach is of course to omit this functionality and an-other is to allow atomic dependency() to allow anexpression as its first argument and a variable list ofmemory order consume loaded variables

Another issue is connecting [[carries

dependency]] return values to subsequent atomic

dependency() invocations There are a numberof possible resolutions to this issue One approachwould be to use [[carries dependency]] attributeto mark the declaration of the variable to whichthe functionrsquos return value is assigned bringing theproposal from Section 76 to bear In the specialcase where the memory order consume load is in thesame function body as the atomic dependency()

that depends on it the atomic dependency()

could reference the variable that was the sourceof the original memory order consume load An-other approach would be to allow function-returncarries dependency]] attributes to define namesthat could be used by later atomic dependency()

invocations

A third issue arises when atomic dependency()

must be applied after the head of the dependencychain has gone out of scope for example if the headwas contained in a variable defined in an inner scopethat has since been exited

A fourth issue arises if optimizations along a

WG21N4321 26

needed dependency chain allow ordering the depen-dent operation to precede the head of the depen-dency chain in which case inserting barriers wouldbe ineffective The current proposal for addressingthis issue is to suppress memory-movement optimiza-tions across the atomic dependency() perhaps us-ing something like atomic signal fence() or theLinux kernelrsquos barrier() macro This approach al-lows dependency checking and fence insertion to becarried out as a final pass in the compilation process

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains having ex-plicit atomic dependency() calls (and option-ally intermediate [[carries dependency]] at-tributes) need be traced

3 Irrelevant dependency chains are pruned by de-fault with no stdkill dependency() calls re-quired

4 The atomic dependency() calls serve as valu-able documentation of the developerrsquos intent

5 Although optimizations on dependency chainsmust be restricted use of explicit atomic

dependency() greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of dependency chains giventhat the Linux kernel currently relies on implicitends of dependency chains

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(atomic_dependency(lsp-gtt27 head-gtfirst))28 rcu_read_unlock()29 return atomic_dependency(data lsp-gtt)30

Figure 28 List-Based-Stack Tail-Marked Dependen-cies 1 of 2

It is not yet known whether this approach can bereasonably modeled

The result is shown in Figures 28 and 29

78 Explicitly Head-Marked Depen-dency Chains

This approach suggested via email by Olivier Girouxcan be thought of as another inverse of stdkilldependency() In this case the heads of the de-pendency chains are marked indicating to whichpointed-to objects dependencies should be carriedThis description is an extrapolation of a very con-cise proposal and corrections are welcome

The general idea is to provide an augmentedform of the load() member function that indicatesdependencies for example xload(memory order

consume x-gtnext) would cause a dependency tobe carried through the -gtnext field but throughno other field This is shown in Figure 30 where

WG21N4321 27

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(atomic_dependency(lsnp2-gtt lsnp))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return atomic_dereference(data lsnp2-gtt)59

Figure 29 List-Based-Stack Tail-Marked Dependen-cies 2 of 2

1 struct foo 2 struct foo a3 struct foo b4 struct foo c5 int d6 78 p = atomic_load_explicit(ampgp memory_order_consume9 p-gta p-gtb)

10 qa = p-gta Dependency carried 11 qb = p-gtb Dependency carried 12 qc = p-gtc No dependency carried 13 d = p-gtd No dependency carried

Figure 30 Explicit Dependency Operations and Aug-mented Load

the explicit dependency information on line 9 causeslines 10 and 11 to carry a dependency but lines 12and 13 not to do so

Some open questions regarding this approach

1 How does this interact with arguments and re-turn values Do the corresponding annotationsneed to indicate to which fields dependenciesmight be carried Should mismatches be con-sidered an error and if so which sorts of mis-matches

2 How are opaque types handled For exam-ple consider the Linux kernel linked-list facil-ity which embeds a list head structure intothe enclosing object that is to be placed on thelist The memory order consume load returnsa (struct list head ) but it may be nec-essary to carry a dependency to one or more ofthe fields in the enclosing object Should thisbe handled via something like xload(memory

order consume x) but if so doesnrsquot thisre-introduce the need for lots of stdkill

dependency() calls

3 Larger structures might have quite a few fieldsthat need dependencies carried Should therebe some sort of shorthand to make this easierto code for example tagging the fields needingdependency ordering in the declaration of thestruct or class

4 Common types of abstraction need to be han-dled correctly For example there are more than

WG21N4321 28

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst head-gtfirst-gtt)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(lsp-gtt lsp-gtt)) 28 rcu_read_unlock()29 return data30

Figure 31 List-Based-Stack Head-Marked Depen-dencies 1 of 2

500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

79 Restricted Dependency Chains

This approach restricts dependency chains to opera-tions for which compilers would naturally carry de-pendencies As such this approach can be consid-ered to be a refinement of the whole-program option(Section 74) restricted as described in Section 412but also omitting control dependencies and RCU-protected array indexes As always a memory order

consume load heads a dependency chain and as al-ways a memory order consume load may be imple-mented with a simple load instruction on architec-tures such as x86 ARM and Power

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst head-gtfirst-gtnext)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext lsnp-gtnext-gtt)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt lsnp2-gtt))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 32 List-Based-Stack Head-Marked Depen-dencies 2 of 2

WG21N4321 29

1 define rcu_ptr_extract(p) 2 ( 3 uintptr_t ___ip = (uintptr_t)(p) 4 5 ___ip = ___ip amp ~0x7 6 (typeof(p)) ___ip 7 )89 define rcu_ptr_set_bits(p bits)

10 ( 11 uintptr_t ___ip = (uintptr_t)(p) 12 13 ___ip = ___ip amp ~0x7 14 ___ip = ___ip | (bits) 15 (typeof(p)) ___ip 16 )1718 define rcu_ptr_get_bits(p bits) 19 ( 20 uintptr_t ___ip = (uintptr_t)(p) 21 22 ___ip amp ~0x7 23 )

Figure 33 Pointers and Bit Manipulation on 64-BitSystem

This results in a specific list of operations that ex-tend dependency chains and a separate specific list ofoperations that terminate such chains each coveredin its own section

791 Extending Dependency Chains

The following primitive operations extend thatchain15

1 If any value is part of a dependency chain thenusing that value as the left-hand side of an as-signment expression extends the chain to coverthe assignment This rule is exercised in theLinux kernel by stores into fields making up anRCU protected data element

2 If any value is part of a dependency chain thenusing that value as the right-hand side of an as-signment expression extends the chain to coverboth the assignment and the value returned bythat assignment statement Line 20 of Figure 20shows how this rule may be used to extend adependency chain into a local variable

15 In case of operator overloading the actual functions calledmust be analyzed in order to determine their effects on depen-dency chains

3 If any value that is part of a dependency chain isstored to a non-shared variable then any valueloaded by a later load from that same variable bythat same thread is also part of the dependencychain Lines 20 and 24 of Figure 20 illustratethis rule though this rule would apply even iflocal variable lsp was a intptr t instead of apointer

4 If a pointer that is part of a dependency chainis stored to any variable then any value loadedby a later load from that same variable by thatsame thread is also part of the dependency chainLines 20 and 24 of Figure 20 illustrate this rulethough this rule would apply even if local vari-able lsp was instead a shared variable

5 If a pointer is part of a dependency chain thenadding an integral value to that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers and alsoto addition via the infix + operator and via thepostfix [] operator Note that the addition mustbe carried out on a pointer Casting to an inte-gral type and then carrying out the addition willbreak the dependency chain Therefore insteadof casting to an integral type to carry out theaddition cast to a pointer to char16 Line 24 ofFigure 20 illustrates this given that the -gtt actsas a pointer offset prior to indirection

6 If a pointer is part of a dependency chain thensubtracting an integer from that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers Againcasting to an integral type and then carryingout the subtraction will break the dependencychain so instead cast to a pointer to char TheLinux-kernel container of() macro illustratesthis This macro is used to find the beginning ofa structure given a pointer to a field within thatsame structure

16 Yes some old systems had strange formats for characterpointers and this restriction does exclude those systems fromthis nuance of dependency ordering However to the best ofmy knowledge all such systems were uniprocessors so this isnot a real problem

WG21N4321 30

7 If a pointer is part of a dependency chain thendereferencing it using the prefix operator ex-tends the chain through the dereference opera-tion Line 24 of Figure 20 illustrates this giventhat the -gtt acts as a pointer offset prior to in-direction

8 If a pointer is part of a dependency chain thendereferencing it using the -gt field-selection oper-ator extends the chain to the field Note that thewhen the -gt operator is followed by one or more operators these latter operators are equiva-lent to adding a constant integer to the originalpointer Line 24 of Figure 20 directly illustratesthis rule

9 If a pointer is part of a dependency chain thencasting it (either explicitly or implicitly) to anypointer-sized type extends the chain to the re-sult Line 3 of Figure 33 illustrates this rule

10 If a value of type intptr t or uintptr t ispart of a dependency chain then casting it toa pointer type extends the chain to the resultLine 6 of Figure 33 illustrates this rule

11 If a value of type intptr t or uintptr t is partof a dependency chain then the bitwise infix amp

and | operators extend the dependency chain tothe resulting value Line 5 of Figure 33 illus-trates this rule

12 If a value of type intptr t or uintptr t is partof a dependency chain the the bitwise infix ^

operator extends the dependency chain to theresulting value The use case for this traversal ofbuddy-allocator-like lists or dense-array heapsbut it is not clear whether these use cases justifythis addition to dependency ordering

13 If a pointer is part of a dependency chain thenapplying the unary amp address-of operator op-tionally casting this address to a pointer type(perhaps repeatedly to different pointer typeseither explicitly or implicitly) then applying the dereference operator extends the chain to theresult This is used by some of the Linux-kernellist-processing macros

14 If a pointer is part of a dependency chain andthat pointer appears in the entry of a expres-sion selected by the condition then the chainextends to the result Please note that doesnot extend chains from its condition only fromits second or third argument

15 If a pointer to a function is part of a depen-dency chain then invoking the pointed-to func-tion extends the chain from the pointer to theinstructions executed Note that the exact mech-anism used to update instructions is implemen-tation defined and might require use of specialinstruction-cache-flush operations17

16 If a pointer is part of a dependency chain thenif that pointer is used as the actual parameter ofa function call the dependency chain extends tothe formal parameter

17 If a pointer is part of a dependency chain thenif that pointer is returned from a function thedependency chain extends to the returned valuein the calling function

18 If a given operation extends a dependency chainthen so does its atomic counterpart For exam-ple the rules applying to assignments also applyto atomic loads and stores

Any other operation terminates a dependencychain

792 Terminating Dependency Chains

Even though all other operations terminate depen-dency chains there are a few that deserve specialmention

1 Equality comparisons

2 Narrowing magnitude comparisons

3 Narrowing arithmetic operations

4 Narrowing bitwise operations

17 This may sound strange but just you try implementingdynamic linking without the ability to update instructions

WG21N4321 31

1 struct bar 2 struct bar next3 int a4 int b5 6 struct bar head = amphead 1 2 78 for (p = head-gtnext p p = rcu_dereference(p-gtnext)) 9 foo += p-gta

10 if (p == amphead)11 break12 13 bar = head-gtb

Figure 34 Back-Propagation of Dependency-ChainBreakage Due to Comparisons

1 if (p gt ampfoo)2 do_something(p)3 else if (p lt ampfoo)4 do_something_else(p)5 else6 do_something_nodep(p)

Figure 35 Inequality-Comparison Dependency-Chain Breakage

5 Storing non-pointers into shared variables

6 Passing values between threads without using amemory order consume load

7 Undefined behavior

8 Use of kill dependency

Each of these is covered below

Equality comparisons If a pointer is part of adependency chain then a == or = comparison thatcompares equal to some other pointer where thatother pointer is not part of any dependency chainwill cause any uses of the original pointer to no longerbe part of the dependency chain This dependency-chain breakage can back-propagate to earlier uses ofthe pointer so that in Figure 34 if the comparisonon line 10 compares equal then the access on line 9is not part of the dependency chain This is admit-tedly a rather strange code fragment and besidesthe Linux-kernel barrier() macro could prevent thisif placed between lines 9 and 10 Furthermore the

Linux kernelrsquos list macros avoid this situation becausethe equal comparison terminates the loop

So what if the compiler introduces an equality com-parison This might happen when doing feedback-directed optimization where the compiler might no-tice (for example) that a particularly statically allo-cated structure was almost always the first elementon a given list The compiler might therefore in-troduce a specialization optimization comparing theaddresses and generating code using the statically al-located structure on equals comparison On the onehand in the cases where the Linux kernel adds a stat-ically allocated structure to an RCU-protected linkeddata structure that structure has been initialized atcompile time so that dependency ordering is not re-quired On the other hand this appears to be anextremely dubious optimization for linked data struc-tures In a great many cases the added overhead ofthe comparison would overwhelm the benefits of gen-erating code based on the statically allocated struc-ture

High-quality implementations would therefore beexpected to provide means for disabling this sort ofoptimization especially for pointers obtained fromthe heap After all use of statically allocated struc-tures in RCU-protected lists could be quite usefulduring out-of-memory conditions in which case thespecialization optimization would almost always re-duce performance which is not what optmizationsare supposed to be doing

Narrowing magnitude comparisons A seriesof gt lt gt= or lt= operators that informs the compilerof the exact value of a pointer causes that pointerto no longer be part of the dependency chain SeeFigure 35 for an example of this On line 6 of thisfigure the compiler knows that the value of p is equalto ampfoo so although there is dependency ordering tolines 2 and 4 there is no dependency ordering toline 6 This dependency-chain breakage can back-propagate just as for equality comparisons How-ever dependencies are maintained for normal usesfor exmaple the use of comparisons for deadlockavoidance when acquiring locks contained in multi-ple RCU-protected data elements

WG21N4321 32

Narrowing arithmetic operations If a pointeris part of a dependency chain and if the valuesadded to or subtracted from that pointer cancel thepointer value so as to allow the compiler to pre-cisely determine the resulting value then the result-ing value will not be part of any dependency chainFor example if p is part of a dependency chain then((char )p-(uintptr t)p)+65536 will not be18

Narrowing bitwise operations If a value of typeintptr t or uintptr t is part of a dependency chainand if that value is one of the operands to an amp

or | infix operator whose result has too few or toomany bits set then the resulting value will not bepart of any dependency chain For example on a64-bit system if p is part of a dependency chainthen (p amp 0x7) provides just the tag bits and nor-mally cannot even be legally dereferenced Similarly(p | ~0) normally cannot be legally dereferencedHowever (p amp ~0x7) will provide a usable pointerthat is part of prsquos dependency chain Setting orclearing bits that are not used by the implementa-tion or that would be zero for a properly aligned ob-ject will not break the dependency chain That saidthe compiler might not know the definition of ldquoprop-erly alignedrdquo for example in the cases of manuallycache-aligned or page-aligned objects

Storing non-pointers into shared variables Ifa non-pointer value that is part of a dependency chainis stored into a shared variable then the dependencychain does not extend to a later load from that vari-able

Passing values between threads If a value thatis part of a dependency chain is stored into a variableby one thread and loaded from that same variable bysome other thread using either a non-atomic load ora memory order relaxed load then the dependencychain does not extend to the second thread To getthis effect the second thread would instead need touse a memory order consume load Note that this

18 That said 57p4 of C++ and 656p8 of C both say thatindexing outside of an object is undefined behavior so the lossof dependency ordering is likely the least of the problems here

would extend the dependency chain even if the cor-responding store was a memory order relaxed storebecause the required store-side ordering is providedby the dependency chain

Undefined behavior If undefined behavior is in-voked then consistent with the notion of undefinedbehavior there are no dependency-chain guarantees

If a given pointer intptr t or uintptr t is usedsuch that only one value avoids undefined behaviorthen the dependency chain is broken in the same wayas it would be in the case of an equality comparisonwith that same value

kill dependency() The result of calling kill

dependency is never part of any dependency chainThis operation can be used to suppress diagnosticsthat implementations might omit for likely misusesof dependency ordering

793 Restricted Dependency Chains and theLinux Kernel

This covers all known pointer-based RCU uses in theLinux kernel aside from RCU-protected array in-dexes (more on these later) However as noted ear-lier these restrictions might prove too constrainingfor future code Therefore it might be necessary tocombine this approach with some variation on one ofthe methods for explicitly marking variables formalparameters and return values that are intended tocarry dependencies

Section 32 discussed dependency chains headedby memory order consume loads of integers that arelater used as array indexes or pointer offsets Al-though there are a (very) few such uses in the Linuxkernel accommodating dependency chains headed byloads of integers greatly complicates the handling ofdependency chains For example given an integer x

produced by a memory order consume load we mustcorrectly handle expressions containing (x - x) in-cluding cases where the cancellation is not at all obvi-ous from the source code In contrast given a pointerp the expression (p - p) not a pointer and the rulesgiven above do not require carrying a dependency

WG21N4321 33

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a) 3 q = kill_dependency(p)4 a = q-gtspecial_a5 else 6 a = p-gtnormal_a7

Figure 36 Avoiding Diagnostic Due To Dependency-Ordering Value-Narrowing Hazard

through such an expression This approach there-fore excludes dependency chains headed by memory

order consume loads from non-pointer atomic vari-ables

This raises the question of what should be doneabout the Linux kernel code that relies on order-ing carried through integer array indexes Paulhas (hopefully) answered this question by creatinga Linux-kernel patch that removes the kernelrsquos de-pendency on RCU-protected array indexes [22]

794 Restricted Dependency Chains Ad-vantages and Disavantages

This approach to dependency ordering has the fol-lowing advantages and disadvantages

1 There is no need for compilers to trace depen-dency chains Instead dependencies are an au-tomatic result of current code-generation and op-timization practices

2 There would be little or no limitation on com-piler optimizations Compilers are no longer re-quired to establish artificial dependencies for ex-pressions such as (x - x) or to detect value-narrowing hazards involving == and =

3 The breaking of dependency chains when apointer compares equal to some other pointermight prove to be onerous but it is not aproblem for current Linux use cases19 Themost common cases are comparison against alist header (in which case an equality compar-ison terminates the traversal) and comparison

19 This needs to be re-verified

against NULL (in which case an equality com-parison indicates a pointer that cannot be deref-erenced in any case)

4 The compiler is prohibited from carrying outvalue-speculation optimizations on pointers orvalues of type intptr t or uintptr t that havebeen cast from a pointer type and subjected tono operations other than infix amp | and ^ (Isthis an advantage or a disadvantage The an-swer to this question is left to the reader)

5 In a great many cases dependency chains canreliably pass through library functions compiledby pre-C11 compilers

6 It would not be necessary to use stdkill

dependency() calls in most cases That saiduse of stdkill dependency() might at somefuture point allow the compiler to produce bet-ter diagnostics for dubious dependency-chain usecases For example a compiler might issue awarning for the value-narrowing hazard shownon line 3 of Figure 13 and that diagnostic mightbe suppressed as shown in Figure 36

7 More generally this approach allows annotationsto be discarded as was shown in Figures 20and 21 However the memory order consume

loads are still required in order to support DECAlpha and prevent compiler optimizations thatmight otherwise destroy the dependency chain

8 It would not be necessary to use [[carries

dependency]] attributes However as withstdkill dependency() use of something like[[carries dependency]] might produce bet-ter diagnostics for dubious use dependency-chainuse cases That said it might be preferable tosubstitute either type or variable modifiers forattributes given that use of attributes is not per-mitted to change the meaning of the programand also that attributes are not yet supportedby the C language

9 With the exception of one use case involving ar-rays no changes to the Linux-kernel source code

WG21N4321 34

are required As noted earlier a patch is avail-able to remove the Linux kernelrsquos dependency onRCU-protected array indexes [22]

In the future this approach could be augmentedby attributes type modifiers variable modifiers orother markings to allow more elaborate dependencychains to be created on the one hand and to improvethe compilerrsquos ability to emit diagnostics for dubioususes of dependency chains on the other

710 Evaluation

This evaluation starts by enumerating the differentaudiences that any change to memory order consume

must address (Section 7101) and then compares thevarious proposals based on the perceived viewpointsof these audiences (Section 7102)

7101 Audiences

The main audiences for any change to memory

order consume include standards committee mem-bers compiler implementers formal-methods re-searchers developers intending to write new codeand developers working with existing RCU code TheLinux kernel community is of course a notable exam-ple of this last category

Standards committee members would like a cleanand non-intrusive change to the standard Theywould of course also like solutions minimizing thenumber and vehemence of complaints from the otheraudiences or failing that reducing the complaintsto a tolerable noise level

Compiler implementers would like a mechanismthat fits nicely into current implementations whichdoes much to explain their satisfaction with the ap-proach of strengthening memory order consume tomemory order acquire In particular they wouldlike to avoid unbounded tracing of dependencies andwould prefer minimal constraints on their ability toapply time-honored optimizations

Formal-methods researchers would like a definitionof memory order consume that fits into existing the-oretical frameworks without undue conceptual vio-lence Of particular concern is any need to deal with

counter-factuals in other words any need to rea-son not only about values of variables required forthe solution of a given litmus test but also aboutother unrelated values for these variables As suchcounter-factuals are the rock upon which otherwiseattractive approaches involving semantic dependencyhave foundered20 Some practitioners might won-der why the opinion of formal-methods researchersshould be given any weight at all and the answer tothis question is that it is the work of formal-methodsresearchers that provides us the much-needed toolsthat we need to analyze both the memory-orderingspecification itself as well as programs using thatspecification

Developers writing new code need something thatexpresses their algorithm with a minimum of syn-tactic saccharine that is easy to learn and that iseasy to maintain For example one of the weak-nesses of the current standardsrsquo definition of memoryorder consume is the need to sprinkle large numbersof kill dependency() calls throughout onersquos codeIn short developers would like it to be easy to writeanalyze and maintain code that uses dependency or-dering

Developers with existing RCU code have the samedesires as do developers writing new code but arealso very interested in minimizing the code churn re-quired to adhere to the standard

The challenge if of course to find a proposal thataddresses the viewoints of all of these audiences Aswe will see in the next session this is not easy How-ever there is some hope that the approach presentedin Section 79 might suffice

7102 Comparison

A summary comparison of the proposals is shown inTable 1

The dependency type can either be ldquodeprdquo fornormal dependency ldquordeprdquo for the restricted de-pendencies discussed in Section 79 ldquosdeprdquo for(global) semantic dependency or ldquolsdeprdquo for local

20 That said Alan Jeffries is making another attempt tocome up with a suitable formal definition of semantic depen-dency

WG21N4321 35

Dep

enden

cyType

Variable

Marking

Form

al-Parameter

Marking

Return-V

alueMarking

Beginning-O

f-Chain

Handling

End-O

f-Chain

Handling

Dep

enden

cyTracingRequired

CAttribute

Support

Req

uired

C11 C++11 dep A A K Y Y

Type-Based Designation of DependencyChains With Restrictions (Section 72)

lsdep T T T k

Type-Based Designation of DependencyChains (Section 73)

dep T T T k

Whole-Program Option (Section 74) sdep

Local-Variable Restriction (Section 75) dep A A k Y

Mark Dependency-Carrying Local Variables(Section 76)

dep A A A k Y

Explicitly Tail-Marked Dependency Chains(Section 77)

dep A A Dk y Y

Explicitly Head-Marked Dependency Chains(Section 78)

dep D k y Y

Restricted Dependency Chains (Section 79) rdep

Marking ldquoArdquo attribute ldquoTrdquo typeBeginning of chain ldquoDrdquo explicit designationEnd of chain ldquoDrdquo explicit designation ldquokKrdquo implicitexplicit kill dependencyDependency tracking required ldquoyrdquo only for marked chains ldquoYrdquo always

Table 1 Comparison of Consume Proposals

WG21N4321 36

semantic dependency21 Variable formal-parameterand return-value marking can either be type-based(ldquoTrdquo) attribute-based (ldquoArdquo) or not required (ldquo rdquo)Beginning-of-chain handling can either require ex-plicit indication to which quantities dependenciesmust be carried (ldquoDrdquo) or nothing (ldquo rdquo) End-of-chain handling can either require an explicit kill

dependency (ldquoKrdquo) an implicit kill dependency

(ldquokrdquo) explicit designation of dependency (ldquoDrdquo) ornothing (ldquo rdquo)22 Dependency tracking might berequired for all chains (ldquoYrdquo) explicitly designatedchains (ldquoyrdquo) or not required at all (ldquo rdquo) C-language[[carries dependency]] support might be required(ldquoYrdquo) or not (ldquo rdquo)

The ideal proposal would have dependency typeldquodeprdquo (thus making it easier to model dependencyordering and making it unnecessary for developer tohave to outwit full-program optimizations) no needfor variable formal-parameter or return-value mark-ing (thus minimizing changes required for existingRCU code) implicit ldquodo the right thingrdquo end-of-chainhandling23 (thus minimizing the need for whack-a-mole source-code markups) no need for dependencytracking (thus making it easier to implement) andno need for C-language support for the [[carries

dependency]] attribute (thus minimizing changes tothe C standard)

7103 Other Approaches

If the C standards committee is unwilling to ac-cept attributes perhaps a new keyword such asCarries dependency would be an acceptable alter-native (This was suggested at the 2014 UIUC meet-ing but I cannot recall who suggested it)

21 Recall that a local semantic dependency remains a de-pendency even if the memory order consume load at its headcan return only a single value In contrast a global seman-tic dependency remains a dependency only if more than onevalue can appear at the end of the chain Therefore optimiza-tions based on global full-program analysis can break a globalsemantic dependency but can break neither a local semanticdependency nor a normal dependency

22 Variables that go out of scope always have any dependencychain implicitly killed

23 Perhaps implemented by a careful choice of exactly whichoperators carry dependencies in which situations

It might also be possible to combine different as-pects of the various proposals perhaps even arrivingat an improved proposal

Nevertheless we clearly have some more work todo

8 Summary

This document has analyzed Linux-kernel use of de-pendency ordering and has laid out the status-quo in-teraction between the Linux kernel and pre-C11 com-pilers It has also put forward some possible ways ofbuilding towards a full implementation of C11rsquos andC++11rsquos handling of dependency ordering Finallyit calls out some weaknesses in C11rsquos and C++11rsquoshandling of dependency ordering and offers some al-ternatives

References

[1] Alglave J Maranget L Pawan PSarkar S Sewell P Williams D andNardelli F Z PPCMEMARMMEM A toolfor exploring the POWER and ARM memorymodels httpwwwclcamacuk~pes20

ppc-supplementalpldi105-sarkarpdfJune 2011

[2] Alglave J Maranget L andTautschnig M Herding cats Modellingsimulation testing and data-mining for weakmemory In Proceedings of the 35th ACM SIG-PLAN Conference on Programming LanguageDesign and Implementation (New York NYUSA 2014) PLDI rsquo14 ACM pp 40ndash40

[3] ARM Limited ARM Architecture ReferenceManual ARMv7-A and ARMv7-R Edition2010

[4] Boehm H J Space efficient conservativegarbage collection SIGPLAN Not 39 4 (Apr2004) 490ndash501

[5] Bonzini P and Day M RCU imple-mentation for Qemu httplistsgnu

WG21N4321 37

orgarchivehtmlqemu-devel2013-08

msg02055html August 2013

[6] Dalton M THE DESIGN AND IMPLE-MENTATION OF DYNAMIC INFORMATIONFLOW TRACKING SYSTEMS FOR SOFT-WARE SECURITY PhD thesis StanfordUniversity 2009 Available httpcsl

stanfordedu~christospublications

2009michael_daltonphd_thesispdf

[Viewed March 9 2010]

[7] Desnoyers M [RFC git tree] userspace RCU(urcu) for Linux httpurcuso February2009

[8] Desnoyers M McKenney P E SternA Dagenais M R and Walpole JUser-level implementations of read-copy updateIEEE Transactions on Parallel and DistributedSystems 23 (2012) 375ndash382

[9] Grisenthwaite R ARM Barrier LitmusTests and Cookbook ARM Limited 2009

[10] Howard P W and Walpole J A rel-ativistic enhancement to software transactionalmemory In Proceedings of the 3rd USENIX con-ference on Hot topics in parallelism (BerkeleyCA USA 2011) HotParrsquo11 USENIX Associa-tion pp 1ndash6

[11] Howard P W and Walpole J Relativis-tic red-black trees Concurrency and Computa-tion Practice and Experience (2013) nandashna

[12] Intel Corporation A Formal Speci-fication of Intel Itanium Processor Fam-ily Memory Ordering 2002 Avail-able httpdeveloperintelcom

designitaniumdownloads251429htm

ftpdownloadintelcomdesign

ItaniumDownloads25142901pdf [ViewedJanuary 10 2007]

[13] International Business Machines Corpo-ration Power ISA Version 207 2013

[14] Kannan H Ordering decoupled metadata ac-cesses in multiprocessors In MICRO 42 Pro-ceedings of the 42nd Annual IEEEACM Inter-national Symposium on Microarchitecture (NewYork NY USA 2009) ACM pp 381ndash390

[15] McCarthy J Recursive functions of symbolicexpressions and their computation by machinepart i Commun ACM 3 4 (Apr 1960) 184ndash195

[16] McKenney P E Read-copy update(RCU) usage in Linux kernel Availablehttpwwwrdropcomuserspaulmck

RCUlinuxusagerculocktabhtml [ViewedJanuary 14 2007] October 2006

[17] McKenney P E What is RCU part 2Usage Available httplwnnetArticles

263130 [Viewed January 4 2008] January2008

[18] McKenney P E The RCU API 2010 edi-tion httplwnnetArticles418853 De-cember 2010

[19] McKenney P E Validating memory barri-ers and atomic instructions httplwnnet

Articles470681 December 2011

[20] McKenney P E Is Parallel ProgrammingHard And If So What Can You Do About Itkernelorg Corvallis OR USA 2012

[21] McKenney P E Structured deferral syn-chronization via procrastination CommunACM 56 7 (July 2013) 40ndash49

[22] McKenney P E [PATCH tipcorercu14] mce Stop using array-index-based RCUprimitives [PATCHtipcorercu14]mce

Stopusingarray-index-basedRCUprimitivesMay 2015

[23] McKenney P E Purcell C Al-gae Schumin B Cornelius G Qwer-tyus Conway N Sbw Blainster Ru-fus C Zoicon5 Anome and Eisen HRead-copy update httpenwikipedia

orgwikiRead-copy-update July 2006

WG21N4321 38

[24] McKenney P E and Slingwine J DRead-copy update Using execution history tosolve concurrency problems In Parallel andDistributed Computing and Systems (Las VegasNV October 1998) pp 509ndash518

[25] McKenney P E and Walpole J What isRCU fundamentally Available httplwn

netArticles262464 [Viewed December 272007] December 2007

[26] Michael M M Hazard pointers Safe mem-ory reclamation for lock-free objects IEEETransactions on Parallel and Distributed Sys-tems 15 6 (June 2004) 491ndash504

[27] Rossbach C J Hofmann O S PorterD E Ramadan H E Bhandari A andWitchel E TxLinux Using and managinghardware transactional memory in an operatingsystem In SOSPrsquo07 Twenty-First ACMSymposium on Operating Systems Principles(October 2007) ACM SIGOPS Avail-able httpwwwsosp2007orgpapers

sosp056-rossbachpdf [Viewed October 212007]

[28] Toit S D Working draft stan-dard for programming language C++httpwwwopen-stdorgjtc1sc22wg21

docspapers2013n3691pdf May 2013

[29] Vigueras G Orduna J M and LozanoM A Read-Copy Update based parallel serverfor distributed crowd simulations The Journalof Supercomputing (Apr 2012)

Change Log

This paper first appeared as N4026 in May of 2014Revisions to this document are as follows

bull Add a suggestion to Section 71 that developersprovide diagnostics for singleton arrays when us-ing RCU-protected indexes (May 28 2014)

bull Apply changes from Torvald Riegel feedback

1 Move Section 71 which summarizes theLinux-kernel uses of dependency chains upto the beginning of Section 7

2 Fix some typos in Section 74 and addwords accounting for DEC Alpha whichmust continue to promote memory order

consume loads to memory order acquire

with this sectionrsquos approach Also clarifyJeff Preshingrsquos conjecture about the lack ofoptimizations that break semantic depen-dency chains

(June 5 2014)

bull Add Section 75 which covers Hans Boehmrsquosapproach which restricts dependency chains tolocal variable Also add Section 76 whichcovers Clark Nelsorsquos approach which usesthe [[carries dependency]] attribute to markthose local variables that carry dependencies(July 17 2014)

bull Mark the paper as a revision of N4036 (July 172014)

bull Add Jeff Preshing Hans Boehm and Clark Nel-son as authors and clarify Hansrsquos proposal inSection 75 (August 13 2014)

bull Add Olivier Girouxrsquos proposal for explicitlymarking the tails of dependency chains as Sec-tion 77 and add Olivier as author Update theadvantages and disadvantages of Jeff Preshingrsquoswhole-program option in Section 74 Add advan-tages and disadvantages for Hans Boehmrsquos pro-posal in Section 75 and for Clark Nelsonrsquos pro-posal in Section 76 (August 23 2014)

bull Update Olivier Girouxrsquos explicit dependency-chain marking proposal in Section 77 based onfeedback from Olivier (August 24 2014)

bull Add a section header (Section 411) for thedependency-chain rules for 2014 GCC implemen-tations Add Section 413 laying out the sim-pler rules for managing dependency chains in theolder 1990s Sequent C implementations Adda footnote to Section 76 recording Lawrence

WG21N4321 39

Crowlrsquos preference for variable modifiers insteadof either attributes or type modifiers Addwords to Section 77 noting that atomic signal

fence() or the Linux kernelrsquos barrier() can beused to suppress code-motion optimizations thatmight otherwise prevent fixing up tail-markeddependency chains that were broken by code-motion optimizations (October 1 2014)

bull Add Section 34 which describes operators thatact as the last link in the Linux kernelrsquos depen-dency chains including the ambiguous role ofthe -gt operator (October 1 2014)

bull Update plots of RCU API usage (October 52014)

bull Flesh out Section 412 which lays out a shorthistory of dependency-chain management rulesfor the Linux kernel Numerous minor clarifica-tions and corrections throughout the document(October 5 2014)

bull Add ldquoandrdquo before final authorrsquos name and emailaddress Also add a pointer back to N4036 (Oc-tober 5 2014)

bull Add a brain-dead build script (October 52014)

bull Add Section 710 which has the beginnings ofan evaluation of the various proposals (October5 2014)

The paper was then published as N4215 whichwas revised as follows

bull Add verbiage to Sections 72 and 7102 differ-entiating between various forms of syntactic andsemantic dependencies (November 11 2014)

bull Add Section 7103 suggesting a C keyword in-stead of the C++ [[carries dependency]] at-tribute This section also suggests that combin-ing ideas from several proposals might be helpful(November 11 2014)

bull Add Section 78 which contains a first attemptto describe Olivier Girouxrsquos head-marked depen-dency chains (November 11 2014)

bull Add example code in Figures 15 and 16 Addsimilar figures to the proposals showing how theymark the dependency chains in the sample code(November 11 2014)

At this point the paper was published as N4321which was revised as follows

bull Add a warning against using operands to amp and| that leave only one bit set (or respectivelycleared) to Section 411 (April 9 2015)

bull Update plots of RCU API usage (April 192014)

bull Add Section 79 which restricts dependencychains with the goal of allowing unmarked de-pendency chains while avoiding restricting com-piler optimizations One important restriction isthe elimination of RCU-protected array indexes(April 19 2014)

bull Add verbiage to Section 32 noting that the amp

operator is sometimes used to find the beginningof a power-of-two aligned structure (April 192014)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Torvald Riegel to Sec-tion 79 (May 20 2015)

bull Donrsquot allow integers to carry dependencies intoor out of unmarked functions in Section 79(May 21 2015)

bull Update Section 79 to note that assignments can-not extend a dependency chain from one threadto another and that assignments cannot extendan intptr t- or uintptr t-based dependencychain through a shared variable even within agiven thread (May 21 2015)

bull Update Section 79 to note that atomic op-erations can extend dependency chains in thesame way that their non-atomic counterpartscan (May 21 2015)

WG21N4321 40

bull Update Section 79 to explicitly call out the factthat stores and subsequent loads can extend de-pendency chains (May 21 2015)

bull Update Section 79 to explicitly state that un-defined behavior terminates dependency chainsand that if only one value of a pointer avoidsundefined behavior any dependency chainsthrough that pointer at that point are termi-nated (May 22 2015)

bull Update Section 79 to document the difficul-ties stemming from value-specialization opti-mizations (May 23 2015)

bull Update Section 79 to note the back-propagationof the dependency-chain-breaking effects of thecompiler deducing the exact value of a pointer(May 27 2015)

bull Update Section 79 to give more informationon how much bit-setting and bit-clearing is toomuch (June 8 2015)

bull Update Section 79 to note that the Linux ker-nel does not rely on dependency ordering whena statically allocated structure is linked into anRCU-protected structure but also note that op-timizations breaking dependency ordering in thiscase seem quite dubious (July 11 2015)

bull Update Section 79 illustrating dependency-chain rules with examples (July 11 2015)

bull Update Section 79 noting that simple loads suf-fice for memory order consume loads (July 112015)

bull Update Sections 77 and 78 calling out abstrac-tion challenges for head- and tail-marking ap-proaches (July 13 2015)

bull Add subsections to Section 79 (July 13 2015)

bull Update Section 79 to amplify objections to spe-cialization optimizations applied to heap-basedpointers (July 13 2015)

  • 1 Introduction
  • 2 Introduction to RCU
  • 3 Linux-Kernel Use Cases
    • 31 Types of Linux-Kernel Dependency Chains
    • 32 Operators in Linux-Kernel Dependency Chains
    • 33 Operators Terminating Linux-Kernel Dependency Chains
    • 34 Operators Acting as Last Link in Linux-Kernel Dependency Chains
    • 35 Linux-Kernel Dependency Chain Length
      • 4 Dependency Ordering in Pre-C11 Implementations
        • 41 Rules for C-Language RCU Users
          • 411 Rules for 2014 GCC Implementations
          • 412 Rules for 2003 GCC Implementations
          • 413 Rules for 1990s Sequent C Implementations
            • 42 Rules for C-Language Implementers
              • 5 Dependency Ordering in C11 and C++11 Implementations
              • 6 Weaknesses in C11 and C++11 Dependency Ordering
              • 7 Potential Alternatives to C11 and C++11 Dependency Ordering
                • 71 Revising C11 and C++11 Dependency-Ordering Definition
                • 72 Type-Based Designation of Dependency Chains With Restrictions
                • 73 Type-Based Designation of Dependency Chains
                • 74 Whole-Program Option
                • 75 Local-Variable Restriction
                • 76 Mark Dependency-Carrying Local Variables
                • 77 Explicitly Tail-Marked Dependency Chains
                • 78 Explicitly Head-Marked Dependency Chains
                • 79 Restricted Dependency Chains
                  • 791 Extending Dependency Chains
                  • 792 Terminating Dependency Chains
                  • 793 Restricted Dependency Chains and the Linux Kernel
                  • 794 Restricted Dependency Chains Advantages and Disavantages
                    • 710 Evaluation
                      • 7101 Audiences
                      • 7102 Comparison
                      • 7103 Other Approaches
                          • 8 Summary
Page 11: N4321: Towards Implementation and Use of memory order … · 2015-07-14 · N4321: Towards Implementation and Use of memory order consume Doc. No.: WG21/N4321 (revised) Date: 2015-04-19

WG21N4321 11

fact a volatile cast was not added to the implementa-tion of rcu dereference() until 2624 in early 2008

413 Rules for 1990s Sequent C Implemen-tations

1990s systems featured far slower CPUs and muchless memory that is commonly provisioned to-day and the compilers were correspondingly lesssophisticated Therefore at that time a sim-ple C-language field selector was used instead ofany sort of rcu dereference() or memory order

consume operation Not only was there novolatile cast there also was nothing resembling smp

read barrier depends() The lack of smp read

barrier depends() is not too surprising given thatDYNIXptx did not run on DEC Alpha

This approach was nevertheless quite reliable be-cause the use cases within the DYNIXptx kernelwere both few and straightforward and provided lit-tle or no opportunity for optimizations that mightbreak dependency chains

42 Rules for C-Language Imple-menters

The main rule for C-language implementers is toavoid any sort of value speculation or at the veryleast provide means for the user to disable suchspeculation An example of a value-speculation op-timization that can be carried out with the help ofhardware branch prediction is shown in Figure 10which is an optimized version of the code in Fig-ure 5 This sort of transformation might result fromfeedback-directed optimization where profiling runsdetermined that the value loaded from ph was almostalway 0xbadfab1e Although this transformation iscorrect in a single-threaded environment in a concur-rent environment nothing stops the compiler or theCPU from speculating the load on line 19 before itexecutes the rcu dereference() on line 16 whichcould result in line 19 executing before the corre-sponding store on line 7 resulting in a garbage valuein variable a4

4 Kudos to Olivier Giroux for pointing out use of branchprediction to enable value speculation

1 void new_element(struct foo pp int a)2 3 struct foo p = malloc(sizeof(p))45 if (p)6 abort()7 p-gta = a8 rcu_assign_pointer(pp p)9

1011 int traverse(struct foo_head ph)12 13 int a = -114 struct foo p1516 p = rcu_dereference(ampph-gth)17 while (p = NULL) 18 if (p == (struct foo )0xbadfab1e)19 a = ((struct foo )0xbadfab1e)-gta20 else21 a = p-gta22 p = rcu_dereference(ampp-gtn)23 24 return a25

Figure 10 Dangerous Optimizations HardwareBranch Predictions

There are some situations where this sort of opti-mization would be safe including

1 The value speculated is a numeric value ratherthan a pointer so that if the guess proves correctafter the fact the computation will be appropri-ate after the fact

2 The value speculated is a pointer to invariantdata so that reasonable values are produced bydereferencing even if the guess proves to havebeen correct only after the fact

3 As above but where any updates result in datathat produces appropriate computations at anyand all phases of the update

However this list does not contain the general caseof memory order consume loads

Pure hardware implementations of value specula-tion can avoid this problem because they monitorcache-coherence protocol events that would resultfrom some other CPU invalidating the guess

In short compiler writers must provide means todisable all forms of value speculation unless the spec-

WG21N4321 12

ulation is accompanied by some means of detectingthe race condition that Figure 10 is subject to

Are there other dependency-breaking optimizationsthat should be called out separately

5 Dependency Ordering in C11and C++11 Implementations

The simplest way to avoid dependency-ordering is-sues is to strengthen all memory order consume oper-ations to memory order acquire This functions cor-rectly but may result in unacceptable performancedue to memory-barrier instructions on weakly or-dered systems such as ARM and PowerPC5 and mayfurther unnecessarily suppress code-motion optimiza-tions

Another straightforward approach is to avoid valuespeculation and other dependency-breaking opti-mizations This might result in missed opportu-nities for optimization but avoids any need fordependency-chain annotations and also all issuesthat might otherwise arise from use of dependency-breaking optimizations This approach is fully com-patible with the Linux kernel communityrsquos currentapproach to dependency chains Unfortunately thereare any number of valuable optimizations that breakdependency chains so this approach seems impracti-cal

A third approach is to avoid value speculationand other dependency-breaking optimizations in anyfunction containing either a memory order consume

load or a [[carries dependency]] attribute Forexample the hardware-branch-predition optimiza-tion shown in Figure 10 would be prohibited in suchfunctions as would cancellation optimizations suchas optimizing a = b + c - c into a = b This toocan result in missed opportunities for optimizationthough very probably many fewer than the previousapproach This approach can also result in issues dueto dependency-breaking optimizations in functionslacking [[carries dependency]] attributes for ex-ample function d() in Figure 11 It can also result

5 From a Linux-kernel community viewpoint that shouldread ldquowill result in unacceptable performancerdquo

1 int a(struct foo p [[carries_dependency]])2 3 return kill_dependency(p-gta = 0)4 56 int b(int x)7 8 return x9

1011 foo c(void)12 13 return fpload_explicit(memory_order_consume)14 return rcu_dereference(fp) in Linux kernel 15 1617 int d(void)18 19 int a20 foo p2122 rcu_read_lock()23 p = c()24 a = p-gta25 rcu_read_unlock()26 return a27

Figure 11 Example Functions for Dependency Or-dering Part 1

1 [[carries_dependency]] struct foo e(void)2 3 return fpload_explicit(memory_order_consume)4 return rcu_dereference(fp) in Linux kernel 5 67 int f(void)8 9 int a

10 foo p1112 rcu_read_lock()13 p = e()14 a = p-gta15 rcu_read_unlock()16 return kill_dependency(a)17 1819 int g(void)20 21 int a22 foo p2324 rcu_read_lock()25 p = e()26 a = p-gta27 rcu_read_unlock()28 return b(a)29

Figure 12 Example Functions for Dependency Or-dering Part 2

WG21N4321 13

in spurious memory-barrier instructions when a de-pendency chain goes out of scope for example withthe return statement of function g() in Figure 12

A fourth approach is to add a compile-time op-eration corresponding to the beginning and end ofRCU read-side critical section These would need tobe evaluated at compile time taking into accountthe fact that these critical sections can nest and canbe conditionally entered and exited Note that theexit from an outermost RCU read-side critical sec-tion should imply a kill dependency() operation oneach variable that is live at that point in the code6

Although it is probably impossible to precisely de-termine the bounds of a given RCU read-side criticalsection in the general case conservative approachesthat might overestimate the extent of a given sec-tion should be acceptable in almost all cases Thisapproach would make functions c() and d() in Fig-ure 11 handle dependency chains in a natural mannerbut avoiding whole-program analysis would requiresomething similar to the [[carries dependency]]

annotations called out in the C11 and C++11 stan-dards

A fifth approach would be to require that all op-erations on the essential subset of any dependencychain be annotated This would greatly ease imple-mentation but would not be likely to be accepted bythe Linux kernel community

A sixth approach is to track dependencies as calledout in the C11 and C++11 standards However in-stead of emitting a memory-barrier instruction whena dependency chain flows into or out of a functionwithout the benefit of [[carries dependency]] in-sert an implicit kill dependency() invocation Im-plementation should also optionally issue a diagnosticin this case The motivation for this approach is thatit is expected that many more kill dependencies()

than [[carries dependency]] would be required toconvert the Linux kernelrsquos RCU code to C11 In theexample in Figure 12 this approach would allow func-tion g() to avoid emitting an unnecessary memory-barrier instruction but without function f()rsquos ex-

6 What if a given rcu read unlock() sometimes marked theend of an outermost RCU read-side critical section but othertimes was nested in some other RCU read-side critical sectionIn that case there should be no kill dependency()

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a)3 a = p-gtspecial_a4 else5 a = p-gtnormal_a

Figure 13 Dependency-Ordering Value-NarrowingHazard

plicit kill dependency() Both functions are in Fig-ure 12

A seventh and final approach is to track dependen-cies as called out in in the C11 and C++11 standardsWith this approach functions e() and f() properlypreserve the required amount of dependency order-ing

6 Weaknesses in C11 andC++11 Dependency Or-dering

Experience has shown several weaknesses in the de-pendency ordering specified in the C11 and C++11standards

1 The C11 standard does not provide attributesand in particular does not provide the[[carries dependency]] attribute This pre-vents the developer from specifying that a givendependency chain passes into or out of a givenfunction

2 The implementation complexity of thedependency-chain tracking required by bothstandard can be quite onerous on the one handand the overhead of unconditionally promotingmemory order consume loads to memory order

acquire can be excessive on weakly orderedimplementations on the other There is thereforeno easy way out for a memory order consume

implementation on a weakly ordered system

3 The function-level granularity of [[carries

dependency]] seems too coarse One problemis that points-to analysis is non-trivial so that

WG21N4321 14

compilers are likely to have difficulty determin-ing whether or not a given pointer carries a de-pendency For example the current wording ofthe standard (intentionally) does not disallowdependency chaining through stores and loadsTherefore if a dependency-carrying value mightever be written to a given variable an implemen-tation might reasonably assume that any loadfrom that variable must be assumed to carry adependency

4 The rules set out in the standard [28 110p9]do not align well with the rules that develop-ers must currently adhere to in order to main-tain dependency chains when using pre-C11 andpre-C++11 compilers (see Section 41) For ex-ample the standard requires (x-x) to carry adependency and providing this guarantee wouldat the very least require the compiler to alsoturn off optimizations that remove (x-x) (andsimilar patterns) if x might possibly be carry-ing a dependency For another example con-sider the value-speculation-like code shown inFigure 13 that is sometimes written by devel-opers and that was described in bullet 9 ofSection 41 In this example the standard re-quires dependency ordering between the memoryorder consume load on line 1 and the sub-sequent dereference on line 3 but a typicalcompiler would not be expected to differenti-ate between these two apparently identical val-ues These two examples show that a compilerwould need to detect and carefully handle thesecases either by artificially inserting dependen-cies omitting optimizations differentiating be-tween apparently identical values or even byemitting memory order acquire fences

5 The whole point of memory order consume andthe resulting dependency chains is to allow de-velopers to optimize their code Such optimiza-tion attempts can be completely defeated by thememory order acquire fences that the standardcurrently requires when a dependency chain goesout of scope without the benefit of a [[carries

dependency]] attribute Preventing the com-piler from emitting these fences requires liberal

use of kill dependency() which clutters coderequires large developer effort and further re-quires that the developer know quite a bit aboutwhich code patterns a given version of a givencompiler can optimize (thus avoiding needlessfences) and which it cannot (thus requiring man-ual insertion of kill dependency()

As of this writing no known implementations fullysupport C11 or C++11 dependency ordering

It is worth asking why Paul didnrsquot anticipate theseweaknesses There are several reasons for this

1 Compiler optimizations have become more ag-gressive over the seven years since Paul startedworking on standardization

2 New dependency-ordering use cases have arisenduring that same time in particular there arelonger dependency chains and more of themincluding dependency chains spanning multiplecompilation units

3 The number of dependency chains has increasedby roughly an order of magnitude during thattime so that changes in code style can be ex-pected to face a commeasurate increase in resis-tance from the Linux kernel community ndash unlessthose changes bring some tangible benefit

With that letrsquos look at some potential alternativesto dependency ordering as defined in the C11 andC++11 standards

7 Potential Alternatives to C11and C++11 Dependency Or-dering

Given the weaknesses in the current standardrsquos spec-ification of dependency ordering it is quite reason-able to consider alternatives To this end Section 71discusses ease-of-use issues involved with revisions tothe C11 and C++11 definitions of dependency or-dering Section 72 enlists help from the type systembut also imposes value restrictions (thus revising the

WG21N4321 15

C11 and C++11 semantics for dependencies) Sec-tion 73 enlists help from the type system withoutthe value restrictions and Section 74 describes awhole-program approach to dependency chains (alsorevising the C11 and C++11 semantics for depen-dencies) Section 75 describes a post-Rapperswilproposal that dependency chains be restricted tofunction-scope local variables and temporaries andSection 76 describes a second post-Rapperswil pro-posal that the [[carries dependency]] attributebe used to label local-scope variables that carry de-pendencies Section 77 describes a proposal dis-cussed verbally at Rapperswil that explicitly marksthe tails of dependency chains Section 78 describesthe inverse namely marking the heads of dependencychains Section 79 describes an approach that avoidsmarking by sharply restricting the number and typeof operations permitted in dependency chains Eachapproach appears to have advantages and disadvan-tages so it is hoped that further discussion will eitherhelp settle on one of these alternatives or generatesomething better To help initiate this discussionSection 710 provides an initial comparative evalua-tion

71 Revising C11 and C++11Dependency-Ordering Definition

The following sections each describe a proposed revi-sion of the dependency-ordering definition from thatin the current C11 and C++11 standards In manyof these proposals developers are required to followan additional rule in order to be able to rely on de-pendency ordering Subsequent execution must notlead to a situation where there is only one possiblevalue for the variable that is intended to carry thedependency7 This is shown in Figure 17 where thecompiler is permitted to break dependency orderingon line 6 because it knows that the value of p is equalto that of q which means that it could substitutethe latter value from the former which would break

7 This restricted notion of dependence is sometimes calledsemantic dependence and the value at the end of a depen-dence chain that does not represent a semantic dependence issometimes said to be independent of the value at the head ofthe dependency chain

1 int my_array[MY_ARRAY_SIZE]23 i = atomic_load_explicit(gi memory_order_consume)4 r1 = my_array[i]

Figure 14 Single-Element Arrays and DependencyOrdering

dependency ordering In short a dependency chainbreaks if it comes to a point where only a single valueis possible regardless of the value of the memory

order consume load heading up the chain At firstglance this additional rule could be quite difficult tolive with as dependency ordering could come and godepending on small details of code far away from thatpoint in the dependency chain

However a review of the Linux-kernel operators inSection 32 shows that the most commonly used op-erators act identically under both definitions Theproblem-free operators include -gt infix = casts pre-fix amp prefix and ternary

One example of a potentially troublesome opera-tor namely == is shown in Figure 17 where line 6breaks dependency ordering because the value of p isknown to be equal to that of q which is not part of adependency chain This example could be addressedthrough careful diagnostic design coupled with appro-priate coding standards For example the compilercould emit a warning on line 6 but remain silent forthe equivalent line substituting q for p namely dosomething with(q-gta)

Another example is the use of postfix [] thatis shown in Figure 14 If this code fragment wascompiled with MY ARRAY SIZE equal to one there isno dependency ordering between lines 3 and 4 butthat same code fragment compiled with MY ARRAY

SIZE equal to two or greater would be dependency-ordered Here a diagnostic for single-element arraysmight prove useful and such a diagnostic can easilybe supplied in this case using if and error

In the Linux kernel infix + and - are used forpointer and array computations These are all safein that they operate on an integer and pointer sothat any cancellation will not normally be detectableat compile time However one big purpose of diag-nostics is to detect abnormal conditions indicating

WG21N4321 16

1 struct liststackhead 2 struct liststack __rcu first3 45 struct liststack 6 struct liststack __rcu next7 void t8 struct rcu_head rh9

1011 _Carries_dependency12 void ls_front(struct liststackhead head)13 14 _Carries_dependency void data15 struct liststack lsp1617 rcu_read_lock()18 lsp = rcu_dereference(head-gtfirst)19 if (lsp == NULL)20 data = NULL21 else22 data = rcu_dereference(lsp-gtt)23 rcu_read_unlock()24 return data25

Figure 15 List-Based-Stack Example Code 1 of 2

probable bugs Therefore in cases where the com-piler can determine that two values from dependencychains are annihilating each other via infix + and -a diagnostic would be appropriate

Similarly the Linux kernel uses infix (bitwise) amp tomanipulate bits at the bottom of a pointer whereagain cancellation will not normally be detectable atcompile timemdashexcept in the case of operations on aNULL pointer for which dependency ordering is notmeaningful in any case However as with infix +

and - if the compiler detects value annihilation adiagnostic would be appropriate

Although issues with false positives and negativesneeds further investigation there is reason to hopethat this revision of the definition of dependency or-dering might avoid significant impacts on ease of useWith this hope we proceed to the specific propos-als using the code in Figures 15 and 16 to showsome sample code using Linux-kernel nomenclaturewith the addition of a mythical C keyword Carries

dependency to annotate parameters variables andreturn values that carry dependencies Please notethat this code example in no way endorses the dubi-ous practice of creating a parallel program with thesort of choke point exemplified by the head of this

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 _Carries_dependency struct liststack lsp40 struct liststack lsnp141 _Carries_dependency struct liststack lsnp242 _Carries_dependency void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 16 List-Based-Stack Example Code 2 of 2

WG21N4321 17

1 value_dep_preserving struct foo p23 p = atomic_load_explicit(gp memory_order_consume)4 q = some_other_pointer5 if (p == q)6 do_something_with(p-gta)7 else8 do_something_else_with(p-gtb)

Figure 17 Single-Value Variables and DependencyOrdering

list Note also that cmpxchg() heads a dependencychain which is completely reasonable within the con-text of the Linux kernel due to its acquire semanticswhich of course might be argued to indicate that theannotations in ls pop() are unnecessary

72 Type-Based Designation of De-pendency Chains With Restric-tions

This approach was formulated by Torvald Riegel inresponse to Linus Torvaldsrsquos spirited criticisms of thecurrent C11 and C++11 wording

This approach introduces a new value dep

preserving type qualifier Dependency ordering ispreserved only via variables having this type quali-fier This is meant to model the real scope of depen-dencies which is data flow not execution at function-level granularity This approach should therefore givedevelopers much finer control of which dependenciesare tracked

Assigning from a value dep preserving value to anon-value dep preserving variable terminates thetracking of dependencies in much the same way thatan explicit kill dependency() would However un-like an explicit kill dependency() compilers shouldbe able to emit a suppressable warning on implicitconversions so as to alert the developer about other-wise silent dropping of dependency tracking8

Next we specify that memory order consume loadsreturn a value dep preserving type by default thecompiler must assume such a load to be capable of

8 Other choices are possible in this case including emit-ting a memory order acquire fence in order to conservativelypreserve a potentially intended ordering

producing any value of the underlying type In otherwords the implementation is not permitted to applyany value-restriction knowledge it might gain fromwhole-program analysis We call this a local seman-tic dependency to distinguish not only from a pure(syntactic) dependency but also from a global seman-tic dependency where global information may be ap-plied Note that any global semantic dependency isalso a local semantic dependency but that any localsemantic dependency which is headed by a variablethat can be proven to take on only a single value is nota global semantic dependency The term ldquosemanticdependencyrdquo should be interpreted to mean a globalsemantic dependency unless otherwise stated

This allows developers to start with a clean slatefor the additional rule that they must follow to beable to rely on dependency ordering Subsequent ex-ecution must not lead to a situation there is only onepossible value for the value dep preserving expres-sion because otherwise the implementation is per-mitted to break the dependency chain As notedearlier this is shown in Figure 17 where the com-piler is permitted to break dependency ordering online 6 because it knows that the value of p is equalto that of q which means that it could substitutethe latter value from the former which would breakdependency ordering

This approach has several advantages

1 The implementation is simpler because no de-pendency chains need to be traced The imple-mentation can instead drive optimization deci-sions strictly from type information

2 Use of the value dep preserving type modifierallows the developer to limit the extent of thedependency chains

3 This type modifier can be used to mark a depen-dency chainrsquos entry to and exit from a functionin a straightforward way without the need forattributes

4 The value dep preserving type modifiers serveas valuable documentation of the developerrsquos in-tent

WG21N4321 18

5 This approach permits many additional opti-mizations compared to those permitted by thecurrent standard on code that carries a depen-dency Expressions such as (x-x) no longerrequire establishment of artificial dependenciesand the compiler is no longer required to detectvalue-narrowing hazards like that shown in Fig-ure 13 However the compiler is still prohibitedfrom adding its own value-speculation optimiza-tions

6 Linus Torvalds seems to be OK with it whichindicates that this set of rules might be practicalfrom the perspective of developers who currentlyexploit dependency chains

According to Peter Sewell one disadvantage is thatthis approach will be quite difficult to model which inturn will pose obstacles for the analysis tooling thatwill be increasingly necessary for large-scale concur-rent programming efforts In particular the concernis that forcing the compiler to assume that a memory

order consume load could possibly return any valuepermitted by its type might require program-analysistools to consider counterfactual hypothetical execu-tions which might complicate specification of seman-tics and verification

Figures 18 and 19 show how this approach playsout with the list-based stack

73 Type-Based Designation of De-pendency Chains

Jeff Preshing made an off-list suggestion of using avalue dep preserving type modifier as suggestedby Torvald Riegel but using this type modifier tostrictly enforce dependency ordering For exampleconsider the code fragment shown in Figure 17 Thescheme described in Section 72 would not necessar-ily enforce dependency ordering between the load online 3 and the access one line 6 while the approachdescribed in this section would enforce dependencyordering in this case

Furthermore cancelling or value-destruction oper-ations on value dep preserving values would notdisrupt dependency ordering As with the cur-rent C11 and C++11 standards the implementation

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack value_dep_preserving first6 78 struct liststack 9 struct liststack value_dep_preserving next

10 void t11 struct rcu_head rh12 1314 value_dep_preserving15 void ls_front(struct liststackhead head)16 17 value_dep_preserving void data18 value_dep_preserving struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 18 List-Based-Stack Restricted Type-BasedDesignation 1 of 2

would be required to emit a memory-barrier instruc-tion or compute an artificial dependency for such op-erations (Note however that use of cancelling orvalue-destruction operations on dependency chainshas proven quite rare in practice)

This approach shares many of the advantages ofTorvald Riegelrsquos approach

1 The implementation is simpler because no de-pendency chains need be traced The implemen-tation can instead drive optimization decisionsstrictly from type information

2 Use of the value dep preserving type modifierallows the developer to limit the extent of thedependency chains

3 This type modifier can be used to mark a depen-dency chainrsquos entry to and exit from a functionin a straightforward way without the need forattributes

4 The value dep preserving type modifiers serve

WG21N4321 19

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 value_dep_preserving37 void ls_pop(struct liststackhead head)38 39 value_dep_preserving struct liststack lsp40 struct liststack lsnp141 value_dep_preserving struct liststack lsnp242 value_dep_preserving void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 19 List-Based-Stack Restricted Type-BasedDesignation 2 of 2

as valuable documentation of the developerrsquos in-tent

5 Although optimizations on a dependency chainare restricted just as in the current standardthe use of value dep preserving restricts thedependency chains to those intended by the de-veloper

6 Restricting dependency-breaking optimizationson all dependency chains marked value dep

preserving without exceptions for cases inwhich the compiler knows too much might makethis approach easier to learn and to use

It is expected that modeling this approach shouldbe straightforward because the modeling tools wouldbe able to make use of the type information Thisapproach results in the same code as shown in Fig-ures 18 and 19 of the previous section

74 Whole-Program Option

This approach also suggested off-list by Jeff Presh-ing has the goal of reusing existing non-dependency-ordered source code unchanged (albeit requiring re-compilation in most cases)9 For example this ap-proach permits an instance of stdmap to be refer-enced by a pointer loaded via memory order consume

and to provide that stdmap instance with thebenefits of dependency ordering without any codechanges whatsoever to stdmap It is important tonote that this protection will be provided only to aread-only stdmap that is referenced by a changingpointer loaded via memory order consume in partic-ular not to a concurrently updated stdmap refer-enced by a pointer (read-only or otherwise) loadedvia memory order consume This latter case wouldrequire changes to the underlying stdmap implemen-tation at a minimum changing some of the loads tobe memory order consume loads Nevertheless theability to provide dependency-ordering protection topre-existing linked data structures is valuable evenwith this read-only restriction

9 A module or library that is known to never carry a de-pendency need not be recompiled

WG21N4321 20

This approach which again does require full re-compilation can be implemented using two ap-proaches

1 Promote all memory order consume loads tomemory order acquire as may be done withthe current standard

2 On architectures that respect memory order-ing prohibit all dependency-breaking optimiza-tions throughout the entire program but onlyin cases where a change in the value returnedby a memory order consume load could cause achange in the value computed later in that samedependency chain in other words where there isa global semantic dependency Note again thatthe possibility of storing a value obtained froma memory order consume load then loading itlater means that normal loads as well as memoryorder relaxed loads often must be consideredto head their own dependency chains but onlywhen loaded by the same thread that did thestore

Some implementations might allow the developerto choose between these two approaches for exampleby using a compiler switch provided for that purpose

This approach also has the effect of permitting atrivial implementation of a memory order consume

atomic thread fence() When using the first im-plementation approach the atomic thread fence()

is simply promoted to memory order acquire In-terestingly enough when using the second ap-proach the memory order consume atomic thread

fence() may simply be ignored The reason forthis is that this approach has the effect of promot-ing memory order relaxed loads to memory order

consume which already globally enforces all theordering that the memory order consume atomic

thread fence() is required to provide locally10

This approach has its own set of advantages anddisadvantages

10 Of course this presumed promotion from memory order

relaxed to memory order consume means that architecturessuch as DEC Alpha that do not respect dependency order-ing must continue to use the first option of emitting memory-ordering instructions for memory order consume loads

1 This approach dispenses with the [[carries

dependency]] attribute and the kill

dependency() primitive

2 This approach better promotes reuse of existingsource code In particular it should require nochanges to the current Linux-kernel source baseaside from changes to the rcu dereference()

family of primitives

3 This approach allows implementations to carryout dependency-breaking optimizations on de-pendency chains as long as a change in thevalue from the memory order consume load doesnot change values further down the dependencychain both with and without the optimizationJeff conjectures that the set of dependency-breaking optimizations used in practice applyonly outside of dependency chains by the re-vised definition in which single-value restrictionsbreak dependency chains11 If this conjectureholds it also applies to Torvaldrsquos approach de-scribed in Section 72

4 Code that follows the rules presented in Sec-tion 41 (substituting memory order consume

loads for volatile loads) would have its depen-dency ordering properly preserved

It is unlikely that this approach could be modeledreasonably given the current state of the art Therequirement that any given memory order consume

load be able to generate at least two different val-ues at the tail of the dependency chain is believedto be a show-stopper especially when coupled withwhole-program analysis which might find that thereis only one value entering at the head of the depen-dency chain

This approach allows annotations to be discardedas shown in Figures 20 and 21 However the memoryorder consume loads are still required in order toenable the promote-to-acquire implementation style

11 This is certainly the case for the usual optimizations ex-emplified by replacing (x-x) with zero

WG21N4321 21

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 void ls_front(struct liststackhead head)15 16 void data17 struct liststack lsp1819 rcu_read_lock()20 lsp = rcu_dereference(head-gtfirst)21 if (lsp == NULL)22 data = NULL23 else24 data = rcu_dereference(lsp-gtt)25 rcu_read_unlock()26 return data27

Figure 20 List-Based-Stack Whole-Program Ap-proach 1 of 2

75 Local-Variable Restriction

This approach suggested off-list by Hans Boehmlimits the extent of dependency trees to a local whichincludes local variables temporaries function argu-ments and return variables Assigning a value from amemory order consume load to such an object beginsa dependency chain Assigning a value loaded fromsuch a local to a global variable (including function-local variables marked static) or to the heap impliesa kill dependency() so that dependency chains areconfined to locals However if the compiler is unableto see the full dependency chain for example be-cause it passes into a function in another translationunit that is not marked [[carries dependency]]the compiler should promote memory order consume

to memory order acquire12

Section 32 indicates that the following operatorsshould transmit dependency status from one localvariable or temporary to another -gt infix = casts

12 Some implementations might provide means to allow theuser to specify that a diagnostic be generated if such promotionis necessary

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 void ls_pop(struct liststackhead head)37 38 struct liststack lsp39 struct liststack lsnp140 struct liststack lsnp241 void data4243 rcu_read_lock()44 lsnp2 = rcu_dereference(head-gtfirst)45 do 46 lsnp1 = lsnp247 if (lsnp1 == NULL) 48 rcu_read_unlock()49 return NULL50 51 lsp = rcu_dereference(lsnp1-gtnext)52 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)53 while (lsnp1 = lsnp2)54 data = rcu_dereference(lsnp2-gtt)55 rcu_read_unlock()56 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)57 return data58

Figure 21 List-Based-Stack Whole-Program Ap-proach 2 of 2

WG21N4321 22

prefix amp prefix [] infix + infix - ternary infix (bitwise) amp and probably also | Similarly Sec-tion 33 indicates that the following operators shouldimply a kill dependency() () == = ampamp ||infix and

It will also be necesary to check whether Linux-kernel usage expects dependency chains to passthrough globals and heap objects that are in someway thread-local If there are such use cases and ifthey are sane and cannot easily be changed to uselocal variables should [[carries dependency]] beused to flag dependency-carrying globals and heapobjects

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute if de-pendency chains are to span multiple translationunits as is the case in some parts of the Linuxkernel

2 The implementation is likely to be some-what simpler because only those dependencychains passing through local variables compiler-generated temporaries compiler-visible functionarguments and compiler-visible return valuesneed be traced One could also argue that func-tion arguments and return values marked with[[carries dependency]] attribute also need tobe traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 Although optimizations on dependency chainsmust be restricted the restricted scope of de-pendency chains reduces the impact of these re-strictions

5 Applying this approach to the Linux kernelwould only require the addition of markings onfunction parameters and return values corre-sponding to cross-translation-unit function callsHowever there are a significant number of theseso this approach can expect significant resistancefrom the Linux community

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 22 List-Based-Stack Local-Variable Restric-tion 1 of 2

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach allows local-variable annotations tobe dropped as shown in Figure 22 and 23

76 Mark Dependency-Carrying Lo-cal Variables

This approach suggested offlist by Clark Nelson usesthe [[carries dependency]] attribute to mark non-static local-scope variables as carrying a dependencyin addition to its current use marking function ar-guments and return values as carrying dependen-cies It is not permissible to mark global variablesor structure members with this attribute Assigningfrom a [[carries dependency]] object to a non-[[carries dependency]] object results in an im-plicit kill dependency()

This approach is similar to that of Section 73 ex-cept that it uses an attribute rather than a type mod-ifier As such it has many of the advantages and dis-

WG21N4321 23

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 23 List-Based-Stack Local-Variable Restric-tion 2 of 2

advantages of that approach however some believethat an attribute-based approach will be more ac-ceptable to the committee than would a type-modifierapproach13 However this approach does requirethat C add attributes

This leave the question of which operatorstransmit dependency chains from one [[carries

dependency]] object to another Section 32 indi-cates that the following operators should transmitdependency status from one local variable or tem-porary to another -gt infix = casts prefix amp prefix [] infix + infix - ternary infix (bitwise) ampand probably also | Similarly Section 33 showsthat the following operators should imply a kill

dependency() () == = ampamp || infix and

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains passingthrough variables marked with the [[carries

dependency]] attribute need be traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 The [[carries dependency]] calls serve asvaluable documentation of the developerrsquos in-tent

5 Although optimizations on dependency chainsmust be restricted use of explicit [[carries

dependency]] greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of variables carrying depen-dencies given that the Linux kernel currentlyrequires no such markings

13 Lawrence Crowl suggests a third approach namely a vari-able modifier

WG21N4321 24

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 _Carries_dependency void data18 _Carries_dependency struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 24 List-Based-Stack Marked Local Variables1 of 2

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach results in code as shown in Fig-ures 24 and 25 where the [[carries dependency]]

attributes have been replaced with a mythicalCarries dependency C keyword

77 Explicitly Tail-Marked Depen-dency Chains

This approach suggested at Rapperswil by OlivierGiroux can be thought of as the inverse of

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 _Carries_dependency struct liststack lsp40 struct liststack lsnp141 _Carries_dependency struct liststack lsnp242 _Carries_dependency void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 25 List-Based-Stack Marked Local Variables2 of 2

WG21N4321 25

1 p = atomic_load_explicit(ampgp memory_order_consume)2 if (p = NULL)3 do_it(atomic_dependency(p gp))

Figure 26 Explicit Dependency Operations

stdkill dependency() Instead of explicitlymarking where the dependency chains terminateOlivierrsquos proposal uses a stddependency() prim-itive to indicate the locations in the code that thedependency chains are required to reach The first ar-gument to stddependency() is the value to whichthe dependency must be carried and the second argu-ment is the variable that heads the dependency chainin other words the second argument is the variablethat was loaded from by a memory order consume

load This proposal differs from the others in thatit is expected to be implemented not necessarily bypreserving the dependency but instead by insertingbarriers in those cases where optimizations have elim-inated any required dependencies The goal here is toimpose minimal restrictions on optimizations of codecontaining dependency chains

A C-language example is shown in Figure 26where stddependency() is transliterated to the C-language atomic dependency() function On line 3atomic dependency() returns the value of its firstargument (p) while ensuring that the data depen-dency from the memory order consume load from gp

is faithfully reflected in the assembly language imple-menting this code fragment The assembly-languagereflection of this dependency might be in terms ofan assembly-language dependency (for example onARM or PowerPC) implicit memory ordering (forexample on x86 or mainframe) or by an explicitmemory-barrier instruction However if there was noatomic dependency() function the compiler wouldbe under no obligation to preserve the dependency14

These explicitly specified dependencies may becombined with [[carries dependency]] attributeson function arguments for example as shown in Fig-ure 27 Note the interplay of atomic dependency()

and [[carries dependency]] where line 8 estab-

14 Would it be better to have the second argument to atomic

dependency() be a label rather than an expression

1 void foo(struct bar q [[carries_dependency]])2 3 if (q = NULL)4 do_it(atomic_dependency(q-gtb q))5 67 p = atomic_load_explicit(ampgp memory_order_consume)8 foo(atomic_dependency(p gp))

Figure 27 Explicit Dependency Operations and car-ries dependency

lishes the dependency between the load from gp andthe [[carries dependency]] argument q of foo()and where line 4 establishes the further dependencybetween argument q of foo() and do it()s argu-ment

This approach is not yet complete One issue isthe possibility of a given operation being dependenton multiple memory order consume loads One ap-proach is of course to omit this functionality and an-other is to allow atomic dependency() to allow anexpression as its first argument and a variable list ofmemory order consume loaded variables

Another issue is connecting [[carries

dependency]] return values to subsequent atomic

dependency() invocations There are a numberof possible resolutions to this issue One approachwould be to use [[carries dependency]] attributeto mark the declaration of the variable to whichthe functionrsquos return value is assigned bringing theproposal from Section 76 to bear In the specialcase where the memory order consume load is in thesame function body as the atomic dependency()

that depends on it the atomic dependency()

could reference the variable that was the sourceof the original memory order consume load An-other approach would be to allow function-returncarries dependency]] attributes to define namesthat could be used by later atomic dependency()

invocations

A third issue arises when atomic dependency()

must be applied after the head of the dependencychain has gone out of scope for example if the headwas contained in a variable defined in an inner scopethat has since been exited

A fourth issue arises if optimizations along a

WG21N4321 26

needed dependency chain allow ordering the depen-dent operation to precede the head of the depen-dency chain in which case inserting barriers wouldbe ineffective The current proposal for addressingthis issue is to suppress memory-movement optimiza-tions across the atomic dependency() perhaps us-ing something like atomic signal fence() or theLinux kernelrsquos barrier() macro This approach al-lows dependency checking and fence insertion to becarried out as a final pass in the compilation process

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains having ex-plicit atomic dependency() calls (and option-ally intermediate [[carries dependency]] at-tributes) need be traced

3 Irrelevant dependency chains are pruned by de-fault with no stdkill dependency() calls re-quired

4 The atomic dependency() calls serve as valu-able documentation of the developerrsquos intent

5 Although optimizations on dependency chainsmust be restricted use of explicit atomic

dependency() greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of dependency chains giventhat the Linux kernel currently relies on implicitends of dependency chains

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(atomic_dependency(lsp-gtt27 head-gtfirst))28 rcu_read_unlock()29 return atomic_dependency(data lsp-gtt)30

Figure 28 List-Based-Stack Tail-Marked Dependen-cies 1 of 2

It is not yet known whether this approach can bereasonably modeled

The result is shown in Figures 28 and 29

78 Explicitly Head-Marked Depen-dency Chains

This approach suggested via email by Olivier Girouxcan be thought of as another inverse of stdkilldependency() In this case the heads of the de-pendency chains are marked indicating to whichpointed-to objects dependencies should be carriedThis description is an extrapolation of a very con-cise proposal and corrections are welcome

The general idea is to provide an augmentedform of the load() member function that indicatesdependencies for example xload(memory order

consume x-gtnext) would cause a dependency tobe carried through the -gtnext field but throughno other field This is shown in Figure 30 where

WG21N4321 27

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(atomic_dependency(lsnp2-gtt lsnp))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return atomic_dereference(data lsnp2-gtt)59

Figure 29 List-Based-Stack Tail-Marked Dependen-cies 2 of 2

1 struct foo 2 struct foo a3 struct foo b4 struct foo c5 int d6 78 p = atomic_load_explicit(ampgp memory_order_consume9 p-gta p-gtb)

10 qa = p-gta Dependency carried 11 qb = p-gtb Dependency carried 12 qc = p-gtc No dependency carried 13 d = p-gtd No dependency carried

Figure 30 Explicit Dependency Operations and Aug-mented Load

the explicit dependency information on line 9 causeslines 10 and 11 to carry a dependency but lines 12and 13 not to do so

Some open questions regarding this approach

1 How does this interact with arguments and re-turn values Do the corresponding annotationsneed to indicate to which fields dependenciesmight be carried Should mismatches be con-sidered an error and if so which sorts of mis-matches

2 How are opaque types handled For exam-ple consider the Linux kernel linked-list facil-ity which embeds a list head structure intothe enclosing object that is to be placed on thelist The memory order consume load returnsa (struct list head ) but it may be nec-essary to carry a dependency to one or more ofthe fields in the enclosing object Should thisbe handled via something like xload(memory

order consume x) but if so doesnrsquot thisre-introduce the need for lots of stdkill

dependency() calls

3 Larger structures might have quite a few fieldsthat need dependencies carried Should therebe some sort of shorthand to make this easierto code for example tagging the fields needingdependency ordering in the declaration of thestruct or class

4 Common types of abstraction need to be han-dled correctly For example there are more than

WG21N4321 28

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst head-gtfirst-gtt)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(lsp-gtt lsp-gtt)) 28 rcu_read_unlock()29 return data30

Figure 31 List-Based-Stack Head-Marked Depen-dencies 1 of 2

500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

79 Restricted Dependency Chains

This approach restricts dependency chains to opera-tions for which compilers would naturally carry de-pendencies As such this approach can be consid-ered to be a refinement of the whole-program option(Section 74) restricted as described in Section 412but also omitting control dependencies and RCU-protected array indexes As always a memory order

consume load heads a dependency chain and as al-ways a memory order consume load may be imple-mented with a simple load instruction on architec-tures such as x86 ARM and Power

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst head-gtfirst-gtnext)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext lsnp-gtnext-gtt)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt lsnp2-gtt))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 32 List-Based-Stack Head-Marked Depen-dencies 2 of 2

WG21N4321 29

1 define rcu_ptr_extract(p) 2 ( 3 uintptr_t ___ip = (uintptr_t)(p) 4 5 ___ip = ___ip amp ~0x7 6 (typeof(p)) ___ip 7 )89 define rcu_ptr_set_bits(p bits)

10 ( 11 uintptr_t ___ip = (uintptr_t)(p) 12 13 ___ip = ___ip amp ~0x7 14 ___ip = ___ip | (bits) 15 (typeof(p)) ___ip 16 )1718 define rcu_ptr_get_bits(p bits) 19 ( 20 uintptr_t ___ip = (uintptr_t)(p) 21 22 ___ip amp ~0x7 23 )

Figure 33 Pointers and Bit Manipulation on 64-BitSystem

This results in a specific list of operations that ex-tend dependency chains and a separate specific list ofoperations that terminate such chains each coveredin its own section

791 Extending Dependency Chains

The following primitive operations extend thatchain15

1 If any value is part of a dependency chain thenusing that value as the left-hand side of an as-signment expression extends the chain to coverthe assignment This rule is exercised in theLinux kernel by stores into fields making up anRCU protected data element

2 If any value is part of a dependency chain thenusing that value as the right-hand side of an as-signment expression extends the chain to coverboth the assignment and the value returned bythat assignment statement Line 20 of Figure 20shows how this rule may be used to extend adependency chain into a local variable

15 In case of operator overloading the actual functions calledmust be analyzed in order to determine their effects on depen-dency chains

3 If any value that is part of a dependency chain isstored to a non-shared variable then any valueloaded by a later load from that same variable bythat same thread is also part of the dependencychain Lines 20 and 24 of Figure 20 illustratethis rule though this rule would apply even iflocal variable lsp was a intptr t instead of apointer

4 If a pointer that is part of a dependency chainis stored to any variable then any value loadedby a later load from that same variable by thatsame thread is also part of the dependency chainLines 20 and 24 of Figure 20 illustrate this rulethough this rule would apply even if local vari-able lsp was instead a shared variable

5 If a pointer is part of a dependency chain thenadding an integral value to that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers and alsoto addition via the infix + operator and via thepostfix [] operator Note that the addition mustbe carried out on a pointer Casting to an inte-gral type and then carrying out the addition willbreak the dependency chain Therefore insteadof casting to an integral type to carry out theaddition cast to a pointer to char16 Line 24 ofFigure 20 illustrates this given that the -gtt actsas a pointer offset prior to indirection

6 If a pointer is part of a dependency chain thensubtracting an integer from that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers Againcasting to an integral type and then carryingout the subtraction will break the dependencychain so instead cast to a pointer to char TheLinux-kernel container of() macro illustratesthis This macro is used to find the beginning ofa structure given a pointer to a field within thatsame structure

16 Yes some old systems had strange formats for characterpointers and this restriction does exclude those systems fromthis nuance of dependency ordering However to the best ofmy knowledge all such systems were uniprocessors so this isnot a real problem

WG21N4321 30

7 If a pointer is part of a dependency chain thendereferencing it using the prefix operator ex-tends the chain through the dereference opera-tion Line 24 of Figure 20 illustrates this giventhat the -gtt acts as a pointer offset prior to in-direction

8 If a pointer is part of a dependency chain thendereferencing it using the -gt field-selection oper-ator extends the chain to the field Note that thewhen the -gt operator is followed by one or more operators these latter operators are equiva-lent to adding a constant integer to the originalpointer Line 24 of Figure 20 directly illustratesthis rule

9 If a pointer is part of a dependency chain thencasting it (either explicitly or implicitly) to anypointer-sized type extends the chain to the re-sult Line 3 of Figure 33 illustrates this rule

10 If a value of type intptr t or uintptr t ispart of a dependency chain then casting it toa pointer type extends the chain to the resultLine 6 of Figure 33 illustrates this rule

11 If a value of type intptr t or uintptr t is partof a dependency chain then the bitwise infix amp

and | operators extend the dependency chain tothe resulting value Line 5 of Figure 33 illus-trates this rule

12 If a value of type intptr t or uintptr t is partof a dependency chain the the bitwise infix ^

operator extends the dependency chain to theresulting value The use case for this traversal ofbuddy-allocator-like lists or dense-array heapsbut it is not clear whether these use cases justifythis addition to dependency ordering

13 If a pointer is part of a dependency chain thenapplying the unary amp address-of operator op-tionally casting this address to a pointer type(perhaps repeatedly to different pointer typeseither explicitly or implicitly) then applying the dereference operator extends the chain to theresult This is used by some of the Linux-kernellist-processing macros

14 If a pointer is part of a dependency chain andthat pointer appears in the entry of a expres-sion selected by the condition then the chainextends to the result Please note that doesnot extend chains from its condition only fromits second or third argument

15 If a pointer to a function is part of a depen-dency chain then invoking the pointed-to func-tion extends the chain from the pointer to theinstructions executed Note that the exact mech-anism used to update instructions is implemen-tation defined and might require use of specialinstruction-cache-flush operations17

16 If a pointer is part of a dependency chain thenif that pointer is used as the actual parameter ofa function call the dependency chain extends tothe formal parameter

17 If a pointer is part of a dependency chain thenif that pointer is returned from a function thedependency chain extends to the returned valuein the calling function

18 If a given operation extends a dependency chainthen so does its atomic counterpart For exam-ple the rules applying to assignments also applyto atomic loads and stores

Any other operation terminates a dependencychain

792 Terminating Dependency Chains

Even though all other operations terminate depen-dency chains there are a few that deserve specialmention

1 Equality comparisons

2 Narrowing magnitude comparisons

3 Narrowing arithmetic operations

4 Narrowing bitwise operations

17 This may sound strange but just you try implementingdynamic linking without the ability to update instructions

WG21N4321 31

1 struct bar 2 struct bar next3 int a4 int b5 6 struct bar head = amphead 1 2 78 for (p = head-gtnext p p = rcu_dereference(p-gtnext)) 9 foo += p-gta

10 if (p == amphead)11 break12 13 bar = head-gtb

Figure 34 Back-Propagation of Dependency-ChainBreakage Due to Comparisons

1 if (p gt ampfoo)2 do_something(p)3 else if (p lt ampfoo)4 do_something_else(p)5 else6 do_something_nodep(p)

Figure 35 Inequality-Comparison Dependency-Chain Breakage

5 Storing non-pointers into shared variables

6 Passing values between threads without using amemory order consume load

7 Undefined behavior

8 Use of kill dependency

Each of these is covered below

Equality comparisons If a pointer is part of adependency chain then a == or = comparison thatcompares equal to some other pointer where thatother pointer is not part of any dependency chainwill cause any uses of the original pointer to no longerbe part of the dependency chain This dependency-chain breakage can back-propagate to earlier uses ofthe pointer so that in Figure 34 if the comparisonon line 10 compares equal then the access on line 9is not part of the dependency chain This is admit-tedly a rather strange code fragment and besidesthe Linux-kernel barrier() macro could prevent thisif placed between lines 9 and 10 Furthermore the

Linux kernelrsquos list macros avoid this situation becausethe equal comparison terminates the loop

So what if the compiler introduces an equality com-parison This might happen when doing feedback-directed optimization where the compiler might no-tice (for example) that a particularly statically allo-cated structure was almost always the first elementon a given list The compiler might therefore in-troduce a specialization optimization comparing theaddresses and generating code using the statically al-located structure on equals comparison On the onehand in the cases where the Linux kernel adds a stat-ically allocated structure to an RCU-protected linkeddata structure that structure has been initialized atcompile time so that dependency ordering is not re-quired On the other hand this appears to be anextremely dubious optimization for linked data struc-tures In a great many cases the added overhead ofthe comparison would overwhelm the benefits of gen-erating code based on the statically allocated struc-ture

High-quality implementations would therefore beexpected to provide means for disabling this sort ofoptimization especially for pointers obtained fromthe heap After all use of statically allocated struc-tures in RCU-protected lists could be quite usefulduring out-of-memory conditions in which case thespecialization optimization would almost always re-duce performance which is not what optmizationsare supposed to be doing

Narrowing magnitude comparisons A seriesof gt lt gt= or lt= operators that informs the compilerof the exact value of a pointer causes that pointerto no longer be part of the dependency chain SeeFigure 35 for an example of this On line 6 of thisfigure the compiler knows that the value of p is equalto ampfoo so although there is dependency ordering tolines 2 and 4 there is no dependency ordering toline 6 This dependency-chain breakage can back-propagate just as for equality comparisons How-ever dependencies are maintained for normal usesfor exmaple the use of comparisons for deadlockavoidance when acquiring locks contained in multi-ple RCU-protected data elements

WG21N4321 32

Narrowing arithmetic operations If a pointeris part of a dependency chain and if the valuesadded to or subtracted from that pointer cancel thepointer value so as to allow the compiler to pre-cisely determine the resulting value then the result-ing value will not be part of any dependency chainFor example if p is part of a dependency chain then((char )p-(uintptr t)p)+65536 will not be18

Narrowing bitwise operations If a value of typeintptr t or uintptr t is part of a dependency chainand if that value is one of the operands to an amp

or | infix operator whose result has too few or toomany bits set then the resulting value will not bepart of any dependency chain For example on a64-bit system if p is part of a dependency chainthen (p amp 0x7) provides just the tag bits and nor-mally cannot even be legally dereferenced Similarly(p | ~0) normally cannot be legally dereferencedHowever (p amp ~0x7) will provide a usable pointerthat is part of prsquos dependency chain Setting orclearing bits that are not used by the implementa-tion or that would be zero for a properly aligned ob-ject will not break the dependency chain That saidthe compiler might not know the definition of ldquoprop-erly alignedrdquo for example in the cases of manuallycache-aligned or page-aligned objects

Storing non-pointers into shared variables Ifa non-pointer value that is part of a dependency chainis stored into a shared variable then the dependencychain does not extend to a later load from that vari-able

Passing values between threads If a value thatis part of a dependency chain is stored into a variableby one thread and loaded from that same variable bysome other thread using either a non-atomic load ora memory order relaxed load then the dependencychain does not extend to the second thread To getthis effect the second thread would instead need touse a memory order consume load Note that this

18 That said 57p4 of C++ and 656p8 of C both say thatindexing outside of an object is undefined behavior so the lossof dependency ordering is likely the least of the problems here

would extend the dependency chain even if the cor-responding store was a memory order relaxed storebecause the required store-side ordering is providedby the dependency chain

Undefined behavior If undefined behavior is in-voked then consistent with the notion of undefinedbehavior there are no dependency-chain guarantees

If a given pointer intptr t or uintptr t is usedsuch that only one value avoids undefined behaviorthen the dependency chain is broken in the same wayas it would be in the case of an equality comparisonwith that same value

kill dependency() The result of calling kill

dependency is never part of any dependency chainThis operation can be used to suppress diagnosticsthat implementations might omit for likely misusesof dependency ordering

793 Restricted Dependency Chains and theLinux Kernel

This covers all known pointer-based RCU uses in theLinux kernel aside from RCU-protected array in-dexes (more on these later) However as noted ear-lier these restrictions might prove too constrainingfor future code Therefore it might be necessary tocombine this approach with some variation on one ofthe methods for explicitly marking variables formalparameters and return values that are intended tocarry dependencies

Section 32 discussed dependency chains headedby memory order consume loads of integers that arelater used as array indexes or pointer offsets Al-though there are a (very) few such uses in the Linuxkernel accommodating dependency chains headed byloads of integers greatly complicates the handling ofdependency chains For example given an integer x

produced by a memory order consume load we mustcorrectly handle expressions containing (x - x) in-cluding cases where the cancellation is not at all obvi-ous from the source code In contrast given a pointerp the expression (p - p) not a pointer and the rulesgiven above do not require carrying a dependency

WG21N4321 33

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a) 3 q = kill_dependency(p)4 a = q-gtspecial_a5 else 6 a = p-gtnormal_a7

Figure 36 Avoiding Diagnostic Due To Dependency-Ordering Value-Narrowing Hazard

through such an expression This approach there-fore excludes dependency chains headed by memory

order consume loads from non-pointer atomic vari-ables

This raises the question of what should be doneabout the Linux kernel code that relies on order-ing carried through integer array indexes Paulhas (hopefully) answered this question by creatinga Linux-kernel patch that removes the kernelrsquos de-pendency on RCU-protected array indexes [22]

794 Restricted Dependency Chains Ad-vantages and Disavantages

This approach to dependency ordering has the fol-lowing advantages and disadvantages

1 There is no need for compilers to trace depen-dency chains Instead dependencies are an au-tomatic result of current code-generation and op-timization practices

2 There would be little or no limitation on com-piler optimizations Compilers are no longer re-quired to establish artificial dependencies for ex-pressions such as (x - x) or to detect value-narrowing hazards involving == and =

3 The breaking of dependency chains when apointer compares equal to some other pointermight prove to be onerous but it is not aproblem for current Linux use cases19 Themost common cases are comparison against alist header (in which case an equality compar-ison terminates the traversal) and comparison

19 This needs to be re-verified

against NULL (in which case an equality com-parison indicates a pointer that cannot be deref-erenced in any case)

4 The compiler is prohibited from carrying outvalue-speculation optimizations on pointers orvalues of type intptr t or uintptr t that havebeen cast from a pointer type and subjected tono operations other than infix amp | and ^ (Isthis an advantage or a disadvantage The an-swer to this question is left to the reader)

5 In a great many cases dependency chains canreliably pass through library functions compiledby pre-C11 compilers

6 It would not be necessary to use stdkill

dependency() calls in most cases That saiduse of stdkill dependency() might at somefuture point allow the compiler to produce bet-ter diagnostics for dubious dependency-chain usecases For example a compiler might issue awarning for the value-narrowing hazard shownon line 3 of Figure 13 and that diagnostic mightbe suppressed as shown in Figure 36

7 More generally this approach allows annotationsto be discarded as was shown in Figures 20and 21 However the memory order consume

loads are still required in order to support DECAlpha and prevent compiler optimizations thatmight otherwise destroy the dependency chain

8 It would not be necessary to use [[carries

dependency]] attributes However as withstdkill dependency() use of something like[[carries dependency]] might produce bet-ter diagnostics for dubious use dependency-chainuse cases That said it might be preferable tosubstitute either type or variable modifiers forattributes given that use of attributes is not per-mitted to change the meaning of the programand also that attributes are not yet supportedby the C language

9 With the exception of one use case involving ar-rays no changes to the Linux-kernel source code

WG21N4321 34

are required As noted earlier a patch is avail-able to remove the Linux kernelrsquos dependency onRCU-protected array indexes [22]

In the future this approach could be augmentedby attributes type modifiers variable modifiers orother markings to allow more elaborate dependencychains to be created on the one hand and to improvethe compilerrsquos ability to emit diagnostics for dubioususes of dependency chains on the other

710 Evaluation

This evaluation starts by enumerating the differentaudiences that any change to memory order consume

must address (Section 7101) and then compares thevarious proposals based on the perceived viewpointsof these audiences (Section 7102)

7101 Audiences

The main audiences for any change to memory

order consume include standards committee mem-bers compiler implementers formal-methods re-searchers developers intending to write new codeand developers working with existing RCU code TheLinux kernel community is of course a notable exam-ple of this last category

Standards committee members would like a cleanand non-intrusive change to the standard Theywould of course also like solutions minimizing thenumber and vehemence of complaints from the otheraudiences or failing that reducing the complaintsto a tolerable noise level

Compiler implementers would like a mechanismthat fits nicely into current implementations whichdoes much to explain their satisfaction with the ap-proach of strengthening memory order consume tomemory order acquire In particular they wouldlike to avoid unbounded tracing of dependencies andwould prefer minimal constraints on their ability toapply time-honored optimizations

Formal-methods researchers would like a definitionof memory order consume that fits into existing the-oretical frameworks without undue conceptual vio-lence Of particular concern is any need to deal with

counter-factuals in other words any need to rea-son not only about values of variables required forthe solution of a given litmus test but also aboutother unrelated values for these variables As suchcounter-factuals are the rock upon which otherwiseattractive approaches involving semantic dependencyhave foundered20 Some practitioners might won-der why the opinion of formal-methods researchersshould be given any weight at all and the answer tothis question is that it is the work of formal-methodsresearchers that provides us the much-needed toolsthat we need to analyze both the memory-orderingspecification itself as well as programs using thatspecification

Developers writing new code need something thatexpresses their algorithm with a minimum of syn-tactic saccharine that is easy to learn and that iseasy to maintain For example one of the weak-nesses of the current standardsrsquo definition of memoryorder consume is the need to sprinkle large numbersof kill dependency() calls throughout onersquos codeIn short developers would like it to be easy to writeanalyze and maintain code that uses dependency or-dering

Developers with existing RCU code have the samedesires as do developers writing new code but arealso very interested in minimizing the code churn re-quired to adhere to the standard

The challenge if of course to find a proposal thataddresses the viewoints of all of these audiences Aswe will see in the next session this is not easy How-ever there is some hope that the approach presentedin Section 79 might suffice

7102 Comparison

A summary comparison of the proposals is shown inTable 1

The dependency type can either be ldquodeprdquo fornormal dependency ldquordeprdquo for the restricted de-pendencies discussed in Section 79 ldquosdeprdquo for(global) semantic dependency or ldquolsdeprdquo for local

20 That said Alan Jeffries is making another attempt tocome up with a suitable formal definition of semantic depen-dency

WG21N4321 35

Dep

enden

cyType

Variable

Marking

Form

al-Parameter

Marking

Return-V

alueMarking

Beginning-O

f-Chain

Handling

End-O

f-Chain

Handling

Dep

enden

cyTracingRequired

CAttribute

Support

Req

uired

C11 C++11 dep A A K Y Y

Type-Based Designation of DependencyChains With Restrictions (Section 72)

lsdep T T T k

Type-Based Designation of DependencyChains (Section 73)

dep T T T k

Whole-Program Option (Section 74) sdep

Local-Variable Restriction (Section 75) dep A A k Y

Mark Dependency-Carrying Local Variables(Section 76)

dep A A A k Y

Explicitly Tail-Marked Dependency Chains(Section 77)

dep A A Dk y Y

Explicitly Head-Marked Dependency Chains(Section 78)

dep D k y Y

Restricted Dependency Chains (Section 79) rdep

Marking ldquoArdquo attribute ldquoTrdquo typeBeginning of chain ldquoDrdquo explicit designationEnd of chain ldquoDrdquo explicit designation ldquokKrdquo implicitexplicit kill dependencyDependency tracking required ldquoyrdquo only for marked chains ldquoYrdquo always

Table 1 Comparison of Consume Proposals

WG21N4321 36

semantic dependency21 Variable formal-parameterand return-value marking can either be type-based(ldquoTrdquo) attribute-based (ldquoArdquo) or not required (ldquo rdquo)Beginning-of-chain handling can either require ex-plicit indication to which quantities dependenciesmust be carried (ldquoDrdquo) or nothing (ldquo rdquo) End-of-chain handling can either require an explicit kill

dependency (ldquoKrdquo) an implicit kill dependency

(ldquokrdquo) explicit designation of dependency (ldquoDrdquo) ornothing (ldquo rdquo)22 Dependency tracking might berequired for all chains (ldquoYrdquo) explicitly designatedchains (ldquoyrdquo) or not required at all (ldquo rdquo) C-language[[carries dependency]] support might be required(ldquoYrdquo) or not (ldquo rdquo)

The ideal proposal would have dependency typeldquodeprdquo (thus making it easier to model dependencyordering and making it unnecessary for developer tohave to outwit full-program optimizations) no needfor variable formal-parameter or return-value mark-ing (thus minimizing changes required for existingRCU code) implicit ldquodo the right thingrdquo end-of-chainhandling23 (thus minimizing the need for whack-a-mole source-code markups) no need for dependencytracking (thus making it easier to implement) andno need for C-language support for the [[carries

dependency]] attribute (thus minimizing changes tothe C standard)

7103 Other Approaches

If the C standards committee is unwilling to ac-cept attributes perhaps a new keyword such asCarries dependency would be an acceptable alter-native (This was suggested at the 2014 UIUC meet-ing but I cannot recall who suggested it)

21 Recall that a local semantic dependency remains a de-pendency even if the memory order consume load at its headcan return only a single value In contrast a global seman-tic dependency remains a dependency only if more than onevalue can appear at the end of the chain Therefore optimiza-tions based on global full-program analysis can break a globalsemantic dependency but can break neither a local semanticdependency nor a normal dependency

22 Variables that go out of scope always have any dependencychain implicitly killed

23 Perhaps implemented by a careful choice of exactly whichoperators carry dependencies in which situations

It might also be possible to combine different as-pects of the various proposals perhaps even arrivingat an improved proposal

Nevertheless we clearly have some more work todo

8 Summary

This document has analyzed Linux-kernel use of de-pendency ordering and has laid out the status-quo in-teraction between the Linux kernel and pre-C11 com-pilers It has also put forward some possible ways ofbuilding towards a full implementation of C11rsquos andC++11rsquos handling of dependency ordering Finallyit calls out some weaknesses in C11rsquos and C++11rsquoshandling of dependency ordering and offers some al-ternatives

References

[1] Alglave J Maranget L Pawan PSarkar S Sewell P Williams D andNardelli F Z PPCMEMARMMEM A toolfor exploring the POWER and ARM memorymodels httpwwwclcamacuk~pes20

ppc-supplementalpldi105-sarkarpdfJune 2011

[2] Alglave J Maranget L andTautschnig M Herding cats Modellingsimulation testing and data-mining for weakmemory In Proceedings of the 35th ACM SIG-PLAN Conference on Programming LanguageDesign and Implementation (New York NYUSA 2014) PLDI rsquo14 ACM pp 40ndash40

[3] ARM Limited ARM Architecture ReferenceManual ARMv7-A and ARMv7-R Edition2010

[4] Boehm H J Space efficient conservativegarbage collection SIGPLAN Not 39 4 (Apr2004) 490ndash501

[5] Bonzini P and Day M RCU imple-mentation for Qemu httplistsgnu

WG21N4321 37

orgarchivehtmlqemu-devel2013-08

msg02055html August 2013

[6] Dalton M THE DESIGN AND IMPLE-MENTATION OF DYNAMIC INFORMATIONFLOW TRACKING SYSTEMS FOR SOFT-WARE SECURITY PhD thesis StanfordUniversity 2009 Available httpcsl

stanfordedu~christospublications

2009michael_daltonphd_thesispdf

[Viewed March 9 2010]

[7] Desnoyers M [RFC git tree] userspace RCU(urcu) for Linux httpurcuso February2009

[8] Desnoyers M McKenney P E SternA Dagenais M R and Walpole JUser-level implementations of read-copy updateIEEE Transactions on Parallel and DistributedSystems 23 (2012) 375ndash382

[9] Grisenthwaite R ARM Barrier LitmusTests and Cookbook ARM Limited 2009

[10] Howard P W and Walpole J A rel-ativistic enhancement to software transactionalmemory In Proceedings of the 3rd USENIX con-ference on Hot topics in parallelism (BerkeleyCA USA 2011) HotParrsquo11 USENIX Associa-tion pp 1ndash6

[11] Howard P W and Walpole J Relativis-tic red-black trees Concurrency and Computa-tion Practice and Experience (2013) nandashna

[12] Intel Corporation A Formal Speci-fication of Intel Itanium Processor Fam-ily Memory Ordering 2002 Avail-able httpdeveloperintelcom

designitaniumdownloads251429htm

ftpdownloadintelcomdesign

ItaniumDownloads25142901pdf [ViewedJanuary 10 2007]

[13] International Business Machines Corpo-ration Power ISA Version 207 2013

[14] Kannan H Ordering decoupled metadata ac-cesses in multiprocessors In MICRO 42 Pro-ceedings of the 42nd Annual IEEEACM Inter-national Symposium on Microarchitecture (NewYork NY USA 2009) ACM pp 381ndash390

[15] McCarthy J Recursive functions of symbolicexpressions and their computation by machinepart i Commun ACM 3 4 (Apr 1960) 184ndash195

[16] McKenney P E Read-copy update(RCU) usage in Linux kernel Availablehttpwwwrdropcomuserspaulmck

RCUlinuxusagerculocktabhtml [ViewedJanuary 14 2007] October 2006

[17] McKenney P E What is RCU part 2Usage Available httplwnnetArticles

263130 [Viewed January 4 2008] January2008

[18] McKenney P E The RCU API 2010 edi-tion httplwnnetArticles418853 De-cember 2010

[19] McKenney P E Validating memory barri-ers and atomic instructions httplwnnet

Articles470681 December 2011

[20] McKenney P E Is Parallel ProgrammingHard And If So What Can You Do About Itkernelorg Corvallis OR USA 2012

[21] McKenney P E Structured deferral syn-chronization via procrastination CommunACM 56 7 (July 2013) 40ndash49

[22] McKenney P E [PATCH tipcorercu14] mce Stop using array-index-based RCUprimitives [PATCHtipcorercu14]mce

Stopusingarray-index-basedRCUprimitivesMay 2015

[23] McKenney P E Purcell C Al-gae Schumin B Cornelius G Qwer-tyus Conway N Sbw Blainster Ru-fus C Zoicon5 Anome and Eisen HRead-copy update httpenwikipedia

orgwikiRead-copy-update July 2006

WG21N4321 38

[24] McKenney P E and Slingwine J DRead-copy update Using execution history tosolve concurrency problems In Parallel andDistributed Computing and Systems (Las VegasNV October 1998) pp 509ndash518

[25] McKenney P E and Walpole J What isRCU fundamentally Available httplwn

netArticles262464 [Viewed December 272007] December 2007

[26] Michael M M Hazard pointers Safe mem-ory reclamation for lock-free objects IEEETransactions on Parallel and Distributed Sys-tems 15 6 (June 2004) 491ndash504

[27] Rossbach C J Hofmann O S PorterD E Ramadan H E Bhandari A andWitchel E TxLinux Using and managinghardware transactional memory in an operatingsystem In SOSPrsquo07 Twenty-First ACMSymposium on Operating Systems Principles(October 2007) ACM SIGOPS Avail-able httpwwwsosp2007orgpapers

sosp056-rossbachpdf [Viewed October 212007]

[28] Toit S D Working draft stan-dard for programming language C++httpwwwopen-stdorgjtc1sc22wg21

docspapers2013n3691pdf May 2013

[29] Vigueras G Orduna J M and LozanoM A Read-Copy Update based parallel serverfor distributed crowd simulations The Journalof Supercomputing (Apr 2012)

Change Log

This paper first appeared as N4026 in May of 2014Revisions to this document are as follows

bull Add a suggestion to Section 71 that developersprovide diagnostics for singleton arrays when us-ing RCU-protected indexes (May 28 2014)

bull Apply changes from Torvald Riegel feedback

1 Move Section 71 which summarizes theLinux-kernel uses of dependency chains upto the beginning of Section 7

2 Fix some typos in Section 74 and addwords accounting for DEC Alpha whichmust continue to promote memory order

consume loads to memory order acquire

with this sectionrsquos approach Also clarifyJeff Preshingrsquos conjecture about the lack ofoptimizations that break semantic depen-dency chains

(June 5 2014)

bull Add Section 75 which covers Hans Boehmrsquosapproach which restricts dependency chains tolocal variable Also add Section 76 whichcovers Clark Nelsorsquos approach which usesthe [[carries dependency]] attribute to markthose local variables that carry dependencies(July 17 2014)

bull Mark the paper as a revision of N4036 (July 172014)

bull Add Jeff Preshing Hans Boehm and Clark Nel-son as authors and clarify Hansrsquos proposal inSection 75 (August 13 2014)

bull Add Olivier Girouxrsquos proposal for explicitlymarking the tails of dependency chains as Sec-tion 77 and add Olivier as author Update theadvantages and disadvantages of Jeff Preshingrsquoswhole-program option in Section 74 Add advan-tages and disadvantages for Hans Boehmrsquos pro-posal in Section 75 and for Clark Nelsonrsquos pro-posal in Section 76 (August 23 2014)

bull Update Olivier Girouxrsquos explicit dependency-chain marking proposal in Section 77 based onfeedback from Olivier (August 24 2014)

bull Add a section header (Section 411) for thedependency-chain rules for 2014 GCC implemen-tations Add Section 413 laying out the sim-pler rules for managing dependency chains in theolder 1990s Sequent C implementations Adda footnote to Section 76 recording Lawrence

WG21N4321 39

Crowlrsquos preference for variable modifiers insteadof either attributes or type modifiers Addwords to Section 77 noting that atomic signal

fence() or the Linux kernelrsquos barrier() can beused to suppress code-motion optimizations thatmight otherwise prevent fixing up tail-markeddependency chains that were broken by code-motion optimizations (October 1 2014)

bull Add Section 34 which describes operators thatact as the last link in the Linux kernelrsquos depen-dency chains including the ambiguous role ofthe -gt operator (October 1 2014)

bull Update plots of RCU API usage (October 52014)

bull Flesh out Section 412 which lays out a shorthistory of dependency-chain management rulesfor the Linux kernel Numerous minor clarifica-tions and corrections throughout the document(October 5 2014)

bull Add ldquoandrdquo before final authorrsquos name and emailaddress Also add a pointer back to N4036 (Oc-tober 5 2014)

bull Add a brain-dead build script (October 52014)

bull Add Section 710 which has the beginnings ofan evaluation of the various proposals (October5 2014)

The paper was then published as N4215 whichwas revised as follows

bull Add verbiage to Sections 72 and 7102 differ-entiating between various forms of syntactic andsemantic dependencies (November 11 2014)

bull Add Section 7103 suggesting a C keyword in-stead of the C++ [[carries dependency]] at-tribute This section also suggests that combin-ing ideas from several proposals might be helpful(November 11 2014)

bull Add Section 78 which contains a first attemptto describe Olivier Girouxrsquos head-marked depen-dency chains (November 11 2014)

bull Add example code in Figures 15 and 16 Addsimilar figures to the proposals showing how theymark the dependency chains in the sample code(November 11 2014)

At this point the paper was published as N4321which was revised as follows

bull Add a warning against using operands to amp and| that leave only one bit set (or respectivelycleared) to Section 411 (April 9 2015)

bull Update plots of RCU API usage (April 192014)

bull Add Section 79 which restricts dependencychains with the goal of allowing unmarked de-pendency chains while avoiding restricting com-piler optimizations One important restriction isthe elimination of RCU-protected array indexes(April 19 2014)

bull Add verbiage to Section 32 noting that the amp

operator is sometimes used to find the beginningof a power-of-two aligned structure (April 192014)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Torvald Riegel to Sec-tion 79 (May 20 2015)

bull Donrsquot allow integers to carry dependencies intoor out of unmarked functions in Section 79(May 21 2015)

bull Update Section 79 to note that assignments can-not extend a dependency chain from one threadto another and that assignments cannot extendan intptr t- or uintptr t-based dependencychain through a shared variable even within agiven thread (May 21 2015)

bull Update Section 79 to note that atomic op-erations can extend dependency chains in thesame way that their non-atomic counterpartscan (May 21 2015)

WG21N4321 40

bull Update Section 79 to explicitly call out the factthat stores and subsequent loads can extend de-pendency chains (May 21 2015)

bull Update Section 79 to explicitly state that un-defined behavior terminates dependency chainsand that if only one value of a pointer avoidsundefined behavior any dependency chainsthrough that pointer at that point are termi-nated (May 22 2015)

bull Update Section 79 to document the difficul-ties stemming from value-specialization opti-mizations (May 23 2015)

bull Update Section 79 to note the back-propagationof the dependency-chain-breaking effects of thecompiler deducing the exact value of a pointer(May 27 2015)

bull Update Section 79 to give more informationon how much bit-setting and bit-clearing is toomuch (June 8 2015)

bull Update Section 79 to note that the Linux ker-nel does not rely on dependency ordering whena statically allocated structure is linked into anRCU-protected structure but also note that op-timizations breaking dependency ordering in thiscase seem quite dubious (July 11 2015)

bull Update Section 79 illustrating dependency-chain rules with examples (July 11 2015)

bull Update Section 79 noting that simple loads suf-fice for memory order consume loads (July 112015)

bull Update Sections 77 and 78 calling out abstrac-tion challenges for head- and tail-marking ap-proaches (July 13 2015)

bull Add subsections to Section 79 (July 13 2015)

bull Update Section 79 to amplify objections to spe-cialization optimizations applied to heap-basedpointers (July 13 2015)

  • 1 Introduction
  • 2 Introduction to RCU
  • 3 Linux-Kernel Use Cases
    • 31 Types of Linux-Kernel Dependency Chains
    • 32 Operators in Linux-Kernel Dependency Chains
    • 33 Operators Terminating Linux-Kernel Dependency Chains
    • 34 Operators Acting as Last Link in Linux-Kernel Dependency Chains
    • 35 Linux-Kernel Dependency Chain Length
      • 4 Dependency Ordering in Pre-C11 Implementations
        • 41 Rules for C-Language RCU Users
          • 411 Rules for 2014 GCC Implementations
          • 412 Rules for 2003 GCC Implementations
          • 413 Rules for 1990s Sequent C Implementations
            • 42 Rules for C-Language Implementers
              • 5 Dependency Ordering in C11 and C++11 Implementations
              • 6 Weaknesses in C11 and C++11 Dependency Ordering
              • 7 Potential Alternatives to C11 and C++11 Dependency Ordering
                • 71 Revising C11 and C++11 Dependency-Ordering Definition
                • 72 Type-Based Designation of Dependency Chains With Restrictions
                • 73 Type-Based Designation of Dependency Chains
                • 74 Whole-Program Option
                • 75 Local-Variable Restriction
                • 76 Mark Dependency-Carrying Local Variables
                • 77 Explicitly Tail-Marked Dependency Chains
                • 78 Explicitly Head-Marked Dependency Chains
                • 79 Restricted Dependency Chains
                  • 791 Extending Dependency Chains
                  • 792 Terminating Dependency Chains
                  • 793 Restricted Dependency Chains and the Linux Kernel
                  • 794 Restricted Dependency Chains Advantages and Disavantages
                    • 710 Evaluation
                      • 7101 Audiences
                      • 7102 Comparison
                      • 7103 Other Approaches
                          • 8 Summary
Page 12: N4321: Towards Implementation and Use of memory order … · 2015-07-14 · N4321: Towards Implementation and Use of memory order consume Doc. No.: WG21/N4321 (revised) Date: 2015-04-19

WG21N4321 12

ulation is accompanied by some means of detectingthe race condition that Figure 10 is subject to

Are there other dependency-breaking optimizationsthat should be called out separately

5 Dependency Ordering in C11and C++11 Implementations

The simplest way to avoid dependency-ordering is-sues is to strengthen all memory order consume oper-ations to memory order acquire This functions cor-rectly but may result in unacceptable performancedue to memory-barrier instructions on weakly or-dered systems such as ARM and PowerPC5 and mayfurther unnecessarily suppress code-motion optimiza-tions

Another straightforward approach is to avoid valuespeculation and other dependency-breaking opti-mizations This might result in missed opportu-nities for optimization but avoids any need fordependency-chain annotations and also all issuesthat might otherwise arise from use of dependency-breaking optimizations This approach is fully com-patible with the Linux kernel communityrsquos currentapproach to dependency chains Unfortunately thereare any number of valuable optimizations that breakdependency chains so this approach seems impracti-cal

A third approach is to avoid value speculationand other dependency-breaking optimizations in anyfunction containing either a memory order consume

load or a [[carries dependency]] attribute Forexample the hardware-branch-predition optimiza-tion shown in Figure 10 would be prohibited in suchfunctions as would cancellation optimizations suchas optimizing a = b + c - c into a = b This toocan result in missed opportunities for optimizationthough very probably many fewer than the previousapproach This approach can also result in issues dueto dependency-breaking optimizations in functionslacking [[carries dependency]] attributes for ex-ample function d() in Figure 11 It can also result

5 From a Linux-kernel community viewpoint that shouldread ldquowill result in unacceptable performancerdquo

1 int a(struct foo p [[carries_dependency]])2 3 return kill_dependency(p-gta = 0)4 56 int b(int x)7 8 return x9

1011 foo c(void)12 13 return fpload_explicit(memory_order_consume)14 return rcu_dereference(fp) in Linux kernel 15 1617 int d(void)18 19 int a20 foo p2122 rcu_read_lock()23 p = c()24 a = p-gta25 rcu_read_unlock()26 return a27

Figure 11 Example Functions for Dependency Or-dering Part 1

1 [[carries_dependency]] struct foo e(void)2 3 return fpload_explicit(memory_order_consume)4 return rcu_dereference(fp) in Linux kernel 5 67 int f(void)8 9 int a

10 foo p1112 rcu_read_lock()13 p = e()14 a = p-gta15 rcu_read_unlock()16 return kill_dependency(a)17 1819 int g(void)20 21 int a22 foo p2324 rcu_read_lock()25 p = e()26 a = p-gta27 rcu_read_unlock()28 return b(a)29

Figure 12 Example Functions for Dependency Or-dering Part 2

WG21N4321 13

in spurious memory-barrier instructions when a de-pendency chain goes out of scope for example withthe return statement of function g() in Figure 12

A fourth approach is to add a compile-time op-eration corresponding to the beginning and end ofRCU read-side critical section These would need tobe evaluated at compile time taking into accountthe fact that these critical sections can nest and canbe conditionally entered and exited Note that theexit from an outermost RCU read-side critical sec-tion should imply a kill dependency() operation oneach variable that is live at that point in the code6

Although it is probably impossible to precisely de-termine the bounds of a given RCU read-side criticalsection in the general case conservative approachesthat might overestimate the extent of a given sec-tion should be acceptable in almost all cases Thisapproach would make functions c() and d() in Fig-ure 11 handle dependency chains in a natural mannerbut avoiding whole-program analysis would requiresomething similar to the [[carries dependency]]

annotations called out in the C11 and C++11 stan-dards

A fifth approach would be to require that all op-erations on the essential subset of any dependencychain be annotated This would greatly ease imple-mentation but would not be likely to be accepted bythe Linux kernel community

A sixth approach is to track dependencies as calledout in the C11 and C++11 standards However in-stead of emitting a memory-barrier instruction whena dependency chain flows into or out of a functionwithout the benefit of [[carries dependency]] in-sert an implicit kill dependency() invocation Im-plementation should also optionally issue a diagnosticin this case The motivation for this approach is thatit is expected that many more kill dependencies()

than [[carries dependency]] would be required toconvert the Linux kernelrsquos RCU code to C11 In theexample in Figure 12 this approach would allow func-tion g() to avoid emitting an unnecessary memory-barrier instruction but without function f()rsquos ex-

6 What if a given rcu read unlock() sometimes marked theend of an outermost RCU read-side critical section but othertimes was nested in some other RCU read-side critical sectionIn that case there should be no kill dependency()

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a)3 a = p-gtspecial_a4 else5 a = p-gtnormal_a

Figure 13 Dependency-Ordering Value-NarrowingHazard

plicit kill dependency() Both functions are in Fig-ure 12

A seventh and final approach is to track dependen-cies as called out in in the C11 and C++11 standardsWith this approach functions e() and f() properlypreserve the required amount of dependency order-ing

6 Weaknesses in C11 andC++11 Dependency Or-dering

Experience has shown several weaknesses in the de-pendency ordering specified in the C11 and C++11standards

1 The C11 standard does not provide attributesand in particular does not provide the[[carries dependency]] attribute This pre-vents the developer from specifying that a givendependency chain passes into or out of a givenfunction

2 The implementation complexity of thedependency-chain tracking required by bothstandard can be quite onerous on the one handand the overhead of unconditionally promotingmemory order consume loads to memory order

acquire can be excessive on weakly orderedimplementations on the other There is thereforeno easy way out for a memory order consume

implementation on a weakly ordered system

3 The function-level granularity of [[carries

dependency]] seems too coarse One problemis that points-to analysis is non-trivial so that

WG21N4321 14

compilers are likely to have difficulty determin-ing whether or not a given pointer carries a de-pendency For example the current wording ofthe standard (intentionally) does not disallowdependency chaining through stores and loadsTherefore if a dependency-carrying value mightever be written to a given variable an implemen-tation might reasonably assume that any loadfrom that variable must be assumed to carry adependency

4 The rules set out in the standard [28 110p9]do not align well with the rules that develop-ers must currently adhere to in order to main-tain dependency chains when using pre-C11 andpre-C++11 compilers (see Section 41) For ex-ample the standard requires (x-x) to carry adependency and providing this guarantee wouldat the very least require the compiler to alsoturn off optimizations that remove (x-x) (andsimilar patterns) if x might possibly be carry-ing a dependency For another example con-sider the value-speculation-like code shown inFigure 13 that is sometimes written by devel-opers and that was described in bullet 9 ofSection 41 In this example the standard re-quires dependency ordering between the memoryorder consume load on line 1 and the sub-sequent dereference on line 3 but a typicalcompiler would not be expected to differenti-ate between these two apparently identical val-ues These two examples show that a compilerwould need to detect and carefully handle thesecases either by artificially inserting dependen-cies omitting optimizations differentiating be-tween apparently identical values or even byemitting memory order acquire fences

5 The whole point of memory order consume andthe resulting dependency chains is to allow de-velopers to optimize their code Such optimiza-tion attempts can be completely defeated by thememory order acquire fences that the standardcurrently requires when a dependency chain goesout of scope without the benefit of a [[carries

dependency]] attribute Preventing the com-piler from emitting these fences requires liberal

use of kill dependency() which clutters coderequires large developer effort and further re-quires that the developer know quite a bit aboutwhich code patterns a given version of a givencompiler can optimize (thus avoiding needlessfences) and which it cannot (thus requiring man-ual insertion of kill dependency()

As of this writing no known implementations fullysupport C11 or C++11 dependency ordering

It is worth asking why Paul didnrsquot anticipate theseweaknesses There are several reasons for this

1 Compiler optimizations have become more ag-gressive over the seven years since Paul startedworking on standardization

2 New dependency-ordering use cases have arisenduring that same time in particular there arelonger dependency chains and more of themincluding dependency chains spanning multiplecompilation units

3 The number of dependency chains has increasedby roughly an order of magnitude during thattime so that changes in code style can be ex-pected to face a commeasurate increase in resis-tance from the Linux kernel community ndash unlessthose changes bring some tangible benefit

With that letrsquos look at some potential alternativesto dependency ordering as defined in the C11 andC++11 standards

7 Potential Alternatives to C11and C++11 Dependency Or-dering

Given the weaknesses in the current standardrsquos spec-ification of dependency ordering it is quite reason-able to consider alternatives To this end Section 71discusses ease-of-use issues involved with revisions tothe C11 and C++11 definitions of dependency or-dering Section 72 enlists help from the type systembut also imposes value restrictions (thus revising the

WG21N4321 15

C11 and C++11 semantics for dependencies) Sec-tion 73 enlists help from the type system withoutthe value restrictions and Section 74 describes awhole-program approach to dependency chains (alsorevising the C11 and C++11 semantics for depen-dencies) Section 75 describes a post-Rapperswilproposal that dependency chains be restricted tofunction-scope local variables and temporaries andSection 76 describes a second post-Rapperswil pro-posal that the [[carries dependency]] attributebe used to label local-scope variables that carry de-pendencies Section 77 describes a proposal dis-cussed verbally at Rapperswil that explicitly marksthe tails of dependency chains Section 78 describesthe inverse namely marking the heads of dependencychains Section 79 describes an approach that avoidsmarking by sharply restricting the number and typeof operations permitted in dependency chains Eachapproach appears to have advantages and disadvan-tages so it is hoped that further discussion will eitherhelp settle on one of these alternatives or generatesomething better To help initiate this discussionSection 710 provides an initial comparative evalua-tion

71 Revising C11 and C++11Dependency-Ordering Definition

The following sections each describe a proposed revi-sion of the dependency-ordering definition from thatin the current C11 and C++11 standards In manyof these proposals developers are required to followan additional rule in order to be able to rely on de-pendency ordering Subsequent execution must notlead to a situation where there is only one possiblevalue for the variable that is intended to carry thedependency7 This is shown in Figure 17 where thecompiler is permitted to break dependency orderingon line 6 because it knows that the value of p is equalto that of q which means that it could substitutethe latter value from the former which would break

7 This restricted notion of dependence is sometimes calledsemantic dependence and the value at the end of a depen-dence chain that does not represent a semantic dependence issometimes said to be independent of the value at the head ofthe dependency chain

1 int my_array[MY_ARRAY_SIZE]23 i = atomic_load_explicit(gi memory_order_consume)4 r1 = my_array[i]

Figure 14 Single-Element Arrays and DependencyOrdering

dependency ordering In short a dependency chainbreaks if it comes to a point where only a single valueis possible regardless of the value of the memory

order consume load heading up the chain At firstglance this additional rule could be quite difficult tolive with as dependency ordering could come and godepending on small details of code far away from thatpoint in the dependency chain

However a review of the Linux-kernel operators inSection 32 shows that the most commonly used op-erators act identically under both definitions Theproblem-free operators include -gt infix = casts pre-fix amp prefix and ternary

One example of a potentially troublesome opera-tor namely == is shown in Figure 17 where line 6breaks dependency ordering because the value of p isknown to be equal to that of q which is not part of adependency chain This example could be addressedthrough careful diagnostic design coupled with appro-priate coding standards For example the compilercould emit a warning on line 6 but remain silent forthe equivalent line substituting q for p namely dosomething with(q-gta)

Another example is the use of postfix [] thatis shown in Figure 14 If this code fragment wascompiled with MY ARRAY SIZE equal to one there isno dependency ordering between lines 3 and 4 butthat same code fragment compiled with MY ARRAY

SIZE equal to two or greater would be dependency-ordered Here a diagnostic for single-element arraysmight prove useful and such a diagnostic can easilybe supplied in this case using if and error

In the Linux kernel infix + and - are used forpointer and array computations These are all safein that they operate on an integer and pointer sothat any cancellation will not normally be detectableat compile time However one big purpose of diag-nostics is to detect abnormal conditions indicating

WG21N4321 16

1 struct liststackhead 2 struct liststack __rcu first3 45 struct liststack 6 struct liststack __rcu next7 void t8 struct rcu_head rh9

1011 _Carries_dependency12 void ls_front(struct liststackhead head)13 14 _Carries_dependency void data15 struct liststack lsp1617 rcu_read_lock()18 lsp = rcu_dereference(head-gtfirst)19 if (lsp == NULL)20 data = NULL21 else22 data = rcu_dereference(lsp-gtt)23 rcu_read_unlock()24 return data25

Figure 15 List-Based-Stack Example Code 1 of 2

probable bugs Therefore in cases where the com-piler can determine that two values from dependencychains are annihilating each other via infix + and -a diagnostic would be appropriate

Similarly the Linux kernel uses infix (bitwise) amp tomanipulate bits at the bottom of a pointer whereagain cancellation will not normally be detectable atcompile timemdashexcept in the case of operations on aNULL pointer for which dependency ordering is notmeaningful in any case However as with infix +

and - if the compiler detects value annihilation adiagnostic would be appropriate

Although issues with false positives and negativesneeds further investigation there is reason to hopethat this revision of the definition of dependency or-dering might avoid significant impacts on ease of useWith this hope we proceed to the specific propos-als using the code in Figures 15 and 16 to showsome sample code using Linux-kernel nomenclaturewith the addition of a mythical C keyword Carries

dependency to annotate parameters variables andreturn values that carry dependencies Please notethat this code example in no way endorses the dubi-ous practice of creating a parallel program with thesort of choke point exemplified by the head of this

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 _Carries_dependency struct liststack lsp40 struct liststack lsnp141 _Carries_dependency struct liststack lsnp242 _Carries_dependency void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 16 List-Based-Stack Example Code 2 of 2

WG21N4321 17

1 value_dep_preserving struct foo p23 p = atomic_load_explicit(gp memory_order_consume)4 q = some_other_pointer5 if (p == q)6 do_something_with(p-gta)7 else8 do_something_else_with(p-gtb)

Figure 17 Single-Value Variables and DependencyOrdering

list Note also that cmpxchg() heads a dependencychain which is completely reasonable within the con-text of the Linux kernel due to its acquire semanticswhich of course might be argued to indicate that theannotations in ls pop() are unnecessary

72 Type-Based Designation of De-pendency Chains With Restric-tions

This approach was formulated by Torvald Riegel inresponse to Linus Torvaldsrsquos spirited criticisms of thecurrent C11 and C++11 wording

This approach introduces a new value dep

preserving type qualifier Dependency ordering ispreserved only via variables having this type quali-fier This is meant to model the real scope of depen-dencies which is data flow not execution at function-level granularity This approach should therefore givedevelopers much finer control of which dependenciesare tracked

Assigning from a value dep preserving value to anon-value dep preserving variable terminates thetracking of dependencies in much the same way thatan explicit kill dependency() would However un-like an explicit kill dependency() compilers shouldbe able to emit a suppressable warning on implicitconversions so as to alert the developer about other-wise silent dropping of dependency tracking8

Next we specify that memory order consume loadsreturn a value dep preserving type by default thecompiler must assume such a load to be capable of

8 Other choices are possible in this case including emit-ting a memory order acquire fence in order to conservativelypreserve a potentially intended ordering

producing any value of the underlying type In otherwords the implementation is not permitted to applyany value-restriction knowledge it might gain fromwhole-program analysis We call this a local seman-tic dependency to distinguish not only from a pure(syntactic) dependency but also from a global seman-tic dependency where global information may be ap-plied Note that any global semantic dependency isalso a local semantic dependency but that any localsemantic dependency which is headed by a variablethat can be proven to take on only a single value is nota global semantic dependency The term ldquosemanticdependencyrdquo should be interpreted to mean a globalsemantic dependency unless otherwise stated

This allows developers to start with a clean slatefor the additional rule that they must follow to beable to rely on dependency ordering Subsequent ex-ecution must not lead to a situation there is only onepossible value for the value dep preserving expres-sion because otherwise the implementation is per-mitted to break the dependency chain As notedearlier this is shown in Figure 17 where the com-piler is permitted to break dependency ordering online 6 because it knows that the value of p is equalto that of q which means that it could substitutethe latter value from the former which would breakdependency ordering

This approach has several advantages

1 The implementation is simpler because no de-pendency chains need to be traced The imple-mentation can instead drive optimization deci-sions strictly from type information

2 Use of the value dep preserving type modifierallows the developer to limit the extent of thedependency chains

3 This type modifier can be used to mark a depen-dency chainrsquos entry to and exit from a functionin a straightforward way without the need forattributes

4 The value dep preserving type modifiers serveas valuable documentation of the developerrsquos in-tent

WG21N4321 18

5 This approach permits many additional opti-mizations compared to those permitted by thecurrent standard on code that carries a depen-dency Expressions such as (x-x) no longerrequire establishment of artificial dependenciesand the compiler is no longer required to detectvalue-narrowing hazards like that shown in Fig-ure 13 However the compiler is still prohibitedfrom adding its own value-speculation optimiza-tions

6 Linus Torvalds seems to be OK with it whichindicates that this set of rules might be practicalfrom the perspective of developers who currentlyexploit dependency chains

According to Peter Sewell one disadvantage is thatthis approach will be quite difficult to model which inturn will pose obstacles for the analysis tooling thatwill be increasingly necessary for large-scale concur-rent programming efforts In particular the concernis that forcing the compiler to assume that a memory

order consume load could possibly return any valuepermitted by its type might require program-analysistools to consider counterfactual hypothetical execu-tions which might complicate specification of seman-tics and verification

Figures 18 and 19 show how this approach playsout with the list-based stack

73 Type-Based Designation of De-pendency Chains

Jeff Preshing made an off-list suggestion of using avalue dep preserving type modifier as suggestedby Torvald Riegel but using this type modifier tostrictly enforce dependency ordering For exampleconsider the code fragment shown in Figure 17 Thescheme described in Section 72 would not necessar-ily enforce dependency ordering between the load online 3 and the access one line 6 while the approachdescribed in this section would enforce dependencyordering in this case

Furthermore cancelling or value-destruction oper-ations on value dep preserving values would notdisrupt dependency ordering As with the cur-rent C11 and C++11 standards the implementation

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack value_dep_preserving first6 78 struct liststack 9 struct liststack value_dep_preserving next

10 void t11 struct rcu_head rh12 1314 value_dep_preserving15 void ls_front(struct liststackhead head)16 17 value_dep_preserving void data18 value_dep_preserving struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 18 List-Based-Stack Restricted Type-BasedDesignation 1 of 2

would be required to emit a memory-barrier instruc-tion or compute an artificial dependency for such op-erations (Note however that use of cancelling orvalue-destruction operations on dependency chainshas proven quite rare in practice)

This approach shares many of the advantages ofTorvald Riegelrsquos approach

1 The implementation is simpler because no de-pendency chains need be traced The implemen-tation can instead drive optimization decisionsstrictly from type information

2 Use of the value dep preserving type modifierallows the developer to limit the extent of thedependency chains

3 This type modifier can be used to mark a depen-dency chainrsquos entry to and exit from a functionin a straightforward way without the need forattributes

4 The value dep preserving type modifiers serve

WG21N4321 19

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 value_dep_preserving37 void ls_pop(struct liststackhead head)38 39 value_dep_preserving struct liststack lsp40 struct liststack lsnp141 value_dep_preserving struct liststack lsnp242 value_dep_preserving void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 19 List-Based-Stack Restricted Type-BasedDesignation 2 of 2

as valuable documentation of the developerrsquos in-tent

5 Although optimizations on a dependency chainare restricted just as in the current standardthe use of value dep preserving restricts thedependency chains to those intended by the de-veloper

6 Restricting dependency-breaking optimizationson all dependency chains marked value dep

preserving without exceptions for cases inwhich the compiler knows too much might makethis approach easier to learn and to use

It is expected that modeling this approach shouldbe straightforward because the modeling tools wouldbe able to make use of the type information Thisapproach results in the same code as shown in Fig-ures 18 and 19 of the previous section

74 Whole-Program Option

This approach also suggested off-list by Jeff Presh-ing has the goal of reusing existing non-dependency-ordered source code unchanged (albeit requiring re-compilation in most cases)9 For example this ap-proach permits an instance of stdmap to be refer-enced by a pointer loaded via memory order consume

and to provide that stdmap instance with thebenefits of dependency ordering without any codechanges whatsoever to stdmap It is important tonote that this protection will be provided only to aread-only stdmap that is referenced by a changingpointer loaded via memory order consume in partic-ular not to a concurrently updated stdmap refer-enced by a pointer (read-only or otherwise) loadedvia memory order consume This latter case wouldrequire changes to the underlying stdmap implemen-tation at a minimum changing some of the loads tobe memory order consume loads Nevertheless theability to provide dependency-ordering protection topre-existing linked data structures is valuable evenwith this read-only restriction

9 A module or library that is known to never carry a de-pendency need not be recompiled

WG21N4321 20

This approach which again does require full re-compilation can be implemented using two ap-proaches

1 Promote all memory order consume loads tomemory order acquire as may be done withthe current standard

2 On architectures that respect memory order-ing prohibit all dependency-breaking optimiza-tions throughout the entire program but onlyin cases where a change in the value returnedby a memory order consume load could cause achange in the value computed later in that samedependency chain in other words where there isa global semantic dependency Note again thatthe possibility of storing a value obtained froma memory order consume load then loading itlater means that normal loads as well as memoryorder relaxed loads often must be consideredto head their own dependency chains but onlywhen loaded by the same thread that did thestore

Some implementations might allow the developerto choose between these two approaches for exampleby using a compiler switch provided for that purpose

This approach also has the effect of permitting atrivial implementation of a memory order consume

atomic thread fence() When using the first im-plementation approach the atomic thread fence()

is simply promoted to memory order acquire In-terestingly enough when using the second ap-proach the memory order consume atomic thread

fence() may simply be ignored The reason forthis is that this approach has the effect of promot-ing memory order relaxed loads to memory order

consume which already globally enforces all theordering that the memory order consume atomic

thread fence() is required to provide locally10

This approach has its own set of advantages anddisadvantages

10 Of course this presumed promotion from memory order

relaxed to memory order consume means that architecturessuch as DEC Alpha that do not respect dependency order-ing must continue to use the first option of emitting memory-ordering instructions for memory order consume loads

1 This approach dispenses with the [[carries

dependency]] attribute and the kill

dependency() primitive

2 This approach better promotes reuse of existingsource code In particular it should require nochanges to the current Linux-kernel source baseaside from changes to the rcu dereference()

family of primitives

3 This approach allows implementations to carryout dependency-breaking optimizations on de-pendency chains as long as a change in thevalue from the memory order consume load doesnot change values further down the dependencychain both with and without the optimizationJeff conjectures that the set of dependency-breaking optimizations used in practice applyonly outside of dependency chains by the re-vised definition in which single-value restrictionsbreak dependency chains11 If this conjectureholds it also applies to Torvaldrsquos approach de-scribed in Section 72

4 Code that follows the rules presented in Sec-tion 41 (substituting memory order consume

loads for volatile loads) would have its depen-dency ordering properly preserved

It is unlikely that this approach could be modeledreasonably given the current state of the art Therequirement that any given memory order consume

load be able to generate at least two different val-ues at the tail of the dependency chain is believedto be a show-stopper especially when coupled withwhole-program analysis which might find that thereis only one value entering at the head of the depen-dency chain

This approach allows annotations to be discardedas shown in Figures 20 and 21 However the memoryorder consume loads are still required in order toenable the promote-to-acquire implementation style

11 This is certainly the case for the usual optimizations ex-emplified by replacing (x-x) with zero

WG21N4321 21

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 void ls_front(struct liststackhead head)15 16 void data17 struct liststack lsp1819 rcu_read_lock()20 lsp = rcu_dereference(head-gtfirst)21 if (lsp == NULL)22 data = NULL23 else24 data = rcu_dereference(lsp-gtt)25 rcu_read_unlock()26 return data27

Figure 20 List-Based-Stack Whole-Program Ap-proach 1 of 2

75 Local-Variable Restriction

This approach suggested off-list by Hans Boehmlimits the extent of dependency trees to a local whichincludes local variables temporaries function argu-ments and return variables Assigning a value from amemory order consume load to such an object beginsa dependency chain Assigning a value loaded fromsuch a local to a global variable (including function-local variables marked static) or to the heap impliesa kill dependency() so that dependency chains areconfined to locals However if the compiler is unableto see the full dependency chain for example be-cause it passes into a function in another translationunit that is not marked [[carries dependency]]the compiler should promote memory order consume

to memory order acquire12

Section 32 indicates that the following operatorsshould transmit dependency status from one localvariable or temporary to another -gt infix = casts

12 Some implementations might provide means to allow theuser to specify that a diagnostic be generated if such promotionis necessary

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 void ls_pop(struct liststackhead head)37 38 struct liststack lsp39 struct liststack lsnp140 struct liststack lsnp241 void data4243 rcu_read_lock()44 lsnp2 = rcu_dereference(head-gtfirst)45 do 46 lsnp1 = lsnp247 if (lsnp1 == NULL) 48 rcu_read_unlock()49 return NULL50 51 lsp = rcu_dereference(lsnp1-gtnext)52 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)53 while (lsnp1 = lsnp2)54 data = rcu_dereference(lsnp2-gtt)55 rcu_read_unlock()56 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)57 return data58

Figure 21 List-Based-Stack Whole-Program Ap-proach 2 of 2

WG21N4321 22

prefix amp prefix [] infix + infix - ternary infix (bitwise) amp and probably also | Similarly Sec-tion 33 indicates that the following operators shouldimply a kill dependency() () == = ampamp ||infix and

It will also be necesary to check whether Linux-kernel usage expects dependency chains to passthrough globals and heap objects that are in someway thread-local If there are such use cases and ifthey are sane and cannot easily be changed to uselocal variables should [[carries dependency]] beused to flag dependency-carrying globals and heapobjects

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute if de-pendency chains are to span multiple translationunits as is the case in some parts of the Linuxkernel

2 The implementation is likely to be some-what simpler because only those dependencychains passing through local variables compiler-generated temporaries compiler-visible functionarguments and compiler-visible return valuesneed be traced One could also argue that func-tion arguments and return values marked with[[carries dependency]] attribute also need tobe traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 Although optimizations on dependency chainsmust be restricted the restricted scope of de-pendency chains reduces the impact of these re-strictions

5 Applying this approach to the Linux kernelwould only require the addition of markings onfunction parameters and return values corre-sponding to cross-translation-unit function callsHowever there are a significant number of theseso this approach can expect significant resistancefrom the Linux community

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 22 List-Based-Stack Local-Variable Restric-tion 1 of 2

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach allows local-variable annotations tobe dropped as shown in Figure 22 and 23

76 Mark Dependency-Carrying Lo-cal Variables

This approach suggested offlist by Clark Nelson usesthe [[carries dependency]] attribute to mark non-static local-scope variables as carrying a dependencyin addition to its current use marking function ar-guments and return values as carrying dependen-cies It is not permissible to mark global variablesor structure members with this attribute Assigningfrom a [[carries dependency]] object to a non-[[carries dependency]] object results in an im-plicit kill dependency()

This approach is similar to that of Section 73 ex-cept that it uses an attribute rather than a type mod-ifier As such it has many of the advantages and dis-

WG21N4321 23

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 23 List-Based-Stack Local-Variable Restric-tion 2 of 2

advantages of that approach however some believethat an attribute-based approach will be more ac-ceptable to the committee than would a type-modifierapproach13 However this approach does requirethat C add attributes

This leave the question of which operatorstransmit dependency chains from one [[carries

dependency]] object to another Section 32 indi-cates that the following operators should transmitdependency status from one local variable or tem-porary to another -gt infix = casts prefix amp prefix [] infix + infix - ternary infix (bitwise) ampand probably also | Similarly Section 33 showsthat the following operators should imply a kill

dependency() () == = ampamp || infix and

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains passingthrough variables marked with the [[carries

dependency]] attribute need be traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 The [[carries dependency]] calls serve asvaluable documentation of the developerrsquos in-tent

5 Although optimizations on dependency chainsmust be restricted use of explicit [[carries

dependency]] greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of variables carrying depen-dencies given that the Linux kernel currentlyrequires no such markings

13 Lawrence Crowl suggests a third approach namely a vari-able modifier

WG21N4321 24

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 _Carries_dependency void data18 _Carries_dependency struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 24 List-Based-Stack Marked Local Variables1 of 2

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach results in code as shown in Fig-ures 24 and 25 where the [[carries dependency]]

attributes have been replaced with a mythicalCarries dependency C keyword

77 Explicitly Tail-Marked Depen-dency Chains

This approach suggested at Rapperswil by OlivierGiroux can be thought of as the inverse of

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 _Carries_dependency struct liststack lsp40 struct liststack lsnp141 _Carries_dependency struct liststack lsnp242 _Carries_dependency void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 25 List-Based-Stack Marked Local Variables2 of 2

WG21N4321 25

1 p = atomic_load_explicit(ampgp memory_order_consume)2 if (p = NULL)3 do_it(atomic_dependency(p gp))

Figure 26 Explicit Dependency Operations

stdkill dependency() Instead of explicitlymarking where the dependency chains terminateOlivierrsquos proposal uses a stddependency() prim-itive to indicate the locations in the code that thedependency chains are required to reach The first ar-gument to stddependency() is the value to whichthe dependency must be carried and the second argu-ment is the variable that heads the dependency chainin other words the second argument is the variablethat was loaded from by a memory order consume

load This proposal differs from the others in thatit is expected to be implemented not necessarily bypreserving the dependency but instead by insertingbarriers in those cases where optimizations have elim-inated any required dependencies The goal here is toimpose minimal restrictions on optimizations of codecontaining dependency chains

A C-language example is shown in Figure 26where stddependency() is transliterated to the C-language atomic dependency() function On line 3atomic dependency() returns the value of its firstargument (p) while ensuring that the data depen-dency from the memory order consume load from gp

is faithfully reflected in the assembly language imple-menting this code fragment The assembly-languagereflection of this dependency might be in terms ofan assembly-language dependency (for example onARM or PowerPC) implicit memory ordering (forexample on x86 or mainframe) or by an explicitmemory-barrier instruction However if there was noatomic dependency() function the compiler wouldbe under no obligation to preserve the dependency14

These explicitly specified dependencies may becombined with [[carries dependency]] attributeson function arguments for example as shown in Fig-ure 27 Note the interplay of atomic dependency()

and [[carries dependency]] where line 8 estab-

14 Would it be better to have the second argument to atomic

dependency() be a label rather than an expression

1 void foo(struct bar q [[carries_dependency]])2 3 if (q = NULL)4 do_it(atomic_dependency(q-gtb q))5 67 p = atomic_load_explicit(ampgp memory_order_consume)8 foo(atomic_dependency(p gp))

Figure 27 Explicit Dependency Operations and car-ries dependency

lishes the dependency between the load from gp andthe [[carries dependency]] argument q of foo()and where line 4 establishes the further dependencybetween argument q of foo() and do it()s argu-ment

This approach is not yet complete One issue isthe possibility of a given operation being dependenton multiple memory order consume loads One ap-proach is of course to omit this functionality and an-other is to allow atomic dependency() to allow anexpression as its first argument and a variable list ofmemory order consume loaded variables

Another issue is connecting [[carries

dependency]] return values to subsequent atomic

dependency() invocations There are a numberof possible resolutions to this issue One approachwould be to use [[carries dependency]] attributeto mark the declaration of the variable to whichthe functionrsquos return value is assigned bringing theproposal from Section 76 to bear In the specialcase where the memory order consume load is in thesame function body as the atomic dependency()

that depends on it the atomic dependency()

could reference the variable that was the sourceof the original memory order consume load An-other approach would be to allow function-returncarries dependency]] attributes to define namesthat could be used by later atomic dependency()

invocations

A third issue arises when atomic dependency()

must be applied after the head of the dependencychain has gone out of scope for example if the headwas contained in a variable defined in an inner scopethat has since been exited

A fourth issue arises if optimizations along a

WG21N4321 26

needed dependency chain allow ordering the depen-dent operation to precede the head of the depen-dency chain in which case inserting barriers wouldbe ineffective The current proposal for addressingthis issue is to suppress memory-movement optimiza-tions across the atomic dependency() perhaps us-ing something like atomic signal fence() or theLinux kernelrsquos barrier() macro This approach al-lows dependency checking and fence insertion to becarried out as a final pass in the compilation process

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains having ex-plicit atomic dependency() calls (and option-ally intermediate [[carries dependency]] at-tributes) need be traced

3 Irrelevant dependency chains are pruned by de-fault with no stdkill dependency() calls re-quired

4 The atomic dependency() calls serve as valu-able documentation of the developerrsquos intent

5 Although optimizations on dependency chainsmust be restricted use of explicit atomic

dependency() greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of dependency chains giventhat the Linux kernel currently relies on implicitends of dependency chains

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(atomic_dependency(lsp-gtt27 head-gtfirst))28 rcu_read_unlock()29 return atomic_dependency(data lsp-gtt)30

Figure 28 List-Based-Stack Tail-Marked Dependen-cies 1 of 2

It is not yet known whether this approach can bereasonably modeled

The result is shown in Figures 28 and 29

78 Explicitly Head-Marked Depen-dency Chains

This approach suggested via email by Olivier Girouxcan be thought of as another inverse of stdkilldependency() In this case the heads of the de-pendency chains are marked indicating to whichpointed-to objects dependencies should be carriedThis description is an extrapolation of a very con-cise proposal and corrections are welcome

The general idea is to provide an augmentedform of the load() member function that indicatesdependencies for example xload(memory order

consume x-gtnext) would cause a dependency tobe carried through the -gtnext field but throughno other field This is shown in Figure 30 where

WG21N4321 27

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(atomic_dependency(lsnp2-gtt lsnp))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return atomic_dereference(data lsnp2-gtt)59

Figure 29 List-Based-Stack Tail-Marked Dependen-cies 2 of 2

1 struct foo 2 struct foo a3 struct foo b4 struct foo c5 int d6 78 p = atomic_load_explicit(ampgp memory_order_consume9 p-gta p-gtb)

10 qa = p-gta Dependency carried 11 qb = p-gtb Dependency carried 12 qc = p-gtc No dependency carried 13 d = p-gtd No dependency carried

Figure 30 Explicit Dependency Operations and Aug-mented Load

the explicit dependency information on line 9 causeslines 10 and 11 to carry a dependency but lines 12and 13 not to do so

Some open questions regarding this approach

1 How does this interact with arguments and re-turn values Do the corresponding annotationsneed to indicate to which fields dependenciesmight be carried Should mismatches be con-sidered an error and if so which sorts of mis-matches

2 How are opaque types handled For exam-ple consider the Linux kernel linked-list facil-ity which embeds a list head structure intothe enclosing object that is to be placed on thelist The memory order consume load returnsa (struct list head ) but it may be nec-essary to carry a dependency to one or more ofthe fields in the enclosing object Should thisbe handled via something like xload(memory

order consume x) but if so doesnrsquot thisre-introduce the need for lots of stdkill

dependency() calls

3 Larger structures might have quite a few fieldsthat need dependencies carried Should therebe some sort of shorthand to make this easierto code for example tagging the fields needingdependency ordering in the declaration of thestruct or class

4 Common types of abstraction need to be han-dled correctly For example there are more than

WG21N4321 28

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst head-gtfirst-gtt)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(lsp-gtt lsp-gtt)) 28 rcu_read_unlock()29 return data30

Figure 31 List-Based-Stack Head-Marked Depen-dencies 1 of 2

500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

79 Restricted Dependency Chains

This approach restricts dependency chains to opera-tions for which compilers would naturally carry de-pendencies As such this approach can be consid-ered to be a refinement of the whole-program option(Section 74) restricted as described in Section 412but also omitting control dependencies and RCU-protected array indexes As always a memory order

consume load heads a dependency chain and as al-ways a memory order consume load may be imple-mented with a simple load instruction on architec-tures such as x86 ARM and Power

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst head-gtfirst-gtnext)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext lsnp-gtnext-gtt)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt lsnp2-gtt))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 32 List-Based-Stack Head-Marked Depen-dencies 2 of 2

WG21N4321 29

1 define rcu_ptr_extract(p) 2 ( 3 uintptr_t ___ip = (uintptr_t)(p) 4 5 ___ip = ___ip amp ~0x7 6 (typeof(p)) ___ip 7 )89 define rcu_ptr_set_bits(p bits)

10 ( 11 uintptr_t ___ip = (uintptr_t)(p) 12 13 ___ip = ___ip amp ~0x7 14 ___ip = ___ip | (bits) 15 (typeof(p)) ___ip 16 )1718 define rcu_ptr_get_bits(p bits) 19 ( 20 uintptr_t ___ip = (uintptr_t)(p) 21 22 ___ip amp ~0x7 23 )

Figure 33 Pointers and Bit Manipulation on 64-BitSystem

This results in a specific list of operations that ex-tend dependency chains and a separate specific list ofoperations that terminate such chains each coveredin its own section

791 Extending Dependency Chains

The following primitive operations extend thatchain15

1 If any value is part of a dependency chain thenusing that value as the left-hand side of an as-signment expression extends the chain to coverthe assignment This rule is exercised in theLinux kernel by stores into fields making up anRCU protected data element

2 If any value is part of a dependency chain thenusing that value as the right-hand side of an as-signment expression extends the chain to coverboth the assignment and the value returned bythat assignment statement Line 20 of Figure 20shows how this rule may be used to extend adependency chain into a local variable

15 In case of operator overloading the actual functions calledmust be analyzed in order to determine their effects on depen-dency chains

3 If any value that is part of a dependency chain isstored to a non-shared variable then any valueloaded by a later load from that same variable bythat same thread is also part of the dependencychain Lines 20 and 24 of Figure 20 illustratethis rule though this rule would apply even iflocal variable lsp was a intptr t instead of apointer

4 If a pointer that is part of a dependency chainis stored to any variable then any value loadedby a later load from that same variable by thatsame thread is also part of the dependency chainLines 20 and 24 of Figure 20 illustrate this rulethough this rule would apply even if local vari-able lsp was instead a shared variable

5 If a pointer is part of a dependency chain thenadding an integral value to that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers and alsoto addition via the infix + operator and via thepostfix [] operator Note that the addition mustbe carried out on a pointer Casting to an inte-gral type and then carrying out the addition willbreak the dependency chain Therefore insteadof casting to an integral type to carry out theaddition cast to a pointer to char16 Line 24 ofFigure 20 illustrates this given that the -gtt actsas a pointer offset prior to indirection

6 If a pointer is part of a dependency chain thensubtracting an integer from that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers Againcasting to an integral type and then carryingout the subtraction will break the dependencychain so instead cast to a pointer to char TheLinux-kernel container of() macro illustratesthis This macro is used to find the beginning ofa structure given a pointer to a field within thatsame structure

16 Yes some old systems had strange formats for characterpointers and this restriction does exclude those systems fromthis nuance of dependency ordering However to the best ofmy knowledge all such systems were uniprocessors so this isnot a real problem

WG21N4321 30

7 If a pointer is part of a dependency chain thendereferencing it using the prefix operator ex-tends the chain through the dereference opera-tion Line 24 of Figure 20 illustrates this giventhat the -gtt acts as a pointer offset prior to in-direction

8 If a pointer is part of a dependency chain thendereferencing it using the -gt field-selection oper-ator extends the chain to the field Note that thewhen the -gt operator is followed by one or more operators these latter operators are equiva-lent to adding a constant integer to the originalpointer Line 24 of Figure 20 directly illustratesthis rule

9 If a pointer is part of a dependency chain thencasting it (either explicitly or implicitly) to anypointer-sized type extends the chain to the re-sult Line 3 of Figure 33 illustrates this rule

10 If a value of type intptr t or uintptr t ispart of a dependency chain then casting it toa pointer type extends the chain to the resultLine 6 of Figure 33 illustrates this rule

11 If a value of type intptr t or uintptr t is partof a dependency chain then the bitwise infix amp

and | operators extend the dependency chain tothe resulting value Line 5 of Figure 33 illus-trates this rule

12 If a value of type intptr t or uintptr t is partof a dependency chain the the bitwise infix ^

operator extends the dependency chain to theresulting value The use case for this traversal ofbuddy-allocator-like lists or dense-array heapsbut it is not clear whether these use cases justifythis addition to dependency ordering

13 If a pointer is part of a dependency chain thenapplying the unary amp address-of operator op-tionally casting this address to a pointer type(perhaps repeatedly to different pointer typeseither explicitly or implicitly) then applying the dereference operator extends the chain to theresult This is used by some of the Linux-kernellist-processing macros

14 If a pointer is part of a dependency chain andthat pointer appears in the entry of a expres-sion selected by the condition then the chainextends to the result Please note that doesnot extend chains from its condition only fromits second or third argument

15 If a pointer to a function is part of a depen-dency chain then invoking the pointed-to func-tion extends the chain from the pointer to theinstructions executed Note that the exact mech-anism used to update instructions is implemen-tation defined and might require use of specialinstruction-cache-flush operations17

16 If a pointer is part of a dependency chain thenif that pointer is used as the actual parameter ofa function call the dependency chain extends tothe formal parameter

17 If a pointer is part of a dependency chain thenif that pointer is returned from a function thedependency chain extends to the returned valuein the calling function

18 If a given operation extends a dependency chainthen so does its atomic counterpart For exam-ple the rules applying to assignments also applyto atomic loads and stores

Any other operation terminates a dependencychain

792 Terminating Dependency Chains

Even though all other operations terminate depen-dency chains there are a few that deserve specialmention

1 Equality comparisons

2 Narrowing magnitude comparisons

3 Narrowing arithmetic operations

4 Narrowing bitwise operations

17 This may sound strange but just you try implementingdynamic linking without the ability to update instructions

WG21N4321 31

1 struct bar 2 struct bar next3 int a4 int b5 6 struct bar head = amphead 1 2 78 for (p = head-gtnext p p = rcu_dereference(p-gtnext)) 9 foo += p-gta

10 if (p == amphead)11 break12 13 bar = head-gtb

Figure 34 Back-Propagation of Dependency-ChainBreakage Due to Comparisons

1 if (p gt ampfoo)2 do_something(p)3 else if (p lt ampfoo)4 do_something_else(p)5 else6 do_something_nodep(p)

Figure 35 Inequality-Comparison Dependency-Chain Breakage

5 Storing non-pointers into shared variables

6 Passing values between threads without using amemory order consume load

7 Undefined behavior

8 Use of kill dependency

Each of these is covered below

Equality comparisons If a pointer is part of adependency chain then a == or = comparison thatcompares equal to some other pointer where thatother pointer is not part of any dependency chainwill cause any uses of the original pointer to no longerbe part of the dependency chain This dependency-chain breakage can back-propagate to earlier uses ofthe pointer so that in Figure 34 if the comparisonon line 10 compares equal then the access on line 9is not part of the dependency chain This is admit-tedly a rather strange code fragment and besidesthe Linux-kernel barrier() macro could prevent thisif placed between lines 9 and 10 Furthermore the

Linux kernelrsquos list macros avoid this situation becausethe equal comparison terminates the loop

So what if the compiler introduces an equality com-parison This might happen when doing feedback-directed optimization where the compiler might no-tice (for example) that a particularly statically allo-cated structure was almost always the first elementon a given list The compiler might therefore in-troduce a specialization optimization comparing theaddresses and generating code using the statically al-located structure on equals comparison On the onehand in the cases where the Linux kernel adds a stat-ically allocated structure to an RCU-protected linkeddata structure that structure has been initialized atcompile time so that dependency ordering is not re-quired On the other hand this appears to be anextremely dubious optimization for linked data struc-tures In a great many cases the added overhead ofthe comparison would overwhelm the benefits of gen-erating code based on the statically allocated struc-ture

High-quality implementations would therefore beexpected to provide means for disabling this sort ofoptimization especially for pointers obtained fromthe heap After all use of statically allocated struc-tures in RCU-protected lists could be quite usefulduring out-of-memory conditions in which case thespecialization optimization would almost always re-duce performance which is not what optmizationsare supposed to be doing

Narrowing magnitude comparisons A seriesof gt lt gt= or lt= operators that informs the compilerof the exact value of a pointer causes that pointerto no longer be part of the dependency chain SeeFigure 35 for an example of this On line 6 of thisfigure the compiler knows that the value of p is equalto ampfoo so although there is dependency ordering tolines 2 and 4 there is no dependency ordering toline 6 This dependency-chain breakage can back-propagate just as for equality comparisons How-ever dependencies are maintained for normal usesfor exmaple the use of comparisons for deadlockavoidance when acquiring locks contained in multi-ple RCU-protected data elements

WG21N4321 32

Narrowing arithmetic operations If a pointeris part of a dependency chain and if the valuesadded to or subtracted from that pointer cancel thepointer value so as to allow the compiler to pre-cisely determine the resulting value then the result-ing value will not be part of any dependency chainFor example if p is part of a dependency chain then((char )p-(uintptr t)p)+65536 will not be18

Narrowing bitwise operations If a value of typeintptr t or uintptr t is part of a dependency chainand if that value is one of the operands to an amp

or | infix operator whose result has too few or toomany bits set then the resulting value will not bepart of any dependency chain For example on a64-bit system if p is part of a dependency chainthen (p amp 0x7) provides just the tag bits and nor-mally cannot even be legally dereferenced Similarly(p | ~0) normally cannot be legally dereferencedHowever (p amp ~0x7) will provide a usable pointerthat is part of prsquos dependency chain Setting orclearing bits that are not used by the implementa-tion or that would be zero for a properly aligned ob-ject will not break the dependency chain That saidthe compiler might not know the definition of ldquoprop-erly alignedrdquo for example in the cases of manuallycache-aligned or page-aligned objects

Storing non-pointers into shared variables Ifa non-pointer value that is part of a dependency chainis stored into a shared variable then the dependencychain does not extend to a later load from that vari-able

Passing values between threads If a value thatis part of a dependency chain is stored into a variableby one thread and loaded from that same variable bysome other thread using either a non-atomic load ora memory order relaxed load then the dependencychain does not extend to the second thread To getthis effect the second thread would instead need touse a memory order consume load Note that this

18 That said 57p4 of C++ and 656p8 of C both say thatindexing outside of an object is undefined behavior so the lossof dependency ordering is likely the least of the problems here

would extend the dependency chain even if the cor-responding store was a memory order relaxed storebecause the required store-side ordering is providedby the dependency chain

Undefined behavior If undefined behavior is in-voked then consistent with the notion of undefinedbehavior there are no dependency-chain guarantees

If a given pointer intptr t or uintptr t is usedsuch that only one value avoids undefined behaviorthen the dependency chain is broken in the same wayas it would be in the case of an equality comparisonwith that same value

kill dependency() The result of calling kill

dependency is never part of any dependency chainThis operation can be used to suppress diagnosticsthat implementations might omit for likely misusesof dependency ordering

793 Restricted Dependency Chains and theLinux Kernel

This covers all known pointer-based RCU uses in theLinux kernel aside from RCU-protected array in-dexes (more on these later) However as noted ear-lier these restrictions might prove too constrainingfor future code Therefore it might be necessary tocombine this approach with some variation on one ofthe methods for explicitly marking variables formalparameters and return values that are intended tocarry dependencies

Section 32 discussed dependency chains headedby memory order consume loads of integers that arelater used as array indexes or pointer offsets Al-though there are a (very) few such uses in the Linuxkernel accommodating dependency chains headed byloads of integers greatly complicates the handling ofdependency chains For example given an integer x

produced by a memory order consume load we mustcorrectly handle expressions containing (x - x) in-cluding cases where the cancellation is not at all obvi-ous from the source code In contrast given a pointerp the expression (p - p) not a pointer and the rulesgiven above do not require carrying a dependency

WG21N4321 33

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a) 3 q = kill_dependency(p)4 a = q-gtspecial_a5 else 6 a = p-gtnormal_a7

Figure 36 Avoiding Diagnostic Due To Dependency-Ordering Value-Narrowing Hazard

through such an expression This approach there-fore excludes dependency chains headed by memory

order consume loads from non-pointer atomic vari-ables

This raises the question of what should be doneabout the Linux kernel code that relies on order-ing carried through integer array indexes Paulhas (hopefully) answered this question by creatinga Linux-kernel patch that removes the kernelrsquos de-pendency on RCU-protected array indexes [22]

794 Restricted Dependency Chains Ad-vantages and Disavantages

This approach to dependency ordering has the fol-lowing advantages and disadvantages

1 There is no need for compilers to trace depen-dency chains Instead dependencies are an au-tomatic result of current code-generation and op-timization practices

2 There would be little or no limitation on com-piler optimizations Compilers are no longer re-quired to establish artificial dependencies for ex-pressions such as (x - x) or to detect value-narrowing hazards involving == and =

3 The breaking of dependency chains when apointer compares equal to some other pointermight prove to be onerous but it is not aproblem for current Linux use cases19 Themost common cases are comparison against alist header (in which case an equality compar-ison terminates the traversal) and comparison

19 This needs to be re-verified

against NULL (in which case an equality com-parison indicates a pointer that cannot be deref-erenced in any case)

4 The compiler is prohibited from carrying outvalue-speculation optimizations on pointers orvalues of type intptr t or uintptr t that havebeen cast from a pointer type and subjected tono operations other than infix amp | and ^ (Isthis an advantage or a disadvantage The an-swer to this question is left to the reader)

5 In a great many cases dependency chains canreliably pass through library functions compiledby pre-C11 compilers

6 It would not be necessary to use stdkill

dependency() calls in most cases That saiduse of stdkill dependency() might at somefuture point allow the compiler to produce bet-ter diagnostics for dubious dependency-chain usecases For example a compiler might issue awarning for the value-narrowing hazard shownon line 3 of Figure 13 and that diagnostic mightbe suppressed as shown in Figure 36

7 More generally this approach allows annotationsto be discarded as was shown in Figures 20and 21 However the memory order consume

loads are still required in order to support DECAlpha and prevent compiler optimizations thatmight otherwise destroy the dependency chain

8 It would not be necessary to use [[carries

dependency]] attributes However as withstdkill dependency() use of something like[[carries dependency]] might produce bet-ter diagnostics for dubious use dependency-chainuse cases That said it might be preferable tosubstitute either type or variable modifiers forattributes given that use of attributes is not per-mitted to change the meaning of the programand also that attributes are not yet supportedby the C language

9 With the exception of one use case involving ar-rays no changes to the Linux-kernel source code

WG21N4321 34

are required As noted earlier a patch is avail-able to remove the Linux kernelrsquos dependency onRCU-protected array indexes [22]

In the future this approach could be augmentedby attributes type modifiers variable modifiers orother markings to allow more elaborate dependencychains to be created on the one hand and to improvethe compilerrsquos ability to emit diagnostics for dubioususes of dependency chains on the other

710 Evaluation

This evaluation starts by enumerating the differentaudiences that any change to memory order consume

must address (Section 7101) and then compares thevarious proposals based on the perceived viewpointsof these audiences (Section 7102)

7101 Audiences

The main audiences for any change to memory

order consume include standards committee mem-bers compiler implementers formal-methods re-searchers developers intending to write new codeand developers working with existing RCU code TheLinux kernel community is of course a notable exam-ple of this last category

Standards committee members would like a cleanand non-intrusive change to the standard Theywould of course also like solutions minimizing thenumber and vehemence of complaints from the otheraudiences or failing that reducing the complaintsto a tolerable noise level

Compiler implementers would like a mechanismthat fits nicely into current implementations whichdoes much to explain their satisfaction with the ap-proach of strengthening memory order consume tomemory order acquire In particular they wouldlike to avoid unbounded tracing of dependencies andwould prefer minimal constraints on their ability toapply time-honored optimizations

Formal-methods researchers would like a definitionof memory order consume that fits into existing the-oretical frameworks without undue conceptual vio-lence Of particular concern is any need to deal with

counter-factuals in other words any need to rea-son not only about values of variables required forthe solution of a given litmus test but also aboutother unrelated values for these variables As suchcounter-factuals are the rock upon which otherwiseattractive approaches involving semantic dependencyhave foundered20 Some practitioners might won-der why the opinion of formal-methods researchersshould be given any weight at all and the answer tothis question is that it is the work of formal-methodsresearchers that provides us the much-needed toolsthat we need to analyze both the memory-orderingspecification itself as well as programs using thatspecification

Developers writing new code need something thatexpresses their algorithm with a minimum of syn-tactic saccharine that is easy to learn and that iseasy to maintain For example one of the weak-nesses of the current standardsrsquo definition of memoryorder consume is the need to sprinkle large numbersof kill dependency() calls throughout onersquos codeIn short developers would like it to be easy to writeanalyze and maintain code that uses dependency or-dering

Developers with existing RCU code have the samedesires as do developers writing new code but arealso very interested in minimizing the code churn re-quired to adhere to the standard

The challenge if of course to find a proposal thataddresses the viewoints of all of these audiences Aswe will see in the next session this is not easy How-ever there is some hope that the approach presentedin Section 79 might suffice

7102 Comparison

A summary comparison of the proposals is shown inTable 1

The dependency type can either be ldquodeprdquo fornormal dependency ldquordeprdquo for the restricted de-pendencies discussed in Section 79 ldquosdeprdquo for(global) semantic dependency or ldquolsdeprdquo for local

20 That said Alan Jeffries is making another attempt tocome up with a suitable formal definition of semantic depen-dency

WG21N4321 35

Dep

enden

cyType

Variable

Marking

Form

al-Parameter

Marking

Return-V

alueMarking

Beginning-O

f-Chain

Handling

End-O

f-Chain

Handling

Dep

enden

cyTracingRequired

CAttribute

Support

Req

uired

C11 C++11 dep A A K Y Y

Type-Based Designation of DependencyChains With Restrictions (Section 72)

lsdep T T T k

Type-Based Designation of DependencyChains (Section 73)

dep T T T k

Whole-Program Option (Section 74) sdep

Local-Variable Restriction (Section 75) dep A A k Y

Mark Dependency-Carrying Local Variables(Section 76)

dep A A A k Y

Explicitly Tail-Marked Dependency Chains(Section 77)

dep A A Dk y Y

Explicitly Head-Marked Dependency Chains(Section 78)

dep D k y Y

Restricted Dependency Chains (Section 79) rdep

Marking ldquoArdquo attribute ldquoTrdquo typeBeginning of chain ldquoDrdquo explicit designationEnd of chain ldquoDrdquo explicit designation ldquokKrdquo implicitexplicit kill dependencyDependency tracking required ldquoyrdquo only for marked chains ldquoYrdquo always

Table 1 Comparison of Consume Proposals

WG21N4321 36

semantic dependency21 Variable formal-parameterand return-value marking can either be type-based(ldquoTrdquo) attribute-based (ldquoArdquo) or not required (ldquo rdquo)Beginning-of-chain handling can either require ex-plicit indication to which quantities dependenciesmust be carried (ldquoDrdquo) or nothing (ldquo rdquo) End-of-chain handling can either require an explicit kill

dependency (ldquoKrdquo) an implicit kill dependency

(ldquokrdquo) explicit designation of dependency (ldquoDrdquo) ornothing (ldquo rdquo)22 Dependency tracking might berequired for all chains (ldquoYrdquo) explicitly designatedchains (ldquoyrdquo) or not required at all (ldquo rdquo) C-language[[carries dependency]] support might be required(ldquoYrdquo) or not (ldquo rdquo)

The ideal proposal would have dependency typeldquodeprdquo (thus making it easier to model dependencyordering and making it unnecessary for developer tohave to outwit full-program optimizations) no needfor variable formal-parameter or return-value mark-ing (thus minimizing changes required for existingRCU code) implicit ldquodo the right thingrdquo end-of-chainhandling23 (thus minimizing the need for whack-a-mole source-code markups) no need for dependencytracking (thus making it easier to implement) andno need for C-language support for the [[carries

dependency]] attribute (thus minimizing changes tothe C standard)

7103 Other Approaches

If the C standards committee is unwilling to ac-cept attributes perhaps a new keyword such asCarries dependency would be an acceptable alter-native (This was suggested at the 2014 UIUC meet-ing but I cannot recall who suggested it)

21 Recall that a local semantic dependency remains a de-pendency even if the memory order consume load at its headcan return only a single value In contrast a global seman-tic dependency remains a dependency only if more than onevalue can appear at the end of the chain Therefore optimiza-tions based on global full-program analysis can break a globalsemantic dependency but can break neither a local semanticdependency nor a normal dependency

22 Variables that go out of scope always have any dependencychain implicitly killed

23 Perhaps implemented by a careful choice of exactly whichoperators carry dependencies in which situations

It might also be possible to combine different as-pects of the various proposals perhaps even arrivingat an improved proposal

Nevertheless we clearly have some more work todo

8 Summary

This document has analyzed Linux-kernel use of de-pendency ordering and has laid out the status-quo in-teraction between the Linux kernel and pre-C11 com-pilers It has also put forward some possible ways ofbuilding towards a full implementation of C11rsquos andC++11rsquos handling of dependency ordering Finallyit calls out some weaknesses in C11rsquos and C++11rsquoshandling of dependency ordering and offers some al-ternatives

References

[1] Alglave J Maranget L Pawan PSarkar S Sewell P Williams D andNardelli F Z PPCMEMARMMEM A toolfor exploring the POWER and ARM memorymodels httpwwwclcamacuk~pes20

ppc-supplementalpldi105-sarkarpdfJune 2011

[2] Alglave J Maranget L andTautschnig M Herding cats Modellingsimulation testing and data-mining for weakmemory In Proceedings of the 35th ACM SIG-PLAN Conference on Programming LanguageDesign and Implementation (New York NYUSA 2014) PLDI rsquo14 ACM pp 40ndash40

[3] ARM Limited ARM Architecture ReferenceManual ARMv7-A and ARMv7-R Edition2010

[4] Boehm H J Space efficient conservativegarbage collection SIGPLAN Not 39 4 (Apr2004) 490ndash501

[5] Bonzini P and Day M RCU imple-mentation for Qemu httplistsgnu

WG21N4321 37

orgarchivehtmlqemu-devel2013-08

msg02055html August 2013

[6] Dalton M THE DESIGN AND IMPLE-MENTATION OF DYNAMIC INFORMATIONFLOW TRACKING SYSTEMS FOR SOFT-WARE SECURITY PhD thesis StanfordUniversity 2009 Available httpcsl

stanfordedu~christospublications

2009michael_daltonphd_thesispdf

[Viewed March 9 2010]

[7] Desnoyers M [RFC git tree] userspace RCU(urcu) for Linux httpurcuso February2009

[8] Desnoyers M McKenney P E SternA Dagenais M R and Walpole JUser-level implementations of read-copy updateIEEE Transactions on Parallel and DistributedSystems 23 (2012) 375ndash382

[9] Grisenthwaite R ARM Barrier LitmusTests and Cookbook ARM Limited 2009

[10] Howard P W and Walpole J A rel-ativistic enhancement to software transactionalmemory In Proceedings of the 3rd USENIX con-ference on Hot topics in parallelism (BerkeleyCA USA 2011) HotParrsquo11 USENIX Associa-tion pp 1ndash6

[11] Howard P W and Walpole J Relativis-tic red-black trees Concurrency and Computa-tion Practice and Experience (2013) nandashna

[12] Intel Corporation A Formal Speci-fication of Intel Itanium Processor Fam-ily Memory Ordering 2002 Avail-able httpdeveloperintelcom

designitaniumdownloads251429htm

ftpdownloadintelcomdesign

ItaniumDownloads25142901pdf [ViewedJanuary 10 2007]

[13] International Business Machines Corpo-ration Power ISA Version 207 2013

[14] Kannan H Ordering decoupled metadata ac-cesses in multiprocessors In MICRO 42 Pro-ceedings of the 42nd Annual IEEEACM Inter-national Symposium on Microarchitecture (NewYork NY USA 2009) ACM pp 381ndash390

[15] McCarthy J Recursive functions of symbolicexpressions and their computation by machinepart i Commun ACM 3 4 (Apr 1960) 184ndash195

[16] McKenney P E Read-copy update(RCU) usage in Linux kernel Availablehttpwwwrdropcomuserspaulmck

RCUlinuxusagerculocktabhtml [ViewedJanuary 14 2007] October 2006

[17] McKenney P E What is RCU part 2Usage Available httplwnnetArticles

263130 [Viewed January 4 2008] January2008

[18] McKenney P E The RCU API 2010 edi-tion httplwnnetArticles418853 De-cember 2010

[19] McKenney P E Validating memory barri-ers and atomic instructions httplwnnet

Articles470681 December 2011

[20] McKenney P E Is Parallel ProgrammingHard And If So What Can You Do About Itkernelorg Corvallis OR USA 2012

[21] McKenney P E Structured deferral syn-chronization via procrastination CommunACM 56 7 (July 2013) 40ndash49

[22] McKenney P E [PATCH tipcorercu14] mce Stop using array-index-based RCUprimitives [PATCHtipcorercu14]mce

Stopusingarray-index-basedRCUprimitivesMay 2015

[23] McKenney P E Purcell C Al-gae Schumin B Cornelius G Qwer-tyus Conway N Sbw Blainster Ru-fus C Zoicon5 Anome and Eisen HRead-copy update httpenwikipedia

orgwikiRead-copy-update July 2006

WG21N4321 38

[24] McKenney P E and Slingwine J DRead-copy update Using execution history tosolve concurrency problems In Parallel andDistributed Computing and Systems (Las VegasNV October 1998) pp 509ndash518

[25] McKenney P E and Walpole J What isRCU fundamentally Available httplwn

netArticles262464 [Viewed December 272007] December 2007

[26] Michael M M Hazard pointers Safe mem-ory reclamation for lock-free objects IEEETransactions on Parallel and Distributed Sys-tems 15 6 (June 2004) 491ndash504

[27] Rossbach C J Hofmann O S PorterD E Ramadan H E Bhandari A andWitchel E TxLinux Using and managinghardware transactional memory in an operatingsystem In SOSPrsquo07 Twenty-First ACMSymposium on Operating Systems Principles(October 2007) ACM SIGOPS Avail-able httpwwwsosp2007orgpapers

sosp056-rossbachpdf [Viewed October 212007]

[28] Toit S D Working draft stan-dard for programming language C++httpwwwopen-stdorgjtc1sc22wg21

docspapers2013n3691pdf May 2013

[29] Vigueras G Orduna J M and LozanoM A Read-Copy Update based parallel serverfor distributed crowd simulations The Journalof Supercomputing (Apr 2012)

Change Log

This paper first appeared as N4026 in May of 2014Revisions to this document are as follows

bull Add a suggestion to Section 71 that developersprovide diagnostics for singleton arrays when us-ing RCU-protected indexes (May 28 2014)

bull Apply changes from Torvald Riegel feedback

1 Move Section 71 which summarizes theLinux-kernel uses of dependency chains upto the beginning of Section 7

2 Fix some typos in Section 74 and addwords accounting for DEC Alpha whichmust continue to promote memory order

consume loads to memory order acquire

with this sectionrsquos approach Also clarifyJeff Preshingrsquos conjecture about the lack ofoptimizations that break semantic depen-dency chains

(June 5 2014)

bull Add Section 75 which covers Hans Boehmrsquosapproach which restricts dependency chains tolocal variable Also add Section 76 whichcovers Clark Nelsorsquos approach which usesthe [[carries dependency]] attribute to markthose local variables that carry dependencies(July 17 2014)

bull Mark the paper as a revision of N4036 (July 172014)

bull Add Jeff Preshing Hans Boehm and Clark Nel-son as authors and clarify Hansrsquos proposal inSection 75 (August 13 2014)

bull Add Olivier Girouxrsquos proposal for explicitlymarking the tails of dependency chains as Sec-tion 77 and add Olivier as author Update theadvantages and disadvantages of Jeff Preshingrsquoswhole-program option in Section 74 Add advan-tages and disadvantages for Hans Boehmrsquos pro-posal in Section 75 and for Clark Nelsonrsquos pro-posal in Section 76 (August 23 2014)

bull Update Olivier Girouxrsquos explicit dependency-chain marking proposal in Section 77 based onfeedback from Olivier (August 24 2014)

bull Add a section header (Section 411) for thedependency-chain rules for 2014 GCC implemen-tations Add Section 413 laying out the sim-pler rules for managing dependency chains in theolder 1990s Sequent C implementations Adda footnote to Section 76 recording Lawrence

WG21N4321 39

Crowlrsquos preference for variable modifiers insteadof either attributes or type modifiers Addwords to Section 77 noting that atomic signal

fence() or the Linux kernelrsquos barrier() can beused to suppress code-motion optimizations thatmight otherwise prevent fixing up tail-markeddependency chains that were broken by code-motion optimizations (October 1 2014)

bull Add Section 34 which describes operators thatact as the last link in the Linux kernelrsquos depen-dency chains including the ambiguous role ofthe -gt operator (October 1 2014)

bull Update plots of RCU API usage (October 52014)

bull Flesh out Section 412 which lays out a shorthistory of dependency-chain management rulesfor the Linux kernel Numerous minor clarifica-tions and corrections throughout the document(October 5 2014)

bull Add ldquoandrdquo before final authorrsquos name and emailaddress Also add a pointer back to N4036 (Oc-tober 5 2014)

bull Add a brain-dead build script (October 52014)

bull Add Section 710 which has the beginnings ofan evaluation of the various proposals (October5 2014)

The paper was then published as N4215 whichwas revised as follows

bull Add verbiage to Sections 72 and 7102 differ-entiating between various forms of syntactic andsemantic dependencies (November 11 2014)

bull Add Section 7103 suggesting a C keyword in-stead of the C++ [[carries dependency]] at-tribute This section also suggests that combin-ing ideas from several proposals might be helpful(November 11 2014)

bull Add Section 78 which contains a first attemptto describe Olivier Girouxrsquos head-marked depen-dency chains (November 11 2014)

bull Add example code in Figures 15 and 16 Addsimilar figures to the proposals showing how theymark the dependency chains in the sample code(November 11 2014)

At this point the paper was published as N4321which was revised as follows

bull Add a warning against using operands to amp and| that leave only one bit set (or respectivelycleared) to Section 411 (April 9 2015)

bull Update plots of RCU API usage (April 192014)

bull Add Section 79 which restricts dependencychains with the goal of allowing unmarked de-pendency chains while avoiding restricting com-piler optimizations One important restriction isthe elimination of RCU-protected array indexes(April 19 2014)

bull Add verbiage to Section 32 noting that the amp

operator is sometimes used to find the beginningof a power-of-two aligned structure (April 192014)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Torvald Riegel to Sec-tion 79 (May 20 2015)

bull Donrsquot allow integers to carry dependencies intoor out of unmarked functions in Section 79(May 21 2015)

bull Update Section 79 to note that assignments can-not extend a dependency chain from one threadto another and that assignments cannot extendan intptr t- or uintptr t-based dependencychain through a shared variable even within agiven thread (May 21 2015)

bull Update Section 79 to note that atomic op-erations can extend dependency chains in thesame way that their non-atomic counterpartscan (May 21 2015)

WG21N4321 40

bull Update Section 79 to explicitly call out the factthat stores and subsequent loads can extend de-pendency chains (May 21 2015)

bull Update Section 79 to explicitly state that un-defined behavior terminates dependency chainsand that if only one value of a pointer avoidsundefined behavior any dependency chainsthrough that pointer at that point are termi-nated (May 22 2015)

bull Update Section 79 to document the difficul-ties stemming from value-specialization opti-mizations (May 23 2015)

bull Update Section 79 to note the back-propagationof the dependency-chain-breaking effects of thecompiler deducing the exact value of a pointer(May 27 2015)

bull Update Section 79 to give more informationon how much bit-setting and bit-clearing is toomuch (June 8 2015)

bull Update Section 79 to note that the Linux ker-nel does not rely on dependency ordering whena statically allocated structure is linked into anRCU-protected structure but also note that op-timizations breaking dependency ordering in thiscase seem quite dubious (July 11 2015)

bull Update Section 79 illustrating dependency-chain rules with examples (July 11 2015)

bull Update Section 79 noting that simple loads suf-fice for memory order consume loads (July 112015)

bull Update Sections 77 and 78 calling out abstrac-tion challenges for head- and tail-marking ap-proaches (July 13 2015)

bull Add subsections to Section 79 (July 13 2015)

bull Update Section 79 to amplify objections to spe-cialization optimizations applied to heap-basedpointers (July 13 2015)

  • 1 Introduction
  • 2 Introduction to RCU
  • 3 Linux-Kernel Use Cases
    • 31 Types of Linux-Kernel Dependency Chains
    • 32 Operators in Linux-Kernel Dependency Chains
    • 33 Operators Terminating Linux-Kernel Dependency Chains
    • 34 Operators Acting as Last Link in Linux-Kernel Dependency Chains
    • 35 Linux-Kernel Dependency Chain Length
      • 4 Dependency Ordering in Pre-C11 Implementations
        • 41 Rules for C-Language RCU Users
          • 411 Rules for 2014 GCC Implementations
          • 412 Rules for 2003 GCC Implementations
          • 413 Rules for 1990s Sequent C Implementations
            • 42 Rules for C-Language Implementers
              • 5 Dependency Ordering in C11 and C++11 Implementations
              • 6 Weaknesses in C11 and C++11 Dependency Ordering
              • 7 Potential Alternatives to C11 and C++11 Dependency Ordering
                • 71 Revising C11 and C++11 Dependency-Ordering Definition
                • 72 Type-Based Designation of Dependency Chains With Restrictions
                • 73 Type-Based Designation of Dependency Chains
                • 74 Whole-Program Option
                • 75 Local-Variable Restriction
                • 76 Mark Dependency-Carrying Local Variables
                • 77 Explicitly Tail-Marked Dependency Chains
                • 78 Explicitly Head-Marked Dependency Chains
                • 79 Restricted Dependency Chains
                  • 791 Extending Dependency Chains
                  • 792 Terminating Dependency Chains
                  • 793 Restricted Dependency Chains and the Linux Kernel
                  • 794 Restricted Dependency Chains Advantages and Disavantages
                    • 710 Evaluation
                      • 7101 Audiences
                      • 7102 Comparison
                      • 7103 Other Approaches
                          • 8 Summary
Page 13: N4321: Towards Implementation and Use of memory order … · 2015-07-14 · N4321: Towards Implementation and Use of memory order consume Doc. No.: WG21/N4321 (revised) Date: 2015-04-19

WG21N4321 13

in spurious memory-barrier instructions when a de-pendency chain goes out of scope for example withthe return statement of function g() in Figure 12

A fourth approach is to add a compile-time op-eration corresponding to the beginning and end ofRCU read-side critical section These would need tobe evaluated at compile time taking into accountthe fact that these critical sections can nest and canbe conditionally entered and exited Note that theexit from an outermost RCU read-side critical sec-tion should imply a kill dependency() operation oneach variable that is live at that point in the code6

Although it is probably impossible to precisely de-termine the bounds of a given RCU read-side criticalsection in the general case conservative approachesthat might overestimate the extent of a given sec-tion should be acceptable in almost all cases Thisapproach would make functions c() and d() in Fig-ure 11 handle dependency chains in a natural mannerbut avoiding whole-program analysis would requiresomething similar to the [[carries dependency]]

annotations called out in the C11 and C++11 stan-dards

A fifth approach would be to require that all op-erations on the essential subset of any dependencychain be annotated This would greatly ease imple-mentation but would not be likely to be accepted bythe Linux kernel community

A sixth approach is to track dependencies as calledout in the C11 and C++11 standards However in-stead of emitting a memory-barrier instruction whena dependency chain flows into or out of a functionwithout the benefit of [[carries dependency]] in-sert an implicit kill dependency() invocation Im-plementation should also optionally issue a diagnosticin this case The motivation for this approach is thatit is expected that many more kill dependencies()

than [[carries dependency]] would be required toconvert the Linux kernelrsquos RCU code to C11 In theexample in Figure 12 this approach would allow func-tion g() to avoid emitting an unnecessary memory-barrier instruction but without function f()rsquos ex-

6 What if a given rcu read unlock() sometimes marked theend of an outermost RCU read-side critical section but othertimes was nested in some other RCU read-side critical sectionIn that case there should be no kill dependency()

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a)3 a = p-gtspecial_a4 else5 a = p-gtnormal_a

Figure 13 Dependency-Ordering Value-NarrowingHazard

plicit kill dependency() Both functions are in Fig-ure 12

A seventh and final approach is to track dependen-cies as called out in in the C11 and C++11 standardsWith this approach functions e() and f() properlypreserve the required amount of dependency order-ing

6 Weaknesses in C11 andC++11 Dependency Or-dering

Experience has shown several weaknesses in the de-pendency ordering specified in the C11 and C++11standards

1 The C11 standard does not provide attributesand in particular does not provide the[[carries dependency]] attribute This pre-vents the developer from specifying that a givendependency chain passes into or out of a givenfunction

2 The implementation complexity of thedependency-chain tracking required by bothstandard can be quite onerous on the one handand the overhead of unconditionally promotingmemory order consume loads to memory order

acquire can be excessive on weakly orderedimplementations on the other There is thereforeno easy way out for a memory order consume

implementation on a weakly ordered system

3 The function-level granularity of [[carries

dependency]] seems too coarse One problemis that points-to analysis is non-trivial so that

WG21N4321 14

compilers are likely to have difficulty determin-ing whether or not a given pointer carries a de-pendency For example the current wording ofthe standard (intentionally) does not disallowdependency chaining through stores and loadsTherefore if a dependency-carrying value mightever be written to a given variable an implemen-tation might reasonably assume that any loadfrom that variable must be assumed to carry adependency

4 The rules set out in the standard [28 110p9]do not align well with the rules that develop-ers must currently adhere to in order to main-tain dependency chains when using pre-C11 andpre-C++11 compilers (see Section 41) For ex-ample the standard requires (x-x) to carry adependency and providing this guarantee wouldat the very least require the compiler to alsoturn off optimizations that remove (x-x) (andsimilar patterns) if x might possibly be carry-ing a dependency For another example con-sider the value-speculation-like code shown inFigure 13 that is sometimes written by devel-opers and that was described in bullet 9 ofSection 41 In this example the standard re-quires dependency ordering between the memoryorder consume load on line 1 and the sub-sequent dereference on line 3 but a typicalcompiler would not be expected to differenti-ate between these two apparently identical val-ues These two examples show that a compilerwould need to detect and carefully handle thesecases either by artificially inserting dependen-cies omitting optimizations differentiating be-tween apparently identical values or even byemitting memory order acquire fences

5 The whole point of memory order consume andthe resulting dependency chains is to allow de-velopers to optimize their code Such optimiza-tion attempts can be completely defeated by thememory order acquire fences that the standardcurrently requires when a dependency chain goesout of scope without the benefit of a [[carries

dependency]] attribute Preventing the com-piler from emitting these fences requires liberal

use of kill dependency() which clutters coderequires large developer effort and further re-quires that the developer know quite a bit aboutwhich code patterns a given version of a givencompiler can optimize (thus avoiding needlessfences) and which it cannot (thus requiring man-ual insertion of kill dependency()

As of this writing no known implementations fullysupport C11 or C++11 dependency ordering

It is worth asking why Paul didnrsquot anticipate theseweaknesses There are several reasons for this

1 Compiler optimizations have become more ag-gressive over the seven years since Paul startedworking on standardization

2 New dependency-ordering use cases have arisenduring that same time in particular there arelonger dependency chains and more of themincluding dependency chains spanning multiplecompilation units

3 The number of dependency chains has increasedby roughly an order of magnitude during thattime so that changes in code style can be ex-pected to face a commeasurate increase in resis-tance from the Linux kernel community ndash unlessthose changes bring some tangible benefit

With that letrsquos look at some potential alternativesto dependency ordering as defined in the C11 andC++11 standards

7 Potential Alternatives to C11and C++11 Dependency Or-dering

Given the weaknesses in the current standardrsquos spec-ification of dependency ordering it is quite reason-able to consider alternatives To this end Section 71discusses ease-of-use issues involved with revisions tothe C11 and C++11 definitions of dependency or-dering Section 72 enlists help from the type systembut also imposes value restrictions (thus revising the

WG21N4321 15

C11 and C++11 semantics for dependencies) Sec-tion 73 enlists help from the type system withoutthe value restrictions and Section 74 describes awhole-program approach to dependency chains (alsorevising the C11 and C++11 semantics for depen-dencies) Section 75 describes a post-Rapperswilproposal that dependency chains be restricted tofunction-scope local variables and temporaries andSection 76 describes a second post-Rapperswil pro-posal that the [[carries dependency]] attributebe used to label local-scope variables that carry de-pendencies Section 77 describes a proposal dis-cussed verbally at Rapperswil that explicitly marksthe tails of dependency chains Section 78 describesthe inverse namely marking the heads of dependencychains Section 79 describes an approach that avoidsmarking by sharply restricting the number and typeof operations permitted in dependency chains Eachapproach appears to have advantages and disadvan-tages so it is hoped that further discussion will eitherhelp settle on one of these alternatives or generatesomething better To help initiate this discussionSection 710 provides an initial comparative evalua-tion

71 Revising C11 and C++11Dependency-Ordering Definition

The following sections each describe a proposed revi-sion of the dependency-ordering definition from thatin the current C11 and C++11 standards In manyof these proposals developers are required to followan additional rule in order to be able to rely on de-pendency ordering Subsequent execution must notlead to a situation where there is only one possiblevalue for the variable that is intended to carry thedependency7 This is shown in Figure 17 where thecompiler is permitted to break dependency orderingon line 6 because it knows that the value of p is equalto that of q which means that it could substitutethe latter value from the former which would break

7 This restricted notion of dependence is sometimes calledsemantic dependence and the value at the end of a depen-dence chain that does not represent a semantic dependence issometimes said to be independent of the value at the head ofthe dependency chain

1 int my_array[MY_ARRAY_SIZE]23 i = atomic_load_explicit(gi memory_order_consume)4 r1 = my_array[i]

Figure 14 Single-Element Arrays and DependencyOrdering

dependency ordering In short a dependency chainbreaks if it comes to a point where only a single valueis possible regardless of the value of the memory

order consume load heading up the chain At firstglance this additional rule could be quite difficult tolive with as dependency ordering could come and godepending on small details of code far away from thatpoint in the dependency chain

However a review of the Linux-kernel operators inSection 32 shows that the most commonly used op-erators act identically under both definitions Theproblem-free operators include -gt infix = casts pre-fix amp prefix and ternary

One example of a potentially troublesome opera-tor namely == is shown in Figure 17 where line 6breaks dependency ordering because the value of p isknown to be equal to that of q which is not part of adependency chain This example could be addressedthrough careful diagnostic design coupled with appro-priate coding standards For example the compilercould emit a warning on line 6 but remain silent forthe equivalent line substituting q for p namely dosomething with(q-gta)

Another example is the use of postfix [] thatis shown in Figure 14 If this code fragment wascompiled with MY ARRAY SIZE equal to one there isno dependency ordering between lines 3 and 4 butthat same code fragment compiled with MY ARRAY

SIZE equal to two or greater would be dependency-ordered Here a diagnostic for single-element arraysmight prove useful and such a diagnostic can easilybe supplied in this case using if and error

In the Linux kernel infix + and - are used forpointer and array computations These are all safein that they operate on an integer and pointer sothat any cancellation will not normally be detectableat compile time However one big purpose of diag-nostics is to detect abnormal conditions indicating

WG21N4321 16

1 struct liststackhead 2 struct liststack __rcu first3 45 struct liststack 6 struct liststack __rcu next7 void t8 struct rcu_head rh9

1011 _Carries_dependency12 void ls_front(struct liststackhead head)13 14 _Carries_dependency void data15 struct liststack lsp1617 rcu_read_lock()18 lsp = rcu_dereference(head-gtfirst)19 if (lsp == NULL)20 data = NULL21 else22 data = rcu_dereference(lsp-gtt)23 rcu_read_unlock()24 return data25

Figure 15 List-Based-Stack Example Code 1 of 2

probable bugs Therefore in cases where the com-piler can determine that two values from dependencychains are annihilating each other via infix + and -a diagnostic would be appropriate

Similarly the Linux kernel uses infix (bitwise) amp tomanipulate bits at the bottom of a pointer whereagain cancellation will not normally be detectable atcompile timemdashexcept in the case of operations on aNULL pointer for which dependency ordering is notmeaningful in any case However as with infix +

and - if the compiler detects value annihilation adiagnostic would be appropriate

Although issues with false positives and negativesneeds further investigation there is reason to hopethat this revision of the definition of dependency or-dering might avoid significant impacts on ease of useWith this hope we proceed to the specific propos-als using the code in Figures 15 and 16 to showsome sample code using Linux-kernel nomenclaturewith the addition of a mythical C keyword Carries

dependency to annotate parameters variables andreturn values that carry dependencies Please notethat this code example in no way endorses the dubi-ous practice of creating a parallel program with thesort of choke point exemplified by the head of this

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 _Carries_dependency struct liststack lsp40 struct liststack lsnp141 _Carries_dependency struct liststack lsnp242 _Carries_dependency void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 16 List-Based-Stack Example Code 2 of 2

WG21N4321 17

1 value_dep_preserving struct foo p23 p = atomic_load_explicit(gp memory_order_consume)4 q = some_other_pointer5 if (p == q)6 do_something_with(p-gta)7 else8 do_something_else_with(p-gtb)

Figure 17 Single-Value Variables and DependencyOrdering

list Note also that cmpxchg() heads a dependencychain which is completely reasonable within the con-text of the Linux kernel due to its acquire semanticswhich of course might be argued to indicate that theannotations in ls pop() are unnecessary

72 Type-Based Designation of De-pendency Chains With Restric-tions

This approach was formulated by Torvald Riegel inresponse to Linus Torvaldsrsquos spirited criticisms of thecurrent C11 and C++11 wording

This approach introduces a new value dep

preserving type qualifier Dependency ordering ispreserved only via variables having this type quali-fier This is meant to model the real scope of depen-dencies which is data flow not execution at function-level granularity This approach should therefore givedevelopers much finer control of which dependenciesare tracked

Assigning from a value dep preserving value to anon-value dep preserving variable terminates thetracking of dependencies in much the same way thatan explicit kill dependency() would However un-like an explicit kill dependency() compilers shouldbe able to emit a suppressable warning on implicitconversions so as to alert the developer about other-wise silent dropping of dependency tracking8

Next we specify that memory order consume loadsreturn a value dep preserving type by default thecompiler must assume such a load to be capable of

8 Other choices are possible in this case including emit-ting a memory order acquire fence in order to conservativelypreserve a potentially intended ordering

producing any value of the underlying type In otherwords the implementation is not permitted to applyany value-restriction knowledge it might gain fromwhole-program analysis We call this a local seman-tic dependency to distinguish not only from a pure(syntactic) dependency but also from a global seman-tic dependency where global information may be ap-plied Note that any global semantic dependency isalso a local semantic dependency but that any localsemantic dependency which is headed by a variablethat can be proven to take on only a single value is nota global semantic dependency The term ldquosemanticdependencyrdquo should be interpreted to mean a globalsemantic dependency unless otherwise stated

This allows developers to start with a clean slatefor the additional rule that they must follow to beable to rely on dependency ordering Subsequent ex-ecution must not lead to a situation there is only onepossible value for the value dep preserving expres-sion because otherwise the implementation is per-mitted to break the dependency chain As notedearlier this is shown in Figure 17 where the com-piler is permitted to break dependency ordering online 6 because it knows that the value of p is equalto that of q which means that it could substitutethe latter value from the former which would breakdependency ordering

This approach has several advantages

1 The implementation is simpler because no de-pendency chains need to be traced The imple-mentation can instead drive optimization deci-sions strictly from type information

2 Use of the value dep preserving type modifierallows the developer to limit the extent of thedependency chains

3 This type modifier can be used to mark a depen-dency chainrsquos entry to and exit from a functionin a straightforward way without the need forattributes

4 The value dep preserving type modifiers serveas valuable documentation of the developerrsquos in-tent

WG21N4321 18

5 This approach permits many additional opti-mizations compared to those permitted by thecurrent standard on code that carries a depen-dency Expressions such as (x-x) no longerrequire establishment of artificial dependenciesand the compiler is no longer required to detectvalue-narrowing hazards like that shown in Fig-ure 13 However the compiler is still prohibitedfrom adding its own value-speculation optimiza-tions

6 Linus Torvalds seems to be OK with it whichindicates that this set of rules might be practicalfrom the perspective of developers who currentlyexploit dependency chains

According to Peter Sewell one disadvantage is thatthis approach will be quite difficult to model which inturn will pose obstacles for the analysis tooling thatwill be increasingly necessary for large-scale concur-rent programming efforts In particular the concernis that forcing the compiler to assume that a memory

order consume load could possibly return any valuepermitted by its type might require program-analysistools to consider counterfactual hypothetical execu-tions which might complicate specification of seman-tics and verification

Figures 18 and 19 show how this approach playsout with the list-based stack

73 Type-Based Designation of De-pendency Chains

Jeff Preshing made an off-list suggestion of using avalue dep preserving type modifier as suggestedby Torvald Riegel but using this type modifier tostrictly enforce dependency ordering For exampleconsider the code fragment shown in Figure 17 Thescheme described in Section 72 would not necessar-ily enforce dependency ordering between the load online 3 and the access one line 6 while the approachdescribed in this section would enforce dependencyordering in this case

Furthermore cancelling or value-destruction oper-ations on value dep preserving values would notdisrupt dependency ordering As with the cur-rent C11 and C++11 standards the implementation

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack value_dep_preserving first6 78 struct liststack 9 struct liststack value_dep_preserving next

10 void t11 struct rcu_head rh12 1314 value_dep_preserving15 void ls_front(struct liststackhead head)16 17 value_dep_preserving void data18 value_dep_preserving struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 18 List-Based-Stack Restricted Type-BasedDesignation 1 of 2

would be required to emit a memory-barrier instruc-tion or compute an artificial dependency for such op-erations (Note however that use of cancelling orvalue-destruction operations on dependency chainshas proven quite rare in practice)

This approach shares many of the advantages ofTorvald Riegelrsquos approach

1 The implementation is simpler because no de-pendency chains need be traced The implemen-tation can instead drive optimization decisionsstrictly from type information

2 Use of the value dep preserving type modifierallows the developer to limit the extent of thedependency chains

3 This type modifier can be used to mark a depen-dency chainrsquos entry to and exit from a functionin a straightforward way without the need forattributes

4 The value dep preserving type modifiers serve

WG21N4321 19

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 value_dep_preserving37 void ls_pop(struct liststackhead head)38 39 value_dep_preserving struct liststack lsp40 struct liststack lsnp141 value_dep_preserving struct liststack lsnp242 value_dep_preserving void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 19 List-Based-Stack Restricted Type-BasedDesignation 2 of 2

as valuable documentation of the developerrsquos in-tent

5 Although optimizations on a dependency chainare restricted just as in the current standardthe use of value dep preserving restricts thedependency chains to those intended by the de-veloper

6 Restricting dependency-breaking optimizationson all dependency chains marked value dep

preserving without exceptions for cases inwhich the compiler knows too much might makethis approach easier to learn and to use

It is expected that modeling this approach shouldbe straightforward because the modeling tools wouldbe able to make use of the type information Thisapproach results in the same code as shown in Fig-ures 18 and 19 of the previous section

74 Whole-Program Option

This approach also suggested off-list by Jeff Presh-ing has the goal of reusing existing non-dependency-ordered source code unchanged (albeit requiring re-compilation in most cases)9 For example this ap-proach permits an instance of stdmap to be refer-enced by a pointer loaded via memory order consume

and to provide that stdmap instance with thebenefits of dependency ordering without any codechanges whatsoever to stdmap It is important tonote that this protection will be provided only to aread-only stdmap that is referenced by a changingpointer loaded via memory order consume in partic-ular not to a concurrently updated stdmap refer-enced by a pointer (read-only or otherwise) loadedvia memory order consume This latter case wouldrequire changes to the underlying stdmap implemen-tation at a minimum changing some of the loads tobe memory order consume loads Nevertheless theability to provide dependency-ordering protection topre-existing linked data structures is valuable evenwith this read-only restriction

9 A module or library that is known to never carry a de-pendency need not be recompiled

WG21N4321 20

This approach which again does require full re-compilation can be implemented using two ap-proaches

1 Promote all memory order consume loads tomemory order acquire as may be done withthe current standard

2 On architectures that respect memory order-ing prohibit all dependency-breaking optimiza-tions throughout the entire program but onlyin cases where a change in the value returnedby a memory order consume load could cause achange in the value computed later in that samedependency chain in other words where there isa global semantic dependency Note again thatthe possibility of storing a value obtained froma memory order consume load then loading itlater means that normal loads as well as memoryorder relaxed loads often must be consideredto head their own dependency chains but onlywhen loaded by the same thread that did thestore

Some implementations might allow the developerto choose between these two approaches for exampleby using a compiler switch provided for that purpose

This approach also has the effect of permitting atrivial implementation of a memory order consume

atomic thread fence() When using the first im-plementation approach the atomic thread fence()

is simply promoted to memory order acquire In-terestingly enough when using the second ap-proach the memory order consume atomic thread

fence() may simply be ignored The reason forthis is that this approach has the effect of promot-ing memory order relaxed loads to memory order

consume which already globally enforces all theordering that the memory order consume atomic

thread fence() is required to provide locally10

This approach has its own set of advantages anddisadvantages

10 Of course this presumed promotion from memory order

relaxed to memory order consume means that architecturessuch as DEC Alpha that do not respect dependency order-ing must continue to use the first option of emitting memory-ordering instructions for memory order consume loads

1 This approach dispenses with the [[carries

dependency]] attribute and the kill

dependency() primitive

2 This approach better promotes reuse of existingsource code In particular it should require nochanges to the current Linux-kernel source baseaside from changes to the rcu dereference()

family of primitives

3 This approach allows implementations to carryout dependency-breaking optimizations on de-pendency chains as long as a change in thevalue from the memory order consume load doesnot change values further down the dependencychain both with and without the optimizationJeff conjectures that the set of dependency-breaking optimizations used in practice applyonly outside of dependency chains by the re-vised definition in which single-value restrictionsbreak dependency chains11 If this conjectureholds it also applies to Torvaldrsquos approach de-scribed in Section 72

4 Code that follows the rules presented in Sec-tion 41 (substituting memory order consume

loads for volatile loads) would have its depen-dency ordering properly preserved

It is unlikely that this approach could be modeledreasonably given the current state of the art Therequirement that any given memory order consume

load be able to generate at least two different val-ues at the tail of the dependency chain is believedto be a show-stopper especially when coupled withwhole-program analysis which might find that thereis only one value entering at the head of the depen-dency chain

This approach allows annotations to be discardedas shown in Figures 20 and 21 However the memoryorder consume loads are still required in order toenable the promote-to-acquire implementation style

11 This is certainly the case for the usual optimizations ex-emplified by replacing (x-x) with zero

WG21N4321 21

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 void ls_front(struct liststackhead head)15 16 void data17 struct liststack lsp1819 rcu_read_lock()20 lsp = rcu_dereference(head-gtfirst)21 if (lsp == NULL)22 data = NULL23 else24 data = rcu_dereference(lsp-gtt)25 rcu_read_unlock()26 return data27

Figure 20 List-Based-Stack Whole-Program Ap-proach 1 of 2

75 Local-Variable Restriction

This approach suggested off-list by Hans Boehmlimits the extent of dependency trees to a local whichincludes local variables temporaries function argu-ments and return variables Assigning a value from amemory order consume load to such an object beginsa dependency chain Assigning a value loaded fromsuch a local to a global variable (including function-local variables marked static) or to the heap impliesa kill dependency() so that dependency chains areconfined to locals However if the compiler is unableto see the full dependency chain for example be-cause it passes into a function in another translationunit that is not marked [[carries dependency]]the compiler should promote memory order consume

to memory order acquire12

Section 32 indicates that the following operatorsshould transmit dependency status from one localvariable or temporary to another -gt infix = casts

12 Some implementations might provide means to allow theuser to specify that a diagnostic be generated if such promotionis necessary

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 void ls_pop(struct liststackhead head)37 38 struct liststack lsp39 struct liststack lsnp140 struct liststack lsnp241 void data4243 rcu_read_lock()44 lsnp2 = rcu_dereference(head-gtfirst)45 do 46 lsnp1 = lsnp247 if (lsnp1 == NULL) 48 rcu_read_unlock()49 return NULL50 51 lsp = rcu_dereference(lsnp1-gtnext)52 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)53 while (lsnp1 = lsnp2)54 data = rcu_dereference(lsnp2-gtt)55 rcu_read_unlock()56 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)57 return data58

Figure 21 List-Based-Stack Whole-Program Ap-proach 2 of 2

WG21N4321 22

prefix amp prefix [] infix + infix - ternary infix (bitwise) amp and probably also | Similarly Sec-tion 33 indicates that the following operators shouldimply a kill dependency() () == = ampamp ||infix and

It will also be necesary to check whether Linux-kernel usage expects dependency chains to passthrough globals and heap objects that are in someway thread-local If there are such use cases and ifthey are sane and cannot easily be changed to uselocal variables should [[carries dependency]] beused to flag dependency-carrying globals and heapobjects

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute if de-pendency chains are to span multiple translationunits as is the case in some parts of the Linuxkernel

2 The implementation is likely to be some-what simpler because only those dependencychains passing through local variables compiler-generated temporaries compiler-visible functionarguments and compiler-visible return valuesneed be traced One could also argue that func-tion arguments and return values marked with[[carries dependency]] attribute also need tobe traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 Although optimizations on dependency chainsmust be restricted the restricted scope of de-pendency chains reduces the impact of these re-strictions

5 Applying this approach to the Linux kernelwould only require the addition of markings onfunction parameters and return values corre-sponding to cross-translation-unit function callsHowever there are a significant number of theseso this approach can expect significant resistancefrom the Linux community

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 22 List-Based-Stack Local-Variable Restric-tion 1 of 2

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach allows local-variable annotations tobe dropped as shown in Figure 22 and 23

76 Mark Dependency-Carrying Lo-cal Variables

This approach suggested offlist by Clark Nelson usesthe [[carries dependency]] attribute to mark non-static local-scope variables as carrying a dependencyin addition to its current use marking function ar-guments and return values as carrying dependen-cies It is not permissible to mark global variablesor structure members with this attribute Assigningfrom a [[carries dependency]] object to a non-[[carries dependency]] object results in an im-plicit kill dependency()

This approach is similar to that of Section 73 ex-cept that it uses an attribute rather than a type mod-ifier As such it has many of the advantages and dis-

WG21N4321 23

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 23 List-Based-Stack Local-Variable Restric-tion 2 of 2

advantages of that approach however some believethat an attribute-based approach will be more ac-ceptable to the committee than would a type-modifierapproach13 However this approach does requirethat C add attributes

This leave the question of which operatorstransmit dependency chains from one [[carries

dependency]] object to another Section 32 indi-cates that the following operators should transmitdependency status from one local variable or tem-porary to another -gt infix = casts prefix amp prefix [] infix + infix - ternary infix (bitwise) ampand probably also | Similarly Section 33 showsthat the following operators should imply a kill

dependency() () == = ampamp || infix and

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains passingthrough variables marked with the [[carries

dependency]] attribute need be traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 The [[carries dependency]] calls serve asvaluable documentation of the developerrsquos in-tent

5 Although optimizations on dependency chainsmust be restricted use of explicit [[carries

dependency]] greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of variables carrying depen-dencies given that the Linux kernel currentlyrequires no such markings

13 Lawrence Crowl suggests a third approach namely a vari-able modifier

WG21N4321 24

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 _Carries_dependency void data18 _Carries_dependency struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 24 List-Based-Stack Marked Local Variables1 of 2

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach results in code as shown in Fig-ures 24 and 25 where the [[carries dependency]]

attributes have been replaced with a mythicalCarries dependency C keyword

77 Explicitly Tail-Marked Depen-dency Chains

This approach suggested at Rapperswil by OlivierGiroux can be thought of as the inverse of

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 _Carries_dependency struct liststack lsp40 struct liststack lsnp141 _Carries_dependency struct liststack lsnp242 _Carries_dependency void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 25 List-Based-Stack Marked Local Variables2 of 2

WG21N4321 25

1 p = atomic_load_explicit(ampgp memory_order_consume)2 if (p = NULL)3 do_it(atomic_dependency(p gp))

Figure 26 Explicit Dependency Operations

stdkill dependency() Instead of explicitlymarking where the dependency chains terminateOlivierrsquos proposal uses a stddependency() prim-itive to indicate the locations in the code that thedependency chains are required to reach The first ar-gument to stddependency() is the value to whichthe dependency must be carried and the second argu-ment is the variable that heads the dependency chainin other words the second argument is the variablethat was loaded from by a memory order consume

load This proposal differs from the others in thatit is expected to be implemented not necessarily bypreserving the dependency but instead by insertingbarriers in those cases where optimizations have elim-inated any required dependencies The goal here is toimpose minimal restrictions on optimizations of codecontaining dependency chains

A C-language example is shown in Figure 26where stddependency() is transliterated to the C-language atomic dependency() function On line 3atomic dependency() returns the value of its firstargument (p) while ensuring that the data depen-dency from the memory order consume load from gp

is faithfully reflected in the assembly language imple-menting this code fragment The assembly-languagereflection of this dependency might be in terms ofan assembly-language dependency (for example onARM or PowerPC) implicit memory ordering (forexample on x86 or mainframe) or by an explicitmemory-barrier instruction However if there was noatomic dependency() function the compiler wouldbe under no obligation to preserve the dependency14

These explicitly specified dependencies may becombined with [[carries dependency]] attributeson function arguments for example as shown in Fig-ure 27 Note the interplay of atomic dependency()

and [[carries dependency]] where line 8 estab-

14 Would it be better to have the second argument to atomic

dependency() be a label rather than an expression

1 void foo(struct bar q [[carries_dependency]])2 3 if (q = NULL)4 do_it(atomic_dependency(q-gtb q))5 67 p = atomic_load_explicit(ampgp memory_order_consume)8 foo(atomic_dependency(p gp))

Figure 27 Explicit Dependency Operations and car-ries dependency

lishes the dependency between the load from gp andthe [[carries dependency]] argument q of foo()and where line 4 establishes the further dependencybetween argument q of foo() and do it()s argu-ment

This approach is not yet complete One issue isthe possibility of a given operation being dependenton multiple memory order consume loads One ap-proach is of course to omit this functionality and an-other is to allow atomic dependency() to allow anexpression as its first argument and a variable list ofmemory order consume loaded variables

Another issue is connecting [[carries

dependency]] return values to subsequent atomic

dependency() invocations There are a numberof possible resolutions to this issue One approachwould be to use [[carries dependency]] attributeto mark the declaration of the variable to whichthe functionrsquos return value is assigned bringing theproposal from Section 76 to bear In the specialcase where the memory order consume load is in thesame function body as the atomic dependency()

that depends on it the atomic dependency()

could reference the variable that was the sourceof the original memory order consume load An-other approach would be to allow function-returncarries dependency]] attributes to define namesthat could be used by later atomic dependency()

invocations

A third issue arises when atomic dependency()

must be applied after the head of the dependencychain has gone out of scope for example if the headwas contained in a variable defined in an inner scopethat has since been exited

A fourth issue arises if optimizations along a

WG21N4321 26

needed dependency chain allow ordering the depen-dent operation to precede the head of the depen-dency chain in which case inserting barriers wouldbe ineffective The current proposal for addressingthis issue is to suppress memory-movement optimiza-tions across the atomic dependency() perhaps us-ing something like atomic signal fence() or theLinux kernelrsquos barrier() macro This approach al-lows dependency checking and fence insertion to becarried out as a final pass in the compilation process

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains having ex-plicit atomic dependency() calls (and option-ally intermediate [[carries dependency]] at-tributes) need be traced

3 Irrelevant dependency chains are pruned by de-fault with no stdkill dependency() calls re-quired

4 The atomic dependency() calls serve as valu-able documentation of the developerrsquos intent

5 Although optimizations on dependency chainsmust be restricted use of explicit atomic

dependency() greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of dependency chains giventhat the Linux kernel currently relies on implicitends of dependency chains

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(atomic_dependency(lsp-gtt27 head-gtfirst))28 rcu_read_unlock()29 return atomic_dependency(data lsp-gtt)30

Figure 28 List-Based-Stack Tail-Marked Dependen-cies 1 of 2

It is not yet known whether this approach can bereasonably modeled

The result is shown in Figures 28 and 29

78 Explicitly Head-Marked Depen-dency Chains

This approach suggested via email by Olivier Girouxcan be thought of as another inverse of stdkilldependency() In this case the heads of the de-pendency chains are marked indicating to whichpointed-to objects dependencies should be carriedThis description is an extrapolation of a very con-cise proposal and corrections are welcome

The general idea is to provide an augmentedform of the load() member function that indicatesdependencies for example xload(memory order

consume x-gtnext) would cause a dependency tobe carried through the -gtnext field but throughno other field This is shown in Figure 30 where

WG21N4321 27

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(atomic_dependency(lsnp2-gtt lsnp))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return atomic_dereference(data lsnp2-gtt)59

Figure 29 List-Based-Stack Tail-Marked Dependen-cies 2 of 2

1 struct foo 2 struct foo a3 struct foo b4 struct foo c5 int d6 78 p = atomic_load_explicit(ampgp memory_order_consume9 p-gta p-gtb)

10 qa = p-gta Dependency carried 11 qb = p-gtb Dependency carried 12 qc = p-gtc No dependency carried 13 d = p-gtd No dependency carried

Figure 30 Explicit Dependency Operations and Aug-mented Load

the explicit dependency information on line 9 causeslines 10 and 11 to carry a dependency but lines 12and 13 not to do so

Some open questions regarding this approach

1 How does this interact with arguments and re-turn values Do the corresponding annotationsneed to indicate to which fields dependenciesmight be carried Should mismatches be con-sidered an error and if so which sorts of mis-matches

2 How are opaque types handled For exam-ple consider the Linux kernel linked-list facil-ity which embeds a list head structure intothe enclosing object that is to be placed on thelist The memory order consume load returnsa (struct list head ) but it may be nec-essary to carry a dependency to one or more ofthe fields in the enclosing object Should thisbe handled via something like xload(memory

order consume x) but if so doesnrsquot thisre-introduce the need for lots of stdkill

dependency() calls

3 Larger structures might have quite a few fieldsthat need dependencies carried Should therebe some sort of shorthand to make this easierto code for example tagging the fields needingdependency ordering in the declaration of thestruct or class

4 Common types of abstraction need to be han-dled correctly For example there are more than

WG21N4321 28

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst head-gtfirst-gtt)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(lsp-gtt lsp-gtt)) 28 rcu_read_unlock()29 return data30

Figure 31 List-Based-Stack Head-Marked Depen-dencies 1 of 2

500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

79 Restricted Dependency Chains

This approach restricts dependency chains to opera-tions for which compilers would naturally carry de-pendencies As such this approach can be consid-ered to be a refinement of the whole-program option(Section 74) restricted as described in Section 412but also omitting control dependencies and RCU-protected array indexes As always a memory order

consume load heads a dependency chain and as al-ways a memory order consume load may be imple-mented with a simple load instruction on architec-tures such as x86 ARM and Power

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst head-gtfirst-gtnext)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext lsnp-gtnext-gtt)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt lsnp2-gtt))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 32 List-Based-Stack Head-Marked Depen-dencies 2 of 2

WG21N4321 29

1 define rcu_ptr_extract(p) 2 ( 3 uintptr_t ___ip = (uintptr_t)(p) 4 5 ___ip = ___ip amp ~0x7 6 (typeof(p)) ___ip 7 )89 define rcu_ptr_set_bits(p bits)

10 ( 11 uintptr_t ___ip = (uintptr_t)(p) 12 13 ___ip = ___ip amp ~0x7 14 ___ip = ___ip | (bits) 15 (typeof(p)) ___ip 16 )1718 define rcu_ptr_get_bits(p bits) 19 ( 20 uintptr_t ___ip = (uintptr_t)(p) 21 22 ___ip amp ~0x7 23 )

Figure 33 Pointers and Bit Manipulation on 64-BitSystem

This results in a specific list of operations that ex-tend dependency chains and a separate specific list ofoperations that terminate such chains each coveredin its own section

791 Extending Dependency Chains

The following primitive operations extend thatchain15

1 If any value is part of a dependency chain thenusing that value as the left-hand side of an as-signment expression extends the chain to coverthe assignment This rule is exercised in theLinux kernel by stores into fields making up anRCU protected data element

2 If any value is part of a dependency chain thenusing that value as the right-hand side of an as-signment expression extends the chain to coverboth the assignment and the value returned bythat assignment statement Line 20 of Figure 20shows how this rule may be used to extend adependency chain into a local variable

15 In case of operator overloading the actual functions calledmust be analyzed in order to determine their effects on depen-dency chains

3 If any value that is part of a dependency chain isstored to a non-shared variable then any valueloaded by a later load from that same variable bythat same thread is also part of the dependencychain Lines 20 and 24 of Figure 20 illustratethis rule though this rule would apply even iflocal variable lsp was a intptr t instead of apointer

4 If a pointer that is part of a dependency chainis stored to any variable then any value loadedby a later load from that same variable by thatsame thread is also part of the dependency chainLines 20 and 24 of Figure 20 illustrate this rulethough this rule would apply even if local vari-able lsp was instead a shared variable

5 If a pointer is part of a dependency chain thenadding an integral value to that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers and alsoto addition via the infix + operator and via thepostfix [] operator Note that the addition mustbe carried out on a pointer Casting to an inte-gral type and then carrying out the addition willbreak the dependency chain Therefore insteadof casting to an integral type to carry out theaddition cast to a pointer to char16 Line 24 ofFigure 20 illustrates this given that the -gtt actsas a pointer offset prior to indirection

6 If a pointer is part of a dependency chain thensubtracting an integer from that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers Againcasting to an integral type and then carryingout the subtraction will break the dependencychain so instead cast to a pointer to char TheLinux-kernel container of() macro illustratesthis This macro is used to find the beginning ofa structure given a pointer to a field within thatsame structure

16 Yes some old systems had strange formats for characterpointers and this restriction does exclude those systems fromthis nuance of dependency ordering However to the best ofmy knowledge all such systems were uniprocessors so this isnot a real problem

WG21N4321 30

7 If a pointer is part of a dependency chain thendereferencing it using the prefix operator ex-tends the chain through the dereference opera-tion Line 24 of Figure 20 illustrates this giventhat the -gtt acts as a pointer offset prior to in-direction

8 If a pointer is part of a dependency chain thendereferencing it using the -gt field-selection oper-ator extends the chain to the field Note that thewhen the -gt operator is followed by one or more operators these latter operators are equiva-lent to adding a constant integer to the originalpointer Line 24 of Figure 20 directly illustratesthis rule

9 If a pointer is part of a dependency chain thencasting it (either explicitly or implicitly) to anypointer-sized type extends the chain to the re-sult Line 3 of Figure 33 illustrates this rule

10 If a value of type intptr t or uintptr t ispart of a dependency chain then casting it toa pointer type extends the chain to the resultLine 6 of Figure 33 illustrates this rule

11 If a value of type intptr t or uintptr t is partof a dependency chain then the bitwise infix amp

and | operators extend the dependency chain tothe resulting value Line 5 of Figure 33 illus-trates this rule

12 If a value of type intptr t or uintptr t is partof a dependency chain the the bitwise infix ^

operator extends the dependency chain to theresulting value The use case for this traversal ofbuddy-allocator-like lists or dense-array heapsbut it is not clear whether these use cases justifythis addition to dependency ordering

13 If a pointer is part of a dependency chain thenapplying the unary amp address-of operator op-tionally casting this address to a pointer type(perhaps repeatedly to different pointer typeseither explicitly or implicitly) then applying the dereference operator extends the chain to theresult This is used by some of the Linux-kernellist-processing macros

14 If a pointer is part of a dependency chain andthat pointer appears in the entry of a expres-sion selected by the condition then the chainextends to the result Please note that doesnot extend chains from its condition only fromits second or third argument

15 If a pointer to a function is part of a depen-dency chain then invoking the pointed-to func-tion extends the chain from the pointer to theinstructions executed Note that the exact mech-anism used to update instructions is implemen-tation defined and might require use of specialinstruction-cache-flush operations17

16 If a pointer is part of a dependency chain thenif that pointer is used as the actual parameter ofa function call the dependency chain extends tothe formal parameter

17 If a pointer is part of a dependency chain thenif that pointer is returned from a function thedependency chain extends to the returned valuein the calling function

18 If a given operation extends a dependency chainthen so does its atomic counterpart For exam-ple the rules applying to assignments also applyto atomic loads and stores

Any other operation terminates a dependencychain

792 Terminating Dependency Chains

Even though all other operations terminate depen-dency chains there are a few that deserve specialmention

1 Equality comparisons

2 Narrowing magnitude comparisons

3 Narrowing arithmetic operations

4 Narrowing bitwise operations

17 This may sound strange but just you try implementingdynamic linking without the ability to update instructions

WG21N4321 31

1 struct bar 2 struct bar next3 int a4 int b5 6 struct bar head = amphead 1 2 78 for (p = head-gtnext p p = rcu_dereference(p-gtnext)) 9 foo += p-gta

10 if (p == amphead)11 break12 13 bar = head-gtb

Figure 34 Back-Propagation of Dependency-ChainBreakage Due to Comparisons

1 if (p gt ampfoo)2 do_something(p)3 else if (p lt ampfoo)4 do_something_else(p)5 else6 do_something_nodep(p)

Figure 35 Inequality-Comparison Dependency-Chain Breakage

5 Storing non-pointers into shared variables

6 Passing values between threads without using amemory order consume load

7 Undefined behavior

8 Use of kill dependency

Each of these is covered below

Equality comparisons If a pointer is part of adependency chain then a == or = comparison thatcompares equal to some other pointer where thatother pointer is not part of any dependency chainwill cause any uses of the original pointer to no longerbe part of the dependency chain This dependency-chain breakage can back-propagate to earlier uses ofthe pointer so that in Figure 34 if the comparisonon line 10 compares equal then the access on line 9is not part of the dependency chain This is admit-tedly a rather strange code fragment and besidesthe Linux-kernel barrier() macro could prevent thisif placed between lines 9 and 10 Furthermore the

Linux kernelrsquos list macros avoid this situation becausethe equal comparison terminates the loop

So what if the compiler introduces an equality com-parison This might happen when doing feedback-directed optimization where the compiler might no-tice (for example) that a particularly statically allo-cated structure was almost always the first elementon a given list The compiler might therefore in-troduce a specialization optimization comparing theaddresses and generating code using the statically al-located structure on equals comparison On the onehand in the cases where the Linux kernel adds a stat-ically allocated structure to an RCU-protected linkeddata structure that structure has been initialized atcompile time so that dependency ordering is not re-quired On the other hand this appears to be anextremely dubious optimization for linked data struc-tures In a great many cases the added overhead ofthe comparison would overwhelm the benefits of gen-erating code based on the statically allocated struc-ture

High-quality implementations would therefore beexpected to provide means for disabling this sort ofoptimization especially for pointers obtained fromthe heap After all use of statically allocated struc-tures in RCU-protected lists could be quite usefulduring out-of-memory conditions in which case thespecialization optimization would almost always re-duce performance which is not what optmizationsare supposed to be doing

Narrowing magnitude comparisons A seriesof gt lt gt= or lt= operators that informs the compilerof the exact value of a pointer causes that pointerto no longer be part of the dependency chain SeeFigure 35 for an example of this On line 6 of thisfigure the compiler knows that the value of p is equalto ampfoo so although there is dependency ordering tolines 2 and 4 there is no dependency ordering toline 6 This dependency-chain breakage can back-propagate just as for equality comparisons How-ever dependencies are maintained for normal usesfor exmaple the use of comparisons for deadlockavoidance when acquiring locks contained in multi-ple RCU-protected data elements

WG21N4321 32

Narrowing arithmetic operations If a pointeris part of a dependency chain and if the valuesadded to or subtracted from that pointer cancel thepointer value so as to allow the compiler to pre-cisely determine the resulting value then the result-ing value will not be part of any dependency chainFor example if p is part of a dependency chain then((char )p-(uintptr t)p)+65536 will not be18

Narrowing bitwise operations If a value of typeintptr t or uintptr t is part of a dependency chainand if that value is one of the operands to an amp

or | infix operator whose result has too few or toomany bits set then the resulting value will not bepart of any dependency chain For example on a64-bit system if p is part of a dependency chainthen (p amp 0x7) provides just the tag bits and nor-mally cannot even be legally dereferenced Similarly(p | ~0) normally cannot be legally dereferencedHowever (p amp ~0x7) will provide a usable pointerthat is part of prsquos dependency chain Setting orclearing bits that are not used by the implementa-tion or that would be zero for a properly aligned ob-ject will not break the dependency chain That saidthe compiler might not know the definition of ldquoprop-erly alignedrdquo for example in the cases of manuallycache-aligned or page-aligned objects

Storing non-pointers into shared variables Ifa non-pointer value that is part of a dependency chainis stored into a shared variable then the dependencychain does not extend to a later load from that vari-able

Passing values between threads If a value thatis part of a dependency chain is stored into a variableby one thread and loaded from that same variable bysome other thread using either a non-atomic load ora memory order relaxed load then the dependencychain does not extend to the second thread To getthis effect the second thread would instead need touse a memory order consume load Note that this

18 That said 57p4 of C++ and 656p8 of C both say thatindexing outside of an object is undefined behavior so the lossof dependency ordering is likely the least of the problems here

would extend the dependency chain even if the cor-responding store was a memory order relaxed storebecause the required store-side ordering is providedby the dependency chain

Undefined behavior If undefined behavior is in-voked then consistent with the notion of undefinedbehavior there are no dependency-chain guarantees

If a given pointer intptr t or uintptr t is usedsuch that only one value avoids undefined behaviorthen the dependency chain is broken in the same wayas it would be in the case of an equality comparisonwith that same value

kill dependency() The result of calling kill

dependency is never part of any dependency chainThis operation can be used to suppress diagnosticsthat implementations might omit for likely misusesof dependency ordering

793 Restricted Dependency Chains and theLinux Kernel

This covers all known pointer-based RCU uses in theLinux kernel aside from RCU-protected array in-dexes (more on these later) However as noted ear-lier these restrictions might prove too constrainingfor future code Therefore it might be necessary tocombine this approach with some variation on one ofthe methods for explicitly marking variables formalparameters and return values that are intended tocarry dependencies

Section 32 discussed dependency chains headedby memory order consume loads of integers that arelater used as array indexes or pointer offsets Al-though there are a (very) few such uses in the Linuxkernel accommodating dependency chains headed byloads of integers greatly complicates the handling ofdependency chains For example given an integer x

produced by a memory order consume load we mustcorrectly handle expressions containing (x - x) in-cluding cases where the cancellation is not at all obvi-ous from the source code In contrast given a pointerp the expression (p - p) not a pointer and the rulesgiven above do not require carrying a dependency

WG21N4321 33

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a) 3 q = kill_dependency(p)4 a = q-gtspecial_a5 else 6 a = p-gtnormal_a7

Figure 36 Avoiding Diagnostic Due To Dependency-Ordering Value-Narrowing Hazard

through such an expression This approach there-fore excludes dependency chains headed by memory

order consume loads from non-pointer atomic vari-ables

This raises the question of what should be doneabout the Linux kernel code that relies on order-ing carried through integer array indexes Paulhas (hopefully) answered this question by creatinga Linux-kernel patch that removes the kernelrsquos de-pendency on RCU-protected array indexes [22]

794 Restricted Dependency Chains Ad-vantages and Disavantages

This approach to dependency ordering has the fol-lowing advantages and disadvantages

1 There is no need for compilers to trace depen-dency chains Instead dependencies are an au-tomatic result of current code-generation and op-timization practices

2 There would be little or no limitation on com-piler optimizations Compilers are no longer re-quired to establish artificial dependencies for ex-pressions such as (x - x) or to detect value-narrowing hazards involving == and =

3 The breaking of dependency chains when apointer compares equal to some other pointermight prove to be onerous but it is not aproblem for current Linux use cases19 Themost common cases are comparison against alist header (in which case an equality compar-ison terminates the traversal) and comparison

19 This needs to be re-verified

against NULL (in which case an equality com-parison indicates a pointer that cannot be deref-erenced in any case)

4 The compiler is prohibited from carrying outvalue-speculation optimizations on pointers orvalues of type intptr t or uintptr t that havebeen cast from a pointer type and subjected tono operations other than infix amp | and ^ (Isthis an advantage or a disadvantage The an-swer to this question is left to the reader)

5 In a great many cases dependency chains canreliably pass through library functions compiledby pre-C11 compilers

6 It would not be necessary to use stdkill

dependency() calls in most cases That saiduse of stdkill dependency() might at somefuture point allow the compiler to produce bet-ter diagnostics for dubious dependency-chain usecases For example a compiler might issue awarning for the value-narrowing hazard shownon line 3 of Figure 13 and that diagnostic mightbe suppressed as shown in Figure 36

7 More generally this approach allows annotationsto be discarded as was shown in Figures 20and 21 However the memory order consume

loads are still required in order to support DECAlpha and prevent compiler optimizations thatmight otherwise destroy the dependency chain

8 It would not be necessary to use [[carries

dependency]] attributes However as withstdkill dependency() use of something like[[carries dependency]] might produce bet-ter diagnostics for dubious use dependency-chainuse cases That said it might be preferable tosubstitute either type or variable modifiers forattributes given that use of attributes is not per-mitted to change the meaning of the programand also that attributes are not yet supportedby the C language

9 With the exception of one use case involving ar-rays no changes to the Linux-kernel source code

WG21N4321 34

are required As noted earlier a patch is avail-able to remove the Linux kernelrsquos dependency onRCU-protected array indexes [22]

In the future this approach could be augmentedby attributes type modifiers variable modifiers orother markings to allow more elaborate dependencychains to be created on the one hand and to improvethe compilerrsquos ability to emit diagnostics for dubioususes of dependency chains on the other

710 Evaluation

This evaluation starts by enumerating the differentaudiences that any change to memory order consume

must address (Section 7101) and then compares thevarious proposals based on the perceived viewpointsof these audiences (Section 7102)

7101 Audiences

The main audiences for any change to memory

order consume include standards committee mem-bers compiler implementers formal-methods re-searchers developers intending to write new codeand developers working with existing RCU code TheLinux kernel community is of course a notable exam-ple of this last category

Standards committee members would like a cleanand non-intrusive change to the standard Theywould of course also like solutions minimizing thenumber and vehemence of complaints from the otheraudiences or failing that reducing the complaintsto a tolerable noise level

Compiler implementers would like a mechanismthat fits nicely into current implementations whichdoes much to explain their satisfaction with the ap-proach of strengthening memory order consume tomemory order acquire In particular they wouldlike to avoid unbounded tracing of dependencies andwould prefer minimal constraints on their ability toapply time-honored optimizations

Formal-methods researchers would like a definitionof memory order consume that fits into existing the-oretical frameworks without undue conceptual vio-lence Of particular concern is any need to deal with

counter-factuals in other words any need to rea-son not only about values of variables required forthe solution of a given litmus test but also aboutother unrelated values for these variables As suchcounter-factuals are the rock upon which otherwiseattractive approaches involving semantic dependencyhave foundered20 Some practitioners might won-der why the opinion of formal-methods researchersshould be given any weight at all and the answer tothis question is that it is the work of formal-methodsresearchers that provides us the much-needed toolsthat we need to analyze both the memory-orderingspecification itself as well as programs using thatspecification

Developers writing new code need something thatexpresses their algorithm with a minimum of syn-tactic saccharine that is easy to learn and that iseasy to maintain For example one of the weak-nesses of the current standardsrsquo definition of memoryorder consume is the need to sprinkle large numbersof kill dependency() calls throughout onersquos codeIn short developers would like it to be easy to writeanalyze and maintain code that uses dependency or-dering

Developers with existing RCU code have the samedesires as do developers writing new code but arealso very interested in minimizing the code churn re-quired to adhere to the standard

The challenge if of course to find a proposal thataddresses the viewoints of all of these audiences Aswe will see in the next session this is not easy How-ever there is some hope that the approach presentedin Section 79 might suffice

7102 Comparison

A summary comparison of the proposals is shown inTable 1

The dependency type can either be ldquodeprdquo fornormal dependency ldquordeprdquo for the restricted de-pendencies discussed in Section 79 ldquosdeprdquo for(global) semantic dependency or ldquolsdeprdquo for local

20 That said Alan Jeffries is making another attempt tocome up with a suitable formal definition of semantic depen-dency

WG21N4321 35

Dep

enden

cyType

Variable

Marking

Form

al-Parameter

Marking

Return-V

alueMarking

Beginning-O

f-Chain

Handling

End-O

f-Chain

Handling

Dep

enden

cyTracingRequired

CAttribute

Support

Req

uired

C11 C++11 dep A A K Y Y

Type-Based Designation of DependencyChains With Restrictions (Section 72)

lsdep T T T k

Type-Based Designation of DependencyChains (Section 73)

dep T T T k

Whole-Program Option (Section 74) sdep

Local-Variable Restriction (Section 75) dep A A k Y

Mark Dependency-Carrying Local Variables(Section 76)

dep A A A k Y

Explicitly Tail-Marked Dependency Chains(Section 77)

dep A A Dk y Y

Explicitly Head-Marked Dependency Chains(Section 78)

dep D k y Y

Restricted Dependency Chains (Section 79) rdep

Marking ldquoArdquo attribute ldquoTrdquo typeBeginning of chain ldquoDrdquo explicit designationEnd of chain ldquoDrdquo explicit designation ldquokKrdquo implicitexplicit kill dependencyDependency tracking required ldquoyrdquo only for marked chains ldquoYrdquo always

Table 1 Comparison of Consume Proposals

WG21N4321 36

semantic dependency21 Variable formal-parameterand return-value marking can either be type-based(ldquoTrdquo) attribute-based (ldquoArdquo) or not required (ldquo rdquo)Beginning-of-chain handling can either require ex-plicit indication to which quantities dependenciesmust be carried (ldquoDrdquo) or nothing (ldquo rdquo) End-of-chain handling can either require an explicit kill

dependency (ldquoKrdquo) an implicit kill dependency

(ldquokrdquo) explicit designation of dependency (ldquoDrdquo) ornothing (ldquo rdquo)22 Dependency tracking might berequired for all chains (ldquoYrdquo) explicitly designatedchains (ldquoyrdquo) or not required at all (ldquo rdquo) C-language[[carries dependency]] support might be required(ldquoYrdquo) or not (ldquo rdquo)

The ideal proposal would have dependency typeldquodeprdquo (thus making it easier to model dependencyordering and making it unnecessary for developer tohave to outwit full-program optimizations) no needfor variable formal-parameter or return-value mark-ing (thus minimizing changes required for existingRCU code) implicit ldquodo the right thingrdquo end-of-chainhandling23 (thus minimizing the need for whack-a-mole source-code markups) no need for dependencytracking (thus making it easier to implement) andno need for C-language support for the [[carries

dependency]] attribute (thus minimizing changes tothe C standard)

7103 Other Approaches

If the C standards committee is unwilling to ac-cept attributes perhaps a new keyword such asCarries dependency would be an acceptable alter-native (This was suggested at the 2014 UIUC meet-ing but I cannot recall who suggested it)

21 Recall that a local semantic dependency remains a de-pendency even if the memory order consume load at its headcan return only a single value In contrast a global seman-tic dependency remains a dependency only if more than onevalue can appear at the end of the chain Therefore optimiza-tions based on global full-program analysis can break a globalsemantic dependency but can break neither a local semanticdependency nor a normal dependency

22 Variables that go out of scope always have any dependencychain implicitly killed

23 Perhaps implemented by a careful choice of exactly whichoperators carry dependencies in which situations

It might also be possible to combine different as-pects of the various proposals perhaps even arrivingat an improved proposal

Nevertheless we clearly have some more work todo

8 Summary

This document has analyzed Linux-kernel use of de-pendency ordering and has laid out the status-quo in-teraction between the Linux kernel and pre-C11 com-pilers It has also put forward some possible ways ofbuilding towards a full implementation of C11rsquos andC++11rsquos handling of dependency ordering Finallyit calls out some weaknesses in C11rsquos and C++11rsquoshandling of dependency ordering and offers some al-ternatives

References

[1] Alglave J Maranget L Pawan PSarkar S Sewell P Williams D andNardelli F Z PPCMEMARMMEM A toolfor exploring the POWER and ARM memorymodels httpwwwclcamacuk~pes20

ppc-supplementalpldi105-sarkarpdfJune 2011

[2] Alglave J Maranget L andTautschnig M Herding cats Modellingsimulation testing and data-mining for weakmemory In Proceedings of the 35th ACM SIG-PLAN Conference on Programming LanguageDesign and Implementation (New York NYUSA 2014) PLDI rsquo14 ACM pp 40ndash40

[3] ARM Limited ARM Architecture ReferenceManual ARMv7-A and ARMv7-R Edition2010

[4] Boehm H J Space efficient conservativegarbage collection SIGPLAN Not 39 4 (Apr2004) 490ndash501

[5] Bonzini P and Day M RCU imple-mentation for Qemu httplistsgnu

WG21N4321 37

orgarchivehtmlqemu-devel2013-08

msg02055html August 2013

[6] Dalton M THE DESIGN AND IMPLE-MENTATION OF DYNAMIC INFORMATIONFLOW TRACKING SYSTEMS FOR SOFT-WARE SECURITY PhD thesis StanfordUniversity 2009 Available httpcsl

stanfordedu~christospublications

2009michael_daltonphd_thesispdf

[Viewed March 9 2010]

[7] Desnoyers M [RFC git tree] userspace RCU(urcu) for Linux httpurcuso February2009

[8] Desnoyers M McKenney P E SternA Dagenais M R and Walpole JUser-level implementations of read-copy updateIEEE Transactions on Parallel and DistributedSystems 23 (2012) 375ndash382

[9] Grisenthwaite R ARM Barrier LitmusTests and Cookbook ARM Limited 2009

[10] Howard P W and Walpole J A rel-ativistic enhancement to software transactionalmemory In Proceedings of the 3rd USENIX con-ference on Hot topics in parallelism (BerkeleyCA USA 2011) HotParrsquo11 USENIX Associa-tion pp 1ndash6

[11] Howard P W and Walpole J Relativis-tic red-black trees Concurrency and Computa-tion Practice and Experience (2013) nandashna

[12] Intel Corporation A Formal Speci-fication of Intel Itanium Processor Fam-ily Memory Ordering 2002 Avail-able httpdeveloperintelcom

designitaniumdownloads251429htm

ftpdownloadintelcomdesign

ItaniumDownloads25142901pdf [ViewedJanuary 10 2007]

[13] International Business Machines Corpo-ration Power ISA Version 207 2013

[14] Kannan H Ordering decoupled metadata ac-cesses in multiprocessors In MICRO 42 Pro-ceedings of the 42nd Annual IEEEACM Inter-national Symposium on Microarchitecture (NewYork NY USA 2009) ACM pp 381ndash390

[15] McCarthy J Recursive functions of symbolicexpressions and their computation by machinepart i Commun ACM 3 4 (Apr 1960) 184ndash195

[16] McKenney P E Read-copy update(RCU) usage in Linux kernel Availablehttpwwwrdropcomuserspaulmck

RCUlinuxusagerculocktabhtml [ViewedJanuary 14 2007] October 2006

[17] McKenney P E What is RCU part 2Usage Available httplwnnetArticles

263130 [Viewed January 4 2008] January2008

[18] McKenney P E The RCU API 2010 edi-tion httplwnnetArticles418853 De-cember 2010

[19] McKenney P E Validating memory barri-ers and atomic instructions httplwnnet

Articles470681 December 2011

[20] McKenney P E Is Parallel ProgrammingHard And If So What Can You Do About Itkernelorg Corvallis OR USA 2012

[21] McKenney P E Structured deferral syn-chronization via procrastination CommunACM 56 7 (July 2013) 40ndash49

[22] McKenney P E [PATCH tipcorercu14] mce Stop using array-index-based RCUprimitives [PATCHtipcorercu14]mce

Stopusingarray-index-basedRCUprimitivesMay 2015

[23] McKenney P E Purcell C Al-gae Schumin B Cornelius G Qwer-tyus Conway N Sbw Blainster Ru-fus C Zoicon5 Anome and Eisen HRead-copy update httpenwikipedia

orgwikiRead-copy-update July 2006

WG21N4321 38

[24] McKenney P E and Slingwine J DRead-copy update Using execution history tosolve concurrency problems In Parallel andDistributed Computing and Systems (Las VegasNV October 1998) pp 509ndash518

[25] McKenney P E and Walpole J What isRCU fundamentally Available httplwn

netArticles262464 [Viewed December 272007] December 2007

[26] Michael M M Hazard pointers Safe mem-ory reclamation for lock-free objects IEEETransactions on Parallel and Distributed Sys-tems 15 6 (June 2004) 491ndash504

[27] Rossbach C J Hofmann O S PorterD E Ramadan H E Bhandari A andWitchel E TxLinux Using and managinghardware transactional memory in an operatingsystem In SOSPrsquo07 Twenty-First ACMSymposium on Operating Systems Principles(October 2007) ACM SIGOPS Avail-able httpwwwsosp2007orgpapers

sosp056-rossbachpdf [Viewed October 212007]

[28] Toit S D Working draft stan-dard for programming language C++httpwwwopen-stdorgjtc1sc22wg21

docspapers2013n3691pdf May 2013

[29] Vigueras G Orduna J M and LozanoM A Read-Copy Update based parallel serverfor distributed crowd simulations The Journalof Supercomputing (Apr 2012)

Change Log

This paper first appeared as N4026 in May of 2014Revisions to this document are as follows

bull Add a suggestion to Section 71 that developersprovide diagnostics for singleton arrays when us-ing RCU-protected indexes (May 28 2014)

bull Apply changes from Torvald Riegel feedback

1 Move Section 71 which summarizes theLinux-kernel uses of dependency chains upto the beginning of Section 7

2 Fix some typos in Section 74 and addwords accounting for DEC Alpha whichmust continue to promote memory order

consume loads to memory order acquire

with this sectionrsquos approach Also clarifyJeff Preshingrsquos conjecture about the lack ofoptimizations that break semantic depen-dency chains

(June 5 2014)

bull Add Section 75 which covers Hans Boehmrsquosapproach which restricts dependency chains tolocal variable Also add Section 76 whichcovers Clark Nelsorsquos approach which usesthe [[carries dependency]] attribute to markthose local variables that carry dependencies(July 17 2014)

bull Mark the paper as a revision of N4036 (July 172014)

bull Add Jeff Preshing Hans Boehm and Clark Nel-son as authors and clarify Hansrsquos proposal inSection 75 (August 13 2014)

bull Add Olivier Girouxrsquos proposal for explicitlymarking the tails of dependency chains as Sec-tion 77 and add Olivier as author Update theadvantages and disadvantages of Jeff Preshingrsquoswhole-program option in Section 74 Add advan-tages and disadvantages for Hans Boehmrsquos pro-posal in Section 75 and for Clark Nelsonrsquos pro-posal in Section 76 (August 23 2014)

bull Update Olivier Girouxrsquos explicit dependency-chain marking proposal in Section 77 based onfeedback from Olivier (August 24 2014)

bull Add a section header (Section 411) for thedependency-chain rules for 2014 GCC implemen-tations Add Section 413 laying out the sim-pler rules for managing dependency chains in theolder 1990s Sequent C implementations Adda footnote to Section 76 recording Lawrence

WG21N4321 39

Crowlrsquos preference for variable modifiers insteadof either attributes or type modifiers Addwords to Section 77 noting that atomic signal

fence() or the Linux kernelrsquos barrier() can beused to suppress code-motion optimizations thatmight otherwise prevent fixing up tail-markeddependency chains that were broken by code-motion optimizations (October 1 2014)

bull Add Section 34 which describes operators thatact as the last link in the Linux kernelrsquos depen-dency chains including the ambiguous role ofthe -gt operator (October 1 2014)

bull Update plots of RCU API usage (October 52014)

bull Flesh out Section 412 which lays out a shorthistory of dependency-chain management rulesfor the Linux kernel Numerous minor clarifica-tions and corrections throughout the document(October 5 2014)

bull Add ldquoandrdquo before final authorrsquos name and emailaddress Also add a pointer back to N4036 (Oc-tober 5 2014)

bull Add a brain-dead build script (October 52014)

bull Add Section 710 which has the beginnings ofan evaluation of the various proposals (October5 2014)

The paper was then published as N4215 whichwas revised as follows

bull Add verbiage to Sections 72 and 7102 differ-entiating between various forms of syntactic andsemantic dependencies (November 11 2014)

bull Add Section 7103 suggesting a C keyword in-stead of the C++ [[carries dependency]] at-tribute This section also suggests that combin-ing ideas from several proposals might be helpful(November 11 2014)

bull Add Section 78 which contains a first attemptto describe Olivier Girouxrsquos head-marked depen-dency chains (November 11 2014)

bull Add example code in Figures 15 and 16 Addsimilar figures to the proposals showing how theymark the dependency chains in the sample code(November 11 2014)

At this point the paper was published as N4321which was revised as follows

bull Add a warning against using operands to amp and| that leave only one bit set (or respectivelycleared) to Section 411 (April 9 2015)

bull Update plots of RCU API usage (April 192014)

bull Add Section 79 which restricts dependencychains with the goal of allowing unmarked de-pendency chains while avoiding restricting com-piler optimizations One important restriction isthe elimination of RCU-protected array indexes(April 19 2014)

bull Add verbiage to Section 32 noting that the amp

operator is sometimes used to find the beginningof a power-of-two aligned structure (April 192014)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Torvald Riegel to Sec-tion 79 (May 20 2015)

bull Donrsquot allow integers to carry dependencies intoor out of unmarked functions in Section 79(May 21 2015)

bull Update Section 79 to note that assignments can-not extend a dependency chain from one threadto another and that assignments cannot extendan intptr t- or uintptr t-based dependencychain through a shared variable even within agiven thread (May 21 2015)

bull Update Section 79 to note that atomic op-erations can extend dependency chains in thesame way that their non-atomic counterpartscan (May 21 2015)

WG21N4321 40

bull Update Section 79 to explicitly call out the factthat stores and subsequent loads can extend de-pendency chains (May 21 2015)

bull Update Section 79 to explicitly state that un-defined behavior terminates dependency chainsand that if only one value of a pointer avoidsundefined behavior any dependency chainsthrough that pointer at that point are termi-nated (May 22 2015)

bull Update Section 79 to document the difficul-ties stemming from value-specialization opti-mizations (May 23 2015)

bull Update Section 79 to note the back-propagationof the dependency-chain-breaking effects of thecompiler deducing the exact value of a pointer(May 27 2015)

bull Update Section 79 to give more informationon how much bit-setting and bit-clearing is toomuch (June 8 2015)

bull Update Section 79 to note that the Linux ker-nel does not rely on dependency ordering whena statically allocated structure is linked into anRCU-protected structure but also note that op-timizations breaking dependency ordering in thiscase seem quite dubious (July 11 2015)

bull Update Section 79 illustrating dependency-chain rules with examples (July 11 2015)

bull Update Section 79 noting that simple loads suf-fice for memory order consume loads (July 112015)

bull Update Sections 77 and 78 calling out abstrac-tion challenges for head- and tail-marking ap-proaches (July 13 2015)

bull Add subsections to Section 79 (July 13 2015)

bull Update Section 79 to amplify objections to spe-cialization optimizations applied to heap-basedpointers (July 13 2015)

  • 1 Introduction
  • 2 Introduction to RCU
  • 3 Linux-Kernel Use Cases
    • 31 Types of Linux-Kernel Dependency Chains
    • 32 Operators in Linux-Kernel Dependency Chains
    • 33 Operators Terminating Linux-Kernel Dependency Chains
    • 34 Operators Acting as Last Link in Linux-Kernel Dependency Chains
    • 35 Linux-Kernel Dependency Chain Length
      • 4 Dependency Ordering in Pre-C11 Implementations
        • 41 Rules for C-Language RCU Users
          • 411 Rules for 2014 GCC Implementations
          • 412 Rules for 2003 GCC Implementations
          • 413 Rules for 1990s Sequent C Implementations
            • 42 Rules for C-Language Implementers
              • 5 Dependency Ordering in C11 and C++11 Implementations
              • 6 Weaknesses in C11 and C++11 Dependency Ordering
              • 7 Potential Alternatives to C11 and C++11 Dependency Ordering
                • 71 Revising C11 and C++11 Dependency-Ordering Definition
                • 72 Type-Based Designation of Dependency Chains With Restrictions
                • 73 Type-Based Designation of Dependency Chains
                • 74 Whole-Program Option
                • 75 Local-Variable Restriction
                • 76 Mark Dependency-Carrying Local Variables
                • 77 Explicitly Tail-Marked Dependency Chains
                • 78 Explicitly Head-Marked Dependency Chains
                • 79 Restricted Dependency Chains
                  • 791 Extending Dependency Chains
                  • 792 Terminating Dependency Chains
                  • 793 Restricted Dependency Chains and the Linux Kernel
                  • 794 Restricted Dependency Chains Advantages and Disavantages
                    • 710 Evaluation
                      • 7101 Audiences
                      • 7102 Comparison
                      • 7103 Other Approaches
                          • 8 Summary
Page 14: N4321: Towards Implementation and Use of memory order … · 2015-07-14 · N4321: Towards Implementation and Use of memory order consume Doc. No.: WG21/N4321 (revised) Date: 2015-04-19

WG21N4321 14

compilers are likely to have difficulty determin-ing whether or not a given pointer carries a de-pendency For example the current wording ofthe standard (intentionally) does not disallowdependency chaining through stores and loadsTherefore if a dependency-carrying value mightever be written to a given variable an implemen-tation might reasonably assume that any loadfrom that variable must be assumed to carry adependency

4 The rules set out in the standard [28 110p9]do not align well with the rules that develop-ers must currently adhere to in order to main-tain dependency chains when using pre-C11 andpre-C++11 compilers (see Section 41) For ex-ample the standard requires (x-x) to carry adependency and providing this guarantee wouldat the very least require the compiler to alsoturn off optimizations that remove (x-x) (andsimilar patterns) if x might possibly be carry-ing a dependency For another example con-sider the value-speculation-like code shown inFigure 13 that is sometimes written by devel-opers and that was described in bullet 9 ofSection 41 In this example the standard re-quires dependency ordering between the memoryorder consume load on line 1 and the sub-sequent dereference on line 3 but a typicalcompiler would not be expected to differenti-ate between these two apparently identical val-ues These two examples show that a compilerwould need to detect and carefully handle thesecases either by artificially inserting dependen-cies omitting optimizations differentiating be-tween apparently identical values or even byemitting memory order acquire fences

5 The whole point of memory order consume andthe resulting dependency chains is to allow de-velopers to optimize their code Such optimiza-tion attempts can be completely defeated by thememory order acquire fences that the standardcurrently requires when a dependency chain goesout of scope without the benefit of a [[carries

dependency]] attribute Preventing the com-piler from emitting these fences requires liberal

use of kill dependency() which clutters coderequires large developer effort and further re-quires that the developer know quite a bit aboutwhich code patterns a given version of a givencompiler can optimize (thus avoiding needlessfences) and which it cannot (thus requiring man-ual insertion of kill dependency()

As of this writing no known implementations fullysupport C11 or C++11 dependency ordering

It is worth asking why Paul didnrsquot anticipate theseweaknesses There are several reasons for this

1 Compiler optimizations have become more ag-gressive over the seven years since Paul startedworking on standardization

2 New dependency-ordering use cases have arisenduring that same time in particular there arelonger dependency chains and more of themincluding dependency chains spanning multiplecompilation units

3 The number of dependency chains has increasedby roughly an order of magnitude during thattime so that changes in code style can be ex-pected to face a commeasurate increase in resis-tance from the Linux kernel community ndash unlessthose changes bring some tangible benefit

With that letrsquos look at some potential alternativesto dependency ordering as defined in the C11 andC++11 standards

7 Potential Alternatives to C11and C++11 Dependency Or-dering

Given the weaknesses in the current standardrsquos spec-ification of dependency ordering it is quite reason-able to consider alternatives To this end Section 71discusses ease-of-use issues involved with revisions tothe C11 and C++11 definitions of dependency or-dering Section 72 enlists help from the type systembut also imposes value restrictions (thus revising the

WG21N4321 15

C11 and C++11 semantics for dependencies) Sec-tion 73 enlists help from the type system withoutthe value restrictions and Section 74 describes awhole-program approach to dependency chains (alsorevising the C11 and C++11 semantics for depen-dencies) Section 75 describes a post-Rapperswilproposal that dependency chains be restricted tofunction-scope local variables and temporaries andSection 76 describes a second post-Rapperswil pro-posal that the [[carries dependency]] attributebe used to label local-scope variables that carry de-pendencies Section 77 describes a proposal dis-cussed verbally at Rapperswil that explicitly marksthe tails of dependency chains Section 78 describesthe inverse namely marking the heads of dependencychains Section 79 describes an approach that avoidsmarking by sharply restricting the number and typeof operations permitted in dependency chains Eachapproach appears to have advantages and disadvan-tages so it is hoped that further discussion will eitherhelp settle on one of these alternatives or generatesomething better To help initiate this discussionSection 710 provides an initial comparative evalua-tion

71 Revising C11 and C++11Dependency-Ordering Definition

The following sections each describe a proposed revi-sion of the dependency-ordering definition from thatin the current C11 and C++11 standards In manyof these proposals developers are required to followan additional rule in order to be able to rely on de-pendency ordering Subsequent execution must notlead to a situation where there is only one possiblevalue for the variable that is intended to carry thedependency7 This is shown in Figure 17 where thecompiler is permitted to break dependency orderingon line 6 because it knows that the value of p is equalto that of q which means that it could substitutethe latter value from the former which would break

7 This restricted notion of dependence is sometimes calledsemantic dependence and the value at the end of a depen-dence chain that does not represent a semantic dependence issometimes said to be independent of the value at the head ofthe dependency chain

1 int my_array[MY_ARRAY_SIZE]23 i = atomic_load_explicit(gi memory_order_consume)4 r1 = my_array[i]

Figure 14 Single-Element Arrays and DependencyOrdering

dependency ordering In short a dependency chainbreaks if it comes to a point where only a single valueis possible regardless of the value of the memory

order consume load heading up the chain At firstglance this additional rule could be quite difficult tolive with as dependency ordering could come and godepending on small details of code far away from thatpoint in the dependency chain

However a review of the Linux-kernel operators inSection 32 shows that the most commonly used op-erators act identically under both definitions Theproblem-free operators include -gt infix = casts pre-fix amp prefix and ternary

One example of a potentially troublesome opera-tor namely == is shown in Figure 17 where line 6breaks dependency ordering because the value of p isknown to be equal to that of q which is not part of adependency chain This example could be addressedthrough careful diagnostic design coupled with appro-priate coding standards For example the compilercould emit a warning on line 6 but remain silent forthe equivalent line substituting q for p namely dosomething with(q-gta)

Another example is the use of postfix [] thatis shown in Figure 14 If this code fragment wascompiled with MY ARRAY SIZE equal to one there isno dependency ordering between lines 3 and 4 butthat same code fragment compiled with MY ARRAY

SIZE equal to two or greater would be dependency-ordered Here a diagnostic for single-element arraysmight prove useful and such a diagnostic can easilybe supplied in this case using if and error

In the Linux kernel infix + and - are used forpointer and array computations These are all safein that they operate on an integer and pointer sothat any cancellation will not normally be detectableat compile time However one big purpose of diag-nostics is to detect abnormal conditions indicating

WG21N4321 16

1 struct liststackhead 2 struct liststack __rcu first3 45 struct liststack 6 struct liststack __rcu next7 void t8 struct rcu_head rh9

1011 _Carries_dependency12 void ls_front(struct liststackhead head)13 14 _Carries_dependency void data15 struct liststack lsp1617 rcu_read_lock()18 lsp = rcu_dereference(head-gtfirst)19 if (lsp == NULL)20 data = NULL21 else22 data = rcu_dereference(lsp-gtt)23 rcu_read_unlock()24 return data25

Figure 15 List-Based-Stack Example Code 1 of 2

probable bugs Therefore in cases where the com-piler can determine that two values from dependencychains are annihilating each other via infix + and -a diagnostic would be appropriate

Similarly the Linux kernel uses infix (bitwise) amp tomanipulate bits at the bottom of a pointer whereagain cancellation will not normally be detectable atcompile timemdashexcept in the case of operations on aNULL pointer for which dependency ordering is notmeaningful in any case However as with infix +

and - if the compiler detects value annihilation adiagnostic would be appropriate

Although issues with false positives and negativesneeds further investigation there is reason to hopethat this revision of the definition of dependency or-dering might avoid significant impacts on ease of useWith this hope we proceed to the specific propos-als using the code in Figures 15 and 16 to showsome sample code using Linux-kernel nomenclaturewith the addition of a mythical C keyword Carries

dependency to annotate parameters variables andreturn values that carry dependencies Please notethat this code example in no way endorses the dubi-ous practice of creating a parallel program with thesort of choke point exemplified by the head of this

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 _Carries_dependency struct liststack lsp40 struct liststack lsnp141 _Carries_dependency struct liststack lsnp242 _Carries_dependency void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 16 List-Based-Stack Example Code 2 of 2

WG21N4321 17

1 value_dep_preserving struct foo p23 p = atomic_load_explicit(gp memory_order_consume)4 q = some_other_pointer5 if (p == q)6 do_something_with(p-gta)7 else8 do_something_else_with(p-gtb)

Figure 17 Single-Value Variables and DependencyOrdering

list Note also that cmpxchg() heads a dependencychain which is completely reasonable within the con-text of the Linux kernel due to its acquire semanticswhich of course might be argued to indicate that theannotations in ls pop() are unnecessary

72 Type-Based Designation of De-pendency Chains With Restric-tions

This approach was formulated by Torvald Riegel inresponse to Linus Torvaldsrsquos spirited criticisms of thecurrent C11 and C++11 wording

This approach introduces a new value dep

preserving type qualifier Dependency ordering ispreserved only via variables having this type quali-fier This is meant to model the real scope of depen-dencies which is data flow not execution at function-level granularity This approach should therefore givedevelopers much finer control of which dependenciesare tracked

Assigning from a value dep preserving value to anon-value dep preserving variable terminates thetracking of dependencies in much the same way thatan explicit kill dependency() would However un-like an explicit kill dependency() compilers shouldbe able to emit a suppressable warning on implicitconversions so as to alert the developer about other-wise silent dropping of dependency tracking8

Next we specify that memory order consume loadsreturn a value dep preserving type by default thecompiler must assume such a load to be capable of

8 Other choices are possible in this case including emit-ting a memory order acquire fence in order to conservativelypreserve a potentially intended ordering

producing any value of the underlying type In otherwords the implementation is not permitted to applyany value-restriction knowledge it might gain fromwhole-program analysis We call this a local seman-tic dependency to distinguish not only from a pure(syntactic) dependency but also from a global seman-tic dependency where global information may be ap-plied Note that any global semantic dependency isalso a local semantic dependency but that any localsemantic dependency which is headed by a variablethat can be proven to take on only a single value is nota global semantic dependency The term ldquosemanticdependencyrdquo should be interpreted to mean a globalsemantic dependency unless otherwise stated

This allows developers to start with a clean slatefor the additional rule that they must follow to beable to rely on dependency ordering Subsequent ex-ecution must not lead to a situation there is only onepossible value for the value dep preserving expres-sion because otherwise the implementation is per-mitted to break the dependency chain As notedearlier this is shown in Figure 17 where the com-piler is permitted to break dependency ordering online 6 because it knows that the value of p is equalto that of q which means that it could substitutethe latter value from the former which would breakdependency ordering

This approach has several advantages

1 The implementation is simpler because no de-pendency chains need to be traced The imple-mentation can instead drive optimization deci-sions strictly from type information

2 Use of the value dep preserving type modifierallows the developer to limit the extent of thedependency chains

3 This type modifier can be used to mark a depen-dency chainrsquos entry to and exit from a functionin a straightforward way without the need forattributes

4 The value dep preserving type modifiers serveas valuable documentation of the developerrsquos in-tent

WG21N4321 18

5 This approach permits many additional opti-mizations compared to those permitted by thecurrent standard on code that carries a depen-dency Expressions such as (x-x) no longerrequire establishment of artificial dependenciesand the compiler is no longer required to detectvalue-narrowing hazards like that shown in Fig-ure 13 However the compiler is still prohibitedfrom adding its own value-speculation optimiza-tions

6 Linus Torvalds seems to be OK with it whichindicates that this set of rules might be practicalfrom the perspective of developers who currentlyexploit dependency chains

According to Peter Sewell one disadvantage is thatthis approach will be quite difficult to model which inturn will pose obstacles for the analysis tooling thatwill be increasingly necessary for large-scale concur-rent programming efforts In particular the concernis that forcing the compiler to assume that a memory

order consume load could possibly return any valuepermitted by its type might require program-analysistools to consider counterfactual hypothetical execu-tions which might complicate specification of seman-tics and verification

Figures 18 and 19 show how this approach playsout with the list-based stack

73 Type-Based Designation of De-pendency Chains

Jeff Preshing made an off-list suggestion of using avalue dep preserving type modifier as suggestedby Torvald Riegel but using this type modifier tostrictly enforce dependency ordering For exampleconsider the code fragment shown in Figure 17 Thescheme described in Section 72 would not necessar-ily enforce dependency ordering between the load online 3 and the access one line 6 while the approachdescribed in this section would enforce dependencyordering in this case

Furthermore cancelling or value-destruction oper-ations on value dep preserving values would notdisrupt dependency ordering As with the cur-rent C11 and C++11 standards the implementation

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack value_dep_preserving first6 78 struct liststack 9 struct liststack value_dep_preserving next

10 void t11 struct rcu_head rh12 1314 value_dep_preserving15 void ls_front(struct liststackhead head)16 17 value_dep_preserving void data18 value_dep_preserving struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 18 List-Based-Stack Restricted Type-BasedDesignation 1 of 2

would be required to emit a memory-barrier instruc-tion or compute an artificial dependency for such op-erations (Note however that use of cancelling orvalue-destruction operations on dependency chainshas proven quite rare in practice)

This approach shares many of the advantages ofTorvald Riegelrsquos approach

1 The implementation is simpler because no de-pendency chains need be traced The implemen-tation can instead drive optimization decisionsstrictly from type information

2 Use of the value dep preserving type modifierallows the developer to limit the extent of thedependency chains

3 This type modifier can be used to mark a depen-dency chainrsquos entry to and exit from a functionin a straightforward way without the need forattributes

4 The value dep preserving type modifiers serve

WG21N4321 19

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 value_dep_preserving37 void ls_pop(struct liststackhead head)38 39 value_dep_preserving struct liststack lsp40 struct liststack lsnp141 value_dep_preserving struct liststack lsnp242 value_dep_preserving void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 19 List-Based-Stack Restricted Type-BasedDesignation 2 of 2

as valuable documentation of the developerrsquos in-tent

5 Although optimizations on a dependency chainare restricted just as in the current standardthe use of value dep preserving restricts thedependency chains to those intended by the de-veloper

6 Restricting dependency-breaking optimizationson all dependency chains marked value dep

preserving without exceptions for cases inwhich the compiler knows too much might makethis approach easier to learn and to use

It is expected that modeling this approach shouldbe straightforward because the modeling tools wouldbe able to make use of the type information Thisapproach results in the same code as shown in Fig-ures 18 and 19 of the previous section

74 Whole-Program Option

This approach also suggested off-list by Jeff Presh-ing has the goal of reusing existing non-dependency-ordered source code unchanged (albeit requiring re-compilation in most cases)9 For example this ap-proach permits an instance of stdmap to be refer-enced by a pointer loaded via memory order consume

and to provide that stdmap instance with thebenefits of dependency ordering without any codechanges whatsoever to stdmap It is important tonote that this protection will be provided only to aread-only stdmap that is referenced by a changingpointer loaded via memory order consume in partic-ular not to a concurrently updated stdmap refer-enced by a pointer (read-only or otherwise) loadedvia memory order consume This latter case wouldrequire changes to the underlying stdmap implemen-tation at a minimum changing some of the loads tobe memory order consume loads Nevertheless theability to provide dependency-ordering protection topre-existing linked data structures is valuable evenwith this read-only restriction

9 A module or library that is known to never carry a de-pendency need not be recompiled

WG21N4321 20

This approach which again does require full re-compilation can be implemented using two ap-proaches

1 Promote all memory order consume loads tomemory order acquire as may be done withthe current standard

2 On architectures that respect memory order-ing prohibit all dependency-breaking optimiza-tions throughout the entire program but onlyin cases where a change in the value returnedby a memory order consume load could cause achange in the value computed later in that samedependency chain in other words where there isa global semantic dependency Note again thatthe possibility of storing a value obtained froma memory order consume load then loading itlater means that normal loads as well as memoryorder relaxed loads often must be consideredto head their own dependency chains but onlywhen loaded by the same thread that did thestore

Some implementations might allow the developerto choose between these two approaches for exampleby using a compiler switch provided for that purpose

This approach also has the effect of permitting atrivial implementation of a memory order consume

atomic thread fence() When using the first im-plementation approach the atomic thread fence()

is simply promoted to memory order acquire In-terestingly enough when using the second ap-proach the memory order consume atomic thread

fence() may simply be ignored The reason forthis is that this approach has the effect of promot-ing memory order relaxed loads to memory order

consume which already globally enforces all theordering that the memory order consume atomic

thread fence() is required to provide locally10

This approach has its own set of advantages anddisadvantages

10 Of course this presumed promotion from memory order

relaxed to memory order consume means that architecturessuch as DEC Alpha that do not respect dependency order-ing must continue to use the first option of emitting memory-ordering instructions for memory order consume loads

1 This approach dispenses with the [[carries

dependency]] attribute and the kill

dependency() primitive

2 This approach better promotes reuse of existingsource code In particular it should require nochanges to the current Linux-kernel source baseaside from changes to the rcu dereference()

family of primitives

3 This approach allows implementations to carryout dependency-breaking optimizations on de-pendency chains as long as a change in thevalue from the memory order consume load doesnot change values further down the dependencychain both with and without the optimizationJeff conjectures that the set of dependency-breaking optimizations used in practice applyonly outside of dependency chains by the re-vised definition in which single-value restrictionsbreak dependency chains11 If this conjectureholds it also applies to Torvaldrsquos approach de-scribed in Section 72

4 Code that follows the rules presented in Sec-tion 41 (substituting memory order consume

loads for volatile loads) would have its depen-dency ordering properly preserved

It is unlikely that this approach could be modeledreasonably given the current state of the art Therequirement that any given memory order consume

load be able to generate at least two different val-ues at the tail of the dependency chain is believedto be a show-stopper especially when coupled withwhole-program analysis which might find that thereis only one value entering at the head of the depen-dency chain

This approach allows annotations to be discardedas shown in Figures 20 and 21 However the memoryorder consume loads are still required in order toenable the promote-to-acquire implementation style

11 This is certainly the case for the usual optimizations ex-emplified by replacing (x-x) with zero

WG21N4321 21

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 void ls_front(struct liststackhead head)15 16 void data17 struct liststack lsp1819 rcu_read_lock()20 lsp = rcu_dereference(head-gtfirst)21 if (lsp == NULL)22 data = NULL23 else24 data = rcu_dereference(lsp-gtt)25 rcu_read_unlock()26 return data27

Figure 20 List-Based-Stack Whole-Program Ap-proach 1 of 2

75 Local-Variable Restriction

This approach suggested off-list by Hans Boehmlimits the extent of dependency trees to a local whichincludes local variables temporaries function argu-ments and return variables Assigning a value from amemory order consume load to such an object beginsa dependency chain Assigning a value loaded fromsuch a local to a global variable (including function-local variables marked static) or to the heap impliesa kill dependency() so that dependency chains areconfined to locals However if the compiler is unableto see the full dependency chain for example be-cause it passes into a function in another translationunit that is not marked [[carries dependency]]the compiler should promote memory order consume

to memory order acquire12

Section 32 indicates that the following operatorsshould transmit dependency status from one localvariable or temporary to another -gt infix = casts

12 Some implementations might provide means to allow theuser to specify that a diagnostic be generated if such promotionis necessary

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 void ls_pop(struct liststackhead head)37 38 struct liststack lsp39 struct liststack lsnp140 struct liststack lsnp241 void data4243 rcu_read_lock()44 lsnp2 = rcu_dereference(head-gtfirst)45 do 46 lsnp1 = lsnp247 if (lsnp1 == NULL) 48 rcu_read_unlock()49 return NULL50 51 lsp = rcu_dereference(lsnp1-gtnext)52 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)53 while (lsnp1 = lsnp2)54 data = rcu_dereference(lsnp2-gtt)55 rcu_read_unlock()56 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)57 return data58

Figure 21 List-Based-Stack Whole-Program Ap-proach 2 of 2

WG21N4321 22

prefix amp prefix [] infix + infix - ternary infix (bitwise) amp and probably also | Similarly Sec-tion 33 indicates that the following operators shouldimply a kill dependency() () == = ampamp ||infix and

It will also be necesary to check whether Linux-kernel usage expects dependency chains to passthrough globals and heap objects that are in someway thread-local If there are such use cases and ifthey are sane and cannot easily be changed to uselocal variables should [[carries dependency]] beused to flag dependency-carrying globals and heapobjects

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute if de-pendency chains are to span multiple translationunits as is the case in some parts of the Linuxkernel

2 The implementation is likely to be some-what simpler because only those dependencychains passing through local variables compiler-generated temporaries compiler-visible functionarguments and compiler-visible return valuesneed be traced One could also argue that func-tion arguments and return values marked with[[carries dependency]] attribute also need tobe traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 Although optimizations on dependency chainsmust be restricted the restricted scope of de-pendency chains reduces the impact of these re-strictions

5 Applying this approach to the Linux kernelwould only require the addition of markings onfunction parameters and return values corre-sponding to cross-translation-unit function callsHowever there are a significant number of theseso this approach can expect significant resistancefrom the Linux community

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 22 List-Based-Stack Local-Variable Restric-tion 1 of 2

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach allows local-variable annotations tobe dropped as shown in Figure 22 and 23

76 Mark Dependency-Carrying Lo-cal Variables

This approach suggested offlist by Clark Nelson usesthe [[carries dependency]] attribute to mark non-static local-scope variables as carrying a dependencyin addition to its current use marking function ar-guments and return values as carrying dependen-cies It is not permissible to mark global variablesor structure members with this attribute Assigningfrom a [[carries dependency]] object to a non-[[carries dependency]] object results in an im-plicit kill dependency()

This approach is similar to that of Section 73 ex-cept that it uses an attribute rather than a type mod-ifier As such it has many of the advantages and dis-

WG21N4321 23

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 23 List-Based-Stack Local-Variable Restric-tion 2 of 2

advantages of that approach however some believethat an attribute-based approach will be more ac-ceptable to the committee than would a type-modifierapproach13 However this approach does requirethat C add attributes

This leave the question of which operatorstransmit dependency chains from one [[carries

dependency]] object to another Section 32 indi-cates that the following operators should transmitdependency status from one local variable or tem-porary to another -gt infix = casts prefix amp prefix [] infix + infix - ternary infix (bitwise) ampand probably also | Similarly Section 33 showsthat the following operators should imply a kill

dependency() () == = ampamp || infix and

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains passingthrough variables marked with the [[carries

dependency]] attribute need be traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 The [[carries dependency]] calls serve asvaluable documentation of the developerrsquos in-tent

5 Although optimizations on dependency chainsmust be restricted use of explicit [[carries

dependency]] greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of variables carrying depen-dencies given that the Linux kernel currentlyrequires no such markings

13 Lawrence Crowl suggests a third approach namely a vari-able modifier

WG21N4321 24

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 _Carries_dependency void data18 _Carries_dependency struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 24 List-Based-Stack Marked Local Variables1 of 2

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach results in code as shown in Fig-ures 24 and 25 where the [[carries dependency]]

attributes have been replaced with a mythicalCarries dependency C keyword

77 Explicitly Tail-Marked Depen-dency Chains

This approach suggested at Rapperswil by OlivierGiroux can be thought of as the inverse of

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 _Carries_dependency struct liststack lsp40 struct liststack lsnp141 _Carries_dependency struct liststack lsnp242 _Carries_dependency void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 25 List-Based-Stack Marked Local Variables2 of 2

WG21N4321 25

1 p = atomic_load_explicit(ampgp memory_order_consume)2 if (p = NULL)3 do_it(atomic_dependency(p gp))

Figure 26 Explicit Dependency Operations

stdkill dependency() Instead of explicitlymarking where the dependency chains terminateOlivierrsquos proposal uses a stddependency() prim-itive to indicate the locations in the code that thedependency chains are required to reach The first ar-gument to stddependency() is the value to whichthe dependency must be carried and the second argu-ment is the variable that heads the dependency chainin other words the second argument is the variablethat was loaded from by a memory order consume

load This proposal differs from the others in thatit is expected to be implemented not necessarily bypreserving the dependency but instead by insertingbarriers in those cases where optimizations have elim-inated any required dependencies The goal here is toimpose minimal restrictions on optimizations of codecontaining dependency chains

A C-language example is shown in Figure 26where stddependency() is transliterated to the C-language atomic dependency() function On line 3atomic dependency() returns the value of its firstargument (p) while ensuring that the data depen-dency from the memory order consume load from gp

is faithfully reflected in the assembly language imple-menting this code fragment The assembly-languagereflection of this dependency might be in terms ofan assembly-language dependency (for example onARM or PowerPC) implicit memory ordering (forexample on x86 or mainframe) or by an explicitmemory-barrier instruction However if there was noatomic dependency() function the compiler wouldbe under no obligation to preserve the dependency14

These explicitly specified dependencies may becombined with [[carries dependency]] attributeson function arguments for example as shown in Fig-ure 27 Note the interplay of atomic dependency()

and [[carries dependency]] where line 8 estab-

14 Would it be better to have the second argument to atomic

dependency() be a label rather than an expression

1 void foo(struct bar q [[carries_dependency]])2 3 if (q = NULL)4 do_it(atomic_dependency(q-gtb q))5 67 p = atomic_load_explicit(ampgp memory_order_consume)8 foo(atomic_dependency(p gp))

Figure 27 Explicit Dependency Operations and car-ries dependency

lishes the dependency between the load from gp andthe [[carries dependency]] argument q of foo()and where line 4 establishes the further dependencybetween argument q of foo() and do it()s argu-ment

This approach is not yet complete One issue isthe possibility of a given operation being dependenton multiple memory order consume loads One ap-proach is of course to omit this functionality and an-other is to allow atomic dependency() to allow anexpression as its first argument and a variable list ofmemory order consume loaded variables

Another issue is connecting [[carries

dependency]] return values to subsequent atomic

dependency() invocations There are a numberof possible resolutions to this issue One approachwould be to use [[carries dependency]] attributeto mark the declaration of the variable to whichthe functionrsquos return value is assigned bringing theproposal from Section 76 to bear In the specialcase where the memory order consume load is in thesame function body as the atomic dependency()

that depends on it the atomic dependency()

could reference the variable that was the sourceof the original memory order consume load An-other approach would be to allow function-returncarries dependency]] attributes to define namesthat could be used by later atomic dependency()

invocations

A third issue arises when atomic dependency()

must be applied after the head of the dependencychain has gone out of scope for example if the headwas contained in a variable defined in an inner scopethat has since been exited

A fourth issue arises if optimizations along a

WG21N4321 26

needed dependency chain allow ordering the depen-dent operation to precede the head of the depen-dency chain in which case inserting barriers wouldbe ineffective The current proposal for addressingthis issue is to suppress memory-movement optimiza-tions across the atomic dependency() perhaps us-ing something like atomic signal fence() or theLinux kernelrsquos barrier() macro This approach al-lows dependency checking and fence insertion to becarried out as a final pass in the compilation process

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains having ex-plicit atomic dependency() calls (and option-ally intermediate [[carries dependency]] at-tributes) need be traced

3 Irrelevant dependency chains are pruned by de-fault with no stdkill dependency() calls re-quired

4 The atomic dependency() calls serve as valu-able documentation of the developerrsquos intent

5 Although optimizations on dependency chainsmust be restricted use of explicit atomic

dependency() greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of dependency chains giventhat the Linux kernel currently relies on implicitends of dependency chains

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(atomic_dependency(lsp-gtt27 head-gtfirst))28 rcu_read_unlock()29 return atomic_dependency(data lsp-gtt)30

Figure 28 List-Based-Stack Tail-Marked Dependen-cies 1 of 2

It is not yet known whether this approach can bereasonably modeled

The result is shown in Figures 28 and 29

78 Explicitly Head-Marked Depen-dency Chains

This approach suggested via email by Olivier Girouxcan be thought of as another inverse of stdkilldependency() In this case the heads of the de-pendency chains are marked indicating to whichpointed-to objects dependencies should be carriedThis description is an extrapolation of a very con-cise proposal and corrections are welcome

The general idea is to provide an augmentedform of the load() member function that indicatesdependencies for example xload(memory order

consume x-gtnext) would cause a dependency tobe carried through the -gtnext field but throughno other field This is shown in Figure 30 where

WG21N4321 27

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(atomic_dependency(lsnp2-gtt lsnp))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return atomic_dereference(data lsnp2-gtt)59

Figure 29 List-Based-Stack Tail-Marked Dependen-cies 2 of 2

1 struct foo 2 struct foo a3 struct foo b4 struct foo c5 int d6 78 p = atomic_load_explicit(ampgp memory_order_consume9 p-gta p-gtb)

10 qa = p-gta Dependency carried 11 qb = p-gtb Dependency carried 12 qc = p-gtc No dependency carried 13 d = p-gtd No dependency carried

Figure 30 Explicit Dependency Operations and Aug-mented Load

the explicit dependency information on line 9 causeslines 10 and 11 to carry a dependency but lines 12and 13 not to do so

Some open questions regarding this approach

1 How does this interact with arguments and re-turn values Do the corresponding annotationsneed to indicate to which fields dependenciesmight be carried Should mismatches be con-sidered an error and if so which sorts of mis-matches

2 How are opaque types handled For exam-ple consider the Linux kernel linked-list facil-ity which embeds a list head structure intothe enclosing object that is to be placed on thelist The memory order consume load returnsa (struct list head ) but it may be nec-essary to carry a dependency to one or more ofthe fields in the enclosing object Should thisbe handled via something like xload(memory

order consume x) but if so doesnrsquot thisre-introduce the need for lots of stdkill

dependency() calls

3 Larger structures might have quite a few fieldsthat need dependencies carried Should therebe some sort of shorthand to make this easierto code for example tagging the fields needingdependency ordering in the declaration of thestruct or class

4 Common types of abstraction need to be han-dled correctly For example there are more than

WG21N4321 28

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst head-gtfirst-gtt)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(lsp-gtt lsp-gtt)) 28 rcu_read_unlock()29 return data30

Figure 31 List-Based-Stack Head-Marked Depen-dencies 1 of 2

500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

79 Restricted Dependency Chains

This approach restricts dependency chains to opera-tions for which compilers would naturally carry de-pendencies As such this approach can be consid-ered to be a refinement of the whole-program option(Section 74) restricted as described in Section 412but also omitting control dependencies and RCU-protected array indexes As always a memory order

consume load heads a dependency chain and as al-ways a memory order consume load may be imple-mented with a simple load instruction on architec-tures such as x86 ARM and Power

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst head-gtfirst-gtnext)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext lsnp-gtnext-gtt)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt lsnp2-gtt))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 32 List-Based-Stack Head-Marked Depen-dencies 2 of 2

WG21N4321 29

1 define rcu_ptr_extract(p) 2 ( 3 uintptr_t ___ip = (uintptr_t)(p) 4 5 ___ip = ___ip amp ~0x7 6 (typeof(p)) ___ip 7 )89 define rcu_ptr_set_bits(p bits)

10 ( 11 uintptr_t ___ip = (uintptr_t)(p) 12 13 ___ip = ___ip amp ~0x7 14 ___ip = ___ip | (bits) 15 (typeof(p)) ___ip 16 )1718 define rcu_ptr_get_bits(p bits) 19 ( 20 uintptr_t ___ip = (uintptr_t)(p) 21 22 ___ip amp ~0x7 23 )

Figure 33 Pointers and Bit Manipulation on 64-BitSystem

This results in a specific list of operations that ex-tend dependency chains and a separate specific list ofoperations that terminate such chains each coveredin its own section

791 Extending Dependency Chains

The following primitive operations extend thatchain15

1 If any value is part of a dependency chain thenusing that value as the left-hand side of an as-signment expression extends the chain to coverthe assignment This rule is exercised in theLinux kernel by stores into fields making up anRCU protected data element

2 If any value is part of a dependency chain thenusing that value as the right-hand side of an as-signment expression extends the chain to coverboth the assignment and the value returned bythat assignment statement Line 20 of Figure 20shows how this rule may be used to extend adependency chain into a local variable

15 In case of operator overloading the actual functions calledmust be analyzed in order to determine their effects on depen-dency chains

3 If any value that is part of a dependency chain isstored to a non-shared variable then any valueloaded by a later load from that same variable bythat same thread is also part of the dependencychain Lines 20 and 24 of Figure 20 illustratethis rule though this rule would apply even iflocal variable lsp was a intptr t instead of apointer

4 If a pointer that is part of a dependency chainis stored to any variable then any value loadedby a later load from that same variable by thatsame thread is also part of the dependency chainLines 20 and 24 of Figure 20 illustrate this rulethough this rule would apply even if local vari-able lsp was instead a shared variable

5 If a pointer is part of a dependency chain thenadding an integral value to that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers and alsoto addition via the infix + operator and via thepostfix [] operator Note that the addition mustbe carried out on a pointer Casting to an inte-gral type and then carrying out the addition willbreak the dependency chain Therefore insteadof casting to an integral type to carry out theaddition cast to a pointer to char16 Line 24 ofFigure 20 illustrates this given that the -gtt actsas a pointer offset prior to indirection

6 If a pointer is part of a dependency chain thensubtracting an integer from that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers Againcasting to an integral type and then carryingout the subtraction will break the dependencychain so instead cast to a pointer to char TheLinux-kernel container of() macro illustratesthis This macro is used to find the beginning ofa structure given a pointer to a field within thatsame structure

16 Yes some old systems had strange formats for characterpointers and this restriction does exclude those systems fromthis nuance of dependency ordering However to the best ofmy knowledge all such systems were uniprocessors so this isnot a real problem

WG21N4321 30

7 If a pointer is part of a dependency chain thendereferencing it using the prefix operator ex-tends the chain through the dereference opera-tion Line 24 of Figure 20 illustrates this giventhat the -gtt acts as a pointer offset prior to in-direction

8 If a pointer is part of a dependency chain thendereferencing it using the -gt field-selection oper-ator extends the chain to the field Note that thewhen the -gt operator is followed by one or more operators these latter operators are equiva-lent to adding a constant integer to the originalpointer Line 24 of Figure 20 directly illustratesthis rule

9 If a pointer is part of a dependency chain thencasting it (either explicitly or implicitly) to anypointer-sized type extends the chain to the re-sult Line 3 of Figure 33 illustrates this rule

10 If a value of type intptr t or uintptr t ispart of a dependency chain then casting it toa pointer type extends the chain to the resultLine 6 of Figure 33 illustrates this rule

11 If a value of type intptr t or uintptr t is partof a dependency chain then the bitwise infix amp

and | operators extend the dependency chain tothe resulting value Line 5 of Figure 33 illus-trates this rule

12 If a value of type intptr t or uintptr t is partof a dependency chain the the bitwise infix ^

operator extends the dependency chain to theresulting value The use case for this traversal ofbuddy-allocator-like lists or dense-array heapsbut it is not clear whether these use cases justifythis addition to dependency ordering

13 If a pointer is part of a dependency chain thenapplying the unary amp address-of operator op-tionally casting this address to a pointer type(perhaps repeatedly to different pointer typeseither explicitly or implicitly) then applying the dereference operator extends the chain to theresult This is used by some of the Linux-kernellist-processing macros

14 If a pointer is part of a dependency chain andthat pointer appears in the entry of a expres-sion selected by the condition then the chainextends to the result Please note that doesnot extend chains from its condition only fromits second or third argument

15 If a pointer to a function is part of a depen-dency chain then invoking the pointed-to func-tion extends the chain from the pointer to theinstructions executed Note that the exact mech-anism used to update instructions is implemen-tation defined and might require use of specialinstruction-cache-flush operations17

16 If a pointer is part of a dependency chain thenif that pointer is used as the actual parameter ofa function call the dependency chain extends tothe formal parameter

17 If a pointer is part of a dependency chain thenif that pointer is returned from a function thedependency chain extends to the returned valuein the calling function

18 If a given operation extends a dependency chainthen so does its atomic counterpart For exam-ple the rules applying to assignments also applyto atomic loads and stores

Any other operation terminates a dependencychain

792 Terminating Dependency Chains

Even though all other operations terminate depen-dency chains there are a few that deserve specialmention

1 Equality comparisons

2 Narrowing magnitude comparisons

3 Narrowing arithmetic operations

4 Narrowing bitwise operations

17 This may sound strange but just you try implementingdynamic linking without the ability to update instructions

WG21N4321 31

1 struct bar 2 struct bar next3 int a4 int b5 6 struct bar head = amphead 1 2 78 for (p = head-gtnext p p = rcu_dereference(p-gtnext)) 9 foo += p-gta

10 if (p == amphead)11 break12 13 bar = head-gtb

Figure 34 Back-Propagation of Dependency-ChainBreakage Due to Comparisons

1 if (p gt ampfoo)2 do_something(p)3 else if (p lt ampfoo)4 do_something_else(p)5 else6 do_something_nodep(p)

Figure 35 Inequality-Comparison Dependency-Chain Breakage

5 Storing non-pointers into shared variables

6 Passing values between threads without using amemory order consume load

7 Undefined behavior

8 Use of kill dependency

Each of these is covered below

Equality comparisons If a pointer is part of adependency chain then a == or = comparison thatcompares equal to some other pointer where thatother pointer is not part of any dependency chainwill cause any uses of the original pointer to no longerbe part of the dependency chain This dependency-chain breakage can back-propagate to earlier uses ofthe pointer so that in Figure 34 if the comparisonon line 10 compares equal then the access on line 9is not part of the dependency chain This is admit-tedly a rather strange code fragment and besidesthe Linux-kernel barrier() macro could prevent thisif placed between lines 9 and 10 Furthermore the

Linux kernelrsquos list macros avoid this situation becausethe equal comparison terminates the loop

So what if the compiler introduces an equality com-parison This might happen when doing feedback-directed optimization where the compiler might no-tice (for example) that a particularly statically allo-cated structure was almost always the first elementon a given list The compiler might therefore in-troduce a specialization optimization comparing theaddresses and generating code using the statically al-located structure on equals comparison On the onehand in the cases where the Linux kernel adds a stat-ically allocated structure to an RCU-protected linkeddata structure that structure has been initialized atcompile time so that dependency ordering is not re-quired On the other hand this appears to be anextremely dubious optimization for linked data struc-tures In a great many cases the added overhead ofthe comparison would overwhelm the benefits of gen-erating code based on the statically allocated struc-ture

High-quality implementations would therefore beexpected to provide means for disabling this sort ofoptimization especially for pointers obtained fromthe heap After all use of statically allocated struc-tures in RCU-protected lists could be quite usefulduring out-of-memory conditions in which case thespecialization optimization would almost always re-duce performance which is not what optmizationsare supposed to be doing

Narrowing magnitude comparisons A seriesof gt lt gt= or lt= operators that informs the compilerof the exact value of a pointer causes that pointerto no longer be part of the dependency chain SeeFigure 35 for an example of this On line 6 of thisfigure the compiler knows that the value of p is equalto ampfoo so although there is dependency ordering tolines 2 and 4 there is no dependency ordering toline 6 This dependency-chain breakage can back-propagate just as for equality comparisons How-ever dependencies are maintained for normal usesfor exmaple the use of comparisons for deadlockavoidance when acquiring locks contained in multi-ple RCU-protected data elements

WG21N4321 32

Narrowing arithmetic operations If a pointeris part of a dependency chain and if the valuesadded to or subtracted from that pointer cancel thepointer value so as to allow the compiler to pre-cisely determine the resulting value then the result-ing value will not be part of any dependency chainFor example if p is part of a dependency chain then((char )p-(uintptr t)p)+65536 will not be18

Narrowing bitwise operations If a value of typeintptr t or uintptr t is part of a dependency chainand if that value is one of the operands to an amp

or | infix operator whose result has too few or toomany bits set then the resulting value will not bepart of any dependency chain For example on a64-bit system if p is part of a dependency chainthen (p amp 0x7) provides just the tag bits and nor-mally cannot even be legally dereferenced Similarly(p | ~0) normally cannot be legally dereferencedHowever (p amp ~0x7) will provide a usable pointerthat is part of prsquos dependency chain Setting orclearing bits that are not used by the implementa-tion or that would be zero for a properly aligned ob-ject will not break the dependency chain That saidthe compiler might not know the definition of ldquoprop-erly alignedrdquo for example in the cases of manuallycache-aligned or page-aligned objects

Storing non-pointers into shared variables Ifa non-pointer value that is part of a dependency chainis stored into a shared variable then the dependencychain does not extend to a later load from that vari-able

Passing values between threads If a value thatis part of a dependency chain is stored into a variableby one thread and loaded from that same variable bysome other thread using either a non-atomic load ora memory order relaxed load then the dependencychain does not extend to the second thread To getthis effect the second thread would instead need touse a memory order consume load Note that this

18 That said 57p4 of C++ and 656p8 of C both say thatindexing outside of an object is undefined behavior so the lossof dependency ordering is likely the least of the problems here

would extend the dependency chain even if the cor-responding store was a memory order relaxed storebecause the required store-side ordering is providedby the dependency chain

Undefined behavior If undefined behavior is in-voked then consistent with the notion of undefinedbehavior there are no dependency-chain guarantees

If a given pointer intptr t or uintptr t is usedsuch that only one value avoids undefined behaviorthen the dependency chain is broken in the same wayas it would be in the case of an equality comparisonwith that same value

kill dependency() The result of calling kill

dependency is never part of any dependency chainThis operation can be used to suppress diagnosticsthat implementations might omit for likely misusesof dependency ordering

793 Restricted Dependency Chains and theLinux Kernel

This covers all known pointer-based RCU uses in theLinux kernel aside from RCU-protected array in-dexes (more on these later) However as noted ear-lier these restrictions might prove too constrainingfor future code Therefore it might be necessary tocombine this approach with some variation on one ofthe methods for explicitly marking variables formalparameters and return values that are intended tocarry dependencies

Section 32 discussed dependency chains headedby memory order consume loads of integers that arelater used as array indexes or pointer offsets Al-though there are a (very) few such uses in the Linuxkernel accommodating dependency chains headed byloads of integers greatly complicates the handling ofdependency chains For example given an integer x

produced by a memory order consume load we mustcorrectly handle expressions containing (x - x) in-cluding cases where the cancellation is not at all obvi-ous from the source code In contrast given a pointerp the expression (p - p) not a pointer and the rulesgiven above do not require carrying a dependency

WG21N4321 33

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a) 3 q = kill_dependency(p)4 a = q-gtspecial_a5 else 6 a = p-gtnormal_a7

Figure 36 Avoiding Diagnostic Due To Dependency-Ordering Value-Narrowing Hazard

through such an expression This approach there-fore excludes dependency chains headed by memory

order consume loads from non-pointer atomic vari-ables

This raises the question of what should be doneabout the Linux kernel code that relies on order-ing carried through integer array indexes Paulhas (hopefully) answered this question by creatinga Linux-kernel patch that removes the kernelrsquos de-pendency on RCU-protected array indexes [22]

794 Restricted Dependency Chains Ad-vantages and Disavantages

This approach to dependency ordering has the fol-lowing advantages and disadvantages

1 There is no need for compilers to trace depen-dency chains Instead dependencies are an au-tomatic result of current code-generation and op-timization practices

2 There would be little or no limitation on com-piler optimizations Compilers are no longer re-quired to establish artificial dependencies for ex-pressions such as (x - x) or to detect value-narrowing hazards involving == and =

3 The breaking of dependency chains when apointer compares equal to some other pointermight prove to be onerous but it is not aproblem for current Linux use cases19 Themost common cases are comparison against alist header (in which case an equality compar-ison terminates the traversal) and comparison

19 This needs to be re-verified

against NULL (in which case an equality com-parison indicates a pointer that cannot be deref-erenced in any case)

4 The compiler is prohibited from carrying outvalue-speculation optimizations on pointers orvalues of type intptr t or uintptr t that havebeen cast from a pointer type and subjected tono operations other than infix amp | and ^ (Isthis an advantage or a disadvantage The an-swer to this question is left to the reader)

5 In a great many cases dependency chains canreliably pass through library functions compiledby pre-C11 compilers

6 It would not be necessary to use stdkill

dependency() calls in most cases That saiduse of stdkill dependency() might at somefuture point allow the compiler to produce bet-ter diagnostics for dubious dependency-chain usecases For example a compiler might issue awarning for the value-narrowing hazard shownon line 3 of Figure 13 and that diagnostic mightbe suppressed as shown in Figure 36

7 More generally this approach allows annotationsto be discarded as was shown in Figures 20and 21 However the memory order consume

loads are still required in order to support DECAlpha and prevent compiler optimizations thatmight otherwise destroy the dependency chain

8 It would not be necessary to use [[carries

dependency]] attributes However as withstdkill dependency() use of something like[[carries dependency]] might produce bet-ter diagnostics for dubious use dependency-chainuse cases That said it might be preferable tosubstitute either type or variable modifiers forattributes given that use of attributes is not per-mitted to change the meaning of the programand also that attributes are not yet supportedby the C language

9 With the exception of one use case involving ar-rays no changes to the Linux-kernel source code

WG21N4321 34

are required As noted earlier a patch is avail-able to remove the Linux kernelrsquos dependency onRCU-protected array indexes [22]

In the future this approach could be augmentedby attributes type modifiers variable modifiers orother markings to allow more elaborate dependencychains to be created on the one hand and to improvethe compilerrsquos ability to emit diagnostics for dubioususes of dependency chains on the other

710 Evaluation

This evaluation starts by enumerating the differentaudiences that any change to memory order consume

must address (Section 7101) and then compares thevarious proposals based on the perceived viewpointsof these audiences (Section 7102)

7101 Audiences

The main audiences for any change to memory

order consume include standards committee mem-bers compiler implementers formal-methods re-searchers developers intending to write new codeand developers working with existing RCU code TheLinux kernel community is of course a notable exam-ple of this last category

Standards committee members would like a cleanand non-intrusive change to the standard Theywould of course also like solutions minimizing thenumber and vehemence of complaints from the otheraudiences or failing that reducing the complaintsto a tolerable noise level

Compiler implementers would like a mechanismthat fits nicely into current implementations whichdoes much to explain their satisfaction with the ap-proach of strengthening memory order consume tomemory order acquire In particular they wouldlike to avoid unbounded tracing of dependencies andwould prefer minimal constraints on their ability toapply time-honored optimizations

Formal-methods researchers would like a definitionof memory order consume that fits into existing the-oretical frameworks without undue conceptual vio-lence Of particular concern is any need to deal with

counter-factuals in other words any need to rea-son not only about values of variables required forthe solution of a given litmus test but also aboutother unrelated values for these variables As suchcounter-factuals are the rock upon which otherwiseattractive approaches involving semantic dependencyhave foundered20 Some practitioners might won-der why the opinion of formal-methods researchersshould be given any weight at all and the answer tothis question is that it is the work of formal-methodsresearchers that provides us the much-needed toolsthat we need to analyze both the memory-orderingspecification itself as well as programs using thatspecification

Developers writing new code need something thatexpresses their algorithm with a minimum of syn-tactic saccharine that is easy to learn and that iseasy to maintain For example one of the weak-nesses of the current standardsrsquo definition of memoryorder consume is the need to sprinkle large numbersof kill dependency() calls throughout onersquos codeIn short developers would like it to be easy to writeanalyze and maintain code that uses dependency or-dering

Developers with existing RCU code have the samedesires as do developers writing new code but arealso very interested in minimizing the code churn re-quired to adhere to the standard

The challenge if of course to find a proposal thataddresses the viewoints of all of these audiences Aswe will see in the next session this is not easy How-ever there is some hope that the approach presentedin Section 79 might suffice

7102 Comparison

A summary comparison of the proposals is shown inTable 1

The dependency type can either be ldquodeprdquo fornormal dependency ldquordeprdquo for the restricted de-pendencies discussed in Section 79 ldquosdeprdquo for(global) semantic dependency or ldquolsdeprdquo for local

20 That said Alan Jeffries is making another attempt tocome up with a suitable formal definition of semantic depen-dency

WG21N4321 35

Dep

enden

cyType

Variable

Marking

Form

al-Parameter

Marking

Return-V

alueMarking

Beginning-O

f-Chain

Handling

End-O

f-Chain

Handling

Dep

enden

cyTracingRequired

CAttribute

Support

Req

uired

C11 C++11 dep A A K Y Y

Type-Based Designation of DependencyChains With Restrictions (Section 72)

lsdep T T T k

Type-Based Designation of DependencyChains (Section 73)

dep T T T k

Whole-Program Option (Section 74) sdep

Local-Variable Restriction (Section 75) dep A A k Y

Mark Dependency-Carrying Local Variables(Section 76)

dep A A A k Y

Explicitly Tail-Marked Dependency Chains(Section 77)

dep A A Dk y Y

Explicitly Head-Marked Dependency Chains(Section 78)

dep D k y Y

Restricted Dependency Chains (Section 79) rdep

Marking ldquoArdquo attribute ldquoTrdquo typeBeginning of chain ldquoDrdquo explicit designationEnd of chain ldquoDrdquo explicit designation ldquokKrdquo implicitexplicit kill dependencyDependency tracking required ldquoyrdquo only for marked chains ldquoYrdquo always

Table 1 Comparison of Consume Proposals

WG21N4321 36

semantic dependency21 Variable formal-parameterand return-value marking can either be type-based(ldquoTrdquo) attribute-based (ldquoArdquo) or not required (ldquo rdquo)Beginning-of-chain handling can either require ex-plicit indication to which quantities dependenciesmust be carried (ldquoDrdquo) or nothing (ldquo rdquo) End-of-chain handling can either require an explicit kill

dependency (ldquoKrdquo) an implicit kill dependency

(ldquokrdquo) explicit designation of dependency (ldquoDrdquo) ornothing (ldquo rdquo)22 Dependency tracking might berequired for all chains (ldquoYrdquo) explicitly designatedchains (ldquoyrdquo) or not required at all (ldquo rdquo) C-language[[carries dependency]] support might be required(ldquoYrdquo) or not (ldquo rdquo)

The ideal proposal would have dependency typeldquodeprdquo (thus making it easier to model dependencyordering and making it unnecessary for developer tohave to outwit full-program optimizations) no needfor variable formal-parameter or return-value mark-ing (thus minimizing changes required for existingRCU code) implicit ldquodo the right thingrdquo end-of-chainhandling23 (thus minimizing the need for whack-a-mole source-code markups) no need for dependencytracking (thus making it easier to implement) andno need for C-language support for the [[carries

dependency]] attribute (thus minimizing changes tothe C standard)

7103 Other Approaches

If the C standards committee is unwilling to ac-cept attributes perhaps a new keyword such asCarries dependency would be an acceptable alter-native (This was suggested at the 2014 UIUC meet-ing but I cannot recall who suggested it)

21 Recall that a local semantic dependency remains a de-pendency even if the memory order consume load at its headcan return only a single value In contrast a global seman-tic dependency remains a dependency only if more than onevalue can appear at the end of the chain Therefore optimiza-tions based on global full-program analysis can break a globalsemantic dependency but can break neither a local semanticdependency nor a normal dependency

22 Variables that go out of scope always have any dependencychain implicitly killed

23 Perhaps implemented by a careful choice of exactly whichoperators carry dependencies in which situations

It might also be possible to combine different as-pects of the various proposals perhaps even arrivingat an improved proposal

Nevertheless we clearly have some more work todo

8 Summary

This document has analyzed Linux-kernel use of de-pendency ordering and has laid out the status-quo in-teraction between the Linux kernel and pre-C11 com-pilers It has also put forward some possible ways ofbuilding towards a full implementation of C11rsquos andC++11rsquos handling of dependency ordering Finallyit calls out some weaknesses in C11rsquos and C++11rsquoshandling of dependency ordering and offers some al-ternatives

References

[1] Alglave J Maranget L Pawan PSarkar S Sewell P Williams D andNardelli F Z PPCMEMARMMEM A toolfor exploring the POWER and ARM memorymodels httpwwwclcamacuk~pes20

ppc-supplementalpldi105-sarkarpdfJune 2011

[2] Alglave J Maranget L andTautschnig M Herding cats Modellingsimulation testing and data-mining for weakmemory In Proceedings of the 35th ACM SIG-PLAN Conference on Programming LanguageDesign and Implementation (New York NYUSA 2014) PLDI rsquo14 ACM pp 40ndash40

[3] ARM Limited ARM Architecture ReferenceManual ARMv7-A and ARMv7-R Edition2010

[4] Boehm H J Space efficient conservativegarbage collection SIGPLAN Not 39 4 (Apr2004) 490ndash501

[5] Bonzini P and Day M RCU imple-mentation for Qemu httplistsgnu

WG21N4321 37

orgarchivehtmlqemu-devel2013-08

msg02055html August 2013

[6] Dalton M THE DESIGN AND IMPLE-MENTATION OF DYNAMIC INFORMATIONFLOW TRACKING SYSTEMS FOR SOFT-WARE SECURITY PhD thesis StanfordUniversity 2009 Available httpcsl

stanfordedu~christospublications

2009michael_daltonphd_thesispdf

[Viewed March 9 2010]

[7] Desnoyers M [RFC git tree] userspace RCU(urcu) for Linux httpurcuso February2009

[8] Desnoyers M McKenney P E SternA Dagenais M R and Walpole JUser-level implementations of read-copy updateIEEE Transactions on Parallel and DistributedSystems 23 (2012) 375ndash382

[9] Grisenthwaite R ARM Barrier LitmusTests and Cookbook ARM Limited 2009

[10] Howard P W and Walpole J A rel-ativistic enhancement to software transactionalmemory In Proceedings of the 3rd USENIX con-ference on Hot topics in parallelism (BerkeleyCA USA 2011) HotParrsquo11 USENIX Associa-tion pp 1ndash6

[11] Howard P W and Walpole J Relativis-tic red-black trees Concurrency and Computa-tion Practice and Experience (2013) nandashna

[12] Intel Corporation A Formal Speci-fication of Intel Itanium Processor Fam-ily Memory Ordering 2002 Avail-able httpdeveloperintelcom

designitaniumdownloads251429htm

ftpdownloadintelcomdesign

ItaniumDownloads25142901pdf [ViewedJanuary 10 2007]

[13] International Business Machines Corpo-ration Power ISA Version 207 2013

[14] Kannan H Ordering decoupled metadata ac-cesses in multiprocessors In MICRO 42 Pro-ceedings of the 42nd Annual IEEEACM Inter-national Symposium on Microarchitecture (NewYork NY USA 2009) ACM pp 381ndash390

[15] McCarthy J Recursive functions of symbolicexpressions and their computation by machinepart i Commun ACM 3 4 (Apr 1960) 184ndash195

[16] McKenney P E Read-copy update(RCU) usage in Linux kernel Availablehttpwwwrdropcomuserspaulmck

RCUlinuxusagerculocktabhtml [ViewedJanuary 14 2007] October 2006

[17] McKenney P E What is RCU part 2Usage Available httplwnnetArticles

263130 [Viewed January 4 2008] January2008

[18] McKenney P E The RCU API 2010 edi-tion httplwnnetArticles418853 De-cember 2010

[19] McKenney P E Validating memory barri-ers and atomic instructions httplwnnet

Articles470681 December 2011

[20] McKenney P E Is Parallel ProgrammingHard And If So What Can You Do About Itkernelorg Corvallis OR USA 2012

[21] McKenney P E Structured deferral syn-chronization via procrastination CommunACM 56 7 (July 2013) 40ndash49

[22] McKenney P E [PATCH tipcorercu14] mce Stop using array-index-based RCUprimitives [PATCHtipcorercu14]mce

Stopusingarray-index-basedRCUprimitivesMay 2015

[23] McKenney P E Purcell C Al-gae Schumin B Cornelius G Qwer-tyus Conway N Sbw Blainster Ru-fus C Zoicon5 Anome and Eisen HRead-copy update httpenwikipedia

orgwikiRead-copy-update July 2006

WG21N4321 38

[24] McKenney P E and Slingwine J DRead-copy update Using execution history tosolve concurrency problems In Parallel andDistributed Computing and Systems (Las VegasNV October 1998) pp 509ndash518

[25] McKenney P E and Walpole J What isRCU fundamentally Available httplwn

netArticles262464 [Viewed December 272007] December 2007

[26] Michael M M Hazard pointers Safe mem-ory reclamation for lock-free objects IEEETransactions on Parallel and Distributed Sys-tems 15 6 (June 2004) 491ndash504

[27] Rossbach C J Hofmann O S PorterD E Ramadan H E Bhandari A andWitchel E TxLinux Using and managinghardware transactional memory in an operatingsystem In SOSPrsquo07 Twenty-First ACMSymposium on Operating Systems Principles(October 2007) ACM SIGOPS Avail-able httpwwwsosp2007orgpapers

sosp056-rossbachpdf [Viewed October 212007]

[28] Toit S D Working draft stan-dard for programming language C++httpwwwopen-stdorgjtc1sc22wg21

docspapers2013n3691pdf May 2013

[29] Vigueras G Orduna J M and LozanoM A Read-Copy Update based parallel serverfor distributed crowd simulations The Journalof Supercomputing (Apr 2012)

Change Log

This paper first appeared as N4026 in May of 2014Revisions to this document are as follows

bull Add a suggestion to Section 71 that developersprovide diagnostics for singleton arrays when us-ing RCU-protected indexes (May 28 2014)

bull Apply changes from Torvald Riegel feedback

1 Move Section 71 which summarizes theLinux-kernel uses of dependency chains upto the beginning of Section 7

2 Fix some typos in Section 74 and addwords accounting for DEC Alpha whichmust continue to promote memory order

consume loads to memory order acquire

with this sectionrsquos approach Also clarifyJeff Preshingrsquos conjecture about the lack ofoptimizations that break semantic depen-dency chains

(June 5 2014)

bull Add Section 75 which covers Hans Boehmrsquosapproach which restricts dependency chains tolocal variable Also add Section 76 whichcovers Clark Nelsorsquos approach which usesthe [[carries dependency]] attribute to markthose local variables that carry dependencies(July 17 2014)

bull Mark the paper as a revision of N4036 (July 172014)

bull Add Jeff Preshing Hans Boehm and Clark Nel-son as authors and clarify Hansrsquos proposal inSection 75 (August 13 2014)

bull Add Olivier Girouxrsquos proposal for explicitlymarking the tails of dependency chains as Sec-tion 77 and add Olivier as author Update theadvantages and disadvantages of Jeff Preshingrsquoswhole-program option in Section 74 Add advan-tages and disadvantages for Hans Boehmrsquos pro-posal in Section 75 and for Clark Nelsonrsquos pro-posal in Section 76 (August 23 2014)

bull Update Olivier Girouxrsquos explicit dependency-chain marking proposal in Section 77 based onfeedback from Olivier (August 24 2014)

bull Add a section header (Section 411) for thedependency-chain rules for 2014 GCC implemen-tations Add Section 413 laying out the sim-pler rules for managing dependency chains in theolder 1990s Sequent C implementations Adda footnote to Section 76 recording Lawrence

WG21N4321 39

Crowlrsquos preference for variable modifiers insteadof either attributes or type modifiers Addwords to Section 77 noting that atomic signal

fence() or the Linux kernelrsquos barrier() can beused to suppress code-motion optimizations thatmight otherwise prevent fixing up tail-markeddependency chains that were broken by code-motion optimizations (October 1 2014)

bull Add Section 34 which describes operators thatact as the last link in the Linux kernelrsquos depen-dency chains including the ambiguous role ofthe -gt operator (October 1 2014)

bull Update plots of RCU API usage (October 52014)

bull Flesh out Section 412 which lays out a shorthistory of dependency-chain management rulesfor the Linux kernel Numerous minor clarifica-tions and corrections throughout the document(October 5 2014)

bull Add ldquoandrdquo before final authorrsquos name and emailaddress Also add a pointer back to N4036 (Oc-tober 5 2014)

bull Add a brain-dead build script (October 52014)

bull Add Section 710 which has the beginnings ofan evaluation of the various proposals (October5 2014)

The paper was then published as N4215 whichwas revised as follows

bull Add verbiage to Sections 72 and 7102 differ-entiating between various forms of syntactic andsemantic dependencies (November 11 2014)

bull Add Section 7103 suggesting a C keyword in-stead of the C++ [[carries dependency]] at-tribute This section also suggests that combin-ing ideas from several proposals might be helpful(November 11 2014)

bull Add Section 78 which contains a first attemptto describe Olivier Girouxrsquos head-marked depen-dency chains (November 11 2014)

bull Add example code in Figures 15 and 16 Addsimilar figures to the proposals showing how theymark the dependency chains in the sample code(November 11 2014)

At this point the paper was published as N4321which was revised as follows

bull Add a warning against using operands to amp and| that leave only one bit set (or respectivelycleared) to Section 411 (April 9 2015)

bull Update plots of RCU API usage (April 192014)

bull Add Section 79 which restricts dependencychains with the goal of allowing unmarked de-pendency chains while avoiding restricting com-piler optimizations One important restriction isthe elimination of RCU-protected array indexes(April 19 2014)

bull Add verbiage to Section 32 noting that the amp

operator is sometimes used to find the beginningof a power-of-two aligned structure (April 192014)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Torvald Riegel to Sec-tion 79 (May 20 2015)

bull Donrsquot allow integers to carry dependencies intoor out of unmarked functions in Section 79(May 21 2015)

bull Update Section 79 to note that assignments can-not extend a dependency chain from one threadto another and that assignments cannot extendan intptr t- or uintptr t-based dependencychain through a shared variable even within agiven thread (May 21 2015)

bull Update Section 79 to note that atomic op-erations can extend dependency chains in thesame way that their non-atomic counterpartscan (May 21 2015)

WG21N4321 40

bull Update Section 79 to explicitly call out the factthat stores and subsequent loads can extend de-pendency chains (May 21 2015)

bull Update Section 79 to explicitly state that un-defined behavior terminates dependency chainsand that if only one value of a pointer avoidsundefined behavior any dependency chainsthrough that pointer at that point are termi-nated (May 22 2015)

bull Update Section 79 to document the difficul-ties stemming from value-specialization opti-mizations (May 23 2015)

bull Update Section 79 to note the back-propagationof the dependency-chain-breaking effects of thecompiler deducing the exact value of a pointer(May 27 2015)

bull Update Section 79 to give more informationon how much bit-setting and bit-clearing is toomuch (June 8 2015)

bull Update Section 79 to note that the Linux ker-nel does not rely on dependency ordering whena statically allocated structure is linked into anRCU-protected structure but also note that op-timizations breaking dependency ordering in thiscase seem quite dubious (July 11 2015)

bull Update Section 79 illustrating dependency-chain rules with examples (July 11 2015)

bull Update Section 79 noting that simple loads suf-fice for memory order consume loads (July 112015)

bull Update Sections 77 and 78 calling out abstrac-tion challenges for head- and tail-marking ap-proaches (July 13 2015)

bull Add subsections to Section 79 (July 13 2015)

bull Update Section 79 to amplify objections to spe-cialization optimizations applied to heap-basedpointers (July 13 2015)

  • 1 Introduction
  • 2 Introduction to RCU
  • 3 Linux-Kernel Use Cases
    • 31 Types of Linux-Kernel Dependency Chains
    • 32 Operators in Linux-Kernel Dependency Chains
    • 33 Operators Terminating Linux-Kernel Dependency Chains
    • 34 Operators Acting as Last Link in Linux-Kernel Dependency Chains
    • 35 Linux-Kernel Dependency Chain Length
      • 4 Dependency Ordering in Pre-C11 Implementations
        • 41 Rules for C-Language RCU Users
          • 411 Rules for 2014 GCC Implementations
          • 412 Rules for 2003 GCC Implementations
          • 413 Rules for 1990s Sequent C Implementations
            • 42 Rules for C-Language Implementers
              • 5 Dependency Ordering in C11 and C++11 Implementations
              • 6 Weaknesses in C11 and C++11 Dependency Ordering
              • 7 Potential Alternatives to C11 and C++11 Dependency Ordering
                • 71 Revising C11 and C++11 Dependency-Ordering Definition
                • 72 Type-Based Designation of Dependency Chains With Restrictions
                • 73 Type-Based Designation of Dependency Chains
                • 74 Whole-Program Option
                • 75 Local-Variable Restriction
                • 76 Mark Dependency-Carrying Local Variables
                • 77 Explicitly Tail-Marked Dependency Chains
                • 78 Explicitly Head-Marked Dependency Chains
                • 79 Restricted Dependency Chains
                  • 791 Extending Dependency Chains
                  • 792 Terminating Dependency Chains
                  • 793 Restricted Dependency Chains and the Linux Kernel
                  • 794 Restricted Dependency Chains Advantages and Disavantages
                    • 710 Evaluation
                      • 7101 Audiences
                      • 7102 Comparison
                      • 7103 Other Approaches
                          • 8 Summary
Page 15: N4321: Towards Implementation and Use of memory order … · 2015-07-14 · N4321: Towards Implementation and Use of memory order consume Doc. No.: WG21/N4321 (revised) Date: 2015-04-19

WG21N4321 15

C11 and C++11 semantics for dependencies) Sec-tion 73 enlists help from the type system withoutthe value restrictions and Section 74 describes awhole-program approach to dependency chains (alsorevising the C11 and C++11 semantics for depen-dencies) Section 75 describes a post-Rapperswilproposal that dependency chains be restricted tofunction-scope local variables and temporaries andSection 76 describes a second post-Rapperswil pro-posal that the [[carries dependency]] attributebe used to label local-scope variables that carry de-pendencies Section 77 describes a proposal dis-cussed verbally at Rapperswil that explicitly marksthe tails of dependency chains Section 78 describesthe inverse namely marking the heads of dependencychains Section 79 describes an approach that avoidsmarking by sharply restricting the number and typeof operations permitted in dependency chains Eachapproach appears to have advantages and disadvan-tages so it is hoped that further discussion will eitherhelp settle on one of these alternatives or generatesomething better To help initiate this discussionSection 710 provides an initial comparative evalua-tion

71 Revising C11 and C++11Dependency-Ordering Definition

The following sections each describe a proposed revi-sion of the dependency-ordering definition from thatin the current C11 and C++11 standards In manyof these proposals developers are required to followan additional rule in order to be able to rely on de-pendency ordering Subsequent execution must notlead to a situation where there is only one possiblevalue for the variable that is intended to carry thedependency7 This is shown in Figure 17 where thecompiler is permitted to break dependency orderingon line 6 because it knows that the value of p is equalto that of q which means that it could substitutethe latter value from the former which would break

7 This restricted notion of dependence is sometimes calledsemantic dependence and the value at the end of a depen-dence chain that does not represent a semantic dependence issometimes said to be independent of the value at the head ofthe dependency chain

1 int my_array[MY_ARRAY_SIZE]23 i = atomic_load_explicit(gi memory_order_consume)4 r1 = my_array[i]

Figure 14 Single-Element Arrays and DependencyOrdering

dependency ordering In short a dependency chainbreaks if it comes to a point where only a single valueis possible regardless of the value of the memory

order consume load heading up the chain At firstglance this additional rule could be quite difficult tolive with as dependency ordering could come and godepending on small details of code far away from thatpoint in the dependency chain

However a review of the Linux-kernel operators inSection 32 shows that the most commonly used op-erators act identically under both definitions Theproblem-free operators include -gt infix = casts pre-fix amp prefix and ternary

One example of a potentially troublesome opera-tor namely == is shown in Figure 17 where line 6breaks dependency ordering because the value of p isknown to be equal to that of q which is not part of adependency chain This example could be addressedthrough careful diagnostic design coupled with appro-priate coding standards For example the compilercould emit a warning on line 6 but remain silent forthe equivalent line substituting q for p namely dosomething with(q-gta)

Another example is the use of postfix [] thatis shown in Figure 14 If this code fragment wascompiled with MY ARRAY SIZE equal to one there isno dependency ordering between lines 3 and 4 butthat same code fragment compiled with MY ARRAY

SIZE equal to two or greater would be dependency-ordered Here a diagnostic for single-element arraysmight prove useful and such a diagnostic can easilybe supplied in this case using if and error

In the Linux kernel infix + and - are used forpointer and array computations These are all safein that they operate on an integer and pointer sothat any cancellation will not normally be detectableat compile time However one big purpose of diag-nostics is to detect abnormal conditions indicating

WG21N4321 16

1 struct liststackhead 2 struct liststack __rcu first3 45 struct liststack 6 struct liststack __rcu next7 void t8 struct rcu_head rh9

1011 _Carries_dependency12 void ls_front(struct liststackhead head)13 14 _Carries_dependency void data15 struct liststack lsp1617 rcu_read_lock()18 lsp = rcu_dereference(head-gtfirst)19 if (lsp == NULL)20 data = NULL21 else22 data = rcu_dereference(lsp-gtt)23 rcu_read_unlock()24 return data25

Figure 15 List-Based-Stack Example Code 1 of 2

probable bugs Therefore in cases where the com-piler can determine that two values from dependencychains are annihilating each other via infix + and -a diagnostic would be appropriate

Similarly the Linux kernel uses infix (bitwise) amp tomanipulate bits at the bottom of a pointer whereagain cancellation will not normally be detectable atcompile timemdashexcept in the case of operations on aNULL pointer for which dependency ordering is notmeaningful in any case However as with infix +

and - if the compiler detects value annihilation adiagnostic would be appropriate

Although issues with false positives and negativesneeds further investigation there is reason to hopethat this revision of the definition of dependency or-dering might avoid significant impacts on ease of useWith this hope we proceed to the specific propos-als using the code in Figures 15 and 16 to showsome sample code using Linux-kernel nomenclaturewith the addition of a mythical C keyword Carries

dependency to annotate parameters variables andreturn values that carry dependencies Please notethat this code example in no way endorses the dubi-ous practice of creating a parallel program with thesort of choke point exemplified by the head of this

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 _Carries_dependency struct liststack lsp40 struct liststack lsnp141 _Carries_dependency struct liststack lsnp242 _Carries_dependency void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 16 List-Based-Stack Example Code 2 of 2

WG21N4321 17

1 value_dep_preserving struct foo p23 p = atomic_load_explicit(gp memory_order_consume)4 q = some_other_pointer5 if (p == q)6 do_something_with(p-gta)7 else8 do_something_else_with(p-gtb)

Figure 17 Single-Value Variables and DependencyOrdering

list Note also that cmpxchg() heads a dependencychain which is completely reasonable within the con-text of the Linux kernel due to its acquire semanticswhich of course might be argued to indicate that theannotations in ls pop() are unnecessary

72 Type-Based Designation of De-pendency Chains With Restric-tions

This approach was formulated by Torvald Riegel inresponse to Linus Torvaldsrsquos spirited criticisms of thecurrent C11 and C++11 wording

This approach introduces a new value dep

preserving type qualifier Dependency ordering ispreserved only via variables having this type quali-fier This is meant to model the real scope of depen-dencies which is data flow not execution at function-level granularity This approach should therefore givedevelopers much finer control of which dependenciesare tracked

Assigning from a value dep preserving value to anon-value dep preserving variable terminates thetracking of dependencies in much the same way thatan explicit kill dependency() would However un-like an explicit kill dependency() compilers shouldbe able to emit a suppressable warning on implicitconversions so as to alert the developer about other-wise silent dropping of dependency tracking8

Next we specify that memory order consume loadsreturn a value dep preserving type by default thecompiler must assume such a load to be capable of

8 Other choices are possible in this case including emit-ting a memory order acquire fence in order to conservativelypreserve a potentially intended ordering

producing any value of the underlying type In otherwords the implementation is not permitted to applyany value-restriction knowledge it might gain fromwhole-program analysis We call this a local seman-tic dependency to distinguish not only from a pure(syntactic) dependency but also from a global seman-tic dependency where global information may be ap-plied Note that any global semantic dependency isalso a local semantic dependency but that any localsemantic dependency which is headed by a variablethat can be proven to take on only a single value is nota global semantic dependency The term ldquosemanticdependencyrdquo should be interpreted to mean a globalsemantic dependency unless otherwise stated

This allows developers to start with a clean slatefor the additional rule that they must follow to beable to rely on dependency ordering Subsequent ex-ecution must not lead to a situation there is only onepossible value for the value dep preserving expres-sion because otherwise the implementation is per-mitted to break the dependency chain As notedearlier this is shown in Figure 17 where the com-piler is permitted to break dependency ordering online 6 because it knows that the value of p is equalto that of q which means that it could substitutethe latter value from the former which would breakdependency ordering

This approach has several advantages

1 The implementation is simpler because no de-pendency chains need to be traced The imple-mentation can instead drive optimization deci-sions strictly from type information

2 Use of the value dep preserving type modifierallows the developer to limit the extent of thedependency chains

3 This type modifier can be used to mark a depen-dency chainrsquos entry to and exit from a functionin a straightforward way without the need forattributes

4 The value dep preserving type modifiers serveas valuable documentation of the developerrsquos in-tent

WG21N4321 18

5 This approach permits many additional opti-mizations compared to those permitted by thecurrent standard on code that carries a depen-dency Expressions such as (x-x) no longerrequire establishment of artificial dependenciesand the compiler is no longer required to detectvalue-narrowing hazards like that shown in Fig-ure 13 However the compiler is still prohibitedfrom adding its own value-speculation optimiza-tions

6 Linus Torvalds seems to be OK with it whichindicates that this set of rules might be practicalfrom the perspective of developers who currentlyexploit dependency chains

According to Peter Sewell one disadvantage is thatthis approach will be quite difficult to model which inturn will pose obstacles for the analysis tooling thatwill be increasingly necessary for large-scale concur-rent programming efforts In particular the concernis that forcing the compiler to assume that a memory

order consume load could possibly return any valuepermitted by its type might require program-analysistools to consider counterfactual hypothetical execu-tions which might complicate specification of seman-tics and verification

Figures 18 and 19 show how this approach playsout with the list-based stack

73 Type-Based Designation of De-pendency Chains

Jeff Preshing made an off-list suggestion of using avalue dep preserving type modifier as suggestedby Torvald Riegel but using this type modifier tostrictly enforce dependency ordering For exampleconsider the code fragment shown in Figure 17 Thescheme described in Section 72 would not necessar-ily enforce dependency ordering between the load online 3 and the access one line 6 while the approachdescribed in this section would enforce dependencyordering in this case

Furthermore cancelling or value-destruction oper-ations on value dep preserving values would notdisrupt dependency ordering As with the cur-rent C11 and C++11 standards the implementation

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack value_dep_preserving first6 78 struct liststack 9 struct liststack value_dep_preserving next

10 void t11 struct rcu_head rh12 1314 value_dep_preserving15 void ls_front(struct liststackhead head)16 17 value_dep_preserving void data18 value_dep_preserving struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 18 List-Based-Stack Restricted Type-BasedDesignation 1 of 2

would be required to emit a memory-barrier instruc-tion or compute an artificial dependency for such op-erations (Note however that use of cancelling orvalue-destruction operations on dependency chainshas proven quite rare in practice)

This approach shares many of the advantages ofTorvald Riegelrsquos approach

1 The implementation is simpler because no de-pendency chains need be traced The implemen-tation can instead drive optimization decisionsstrictly from type information

2 Use of the value dep preserving type modifierallows the developer to limit the extent of thedependency chains

3 This type modifier can be used to mark a depen-dency chainrsquos entry to and exit from a functionin a straightforward way without the need forattributes

4 The value dep preserving type modifiers serve

WG21N4321 19

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 value_dep_preserving37 void ls_pop(struct liststackhead head)38 39 value_dep_preserving struct liststack lsp40 struct liststack lsnp141 value_dep_preserving struct liststack lsnp242 value_dep_preserving void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 19 List-Based-Stack Restricted Type-BasedDesignation 2 of 2

as valuable documentation of the developerrsquos in-tent

5 Although optimizations on a dependency chainare restricted just as in the current standardthe use of value dep preserving restricts thedependency chains to those intended by the de-veloper

6 Restricting dependency-breaking optimizationson all dependency chains marked value dep

preserving without exceptions for cases inwhich the compiler knows too much might makethis approach easier to learn and to use

It is expected that modeling this approach shouldbe straightforward because the modeling tools wouldbe able to make use of the type information Thisapproach results in the same code as shown in Fig-ures 18 and 19 of the previous section

74 Whole-Program Option

This approach also suggested off-list by Jeff Presh-ing has the goal of reusing existing non-dependency-ordered source code unchanged (albeit requiring re-compilation in most cases)9 For example this ap-proach permits an instance of stdmap to be refer-enced by a pointer loaded via memory order consume

and to provide that stdmap instance with thebenefits of dependency ordering without any codechanges whatsoever to stdmap It is important tonote that this protection will be provided only to aread-only stdmap that is referenced by a changingpointer loaded via memory order consume in partic-ular not to a concurrently updated stdmap refer-enced by a pointer (read-only or otherwise) loadedvia memory order consume This latter case wouldrequire changes to the underlying stdmap implemen-tation at a minimum changing some of the loads tobe memory order consume loads Nevertheless theability to provide dependency-ordering protection topre-existing linked data structures is valuable evenwith this read-only restriction

9 A module or library that is known to never carry a de-pendency need not be recompiled

WG21N4321 20

This approach which again does require full re-compilation can be implemented using two ap-proaches

1 Promote all memory order consume loads tomemory order acquire as may be done withthe current standard

2 On architectures that respect memory order-ing prohibit all dependency-breaking optimiza-tions throughout the entire program but onlyin cases where a change in the value returnedby a memory order consume load could cause achange in the value computed later in that samedependency chain in other words where there isa global semantic dependency Note again thatthe possibility of storing a value obtained froma memory order consume load then loading itlater means that normal loads as well as memoryorder relaxed loads often must be consideredto head their own dependency chains but onlywhen loaded by the same thread that did thestore

Some implementations might allow the developerto choose between these two approaches for exampleby using a compiler switch provided for that purpose

This approach also has the effect of permitting atrivial implementation of a memory order consume

atomic thread fence() When using the first im-plementation approach the atomic thread fence()

is simply promoted to memory order acquire In-terestingly enough when using the second ap-proach the memory order consume atomic thread

fence() may simply be ignored The reason forthis is that this approach has the effect of promot-ing memory order relaxed loads to memory order

consume which already globally enforces all theordering that the memory order consume atomic

thread fence() is required to provide locally10

This approach has its own set of advantages anddisadvantages

10 Of course this presumed promotion from memory order

relaxed to memory order consume means that architecturessuch as DEC Alpha that do not respect dependency order-ing must continue to use the first option of emitting memory-ordering instructions for memory order consume loads

1 This approach dispenses with the [[carries

dependency]] attribute and the kill

dependency() primitive

2 This approach better promotes reuse of existingsource code In particular it should require nochanges to the current Linux-kernel source baseaside from changes to the rcu dereference()

family of primitives

3 This approach allows implementations to carryout dependency-breaking optimizations on de-pendency chains as long as a change in thevalue from the memory order consume load doesnot change values further down the dependencychain both with and without the optimizationJeff conjectures that the set of dependency-breaking optimizations used in practice applyonly outside of dependency chains by the re-vised definition in which single-value restrictionsbreak dependency chains11 If this conjectureholds it also applies to Torvaldrsquos approach de-scribed in Section 72

4 Code that follows the rules presented in Sec-tion 41 (substituting memory order consume

loads for volatile loads) would have its depen-dency ordering properly preserved

It is unlikely that this approach could be modeledreasonably given the current state of the art Therequirement that any given memory order consume

load be able to generate at least two different val-ues at the tail of the dependency chain is believedto be a show-stopper especially when coupled withwhole-program analysis which might find that thereis only one value entering at the head of the depen-dency chain

This approach allows annotations to be discardedas shown in Figures 20 and 21 However the memoryorder consume loads are still required in order toenable the promote-to-acquire implementation style

11 This is certainly the case for the usual optimizations ex-emplified by replacing (x-x) with zero

WG21N4321 21

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 void ls_front(struct liststackhead head)15 16 void data17 struct liststack lsp1819 rcu_read_lock()20 lsp = rcu_dereference(head-gtfirst)21 if (lsp == NULL)22 data = NULL23 else24 data = rcu_dereference(lsp-gtt)25 rcu_read_unlock()26 return data27

Figure 20 List-Based-Stack Whole-Program Ap-proach 1 of 2

75 Local-Variable Restriction

This approach suggested off-list by Hans Boehmlimits the extent of dependency trees to a local whichincludes local variables temporaries function argu-ments and return variables Assigning a value from amemory order consume load to such an object beginsa dependency chain Assigning a value loaded fromsuch a local to a global variable (including function-local variables marked static) or to the heap impliesa kill dependency() so that dependency chains areconfined to locals However if the compiler is unableto see the full dependency chain for example be-cause it passes into a function in another translationunit that is not marked [[carries dependency]]the compiler should promote memory order consume

to memory order acquire12

Section 32 indicates that the following operatorsshould transmit dependency status from one localvariable or temporary to another -gt infix = casts

12 Some implementations might provide means to allow theuser to specify that a diagnostic be generated if such promotionis necessary

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 void ls_pop(struct liststackhead head)37 38 struct liststack lsp39 struct liststack lsnp140 struct liststack lsnp241 void data4243 rcu_read_lock()44 lsnp2 = rcu_dereference(head-gtfirst)45 do 46 lsnp1 = lsnp247 if (lsnp1 == NULL) 48 rcu_read_unlock()49 return NULL50 51 lsp = rcu_dereference(lsnp1-gtnext)52 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)53 while (lsnp1 = lsnp2)54 data = rcu_dereference(lsnp2-gtt)55 rcu_read_unlock()56 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)57 return data58

Figure 21 List-Based-Stack Whole-Program Ap-proach 2 of 2

WG21N4321 22

prefix amp prefix [] infix + infix - ternary infix (bitwise) amp and probably also | Similarly Sec-tion 33 indicates that the following operators shouldimply a kill dependency() () == = ampamp ||infix and

It will also be necesary to check whether Linux-kernel usage expects dependency chains to passthrough globals and heap objects that are in someway thread-local If there are such use cases and ifthey are sane and cannot easily be changed to uselocal variables should [[carries dependency]] beused to flag dependency-carrying globals and heapobjects

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute if de-pendency chains are to span multiple translationunits as is the case in some parts of the Linuxkernel

2 The implementation is likely to be some-what simpler because only those dependencychains passing through local variables compiler-generated temporaries compiler-visible functionarguments and compiler-visible return valuesneed be traced One could also argue that func-tion arguments and return values marked with[[carries dependency]] attribute also need tobe traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 Although optimizations on dependency chainsmust be restricted the restricted scope of de-pendency chains reduces the impact of these re-strictions

5 Applying this approach to the Linux kernelwould only require the addition of markings onfunction parameters and return values corre-sponding to cross-translation-unit function callsHowever there are a significant number of theseso this approach can expect significant resistancefrom the Linux community

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 22 List-Based-Stack Local-Variable Restric-tion 1 of 2

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach allows local-variable annotations tobe dropped as shown in Figure 22 and 23

76 Mark Dependency-Carrying Lo-cal Variables

This approach suggested offlist by Clark Nelson usesthe [[carries dependency]] attribute to mark non-static local-scope variables as carrying a dependencyin addition to its current use marking function ar-guments and return values as carrying dependen-cies It is not permissible to mark global variablesor structure members with this attribute Assigningfrom a [[carries dependency]] object to a non-[[carries dependency]] object results in an im-plicit kill dependency()

This approach is similar to that of Section 73 ex-cept that it uses an attribute rather than a type mod-ifier As such it has many of the advantages and dis-

WG21N4321 23

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 23 List-Based-Stack Local-Variable Restric-tion 2 of 2

advantages of that approach however some believethat an attribute-based approach will be more ac-ceptable to the committee than would a type-modifierapproach13 However this approach does requirethat C add attributes

This leave the question of which operatorstransmit dependency chains from one [[carries

dependency]] object to another Section 32 indi-cates that the following operators should transmitdependency status from one local variable or tem-porary to another -gt infix = casts prefix amp prefix [] infix + infix - ternary infix (bitwise) ampand probably also | Similarly Section 33 showsthat the following operators should imply a kill

dependency() () == = ampamp || infix and

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains passingthrough variables marked with the [[carries

dependency]] attribute need be traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 The [[carries dependency]] calls serve asvaluable documentation of the developerrsquos in-tent

5 Although optimizations on dependency chainsmust be restricted use of explicit [[carries

dependency]] greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of variables carrying depen-dencies given that the Linux kernel currentlyrequires no such markings

13 Lawrence Crowl suggests a third approach namely a vari-able modifier

WG21N4321 24

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 _Carries_dependency void data18 _Carries_dependency struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 24 List-Based-Stack Marked Local Variables1 of 2

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach results in code as shown in Fig-ures 24 and 25 where the [[carries dependency]]

attributes have been replaced with a mythicalCarries dependency C keyword

77 Explicitly Tail-Marked Depen-dency Chains

This approach suggested at Rapperswil by OlivierGiroux can be thought of as the inverse of

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 _Carries_dependency struct liststack lsp40 struct liststack lsnp141 _Carries_dependency struct liststack lsnp242 _Carries_dependency void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 25 List-Based-Stack Marked Local Variables2 of 2

WG21N4321 25

1 p = atomic_load_explicit(ampgp memory_order_consume)2 if (p = NULL)3 do_it(atomic_dependency(p gp))

Figure 26 Explicit Dependency Operations

stdkill dependency() Instead of explicitlymarking where the dependency chains terminateOlivierrsquos proposal uses a stddependency() prim-itive to indicate the locations in the code that thedependency chains are required to reach The first ar-gument to stddependency() is the value to whichthe dependency must be carried and the second argu-ment is the variable that heads the dependency chainin other words the second argument is the variablethat was loaded from by a memory order consume

load This proposal differs from the others in thatit is expected to be implemented not necessarily bypreserving the dependency but instead by insertingbarriers in those cases where optimizations have elim-inated any required dependencies The goal here is toimpose minimal restrictions on optimizations of codecontaining dependency chains

A C-language example is shown in Figure 26where stddependency() is transliterated to the C-language atomic dependency() function On line 3atomic dependency() returns the value of its firstargument (p) while ensuring that the data depen-dency from the memory order consume load from gp

is faithfully reflected in the assembly language imple-menting this code fragment The assembly-languagereflection of this dependency might be in terms ofan assembly-language dependency (for example onARM or PowerPC) implicit memory ordering (forexample on x86 or mainframe) or by an explicitmemory-barrier instruction However if there was noatomic dependency() function the compiler wouldbe under no obligation to preserve the dependency14

These explicitly specified dependencies may becombined with [[carries dependency]] attributeson function arguments for example as shown in Fig-ure 27 Note the interplay of atomic dependency()

and [[carries dependency]] where line 8 estab-

14 Would it be better to have the second argument to atomic

dependency() be a label rather than an expression

1 void foo(struct bar q [[carries_dependency]])2 3 if (q = NULL)4 do_it(atomic_dependency(q-gtb q))5 67 p = atomic_load_explicit(ampgp memory_order_consume)8 foo(atomic_dependency(p gp))

Figure 27 Explicit Dependency Operations and car-ries dependency

lishes the dependency between the load from gp andthe [[carries dependency]] argument q of foo()and where line 4 establishes the further dependencybetween argument q of foo() and do it()s argu-ment

This approach is not yet complete One issue isthe possibility of a given operation being dependenton multiple memory order consume loads One ap-proach is of course to omit this functionality and an-other is to allow atomic dependency() to allow anexpression as its first argument and a variable list ofmemory order consume loaded variables

Another issue is connecting [[carries

dependency]] return values to subsequent atomic

dependency() invocations There are a numberof possible resolutions to this issue One approachwould be to use [[carries dependency]] attributeto mark the declaration of the variable to whichthe functionrsquos return value is assigned bringing theproposal from Section 76 to bear In the specialcase where the memory order consume load is in thesame function body as the atomic dependency()

that depends on it the atomic dependency()

could reference the variable that was the sourceof the original memory order consume load An-other approach would be to allow function-returncarries dependency]] attributes to define namesthat could be used by later atomic dependency()

invocations

A third issue arises when atomic dependency()

must be applied after the head of the dependencychain has gone out of scope for example if the headwas contained in a variable defined in an inner scopethat has since been exited

A fourth issue arises if optimizations along a

WG21N4321 26

needed dependency chain allow ordering the depen-dent operation to precede the head of the depen-dency chain in which case inserting barriers wouldbe ineffective The current proposal for addressingthis issue is to suppress memory-movement optimiza-tions across the atomic dependency() perhaps us-ing something like atomic signal fence() or theLinux kernelrsquos barrier() macro This approach al-lows dependency checking and fence insertion to becarried out as a final pass in the compilation process

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains having ex-plicit atomic dependency() calls (and option-ally intermediate [[carries dependency]] at-tributes) need be traced

3 Irrelevant dependency chains are pruned by de-fault with no stdkill dependency() calls re-quired

4 The atomic dependency() calls serve as valu-able documentation of the developerrsquos intent

5 Although optimizations on dependency chainsmust be restricted use of explicit atomic

dependency() greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of dependency chains giventhat the Linux kernel currently relies on implicitends of dependency chains

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(atomic_dependency(lsp-gtt27 head-gtfirst))28 rcu_read_unlock()29 return atomic_dependency(data lsp-gtt)30

Figure 28 List-Based-Stack Tail-Marked Dependen-cies 1 of 2

It is not yet known whether this approach can bereasonably modeled

The result is shown in Figures 28 and 29

78 Explicitly Head-Marked Depen-dency Chains

This approach suggested via email by Olivier Girouxcan be thought of as another inverse of stdkilldependency() In this case the heads of the de-pendency chains are marked indicating to whichpointed-to objects dependencies should be carriedThis description is an extrapolation of a very con-cise proposal and corrections are welcome

The general idea is to provide an augmentedform of the load() member function that indicatesdependencies for example xload(memory order

consume x-gtnext) would cause a dependency tobe carried through the -gtnext field but throughno other field This is shown in Figure 30 where

WG21N4321 27

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(atomic_dependency(lsnp2-gtt lsnp))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return atomic_dereference(data lsnp2-gtt)59

Figure 29 List-Based-Stack Tail-Marked Dependen-cies 2 of 2

1 struct foo 2 struct foo a3 struct foo b4 struct foo c5 int d6 78 p = atomic_load_explicit(ampgp memory_order_consume9 p-gta p-gtb)

10 qa = p-gta Dependency carried 11 qb = p-gtb Dependency carried 12 qc = p-gtc No dependency carried 13 d = p-gtd No dependency carried

Figure 30 Explicit Dependency Operations and Aug-mented Load

the explicit dependency information on line 9 causeslines 10 and 11 to carry a dependency but lines 12and 13 not to do so

Some open questions regarding this approach

1 How does this interact with arguments and re-turn values Do the corresponding annotationsneed to indicate to which fields dependenciesmight be carried Should mismatches be con-sidered an error and if so which sorts of mis-matches

2 How are opaque types handled For exam-ple consider the Linux kernel linked-list facil-ity which embeds a list head structure intothe enclosing object that is to be placed on thelist The memory order consume load returnsa (struct list head ) but it may be nec-essary to carry a dependency to one or more ofthe fields in the enclosing object Should thisbe handled via something like xload(memory

order consume x) but if so doesnrsquot thisre-introduce the need for lots of stdkill

dependency() calls

3 Larger structures might have quite a few fieldsthat need dependencies carried Should therebe some sort of shorthand to make this easierto code for example tagging the fields needingdependency ordering in the declaration of thestruct or class

4 Common types of abstraction need to be han-dled correctly For example there are more than

WG21N4321 28

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst head-gtfirst-gtt)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(lsp-gtt lsp-gtt)) 28 rcu_read_unlock()29 return data30

Figure 31 List-Based-Stack Head-Marked Depen-dencies 1 of 2

500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

79 Restricted Dependency Chains

This approach restricts dependency chains to opera-tions for which compilers would naturally carry de-pendencies As such this approach can be consid-ered to be a refinement of the whole-program option(Section 74) restricted as described in Section 412but also omitting control dependencies and RCU-protected array indexes As always a memory order

consume load heads a dependency chain and as al-ways a memory order consume load may be imple-mented with a simple load instruction on architec-tures such as x86 ARM and Power

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst head-gtfirst-gtnext)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext lsnp-gtnext-gtt)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt lsnp2-gtt))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 32 List-Based-Stack Head-Marked Depen-dencies 2 of 2

WG21N4321 29

1 define rcu_ptr_extract(p) 2 ( 3 uintptr_t ___ip = (uintptr_t)(p) 4 5 ___ip = ___ip amp ~0x7 6 (typeof(p)) ___ip 7 )89 define rcu_ptr_set_bits(p bits)

10 ( 11 uintptr_t ___ip = (uintptr_t)(p) 12 13 ___ip = ___ip amp ~0x7 14 ___ip = ___ip | (bits) 15 (typeof(p)) ___ip 16 )1718 define rcu_ptr_get_bits(p bits) 19 ( 20 uintptr_t ___ip = (uintptr_t)(p) 21 22 ___ip amp ~0x7 23 )

Figure 33 Pointers and Bit Manipulation on 64-BitSystem

This results in a specific list of operations that ex-tend dependency chains and a separate specific list ofoperations that terminate such chains each coveredin its own section

791 Extending Dependency Chains

The following primitive operations extend thatchain15

1 If any value is part of a dependency chain thenusing that value as the left-hand side of an as-signment expression extends the chain to coverthe assignment This rule is exercised in theLinux kernel by stores into fields making up anRCU protected data element

2 If any value is part of a dependency chain thenusing that value as the right-hand side of an as-signment expression extends the chain to coverboth the assignment and the value returned bythat assignment statement Line 20 of Figure 20shows how this rule may be used to extend adependency chain into a local variable

15 In case of operator overloading the actual functions calledmust be analyzed in order to determine their effects on depen-dency chains

3 If any value that is part of a dependency chain isstored to a non-shared variable then any valueloaded by a later load from that same variable bythat same thread is also part of the dependencychain Lines 20 and 24 of Figure 20 illustratethis rule though this rule would apply even iflocal variable lsp was a intptr t instead of apointer

4 If a pointer that is part of a dependency chainis stored to any variable then any value loadedby a later load from that same variable by thatsame thread is also part of the dependency chainLines 20 and 24 of Figure 20 illustrate this rulethough this rule would apply even if local vari-able lsp was instead a shared variable

5 If a pointer is part of a dependency chain thenadding an integral value to that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers and alsoto addition via the infix + operator and via thepostfix [] operator Note that the addition mustbe carried out on a pointer Casting to an inte-gral type and then carrying out the addition willbreak the dependency chain Therefore insteadof casting to an integral type to carry out theaddition cast to a pointer to char16 Line 24 ofFigure 20 illustrates this given that the -gtt actsas a pointer offset prior to indirection

6 If a pointer is part of a dependency chain thensubtracting an integer from that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers Againcasting to an integral type and then carryingout the subtraction will break the dependencychain so instead cast to a pointer to char TheLinux-kernel container of() macro illustratesthis This macro is used to find the beginning ofa structure given a pointer to a field within thatsame structure

16 Yes some old systems had strange formats for characterpointers and this restriction does exclude those systems fromthis nuance of dependency ordering However to the best ofmy knowledge all such systems were uniprocessors so this isnot a real problem

WG21N4321 30

7 If a pointer is part of a dependency chain thendereferencing it using the prefix operator ex-tends the chain through the dereference opera-tion Line 24 of Figure 20 illustrates this giventhat the -gtt acts as a pointer offset prior to in-direction

8 If a pointer is part of a dependency chain thendereferencing it using the -gt field-selection oper-ator extends the chain to the field Note that thewhen the -gt operator is followed by one or more operators these latter operators are equiva-lent to adding a constant integer to the originalpointer Line 24 of Figure 20 directly illustratesthis rule

9 If a pointer is part of a dependency chain thencasting it (either explicitly or implicitly) to anypointer-sized type extends the chain to the re-sult Line 3 of Figure 33 illustrates this rule

10 If a value of type intptr t or uintptr t ispart of a dependency chain then casting it toa pointer type extends the chain to the resultLine 6 of Figure 33 illustrates this rule

11 If a value of type intptr t or uintptr t is partof a dependency chain then the bitwise infix amp

and | operators extend the dependency chain tothe resulting value Line 5 of Figure 33 illus-trates this rule

12 If a value of type intptr t or uintptr t is partof a dependency chain the the bitwise infix ^

operator extends the dependency chain to theresulting value The use case for this traversal ofbuddy-allocator-like lists or dense-array heapsbut it is not clear whether these use cases justifythis addition to dependency ordering

13 If a pointer is part of a dependency chain thenapplying the unary amp address-of operator op-tionally casting this address to a pointer type(perhaps repeatedly to different pointer typeseither explicitly or implicitly) then applying the dereference operator extends the chain to theresult This is used by some of the Linux-kernellist-processing macros

14 If a pointer is part of a dependency chain andthat pointer appears in the entry of a expres-sion selected by the condition then the chainextends to the result Please note that doesnot extend chains from its condition only fromits second or third argument

15 If a pointer to a function is part of a depen-dency chain then invoking the pointed-to func-tion extends the chain from the pointer to theinstructions executed Note that the exact mech-anism used to update instructions is implemen-tation defined and might require use of specialinstruction-cache-flush operations17

16 If a pointer is part of a dependency chain thenif that pointer is used as the actual parameter ofa function call the dependency chain extends tothe formal parameter

17 If a pointer is part of a dependency chain thenif that pointer is returned from a function thedependency chain extends to the returned valuein the calling function

18 If a given operation extends a dependency chainthen so does its atomic counterpart For exam-ple the rules applying to assignments also applyto atomic loads and stores

Any other operation terminates a dependencychain

792 Terminating Dependency Chains

Even though all other operations terminate depen-dency chains there are a few that deserve specialmention

1 Equality comparisons

2 Narrowing magnitude comparisons

3 Narrowing arithmetic operations

4 Narrowing bitwise operations

17 This may sound strange but just you try implementingdynamic linking without the ability to update instructions

WG21N4321 31

1 struct bar 2 struct bar next3 int a4 int b5 6 struct bar head = amphead 1 2 78 for (p = head-gtnext p p = rcu_dereference(p-gtnext)) 9 foo += p-gta

10 if (p == amphead)11 break12 13 bar = head-gtb

Figure 34 Back-Propagation of Dependency-ChainBreakage Due to Comparisons

1 if (p gt ampfoo)2 do_something(p)3 else if (p lt ampfoo)4 do_something_else(p)5 else6 do_something_nodep(p)

Figure 35 Inequality-Comparison Dependency-Chain Breakage

5 Storing non-pointers into shared variables

6 Passing values between threads without using amemory order consume load

7 Undefined behavior

8 Use of kill dependency

Each of these is covered below

Equality comparisons If a pointer is part of adependency chain then a == or = comparison thatcompares equal to some other pointer where thatother pointer is not part of any dependency chainwill cause any uses of the original pointer to no longerbe part of the dependency chain This dependency-chain breakage can back-propagate to earlier uses ofthe pointer so that in Figure 34 if the comparisonon line 10 compares equal then the access on line 9is not part of the dependency chain This is admit-tedly a rather strange code fragment and besidesthe Linux-kernel barrier() macro could prevent thisif placed between lines 9 and 10 Furthermore the

Linux kernelrsquos list macros avoid this situation becausethe equal comparison terminates the loop

So what if the compiler introduces an equality com-parison This might happen when doing feedback-directed optimization where the compiler might no-tice (for example) that a particularly statically allo-cated structure was almost always the first elementon a given list The compiler might therefore in-troduce a specialization optimization comparing theaddresses and generating code using the statically al-located structure on equals comparison On the onehand in the cases where the Linux kernel adds a stat-ically allocated structure to an RCU-protected linkeddata structure that structure has been initialized atcompile time so that dependency ordering is not re-quired On the other hand this appears to be anextremely dubious optimization for linked data struc-tures In a great many cases the added overhead ofthe comparison would overwhelm the benefits of gen-erating code based on the statically allocated struc-ture

High-quality implementations would therefore beexpected to provide means for disabling this sort ofoptimization especially for pointers obtained fromthe heap After all use of statically allocated struc-tures in RCU-protected lists could be quite usefulduring out-of-memory conditions in which case thespecialization optimization would almost always re-duce performance which is not what optmizationsare supposed to be doing

Narrowing magnitude comparisons A seriesof gt lt gt= or lt= operators that informs the compilerof the exact value of a pointer causes that pointerto no longer be part of the dependency chain SeeFigure 35 for an example of this On line 6 of thisfigure the compiler knows that the value of p is equalto ampfoo so although there is dependency ordering tolines 2 and 4 there is no dependency ordering toline 6 This dependency-chain breakage can back-propagate just as for equality comparisons How-ever dependencies are maintained for normal usesfor exmaple the use of comparisons for deadlockavoidance when acquiring locks contained in multi-ple RCU-protected data elements

WG21N4321 32

Narrowing arithmetic operations If a pointeris part of a dependency chain and if the valuesadded to or subtracted from that pointer cancel thepointer value so as to allow the compiler to pre-cisely determine the resulting value then the result-ing value will not be part of any dependency chainFor example if p is part of a dependency chain then((char )p-(uintptr t)p)+65536 will not be18

Narrowing bitwise operations If a value of typeintptr t or uintptr t is part of a dependency chainand if that value is one of the operands to an amp

or | infix operator whose result has too few or toomany bits set then the resulting value will not bepart of any dependency chain For example on a64-bit system if p is part of a dependency chainthen (p amp 0x7) provides just the tag bits and nor-mally cannot even be legally dereferenced Similarly(p | ~0) normally cannot be legally dereferencedHowever (p amp ~0x7) will provide a usable pointerthat is part of prsquos dependency chain Setting orclearing bits that are not used by the implementa-tion or that would be zero for a properly aligned ob-ject will not break the dependency chain That saidthe compiler might not know the definition of ldquoprop-erly alignedrdquo for example in the cases of manuallycache-aligned or page-aligned objects

Storing non-pointers into shared variables Ifa non-pointer value that is part of a dependency chainis stored into a shared variable then the dependencychain does not extend to a later load from that vari-able

Passing values between threads If a value thatis part of a dependency chain is stored into a variableby one thread and loaded from that same variable bysome other thread using either a non-atomic load ora memory order relaxed load then the dependencychain does not extend to the second thread To getthis effect the second thread would instead need touse a memory order consume load Note that this

18 That said 57p4 of C++ and 656p8 of C both say thatindexing outside of an object is undefined behavior so the lossof dependency ordering is likely the least of the problems here

would extend the dependency chain even if the cor-responding store was a memory order relaxed storebecause the required store-side ordering is providedby the dependency chain

Undefined behavior If undefined behavior is in-voked then consistent with the notion of undefinedbehavior there are no dependency-chain guarantees

If a given pointer intptr t or uintptr t is usedsuch that only one value avoids undefined behaviorthen the dependency chain is broken in the same wayas it would be in the case of an equality comparisonwith that same value

kill dependency() The result of calling kill

dependency is never part of any dependency chainThis operation can be used to suppress diagnosticsthat implementations might omit for likely misusesof dependency ordering

793 Restricted Dependency Chains and theLinux Kernel

This covers all known pointer-based RCU uses in theLinux kernel aside from RCU-protected array in-dexes (more on these later) However as noted ear-lier these restrictions might prove too constrainingfor future code Therefore it might be necessary tocombine this approach with some variation on one ofthe methods for explicitly marking variables formalparameters and return values that are intended tocarry dependencies

Section 32 discussed dependency chains headedby memory order consume loads of integers that arelater used as array indexes or pointer offsets Al-though there are a (very) few such uses in the Linuxkernel accommodating dependency chains headed byloads of integers greatly complicates the handling ofdependency chains For example given an integer x

produced by a memory order consume load we mustcorrectly handle expressions containing (x - x) in-cluding cases where the cancellation is not at all obvi-ous from the source code In contrast given a pointerp the expression (p - p) not a pointer and the rulesgiven above do not require carrying a dependency

WG21N4321 33

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a) 3 q = kill_dependency(p)4 a = q-gtspecial_a5 else 6 a = p-gtnormal_a7

Figure 36 Avoiding Diagnostic Due To Dependency-Ordering Value-Narrowing Hazard

through such an expression This approach there-fore excludes dependency chains headed by memory

order consume loads from non-pointer atomic vari-ables

This raises the question of what should be doneabout the Linux kernel code that relies on order-ing carried through integer array indexes Paulhas (hopefully) answered this question by creatinga Linux-kernel patch that removes the kernelrsquos de-pendency on RCU-protected array indexes [22]

794 Restricted Dependency Chains Ad-vantages and Disavantages

This approach to dependency ordering has the fol-lowing advantages and disadvantages

1 There is no need for compilers to trace depen-dency chains Instead dependencies are an au-tomatic result of current code-generation and op-timization practices

2 There would be little or no limitation on com-piler optimizations Compilers are no longer re-quired to establish artificial dependencies for ex-pressions such as (x - x) or to detect value-narrowing hazards involving == and =

3 The breaking of dependency chains when apointer compares equal to some other pointermight prove to be onerous but it is not aproblem for current Linux use cases19 Themost common cases are comparison against alist header (in which case an equality compar-ison terminates the traversal) and comparison

19 This needs to be re-verified

against NULL (in which case an equality com-parison indicates a pointer that cannot be deref-erenced in any case)

4 The compiler is prohibited from carrying outvalue-speculation optimizations on pointers orvalues of type intptr t or uintptr t that havebeen cast from a pointer type and subjected tono operations other than infix amp | and ^ (Isthis an advantage or a disadvantage The an-swer to this question is left to the reader)

5 In a great many cases dependency chains canreliably pass through library functions compiledby pre-C11 compilers

6 It would not be necessary to use stdkill

dependency() calls in most cases That saiduse of stdkill dependency() might at somefuture point allow the compiler to produce bet-ter diagnostics for dubious dependency-chain usecases For example a compiler might issue awarning for the value-narrowing hazard shownon line 3 of Figure 13 and that diagnostic mightbe suppressed as shown in Figure 36

7 More generally this approach allows annotationsto be discarded as was shown in Figures 20and 21 However the memory order consume

loads are still required in order to support DECAlpha and prevent compiler optimizations thatmight otherwise destroy the dependency chain

8 It would not be necessary to use [[carries

dependency]] attributes However as withstdkill dependency() use of something like[[carries dependency]] might produce bet-ter diagnostics for dubious use dependency-chainuse cases That said it might be preferable tosubstitute either type or variable modifiers forattributes given that use of attributes is not per-mitted to change the meaning of the programand also that attributes are not yet supportedby the C language

9 With the exception of one use case involving ar-rays no changes to the Linux-kernel source code

WG21N4321 34

are required As noted earlier a patch is avail-able to remove the Linux kernelrsquos dependency onRCU-protected array indexes [22]

In the future this approach could be augmentedby attributes type modifiers variable modifiers orother markings to allow more elaborate dependencychains to be created on the one hand and to improvethe compilerrsquos ability to emit diagnostics for dubioususes of dependency chains on the other

710 Evaluation

This evaluation starts by enumerating the differentaudiences that any change to memory order consume

must address (Section 7101) and then compares thevarious proposals based on the perceived viewpointsof these audiences (Section 7102)

7101 Audiences

The main audiences for any change to memory

order consume include standards committee mem-bers compiler implementers formal-methods re-searchers developers intending to write new codeand developers working with existing RCU code TheLinux kernel community is of course a notable exam-ple of this last category

Standards committee members would like a cleanand non-intrusive change to the standard Theywould of course also like solutions minimizing thenumber and vehemence of complaints from the otheraudiences or failing that reducing the complaintsto a tolerable noise level

Compiler implementers would like a mechanismthat fits nicely into current implementations whichdoes much to explain their satisfaction with the ap-proach of strengthening memory order consume tomemory order acquire In particular they wouldlike to avoid unbounded tracing of dependencies andwould prefer minimal constraints on their ability toapply time-honored optimizations

Formal-methods researchers would like a definitionof memory order consume that fits into existing the-oretical frameworks without undue conceptual vio-lence Of particular concern is any need to deal with

counter-factuals in other words any need to rea-son not only about values of variables required forthe solution of a given litmus test but also aboutother unrelated values for these variables As suchcounter-factuals are the rock upon which otherwiseattractive approaches involving semantic dependencyhave foundered20 Some practitioners might won-der why the opinion of formal-methods researchersshould be given any weight at all and the answer tothis question is that it is the work of formal-methodsresearchers that provides us the much-needed toolsthat we need to analyze both the memory-orderingspecification itself as well as programs using thatspecification

Developers writing new code need something thatexpresses their algorithm with a minimum of syn-tactic saccharine that is easy to learn and that iseasy to maintain For example one of the weak-nesses of the current standardsrsquo definition of memoryorder consume is the need to sprinkle large numbersof kill dependency() calls throughout onersquos codeIn short developers would like it to be easy to writeanalyze and maintain code that uses dependency or-dering

Developers with existing RCU code have the samedesires as do developers writing new code but arealso very interested in minimizing the code churn re-quired to adhere to the standard

The challenge if of course to find a proposal thataddresses the viewoints of all of these audiences Aswe will see in the next session this is not easy How-ever there is some hope that the approach presentedin Section 79 might suffice

7102 Comparison

A summary comparison of the proposals is shown inTable 1

The dependency type can either be ldquodeprdquo fornormal dependency ldquordeprdquo for the restricted de-pendencies discussed in Section 79 ldquosdeprdquo for(global) semantic dependency or ldquolsdeprdquo for local

20 That said Alan Jeffries is making another attempt tocome up with a suitable formal definition of semantic depen-dency

WG21N4321 35

Dep

enden

cyType

Variable

Marking

Form

al-Parameter

Marking

Return-V

alueMarking

Beginning-O

f-Chain

Handling

End-O

f-Chain

Handling

Dep

enden

cyTracingRequired

CAttribute

Support

Req

uired

C11 C++11 dep A A K Y Y

Type-Based Designation of DependencyChains With Restrictions (Section 72)

lsdep T T T k

Type-Based Designation of DependencyChains (Section 73)

dep T T T k

Whole-Program Option (Section 74) sdep

Local-Variable Restriction (Section 75) dep A A k Y

Mark Dependency-Carrying Local Variables(Section 76)

dep A A A k Y

Explicitly Tail-Marked Dependency Chains(Section 77)

dep A A Dk y Y

Explicitly Head-Marked Dependency Chains(Section 78)

dep D k y Y

Restricted Dependency Chains (Section 79) rdep

Marking ldquoArdquo attribute ldquoTrdquo typeBeginning of chain ldquoDrdquo explicit designationEnd of chain ldquoDrdquo explicit designation ldquokKrdquo implicitexplicit kill dependencyDependency tracking required ldquoyrdquo only for marked chains ldquoYrdquo always

Table 1 Comparison of Consume Proposals

WG21N4321 36

semantic dependency21 Variable formal-parameterand return-value marking can either be type-based(ldquoTrdquo) attribute-based (ldquoArdquo) or not required (ldquo rdquo)Beginning-of-chain handling can either require ex-plicit indication to which quantities dependenciesmust be carried (ldquoDrdquo) or nothing (ldquo rdquo) End-of-chain handling can either require an explicit kill

dependency (ldquoKrdquo) an implicit kill dependency

(ldquokrdquo) explicit designation of dependency (ldquoDrdquo) ornothing (ldquo rdquo)22 Dependency tracking might berequired for all chains (ldquoYrdquo) explicitly designatedchains (ldquoyrdquo) or not required at all (ldquo rdquo) C-language[[carries dependency]] support might be required(ldquoYrdquo) or not (ldquo rdquo)

The ideal proposal would have dependency typeldquodeprdquo (thus making it easier to model dependencyordering and making it unnecessary for developer tohave to outwit full-program optimizations) no needfor variable formal-parameter or return-value mark-ing (thus minimizing changes required for existingRCU code) implicit ldquodo the right thingrdquo end-of-chainhandling23 (thus minimizing the need for whack-a-mole source-code markups) no need for dependencytracking (thus making it easier to implement) andno need for C-language support for the [[carries

dependency]] attribute (thus minimizing changes tothe C standard)

7103 Other Approaches

If the C standards committee is unwilling to ac-cept attributes perhaps a new keyword such asCarries dependency would be an acceptable alter-native (This was suggested at the 2014 UIUC meet-ing but I cannot recall who suggested it)

21 Recall that a local semantic dependency remains a de-pendency even if the memory order consume load at its headcan return only a single value In contrast a global seman-tic dependency remains a dependency only if more than onevalue can appear at the end of the chain Therefore optimiza-tions based on global full-program analysis can break a globalsemantic dependency but can break neither a local semanticdependency nor a normal dependency

22 Variables that go out of scope always have any dependencychain implicitly killed

23 Perhaps implemented by a careful choice of exactly whichoperators carry dependencies in which situations

It might also be possible to combine different as-pects of the various proposals perhaps even arrivingat an improved proposal

Nevertheless we clearly have some more work todo

8 Summary

This document has analyzed Linux-kernel use of de-pendency ordering and has laid out the status-quo in-teraction between the Linux kernel and pre-C11 com-pilers It has also put forward some possible ways ofbuilding towards a full implementation of C11rsquos andC++11rsquos handling of dependency ordering Finallyit calls out some weaknesses in C11rsquos and C++11rsquoshandling of dependency ordering and offers some al-ternatives

References

[1] Alglave J Maranget L Pawan PSarkar S Sewell P Williams D andNardelli F Z PPCMEMARMMEM A toolfor exploring the POWER and ARM memorymodels httpwwwclcamacuk~pes20

ppc-supplementalpldi105-sarkarpdfJune 2011

[2] Alglave J Maranget L andTautschnig M Herding cats Modellingsimulation testing and data-mining for weakmemory In Proceedings of the 35th ACM SIG-PLAN Conference on Programming LanguageDesign and Implementation (New York NYUSA 2014) PLDI rsquo14 ACM pp 40ndash40

[3] ARM Limited ARM Architecture ReferenceManual ARMv7-A and ARMv7-R Edition2010

[4] Boehm H J Space efficient conservativegarbage collection SIGPLAN Not 39 4 (Apr2004) 490ndash501

[5] Bonzini P and Day M RCU imple-mentation for Qemu httplistsgnu

WG21N4321 37

orgarchivehtmlqemu-devel2013-08

msg02055html August 2013

[6] Dalton M THE DESIGN AND IMPLE-MENTATION OF DYNAMIC INFORMATIONFLOW TRACKING SYSTEMS FOR SOFT-WARE SECURITY PhD thesis StanfordUniversity 2009 Available httpcsl

stanfordedu~christospublications

2009michael_daltonphd_thesispdf

[Viewed March 9 2010]

[7] Desnoyers M [RFC git tree] userspace RCU(urcu) for Linux httpurcuso February2009

[8] Desnoyers M McKenney P E SternA Dagenais M R and Walpole JUser-level implementations of read-copy updateIEEE Transactions on Parallel and DistributedSystems 23 (2012) 375ndash382

[9] Grisenthwaite R ARM Barrier LitmusTests and Cookbook ARM Limited 2009

[10] Howard P W and Walpole J A rel-ativistic enhancement to software transactionalmemory In Proceedings of the 3rd USENIX con-ference on Hot topics in parallelism (BerkeleyCA USA 2011) HotParrsquo11 USENIX Associa-tion pp 1ndash6

[11] Howard P W and Walpole J Relativis-tic red-black trees Concurrency and Computa-tion Practice and Experience (2013) nandashna

[12] Intel Corporation A Formal Speci-fication of Intel Itanium Processor Fam-ily Memory Ordering 2002 Avail-able httpdeveloperintelcom

designitaniumdownloads251429htm

ftpdownloadintelcomdesign

ItaniumDownloads25142901pdf [ViewedJanuary 10 2007]

[13] International Business Machines Corpo-ration Power ISA Version 207 2013

[14] Kannan H Ordering decoupled metadata ac-cesses in multiprocessors In MICRO 42 Pro-ceedings of the 42nd Annual IEEEACM Inter-national Symposium on Microarchitecture (NewYork NY USA 2009) ACM pp 381ndash390

[15] McCarthy J Recursive functions of symbolicexpressions and their computation by machinepart i Commun ACM 3 4 (Apr 1960) 184ndash195

[16] McKenney P E Read-copy update(RCU) usage in Linux kernel Availablehttpwwwrdropcomuserspaulmck

RCUlinuxusagerculocktabhtml [ViewedJanuary 14 2007] October 2006

[17] McKenney P E What is RCU part 2Usage Available httplwnnetArticles

263130 [Viewed January 4 2008] January2008

[18] McKenney P E The RCU API 2010 edi-tion httplwnnetArticles418853 De-cember 2010

[19] McKenney P E Validating memory barri-ers and atomic instructions httplwnnet

Articles470681 December 2011

[20] McKenney P E Is Parallel ProgrammingHard And If So What Can You Do About Itkernelorg Corvallis OR USA 2012

[21] McKenney P E Structured deferral syn-chronization via procrastination CommunACM 56 7 (July 2013) 40ndash49

[22] McKenney P E [PATCH tipcorercu14] mce Stop using array-index-based RCUprimitives [PATCHtipcorercu14]mce

Stopusingarray-index-basedRCUprimitivesMay 2015

[23] McKenney P E Purcell C Al-gae Schumin B Cornelius G Qwer-tyus Conway N Sbw Blainster Ru-fus C Zoicon5 Anome and Eisen HRead-copy update httpenwikipedia

orgwikiRead-copy-update July 2006

WG21N4321 38

[24] McKenney P E and Slingwine J DRead-copy update Using execution history tosolve concurrency problems In Parallel andDistributed Computing and Systems (Las VegasNV October 1998) pp 509ndash518

[25] McKenney P E and Walpole J What isRCU fundamentally Available httplwn

netArticles262464 [Viewed December 272007] December 2007

[26] Michael M M Hazard pointers Safe mem-ory reclamation for lock-free objects IEEETransactions on Parallel and Distributed Sys-tems 15 6 (June 2004) 491ndash504

[27] Rossbach C J Hofmann O S PorterD E Ramadan H E Bhandari A andWitchel E TxLinux Using and managinghardware transactional memory in an operatingsystem In SOSPrsquo07 Twenty-First ACMSymposium on Operating Systems Principles(October 2007) ACM SIGOPS Avail-able httpwwwsosp2007orgpapers

sosp056-rossbachpdf [Viewed October 212007]

[28] Toit S D Working draft stan-dard for programming language C++httpwwwopen-stdorgjtc1sc22wg21

docspapers2013n3691pdf May 2013

[29] Vigueras G Orduna J M and LozanoM A Read-Copy Update based parallel serverfor distributed crowd simulations The Journalof Supercomputing (Apr 2012)

Change Log

This paper first appeared as N4026 in May of 2014Revisions to this document are as follows

bull Add a suggestion to Section 71 that developersprovide diagnostics for singleton arrays when us-ing RCU-protected indexes (May 28 2014)

bull Apply changes from Torvald Riegel feedback

1 Move Section 71 which summarizes theLinux-kernel uses of dependency chains upto the beginning of Section 7

2 Fix some typos in Section 74 and addwords accounting for DEC Alpha whichmust continue to promote memory order

consume loads to memory order acquire

with this sectionrsquos approach Also clarifyJeff Preshingrsquos conjecture about the lack ofoptimizations that break semantic depen-dency chains

(June 5 2014)

bull Add Section 75 which covers Hans Boehmrsquosapproach which restricts dependency chains tolocal variable Also add Section 76 whichcovers Clark Nelsorsquos approach which usesthe [[carries dependency]] attribute to markthose local variables that carry dependencies(July 17 2014)

bull Mark the paper as a revision of N4036 (July 172014)

bull Add Jeff Preshing Hans Boehm and Clark Nel-son as authors and clarify Hansrsquos proposal inSection 75 (August 13 2014)

bull Add Olivier Girouxrsquos proposal for explicitlymarking the tails of dependency chains as Sec-tion 77 and add Olivier as author Update theadvantages and disadvantages of Jeff Preshingrsquoswhole-program option in Section 74 Add advan-tages and disadvantages for Hans Boehmrsquos pro-posal in Section 75 and for Clark Nelsonrsquos pro-posal in Section 76 (August 23 2014)

bull Update Olivier Girouxrsquos explicit dependency-chain marking proposal in Section 77 based onfeedback from Olivier (August 24 2014)

bull Add a section header (Section 411) for thedependency-chain rules for 2014 GCC implemen-tations Add Section 413 laying out the sim-pler rules for managing dependency chains in theolder 1990s Sequent C implementations Adda footnote to Section 76 recording Lawrence

WG21N4321 39

Crowlrsquos preference for variable modifiers insteadof either attributes or type modifiers Addwords to Section 77 noting that atomic signal

fence() or the Linux kernelrsquos barrier() can beused to suppress code-motion optimizations thatmight otherwise prevent fixing up tail-markeddependency chains that were broken by code-motion optimizations (October 1 2014)

bull Add Section 34 which describes operators thatact as the last link in the Linux kernelrsquos depen-dency chains including the ambiguous role ofthe -gt operator (October 1 2014)

bull Update plots of RCU API usage (October 52014)

bull Flesh out Section 412 which lays out a shorthistory of dependency-chain management rulesfor the Linux kernel Numerous minor clarifica-tions and corrections throughout the document(October 5 2014)

bull Add ldquoandrdquo before final authorrsquos name and emailaddress Also add a pointer back to N4036 (Oc-tober 5 2014)

bull Add a brain-dead build script (October 52014)

bull Add Section 710 which has the beginnings ofan evaluation of the various proposals (October5 2014)

The paper was then published as N4215 whichwas revised as follows

bull Add verbiage to Sections 72 and 7102 differ-entiating between various forms of syntactic andsemantic dependencies (November 11 2014)

bull Add Section 7103 suggesting a C keyword in-stead of the C++ [[carries dependency]] at-tribute This section also suggests that combin-ing ideas from several proposals might be helpful(November 11 2014)

bull Add Section 78 which contains a first attemptto describe Olivier Girouxrsquos head-marked depen-dency chains (November 11 2014)

bull Add example code in Figures 15 and 16 Addsimilar figures to the proposals showing how theymark the dependency chains in the sample code(November 11 2014)

At this point the paper was published as N4321which was revised as follows

bull Add a warning against using operands to amp and| that leave only one bit set (or respectivelycleared) to Section 411 (April 9 2015)

bull Update plots of RCU API usage (April 192014)

bull Add Section 79 which restricts dependencychains with the goal of allowing unmarked de-pendency chains while avoiding restricting com-piler optimizations One important restriction isthe elimination of RCU-protected array indexes(April 19 2014)

bull Add verbiage to Section 32 noting that the amp

operator is sometimes used to find the beginningof a power-of-two aligned structure (April 192014)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Torvald Riegel to Sec-tion 79 (May 20 2015)

bull Donrsquot allow integers to carry dependencies intoor out of unmarked functions in Section 79(May 21 2015)

bull Update Section 79 to note that assignments can-not extend a dependency chain from one threadto another and that assignments cannot extendan intptr t- or uintptr t-based dependencychain through a shared variable even within agiven thread (May 21 2015)

bull Update Section 79 to note that atomic op-erations can extend dependency chains in thesame way that their non-atomic counterpartscan (May 21 2015)

WG21N4321 40

bull Update Section 79 to explicitly call out the factthat stores and subsequent loads can extend de-pendency chains (May 21 2015)

bull Update Section 79 to explicitly state that un-defined behavior terminates dependency chainsand that if only one value of a pointer avoidsundefined behavior any dependency chainsthrough that pointer at that point are termi-nated (May 22 2015)

bull Update Section 79 to document the difficul-ties stemming from value-specialization opti-mizations (May 23 2015)

bull Update Section 79 to note the back-propagationof the dependency-chain-breaking effects of thecompiler deducing the exact value of a pointer(May 27 2015)

bull Update Section 79 to give more informationon how much bit-setting and bit-clearing is toomuch (June 8 2015)

bull Update Section 79 to note that the Linux ker-nel does not rely on dependency ordering whena statically allocated structure is linked into anRCU-protected structure but also note that op-timizations breaking dependency ordering in thiscase seem quite dubious (July 11 2015)

bull Update Section 79 illustrating dependency-chain rules with examples (July 11 2015)

bull Update Section 79 noting that simple loads suf-fice for memory order consume loads (July 112015)

bull Update Sections 77 and 78 calling out abstrac-tion challenges for head- and tail-marking ap-proaches (July 13 2015)

bull Add subsections to Section 79 (July 13 2015)

bull Update Section 79 to amplify objections to spe-cialization optimizations applied to heap-basedpointers (July 13 2015)

  • 1 Introduction
  • 2 Introduction to RCU
  • 3 Linux-Kernel Use Cases
    • 31 Types of Linux-Kernel Dependency Chains
    • 32 Operators in Linux-Kernel Dependency Chains
    • 33 Operators Terminating Linux-Kernel Dependency Chains
    • 34 Operators Acting as Last Link in Linux-Kernel Dependency Chains
    • 35 Linux-Kernel Dependency Chain Length
      • 4 Dependency Ordering in Pre-C11 Implementations
        • 41 Rules for C-Language RCU Users
          • 411 Rules for 2014 GCC Implementations
          • 412 Rules for 2003 GCC Implementations
          • 413 Rules for 1990s Sequent C Implementations
            • 42 Rules for C-Language Implementers
              • 5 Dependency Ordering in C11 and C++11 Implementations
              • 6 Weaknesses in C11 and C++11 Dependency Ordering
              • 7 Potential Alternatives to C11 and C++11 Dependency Ordering
                • 71 Revising C11 and C++11 Dependency-Ordering Definition
                • 72 Type-Based Designation of Dependency Chains With Restrictions
                • 73 Type-Based Designation of Dependency Chains
                • 74 Whole-Program Option
                • 75 Local-Variable Restriction
                • 76 Mark Dependency-Carrying Local Variables
                • 77 Explicitly Tail-Marked Dependency Chains
                • 78 Explicitly Head-Marked Dependency Chains
                • 79 Restricted Dependency Chains
                  • 791 Extending Dependency Chains
                  • 792 Terminating Dependency Chains
                  • 793 Restricted Dependency Chains and the Linux Kernel
                  • 794 Restricted Dependency Chains Advantages and Disavantages
                    • 710 Evaluation
                      • 7101 Audiences
                      • 7102 Comparison
                      • 7103 Other Approaches
                          • 8 Summary
Page 16: N4321: Towards Implementation and Use of memory order … · 2015-07-14 · N4321: Towards Implementation and Use of memory order consume Doc. No.: WG21/N4321 (revised) Date: 2015-04-19

WG21N4321 16

1 struct liststackhead 2 struct liststack __rcu first3 45 struct liststack 6 struct liststack __rcu next7 void t8 struct rcu_head rh9

1011 _Carries_dependency12 void ls_front(struct liststackhead head)13 14 _Carries_dependency void data15 struct liststack lsp1617 rcu_read_lock()18 lsp = rcu_dereference(head-gtfirst)19 if (lsp == NULL)20 data = NULL21 else22 data = rcu_dereference(lsp-gtt)23 rcu_read_unlock()24 return data25

Figure 15 List-Based-Stack Example Code 1 of 2

probable bugs Therefore in cases where the com-piler can determine that two values from dependencychains are annihilating each other via infix + and -a diagnostic would be appropriate

Similarly the Linux kernel uses infix (bitwise) amp tomanipulate bits at the bottom of a pointer whereagain cancellation will not normally be detectable atcompile timemdashexcept in the case of operations on aNULL pointer for which dependency ordering is notmeaningful in any case However as with infix +

and - if the compiler detects value annihilation adiagnostic would be appropriate

Although issues with false positives and negativesneeds further investigation there is reason to hopethat this revision of the definition of dependency or-dering might avoid significant impacts on ease of useWith this hope we proceed to the specific propos-als using the code in Figures 15 and 16 to showsome sample code using Linux-kernel nomenclaturewith the addition of a mythical C keyword Carries

dependency to annotate parameters variables andreturn values that carry dependencies Please notethat this code example in no way endorses the dubi-ous practice of creating a parallel program with thesort of choke point exemplified by the head of this

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 _Carries_dependency struct liststack lsp40 struct liststack lsnp141 _Carries_dependency struct liststack lsnp242 _Carries_dependency void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 16 List-Based-Stack Example Code 2 of 2

WG21N4321 17

1 value_dep_preserving struct foo p23 p = atomic_load_explicit(gp memory_order_consume)4 q = some_other_pointer5 if (p == q)6 do_something_with(p-gta)7 else8 do_something_else_with(p-gtb)

Figure 17 Single-Value Variables and DependencyOrdering

list Note also that cmpxchg() heads a dependencychain which is completely reasonable within the con-text of the Linux kernel due to its acquire semanticswhich of course might be argued to indicate that theannotations in ls pop() are unnecessary

72 Type-Based Designation of De-pendency Chains With Restric-tions

This approach was formulated by Torvald Riegel inresponse to Linus Torvaldsrsquos spirited criticisms of thecurrent C11 and C++11 wording

This approach introduces a new value dep

preserving type qualifier Dependency ordering ispreserved only via variables having this type quali-fier This is meant to model the real scope of depen-dencies which is data flow not execution at function-level granularity This approach should therefore givedevelopers much finer control of which dependenciesare tracked

Assigning from a value dep preserving value to anon-value dep preserving variable terminates thetracking of dependencies in much the same way thatan explicit kill dependency() would However un-like an explicit kill dependency() compilers shouldbe able to emit a suppressable warning on implicitconversions so as to alert the developer about other-wise silent dropping of dependency tracking8

Next we specify that memory order consume loadsreturn a value dep preserving type by default thecompiler must assume such a load to be capable of

8 Other choices are possible in this case including emit-ting a memory order acquire fence in order to conservativelypreserve a potentially intended ordering

producing any value of the underlying type In otherwords the implementation is not permitted to applyany value-restriction knowledge it might gain fromwhole-program analysis We call this a local seman-tic dependency to distinguish not only from a pure(syntactic) dependency but also from a global seman-tic dependency where global information may be ap-plied Note that any global semantic dependency isalso a local semantic dependency but that any localsemantic dependency which is headed by a variablethat can be proven to take on only a single value is nota global semantic dependency The term ldquosemanticdependencyrdquo should be interpreted to mean a globalsemantic dependency unless otherwise stated

This allows developers to start with a clean slatefor the additional rule that they must follow to beable to rely on dependency ordering Subsequent ex-ecution must not lead to a situation there is only onepossible value for the value dep preserving expres-sion because otherwise the implementation is per-mitted to break the dependency chain As notedearlier this is shown in Figure 17 where the com-piler is permitted to break dependency ordering online 6 because it knows that the value of p is equalto that of q which means that it could substitutethe latter value from the former which would breakdependency ordering

This approach has several advantages

1 The implementation is simpler because no de-pendency chains need to be traced The imple-mentation can instead drive optimization deci-sions strictly from type information

2 Use of the value dep preserving type modifierallows the developer to limit the extent of thedependency chains

3 This type modifier can be used to mark a depen-dency chainrsquos entry to and exit from a functionin a straightforward way without the need forattributes

4 The value dep preserving type modifiers serveas valuable documentation of the developerrsquos in-tent

WG21N4321 18

5 This approach permits many additional opti-mizations compared to those permitted by thecurrent standard on code that carries a depen-dency Expressions such as (x-x) no longerrequire establishment of artificial dependenciesand the compiler is no longer required to detectvalue-narrowing hazards like that shown in Fig-ure 13 However the compiler is still prohibitedfrom adding its own value-speculation optimiza-tions

6 Linus Torvalds seems to be OK with it whichindicates that this set of rules might be practicalfrom the perspective of developers who currentlyexploit dependency chains

According to Peter Sewell one disadvantage is thatthis approach will be quite difficult to model which inturn will pose obstacles for the analysis tooling thatwill be increasingly necessary for large-scale concur-rent programming efforts In particular the concernis that forcing the compiler to assume that a memory

order consume load could possibly return any valuepermitted by its type might require program-analysistools to consider counterfactual hypothetical execu-tions which might complicate specification of seman-tics and verification

Figures 18 and 19 show how this approach playsout with the list-based stack

73 Type-Based Designation of De-pendency Chains

Jeff Preshing made an off-list suggestion of using avalue dep preserving type modifier as suggestedby Torvald Riegel but using this type modifier tostrictly enforce dependency ordering For exampleconsider the code fragment shown in Figure 17 Thescheme described in Section 72 would not necessar-ily enforce dependency ordering between the load online 3 and the access one line 6 while the approachdescribed in this section would enforce dependencyordering in this case

Furthermore cancelling or value-destruction oper-ations on value dep preserving values would notdisrupt dependency ordering As with the cur-rent C11 and C++11 standards the implementation

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack value_dep_preserving first6 78 struct liststack 9 struct liststack value_dep_preserving next

10 void t11 struct rcu_head rh12 1314 value_dep_preserving15 void ls_front(struct liststackhead head)16 17 value_dep_preserving void data18 value_dep_preserving struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 18 List-Based-Stack Restricted Type-BasedDesignation 1 of 2

would be required to emit a memory-barrier instruc-tion or compute an artificial dependency for such op-erations (Note however that use of cancelling orvalue-destruction operations on dependency chainshas proven quite rare in practice)

This approach shares many of the advantages ofTorvald Riegelrsquos approach

1 The implementation is simpler because no de-pendency chains need be traced The implemen-tation can instead drive optimization decisionsstrictly from type information

2 Use of the value dep preserving type modifierallows the developer to limit the extent of thedependency chains

3 This type modifier can be used to mark a depen-dency chainrsquos entry to and exit from a functionin a straightforward way without the need forattributes

4 The value dep preserving type modifiers serve

WG21N4321 19

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 value_dep_preserving37 void ls_pop(struct liststackhead head)38 39 value_dep_preserving struct liststack lsp40 struct liststack lsnp141 value_dep_preserving struct liststack lsnp242 value_dep_preserving void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 19 List-Based-Stack Restricted Type-BasedDesignation 2 of 2

as valuable documentation of the developerrsquos in-tent

5 Although optimizations on a dependency chainare restricted just as in the current standardthe use of value dep preserving restricts thedependency chains to those intended by the de-veloper

6 Restricting dependency-breaking optimizationson all dependency chains marked value dep

preserving without exceptions for cases inwhich the compiler knows too much might makethis approach easier to learn and to use

It is expected that modeling this approach shouldbe straightforward because the modeling tools wouldbe able to make use of the type information Thisapproach results in the same code as shown in Fig-ures 18 and 19 of the previous section

74 Whole-Program Option

This approach also suggested off-list by Jeff Presh-ing has the goal of reusing existing non-dependency-ordered source code unchanged (albeit requiring re-compilation in most cases)9 For example this ap-proach permits an instance of stdmap to be refer-enced by a pointer loaded via memory order consume

and to provide that stdmap instance with thebenefits of dependency ordering without any codechanges whatsoever to stdmap It is important tonote that this protection will be provided only to aread-only stdmap that is referenced by a changingpointer loaded via memory order consume in partic-ular not to a concurrently updated stdmap refer-enced by a pointer (read-only or otherwise) loadedvia memory order consume This latter case wouldrequire changes to the underlying stdmap implemen-tation at a minimum changing some of the loads tobe memory order consume loads Nevertheless theability to provide dependency-ordering protection topre-existing linked data structures is valuable evenwith this read-only restriction

9 A module or library that is known to never carry a de-pendency need not be recompiled

WG21N4321 20

This approach which again does require full re-compilation can be implemented using two ap-proaches

1 Promote all memory order consume loads tomemory order acquire as may be done withthe current standard

2 On architectures that respect memory order-ing prohibit all dependency-breaking optimiza-tions throughout the entire program but onlyin cases where a change in the value returnedby a memory order consume load could cause achange in the value computed later in that samedependency chain in other words where there isa global semantic dependency Note again thatthe possibility of storing a value obtained froma memory order consume load then loading itlater means that normal loads as well as memoryorder relaxed loads often must be consideredto head their own dependency chains but onlywhen loaded by the same thread that did thestore

Some implementations might allow the developerto choose between these two approaches for exampleby using a compiler switch provided for that purpose

This approach also has the effect of permitting atrivial implementation of a memory order consume

atomic thread fence() When using the first im-plementation approach the atomic thread fence()

is simply promoted to memory order acquire In-terestingly enough when using the second ap-proach the memory order consume atomic thread

fence() may simply be ignored The reason forthis is that this approach has the effect of promot-ing memory order relaxed loads to memory order

consume which already globally enforces all theordering that the memory order consume atomic

thread fence() is required to provide locally10

This approach has its own set of advantages anddisadvantages

10 Of course this presumed promotion from memory order

relaxed to memory order consume means that architecturessuch as DEC Alpha that do not respect dependency order-ing must continue to use the first option of emitting memory-ordering instructions for memory order consume loads

1 This approach dispenses with the [[carries

dependency]] attribute and the kill

dependency() primitive

2 This approach better promotes reuse of existingsource code In particular it should require nochanges to the current Linux-kernel source baseaside from changes to the rcu dereference()

family of primitives

3 This approach allows implementations to carryout dependency-breaking optimizations on de-pendency chains as long as a change in thevalue from the memory order consume load doesnot change values further down the dependencychain both with and without the optimizationJeff conjectures that the set of dependency-breaking optimizations used in practice applyonly outside of dependency chains by the re-vised definition in which single-value restrictionsbreak dependency chains11 If this conjectureholds it also applies to Torvaldrsquos approach de-scribed in Section 72

4 Code that follows the rules presented in Sec-tion 41 (substituting memory order consume

loads for volatile loads) would have its depen-dency ordering properly preserved

It is unlikely that this approach could be modeledreasonably given the current state of the art Therequirement that any given memory order consume

load be able to generate at least two different val-ues at the tail of the dependency chain is believedto be a show-stopper especially when coupled withwhole-program analysis which might find that thereis only one value entering at the head of the depen-dency chain

This approach allows annotations to be discardedas shown in Figures 20 and 21 However the memoryorder consume loads are still required in order toenable the promote-to-acquire implementation style

11 This is certainly the case for the usual optimizations ex-emplified by replacing (x-x) with zero

WG21N4321 21

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 void ls_front(struct liststackhead head)15 16 void data17 struct liststack lsp1819 rcu_read_lock()20 lsp = rcu_dereference(head-gtfirst)21 if (lsp == NULL)22 data = NULL23 else24 data = rcu_dereference(lsp-gtt)25 rcu_read_unlock()26 return data27

Figure 20 List-Based-Stack Whole-Program Ap-proach 1 of 2

75 Local-Variable Restriction

This approach suggested off-list by Hans Boehmlimits the extent of dependency trees to a local whichincludes local variables temporaries function argu-ments and return variables Assigning a value from amemory order consume load to such an object beginsa dependency chain Assigning a value loaded fromsuch a local to a global variable (including function-local variables marked static) or to the heap impliesa kill dependency() so that dependency chains areconfined to locals However if the compiler is unableto see the full dependency chain for example be-cause it passes into a function in another translationunit that is not marked [[carries dependency]]the compiler should promote memory order consume

to memory order acquire12

Section 32 indicates that the following operatorsshould transmit dependency status from one localvariable or temporary to another -gt infix = casts

12 Some implementations might provide means to allow theuser to specify that a diagnostic be generated if such promotionis necessary

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 void ls_pop(struct liststackhead head)37 38 struct liststack lsp39 struct liststack lsnp140 struct liststack lsnp241 void data4243 rcu_read_lock()44 lsnp2 = rcu_dereference(head-gtfirst)45 do 46 lsnp1 = lsnp247 if (lsnp1 == NULL) 48 rcu_read_unlock()49 return NULL50 51 lsp = rcu_dereference(lsnp1-gtnext)52 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)53 while (lsnp1 = lsnp2)54 data = rcu_dereference(lsnp2-gtt)55 rcu_read_unlock()56 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)57 return data58

Figure 21 List-Based-Stack Whole-Program Ap-proach 2 of 2

WG21N4321 22

prefix amp prefix [] infix + infix - ternary infix (bitwise) amp and probably also | Similarly Sec-tion 33 indicates that the following operators shouldimply a kill dependency() () == = ampamp ||infix and

It will also be necesary to check whether Linux-kernel usage expects dependency chains to passthrough globals and heap objects that are in someway thread-local If there are such use cases and ifthey are sane and cannot easily be changed to uselocal variables should [[carries dependency]] beused to flag dependency-carrying globals and heapobjects

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute if de-pendency chains are to span multiple translationunits as is the case in some parts of the Linuxkernel

2 The implementation is likely to be some-what simpler because only those dependencychains passing through local variables compiler-generated temporaries compiler-visible functionarguments and compiler-visible return valuesneed be traced One could also argue that func-tion arguments and return values marked with[[carries dependency]] attribute also need tobe traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 Although optimizations on dependency chainsmust be restricted the restricted scope of de-pendency chains reduces the impact of these re-strictions

5 Applying this approach to the Linux kernelwould only require the addition of markings onfunction parameters and return values corre-sponding to cross-translation-unit function callsHowever there are a significant number of theseso this approach can expect significant resistancefrom the Linux community

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 22 List-Based-Stack Local-Variable Restric-tion 1 of 2

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach allows local-variable annotations tobe dropped as shown in Figure 22 and 23

76 Mark Dependency-Carrying Lo-cal Variables

This approach suggested offlist by Clark Nelson usesthe [[carries dependency]] attribute to mark non-static local-scope variables as carrying a dependencyin addition to its current use marking function ar-guments and return values as carrying dependen-cies It is not permissible to mark global variablesor structure members with this attribute Assigningfrom a [[carries dependency]] object to a non-[[carries dependency]] object results in an im-plicit kill dependency()

This approach is similar to that of Section 73 ex-cept that it uses an attribute rather than a type mod-ifier As such it has many of the advantages and dis-

WG21N4321 23

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 23 List-Based-Stack Local-Variable Restric-tion 2 of 2

advantages of that approach however some believethat an attribute-based approach will be more ac-ceptable to the committee than would a type-modifierapproach13 However this approach does requirethat C add attributes

This leave the question of which operatorstransmit dependency chains from one [[carries

dependency]] object to another Section 32 indi-cates that the following operators should transmitdependency status from one local variable or tem-porary to another -gt infix = casts prefix amp prefix [] infix + infix - ternary infix (bitwise) ampand probably also | Similarly Section 33 showsthat the following operators should imply a kill

dependency() () == = ampamp || infix and

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains passingthrough variables marked with the [[carries

dependency]] attribute need be traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 The [[carries dependency]] calls serve asvaluable documentation of the developerrsquos in-tent

5 Although optimizations on dependency chainsmust be restricted use of explicit [[carries

dependency]] greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of variables carrying depen-dencies given that the Linux kernel currentlyrequires no such markings

13 Lawrence Crowl suggests a third approach namely a vari-able modifier

WG21N4321 24

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 _Carries_dependency void data18 _Carries_dependency struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 24 List-Based-Stack Marked Local Variables1 of 2

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach results in code as shown in Fig-ures 24 and 25 where the [[carries dependency]]

attributes have been replaced with a mythicalCarries dependency C keyword

77 Explicitly Tail-Marked Depen-dency Chains

This approach suggested at Rapperswil by OlivierGiroux can be thought of as the inverse of

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 _Carries_dependency struct liststack lsp40 struct liststack lsnp141 _Carries_dependency struct liststack lsnp242 _Carries_dependency void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 25 List-Based-Stack Marked Local Variables2 of 2

WG21N4321 25

1 p = atomic_load_explicit(ampgp memory_order_consume)2 if (p = NULL)3 do_it(atomic_dependency(p gp))

Figure 26 Explicit Dependency Operations

stdkill dependency() Instead of explicitlymarking where the dependency chains terminateOlivierrsquos proposal uses a stddependency() prim-itive to indicate the locations in the code that thedependency chains are required to reach The first ar-gument to stddependency() is the value to whichthe dependency must be carried and the second argu-ment is the variable that heads the dependency chainin other words the second argument is the variablethat was loaded from by a memory order consume

load This proposal differs from the others in thatit is expected to be implemented not necessarily bypreserving the dependency but instead by insertingbarriers in those cases where optimizations have elim-inated any required dependencies The goal here is toimpose minimal restrictions on optimizations of codecontaining dependency chains

A C-language example is shown in Figure 26where stddependency() is transliterated to the C-language atomic dependency() function On line 3atomic dependency() returns the value of its firstargument (p) while ensuring that the data depen-dency from the memory order consume load from gp

is faithfully reflected in the assembly language imple-menting this code fragment The assembly-languagereflection of this dependency might be in terms ofan assembly-language dependency (for example onARM or PowerPC) implicit memory ordering (forexample on x86 or mainframe) or by an explicitmemory-barrier instruction However if there was noatomic dependency() function the compiler wouldbe under no obligation to preserve the dependency14

These explicitly specified dependencies may becombined with [[carries dependency]] attributeson function arguments for example as shown in Fig-ure 27 Note the interplay of atomic dependency()

and [[carries dependency]] where line 8 estab-

14 Would it be better to have the second argument to atomic

dependency() be a label rather than an expression

1 void foo(struct bar q [[carries_dependency]])2 3 if (q = NULL)4 do_it(atomic_dependency(q-gtb q))5 67 p = atomic_load_explicit(ampgp memory_order_consume)8 foo(atomic_dependency(p gp))

Figure 27 Explicit Dependency Operations and car-ries dependency

lishes the dependency between the load from gp andthe [[carries dependency]] argument q of foo()and where line 4 establishes the further dependencybetween argument q of foo() and do it()s argu-ment

This approach is not yet complete One issue isthe possibility of a given operation being dependenton multiple memory order consume loads One ap-proach is of course to omit this functionality and an-other is to allow atomic dependency() to allow anexpression as its first argument and a variable list ofmemory order consume loaded variables

Another issue is connecting [[carries

dependency]] return values to subsequent atomic

dependency() invocations There are a numberof possible resolutions to this issue One approachwould be to use [[carries dependency]] attributeto mark the declaration of the variable to whichthe functionrsquos return value is assigned bringing theproposal from Section 76 to bear In the specialcase where the memory order consume load is in thesame function body as the atomic dependency()

that depends on it the atomic dependency()

could reference the variable that was the sourceof the original memory order consume load An-other approach would be to allow function-returncarries dependency]] attributes to define namesthat could be used by later atomic dependency()

invocations

A third issue arises when atomic dependency()

must be applied after the head of the dependencychain has gone out of scope for example if the headwas contained in a variable defined in an inner scopethat has since been exited

A fourth issue arises if optimizations along a

WG21N4321 26

needed dependency chain allow ordering the depen-dent operation to precede the head of the depen-dency chain in which case inserting barriers wouldbe ineffective The current proposal for addressingthis issue is to suppress memory-movement optimiza-tions across the atomic dependency() perhaps us-ing something like atomic signal fence() or theLinux kernelrsquos barrier() macro This approach al-lows dependency checking and fence insertion to becarried out as a final pass in the compilation process

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains having ex-plicit atomic dependency() calls (and option-ally intermediate [[carries dependency]] at-tributes) need be traced

3 Irrelevant dependency chains are pruned by de-fault with no stdkill dependency() calls re-quired

4 The atomic dependency() calls serve as valu-able documentation of the developerrsquos intent

5 Although optimizations on dependency chainsmust be restricted use of explicit atomic

dependency() greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of dependency chains giventhat the Linux kernel currently relies on implicitends of dependency chains

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(atomic_dependency(lsp-gtt27 head-gtfirst))28 rcu_read_unlock()29 return atomic_dependency(data lsp-gtt)30

Figure 28 List-Based-Stack Tail-Marked Dependen-cies 1 of 2

It is not yet known whether this approach can bereasonably modeled

The result is shown in Figures 28 and 29

78 Explicitly Head-Marked Depen-dency Chains

This approach suggested via email by Olivier Girouxcan be thought of as another inverse of stdkilldependency() In this case the heads of the de-pendency chains are marked indicating to whichpointed-to objects dependencies should be carriedThis description is an extrapolation of a very con-cise proposal and corrections are welcome

The general idea is to provide an augmentedform of the load() member function that indicatesdependencies for example xload(memory order

consume x-gtnext) would cause a dependency tobe carried through the -gtnext field but throughno other field This is shown in Figure 30 where

WG21N4321 27

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(atomic_dependency(lsnp2-gtt lsnp))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return atomic_dereference(data lsnp2-gtt)59

Figure 29 List-Based-Stack Tail-Marked Dependen-cies 2 of 2

1 struct foo 2 struct foo a3 struct foo b4 struct foo c5 int d6 78 p = atomic_load_explicit(ampgp memory_order_consume9 p-gta p-gtb)

10 qa = p-gta Dependency carried 11 qb = p-gtb Dependency carried 12 qc = p-gtc No dependency carried 13 d = p-gtd No dependency carried

Figure 30 Explicit Dependency Operations and Aug-mented Load

the explicit dependency information on line 9 causeslines 10 and 11 to carry a dependency but lines 12and 13 not to do so

Some open questions regarding this approach

1 How does this interact with arguments and re-turn values Do the corresponding annotationsneed to indicate to which fields dependenciesmight be carried Should mismatches be con-sidered an error and if so which sorts of mis-matches

2 How are opaque types handled For exam-ple consider the Linux kernel linked-list facil-ity which embeds a list head structure intothe enclosing object that is to be placed on thelist The memory order consume load returnsa (struct list head ) but it may be nec-essary to carry a dependency to one or more ofthe fields in the enclosing object Should thisbe handled via something like xload(memory

order consume x) but if so doesnrsquot thisre-introduce the need for lots of stdkill

dependency() calls

3 Larger structures might have quite a few fieldsthat need dependencies carried Should therebe some sort of shorthand to make this easierto code for example tagging the fields needingdependency ordering in the declaration of thestruct or class

4 Common types of abstraction need to be han-dled correctly For example there are more than

WG21N4321 28

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst head-gtfirst-gtt)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(lsp-gtt lsp-gtt)) 28 rcu_read_unlock()29 return data30

Figure 31 List-Based-Stack Head-Marked Depen-dencies 1 of 2

500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

79 Restricted Dependency Chains

This approach restricts dependency chains to opera-tions for which compilers would naturally carry de-pendencies As such this approach can be consid-ered to be a refinement of the whole-program option(Section 74) restricted as described in Section 412but also omitting control dependencies and RCU-protected array indexes As always a memory order

consume load heads a dependency chain and as al-ways a memory order consume load may be imple-mented with a simple load instruction on architec-tures such as x86 ARM and Power

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst head-gtfirst-gtnext)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext lsnp-gtnext-gtt)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt lsnp2-gtt))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 32 List-Based-Stack Head-Marked Depen-dencies 2 of 2

WG21N4321 29

1 define rcu_ptr_extract(p) 2 ( 3 uintptr_t ___ip = (uintptr_t)(p) 4 5 ___ip = ___ip amp ~0x7 6 (typeof(p)) ___ip 7 )89 define rcu_ptr_set_bits(p bits)

10 ( 11 uintptr_t ___ip = (uintptr_t)(p) 12 13 ___ip = ___ip amp ~0x7 14 ___ip = ___ip | (bits) 15 (typeof(p)) ___ip 16 )1718 define rcu_ptr_get_bits(p bits) 19 ( 20 uintptr_t ___ip = (uintptr_t)(p) 21 22 ___ip amp ~0x7 23 )

Figure 33 Pointers and Bit Manipulation on 64-BitSystem

This results in a specific list of operations that ex-tend dependency chains and a separate specific list ofoperations that terminate such chains each coveredin its own section

791 Extending Dependency Chains

The following primitive operations extend thatchain15

1 If any value is part of a dependency chain thenusing that value as the left-hand side of an as-signment expression extends the chain to coverthe assignment This rule is exercised in theLinux kernel by stores into fields making up anRCU protected data element

2 If any value is part of a dependency chain thenusing that value as the right-hand side of an as-signment expression extends the chain to coverboth the assignment and the value returned bythat assignment statement Line 20 of Figure 20shows how this rule may be used to extend adependency chain into a local variable

15 In case of operator overloading the actual functions calledmust be analyzed in order to determine their effects on depen-dency chains

3 If any value that is part of a dependency chain isstored to a non-shared variable then any valueloaded by a later load from that same variable bythat same thread is also part of the dependencychain Lines 20 and 24 of Figure 20 illustratethis rule though this rule would apply even iflocal variable lsp was a intptr t instead of apointer

4 If a pointer that is part of a dependency chainis stored to any variable then any value loadedby a later load from that same variable by thatsame thread is also part of the dependency chainLines 20 and 24 of Figure 20 illustrate this rulethough this rule would apply even if local vari-able lsp was instead a shared variable

5 If a pointer is part of a dependency chain thenadding an integral value to that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers and alsoto addition via the infix + operator and via thepostfix [] operator Note that the addition mustbe carried out on a pointer Casting to an inte-gral type and then carrying out the addition willbreak the dependency chain Therefore insteadof casting to an integral type to carry out theaddition cast to a pointer to char16 Line 24 ofFigure 20 illustrates this given that the -gtt actsas a pointer offset prior to indirection

6 If a pointer is part of a dependency chain thensubtracting an integer from that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers Againcasting to an integral type and then carryingout the subtraction will break the dependencychain so instead cast to a pointer to char TheLinux-kernel container of() macro illustratesthis This macro is used to find the beginning ofa structure given a pointer to a field within thatsame structure

16 Yes some old systems had strange formats for characterpointers and this restriction does exclude those systems fromthis nuance of dependency ordering However to the best ofmy knowledge all such systems were uniprocessors so this isnot a real problem

WG21N4321 30

7 If a pointer is part of a dependency chain thendereferencing it using the prefix operator ex-tends the chain through the dereference opera-tion Line 24 of Figure 20 illustrates this giventhat the -gtt acts as a pointer offset prior to in-direction

8 If a pointer is part of a dependency chain thendereferencing it using the -gt field-selection oper-ator extends the chain to the field Note that thewhen the -gt operator is followed by one or more operators these latter operators are equiva-lent to adding a constant integer to the originalpointer Line 24 of Figure 20 directly illustratesthis rule

9 If a pointer is part of a dependency chain thencasting it (either explicitly or implicitly) to anypointer-sized type extends the chain to the re-sult Line 3 of Figure 33 illustrates this rule

10 If a value of type intptr t or uintptr t ispart of a dependency chain then casting it toa pointer type extends the chain to the resultLine 6 of Figure 33 illustrates this rule

11 If a value of type intptr t or uintptr t is partof a dependency chain then the bitwise infix amp

and | operators extend the dependency chain tothe resulting value Line 5 of Figure 33 illus-trates this rule

12 If a value of type intptr t or uintptr t is partof a dependency chain the the bitwise infix ^

operator extends the dependency chain to theresulting value The use case for this traversal ofbuddy-allocator-like lists or dense-array heapsbut it is not clear whether these use cases justifythis addition to dependency ordering

13 If a pointer is part of a dependency chain thenapplying the unary amp address-of operator op-tionally casting this address to a pointer type(perhaps repeatedly to different pointer typeseither explicitly or implicitly) then applying the dereference operator extends the chain to theresult This is used by some of the Linux-kernellist-processing macros

14 If a pointer is part of a dependency chain andthat pointer appears in the entry of a expres-sion selected by the condition then the chainextends to the result Please note that doesnot extend chains from its condition only fromits second or third argument

15 If a pointer to a function is part of a depen-dency chain then invoking the pointed-to func-tion extends the chain from the pointer to theinstructions executed Note that the exact mech-anism used to update instructions is implemen-tation defined and might require use of specialinstruction-cache-flush operations17

16 If a pointer is part of a dependency chain thenif that pointer is used as the actual parameter ofa function call the dependency chain extends tothe formal parameter

17 If a pointer is part of a dependency chain thenif that pointer is returned from a function thedependency chain extends to the returned valuein the calling function

18 If a given operation extends a dependency chainthen so does its atomic counterpart For exam-ple the rules applying to assignments also applyto atomic loads and stores

Any other operation terminates a dependencychain

792 Terminating Dependency Chains

Even though all other operations terminate depen-dency chains there are a few that deserve specialmention

1 Equality comparisons

2 Narrowing magnitude comparisons

3 Narrowing arithmetic operations

4 Narrowing bitwise operations

17 This may sound strange but just you try implementingdynamic linking without the ability to update instructions

WG21N4321 31

1 struct bar 2 struct bar next3 int a4 int b5 6 struct bar head = amphead 1 2 78 for (p = head-gtnext p p = rcu_dereference(p-gtnext)) 9 foo += p-gta

10 if (p == amphead)11 break12 13 bar = head-gtb

Figure 34 Back-Propagation of Dependency-ChainBreakage Due to Comparisons

1 if (p gt ampfoo)2 do_something(p)3 else if (p lt ampfoo)4 do_something_else(p)5 else6 do_something_nodep(p)

Figure 35 Inequality-Comparison Dependency-Chain Breakage

5 Storing non-pointers into shared variables

6 Passing values between threads without using amemory order consume load

7 Undefined behavior

8 Use of kill dependency

Each of these is covered below

Equality comparisons If a pointer is part of adependency chain then a == or = comparison thatcompares equal to some other pointer where thatother pointer is not part of any dependency chainwill cause any uses of the original pointer to no longerbe part of the dependency chain This dependency-chain breakage can back-propagate to earlier uses ofthe pointer so that in Figure 34 if the comparisonon line 10 compares equal then the access on line 9is not part of the dependency chain This is admit-tedly a rather strange code fragment and besidesthe Linux-kernel barrier() macro could prevent thisif placed between lines 9 and 10 Furthermore the

Linux kernelrsquos list macros avoid this situation becausethe equal comparison terminates the loop

So what if the compiler introduces an equality com-parison This might happen when doing feedback-directed optimization where the compiler might no-tice (for example) that a particularly statically allo-cated structure was almost always the first elementon a given list The compiler might therefore in-troduce a specialization optimization comparing theaddresses and generating code using the statically al-located structure on equals comparison On the onehand in the cases where the Linux kernel adds a stat-ically allocated structure to an RCU-protected linkeddata structure that structure has been initialized atcompile time so that dependency ordering is not re-quired On the other hand this appears to be anextremely dubious optimization for linked data struc-tures In a great many cases the added overhead ofthe comparison would overwhelm the benefits of gen-erating code based on the statically allocated struc-ture

High-quality implementations would therefore beexpected to provide means for disabling this sort ofoptimization especially for pointers obtained fromthe heap After all use of statically allocated struc-tures in RCU-protected lists could be quite usefulduring out-of-memory conditions in which case thespecialization optimization would almost always re-duce performance which is not what optmizationsare supposed to be doing

Narrowing magnitude comparisons A seriesof gt lt gt= or lt= operators that informs the compilerof the exact value of a pointer causes that pointerto no longer be part of the dependency chain SeeFigure 35 for an example of this On line 6 of thisfigure the compiler knows that the value of p is equalto ampfoo so although there is dependency ordering tolines 2 and 4 there is no dependency ordering toline 6 This dependency-chain breakage can back-propagate just as for equality comparisons How-ever dependencies are maintained for normal usesfor exmaple the use of comparisons for deadlockavoidance when acquiring locks contained in multi-ple RCU-protected data elements

WG21N4321 32

Narrowing arithmetic operations If a pointeris part of a dependency chain and if the valuesadded to or subtracted from that pointer cancel thepointer value so as to allow the compiler to pre-cisely determine the resulting value then the result-ing value will not be part of any dependency chainFor example if p is part of a dependency chain then((char )p-(uintptr t)p)+65536 will not be18

Narrowing bitwise operations If a value of typeintptr t or uintptr t is part of a dependency chainand if that value is one of the operands to an amp

or | infix operator whose result has too few or toomany bits set then the resulting value will not bepart of any dependency chain For example on a64-bit system if p is part of a dependency chainthen (p amp 0x7) provides just the tag bits and nor-mally cannot even be legally dereferenced Similarly(p | ~0) normally cannot be legally dereferencedHowever (p amp ~0x7) will provide a usable pointerthat is part of prsquos dependency chain Setting orclearing bits that are not used by the implementa-tion or that would be zero for a properly aligned ob-ject will not break the dependency chain That saidthe compiler might not know the definition of ldquoprop-erly alignedrdquo for example in the cases of manuallycache-aligned or page-aligned objects

Storing non-pointers into shared variables Ifa non-pointer value that is part of a dependency chainis stored into a shared variable then the dependencychain does not extend to a later load from that vari-able

Passing values between threads If a value thatis part of a dependency chain is stored into a variableby one thread and loaded from that same variable bysome other thread using either a non-atomic load ora memory order relaxed load then the dependencychain does not extend to the second thread To getthis effect the second thread would instead need touse a memory order consume load Note that this

18 That said 57p4 of C++ and 656p8 of C both say thatindexing outside of an object is undefined behavior so the lossof dependency ordering is likely the least of the problems here

would extend the dependency chain even if the cor-responding store was a memory order relaxed storebecause the required store-side ordering is providedby the dependency chain

Undefined behavior If undefined behavior is in-voked then consistent with the notion of undefinedbehavior there are no dependency-chain guarantees

If a given pointer intptr t or uintptr t is usedsuch that only one value avoids undefined behaviorthen the dependency chain is broken in the same wayas it would be in the case of an equality comparisonwith that same value

kill dependency() The result of calling kill

dependency is never part of any dependency chainThis operation can be used to suppress diagnosticsthat implementations might omit for likely misusesof dependency ordering

793 Restricted Dependency Chains and theLinux Kernel

This covers all known pointer-based RCU uses in theLinux kernel aside from RCU-protected array in-dexes (more on these later) However as noted ear-lier these restrictions might prove too constrainingfor future code Therefore it might be necessary tocombine this approach with some variation on one ofthe methods for explicitly marking variables formalparameters and return values that are intended tocarry dependencies

Section 32 discussed dependency chains headedby memory order consume loads of integers that arelater used as array indexes or pointer offsets Al-though there are a (very) few such uses in the Linuxkernel accommodating dependency chains headed byloads of integers greatly complicates the handling ofdependency chains For example given an integer x

produced by a memory order consume load we mustcorrectly handle expressions containing (x - x) in-cluding cases where the cancellation is not at all obvi-ous from the source code In contrast given a pointerp the expression (p - p) not a pointer and the rulesgiven above do not require carrying a dependency

WG21N4321 33

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a) 3 q = kill_dependency(p)4 a = q-gtspecial_a5 else 6 a = p-gtnormal_a7

Figure 36 Avoiding Diagnostic Due To Dependency-Ordering Value-Narrowing Hazard

through such an expression This approach there-fore excludes dependency chains headed by memory

order consume loads from non-pointer atomic vari-ables

This raises the question of what should be doneabout the Linux kernel code that relies on order-ing carried through integer array indexes Paulhas (hopefully) answered this question by creatinga Linux-kernel patch that removes the kernelrsquos de-pendency on RCU-protected array indexes [22]

794 Restricted Dependency Chains Ad-vantages and Disavantages

This approach to dependency ordering has the fol-lowing advantages and disadvantages

1 There is no need for compilers to trace depen-dency chains Instead dependencies are an au-tomatic result of current code-generation and op-timization practices

2 There would be little or no limitation on com-piler optimizations Compilers are no longer re-quired to establish artificial dependencies for ex-pressions such as (x - x) or to detect value-narrowing hazards involving == and =

3 The breaking of dependency chains when apointer compares equal to some other pointermight prove to be onerous but it is not aproblem for current Linux use cases19 Themost common cases are comparison against alist header (in which case an equality compar-ison terminates the traversal) and comparison

19 This needs to be re-verified

against NULL (in which case an equality com-parison indicates a pointer that cannot be deref-erenced in any case)

4 The compiler is prohibited from carrying outvalue-speculation optimizations on pointers orvalues of type intptr t or uintptr t that havebeen cast from a pointer type and subjected tono operations other than infix amp | and ^ (Isthis an advantage or a disadvantage The an-swer to this question is left to the reader)

5 In a great many cases dependency chains canreliably pass through library functions compiledby pre-C11 compilers

6 It would not be necessary to use stdkill

dependency() calls in most cases That saiduse of stdkill dependency() might at somefuture point allow the compiler to produce bet-ter diagnostics for dubious dependency-chain usecases For example a compiler might issue awarning for the value-narrowing hazard shownon line 3 of Figure 13 and that diagnostic mightbe suppressed as shown in Figure 36

7 More generally this approach allows annotationsto be discarded as was shown in Figures 20and 21 However the memory order consume

loads are still required in order to support DECAlpha and prevent compiler optimizations thatmight otherwise destroy the dependency chain

8 It would not be necessary to use [[carries

dependency]] attributes However as withstdkill dependency() use of something like[[carries dependency]] might produce bet-ter diagnostics for dubious use dependency-chainuse cases That said it might be preferable tosubstitute either type or variable modifiers forattributes given that use of attributes is not per-mitted to change the meaning of the programand also that attributes are not yet supportedby the C language

9 With the exception of one use case involving ar-rays no changes to the Linux-kernel source code

WG21N4321 34

are required As noted earlier a patch is avail-able to remove the Linux kernelrsquos dependency onRCU-protected array indexes [22]

In the future this approach could be augmentedby attributes type modifiers variable modifiers orother markings to allow more elaborate dependencychains to be created on the one hand and to improvethe compilerrsquos ability to emit diagnostics for dubioususes of dependency chains on the other

710 Evaluation

This evaluation starts by enumerating the differentaudiences that any change to memory order consume

must address (Section 7101) and then compares thevarious proposals based on the perceived viewpointsof these audiences (Section 7102)

7101 Audiences

The main audiences for any change to memory

order consume include standards committee mem-bers compiler implementers formal-methods re-searchers developers intending to write new codeand developers working with existing RCU code TheLinux kernel community is of course a notable exam-ple of this last category

Standards committee members would like a cleanand non-intrusive change to the standard Theywould of course also like solutions minimizing thenumber and vehemence of complaints from the otheraudiences or failing that reducing the complaintsto a tolerable noise level

Compiler implementers would like a mechanismthat fits nicely into current implementations whichdoes much to explain their satisfaction with the ap-proach of strengthening memory order consume tomemory order acquire In particular they wouldlike to avoid unbounded tracing of dependencies andwould prefer minimal constraints on their ability toapply time-honored optimizations

Formal-methods researchers would like a definitionof memory order consume that fits into existing the-oretical frameworks without undue conceptual vio-lence Of particular concern is any need to deal with

counter-factuals in other words any need to rea-son not only about values of variables required forthe solution of a given litmus test but also aboutother unrelated values for these variables As suchcounter-factuals are the rock upon which otherwiseattractive approaches involving semantic dependencyhave foundered20 Some practitioners might won-der why the opinion of formal-methods researchersshould be given any weight at all and the answer tothis question is that it is the work of formal-methodsresearchers that provides us the much-needed toolsthat we need to analyze both the memory-orderingspecification itself as well as programs using thatspecification

Developers writing new code need something thatexpresses their algorithm with a minimum of syn-tactic saccharine that is easy to learn and that iseasy to maintain For example one of the weak-nesses of the current standardsrsquo definition of memoryorder consume is the need to sprinkle large numbersof kill dependency() calls throughout onersquos codeIn short developers would like it to be easy to writeanalyze and maintain code that uses dependency or-dering

Developers with existing RCU code have the samedesires as do developers writing new code but arealso very interested in minimizing the code churn re-quired to adhere to the standard

The challenge if of course to find a proposal thataddresses the viewoints of all of these audiences Aswe will see in the next session this is not easy How-ever there is some hope that the approach presentedin Section 79 might suffice

7102 Comparison

A summary comparison of the proposals is shown inTable 1

The dependency type can either be ldquodeprdquo fornormal dependency ldquordeprdquo for the restricted de-pendencies discussed in Section 79 ldquosdeprdquo for(global) semantic dependency or ldquolsdeprdquo for local

20 That said Alan Jeffries is making another attempt tocome up with a suitable formal definition of semantic depen-dency

WG21N4321 35

Dep

enden

cyType

Variable

Marking

Form

al-Parameter

Marking

Return-V

alueMarking

Beginning-O

f-Chain

Handling

End-O

f-Chain

Handling

Dep

enden

cyTracingRequired

CAttribute

Support

Req

uired

C11 C++11 dep A A K Y Y

Type-Based Designation of DependencyChains With Restrictions (Section 72)

lsdep T T T k

Type-Based Designation of DependencyChains (Section 73)

dep T T T k

Whole-Program Option (Section 74) sdep

Local-Variable Restriction (Section 75) dep A A k Y

Mark Dependency-Carrying Local Variables(Section 76)

dep A A A k Y

Explicitly Tail-Marked Dependency Chains(Section 77)

dep A A Dk y Y

Explicitly Head-Marked Dependency Chains(Section 78)

dep D k y Y

Restricted Dependency Chains (Section 79) rdep

Marking ldquoArdquo attribute ldquoTrdquo typeBeginning of chain ldquoDrdquo explicit designationEnd of chain ldquoDrdquo explicit designation ldquokKrdquo implicitexplicit kill dependencyDependency tracking required ldquoyrdquo only for marked chains ldquoYrdquo always

Table 1 Comparison of Consume Proposals

WG21N4321 36

semantic dependency21 Variable formal-parameterand return-value marking can either be type-based(ldquoTrdquo) attribute-based (ldquoArdquo) or not required (ldquo rdquo)Beginning-of-chain handling can either require ex-plicit indication to which quantities dependenciesmust be carried (ldquoDrdquo) or nothing (ldquo rdquo) End-of-chain handling can either require an explicit kill

dependency (ldquoKrdquo) an implicit kill dependency

(ldquokrdquo) explicit designation of dependency (ldquoDrdquo) ornothing (ldquo rdquo)22 Dependency tracking might berequired for all chains (ldquoYrdquo) explicitly designatedchains (ldquoyrdquo) or not required at all (ldquo rdquo) C-language[[carries dependency]] support might be required(ldquoYrdquo) or not (ldquo rdquo)

The ideal proposal would have dependency typeldquodeprdquo (thus making it easier to model dependencyordering and making it unnecessary for developer tohave to outwit full-program optimizations) no needfor variable formal-parameter or return-value mark-ing (thus minimizing changes required for existingRCU code) implicit ldquodo the right thingrdquo end-of-chainhandling23 (thus minimizing the need for whack-a-mole source-code markups) no need for dependencytracking (thus making it easier to implement) andno need for C-language support for the [[carries

dependency]] attribute (thus minimizing changes tothe C standard)

7103 Other Approaches

If the C standards committee is unwilling to ac-cept attributes perhaps a new keyword such asCarries dependency would be an acceptable alter-native (This was suggested at the 2014 UIUC meet-ing but I cannot recall who suggested it)

21 Recall that a local semantic dependency remains a de-pendency even if the memory order consume load at its headcan return only a single value In contrast a global seman-tic dependency remains a dependency only if more than onevalue can appear at the end of the chain Therefore optimiza-tions based on global full-program analysis can break a globalsemantic dependency but can break neither a local semanticdependency nor a normal dependency

22 Variables that go out of scope always have any dependencychain implicitly killed

23 Perhaps implemented by a careful choice of exactly whichoperators carry dependencies in which situations

It might also be possible to combine different as-pects of the various proposals perhaps even arrivingat an improved proposal

Nevertheless we clearly have some more work todo

8 Summary

This document has analyzed Linux-kernel use of de-pendency ordering and has laid out the status-quo in-teraction between the Linux kernel and pre-C11 com-pilers It has also put forward some possible ways ofbuilding towards a full implementation of C11rsquos andC++11rsquos handling of dependency ordering Finallyit calls out some weaknesses in C11rsquos and C++11rsquoshandling of dependency ordering and offers some al-ternatives

References

[1] Alglave J Maranget L Pawan PSarkar S Sewell P Williams D andNardelli F Z PPCMEMARMMEM A toolfor exploring the POWER and ARM memorymodels httpwwwclcamacuk~pes20

ppc-supplementalpldi105-sarkarpdfJune 2011

[2] Alglave J Maranget L andTautschnig M Herding cats Modellingsimulation testing and data-mining for weakmemory In Proceedings of the 35th ACM SIG-PLAN Conference on Programming LanguageDesign and Implementation (New York NYUSA 2014) PLDI rsquo14 ACM pp 40ndash40

[3] ARM Limited ARM Architecture ReferenceManual ARMv7-A and ARMv7-R Edition2010

[4] Boehm H J Space efficient conservativegarbage collection SIGPLAN Not 39 4 (Apr2004) 490ndash501

[5] Bonzini P and Day M RCU imple-mentation for Qemu httplistsgnu

WG21N4321 37

orgarchivehtmlqemu-devel2013-08

msg02055html August 2013

[6] Dalton M THE DESIGN AND IMPLE-MENTATION OF DYNAMIC INFORMATIONFLOW TRACKING SYSTEMS FOR SOFT-WARE SECURITY PhD thesis StanfordUniversity 2009 Available httpcsl

stanfordedu~christospublications

2009michael_daltonphd_thesispdf

[Viewed March 9 2010]

[7] Desnoyers M [RFC git tree] userspace RCU(urcu) for Linux httpurcuso February2009

[8] Desnoyers M McKenney P E SternA Dagenais M R and Walpole JUser-level implementations of read-copy updateIEEE Transactions on Parallel and DistributedSystems 23 (2012) 375ndash382

[9] Grisenthwaite R ARM Barrier LitmusTests and Cookbook ARM Limited 2009

[10] Howard P W and Walpole J A rel-ativistic enhancement to software transactionalmemory In Proceedings of the 3rd USENIX con-ference on Hot topics in parallelism (BerkeleyCA USA 2011) HotParrsquo11 USENIX Associa-tion pp 1ndash6

[11] Howard P W and Walpole J Relativis-tic red-black trees Concurrency and Computa-tion Practice and Experience (2013) nandashna

[12] Intel Corporation A Formal Speci-fication of Intel Itanium Processor Fam-ily Memory Ordering 2002 Avail-able httpdeveloperintelcom

designitaniumdownloads251429htm

ftpdownloadintelcomdesign

ItaniumDownloads25142901pdf [ViewedJanuary 10 2007]

[13] International Business Machines Corpo-ration Power ISA Version 207 2013

[14] Kannan H Ordering decoupled metadata ac-cesses in multiprocessors In MICRO 42 Pro-ceedings of the 42nd Annual IEEEACM Inter-national Symposium on Microarchitecture (NewYork NY USA 2009) ACM pp 381ndash390

[15] McCarthy J Recursive functions of symbolicexpressions and their computation by machinepart i Commun ACM 3 4 (Apr 1960) 184ndash195

[16] McKenney P E Read-copy update(RCU) usage in Linux kernel Availablehttpwwwrdropcomuserspaulmck

RCUlinuxusagerculocktabhtml [ViewedJanuary 14 2007] October 2006

[17] McKenney P E What is RCU part 2Usage Available httplwnnetArticles

263130 [Viewed January 4 2008] January2008

[18] McKenney P E The RCU API 2010 edi-tion httplwnnetArticles418853 De-cember 2010

[19] McKenney P E Validating memory barri-ers and atomic instructions httplwnnet

Articles470681 December 2011

[20] McKenney P E Is Parallel ProgrammingHard And If So What Can You Do About Itkernelorg Corvallis OR USA 2012

[21] McKenney P E Structured deferral syn-chronization via procrastination CommunACM 56 7 (July 2013) 40ndash49

[22] McKenney P E [PATCH tipcorercu14] mce Stop using array-index-based RCUprimitives [PATCHtipcorercu14]mce

Stopusingarray-index-basedRCUprimitivesMay 2015

[23] McKenney P E Purcell C Al-gae Schumin B Cornelius G Qwer-tyus Conway N Sbw Blainster Ru-fus C Zoicon5 Anome and Eisen HRead-copy update httpenwikipedia

orgwikiRead-copy-update July 2006

WG21N4321 38

[24] McKenney P E and Slingwine J DRead-copy update Using execution history tosolve concurrency problems In Parallel andDistributed Computing and Systems (Las VegasNV October 1998) pp 509ndash518

[25] McKenney P E and Walpole J What isRCU fundamentally Available httplwn

netArticles262464 [Viewed December 272007] December 2007

[26] Michael M M Hazard pointers Safe mem-ory reclamation for lock-free objects IEEETransactions on Parallel and Distributed Sys-tems 15 6 (June 2004) 491ndash504

[27] Rossbach C J Hofmann O S PorterD E Ramadan H E Bhandari A andWitchel E TxLinux Using and managinghardware transactional memory in an operatingsystem In SOSPrsquo07 Twenty-First ACMSymposium on Operating Systems Principles(October 2007) ACM SIGOPS Avail-able httpwwwsosp2007orgpapers

sosp056-rossbachpdf [Viewed October 212007]

[28] Toit S D Working draft stan-dard for programming language C++httpwwwopen-stdorgjtc1sc22wg21

docspapers2013n3691pdf May 2013

[29] Vigueras G Orduna J M and LozanoM A Read-Copy Update based parallel serverfor distributed crowd simulations The Journalof Supercomputing (Apr 2012)

Change Log

This paper first appeared as N4026 in May of 2014Revisions to this document are as follows

bull Add a suggestion to Section 71 that developersprovide diagnostics for singleton arrays when us-ing RCU-protected indexes (May 28 2014)

bull Apply changes from Torvald Riegel feedback

1 Move Section 71 which summarizes theLinux-kernel uses of dependency chains upto the beginning of Section 7

2 Fix some typos in Section 74 and addwords accounting for DEC Alpha whichmust continue to promote memory order

consume loads to memory order acquire

with this sectionrsquos approach Also clarifyJeff Preshingrsquos conjecture about the lack ofoptimizations that break semantic depen-dency chains

(June 5 2014)

bull Add Section 75 which covers Hans Boehmrsquosapproach which restricts dependency chains tolocal variable Also add Section 76 whichcovers Clark Nelsorsquos approach which usesthe [[carries dependency]] attribute to markthose local variables that carry dependencies(July 17 2014)

bull Mark the paper as a revision of N4036 (July 172014)

bull Add Jeff Preshing Hans Boehm and Clark Nel-son as authors and clarify Hansrsquos proposal inSection 75 (August 13 2014)

bull Add Olivier Girouxrsquos proposal for explicitlymarking the tails of dependency chains as Sec-tion 77 and add Olivier as author Update theadvantages and disadvantages of Jeff Preshingrsquoswhole-program option in Section 74 Add advan-tages and disadvantages for Hans Boehmrsquos pro-posal in Section 75 and for Clark Nelsonrsquos pro-posal in Section 76 (August 23 2014)

bull Update Olivier Girouxrsquos explicit dependency-chain marking proposal in Section 77 based onfeedback from Olivier (August 24 2014)

bull Add a section header (Section 411) for thedependency-chain rules for 2014 GCC implemen-tations Add Section 413 laying out the sim-pler rules for managing dependency chains in theolder 1990s Sequent C implementations Adda footnote to Section 76 recording Lawrence

WG21N4321 39

Crowlrsquos preference for variable modifiers insteadof either attributes or type modifiers Addwords to Section 77 noting that atomic signal

fence() or the Linux kernelrsquos barrier() can beused to suppress code-motion optimizations thatmight otherwise prevent fixing up tail-markeddependency chains that were broken by code-motion optimizations (October 1 2014)

bull Add Section 34 which describes operators thatact as the last link in the Linux kernelrsquos depen-dency chains including the ambiguous role ofthe -gt operator (October 1 2014)

bull Update plots of RCU API usage (October 52014)

bull Flesh out Section 412 which lays out a shorthistory of dependency-chain management rulesfor the Linux kernel Numerous minor clarifica-tions and corrections throughout the document(October 5 2014)

bull Add ldquoandrdquo before final authorrsquos name and emailaddress Also add a pointer back to N4036 (Oc-tober 5 2014)

bull Add a brain-dead build script (October 52014)

bull Add Section 710 which has the beginnings ofan evaluation of the various proposals (October5 2014)

The paper was then published as N4215 whichwas revised as follows

bull Add verbiage to Sections 72 and 7102 differ-entiating between various forms of syntactic andsemantic dependencies (November 11 2014)

bull Add Section 7103 suggesting a C keyword in-stead of the C++ [[carries dependency]] at-tribute This section also suggests that combin-ing ideas from several proposals might be helpful(November 11 2014)

bull Add Section 78 which contains a first attemptto describe Olivier Girouxrsquos head-marked depen-dency chains (November 11 2014)

bull Add example code in Figures 15 and 16 Addsimilar figures to the proposals showing how theymark the dependency chains in the sample code(November 11 2014)

At this point the paper was published as N4321which was revised as follows

bull Add a warning against using operands to amp and| that leave only one bit set (or respectivelycleared) to Section 411 (April 9 2015)

bull Update plots of RCU API usage (April 192014)

bull Add Section 79 which restricts dependencychains with the goal of allowing unmarked de-pendency chains while avoiding restricting com-piler optimizations One important restriction isthe elimination of RCU-protected array indexes(April 19 2014)

bull Add verbiage to Section 32 noting that the amp

operator is sometimes used to find the beginningof a power-of-two aligned structure (April 192014)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Torvald Riegel to Sec-tion 79 (May 20 2015)

bull Donrsquot allow integers to carry dependencies intoor out of unmarked functions in Section 79(May 21 2015)

bull Update Section 79 to note that assignments can-not extend a dependency chain from one threadto another and that assignments cannot extendan intptr t- or uintptr t-based dependencychain through a shared variable even within agiven thread (May 21 2015)

bull Update Section 79 to note that atomic op-erations can extend dependency chains in thesame way that their non-atomic counterpartscan (May 21 2015)

WG21N4321 40

bull Update Section 79 to explicitly call out the factthat stores and subsequent loads can extend de-pendency chains (May 21 2015)

bull Update Section 79 to explicitly state that un-defined behavior terminates dependency chainsand that if only one value of a pointer avoidsundefined behavior any dependency chainsthrough that pointer at that point are termi-nated (May 22 2015)

bull Update Section 79 to document the difficul-ties stemming from value-specialization opti-mizations (May 23 2015)

bull Update Section 79 to note the back-propagationof the dependency-chain-breaking effects of thecompiler deducing the exact value of a pointer(May 27 2015)

bull Update Section 79 to give more informationon how much bit-setting and bit-clearing is toomuch (June 8 2015)

bull Update Section 79 to note that the Linux ker-nel does not rely on dependency ordering whena statically allocated structure is linked into anRCU-protected structure but also note that op-timizations breaking dependency ordering in thiscase seem quite dubious (July 11 2015)

bull Update Section 79 illustrating dependency-chain rules with examples (July 11 2015)

bull Update Section 79 noting that simple loads suf-fice for memory order consume loads (July 112015)

bull Update Sections 77 and 78 calling out abstrac-tion challenges for head- and tail-marking ap-proaches (July 13 2015)

bull Add subsections to Section 79 (July 13 2015)

bull Update Section 79 to amplify objections to spe-cialization optimizations applied to heap-basedpointers (July 13 2015)

  • 1 Introduction
  • 2 Introduction to RCU
  • 3 Linux-Kernel Use Cases
    • 31 Types of Linux-Kernel Dependency Chains
    • 32 Operators in Linux-Kernel Dependency Chains
    • 33 Operators Terminating Linux-Kernel Dependency Chains
    • 34 Operators Acting as Last Link in Linux-Kernel Dependency Chains
    • 35 Linux-Kernel Dependency Chain Length
      • 4 Dependency Ordering in Pre-C11 Implementations
        • 41 Rules for C-Language RCU Users
          • 411 Rules for 2014 GCC Implementations
          • 412 Rules for 2003 GCC Implementations
          • 413 Rules for 1990s Sequent C Implementations
            • 42 Rules for C-Language Implementers
              • 5 Dependency Ordering in C11 and C++11 Implementations
              • 6 Weaknesses in C11 and C++11 Dependency Ordering
              • 7 Potential Alternatives to C11 and C++11 Dependency Ordering
                • 71 Revising C11 and C++11 Dependency-Ordering Definition
                • 72 Type-Based Designation of Dependency Chains With Restrictions
                • 73 Type-Based Designation of Dependency Chains
                • 74 Whole-Program Option
                • 75 Local-Variable Restriction
                • 76 Mark Dependency-Carrying Local Variables
                • 77 Explicitly Tail-Marked Dependency Chains
                • 78 Explicitly Head-Marked Dependency Chains
                • 79 Restricted Dependency Chains
                  • 791 Extending Dependency Chains
                  • 792 Terminating Dependency Chains
                  • 793 Restricted Dependency Chains and the Linux Kernel
                  • 794 Restricted Dependency Chains Advantages and Disavantages
                    • 710 Evaluation
                      • 7101 Audiences
                      • 7102 Comparison
                      • 7103 Other Approaches
                          • 8 Summary
Page 17: N4321: Towards Implementation and Use of memory order … · 2015-07-14 · N4321: Towards Implementation and Use of memory order consume Doc. No.: WG21/N4321 (revised) Date: 2015-04-19

WG21N4321 17

1 value_dep_preserving struct foo p23 p = atomic_load_explicit(gp memory_order_consume)4 q = some_other_pointer5 if (p == q)6 do_something_with(p-gta)7 else8 do_something_else_with(p-gtb)

Figure 17 Single-Value Variables and DependencyOrdering

list Note also that cmpxchg() heads a dependencychain which is completely reasonable within the con-text of the Linux kernel due to its acquire semanticswhich of course might be argued to indicate that theannotations in ls pop() are unnecessary

72 Type-Based Designation of De-pendency Chains With Restric-tions

This approach was formulated by Torvald Riegel inresponse to Linus Torvaldsrsquos spirited criticisms of thecurrent C11 and C++11 wording

This approach introduces a new value dep

preserving type qualifier Dependency ordering ispreserved only via variables having this type quali-fier This is meant to model the real scope of depen-dencies which is data flow not execution at function-level granularity This approach should therefore givedevelopers much finer control of which dependenciesare tracked

Assigning from a value dep preserving value to anon-value dep preserving variable terminates thetracking of dependencies in much the same way thatan explicit kill dependency() would However un-like an explicit kill dependency() compilers shouldbe able to emit a suppressable warning on implicitconversions so as to alert the developer about other-wise silent dropping of dependency tracking8

Next we specify that memory order consume loadsreturn a value dep preserving type by default thecompiler must assume such a load to be capable of

8 Other choices are possible in this case including emit-ting a memory order acquire fence in order to conservativelypreserve a potentially intended ordering

producing any value of the underlying type In otherwords the implementation is not permitted to applyany value-restriction knowledge it might gain fromwhole-program analysis We call this a local seman-tic dependency to distinguish not only from a pure(syntactic) dependency but also from a global seman-tic dependency where global information may be ap-plied Note that any global semantic dependency isalso a local semantic dependency but that any localsemantic dependency which is headed by a variablethat can be proven to take on only a single value is nota global semantic dependency The term ldquosemanticdependencyrdquo should be interpreted to mean a globalsemantic dependency unless otherwise stated

This allows developers to start with a clean slatefor the additional rule that they must follow to beable to rely on dependency ordering Subsequent ex-ecution must not lead to a situation there is only onepossible value for the value dep preserving expres-sion because otherwise the implementation is per-mitted to break the dependency chain As notedearlier this is shown in Figure 17 where the com-piler is permitted to break dependency ordering online 6 because it knows that the value of p is equalto that of q which means that it could substitutethe latter value from the former which would breakdependency ordering

This approach has several advantages

1 The implementation is simpler because no de-pendency chains need to be traced The imple-mentation can instead drive optimization deci-sions strictly from type information

2 Use of the value dep preserving type modifierallows the developer to limit the extent of thedependency chains

3 This type modifier can be used to mark a depen-dency chainrsquos entry to and exit from a functionin a straightforward way without the need forattributes

4 The value dep preserving type modifiers serveas valuable documentation of the developerrsquos in-tent

WG21N4321 18

5 This approach permits many additional opti-mizations compared to those permitted by thecurrent standard on code that carries a depen-dency Expressions such as (x-x) no longerrequire establishment of artificial dependenciesand the compiler is no longer required to detectvalue-narrowing hazards like that shown in Fig-ure 13 However the compiler is still prohibitedfrom adding its own value-speculation optimiza-tions

6 Linus Torvalds seems to be OK with it whichindicates that this set of rules might be practicalfrom the perspective of developers who currentlyexploit dependency chains

According to Peter Sewell one disadvantage is thatthis approach will be quite difficult to model which inturn will pose obstacles for the analysis tooling thatwill be increasingly necessary for large-scale concur-rent programming efforts In particular the concernis that forcing the compiler to assume that a memory

order consume load could possibly return any valuepermitted by its type might require program-analysistools to consider counterfactual hypothetical execu-tions which might complicate specification of seman-tics and verification

Figures 18 and 19 show how this approach playsout with the list-based stack

73 Type-Based Designation of De-pendency Chains

Jeff Preshing made an off-list suggestion of using avalue dep preserving type modifier as suggestedby Torvald Riegel but using this type modifier tostrictly enforce dependency ordering For exampleconsider the code fragment shown in Figure 17 Thescheme described in Section 72 would not necessar-ily enforce dependency ordering between the load online 3 and the access one line 6 while the approachdescribed in this section would enforce dependencyordering in this case

Furthermore cancelling or value-destruction oper-ations on value dep preserving values would notdisrupt dependency ordering As with the cur-rent C11 and C++11 standards the implementation

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack value_dep_preserving first6 78 struct liststack 9 struct liststack value_dep_preserving next

10 void t11 struct rcu_head rh12 1314 value_dep_preserving15 void ls_front(struct liststackhead head)16 17 value_dep_preserving void data18 value_dep_preserving struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 18 List-Based-Stack Restricted Type-BasedDesignation 1 of 2

would be required to emit a memory-barrier instruc-tion or compute an artificial dependency for such op-erations (Note however that use of cancelling orvalue-destruction operations on dependency chainshas proven quite rare in practice)

This approach shares many of the advantages ofTorvald Riegelrsquos approach

1 The implementation is simpler because no de-pendency chains need be traced The implemen-tation can instead drive optimization decisionsstrictly from type information

2 Use of the value dep preserving type modifierallows the developer to limit the extent of thedependency chains

3 This type modifier can be used to mark a depen-dency chainrsquos entry to and exit from a functionin a straightforward way without the need forattributes

4 The value dep preserving type modifiers serve

WG21N4321 19

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 value_dep_preserving37 void ls_pop(struct liststackhead head)38 39 value_dep_preserving struct liststack lsp40 struct liststack lsnp141 value_dep_preserving struct liststack lsnp242 value_dep_preserving void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 19 List-Based-Stack Restricted Type-BasedDesignation 2 of 2

as valuable documentation of the developerrsquos in-tent

5 Although optimizations on a dependency chainare restricted just as in the current standardthe use of value dep preserving restricts thedependency chains to those intended by the de-veloper

6 Restricting dependency-breaking optimizationson all dependency chains marked value dep

preserving without exceptions for cases inwhich the compiler knows too much might makethis approach easier to learn and to use

It is expected that modeling this approach shouldbe straightforward because the modeling tools wouldbe able to make use of the type information Thisapproach results in the same code as shown in Fig-ures 18 and 19 of the previous section

74 Whole-Program Option

This approach also suggested off-list by Jeff Presh-ing has the goal of reusing existing non-dependency-ordered source code unchanged (albeit requiring re-compilation in most cases)9 For example this ap-proach permits an instance of stdmap to be refer-enced by a pointer loaded via memory order consume

and to provide that stdmap instance with thebenefits of dependency ordering without any codechanges whatsoever to stdmap It is important tonote that this protection will be provided only to aread-only stdmap that is referenced by a changingpointer loaded via memory order consume in partic-ular not to a concurrently updated stdmap refer-enced by a pointer (read-only or otherwise) loadedvia memory order consume This latter case wouldrequire changes to the underlying stdmap implemen-tation at a minimum changing some of the loads tobe memory order consume loads Nevertheless theability to provide dependency-ordering protection topre-existing linked data structures is valuable evenwith this read-only restriction

9 A module or library that is known to never carry a de-pendency need not be recompiled

WG21N4321 20

This approach which again does require full re-compilation can be implemented using two ap-proaches

1 Promote all memory order consume loads tomemory order acquire as may be done withthe current standard

2 On architectures that respect memory order-ing prohibit all dependency-breaking optimiza-tions throughout the entire program but onlyin cases where a change in the value returnedby a memory order consume load could cause achange in the value computed later in that samedependency chain in other words where there isa global semantic dependency Note again thatthe possibility of storing a value obtained froma memory order consume load then loading itlater means that normal loads as well as memoryorder relaxed loads often must be consideredto head their own dependency chains but onlywhen loaded by the same thread that did thestore

Some implementations might allow the developerto choose between these two approaches for exampleby using a compiler switch provided for that purpose

This approach also has the effect of permitting atrivial implementation of a memory order consume

atomic thread fence() When using the first im-plementation approach the atomic thread fence()

is simply promoted to memory order acquire In-terestingly enough when using the second ap-proach the memory order consume atomic thread

fence() may simply be ignored The reason forthis is that this approach has the effect of promot-ing memory order relaxed loads to memory order

consume which already globally enforces all theordering that the memory order consume atomic

thread fence() is required to provide locally10

This approach has its own set of advantages anddisadvantages

10 Of course this presumed promotion from memory order

relaxed to memory order consume means that architecturessuch as DEC Alpha that do not respect dependency order-ing must continue to use the first option of emitting memory-ordering instructions for memory order consume loads

1 This approach dispenses with the [[carries

dependency]] attribute and the kill

dependency() primitive

2 This approach better promotes reuse of existingsource code In particular it should require nochanges to the current Linux-kernel source baseaside from changes to the rcu dereference()

family of primitives

3 This approach allows implementations to carryout dependency-breaking optimizations on de-pendency chains as long as a change in thevalue from the memory order consume load doesnot change values further down the dependencychain both with and without the optimizationJeff conjectures that the set of dependency-breaking optimizations used in practice applyonly outside of dependency chains by the re-vised definition in which single-value restrictionsbreak dependency chains11 If this conjectureholds it also applies to Torvaldrsquos approach de-scribed in Section 72

4 Code that follows the rules presented in Sec-tion 41 (substituting memory order consume

loads for volatile loads) would have its depen-dency ordering properly preserved

It is unlikely that this approach could be modeledreasonably given the current state of the art Therequirement that any given memory order consume

load be able to generate at least two different val-ues at the tail of the dependency chain is believedto be a show-stopper especially when coupled withwhole-program analysis which might find that thereis only one value entering at the head of the depen-dency chain

This approach allows annotations to be discardedas shown in Figures 20 and 21 However the memoryorder consume loads are still required in order toenable the promote-to-acquire implementation style

11 This is certainly the case for the usual optimizations ex-emplified by replacing (x-x) with zero

WG21N4321 21

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 void ls_front(struct liststackhead head)15 16 void data17 struct liststack lsp1819 rcu_read_lock()20 lsp = rcu_dereference(head-gtfirst)21 if (lsp == NULL)22 data = NULL23 else24 data = rcu_dereference(lsp-gtt)25 rcu_read_unlock()26 return data27

Figure 20 List-Based-Stack Whole-Program Ap-proach 1 of 2

75 Local-Variable Restriction

This approach suggested off-list by Hans Boehmlimits the extent of dependency trees to a local whichincludes local variables temporaries function argu-ments and return variables Assigning a value from amemory order consume load to such an object beginsa dependency chain Assigning a value loaded fromsuch a local to a global variable (including function-local variables marked static) or to the heap impliesa kill dependency() so that dependency chains areconfined to locals However if the compiler is unableto see the full dependency chain for example be-cause it passes into a function in another translationunit that is not marked [[carries dependency]]the compiler should promote memory order consume

to memory order acquire12

Section 32 indicates that the following operatorsshould transmit dependency status from one localvariable or temporary to another -gt infix = casts

12 Some implementations might provide means to allow theuser to specify that a diagnostic be generated if such promotionis necessary

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 void ls_pop(struct liststackhead head)37 38 struct liststack lsp39 struct liststack lsnp140 struct liststack lsnp241 void data4243 rcu_read_lock()44 lsnp2 = rcu_dereference(head-gtfirst)45 do 46 lsnp1 = lsnp247 if (lsnp1 == NULL) 48 rcu_read_unlock()49 return NULL50 51 lsp = rcu_dereference(lsnp1-gtnext)52 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)53 while (lsnp1 = lsnp2)54 data = rcu_dereference(lsnp2-gtt)55 rcu_read_unlock()56 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)57 return data58

Figure 21 List-Based-Stack Whole-Program Ap-proach 2 of 2

WG21N4321 22

prefix amp prefix [] infix + infix - ternary infix (bitwise) amp and probably also | Similarly Sec-tion 33 indicates that the following operators shouldimply a kill dependency() () == = ampamp ||infix and

It will also be necesary to check whether Linux-kernel usage expects dependency chains to passthrough globals and heap objects that are in someway thread-local If there are such use cases and ifthey are sane and cannot easily be changed to uselocal variables should [[carries dependency]] beused to flag dependency-carrying globals and heapobjects

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute if de-pendency chains are to span multiple translationunits as is the case in some parts of the Linuxkernel

2 The implementation is likely to be some-what simpler because only those dependencychains passing through local variables compiler-generated temporaries compiler-visible functionarguments and compiler-visible return valuesneed be traced One could also argue that func-tion arguments and return values marked with[[carries dependency]] attribute also need tobe traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 Although optimizations on dependency chainsmust be restricted the restricted scope of de-pendency chains reduces the impact of these re-strictions

5 Applying this approach to the Linux kernelwould only require the addition of markings onfunction parameters and return values corre-sponding to cross-translation-unit function callsHowever there are a significant number of theseso this approach can expect significant resistancefrom the Linux community

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 22 List-Based-Stack Local-Variable Restric-tion 1 of 2

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach allows local-variable annotations tobe dropped as shown in Figure 22 and 23

76 Mark Dependency-Carrying Lo-cal Variables

This approach suggested offlist by Clark Nelson usesthe [[carries dependency]] attribute to mark non-static local-scope variables as carrying a dependencyin addition to its current use marking function ar-guments and return values as carrying dependen-cies It is not permissible to mark global variablesor structure members with this attribute Assigningfrom a [[carries dependency]] object to a non-[[carries dependency]] object results in an im-plicit kill dependency()

This approach is similar to that of Section 73 ex-cept that it uses an attribute rather than a type mod-ifier As such it has many of the advantages and dis-

WG21N4321 23

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 23 List-Based-Stack Local-Variable Restric-tion 2 of 2

advantages of that approach however some believethat an attribute-based approach will be more ac-ceptable to the committee than would a type-modifierapproach13 However this approach does requirethat C add attributes

This leave the question of which operatorstransmit dependency chains from one [[carries

dependency]] object to another Section 32 indi-cates that the following operators should transmitdependency status from one local variable or tem-porary to another -gt infix = casts prefix amp prefix [] infix + infix - ternary infix (bitwise) ampand probably also | Similarly Section 33 showsthat the following operators should imply a kill

dependency() () == = ampamp || infix and

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains passingthrough variables marked with the [[carries

dependency]] attribute need be traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 The [[carries dependency]] calls serve asvaluable documentation of the developerrsquos in-tent

5 Although optimizations on dependency chainsmust be restricted use of explicit [[carries

dependency]] greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of variables carrying depen-dencies given that the Linux kernel currentlyrequires no such markings

13 Lawrence Crowl suggests a third approach namely a vari-able modifier

WG21N4321 24

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 _Carries_dependency void data18 _Carries_dependency struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 24 List-Based-Stack Marked Local Variables1 of 2

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach results in code as shown in Fig-ures 24 and 25 where the [[carries dependency]]

attributes have been replaced with a mythicalCarries dependency C keyword

77 Explicitly Tail-Marked Depen-dency Chains

This approach suggested at Rapperswil by OlivierGiroux can be thought of as the inverse of

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 _Carries_dependency struct liststack lsp40 struct liststack lsnp141 _Carries_dependency struct liststack lsnp242 _Carries_dependency void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 25 List-Based-Stack Marked Local Variables2 of 2

WG21N4321 25

1 p = atomic_load_explicit(ampgp memory_order_consume)2 if (p = NULL)3 do_it(atomic_dependency(p gp))

Figure 26 Explicit Dependency Operations

stdkill dependency() Instead of explicitlymarking where the dependency chains terminateOlivierrsquos proposal uses a stddependency() prim-itive to indicate the locations in the code that thedependency chains are required to reach The first ar-gument to stddependency() is the value to whichthe dependency must be carried and the second argu-ment is the variable that heads the dependency chainin other words the second argument is the variablethat was loaded from by a memory order consume

load This proposal differs from the others in thatit is expected to be implemented not necessarily bypreserving the dependency but instead by insertingbarriers in those cases where optimizations have elim-inated any required dependencies The goal here is toimpose minimal restrictions on optimizations of codecontaining dependency chains

A C-language example is shown in Figure 26where stddependency() is transliterated to the C-language atomic dependency() function On line 3atomic dependency() returns the value of its firstargument (p) while ensuring that the data depen-dency from the memory order consume load from gp

is faithfully reflected in the assembly language imple-menting this code fragment The assembly-languagereflection of this dependency might be in terms ofan assembly-language dependency (for example onARM or PowerPC) implicit memory ordering (forexample on x86 or mainframe) or by an explicitmemory-barrier instruction However if there was noatomic dependency() function the compiler wouldbe under no obligation to preserve the dependency14

These explicitly specified dependencies may becombined with [[carries dependency]] attributeson function arguments for example as shown in Fig-ure 27 Note the interplay of atomic dependency()

and [[carries dependency]] where line 8 estab-

14 Would it be better to have the second argument to atomic

dependency() be a label rather than an expression

1 void foo(struct bar q [[carries_dependency]])2 3 if (q = NULL)4 do_it(atomic_dependency(q-gtb q))5 67 p = atomic_load_explicit(ampgp memory_order_consume)8 foo(atomic_dependency(p gp))

Figure 27 Explicit Dependency Operations and car-ries dependency

lishes the dependency between the load from gp andthe [[carries dependency]] argument q of foo()and where line 4 establishes the further dependencybetween argument q of foo() and do it()s argu-ment

This approach is not yet complete One issue isthe possibility of a given operation being dependenton multiple memory order consume loads One ap-proach is of course to omit this functionality and an-other is to allow atomic dependency() to allow anexpression as its first argument and a variable list ofmemory order consume loaded variables

Another issue is connecting [[carries

dependency]] return values to subsequent atomic

dependency() invocations There are a numberof possible resolutions to this issue One approachwould be to use [[carries dependency]] attributeto mark the declaration of the variable to whichthe functionrsquos return value is assigned bringing theproposal from Section 76 to bear In the specialcase where the memory order consume load is in thesame function body as the atomic dependency()

that depends on it the atomic dependency()

could reference the variable that was the sourceof the original memory order consume load An-other approach would be to allow function-returncarries dependency]] attributes to define namesthat could be used by later atomic dependency()

invocations

A third issue arises when atomic dependency()

must be applied after the head of the dependencychain has gone out of scope for example if the headwas contained in a variable defined in an inner scopethat has since been exited

A fourth issue arises if optimizations along a

WG21N4321 26

needed dependency chain allow ordering the depen-dent operation to precede the head of the depen-dency chain in which case inserting barriers wouldbe ineffective The current proposal for addressingthis issue is to suppress memory-movement optimiza-tions across the atomic dependency() perhaps us-ing something like atomic signal fence() or theLinux kernelrsquos barrier() macro This approach al-lows dependency checking and fence insertion to becarried out as a final pass in the compilation process

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains having ex-plicit atomic dependency() calls (and option-ally intermediate [[carries dependency]] at-tributes) need be traced

3 Irrelevant dependency chains are pruned by de-fault with no stdkill dependency() calls re-quired

4 The atomic dependency() calls serve as valu-able documentation of the developerrsquos intent

5 Although optimizations on dependency chainsmust be restricted use of explicit atomic

dependency() greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of dependency chains giventhat the Linux kernel currently relies on implicitends of dependency chains

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(atomic_dependency(lsp-gtt27 head-gtfirst))28 rcu_read_unlock()29 return atomic_dependency(data lsp-gtt)30

Figure 28 List-Based-Stack Tail-Marked Dependen-cies 1 of 2

It is not yet known whether this approach can bereasonably modeled

The result is shown in Figures 28 and 29

78 Explicitly Head-Marked Depen-dency Chains

This approach suggested via email by Olivier Girouxcan be thought of as another inverse of stdkilldependency() In this case the heads of the de-pendency chains are marked indicating to whichpointed-to objects dependencies should be carriedThis description is an extrapolation of a very con-cise proposal and corrections are welcome

The general idea is to provide an augmentedform of the load() member function that indicatesdependencies for example xload(memory order

consume x-gtnext) would cause a dependency tobe carried through the -gtnext field but throughno other field This is shown in Figure 30 where

WG21N4321 27

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(atomic_dependency(lsnp2-gtt lsnp))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return atomic_dereference(data lsnp2-gtt)59

Figure 29 List-Based-Stack Tail-Marked Dependen-cies 2 of 2

1 struct foo 2 struct foo a3 struct foo b4 struct foo c5 int d6 78 p = atomic_load_explicit(ampgp memory_order_consume9 p-gta p-gtb)

10 qa = p-gta Dependency carried 11 qb = p-gtb Dependency carried 12 qc = p-gtc No dependency carried 13 d = p-gtd No dependency carried

Figure 30 Explicit Dependency Operations and Aug-mented Load

the explicit dependency information on line 9 causeslines 10 and 11 to carry a dependency but lines 12and 13 not to do so

Some open questions regarding this approach

1 How does this interact with arguments and re-turn values Do the corresponding annotationsneed to indicate to which fields dependenciesmight be carried Should mismatches be con-sidered an error and if so which sorts of mis-matches

2 How are opaque types handled For exam-ple consider the Linux kernel linked-list facil-ity which embeds a list head structure intothe enclosing object that is to be placed on thelist The memory order consume load returnsa (struct list head ) but it may be nec-essary to carry a dependency to one or more ofthe fields in the enclosing object Should thisbe handled via something like xload(memory

order consume x) but if so doesnrsquot thisre-introduce the need for lots of stdkill

dependency() calls

3 Larger structures might have quite a few fieldsthat need dependencies carried Should therebe some sort of shorthand to make this easierto code for example tagging the fields needingdependency ordering in the declaration of thestruct or class

4 Common types of abstraction need to be han-dled correctly For example there are more than

WG21N4321 28

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst head-gtfirst-gtt)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(lsp-gtt lsp-gtt)) 28 rcu_read_unlock()29 return data30

Figure 31 List-Based-Stack Head-Marked Depen-dencies 1 of 2

500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

79 Restricted Dependency Chains

This approach restricts dependency chains to opera-tions for which compilers would naturally carry de-pendencies As such this approach can be consid-ered to be a refinement of the whole-program option(Section 74) restricted as described in Section 412but also omitting control dependencies and RCU-protected array indexes As always a memory order

consume load heads a dependency chain and as al-ways a memory order consume load may be imple-mented with a simple load instruction on architec-tures such as x86 ARM and Power

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst head-gtfirst-gtnext)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext lsnp-gtnext-gtt)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt lsnp2-gtt))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 32 List-Based-Stack Head-Marked Depen-dencies 2 of 2

WG21N4321 29

1 define rcu_ptr_extract(p) 2 ( 3 uintptr_t ___ip = (uintptr_t)(p) 4 5 ___ip = ___ip amp ~0x7 6 (typeof(p)) ___ip 7 )89 define rcu_ptr_set_bits(p bits)

10 ( 11 uintptr_t ___ip = (uintptr_t)(p) 12 13 ___ip = ___ip amp ~0x7 14 ___ip = ___ip | (bits) 15 (typeof(p)) ___ip 16 )1718 define rcu_ptr_get_bits(p bits) 19 ( 20 uintptr_t ___ip = (uintptr_t)(p) 21 22 ___ip amp ~0x7 23 )

Figure 33 Pointers and Bit Manipulation on 64-BitSystem

This results in a specific list of operations that ex-tend dependency chains and a separate specific list ofoperations that terminate such chains each coveredin its own section

791 Extending Dependency Chains

The following primitive operations extend thatchain15

1 If any value is part of a dependency chain thenusing that value as the left-hand side of an as-signment expression extends the chain to coverthe assignment This rule is exercised in theLinux kernel by stores into fields making up anRCU protected data element

2 If any value is part of a dependency chain thenusing that value as the right-hand side of an as-signment expression extends the chain to coverboth the assignment and the value returned bythat assignment statement Line 20 of Figure 20shows how this rule may be used to extend adependency chain into a local variable

15 In case of operator overloading the actual functions calledmust be analyzed in order to determine their effects on depen-dency chains

3 If any value that is part of a dependency chain isstored to a non-shared variable then any valueloaded by a later load from that same variable bythat same thread is also part of the dependencychain Lines 20 and 24 of Figure 20 illustratethis rule though this rule would apply even iflocal variable lsp was a intptr t instead of apointer

4 If a pointer that is part of a dependency chainis stored to any variable then any value loadedby a later load from that same variable by thatsame thread is also part of the dependency chainLines 20 and 24 of Figure 20 illustrate this rulethough this rule would apply even if local vari-able lsp was instead a shared variable

5 If a pointer is part of a dependency chain thenadding an integral value to that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers and alsoto addition via the infix + operator and via thepostfix [] operator Note that the addition mustbe carried out on a pointer Casting to an inte-gral type and then carrying out the addition willbreak the dependency chain Therefore insteadof casting to an integral type to carry out theaddition cast to a pointer to char16 Line 24 ofFigure 20 illustrates this given that the -gtt actsas a pointer offset prior to indirection

6 If a pointer is part of a dependency chain thensubtracting an integer from that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers Againcasting to an integral type and then carryingout the subtraction will break the dependencychain so instead cast to a pointer to char TheLinux-kernel container of() macro illustratesthis This macro is used to find the beginning ofa structure given a pointer to a field within thatsame structure

16 Yes some old systems had strange formats for characterpointers and this restriction does exclude those systems fromthis nuance of dependency ordering However to the best ofmy knowledge all such systems were uniprocessors so this isnot a real problem

WG21N4321 30

7 If a pointer is part of a dependency chain thendereferencing it using the prefix operator ex-tends the chain through the dereference opera-tion Line 24 of Figure 20 illustrates this giventhat the -gtt acts as a pointer offset prior to in-direction

8 If a pointer is part of a dependency chain thendereferencing it using the -gt field-selection oper-ator extends the chain to the field Note that thewhen the -gt operator is followed by one or more operators these latter operators are equiva-lent to adding a constant integer to the originalpointer Line 24 of Figure 20 directly illustratesthis rule

9 If a pointer is part of a dependency chain thencasting it (either explicitly or implicitly) to anypointer-sized type extends the chain to the re-sult Line 3 of Figure 33 illustrates this rule

10 If a value of type intptr t or uintptr t ispart of a dependency chain then casting it toa pointer type extends the chain to the resultLine 6 of Figure 33 illustrates this rule

11 If a value of type intptr t or uintptr t is partof a dependency chain then the bitwise infix amp

and | operators extend the dependency chain tothe resulting value Line 5 of Figure 33 illus-trates this rule

12 If a value of type intptr t or uintptr t is partof a dependency chain the the bitwise infix ^

operator extends the dependency chain to theresulting value The use case for this traversal ofbuddy-allocator-like lists or dense-array heapsbut it is not clear whether these use cases justifythis addition to dependency ordering

13 If a pointer is part of a dependency chain thenapplying the unary amp address-of operator op-tionally casting this address to a pointer type(perhaps repeatedly to different pointer typeseither explicitly or implicitly) then applying the dereference operator extends the chain to theresult This is used by some of the Linux-kernellist-processing macros

14 If a pointer is part of a dependency chain andthat pointer appears in the entry of a expres-sion selected by the condition then the chainextends to the result Please note that doesnot extend chains from its condition only fromits second or third argument

15 If a pointer to a function is part of a depen-dency chain then invoking the pointed-to func-tion extends the chain from the pointer to theinstructions executed Note that the exact mech-anism used to update instructions is implemen-tation defined and might require use of specialinstruction-cache-flush operations17

16 If a pointer is part of a dependency chain thenif that pointer is used as the actual parameter ofa function call the dependency chain extends tothe formal parameter

17 If a pointer is part of a dependency chain thenif that pointer is returned from a function thedependency chain extends to the returned valuein the calling function

18 If a given operation extends a dependency chainthen so does its atomic counterpart For exam-ple the rules applying to assignments also applyto atomic loads and stores

Any other operation terminates a dependencychain

792 Terminating Dependency Chains

Even though all other operations terminate depen-dency chains there are a few that deserve specialmention

1 Equality comparisons

2 Narrowing magnitude comparisons

3 Narrowing arithmetic operations

4 Narrowing bitwise operations

17 This may sound strange but just you try implementingdynamic linking without the ability to update instructions

WG21N4321 31

1 struct bar 2 struct bar next3 int a4 int b5 6 struct bar head = amphead 1 2 78 for (p = head-gtnext p p = rcu_dereference(p-gtnext)) 9 foo += p-gta

10 if (p == amphead)11 break12 13 bar = head-gtb

Figure 34 Back-Propagation of Dependency-ChainBreakage Due to Comparisons

1 if (p gt ampfoo)2 do_something(p)3 else if (p lt ampfoo)4 do_something_else(p)5 else6 do_something_nodep(p)

Figure 35 Inequality-Comparison Dependency-Chain Breakage

5 Storing non-pointers into shared variables

6 Passing values between threads without using amemory order consume load

7 Undefined behavior

8 Use of kill dependency

Each of these is covered below

Equality comparisons If a pointer is part of adependency chain then a == or = comparison thatcompares equal to some other pointer where thatother pointer is not part of any dependency chainwill cause any uses of the original pointer to no longerbe part of the dependency chain This dependency-chain breakage can back-propagate to earlier uses ofthe pointer so that in Figure 34 if the comparisonon line 10 compares equal then the access on line 9is not part of the dependency chain This is admit-tedly a rather strange code fragment and besidesthe Linux-kernel barrier() macro could prevent thisif placed between lines 9 and 10 Furthermore the

Linux kernelrsquos list macros avoid this situation becausethe equal comparison terminates the loop

So what if the compiler introduces an equality com-parison This might happen when doing feedback-directed optimization where the compiler might no-tice (for example) that a particularly statically allo-cated structure was almost always the first elementon a given list The compiler might therefore in-troduce a specialization optimization comparing theaddresses and generating code using the statically al-located structure on equals comparison On the onehand in the cases where the Linux kernel adds a stat-ically allocated structure to an RCU-protected linkeddata structure that structure has been initialized atcompile time so that dependency ordering is not re-quired On the other hand this appears to be anextremely dubious optimization for linked data struc-tures In a great many cases the added overhead ofthe comparison would overwhelm the benefits of gen-erating code based on the statically allocated struc-ture

High-quality implementations would therefore beexpected to provide means for disabling this sort ofoptimization especially for pointers obtained fromthe heap After all use of statically allocated struc-tures in RCU-protected lists could be quite usefulduring out-of-memory conditions in which case thespecialization optimization would almost always re-duce performance which is not what optmizationsare supposed to be doing

Narrowing magnitude comparisons A seriesof gt lt gt= or lt= operators that informs the compilerof the exact value of a pointer causes that pointerto no longer be part of the dependency chain SeeFigure 35 for an example of this On line 6 of thisfigure the compiler knows that the value of p is equalto ampfoo so although there is dependency ordering tolines 2 and 4 there is no dependency ordering toline 6 This dependency-chain breakage can back-propagate just as for equality comparisons How-ever dependencies are maintained for normal usesfor exmaple the use of comparisons for deadlockavoidance when acquiring locks contained in multi-ple RCU-protected data elements

WG21N4321 32

Narrowing arithmetic operations If a pointeris part of a dependency chain and if the valuesadded to or subtracted from that pointer cancel thepointer value so as to allow the compiler to pre-cisely determine the resulting value then the result-ing value will not be part of any dependency chainFor example if p is part of a dependency chain then((char )p-(uintptr t)p)+65536 will not be18

Narrowing bitwise operations If a value of typeintptr t or uintptr t is part of a dependency chainand if that value is one of the operands to an amp

or | infix operator whose result has too few or toomany bits set then the resulting value will not bepart of any dependency chain For example on a64-bit system if p is part of a dependency chainthen (p amp 0x7) provides just the tag bits and nor-mally cannot even be legally dereferenced Similarly(p | ~0) normally cannot be legally dereferencedHowever (p amp ~0x7) will provide a usable pointerthat is part of prsquos dependency chain Setting orclearing bits that are not used by the implementa-tion or that would be zero for a properly aligned ob-ject will not break the dependency chain That saidthe compiler might not know the definition of ldquoprop-erly alignedrdquo for example in the cases of manuallycache-aligned or page-aligned objects

Storing non-pointers into shared variables Ifa non-pointer value that is part of a dependency chainis stored into a shared variable then the dependencychain does not extend to a later load from that vari-able

Passing values between threads If a value thatis part of a dependency chain is stored into a variableby one thread and loaded from that same variable bysome other thread using either a non-atomic load ora memory order relaxed load then the dependencychain does not extend to the second thread To getthis effect the second thread would instead need touse a memory order consume load Note that this

18 That said 57p4 of C++ and 656p8 of C both say thatindexing outside of an object is undefined behavior so the lossof dependency ordering is likely the least of the problems here

would extend the dependency chain even if the cor-responding store was a memory order relaxed storebecause the required store-side ordering is providedby the dependency chain

Undefined behavior If undefined behavior is in-voked then consistent with the notion of undefinedbehavior there are no dependency-chain guarantees

If a given pointer intptr t or uintptr t is usedsuch that only one value avoids undefined behaviorthen the dependency chain is broken in the same wayas it would be in the case of an equality comparisonwith that same value

kill dependency() The result of calling kill

dependency is never part of any dependency chainThis operation can be used to suppress diagnosticsthat implementations might omit for likely misusesof dependency ordering

793 Restricted Dependency Chains and theLinux Kernel

This covers all known pointer-based RCU uses in theLinux kernel aside from RCU-protected array in-dexes (more on these later) However as noted ear-lier these restrictions might prove too constrainingfor future code Therefore it might be necessary tocombine this approach with some variation on one ofthe methods for explicitly marking variables formalparameters and return values that are intended tocarry dependencies

Section 32 discussed dependency chains headedby memory order consume loads of integers that arelater used as array indexes or pointer offsets Al-though there are a (very) few such uses in the Linuxkernel accommodating dependency chains headed byloads of integers greatly complicates the handling ofdependency chains For example given an integer x

produced by a memory order consume load we mustcorrectly handle expressions containing (x - x) in-cluding cases where the cancellation is not at all obvi-ous from the source code In contrast given a pointerp the expression (p - p) not a pointer and the rulesgiven above do not require carrying a dependency

WG21N4321 33

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a) 3 q = kill_dependency(p)4 a = q-gtspecial_a5 else 6 a = p-gtnormal_a7

Figure 36 Avoiding Diagnostic Due To Dependency-Ordering Value-Narrowing Hazard

through such an expression This approach there-fore excludes dependency chains headed by memory

order consume loads from non-pointer atomic vari-ables

This raises the question of what should be doneabout the Linux kernel code that relies on order-ing carried through integer array indexes Paulhas (hopefully) answered this question by creatinga Linux-kernel patch that removes the kernelrsquos de-pendency on RCU-protected array indexes [22]

794 Restricted Dependency Chains Ad-vantages and Disavantages

This approach to dependency ordering has the fol-lowing advantages and disadvantages

1 There is no need for compilers to trace depen-dency chains Instead dependencies are an au-tomatic result of current code-generation and op-timization practices

2 There would be little or no limitation on com-piler optimizations Compilers are no longer re-quired to establish artificial dependencies for ex-pressions such as (x - x) or to detect value-narrowing hazards involving == and =

3 The breaking of dependency chains when apointer compares equal to some other pointermight prove to be onerous but it is not aproblem for current Linux use cases19 Themost common cases are comparison against alist header (in which case an equality compar-ison terminates the traversal) and comparison

19 This needs to be re-verified

against NULL (in which case an equality com-parison indicates a pointer that cannot be deref-erenced in any case)

4 The compiler is prohibited from carrying outvalue-speculation optimizations on pointers orvalues of type intptr t or uintptr t that havebeen cast from a pointer type and subjected tono operations other than infix amp | and ^ (Isthis an advantage or a disadvantage The an-swer to this question is left to the reader)

5 In a great many cases dependency chains canreliably pass through library functions compiledby pre-C11 compilers

6 It would not be necessary to use stdkill

dependency() calls in most cases That saiduse of stdkill dependency() might at somefuture point allow the compiler to produce bet-ter diagnostics for dubious dependency-chain usecases For example a compiler might issue awarning for the value-narrowing hazard shownon line 3 of Figure 13 and that diagnostic mightbe suppressed as shown in Figure 36

7 More generally this approach allows annotationsto be discarded as was shown in Figures 20and 21 However the memory order consume

loads are still required in order to support DECAlpha and prevent compiler optimizations thatmight otherwise destroy the dependency chain

8 It would not be necessary to use [[carries

dependency]] attributes However as withstdkill dependency() use of something like[[carries dependency]] might produce bet-ter diagnostics for dubious use dependency-chainuse cases That said it might be preferable tosubstitute either type or variable modifiers forattributes given that use of attributes is not per-mitted to change the meaning of the programand also that attributes are not yet supportedby the C language

9 With the exception of one use case involving ar-rays no changes to the Linux-kernel source code

WG21N4321 34

are required As noted earlier a patch is avail-able to remove the Linux kernelrsquos dependency onRCU-protected array indexes [22]

In the future this approach could be augmentedby attributes type modifiers variable modifiers orother markings to allow more elaborate dependencychains to be created on the one hand and to improvethe compilerrsquos ability to emit diagnostics for dubioususes of dependency chains on the other

710 Evaluation

This evaluation starts by enumerating the differentaudiences that any change to memory order consume

must address (Section 7101) and then compares thevarious proposals based on the perceived viewpointsof these audiences (Section 7102)

7101 Audiences

The main audiences for any change to memory

order consume include standards committee mem-bers compiler implementers formal-methods re-searchers developers intending to write new codeand developers working with existing RCU code TheLinux kernel community is of course a notable exam-ple of this last category

Standards committee members would like a cleanand non-intrusive change to the standard Theywould of course also like solutions minimizing thenumber and vehemence of complaints from the otheraudiences or failing that reducing the complaintsto a tolerable noise level

Compiler implementers would like a mechanismthat fits nicely into current implementations whichdoes much to explain their satisfaction with the ap-proach of strengthening memory order consume tomemory order acquire In particular they wouldlike to avoid unbounded tracing of dependencies andwould prefer minimal constraints on their ability toapply time-honored optimizations

Formal-methods researchers would like a definitionof memory order consume that fits into existing the-oretical frameworks without undue conceptual vio-lence Of particular concern is any need to deal with

counter-factuals in other words any need to rea-son not only about values of variables required forthe solution of a given litmus test but also aboutother unrelated values for these variables As suchcounter-factuals are the rock upon which otherwiseattractive approaches involving semantic dependencyhave foundered20 Some practitioners might won-der why the opinion of formal-methods researchersshould be given any weight at all and the answer tothis question is that it is the work of formal-methodsresearchers that provides us the much-needed toolsthat we need to analyze both the memory-orderingspecification itself as well as programs using thatspecification

Developers writing new code need something thatexpresses their algorithm with a minimum of syn-tactic saccharine that is easy to learn and that iseasy to maintain For example one of the weak-nesses of the current standardsrsquo definition of memoryorder consume is the need to sprinkle large numbersof kill dependency() calls throughout onersquos codeIn short developers would like it to be easy to writeanalyze and maintain code that uses dependency or-dering

Developers with existing RCU code have the samedesires as do developers writing new code but arealso very interested in minimizing the code churn re-quired to adhere to the standard

The challenge if of course to find a proposal thataddresses the viewoints of all of these audiences Aswe will see in the next session this is not easy How-ever there is some hope that the approach presentedin Section 79 might suffice

7102 Comparison

A summary comparison of the proposals is shown inTable 1

The dependency type can either be ldquodeprdquo fornormal dependency ldquordeprdquo for the restricted de-pendencies discussed in Section 79 ldquosdeprdquo for(global) semantic dependency or ldquolsdeprdquo for local

20 That said Alan Jeffries is making another attempt tocome up with a suitable formal definition of semantic depen-dency

WG21N4321 35

Dep

enden

cyType

Variable

Marking

Form

al-Parameter

Marking

Return-V

alueMarking

Beginning-O

f-Chain

Handling

End-O

f-Chain

Handling

Dep

enden

cyTracingRequired

CAttribute

Support

Req

uired

C11 C++11 dep A A K Y Y

Type-Based Designation of DependencyChains With Restrictions (Section 72)

lsdep T T T k

Type-Based Designation of DependencyChains (Section 73)

dep T T T k

Whole-Program Option (Section 74) sdep

Local-Variable Restriction (Section 75) dep A A k Y

Mark Dependency-Carrying Local Variables(Section 76)

dep A A A k Y

Explicitly Tail-Marked Dependency Chains(Section 77)

dep A A Dk y Y

Explicitly Head-Marked Dependency Chains(Section 78)

dep D k y Y

Restricted Dependency Chains (Section 79) rdep

Marking ldquoArdquo attribute ldquoTrdquo typeBeginning of chain ldquoDrdquo explicit designationEnd of chain ldquoDrdquo explicit designation ldquokKrdquo implicitexplicit kill dependencyDependency tracking required ldquoyrdquo only for marked chains ldquoYrdquo always

Table 1 Comparison of Consume Proposals

WG21N4321 36

semantic dependency21 Variable formal-parameterand return-value marking can either be type-based(ldquoTrdquo) attribute-based (ldquoArdquo) or not required (ldquo rdquo)Beginning-of-chain handling can either require ex-plicit indication to which quantities dependenciesmust be carried (ldquoDrdquo) or nothing (ldquo rdquo) End-of-chain handling can either require an explicit kill

dependency (ldquoKrdquo) an implicit kill dependency

(ldquokrdquo) explicit designation of dependency (ldquoDrdquo) ornothing (ldquo rdquo)22 Dependency tracking might berequired for all chains (ldquoYrdquo) explicitly designatedchains (ldquoyrdquo) or not required at all (ldquo rdquo) C-language[[carries dependency]] support might be required(ldquoYrdquo) or not (ldquo rdquo)

The ideal proposal would have dependency typeldquodeprdquo (thus making it easier to model dependencyordering and making it unnecessary for developer tohave to outwit full-program optimizations) no needfor variable formal-parameter or return-value mark-ing (thus minimizing changes required for existingRCU code) implicit ldquodo the right thingrdquo end-of-chainhandling23 (thus minimizing the need for whack-a-mole source-code markups) no need for dependencytracking (thus making it easier to implement) andno need for C-language support for the [[carries

dependency]] attribute (thus minimizing changes tothe C standard)

7103 Other Approaches

If the C standards committee is unwilling to ac-cept attributes perhaps a new keyword such asCarries dependency would be an acceptable alter-native (This was suggested at the 2014 UIUC meet-ing but I cannot recall who suggested it)

21 Recall that a local semantic dependency remains a de-pendency even if the memory order consume load at its headcan return only a single value In contrast a global seman-tic dependency remains a dependency only if more than onevalue can appear at the end of the chain Therefore optimiza-tions based on global full-program analysis can break a globalsemantic dependency but can break neither a local semanticdependency nor a normal dependency

22 Variables that go out of scope always have any dependencychain implicitly killed

23 Perhaps implemented by a careful choice of exactly whichoperators carry dependencies in which situations

It might also be possible to combine different as-pects of the various proposals perhaps even arrivingat an improved proposal

Nevertheless we clearly have some more work todo

8 Summary

This document has analyzed Linux-kernel use of de-pendency ordering and has laid out the status-quo in-teraction between the Linux kernel and pre-C11 com-pilers It has also put forward some possible ways ofbuilding towards a full implementation of C11rsquos andC++11rsquos handling of dependency ordering Finallyit calls out some weaknesses in C11rsquos and C++11rsquoshandling of dependency ordering and offers some al-ternatives

References

[1] Alglave J Maranget L Pawan PSarkar S Sewell P Williams D andNardelli F Z PPCMEMARMMEM A toolfor exploring the POWER and ARM memorymodels httpwwwclcamacuk~pes20

ppc-supplementalpldi105-sarkarpdfJune 2011

[2] Alglave J Maranget L andTautschnig M Herding cats Modellingsimulation testing and data-mining for weakmemory In Proceedings of the 35th ACM SIG-PLAN Conference on Programming LanguageDesign and Implementation (New York NYUSA 2014) PLDI rsquo14 ACM pp 40ndash40

[3] ARM Limited ARM Architecture ReferenceManual ARMv7-A and ARMv7-R Edition2010

[4] Boehm H J Space efficient conservativegarbage collection SIGPLAN Not 39 4 (Apr2004) 490ndash501

[5] Bonzini P and Day M RCU imple-mentation for Qemu httplistsgnu

WG21N4321 37

orgarchivehtmlqemu-devel2013-08

msg02055html August 2013

[6] Dalton M THE DESIGN AND IMPLE-MENTATION OF DYNAMIC INFORMATIONFLOW TRACKING SYSTEMS FOR SOFT-WARE SECURITY PhD thesis StanfordUniversity 2009 Available httpcsl

stanfordedu~christospublications

2009michael_daltonphd_thesispdf

[Viewed March 9 2010]

[7] Desnoyers M [RFC git tree] userspace RCU(urcu) for Linux httpurcuso February2009

[8] Desnoyers M McKenney P E SternA Dagenais M R and Walpole JUser-level implementations of read-copy updateIEEE Transactions on Parallel and DistributedSystems 23 (2012) 375ndash382

[9] Grisenthwaite R ARM Barrier LitmusTests and Cookbook ARM Limited 2009

[10] Howard P W and Walpole J A rel-ativistic enhancement to software transactionalmemory In Proceedings of the 3rd USENIX con-ference on Hot topics in parallelism (BerkeleyCA USA 2011) HotParrsquo11 USENIX Associa-tion pp 1ndash6

[11] Howard P W and Walpole J Relativis-tic red-black trees Concurrency and Computa-tion Practice and Experience (2013) nandashna

[12] Intel Corporation A Formal Speci-fication of Intel Itanium Processor Fam-ily Memory Ordering 2002 Avail-able httpdeveloperintelcom

designitaniumdownloads251429htm

ftpdownloadintelcomdesign

ItaniumDownloads25142901pdf [ViewedJanuary 10 2007]

[13] International Business Machines Corpo-ration Power ISA Version 207 2013

[14] Kannan H Ordering decoupled metadata ac-cesses in multiprocessors In MICRO 42 Pro-ceedings of the 42nd Annual IEEEACM Inter-national Symposium on Microarchitecture (NewYork NY USA 2009) ACM pp 381ndash390

[15] McCarthy J Recursive functions of symbolicexpressions and their computation by machinepart i Commun ACM 3 4 (Apr 1960) 184ndash195

[16] McKenney P E Read-copy update(RCU) usage in Linux kernel Availablehttpwwwrdropcomuserspaulmck

RCUlinuxusagerculocktabhtml [ViewedJanuary 14 2007] October 2006

[17] McKenney P E What is RCU part 2Usage Available httplwnnetArticles

263130 [Viewed January 4 2008] January2008

[18] McKenney P E The RCU API 2010 edi-tion httplwnnetArticles418853 De-cember 2010

[19] McKenney P E Validating memory barri-ers and atomic instructions httplwnnet

Articles470681 December 2011

[20] McKenney P E Is Parallel ProgrammingHard And If So What Can You Do About Itkernelorg Corvallis OR USA 2012

[21] McKenney P E Structured deferral syn-chronization via procrastination CommunACM 56 7 (July 2013) 40ndash49

[22] McKenney P E [PATCH tipcorercu14] mce Stop using array-index-based RCUprimitives [PATCHtipcorercu14]mce

Stopusingarray-index-basedRCUprimitivesMay 2015

[23] McKenney P E Purcell C Al-gae Schumin B Cornelius G Qwer-tyus Conway N Sbw Blainster Ru-fus C Zoicon5 Anome and Eisen HRead-copy update httpenwikipedia

orgwikiRead-copy-update July 2006

WG21N4321 38

[24] McKenney P E and Slingwine J DRead-copy update Using execution history tosolve concurrency problems In Parallel andDistributed Computing and Systems (Las VegasNV October 1998) pp 509ndash518

[25] McKenney P E and Walpole J What isRCU fundamentally Available httplwn

netArticles262464 [Viewed December 272007] December 2007

[26] Michael M M Hazard pointers Safe mem-ory reclamation for lock-free objects IEEETransactions on Parallel and Distributed Sys-tems 15 6 (June 2004) 491ndash504

[27] Rossbach C J Hofmann O S PorterD E Ramadan H E Bhandari A andWitchel E TxLinux Using and managinghardware transactional memory in an operatingsystem In SOSPrsquo07 Twenty-First ACMSymposium on Operating Systems Principles(October 2007) ACM SIGOPS Avail-able httpwwwsosp2007orgpapers

sosp056-rossbachpdf [Viewed October 212007]

[28] Toit S D Working draft stan-dard for programming language C++httpwwwopen-stdorgjtc1sc22wg21

docspapers2013n3691pdf May 2013

[29] Vigueras G Orduna J M and LozanoM A Read-Copy Update based parallel serverfor distributed crowd simulations The Journalof Supercomputing (Apr 2012)

Change Log

This paper first appeared as N4026 in May of 2014Revisions to this document are as follows

bull Add a suggestion to Section 71 that developersprovide diagnostics for singleton arrays when us-ing RCU-protected indexes (May 28 2014)

bull Apply changes from Torvald Riegel feedback

1 Move Section 71 which summarizes theLinux-kernel uses of dependency chains upto the beginning of Section 7

2 Fix some typos in Section 74 and addwords accounting for DEC Alpha whichmust continue to promote memory order

consume loads to memory order acquire

with this sectionrsquos approach Also clarifyJeff Preshingrsquos conjecture about the lack ofoptimizations that break semantic depen-dency chains

(June 5 2014)

bull Add Section 75 which covers Hans Boehmrsquosapproach which restricts dependency chains tolocal variable Also add Section 76 whichcovers Clark Nelsorsquos approach which usesthe [[carries dependency]] attribute to markthose local variables that carry dependencies(July 17 2014)

bull Mark the paper as a revision of N4036 (July 172014)

bull Add Jeff Preshing Hans Boehm and Clark Nel-son as authors and clarify Hansrsquos proposal inSection 75 (August 13 2014)

bull Add Olivier Girouxrsquos proposal for explicitlymarking the tails of dependency chains as Sec-tion 77 and add Olivier as author Update theadvantages and disadvantages of Jeff Preshingrsquoswhole-program option in Section 74 Add advan-tages and disadvantages for Hans Boehmrsquos pro-posal in Section 75 and for Clark Nelsonrsquos pro-posal in Section 76 (August 23 2014)

bull Update Olivier Girouxrsquos explicit dependency-chain marking proposal in Section 77 based onfeedback from Olivier (August 24 2014)

bull Add a section header (Section 411) for thedependency-chain rules for 2014 GCC implemen-tations Add Section 413 laying out the sim-pler rules for managing dependency chains in theolder 1990s Sequent C implementations Adda footnote to Section 76 recording Lawrence

WG21N4321 39

Crowlrsquos preference for variable modifiers insteadof either attributes or type modifiers Addwords to Section 77 noting that atomic signal

fence() or the Linux kernelrsquos barrier() can beused to suppress code-motion optimizations thatmight otherwise prevent fixing up tail-markeddependency chains that were broken by code-motion optimizations (October 1 2014)

bull Add Section 34 which describes operators thatact as the last link in the Linux kernelrsquos depen-dency chains including the ambiguous role ofthe -gt operator (October 1 2014)

bull Update plots of RCU API usage (October 52014)

bull Flesh out Section 412 which lays out a shorthistory of dependency-chain management rulesfor the Linux kernel Numerous minor clarifica-tions and corrections throughout the document(October 5 2014)

bull Add ldquoandrdquo before final authorrsquos name and emailaddress Also add a pointer back to N4036 (Oc-tober 5 2014)

bull Add a brain-dead build script (October 52014)

bull Add Section 710 which has the beginnings ofan evaluation of the various proposals (October5 2014)

The paper was then published as N4215 whichwas revised as follows

bull Add verbiage to Sections 72 and 7102 differ-entiating between various forms of syntactic andsemantic dependencies (November 11 2014)

bull Add Section 7103 suggesting a C keyword in-stead of the C++ [[carries dependency]] at-tribute This section also suggests that combin-ing ideas from several proposals might be helpful(November 11 2014)

bull Add Section 78 which contains a first attemptto describe Olivier Girouxrsquos head-marked depen-dency chains (November 11 2014)

bull Add example code in Figures 15 and 16 Addsimilar figures to the proposals showing how theymark the dependency chains in the sample code(November 11 2014)

At this point the paper was published as N4321which was revised as follows

bull Add a warning against using operands to amp and| that leave only one bit set (or respectivelycleared) to Section 411 (April 9 2015)

bull Update plots of RCU API usage (April 192014)

bull Add Section 79 which restricts dependencychains with the goal of allowing unmarked de-pendency chains while avoiding restricting com-piler optimizations One important restriction isthe elimination of RCU-protected array indexes(April 19 2014)

bull Add verbiage to Section 32 noting that the amp

operator is sometimes used to find the beginningof a power-of-two aligned structure (April 192014)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Torvald Riegel to Sec-tion 79 (May 20 2015)

bull Donrsquot allow integers to carry dependencies intoor out of unmarked functions in Section 79(May 21 2015)

bull Update Section 79 to note that assignments can-not extend a dependency chain from one threadto another and that assignments cannot extendan intptr t- or uintptr t-based dependencychain through a shared variable even within agiven thread (May 21 2015)

bull Update Section 79 to note that atomic op-erations can extend dependency chains in thesame way that their non-atomic counterpartscan (May 21 2015)

WG21N4321 40

bull Update Section 79 to explicitly call out the factthat stores and subsequent loads can extend de-pendency chains (May 21 2015)

bull Update Section 79 to explicitly state that un-defined behavior terminates dependency chainsand that if only one value of a pointer avoidsundefined behavior any dependency chainsthrough that pointer at that point are termi-nated (May 22 2015)

bull Update Section 79 to document the difficul-ties stemming from value-specialization opti-mizations (May 23 2015)

bull Update Section 79 to note the back-propagationof the dependency-chain-breaking effects of thecompiler deducing the exact value of a pointer(May 27 2015)

bull Update Section 79 to give more informationon how much bit-setting and bit-clearing is toomuch (June 8 2015)

bull Update Section 79 to note that the Linux ker-nel does not rely on dependency ordering whena statically allocated structure is linked into anRCU-protected structure but also note that op-timizations breaking dependency ordering in thiscase seem quite dubious (July 11 2015)

bull Update Section 79 illustrating dependency-chain rules with examples (July 11 2015)

bull Update Section 79 noting that simple loads suf-fice for memory order consume loads (July 112015)

bull Update Sections 77 and 78 calling out abstrac-tion challenges for head- and tail-marking ap-proaches (July 13 2015)

bull Add subsections to Section 79 (July 13 2015)

bull Update Section 79 to amplify objections to spe-cialization optimizations applied to heap-basedpointers (July 13 2015)

  • 1 Introduction
  • 2 Introduction to RCU
  • 3 Linux-Kernel Use Cases
    • 31 Types of Linux-Kernel Dependency Chains
    • 32 Operators in Linux-Kernel Dependency Chains
    • 33 Operators Terminating Linux-Kernel Dependency Chains
    • 34 Operators Acting as Last Link in Linux-Kernel Dependency Chains
    • 35 Linux-Kernel Dependency Chain Length
      • 4 Dependency Ordering in Pre-C11 Implementations
        • 41 Rules for C-Language RCU Users
          • 411 Rules for 2014 GCC Implementations
          • 412 Rules for 2003 GCC Implementations
          • 413 Rules for 1990s Sequent C Implementations
            • 42 Rules for C-Language Implementers
              • 5 Dependency Ordering in C11 and C++11 Implementations
              • 6 Weaknesses in C11 and C++11 Dependency Ordering
              • 7 Potential Alternatives to C11 and C++11 Dependency Ordering
                • 71 Revising C11 and C++11 Dependency-Ordering Definition
                • 72 Type-Based Designation of Dependency Chains With Restrictions
                • 73 Type-Based Designation of Dependency Chains
                • 74 Whole-Program Option
                • 75 Local-Variable Restriction
                • 76 Mark Dependency-Carrying Local Variables
                • 77 Explicitly Tail-Marked Dependency Chains
                • 78 Explicitly Head-Marked Dependency Chains
                • 79 Restricted Dependency Chains
                  • 791 Extending Dependency Chains
                  • 792 Terminating Dependency Chains
                  • 793 Restricted Dependency Chains and the Linux Kernel
                  • 794 Restricted Dependency Chains Advantages and Disavantages
                    • 710 Evaluation
                      • 7101 Audiences
                      • 7102 Comparison
                      • 7103 Other Approaches
                          • 8 Summary
Page 18: N4321: Towards Implementation and Use of memory order … · 2015-07-14 · N4321: Towards Implementation and Use of memory order consume Doc. No.: WG21/N4321 (revised) Date: 2015-04-19

WG21N4321 18

5 This approach permits many additional opti-mizations compared to those permitted by thecurrent standard on code that carries a depen-dency Expressions such as (x-x) no longerrequire establishment of artificial dependenciesand the compiler is no longer required to detectvalue-narrowing hazards like that shown in Fig-ure 13 However the compiler is still prohibitedfrom adding its own value-speculation optimiza-tions

6 Linus Torvalds seems to be OK with it whichindicates that this set of rules might be practicalfrom the perspective of developers who currentlyexploit dependency chains

According to Peter Sewell one disadvantage is thatthis approach will be quite difficult to model which inturn will pose obstacles for the analysis tooling thatwill be increasingly necessary for large-scale concur-rent programming efforts In particular the concernis that forcing the compiler to assume that a memory

order consume load could possibly return any valuepermitted by its type might require program-analysistools to consider counterfactual hypothetical execu-tions which might complicate specification of seman-tics and verification

Figures 18 and 19 show how this approach playsout with the list-based stack

73 Type-Based Designation of De-pendency Chains

Jeff Preshing made an off-list suggestion of using avalue dep preserving type modifier as suggestedby Torvald Riegel but using this type modifier tostrictly enforce dependency ordering For exampleconsider the code fragment shown in Figure 17 Thescheme described in Section 72 would not necessar-ily enforce dependency ordering between the load online 3 and the access one line 6 while the approachdescribed in this section would enforce dependencyordering in this case

Furthermore cancelling or value-destruction oper-ations on value dep preserving values would notdisrupt dependency ordering As with the cur-rent C11 and C++11 standards the implementation

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack value_dep_preserving first6 78 struct liststack 9 struct liststack value_dep_preserving next

10 void t11 struct rcu_head rh12 1314 value_dep_preserving15 void ls_front(struct liststackhead head)16 17 value_dep_preserving void data18 value_dep_preserving struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 18 List-Based-Stack Restricted Type-BasedDesignation 1 of 2

would be required to emit a memory-barrier instruc-tion or compute an artificial dependency for such op-erations (Note however that use of cancelling orvalue-destruction operations on dependency chainshas proven quite rare in practice)

This approach shares many of the advantages ofTorvald Riegelrsquos approach

1 The implementation is simpler because no de-pendency chains need be traced The implemen-tation can instead drive optimization decisionsstrictly from type information

2 Use of the value dep preserving type modifierallows the developer to limit the extent of thedependency chains

3 This type modifier can be used to mark a depen-dency chainrsquos entry to and exit from a functionin a straightforward way without the need forattributes

4 The value dep preserving type modifiers serve

WG21N4321 19

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 value_dep_preserving37 void ls_pop(struct liststackhead head)38 39 value_dep_preserving struct liststack lsp40 struct liststack lsnp141 value_dep_preserving struct liststack lsnp242 value_dep_preserving void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 19 List-Based-Stack Restricted Type-BasedDesignation 2 of 2

as valuable documentation of the developerrsquos in-tent

5 Although optimizations on a dependency chainare restricted just as in the current standardthe use of value dep preserving restricts thedependency chains to those intended by the de-veloper

6 Restricting dependency-breaking optimizationson all dependency chains marked value dep

preserving without exceptions for cases inwhich the compiler knows too much might makethis approach easier to learn and to use

It is expected that modeling this approach shouldbe straightforward because the modeling tools wouldbe able to make use of the type information Thisapproach results in the same code as shown in Fig-ures 18 and 19 of the previous section

74 Whole-Program Option

This approach also suggested off-list by Jeff Presh-ing has the goal of reusing existing non-dependency-ordered source code unchanged (albeit requiring re-compilation in most cases)9 For example this ap-proach permits an instance of stdmap to be refer-enced by a pointer loaded via memory order consume

and to provide that stdmap instance with thebenefits of dependency ordering without any codechanges whatsoever to stdmap It is important tonote that this protection will be provided only to aread-only stdmap that is referenced by a changingpointer loaded via memory order consume in partic-ular not to a concurrently updated stdmap refer-enced by a pointer (read-only or otherwise) loadedvia memory order consume This latter case wouldrequire changes to the underlying stdmap implemen-tation at a minimum changing some of the loads tobe memory order consume loads Nevertheless theability to provide dependency-ordering protection topre-existing linked data structures is valuable evenwith this read-only restriction

9 A module or library that is known to never carry a de-pendency need not be recompiled

WG21N4321 20

This approach which again does require full re-compilation can be implemented using two ap-proaches

1 Promote all memory order consume loads tomemory order acquire as may be done withthe current standard

2 On architectures that respect memory order-ing prohibit all dependency-breaking optimiza-tions throughout the entire program but onlyin cases where a change in the value returnedby a memory order consume load could cause achange in the value computed later in that samedependency chain in other words where there isa global semantic dependency Note again thatthe possibility of storing a value obtained froma memory order consume load then loading itlater means that normal loads as well as memoryorder relaxed loads often must be consideredto head their own dependency chains but onlywhen loaded by the same thread that did thestore

Some implementations might allow the developerto choose between these two approaches for exampleby using a compiler switch provided for that purpose

This approach also has the effect of permitting atrivial implementation of a memory order consume

atomic thread fence() When using the first im-plementation approach the atomic thread fence()

is simply promoted to memory order acquire In-terestingly enough when using the second ap-proach the memory order consume atomic thread

fence() may simply be ignored The reason forthis is that this approach has the effect of promot-ing memory order relaxed loads to memory order

consume which already globally enforces all theordering that the memory order consume atomic

thread fence() is required to provide locally10

This approach has its own set of advantages anddisadvantages

10 Of course this presumed promotion from memory order

relaxed to memory order consume means that architecturessuch as DEC Alpha that do not respect dependency order-ing must continue to use the first option of emitting memory-ordering instructions for memory order consume loads

1 This approach dispenses with the [[carries

dependency]] attribute and the kill

dependency() primitive

2 This approach better promotes reuse of existingsource code In particular it should require nochanges to the current Linux-kernel source baseaside from changes to the rcu dereference()

family of primitives

3 This approach allows implementations to carryout dependency-breaking optimizations on de-pendency chains as long as a change in thevalue from the memory order consume load doesnot change values further down the dependencychain both with and without the optimizationJeff conjectures that the set of dependency-breaking optimizations used in practice applyonly outside of dependency chains by the re-vised definition in which single-value restrictionsbreak dependency chains11 If this conjectureholds it also applies to Torvaldrsquos approach de-scribed in Section 72

4 Code that follows the rules presented in Sec-tion 41 (substituting memory order consume

loads for volatile loads) would have its depen-dency ordering properly preserved

It is unlikely that this approach could be modeledreasonably given the current state of the art Therequirement that any given memory order consume

load be able to generate at least two different val-ues at the tail of the dependency chain is believedto be a show-stopper especially when coupled withwhole-program analysis which might find that thereis only one value entering at the head of the depen-dency chain

This approach allows annotations to be discardedas shown in Figures 20 and 21 However the memoryorder consume loads are still required in order toenable the promote-to-acquire implementation style

11 This is certainly the case for the usual optimizations ex-emplified by replacing (x-x) with zero

WG21N4321 21

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 void ls_front(struct liststackhead head)15 16 void data17 struct liststack lsp1819 rcu_read_lock()20 lsp = rcu_dereference(head-gtfirst)21 if (lsp == NULL)22 data = NULL23 else24 data = rcu_dereference(lsp-gtt)25 rcu_read_unlock()26 return data27

Figure 20 List-Based-Stack Whole-Program Ap-proach 1 of 2

75 Local-Variable Restriction

This approach suggested off-list by Hans Boehmlimits the extent of dependency trees to a local whichincludes local variables temporaries function argu-ments and return variables Assigning a value from amemory order consume load to such an object beginsa dependency chain Assigning a value loaded fromsuch a local to a global variable (including function-local variables marked static) or to the heap impliesa kill dependency() so that dependency chains areconfined to locals However if the compiler is unableto see the full dependency chain for example be-cause it passes into a function in another translationunit that is not marked [[carries dependency]]the compiler should promote memory order consume

to memory order acquire12

Section 32 indicates that the following operatorsshould transmit dependency status from one localvariable or temporary to another -gt infix = casts

12 Some implementations might provide means to allow theuser to specify that a diagnostic be generated if such promotionis necessary

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 void ls_pop(struct liststackhead head)37 38 struct liststack lsp39 struct liststack lsnp140 struct liststack lsnp241 void data4243 rcu_read_lock()44 lsnp2 = rcu_dereference(head-gtfirst)45 do 46 lsnp1 = lsnp247 if (lsnp1 == NULL) 48 rcu_read_unlock()49 return NULL50 51 lsp = rcu_dereference(lsnp1-gtnext)52 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)53 while (lsnp1 = lsnp2)54 data = rcu_dereference(lsnp2-gtt)55 rcu_read_unlock()56 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)57 return data58

Figure 21 List-Based-Stack Whole-Program Ap-proach 2 of 2

WG21N4321 22

prefix amp prefix [] infix + infix - ternary infix (bitwise) amp and probably also | Similarly Sec-tion 33 indicates that the following operators shouldimply a kill dependency() () == = ampamp ||infix and

It will also be necesary to check whether Linux-kernel usage expects dependency chains to passthrough globals and heap objects that are in someway thread-local If there are such use cases and ifthey are sane and cannot easily be changed to uselocal variables should [[carries dependency]] beused to flag dependency-carrying globals and heapobjects

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute if de-pendency chains are to span multiple translationunits as is the case in some parts of the Linuxkernel

2 The implementation is likely to be some-what simpler because only those dependencychains passing through local variables compiler-generated temporaries compiler-visible functionarguments and compiler-visible return valuesneed be traced One could also argue that func-tion arguments and return values marked with[[carries dependency]] attribute also need tobe traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 Although optimizations on dependency chainsmust be restricted the restricted scope of de-pendency chains reduces the impact of these re-strictions

5 Applying this approach to the Linux kernelwould only require the addition of markings onfunction parameters and return values corre-sponding to cross-translation-unit function callsHowever there are a significant number of theseso this approach can expect significant resistancefrom the Linux community

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 22 List-Based-Stack Local-Variable Restric-tion 1 of 2

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach allows local-variable annotations tobe dropped as shown in Figure 22 and 23

76 Mark Dependency-Carrying Lo-cal Variables

This approach suggested offlist by Clark Nelson usesthe [[carries dependency]] attribute to mark non-static local-scope variables as carrying a dependencyin addition to its current use marking function ar-guments and return values as carrying dependen-cies It is not permissible to mark global variablesor structure members with this attribute Assigningfrom a [[carries dependency]] object to a non-[[carries dependency]] object results in an im-plicit kill dependency()

This approach is similar to that of Section 73 ex-cept that it uses an attribute rather than a type mod-ifier As such it has many of the advantages and dis-

WG21N4321 23

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 23 List-Based-Stack Local-Variable Restric-tion 2 of 2

advantages of that approach however some believethat an attribute-based approach will be more ac-ceptable to the committee than would a type-modifierapproach13 However this approach does requirethat C add attributes

This leave the question of which operatorstransmit dependency chains from one [[carries

dependency]] object to another Section 32 indi-cates that the following operators should transmitdependency status from one local variable or tem-porary to another -gt infix = casts prefix amp prefix [] infix + infix - ternary infix (bitwise) ampand probably also | Similarly Section 33 showsthat the following operators should imply a kill

dependency() () == = ampamp || infix and

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains passingthrough variables marked with the [[carries

dependency]] attribute need be traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 The [[carries dependency]] calls serve asvaluable documentation of the developerrsquos in-tent

5 Although optimizations on dependency chainsmust be restricted use of explicit [[carries

dependency]] greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of variables carrying depen-dencies given that the Linux kernel currentlyrequires no such markings

13 Lawrence Crowl suggests a third approach namely a vari-able modifier

WG21N4321 24

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 _Carries_dependency void data18 _Carries_dependency struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 24 List-Based-Stack Marked Local Variables1 of 2

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach results in code as shown in Fig-ures 24 and 25 where the [[carries dependency]]

attributes have been replaced with a mythicalCarries dependency C keyword

77 Explicitly Tail-Marked Depen-dency Chains

This approach suggested at Rapperswil by OlivierGiroux can be thought of as the inverse of

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 _Carries_dependency struct liststack lsp40 struct liststack lsnp141 _Carries_dependency struct liststack lsnp242 _Carries_dependency void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 25 List-Based-Stack Marked Local Variables2 of 2

WG21N4321 25

1 p = atomic_load_explicit(ampgp memory_order_consume)2 if (p = NULL)3 do_it(atomic_dependency(p gp))

Figure 26 Explicit Dependency Operations

stdkill dependency() Instead of explicitlymarking where the dependency chains terminateOlivierrsquos proposal uses a stddependency() prim-itive to indicate the locations in the code that thedependency chains are required to reach The first ar-gument to stddependency() is the value to whichthe dependency must be carried and the second argu-ment is the variable that heads the dependency chainin other words the second argument is the variablethat was loaded from by a memory order consume

load This proposal differs from the others in thatit is expected to be implemented not necessarily bypreserving the dependency but instead by insertingbarriers in those cases where optimizations have elim-inated any required dependencies The goal here is toimpose minimal restrictions on optimizations of codecontaining dependency chains

A C-language example is shown in Figure 26where stddependency() is transliterated to the C-language atomic dependency() function On line 3atomic dependency() returns the value of its firstargument (p) while ensuring that the data depen-dency from the memory order consume load from gp

is faithfully reflected in the assembly language imple-menting this code fragment The assembly-languagereflection of this dependency might be in terms ofan assembly-language dependency (for example onARM or PowerPC) implicit memory ordering (forexample on x86 or mainframe) or by an explicitmemory-barrier instruction However if there was noatomic dependency() function the compiler wouldbe under no obligation to preserve the dependency14

These explicitly specified dependencies may becombined with [[carries dependency]] attributeson function arguments for example as shown in Fig-ure 27 Note the interplay of atomic dependency()

and [[carries dependency]] where line 8 estab-

14 Would it be better to have the second argument to atomic

dependency() be a label rather than an expression

1 void foo(struct bar q [[carries_dependency]])2 3 if (q = NULL)4 do_it(atomic_dependency(q-gtb q))5 67 p = atomic_load_explicit(ampgp memory_order_consume)8 foo(atomic_dependency(p gp))

Figure 27 Explicit Dependency Operations and car-ries dependency

lishes the dependency between the load from gp andthe [[carries dependency]] argument q of foo()and where line 4 establishes the further dependencybetween argument q of foo() and do it()s argu-ment

This approach is not yet complete One issue isthe possibility of a given operation being dependenton multiple memory order consume loads One ap-proach is of course to omit this functionality and an-other is to allow atomic dependency() to allow anexpression as its first argument and a variable list ofmemory order consume loaded variables

Another issue is connecting [[carries

dependency]] return values to subsequent atomic

dependency() invocations There are a numberof possible resolutions to this issue One approachwould be to use [[carries dependency]] attributeto mark the declaration of the variable to whichthe functionrsquos return value is assigned bringing theproposal from Section 76 to bear In the specialcase where the memory order consume load is in thesame function body as the atomic dependency()

that depends on it the atomic dependency()

could reference the variable that was the sourceof the original memory order consume load An-other approach would be to allow function-returncarries dependency]] attributes to define namesthat could be used by later atomic dependency()

invocations

A third issue arises when atomic dependency()

must be applied after the head of the dependencychain has gone out of scope for example if the headwas contained in a variable defined in an inner scopethat has since been exited

A fourth issue arises if optimizations along a

WG21N4321 26

needed dependency chain allow ordering the depen-dent operation to precede the head of the depen-dency chain in which case inserting barriers wouldbe ineffective The current proposal for addressingthis issue is to suppress memory-movement optimiza-tions across the atomic dependency() perhaps us-ing something like atomic signal fence() or theLinux kernelrsquos barrier() macro This approach al-lows dependency checking and fence insertion to becarried out as a final pass in the compilation process

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains having ex-plicit atomic dependency() calls (and option-ally intermediate [[carries dependency]] at-tributes) need be traced

3 Irrelevant dependency chains are pruned by de-fault with no stdkill dependency() calls re-quired

4 The atomic dependency() calls serve as valu-able documentation of the developerrsquos intent

5 Although optimizations on dependency chainsmust be restricted use of explicit atomic

dependency() greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of dependency chains giventhat the Linux kernel currently relies on implicitends of dependency chains

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(atomic_dependency(lsp-gtt27 head-gtfirst))28 rcu_read_unlock()29 return atomic_dependency(data lsp-gtt)30

Figure 28 List-Based-Stack Tail-Marked Dependen-cies 1 of 2

It is not yet known whether this approach can bereasonably modeled

The result is shown in Figures 28 and 29

78 Explicitly Head-Marked Depen-dency Chains

This approach suggested via email by Olivier Girouxcan be thought of as another inverse of stdkilldependency() In this case the heads of the de-pendency chains are marked indicating to whichpointed-to objects dependencies should be carriedThis description is an extrapolation of a very con-cise proposal and corrections are welcome

The general idea is to provide an augmentedform of the load() member function that indicatesdependencies for example xload(memory order

consume x-gtnext) would cause a dependency tobe carried through the -gtnext field but throughno other field This is shown in Figure 30 where

WG21N4321 27

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(atomic_dependency(lsnp2-gtt lsnp))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return atomic_dereference(data lsnp2-gtt)59

Figure 29 List-Based-Stack Tail-Marked Dependen-cies 2 of 2

1 struct foo 2 struct foo a3 struct foo b4 struct foo c5 int d6 78 p = atomic_load_explicit(ampgp memory_order_consume9 p-gta p-gtb)

10 qa = p-gta Dependency carried 11 qb = p-gtb Dependency carried 12 qc = p-gtc No dependency carried 13 d = p-gtd No dependency carried

Figure 30 Explicit Dependency Operations and Aug-mented Load

the explicit dependency information on line 9 causeslines 10 and 11 to carry a dependency but lines 12and 13 not to do so

Some open questions regarding this approach

1 How does this interact with arguments and re-turn values Do the corresponding annotationsneed to indicate to which fields dependenciesmight be carried Should mismatches be con-sidered an error and if so which sorts of mis-matches

2 How are opaque types handled For exam-ple consider the Linux kernel linked-list facil-ity which embeds a list head structure intothe enclosing object that is to be placed on thelist The memory order consume load returnsa (struct list head ) but it may be nec-essary to carry a dependency to one or more ofthe fields in the enclosing object Should thisbe handled via something like xload(memory

order consume x) but if so doesnrsquot thisre-introduce the need for lots of stdkill

dependency() calls

3 Larger structures might have quite a few fieldsthat need dependencies carried Should therebe some sort of shorthand to make this easierto code for example tagging the fields needingdependency ordering in the declaration of thestruct or class

4 Common types of abstraction need to be han-dled correctly For example there are more than

WG21N4321 28

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst head-gtfirst-gtt)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(lsp-gtt lsp-gtt)) 28 rcu_read_unlock()29 return data30

Figure 31 List-Based-Stack Head-Marked Depen-dencies 1 of 2

500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

79 Restricted Dependency Chains

This approach restricts dependency chains to opera-tions for which compilers would naturally carry de-pendencies As such this approach can be consid-ered to be a refinement of the whole-program option(Section 74) restricted as described in Section 412but also omitting control dependencies and RCU-protected array indexes As always a memory order

consume load heads a dependency chain and as al-ways a memory order consume load may be imple-mented with a simple load instruction on architec-tures such as x86 ARM and Power

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst head-gtfirst-gtnext)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext lsnp-gtnext-gtt)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt lsnp2-gtt))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 32 List-Based-Stack Head-Marked Depen-dencies 2 of 2

WG21N4321 29

1 define rcu_ptr_extract(p) 2 ( 3 uintptr_t ___ip = (uintptr_t)(p) 4 5 ___ip = ___ip amp ~0x7 6 (typeof(p)) ___ip 7 )89 define rcu_ptr_set_bits(p bits)

10 ( 11 uintptr_t ___ip = (uintptr_t)(p) 12 13 ___ip = ___ip amp ~0x7 14 ___ip = ___ip | (bits) 15 (typeof(p)) ___ip 16 )1718 define rcu_ptr_get_bits(p bits) 19 ( 20 uintptr_t ___ip = (uintptr_t)(p) 21 22 ___ip amp ~0x7 23 )

Figure 33 Pointers and Bit Manipulation on 64-BitSystem

This results in a specific list of operations that ex-tend dependency chains and a separate specific list ofoperations that terminate such chains each coveredin its own section

791 Extending Dependency Chains

The following primitive operations extend thatchain15

1 If any value is part of a dependency chain thenusing that value as the left-hand side of an as-signment expression extends the chain to coverthe assignment This rule is exercised in theLinux kernel by stores into fields making up anRCU protected data element

2 If any value is part of a dependency chain thenusing that value as the right-hand side of an as-signment expression extends the chain to coverboth the assignment and the value returned bythat assignment statement Line 20 of Figure 20shows how this rule may be used to extend adependency chain into a local variable

15 In case of operator overloading the actual functions calledmust be analyzed in order to determine their effects on depen-dency chains

3 If any value that is part of a dependency chain isstored to a non-shared variable then any valueloaded by a later load from that same variable bythat same thread is also part of the dependencychain Lines 20 and 24 of Figure 20 illustratethis rule though this rule would apply even iflocal variable lsp was a intptr t instead of apointer

4 If a pointer that is part of a dependency chainis stored to any variable then any value loadedby a later load from that same variable by thatsame thread is also part of the dependency chainLines 20 and 24 of Figure 20 illustrate this rulethough this rule would apply even if local vari-able lsp was instead a shared variable

5 If a pointer is part of a dependency chain thenadding an integral value to that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers and alsoto addition via the infix + operator and via thepostfix [] operator Note that the addition mustbe carried out on a pointer Casting to an inte-gral type and then carrying out the addition willbreak the dependency chain Therefore insteadof casting to an integral type to carry out theaddition cast to a pointer to char16 Line 24 ofFigure 20 illustrates this given that the -gtt actsas a pointer offset prior to indirection

6 If a pointer is part of a dependency chain thensubtracting an integer from that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers Againcasting to an integral type and then carryingout the subtraction will break the dependencychain so instead cast to a pointer to char TheLinux-kernel container of() macro illustratesthis This macro is used to find the beginning ofa structure given a pointer to a field within thatsame structure

16 Yes some old systems had strange formats for characterpointers and this restriction does exclude those systems fromthis nuance of dependency ordering However to the best ofmy knowledge all such systems were uniprocessors so this isnot a real problem

WG21N4321 30

7 If a pointer is part of a dependency chain thendereferencing it using the prefix operator ex-tends the chain through the dereference opera-tion Line 24 of Figure 20 illustrates this giventhat the -gtt acts as a pointer offset prior to in-direction

8 If a pointer is part of a dependency chain thendereferencing it using the -gt field-selection oper-ator extends the chain to the field Note that thewhen the -gt operator is followed by one or more operators these latter operators are equiva-lent to adding a constant integer to the originalpointer Line 24 of Figure 20 directly illustratesthis rule

9 If a pointer is part of a dependency chain thencasting it (either explicitly or implicitly) to anypointer-sized type extends the chain to the re-sult Line 3 of Figure 33 illustrates this rule

10 If a value of type intptr t or uintptr t ispart of a dependency chain then casting it toa pointer type extends the chain to the resultLine 6 of Figure 33 illustrates this rule

11 If a value of type intptr t or uintptr t is partof a dependency chain then the bitwise infix amp

and | operators extend the dependency chain tothe resulting value Line 5 of Figure 33 illus-trates this rule

12 If a value of type intptr t or uintptr t is partof a dependency chain the the bitwise infix ^

operator extends the dependency chain to theresulting value The use case for this traversal ofbuddy-allocator-like lists or dense-array heapsbut it is not clear whether these use cases justifythis addition to dependency ordering

13 If a pointer is part of a dependency chain thenapplying the unary amp address-of operator op-tionally casting this address to a pointer type(perhaps repeatedly to different pointer typeseither explicitly or implicitly) then applying the dereference operator extends the chain to theresult This is used by some of the Linux-kernellist-processing macros

14 If a pointer is part of a dependency chain andthat pointer appears in the entry of a expres-sion selected by the condition then the chainextends to the result Please note that doesnot extend chains from its condition only fromits second or third argument

15 If a pointer to a function is part of a depen-dency chain then invoking the pointed-to func-tion extends the chain from the pointer to theinstructions executed Note that the exact mech-anism used to update instructions is implemen-tation defined and might require use of specialinstruction-cache-flush operations17

16 If a pointer is part of a dependency chain thenif that pointer is used as the actual parameter ofa function call the dependency chain extends tothe formal parameter

17 If a pointer is part of a dependency chain thenif that pointer is returned from a function thedependency chain extends to the returned valuein the calling function

18 If a given operation extends a dependency chainthen so does its atomic counterpart For exam-ple the rules applying to assignments also applyto atomic loads and stores

Any other operation terminates a dependencychain

792 Terminating Dependency Chains

Even though all other operations terminate depen-dency chains there are a few that deserve specialmention

1 Equality comparisons

2 Narrowing magnitude comparisons

3 Narrowing arithmetic operations

4 Narrowing bitwise operations

17 This may sound strange but just you try implementingdynamic linking without the ability to update instructions

WG21N4321 31

1 struct bar 2 struct bar next3 int a4 int b5 6 struct bar head = amphead 1 2 78 for (p = head-gtnext p p = rcu_dereference(p-gtnext)) 9 foo += p-gta

10 if (p == amphead)11 break12 13 bar = head-gtb

Figure 34 Back-Propagation of Dependency-ChainBreakage Due to Comparisons

1 if (p gt ampfoo)2 do_something(p)3 else if (p lt ampfoo)4 do_something_else(p)5 else6 do_something_nodep(p)

Figure 35 Inequality-Comparison Dependency-Chain Breakage

5 Storing non-pointers into shared variables

6 Passing values between threads without using amemory order consume load

7 Undefined behavior

8 Use of kill dependency

Each of these is covered below

Equality comparisons If a pointer is part of adependency chain then a == or = comparison thatcompares equal to some other pointer where thatother pointer is not part of any dependency chainwill cause any uses of the original pointer to no longerbe part of the dependency chain This dependency-chain breakage can back-propagate to earlier uses ofthe pointer so that in Figure 34 if the comparisonon line 10 compares equal then the access on line 9is not part of the dependency chain This is admit-tedly a rather strange code fragment and besidesthe Linux-kernel barrier() macro could prevent thisif placed between lines 9 and 10 Furthermore the

Linux kernelrsquos list macros avoid this situation becausethe equal comparison terminates the loop

So what if the compiler introduces an equality com-parison This might happen when doing feedback-directed optimization where the compiler might no-tice (for example) that a particularly statically allo-cated structure was almost always the first elementon a given list The compiler might therefore in-troduce a specialization optimization comparing theaddresses and generating code using the statically al-located structure on equals comparison On the onehand in the cases where the Linux kernel adds a stat-ically allocated structure to an RCU-protected linkeddata structure that structure has been initialized atcompile time so that dependency ordering is not re-quired On the other hand this appears to be anextremely dubious optimization for linked data struc-tures In a great many cases the added overhead ofthe comparison would overwhelm the benefits of gen-erating code based on the statically allocated struc-ture

High-quality implementations would therefore beexpected to provide means for disabling this sort ofoptimization especially for pointers obtained fromthe heap After all use of statically allocated struc-tures in RCU-protected lists could be quite usefulduring out-of-memory conditions in which case thespecialization optimization would almost always re-duce performance which is not what optmizationsare supposed to be doing

Narrowing magnitude comparisons A seriesof gt lt gt= or lt= operators that informs the compilerof the exact value of a pointer causes that pointerto no longer be part of the dependency chain SeeFigure 35 for an example of this On line 6 of thisfigure the compiler knows that the value of p is equalto ampfoo so although there is dependency ordering tolines 2 and 4 there is no dependency ordering toline 6 This dependency-chain breakage can back-propagate just as for equality comparisons How-ever dependencies are maintained for normal usesfor exmaple the use of comparisons for deadlockavoidance when acquiring locks contained in multi-ple RCU-protected data elements

WG21N4321 32

Narrowing arithmetic operations If a pointeris part of a dependency chain and if the valuesadded to or subtracted from that pointer cancel thepointer value so as to allow the compiler to pre-cisely determine the resulting value then the result-ing value will not be part of any dependency chainFor example if p is part of a dependency chain then((char )p-(uintptr t)p)+65536 will not be18

Narrowing bitwise operations If a value of typeintptr t or uintptr t is part of a dependency chainand if that value is one of the operands to an amp

or | infix operator whose result has too few or toomany bits set then the resulting value will not bepart of any dependency chain For example on a64-bit system if p is part of a dependency chainthen (p amp 0x7) provides just the tag bits and nor-mally cannot even be legally dereferenced Similarly(p | ~0) normally cannot be legally dereferencedHowever (p amp ~0x7) will provide a usable pointerthat is part of prsquos dependency chain Setting orclearing bits that are not used by the implementa-tion or that would be zero for a properly aligned ob-ject will not break the dependency chain That saidthe compiler might not know the definition of ldquoprop-erly alignedrdquo for example in the cases of manuallycache-aligned or page-aligned objects

Storing non-pointers into shared variables Ifa non-pointer value that is part of a dependency chainis stored into a shared variable then the dependencychain does not extend to a later load from that vari-able

Passing values between threads If a value thatis part of a dependency chain is stored into a variableby one thread and loaded from that same variable bysome other thread using either a non-atomic load ora memory order relaxed load then the dependencychain does not extend to the second thread To getthis effect the second thread would instead need touse a memory order consume load Note that this

18 That said 57p4 of C++ and 656p8 of C both say thatindexing outside of an object is undefined behavior so the lossof dependency ordering is likely the least of the problems here

would extend the dependency chain even if the cor-responding store was a memory order relaxed storebecause the required store-side ordering is providedby the dependency chain

Undefined behavior If undefined behavior is in-voked then consistent with the notion of undefinedbehavior there are no dependency-chain guarantees

If a given pointer intptr t or uintptr t is usedsuch that only one value avoids undefined behaviorthen the dependency chain is broken in the same wayas it would be in the case of an equality comparisonwith that same value

kill dependency() The result of calling kill

dependency is never part of any dependency chainThis operation can be used to suppress diagnosticsthat implementations might omit for likely misusesof dependency ordering

793 Restricted Dependency Chains and theLinux Kernel

This covers all known pointer-based RCU uses in theLinux kernel aside from RCU-protected array in-dexes (more on these later) However as noted ear-lier these restrictions might prove too constrainingfor future code Therefore it might be necessary tocombine this approach with some variation on one ofthe methods for explicitly marking variables formalparameters and return values that are intended tocarry dependencies

Section 32 discussed dependency chains headedby memory order consume loads of integers that arelater used as array indexes or pointer offsets Al-though there are a (very) few such uses in the Linuxkernel accommodating dependency chains headed byloads of integers greatly complicates the handling ofdependency chains For example given an integer x

produced by a memory order consume load we mustcorrectly handle expressions containing (x - x) in-cluding cases where the cancellation is not at all obvi-ous from the source code In contrast given a pointerp the expression (p - p) not a pointer and the rulesgiven above do not require carrying a dependency

WG21N4321 33

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a) 3 q = kill_dependency(p)4 a = q-gtspecial_a5 else 6 a = p-gtnormal_a7

Figure 36 Avoiding Diagnostic Due To Dependency-Ordering Value-Narrowing Hazard

through such an expression This approach there-fore excludes dependency chains headed by memory

order consume loads from non-pointer atomic vari-ables

This raises the question of what should be doneabout the Linux kernel code that relies on order-ing carried through integer array indexes Paulhas (hopefully) answered this question by creatinga Linux-kernel patch that removes the kernelrsquos de-pendency on RCU-protected array indexes [22]

794 Restricted Dependency Chains Ad-vantages and Disavantages

This approach to dependency ordering has the fol-lowing advantages and disadvantages

1 There is no need for compilers to trace depen-dency chains Instead dependencies are an au-tomatic result of current code-generation and op-timization practices

2 There would be little or no limitation on com-piler optimizations Compilers are no longer re-quired to establish artificial dependencies for ex-pressions such as (x - x) or to detect value-narrowing hazards involving == and =

3 The breaking of dependency chains when apointer compares equal to some other pointermight prove to be onerous but it is not aproblem for current Linux use cases19 Themost common cases are comparison against alist header (in which case an equality compar-ison terminates the traversal) and comparison

19 This needs to be re-verified

against NULL (in which case an equality com-parison indicates a pointer that cannot be deref-erenced in any case)

4 The compiler is prohibited from carrying outvalue-speculation optimizations on pointers orvalues of type intptr t or uintptr t that havebeen cast from a pointer type and subjected tono operations other than infix amp | and ^ (Isthis an advantage or a disadvantage The an-swer to this question is left to the reader)

5 In a great many cases dependency chains canreliably pass through library functions compiledby pre-C11 compilers

6 It would not be necessary to use stdkill

dependency() calls in most cases That saiduse of stdkill dependency() might at somefuture point allow the compiler to produce bet-ter diagnostics for dubious dependency-chain usecases For example a compiler might issue awarning for the value-narrowing hazard shownon line 3 of Figure 13 and that diagnostic mightbe suppressed as shown in Figure 36

7 More generally this approach allows annotationsto be discarded as was shown in Figures 20and 21 However the memory order consume

loads are still required in order to support DECAlpha and prevent compiler optimizations thatmight otherwise destroy the dependency chain

8 It would not be necessary to use [[carries

dependency]] attributes However as withstdkill dependency() use of something like[[carries dependency]] might produce bet-ter diagnostics for dubious use dependency-chainuse cases That said it might be preferable tosubstitute either type or variable modifiers forattributes given that use of attributes is not per-mitted to change the meaning of the programand also that attributes are not yet supportedby the C language

9 With the exception of one use case involving ar-rays no changes to the Linux-kernel source code

WG21N4321 34

are required As noted earlier a patch is avail-able to remove the Linux kernelrsquos dependency onRCU-protected array indexes [22]

In the future this approach could be augmentedby attributes type modifiers variable modifiers orother markings to allow more elaborate dependencychains to be created on the one hand and to improvethe compilerrsquos ability to emit diagnostics for dubioususes of dependency chains on the other

710 Evaluation

This evaluation starts by enumerating the differentaudiences that any change to memory order consume

must address (Section 7101) and then compares thevarious proposals based on the perceived viewpointsof these audiences (Section 7102)

7101 Audiences

The main audiences for any change to memory

order consume include standards committee mem-bers compiler implementers formal-methods re-searchers developers intending to write new codeand developers working with existing RCU code TheLinux kernel community is of course a notable exam-ple of this last category

Standards committee members would like a cleanand non-intrusive change to the standard Theywould of course also like solutions minimizing thenumber and vehemence of complaints from the otheraudiences or failing that reducing the complaintsto a tolerable noise level

Compiler implementers would like a mechanismthat fits nicely into current implementations whichdoes much to explain their satisfaction with the ap-proach of strengthening memory order consume tomemory order acquire In particular they wouldlike to avoid unbounded tracing of dependencies andwould prefer minimal constraints on their ability toapply time-honored optimizations

Formal-methods researchers would like a definitionof memory order consume that fits into existing the-oretical frameworks without undue conceptual vio-lence Of particular concern is any need to deal with

counter-factuals in other words any need to rea-son not only about values of variables required forthe solution of a given litmus test but also aboutother unrelated values for these variables As suchcounter-factuals are the rock upon which otherwiseattractive approaches involving semantic dependencyhave foundered20 Some practitioners might won-der why the opinion of formal-methods researchersshould be given any weight at all and the answer tothis question is that it is the work of formal-methodsresearchers that provides us the much-needed toolsthat we need to analyze both the memory-orderingspecification itself as well as programs using thatspecification

Developers writing new code need something thatexpresses their algorithm with a minimum of syn-tactic saccharine that is easy to learn and that iseasy to maintain For example one of the weak-nesses of the current standardsrsquo definition of memoryorder consume is the need to sprinkle large numbersof kill dependency() calls throughout onersquos codeIn short developers would like it to be easy to writeanalyze and maintain code that uses dependency or-dering

Developers with existing RCU code have the samedesires as do developers writing new code but arealso very interested in minimizing the code churn re-quired to adhere to the standard

The challenge if of course to find a proposal thataddresses the viewoints of all of these audiences Aswe will see in the next session this is not easy How-ever there is some hope that the approach presentedin Section 79 might suffice

7102 Comparison

A summary comparison of the proposals is shown inTable 1

The dependency type can either be ldquodeprdquo fornormal dependency ldquordeprdquo for the restricted de-pendencies discussed in Section 79 ldquosdeprdquo for(global) semantic dependency or ldquolsdeprdquo for local

20 That said Alan Jeffries is making another attempt tocome up with a suitable formal definition of semantic depen-dency

WG21N4321 35

Dep

enden

cyType

Variable

Marking

Form

al-Parameter

Marking

Return-V

alueMarking

Beginning-O

f-Chain

Handling

End-O

f-Chain

Handling

Dep

enden

cyTracingRequired

CAttribute

Support

Req

uired

C11 C++11 dep A A K Y Y

Type-Based Designation of DependencyChains With Restrictions (Section 72)

lsdep T T T k

Type-Based Designation of DependencyChains (Section 73)

dep T T T k

Whole-Program Option (Section 74) sdep

Local-Variable Restriction (Section 75) dep A A k Y

Mark Dependency-Carrying Local Variables(Section 76)

dep A A A k Y

Explicitly Tail-Marked Dependency Chains(Section 77)

dep A A Dk y Y

Explicitly Head-Marked Dependency Chains(Section 78)

dep D k y Y

Restricted Dependency Chains (Section 79) rdep

Marking ldquoArdquo attribute ldquoTrdquo typeBeginning of chain ldquoDrdquo explicit designationEnd of chain ldquoDrdquo explicit designation ldquokKrdquo implicitexplicit kill dependencyDependency tracking required ldquoyrdquo only for marked chains ldquoYrdquo always

Table 1 Comparison of Consume Proposals

WG21N4321 36

semantic dependency21 Variable formal-parameterand return-value marking can either be type-based(ldquoTrdquo) attribute-based (ldquoArdquo) or not required (ldquo rdquo)Beginning-of-chain handling can either require ex-plicit indication to which quantities dependenciesmust be carried (ldquoDrdquo) or nothing (ldquo rdquo) End-of-chain handling can either require an explicit kill

dependency (ldquoKrdquo) an implicit kill dependency

(ldquokrdquo) explicit designation of dependency (ldquoDrdquo) ornothing (ldquo rdquo)22 Dependency tracking might berequired for all chains (ldquoYrdquo) explicitly designatedchains (ldquoyrdquo) or not required at all (ldquo rdquo) C-language[[carries dependency]] support might be required(ldquoYrdquo) or not (ldquo rdquo)

The ideal proposal would have dependency typeldquodeprdquo (thus making it easier to model dependencyordering and making it unnecessary for developer tohave to outwit full-program optimizations) no needfor variable formal-parameter or return-value mark-ing (thus minimizing changes required for existingRCU code) implicit ldquodo the right thingrdquo end-of-chainhandling23 (thus minimizing the need for whack-a-mole source-code markups) no need for dependencytracking (thus making it easier to implement) andno need for C-language support for the [[carries

dependency]] attribute (thus minimizing changes tothe C standard)

7103 Other Approaches

If the C standards committee is unwilling to ac-cept attributes perhaps a new keyword such asCarries dependency would be an acceptable alter-native (This was suggested at the 2014 UIUC meet-ing but I cannot recall who suggested it)

21 Recall that a local semantic dependency remains a de-pendency even if the memory order consume load at its headcan return only a single value In contrast a global seman-tic dependency remains a dependency only if more than onevalue can appear at the end of the chain Therefore optimiza-tions based on global full-program analysis can break a globalsemantic dependency but can break neither a local semanticdependency nor a normal dependency

22 Variables that go out of scope always have any dependencychain implicitly killed

23 Perhaps implemented by a careful choice of exactly whichoperators carry dependencies in which situations

It might also be possible to combine different as-pects of the various proposals perhaps even arrivingat an improved proposal

Nevertheless we clearly have some more work todo

8 Summary

This document has analyzed Linux-kernel use of de-pendency ordering and has laid out the status-quo in-teraction between the Linux kernel and pre-C11 com-pilers It has also put forward some possible ways ofbuilding towards a full implementation of C11rsquos andC++11rsquos handling of dependency ordering Finallyit calls out some weaknesses in C11rsquos and C++11rsquoshandling of dependency ordering and offers some al-ternatives

References

[1] Alglave J Maranget L Pawan PSarkar S Sewell P Williams D andNardelli F Z PPCMEMARMMEM A toolfor exploring the POWER and ARM memorymodels httpwwwclcamacuk~pes20

ppc-supplementalpldi105-sarkarpdfJune 2011

[2] Alglave J Maranget L andTautschnig M Herding cats Modellingsimulation testing and data-mining for weakmemory In Proceedings of the 35th ACM SIG-PLAN Conference on Programming LanguageDesign and Implementation (New York NYUSA 2014) PLDI rsquo14 ACM pp 40ndash40

[3] ARM Limited ARM Architecture ReferenceManual ARMv7-A and ARMv7-R Edition2010

[4] Boehm H J Space efficient conservativegarbage collection SIGPLAN Not 39 4 (Apr2004) 490ndash501

[5] Bonzini P and Day M RCU imple-mentation for Qemu httplistsgnu

WG21N4321 37

orgarchivehtmlqemu-devel2013-08

msg02055html August 2013

[6] Dalton M THE DESIGN AND IMPLE-MENTATION OF DYNAMIC INFORMATIONFLOW TRACKING SYSTEMS FOR SOFT-WARE SECURITY PhD thesis StanfordUniversity 2009 Available httpcsl

stanfordedu~christospublications

2009michael_daltonphd_thesispdf

[Viewed March 9 2010]

[7] Desnoyers M [RFC git tree] userspace RCU(urcu) for Linux httpurcuso February2009

[8] Desnoyers M McKenney P E SternA Dagenais M R and Walpole JUser-level implementations of read-copy updateIEEE Transactions on Parallel and DistributedSystems 23 (2012) 375ndash382

[9] Grisenthwaite R ARM Barrier LitmusTests and Cookbook ARM Limited 2009

[10] Howard P W and Walpole J A rel-ativistic enhancement to software transactionalmemory In Proceedings of the 3rd USENIX con-ference on Hot topics in parallelism (BerkeleyCA USA 2011) HotParrsquo11 USENIX Associa-tion pp 1ndash6

[11] Howard P W and Walpole J Relativis-tic red-black trees Concurrency and Computa-tion Practice and Experience (2013) nandashna

[12] Intel Corporation A Formal Speci-fication of Intel Itanium Processor Fam-ily Memory Ordering 2002 Avail-able httpdeveloperintelcom

designitaniumdownloads251429htm

ftpdownloadintelcomdesign

ItaniumDownloads25142901pdf [ViewedJanuary 10 2007]

[13] International Business Machines Corpo-ration Power ISA Version 207 2013

[14] Kannan H Ordering decoupled metadata ac-cesses in multiprocessors In MICRO 42 Pro-ceedings of the 42nd Annual IEEEACM Inter-national Symposium on Microarchitecture (NewYork NY USA 2009) ACM pp 381ndash390

[15] McCarthy J Recursive functions of symbolicexpressions and their computation by machinepart i Commun ACM 3 4 (Apr 1960) 184ndash195

[16] McKenney P E Read-copy update(RCU) usage in Linux kernel Availablehttpwwwrdropcomuserspaulmck

RCUlinuxusagerculocktabhtml [ViewedJanuary 14 2007] October 2006

[17] McKenney P E What is RCU part 2Usage Available httplwnnetArticles

263130 [Viewed January 4 2008] January2008

[18] McKenney P E The RCU API 2010 edi-tion httplwnnetArticles418853 De-cember 2010

[19] McKenney P E Validating memory barri-ers and atomic instructions httplwnnet

Articles470681 December 2011

[20] McKenney P E Is Parallel ProgrammingHard And If So What Can You Do About Itkernelorg Corvallis OR USA 2012

[21] McKenney P E Structured deferral syn-chronization via procrastination CommunACM 56 7 (July 2013) 40ndash49

[22] McKenney P E [PATCH tipcorercu14] mce Stop using array-index-based RCUprimitives [PATCHtipcorercu14]mce

Stopusingarray-index-basedRCUprimitivesMay 2015

[23] McKenney P E Purcell C Al-gae Schumin B Cornelius G Qwer-tyus Conway N Sbw Blainster Ru-fus C Zoicon5 Anome and Eisen HRead-copy update httpenwikipedia

orgwikiRead-copy-update July 2006

WG21N4321 38

[24] McKenney P E and Slingwine J DRead-copy update Using execution history tosolve concurrency problems In Parallel andDistributed Computing and Systems (Las VegasNV October 1998) pp 509ndash518

[25] McKenney P E and Walpole J What isRCU fundamentally Available httplwn

netArticles262464 [Viewed December 272007] December 2007

[26] Michael M M Hazard pointers Safe mem-ory reclamation for lock-free objects IEEETransactions on Parallel and Distributed Sys-tems 15 6 (June 2004) 491ndash504

[27] Rossbach C J Hofmann O S PorterD E Ramadan H E Bhandari A andWitchel E TxLinux Using and managinghardware transactional memory in an operatingsystem In SOSPrsquo07 Twenty-First ACMSymposium on Operating Systems Principles(October 2007) ACM SIGOPS Avail-able httpwwwsosp2007orgpapers

sosp056-rossbachpdf [Viewed October 212007]

[28] Toit S D Working draft stan-dard for programming language C++httpwwwopen-stdorgjtc1sc22wg21

docspapers2013n3691pdf May 2013

[29] Vigueras G Orduna J M and LozanoM A Read-Copy Update based parallel serverfor distributed crowd simulations The Journalof Supercomputing (Apr 2012)

Change Log

This paper first appeared as N4026 in May of 2014Revisions to this document are as follows

bull Add a suggestion to Section 71 that developersprovide diagnostics for singleton arrays when us-ing RCU-protected indexes (May 28 2014)

bull Apply changes from Torvald Riegel feedback

1 Move Section 71 which summarizes theLinux-kernel uses of dependency chains upto the beginning of Section 7

2 Fix some typos in Section 74 and addwords accounting for DEC Alpha whichmust continue to promote memory order

consume loads to memory order acquire

with this sectionrsquos approach Also clarifyJeff Preshingrsquos conjecture about the lack ofoptimizations that break semantic depen-dency chains

(June 5 2014)

bull Add Section 75 which covers Hans Boehmrsquosapproach which restricts dependency chains tolocal variable Also add Section 76 whichcovers Clark Nelsorsquos approach which usesthe [[carries dependency]] attribute to markthose local variables that carry dependencies(July 17 2014)

bull Mark the paper as a revision of N4036 (July 172014)

bull Add Jeff Preshing Hans Boehm and Clark Nel-son as authors and clarify Hansrsquos proposal inSection 75 (August 13 2014)

bull Add Olivier Girouxrsquos proposal for explicitlymarking the tails of dependency chains as Sec-tion 77 and add Olivier as author Update theadvantages and disadvantages of Jeff Preshingrsquoswhole-program option in Section 74 Add advan-tages and disadvantages for Hans Boehmrsquos pro-posal in Section 75 and for Clark Nelsonrsquos pro-posal in Section 76 (August 23 2014)

bull Update Olivier Girouxrsquos explicit dependency-chain marking proposal in Section 77 based onfeedback from Olivier (August 24 2014)

bull Add a section header (Section 411) for thedependency-chain rules for 2014 GCC implemen-tations Add Section 413 laying out the sim-pler rules for managing dependency chains in theolder 1990s Sequent C implementations Adda footnote to Section 76 recording Lawrence

WG21N4321 39

Crowlrsquos preference for variable modifiers insteadof either attributes or type modifiers Addwords to Section 77 noting that atomic signal

fence() or the Linux kernelrsquos barrier() can beused to suppress code-motion optimizations thatmight otherwise prevent fixing up tail-markeddependency chains that were broken by code-motion optimizations (October 1 2014)

bull Add Section 34 which describes operators thatact as the last link in the Linux kernelrsquos depen-dency chains including the ambiguous role ofthe -gt operator (October 1 2014)

bull Update plots of RCU API usage (October 52014)

bull Flesh out Section 412 which lays out a shorthistory of dependency-chain management rulesfor the Linux kernel Numerous minor clarifica-tions and corrections throughout the document(October 5 2014)

bull Add ldquoandrdquo before final authorrsquos name and emailaddress Also add a pointer back to N4036 (Oc-tober 5 2014)

bull Add a brain-dead build script (October 52014)

bull Add Section 710 which has the beginnings ofan evaluation of the various proposals (October5 2014)

The paper was then published as N4215 whichwas revised as follows

bull Add verbiage to Sections 72 and 7102 differ-entiating between various forms of syntactic andsemantic dependencies (November 11 2014)

bull Add Section 7103 suggesting a C keyword in-stead of the C++ [[carries dependency]] at-tribute This section also suggests that combin-ing ideas from several proposals might be helpful(November 11 2014)

bull Add Section 78 which contains a first attemptto describe Olivier Girouxrsquos head-marked depen-dency chains (November 11 2014)

bull Add example code in Figures 15 and 16 Addsimilar figures to the proposals showing how theymark the dependency chains in the sample code(November 11 2014)

At this point the paper was published as N4321which was revised as follows

bull Add a warning against using operands to amp and| that leave only one bit set (or respectivelycleared) to Section 411 (April 9 2015)

bull Update plots of RCU API usage (April 192014)

bull Add Section 79 which restricts dependencychains with the goal of allowing unmarked de-pendency chains while avoiding restricting com-piler optimizations One important restriction isthe elimination of RCU-protected array indexes(April 19 2014)

bull Add verbiage to Section 32 noting that the amp

operator is sometimes used to find the beginningof a power-of-two aligned structure (April 192014)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Torvald Riegel to Sec-tion 79 (May 20 2015)

bull Donrsquot allow integers to carry dependencies intoor out of unmarked functions in Section 79(May 21 2015)

bull Update Section 79 to note that assignments can-not extend a dependency chain from one threadto another and that assignments cannot extendan intptr t- or uintptr t-based dependencychain through a shared variable even within agiven thread (May 21 2015)

bull Update Section 79 to note that atomic op-erations can extend dependency chains in thesame way that their non-atomic counterpartscan (May 21 2015)

WG21N4321 40

bull Update Section 79 to explicitly call out the factthat stores and subsequent loads can extend de-pendency chains (May 21 2015)

bull Update Section 79 to explicitly state that un-defined behavior terminates dependency chainsand that if only one value of a pointer avoidsundefined behavior any dependency chainsthrough that pointer at that point are termi-nated (May 22 2015)

bull Update Section 79 to document the difficul-ties stemming from value-specialization opti-mizations (May 23 2015)

bull Update Section 79 to note the back-propagationof the dependency-chain-breaking effects of thecompiler deducing the exact value of a pointer(May 27 2015)

bull Update Section 79 to give more informationon how much bit-setting and bit-clearing is toomuch (June 8 2015)

bull Update Section 79 to note that the Linux ker-nel does not rely on dependency ordering whena statically allocated structure is linked into anRCU-protected structure but also note that op-timizations breaking dependency ordering in thiscase seem quite dubious (July 11 2015)

bull Update Section 79 illustrating dependency-chain rules with examples (July 11 2015)

bull Update Section 79 noting that simple loads suf-fice for memory order consume loads (July 112015)

bull Update Sections 77 and 78 calling out abstrac-tion challenges for head- and tail-marking ap-proaches (July 13 2015)

bull Add subsections to Section 79 (July 13 2015)

bull Update Section 79 to amplify objections to spe-cialization optimizations applied to heap-basedpointers (July 13 2015)

  • 1 Introduction
  • 2 Introduction to RCU
  • 3 Linux-Kernel Use Cases
    • 31 Types of Linux-Kernel Dependency Chains
    • 32 Operators in Linux-Kernel Dependency Chains
    • 33 Operators Terminating Linux-Kernel Dependency Chains
    • 34 Operators Acting as Last Link in Linux-Kernel Dependency Chains
    • 35 Linux-Kernel Dependency Chain Length
      • 4 Dependency Ordering in Pre-C11 Implementations
        • 41 Rules for C-Language RCU Users
          • 411 Rules for 2014 GCC Implementations
          • 412 Rules for 2003 GCC Implementations
          • 413 Rules for 1990s Sequent C Implementations
            • 42 Rules for C-Language Implementers
              • 5 Dependency Ordering in C11 and C++11 Implementations
              • 6 Weaknesses in C11 and C++11 Dependency Ordering
              • 7 Potential Alternatives to C11 and C++11 Dependency Ordering
                • 71 Revising C11 and C++11 Dependency-Ordering Definition
                • 72 Type-Based Designation of Dependency Chains With Restrictions
                • 73 Type-Based Designation of Dependency Chains
                • 74 Whole-Program Option
                • 75 Local-Variable Restriction
                • 76 Mark Dependency-Carrying Local Variables
                • 77 Explicitly Tail-Marked Dependency Chains
                • 78 Explicitly Head-Marked Dependency Chains
                • 79 Restricted Dependency Chains
                  • 791 Extending Dependency Chains
                  • 792 Terminating Dependency Chains
                  • 793 Restricted Dependency Chains and the Linux Kernel
                  • 794 Restricted Dependency Chains Advantages and Disavantages
                    • 710 Evaluation
                      • 7101 Audiences
                      • 7102 Comparison
                      • 7103 Other Approaches
                          • 8 Summary
Page 19: N4321: Towards Implementation and Use of memory order … · 2015-07-14 · N4321: Towards Implementation and Use of memory order consume Doc. No.: WG21/N4321 (revised) Date: 2015-04-19

WG21N4321 19

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 value_dep_preserving37 void ls_pop(struct liststackhead head)38 39 value_dep_preserving struct liststack lsp40 struct liststack lsnp141 value_dep_preserving struct liststack lsnp242 value_dep_preserving void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 19 List-Based-Stack Restricted Type-BasedDesignation 2 of 2

as valuable documentation of the developerrsquos in-tent

5 Although optimizations on a dependency chainare restricted just as in the current standardthe use of value dep preserving restricts thedependency chains to those intended by the de-veloper

6 Restricting dependency-breaking optimizationson all dependency chains marked value dep

preserving without exceptions for cases inwhich the compiler knows too much might makethis approach easier to learn and to use

It is expected that modeling this approach shouldbe straightforward because the modeling tools wouldbe able to make use of the type information Thisapproach results in the same code as shown in Fig-ures 18 and 19 of the previous section

74 Whole-Program Option

This approach also suggested off-list by Jeff Presh-ing has the goal of reusing existing non-dependency-ordered source code unchanged (albeit requiring re-compilation in most cases)9 For example this ap-proach permits an instance of stdmap to be refer-enced by a pointer loaded via memory order consume

and to provide that stdmap instance with thebenefits of dependency ordering without any codechanges whatsoever to stdmap It is important tonote that this protection will be provided only to aread-only stdmap that is referenced by a changingpointer loaded via memory order consume in partic-ular not to a concurrently updated stdmap refer-enced by a pointer (read-only or otherwise) loadedvia memory order consume This latter case wouldrequire changes to the underlying stdmap implemen-tation at a minimum changing some of the loads tobe memory order consume loads Nevertheless theability to provide dependency-ordering protection topre-existing linked data structures is valuable evenwith this read-only restriction

9 A module or library that is known to never carry a de-pendency need not be recompiled

WG21N4321 20

This approach which again does require full re-compilation can be implemented using two ap-proaches

1 Promote all memory order consume loads tomemory order acquire as may be done withthe current standard

2 On architectures that respect memory order-ing prohibit all dependency-breaking optimiza-tions throughout the entire program but onlyin cases where a change in the value returnedby a memory order consume load could cause achange in the value computed later in that samedependency chain in other words where there isa global semantic dependency Note again thatthe possibility of storing a value obtained froma memory order consume load then loading itlater means that normal loads as well as memoryorder relaxed loads often must be consideredto head their own dependency chains but onlywhen loaded by the same thread that did thestore

Some implementations might allow the developerto choose between these two approaches for exampleby using a compiler switch provided for that purpose

This approach also has the effect of permitting atrivial implementation of a memory order consume

atomic thread fence() When using the first im-plementation approach the atomic thread fence()

is simply promoted to memory order acquire In-terestingly enough when using the second ap-proach the memory order consume atomic thread

fence() may simply be ignored The reason forthis is that this approach has the effect of promot-ing memory order relaxed loads to memory order

consume which already globally enforces all theordering that the memory order consume atomic

thread fence() is required to provide locally10

This approach has its own set of advantages anddisadvantages

10 Of course this presumed promotion from memory order

relaxed to memory order consume means that architecturessuch as DEC Alpha that do not respect dependency order-ing must continue to use the first option of emitting memory-ordering instructions for memory order consume loads

1 This approach dispenses with the [[carries

dependency]] attribute and the kill

dependency() primitive

2 This approach better promotes reuse of existingsource code In particular it should require nochanges to the current Linux-kernel source baseaside from changes to the rcu dereference()

family of primitives

3 This approach allows implementations to carryout dependency-breaking optimizations on de-pendency chains as long as a change in thevalue from the memory order consume load doesnot change values further down the dependencychain both with and without the optimizationJeff conjectures that the set of dependency-breaking optimizations used in practice applyonly outside of dependency chains by the re-vised definition in which single-value restrictionsbreak dependency chains11 If this conjectureholds it also applies to Torvaldrsquos approach de-scribed in Section 72

4 Code that follows the rules presented in Sec-tion 41 (substituting memory order consume

loads for volatile loads) would have its depen-dency ordering properly preserved

It is unlikely that this approach could be modeledreasonably given the current state of the art Therequirement that any given memory order consume

load be able to generate at least two different val-ues at the tail of the dependency chain is believedto be a show-stopper especially when coupled withwhole-program analysis which might find that thereis only one value entering at the head of the depen-dency chain

This approach allows annotations to be discardedas shown in Figures 20 and 21 However the memoryorder consume loads are still required in order toenable the promote-to-acquire implementation style

11 This is certainly the case for the usual optimizations ex-emplified by replacing (x-x) with zero

WG21N4321 21

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 void ls_front(struct liststackhead head)15 16 void data17 struct liststack lsp1819 rcu_read_lock()20 lsp = rcu_dereference(head-gtfirst)21 if (lsp == NULL)22 data = NULL23 else24 data = rcu_dereference(lsp-gtt)25 rcu_read_unlock()26 return data27

Figure 20 List-Based-Stack Whole-Program Ap-proach 1 of 2

75 Local-Variable Restriction

This approach suggested off-list by Hans Boehmlimits the extent of dependency trees to a local whichincludes local variables temporaries function argu-ments and return variables Assigning a value from amemory order consume load to such an object beginsa dependency chain Assigning a value loaded fromsuch a local to a global variable (including function-local variables marked static) or to the heap impliesa kill dependency() so that dependency chains areconfined to locals However if the compiler is unableto see the full dependency chain for example be-cause it passes into a function in another translationunit that is not marked [[carries dependency]]the compiler should promote memory order consume

to memory order acquire12

Section 32 indicates that the following operatorsshould transmit dependency status from one localvariable or temporary to another -gt infix = casts

12 Some implementations might provide means to allow theuser to specify that a diagnostic be generated if such promotionis necessary

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 void ls_pop(struct liststackhead head)37 38 struct liststack lsp39 struct liststack lsnp140 struct liststack lsnp241 void data4243 rcu_read_lock()44 lsnp2 = rcu_dereference(head-gtfirst)45 do 46 lsnp1 = lsnp247 if (lsnp1 == NULL) 48 rcu_read_unlock()49 return NULL50 51 lsp = rcu_dereference(lsnp1-gtnext)52 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)53 while (lsnp1 = lsnp2)54 data = rcu_dereference(lsnp2-gtt)55 rcu_read_unlock()56 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)57 return data58

Figure 21 List-Based-Stack Whole-Program Ap-proach 2 of 2

WG21N4321 22

prefix amp prefix [] infix + infix - ternary infix (bitwise) amp and probably also | Similarly Sec-tion 33 indicates that the following operators shouldimply a kill dependency() () == = ampamp ||infix and

It will also be necesary to check whether Linux-kernel usage expects dependency chains to passthrough globals and heap objects that are in someway thread-local If there are such use cases and ifthey are sane and cannot easily be changed to uselocal variables should [[carries dependency]] beused to flag dependency-carrying globals and heapobjects

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute if de-pendency chains are to span multiple translationunits as is the case in some parts of the Linuxkernel

2 The implementation is likely to be some-what simpler because only those dependencychains passing through local variables compiler-generated temporaries compiler-visible functionarguments and compiler-visible return valuesneed be traced One could also argue that func-tion arguments and return values marked with[[carries dependency]] attribute also need tobe traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 Although optimizations on dependency chainsmust be restricted the restricted scope of de-pendency chains reduces the impact of these re-strictions

5 Applying this approach to the Linux kernelwould only require the addition of markings onfunction parameters and return values corre-sponding to cross-translation-unit function callsHowever there are a significant number of theseso this approach can expect significant resistancefrom the Linux community

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 22 List-Based-Stack Local-Variable Restric-tion 1 of 2

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach allows local-variable annotations tobe dropped as shown in Figure 22 and 23

76 Mark Dependency-Carrying Lo-cal Variables

This approach suggested offlist by Clark Nelson usesthe [[carries dependency]] attribute to mark non-static local-scope variables as carrying a dependencyin addition to its current use marking function ar-guments and return values as carrying dependen-cies It is not permissible to mark global variablesor structure members with this attribute Assigningfrom a [[carries dependency]] object to a non-[[carries dependency]] object results in an im-plicit kill dependency()

This approach is similar to that of Section 73 ex-cept that it uses an attribute rather than a type mod-ifier As such it has many of the advantages and dis-

WG21N4321 23

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 23 List-Based-Stack Local-Variable Restric-tion 2 of 2

advantages of that approach however some believethat an attribute-based approach will be more ac-ceptable to the committee than would a type-modifierapproach13 However this approach does requirethat C add attributes

This leave the question of which operatorstransmit dependency chains from one [[carries

dependency]] object to another Section 32 indi-cates that the following operators should transmitdependency status from one local variable or tem-porary to another -gt infix = casts prefix amp prefix [] infix + infix - ternary infix (bitwise) ampand probably also | Similarly Section 33 showsthat the following operators should imply a kill

dependency() () == = ampamp || infix and

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains passingthrough variables marked with the [[carries

dependency]] attribute need be traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 The [[carries dependency]] calls serve asvaluable documentation of the developerrsquos in-tent

5 Although optimizations on dependency chainsmust be restricted use of explicit [[carries

dependency]] greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of variables carrying depen-dencies given that the Linux kernel currentlyrequires no such markings

13 Lawrence Crowl suggests a third approach namely a vari-able modifier

WG21N4321 24

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 _Carries_dependency void data18 _Carries_dependency struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 24 List-Based-Stack Marked Local Variables1 of 2

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach results in code as shown in Fig-ures 24 and 25 where the [[carries dependency]]

attributes have been replaced with a mythicalCarries dependency C keyword

77 Explicitly Tail-Marked Depen-dency Chains

This approach suggested at Rapperswil by OlivierGiroux can be thought of as the inverse of

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 _Carries_dependency struct liststack lsp40 struct liststack lsnp141 _Carries_dependency struct liststack lsnp242 _Carries_dependency void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 25 List-Based-Stack Marked Local Variables2 of 2

WG21N4321 25

1 p = atomic_load_explicit(ampgp memory_order_consume)2 if (p = NULL)3 do_it(atomic_dependency(p gp))

Figure 26 Explicit Dependency Operations

stdkill dependency() Instead of explicitlymarking where the dependency chains terminateOlivierrsquos proposal uses a stddependency() prim-itive to indicate the locations in the code that thedependency chains are required to reach The first ar-gument to stddependency() is the value to whichthe dependency must be carried and the second argu-ment is the variable that heads the dependency chainin other words the second argument is the variablethat was loaded from by a memory order consume

load This proposal differs from the others in thatit is expected to be implemented not necessarily bypreserving the dependency but instead by insertingbarriers in those cases where optimizations have elim-inated any required dependencies The goal here is toimpose minimal restrictions on optimizations of codecontaining dependency chains

A C-language example is shown in Figure 26where stddependency() is transliterated to the C-language atomic dependency() function On line 3atomic dependency() returns the value of its firstargument (p) while ensuring that the data depen-dency from the memory order consume load from gp

is faithfully reflected in the assembly language imple-menting this code fragment The assembly-languagereflection of this dependency might be in terms ofan assembly-language dependency (for example onARM or PowerPC) implicit memory ordering (forexample on x86 or mainframe) or by an explicitmemory-barrier instruction However if there was noatomic dependency() function the compiler wouldbe under no obligation to preserve the dependency14

These explicitly specified dependencies may becombined with [[carries dependency]] attributeson function arguments for example as shown in Fig-ure 27 Note the interplay of atomic dependency()

and [[carries dependency]] where line 8 estab-

14 Would it be better to have the second argument to atomic

dependency() be a label rather than an expression

1 void foo(struct bar q [[carries_dependency]])2 3 if (q = NULL)4 do_it(atomic_dependency(q-gtb q))5 67 p = atomic_load_explicit(ampgp memory_order_consume)8 foo(atomic_dependency(p gp))

Figure 27 Explicit Dependency Operations and car-ries dependency

lishes the dependency between the load from gp andthe [[carries dependency]] argument q of foo()and where line 4 establishes the further dependencybetween argument q of foo() and do it()s argu-ment

This approach is not yet complete One issue isthe possibility of a given operation being dependenton multiple memory order consume loads One ap-proach is of course to omit this functionality and an-other is to allow atomic dependency() to allow anexpression as its first argument and a variable list ofmemory order consume loaded variables

Another issue is connecting [[carries

dependency]] return values to subsequent atomic

dependency() invocations There are a numberof possible resolutions to this issue One approachwould be to use [[carries dependency]] attributeto mark the declaration of the variable to whichthe functionrsquos return value is assigned bringing theproposal from Section 76 to bear In the specialcase where the memory order consume load is in thesame function body as the atomic dependency()

that depends on it the atomic dependency()

could reference the variable that was the sourceof the original memory order consume load An-other approach would be to allow function-returncarries dependency]] attributes to define namesthat could be used by later atomic dependency()

invocations

A third issue arises when atomic dependency()

must be applied after the head of the dependencychain has gone out of scope for example if the headwas contained in a variable defined in an inner scopethat has since been exited

A fourth issue arises if optimizations along a

WG21N4321 26

needed dependency chain allow ordering the depen-dent operation to precede the head of the depen-dency chain in which case inserting barriers wouldbe ineffective The current proposal for addressingthis issue is to suppress memory-movement optimiza-tions across the atomic dependency() perhaps us-ing something like atomic signal fence() or theLinux kernelrsquos barrier() macro This approach al-lows dependency checking and fence insertion to becarried out as a final pass in the compilation process

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains having ex-plicit atomic dependency() calls (and option-ally intermediate [[carries dependency]] at-tributes) need be traced

3 Irrelevant dependency chains are pruned by de-fault with no stdkill dependency() calls re-quired

4 The atomic dependency() calls serve as valu-able documentation of the developerrsquos intent

5 Although optimizations on dependency chainsmust be restricted use of explicit atomic

dependency() greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of dependency chains giventhat the Linux kernel currently relies on implicitends of dependency chains

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(atomic_dependency(lsp-gtt27 head-gtfirst))28 rcu_read_unlock()29 return atomic_dependency(data lsp-gtt)30

Figure 28 List-Based-Stack Tail-Marked Dependen-cies 1 of 2

It is not yet known whether this approach can bereasonably modeled

The result is shown in Figures 28 and 29

78 Explicitly Head-Marked Depen-dency Chains

This approach suggested via email by Olivier Girouxcan be thought of as another inverse of stdkilldependency() In this case the heads of the de-pendency chains are marked indicating to whichpointed-to objects dependencies should be carriedThis description is an extrapolation of a very con-cise proposal and corrections are welcome

The general idea is to provide an augmentedform of the load() member function that indicatesdependencies for example xload(memory order

consume x-gtnext) would cause a dependency tobe carried through the -gtnext field but throughno other field This is shown in Figure 30 where

WG21N4321 27

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(atomic_dependency(lsnp2-gtt lsnp))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return atomic_dereference(data lsnp2-gtt)59

Figure 29 List-Based-Stack Tail-Marked Dependen-cies 2 of 2

1 struct foo 2 struct foo a3 struct foo b4 struct foo c5 int d6 78 p = atomic_load_explicit(ampgp memory_order_consume9 p-gta p-gtb)

10 qa = p-gta Dependency carried 11 qb = p-gtb Dependency carried 12 qc = p-gtc No dependency carried 13 d = p-gtd No dependency carried

Figure 30 Explicit Dependency Operations and Aug-mented Load

the explicit dependency information on line 9 causeslines 10 and 11 to carry a dependency but lines 12and 13 not to do so

Some open questions regarding this approach

1 How does this interact with arguments and re-turn values Do the corresponding annotationsneed to indicate to which fields dependenciesmight be carried Should mismatches be con-sidered an error and if so which sorts of mis-matches

2 How are opaque types handled For exam-ple consider the Linux kernel linked-list facil-ity which embeds a list head structure intothe enclosing object that is to be placed on thelist The memory order consume load returnsa (struct list head ) but it may be nec-essary to carry a dependency to one or more ofthe fields in the enclosing object Should thisbe handled via something like xload(memory

order consume x) but if so doesnrsquot thisre-introduce the need for lots of stdkill

dependency() calls

3 Larger structures might have quite a few fieldsthat need dependencies carried Should therebe some sort of shorthand to make this easierto code for example tagging the fields needingdependency ordering in the declaration of thestruct or class

4 Common types of abstraction need to be han-dled correctly For example there are more than

WG21N4321 28

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst head-gtfirst-gtt)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(lsp-gtt lsp-gtt)) 28 rcu_read_unlock()29 return data30

Figure 31 List-Based-Stack Head-Marked Depen-dencies 1 of 2

500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

79 Restricted Dependency Chains

This approach restricts dependency chains to opera-tions for which compilers would naturally carry de-pendencies As such this approach can be consid-ered to be a refinement of the whole-program option(Section 74) restricted as described in Section 412but also omitting control dependencies and RCU-protected array indexes As always a memory order

consume load heads a dependency chain and as al-ways a memory order consume load may be imple-mented with a simple load instruction on architec-tures such as x86 ARM and Power

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst head-gtfirst-gtnext)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext lsnp-gtnext-gtt)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt lsnp2-gtt))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 32 List-Based-Stack Head-Marked Depen-dencies 2 of 2

WG21N4321 29

1 define rcu_ptr_extract(p) 2 ( 3 uintptr_t ___ip = (uintptr_t)(p) 4 5 ___ip = ___ip amp ~0x7 6 (typeof(p)) ___ip 7 )89 define rcu_ptr_set_bits(p bits)

10 ( 11 uintptr_t ___ip = (uintptr_t)(p) 12 13 ___ip = ___ip amp ~0x7 14 ___ip = ___ip | (bits) 15 (typeof(p)) ___ip 16 )1718 define rcu_ptr_get_bits(p bits) 19 ( 20 uintptr_t ___ip = (uintptr_t)(p) 21 22 ___ip amp ~0x7 23 )

Figure 33 Pointers and Bit Manipulation on 64-BitSystem

This results in a specific list of operations that ex-tend dependency chains and a separate specific list ofoperations that terminate such chains each coveredin its own section

791 Extending Dependency Chains

The following primitive operations extend thatchain15

1 If any value is part of a dependency chain thenusing that value as the left-hand side of an as-signment expression extends the chain to coverthe assignment This rule is exercised in theLinux kernel by stores into fields making up anRCU protected data element

2 If any value is part of a dependency chain thenusing that value as the right-hand side of an as-signment expression extends the chain to coverboth the assignment and the value returned bythat assignment statement Line 20 of Figure 20shows how this rule may be used to extend adependency chain into a local variable

15 In case of operator overloading the actual functions calledmust be analyzed in order to determine their effects on depen-dency chains

3 If any value that is part of a dependency chain isstored to a non-shared variable then any valueloaded by a later load from that same variable bythat same thread is also part of the dependencychain Lines 20 and 24 of Figure 20 illustratethis rule though this rule would apply even iflocal variable lsp was a intptr t instead of apointer

4 If a pointer that is part of a dependency chainis stored to any variable then any value loadedby a later load from that same variable by thatsame thread is also part of the dependency chainLines 20 and 24 of Figure 20 illustrate this rulethough this rule would apply even if local vari-able lsp was instead a shared variable

5 If a pointer is part of a dependency chain thenadding an integral value to that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers and alsoto addition via the infix + operator and via thepostfix [] operator Note that the addition mustbe carried out on a pointer Casting to an inte-gral type and then carrying out the addition willbreak the dependency chain Therefore insteadof casting to an integral type to carry out theaddition cast to a pointer to char16 Line 24 ofFigure 20 illustrates this given that the -gtt actsas a pointer offset prior to indirection

6 If a pointer is part of a dependency chain thensubtracting an integer from that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers Againcasting to an integral type and then carryingout the subtraction will break the dependencychain so instead cast to a pointer to char TheLinux-kernel container of() macro illustratesthis This macro is used to find the beginning ofa structure given a pointer to a field within thatsame structure

16 Yes some old systems had strange formats for characterpointers and this restriction does exclude those systems fromthis nuance of dependency ordering However to the best ofmy knowledge all such systems were uniprocessors so this isnot a real problem

WG21N4321 30

7 If a pointer is part of a dependency chain thendereferencing it using the prefix operator ex-tends the chain through the dereference opera-tion Line 24 of Figure 20 illustrates this giventhat the -gtt acts as a pointer offset prior to in-direction

8 If a pointer is part of a dependency chain thendereferencing it using the -gt field-selection oper-ator extends the chain to the field Note that thewhen the -gt operator is followed by one or more operators these latter operators are equiva-lent to adding a constant integer to the originalpointer Line 24 of Figure 20 directly illustratesthis rule

9 If a pointer is part of a dependency chain thencasting it (either explicitly or implicitly) to anypointer-sized type extends the chain to the re-sult Line 3 of Figure 33 illustrates this rule

10 If a value of type intptr t or uintptr t ispart of a dependency chain then casting it toa pointer type extends the chain to the resultLine 6 of Figure 33 illustrates this rule

11 If a value of type intptr t or uintptr t is partof a dependency chain then the bitwise infix amp

and | operators extend the dependency chain tothe resulting value Line 5 of Figure 33 illus-trates this rule

12 If a value of type intptr t or uintptr t is partof a dependency chain the the bitwise infix ^

operator extends the dependency chain to theresulting value The use case for this traversal ofbuddy-allocator-like lists or dense-array heapsbut it is not clear whether these use cases justifythis addition to dependency ordering

13 If a pointer is part of a dependency chain thenapplying the unary amp address-of operator op-tionally casting this address to a pointer type(perhaps repeatedly to different pointer typeseither explicitly or implicitly) then applying the dereference operator extends the chain to theresult This is used by some of the Linux-kernellist-processing macros

14 If a pointer is part of a dependency chain andthat pointer appears in the entry of a expres-sion selected by the condition then the chainextends to the result Please note that doesnot extend chains from its condition only fromits second or third argument

15 If a pointer to a function is part of a depen-dency chain then invoking the pointed-to func-tion extends the chain from the pointer to theinstructions executed Note that the exact mech-anism used to update instructions is implemen-tation defined and might require use of specialinstruction-cache-flush operations17

16 If a pointer is part of a dependency chain thenif that pointer is used as the actual parameter ofa function call the dependency chain extends tothe formal parameter

17 If a pointer is part of a dependency chain thenif that pointer is returned from a function thedependency chain extends to the returned valuein the calling function

18 If a given operation extends a dependency chainthen so does its atomic counterpart For exam-ple the rules applying to assignments also applyto atomic loads and stores

Any other operation terminates a dependencychain

792 Terminating Dependency Chains

Even though all other operations terminate depen-dency chains there are a few that deserve specialmention

1 Equality comparisons

2 Narrowing magnitude comparisons

3 Narrowing arithmetic operations

4 Narrowing bitwise operations

17 This may sound strange but just you try implementingdynamic linking without the ability to update instructions

WG21N4321 31

1 struct bar 2 struct bar next3 int a4 int b5 6 struct bar head = amphead 1 2 78 for (p = head-gtnext p p = rcu_dereference(p-gtnext)) 9 foo += p-gta

10 if (p == amphead)11 break12 13 bar = head-gtb

Figure 34 Back-Propagation of Dependency-ChainBreakage Due to Comparisons

1 if (p gt ampfoo)2 do_something(p)3 else if (p lt ampfoo)4 do_something_else(p)5 else6 do_something_nodep(p)

Figure 35 Inequality-Comparison Dependency-Chain Breakage

5 Storing non-pointers into shared variables

6 Passing values between threads without using amemory order consume load

7 Undefined behavior

8 Use of kill dependency

Each of these is covered below

Equality comparisons If a pointer is part of adependency chain then a == or = comparison thatcompares equal to some other pointer where thatother pointer is not part of any dependency chainwill cause any uses of the original pointer to no longerbe part of the dependency chain This dependency-chain breakage can back-propagate to earlier uses ofthe pointer so that in Figure 34 if the comparisonon line 10 compares equal then the access on line 9is not part of the dependency chain This is admit-tedly a rather strange code fragment and besidesthe Linux-kernel barrier() macro could prevent thisif placed between lines 9 and 10 Furthermore the

Linux kernelrsquos list macros avoid this situation becausethe equal comparison terminates the loop

So what if the compiler introduces an equality com-parison This might happen when doing feedback-directed optimization where the compiler might no-tice (for example) that a particularly statically allo-cated structure was almost always the first elementon a given list The compiler might therefore in-troduce a specialization optimization comparing theaddresses and generating code using the statically al-located structure on equals comparison On the onehand in the cases where the Linux kernel adds a stat-ically allocated structure to an RCU-protected linkeddata structure that structure has been initialized atcompile time so that dependency ordering is not re-quired On the other hand this appears to be anextremely dubious optimization for linked data struc-tures In a great many cases the added overhead ofthe comparison would overwhelm the benefits of gen-erating code based on the statically allocated struc-ture

High-quality implementations would therefore beexpected to provide means for disabling this sort ofoptimization especially for pointers obtained fromthe heap After all use of statically allocated struc-tures in RCU-protected lists could be quite usefulduring out-of-memory conditions in which case thespecialization optimization would almost always re-duce performance which is not what optmizationsare supposed to be doing

Narrowing magnitude comparisons A seriesof gt lt gt= or lt= operators that informs the compilerof the exact value of a pointer causes that pointerto no longer be part of the dependency chain SeeFigure 35 for an example of this On line 6 of thisfigure the compiler knows that the value of p is equalto ampfoo so although there is dependency ordering tolines 2 and 4 there is no dependency ordering toline 6 This dependency-chain breakage can back-propagate just as for equality comparisons How-ever dependencies are maintained for normal usesfor exmaple the use of comparisons for deadlockavoidance when acquiring locks contained in multi-ple RCU-protected data elements

WG21N4321 32

Narrowing arithmetic operations If a pointeris part of a dependency chain and if the valuesadded to or subtracted from that pointer cancel thepointer value so as to allow the compiler to pre-cisely determine the resulting value then the result-ing value will not be part of any dependency chainFor example if p is part of a dependency chain then((char )p-(uintptr t)p)+65536 will not be18

Narrowing bitwise operations If a value of typeintptr t or uintptr t is part of a dependency chainand if that value is one of the operands to an amp

or | infix operator whose result has too few or toomany bits set then the resulting value will not bepart of any dependency chain For example on a64-bit system if p is part of a dependency chainthen (p amp 0x7) provides just the tag bits and nor-mally cannot even be legally dereferenced Similarly(p | ~0) normally cannot be legally dereferencedHowever (p amp ~0x7) will provide a usable pointerthat is part of prsquos dependency chain Setting orclearing bits that are not used by the implementa-tion or that would be zero for a properly aligned ob-ject will not break the dependency chain That saidthe compiler might not know the definition of ldquoprop-erly alignedrdquo for example in the cases of manuallycache-aligned or page-aligned objects

Storing non-pointers into shared variables Ifa non-pointer value that is part of a dependency chainis stored into a shared variable then the dependencychain does not extend to a later load from that vari-able

Passing values between threads If a value thatis part of a dependency chain is stored into a variableby one thread and loaded from that same variable bysome other thread using either a non-atomic load ora memory order relaxed load then the dependencychain does not extend to the second thread To getthis effect the second thread would instead need touse a memory order consume load Note that this

18 That said 57p4 of C++ and 656p8 of C both say thatindexing outside of an object is undefined behavior so the lossof dependency ordering is likely the least of the problems here

would extend the dependency chain even if the cor-responding store was a memory order relaxed storebecause the required store-side ordering is providedby the dependency chain

Undefined behavior If undefined behavior is in-voked then consistent with the notion of undefinedbehavior there are no dependency-chain guarantees

If a given pointer intptr t or uintptr t is usedsuch that only one value avoids undefined behaviorthen the dependency chain is broken in the same wayas it would be in the case of an equality comparisonwith that same value

kill dependency() The result of calling kill

dependency is never part of any dependency chainThis operation can be used to suppress diagnosticsthat implementations might omit for likely misusesof dependency ordering

793 Restricted Dependency Chains and theLinux Kernel

This covers all known pointer-based RCU uses in theLinux kernel aside from RCU-protected array in-dexes (more on these later) However as noted ear-lier these restrictions might prove too constrainingfor future code Therefore it might be necessary tocombine this approach with some variation on one ofthe methods for explicitly marking variables formalparameters and return values that are intended tocarry dependencies

Section 32 discussed dependency chains headedby memory order consume loads of integers that arelater used as array indexes or pointer offsets Al-though there are a (very) few such uses in the Linuxkernel accommodating dependency chains headed byloads of integers greatly complicates the handling ofdependency chains For example given an integer x

produced by a memory order consume load we mustcorrectly handle expressions containing (x - x) in-cluding cases where the cancellation is not at all obvi-ous from the source code In contrast given a pointerp the expression (p - p) not a pointer and the rulesgiven above do not require carrying a dependency

WG21N4321 33

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a) 3 q = kill_dependency(p)4 a = q-gtspecial_a5 else 6 a = p-gtnormal_a7

Figure 36 Avoiding Diagnostic Due To Dependency-Ordering Value-Narrowing Hazard

through such an expression This approach there-fore excludes dependency chains headed by memory

order consume loads from non-pointer atomic vari-ables

This raises the question of what should be doneabout the Linux kernel code that relies on order-ing carried through integer array indexes Paulhas (hopefully) answered this question by creatinga Linux-kernel patch that removes the kernelrsquos de-pendency on RCU-protected array indexes [22]

794 Restricted Dependency Chains Ad-vantages and Disavantages

This approach to dependency ordering has the fol-lowing advantages and disadvantages

1 There is no need for compilers to trace depen-dency chains Instead dependencies are an au-tomatic result of current code-generation and op-timization practices

2 There would be little or no limitation on com-piler optimizations Compilers are no longer re-quired to establish artificial dependencies for ex-pressions such as (x - x) or to detect value-narrowing hazards involving == and =

3 The breaking of dependency chains when apointer compares equal to some other pointermight prove to be onerous but it is not aproblem for current Linux use cases19 Themost common cases are comparison against alist header (in which case an equality compar-ison terminates the traversal) and comparison

19 This needs to be re-verified

against NULL (in which case an equality com-parison indicates a pointer that cannot be deref-erenced in any case)

4 The compiler is prohibited from carrying outvalue-speculation optimizations on pointers orvalues of type intptr t or uintptr t that havebeen cast from a pointer type and subjected tono operations other than infix amp | and ^ (Isthis an advantage or a disadvantage The an-swer to this question is left to the reader)

5 In a great many cases dependency chains canreliably pass through library functions compiledby pre-C11 compilers

6 It would not be necessary to use stdkill

dependency() calls in most cases That saiduse of stdkill dependency() might at somefuture point allow the compiler to produce bet-ter diagnostics for dubious dependency-chain usecases For example a compiler might issue awarning for the value-narrowing hazard shownon line 3 of Figure 13 and that diagnostic mightbe suppressed as shown in Figure 36

7 More generally this approach allows annotationsto be discarded as was shown in Figures 20and 21 However the memory order consume

loads are still required in order to support DECAlpha and prevent compiler optimizations thatmight otherwise destroy the dependency chain

8 It would not be necessary to use [[carries

dependency]] attributes However as withstdkill dependency() use of something like[[carries dependency]] might produce bet-ter diagnostics for dubious use dependency-chainuse cases That said it might be preferable tosubstitute either type or variable modifiers forattributes given that use of attributes is not per-mitted to change the meaning of the programand also that attributes are not yet supportedby the C language

9 With the exception of one use case involving ar-rays no changes to the Linux-kernel source code

WG21N4321 34

are required As noted earlier a patch is avail-able to remove the Linux kernelrsquos dependency onRCU-protected array indexes [22]

In the future this approach could be augmentedby attributes type modifiers variable modifiers orother markings to allow more elaborate dependencychains to be created on the one hand and to improvethe compilerrsquos ability to emit diagnostics for dubioususes of dependency chains on the other

710 Evaluation

This evaluation starts by enumerating the differentaudiences that any change to memory order consume

must address (Section 7101) and then compares thevarious proposals based on the perceived viewpointsof these audiences (Section 7102)

7101 Audiences

The main audiences for any change to memory

order consume include standards committee mem-bers compiler implementers formal-methods re-searchers developers intending to write new codeand developers working with existing RCU code TheLinux kernel community is of course a notable exam-ple of this last category

Standards committee members would like a cleanand non-intrusive change to the standard Theywould of course also like solutions minimizing thenumber and vehemence of complaints from the otheraudiences or failing that reducing the complaintsto a tolerable noise level

Compiler implementers would like a mechanismthat fits nicely into current implementations whichdoes much to explain their satisfaction with the ap-proach of strengthening memory order consume tomemory order acquire In particular they wouldlike to avoid unbounded tracing of dependencies andwould prefer minimal constraints on their ability toapply time-honored optimizations

Formal-methods researchers would like a definitionof memory order consume that fits into existing the-oretical frameworks without undue conceptual vio-lence Of particular concern is any need to deal with

counter-factuals in other words any need to rea-son not only about values of variables required forthe solution of a given litmus test but also aboutother unrelated values for these variables As suchcounter-factuals are the rock upon which otherwiseattractive approaches involving semantic dependencyhave foundered20 Some practitioners might won-der why the opinion of formal-methods researchersshould be given any weight at all and the answer tothis question is that it is the work of formal-methodsresearchers that provides us the much-needed toolsthat we need to analyze both the memory-orderingspecification itself as well as programs using thatspecification

Developers writing new code need something thatexpresses their algorithm with a minimum of syn-tactic saccharine that is easy to learn and that iseasy to maintain For example one of the weak-nesses of the current standardsrsquo definition of memoryorder consume is the need to sprinkle large numbersof kill dependency() calls throughout onersquos codeIn short developers would like it to be easy to writeanalyze and maintain code that uses dependency or-dering

Developers with existing RCU code have the samedesires as do developers writing new code but arealso very interested in minimizing the code churn re-quired to adhere to the standard

The challenge if of course to find a proposal thataddresses the viewoints of all of these audiences Aswe will see in the next session this is not easy How-ever there is some hope that the approach presentedin Section 79 might suffice

7102 Comparison

A summary comparison of the proposals is shown inTable 1

The dependency type can either be ldquodeprdquo fornormal dependency ldquordeprdquo for the restricted de-pendencies discussed in Section 79 ldquosdeprdquo for(global) semantic dependency or ldquolsdeprdquo for local

20 That said Alan Jeffries is making another attempt tocome up with a suitable formal definition of semantic depen-dency

WG21N4321 35

Dep

enden

cyType

Variable

Marking

Form

al-Parameter

Marking

Return-V

alueMarking

Beginning-O

f-Chain

Handling

End-O

f-Chain

Handling

Dep

enden

cyTracingRequired

CAttribute

Support

Req

uired

C11 C++11 dep A A K Y Y

Type-Based Designation of DependencyChains With Restrictions (Section 72)

lsdep T T T k

Type-Based Designation of DependencyChains (Section 73)

dep T T T k

Whole-Program Option (Section 74) sdep

Local-Variable Restriction (Section 75) dep A A k Y

Mark Dependency-Carrying Local Variables(Section 76)

dep A A A k Y

Explicitly Tail-Marked Dependency Chains(Section 77)

dep A A Dk y Y

Explicitly Head-Marked Dependency Chains(Section 78)

dep D k y Y

Restricted Dependency Chains (Section 79) rdep

Marking ldquoArdquo attribute ldquoTrdquo typeBeginning of chain ldquoDrdquo explicit designationEnd of chain ldquoDrdquo explicit designation ldquokKrdquo implicitexplicit kill dependencyDependency tracking required ldquoyrdquo only for marked chains ldquoYrdquo always

Table 1 Comparison of Consume Proposals

WG21N4321 36

semantic dependency21 Variable formal-parameterand return-value marking can either be type-based(ldquoTrdquo) attribute-based (ldquoArdquo) or not required (ldquo rdquo)Beginning-of-chain handling can either require ex-plicit indication to which quantities dependenciesmust be carried (ldquoDrdquo) or nothing (ldquo rdquo) End-of-chain handling can either require an explicit kill

dependency (ldquoKrdquo) an implicit kill dependency

(ldquokrdquo) explicit designation of dependency (ldquoDrdquo) ornothing (ldquo rdquo)22 Dependency tracking might berequired for all chains (ldquoYrdquo) explicitly designatedchains (ldquoyrdquo) or not required at all (ldquo rdquo) C-language[[carries dependency]] support might be required(ldquoYrdquo) or not (ldquo rdquo)

The ideal proposal would have dependency typeldquodeprdquo (thus making it easier to model dependencyordering and making it unnecessary for developer tohave to outwit full-program optimizations) no needfor variable formal-parameter or return-value mark-ing (thus minimizing changes required for existingRCU code) implicit ldquodo the right thingrdquo end-of-chainhandling23 (thus minimizing the need for whack-a-mole source-code markups) no need for dependencytracking (thus making it easier to implement) andno need for C-language support for the [[carries

dependency]] attribute (thus minimizing changes tothe C standard)

7103 Other Approaches

If the C standards committee is unwilling to ac-cept attributes perhaps a new keyword such asCarries dependency would be an acceptable alter-native (This was suggested at the 2014 UIUC meet-ing but I cannot recall who suggested it)

21 Recall that a local semantic dependency remains a de-pendency even if the memory order consume load at its headcan return only a single value In contrast a global seman-tic dependency remains a dependency only if more than onevalue can appear at the end of the chain Therefore optimiza-tions based on global full-program analysis can break a globalsemantic dependency but can break neither a local semanticdependency nor a normal dependency

22 Variables that go out of scope always have any dependencychain implicitly killed

23 Perhaps implemented by a careful choice of exactly whichoperators carry dependencies in which situations

It might also be possible to combine different as-pects of the various proposals perhaps even arrivingat an improved proposal

Nevertheless we clearly have some more work todo

8 Summary

This document has analyzed Linux-kernel use of de-pendency ordering and has laid out the status-quo in-teraction between the Linux kernel and pre-C11 com-pilers It has also put forward some possible ways ofbuilding towards a full implementation of C11rsquos andC++11rsquos handling of dependency ordering Finallyit calls out some weaknesses in C11rsquos and C++11rsquoshandling of dependency ordering and offers some al-ternatives

References

[1] Alglave J Maranget L Pawan PSarkar S Sewell P Williams D andNardelli F Z PPCMEMARMMEM A toolfor exploring the POWER and ARM memorymodels httpwwwclcamacuk~pes20

ppc-supplementalpldi105-sarkarpdfJune 2011

[2] Alglave J Maranget L andTautschnig M Herding cats Modellingsimulation testing and data-mining for weakmemory In Proceedings of the 35th ACM SIG-PLAN Conference on Programming LanguageDesign and Implementation (New York NYUSA 2014) PLDI rsquo14 ACM pp 40ndash40

[3] ARM Limited ARM Architecture ReferenceManual ARMv7-A and ARMv7-R Edition2010

[4] Boehm H J Space efficient conservativegarbage collection SIGPLAN Not 39 4 (Apr2004) 490ndash501

[5] Bonzini P and Day M RCU imple-mentation for Qemu httplistsgnu

WG21N4321 37

orgarchivehtmlqemu-devel2013-08

msg02055html August 2013

[6] Dalton M THE DESIGN AND IMPLE-MENTATION OF DYNAMIC INFORMATIONFLOW TRACKING SYSTEMS FOR SOFT-WARE SECURITY PhD thesis StanfordUniversity 2009 Available httpcsl

stanfordedu~christospublications

2009michael_daltonphd_thesispdf

[Viewed March 9 2010]

[7] Desnoyers M [RFC git tree] userspace RCU(urcu) for Linux httpurcuso February2009

[8] Desnoyers M McKenney P E SternA Dagenais M R and Walpole JUser-level implementations of read-copy updateIEEE Transactions on Parallel and DistributedSystems 23 (2012) 375ndash382

[9] Grisenthwaite R ARM Barrier LitmusTests and Cookbook ARM Limited 2009

[10] Howard P W and Walpole J A rel-ativistic enhancement to software transactionalmemory In Proceedings of the 3rd USENIX con-ference on Hot topics in parallelism (BerkeleyCA USA 2011) HotParrsquo11 USENIX Associa-tion pp 1ndash6

[11] Howard P W and Walpole J Relativis-tic red-black trees Concurrency and Computa-tion Practice and Experience (2013) nandashna

[12] Intel Corporation A Formal Speci-fication of Intel Itanium Processor Fam-ily Memory Ordering 2002 Avail-able httpdeveloperintelcom

designitaniumdownloads251429htm

ftpdownloadintelcomdesign

ItaniumDownloads25142901pdf [ViewedJanuary 10 2007]

[13] International Business Machines Corpo-ration Power ISA Version 207 2013

[14] Kannan H Ordering decoupled metadata ac-cesses in multiprocessors In MICRO 42 Pro-ceedings of the 42nd Annual IEEEACM Inter-national Symposium on Microarchitecture (NewYork NY USA 2009) ACM pp 381ndash390

[15] McCarthy J Recursive functions of symbolicexpressions and their computation by machinepart i Commun ACM 3 4 (Apr 1960) 184ndash195

[16] McKenney P E Read-copy update(RCU) usage in Linux kernel Availablehttpwwwrdropcomuserspaulmck

RCUlinuxusagerculocktabhtml [ViewedJanuary 14 2007] October 2006

[17] McKenney P E What is RCU part 2Usage Available httplwnnetArticles

263130 [Viewed January 4 2008] January2008

[18] McKenney P E The RCU API 2010 edi-tion httplwnnetArticles418853 De-cember 2010

[19] McKenney P E Validating memory barri-ers and atomic instructions httplwnnet

Articles470681 December 2011

[20] McKenney P E Is Parallel ProgrammingHard And If So What Can You Do About Itkernelorg Corvallis OR USA 2012

[21] McKenney P E Structured deferral syn-chronization via procrastination CommunACM 56 7 (July 2013) 40ndash49

[22] McKenney P E [PATCH tipcorercu14] mce Stop using array-index-based RCUprimitives [PATCHtipcorercu14]mce

Stopusingarray-index-basedRCUprimitivesMay 2015

[23] McKenney P E Purcell C Al-gae Schumin B Cornelius G Qwer-tyus Conway N Sbw Blainster Ru-fus C Zoicon5 Anome and Eisen HRead-copy update httpenwikipedia

orgwikiRead-copy-update July 2006

WG21N4321 38

[24] McKenney P E and Slingwine J DRead-copy update Using execution history tosolve concurrency problems In Parallel andDistributed Computing and Systems (Las VegasNV October 1998) pp 509ndash518

[25] McKenney P E and Walpole J What isRCU fundamentally Available httplwn

netArticles262464 [Viewed December 272007] December 2007

[26] Michael M M Hazard pointers Safe mem-ory reclamation for lock-free objects IEEETransactions on Parallel and Distributed Sys-tems 15 6 (June 2004) 491ndash504

[27] Rossbach C J Hofmann O S PorterD E Ramadan H E Bhandari A andWitchel E TxLinux Using and managinghardware transactional memory in an operatingsystem In SOSPrsquo07 Twenty-First ACMSymposium on Operating Systems Principles(October 2007) ACM SIGOPS Avail-able httpwwwsosp2007orgpapers

sosp056-rossbachpdf [Viewed October 212007]

[28] Toit S D Working draft stan-dard for programming language C++httpwwwopen-stdorgjtc1sc22wg21

docspapers2013n3691pdf May 2013

[29] Vigueras G Orduna J M and LozanoM A Read-Copy Update based parallel serverfor distributed crowd simulations The Journalof Supercomputing (Apr 2012)

Change Log

This paper first appeared as N4026 in May of 2014Revisions to this document are as follows

bull Add a suggestion to Section 71 that developersprovide diagnostics for singleton arrays when us-ing RCU-protected indexes (May 28 2014)

bull Apply changes from Torvald Riegel feedback

1 Move Section 71 which summarizes theLinux-kernel uses of dependency chains upto the beginning of Section 7

2 Fix some typos in Section 74 and addwords accounting for DEC Alpha whichmust continue to promote memory order

consume loads to memory order acquire

with this sectionrsquos approach Also clarifyJeff Preshingrsquos conjecture about the lack ofoptimizations that break semantic depen-dency chains

(June 5 2014)

bull Add Section 75 which covers Hans Boehmrsquosapproach which restricts dependency chains tolocal variable Also add Section 76 whichcovers Clark Nelsorsquos approach which usesthe [[carries dependency]] attribute to markthose local variables that carry dependencies(July 17 2014)

bull Mark the paper as a revision of N4036 (July 172014)

bull Add Jeff Preshing Hans Boehm and Clark Nel-son as authors and clarify Hansrsquos proposal inSection 75 (August 13 2014)

bull Add Olivier Girouxrsquos proposal for explicitlymarking the tails of dependency chains as Sec-tion 77 and add Olivier as author Update theadvantages and disadvantages of Jeff Preshingrsquoswhole-program option in Section 74 Add advan-tages and disadvantages for Hans Boehmrsquos pro-posal in Section 75 and for Clark Nelsonrsquos pro-posal in Section 76 (August 23 2014)

bull Update Olivier Girouxrsquos explicit dependency-chain marking proposal in Section 77 based onfeedback from Olivier (August 24 2014)

bull Add a section header (Section 411) for thedependency-chain rules for 2014 GCC implemen-tations Add Section 413 laying out the sim-pler rules for managing dependency chains in theolder 1990s Sequent C implementations Adda footnote to Section 76 recording Lawrence

WG21N4321 39

Crowlrsquos preference for variable modifiers insteadof either attributes or type modifiers Addwords to Section 77 noting that atomic signal

fence() or the Linux kernelrsquos barrier() can beused to suppress code-motion optimizations thatmight otherwise prevent fixing up tail-markeddependency chains that were broken by code-motion optimizations (October 1 2014)

bull Add Section 34 which describes operators thatact as the last link in the Linux kernelrsquos depen-dency chains including the ambiguous role ofthe -gt operator (October 1 2014)

bull Update plots of RCU API usage (October 52014)

bull Flesh out Section 412 which lays out a shorthistory of dependency-chain management rulesfor the Linux kernel Numerous minor clarifica-tions and corrections throughout the document(October 5 2014)

bull Add ldquoandrdquo before final authorrsquos name and emailaddress Also add a pointer back to N4036 (Oc-tober 5 2014)

bull Add a brain-dead build script (October 52014)

bull Add Section 710 which has the beginnings ofan evaluation of the various proposals (October5 2014)

The paper was then published as N4215 whichwas revised as follows

bull Add verbiage to Sections 72 and 7102 differ-entiating between various forms of syntactic andsemantic dependencies (November 11 2014)

bull Add Section 7103 suggesting a C keyword in-stead of the C++ [[carries dependency]] at-tribute This section also suggests that combin-ing ideas from several proposals might be helpful(November 11 2014)

bull Add Section 78 which contains a first attemptto describe Olivier Girouxrsquos head-marked depen-dency chains (November 11 2014)

bull Add example code in Figures 15 and 16 Addsimilar figures to the proposals showing how theymark the dependency chains in the sample code(November 11 2014)

At this point the paper was published as N4321which was revised as follows

bull Add a warning against using operands to amp and| that leave only one bit set (or respectivelycleared) to Section 411 (April 9 2015)

bull Update plots of RCU API usage (April 192014)

bull Add Section 79 which restricts dependencychains with the goal of allowing unmarked de-pendency chains while avoiding restricting com-piler optimizations One important restriction isthe elimination of RCU-protected array indexes(April 19 2014)

bull Add verbiage to Section 32 noting that the amp

operator is sometimes used to find the beginningof a power-of-two aligned structure (April 192014)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Torvald Riegel to Sec-tion 79 (May 20 2015)

bull Donrsquot allow integers to carry dependencies intoor out of unmarked functions in Section 79(May 21 2015)

bull Update Section 79 to note that assignments can-not extend a dependency chain from one threadto another and that assignments cannot extendan intptr t- or uintptr t-based dependencychain through a shared variable even within agiven thread (May 21 2015)

bull Update Section 79 to note that atomic op-erations can extend dependency chains in thesame way that their non-atomic counterpartscan (May 21 2015)

WG21N4321 40

bull Update Section 79 to explicitly call out the factthat stores and subsequent loads can extend de-pendency chains (May 21 2015)

bull Update Section 79 to explicitly state that un-defined behavior terminates dependency chainsand that if only one value of a pointer avoidsundefined behavior any dependency chainsthrough that pointer at that point are termi-nated (May 22 2015)

bull Update Section 79 to document the difficul-ties stemming from value-specialization opti-mizations (May 23 2015)

bull Update Section 79 to note the back-propagationof the dependency-chain-breaking effects of thecompiler deducing the exact value of a pointer(May 27 2015)

bull Update Section 79 to give more informationon how much bit-setting and bit-clearing is toomuch (June 8 2015)

bull Update Section 79 to note that the Linux ker-nel does not rely on dependency ordering whena statically allocated structure is linked into anRCU-protected structure but also note that op-timizations breaking dependency ordering in thiscase seem quite dubious (July 11 2015)

bull Update Section 79 illustrating dependency-chain rules with examples (July 11 2015)

bull Update Section 79 noting that simple loads suf-fice for memory order consume loads (July 112015)

bull Update Sections 77 and 78 calling out abstrac-tion challenges for head- and tail-marking ap-proaches (July 13 2015)

bull Add subsections to Section 79 (July 13 2015)

bull Update Section 79 to amplify objections to spe-cialization optimizations applied to heap-basedpointers (July 13 2015)

  • 1 Introduction
  • 2 Introduction to RCU
  • 3 Linux-Kernel Use Cases
    • 31 Types of Linux-Kernel Dependency Chains
    • 32 Operators in Linux-Kernel Dependency Chains
    • 33 Operators Terminating Linux-Kernel Dependency Chains
    • 34 Operators Acting as Last Link in Linux-Kernel Dependency Chains
    • 35 Linux-Kernel Dependency Chain Length
      • 4 Dependency Ordering in Pre-C11 Implementations
        • 41 Rules for C-Language RCU Users
          • 411 Rules for 2014 GCC Implementations
          • 412 Rules for 2003 GCC Implementations
          • 413 Rules for 1990s Sequent C Implementations
            • 42 Rules for C-Language Implementers
              • 5 Dependency Ordering in C11 and C++11 Implementations
              • 6 Weaknesses in C11 and C++11 Dependency Ordering
              • 7 Potential Alternatives to C11 and C++11 Dependency Ordering
                • 71 Revising C11 and C++11 Dependency-Ordering Definition
                • 72 Type-Based Designation of Dependency Chains With Restrictions
                • 73 Type-Based Designation of Dependency Chains
                • 74 Whole-Program Option
                • 75 Local-Variable Restriction
                • 76 Mark Dependency-Carrying Local Variables
                • 77 Explicitly Tail-Marked Dependency Chains
                • 78 Explicitly Head-Marked Dependency Chains
                • 79 Restricted Dependency Chains
                  • 791 Extending Dependency Chains
                  • 792 Terminating Dependency Chains
                  • 793 Restricted Dependency Chains and the Linux Kernel
                  • 794 Restricted Dependency Chains Advantages and Disavantages
                    • 710 Evaluation
                      • 7101 Audiences
                      • 7102 Comparison
                      • 7103 Other Approaches
                          • 8 Summary
Page 20: N4321: Towards Implementation and Use of memory order … · 2015-07-14 · N4321: Towards Implementation and Use of memory order consume Doc. No.: WG21/N4321 (revised) Date: 2015-04-19

WG21N4321 20

This approach which again does require full re-compilation can be implemented using two ap-proaches

1 Promote all memory order consume loads tomemory order acquire as may be done withthe current standard

2 On architectures that respect memory order-ing prohibit all dependency-breaking optimiza-tions throughout the entire program but onlyin cases where a change in the value returnedby a memory order consume load could cause achange in the value computed later in that samedependency chain in other words where there isa global semantic dependency Note again thatthe possibility of storing a value obtained froma memory order consume load then loading itlater means that normal loads as well as memoryorder relaxed loads often must be consideredto head their own dependency chains but onlywhen loaded by the same thread that did thestore

Some implementations might allow the developerto choose between these two approaches for exampleby using a compiler switch provided for that purpose

This approach also has the effect of permitting atrivial implementation of a memory order consume

atomic thread fence() When using the first im-plementation approach the atomic thread fence()

is simply promoted to memory order acquire In-terestingly enough when using the second ap-proach the memory order consume atomic thread

fence() may simply be ignored The reason forthis is that this approach has the effect of promot-ing memory order relaxed loads to memory order

consume which already globally enforces all theordering that the memory order consume atomic

thread fence() is required to provide locally10

This approach has its own set of advantages anddisadvantages

10 Of course this presumed promotion from memory order

relaxed to memory order consume means that architecturessuch as DEC Alpha that do not respect dependency order-ing must continue to use the first option of emitting memory-ordering instructions for memory order consume loads

1 This approach dispenses with the [[carries

dependency]] attribute and the kill

dependency() primitive

2 This approach better promotes reuse of existingsource code In particular it should require nochanges to the current Linux-kernel source baseaside from changes to the rcu dereference()

family of primitives

3 This approach allows implementations to carryout dependency-breaking optimizations on de-pendency chains as long as a change in thevalue from the memory order consume load doesnot change values further down the dependencychain both with and without the optimizationJeff conjectures that the set of dependency-breaking optimizations used in practice applyonly outside of dependency chains by the re-vised definition in which single-value restrictionsbreak dependency chains11 If this conjectureholds it also applies to Torvaldrsquos approach de-scribed in Section 72

4 Code that follows the rules presented in Sec-tion 41 (substituting memory order consume

loads for volatile loads) would have its depen-dency ordering properly preserved

It is unlikely that this approach could be modeledreasonably given the current state of the art Therequirement that any given memory order consume

load be able to generate at least two different val-ues at the tail of the dependency chain is believedto be a show-stopper especially when coupled withwhole-program analysis which might find that thereis only one value entering at the head of the depen-dency chain

This approach allows annotations to be discardedas shown in Figures 20 and 21 However the memoryorder consume loads are still required in order toenable the promote-to-acquire implementation style

11 This is certainly the case for the usual optimizations ex-emplified by replacing (x-x) with zero

WG21N4321 21

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 void ls_front(struct liststackhead head)15 16 void data17 struct liststack lsp1819 rcu_read_lock()20 lsp = rcu_dereference(head-gtfirst)21 if (lsp == NULL)22 data = NULL23 else24 data = rcu_dereference(lsp-gtt)25 rcu_read_unlock()26 return data27

Figure 20 List-Based-Stack Whole-Program Ap-proach 1 of 2

75 Local-Variable Restriction

This approach suggested off-list by Hans Boehmlimits the extent of dependency trees to a local whichincludes local variables temporaries function argu-ments and return variables Assigning a value from amemory order consume load to such an object beginsa dependency chain Assigning a value loaded fromsuch a local to a global variable (including function-local variables marked static) or to the heap impliesa kill dependency() so that dependency chains areconfined to locals However if the compiler is unableto see the full dependency chain for example be-cause it passes into a function in another translationunit that is not marked [[carries dependency]]the compiler should promote memory order consume

to memory order acquire12

Section 32 indicates that the following operatorsshould transmit dependency status from one localvariable or temporary to another -gt infix = casts

12 Some implementations might provide means to allow theuser to specify that a diagnostic be generated if such promotionis necessary

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 void ls_pop(struct liststackhead head)37 38 struct liststack lsp39 struct liststack lsnp140 struct liststack lsnp241 void data4243 rcu_read_lock()44 lsnp2 = rcu_dereference(head-gtfirst)45 do 46 lsnp1 = lsnp247 if (lsnp1 == NULL) 48 rcu_read_unlock()49 return NULL50 51 lsp = rcu_dereference(lsnp1-gtnext)52 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)53 while (lsnp1 = lsnp2)54 data = rcu_dereference(lsnp2-gtt)55 rcu_read_unlock()56 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)57 return data58

Figure 21 List-Based-Stack Whole-Program Ap-proach 2 of 2

WG21N4321 22

prefix amp prefix [] infix + infix - ternary infix (bitwise) amp and probably also | Similarly Sec-tion 33 indicates that the following operators shouldimply a kill dependency() () == = ampamp ||infix and

It will also be necesary to check whether Linux-kernel usage expects dependency chains to passthrough globals and heap objects that are in someway thread-local If there are such use cases and ifthey are sane and cannot easily be changed to uselocal variables should [[carries dependency]] beused to flag dependency-carrying globals and heapobjects

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute if de-pendency chains are to span multiple translationunits as is the case in some parts of the Linuxkernel

2 The implementation is likely to be some-what simpler because only those dependencychains passing through local variables compiler-generated temporaries compiler-visible functionarguments and compiler-visible return valuesneed be traced One could also argue that func-tion arguments and return values marked with[[carries dependency]] attribute also need tobe traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 Although optimizations on dependency chainsmust be restricted the restricted scope of de-pendency chains reduces the impact of these re-strictions

5 Applying this approach to the Linux kernelwould only require the addition of markings onfunction parameters and return values corre-sponding to cross-translation-unit function callsHowever there are a significant number of theseso this approach can expect significant resistancefrom the Linux community

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 22 List-Based-Stack Local-Variable Restric-tion 1 of 2

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach allows local-variable annotations tobe dropped as shown in Figure 22 and 23

76 Mark Dependency-Carrying Lo-cal Variables

This approach suggested offlist by Clark Nelson usesthe [[carries dependency]] attribute to mark non-static local-scope variables as carrying a dependencyin addition to its current use marking function ar-guments and return values as carrying dependen-cies It is not permissible to mark global variablesor structure members with this attribute Assigningfrom a [[carries dependency]] object to a non-[[carries dependency]] object results in an im-plicit kill dependency()

This approach is similar to that of Section 73 ex-cept that it uses an attribute rather than a type mod-ifier As such it has many of the advantages and dis-

WG21N4321 23

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 23 List-Based-Stack Local-Variable Restric-tion 2 of 2

advantages of that approach however some believethat an attribute-based approach will be more ac-ceptable to the committee than would a type-modifierapproach13 However this approach does requirethat C add attributes

This leave the question of which operatorstransmit dependency chains from one [[carries

dependency]] object to another Section 32 indi-cates that the following operators should transmitdependency status from one local variable or tem-porary to another -gt infix = casts prefix amp prefix [] infix + infix - ternary infix (bitwise) ampand probably also | Similarly Section 33 showsthat the following operators should imply a kill

dependency() () == = ampamp || infix and

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains passingthrough variables marked with the [[carries

dependency]] attribute need be traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 The [[carries dependency]] calls serve asvaluable documentation of the developerrsquos in-tent

5 Although optimizations on dependency chainsmust be restricted use of explicit [[carries

dependency]] greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of variables carrying depen-dencies given that the Linux kernel currentlyrequires no such markings

13 Lawrence Crowl suggests a third approach namely a vari-able modifier

WG21N4321 24

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 _Carries_dependency void data18 _Carries_dependency struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 24 List-Based-Stack Marked Local Variables1 of 2

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach results in code as shown in Fig-ures 24 and 25 where the [[carries dependency]]

attributes have been replaced with a mythicalCarries dependency C keyword

77 Explicitly Tail-Marked Depen-dency Chains

This approach suggested at Rapperswil by OlivierGiroux can be thought of as the inverse of

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 _Carries_dependency struct liststack lsp40 struct liststack lsnp141 _Carries_dependency struct liststack lsnp242 _Carries_dependency void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 25 List-Based-Stack Marked Local Variables2 of 2

WG21N4321 25

1 p = atomic_load_explicit(ampgp memory_order_consume)2 if (p = NULL)3 do_it(atomic_dependency(p gp))

Figure 26 Explicit Dependency Operations

stdkill dependency() Instead of explicitlymarking where the dependency chains terminateOlivierrsquos proposal uses a stddependency() prim-itive to indicate the locations in the code that thedependency chains are required to reach The first ar-gument to stddependency() is the value to whichthe dependency must be carried and the second argu-ment is the variable that heads the dependency chainin other words the second argument is the variablethat was loaded from by a memory order consume

load This proposal differs from the others in thatit is expected to be implemented not necessarily bypreserving the dependency but instead by insertingbarriers in those cases where optimizations have elim-inated any required dependencies The goal here is toimpose minimal restrictions on optimizations of codecontaining dependency chains

A C-language example is shown in Figure 26where stddependency() is transliterated to the C-language atomic dependency() function On line 3atomic dependency() returns the value of its firstargument (p) while ensuring that the data depen-dency from the memory order consume load from gp

is faithfully reflected in the assembly language imple-menting this code fragment The assembly-languagereflection of this dependency might be in terms ofan assembly-language dependency (for example onARM or PowerPC) implicit memory ordering (forexample on x86 or mainframe) or by an explicitmemory-barrier instruction However if there was noatomic dependency() function the compiler wouldbe under no obligation to preserve the dependency14

These explicitly specified dependencies may becombined with [[carries dependency]] attributeson function arguments for example as shown in Fig-ure 27 Note the interplay of atomic dependency()

and [[carries dependency]] where line 8 estab-

14 Would it be better to have the second argument to atomic

dependency() be a label rather than an expression

1 void foo(struct bar q [[carries_dependency]])2 3 if (q = NULL)4 do_it(atomic_dependency(q-gtb q))5 67 p = atomic_load_explicit(ampgp memory_order_consume)8 foo(atomic_dependency(p gp))

Figure 27 Explicit Dependency Operations and car-ries dependency

lishes the dependency between the load from gp andthe [[carries dependency]] argument q of foo()and where line 4 establishes the further dependencybetween argument q of foo() and do it()s argu-ment

This approach is not yet complete One issue isthe possibility of a given operation being dependenton multiple memory order consume loads One ap-proach is of course to omit this functionality and an-other is to allow atomic dependency() to allow anexpression as its first argument and a variable list ofmemory order consume loaded variables

Another issue is connecting [[carries

dependency]] return values to subsequent atomic

dependency() invocations There are a numberof possible resolutions to this issue One approachwould be to use [[carries dependency]] attributeto mark the declaration of the variable to whichthe functionrsquos return value is assigned bringing theproposal from Section 76 to bear In the specialcase where the memory order consume load is in thesame function body as the atomic dependency()

that depends on it the atomic dependency()

could reference the variable that was the sourceof the original memory order consume load An-other approach would be to allow function-returncarries dependency]] attributes to define namesthat could be used by later atomic dependency()

invocations

A third issue arises when atomic dependency()

must be applied after the head of the dependencychain has gone out of scope for example if the headwas contained in a variable defined in an inner scopethat has since been exited

A fourth issue arises if optimizations along a

WG21N4321 26

needed dependency chain allow ordering the depen-dent operation to precede the head of the depen-dency chain in which case inserting barriers wouldbe ineffective The current proposal for addressingthis issue is to suppress memory-movement optimiza-tions across the atomic dependency() perhaps us-ing something like atomic signal fence() or theLinux kernelrsquos barrier() macro This approach al-lows dependency checking and fence insertion to becarried out as a final pass in the compilation process

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains having ex-plicit atomic dependency() calls (and option-ally intermediate [[carries dependency]] at-tributes) need be traced

3 Irrelevant dependency chains are pruned by de-fault with no stdkill dependency() calls re-quired

4 The atomic dependency() calls serve as valu-able documentation of the developerrsquos intent

5 Although optimizations on dependency chainsmust be restricted use of explicit atomic

dependency() greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of dependency chains giventhat the Linux kernel currently relies on implicitends of dependency chains

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(atomic_dependency(lsp-gtt27 head-gtfirst))28 rcu_read_unlock()29 return atomic_dependency(data lsp-gtt)30

Figure 28 List-Based-Stack Tail-Marked Dependen-cies 1 of 2

It is not yet known whether this approach can bereasonably modeled

The result is shown in Figures 28 and 29

78 Explicitly Head-Marked Depen-dency Chains

This approach suggested via email by Olivier Girouxcan be thought of as another inverse of stdkilldependency() In this case the heads of the de-pendency chains are marked indicating to whichpointed-to objects dependencies should be carriedThis description is an extrapolation of a very con-cise proposal and corrections are welcome

The general idea is to provide an augmentedform of the load() member function that indicatesdependencies for example xload(memory order

consume x-gtnext) would cause a dependency tobe carried through the -gtnext field but throughno other field This is shown in Figure 30 where

WG21N4321 27

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(atomic_dependency(lsnp2-gtt lsnp))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return atomic_dereference(data lsnp2-gtt)59

Figure 29 List-Based-Stack Tail-Marked Dependen-cies 2 of 2

1 struct foo 2 struct foo a3 struct foo b4 struct foo c5 int d6 78 p = atomic_load_explicit(ampgp memory_order_consume9 p-gta p-gtb)

10 qa = p-gta Dependency carried 11 qb = p-gtb Dependency carried 12 qc = p-gtc No dependency carried 13 d = p-gtd No dependency carried

Figure 30 Explicit Dependency Operations and Aug-mented Load

the explicit dependency information on line 9 causeslines 10 and 11 to carry a dependency but lines 12and 13 not to do so

Some open questions regarding this approach

1 How does this interact with arguments and re-turn values Do the corresponding annotationsneed to indicate to which fields dependenciesmight be carried Should mismatches be con-sidered an error and if so which sorts of mis-matches

2 How are opaque types handled For exam-ple consider the Linux kernel linked-list facil-ity which embeds a list head structure intothe enclosing object that is to be placed on thelist The memory order consume load returnsa (struct list head ) but it may be nec-essary to carry a dependency to one or more ofthe fields in the enclosing object Should thisbe handled via something like xload(memory

order consume x) but if so doesnrsquot thisre-introduce the need for lots of stdkill

dependency() calls

3 Larger structures might have quite a few fieldsthat need dependencies carried Should therebe some sort of shorthand to make this easierto code for example tagging the fields needingdependency ordering in the declaration of thestruct or class

4 Common types of abstraction need to be han-dled correctly For example there are more than

WG21N4321 28

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst head-gtfirst-gtt)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(lsp-gtt lsp-gtt)) 28 rcu_read_unlock()29 return data30

Figure 31 List-Based-Stack Head-Marked Depen-dencies 1 of 2

500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

79 Restricted Dependency Chains

This approach restricts dependency chains to opera-tions for which compilers would naturally carry de-pendencies As such this approach can be consid-ered to be a refinement of the whole-program option(Section 74) restricted as described in Section 412but also omitting control dependencies and RCU-protected array indexes As always a memory order

consume load heads a dependency chain and as al-ways a memory order consume load may be imple-mented with a simple load instruction on architec-tures such as x86 ARM and Power

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst head-gtfirst-gtnext)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext lsnp-gtnext-gtt)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt lsnp2-gtt))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 32 List-Based-Stack Head-Marked Depen-dencies 2 of 2

WG21N4321 29

1 define rcu_ptr_extract(p) 2 ( 3 uintptr_t ___ip = (uintptr_t)(p) 4 5 ___ip = ___ip amp ~0x7 6 (typeof(p)) ___ip 7 )89 define rcu_ptr_set_bits(p bits)

10 ( 11 uintptr_t ___ip = (uintptr_t)(p) 12 13 ___ip = ___ip amp ~0x7 14 ___ip = ___ip | (bits) 15 (typeof(p)) ___ip 16 )1718 define rcu_ptr_get_bits(p bits) 19 ( 20 uintptr_t ___ip = (uintptr_t)(p) 21 22 ___ip amp ~0x7 23 )

Figure 33 Pointers and Bit Manipulation on 64-BitSystem

This results in a specific list of operations that ex-tend dependency chains and a separate specific list ofoperations that terminate such chains each coveredin its own section

791 Extending Dependency Chains

The following primitive operations extend thatchain15

1 If any value is part of a dependency chain thenusing that value as the left-hand side of an as-signment expression extends the chain to coverthe assignment This rule is exercised in theLinux kernel by stores into fields making up anRCU protected data element

2 If any value is part of a dependency chain thenusing that value as the right-hand side of an as-signment expression extends the chain to coverboth the assignment and the value returned bythat assignment statement Line 20 of Figure 20shows how this rule may be used to extend adependency chain into a local variable

15 In case of operator overloading the actual functions calledmust be analyzed in order to determine their effects on depen-dency chains

3 If any value that is part of a dependency chain isstored to a non-shared variable then any valueloaded by a later load from that same variable bythat same thread is also part of the dependencychain Lines 20 and 24 of Figure 20 illustratethis rule though this rule would apply even iflocal variable lsp was a intptr t instead of apointer

4 If a pointer that is part of a dependency chainis stored to any variable then any value loadedby a later load from that same variable by thatsame thread is also part of the dependency chainLines 20 and 24 of Figure 20 illustrate this rulethough this rule would apply even if local vari-able lsp was instead a shared variable

5 If a pointer is part of a dependency chain thenadding an integral value to that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers and alsoto addition via the infix + operator and via thepostfix [] operator Note that the addition mustbe carried out on a pointer Casting to an inte-gral type and then carrying out the addition willbreak the dependency chain Therefore insteadof casting to an integral type to carry out theaddition cast to a pointer to char16 Line 24 ofFigure 20 illustrates this given that the -gtt actsas a pointer offset prior to indirection

6 If a pointer is part of a dependency chain thensubtracting an integer from that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers Againcasting to an integral type and then carryingout the subtraction will break the dependencychain so instead cast to a pointer to char TheLinux-kernel container of() macro illustratesthis This macro is used to find the beginning ofa structure given a pointer to a field within thatsame structure

16 Yes some old systems had strange formats for characterpointers and this restriction does exclude those systems fromthis nuance of dependency ordering However to the best ofmy knowledge all such systems were uniprocessors so this isnot a real problem

WG21N4321 30

7 If a pointer is part of a dependency chain thendereferencing it using the prefix operator ex-tends the chain through the dereference opera-tion Line 24 of Figure 20 illustrates this giventhat the -gtt acts as a pointer offset prior to in-direction

8 If a pointer is part of a dependency chain thendereferencing it using the -gt field-selection oper-ator extends the chain to the field Note that thewhen the -gt operator is followed by one or more operators these latter operators are equiva-lent to adding a constant integer to the originalpointer Line 24 of Figure 20 directly illustratesthis rule

9 If a pointer is part of a dependency chain thencasting it (either explicitly or implicitly) to anypointer-sized type extends the chain to the re-sult Line 3 of Figure 33 illustrates this rule

10 If a value of type intptr t or uintptr t ispart of a dependency chain then casting it toa pointer type extends the chain to the resultLine 6 of Figure 33 illustrates this rule

11 If a value of type intptr t or uintptr t is partof a dependency chain then the bitwise infix amp

and | operators extend the dependency chain tothe resulting value Line 5 of Figure 33 illus-trates this rule

12 If a value of type intptr t or uintptr t is partof a dependency chain the the bitwise infix ^

operator extends the dependency chain to theresulting value The use case for this traversal ofbuddy-allocator-like lists or dense-array heapsbut it is not clear whether these use cases justifythis addition to dependency ordering

13 If a pointer is part of a dependency chain thenapplying the unary amp address-of operator op-tionally casting this address to a pointer type(perhaps repeatedly to different pointer typeseither explicitly or implicitly) then applying the dereference operator extends the chain to theresult This is used by some of the Linux-kernellist-processing macros

14 If a pointer is part of a dependency chain andthat pointer appears in the entry of a expres-sion selected by the condition then the chainextends to the result Please note that doesnot extend chains from its condition only fromits second or third argument

15 If a pointer to a function is part of a depen-dency chain then invoking the pointed-to func-tion extends the chain from the pointer to theinstructions executed Note that the exact mech-anism used to update instructions is implemen-tation defined and might require use of specialinstruction-cache-flush operations17

16 If a pointer is part of a dependency chain thenif that pointer is used as the actual parameter ofa function call the dependency chain extends tothe formal parameter

17 If a pointer is part of a dependency chain thenif that pointer is returned from a function thedependency chain extends to the returned valuein the calling function

18 If a given operation extends a dependency chainthen so does its atomic counterpart For exam-ple the rules applying to assignments also applyto atomic loads and stores

Any other operation terminates a dependencychain

792 Terminating Dependency Chains

Even though all other operations terminate depen-dency chains there are a few that deserve specialmention

1 Equality comparisons

2 Narrowing magnitude comparisons

3 Narrowing arithmetic operations

4 Narrowing bitwise operations

17 This may sound strange but just you try implementingdynamic linking without the ability to update instructions

WG21N4321 31

1 struct bar 2 struct bar next3 int a4 int b5 6 struct bar head = amphead 1 2 78 for (p = head-gtnext p p = rcu_dereference(p-gtnext)) 9 foo += p-gta

10 if (p == amphead)11 break12 13 bar = head-gtb

Figure 34 Back-Propagation of Dependency-ChainBreakage Due to Comparisons

1 if (p gt ampfoo)2 do_something(p)3 else if (p lt ampfoo)4 do_something_else(p)5 else6 do_something_nodep(p)

Figure 35 Inequality-Comparison Dependency-Chain Breakage

5 Storing non-pointers into shared variables

6 Passing values between threads without using amemory order consume load

7 Undefined behavior

8 Use of kill dependency

Each of these is covered below

Equality comparisons If a pointer is part of adependency chain then a == or = comparison thatcompares equal to some other pointer where thatother pointer is not part of any dependency chainwill cause any uses of the original pointer to no longerbe part of the dependency chain This dependency-chain breakage can back-propagate to earlier uses ofthe pointer so that in Figure 34 if the comparisonon line 10 compares equal then the access on line 9is not part of the dependency chain This is admit-tedly a rather strange code fragment and besidesthe Linux-kernel barrier() macro could prevent thisif placed between lines 9 and 10 Furthermore the

Linux kernelrsquos list macros avoid this situation becausethe equal comparison terminates the loop

So what if the compiler introduces an equality com-parison This might happen when doing feedback-directed optimization where the compiler might no-tice (for example) that a particularly statically allo-cated structure was almost always the first elementon a given list The compiler might therefore in-troduce a specialization optimization comparing theaddresses and generating code using the statically al-located structure on equals comparison On the onehand in the cases where the Linux kernel adds a stat-ically allocated structure to an RCU-protected linkeddata structure that structure has been initialized atcompile time so that dependency ordering is not re-quired On the other hand this appears to be anextremely dubious optimization for linked data struc-tures In a great many cases the added overhead ofthe comparison would overwhelm the benefits of gen-erating code based on the statically allocated struc-ture

High-quality implementations would therefore beexpected to provide means for disabling this sort ofoptimization especially for pointers obtained fromthe heap After all use of statically allocated struc-tures in RCU-protected lists could be quite usefulduring out-of-memory conditions in which case thespecialization optimization would almost always re-duce performance which is not what optmizationsare supposed to be doing

Narrowing magnitude comparisons A seriesof gt lt gt= or lt= operators that informs the compilerof the exact value of a pointer causes that pointerto no longer be part of the dependency chain SeeFigure 35 for an example of this On line 6 of thisfigure the compiler knows that the value of p is equalto ampfoo so although there is dependency ordering tolines 2 and 4 there is no dependency ordering toline 6 This dependency-chain breakage can back-propagate just as for equality comparisons How-ever dependencies are maintained for normal usesfor exmaple the use of comparisons for deadlockavoidance when acquiring locks contained in multi-ple RCU-protected data elements

WG21N4321 32

Narrowing arithmetic operations If a pointeris part of a dependency chain and if the valuesadded to or subtracted from that pointer cancel thepointer value so as to allow the compiler to pre-cisely determine the resulting value then the result-ing value will not be part of any dependency chainFor example if p is part of a dependency chain then((char )p-(uintptr t)p)+65536 will not be18

Narrowing bitwise operations If a value of typeintptr t or uintptr t is part of a dependency chainand if that value is one of the operands to an amp

or | infix operator whose result has too few or toomany bits set then the resulting value will not bepart of any dependency chain For example on a64-bit system if p is part of a dependency chainthen (p amp 0x7) provides just the tag bits and nor-mally cannot even be legally dereferenced Similarly(p | ~0) normally cannot be legally dereferencedHowever (p amp ~0x7) will provide a usable pointerthat is part of prsquos dependency chain Setting orclearing bits that are not used by the implementa-tion or that would be zero for a properly aligned ob-ject will not break the dependency chain That saidthe compiler might not know the definition of ldquoprop-erly alignedrdquo for example in the cases of manuallycache-aligned or page-aligned objects

Storing non-pointers into shared variables Ifa non-pointer value that is part of a dependency chainis stored into a shared variable then the dependencychain does not extend to a later load from that vari-able

Passing values between threads If a value thatis part of a dependency chain is stored into a variableby one thread and loaded from that same variable bysome other thread using either a non-atomic load ora memory order relaxed load then the dependencychain does not extend to the second thread To getthis effect the second thread would instead need touse a memory order consume load Note that this

18 That said 57p4 of C++ and 656p8 of C both say thatindexing outside of an object is undefined behavior so the lossof dependency ordering is likely the least of the problems here

would extend the dependency chain even if the cor-responding store was a memory order relaxed storebecause the required store-side ordering is providedby the dependency chain

Undefined behavior If undefined behavior is in-voked then consistent with the notion of undefinedbehavior there are no dependency-chain guarantees

If a given pointer intptr t or uintptr t is usedsuch that only one value avoids undefined behaviorthen the dependency chain is broken in the same wayas it would be in the case of an equality comparisonwith that same value

kill dependency() The result of calling kill

dependency is never part of any dependency chainThis operation can be used to suppress diagnosticsthat implementations might omit for likely misusesof dependency ordering

793 Restricted Dependency Chains and theLinux Kernel

This covers all known pointer-based RCU uses in theLinux kernel aside from RCU-protected array in-dexes (more on these later) However as noted ear-lier these restrictions might prove too constrainingfor future code Therefore it might be necessary tocombine this approach with some variation on one ofthe methods for explicitly marking variables formalparameters and return values that are intended tocarry dependencies

Section 32 discussed dependency chains headedby memory order consume loads of integers that arelater used as array indexes or pointer offsets Al-though there are a (very) few such uses in the Linuxkernel accommodating dependency chains headed byloads of integers greatly complicates the handling ofdependency chains For example given an integer x

produced by a memory order consume load we mustcorrectly handle expressions containing (x - x) in-cluding cases where the cancellation is not at all obvi-ous from the source code In contrast given a pointerp the expression (p - p) not a pointer and the rulesgiven above do not require carrying a dependency

WG21N4321 33

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a) 3 q = kill_dependency(p)4 a = q-gtspecial_a5 else 6 a = p-gtnormal_a7

Figure 36 Avoiding Diagnostic Due To Dependency-Ordering Value-Narrowing Hazard

through such an expression This approach there-fore excludes dependency chains headed by memory

order consume loads from non-pointer atomic vari-ables

This raises the question of what should be doneabout the Linux kernel code that relies on order-ing carried through integer array indexes Paulhas (hopefully) answered this question by creatinga Linux-kernel patch that removes the kernelrsquos de-pendency on RCU-protected array indexes [22]

794 Restricted Dependency Chains Ad-vantages and Disavantages

This approach to dependency ordering has the fol-lowing advantages and disadvantages

1 There is no need for compilers to trace depen-dency chains Instead dependencies are an au-tomatic result of current code-generation and op-timization practices

2 There would be little or no limitation on com-piler optimizations Compilers are no longer re-quired to establish artificial dependencies for ex-pressions such as (x - x) or to detect value-narrowing hazards involving == and =

3 The breaking of dependency chains when apointer compares equal to some other pointermight prove to be onerous but it is not aproblem for current Linux use cases19 Themost common cases are comparison against alist header (in which case an equality compar-ison terminates the traversal) and comparison

19 This needs to be re-verified

against NULL (in which case an equality com-parison indicates a pointer that cannot be deref-erenced in any case)

4 The compiler is prohibited from carrying outvalue-speculation optimizations on pointers orvalues of type intptr t or uintptr t that havebeen cast from a pointer type and subjected tono operations other than infix amp | and ^ (Isthis an advantage or a disadvantage The an-swer to this question is left to the reader)

5 In a great many cases dependency chains canreliably pass through library functions compiledby pre-C11 compilers

6 It would not be necessary to use stdkill

dependency() calls in most cases That saiduse of stdkill dependency() might at somefuture point allow the compiler to produce bet-ter diagnostics for dubious dependency-chain usecases For example a compiler might issue awarning for the value-narrowing hazard shownon line 3 of Figure 13 and that diagnostic mightbe suppressed as shown in Figure 36

7 More generally this approach allows annotationsto be discarded as was shown in Figures 20and 21 However the memory order consume

loads are still required in order to support DECAlpha and prevent compiler optimizations thatmight otherwise destroy the dependency chain

8 It would not be necessary to use [[carries

dependency]] attributes However as withstdkill dependency() use of something like[[carries dependency]] might produce bet-ter diagnostics for dubious use dependency-chainuse cases That said it might be preferable tosubstitute either type or variable modifiers forattributes given that use of attributes is not per-mitted to change the meaning of the programand also that attributes are not yet supportedby the C language

9 With the exception of one use case involving ar-rays no changes to the Linux-kernel source code

WG21N4321 34

are required As noted earlier a patch is avail-able to remove the Linux kernelrsquos dependency onRCU-protected array indexes [22]

In the future this approach could be augmentedby attributes type modifiers variable modifiers orother markings to allow more elaborate dependencychains to be created on the one hand and to improvethe compilerrsquos ability to emit diagnostics for dubioususes of dependency chains on the other

710 Evaluation

This evaluation starts by enumerating the differentaudiences that any change to memory order consume

must address (Section 7101) and then compares thevarious proposals based on the perceived viewpointsof these audiences (Section 7102)

7101 Audiences

The main audiences for any change to memory

order consume include standards committee mem-bers compiler implementers formal-methods re-searchers developers intending to write new codeand developers working with existing RCU code TheLinux kernel community is of course a notable exam-ple of this last category

Standards committee members would like a cleanand non-intrusive change to the standard Theywould of course also like solutions minimizing thenumber and vehemence of complaints from the otheraudiences or failing that reducing the complaintsto a tolerable noise level

Compiler implementers would like a mechanismthat fits nicely into current implementations whichdoes much to explain their satisfaction with the ap-proach of strengthening memory order consume tomemory order acquire In particular they wouldlike to avoid unbounded tracing of dependencies andwould prefer minimal constraints on their ability toapply time-honored optimizations

Formal-methods researchers would like a definitionof memory order consume that fits into existing the-oretical frameworks without undue conceptual vio-lence Of particular concern is any need to deal with

counter-factuals in other words any need to rea-son not only about values of variables required forthe solution of a given litmus test but also aboutother unrelated values for these variables As suchcounter-factuals are the rock upon which otherwiseattractive approaches involving semantic dependencyhave foundered20 Some practitioners might won-der why the opinion of formal-methods researchersshould be given any weight at all and the answer tothis question is that it is the work of formal-methodsresearchers that provides us the much-needed toolsthat we need to analyze both the memory-orderingspecification itself as well as programs using thatspecification

Developers writing new code need something thatexpresses their algorithm with a minimum of syn-tactic saccharine that is easy to learn and that iseasy to maintain For example one of the weak-nesses of the current standardsrsquo definition of memoryorder consume is the need to sprinkle large numbersof kill dependency() calls throughout onersquos codeIn short developers would like it to be easy to writeanalyze and maintain code that uses dependency or-dering

Developers with existing RCU code have the samedesires as do developers writing new code but arealso very interested in minimizing the code churn re-quired to adhere to the standard

The challenge if of course to find a proposal thataddresses the viewoints of all of these audiences Aswe will see in the next session this is not easy How-ever there is some hope that the approach presentedin Section 79 might suffice

7102 Comparison

A summary comparison of the proposals is shown inTable 1

The dependency type can either be ldquodeprdquo fornormal dependency ldquordeprdquo for the restricted de-pendencies discussed in Section 79 ldquosdeprdquo for(global) semantic dependency or ldquolsdeprdquo for local

20 That said Alan Jeffries is making another attempt tocome up with a suitable formal definition of semantic depen-dency

WG21N4321 35

Dep

enden

cyType

Variable

Marking

Form

al-Parameter

Marking

Return-V

alueMarking

Beginning-O

f-Chain

Handling

End-O

f-Chain

Handling

Dep

enden

cyTracingRequired

CAttribute

Support

Req

uired

C11 C++11 dep A A K Y Y

Type-Based Designation of DependencyChains With Restrictions (Section 72)

lsdep T T T k

Type-Based Designation of DependencyChains (Section 73)

dep T T T k

Whole-Program Option (Section 74) sdep

Local-Variable Restriction (Section 75) dep A A k Y

Mark Dependency-Carrying Local Variables(Section 76)

dep A A A k Y

Explicitly Tail-Marked Dependency Chains(Section 77)

dep A A Dk y Y

Explicitly Head-Marked Dependency Chains(Section 78)

dep D k y Y

Restricted Dependency Chains (Section 79) rdep

Marking ldquoArdquo attribute ldquoTrdquo typeBeginning of chain ldquoDrdquo explicit designationEnd of chain ldquoDrdquo explicit designation ldquokKrdquo implicitexplicit kill dependencyDependency tracking required ldquoyrdquo only for marked chains ldquoYrdquo always

Table 1 Comparison of Consume Proposals

WG21N4321 36

semantic dependency21 Variable formal-parameterand return-value marking can either be type-based(ldquoTrdquo) attribute-based (ldquoArdquo) or not required (ldquo rdquo)Beginning-of-chain handling can either require ex-plicit indication to which quantities dependenciesmust be carried (ldquoDrdquo) or nothing (ldquo rdquo) End-of-chain handling can either require an explicit kill

dependency (ldquoKrdquo) an implicit kill dependency

(ldquokrdquo) explicit designation of dependency (ldquoDrdquo) ornothing (ldquo rdquo)22 Dependency tracking might berequired for all chains (ldquoYrdquo) explicitly designatedchains (ldquoyrdquo) or not required at all (ldquo rdquo) C-language[[carries dependency]] support might be required(ldquoYrdquo) or not (ldquo rdquo)

The ideal proposal would have dependency typeldquodeprdquo (thus making it easier to model dependencyordering and making it unnecessary for developer tohave to outwit full-program optimizations) no needfor variable formal-parameter or return-value mark-ing (thus minimizing changes required for existingRCU code) implicit ldquodo the right thingrdquo end-of-chainhandling23 (thus minimizing the need for whack-a-mole source-code markups) no need for dependencytracking (thus making it easier to implement) andno need for C-language support for the [[carries

dependency]] attribute (thus minimizing changes tothe C standard)

7103 Other Approaches

If the C standards committee is unwilling to ac-cept attributes perhaps a new keyword such asCarries dependency would be an acceptable alter-native (This was suggested at the 2014 UIUC meet-ing but I cannot recall who suggested it)

21 Recall that a local semantic dependency remains a de-pendency even if the memory order consume load at its headcan return only a single value In contrast a global seman-tic dependency remains a dependency only if more than onevalue can appear at the end of the chain Therefore optimiza-tions based on global full-program analysis can break a globalsemantic dependency but can break neither a local semanticdependency nor a normal dependency

22 Variables that go out of scope always have any dependencychain implicitly killed

23 Perhaps implemented by a careful choice of exactly whichoperators carry dependencies in which situations

It might also be possible to combine different as-pects of the various proposals perhaps even arrivingat an improved proposal

Nevertheless we clearly have some more work todo

8 Summary

This document has analyzed Linux-kernel use of de-pendency ordering and has laid out the status-quo in-teraction between the Linux kernel and pre-C11 com-pilers It has also put forward some possible ways ofbuilding towards a full implementation of C11rsquos andC++11rsquos handling of dependency ordering Finallyit calls out some weaknesses in C11rsquos and C++11rsquoshandling of dependency ordering and offers some al-ternatives

References

[1] Alglave J Maranget L Pawan PSarkar S Sewell P Williams D andNardelli F Z PPCMEMARMMEM A toolfor exploring the POWER and ARM memorymodels httpwwwclcamacuk~pes20

ppc-supplementalpldi105-sarkarpdfJune 2011

[2] Alglave J Maranget L andTautschnig M Herding cats Modellingsimulation testing and data-mining for weakmemory In Proceedings of the 35th ACM SIG-PLAN Conference on Programming LanguageDesign and Implementation (New York NYUSA 2014) PLDI rsquo14 ACM pp 40ndash40

[3] ARM Limited ARM Architecture ReferenceManual ARMv7-A and ARMv7-R Edition2010

[4] Boehm H J Space efficient conservativegarbage collection SIGPLAN Not 39 4 (Apr2004) 490ndash501

[5] Bonzini P and Day M RCU imple-mentation for Qemu httplistsgnu

WG21N4321 37

orgarchivehtmlqemu-devel2013-08

msg02055html August 2013

[6] Dalton M THE DESIGN AND IMPLE-MENTATION OF DYNAMIC INFORMATIONFLOW TRACKING SYSTEMS FOR SOFT-WARE SECURITY PhD thesis StanfordUniversity 2009 Available httpcsl

stanfordedu~christospublications

2009michael_daltonphd_thesispdf

[Viewed March 9 2010]

[7] Desnoyers M [RFC git tree] userspace RCU(urcu) for Linux httpurcuso February2009

[8] Desnoyers M McKenney P E SternA Dagenais M R and Walpole JUser-level implementations of read-copy updateIEEE Transactions on Parallel and DistributedSystems 23 (2012) 375ndash382

[9] Grisenthwaite R ARM Barrier LitmusTests and Cookbook ARM Limited 2009

[10] Howard P W and Walpole J A rel-ativistic enhancement to software transactionalmemory In Proceedings of the 3rd USENIX con-ference on Hot topics in parallelism (BerkeleyCA USA 2011) HotParrsquo11 USENIX Associa-tion pp 1ndash6

[11] Howard P W and Walpole J Relativis-tic red-black trees Concurrency and Computa-tion Practice and Experience (2013) nandashna

[12] Intel Corporation A Formal Speci-fication of Intel Itanium Processor Fam-ily Memory Ordering 2002 Avail-able httpdeveloperintelcom

designitaniumdownloads251429htm

ftpdownloadintelcomdesign

ItaniumDownloads25142901pdf [ViewedJanuary 10 2007]

[13] International Business Machines Corpo-ration Power ISA Version 207 2013

[14] Kannan H Ordering decoupled metadata ac-cesses in multiprocessors In MICRO 42 Pro-ceedings of the 42nd Annual IEEEACM Inter-national Symposium on Microarchitecture (NewYork NY USA 2009) ACM pp 381ndash390

[15] McCarthy J Recursive functions of symbolicexpressions and their computation by machinepart i Commun ACM 3 4 (Apr 1960) 184ndash195

[16] McKenney P E Read-copy update(RCU) usage in Linux kernel Availablehttpwwwrdropcomuserspaulmck

RCUlinuxusagerculocktabhtml [ViewedJanuary 14 2007] October 2006

[17] McKenney P E What is RCU part 2Usage Available httplwnnetArticles

263130 [Viewed January 4 2008] January2008

[18] McKenney P E The RCU API 2010 edi-tion httplwnnetArticles418853 De-cember 2010

[19] McKenney P E Validating memory barri-ers and atomic instructions httplwnnet

Articles470681 December 2011

[20] McKenney P E Is Parallel ProgrammingHard And If So What Can You Do About Itkernelorg Corvallis OR USA 2012

[21] McKenney P E Structured deferral syn-chronization via procrastination CommunACM 56 7 (July 2013) 40ndash49

[22] McKenney P E [PATCH tipcorercu14] mce Stop using array-index-based RCUprimitives [PATCHtipcorercu14]mce

Stopusingarray-index-basedRCUprimitivesMay 2015

[23] McKenney P E Purcell C Al-gae Schumin B Cornelius G Qwer-tyus Conway N Sbw Blainster Ru-fus C Zoicon5 Anome and Eisen HRead-copy update httpenwikipedia

orgwikiRead-copy-update July 2006

WG21N4321 38

[24] McKenney P E and Slingwine J DRead-copy update Using execution history tosolve concurrency problems In Parallel andDistributed Computing and Systems (Las VegasNV October 1998) pp 509ndash518

[25] McKenney P E and Walpole J What isRCU fundamentally Available httplwn

netArticles262464 [Viewed December 272007] December 2007

[26] Michael M M Hazard pointers Safe mem-ory reclamation for lock-free objects IEEETransactions on Parallel and Distributed Sys-tems 15 6 (June 2004) 491ndash504

[27] Rossbach C J Hofmann O S PorterD E Ramadan H E Bhandari A andWitchel E TxLinux Using and managinghardware transactional memory in an operatingsystem In SOSPrsquo07 Twenty-First ACMSymposium on Operating Systems Principles(October 2007) ACM SIGOPS Avail-able httpwwwsosp2007orgpapers

sosp056-rossbachpdf [Viewed October 212007]

[28] Toit S D Working draft stan-dard for programming language C++httpwwwopen-stdorgjtc1sc22wg21

docspapers2013n3691pdf May 2013

[29] Vigueras G Orduna J M and LozanoM A Read-Copy Update based parallel serverfor distributed crowd simulations The Journalof Supercomputing (Apr 2012)

Change Log

This paper first appeared as N4026 in May of 2014Revisions to this document are as follows

bull Add a suggestion to Section 71 that developersprovide diagnostics for singleton arrays when us-ing RCU-protected indexes (May 28 2014)

bull Apply changes from Torvald Riegel feedback

1 Move Section 71 which summarizes theLinux-kernel uses of dependency chains upto the beginning of Section 7

2 Fix some typos in Section 74 and addwords accounting for DEC Alpha whichmust continue to promote memory order

consume loads to memory order acquire

with this sectionrsquos approach Also clarifyJeff Preshingrsquos conjecture about the lack ofoptimizations that break semantic depen-dency chains

(June 5 2014)

bull Add Section 75 which covers Hans Boehmrsquosapproach which restricts dependency chains tolocal variable Also add Section 76 whichcovers Clark Nelsorsquos approach which usesthe [[carries dependency]] attribute to markthose local variables that carry dependencies(July 17 2014)

bull Mark the paper as a revision of N4036 (July 172014)

bull Add Jeff Preshing Hans Boehm and Clark Nel-son as authors and clarify Hansrsquos proposal inSection 75 (August 13 2014)

bull Add Olivier Girouxrsquos proposal for explicitlymarking the tails of dependency chains as Sec-tion 77 and add Olivier as author Update theadvantages and disadvantages of Jeff Preshingrsquoswhole-program option in Section 74 Add advan-tages and disadvantages for Hans Boehmrsquos pro-posal in Section 75 and for Clark Nelsonrsquos pro-posal in Section 76 (August 23 2014)

bull Update Olivier Girouxrsquos explicit dependency-chain marking proposal in Section 77 based onfeedback from Olivier (August 24 2014)

bull Add a section header (Section 411) for thedependency-chain rules for 2014 GCC implemen-tations Add Section 413 laying out the sim-pler rules for managing dependency chains in theolder 1990s Sequent C implementations Adda footnote to Section 76 recording Lawrence

WG21N4321 39

Crowlrsquos preference for variable modifiers insteadof either attributes or type modifiers Addwords to Section 77 noting that atomic signal

fence() or the Linux kernelrsquos barrier() can beused to suppress code-motion optimizations thatmight otherwise prevent fixing up tail-markeddependency chains that were broken by code-motion optimizations (October 1 2014)

bull Add Section 34 which describes operators thatact as the last link in the Linux kernelrsquos depen-dency chains including the ambiguous role ofthe -gt operator (October 1 2014)

bull Update plots of RCU API usage (October 52014)

bull Flesh out Section 412 which lays out a shorthistory of dependency-chain management rulesfor the Linux kernel Numerous minor clarifica-tions and corrections throughout the document(October 5 2014)

bull Add ldquoandrdquo before final authorrsquos name and emailaddress Also add a pointer back to N4036 (Oc-tober 5 2014)

bull Add a brain-dead build script (October 52014)

bull Add Section 710 which has the beginnings ofan evaluation of the various proposals (October5 2014)

The paper was then published as N4215 whichwas revised as follows

bull Add verbiage to Sections 72 and 7102 differ-entiating between various forms of syntactic andsemantic dependencies (November 11 2014)

bull Add Section 7103 suggesting a C keyword in-stead of the C++ [[carries dependency]] at-tribute This section also suggests that combin-ing ideas from several proposals might be helpful(November 11 2014)

bull Add Section 78 which contains a first attemptto describe Olivier Girouxrsquos head-marked depen-dency chains (November 11 2014)

bull Add example code in Figures 15 and 16 Addsimilar figures to the proposals showing how theymark the dependency chains in the sample code(November 11 2014)

At this point the paper was published as N4321which was revised as follows

bull Add a warning against using operands to amp and| that leave only one bit set (or respectivelycleared) to Section 411 (April 9 2015)

bull Update plots of RCU API usage (April 192014)

bull Add Section 79 which restricts dependencychains with the goal of allowing unmarked de-pendency chains while avoiding restricting com-piler optimizations One important restriction isthe elimination of RCU-protected array indexes(April 19 2014)

bull Add verbiage to Section 32 noting that the amp

operator is sometimes used to find the beginningof a power-of-two aligned structure (April 192014)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Torvald Riegel to Sec-tion 79 (May 20 2015)

bull Donrsquot allow integers to carry dependencies intoor out of unmarked functions in Section 79(May 21 2015)

bull Update Section 79 to note that assignments can-not extend a dependency chain from one threadto another and that assignments cannot extendan intptr t- or uintptr t-based dependencychain through a shared variable even within agiven thread (May 21 2015)

bull Update Section 79 to note that atomic op-erations can extend dependency chains in thesame way that their non-atomic counterpartscan (May 21 2015)

WG21N4321 40

bull Update Section 79 to explicitly call out the factthat stores and subsequent loads can extend de-pendency chains (May 21 2015)

bull Update Section 79 to explicitly state that un-defined behavior terminates dependency chainsand that if only one value of a pointer avoidsundefined behavior any dependency chainsthrough that pointer at that point are termi-nated (May 22 2015)

bull Update Section 79 to document the difficul-ties stemming from value-specialization opti-mizations (May 23 2015)

bull Update Section 79 to note the back-propagationof the dependency-chain-breaking effects of thecompiler deducing the exact value of a pointer(May 27 2015)

bull Update Section 79 to give more informationon how much bit-setting and bit-clearing is toomuch (June 8 2015)

bull Update Section 79 to note that the Linux ker-nel does not rely on dependency ordering whena statically allocated structure is linked into anRCU-protected structure but also note that op-timizations breaking dependency ordering in thiscase seem quite dubious (July 11 2015)

bull Update Section 79 illustrating dependency-chain rules with examples (July 11 2015)

bull Update Section 79 noting that simple loads suf-fice for memory order consume loads (July 112015)

bull Update Sections 77 and 78 calling out abstrac-tion challenges for head- and tail-marking ap-proaches (July 13 2015)

bull Add subsections to Section 79 (July 13 2015)

bull Update Section 79 to amplify objections to spe-cialization optimizations applied to heap-basedpointers (July 13 2015)

  • 1 Introduction
  • 2 Introduction to RCU
  • 3 Linux-Kernel Use Cases
    • 31 Types of Linux-Kernel Dependency Chains
    • 32 Operators in Linux-Kernel Dependency Chains
    • 33 Operators Terminating Linux-Kernel Dependency Chains
    • 34 Operators Acting as Last Link in Linux-Kernel Dependency Chains
    • 35 Linux-Kernel Dependency Chain Length
      • 4 Dependency Ordering in Pre-C11 Implementations
        • 41 Rules for C-Language RCU Users
          • 411 Rules for 2014 GCC Implementations
          • 412 Rules for 2003 GCC Implementations
          • 413 Rules for 1990s Sequent C Implementations
            • 42 Rules for C-Language Implementers
              • 5 Dependency Ordering in C11 and C++11 Implementations
              • 6 Weaknesses in C11 and C++11 Dependency Ordering
              • 7 Potential Alternatives to C11 and C++11 Dependency Ordering
                • 71 Revising C11 and C++11 Dependency-Ordering Definition
                • 72 Type-Based Designation of Dependency Chains With Restrictions
                • 73 Type-Based Designation of Dependency Chains
                • 74 Whole-Program Option
                • 75 Local-Variable Restriction
                • 76 Mark Dependency-Carrying Local Variables
                • 77 Explicitly Tail-Marked Dependency Chains
                • 78 Explicitly Head-Marked Dependency Chains
                • 79 Restricted Dependency Chains
                  • 791 Extending Dependency Chains
                  • 792 Terminating Dependency Chains
                  • 793 Restricted Dependency Chains and the Linux Kernel
                  • 794 Restricted Dependency Chains Advantages and Disavantages
                    • 710 Evaluation
                      • 7101 Audiences
                      • 7102 Comparison
                      • 7103 Other Approaches
                          • 8 Summary
Page 21: N4321: Towards Implementation and Use of memory order … · 2015-07-14 · N4321: Towards Implementation and Use of memory order consume Doc. No.: WG21/N4321 (revised) Date: 2015-04-19

WG21N4321 21

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 void ls_front(struct liststackhead head)15 16 void data17 struct liststack lsp1819 rcu_read_lock()20 lsp = rcu_dereference(head-gtfirst)21 if (lsp == NULL)22 data = NULL23 else24 data = rcu_dereference(lsp-gtt)25 rcu_read_unlock()26 return data27

Figure 20 List-Based-Stack Whole-Program Ap-proach 1 of 2

75 Local-Variable Restriction

This approach suggested off-list by Hans Boehmlimits the extent of dependency trees to a local whichincludes local variables temporaries function argu-ments and return variables Assigning a value from amemory order consume load to such an object beginsa dependency chain Assigning a value loaded fromsuch a local to a global variable (including function-local variables marked static) or to the heap impliesa kill dependency() so that dependency chains areconfined to locals However if the compiler is unableto see the full dependency chain for example be-cause it passes into a function in another translationunit that is not marked [[carries dependency]]the compiler should promote memory order consume

to memory order acquire12

Section 32 indicates that the following operatorsshould transmit dependency status from one localvariable or temporary to another -gt infix = casts

12 Some implementations might provide means to allow theuser to specify that a diagnostic be generated if such promotionis necessary

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 void ls_pop(struct liststackhead head)37 38 struct liststack lsp39 struct liststack lsnp140 struct liststack lsnp241 void data4243 rcu_read_lock()44 lsnp2 = rcu_dereference(head-gtfirst)45 do 46 lsnp1 = lsnp247 if (lsnp1 == NULL) 48 rcu_read_unlock()49 return NULL50 51 lsp = rcu_dereference(lsnp1-gtnext)52 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)53 while (lsnp1 = lsnp2)54 data = rcu_dereference(lsnp2-gtt)55 rcu_read_unlock()56 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)57 return data58

Figure 21 List-Based-Stack Whole-Program Ap-proach 2 of 2

WG21N4321 22

prefix amp prefix [] infix + infix - ternary infix (bitwise) amp and probably also | Similarly Sec-tion 33 indicates that the following operators shouldimply a kill dependency() () == = ampamp ||infix and

It will also be necesary to check whether Linux-kernel usage expects dependency chains to passthrough globals and heap objects that are in someway thread-local If there are such use cases and ifthey are sane and cannot easily be changed to uselocal variables should [[carries dependency]] beused to flag dependency-carrying globals and heapobjects

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute if de-pendency chains are to span multiple translationunits as is the case in some parts of the Linuxkernel

2 The implementation is likely to be some-what simpler because only those dependencychains passing through local variables compiler-generated temporaries compiler-visible functionarguments and compiler-visible return valuesneed be traced One could also argue that func-tion arguments and return values marked with[[carries dependency]] attribute also need tobe traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 Although optimizations on dependency chainsmust be restricted the restricted scope of de-pendency chains reduces the impact of these re-strictions

5 Applying this approach to the Linux kernelwould only require the addition of markings onfunction parameters and return values corre-sponding to cross-translation-unit function callsHowever there are a significant number of theseso this approach can expect significant resistancefrom the Linux community

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 22 List-Based-Stack Local-Variable Restric-tion 1 of 2

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach allows local-variable annotations tobe dropped as shown in Figure 22 and 23

76 Mark Dependency-Carrying Lo-cal Variables

This approach suggested offlist by Clark Nelson usesthe [[carries dependency]] attribute to mark non-static local-scope variables as carrying a dependencyin addition to its current use marking function ar-guments and return values as carrying dependen-cies It is not permissible to mark global variablesor structure members with this attribute Assigningfrom a [[carries dependency]] object to a non-[[carries dependency]] object results in an im-plicit kill dependency()

This approach is similar to that of Section 73 ex-cept that it uses an attribute rather than a type mod-ifier As such it has many of the advantages and dis-

WG21N4321 23

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 23 List-Based-Stack Local-Variable Restric-tion 2 of 2

advantages of that approach however some believethat an attribute-based approach will be more ac-ceptable to the committee than would a type-modifierapproach13 However this approach does requirethat C add attributes

This leave the question of which operatorstransmit dependency chains from one [[carries

dependency]] object to another Section 32 indi-cates that the following operators should transmitdependency status from one local variable or tem-porary to another -gt infix = casts prefix amp prefix [] infix + infix - ternary infix (bitwise) ampand probably also | Similarly Section 33 showsthat the following operators should imply a kill

dependency() () == = ampamp || infix and

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains passingthrough variables marked with the [[carries

dependency]] attribute need be traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 The [[carries dependency]] calls serve asvaluable documentation of the developerrsquos in-tent

5 Although optimizations on dependency chainsmust be restricted use of explicit [[carries

dependency]] greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of variables carrying depen-dencies given that the Linux kernel currentlyrequires no such markings

13 Lawrence Crowl suggests a third approach namely a vari-able modifier

WG21N4321 24

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 _Carries_dependency void data18 _Carries_dependency struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 24 List-Based-Stack Marked Local Variables1 of 2

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach results in code as shown in Fig-ures 24 and 25 where the [[carries dependency]]

attributes have been replaced with a mythicalCarries dependency C keyword

77 Explicitly Tail-Marked Depen-dency Chains

This approach suggested at Rapperswil by OlivierGiroux can be thought of as the inverse of

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 _Carries_dependency struct liststack lsp40 struct liststack lsnp141 _Carries_dependency struct liststack lsnp242 _Carries_dependency void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 25 List-Based-Stack Marked Local Variables2 of 2

WG21N4321 25

1 p = atomic_load_explicit(ampgp memory_order_consume)2 if (p = NULL)3 do_it(atomic_dependency(p gp))

Figure 26 Explicit Dependency Operations

stdkill dependency() Instead of explicitlymarking where the dependency chains terminateOlivierrsquos proposal uses a stddependency() prim-itive to indicate the locations in the code that thedependency chains are required to reach The first ar-gument to stddependency() is the value to whichthe dependency must be carried and the second argu-ment is the variable that heads the dependency chainin other words the second argument is the variablethat was loaded from by a memory order consume

load This proposal differs from the others in thatit is expected to be implemented not necessarily bypreserving the dependency but instead by insertingbarriers in those cases where optimizations have elim-inated any required dependencies The goal here is toimpose minimal restrictions on optimizations of codecontaining dependency chains

A C-language example is shown in Figure 26where stddependency() is transliterated to the C-language atomic dependency() function On line 3atomic dependency() returns the value of its firstargument (p) while ensuring that the data depen-dency from the memory order consume load from gp

is faithfully reflected in the assembly language imple-menting this code fragment The assembly-languagereflection of this dependency might be in terms ofan assembly-language dependency (for example onARM or PowerPC) implicit memory ordering (forexample on x86 or mainframe) or by an explicitmemory-barrier instruction However if there was noatomic dependency() function the compiler wouldbe under no obligation to preserve the dependency14

These explicitly specified dependencies may becombined with [[carries dependency]] attributeson function arguments for example as shown in Fig-ure 27 Note the interplay of atomic dependency()

and [[carries dependency]] where line 8 estab-

14 Would it be better to have the second argument to atomic

dependency() be a label rather than an expression

1 void foo(struct bar q [[carries_dependency]])2 3 if (q = NULL)4 do_it(atomic_dependency(q-gtb q))5 67 p = atomic_load_explicit(ampgp memory_order_consume)8 foo(atomic_dependency(p gp))

Figure 27 Explicit Dependency Operations and car-ries dependency

lishes the dependency between the load from gp andthe [[carries dependency]] argument q of foo()and where line 4 establishes the further dependencybetween argument q of foo() and do it()s argu-ment

This approach is not yet complete One issue isthe possibility of a given operation being dependenton multiple memory order consume loads One ap-proach is of course to omit this functionality and an-other is to allow atomic dependency() to allow anexpression as its first argument and a variable list ofmemory order consume loaded variables

Another issue is connecting [[carries

dependency]] return values to subsequent atomic

dependency() invocations There are a numberof possible resolutions to this issue One approachwould be to use [[carries dependency]] attributeto mark the declaration of the variable to whichthe functionrsquos return value is assigned bringing theproposal from Section 76 to bear In the specialcase where the memory order consume load is in thesame function body as the atomic dependency()

that depends on it the atomic dependency()

could reference the variable that was the sourceof the original memory order consume load An-other approach would be to allow function-returncarries dependency]] attributes to define namesthat could be used by later atomic dependency()

invocations

A third issue arises when atomic dependency()

must be applied after the head of the dependencychain has gone out of scope for example if the headwas contained in a variable defined in an inner scopethat has since been exited

A fourth issue arises if optimizations along a

WG21N4321 26

needed dependency chain allow ordering the depen-dent operation to precede the head of the depen-dency chain in which case inserting barriers wouldbe ineffective The current proposal for addressingthis issue is to suppress memory-movement optimiza-tions across the atomic dependency() perhaps us-ing something like atomic signal fence() or theLinux kernelrsquos barrier() macro This approach al-lows dependency checking and fence insertion to becarried out as a final pass in the compilation process

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains having ex-plicit atomic dependency() calls (and option-ally intermediate [[carries dependency]] at-tributes) need be traced

3 Irrelevant dependency chains are pruned by de-fault with no stdkill dependency() calls re-quired

4 The atomic dependency() calls serve as valu-able documentation of the developerrsquos intent

5 Although optimizations on dependency chainsmust be restricted use of explicit atomic

dependency() greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of dependency chains giventhat the Linux kernel currently relies on implicitends of dependency chains

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(atomic_dependency(lsp-gtt27 head-gtfirst))28 rcu_read_unlock()29 return atomic_dependency(data lsp-gtt)30

Figure 28 List-Based-Stack Tail-Marked Dependen-cies 1 of 2

It is not yet known whether this approach can bereasonably modeled

The result is shown in Figures 28 and 29

78 Explicitly Head-Marked Depen-dency Chains

This approach suggested via email by Olivier Girouxcan be thought of as another inverse of stdkilldependency() In this case the heads of the de-pendency chains are marked indicating to whichpointed-to objects dependencies should be carriedThis description is an extrapolation of a very con-cise proposal and corrections are welcome

The general idea is to provide an augmentedform of the load() member function that indicatesdependencies for example xload(memory order

consume x-gtnext) would cause a dependency tobe carried through the -gtnext field but throughno other field This is shown in Figure 30 where

WG21N4321 27

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(atomic_dependency(lsnp2-gtt lsnp))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return atomic_dereference(data lsnp2-gtt)59

Figure 29 List-Based-Stack Tail-Marked Dependen-cies 2 of 2

1 struct foo 2 struct foo a3 struct foo b4 struct foo c5 int d6 78 p = atomic_load_explicit(ampgp memory_order_consume9 p-gta p-gtb)

10 qa = p-gta Dependency carried 11 qb = p-gtb Dependency carried 12 qc = p-gtc No dependency carried 13 d = p-gtd No dependency carried

Figure 30 Explicit Dependency Operations and Aug-mented Load

the explicit dependency information on line 9 causeslines 10 and 11 to carry a dependency but lines 12and 13 not to do so

Some open questions regarding this approach

1 How does this interact with arguments and re-turn values Do the corresponding annotationsneed to indicate to which fields dependenciesmight be carried Should mismatches be con-sidered an error and if so which sorts of mis-matches

2 How are opaque types handled For exam-ple consider the Linux kernel linked-list facil-ity which embeds a list head structure intothe enclosing object that is to be placed on thelist The memory order consume load returnsa (struct list head ) but it may be nec-essary to carry a dependency to one or more ofthe fields in the enclosing object Should thisbe handled via something like xload(memory

order consume x) but if so doesnrsquot thisre-introduce the need for lots of stdkill

dependency() calls

3 Larger structures might have quite a few fieldsthat need dependencies carried Should therebe some sort of shorthand to make this easierto code for example tagging the fields needingdependency ordering in the declaration of thestruct or class

4 Common types of abstraction need to be han-dled correctly For example there are more than

WG21N4321 28

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst head-gtfirst-gtt)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(lsp-gtt lsp-gtt)) 28 rcu_read_unlock()29 return data30

Figure 31 List-Based-Stack Head-Marked Depen-dencies 1 of 2

500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

79 Restricted Dependency Chains

This approach restricts dependency chains to opera-tions for which compilers would naturally carry de-pendencies As such this approach can be consid-ered to be a refinement of the whole-program option(Section 74) restricted as described in Section 412but also omitting control dependencies and RCU-protected array indexes As always a memory order

consume load heads a dependency chain and as al-ways a memory order consume load may be imple-mented with a simple load instruction on architec-tures such as x86 ARM and Power

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst head-gtfirst-gtnext)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext lsnp-gtnext-gtt)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt lsnp2-gtt))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 32 List-Based-Stack Head-Marked Depen-dencies 2 of 2

WG21N4321 29

1 define rcu_ptr_extract(p) 2 ( 3 uintptr_t ___ip = (uintptr_t)(p) 4 5 ___ip = ___ip amp ~0x7 6 (typeof(p)) ___ip 7 )89 define rcu_ptr_set_bits(p bits)

10 ( 11 uintptr_t ___ip = (uintptr_t)(p) 12 13 ___ip = ___ip amp ~0x7 14 ___ip = ___ip | (bits) 15 (typeof(p)) ___ip 16 )1718 define rcu_ptr_get_bits(p bits) 19 ( 20 uintptr_t ___ip = (uintptr_t)(p) 21 22 ___ip amp ~0x7 23 )

Figure 33 Pointers and Bit Manipulation on 64-BitSystem

This results in a specific list of operations that ex-tend dependency chains and a separate specific list ofoperations that terminate such chains each coveredin its own section

791 Extending Dependency Chains

The following primitive operations extend thatchain15

1 If any value is part of a dependency chain thenusing that value as the left-hand side of an as-signment expression extends the chain to coverthe assignment This rule is exercised in theLinux kernel by stores into fields making up anRCU protected data element

2 If any value is part of a dependency chain thenusing that value as the right-hand side of an as-signment expression extends the chain to coverboth the assignment and the value returned bythat assignment statement Line 20 of Figure 20shows how this rule may be used to extend adependency chain into a local variable

15 In case of operator overloading the actual functions calledmust be analyzed in order to determine their effects on depen-dency chains

3 If any value that is part of a dependency chain isstored to a non-shared variable then any valueloaded by a later load from that same variable bythat same thread is also part of the dependencychain Lines 20 and 24 of Figure 20 illustratethis rule though this rule would apply even iflocal variable lsp was a intptr t instead of apointer

4 If a pointer that is part of a dependency chainis stored to any variable then any value loadedby a later load from that same variable by thatsame thread is also part of the dependency chainLines 20 and 24 of Figure 20 illustrate this rulethough this rule would apply even if local vari-able lsp was instead a shared variable

5 If a pointer is part of a dependency chain thenadding an integral value to that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers and alsoto addition via the infix + operator and via thepostfix [] operator Note that the addition mustbe carried out on a pointer Casting to an inte-gral type and then carrying out the addition willbreak the dependency chain Therefore insteadof casting to an integral type to carry out theaddition cast to a pointer to char16 Line 24 ofFigure 20 illustrates this given that the -gtt actsas a pointer offset prior to indirection

6 If a pointer is part of a dependency chain thensubtracting an integer from that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers Againcasting to an integral type and then carryingout the subtraction will break the dependencychain so instead cast to a pointer to char TheLinux-kernel container of() macro illustratesthis This macro is used to find the beginning ofa structure given a pointer to a field within thatsame structure

16 Yes some old systems had strange formats for characterpointers and this restriction does exclude those systems fromthis nuance of dependency ordering However to the best ofmy knowledge all such systems were uniprocessors so this isnot a real problem

WG21N4321 30

7 If a pointer is part of a dependency chain thendereferencing it using the prefix operator ex-tends the chain through the dereference opera-tion Line 24 of Figure 20 illustrates this giventhat the -gtt acts as a pointer offset prior to in-direction

8 If a pointer is part of a dependency chain thendereferencing it using the -gt field-selection oper-ator extends the chain to the field Note that thewhen the -gt operator is followed by one or more operators these latter operators are equiva-lent to adding a constant integer to the originalpointer Line 24 of Figure 20 directly illustratesthis rule

9 If a pointer is part of a dependency chain thencasting it (either explicitly or implicitly) to anypointer-sized type extends the chain to the re-sult Line 3 of Figure 33 illustrates this rule

10 If a value of type intptr t or uintptr t ispart of a dependency chain then casting it toa pointer type extends the chain to the resultLine 6 of Figure 33 illustrates this rule

11 If a value of type intptr t or uintptr t is partof a dependency chain then the bitwise infix amp

and | operators extend the dependency chain tothe resulting value Line 5 of Figure 33 illus-trates this rule

12 If a value of type intptr t or uintptr t is partof a dependency chain the the bitwise infix ^

operator extends the dependency chain to theresulting value The use case for this traversal ofbuddy-allocator-like lists or dense-array heapsbut it is not clear whether these use cases justifythis addition to dependency ordering

13 If a pointer is part of a dependency chain thenapplying the unary amp address-of operator op-tionally casting this address to a pointer type(perhaps repeatedly to different pointer typeseither explicitly or implicitly) then applying the dereference operator extends the chain to theresult This is used by some of the Linux-kernellist-processing macros

14 If a pointer is part of a dependency chain andthat pointer appears in the entry of a expres-sion selected by the condition then the chainextends to the result Please note that doesnot extend chains from its condition only fromits second or third argument

15 If a pointer to a function is part of a depen-dency chain then invoking the pointed-to func-tion extends the chain from the pointer to theinstructions executed Note that the exact mech-anism used to update instructions is implemen-tation defined and might require use of specialinstruction-cache-flush operations17

16 If a pointer is part of a dependency chain thenif that pointer is used as the actual parameter ofa function call the dependency chain extends tothe formal parameter

17 If a pointer is part of a dependency chain thenif that pointer is returned from a function thedependency chain extends to the returned valuein the calling function

18 If a given operation extends a dependency chainthen so does its atomic counterpart For exam-ple the rules applying to assignments also applyto atomic loads and stores

Any other operation terminates a dependencychain

792 Terminating Dependency Chains

Even though all other operations terminate depen-dency chains there are a few that deserve specialmention

1 Equality comparisons

2 Narrowing magnitude comparisons

3 Narrowing arithmetic operations

4 Narrowing bitwise operations

17 This may sound strange but just you try implementingdynamic linking without the ability to update instructions

WG21N4321 31

1 struct bar 2 struct bar next3 int a4 int b5 6 struct bar head = amphead 1 2 78 for (p = head-gtnext p p = rcu_dereference(p-gtnext)) 9 foo += p-gta

10 if (p == amphead)11 break12 13 bar = head-gtb

Figure 34 Back-Propagation of Dependency-ChainBreakage Due to Comparisons

1 if (p gt ampfoo)2 do_something(p)3 else if (p lt ampfoo)4 do_something_else(p)5 else6 do_something_nodep(p)

Figure 35 Inequality-Comparison Dependency-Chain Breakage

5 Storing non-pointers into shared variables

6 Passing values between threads without using amemory order consume load

7 Undefined behavior

8 Use of kill dependency

Each of these is covered below

Equality comparisons If a pointer is part of adependency chain then a == or = comparison thatcompares equal to some other pointer where thatother pointer is not part of any dependency chainwill cause any uses of the original pointer to no longerbe part of the dependency chain This dependency-chain breakage can back-propagate to earlier uses ofthe pointer so that in Figure 34 if the comparisonon line 10 compares equal then the access on line 9is not part of the dependency chain This is admit-tedly a rather strange code fragment and besidesthe Linux-kernel barrier() macro could prevent thisif placed between lines 9 and 10 Furthermore the

Linux kernelrsquos list macros avoid this situation becausethe equal comparison terminates the loop

So what if the compiler introduces an equality com-parison This might happen when doing feedback-directed optimization where the compiler might no-tice (for example) that a particularly statically allo-cated structure was almost always the first elementon a given list The compiler might therefore in-troduce a specialization optimization comparing theaddresses and generating code using the statically al-located structure on equals comparison On the onehand in the cases where the Linux kernel adds a stat-ically allocated structure to an RCU-protected linkeddata structure that structure has been initialized atcompile time so that dependency ordering is not re-quired On the other hand this appears to be anextremely dubious optimization for linked data struc-tures In a great many cases the added overhead ofthe comparison would overwhelm the benefits of gen-erating code based on the statically allocated struc-ture

High-quality implementations would therefore beexpected to provide means for disabling this sort ofoptimization especially for pointers obtained fromthe heap After all use of statically allocated struc-tures in RCU-protected lists could be quite usefulduring out-of-memory conditions in which case thespecialization optimization would almost always re-duce performance which is not what optmizationsare supposed to be doing

Narrowing magnitude comparisons A seriesof gt lt gt= or lt= operators that informs the compilerof the exact value of a pointer causes that pointerto no longer be part of the dependency chain SeeFigure 35 for an example of this On line 6 of thisfigure the compiler knows that the value of p is equalto ampfoo so although there is dependency ordering tolines 2 and 4 there is no dependency ordering toline 6 This dependency-chain breakage can back-propagate just as for equality comparisons How-ever dependencies are maintained for normal usesfor exmaple the use of comparisons for deadlockavoidance when acquiring locks contained in multi-ple RCU-protected data elements

WG21N4321 32

Narrowing arithmetic operations If a pointeris part of a dependency chain and if the valuesadded to or subtracted from that pointer cancel thepointer value so as to allow the compiler to pre-cisely determine the resulting value then the result-ing value will not be part of any dependency chainFor example if p is part of a dependency chain then((char )p-(uintptr t)p)+65536 will not be18

Narrowing bitwise operations If a value of typeintptr t or uintptr t is part of a dependency chainand if that value is one of the operands to an amp

or | infix operator whose result has too few or toomany bits set then the resulting value will not bepart of any dependency chain For example on a64-bit system if p is part of a dependency chainthen (p amp 0x7) provides just the tag bits and nor-mally cannot even be legally dereferenced Similarly(p | ~0) normally cannot be legally dereferencedHowever (p amp ~0x7) will provide a usable pointerthat is part of prsquos dependency chain Setting orclearing bits that are not used by the implementa-tion or that would be zero for a properly aligned ob-ject will not break the dependency chain That saidthe compiler might not know the definition of ldquoprop-erly alignedrdquo for example in the cases of manuallycache-aligned or page-aligned objects

Storing non-pointers into shared variables Ifa non-pointer value that is part of a dependency chainis stored into a shared variable then the dependencychain does not extend to a later load from that vari-able

Passing values between threads If a value thatis part of a dependency chain is stored into a variableby one thread and loaded from that same variable bysome other thread using either a non-atomic load ora memory order relaxed load then the dependencychain does not extend to the second thread To getthis effect the second thread would instead need touse a memory order consume load Note that this

18 That said 57p4 of C++ and 656p8 of C both say thatindexing outside of an object is undefined behavior so the lossof dependency ordering is likely the least of the problems here

would extend the dependency chain even if the cor-responding store was a memory order relaxed storebecause the required store-side ordering is providedby the dependency chain

Undefined behavior If undefined behavior is in-voked then consistent with the notion of undefinedbehavior there are no dependency-chain guarantees

If a given pointer intptr t or uintptr t is usedsuch that only one value avoids undefined behaviorthen the dependency chain is broken in the same wayas it would be in the case of an equality comparisonwith that same value

kill dependency() The result of calling kill

dependency is never part of any dependency chainThis operation can be used to suppress diagnosticsthat implementations might omit for likely misusesof dependency ordering

793 Restricted Dependency Chains and theLinux Kernel

This covers all known pointer-based RCU uses in theLinux kernel aside from RCU-protected array in-dexes (more on these later) However as noted ear-lier these restrictions might prove too constrainingfor future code Therefore it might be necessary tocombine this approach with some variation on one ofthe methods for explicitly marking variables formalparameters and return values that are intended tocarry dependencies

Section 32 discussed dependency chains headedby memory order consume loads of integers that arelater used as array indexes or pointer offsets Al-though there are a (very) few such uses in the Linuxkernel accommodating dependency chains headed byloads of integers greatly complicates the handling ofdependency chains For example given an integer x

produced by a memory order consume load we mustcorrectly handle expressions containing (x - x) in-cluding cases where the cancellation is not at all obvi-ous from the source code In contrast given a pointerp the expression (p - p) not a pointer and the rulesgiven above do not require carrying a dependency

WG21N4321 33

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a) 3 q = kill_dependency(p)4 a = q-gtspecial_a5 else 6 a = p-gtnormal_a7

Figure 36 Avoiding Diagnostic Due To Dependency-Ordering Value-Narrowing Hazard

through such an expression This approach there-fore excludes dependency chains headed by memory

order consume loads from non-pointer atomic vari-ables

This raises the question of what should be doneabout the Linux kernel code that relies on order-ing carried through integer array indexes Paulhas (hopefully) answered this question by creatinga Linux-kernel patch that removes the kernelrsquos de-pendency on RCU-protected array indexes [22]

794 Restricted Dependency Chains Ad-vantages and Disavantages

This approach to dependency ordering has the fol-lowing advantages and disadvantages

1 There is no need for compilers to trace depen-dency chains Instead dependencies are an au-tomatic result of current code-generation and op-timization practices

2 There would be little or no limitation on com-piler optimizations Compilers are no longer re-quired to establish artificial dependencies for ex-pressions such as (x - x) or to detect value-narrowing hazards involving == and =

3 The breaking of dependency chains when apointer compares equal to some other pointermight prove to be onerous but it is not aproblem for current Linux use cases19 Themost common cases are comparison against alist header (in which case an equality compar-ison terminates the traversal) and comparison

19 This needs to be re-verified

against NULL (in which case an equality com-parison indicates a pointer that cannot be deref-erenced in any case)

4 The compiler is prohibited from carrying outvalue-speculation optimizations on pointers orvalues of type intptr t or uintptr t that havebeen cast from a pointer type and subjected tono operations other than infix amp | and ^ (Isthis an advantage or a disadvantage The an-swer to this question is left to the reader)

5 In a great many cases dependency chains canreliably pass through library functions compiledby pre-C11 compilers

6 It would not be necessary to use stdkill

dependency() calls in most cases That saiduse of stdkill dependency() might at somefuture point allow the compiler to produce bet-ter diagnostics for dubious dependency-chain usecases For example a compiler might issue awarning for the value-narrowing hazard shownon line 3 of Figure 13 and that diagnostic mightbe suppressed as shown in Figure 36

7 More generally this approach allows annotationsto be discarded as was shown in Figures 20and 21 However the memory order consume

loads are still required in order to support DECAlpha and prevent compiler optimizations thatmight otherwise destroy the dependency chain

8 It would not be necessary to use [[carries

dependency]] attributes However as withstdkill dependency() use of something like[[carries dependency]] might produce bet-ter diagnostics for dubious use dependency-chainuse cases That said it might be preferable tosubstitute either type or variable modifiers forattributes given that use of attributes is not per-mitted to change the meaning of the programand also that attributes are not yet supportedby the C language

9 With the exception of one use case involving ar-rays no changes to the Linux-kernel source code

WG21N4321 34

are required As noted earlier a patch is avail-able to remove the Linux kernelrsquos dependency onRCU-protected array indexes [22]

In the future this approach could be augmentedby attributes type modifiers variable modifiers orother markings to allow more elaborate dependencychains to be created on the one hand and to improvethe compilerrsquos ability to emit diagnostics for dubioususes of dependency chains on the other

710 Evaluation

This evaluation starts by enumerating the differentaudiences that any change to memory order consume

must address (Section 7101) and then compares thevarious proposals based on the perceived viewpointsof these audiences (Section 7102)

7101 Audiences

The main audiences for any change to memory

order consume include standards committee mem-bers compiler implementers formal-methods re-searchers developers intending to write new codeand developers working with existing RCU code TheLinux kernel community is of course a notable exam-ple of this last category

Standards committee members would like a cleanand non-intrusive change to the standard Theywould of course also like solutions minimizing thenumber and vehemence of complaints from the otheraudiences or failing that reducing the complaintsto a tolerable noise level

Compiler implementers would like a mechanismthat fits nicely into current implementations whichdoes much to explain their satisfaction with the ap-proach of strengthening memory order consume tomemory order acquire In particular they wouldlike to avoid unbounded tracing of dependencies andwould prefer minimal constraints on their ability toapply time-honored optimizations

Formal-methods researchers would like a definitionof memory order consume that fits into existing the-oretical frameworks without undue conceptual vio-lence Of particular concern is any need to deal with

counter-factuals in other words any need to rea-son not only about values of variables required forthe solution of a given litmus test but also aboutother unrelated values for these variables As suchcounter-factuals are the rock upon which otherwiseattractive approaches involving semantic dependencyhave foundered20 Some practitioners might won-der why the opinion of formal-methods researchersshould be given any weight at all and the answer tothis question is that it is the work of formal-methodsresearchers that provides us the much-needed toolsthat we need to analyze both the memory-orderingspecification itself as well as programs using thatspecification

Developers writing new code need something thatexpresses their algorithm with a minimum of syn-tactic saccharine that is easy to learn and that iseasy to maintain For example one of the weak-nesses of the current standardsrsquo definition of memoryorder consume is the need to sprinkle large numbersof kill dependency() calls throughout onersquos codeIn short developers would like it to be easy to writeanalyze and maintain code that uses dependency or-dering

Developers with existing RCU code have the samedesires as do developers writing new code but arealso very interested in minimizing the code churn re-quired to adhere to the standard

The challenge if of course to find a proposal thataddresses the viewoints of all of these audiences Aswe will see in the next session this is not easy How-ever there is some hope that the approach presentedin Section 79 might suffice

7102 Comparison

A summary comparison of the proposals is shown inTable 1

The dependency type can either be ldquodeprdquo fornormal dependency ldquordeprdquo for the restricted de-pendencies discussed in Section 79 ldquosdeprdquo for(global) semantic dependency or ldquolsdeprdquo for local

20 That said Alan Jeffries is making another attempt tocome up with a suitable formal definition of semantic depen-dency

WG21N4321 35

Dep

enden

cyType

Variable

Marking

Form

al-Parameter

Marking

Return-V

alueMarking

Beginning-O

f-Chain

Handling

End-O

f-Chain

Handling

Dep

enden

cyTracingRequired

CAttribute

Support

Req

uired

C11 C++11 dep A A K Y Y

Type-Based Designation of DependencyChains With Restrictions (Section 72)

lsdep T T T k

Type-Based Designation of DependencyChains (Section 73)

dep T T T k

Whole-Program Option (Section 74) sdep

Local-Variable Restriction (Section 75) dep A A k Y

Mark Dependency-Carrying Local Variables(Section 76)

dep A A A k Y

Explicitly Tail-Marked Dependency Chains(Section 77)

dep A A Dk y Y

Explicitly Head-Marked Dependency Chains(Section 78)

dep D k y Y

Restricted Dependency Chains (Section 79) rdep

Marking ldquoArdquo attribute ldquoTrdquo typeBeginning of chain ldquoDrdquo explicit designationEnd of chain ldquoDrdquo explicit designation ldquokKrdquo implicitexplicit kill dependencyDependency tracking required ldquoyrdquo only for marked chains ldquoYrdquo always

Table 1 Comparison of Consume Proposals

WG21N4321 36

semantic dependency21 Variable formal-parameterand return-value marking can either be type-based(ldquoTrdquo) attribute-based (ldquoArdquo) or not required (ldquo rdquo)Beginning-of-chain handling can either require ex-plicit indication to which quantities dependenciesmust be carried (ldquoDrdquo) or nothing (ldquo rdquo) End-of-chain handling can either require an explicit kill

dependency (ldquoKrdquo) an implicit kill dependency

(ldquokrdquo) explicit designation of dependency (ldquoDrdquo) ornothing (ldquo rdquo)22 Dependency tracking might berequired for all chains (ldquoYrdquo) explicitly designatedchains (ldquoyrdquo) or not required at all (ldquo rdquo) C-language[[carries dependency]] support might be required(ldquoYrdquo) or not (ldquo rdquo)

The ideal proposal would have dependency typeldquodeprdquo (thus making it easier to model dependencyordering and making it unnecessary for developer tohave to outwit full-program optimizations) no needfor variable formal-parameter or return-value mark-ing (thus minimizing changes required for existingRCU code) implicit ldquodo the right thingrdquo end-of-chainhandling23 (thus minimizing the need for whack-a-mole source-code markups) no need for dependencytracking (thus making it easier to implement) andno need for C-language support for the [[carries

dependency]] attribute (thus minimizing changes tothe C standard)

7103 Other Approaches

If the C standards committee is unwilling to ac-cept attributes perhaps a new keyword such asCarries dependency would be an acceptable alter-native (This was suggested at the 2014 UIUC meet-ing but I cannot recall who suggested it)

21 Recall that a local semantic dependency remains a de-pendency even if the memory order consume load at its headcan return only a single value In contrast a global seman-tic dependency remains a dependency only if more than onevalue can appear at the end of the chain Therefore optimiza-tions based on global full-program analysis can break a globalsemantic dependency but can break neither a local semanticdependency nor a normal dependency

22 Variables that go out of scope always have any dependencychain implicitly killed

23 Perhaps implemented by a careful choice of exactly whichoperators carry dependencies in which situations

It might also be possible to combine different as-pects of the various proposals perhaps even arrivingat an improved proposal

Nevertheless we clearly have some more work todo

8 Summary

This document has analyzed Linux-kernel use of de-pendency ordering and has laid out the status-quo in-teraction between the Linux kernel and pre-C11 com-pilers It has also put forward some possible ways ofbuilding towards a full implementation of C11rsquos andC++11rsquos handling of dependency ordering Finallyit calls out some weaknesses in C11rsquos and C++11rsquoshandling of dependency ordering and offers some al-ternatives

References

[1] Alglave J Maranget L Pawan PSarkar S Sewell P Williams D andNardelli F Z PPCMEMARMMEM A toolfor exploring the POWER and ARM memorymodels httpwwwclcamacuk~pes20

ppc-supplementalpldi105-sarkarpdfJune 2011

[2] Alglave J Maranget L andTautschnig M Herding cats Modellingsimulation testing and data-mining for weakmemory In Proceedings of the 35th ACM SIG-PLAN Conference on Programming LanguageDesign and Implementation (New York NYUSA 2014) PLDI rsquo14 ACM pp 40ndash40

[3] ARM Limited ARM Architecture ReferenceManual ARMv7-A and ARMv7-R Edition2010

[4] Boehm H J Space efficient conservativegarbage collection SIGPLAN Not 39 4 (Apr2004) 490ndash501

[5] Bonzini P and Day M RCU imple-mentation for Qemu httplistsgnu

WG21N4321 37

orgarchivehtmlqemu-devel2013-08

msg02055html August 2013

[6] Dalton M THE DESIGN AND IMPLE-MENTATION OF DYNAMIC INFORMATIONFLOW TRACKING SYSTEMS FOR SOFT-WARE SECURITY PhD thesis StanfordUniversity 2009 Available httpcsl

stanfordedu~christospublications

2009michael_daltonphd_thesispdf

[Viewed March 9 2010]

[7] Desnoyers M [RFC git tree] userspace RCU(urcu) for Linux httpurcuso February2009

[8] Desnoyers M McKenney P E SternA Dagenais M R and Walpole JUser-level implementations of read-copy updateIEEE Transactions on Parallel and DistributedSystems 23 (2012) 375ndash382

[9] Grisenthwaite R ARM Barrier LitmusTests and Cookbook ARM Limited 2009

[10] Howard P W and Walpole J A rel-ativistic enhancement to software transactionalmemory In Proceedings of the 3rd USENIX con-ference on Hot topics in parallelism (BerkeleyCA USA 2011) HotParrsquo11 USENIX Associa-tion pp 1ndash6

[11] Howard P W and Walpole J Relativis-tic red-black trees Concurrency and Computa-tion Practice and Experience (2013) nandashna

[12] Intel Corporation A Formal Speci-fication of Intel Itanium Processor Fam-ily Memory Ordering 2002 Avail-able httpdeveloperintelcom

designitaniumdownloads251429htm

ftpdownloadintelcomdesign

ItaniumDownloads25142901pdf [ViewedJanuary 10 2007]

[13] International Business Machines Corpo-ration Power ISA Version 207 2013

[14] Kannan H Ordering decoupled metadata ac-cesses in multiprocessors In MICRO 42 Pro-ceedings of the 42nd Annual IEEEACM Inter-national Symposium on Microarchitecture (NewYork NY USA 2009) ACM pp 381ndash390

[15] McCarthy J Recursive functions of symbolicexpressions and their computation by machinepart i Commun ACM 3 4 (Apr 1960) 184ndash195

[16] McKenney P E Read-copy update(RCU) usage in Linux kernel Availablehttpwwwrdropcomuserspaulmck

RCUlinuxusagerculocktabhtml [ViewedJanuary 14 2007] October 2006

[17] McKenney P E What is RCU part 2Usage Available httplwnnetArticles

263130 [Viewed January 4 2008] January2008

[18] McKenney P E The RCU API 2010 edi-tion httplwnnetArticles418853 De-cember 2010

[19] McKenney P E Validating memory barri-ers and atomic instructions httplwnnet

Articles470681 December 2011

[20] McKenney P E Is Parallel ProgrammingHard And If So What Can You Do About Itkernelorg Corvallis OR USA 2012

[21] McKenney P E Structured deferral syn-chronization via procrastination CommunACM 56 7 (July 2013) 40ndash49

[22] McKenney P E [PATCH tipcorercu14] mce Stop using array-index-based RCUprimitives [PATCHtipcorercu14]mce

Stopusingarray-index-basedRCUprimitivesMay 2015

[23] McKenney P E Purcell C Al-gae Schumin B Cornelius G Qwer-tyus Conway N Sbw Blainster Ru-fus C Zoicon5 Anome and Eisen HRead-copy update httpenwikipedia

orgwikiRead-copy-update July 2006

WG21N4321 38

[24] McKenney P E and Slingwine J DRead-copy update Using execution history tosolve concurrency problems In Parallel andDistributed Computing and Systems (Las VegasNV October 1998) pp 509ndash518

[25] McKenney P E and Walpole J What isRCU fundamentally Available httplwn

netArticles262464 [Viewed December 272007] December 2007

[26] Michael M M Hazard pointers Safe mem-ory reclamation for lock-free objects IEEETransactions on Parallel and Distributed Sys-tems 15 6 (June 2004) 491ndash504

[27] Rossbach C J Hofmann O S PorterD E Ramadan H E Bhandari A andWitchel E TxLinux Using and managinghardware transactional memory in an operatingsystem In SOSPrsquo07 Twenty-First ACMSymposium on Operating Systems Principles(October 2007) ACM SIGOPS Avail-able httpwwwsosp2007orgpapers

sosp056-rossbachpdf [Viewed October 212007]

[28] Toit S D Working draft stan-dard for programming language C++httpwwwopen-stdorgjtc1sc22wg21

docspapers2013n3691pdf May 2013

[29] Vigueras G Orduna J M and LozanoM A Read-Copy Update based parallel serverfor distributed crowd simulations The Journalof Supercomputing (Apr 2012)

Change Log

This paper first appeared as N4026 in May of 2014Revisions to this document are as follows

bull Add a suggestion to Section 71 that developersprovide diagnostics for singleton arrays when us-ing RCU-protected indexes (May 28 2014)

bull Apply changes from Torvald Riegel feedback

1 Move Section 71 which summarizes theLinux-kernel uses of dependency chains upto the beginning of Section 7

2 Fix some typos in Section 74 and addwords accounting for DEC Alpha whichmust continue to promote memory order

consume loads to memory order acquire

with this sectionrsquos approach Also clarifyJeff Preshingrsquos conjecture about the lack ofoptimizations that break semantic depen-dency chains

(June 5 2014)

bull Add Section 75 which covers Hans Boehmrsquosapproach which restricts dependency chains tolocal variable Also add Section 76 whichcovers Clark Nelsorsquos approach which usesthe [[carries dependency]] attribute to markthose local variables that carry dependencies(July 17 2014)

bull Mark the paper as a revision of N4036 (July 172014)

bull Add Jeff Preshing Hans Boehm and Clark Nel-son as authors and clarify Hansrsquos proposal inSection 75 (August 13 2014)

bull Add Olivier Girouxrsquos proposal for explicitlymarking the tails of dependency chains as Sec-tion 77 and add Olivier as author Update theadvantages and disadvantages of Jeff Preshingrsquoswhole-program option in Section 74 Add advan-tages and disadvantages for Hans Boehmrsquos pro-posal in Section 75 and for Clark Nelsonrsquos pro-posal in Section 76 (August 23 2014)

bull Update Olivier Girouxrsquos explicit dependency-chain marking proposal in Section 77 based onfeedback from Olivier (August 24 2014)

bull Add a section header (Section 411) for thedependency-chain rules for 2014 GCC implemen-tations Add Section 413 laying out the sim-pler rules for managing dependency chains in theolder 1990s Sequent C implementations Adda footnote to Section 76 recording Lawrence

WG21N4321 39

Crowlrsquos preference for variable modifiers insteadof either attributes or type modifiers Addwords to Section 77 noting that atomic signal

fence() or the Linux kernelrsquos barrier() can beused to suppress code-motion optimizations thatmight otherwise prevent fixing up tail-markeddependency chains that were broken by code-motion optimizations (October 1 2014)

bull Add Section 34 which describes operators thatact as the last link in the Linux kernelrsquos depen-dency chains including the ambiguous role ofthe -gt operator (October 1 2014)

bull Update plots of RCU API usage (October 52014)

bull Flesh out Section 412 which lays out a shorthistory of dependency-chain management rulesfor the Linux kernel Numerous minor clarifica-tions and corrections throughout the document(October 5 2014)

bull Add ldquoandrdquo before final authorrsquos name and emailaddress Also add a pointer back to N4036 (Oc-tober 5 2014)

bull Add a brain-dead build script (October 52014)

bull Add Section 710 which has the beginnings ofan evaluation of the various proposals (October5 2014)

The paper was then published as N4215 whichwas revised as follows

bull Add verbiage to Sections 72 and 7102 differ-entiating between various forms of syntactic andsemantic dependencies (November 11 2014)

bull Add Section 7103 suggesting a C keyword in-stead of the C++ [[carries dependency]] at-tribute This section also suggests that combin-ing ideas from several proposals might be helpful(November 11 2014)

bull Add Section 78 which contains a first attemptto describe Olivier Girouxrsquos head-marked depen-dency chains (November 11 2014)

bull Add example code in Figures 15 and 16 Addsimilar figures to the proposals showing how theymark the dependency chains in the sample code(November 11 2014)

At this point the paper was published as N4321which was revised as follows

bull Add a warning against using operands to amp and| that leave only one bit set (or respectivelycleared) to Section 411 (April 9 2015)

bull Update plots of RCU API usage (April 192014)

bull Add Section 79 which restricts dependencychains with the goal of allowing unmarked de-pendency chains while avoiding restricting com-piler optimizations One important restriction isthe elimination of RCU-protected array indexes(April 19 2014)

bull Add verbiage to Section 32 noting that the amp

operator is sometimes used to find the beginningof a power-of-two aligned structure (April 192014)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Torvald Riegel to Sec-tion 79 (May 20 2015)

bull Donrsquot allow integers to carry dependencies intoor out of unmarked functions in Section 79(May 21 2015)

bull Update Section 79 to note that assignments can-not extend a dependency chain from one threadto another and that assignments cannot extendan intptr t- or uintptr t-based dependencychain through a shared variable even within agiven thread (May 21 2015)

bull Update Section 79 to note that atomic op-erations can extend dependency chains in thesame way that their non-atomic counterpartscan (May 21 2015)

WG21N4321 40

bull Update Section 79 to explicitly call out the factthat stores and subsequent loads can extend de-pendency chains (May 21 2015)

bull Update Section 79 to explicitly state that un-defined behavior terminates dependency chainsand that if only one value of a pointer avoidsundefined behavior any dependency chainsthrough that pointer at that point are termi-nated (May 22 2015)

bull Update Section 79 to document the difficul-ties stemming from value-specialization opti-mizations (May 23 2015)

bull Update Section 79 to note the back-propagationof the dependency-chain-breaking effects of thecompiler deducing the exact value of a pointer(May 27 2015)

bull Update Section 79 to give more informationon how much bit-setting and bit-clearing is toomuch (June 8 2015)

bull Update Section 79 to note that the Linux ker-nel does not rely on dependency ordering whena statically allocated structure is linked into anRCU-protected structure but also note that op-timizations breaking dependency ordering in thiscase seem quite dubious (July 11 2015)

bull Update Section 79 illustrating dependency-chain rules with examples (July 11 2015)

bull Update Section 79 noting that simple loads suf-fice for memory order consume loads (July 112015)

bull Update Sections 77 and 78 calling out abstrac-tion challenges for head- and tail-marking ap-proaches (July 13 2015)

bull Add subsections to Section 79 (July 13 2015)

bull Update Section 79 to amplify objections to spe-cialization optimizations applied to heap-basedpointers (July 13 2015)

  • 1 Introduction
  • 2 Introduction to RCU
  • 3 Linux-Kernel Use Cases
    • 31 Types of Linux-Kernel Dependency Chains
    • 32 Operators in Linux-Kernel Dependency Chains
    • 33 Operators Terminating Linux-Kernel Dependency Chains
    • 34 Operators Acting as Last Link in Linux-Kernel Dependency Chains
    • 35 Linux-Kernel Dependency Chain Length
      • 4 Dependency Ordering in Pre-C11 Implementations
        • 41 Rules for C-Language RCU Users
          • 411 Rules for 2014 GCC Implementations
          • 412 Rules for 2003 GCC Implementations
          • 413 Rules for 1990s Sequent C Implementations
            • 42 Rules for C-Language Implementers
              • 5 Dependency Ordering in C11 and C++11 Implementations
              • 6 Weaknesses in C11 and C++11 Dependency Ordering
              • 7 Potential Alternatives to C11 and C++11 Dependency Ordering
                • 71 Revising C11 and C++11 Dependency-Ordering Definition
                • 72 Type-Based Designation of Dependency Chains With Restrictions
                • 73 Type-Based Designation of Dependency Chains
                • 74 Whole-Program Option
                • 75 Local-Variable Restriction
                • 76 Mark Dependency-Carrying Local Variables
                • 77 Explicitly Tail-Marked Dependency Chains
                • 78 Explicitly Head-Marked Dependency Chains
                • 79 Restricted Dependency Chains
                  • 791 Extending Dependency Chains
                  • 792 Terminating Dependency Chains
                  • 793 Restricted Dependency Chains and the Linux Kernel
                  • 794 Restricted Dependency Chains Advantages and Disavantages
                    • 710 Evaluation
                      • 7101 Audiences
                      • 7102 Comparison
                      • 7103 Other Approaches
                          • 8 Summary
Page 22: N4321: Towards Implementation and Use of memory order … · 2015-07-14 · N4321: Towards Implementation and Use of memory order consume Doc. No.: WG21/N4321 (revised) Date: 2015-04-19

WG21N4321 22

prefix amp prefix [] infix + infix - ternary infix (bitwise) amp and probably also | Similarly Sec-tion 33 indicates that the following operators shouldimply a kill dependency() () == = ampamp ||infix and

It will also be necesary to check whether Linux-kernel usage expects dependency chains to passthrough globals and heap objects that are in someway thread-local If there are such use cases and ifthey are sane and cannot easily be changed to uselocal variables should [[carries dependency]] beused to flag dependency-carrying globals and heapobjects

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute if de-pendency chains are to span multiple translationunits as is the case in some parts of the Linuxkernel

2 The implementation is likely to be some-what simpler because only those dependencychains passing through local variables compiler-generated temporaries compiler-visible functionarguments and compiler-visible return valuesneed be traced One could also argue that func-tion arguments and return values marked with[[carries dependency]] attribute also need tobe traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 Although optimizations on dependency chainsmust be restricted the restricted scope of de-pendency chains reduces the impact of these re-strictions

5 Applying this approach to the Linux kernelwould only require the addition of markings onfunction parameters and return values corre-sponding to cross-translation-unit function callsHowever there are a significant number of theseso this approach can expect significant resistancefrom the Linux community

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 22 List-Based-Stack Local-Variable Restric-tion 1 of 2

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach allows local-variable annotations tobe dropped as shown in Figure 22 and 23

76 Mark Dependency-Carrying Lo-cal Variables

This approach suggested offlist by Clark Nelson usesthe [[carries dependency]] attribute to mark non-static local-scope variables as carrying a dependencyin addition to its current use marking function ar-guments and return values as carrying dependen-cies It is not permissible to mark global variablesor structure members with this attribute Assigningfrom a [[carries dependency]] object to a non-[[carries dependency]] object results in an im-plicit kill dependency()

This approach is similar to that of Section 73 ex-cept that it uses an attribute rather than a type mod-ifier As such it has many of the advantages and dis-

WG21N4321 23

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 23 List-Based-Stack Local-Variable Restric-tion 2 of 2

advantages of that approach however some believethat an attribute-based approach will be more ac-ceptable to the committee than would a type-modifierapproach13 However this approach does requirethat C add attributes

This leave the question of which operatorstransmit dependency chains from one [[carries

dependency]] object to another Section 32 indi-cates that the following operators should transmitdependency status from one local variable or tem-porary to another -gt infix = casts prefix amp prefix [] infix + infix - ternary infix (bitwise) ampand probably also | Similarly Section 33 showsthat the following operators should imply a kill

dependency() () == = ampamp || infix and

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains passingthrough variables marked with the [[carries

dependency]] attribute need be traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 The [[carries dependency]] calls serve asvaluable documentation of the developerrsquos in-tent

5 Although optimizations on dependency chainsmust be restricted use of explicit [[carries

dependency]] greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of variables carrying depen-dencies given that the Linux kernel currentlyrequires no such markings

13 Lawrence Crowl suggests a third approach namely a vari-able modifier

WG21N4321 24

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 _Carries_dependency void data18 _Carries_dependency struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 24 List-Based-Stack Marked Local Variables1 of 2

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach results in code as shown in Fig-ures 24 and 25 where the [[carries dependency]]

attributes have been replaced with a mythicalCarries dependency C keyword

77 Explicitly Tail-Marked Depen-dency Chains

This approach suggested at Rapperswil by OlivierGiroux can be thought of as the inverse of

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 _Carries_dependency struct liststack lsp40 struct liststack lsnp141 _Carries_dependency struct liststack lsnp242 _Carries_dependency void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 25 List-Based-Stack Marked Local Variables2 of 2

WG21N4321 25

1 p = atomic_load_explicit(ampgp memory_order_consume)2 if (p = NULL)3 do_it(atomic_dependency(p gp))

Figure 26 Explicit Dependency Operations

stdkill dependency() Instead of explicitlymarking where the dependency chains terminateOlivierrsquos proposal uses a stddependency() prim-itive to indicate the locations in the code that thedependency chains are required to reach The first ar-gument to stddependency() is the value to whichthe dependency must be carried and the second argu-ment is the variable that heads the dependency chainin other words the second argument is the variablethat was loaded from by a memory order consume

load This proposal differs from the others in thatit is expected to be implemented not necessarily bypreserving the dependency but instead by insertingbarriers in those cases where optimizations have elim-inated any required dependencies The goal here is toimpose minimal restrictions on optimizations of codecontaining dependency chains

A C-language example is shown in Figure 26where stddependency() is transliterated to the C-language atomic dependency() function On line 3atomic dependency() returns the value of its firstargument (p) while ensuring that the data depen-dency from the memory order consume load from gp

is faithfully reflected in the assembly language imple-menting this code fragment The assembly-languagereflection of this dependency might be in terms ofan assembly-language dependency (for example onARM or PowerPC) implicit memory ordering (forexample on x86 or mainframe) or by an explicitmemory-barrier instruction However if there was noatomic dependency() function the compiler wouldbe under no obligation to preserve the dependency14

These explicitly specified dependencies may becombined with [[carries dependency]] attributeson function arguments for example as shown in Fig-ure 27 Note the interplay of atomic dependency()

and [[carries dependency]] where line 8 estab-

14 Would it be better to have the second argument to atomic

dependency() be a label rather than an expression

1 void foo(struct bar q [[carries_dependency]])2 3 if (q = NULL)4 do_it(atomic_dependency(q-gtb q))5 67 p = atomic_load_explicit(ampgp memory_order_consume)8 foo(atomic_dependency(p gp))

Figure 27 Explicit Dependency Operations and car-ries dependency

lishes the dependency between the load from gp andthe [[carries dependency]] argument q of foo()and where line 4 establishes the further dependencybetween argument q of foo() and do it()s argu-ment

This approach is not yet complete One issue isthe possibility of a given operation being dependenton multiple memory order consume loads One ap-proach is of course to omit this functionality and an-other is to allow atomic dependency() to allow anexpression as its first argument and a variable list ofmemory order consume loaded variables

Another issue is connecting [[carries

dependency]] return values to subsequent atomic

dependency() invocations There are a numberof possible resolutions to this issue One approachwould be to use [[carries dependency]] attributeto mark the declaration of the variable to whichthe functionrsquos return value is assigned bringing theproposal from Section 76 to bear In the specialcase where the memory order consume load is in thesame function body as the atomic dependency()

that depends on it the atomic dependency()

could reference the variable that was the sourceof the original memory order consume load An-other approach would be to allow function-returncarries dependency]] attributes to define namesthat could be used by later atomic dependency()

invocations

A third issue arises when atomic dependency()

must be applied after the head of the dependencychain has gone out of scope for example if the headwas contained in a variable defined in an inner scopethat has since been exited

A fourth issue arises if optimizations along a

WG21N4321 26

needed dependency chain allow ordering the depen-dent operation to precede the head of the depen-dency chain in which case inserting barriers wouldbe ineffective The current proposal for addressingthis issue is to suppress memory-movement optimiza-tions across the atomic dependency() perhaps us-ing something like atomic signal fence() or theLinux kernelrsquos barrier() macro This approach al-lows dependency checking and fence insertion to becarried out as a final pass in the compilation process

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains having ex-plicit atomic dependency() calls (and option-ally intermediate [[carries dependency]] at-tributes) need be traced

3 Irrelevant dependency chains are pruned by de-fault with no stdkill dependency() calls re-quired

4 The atomic dependency() calls serve as valu-able documentation of the developerrsquos intent

5 Although optimizations on dependency chainsmust be restricted use of explicit atomic

dependency() greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of dependency chains giventhat the Linux kernel currently relies on implicitends of dependency chains

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(atomic_dependency(lsp-gtt27 head-gtfirst))28 rcu_read_unlock()29 return atomic_dependency(data lsp-gtt)30

Figure 28 List-Based-Stack Tail-Marked Dependen-cies 1 of 2

It is not yet known whether this approach can bereasonably modeled

The result is shown in Figures 28 and 29

78 Explicitly Head-Marked Depen-dency Chains

This approach suggested via email by Olivier Girouxcan be thought of as another inverse of stdkilldependency() In this case the heads of the de-pendency chains are marked indicating to whichpointed-to objects dependencies should be carriedThis description is an extrapolation of a very con-cise proposal and corrections are welcome

The general idea is to provide an augmentedform of the load() member function that indicatesdependencies for example xload(memory order

consume x-gtnext) would cause a dependency tobe carried through the -gtnext field but throughno other field This is shown in Figure 30 where

WG21N4321 27

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(atomic_dependency(lsnp2-gtt lsnp))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return atomic_dereference(data lsnp2-gtt)59

Figure 29 List-Based-Stack Tail-Marked Dependen-cies 2 of 2

1 struct foo 2 struct foo a3 struct foo b4 struct foo c5 int d6 78 p = atomic_load_explicit(ampgp memory_order_consume9 p-gta p-gtb)

10 qa = p-gta Dependency carried 11 qb = p-gtb Dependency carried 12 qc = p-gtc No dependency carried 13 d = p-gtd No dependency carried

Figure 30 Explicit Dependency Operations and Aug-mented Load

the explicit dependency information on line 9 causeslines 10 and 11 to carry a dependency but lines 12and 13 not to do so

Some open questions regarding this approach

1 How does this interact with arguments and re-turn values Do the corresponding annotationsneed to indicate to which fields dependenciesmight be carried Should mismatches be con-sidered an error and if so which sorts of mis-matches

2 How are opaque types handled For exam-ple consider the Linux kernel linked-list facil-ity which embeds a list head structure intothe enclosing object that is to be placed on thelist The memory order consume load returnsa (struct list head ) but it may be nec-essary to carry a dependency to one or more ofthe fields in the enclosing object Should thisbe handled via something like xload(memory

order consume x) but if so doesnrsquot thisre-introduce the need for lots of stdkill

dependency() calls

3 Larger structures might have quite a few fieldsthat need dependencies carried Should therebe some sort of shorthand to make this easierto code for example tagging the fields needingdependency ordering in the declaration of thestruct or class

4 Common types of abstraction need to be han-dled correctly For example there are more than

WG21N4321 28

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst head-gtfirst-gtt)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(lsp-gtt lsp-gtt)) 28 rcu_read_unlock()29 return data30

Figure 31 List-Based-Stack Head-Marked Depen-dencies 1 of 2

500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

79 Restricted Dependency Chains

This approach restricts dependency chains to opera-tions for which compilers would naturally carry de-pendencies As such this approach can be consid-ered to be a refinement of the whole-program option(Section 74) restricted as described in Section 412but also omitting control dependencies and RCU-protected array indexes As always a memory order

consume load heads a dependency chain and as al-ways a memory order consume load may be imple-mented with a simple load instruction on architec-tures such as x86 ARM and Power

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst head-gtfirst-gtnext)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext lsnp-gtnext-gtt)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt lsnp2-gtt))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 32 List-Based-Stack Head-Marked Depen-dencies 2 of 2

WG21N4321 29

1 define rcu_ptr_extract(p) 2 ( 3 uintptr_t ___ip = (uintptr_t)(p) 4 5 ___ip = ___ip amp ~0x7 6 (typeof(p)) ___ip 7 )89 define rcu_ptr_set_bits(p bits)

10 ( 11 uintptr_t ___ip = (uintptr_t)(p) 12 13 ___ip = ___ip amp ~0x7 14 ___ip = ___ip | (bits) 15 (typeof(p)) ___ip 16 )1718 define rcu_ptr_get_bits(p bits) 19 ( 20 uintptr_t ___ip = (uintptr_t)(p) 21 22 ___ip amp ~0x7 23 )

Figure 33 Pointers and Bit Manipulation on 64-BitSystem

This results in a specific list of operations that ex-tend dependency chains and a separate specific list ofoperations that terminate such chains each coveredin its own section

791 Extending Dependency Chains

The following primitive operations extend thatchain15

1 If any value is part of a dependency chain thenusing that value as the left-hand side of an as-signment expression extends the chain to coverthe assignment This rule is exercised in theLinux kernel by stores into fields making up anRCU protected data element

2 If any value is part of a dependency chain thenusing that value as the right-hand side of an as-signment expression extends the chain to coverboth the assignment and the value returned bythat assignment statement Line 20 of Figure 20shows how this rule may be used to extend adependency chain into a local variable

15 In case of operator overloading the actual functions calledmust be analyzed in order to determine their effects on depen-dency chains

3 If any value that is part of a dependency chain isstored to a non-shared variable then any valueloaded by a later load from that same variable bythat same thread is also part of the dependencychain Lines 20 and 24 of Figure 20 illustratethis rule though this rule would apply even iflocal variable lsp was a intptr t instead of apointer

4 If a pointer that is part of a dependency chainis stored to any variable then any value loadedby a later load from that same variable by thatsame thread is also part of the dependency chainLines 20 and 24 of Figure 20 illustrate this rulethough this rule would apply even if local vari-able lsp was instead a shared variable

5 If a pointer is part of a dependency chain thenadding an integral value to that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers and alsoto addition via the infix + operator and via thepostfix [] operator Note that the addition mustbe carried out on a pointer Casting to an inte-gral type and then carrying out the addition willbreak the dependency chain Therefore insteadof casting to an integral type to carry out theaddition cast to a pointer to char16 Line 24 ofFigure 20 illustrates this given that the -gtt actsas a pointer offset prior to indirection

6 If a pointer is part of a dependency chain thensubtracting an integer from that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers Againcasting to an integral type and then carryingout the subtraction will break the dependencychain so instead cast to a pointer to char TheLinux-kernel container of() macro illustratesthis This macro is used to find the beginning ofa structure given a pointer to a field within thatsame structure

16 Yes some old systems had strange formats for characterpointers and this restriction does exclude those systems fromthis nuance of dependency ordering However to the best ofmy knowledge all such systems were uniprocessors so this isnot a real problem

WG21N4321 30

7 If a pointer is part of a dependency chain thendereferencing it using the prefix operator ex-tends the chain through the dereference opera-tion Line 24 of Figure 20 illustrates this giventhat the -gtt acts as a pointer offset prior to in-direction

8 If a pointer is part of a dependency chain thendereferencing it using the -gt field-selection oper-ator extends the chain to the field Note that thewhen the -gt operator is followed by one or more operators these latter operators are equiva-lent to adding a constant integer to the originalpointer Line 24 of Figure 20 directly illustratesthis rule

9 If a pointer is part of a dependency chain thencasting it (either explicitly or implicitly) to anypointer-sized type extends the chain to the re-sult Line 3 of Figure 33 illustrates this rule

10 If a value of type intptr t or uintptr t ispart of a dependency chain then casting it toa pointer type extends the chain to the resultLine 6 of Figure 33 illustrates this rule

11 If a value of type intptr t or uintptr t is partof a dependency chain then the bitwise infix amp

and | operators extend the dependency chain tothe resulting value Line 5 of Figure 33 illus-trates this rule

12 If a value of type intptr t or uintptr t is partof a dependency chain the the bitwise infix ^

operator extends the dependency chain to theresulting value The use case for this traversal ofbuddy-allocator-like lists or dense-array heapsbut it is not clear whether these use cases justifythis addition to dependency ordering

13 If a pointer is part of a dependency chain thenapplying the unary amp address-of operator op-tionally casting this address to a pointer type(perhaps repeatedly to different pointer typeseither explicitly or implicitly) then applying the dereference operator extends the chain to theresult This is used by some of the Linux-kernellist-processing macros

14 If a pointer is part of a dependency chain andthat pointer appears in the entry of a expres-sion selected by the condition then the chainextends to the result Please note that doesnot extend chains from its condition only fromits second or third argument

15 If a pointer to a function is part of a depen-dency chain then invoking the pointed-to func-tion extends the chain from the pointer to theinstructions executed Note that the exact mech-anism used to update instructions is implemen-tation defined and might require use of specialinstruction-cache-flush operations17

16 If a pointer is part of a dependency chain thenif that pointer is used as the actual parameter ofa function call the dependency chain extends tothe formal parameter

17 If a pointer is part of a dependency chain thenif that pointer is returned from a function thedependency chain extends to the returned valuein the calling function

18 If a given operation extends a dependency chainthen so does its atomic counterpart For exam-ple the rules applying to assignments also applyto atomic loads and stores

Any other operation terminates a dependencychain

792 Terminating Dependency Chains

Even though all other operations terminate depen-dency chains there are a few that deserve specialmention

1 Equality comparisons

2 Narrowing magnitude comparisons

3 Narrowing arithmetic operations

4 Narrowing bitwise operations

17 This may sound strange but just you try implementingdynamic linking without the ability to update instructions

WG21N4321 31

1 struct bar 2 struct bar next3 int a4 int b5 6 struct bar head = amphead 1 2 78 for (p = head-gtnext p p = rcu_dereference(p-gtnext)) 9 foo += p-gta

10 if (p == amphead)11 break12 13 bar = head-gtb

Figure 34 Back-Propagation of Dependency-ChainBreakage Due to Comparisons

1 if (p gt ampfoo)2 do_something(p)3 else if (p lt ampfoo)4 do_something_else(p)5 else6 do_something_nodep(p)

Figure 35 Inequality-Comparison Dependency-Chain Breakage

5 Storing non-pointers into shared variables

6 Passing values between threads without using amemory order consume load

7 Undefined behavior

8 Use of kill dependency

Each of these is covered below

Equality comparisons If a pointer is part of adependency chain then a == or = comparison thatcompares equal to some other pointer where thatother pointer is not part of any dependency chainwill cause any uses of the original pointer to no longerbe part of the dependency chain This dependency-chain breakage can back-propagate to earlier uses ofthe pointer so that in Figure 34 if the comparisonon line 10 compares equal then the access on line 9is not part of the dependency chain This is admit-tedly a rather strange code fragment and besidesthe Linux-kernel barrier() macro could prevent thisif placed between lines 9 and 10 Furthermore the

Linux kernelrsquos list macros avoid this situation becausethe equal comparison terminates the loop

So what if the compiler introduces an equality com-parison This might happen when doing feedback-directed optimization where the compiler might no-tice (for example) that a particularly statically allo-cated structure was almost always the first elementon a given list The compiler might therefore in-troduce a specialization optimization comparing theaddresses and generating code using the statically al-located structure on equals comparison On the onehand in the cases where the Linux kernel adds a stat-ically allocated structure to an RCU-protected linkeddata structure that structure has been initialized atcompile time so that dependency ordering is not re-quired On the other hand this appears to be anextremely dubious optimization for linked data struc-tures In a great many cases the added overhead ofthe comparison would overwhelm the benefits of gen-erating code based on the statically allocated struc-ture

High-quality implementations would therefore beexpected to provide means for disabling this sort ofoptimization especially for pointers obtained fromthe heap After all use of statically allocated struc-tures in RCU-protected lists could be quite usefulduring out-of-memory conditions in which case thespecialization optimization would almost always re-duce performance which is not what optmizationsare supposed to be doing

Narrowing magnitude comparisons A seriesof gt lt gt= or lt= operators that informs the compilerof the exact value of a pointer causes that pointerto no longer be part of the dependency chain SeeFigure 35 for an example of this On line 6 of thisfigure the compiler knows that the value of p is equalto ampfoo so although there is dependency ordering tolines 2 and 4 there is no dependency ordering toline 6 This dependency-chain breakage can back-propagate just as for equality comparisons How-ever dependencies are maintained for normal usesfor exmaple the use of comparisons for deadlockavoidance when acquiring locks contained in multi-ple RCU-protected data elements

WG21N4321 32

Narrowing arithmetic operations If a pointeris part of a dependency chain and if the valuesadded to or subtracted from that pointer cancel thepointer value so as to allow the compiler to pre-cisely determine the resulting value then the result-ing value will not be part of any dependency chainFor example if p is part of a dependency chain then((char )p-(uintptr t)p)+65536 will not be18

Narrowing bitwise operations If a value of typeintptr t or uintptr t is part of a dependency chainand if that value is one of the operands to an amp

or | infix operator whose result has too few or toomany bits set then the resulting value will not bepart of any dependency chain For example on a64-bit system if p is part of a dependency chainthen (p amp 0x7) provides just the tag bits and nor-mally cannot even be legally dereferenced Similarly(p | ~0) normally cannot be legally dereferencedHowever (p amp ~0x7) will provide a usable pointerthat is part of prsquos dependency chain Setting orclearing bits that are not used by the implementa-tion or that would be zero for a properly aligned ob-ject will not break the dependency chain That saidthe compiler might not know the definition of ldquoprop-erly alignedrdquo for example in the cases of manuallycache-aligned or page-aligned objects

Storing non-pointers into shared variables Ifa non-pointer value that is part of a dependency chainis stored into a shared variable then the dependencychain does not extend to a later load from that vari-able

Passing values between threads If a value thatis part of a dependency chain is stored into a variableby one thread and loaded from that same variable bysome other thread using either a non-atomic load ora memory order relaxed load then the dependencychain does not extend to the second thread To getthis effect the second thread would instead need touse a memory order consume load Note that this

18 That said 57p4 of C++ and 656p8 of C both say thatindexing outside of an object is undefined behavior so the lossof dependency ordering is likely the least of the problems here

would extend the dependency chain even if the cor-responding store was a memory order relaxed storebecause the required store-side ordering is providedby the dependency chain

Undefined behavior If undefined behavior is in-voked then consistent with the notion of undefinedbehavior there are no dependency-chain guarantees

If a given pointer intptr t or uintptr t is usedsuch that only one value avoids undefined behaviorthen the dependency chain is broken in the same wayas it would be in the case of an equality comparisonwith that same value

kill dependency() The result of calling kill

dependency is never part of any dependency chainThis operation can be used to suppress diagnosticsthat implementations might omit for likely misusesof dependency ordering

793 Restricted Dependency Chains and theLinux Kernel

This covers all known pointer-based RCU uses in theLinux kernel aside from RCU-protected array in-dexes (more on these later) However as noted ear-lier these restrictions might prove too constrainingfor future code Therefore it might be necessary tocombine this approach with some variation on one ofthe methods for explicitly marking variables formalparameters and return values that are intended tocarry dependencies

Section 32 discussed dependency chains headedby memory order consume loads of integers that arelater used as array indexes or pointer offsets Al-though there are a (very) few such uses in the Linuxkernel accommodating dependency chains headed byloads of integers greatly complicates the handling ofdependency chains For example given an integer x

produced by a memory order consume load we mustcorrectly handle expressions containing (x - x) in-cluding cases where the cancellation is not at all obvi-ous from the source code In contrast given a pointerp the expression (p - p) not a pointer and the rulesgiven above do not require carrying a dependency

WG21N4321 33

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a) 3 q = kill_dependency(p)4 a = q-gtspecial_a5 else 6 a = p-gtnormal_a7

Figure 36 Avoiding Diagnostic Due To Dependency-Ordering Value-Narrowing Hazard

through such an expression This approach there-fore excludes dependency chains headed by memory

order consume loads from non-pointer atomic vari-ables

This raises the question of what should be doneabout the Linux kernel code that relies on order-ing carried through integer array indexes Paulhas (hopefully) answered this question by creatinga Linux-kernel patch that removes the kernelrsquos de-pendency on RCU-protected array indexes [22]

794 Restricted Dependency Chains Ad-vantages and Disavantages

This approach to dependency ordering has the fol-lowing advantages and disadvantages

1 There is no need for compilers to trace depen-dency chains Instead dependencies are an au-tomatic result of current code-generation and op-timization practices

2 There would be little or no limitation on com-piler optimizations Compilers are no longer re-quired to establish artificial dependencies for ex-pressions such as (x - x) or to detect value-narrowing hazards involving == and =

3 The breaking of dependency chains when apointer compares equal to some other pointermight prove to be onerous but it is not aproblem for current Linux use cases19 Themost common cases are comparison against alist header (in which case an equality compar-ison terminates the traversal) and comparison

19 This needs to be re-verified

against NULL (in which case an equality com-parison indicates a pointer that cannot be deref-erenced in any case)

4 The compiler is prohibited from carrying outvalue-speculation optimizations on pointers orvalues of type intptr t or uintptr t that havebeen cast from a pointer type and subjected tono operations other than infix amp | and ^ (Isthis an advantage or a disadvantage The an-swer to this question is left to the reader)

5 In a great many cases dependency chains canreliably pass through library functions compiledby pre-C11 compilers

6 It would not be necessary to use stdkill

dependency() calls in most cases That saiduse of stdkill dependency() might at somefuture point allow the compiler to produce bet-ter diagnostics for dubious dependency-chain usecases For example a compiler might issue awarning for the value-narrowing hazard shownon line 3 of Figure 13 and that diagnostic mightbe suppressed as shown in Figure 36

7 More generally this approach allows annotationsto be discarded as was shown in Figures 20and 21 However the memory order consume

loads are still required in order to support DECAlpha and prevent compiler optimizations thatmight otherwise destroy the dependency chain

8 It would not be necessary to use [[carries

dependency]] attributes However as withstdkill dependency() use of something like[[carries dependency]] might produce bet-ter diagnostics for dubious use dependency-chainuse cases That said it might be preferable tosubstitute either type or variable modifiers forattributes given that use of attributes is not per-mitted to change the meaning of the programand also that attributes are not yet supportedby the C language

9 With the exception of one use case involving ar-rays no changes to the Linux-kernel source code

WG21N4321 34

are required As noted earlier a patch is avail-able to remove the Linux kernelrsquos dependency onRCU-protected array indexes [22]

In the future this approach could be augmentedby attributes type modifiers variable modifiers orother markings to allow more elaborate dependencychains to be created on the one hand and to improvethe compilerrsquos ability to emit diagnostics for dubioususes of dependency chains on the other

710 Evaluation

This evaluation starts by enumerating the differentaudiences that any change to memory order consume

must address (Section 7101) and then compares thevarious proposals based on the perceived viewpointsof these audiences (Section 7102)

7101 Audiences

The main audiences for any change to memory

order consume include standards committee mem-bers compiler implementers formal-methods re-searchers developers intending to write new codeand developers working with existing RCU code TheLinux kernel community is of course a notable exam-ple of this last category

Standards committee members would like a cleanand non-intrusive change to the standard Theywould of course also like solutions minimizing thenumber and vehemence of complaints from the otheraudiences or failing that reducing the complaintsto a tolerable noise level

Compiler implementers would like a mechanismthat fits nicely into current implementations whichdoes much to explain their satisfaction with the ap-proach of strengthening memory order consume tomemory order acquire In particular they wouldlike to avoid unbounded tracing of dependencies andwould prefer minimal constraints on their ability toapply time-honored optimizations

Formal-methods researchers would like a definitionof memory order consume that fits into existing the-oretical frameworks without undue conceptual vio-lence Of particular concern is any need to deal with

counter-factuals in other words any need to rea-son not only about values of variables required forthe solution of a given litmus test but also aboutother unrelated values for these variables As suchcounter-factuals are the rock upon which otherwiseattractive approaches involving semantic dependencyhave foundered20 Some practitioners might won-der why the opinion of formal-methods researchersshould be given any weight at all and the answer tothis question is that it is the work of formal-methodsresearchers that provides us the much-needed toolsthat we need to analyze both the memory-orderingspecification itself as well as programs using thatspecification

Developers writing new code need something thatexpresses their algorithm with a minimum of syn-tactic saccharine that is easy to learn and that iseasy to maintain For example one of the weak-nesses of the current standardsrsquo definition of memoryorder consume is the need to sprinkle large numbersof kill dependency() calls throughout onersquos codeIn short developers would like it to be easy to writeanalyze and maintain code that uses dependency or-dering

Developers with existing RCU code have the samedesires as do developers writing new code but arealso very interested in minimizing the code churn re-quired to adhere to the standard

The challenge if of course to find a proposal thataddresses the viewoints of all of these audiences Aswe will see in the next session this is not easy How-ever there is some hope that the approach presentedin Section 79 might suffice

7102 Comparison

A summary comparison of the proposals is shown inTable 1

The dependency type can either be ldquodeprdquo fornormal dependency ldquordeprdquo for the restricted de-pendencies discussed in Section 79 ldquosdeprdquo for(global) semantic dependency or ldquolsdeprdquo for local

20 That said Alan Jeffries is making another attempt tocome up with a suitable formal definition of semantic depen-dency

WG21N4321 35

Dep

enden

cyType

Variable

Marking

Form

al-Parameter

Marking

Return-V

alueMarking

Beginning-O

f-Chain

Handling

End-O

f-Chain

Handling

Dep

enden

cyTracingRequired

CAttribute

Support

Req

uired

C11 C++11 dep A A K Y Y

Type-Based Designation of DependencyChains With Restrictions (Section 72)

lsdep T T T k

Type-Based Designation of DependencyChains (Section 73)

dep T T T k

Whole-Program Option (Section 74) sdep

Local-Variable Restriction (Section 75) dep A A k Y

Mark Dependency-Carrying Local Variables(Section 76)

dep A A A k Y

Explicitly Tail-Marked Dependency Chains(Section 77)

dep A A Dk y Y

Explicitly Head-Marked Dependency Chains(Section 78)

dep D k y Y

Restricted Dependency Chains (Section 79) rdep

Marking ldquoArdquo attribute ldquoTrdquo typeBeginning of chain ldquoDrdquo explicit designationEnd of chain ldquoDrdquo explicit designation ldquokKrdquo implicitexplicit kill dependencyDependency tracking required ldquoyrdquo only for marked chains ldquoYrdquo always

Table 1 Comparison of Consume Proposals

WG21N4321 36

semantic dependency21 Variable formal-parameterand return-value marking can either be type-based(ldquoTrdquo) attribute-based (ldquoArdquo) or not required (ldquo rdquo)Beginning-of-chain handling can either require ex-plicit indication to which quantities dependenciesmust be carried (ldquoDrdquo) or nothing (ldquo rdquo) End-of-chain handling can either require an explicit kill

dependency (ldquoKrdquo) an implicit kill dependency

(ldquokrdquo) explicit designation of dependency (ldquoDrdquo) ornothing (ldquo rdquo)22 Dependency tracking might berequired for all chains (ldquoYrdquo) explicitly designatedchains (ldquoyrdquo) or not required at all (ldquo rdquo) C-language[[carries dependency]] support might be required(ldquoYrdquo) or not (ldquo rdquo)

The ideal proposal would have dependency typeldquodeprdquo (thus making it easier to model dependencyordering and making it unnecessary for developer tohave to outwit full-program optimizations) no needfor variable formal-parameter or return-value mark-ing (thus minimizing changes required for existingRCU code) implicit ldquodo the right thingrdquo end-of-chainhandling23 (thus minimizing the need for whack-a-mole source-code markups) no need for dependencytracking (thus making it easier to implement) andno need for C-language support for the [[carries

dependency]] attribute (thus minimizing changes tothe C standard)

7103 Other Approaches

If the C standards committee is unwilling to ac-cept attributes perhaps a new keyword such asCarries dependency would be an acceptable alter-native (This was suggested at the 2014 UIUC meet-ing but I cannot recall who suggested it)

21 Recall that a local semantic dependency remains a de-pendency even if the memory order consume load at its headcan return only a single value In contrast a global seman-tic dependency remains a dependency only if more than onevalue can appear at the end of the chain Therefore optimiza-tions based on global full-program analysis can break a globalsemantic dependency but can break neither a local semanticdependency nor a normal dependency

22 Variables that go out of scope always have any dependencychain implicitly killed

23 Perhaps implemented by a careful choice of exactly whichoperators carry dependencies in which situations

It might also be possible to combine different as-pects of the various proposals perhaps even arrivingat an improved proposal

Nevertheless we clearly have some more work todo

8 Summary

This document has analyzed Linux-kernel use of de-pendency ordering and has laid out the status-quo in-teraction between the Linux kernel and pre-C11 com-pilers It has also put forward some possible ways ofbuilding towards a full implementation of C11rsquos andC++11rsquos handling of dependency ordering Finallyit calls out some weaknesses in C11rsquos and C++11rsquoshandling of dependency ordering and offers some al-ternatives

References

[1] Alglave J Maranget L Pawan PSarkar S Sewell P Williams D andNardelli F Z PPCMEMARMMEM A toolfor exploring the POWER and ARM memorymodels httpwwwclcamacuk~pes20

ppc-supplementalpldi105-sarkarpdfJune 2011

[2] Alglave J Maranget L andTautschnig M Herding cats Modellingsimulation testing and data-mining for weakmemory In Proceedings of the 35th ACM SIG-PLAN Conference on Programming LanguageDesign and Implementation (New York NYUSA 2014) PLDI rsquo14 ACM pp 40ndash40

[3] ARM Limited ARM Architecture ReferenceManual ARMv7-A and ARMv7-R Edition2010

[4] Boehm H J Space efficient conservativegarbage collection SIGPLAN Not 39 4 (Apr2004) 490ndash501

[5] Bonzini P and Day M RCU imple-mentation for Qemu httplistsgnu

WG21N4321 37

orgarchivehtmlqemu-devel2013-08

msg02055html August 2013

[6] Dalton M THE DESIGN AND IMPLE-MENTATION OF DYNAMIC INFORMATIONFLOW TRACKING SYSTEMS FOR SOFT-WARE SECURITY PhD thesis StanfordUniversity 2009 Available httpcsl

stanfordedu~christospublications

2009michael_daltonphd_thesispdf

[Viewed March 9 2010]

[7] Desnoyers M [RFC git tree] userspace RCU(urcu) for Linux httpurcuso February2009

[8] Desnoyers M McKenney P E SternA Dagenais M R and Walpole JUser-level implementations of read-copy updateIEEE Transactions on Parallel and DistributedSystems 23 (2012) 375ndash382

[9] Grisenthwaite R ARM Barrier LitmusTests and Cookbook ARM Limited 2009

[10] Howard P W and Walpole J A rel-ativistic enhancement to software transactionalmemory In Proceedings of the 3rd USENIX con-ference on Hot topics in parallelism (BerkeleyCA USA 2011) HotParrsquo11 USENIX Associa-tion pp 1ndash6

[11] Howard P W and Walpole J Relativis-tic red-black trees Concurrency and Computa-tion Practice and Experience (2013) nandashna

[12] Intel Corporation A Formal Speci-fication of Intel Itanium Processor Fam-ily Memory Ordering 2002 Avail-able httpdeveloperintelcom

designitaniumdownloads251429htm

ftpdownloadintelcomdesign

ItaniumDownloads25142901pdf [ViewedJanuary 10 2007]

[13] International Business Machines Corpo-ration Power ISA Version 207 2013

[14] Kannan H Ordering decoupled metadata ac-cesses in multiprocessors In MICRO 42 Pro-ceedings of the 42nd Annual IEEEACM Inter-national Symposium on Microarchitecture (NewYork NY USA 2009) ACM pp 381ndash390

[15] McCarthy J Recursive functions of symbolicexpressions and their computation by machinepart i Commun ACM 3 4 (Apr 1960) 184ndash195

[16] McKenney P E Read-copy update(RCU) usage in Linux kernel Availablehttpwwwrdropcomuserspaulmck

RCUlinuxusagerculocktabhtml [ViewedJanuary 14 2007] October 2006

[17] McKenney P E What is RCU part 2Usage Available httplwnnetArticles

263130 [Viewed January 4 2008] January2008

[18] McKenney P E The RCU API 2010 edi-tion httplwnnetArticles418853 De-cember 2010

[19] McKenney P E Validating memory barri-ers and atomic instructions httplwnnet

Articles470681 December 2011

[20] McKenney P E Is Parallel ProgrammingHard And If So What Can You Do About Itkernelorg Corvallis OR USA 2012

[21] McKenney P E Structured deferral syn-chronization via procrastination CommunACM 56 7 (July 2013) 40ndash49

[22] McKenney P E [PATCH tipcorercu14] mce Stop using array-index-based RCUprimitives [PATCHtipcorercu14]mce

Stopusingarray-index-basedRCUprimitivesMay 2015

[23] McKenney P E Purcell C Al-gae Schumin B Cornelius G Qwer-tyus Conway N Sbw Blainster Ru-fus C Zoicon5 Anome and Eisen HRead-copy update httpenwikipedia

orgwikiRead-copy-update July 2006

WG21N4321 38

[24] McKenney P E and Slingwine J DRead-copy update Using execution history tosolve concurrency problems In Parallel andDistributed Computing and Systems (Las VegasNV October 1998) pp 509ndash518

[25] McKenney P E and Walpole J What isRCU fundamentally Available httplwn

netArticles262464 [Viewed December 272007] December 2007

[26] Michael M M Hazard pointers Safe mem-ory reclamation for lock-free objects IEEETransactions on Parallel and Distributed Sys-tems 15 6 (June 2004) 491ndash504

[27] Rossbach C J Hofmann O S PorterD E Ramadan H E Bhandari A andWitchel E TxLinux Using and managinghardware transactional memory in an operatingsystem In SOSPrsquo07 Twenty-First ACMSymposium on Operating Systems Principles(October 2007) ACM SIGOPS Avail-able httpwwwsosp2007orgpapers

sosp056-rossbachpdf [Viewed October 212007]

[28] Toit S D Working draft stan-dard for programming language C++httpwwwopen-stdorgjtc1sc22wg21

docspapers2013n3691pdf May 2013

[29] Vigueras G Orduna J M and LozanoM A Read-Copy Update based parallel serverfor distributed crowd simulations The Journalof Supercomputing (Apr 2012)

Change Log

This paper first appeared as N4026 in May of 2014Revisions to this document are as follows

bull Add a suggestion to Section 71 that developersprovide diagnostics for singleton arrays when us-ing RCU-protected indexes (May 28 2014)

bull Apply changes from Torvald Riegel feedback

1 Move Section 71 which summarizes theLinux-kernel uses of dependency chains upto the beginning of Section 7

2 Fix some typos in Section 74 and addwords accounting for DEC Alpha whichmust continue to promote memory order

consume loads to memory order acquire

with this sectionrsquos approach Also clarifyJeff Preshingrsquos conjecture about the lack ofoptimizations that break semantic depen-dency chains

(June 5 2014)

bull Add Section 75 which covers Hans Boehmrsquosapproach which restricts dependency chains tolocal variable Also add Section 76 whichcovers Clark Nelsorsquos approach which usesthe [[carries dependency]] attribute to markthose local variables that carry dependencies(July 17 2014)

bull Mark the paper as a revision of N4036 (July 172014)

bull Add Jeff Preshing Hans Boehm and Clark Nel-son as authors and clarify Hansrsquos proposal inSection 75 (August 13 2014)

bull Add Olivier Girouxrsquos proposal for explicitlymarking the tails of dependency chains as Sec-tion 77 and add Olivier as author Update theadvantages and disadvantages of Jeff Preshingrsquoswhole-program option in Section 74 Add advan-tages and disadvantages for Hans Boehmrsquos pro-posal in Section 75 and for Clark Nelsonrsquos pro-posal in Section 76 (August 23 2014)

bull Update Olivier Girouxrsquos explicit dependency-chain marking proposal in Section 77 based onfeedback from Olivier (August 24 2014)

bull Add a section header (Section 411) for thedependency-chain rules for 2014 GCC implemen-tations Add Section 413 laying out the sim-pler rules for managing dependency chains in theolder 1990s Sequent C implementations Adda footnote to Section 76 recording Lawrence

WG21N4321 39

Crowlrsquos preference for variable modifiers insteadof either attributes or type modifiers Addwords to Section 77 noting that atomic signal

fence() or the Linux kernelrsquos barrier() can beused to suppress code-motion optimizations thatmight otherwise prevent fixing up tail-markeddependency chains that were broken by code-motion optimizations (October 1 2014)

bull Add Section 34 which describes operators thatact as the last link in the Linux kernelrsquos depen-dency chains including the ambiguous role ofthe -gt operator (October 1 2014)

bull Update plots of RCU API usage (October 52014)

bull Flesh out Section 412 which lays out a shorthistory of dependency-chain management rulesfor the Linux kernel Numerous minor clarifica-tions and corrections throughout the document(October 5 2014)

bull Add ldquoandrdquo before final authorrsquos name and emailaddress Also add a pointer back to N4036 (Oc-tober 5 2014)

bull Add a brain-dead build script (October 52014)

bull Add Section 710 which has the beginnings ofan evaluation of the various proposals (October5 2014)

The paper was then published as N4215 whichwas revised as follows

bull Add verbiage to Sections 72 and 7102 differ-entiating between various forms of syntactic andsemantic dependencies (November 11 2014)

bull Add Section 7103 suggesting a C keyword in-stead of the C++ [[carries dependency]] at-tribute This section also suggests that combin-ing ideas from several proposals might be helpful(November 11 2014)

bull Add Section 78 which contains a first attemptto describe Olivier Girouxrsquos head-marked depen-dency chains (November 11 2014)

bull Add example code in Figures 15 and 16 Addsimilar figures to the proposals showing how theymark the dependency chains in the sample code(November 11 2014)

At this point the paper was published as N4321which was revised as follows

bull Add a warning against using operands to amp and| that leave only one bit set (or respectivelycleared) to Section 411 (April 9 2015)

bull Update plots of RCU API usage (April 192014)

bull Add Section 79 which restricts dependencychains with the goal of allowing unmarked de-pendency chains while avoiding restricting com-piler optimizations One important restriction isthe elimination of RCU-protected array indexes(April 19 2014)

bull Add verbiage to Section 32 noting that the amp

operator is sometimes used to find the beginningof a power-of-two aligned structure (April 192014)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Torvald Riegel to Sec-tion 79 (May 20 2015)

bull Donrsquot allow integers to carry dependencies intoor out of unmarked functions in Section 79(May 21 2015)

bull Update Section 79 to note that assignments can-not extend a dependency chain from one threadto another and that assignments cannot extendan intptr t- or uintptr t-based dependencychain through a shared variable even within agiven thread (May 21 2015)

bull Update Section 79 to note that atomic op-erations can extend dependency chains in thesame way that their non-atomic counterpartscan (May 21 2015)

WG21N4321 40

bull Update Section 79 to explicitly call out the factthat stores and subsequent loads can extend de-pendency chains (May 21 2015)

bull Update Section 79 to explicitly state that un-defined behavior terminates dependency chainsand that if only one value of a pointer avoidsundefined behavior any dependency chainsthrough that pointer at that point are termi-nated (May 22 2015)

bull Update Section 79 to document the difficul-ties stemming from value-specialization opti-mizations (May 23 2015)

bull Update Section 79 to note the back-propagationof the dependency-chain-breaking effects of thecompiler deducing the exact value of a pointer(May 27 2015)

bull Update Section 79 to give more informationon how much bit-setting and bit-clearing is toomuch (June 8 2015)

bull Update Section 79 to note that the Linux ker-nel does not rely on dependency ordering whena statically allocated structure is linked into anRCU-protected structure but also note that op-timizations breaking dependency ordering in thiscase seem quite dubious (July 11 2015)

bull Update Section 79 illustrating dependency-chain rules with examples (July 11 2015)

bull Update Section 79 noting that simple loads suf-fice for memory order consume loads (July 112015)

bull Update Sections 77 and 78 calling out abstrac-tion challenges for head- and tail-marking ap-proaches (July 13 2015)

bull Add subsections to Section 79 (July 13 2015)

bull Update Section 79 to amplify objections to spe-cialization optimizations applied to heap-basedpointers (July 13 2015)

  • 1 Introduction
  • 2 Introduction to RCU
  • 3 Linux-Kernel Use Cases
    • 31 Types of Linux-Kernel Dependency Chains
    • 32 Operators in Linux-Kernel Dependency Chains
    • 33 Operators Terminating Linux-Kernel Dependency Chains
    • 34 Operators Acting as Last Link in Linux-Kernel Dependency Chains
    • 35 Linux-Kernel Dependency Chain Length
      • 4 Dependency Ordering in Pre-C11 Implementations
        • 41 Rules for C-Language RCU Users
          • 411 Rules for 2014 GCC Implementations
          • 412 Rules for 2003 GCC Implementations
          • 413 Rules for 1990s Sequent C Implementations
            • 42 Rules for C-Language Implementers
              • 5 Dependency Ordering in C11 and C++11 Implementations
              • 6 Weaknesses in C11 and C++11 Dependency Ordering
              • 7 Potential Alternatives to C11 and C++11 Dependency Ordering
                • 71 Revising C11 and C++11 Dependency-Ordering Definition
                • 72 Type-Based Designation of Dependency Chains With Restrictions
                • 73 Type-Based Designation of Dependency Chains
                • 74 Whole-Program Option
                • 75 Local-Variable Restriction
                • 76 Mark Dependency-Carrying Local Variables
                • 77 Explicitly Tail-Marked Dependency Chains
                • 78 Explicitly Head-Marked Dependency Chains
                • 79 Restricted Dependency Chains
                  • 791 Extending Dependency Chains
                  • 792 Terminating Dependency Chains
                  • 793 Restricted Dependency Chains and the Linux Kernel
                  • 794 Restricted Dependency Chains Advantages and Disavantages
                    • 710 Evaluation
                      • 7101 Audiences
                      • 7102 Comparison
                      • 7103 Other Approaches
                          • 8 Summary
Page 23: N4321: Towards Implementation and Use of memory order … · 2015-07-14 · N4321: Towards Implementation and Use of memory order consume Doc. No.: WG21/N4321 (revised) Date: 2015-04-19

WG21N4321 23

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 23 List-Based-Stack Local-Variable Restric-tion 2 of 2

advantages of that approach however some believethat an attribute-based approach will be more ac-ceptable to the committee than would a type-modifierapproach13 However this approach does requirethat C add attributes

This leave the question of which operatorstransmit dependency chains from one [[carries

dependency]] object to another Section 32 indi-cates that the following operators should transmitdependency status from one local variable or tem-porary to another -gt infix = casts prefix amp prefix [] infix + infix - ternary infix (bitwise) ampand probably also | Similarly Section 33 showsthat the following operators should imply a kill

dependency() () == = ampamp || infix and

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains passingthrough variables marked with the [[carries

dependency]] attribute need be traced

3 Many irrelevant dependency chains are prunedby default thus fewer stdkill dependency()

calls are required

4 The [[carries dependency]] calls serve asvaluable documentation of the developerrsquos in-tent

5 Although optimizations on dependency chainsmust be restricted use of explicit [[carries

dependency]] greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of variables carrying depen-dencies given that the Linux kernel currentlyrequires no such markings

13 Lawrence Crowl suggests a third approach namely a vari-able modifier

WG21N4321 24

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 _Carries_dependency void data18 _Carries_dependency struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 24 List-Based-Stack Marked Local Variables1 of 2

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach results in code as shown in Fig-ures 24 and 25 where the [[carries dependency]]

attributes have been replaced with a mythicalCarries dependency C keyword

77 Explicitly Tail-Marked Depen-dency Chains

This approach suggested at Rapperswil by OlivierGiroux can be thought of as the inverse of

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 _Carries_dependency struct liststack lsp40 struct liststack lsnp141 _Carries_dependency struct liststack lsnp242 _Carries_dependency void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 25 List-Based-Stack Marked Local Variables2 of 2

WG21N4321 25

1 p = atomic_load_explicit(ampgp memory_order_consume)2 if (p = NULL)3 do_it(atomic_dependency(p gp))

Figure 26 Explicit Dependency Operations

stdkill dependency() Instead of explicitlymarking where the dependency chains terminateOlivierrsquos proposal uses a stddependency() prim-itive to indicate the locations in the code that thedependency chains are required to reach The first ar-gument to stddependency() is the value to whichthe dependency must be carried and the second argu-ment is the variable that heads the dependency chainin other words the second argument is the variablethat was loaded from by a memory order consume

load This proposal differs from the others in thatit is expected to be implemented not necessarily bypreserving the dependency but instead by insertingbarriers in those cases where optimizations have elim-inated any required dependencies The goal here is toimpose minimal restrictions on optimizations of codecontaining dependency chains

A C-language example is shown in Figure 26where stddependency() is transliterated to the C-language atomic dependency() function On line 3atomic dependency() returns the value of its firstargument (p) while ensuring that the data depen-dency from the memory order consume load from gp

is faithfully reflected in the assembly language imple-menting this code fragment The assembly-languagereflection of this dependency might be in terms ofan assembly-language dependency (for example onARM or PowerPC) implicit memory ordering (forexample on x86 or mainframe) or by an explicitmemory-barrier instruction However if there was noatomic dependency() function the compiler wouldbe under no obligation to preserve the dependency14

These explicitly specified dependencies may becombined with [[carries dependency]] attributeson function arguments for example as shown in Fig-ure 27 Note the interplay of atomic dependency()

and [[carries dependency]] where line 8 estab-

14 Would it be better to have the second argument to atomic

dependency() be a label rather than an expression

1 void foo(struct bar q [[carries_dependency]])2 3 if (q = NULL)4 do_it(atomic_dependency(q-gtb q))5 67 p = atomic_load_explicit(ampgp memory_order_consume)8 foo(atomic_dependency(p gp))

Figure 27 Explicit Dependency Operations and car-ries dependency

lishes the dependency between the load from gp andthe [[carries dependency]] argument q of foo()and where line 4 establishes the further dependencybetween argument q of foo() and do it()s argu-ment

This approach is not yet complete One issue isthe possibility of a given operation being dependenton multiple memory order consume loads One ap-proach is of course to omit this functionality and an-other is to allow atomic dependency() to allow anexpression as its first argument and a variable list ofmemory order consume loaded variables

Another issue is connecting [[carries

dependency]] return values to subsequent atomic

dependency() invocations There are a numberof possible resolutions to this issue One approachwould be to use [[carries dependency]] attributeto mark the declaration of the variable to whichthe functionrsquos return value is assigned bringing theproposal from Section 76 to bear In the specialcase where the memory order consume load is in thesame function body as the atomic dependency()

that depends on it the atomic dependency()

could reference the variable that was the sourceof the original memory order consume load An-other approach would be to allow function-returncarries dependency]] attributes to define namesthat could be used by later atomic dependency()

invocations

A third issue arises when atomic dependency()

must be applied after the head of the dependencychain has gone out of scope for example if the headwas contained in a variable defined in an inner scopethat has since been exited

A fourth issue arises if optimizations along a

WG21N4321 26

needed dependency chain allow ordering the depen-dent operation to precede the head of the depen-dency chain in which case inserting barriers wouldbe ineffective The current proposal for addressingthis issue is to suppress memory-movement optimiza-tions across the atomic dependency() perhaps us-ing something like atomic signal fence() or theLinux kernelrsquos barrier() macro This approach al-lows dependency checking and fence insertion to becarried out as a final pass in the compilation process

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains having ex-plicit atomic dependency() calls (and option-ally intermediate [[carries dependency]] at-tributes) need be traced

3 Irrelevant dependency chains are pruned by de-fault with no stdkill dependency() calls re-quired

4 The atomic dependency() calls serve as valu-able documentation of the developerrsquos intent

5 Although optimizations on dependency chainsmust be restricted use of explicit atomic

dependency() greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of dependency chains giventhat the Linux kernel currently relies on implicitends of dependency chains

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(atomic_dependency(lsp-gtt27 head-gtfirst))28 rcu_read_unlock()29 return atomic_dependency(data lsp-gtt)30

Figure 28 List-Based-Stack Tail-Marked Dependen-cies 1 of 2

It is not yet known whether this approach can bereasonably modeled

The result is shown in Figures 28 and 29

78 Explicitly Head-Marked Depen-dency Chains

This approach suggested via email by Olivier Girouxcan be thought of as another inverse of stdkilldependency() In this case the heads of the de-pendency chains are marked indicating to whichpointed-to objects dependencies should be carriedThis description is an extrapolation of a very con-cise proposal and corrections are welcome

The general idea is to provide an augmentedform of the load() member function that indicatesdependencies for example xload(memory order

consume x-gtnext) would cause a dependency tobe carried through the -gtnext field but throughno other field This is shown in Figure 30 where

WG21N4321 27

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(atomic_dependency(lsnp2-gtt lsnp))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return atomic_dereference(data lsnp2-gtt)59

Figure 29 List-Based-Stack Tail-Marked Dependen-cies 2 of 2

1 struct foo 2 struct foo a3 struct foo b4 struct foo c5 int d6 78 p = atomic_load_explicit(ampgp memory_order_consume9 p-gta p-gtb)

10 qa = p-gta Dependency carried 11 qb = p-gtb Dependency carried 12 qc = p-gtc No dependency carried 13 d = p-gtd No dependency carried

Figure 30 Explicit Dependency Operations and Aug-mented Load

the explicit dependency information on line 9 causeslines 10 and 11 to carry a dependency but lines 12and 13 not to do so

Some open questions regarding this approach

1 How does this interact with arguments and re-turn values Do the corresponding annotationsneed to indicate to which fields dependenciesmight be carried Should mismatches be con-sidered an error and if so which sorts of mis-matches

2 How are opaque types handled For exam-ple consider the Linux kernel linked-list facil-ity which embeds a list head structure intothe enclosing object that is to be placed on thelist The memory order consume load returnsa (struct list head ) but it may be nec-essary to carry a dependency to one or more ofthe fields in the enclosing object Should thisbe handled via something like xload(memory

order consume x) but if so doesnrsquot thisre-introduce the need for lots of stdkill

dependency() calls

3 Larger structures might have quite a few fieldsthat need dependencies carried Should therebe some sort of shorthand to make this easierto code for example tagging the fields needingdependency ordering in the declaration of thestruct or class

4 Common types of abstraction need to be han-dled correctly For example there are more than

WG21N4321 28

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst head-gtfirst-gtt)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(lsp-gtt lsp-gtt)) 28 rcu_read_unlock()29 return data30

Figure 31 List-Based-Stack Head-Marked Depen-dencies 1 of 2

500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

79 Restricted Dependency Chains

This approach restricts dependency chains to opera-tions for which compilers would naturally carry de-pendencies As such this approach can be consid-ered to be a refinement of the whole-program option(Section 74) restricted as described in Section 412but also omitting control dependencies and RCU-protected array indexes As always a memory order

consume load heads a dependency chain and as al-ways a memory order consume load may be imple-mented with a simple load instruction on architec-tures such as x86 ARM and Power

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst head-gtfirst-gtnext)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext lsnp-gtnext-gtt)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt lsnp2-gtt))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 32 List-Based-Stack Head-Marked Depen-dencies 2 of 2

WG21N4321 29

1 define rcu_ptr_extract(p) 2 ( 3 uintptr_t ___ip = (uintptr_t)(p) 4 5 ___ip = ___ip amp ~0x7 6 (typeof(p)) ___ip 7 )89 define rcu_ptr_set_bits(p bits)

10 ( 11 uintptr_t ___ip = (uintptr_t)(p) 12 13 ___ip = ___ip amp ~0x7 14 ___ip = ___ip | (bits) 15 (typeof(p)) ___ip 16 )1718 define rcu_ptr_get_bits(p bits) 19 ( 20 uintptr_t ___ip = (uintptr_t)(p) 21 22 ___ip amp ~0x7 23 )

Figure 33 Pointers and Bit Manipulation on 64-BitSystem

This results in a specific list of operations that ex-tend dependency chains and a separate specific list ofoperations that terminate such chains each coveredin its own section

791 Extending Dependency Chains

The following primitive operations extend thatchain15

1 If any value is part of a dependency chain thenusing that value as the left-hand side of an as-signment expression extends the chain to coverthe assignment This rule is exercised in theLinux kernel by stores into fields making up anRCU protected data element

2 If any value is part of a dependency chain thenusing that value as the right-hand side of an as-signment expression extends the chain to coverboth the assignment and the value returned bythat assignment statement Line 20 of Figure 20shows how this rule may be used to extend adependency chain into a local variable

15 In case of operator overloading the actual functions calledmust be analyzed in order to determine their effects on depen-dency chains

3 If any value that is part of a dependency chain isstored to a non-shared variable then any valueloaded by a later load from that same variable bythat same thread is also part of the dependencychain Lines 20 and 24 of Figure 20 illustratethis rule though this rule would apply even iflocal variable lsp was a intptr t instead of apointer

4 If a pointer that is part of a dependency chainis stored to any variable then any value loadedby a later load from that same variable by thatsame thread is also part of the dependency chainLines 20 and 24 of Figure 20 illustrate this rulethough this rule would apply even if local vari-able lsp was instead a shared variable

5 If a pointer is part of a dependency chain thenadding an integral value to that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers and alsoto addition via the infix + operator and via thepostfix [] operator Note that the addition mustbe carried out on a pointer Casting to an inte-gral type and then carrying out the addition willbreak the dependency chain Therefore insteadof casting to an integral type to carry out theaddition cast to a pointer to char16 Line 24 ofFigure 20 illustrates this given that the -gtt actsas a pointer offset prior to indirection

6 If a pointer is part of a dependency chain thensubtracting an integer from that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers Againcasting to an integral type and then carryingout the subtraction will break the dependencychain so instead cast to a pointer to char TheLinux-kernel container of() macro illustratesthis This macro is used to find the beginning ofa structure given a pointer to a field within thatsame structure

16 Yes some old systems had strange formats for characterpointers and this restriction does exclude those systems fromthis nuance of dependency ordering However to the best ofmy knowledge all such systems were uniprocessors so this isnot a real problem

WG21N4321 30

7 If a pointer is part of a dependency chain thendereferencing it using the prefix operator ex-tends the chain through the dereference opera-tion Line 24 of Figure 20 illustrates this giventhat the -gtt acts as a pointer offset prior to in-direction

8 If a pointer is part of a dependency chain thendereferencing it using the -gt field-selection oper-ator extends the chain to the field Note that thewhen the -gt operator is followed by one or more operators these latter operators are equiva-lent to adding a constant integer to the originalpointer Line 24 of Figure 20 directly illustratesthis rule

9 If a pointer is part of a dependency chain thencasting it (either explicitly or implicitly) to anypointer-sized type extends the chain to the re-sult Line 3 of Figure 33 illustrates this rule

10 If a value of type intptr t or uintptr t ispart of a dependency chain then casting it toa pointer type extends the chain to the resultLine 6 of Figure 33 illustrates this rule

11 If a value of type intptr t or uintptr t is partof a dependency chain then the bitwise infix amp

and | operators extend the dependency chain tothe resulting value Line 5 of Figure 33 illus-trates this rule

12 If a value of type intptr t or uintptr t is partof a dependency chain the the bitwise infix ^

operator extends the dependency chain to theresulting value The use case for this traversal ofbuddy-allocator-like lists or dense-array heapsbut it is not clear whether these use cases justifythis addition to dependency ordering

13 If a pointer is part of a dependency chain thenapplying the unary amp address-of operator op-tionally casting this address to a pointer type(perhaps repeatedly to different pointer typeseither explicitly or implicitly) then applying the dereference operator extends the chain to theresult This is used by some of the Linux-kernellist-processing macros

14 If a pointer is part of a dependency chain andthat pointer appears in the entry of a expres-sion selected by the condition then the chainextends to the result Please note that doesnot extend chains from its condition only fromits second or third argument

15 If a pointer to a function is part of a depen-dency chain then invoking the pointed-to func-tion extends the chain from the pointer to theinstructions executed Note that the exact mech-anism used to update instructions is implemen-tation defined and might require use of specialinstruction-cache-flush operations17

16 If a pointer is part of a dependency chain thenif that pointer is used as the actual parameter ofa function call the dependency chain extends tothe formal parameter

17 If a pointer is part of a dependency chain thenif that pointer is returned from a function thedependency chain extends to the returned valuein the calling function

18 If a given operation extends a dependency chainthen so does its atomic counterpart For exam-ple the rules applying to assignments also applyto atomic loads and stores

Any other operation terminates a dependencychain

792 Terminating Dependency Chains

Even though all other operations terminate depen-dency chains there are a few that deserve specialmention

1 Equality comparisons

2 Narrowing magnitude comparisons

3 Narrowing arithmetic operations

4 Narrowing bitwise operations

17 This may sound strange but just you try implementingdynamic linking without the ability to update instructions

WG21N4321 31

1 struct bar 2 struct bar next3 int a4 int b5 6 struct bar head = amphead 1 2 78 for (p = head-gtnext p p = rcu_dereference(p-gtnext)) 9 foo += p-gta

10 if (p == amphead)11 break12 13 bar = head-gtb

Figure 34 Back-Propagation of Dependency-ChainBreakage Due to Comparisons

1 if (p gt ampfoo)2 do_something(p)3 else if (p lt ampfoo)4 do_something_else(p)5 else6 do_something_nodep(p)

Figure 35 Inequality-Comparison Dependency-Chain Breakage

5 Storing non-pointers into shared variables

6 Passing values between threads without using amemory order consume load

7 Undefined behavior

8 Use of kill dependency

Each of these is covered below

Equality comparisons If a pointer is part of adependency chain then a == or = comparison thatcompares equal to some other pointer where thatother pointer is not part of any dependency chainwill cause any uses of the original pointer to no longerbe part of the dependency chain This dependency-chain breakage can back-propagate to earlier uses ofthe pointer so that in Figure 34 if the comparisonon line 10 compares equal then the access on line 9is not part of the dependency chain This is admit-tedly a rather strange code fragment and besidesthe Linux-kernel barrier() macro could prevent thisif placed between lines 9 and 10 Furthermore the

Linux kernelrsquos list macros avoid this situation becausethe equal comparison terminates the loop

So what if the compiler introduces an equality com-parison This might happen when doing feedback-directed optimization where the compiler might no-tice (for example) that a particularly statically allo-cated structure was almost always the first elementon a given list The compiler might therefore in-troduce a specialization optimization comparing theaddresses and generating code using the statically al-located structure on equals comparison On the onehand in the cases where the Linux kernel adds a stat-ically allocated structure to an RCU-protected linkeddata structure that structure has been initialized atcompile time so that dependency ordering is not re-quired On the other hand this appears to be anextremely dubious optimization for linked data struc-tures In a great many cases the added overhead ofthe comparison would overwhelm the benefits of gen-erating code based on the statically allocated struc-ture

High-quality implementations would therefore beexpected to provide means for disabling this sort ofoptimization especially for pointers obtained fromthe heap After all use of statically allocated struc-tures in RCU-protected lists could be quite usefulduring out-of-memory conditions in which case thespecialization optimization would almost always re-duce performance which is not what optmizationsare supposed to be doing

Narrowing magnitude comparisons A seriesof gt lt gt= or lt= operators that informs the compilerof the exact value of a pointer causes that pointerto no longer be part of the dependency chain SeeFigure 35 for an example of this On line 6 of thisfigure the compiler knows that the value of p is equalto ampfoo so although there is dependency ordering tolines 2 and 4 there is no dependency ordering toline 6 This dependency-chain breakage can back-propagate just as for equality comparisons How-ever dependencies are maintained for normal usesfor exmaple the use of comparisons for deadlockavoidance when acquiring locks contained in multi-ple RCU-protected data elements

WG21N4321 32

Narrowing arithmetic operations If a pointeris part of a dependency chain and if the valuesadded to or subtracted from that pointer cancel thepointer value so as to allow the compiler to pre-cisely determine the resulting value then the result-ing value will not be part of any dependency chainFor example if p is part of a dependency chain then((char )p-(uintptr t)p)+65536 will not be18

Narrowing bitwise operations If a value of typeintptr t or uintptr t is part of a dependency chainand if that value is one of the operands to an amp

or | infix operator whose result has too few or toomany bits set then the resulting value will not bepart of any dependency chain For example on a64-bit system if p is part of a dependency chainthen (p amp 0x7) provides just the tag bits and nor-mally cannot even be legally dereferenced Similarly(p | ~0) normally cannot be legally dereferencedHowever (p amp ~0x7) will provide a usable pointerthat is part of prsquos dependency chain Setting orclearing bits that are not used by the implementa-tion or that would be zero for a properly aligned ob-ject will not break the dependency chain That saidthe compiler might not know the definition of ldquoprop-erly alignedrdquo for example in the cases of manuallycache-aligned or page-aligned objects

Storing non-pointers into shared variables Ifa non-pointer value that is part of a dependency chainis stored into a shared variable then the dependencychain does not extend to a later load from that vari-able

Passing values between threads If a value thatis part of a dependency chain is stored into a variableby one thread and loaded from that same variable bysome other thread using either a non-atomic load ora memory order relaxed load then the dependencychain does not extend to the second thread To getthis effect the second thread would instead need touse a memory order consume load Note that this

18 That said 57p4 of C++ and 656p8 of C both say thatindexing outside of an object is undefined behavior so the lossof dependency ordering is likely the least of the problems here

would extend the dependency chain even if the cor-responding store was a memory order relaxed storebecause the required store-side ordering is providedby the dependency chain

Undefined behavior If undefined behavior is in-voked then consistent with the notion of undefinedbehavior there are no dependency-chain guarantees

If a given pointer intptr t or uintptr t is usedsuch that only one value avoids undefined behaviorthen the dependency chain is broken in the same wayas it would be in the case of an equality comparisonwith that same value

kill dependency() The result of calling kill

dependency is never part of any dependency chainThis operation can be used to suppress diagnosticsthat implementations might omit for likely misusesof dependency ordering

793 Restricted Dependency Chains and theLinux Kernel

This covers all known pointer-based RCU uses in theLinux kernel aside from RCU-protected array in-dexes (more on these later) However as noted ear-lier these restrictions might prove too constrainingfor future code Therefore it might be necessary tocombine this approach with some variation on one ofthe methods for explicitly marking variables formalparameters and return values that are intended tocarry dependencies

Section 32 discussed dependency chains headedby memory order consume loads of integers that arelater used as array indexes or pointer offsets Al-though there are a (very) few such uses in the Linuxkernel accommodating dependency chains headed byloads of integers greatly complicates the handling ofdependency chains For example given an integer x

produced by a memory order consume load we mustcorrectly handle expressions containing (x - x) in-cluding cases where the cancellation is not at all obvi-ous from the source code In contrast given a pointerp the expression (p - p) not a pointer and the rulesgiven above do not require carrying a dependency

WG21N4321 33

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a) 3 q = kill_dependency(p)4 a = q-gtspecial_a5 else 6 a = p-gtnormal_a7

Figure 36 Avoiding Diagnostic Due To Dependency-Ordering Value-Narrowing Hazard

through such an expression This approach there-fore excludes dependency chains headed by memory

order consume loads from non-pointer atomic vari-ables

This raises the question of what should be doneabout the Linux kernel code that relies on order-ing carried through integer array indexes Paulhas (hopefully) answered this question by creatinga Linux-kernel patch that removes the kernelrsquos de-pendency on RCU-protected array indexes [22]

794 Restricted Dependency Chains Ad-vantages and Disavantages

This approach to dependency ordering has the fol-lowing advantages and disadvantages

1 There is no need for compilers to trace depen-dency chains Instead dependencies are an au-tomatic result of current code-generation and op-timization practices

2 There would be little or no limitation on com-piler optimizations Compilers are no longer re-quired to establish artificial dependencies for ex-pressions such as (x - x) or to detect value-narrowing hazards involving == and =

3 The breaking of dependency chains when apointer compares equal to some other pointermight prove to be onerous but it is not aproblem for current Linux use cases19 Themost common cases are comparison against alist header (in which case an equality compar-ison terminates the traversal) and comparison

19 This needs to be re-verified

against NULL (in which case an equality com-parison indicates a pointer that cannot be deref-erenced in any case)

4 The compiler is prohibited from carrying outvalue-speculation optimizations on pointers orvalues of type intptr t or uintptr t that havebeen cast from a pointer type and subjected tono operations other than infix amp | and ^ (Isthis an advantage or a disadvantage The an-swer to this question is left to the reader)

5 In a great many cases dependency chains canreliably pass through library functions compiledby pre-C11 compilers

6 It would not be necessary to use stdkill

dependency() calls in most cases That saiduse of stdkill dependency() might at somefuture point allow the compiler to produce bet-ter diagnostics for dubious dependency-chain usecases For example a compiler might issue awarning for the value-narrowing hazard shownon line 3 of Figure 13 and that diagnostic mightbe suppressed as shown in Figure 36

7 More generally this approach allows annotationsto be discarded as was shown in Figures 20and 21 However the memory order consume

loads are still required in order to support DECAlpha and prevent compiler optimizations thatmight otherwise destroy the dependency chain

8 It would not be necessary to use [[carries

dependency]] attributes However as withstdkill dependency() use of something like[[carries dependency]] might produce bet-ter diagnostics for dubious use dependency-chainuse cases That said it might be preferable tosubstitute either type or variable modifiers forattributes given that use of attributes is not per-mitted to change the meaning of the programand also that attributes are not yet supportedby the C language

9 With the exception of one use case involving ar-rays no changes to the Linux-kernel source code

WG21N4321 34

are required As noted earlier a patch is avail-able to remove the Linux kernelrsquos dependency onRCU-protected array indexes [22]

In the future this approach could be augmentedby attributes type modifiers variable modifiers orother markings to allow more elaborate dependencychains to be created on the one hand and to improvethe compilerrsquos ability to emit diagnostics for dubioususes of dependency chains on the other

710 Evaluation

This evaluation starts by enumerating the differentaudiences that any change to memory order consume

must address (Section 7101) and then compares thevarious proposals based on the perceived viewpointsof these audiences (Section 7102)

7101 Audiences

The main audiences for any change to memory

order consume include standards committee mem-bers compiler implementers formal-methods re-searchers developers intending to write new codeand developers working with existing RCU code TheLinux kernel community is of course a notable exam-ple of this last category

Standards committee members would like a cleanand non-intrusive change to the standard Theywould of course also like solutions minimizing thenumber and vehemence of complaints from the otheraudiences or failing that reducing the complaintsto a tolerable noise level

Compiler implementers would like a mechanismthat fits nicely into current implementations whichdoes much to explain their satisfaction with the ap-proach of strengthening memory order consume tomemory order acquire In particular they wouldlike to avoid unbounded tracing of dependencies andwould prefer minimal constraints on their ability toapply time-honored optimizations

Formal-methods researchers would like a definitionof memory order consume that fits into existing the-oretical frameworks without undue conceptual vio-lence Of particular concern is any need to deal with

counter-factuals in other words any need to rea-son not only about values of variables required forthe solution of a given litmus test but also aboutother unrelated values for these variables As suchcounter-factuals are the rock upon which otherwiseattractive approaches involving semantic dependencyhave foundered20 Some practitioners might won-der why the opinion of formal-methods researchersshould be given any weight at all and the answer tothis question is that it is the work of formal-methodsresearchers that provides us the much-needed toolsthat we need to analyze both the memory-orderingspecification itself as well as programs using thatspecification

Developers writing new code need something thatexpresses their algorithm with a minimum of syn-tactic saccharine that is easy to learn and that iseasy to maintain For example one of the weak-nesses of the current standardsrsquo definition of memoryorder consume is the need to sprinkle large numbersof kill dependency() calls throughout onersquos codeIn short developers would like it to be easy to writeanalyze and maintain code that uses dependency or-dering

Developers with existing RCU code have the samedesires as do developers writing new code but arealso very interested in minimizing the code churn re-quired to adhere to the standard

The challenge if of course to find a proposal thataddresses the viewoints of all of these audiences Aswe will see in the next session this is not easy How-ever there is some hope that the approach presentedin Section 79 might suffice

7102 Comparison

A summary comparison of the proposals is shown inTable 1

The dependency type can either be ldquodeprdquo fornormal dependency ldquordeprdquo for the restricted de-pendencies discussed in Section 79 ldquosdeprdquo for(global) semantic dependency or ldquolsdeprdquo for local

20 That said Alan Jeffries is making another attempt tocome up with a suitable formal definition of semantic depen-dency

WG21N4321 35

Dep

enden

cyType

Variable

Marking

Form

al-Parameter

Marking

Return-V

alueMarking

Beginning-O

f-Chain

Handling

End-O

f-Chain

Handling

Dep

enden

cyTracingRequired

CAttribute

Support

Req

uired

C11 C++11 dep A A K Y Y

Type-Based Designation of DependencyChains With Restrictions (Section 72)

lsdep T T T k

Type-Based Designation of DependencyChains (Section 73)

dep T T T k

Whole-Program Option (Section 74) sdep

Local-Variable Restriction (Section 75) dep A A k Y

Mark Dependency-Carrying Local Variables(Section 76)

dep A A A k Y

Explicitly Tail-Marked Dependency Chains(Section 77)

dep A A Dk y Y

Explicitly Head-Marked Dependency Chains(Section 78)

dep D k y Y

Restricted Dependency Chains (Section 79) rdep

Marking ldquoArdquo attribute ldquoTrdquo typeBeginning of chain ldquoDrdquo explicit designationEnd of chain ldquoDrdquo explicit designation ldquokKrdquo implicitexplicit kill dependencyDependency tracking required ldquoyrdquo only for marked chains ldquoYrdquo always

Table 1 Comparison of Consume Proposals

WG21N4321 36

semantic dependency21 Variable formal-parameterand return-value marking can either be type-based(ldquoTrdquo) attribute-based (ldquoArdquo) or not required (ldquo rdquo)Beginning-of-chain handling can either require ex-plicit indication to which quantities dependenciesmust be carried (ldquoDrdquo) or nothing (ldquo rdquo) End-of-chain handling can either require an explicit kill

dependency (ldquoKrdquo) an implicit kill dependency

(ldquokrdquo) explicit designation of dependency (ldquoDrdquo) ornothing (ldquo rdquo)22 Dependency tracking might berequired for all chains (ldquoYrdquo) explicitly designatedchains (ldquoyrdquo) or not required at all (ldquo rdquo) C-language[[carries dependency]] support might be required(ldquoYrdquo) or not (ldquo rdquo)

The ideal proposal would have dependency typeldquodeprdquo (thus making it easier to model dependencyordering and making it unnecessary for developer tohave to outwit full-program optimizations) no needfor variable formal-parameter or return-value mark-ing (thus minimizing changes required for existingRCU code) implicit ldquodo the right thingrdquo end-of-chainhandling23 (thus minimizing the need for whack-a-mole source-code markups) no need for dependencytracking (thus making it easier to implement) andno need for C-language support for the [[carries

dependency]] attribute (thus minimizing changes tothe C standard)

7103 Other Approaches

If the C standards committee is unwilling to ac-cept attributes perhaps a new keyword such asCarries dependency would be an acceptable alter-native (This was suggested at the 2014 UIUC meet-ing but I cannot recall who suggested it)

21 Recall that a local semantic dependency remains a de-pendency even if the memory order consume load at its headcan return only a single value In contrast a global seman-tic dependency remains a dependency only if more than onevalue can appear at the end of the chain Therefore optimiza-tions based on global full-program analysis can break a globalsemantic dependency but can break neither a local semanticdependency nor a normal dependency

22 Variables that go out of scope always have any dependencychain implicitly killed

23 Perhaps implemented by a careful choice of exactly whichoperators carry dependencies in which situations

It might also be possible to combine different as-pects of the various proposals perhaps even arrivingat an improved proposal

Nevertheless we clearly have some more work todo

8 Summary

This document has analyzed Linux-kernel use of de-pendency ordering and has laid out the status-quo in-teraction between the Linux kernel and pre-C11 com-pilers It has also put forward some possible ways ofbuilding towards a full implementation of C11rsquos andC++11rsquos handling of dependency ordering Finallyit calls out some weaknesses in C11rsquos and C++11rsquoshandling of dependency ordering and offers some al-ternatives

References

[1] Alglave J Maranget L Pawan PSarkar S Sewell P Williams D andNardelli F Z PPCMEMARMMEM A toolfor exploring the POWER and ARM memorymodels httpwwwclcamacuk~pes20

ppc-supplementalpldi105-sarkarpdfJune 2011

[2] Alglave J Maranget L andTautschnig M Herding cats Modellingsimulation testing and data-mining for weakmemory In Proceedings of the 35th ACM SIG-PLAN Conference on Programming LanguageDesign and Implementation (New York NYUSA 2014) PLDI rsquo14 ACM pp 40ndash40

[3] ARM Limited ARM Architecture ReferenceManual ARMv7-A and ARMv7-R Edition2010

[4] Boehm H J Space efficient conservativegarbage collection SIGPLAN Not 39 4 (Apr2004) 490ndash501

[5] Bonzini P and Day M RCU imple-mentation for Qemu httplistsgnu

WG21N4321 37

orgarchivehtmlqemu-devel2013-08

msg02055html August 2013

[6] Dalton M THE DESIGN AND IMPLE-MENTATION OF DYNAMIC INFORMATIONFLOW TRACKING SYSTEMS FOR SOFT-WARE SECURITY PhD thesis StanfordUniversity 2009 Available httpcsl

stanfordedu~christospublications

2009michael_daltonphd_thesispdf

[Viewed March 9 2010]

[7] Desnoyers M [RFC git tree] userspace RCU(urcu) for Linux httpurcuso February2009

[8] Desnoyers M McKenney P E SternA Dagenais M R and Walpole JUser-level implementations of read-copy updateIEEE Transactions on Parallel and DistributedSystems 23 (2012) 375ndash382

[9] Grisenthwaite R ARM Barrier LitmusTests and Cookbook ARM Limited 2009

[10] Howard P W and Walpole J A rel-ativistic enhancement to software transactionalmemory In Proceedings of the 3rd USENIX con-ference on Hot topics in parallelism (BerkeleyCA USA 2011) HotParrsquo11 USENIX Associa-tion pp 1ndash6

[11] Howard P W and Walpole J Relativis-tic red-black trees Concurrency and Computa-tion Practice and Experience (2013) nandashna

[12] Intel Corporation A Formal Speci-fication of Intel Itanium Processor Fam-ily Memory Ordering 2002 Avail-able httpdeveloperintelcom

designitaniumdownloads251429htm

ftpdownloadintelcomdesign

ItaniumDownloads25142901pdf [ViewedJanuary 10 2007]

[13] International Business Machines Corpo-ration Power ISA Version 207 2013

[14] Kannan H Ordering decoupled metadata ac-cesses in multiprocessors In MICRO 42 Pro-ceedings of the 42nd Annual IEEEACM Inter-national Symposium on Microarchitecture (NewYork NY USA 2009) ACM pp 381ndash390

[15] McCarthy J Recursive functions of symbolicexpressions and their computation by machinepart i Commun ACM 3 4 (Apr 1960) 184ndash195

[16] McKenney P E Read-copy update(RCU) usage in Linux kernel Availablehttpwwwrdropcomuserspaulmck

RCUlinuxusagerculocktabhtml [ViewedJanuary 14 2007] October 2006

[17] McKenney P E What is RCU part 2Usage Available httplwnnetArticles

263130 [Viewed January 4 2008] January2008

[18] McKenney P E The RCU API 2010 edi-tion httplwnnetArticles418853 De-cember 2010

[19] McKenney P E Validating memory barri-ers and atomic instructions httplwnnet

Articles470681 December 2011

[20] McKenney P E Is Parallel ProgrammingHard And If So What Can You Do About Itkernelorg Corvallis OR USA 2012

[21] McKenney P E Structured deferral syn-chronization via procrastination CommunACM 56 7 (July 2013) 40ndash49

[22] McKenney P E [PATCH tipcorercu14] mce Stop using array-index-based RCUprimitives [PATCHtipcorercu14]mce

Stopusingarray-index-basedRCUprimitivesMay 2015

[23] McKenney P E Purcell C Al-gae Schumin B Cornelius G Qwer-tyus Conway N Sbw Blainster Ru-fus C Zoicon5 Anome and Eisen HRead-copy update httpenwikipedia

orgwikiRead-copy-update July 2006

WG21N4321 38

[24] McKenney P E and Slingwine J DRead-copy update Using execution history tosolve concurrency problems In Parallel andDistributed Computing and Systems (Las VegasNV October 1998) pp 509ndash518

[25] McKenney P E and Walpole J What isRCU fundamentally Available httplwn

netArticles262464 [Viewed December 272007] December 2007

[26] Michael M M Hazard pointers Safe mem-ory reclamation for lock-free objects IEEETransactions on Parallel and Distributed Sys-tems 15 6 (June 2004) 491ndash504

[27] Rossbach C J Hofmann O S PorterD E Ramadan H E Bhandari A andWitchel E TxLinux Using and managinghardware transactional memory in an operatingsystem In SOSPrsquo07 Twenty-First ACMSymposium on Operating Systems Principles(October 2007) ACM SIGOPS Avail-able httpwwwsosp2007orgpapers

sosp056-rossbachpdf [Viewed October 212007]

[28] Toit S D Working draft stan-dard for programming language C++httpwwwopen-stdorgjtc1sc22wg21

docspapers2013n3691pdf May 2013

[29] Vigueras G Orduna J M and LozanoM A Read-Copy Update based parallel serverfor distributed crowd simulations The Journalof Supercomputing (Apr 2012)

Change Log

This paper first appeared as N4026 in May of 2014Revisions to this document are as follows

bull Add a suggestion to Section 71 that developersprovide diagnostics for singleton arrays when us-ing RCU-protected indexes (May 28 2014)

bull Apply changes from Torvald Riegel feedback

1 Move Section 71 which summarizes theLinux-kernel uses of dependency chains upto the beginning of Section 7

2 Fix some typos in Section 74 and addwords accounting for DEC Alpha whichmust continue to promote memory order

consume loads to memory order acquire

with this sectionrsquos approach Also clarifyJeff Preshingrsquos conjecture about the lack ofoptimizations that break semantic depen-dency chains

(June 5 2014)

bull Add Section 75 which covers Hans Boehmrsquosapproach which restricts dependency chains tolocal variable Also add Section 76 whichcovers Clark Nelsorsquos approach which usesthe [[carries dependency]] attribute to markthose local variables that carry dependencies(July 17 2014)

bull Mark the paper as a revision of N4036 (July 172014)

bull Add Jeff Preshing Hans Boehm and Clark Nel-son as authors and clarify Hansrsquos proposal inSection 75 (August 13 2014)

bull Add Olivier Girouxrsquos proposal for explicitlymarking the tails of dependency chains as Sec-tion 77 and add Olivier as author Update theadvantages and disadvantages of Jeff Preshingrsquoswhole-program option in Section 74 Add advan-tages and disadvantages for Hans Boehmrsquos pro-posal in Section 75 and for Clark Nelsonrsquos pro-posal in Section 76 (August 23 2014)

bull Update Olivier Girouxrsquos explicit dependency-chain marking proposal in Section 77 based onfeedback from Olivier (August 24 2014)

bull Add a section header (Section 411) for thedependency-chain rules for 2014 GCC implemen-tations Add Section 413 laying out the sim-pler rules for managing dependency chains in theolder 1990s Sequent C implementations Adda footnote to Section 76 recording Lawrence

WG21N4321 39

Crowlrsquos preference for variable modifiers insteadof either attributes or type modifiers Addwords to Section 77 noting that atomic signal

fence() or the Linux kernelrsquos barrier() can beused to suppress code-motion optimizations thatmight otherwise prevent fixing up tail-markeddependency chains that were broken by code-motion optimizations (October 1 2014)

bull Add Section 34 which describes operators thatact as the last link in the Linux kernelrsquos depen-dency chains including the ambiguous role ofthe -gt operator (October 1 2014)

bull Update plots of RCU API usage (October 52014)

bull Flesh out Section 412 which lays out a shorthistory of dependency-chain management rulesfor the Linux kernel Numerous minor clarifica-tions and corrections throughout the document(October 5 2014)

bull Add ldquoandrdquo before final authorrsquos name and emailaddress Also add a pointer back to N4036 (Oc-tober 5 2014)

bull Add a brain-dead build script (October 52014)

bull Add Section 710 which has the beginnings ofan evaluation of the various proposals (October5 2014)

The paper was then published as N4215 whichwas revised as follows

bull Add verbiage to Sections 72 and 7102 differ-entiating between various forms of syntactic andsemantic dependencies (November 11 2014)

bull Add Section 7103 suggesting a C keyword in-stead of the C++ [[carries dependency]] at-tribute This section also suggests that combin-ing ideas from several proposals might be helpful(November 11 2014)

bull Add Section 78 which contains a first attemptto describe Olivier Girouxrsquos head-marked depen-dency chains (November 11 2014)

bull Add example code in Figures 15 and 16 Addsimilar figures to the proposals showing how theymark the dependency chains in the sample code(November 11 2014)

At this point the paper was published as N4321which was revised as follows

bull Add a warning against using operands to amp and| that leave only one bit set (or respectivelycleared) to Section 411 (April 9 2015)

bull Update plots of RCU API usage (April 192014)

bull Add Section 79 which restricts dependencychains with the goal of allowing unmarked de-pendency chains while avoiding restricting com-piler optimizations One important restriction isthe elimination of RCU-protected array indexes(April 19 2014)

bull Add verbiage to Section 32 noting that the amp

operator is sometimes used to find the beginningof a power-of-two aligned structure (April 192014)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Torvald Riegel to Sec-tion 79 (May 20 2015)

bull Donrsquot allow integers to carry dependencies intoor out of unmarked functions in Section 79(May 21 2015)

bull Update Section 79 to note that assignments can-not extend a dependency chain from one threadto another and that assignments cannot extendan intptr t- or uintptr t-based dependencychain through a shared variable even within agiven thread (May 21 2015)

bull Update Section 79 to note that atomic op-erations can extend dependency chains in thesame way that their non-atomic counterpartscan (May 21 2015)

WG21N4321 40

bull Update Section 79 to explicitly call out the factthat stores and subsequent loads can extend de-pendency chains (May 21 2015)

bull Update Section 79 to explicitly state that un-defined behavior terminates dependency chainsand that if only one value of a pointer avoidsundefined behavior any dependency chainsthrough that pointer at that point are termi-nated (May 22 2015)

bull Update Section 79 to document the difficul-ties stemming from value-specialization opti-mizations (May 23 2015)

bull Update Section 79 to note the back-propagationof the dependency-chain-breaking effects of thecompiler deducing the exact value of a pointer(May 27 2015)

bull Update Section 79 to give more informationon how much bit-setting and bit-clearing is toomuch (June 8 2015)

bull Update Section 79 to note that the Linux ker-nel does not rely on dependency ordering whena statically allocated structure is linked into anRCU-protected structure but also note that op-timizations breaking dependency ordering in thiscase seem quite dubious (July 11 2015)

bull Update Section 79 illustrating dependency-chain rules with examples (July 11 2015)

bull Update Section 79 noting that simple loads suf-fice for memory order consume loads (July 112015)

bull Update Sections 77 and 78 calling out abstrac-tion challenges for head- and tail-marking ap-proaches (July 13 2015)

bull Add subsections to Section 79 (July 13 2015)

bull Update Section 79 to amplify objections to spe-cialization optimizations applied to heap-basedpointers (July 13 2015)

  • 1 Introduction
  • 2 Introduction to RCU
  • 3 Linux-Kernel Use Cases
    • 31 Types of Linux-Kernel Dependency Chains
    • 32 Operators in Linux-Kernel Dependency Chains
    • 33 Operators Terminating Linux-Kernel Dependency Chains
    • 34 Operators Acting as Last Link in Linux-Kernel Dependency Chains
    • 35 Linux-Kernel Dependency Chain Length
      • 4 Dependency Ordering in Pre-C11 Implementations
        • 41 Rules for C-Language RCU Users
          • 411 Rules for 2014 GCC Implementations
          • 412 Rules for 2003 GCC Implementations
          • 413 Rules for 1990s Sequent C Implementations
            • 42 Rules for C-Language Implementers
              • 5 Dependency Ordering in C11 and C++11 Implementations
              • 6 Weaknesses in C11 and C++11 Dependency Ordering
              • 7 Potential Alternatives to C11 and C++11 Dependency Ordering
                • 71 Revising C11 and C++11 Dependency-Ordering Definition
                • 72 Type-Based Designation of Dependency Chains With Restrictions
                • 73 Type-Based Designation of Dependency Chains
                • 74 Whole-Program Option
                • 75 Local-Variable Restriction
                • 76 Mark Dependency-Carrying Local Variables
                • 77 Explicitly Tail-Marked Dependency Chains
                • 78 Explicitly Head-Marked Dependency Chains
                • 79 Restricted Dependency Chains
                  • 791 Extending Dependency Chains
                  • 792 Terminating Dependency Chains
                  • 793 Restricted Dependency Chains and the Linux Kernel
                  • 794 Restricted Dependency Chains Advantages and Disavantages
                    • 710 Evaluation
                      • 7101 Audiences
                      • 7102 Comparison
                      • 7103 Other Approaches
                          • 8 Summary
Page 24: N4321: Towards Implementation and Use of memory order … · 2015-07-14 · N4321: Towards Implementation and Use of memory order consume Doc. No.: WG21/N4321 (revised) Date: 2015-04-19

WG21N4321 24

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 _Carries_dependency void data18 _Carries_dependency struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data = rcu_dereference(lsp-gtt)26 rcu_read_unlock()27 return data28

Figure 24 List-Based-Stack Marked Local Variables1 of 2

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

It is expected that modeling this approach shouldbe no more difficult than for the current C11 andC++11 standards

This approach results in code as shown in Fig-ures 24 and 25 where the [[carries dependency]]

attributes have been replaced with a mythicalCarries dependency C keyword

77 Explicitly Tail-Marked Depen-dency Chains

This approach suggested at Rapperswil by OlivierGiroux can be thought of as the inverse of

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 _Carries_dependency struct liststack lsp40 struct liststack lsnp141 _Carries_dependency struct liststack lsnp242 _Carries_dependency void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt)56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 25 List-Based-Stack Marked Local Variables2 of 2

WG21N4321 25

1 p = atomic_load_explicit(ampgp memory_order_consume)2 if (p = NULL)3 do_it(atomic_dependency(p gp))

Figure 26 Explicit Dependency Operations

stdkill dependency() Instead of explicitlymarking where the dependency chains terminateOlivierrsquos proposal uses a stddependency() prim-itive to indicate the locations in the code that thedependency chains are required to reach The first ar-gument to stddependency() is the value to whichthe dependency must be carried and the second argu-ment is the variable that heads the dependency chainin other words the second argument is the variablethat was loaded from by a memory order consume

load This proposal differs from the others in thatit is expected to be implemented not necessarily bypreserving the dependency but instead by insertingbarriers in those cases where optimizations have elim-inated any required dependencies The goal here is toimpose minimal restrictions on optimizations of codecontaining dependency chains

A C-language example is shown in Figure 26where stddependency() is transliterated to the C-language atomic dependency() function On line 3atomic dependency() returns the value of its firstargument (p) while ensuring that the data depen-dency from the memory order consume load from gp

is faithfully reflected in the assembly language imple-menting this code fragment The assembly-languagereflection of this dependency might be in terms ofan assembly-language dependency (for example onARM or PowerPC) implicit memory ordering (forexample on x86 or mainframe) or by an explicitmemory-barrier instruction However if there was noatomic dependency() function the compiler wouldbe under no obligation to preserve the dependency14

These explicitly specified dependencies may becombined with [[carries dependency]] attributeson function arguments for example as shown in Fig-ure 27 Note the interplay of atomic dependency()

and [[carries dependency]] where line 8 estab-

14 Would it be better to have the second argument to atomic

dependency() be a label rather than an expression

1 void foo(struct bar q [[carries_dependency]])2 3 if (q = NULL)4 do_it(atomic_dependency(q-gtb q))5 67 p = atomic_load_explicit(ampgp memory_order_consume)8 foo(atomic_dependency(p gp))

Figure 27 Explicit Dependency Operations and car-ries dependency

lishes the dependency between the load from gp andthe [[carries dependency]] argument q of foo()and where line 4 establishes the further dependencybetween argument q of foo() and do it()s argu-ment

This approach is not yet complete One issue isthe possibility of a given operation being dependenton multiple memory order consume loads One ap-proach is of course to omit this functionality and an-other is to allow atomic dependency() to allow anexpression as its first argument and a variable list ofmemory order consume loaded variables

Another issue is connecting [[carries

dependency]] return values to subsequent atomic

dependency() invocations There are a numberof possible resolutions to this issue One approachwould be to use [[carries dependency]] attributeto mark the declaration of the variable to whichthe functionrsquos return value is assigned bringing theproposal from Section 76 to bear In the specialcase where the memory order consume load is in thesame function body as the atomic dependency()

that depends on it the atomic dependency()

could reference the variable that was the sourceof the original memory order consume load An-other approach would be to allow function-returncarries dependency]] attributes to define namesthat could be used by later atomic dependency()

invocations

A third issue arises when atomic dependency()

must be applied after the head of the dependencychain has gone out of scope for example if the headwas contained in a variable defined in an inner scopethat has since been exited

A fourth issue arises if optimizations along a

WG21N4321 26

needed dependency chain allow ordering the depen-dent operation to precede the head of the depen-dency chain in which case inserting barriers wouldbe ineffective The current proposal for addressingthis issue is to suppress memory-movement optimiza-tions across the atomic dependency() perhaps us-ing something like atomic signal fence() or theLinux kernelrsquos barrier() macro This approach al-lows dependency checking and fence insertion to becarried out as a final pass in the compilation process

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains having ex-plicit atomic dependency() calls (and option-ally intermediate [[carries dependency]] at-tributes) need be traced

3 Irrelevant dependency chains are pruned by de-fault with no stdkill dependency() calls re-quired

4 The atomic dependency() calls serve as valu-able documentation of the developerrsquos intent

5 Although optimizations on dependency chainsmust be restricted use of explicit atomic

dependency() greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of dependency chains giventhat the Linux kernel currently relies on implicitends of dependency chains

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(atomic_dependency(lsp-gtt27 head-gtfirst))28 rcu_read_unlock()29 return atomic_dependency(data lsp-gtt)30

Figure 28 List-Based-Stack Tail-Marked Dependen-cies 1 of 2

It is not yet known whether this approach can bereasonably modeled

The result is shown in Figures 28 and 29

78 Explicitly Head-Marked Depen-dency Chains

This approach suggested via email by Olivier Girouxcan be thought of as another inverse of stdkilldependency() In this case the heads of the de-pendency chains are marked indicating to whichpointed-to objects dependencies should be carriedThis description is an extrapolation of a very con-cise proposal and corrections are welcome

The general idea is to provide an augmentedform of the load() member function that indicatesdependencies for example xload(memory order

consume x-gtnext) would cause a dependency tobe carried through the -gtnext field but throughno other field This is shown in Figure 30 where

WG21N4321 27

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(atomic_dependency(lsnp2-gtt lsnp))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return atomic_dereference(data lsnp2-gtt)59

Figure 29 List-Based-Stack Tail-Marked Dependen-cies 2 of 2

1 struct foo 2 struct foo a3 struct foo b4 struct foo c5 int d6 78 p = atomic_load_explicit(ampgp memory_order_consume9 p-gta p-gtb)

10 qa = p-gta Dependency carried 11 qb = p-gtb Dependency carried 12 qc = p-gtc No dependency carried 13 d = p-gtd No dependency carried

Figure 30 Explicit Dependency Operations and Aug-mented Load

the explicit dependency information on line 9 causeslines 10 and 11 to carry a dependency but lines 12and 13 not to do so

Some open questions regarding this approach

1 How does this interact with arguments and re-turn values Do the corresponding annotationsneed to indicate to which fields dependenciesmight be carried Should mismatches be con-sidered an error and if so which sorts of mis-matches

2 How are opaque types handled For exam-ple consider the Linux kernel linked-list facil-ity which embeds a list head structure intothe enclosing object that is to be placed on thelist The memory order consume load returnsa (struct list head ) but it may be nec-essary to carry a dependency to one or more ofthe fields in the enclosing object Should thisbe handled via something like xload(memory

order consume x) but if so doesnrsquot thisre-introduce the need for lots of stdkill

dependency() calls

3 Larger structures might have quite a few fieldsthat need dependencies carried Should therebe some sort of shorthand to make this easierto code for example tagging the fields needingdependency ordering in the declaration of thestruct or class

4 Common types of abstraction need to be han-dled correctly For example there are more than

WG21N4321 28

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst head-gtfirst-gtt)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(lsp-gtt lsp-gtt)) 28 rcu_read_unlock()29 return data30

Figure 31 List-Based-Stack Head-Marked Depen-dencies 1 of 2

500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

79 Restricted Dependency Chains

This approach restricts dependency chains to opera-tions for which compilers would naturally carry de-pendencies As such this approach can be consid-ered to be a refinement of the whole-program option(Section 74) restricted as described in Section 412but also omitting control dependencies and RCU-protected array indexes As always a memory order

consume load heads a dependency chain and as al-ways a memory order consume load may be imple-mented with a simple load instruction on architec-tures such as x86 ARM and Power

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst head-gtfirst-gtnext)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext lsnp-gtnext-gtt)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt lsnp2-gtt))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 32 List-Based-Stack Head-Marked Depen-dencies 2 of 2

WG21N4321 29

1 define rcu_ptr_extract(p) 2 ( 3 uintptr_t ___ip = (uintptr_t)(p) 4 5 ___ip = ___ip amp ~0x7 6 (typeof(p)) ___ip 7 )89 define rcu_ptr_set_bits(p bits)

10 ( 11 uintptr_t ___ip = (uintptr_t)(p) 12 13 ___ip = ___ip amp ~0x7 14 ___ip = ___ip | (bits) 15 (typeof(p)) ___ip 16 )1718 define rcu_ptr_get_bits(p bits) 19 ( 20 uintptr_t ___ip = (uintptr_t)(p) 21 22 ___ip amp ~0x7 23 )

Figure 33 Pointers and Bit Manipulation on 64-BitSystem

This results in a specific list of operations that ex-tend dependency chains and a separate specific list ofoperations that terminate such chains each coveredin its own section

791 Extending Dependency Chains

The following primitive operations extend thatchain15

1 If any value is part of a dependency chain thenusing that value as the left-hand side of an as-signment expression extends the chain to coverthe assignment This rule is exercised in theLinux kernel by stores into fields making up anRCU protected data element

2 If any value is part of a dependency chain thenusing that value as the right-hand side of an as-signment expression extends the chain to coverboth the assignment and the value returned bythat assignment statement Line 20 of Figure 20shows how this rule may be used to extend adependency chain into a local variable

15 In case of operator overloading the actual functions calledmust be analyzed in order to determine their effects on depen-dency chains

3 If any value that is part of a dependency chain isstored to a non-shared variable then any valueloaded by a later load from that same variable bythat same thread is also part of the dependencychain Lines 20 and 24 of Figure 20 illustratethis rule though this rule would apply even iflocal variable lsp was a intptr t instead of apointer

4 If a pointer that is part of a dependency chainis stored to any variable then any value loadedby a later load from that same variable by thatsame thread is also part of the dependency chainLines 20 and 24 of Figure 20 illustrate this rulethough this rule would apply even if local vari-able lsp was instead a shared variable

5 If a pointer is part of a dependency chain thenadding an integral value to that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers and alsoto addition via the infix + operator and via thepostfix [] operator Note that the addition mustbe carried out on a pointer Casting to an inte-gral type and then carrying out the addition willbreak the dependency chain Therefore insteadof casting to an integral type to carry out theaddition cast to a pointer to char16 Line 24 ofFigure 20 illustrates this given that the -gtt actsas a pointer offset prior to indirection

6 If a pointer is part of a dependency chain thensubtracting an integer from that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers Againcasting to an integral type and then carryingout the subtraction will break the dependencychain so instead cast to a pointer to char TheLinux-kernel container of() macro illustratesthis This macro is used to find the beginning ofa structure given a pointer to a field within thatsame structure

16 Yes some old systems had strange formats for characterpointers and this restriction does exclude those systems fromthis nuance of dependency ordering However to the best ofmy knowledge all such systems were uniprocessors so this isnot a real problem

WG21N4321 30

7 If a pointer is part of a dependency chain thendereferencing it using the prefix operator ex-tends the chain through the dereference opera-tion Line 24 of Figure 20 illustrates this giventhat the -gtt acts as a pointer offset prior to in-direction

8 If a pointer is part of a dependency chain thendereferencing it using the -gt field-selection oper-ator extends the chain to the field Note that thewhen the -gt operator is followed by one or more operators these latter operators are equiva-lent to adding a constant integer to the originalpointer Line 24 of Figure 20 directly illustratesthis rule

9 If a pointer is part of a dependency chain thencasting it (either explicitly or implicitly) to anypointer-sized type extends the chain to the re-sult Line 3 of Figure 33 illustrates this rule

10 If a value of type intptr t or uintptr t ispart of a dependency chain then casting it toa pointer type extends the chain to the resultLine 6 of Figure 33 illustrates this rule

11 If a value of type intptr t or uintptr t is partof a dependency chain then the bitwise infix amp

and | operators extend the dependency chain tothe resulting value Line 5 of Figure 33 illus-trates this rule

12 If a value of type intptr t or uintptr t is partof a dependency chain the the bitwise infix ^

operator extends the dependency chain to theresulting value The use case for this traversal ofbuddy-allocator-like lists or dense-array heapsbut it is not clear whether these use cases justifythis addition to dependency ordering

13 If a pointer is part of a dependency chain thenapplying the unary amp address-of operator op-tionally casting this address to a pointer type(perhaps repeatedly to different pointer typeseither explicitly or implicitly) then applying the dereference operator extends the chain to theresult This is used by some of the Linux-kernellist-processing macros

14 If a pointer is part of a dependency chain andthat pointer appears in the entry of a expres-sion selected by the condition then the chainextends to the result Please note that doesnot extend chains from its condition only fromits second or third argument

15 If a pointer to a function is part of a depen-dency chain then invoking the pointed-to func-tion extends the chain from the pointer to theinstructions executed Note that the exact mech-anism used to update instructions is implemen-tation defined and might require use of specialinstruction-cache-flush operations17

16 If a pointer is part of a dependency chain thenif that pointer is used as the actual parameter ofa function call the dependency chain extends tothe formal parameter

17 If a pointer is part of a dependency chain thenif that pointer is returned from a function thedependency chain extends to the returned valuein the calling function

18 If a given operation extends a dependency chainthen so does its atomic counterpart For exam-ple the rules applying to assignments also applyto atomic loads and stores

Any other operation terminates a dependencychain

792 Terminating Dependency Chains

Even though all other operations terminate depen-dency chains there are a few that deserve specialmention

1 Equality comparisons

2 Narrowing magnitude comparisons

3 Narrowing arithmetic operations

4 Narrowing bitwise operations

17 This may sound strange but just you try implementingdynamic linking without the ability to update instructions

WG21N4321 31

1 struct bar 2 struct bar next3 int a4 int b5 6 struct bar head = amphead 1 2 78 for (p = head-gtnext p p = rcu_dereference(p-gtnext)) 9 foo += p-gta

10 if (p == amphead)11 break12 13 bar = head-gtb

Figure 34 Back-Propagation of Dependency-ChainBreakage Due to Comparisons

1 if (p gt ampfoo)2 do_something(p)3 else if (p lt ampfoo)4 do_something_else(p)5 else6 do_something_nodep(p)

Figure 35 Inequality-Comparison Dependency-Chain Breakage

5 Storing non-pointers into shared variables

6 Passing values between threads without using amemory order consume load

7 Undefined behavior

8 Use of kill dependency

Each of these is covered below

Equality comparisons If a pointer is part of adependency chain then a == or = comparison thatcompares equal to some other pointer where thatother pointer is not part of any dependency chainwill cause any uses of the original pointer to no longerbe part of the dependency chain This dependency-chain breakage can back-propagate to earlier uses ofthe pointer so that in Figure 34 if the comparisonon line 10 compares equal then the access on line 9is not part of the dependency chain This is admit-tedly a rather strange code fragment and besidesthe Linux-kernel barrier() macro could prevent thisif placed between lines 9 and 10 Furthermore the

Linux kernelrsquos list macros avoid this situation becausethe equal comparison terminates the loop

So what if the compiler introduces an equality com-parison This might happen when doing feedback-directed optimization where the compiler might no-tice (for example) that a particularly statically allo-cated structure was almost always the first elementon a given list The compiler might therefore in-troduce a specialization optimization comparing theaddresses and generating code using the statically al-located structure on equals comparison On the onehand in the cases where the Linux kernel adds a stat-ically allocated structure to an RCU-protected linkeddata structure that structure has been initialized atcompile time so that dependency ordering is not re-quired On the other hand this appears to be anextremely dubious optimization for linked data struc-tures In a great many cases the added overhead ofthe comparison would overwhelm the benefits of gen-erating code based on the statically allocated struc-ture

High-quality implementations would therefore beexpected to provide means for disabling this sort ofoptimization especially for pointers obtained fromthe heap After all use of statically allocated struc-tures in RCU-protected lists could be quite usefulduring out-of-memory conditions in which case thespecialization optimization would almost always re-duce performance which is not what optmizationsare supposed to be doing

Narrowing magnitude comparisons A seriesof gt lt gt= or lt= operators that informs the compilerof the exact value of a pointer causes that pointerto no longer be part of the dependency chain SeeFigure 35 for an example of this On line 6 of thisfigure the compiler knows that the value of p is equalto ampfoo so although there is dependency ordering tolines 2 and 4 there is no dependency ordering toline 6 This dependency-chain breakage can back-propagate just as for equality comparisons How-ever dependencies are maintained for normal usesfor exmaple the use of comparisons for deadlockavoidance when acquiring locks contained in multi-ple RCU-protected data elements

WG21N4321 32

Narrowing arithmetic operations If a pointeris part of a dependency chain and if the valuesadded to or subtracted from that pointer cancel thepointer value so as to allow the compiler to pre-cisely determine the resulting value then the result-ing value will not be part of any dependency chainFor example if p is part of a dependency chain then((char )p-(uintptr t)p)+65536 will not be18

Narrowing bitwise operations If a value of typeintptr t or uintptr t is part of a dependency chainand if that value is one of the operands to an amp

or | infix operator whose result has too few or toomany bits set then the resulting value will not bepart of any dependency chain For example on a64-bit system if p is part of a dependency chainthen (p amp 0x7) provides just the tag bits and nor-mally cannot even be legally dereferenced Similarly(p | ~0) normally cannot be legally dereferencedHowever (p amp ~0x7) will provide a usable pointerthat is part of prsquos dependency chain Setting orclearing bits that are not used by the implementa-tion or that would be zero for a properly aligned ob-ject will not break the dependency chain That saidthe compiler might not know the definition of ldquoprop-erly alignedrdquo for example in the cases of manuallycache-aligned or page-aligned objects

Storing non-pointers into shared variables Ifa non-pointer value that is part of a dependency chainis stored into a shared variable then the dependencychain does not extend to a later load from that vari-able

Passing values between threads If a value thatis part of a dependency chain is stored into a variableby one thread and loaded from that same variable bysome other thread using either a non-atomic load ora memory order relaxed load then the dependencychain does not extend to the second thread To getthis effect the second thread would instead need touse a memory order consume load Note that this

18 That said 57p4 of C++ and 656p8 of C both say thatindexing outside of an object is undefined behavior so the lossof dependency ordering is likely the least of the problems here

would extend the dependency chain even if the cor-responding store was a memory order relaxed storebecause the required store-side ordering is providedby the dependency chain

Undefined behavior If undefined behavior is in-voked then consistent with the notion of undefinedbehavior there are no dependency-chain guarantees

If a given pointer intptr t or uintptr t is usedsuch that only one value avoids undefined behaviorthen the dependency chain is broken in the same wayas it would be in the case of an equality comparisonwith that same value

kill dependency() The result of calling kill

dependency is never part of any dependency chainThis operation can be used to suppress diagnosticsthat implementations might omit for likely misusesof dependency ordering

793 Restricted Dependency Chains and theLinux Kernel

This covers all known pointer-based RCU uses in theLinux kernel aside from RCU-protected array in-dexes (more on these later) However as noted ear-lier these restrictions might prove too constrainingfor future code Therefore it might be necessary tocombine this approach with some variation on one ofthe methods for explicitly marking variables formalparameters and return values that are intended tocarry dependencies

Section 32 discussed dependency chains headedby memory order consume loads of integers that arelater used as array indexes or pointer offsets Al-though there are a (very) few such uses in the Linuxkernel accommodating dependency chains headed byloads of integers greatly complicates the handling ofdependency chains For example given an integer x

produced by a memory order consume load we mustcorrectly handle expressions containing (x - x) in-cluding cases where the cancellation is not at all obvi-ous from the source code In contrast given a pointerp the expression (p - p) not a pointer and the rulesgiven above do not require carrying a dependency

WG21N4321 33

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a) 3 q = kill_dependency(p)4 a = q-gtspecial_a5 else 6 a = p-gtnormal_a7

Figure 36 Avoiding Diagnostic Due To Dependency-Ordering Value-Narrowing Hazard

through such an expression This approach there-fore excludes dependency chains headed by memory

order consume loads from non-pointer atomic vari-ables

This raises the question of what should be doneabout the Linux kernel code that relies on order-ing carried through integer array indexes Paulhas (hopefully) answered this question by creatinga Linux-kernel patch that removes the kernelrsquos de-pendency on RCU-protected array indexes [22]

794 Restricted Dependency Chains Ad-vantages and Disavantages

This approach to dependency ordering has the fol-lowing advantages and disadvantages

1 There is no need for compilers to trace depen-dency chains Instead dependencies are an au-tomatic result of current code-generation and op-timization practices

2 There would be little or no limitation on com-piler optimizations Compilers are no longer re-quired to establish artificial dependencies for ex-pressions such as (x - x) or to detect value-narrowing hazards involving == and =

3 The breaking of dependency chains when apointer compares equal to some other pointermight prove to be onerous but it is not aproblem for current Linux use cases19 Themost common cases are comparison against alist header (in which case an equality compar-ison terminates the traversal) and comparison

19 This needs to be re-verified

against NULL (in which case an equality com-parison indicates a pointer that cannot be deref-erenced in any case)

4 The compiler is prohibited from carrying outvalue-speculation optimizations on pointers orvalues of type intptr t or uintptr t that havebeen cast from a pointer type and subjected tono operations other than infix amp | and ^ (Isthis an advantage or a disadvantage The an-swer to this question is left to the reader)

5 In a great many cases dependency chains canreliably pass through library functions compiledby pre-C11 compilers

6 It would not be necessary to use stdkill

dependency() calls in most cases That saiduse of stdkill dependency() might at somefuture point allow the compiler to produce bet-ter diagnostics for dubious dependency-chain usecases For example a compiler might issue awarning for the value-narrowing hazard shownon line 3 of Figure 13 and that diagnostic mightbe suppressed as shown in Figure 36

7 More generally this approach allows annotationsto be discarded as was shown in Figures 20and 21 However the memory order consume

loads are still required in order to support DECAlpha and prevent compiler optimizations thatmight otherwise destroy the dependency chain

8 It would not be necessary to use [[carries

dependency]] attributes However as withstdkill dependency() use of something like[[carries dependency]] might produce bet-ter diagnostics for dubious use dependency-chainuse cases That said it might be preferable tosubstitute either type or variable modifiers forattributes given that use of attributes is not per-mitted to change the meaning of the programand also that attributes are not yet supportedby the C language

9 With the exception of one use case involving ar-rays no changes to the Linux-kernel source code

WG21N4321 34

are required As noted earlier a patch is avail-able to remove the Linux kernelrsquos dependency onRCU-protected array indexes [22]

In the future this approach could be augmentedby attributes type modifiers variable modifiers orother markings to allow more elaborate dependencychains to be created on the one hand and to improvethe compilerrsquos ability to emit diagnostics for dubioususes of dependency chains on the other

710 Evaluation

This evaluation starts by enumerating the differentaudiences that any change to memory order consume

must address (Section 7101) and then compares thevarious proposals based on the perceived viewpointsof these audiences (Section 7102)

7101 Audiences

The main audiences for any change to memory

order consume include standards committee mem-bers compiler implementers formal-methods re-searchers developers intending to write new codeand developers working with existing RCU code TheLinux kernel community is of course a notable exam-ple of this last category

Standards committee members would like a cleanand non-intrusive change to the standard Theywould of course also like solutions minimizing thenumber and vehemence of complaints from the otheraudiences or failing that reducing the complaintsto a tolerable noise level

Compiler implementers would like a mechanismthat fits nicely into current implementations whichdoes much to explain their satisfaction with the ap-proach of strengthening memory order consume tomemory order acquire In particular they wouldlike to avoid unbounded tracing of dependencies andwould prefer minimal constraints on their ability toapply time-honored optimizations

Formal-methods researchers would like a definitionof memory order consume that fits into existing the-oretical frameworks without undue conceptual vio-lence Of particular concern is any need to deal with

counter-factuals in other words any need to rea-son not only about values of variables required forthe solution of a given litmus test but also aboutother unrelated values for these variables As suchcounter-factuals are the rock upon which otherwiseattractive approaches involving semantic dependencyhave foundered20 Some practitioners might won-der why the opinion of formal-methods researchersshould be given any weight at all and the answer tothis question is that it is the work of formal-methodsresearchers that provides us the much-needed toolsthat we need to analyze both the memory-orderingspecification itself as well as programs using thatspecification

Developers writing new code need something thatexpresses their algorithm with a minimum of syn-tactic saccharine that is easy to learn and that iseasy to maintain For example one of the weak-nesses of the current standardsrsquo definition of memoryorder consume is the need to sprinkle large numbersof kill dependency() calls throughout onersquos codeIn short developers would like it to be easy to writeanalyze and maintain code that uses dependency or-dering

Developers with existing RCU code have the samedesires as do developers writing new code but arealso very interested in minimizing the code churn re-quired to adhere to the standard

The challenge if of course to find a proposal thataddresses the viewoints of all of these audiences Aswe will see in the next session this is not easy How-ever there is some hope that the approach presentedin Section 79 might suffice

7102 Comparison

A summary comparison of the proposals is shown inTable 1

The dependency type can either be ldquodeprdquo fornormal dependency ldquordeprdquo for the restricted de-pendencies discussed in Section 79 ldquosdeprdquo for(global) semantic dependency or ldquolsdeprdquo for local

20 That said Alan Jeffries is making another attempt tocome up with a suitable formal definition of semantic depen-dency

WG21N4321 35

Dep

enden

cyType

Variable

Marking

Form

al-Parameter

Marking

Return-V

alueMarking

Beginning-O

f-Chain

Handling

End-O

f-Chain

Handling

Dep

enden

cyTracingRequired

CAttribute

Support

Req

uired

C11 C++11 dep A A K Y Y

Type-Based Designation of DependencyChains With Restrictions (Section 72)

lsdep T T T k

Type-Based Designation of DependencyChains (Section 73)

dep T T T k

Whole-Program Option (Section 74) sdep

Local-Variable Restriction (Section 75) dep A A k Y

Mark Dependency-Carrying Local Variables(Section 76)

dep A A A k Y

Explicitly Tail-Marked Dependency Chains(Section 77)

dep A A Dk y Y

Explicitly Head-Marked Dependency Chains(Section 78)

dep D k y Y

Restricted Dependency Chains (Section 79) rdep

Marking ldquoArdquo attribute ldquoTrdquo typeBeginning of chain ldquoDrdquo explicit designationEnd of chain ldquoDrdquo explicit designation ldquokKrdquo implicitexplicit kill dependencyDependency tracking required ldquoyrdquo only for marked chains ldquoYrdquo always

Table 1 Comparison of Consume Proposals

WG21N4321 36

semantic dependency21 Variable formal-parameterand return-value marking can either be type-based(ldquoTrdquo) attribute-based (ldquoArdquo) or not required (ldquo rdquo)Beginning-of-chain handling can either require ex-plicit indication to which quantities dependenciesmust be carried (ldquoDrdquo) or nothing (ldquo rdquo) End-of-chain handling can either require an explicit kill

dependency (ldquoKrdquo) an implicit kill dependency

(ldquokrdquo) explicit designation of dependency (ldquoDrdquo) ornothing (ldquo rdquo)22 Dependency tracking might berequired for all chains (ldquoYrdquo) explicitly designatedchains (ldquoyrdquo) or not required at all (ldquo rdquo) C-language[[carries dependency]] support might be required(ldquoYrdquo) or not (ldquo rdquo)

The ideal proposal would have dependency typeldquodeprdquo (thus making it easier to model dependencyordering and making it unnecessary for developer tohave to outwit full-program optimizations) no needfor variable formal-parameter or return-value mark-ing (thus minimizing changes required for existingRCU code) implicit ldquodo the right thingrdquo end-of-chainhandling23 (thus minimizing the need for whack-a-mole source-code markups) no need for dependencytracking (thus making it easier to implement) andno need for C-language support for the [[carries

dependency]] attribute (thus minimizing changes tothe C standard)

7103 Other Approaches

If the C standards committee is unwilling to ac-cept attributes perhaps a new keyword such asCarries dependency would be an acceptable alter-native (This was suggested at the 2014 UIUC meet-ing but I cannot recall who suggested it)

21 Recall that a local semantic dependency remains a de-pendency even if the memory order consume load at its headcan return only a single value In contrast a global seman-tic dependency remains a dependency only if more than onevalue can appear at the end of the chain Therefore optimiza-tions based on global full-program analysis can break a globalsemantic dependency but can break neither a local semanticdependency nor a normal dependency

22 Variables that go out of scope always have any dependencychain implicitly killed

23 Perhaps implemented by a careful choice of exactly whichoperators carry dependencies in which situations

It might also be possible to combine different as-pects of the various proposals perhaps even arrivingat an improved proposal

Nevertheless we clearly have some more work todo

8 Summary

This document has analyzed Linux-kernel use of de-pendency ordering and has laid out the status-quo in-teraction between the Linux kernel and pre-C11 com-pilers It has also put forward some possible ways ofbuilding towards a full implementation of C11rsquos andC++11rsquos handling of dependency ordering Finallyit calls out some weaknesses in C11rsquos and C++11rsquoshandling of dependency ordering and offers some al-ternatives

References

[1] Alglave J Maranget L Pawan PSarkar S Sewell P Williams D andNardelli F Z PPCMEMARMMEM A toolfor exploring the POWER and ARM memorymodels httpwwwclcamacuk~pes20

ppc-supplementalpldi105-sarkarpdfJune 2011

[2] Alglave J Maranget L andTautschnig M Herding cats Modellingsimulation testing and data-mining for weakmemory In Proceedings of the 35th ACM SIG-PLAN Conference on Programming LanguageDesign and Implementation (New York NYUSA 2014) PLDI rsquo14 ACM pp 40ndash40

[3] ARM Limited ARM Architecture ReferenceManual ARMv7-A and ARMv7-R Edition2010

[4] Boehm H J Space efficient conservativegarbage collection SIGPLAN Not 39 4 (Apr2004) 490ndash501

[5] Bonzini P and Day M RCU imple-mentation for Qemu httplistsgnu

WG21N4321 37

orgarchivehtmlqemu-devel2013-08

msg02055html August 2013

[6] Dalton M THE DESIGN AND IMPLE-MENTATION OF DYNAMIC INFORMATIONFLOW TRACKING SYSTEMS FOR SOFT-WARE SECURITY PhD thesis StanfordUniversity 2009 Available httpcsl

stanfordedu~christospublications

2009michael_daltonphd_thesispdf

[Viewed March 9 2010]

[7] Desnoyers M [RFC git tree] userspace RCU(urcu) for Linux httpurcuso February2009

[8] Desnoyers M McKenney P E SternA Dagenais M R and Walpole JUser-level implementations of read-copy updateIEEE Transactions on Parallel and DistributedSystems 23 (2012) 375ndash382

[9] Grisenthwaite R ARM Barrier LitmusTests and Cookbook ARM Limited 2009

[10] Howard P W and Walpole J A rel-ativistic enhancement to software transactionalmemory In Proceedings of the 3rd USENIX con-ference on Hot topics in parallelism (BerkeleyCA USA 2011) HotParrsquo11 USENIX Associa-tion pp 1ndash6

[11] Howard P W and Walpole J Relativis-tic red-black trees Concurrency and Computa-tion Practice and Experience (2013) nandashna

[12] Intel Corporation A Formal Speci-fication of Intel Itanium Processor Fam-ily Memory Ordering 2002 Avail-able httpdeveloperintelcom

designitaniumdownloads251429htm

ftpdownloadintelcomdesign

ItaniumDownloads25142901pdf [ViewedJanuary 10 2007]

[13] International Business Machines Corpo-ration Power ISA Version 207 2013

[14] Kannan H Ordering decoupled metadata ac-cesses in multiprocessors In MICRO 42 Pro-ceedings of the 42nd Annual IEEEACM Inter-national Symposium on Microarchitecture (NewYork NY USA 2009) ACM pp 381ndash390

[15] McCarthy J Recursive functions of symbolicexpressions and their computation by machinepart i Commun ACM 3 4 (Apr 1960) 184ndash195

[16] McKenney P E Read-copy update(RCU) usage in Linux kernel Availablehttpwwwrdropcomuserspaulmck

RCUlinuxusagerculocktabhtml [ViewedJanuary 14 2007] October 2006

[17] McKenney P E What is RCU part 2Usage Available httplwnnetArticles

263130 [Viewed January 4 2008] January2008

[18] McKenney P E The RCU API 2010 edi-tion httplwnnetArticles418853 De-cember 2010

[19] McKenney P E Validating memory barri-ers and atomic instructions httplwnnet

Articles470681 December 2011

[20] McKenney P E Is Parallel ProgrammingHard And If So What Can You Do About Itkernelorg Corvallis OR USA 2012

[21] McKenney P E Structured deferral syn-chronization via procrastination CommunACM 56 7 (July 2013) 40ndash49

[22] McKenney P E [PATCH tipcorercu14] mce Stop using array-index-based RCUprimitives [PATCHtipcorercu14]mce

Stopusingarray-index-basedRCUprimitivesMay 2015

[23] McKenney P E Purcell C Al-gae Schumin B Cornelius G Qwer-tyus Conway N Sbw Blainster Ru-fus C Zoicon5 Anome and Eisen HRead-copy update httpenwikipedia

orgwikiRead-copy-update July 2006

WG21N4321 38

[24] McKenney P E and Slingwine J DRead-copy update Using execution history tosolve concurrency problems In Parallel andDistributed Computing and Systems (Las VegasNV October 1998) pp 509ndash518

[25] McKenney P E and Walpole J What isRCU fundamentally Available httplwn

netArticles262464 [Viewed December 272007] December 2007

[26] Michael M M Hazard pointers Safe mem-ory reclamation for lock-free objects IEEETransactions on Parallel and Distributed Sys-tems 15 6 (June 2004) 491ndash504

[27] Rossbach C J Hofmann O S PorterD E Ramadan H E Bhandari A andWitchel E TxLinux Using and managinghardware transactional memory in an operatingsystem In SOSPrsquo07 Twenty-First ACMSymposium on Operating Systems Principles(October 2007) ACM SIGOPS Avail-able httpwwwsosp2007orgpapers

sosp056-rossbachpdf [Viewed October 212007]

[28] Toit S D Working draft stan-dard for programming language C++httpwwwopen-stdorgjtc1sc22wg21

docspapers2013n3691pdf May 2013

[29] Vigueras G Orduna J M and LozanoM A Read-Copy Update based parallel serverfor distributed crowd simulations The Journalof Supercomputing (Apr 2012)

Change Log

This paper first appeared as N4026 in May of 2014Revisions to this document are as follows

bull Add a suggestion to Section 71 that developersprovide diagnostics for singleton arrays when us-ing RCU-protected indexes (May 28 2014)

bull Apply changes from Torvald Riegel feedback

1 Move Section 71 which summarizes theLinux-kernel uses of dependency chains upto the beginning of Section 7

2 Fix some typos in Section 74 and addwords accounting for DEC Alpha whichmust continue to promote memory order

consume loads to memory order acquire

with this sectionrsquos approach Also clarifyJeff Preshingrsquos conjecture about the lack ofoptimizations that break semantic depen-dency chains

(June 5 2014)

bull Add Section 75 which covers Hans Boehmrsquosapproach which restricts dependency chains tolocal variable Also add Section 76 whichcovers Clark Nelsorsquos approach which usesthe [[carries dependency]] attribute to markthose local variables that carry dependencies(July 17 2014)

bull Mark the paper as a revision of N4036 (July 172014)

bull Add Jeff Preshing Hans Boehm and Clark Nel-son as authors and clarify Hansrsquos proposal inSection 75 (August 13 2014)

bull Add Olivier Girouxrsquos proposal for explicitlymarking the tails of dependency chains as Sec-tion 77 and add Olivier as author Update theadvantages and disadvantages of Jeff Preshingrsquoswhole-program option in Section 74 Add advan-tages and disadvantages for Hans Boehmrsquos pro-posal in Section 75 and for Clark Nelsonrsquos pro-posal in Section 76 (August 23 2014)

bull Update Olivier Girouxrsquos explicit dependency-chain marking proposal in Section 77 based onfeedback from Olivier (August 24 2014)

bull Add a section header (Section 411) for thedependency-chain rules for 2014 GCC implemen-tations Add Section 413 laying out the sim-pler rules for managing dependency chains in theolder 1990s Sequent C implementations Adda footnote to Section 76 recording Lawrence

WG21N4321 39

Crowlrsquos preference for variable modifiers insteadof either attributes or type modifiers Addwords to Section 77 noting that atomic signal

fence() or the Linux kernelrsquos barrier() can beused to suppress code-motion optimizations thatmight otherwise prevent fixing up tail-markeddependency chains that were broken by code-motion optimizations (October 1 2014)

bull Add Section 34 which describes operators thatact as the last link in the Linux kernelrsquos depen-dency chains including the ambiguous role ofthe -gt operator (October 1 2014)

bull Update plots of RCU API usage (October 52014)

bull Flesh out Section 412 which lays out a shorthistory of dependency-chain management rulesfor the Linux kernel Numerous minor clarifica-tions and corrections throughout the document(October 5 2014)

bull Add ldquoandrdquo before final authorrsquos name and emailaddress Also add a pointer back to N4036 (Oc-tober 5 2014)

bull Add a brain-dead build script (October 52014)

bull Add Section 710 which has the beginnings ofan evaluation of the various proposals (October5 2014)

The paper was then published as N4215 whichwas revised as follows

bull Add verbiage to Sections 72 and 7102 differ-entiating between various forms of syntactic andsemantic dependencies (November 11 2014)

bull Add Section 7103 suggesting a C keyword in-stead of the C++ [[carries dependency]] at-tribute This section also suggests that combin-ing ideas from several proposals might be helpful(November 11 2014)

bull Add Section 78 which contains a first attemptto describe Olivier Girouxrsquos head-marked depen-dency chains (November 11 2014)

bull Add example code in Figures 15 and 16 Addsimilar figures to the proposals showing how theymark the dependency chains in the sample code(November 11 2014)

At this point the paper was published as N4321which was revised as follows

bull Add a warning against using operands to amp and| that leave only one bit set (or respectivelycleared) to Section 411 (April 9 2015)

bull Update plots of RCU API usage (April 192014)

bull Add Section 79 which restricts dependencychains with the goal of allowing unmarked de-pendency chains while avoiding restricting com-piler optimizations One important restriction isthe elimination of RCU-protected array indexes(April 19 2014)

bull Add verbiage to Section 32 noting that the amp

operator is sometimes used to find the beginningof a power-of-two aligned structure (April 192014)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Torvald Riegel to Sec-tion 79 (May 20 2015)

bull Donrsquot allow integers to carry dependencies intoor out of unmarked functions in Section 79(May 21 2015)

bull Update Section 79 to note that assignments can-not extend a dependency chain from one threadto another and that assignments cannot extendan intptr t- or uintptr t-based dependencychain through a shared variable even within agiven thread (May 21 2015)

bull Update Section 79 to note that atomic op-erations can extend dependency chains in thesame way that their non-atomic counterpartscan (May 21 2015)

WG21N4321 40

bull Update Section 79 to explicitly call out the factthat stores and subsequent loads can extend de-pendency chains (May 21 2015)

bull Update Section 79 to explicitly state that un-defined behavior terminates dependency chainsand that if only one value of a pointer avoidsundefined behavior any dependency chainsthrough that pointer at that point are termi-nated (May 22 2015)

bull Update Section 79 to document the difficul-ties stemming from value-specialization opti-mizations (May 23 2015)

bull Update Section 79 to note the back-propagationof the dependency-chain-breaking effects of thecompiler deducing the exact value of a pointer(May 27 2015)

bull Update Section 79 to give more informationon how much bit-setting and bit-clearing is toomuch (June 8 2015)

bull Update Section 79 to note that the Linux ker-nel does not rely on dependency ordering whena statically allocated structure is linked into anRCU-protected structure but also note that op-timizations breaking dependency ordering in thiscase seem quite dubious (July 11 2015)

bull Update Section 79 illustrating dependency-chain rules with examples (July 11 2015)

bull Update Section 79 noting that simple loads suf-fice for memory order consume loads (July 112015)

bull Update Sections 77 and 78 calling out abstrac-tion challenges for head- and tail-marking ap-proaches (July 13 2015)

bull Add subsections to Section 79 (July 13 2015)

bull Update Section 79 to amplify objections to spe-cialization optimizations applied to heap-basedpointers (July 13 2015)

  • 1 Introduction
  • 2 Introduction to RCU
  • 3 Linux-Kernel Use Cases
    • 31 Types of Linux-Kernel Dependency Chains
    • 32 Operators in Linux-Kernel Dependency Chains
    • 33 Operators Terminating Linux-Kernel Dependency Chains
    • 34 Operators Acting as Last Link in Linux-Kernel Dependency Chains
    • 35 Linux-Kernel Dependency Chain Length
      • 4 Dependency Ordering in Pre-C11 Implementations
        • 41 Rules for C-Language RCU Users
          • 411 Rules for 2014 GCC Implementations
          • 412 Rules for 2003 GCC Implementations
          • 413 Rules for 1990s Sequent C Implementations
            • 42 Rules for C-Language Implementers
              • 5 Dependency Ordering in C11 and C++11 Implementations
              • 6 Weaknesses in C11 and C++11 Dependency Ordering
              • 7 Potential Alternatives to C11 and C++11 Dependency Ordering
                • 71 Revising C11 and C++11 Dependency-Ordering Definition
                • 72 Type-Based Designation of Dependency Chains With Restrictions
                • 73 Type-Based Designation of Dependency Chains
                • 74 Whole-Program Option
                • 75 Local-Variable Restriction
                • 76 Mark Dependency-Carrying Local Variables
                • 77 Explicitly Tail-Marked Dependency Chains
                • 78 Explicitly Head-Marked Dependency Chains
                • 79 Restricted Dependency Chains
                  • 791 Extending Dependency Chains
                  • 792 Terminating Dependency Chains
                  • 793 Restricted Dependency Chains and the Linux Kernel
                  • 794 Restricted Dependency Chains Advantages and Disavantages
                    • 710 Evaluation
                      • 7101 Audiences
                      • 7102 Comparison
                      • 7103 Other Approaches
                          • 8 Summary
Page 25: N4321: Towards Implementation and Use of memory order … · 2015-07-14 · N4321: Towards Implementation and Use of memory order consume Doc. No.: WG21/N4321 (revised) Date: 2015-04-19

WG21N4321 25

1 p = atomic_load_explicit(ampgp memory_order_consume)2 if (p = NULL)3 do_it(atomic_dependency(p gp))

Figure 26 Explicit Dependency Operations

stdkill dependency() Instead of explicitlymarking where the dependency chains terminateOlivierrsquos proposal uses a stddependency() prim-itive to indicate the locations in the code that thedependency chains are required to reach The first ar-gument to stddependency() is the value to whichthe dependency must be carried and the second argu-ment is the variable that heads the dependency chainin other words the second argument is the variablethat was loaded from by a memory order consume

load This proposal differs from the others in thatit is expected to be implemented not necessarily bypreserving the dependency but instead by insertingbarriers in those cases where optimizations have elim-inated any required dependencies The goal here is toimpose minimal restrictions on optimizations of codecontaining dependency chains

A C-language example is shown in Figure 26where stddependency() is transliterated to the C-language atomic dependency() function On line 3atomic dependency() returns the value of its firstargument (p) while ensuring that the data depen-dency from the memory order consume load from gp

is faithfully reflected in the assembly language imple-menting this code fragment The assembly-languagereflection of this dependency might be in terms ofan assembly-language dependency (for example onARM or PowerPC) implicit memory ordering (forexample on x86 or mainframe) or by an explicitmemory-barrier instruction However if there was noatomic dependency() function the compiler wouldbe under no obligation to preserve the dependency14

These explicitly specified dependencies may becombined with [[carries dependency]] attributeson function arguments for example as shown in Fig-ure 27 Note the interplay of atomic dependency()

and [[carries dependency]] where line 8 estab-

14 Would it be better to have the second argument to atomic

dependency() be a label rather than an expression

1 void foo(struct bar q [[carries_dependency]])2 3 if (q = NULL)4 do_it(atomic_dependency(q-gtb q))5 67 p = atomic_load_explicit(ampgp memory_order_consume)8 foo(atomic_dependency(p gp))

Figure 27 Explicit Dependency Operations and car-ries dependency

lishes the dependency between the load from gp andthe [[carries dependency]] argument q of foo()and where line 4 establishes the further dependencybetween argument q of foo() and do it()s argu-ment

This approach is not yet complete One issue isthe possibility of a given operation being dependenton multiple memory order consume loads One ap-proach is of course to omit this functionality and an-other is to allow atomic dependency() to allow anexpression as its first argument and a variable list ofmemory order consume loaded variables

Another issue is connecting [[carries

dependency]] return values to subsequent atomic

dependency() invocations There are a numberof possible resolutions to this issue One approachwould be to use [[carries dependency]] attributeto mark the declaration of the variable to whichthe functionrsquos return value is assigned bringing theproposal from Section 76 to bear In the specialcase where the memory order consume load is in thesame function body as the atomic dependency()

that depends on it the atomic dependency()

could reference the variable that was the sourceof the original memory order consume load An-other approach would be to allow function-returncarries dependency]] attributes to define namesthat could be used by later atomic dependency()

invocations

A third issue arises when atomic dependency()

must be applied after the head of the dependencychain has gone out of scope for example if the headwas contained in a variable defined in an inner scopethat has since been exited

A fourth issue arises if optimizations along a

WG21N4321 26

needed dependency chain allow ordering the depen-dent operation to precede the head of the depen-dency chain in which case inserting barriers wouldbe ineffective The current proposal for addressingthis issue is to suppress memory-movement optimiza-tions across the atomic dependency() perhaps us-ing something like atomic signal fence() or theLinux kernelrsquos barrier() macro This approach al-lows dependency checking and fence insertion to becarried out as a final pass in the compilation process

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains having ex-plicit atomic dependency() calls (and option-ally intermediate [[carries dependency]] at-tributes) need be traced

3 Irrelevant dependency chains are pruned by de-fault with no stdkill dependency() calls re-quired

4 The atomic dependency() calls serve as valu-able documentation of the developerrsquos intent

5 Although optimizations on dependency chainsmust be restricted use of explicit atomic

dependency() greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of dependency chains giventhat the Linux kernel currently relies on implicitends of dependency chains

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(atomic_dependency(lsp-gtt27 head-gtfirst))28 rcu_read_unlock()29 return atomic_dependency(data lsp-gtt)30

Figure 28 List-Based-Stack Tail-Marked Dependen-cies 1 of 2

It is not yet known whether this approach can bereasonably modeled

The result is shown in Figures 28 and 29

78 Explicitly Head-Marked Depen-dency Chains

This approach suggested via email by Olivier Girouxcan be thought of as another inverse of stdkilldependency() In this case the heads of the de-pendency chains are marked indicating to whichpointed-to objects dependencies should be carriedThis description is an extrapolation of a very con-cise proposal and corrections are welcome

The general idea is to provide an augmentedform of the load() member function that indicatesdependencies for example xload(memory order

consume x-gtnext) would cause a dependency tobe carried through the -gtnext field but throughno other field This is shown in Figure 30 where

WG21N4321 27

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(atomic_dependency(lsnp2-gtt lsnp))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return atomic_dereference(data lsnp2-gtt)59

Figure 29 List-Based-Stack Tail-Marked Dependen-cies 2 of 2

1 struct foo 2 struct foo a3 struct foo b4 struct foo c5 int d6 78 p = atomic_load_explicit(ampgp memory_order_consume9 p-gta p-gtb)

10 qa = p-gta Dependency carried 11 qb = p-gtb Dependency carried 12 qc = p-gtc No dependency carried 13 d = p-gtd No dependency carried

Figure 30 Explicit Dependency Operations and Aug-mented Load

the explicit dependency information on line 9 causeslines 10 and 11 to carry a dependency but lines 12and 13 not to do so

Some open questions regarding this approach

1 How does this interact with arguments and re-turn values Do the corresponding annotationsneed to indicate to which fields dependenciesmight be carried Should mismatches be con-sidered an error and if so which sorts of mis-matches

2 How are opaque types handled For exam-ple consider the Linux kernel linked-list facil-ity which embeds a list head structure intothe enclosing object that is to be placed on thelist The memory order consume load returnsa (struct list head ) but it may be nec-essary to carry a dependency to one or more ofthe fields in the enclosing object Should thisbe handled via something like xload(memory

order consume x) but if so doesnrsquot thisre-introduce the need for lots of stdkill

dependency() calls

3 Larger structures might have quite a few fieldsthat need dependencies carried Should therebe some sort of shorthand to make this easierto code for example tagging the fields needingdependency ordering in the declaration of thestruct or class

4 Common types of abstraction need to be han-dled correctly For example there are more than

WG21N4321 28

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst head-gtfirst-gtt)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(lsp-gtt lsp-gtt)) 28 rcu_read_unlock()29 return data30

Figure 31 List-Based-Stack Head-Marked Depen-dencies 1 of 2

500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

79 Restricted Dependency Chains

This approach restricts dependency chains to opera-tions for which compilers would naturally carry de-pendencies As such this approach can be consid-ered to be a refinement of the whole-program option(Section 74) restricted as described in Section 412but also omitting control dependencies and RCU-protected array indexes As always a memory order

consume load heads a dependency chain and as al-ways a memory order consume load may be imple-mented with a simple load instruction on architec-tures such as x86 ARM and Power

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst head-gtfirst-gtnext)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext lsnp-gtnext-gtt)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt lsnp2-gtt))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 32 List-Based-Stack Head-Marked Depen-dencies 2 of 2

WG21N4321 29

1 define rcu_ptr_extract(p) 2 ( 3 uintptr_t ___ip = (uintptr_t)(p) 4 5 ___ip = ___ip amp ~0x7 6 (typeof(p)) ___ip 7 )89 define rcu_ptr_set_bits(p bits)

10 ( 11 uintptr_t ___ip = (uintptr_t)(p) 12 13 ___ip = ___ip amp ~0x7 14 ___ip = ___ip | (bits) 15 (typeof(p)) ___ip 16 )1718 define rcu_ptr_get_bits(p bits) 19 ( 20 uintptr_t ___ip = (uintptr_t)(p) 21 22 ___ip amp ~0x7 23 )

Figure 33 Pointers and Bit Manipulation on 64-BitSystem

This results in a specific list of operations that ex-tend dependency chains and a separate specific list ofoperations that terminate such chains each coveredin its own section

791 Extending Dependency Chains

The following primitive operations extend thatchain15

1 If any value is part of a dependency chain thenusing that value as the left-hand side of an as-signment expression extends the chain to coverthe assignment This rule is exercised in theLinux kernel by stores into fields making up anRCU protected data element

2 If any value is part of a dependency chain thenusing that value as the right-hand side of an as-signment expression extends the chain to coverboth the assignment and the value returned bythat assignment statement Line 20 of Figure 20shows how this rule may be used to extend adependency chain into a local variable

15 In case of operator overloading the actual functions calledmust be analyzed in order to determine their effects on depen-dency chains

3 If any value that is part of a dependency chain isstored to a non-shared variable then any valueloaded by a later load from that same variable bythat same thread is also part of the dependencychain Lines 20 and 24 of Figure 20 illustratethis rule though this rule would apply even iflocal variable lsp was a intptr t instead of apointer

4 If a pointer that is part of a dependency chainis stored to any variable then any value loadedby a later load from that same variable by thatsame thread is also part of the dependency chainLines 20 and 24 of Figure 20 illustrate this rulethough this rule would apply even if local vari-able lsp was instead a shared variable

5 If a pointer is part of a dependency chain thenadding an integral value to that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers and alsoto addition via the infix + operator and via thepostfix [] operator Note that the addition mustbe carried out on a pointer Casting to an inte-gral type and then carrying out the addition willbreak the dependency chain Therefore insteadof casting to an integral type to carry out theaddition cast to a pointer to char16 Line 24 ofFigure 20 illustrates this given that the -gtt actsas a pointer offset prior to indirection

6 If a pointer is part of a dependency chain thensubtracting an integer from that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers Againcasting to an integral type and then carryingout the subtraction will break the dependencychain so instead cast to a pointer to char TheLinux-kernel container of() macro illustratesthis This macro is used to find the beginning ofa structure given a pointer to a field within thatsame structure

16 Yes some old systems had strange formats for characterpointers and this restriction does exclude those systems fromthis nuance of dependency ordering However to the best ofmy knowledge all such systems were uniprocessors so this isnot a real problem

WG21N4321 30

7 If a pointer is part of a dependency chain thendereferencing it using the prefix operator ex-tends the chain through the dereference opera-tion Line 24 of Figure 20 illustrates this giventhat the -gtt acts as a pointer offset prior to in-direction

8 If a pointer is part of a dependency chain thendereferencing it using the -gt field-selection oper-ator extends the chain to the field Note that thewhen the -gt operator is followed by one or more operators these latter operators are equiva-lent to adding a constant integer to the originalpointer Line 24 of Figure 20 directly illustratesthis rule

9 If a pointer is part of a dependency chain thencasting it (either explicitly or implicitly) to anypointer-sized type extends the chain to the re-sult Line 3 of Figure 33 illustrates this rule

10 If a value of type intptr t or uintptr t ispart of a dependency chain then casting it toa pointer type extends the chain to the resultLine 6 of Figure 33 illustrates this rule

11 If a value of type intptr t or uintptr t is partof a dependency chain then the bitwise infix amp

and | operators extend the dependency chain tothe resulting value Line 5 of Figure 33 illus-trates this rule

12 If a value of type intptr t or uintptr t is partof a dependency chain the the bitwise infix ^

operator extends the dependency chain to theresulting value The use case for this traversal ofbuddy-allocator-like lists or dense-array heapsbut it is not clear whether these use cases justifythis addition to dependency ordering

13 If a pointer is part of a dependency chain thenapplying the unary amp address-of operator op-tionally casting this address to a pointer type(perhaps repeatedly to different pointer typeseither explicitly or implicitly) then applying the dereference operator extends the chain to theresult This is used by some of the Linux-kernellist-processing macros

14 If a pointer is part of a dependency chain andthat pointer appears in the entry of a expres-sion selected by the condition then the chainextends to the result Please note that doesnot extend chains from its condition only fromits second or third argument

15 If a pointer to a function is part of a depen-dency chain then invoking the pointed-to func-tion extends the chain from the pointer to theinstructions executed Note that the exact mech-anism used to update instructions is implemen-tation defined and might require use of specialinstruction-cache-flush operations17

16 If a pointer is part of a dependency chain thenif that pointer is used as the actual parameter ofa function call the dependency chain extends tothe formal parameter

17 If a pointer is part of a dependency chain thenif that pointer is returned from a function thedependency chain extends to the returned valuein the calling function

18 If a given operation extends a dependency chainthen so does its atomic counterpart For exam-ple the rules applying to assignments also applyto atomic loads and stores

Any other operation terminates a dependencychain

792 Terminating Dependency Chains

Even though all other operations terminate depen-dency chains there are a few that deserve specialmention

1 Equality comparisons

2 Narrowing magnitude comparisons

3 Narrowing arithmetic operations

4 Narrowing bitwise operations

17 This may sound strange but just you try implementingdynamic linking without the ability to update instructions

WG21N4321 31

1 struct bar 2 struct bar next3 int a4 int b5 6 struct bar head = amphead 1 2 78 for (p = head-gtnext p p = rcu_dereference(p-gtnext)) 9 foo += p-gta

10 if (p == amphead)11 break12 13 bar = head-gtb

Figure 34 Back-Propagation of Dependency-ChainBreakage Due to Comparisons

1 if (p gt ampfoo)2 do_something(p)3 else if (p lt ampfoo)4 do_something_else(p)5 else6 do_something_nodep(p)

Figure 35 Inequality-Comparison Dependency-Chain Breakage

5 Storing non-pointers into shared variables

6 Passing values between threads without using amemory order consume load

7 Undefined behavior

8 Use of kill dependency

Each of these is covered below

Equality comparisons If a pointer is part of adependency chain then a == or = comparison thatcompares equal to some other pointer where thatother pointer is not part of any dependency chainwill cause any uses of the original pointer to no longerbe part of the dependency chain This dependency-chain breakage can back-propagate to earlier uses ofthe pointer so that in Figure 34 if the comparisonon line 10 compares equal then the access on line 9is not part of the dependency chain This is admit-tedly a rather strange code fragment and besidesthe Linux-kernel barrier() macro could prevent thisif placed between lines 9 and 10 Furthermore the

Linux kernelrsquos list macros avoid this situation becausethe equal comparison terminates the loop

So what if the compiler introduces an equality com-parison This might happen when doing feedback-directed optimization where the compiler might no-tice (for example) that a particularly statically allo-cated structure was almost always the first elementon a given list The compiler might therefore in-troduce a specialization optimization comparing theaddresses and generating code using the statically al-located structure on equals comparison On the onehand in the cases where the Linux kernel adds a stat-ically allocated structure to an RCU-protected linkeddata structure that structure has been initialized atcompile time so that dependency ordering is not re-quired On the other hand this appears to be anextremely dubious optimization for linked data struc-tures In a great many cases the added overhead ofthe comparison would overwhelm the benefits of gen-erating code based on the statically allocated struc-ture

High-quality implementations would therefore beexpected to provide means for disabling this sort ofoptimization especially for pointers obtained fromthe heap After all use of statically allocated struc-tures in RCU-protected lists could be quite usefulduring out-of-memory conditions in which case thespecialization optimization would almost always re-duce performance which is not what optmizationsare supposed to be doing

Narrowing magnitude comparisons A seriesof gt lt gt= or lt= operators that informs the compilerof the exact value of a pointer causes that pointerto no longer be part of the dependency chain SeeFigure 35 for an example of this On line 6 of thisfigure the compiler knows that the value of p is equalto ampfoo so although there is dependency ordering tolines 2 and 4 there is no dependency ordering toline 6 This dependency-chain breakage can back-propagate just as for equality comparisons How-ever dependencies are maintained for normal usesfor exmaple the use of comparisons for deadlockavoidance when acquiring locks contained in multi-ple RCU-protected data elements

WG21N4321 32

Narrowing arithmetic operations If a pointeris part of a dependency chain and if the valuesadded to or subtracted from that pointer cancel thepointer value so as to allow the compiler to pre-cisely determine the resulting value then the result-ing value will not be part of any dependency chainFor example if p is part of a dependency chain then((char )p-(uintptr t)p)+65536 will not be18

Narrowing bitwise operations If a value of typeintptr t or uintptr t is part of a dependency chainand if that value is one of the operands to an amp

or | infix operator whose result has too few or toomany bits set then the resulting value will not bepart of any dependency chain For example on a64-bit system if p is part of a dependency chainthen (p amp 0x7) provides just the tag bits and nor-mally cannot even be legally dereferenced Similarly(p | ~0) normally cannot be legally dereferencedHowever (p amp ~0x7) will provide a usable pointerthat is part of prsquos dependency chain Setting orclearing bits that are not used by the implementa-tion or that would be zero for a properly aligned ob-ject will not break the dependency chain That saidthe compiler might not know the definition of ldquoprop-erly alignedrdquo for example in the cases of manuallycache-aligned or page-aligned objects

Storing non-pointers into shared variables Ifa non-pointer value that is part of a dependency chainis stored into a shared variable then the dependencychain does not extend to a later load from that vari-able

Passing values between threads If a value thatis part of a dependency chain is stored into a variableby one thread and loaded from that same variable bysome other thread using either a non-atomic load ora memory order relaxed load then the dependencychain does not extend to the second thread To getthis effect the second thread would instead need touse a memory order consume load Note that this

18 That said 57p4 of C++ and 656p8 of C both say thatindexing outside of an object is undefined behavior so the lossof dependency ordering is likely the least of the problems here

would extend the dependency chain even if the cor-responding store was a memory order relaxed storebecause the required store-side ordering is providedby the dependency chain

Undefined behavior If undefined behavior is in-voked then consistent with the notion of undefinedbehavior there are no dependency-chain guarantees

If a given pointer intptr t or uintptr t is usedsuch that only one value avoids undefined behaviorthen the dependency chain is broken in the same wayas it would be in the case of an equality comparisonwith that same value

kill dependency() The result of calling kill

dependency is never part of any dependency chainThis operation can be used to suppress diagnosticsthat implementations might omit for likely misusesof dependency ordering

793 Restricted Dependency Chains and theLinux Kernel

This covers all known pointer-based RCU uses in theLinux kernel aside from RCU-protected array in-dexes (more on these later) However as noted ear-lier these restrictions might prove too constrainingfor future code Therefore it might be necessary tocombine this approach with some variation on one ofthe methods for explicitly marking variables formalparameters and return values that are intended tocarry dependencies

Section 32 discussed dependency chains headedby memory order consume loads of integers that arelater used as array indexes or pointer offsets Al-though there are a (very) few such uses in the Linuxkernel accommodating dependency chains headed byloads of integers greatly complicates the handling ofdependency chains For example given an integer x

produced by a memory order consume load we mustcorrectly handle expressions containing (x - x) in-cluding cases where the cancellation is not at all obvi-ous from the source code In contrast given a pointerp the expression (p - p) not a pointer and the rulesgiven above do not require carrying a dependency

WG21N4321 33

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a) 3 q = kill_dependency(p)4 a = q-gtspecial_a5 else 6 a = p-gtnormal_a7

Figure 36 Avoiding Diagnostic Due To Dependency-Ordering Value-Narrowing Hazard

through such an expression This approach there-fore excludes dependency chains headed by memory

order consume loads from non-pointer atomic vari-ables

This raises the question of what should be doneabout the Linux kernel code that relies on order-ing carried through integer array indexes Paulhas (hopefully) answered this question by creatinga Linux-kernel patch that removes the kernelrsquos de-pendency on RCU-protected array indexes [22]

794 Restricted Dependency Chains Ad-vantages and Disavantages

This approach to dependency ordering has the fol-lowing advantages and disadvantages

1 There is no need for compilers to trace depen-dency chains Instead dependencies are an au-tomatic result of current code-generation and op-timization practices

2 There would be little or no limitation on com-piler optimizations Compilers are no longer re-quired to establish artificial dependencies for ex-pressions such as (x - x) or to detect value-narrowing hazards involving == and =

3 The breaking of dependency chains when apointer compares equal to some other pointermight prove to be onerous but it is not aproblem for current Linux use cases19 Themost common cases are comparison against alist header (in which case an equality compar-ison terminates the traversal) and comparison

19 This needs to be re-verified

against NULL (in which case an equality com-parison indicates a pointer that cannot be deref-erenced in any case)

4 The compiler is prohibited from carrying outvalue-speculation optimizations on pointers orvalues of type intptr t or uintptr t that havebeen cast from a pointer type and subjected tono operations other than infix amp | and ^ (Isthis an advantage or a disadvantage The an-swer to this question is left to the reader)

5 In a great many cases dependency chains canreliably pass through library functions compiledby pre-C11 compilers

6 It would not be necessary to use stdkill

dependency() calls in most cases That saiduse of stdkill dependency() might at somefuture point allow the compiler to produce bet-ter diagnostics for dubious dependency-chain usecases For example a compiler might issue awarning for the value-narrowing hazard shownon line 3 of Figure 13 and that diagnostic mightbe suppressed as shown in Figure 36

7 More generally this approach allows annotationsto be discarded as was shown in Figures 20and 21 However the memory order consume

loads are still required in order to support DECAlpha and prevent compiler optimizations thatmight otherwise destroy the dependency chain

8 It would not be necessary to use [[carries

dependency]] attributes However as withstdkill dependency() use of something like[[carries dependency]] might produce bet-ter diagnostics for dubious use dependency-chainuse cases That said it might be preferable tosubstitute either type or variable modifiers forattributes given that use of attributes is not per-mitted to change the meaning of the programand also that attributes are not yet supportedby the C language

9 With the exception of one use case involving ar-rays no changes to the Linux-kernel source code

WG21N4321 34

are required As noted earlier a patch is avail-able to remove the Linux kernelrsquos dependency onRCU-protected array indexes [22]

In the future this approach could be augmentedby attributes type modifiers variable modifiers orother markings to allow more elaborate dependencychains to be created on the one hand and to improvethe compilerrsquos ability to emit diagnostics for dubioususes of dependency chains on the other

710 Evaluation

This evaluation starts by enumerating the differentaudiences that any change to memory order consume

must address (Section 7101) and then compares thevarious proposals based on the perceived viewpointsof these audiences (Section 7102)

7101 Audiences

The main audiences for any change to memory

order consume include standards committee mem-bers compiler implementers formal-methods re-searchers developers intending to write new codeand developers working with existing RCU code TheLinux kernel community is of course a notable exam-ple of this last category

Standards committee members would like a cleanand non-intrusive change to the standard Theywould of course also like solutions minimizing thenumber and vehemence of complaints from the otheraudiences or failing that reducing the complaintsto a tolerable noise level

Compiler implementers would like a mechanismthat fits nicely into current implementations whichdoes much to explain their satisfaction with the ap-proach of strengthening memory order consume tomemory order acquire In particular they wouldlike to avoid unbounded tracing of dependencies andwould prefer minimal constraints on their ability toapply time-honored optimizations

Formal-methods researchers would like a definitionof memory order consume that fits into existing the-oretical frameworks without undue conceptual vio-lence Of particular concern is any need to deal with

counter-factuals in other words any need to rea-son not only about values of variables required forthe solution of a given litmus test but also aboutother unrelated values for these variables As suchcounter-factuals are the rock upon which otherwiseattractive approaches involving semantic dependencyhave foundered20 Some practitioners might won-der why the opinion of formal-methods researchersshould be given any weight at all and the answer tothis question is that it is the work of formal-methodsresearchers that provides us the much-needed toolsthat we need to analyze both the memory-orderingspecification itself as well as programs using thatspecification

Developers writing new code need something thatexpresses their algorithm with a minimum of syn-tactic saccharine that is easy to learn and that iseasy to maintain For example one of the weak-nesses of the current standardsrsquo definition of memoryorder consume is the need to sprinkle large numbersof kill dependency() calls throughout onersquos codeIn short developers would like it to be easy to writeanalyze and maintain code that uses dependency or-dering

Developers with existing RCU code have the samedesires as do developers writing new code but arealso very interested in minimizing the code churn re-quired to adhere to the standard

The challenge if of course to find a proposal thataddresses the viewoints of all of these audiences Aswe will see in the next session this is not easy How-ever there is some hope that the approach presentedin Section 79 might suffice

7102 Comparison

A summary comparison of the proposals is shown inTable 1

The dependency type can either be ldquodeprdquo fornormal dependency ldquordeprdquo for the restricted de-pendencies discussed in Section 79 ldquosdeprdquo for(global) semantic dependency or ldquolsdeprdquo for local

20 That said Alan Jeffries is making another attempt tocome up with a suitable formal definition of semantic depen-dency

WG21N4321 35

Dep

enden

cyType

Variable

Marking

Form

al-Parameter

Marking

Return-V

alueMarking

Beginning-O

f-Chain

Handling

End-O

f-Chain

Handling

Dep

enden

cyTracingRequired

CAttribute

Support

Req

uired

C11 C++11 dep A A K Y Y

Type-Based Designation of DependencyChains With Restrictions (Section 72)

lsdep T T T k

Type-Based Designation of DependencyChains (Section 73)

dep T T T k

Whole-Program Option (Section 74) sdep

Local-Variable Restriction (Section 75) dep A A k Y

Mark Dependency-Carrying Local Variables(Section 76)

dep A A A k Y

Explicitly Tail-Marked Dependency Chains(Section 77)

dep A A Dk y Y

Explicitly Head-Marked Dependency Chains(Section 78)

dep D k y Y

Restricted Dependency Chains (Section 79) rdep

Marking ldquoArdquo attribute ldquoTrdquo typeBeginning of chain ldquoDrdquo explicit designationEnd of chain ldquoDrdquo explicit designation ldquokKrdquo implicitexplicit kill dependencyDependency tracking required ldquoyrdquo only for marked chains ldquoYrdquo always

Table 1 Comparison of Consume Proposals

WG21N4321 36

semantic dependency21 Variable formal-parameterand return-value marking can either be type-based(ldquoTrdquo) attribute-based (ldquoArdquo) or not required (ldquo rdquo)Beginning-of-chain handling can either require ex-plicit indication to which quantities dependenciesmust be carried (ldquoDrdquo) or nothing (ldquo rdquo) End-of-chain handling can either require an explicit kill

dependency (ldquoKrdquo) an implicit kill dependency

(ldquokrdquo) explicit designation of dependency (ldquoDrdquo) ornothing (ldquo rdquo)22 Dependency tracking might berequired for all chains (ldquoYrdquo) explicitly designatedchains (ldquoyrdquo) or not required at all (ldquo rdquo) C-language[[carries dependency]] support might be required(ldquoYrdquo) or not (ldquo rdquo)

The ideal proposal would have dependency typeldquodeprdquo (thus making it easier to model dependencyordering and making it unnecessary for developer tohave to outwit full-program optimizations) no needfor variable formal-parameter or return-value mark-ing (thus minimizing changes required for existingRCU code) implicit ldquodo the right thingrdquo end-of-chainhandling23 (thus minimizing the need for whack-a-mole source-code markups) no need for dependencytracking (thus making it easier to implement) andno need for C-language support for the [[carries

dependency]] attribute (thus minimizing changes tothe C standard)

7103 Other Approaches

If the C standards committee is unwilling to ac-cept attributes perhaps a new keyword such asCarries dependency would be an acceptable alter-native (This was suggested at the 2014 UIUC meet-ing but I cannot recall who suggested it)

21 Recall that a local semantic dependency remains a de-pendency even if the memory order consume load at its headcan return only a single value In contrast a global seman-tic dependency remains a dependency only if more than onevalue can appear at the end of the chain Therefore optimiza-tions based on global full-program analysis can break a globalsemantic dependency but can break neither a local semanticdependency nor a normal dependency

22 Variables that go out of scope always have any dependencychain implicitly killed

23 Perhaps implemented by a careful choice of exactly whichoperators carry dependencies in which situations

It might also be possible to combine different as-pects of the various proposals perhaps even arrivingat an improved proposal

Nevertheless we clearly have some more work todo

8 Summary

This document has analyzed Linux-kernel use of de-pendency ordering and has laid out the status-quo in-teraction between the Linux kernel and pre-C11 com-pilers It has also put forward some possible ways ofbuilding towards a full implementation of C11rsquos andC++11rsquos handling of dependency ordering Finallyit calls out some weaknesses in C11rsquos and C++11rsquoshandling of dependency ordering and offers some al-ternatives

References

[1] Alglave J Maranget L Pawan PSarkar S Sewell P Williams D andNardelli F Z PPCMEMARMMEM A toolfor exploring the POWER and ARM memorymodels httpwwwclcamacuk~pes20

ppc-supplementalpldi105-sarkarpdfJune 2011

[2] Alglave J Maranget L andTautschnig M Herding cats Modellingsimulation testing and data-mining for weakmemory In Proceedings of the 35th ACM SIG-PLAN Conference on Programming LanguageDesign and Implementation (New York NYUSA 2014) PLDI rsquo14 ACM pp 40ndash40

[3] ARM Limited ARM Architecture ReferenceManual ARMv7-A and ARMv7-R Edition2010

[4] Boehm H J Space efficient conservativegarbage collection SIGPLAN Not 39 4 (Apr2004) 490ndash501

[5] Bonzini P and Day M RCU imple-mentation for Qemu httplistsgnu

WG21N4321 37

orgarchivehtmlqemu-devel2013-08

msg02055html August 2013

[6] Dalton M THE DESIGN AND IMPLE-MENTATION OF DYNAMIC INFORMATIONFLOW TRACKING SYSTEMS FOR SOFT-WARE SECURITY PhD thesis StanfordUniversity 2009 Available httpcsl

stanfordedu~christospublications

2009michael_daltonphd_thesispdf

[Viewed March 9 2010]

[7] Desnoyers M [RFC git tree] userspace RCU(urcu) for Linux httpurcuso February2009

[8] Desnoyers M McKenney P E SternA Dagenais M R and Walpole JUser-level implementations of read-copy updateIEEE Transactions on Parallel and DistributedSystems 23 (2012) 375ndash382

[9] Grisenthwaite R ARM Barrier LitmusTests and Cookbook ARM Limited 2009

[10] Howard P W and Walpole J A rel-ativistic enhancement to software transactionalmemory In Proceedings of the 3rd USENIX con-ference on Hot topics in parallelism (BerkeleyCA USA 2011) HotParrsquo11 USENIX Associa-tion pp 1ndash6

[11] Howard P W and Walpole J Relativis-tic red-black trees Concurrency and Computa-tion Practice and Experience (2013) nandashna

[12] Intel Corporation A Formal Speci-fication of Intel Itanium Processor Fam-ily Memory Ordering 2002 Avail-able httpdeveloperintelcom

designitaniumdownloads251429htm

ftpdownloadintelcomdesign

ItaniumDownloads25142901pdf [ViewedJanuary 10 2007]

[13] International Business Machines Corpo-ration Power ISA Version 207 2013

[14] Kannan H Ordering decoupled metadata ac-cesses in multiprocessors In MICRO 42 Pro-ceedings of the 42nd Annual IEEEACM Inter-national Symposium on Microarchitecture (NewYork NY USA 2009) ACM pp 381ndash390

[15] McCarthy J Recursive functions of symbolicexpressions and their computation by machinepart i Commun ACM 3 4 (Apr 1960) 184ndash195

[16] McKenney P E Read-copy update(RCU) usage in Linux kernel Availablehttpwwwrdropcomuserspaulmck

RCUlinuxusagerculocktabhtml [ViewedJanuary 14 2007] October 2006

[17] McKenney P E What is RCU part 2Usage Available httplwnnetArticles

263130 [Viewed January 4 2008] January2008

[18] McKenney P E The RCU API 2010 edi-tion httplwnnetArticles418853 De-cember 2010

[19] McKenney P E Validating memory barri-ers and atomic instructions httplwnnet

Articles470681 December 2011

[20] McKenney P E Is Parallel ProgrammingHard And If So What Can You Do About Itkernelorg Corvallis OR USA 2012

[21] McKenney P E Structured deferral syn-chronization via procrastination CommunACM 56 7 (July 2013) 40ndash49

[22] McKenney P E [PATCH tipcorercu14] mce Stop using array-index-based RCUprimitives [PATCHtipcorercu14]mce

Stopusingarray-index-basedRCUprimitivesMay 2015

[23] McKenney P E Purcell C Al-gae Schumin B Cornelius G Qwer-tyus Conway N Sbw Blainster Ru-fus C Zoicon5 Anome and Eisen HRead-copy update httpenwikipedia

orgwikiRead-copy-update July 2006

WG21N4321 38

[24] McKenney P E and Slingwine J DRead-copy update Using execution history tosolve concurrency problems In Parallel andDistributed Computing and Systems (Las VegasNV October 1998) pp 509ndash518

[25] McKenney P E and Walpole J What isRCU fundamentally Available httplwn

netArticles262464 [Viewed December 272007] December 2007

[26] Michael M M Hazard pointers Safe mem-ory reclamation for lock-free objects IEEETransactions on Parallel and Distributed Sys-tems 15 6 (June 2004) 491ndash504

[27] Rossbach C J Hofmann O S PorterD E Ramadan H E Bhandari A andWitchel E TxLinux Using and managinghardware transactional memory in an operatingsystem In SOSPrsquo07 Twenty-First ACMSymposium on Operating Systems Principles(October 2007) ACM SIGOPS Avail-able httpwwwsosp2007orgpapers

sosp056-rossbachpdf [Viewed October 212007]

[28] Toit S D Working draft stan-dard for programming language C++httpwwwopen-stdorgjtc1sc22wg21

docspapers2013n3691pdf May 2013

[29] Vigueras G Orduna J M and LozanoM A Read-Copy Update based parallel serverfor distributed crowd simulations The Journalof Supercomputing (Apr 2012)

Change Log

This paper first appeared as N4026 in May of 2014Revisions to this document are as follows

bull Add a suggestion to Section 71 that developersprovide diagnostics for singleton arrays when us-ing RCU-protected indexes (May 28 2014)

bull Apply changes from Torvald Riegel feedback

1 Move Section 71 which summarizes theLinux-kernel uses of dependency chains upto the beginning of Section 7

2 Fix some typos in Section 74 and addwords accounting for DEC Alpha whichmust continue to promote memory order

consume loads to memory order acquire

with this sectionrsquos approach Also clarifyJeff Preshingrsquos conjecture about the lack ofoptimizations that break semantic depen-dency chains

(June 5 2014)

bull Add Section 75 which covers Hans Boehmrsquosapproach which restricts dependency chains tolocal variable Also add Section 76 whichcovers Clark Nelsorsquos approach which usesthe [[carries dependency]] attribute to markthose local variables that carry dependencies(July 17 2014)

bull Mark the paper as a revision of N4036 (July 172014)

bull Add Jeff Preshing Hans Boehm and Clark Nel-son as authors and clarify Hansrsquos proposal inSection 75 (August 13 2014)

bull Add Olivier Girouxrsquos proposal for explicitlymarking the tails of dependency chains as Sec-tion 77 and add Olivier as author Update theadvantages and disadvantages of Jeff Preshingrsquoswhole-program option in Section 74 Add advan-tages and disadvantages for Hans Boehmrsquos pro-posal in Section 75 and for Clark Nelsonrsquos pro-posal in Section 76 (August 23 2014)

bull Update Olivier Girouxrsquos explicit dependency-chain marking proposal in Section 77 based onfeedback from Olivier (August 24 2014)

bull Add a section header (Section 411) for thedependency-chain rules for 2014 GCC implemen-tations Add Section 413 laying out the sim-pler rules for managing dependency chains in theolder 1990s Sequent C implementations Adda footnote to Section 76 recording Lawrence

WG21N4321 39

Crowlrsquos preference for variable modifiers insteadof either attributes or type modifiers Addwords to Section 77 noting that atomic signal

fence() or the Linux kernelrsquos barrier() can beused to suppress code-motion optimizations thatmight otherwise prevent fixing up tail-markeddependency chains that were broken by code-motion optimizations (October 1 2014)

bull Add Section 34 which describes operators thatact as the last link in the Linux kernelrsquos depen-dency chains including the ambiguous role ofthe -gt operator (October 1 2014)

bull Update plots of RCU API usage (October 52014)

bull Flesh out Section 412 which lays out a shorthistory of dependency-chain management rulesfor the Linux kernel Numerous minor clarifica-tions and corrections throughout the document(October 5 2014)

bull Add ldquoandrdquo before final authorrsquos name and emailaddress Also add a pointer back to N4036 (Oc-tober 5 2014)

bull Add a brain-dead build script (October 52014)

bull Add Section 710 which has the beginnings ofan evaluation of the various proposals (October5 2014)

The paper was then published as N4215 whichwas revised as follows

bull Add verbiage to Sections 72 and 7102 differ-entiating between various forms of syntactic andsemantic dependencies (November 11 2014)

bull Add Section 7103 suggesting a C keyword in-stead of the C++ [[carries dependency]] at-tribute This section also suggests that combin-ing ideas from several proposals might be helpful(November 11 2014)

bull Add Section 78 which contains a first attemptto describe Olivier Girouxrsquos head-marked depen-dency chains (November 11 2014)

bull Add example code in Figures 15 and 16 Addsimilar figures to the proposals showing how theymark the dependency chains in the sample code(November 11 2014)

At this point the paper was published as N4321which was revised as follows

bull Add a warning against using operands to amp and| that leave only one bit set (or respectivelycleared) to Section 411 (April 9 2015)

bull Update plots of RCU API usage (April 192014)

bull Add Section 79 which restricts dependencychains with the goal of allowing unmarked de-pendency chains while avoiding restricting com-piler optimizations One important restriction isthe elimination of RCU-protected array indexes(April 19 2014)

bull Add verbiage to Section 32 noting that the amp

operator is sometimes used to find the beginningof a power-of-two aligned structure (April 192014)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Torvald Riegel to Sec-tion 79 (May 20 2015)

bull Donrsquot allow integers to carry dependencies intoor out of unmarked functions in Section 79(May 21 2015)

bull Update Section 79 to note that assignments can-not extend a dependency chain from one threadto another and that assignments cannot extendan intptr t- or uintptr t-based dependencychain through a shared variable even within agiven thread (May 21 2015)

bull Update Section 79 to note that atomic op-erations can extend dependency chains in thesame way that their non-atomic counterpartscan (May 21 2015)

WG21N4321 40

bull Update Section 79 to explicitly call out the factthat stores and subsequent loads can extend de-pendency chains (May 21 2015)

bull Update Section 79 to explicitly state that un-defined behavior terminates dependency chainsand that if only one value of a pointer avoidsundefined behavior any dependency chainsthrough that pointer at that point are termi-nated (May 22 2015)

bull Update Section 79 to document the difficul-ties stemming from value-specialization opti-mizations (May 23 2015)

bull Update Section 79 to note the back-propagationof the dependency-chain-breaking effects of thecompiler deducing the exact value of a pointer(May 27 2015)

bull Update Section 79 to give more informationon how much bit-setting and bit-clearing is toomuch (June 8 2015)

bull Update Section 79 to note that the Linux ker-nel does not rely on dependency ordering whena statically allocated structure is linked into anRCU-protected structure but also note that op-timizations breaking dependency ordering in thiscase seem quite dubious (July 11 2015)

bull Update Section 79 illustrating dependency-chain rules with examples (July 11 2015)

bull Update Section 79 noting that simple loads suf-fice for memory order consume loads (July 112015)

bull Update Sections 77 and 78 calling out abstrac-tion challenges for head- and tail-marking ap-proaches (July 13 2015)

bull Add subsections to Section 79 (July 13 2015)

bull Update Section 79 to amplify objections to spe-cialization optimizations applied to heap-basedpointers (July 13 2015)

  • 1 Introduction
  • 2 Introduction to RCU
  • 3 Linux-Kernel Use Cases
    • 31 Types of Linux-Kernel Dependency Chains
    • 32 Operators in Linux-Kernel Dependency Chains
    • 33 Operators Terminating Linux-Kernel Dependency Chains
    • 34 Operators Acting as Last Link in Linux-Kernel Dependency Chains
    • 35 Linux-Kernel Dependency Chain Length
      • 4 Dependency Ordering in Pre-C11 Implementations
        • 41 Rules for C-Language RCU Users
          • 411 Rules for 2014 GCC Implementations
          • 412 Rules for 2003 GCC Implementations
          • 413 Rules for 1990s Sequent C Implementations
            • 42 Rules for C-Language Implementers
              • 5 Dependency Ordering in C11 and C++11 Implementations
              • 6 Weaknesses in C11 and C++11 Dependency Ordering
              • 7 Potential Alternatives to C11 and C++11 Dependency Ordering
                • 71 Revising C11 and C++11 Dependency-Ordering Definition
                • 72 Type-Based Designation of Dependency Chains With Restrictions
                • 73 Type-Based Designation of Dependency Chains
                • 74 Whole-Program Option
                • 75 Local-Variable Restriction
                • 76 Mark Dependency-Carrying Local Variables
                • 77 Explicitly Tail-Marked Dependency Chains
                • 78 Explicitly Head-Marked Dependency Chains
                • 79 Restricted Dependency Chains
                  • 791 Extending Dependency Chains
                  • 792 Terminating Dependency Chains
                  • 793 Restricted Dependency Chains and the Linux Kernel
                  • 794 Restricted Dependency Chains Advantages and Disavantages
                    • 710 Evaluation
                      • 7101 Audiences
                      • 7102 Comparison
                      • 7103 Other Approaches
                          • 8 Summary
Page 26: N4321: Towards Implementation and Use of memory order … · 2015-07-14 · N4321: Towards Implementation and Use of memory order consume Doc. No.: WG21/N4321 (revised) Date: 2015-04-19

WG21N4321 26

needed dependency chain allow ordering the depen-dent operation to precede the head of the depen-dency chain in which case inserting barriers wouldbe ineffective The current proposal for addressingthis issue is to suppress memory-movement optimiza-tions across the atomic dependency() perhaps us-ing something like atomic signal fence() or theLinux kernelrsquos barrier() macro This approach al-lows dependency checking and fence insertion to becarried out as a final pass in the compilation process

This approach has the following advantages anddisadvantages

1 This approach requires that the C language addthe [[carries dependency]] attribute

2 The implementation is likely to be simpler be-cause only those dependency chains having ex-plicit atomic dependency() calls (and option-ally intermediate [[carries dependency]] at-tributes) need be traced

3 Irrelevant dependency chains are pruned by de-fault with no stdkill dependency() calls re-quired

4 The atomic dependency() calls serve as valu-able documentation of the developerrsquos intent

5 Although optimizations on dependency chainsmust be restricted use of explicit atomic

dependency() greatly reduces unnecessary re-striction of optimizations on unintentional de-pendency chains

6 Applying this to the Linux kernel would requiresignificant marking of dependency chains giventhat the Linux kernel currently relies on implicitends of dependency chains

7 Common types of abstraction need to be han-dled correctly For example there are more than500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(atomic_dependency(lsp-gtt27 head-gtfirst))28 rcu_read_unlock()29 return atomic_dependency(data lsp-gtt)30

Figure 28 List-Based-Stack Tail-Marked Dependen-cies 1 of 2

It is not yet known whether this approach can bereasonably modeled

The result is shown in Figures 28 and 29

78 Explicitly Head-Marked Depen-dency Chains

This approach suggested via email by Olivier Girouxcan be thought of as another inverse of stdkilldependency() In this case the heads of the de-pendency chains are marked indicating to whichpointed-to objects dependencies should be carriedThis description is an extrapolation of a very con-cise proposal and corrections are welcome

The general idea is to provide an augmentedform of the load() member function that indicatesdependencies for example xload(memory order

consume x-gtnext) would cause a dependency tobe carried through the -gtnext field but throughno other field This is shown in Figure 30 where

WG21N4321 27

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(atomic_dependency(lsnp2-gtt lsnp))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return atomic_dereference(data lsnp2-gtt)59

Figure 29 List-Based-Stack Tail-Marked Dependen-cies 2 of 2

1 struct foo 2 struct foo a3 struct foo b4 struct foo c5 int d6 78 p = atomic_load_explicit(ampgp memory_order_consume9 p-gta p-gtb)

10 qa = p-gta Dependency carried 11 qb = p-gtb Dependency carried 12 qc = p-gtc No dependency carried 13 d = p-gtd No dependency carried

Figure 30 Explicit Dependency Operations and Aug-mented Load

the explicit dependency information on line 9 causeslines 10 and 11 to carry a dependency but lines 12and 13 not to do so

Some open questions regarding this approach

1 How does this interact with arguments and re-turn values Do the corresponding annotationsneed to indicate to which fields dependenciesmight be carried Should mismatches be con-sidered an error and if so which sorts of mis-matches

2 How are opaque types handled For exam-ple consider the Linux kernel linked-list facil-ity which embeds a list head structure intothe enclosing object that is to be placed on thelist The memory order consume load returnsa (struct list head ) but it may be nec-essary to carry a dependency to one or more ofthe fields in the enclosing object Should thisbe handled via something like xload(memory

order consume x) but if so doesnrsquot thisre-introduce the need for lots of stdkill

dependency() calls

3 Larger structures might have quite a few fieldsthat need dependencies carried Should therebe some sort of shorthand to make this easierto code for example tagging the fields needingdependency ordering in the declaration of thestruct or class

4 Common types of abstraction need to be han-dled correctly For example there are more than

WG21N4321 28

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst head-gtfirst-gtt)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(lsp-gtt lsp-gtt)) 28 rcu_read_unlock()29 return data30

Figure 31 List-Based-Stack Head-Marked Depen-dencies 1 of 2

500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

79 Restricted Dependency Chains

This approach restricts dependency chains to opera-tions for which compilers would naturally carry de-pendencies As such this approach can be consid-ered to be a refinement of the whole-program option(Section 74) restricted as described in Section 412but also omitting control dependencies and RCU-protected array indexes As always a memory order

consume load heads a dependency chain and as al-ways a memory order consume load may be imple-mented with a simple load instruction on architec-tures such as x86 ARM and Power

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst head-gtfirst-gtnext)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext lsnp-gtnext-gtt)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt lsnp2-gtt))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 32 List-Based-Stack Head-Marked Depen-dencies 2 of 2

WG21N4321 29

1 define rcu_ptr_extract(p) 2 ( 3 uintptr_t ___ip = (uintptr_t)(p) 4 5 ___ip = ___ip amp ~0x7 6 (typeof(p)) ___ip 7 )89 define rcu_ptr_set_bits(p bits)

10 ( 11 uintptr_t ___ip = (uintptr_t)(p) 12 13 ___ip = ___ip amp ~0x7 14 ___ip = ___ip | (bits) 15 (typeof(p)) ___ip 16 )1718 define rcu_ptr_get_bits(p bits) 19 ( 20 uintptr_t ___ip = (uintptr_t)(p) 21 22 ___ip amp ~0x7 23 )

Figure 33 Pointers and Bit Manipulation on 64-BitSystem

This results in a specific list of operations that ex-tend dependency chains and a separate specific list ofoperations that terminate such chains each coveredin its own section

791 Extending Dependency Chains

The following primitive operations extend thatchain15

1 If any value is part of a dependency chain thenusing that value as the left-hand side of an as-signment expression extends the chain to coverthe assignment This rule is exercised in theLinux kernel by stores into fields making up anRCU protected data element

2 If any value is part of a dependency chain thenusing that value as the right-hand side of an as-signment expression extends the chain to coverboth the assignment and the value returned bythat assignment statement Line 20 of Figure 20shows how this rule may be used to extend adependency chain into a local variable

15 In case of operator overloading the actual functions calledmust be analyzed in order to determine their effects on depen-dency chains

3 If any value that is part of a dependency chain isstored to a non-shared variable then any valueloaded by a later load from that same variable bythat same thread is also part of the dependencychain Lines 20 and 24 of Figure 20 illustratethis rule though this rule would apply even iflocal variable lsp was a intptr t instead of apointer

4 If a pointer that is part of a dependency chainis stored to any variable then any value loadedby a later load from that same variable by thatsame thread is also part of the dependency chainLines 20 and 24 of Figure 20 illustrate this rulethough this rule would apply even if local vari-able lsp was instead a shared variable

5 If a pointer is part of a dependency chain thenadding an integral value to that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers and alsoto addition via the infix + operator and via thepostfix [] operator Note that the addition mustbe carried out on a pointer Casting to an inte-gral type and then carrying out the addition willbreak the dependency chain Therefore insteadof casting to an integral type to carry out theaddition cast to a pointer to char16 Line 24 ofFigure 20 illustrates this given that the -gtt actsas a pointer offset prior to indirection

6 If a pointer is part of a dependency chain thensubtracting an integer from that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers Againcasting to an integral type and then carryingout the subtraction will break the dependencychain so instead cast to a pointer to char TheLinux-kernel container of() macro illustratesthis This macro is used to find the beginning ofa structure given a pointer to a field within thatsame structure

16 Yes some old systems had strange formats for characterpointers and this restriction does exclude those systems fromthis nuance of dependency ordering However to the best ofmy knowledge all such systems were uniprocessors so this isnot a real problem

WG21N4321 30

7 If a pointer is part of a dependency chain thendereferencing it using the prefix operator ex-tends the chain through the dereference opera-tion Line 24 of Figure 20 illustrates this giventhat the -gtt acts as a pointer offset prior to in-direction

8 If a pointer is part of a dependency chain thendereferencing it using the -gt field-selection oper-ator extends the chain to the field Note that thewhen the -gt operator is followed by one or more operators these latter operators are equiva-lent to adding a constant integer to the originalpointer Line 24 of Figure 20 directly illustratesthis rule

9 If a pointer is part of a dependency chain thencasting it (either explicitly or implicitly) to anypointer-sized type extends the chain to the re-sult Line 3 of Figure 33 illustrates this rule

10 If a value of type intptr t or uintptr t ispart of a dependency chain then casting it toa pointer type extends the chain to the resultLine 6 of Figure 33 illustrates this rule

11 If a value of type intptr t or uintptr t is partof a dependency chain then the bitwise infix amp

and | operators extend the dependency chain tothe resulting value Line 5 of Figure 33 illus-trates this rule

12 If a value of type intptr t or uintptr t is partof a dependency chain the the bitwise infix ^

operator extends the dependency chain to theresulting value The use case for this traversal ofbuddy-allocator-like lists or dense-array heapsbut it is not clear whether these use cases justifythis addition to dependency ordering

13 If a pointer is part of a dependency chain thenapplying the unary amp address-of operator op-tionally casting this address to a pointer type(perhaps repeatedly to different pointer typeseither explicitly or implicitly) then applying the dereference operator extends the chain to theresult This is used by some of the Linux-kernellist-processing macros

14 If a pointer is part of a dependency chain andthat pointer appears in the entry of a expres-sion selected by the condition then the chainextends to the result Please note that doesnot extend chains from its condition only fromits second or third argument

15 If a pointer to a function is part of a depen-dency chain then invoking the pointed-to func-tion extends the chain from the pointer to theinstructions executed Note that the exact mech-anism used to update instructions is implemen-tation defined and might require use of specialinstruction-cache-flush operations17

16 If a pointer is part of a dependency chain thenif that pointer is used as the actual parameter ofa function call the dependency chain extends tothe formal parameter

17 If a pointer is part of a dependency chain thenif that pointer is returned from a function thedependency chain extends to the returned valuein the calling function

18 If a given operation extends a dependency chainthen so does its atomic counterpart For exam-ple the rules applying to assignments also applyto atomic loads and stores

Any other operation terminates a dependencychain

792 Terminating Dependency Chains

Even though all other operations terminate depen-dency chains there are a few that deserve specialmention

1 Equality comparisons

2 Narrowing magnitude comparisons

3 Narrowing arithmetic operations

4 Narrowing bitwise operations

17 This may sound strange but just you try implementingdynamic linking without the ability to update instructions

WG21N4321 31

1 struct bar 2 struct bar next3 int a4 int b5 6 struct bar head = amphead 1 2 78 for (p = head-gtnext p p = rcu_dereference(p-gtnext)) 9 foo += p-gta

10 if (p == amphead)11 break12 13 bar = head-gtb

Figure 34 Back-Propagation of Dependency-ChainBreakage Due to Comparisons

1 if (p gt ampfoo)2 do_something(p)3 else if (p lt ampfoo)4 do_something_else(p)5 else6 do_something_nodep(p)

Figure 35 Inequality-Comparison Dependency-Chain Breakage

5 Storing non-pointers into shared variables

6 Passing values between threads without using amemory order consume load

7 Undefined behavior

8 Use of kill dependency

Each of these is covered below

Equality comparisons If a pointer is part of adependency chain then a == or = comparison thatcompares equal to some other pointer where thatother pointer is not part of any dependency chainwill cause any uses of the original pointer to no longerbe part of the dependency chain This dependency-chain breakage can back-propagate to earlier uses ofthe pointer so that in Figure 34 if the comparisonon line 10 compares equal then the access on line 9is not part of the dependency chain This is admit-tedly a rather strange code fragment and besidesthe Linux-kernel barrier() macro could prevent thisif placed between lines 9 and 10 Furthermore the

Linux kernelrsquos list macros avoid this situation becausethe equal comparison terminates the loop

So what if the compiler introduces an equality com-parison This might happen when doing feedback-directed optimization where the compiler might no-tice (for example) that a particularly statically allo-cated structure was almost always the first elementon a given list The compiler might therefore in-troduce a specialization optimization comparing theaddresses and generating code using the statically al-located structure on equals comparison On the onehand in the cases where the Linux kernel adds a stat-ically allocated structure to an RCU-protected linkeddata structure that structure has been initialized atcompile time so that dependency ordering is not re-quired On the other hand this appears to be anextremely dubious optimization for linked data struc-tures In a great many cases the added overhead ofthe comparison would overwhelm the benefits of gen-erating code based on the statically allocated struc-ture

High-quality implementations would therefore beexpected to provide means for disabling this sort ofoptimization especially for pointers obtained fromthe heap After all use of statically allocated struc-tures in RCU-protected lists could be quite usefulduring out-of-memory conditions in which case thespecialization optimization would almost always re-duce performance which is not what optmizationsare supposed to be doing

Narrowing magnitude comparisons A seriesof gt lt gt= or lt= operators that informs the compilerof the exact value of a pointer causes that pointerto no longer be part of the dependency chain SeeFigure 35 for an example of this On line 6 of thisfigure the compiler knows that the value of p is equalto ampfoo so although there is dependency ordering tolines 2 and 4 there is no dependency ordering toline 6 This dependency-chain breakage can back-propagate just as for equality comparisons How-ever dependencies are maintained for normal usesfor exmaple the use of comparisons for deadlockavoidance when acquiring locks contained in multi-ple RCU-protected data elements

WG21N4321 32

Narrowing arithmetic operations If a pointeris part of a dependency chain and if the valuesadded to or subtracted from that pointer cancel thepointer value so as to allow the compiler to pre-cisely determine the resulting value then the result-ing value will not be part of any dependency chainFor example if p is part of a dependency chain then((char )p-(uintptr t)p)+65536 will not be18

Narrowing bitwise operations If a value of typeintptr t or uintptr t is part of a dependency chainand if that value is one of the operands to an amp

or | infix operator whose result has too few or toomany bits set then the resulting value will not bepart of any dependency chain For example on a64-bit system if p is part of a dependency chainthen (p amp 0x7) provides just the tag bits and nor-mally cannot even be legally dereferenced Similarly(p | ~0) normally cannot be legally dereferencedHowever (p amp ~0x7) will provide a usable pointerthat is part of prsquos dependency chain Setting orclearing bits that are not used by the implementa-tion or that would be zero for a properly aligned ob-ject will not break the dependency chain That saidthe compiler might not know the definition of ldquoprop-erly alignedrdquo for example in the cases of manuallycache-aligned or page-aligned objects

Storing non-pointers into shared variables Ifa non-pointer value that is part of a dependency chainis stored into a shared variable then the dependencychain does not extend to a later load from that vari-able

Passing values between threads If a value thatis part of a dependency chain is stored into a variableby one thread and loaded from that same variable bysome other thread using either a non-atomic load ora memory order relaxed load then the dependencychain does not extend to the second thread To getthis effect the second thread would instead need touse a memory order consume load Note that this

18 That said 57p4 of C++ and 656p8 of C both say thatindexing outside of an object is undefined behavior so the lossof dependency ordering is likely the least of the problems here

would extend the dependency chain even if the cor-responding store was a memory order relaxed storebecause the required store-side ordering is providedby the dependency chain

Undefined behavior If undefined behavior is in-voked then consistent with the notion of undefinedbehavior there are no dependency-chain guarantees

If a given pointer intptr t or uintptr t is usedsuch that only one value avoids undefined behaviorthen the dependency chain is broken in the same wayas it would be in the case of an equality comparisonwith that same value

kill dependency() The result of calling kill

dependency is never part of any dependency chainThis operation can be used to suppress diagnosticsthat implementations might omit for likely misusesof dependency ordering

793 Restricted Dependency Chains and theLinux Kernel

This covers all known pointer-based RCU uses in theLinux kernel aside from RCU-protected array in-dexes (more on these later) However as noted ear-lier these restrictions might prove too constrainingfor future code Therefore it might be necessary tocombine this approach with some variation on one ofthe methods for explicitly marking variables formalparameters and return values that are intended tocarry dependencies

Section 32 discussed dependency chains headedby memory order consume loads of integers that arelater used as array indexes or pointer offsets Al-though there are a (very) few such uses in the Linuxkernel accommodating dependency chains headed byloads of integers greatly complicates the handling ofdependency chains For example given an integer x

produced by a memory order consume load we mustcorrectly handle expressions containing (x - x) in-cluding cases where the cancellation is not at all obvi-ous from the source code In contrast given a pointerp the expression (p - p) not a pointer and the rulesgiven above do not require carrying a dependency

WG21N4321 33

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a) 3 q = kill_dependency(p)4 a = q-gtspecial_a5 else 6 a = p-gtnormal_a7

Figure 36 Avoiding Diagnostic Due To Dependency-Ordering Value-Narrowing Hazard

through such an expression This approach there-fore excludes dependency chains headed by memory

order consume loads from non-pointer atomic vari-ables

This raises the question of what should be doneabout the Linux kernel code that relies on order-ing carried through integer array indexes Paulhas (hopefully) answered this question by creatinga Linux-kernel patch that removes the kernelrsquos de-pendency on RCU-protected array indexes [22]

794 Restricted Dependency Chains Ad-vantages and Disavantages

This approach to dependency ordering has the fol-lowing advantages and disadvantages

1 There is no need for compilers to trace depen-dency chains Instead dependencies are an au-tomatic result of current code-generation and op-timization practices

2 There would be little or no limitation on com-piler optimizations Compilers are no longer re-quired to establish artificial dependencies for ex-pressions such as (x - x) or to detect value-narrowing hazards involving == and =

3 The breaking of dependency chains when apointer compares equal to some other pointermight prove to be onerous but it is not aproblem for current Linux use cases19 Themost common cases are comparison against alist header (in which case an equality compar-ison terminates the traversal) and comparison

19 This needs to be re-verified

against NULL (in which case an equality com-parison indicates a pointer that cannot be deref-erenced in any case)

4 The compiler is prohibited from carrying outvalue-speculation optimizations on pointers orvalues of type intptr t or uintptr t that havebeen cast from a pointer type and subjected tono operations other than infix amp | and ^ (Isthis an advantage or a disadvantage The an-swer to this question is left to the reader)

5 In a great many cases dependency chains canreliably pass through library functions compiledby pre-C11 compilers

6 It would not be necessary to use stdkill

dependency() calls in most cases That saiduse of stdkill dependency() might at somefuture point allow the compiler to produce bet-ter diagnostics for dubious dependency-chain usecases For example a compiler might issue awarning for the value-narrowing hazard shownon line 3 of Figure 13 and that diagnostic mightbe suppressed as shown in Figure 36

7 More generally this approach allows annotationsto be discarded as was shown in Figures 20and 21 However the memory order consume

loads are still required in order to support DECAlpha and prevent compiler optimizations thatmight otherwise destroy the dependency chain

8 It would not be necessary to use [[carries

dependency]] attributes However as withstdkill dependency() use of something like[[carries dependency]] might produce bet-ter diagnostics for dubious use dependency-chainuse cases That said it might be preferable tosubstitute either type or variable modifiers forattributes given that use of attributes is not per-mitted to change the meaning of the programand also that attributes are not yet supportedby the C language

9 With the exception of one use case involving ar-rays no changes to the Linux-kernel source code

WG21N4321 34

are required As noted earlier a patch is avail-able to remove the Linux kernelrsquos dependency onRCU-protected array indexes [22]

In the future this approach could be augmentedby attributes type modifiers variable modifiers orother markings to allow more elaborate dependencychains to be created on the one hand and to improvethe compilerrsquos ability to emit diagnostics for dubioususes of dependency chains on the other

710 Evaluation

This evaluation starts by enumerating the differentaudiences that any change to memory order consume

must address (Section 7101) and then compares thevarious proposals based on the perceived viewpointsof these audiences (Section 7102)

7101 Audiences

The main audiences for any change to memory

order consume include standards committee mem-bers compiler implementers formal-methods re-searchers developers intending to write new codeand developers working with existing RCU code TheLinux kernel community is of course a notable exam-ple of this last category

Standards committee members would like a cleanand non-intrusive change to the standard Theywould of course also like solutions minimizing thenumber and vehemence of complaints from the otheraudiences or failing that reducing the complaintsto a tolerable noise level

Compiler implementers would like a mechanismthat fits nicely into current implementations whichdoes much to explain their satisfaction with the ap-proach of strengthening memory order consume tomemory order acquire In particular they wouldlike to avoid unbounded tracing of dependencies andwould prefer minimal constraints on their ability toapply time-honored optimizations

Formal-methods researchers would like a definitionof memory order consume that fits into existing the-oretical frameworks without undue conceptual vio-lence Of particular concern is any need to deal with

counter-factuals in other words any need to rea-son not only about values of variables required forthe solution of a given litmus test but also aboutother unrelated values for these variables As suchcounter-factuals are the rock upon which otherwiseattractive approaches involving semantic dependencyhave foundered20 Some practitioners might won-der why the opinion of formal-methods researchersshould be given any weight at all and the answer tothis question is that it is the work of formal-methodsresearchers that provides us the much-needed toolsthat we need to analyze both the memory-orderingspecification itself as well as programs using thatspecification

Developers writing new code need something thatexpresses their algorithm with a minimum of syn-tactic saccharine that is easy to learn and that iseasy to maintain For example one of the weak-nesses of the current standardsrsquo definition of memoryorder consume is the need to sprinkle large numbersof kill dependency() calls throughout onersquos codeIn short developers would like it to be easy to writeanalyze and maintain code that uses dependency or-dering

Developers with existing RCU code have the samedesires as do developers writing new code but arealso very interested in minimizing the code churn re-quired to adhere to the standard

The challenge if of course to find a proposal thataddresses the viewoints of all of these audiences Aswe will see in the next session this is not easy How-ever there is some hope that the approach presentedin Section 79 might suffice

7102 Comparison

A summary comparison of the proposals is shown inTable 1

The dependency type can either be ldquodeprdquo fornormal dependency ldquordeprdquo for the restricted de-pendencies discussed in Section 79 ldquosdeprdquo for(global) semantic dependency or ldquolsdeprdquo for local

20 That said Alan Jeffries is making another attempt tocome up with a suitable formal definition of semantic depen-dency

WG21N4321 35

Dep

enden

cyType

Variable

Marking

Form

al-Parameter

Marking

Return-V

alueMarking

Beginning-O

f-Chain

Handling

End-O

f-Chain

Handling

Dep

enden

cyTracingRequired

CAttribute

Support

Req

uired

C11 C++11 dep A A K Y Y

Type-Based Designation of DependencyChains With Restrictions (Section 72)

lsdep T T T k

Type-Based Designation of DependencyChains (Section 73)

dep T T T k

Whole-Program Option (Section 74) sdep

Local-Variable Restriction (Section 75) dep A A k Y

Mark Dependency-Carrying Local Variables(Section 76)

dep A A A k Y

Explicitly Tail-Marked Dependency Chains(Section 77)

dep A A Dk y Y

Explicitly Head-Marked Dependency Chains(Section 78)

dep D k y Y

Restricted Dependency Chains (Section 79) rdep

Marking ldquoArdquo attribute ldquoTrdquo typeBeginning of chain ldquoDrdquo explicit designationEnd of chain ldquoDrdquo explicit designation ldquokKrdquo implicitexplicit kill dependencyDependency tracking required ldquoyrdquo only for marked chains ldquoYrdquo always

Table 1 Comparison of Consume Proposals

WG21N4321 36

semantic dependency21 Variable formal-parameterand return-value marking can either be type-based(ldquoTrdquo) attribute-based (ldquoArdquo) or not required (ldquo rdquo)Beginning-of-chain handling can either require ex-plicit indication to which quantities dependenciesmust be carried (ldquoDrdquo) or nothing (ldquo rdquo) End-of-chain handling can either require an explicit kill

dependency (ldquoKrdquo) an implicit kill dependency

(ldquokrdquo) explicit designation of dependency (ldquoDrdquo) ornothing (ldquo rdquo)22 Dependency tracking might berequired for all chains (ldquoYrdquo) explicitly designatedchains (ldquoyrdquo) or not required at all (ldquo rdquo) C-language[[carries dependency]] support might be required(ldquoYrdquo) or not (ldquo rdquo)

The ideal proposal would have dependency typeldquodeprdquo (thus making it easier to model dependencyordering and making it unnecessary for developer tohave to outwit full-program optimizations) no needfor variable formal-parameter or return-value mark-ing (thus minimizing changes required for existingRCU code) implicit ldquodo the right thingrdquo end-of-chainhandling23 (thus minimizing the need for whack-a-mole source-code markups) no need for dependencytracking (thus making it easier to implement) andno need for C-language support for the [[carries

dependency]] attribute (thus minimizing changes tothe C standard)

7103 Other Approaches

If the C standards committee is unwilling to ac-cept attributes perhaps a new keyword such asCarries dependency would be an acceptable alter-native (This was suggested at the 2014 UIUC meet-ing but I cannot recall who suggested it)

21 Recall that a local semantic dependency remains a de-pendency even if the memory order consume load at its headcan return only a single value In contrast a global seman-tic dependency remains a dependency only if more than onevalue can appear at the end of the chain Therefore optimiza-tions based on global full-program analysis can break a globalsemantic dependency but can break neither a local semanticdependency nor a normal dependency

22 Variables that go out of scope always have any dependencychain implicitly killed

23 Perhaps implemented by a careful choice of exactly whichoperators carry dependencies in which situations

It might also be possible to combine different as-pects of the various proposals perhaps even arrivingat an improved proposal

Nevertheless we clearly have some more work todo

8 Summary

This document has analyzed Linux-kernel use of de-pendency ordering and has laid out the status-quo in-teraction between the Linux kernel and pre-C11 com-pilers It has also put forward some possible ways ofbuilding towards a full implementation of C11rsquos andC++11rsquos handling of dependency ordering Finallyit calls out some weaknesses in C11rsquos and C++11rsquoshandling of dependency ordering and offers some al-ternatives

References

[1] Alglave J Maranget L Pawan PSarkar S Sewell P Williams D andNardelli F Z PPCMEMARMMEM A toolfor exploring the POWER and ARM memorymodels httpwwwclcamacuk~pes20

ppc-supplementalpldi105-sarkarpdfJune 2011

[2] Alglave J Maranget L andTautschnig M Herding cats Modellingsimulation testing and data-mining for weakmemory In Proceedings of the 35th ACM SIG-PLAN Conference on Programming LanguageDesign and Implementation (New York NYUSA 2014) PLDI rsquo14 ACM pp 40ndash40

[3] ARM Limited ARM Architecture ReferenceManual ARMv7-A and ARMv7-R Edition2010

[4] Boehm H J Space efficient conservativegarbage collection SIGPLAN Not 39 4 (Apr2004) 490ndash501

[5] Bonzini P and Day M RCU imple-mentation for Qemu httplistsgnu

WG21N4321 37

orgarchivehtmlqemu-devel2013-08

msg02055html August 2013

[6] Dalton M THE DESIGN AND IMPLE-MENTATION OF DYNAMIC INFORMATIONFLOW TRACKING SYSTEMS FOR SOFT-WARE SECURITY PhD thesis StanfordUniversity 2009 Available httpcsl

stanfordedu~christospublications

2009michael_daltonphd_thesispdf

[Viewed March 9 2010]

[7] Desnoyers M [RFC git tree] userspace RCU(urcu) for Linux httpurcuso February2009

[8] Desnoyers M McKenney P E SternA Dagenais M R and Walpole JUser-level implementations of read-copy updateIEEE Transactions on Parallel and DistributedSystems 23 (2012) 375ndash382

[9] Grisenthwaite R ARM Barrier LitmusTests and Cookbook ARM Limited 2009

[10] Howard P W and Walpole J A rel-ativistic enhancement to software transactionalmemory In Proceedings of the 3rd USENIX con-ference on Hot topics in parallelism (BerkeleyCA USA 2011) HotParrsquo11 USENIX Associa-tion pp 1ndash6

[11] Howard P W and Walpole J Relativis-tic red-black trees Concurrency and Computa-tion Practice and Experience (2013) nandashna

[12] Intel Corporation A Formal Speci-fication of Intel Itanium Processor Fam-ily Memory Ordering 2002 Avail-able httpdeveloperintelcom

designitaniumdownloads251429htm

ftpdownloadintelcomdesign

ItaniumDownloads25142901pdf [ViewedJanuary 10 2007]

[13] International Business Machines Corpo-ration Power ISA Version 207 2013

[14] Kannan H Ordering decoupled metadata ac-cesses in multiprocessors In MICRO 42 Pro-ceedings of the 42nd Annual IEEEACM Inter-national Symposium on Microarchitecture (NewYork NY USA 2009) ACM pp 381ndash390

[15] McCarthy J Recursive functions of symbolicexpressions and their computation by machinepart i Commun ACM 3 4 (Apr 1960) 184ndash195

[16] McKenney P E Read-copy update(RCU) usage in Linux kernel Availablehttpwwwrdropcomuserspaulmck

RCUlinuxusagerculocktabhtml [ViewedJanuary 14 2007] October 2006

[17] McKenney P E What is RCU part 2Usage Available httplwnnetArticles

263130 [Viewed January 4 2008] January2008

[18] McKenney P E The RCU API 2010 edi-tion httplwnnetArticles418853 De-cember 2010

[19] McKenney P E Validating memory barri-ers and atomic instructions httplwnnet

Articles470681 December 2011

[20] McKenney P E Is Parallel ProgrammingHard And If So What Can You Do About Itkernelorg Corvallis OR USA 2012

[21] McKenney P E Structured deferral syn-chronization via procrastination CommunACM 56 7 (July 2013) 40ndash49

[22] McKenney P E [PATCH tipcorercu14] mce Stop using array-index-based RCUprimitives [PATCHtipcorercu14]mce

Stopusingarray-index-basedRCUprimitivesMay 2015

[23] McKenney P E Purcell C Al-gae Schumin B Cornelius G Qwer-tyus Conway N Sbw Blainster Ru-fus C Zoicon5 Anome and Eisen HRead-copy update httpenwikipedia

orgwikiRead-copy-update July 2006

WG21N4321 38

[24] McKenney P E and Slingwine J DRead-copy update Using execution history tosolve concurrency problems In Parallel andDistributed Computing and Systems (Las VegasNV October 1998) pp 509ndash518

[25] McKenney P E and Walpole J What isRCU fundamentally Available httplwn

netArticles262464 [Viewed December 272007] December 2007

[26] Michael M M Hazard pointers Safe mem-ory reclamation for lock-free objects IEEETransactions on Parallel and Distributed Sys-tems 15 6 (June 2004) 491ndash504

[27] Rossbach C J Hofmann O S PorterD E Ramadan H E Bhandari A andWitchel E TxLinux Using and managinghardware transactional memory in an operatingsystem In SOSPrsquo07 Twenty-First ACMSymposium on Operating Systems Principles(October 2007) ACM SIGOPS Avail-able httpwwwsosp2007orgpapers

sosp056-rossbachpdf [Viewed October 212007]

[28] Toit S D Working draft stan-dard for programming language C++httpwwwopen-stdorgjtc1sc22wg21

docspapers2013n3691pdf May 2013

[29] Vigueras G Orduna J M and LozanoM A Read-Copy Update based parallel serverfor distributed crowd simulations The Journalof Supercomputing (Apr 2012)

Change Log

This paper first appeared as N4026 in May of 2014Revisions to this document are as follows

bull Add a suggestion to Section 71 that developersprovide diagnostics for singleton arrays when us-ing RCU-protected indexes (May 28 2014)

bull Apply changes from Torvald Riegel feedback

1 Move Section 71 which summarizes theLinux-kernel uses of dependency chains upto the beginning of Section 7

2 Fix some typos in Section 74 and addwords accounting for DEC Alpha whichmust continue to promote memory order

consume loads to memory order acquire

with this sectionrsquos approach Also clarifyJeff Preshingrsquos conjecture about the lack ofoptimizations that break semantic depen-dency chains

(June 5 2014)

bull Add Section 75 which covers Hans Boehmrsquosapproach which restricts dependency chains tolocal variable Also add Section 76 whichcovers Clark Nelsorsquos approach which usesthe [[carries dependency]] attribute to markthose local variables that carry dependencies(July 17 2014)

bull Mark the paper as a revision of N4036 (July 172014)

bull Add Jeff Preshing Hans Boehm and Clark Nel-son as authors and clarify Hansrsquos proposal inSection 75 (August 13 2014)

bull Add Olivier Girouxrsquos proposal for explicitlymarking the tails of dependency chains as Sec-tion 77 and add Olivier as author Update theadvantages and disadvantages of Jeff Preshingrsquoswhole-program option in Section 74 Add advan-tages and disadvantages for Hans Boehmrsquos pro-posal in Section 75 and for Clark Nelsonrsquos pro-posal in Section 76 (August 23 2014)

bull Update Olivier Girouxrsquos explicit dependency-chain marking proposal in Section 77 based onfeedback from Olivier (August 24 2014)

bull Add a section header (Section 411) for thedependency-chain rules for 2014 GCC implemen-tations Add Section 413 laying out the sim-pler rules for managing dependency chains in theolder 1990s Sequent C implementations Adda footnote to Section 76 recording Lawrence

WG21N4321 39

Crowlrsquos preference for variable modifiers insteadof either attributes or type modifiers Addwords to Section 77 noting that atomic signal

fence() or the Linux kernelrsquos barrier() can beused to suppress code-motion optimizations thatmight otherwise prevent fixing up tail-markeddependency chains that were broken by code-motion optimizations (October 1 2014)

bull Add Section 34 which describes operators thatact as the last link in the Linux kernelrsquos depen-dency chains including the ambiguous role ofthe -gt operator (October 1 2014)

bull Update plots of RCU API usage (October 52014)

bull Flesh out Section 412 which lays out a shorthistory of dependency-chain management rulesfor the Linux kernel Numerous minor clarifica-tions and corrections throughout the document(October 5 2014)

bull Add ldquoandrdquo before final authorrsquos name and emailaddress Also add a pointer back to N4036 (Oc-tober 5 2014)

bull Add a brain-dead build script (October 52014)

bull Add Section 710 which has the beginnings ofan evaluation of the various proposals (October5 2014)

The paper was then published as N4215 whichwas revised as follows

bull Add verbiage to Sections 72 and 7102 differ-entiating between various forms of syntactic andsemantic dependencies (November 11 2014)

bull Add Section 7103 suggesting a C keyword in-stead of the C++ [[carries dependency]] at-tribute This section also suggests that combin-ing ideas from several proposals might be helpful(November 11 2014)

bull Add Section 78 which contains a first attemptto describe Olivier Girouxrsquos head-marked depen-dency chains (November 11 2014)

bull Add example code in Figures 15 and 16 Addsimilar figures to the proposals showing how theymark the dependency chains in the sample code(November 11 2014)

At this point the paper was published as N4321which was revised as follows

bull Add a warning against using operands to amp and| that leave only one bit set (or respectivelycleared) to Section 411 (April 9 2015)

bull Update plots of RCU API usage (April 192014)

bull Add Section 79 which restricts dependencychains with the goal of allowing unmarked de-pendency chains while avoiding restricting com-piler optimizations One important restriction isthe elimination of RCU-protected array indexes(April 19 2014)

bull Add verbiage to Section 32 noting that the amp

operator is sometimes used to find the beginningof a power-of-two aligned structure (April 192014)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Torvald Riegel to Sec-tion 79 (May 20 2015)

bull Donrsquot allow integers to carry dependencies intoor out of unmarked functions in Section 79(May 21 2015)

bull Update Section 79 to note that assignments can-not extend a dependency chain from one threadto another and that assignments cannot extendan intptr t- or uintptr t-based dependencychain through a shared variable even within agiven thread (May 21 2015)

bull Update Section 79 to note that atomic op-erations can extend dependency chains in thesame way that their non-atomic counterpartscan (May 21 2015)

WG21N4321 40

bull Update Section 79 to explicitly call out the factthat stores and subsequent loads can extend de-pendency chains (May 21 2015)

bull Update Section 79 to explicitly state that un-defined behavior terminates dependency chainsand that if only one value of a pointer avoidsundefined behavior any dependency chainsthrough that pointer at that point are termi-nated (May 22 2015)

bull Update Section 79 to document the difficul-ties stemming from value-specialization opti-mizations (May 23 2015)

bull Update Section 79 to note the back-propagationof the dependency-chain-breaking effects of thecompiler deducing the exact value of a pointer(May 27 2015)

bull Update Section 79 to give more informationon how much bit-setting and bit-clearing is toomuch (June 8 2015)

bull Update Section 79 to note that the Linux ker-nel does not rely on dependency ordering whena statically allocated structure is linked into anRCU-protected structure but also note that op-timizations breaking dependency ordering in thiscase seem quite dubious (July 11 2015)

bull Update Section 79 illustrating dependency-chain rules with examples (July 11 2015)

bull Update Section 79 noting that simple loads suf-fice for memory order consume loads (July 112015)

bull Update Sections 77 and 78 calling out abstrac-tion challenges for head- and tail-marking ap-proaches (July 13 2015)

bull Add subsections to Section 79 (July 13 2015)

bull Update Section 79 to amplify objections to spe-cialization optimizations applied to heap-basedpointers (July 13 2015)

  • 1 Introduction
  • 2 Introduction to RCU
  • 3 Linux-Kernel Use Cases
    • 31 Types of Linux-Kernel Dependency Chains
    • 32 Operators in Linux-Kernel Dependency Chains
    • 33 Operators Terminating Linux-Kernel Dependency Chains
    • 34 Operators Acting as Last Link in Linux-Kernel Dependency Chains
    • 35 Linux-Kernel Dependency Chain Length
      • 4 Dependency Ordering in Pre-C11 Implementations
        • 41 Rules for C-Language RCU Users
          • 411 Rules for 2014 GCC Implementations
          • 412 Rules for 2003 GCC Implementations
          • 413 Rules for 1990s Sequent C Implementations
            • 42 Rules for C-Language Implementers
              • 5 Dependency Ordering in C11 and C++11 Implementations
              • 6 Weaknesses in C11 and C++11 Dependency Ordering
              • 7 Potential Alternatives to C11 and C++11 Dependency Ordering
                • 71 Revising C11 and C++11 Dependency-Ordering Definition
                • 72 Type-Based Designation of Dependency Chains With Restrictions
                • 73 Type-Based Designation of Dependency Chains
                • 74 Whole-Program Option
                • 75 Local-Variable Restriction
                • 76 Mark Dependency-Carrying Local Variables
                • 77 Explicitly Tail-Marked Dependency Chains
                • 78 Explicitly Head-Marked Dependency Chains
                • 79 Restricted Dependency Chains
                  • 791 Extending Dependency Chains
                  • 792 Terminating Dependency Chains
                  • 793 Restricted Dependency Chains and the Linux Kernel
                  • 794 Restricted Dependency Chains Advantages and Disavantages
                    • 710 Evaluation
                      • 7101 Audiences
                      • 7102 Comparison
                      • 7103 Other Approaches
                          • 8 Summary
Page 27: N4321: Towards Implementation and Use of memory order … · 2015-07-14 · N4321: Towards Implementation and Use of memory order consume Doc. No.: WG21/N4321 (revised) Date: 2015-04-19

WG21N4321 27

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(atomic_dependency(lsnp2-gtt lsnp))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return atomic_dereference(data lsnp2-gtt)59

Figure 29 List-Based-Stack Tail-Marked Dependen-cies 2 of 2

1 struct foo 2 struct foo a3 struct foo b4 struct foo c5 int d6 78 p = atomic_load_explicit(ampgp memory_order_consume9 p-gta p-gtb)

10 qa = p-gta Dependency carried 11 qb = p-gtb Dependency carried 12 qc = p-gtc No dependency carried 13 d = p-gtd No dependency carried

Figure 30 Explicit Dependency Operations and Aug-mented Load

the explicit dependency information on line 9 causeslines 10 and 11 to carry a dependency but lines 12and 13 not to do so

Some open questions regarding this approach

1 How does this interact with arguments and re-turn values Do the corresponding annotationsneed to indicate to which fields dependenciesmight be carried Should mismatches be con-sidered an error and if so which sorts of mis-matches

2 How are opaque types handled For exam-ple consider the Linux kernel linked-list facil-ity which embeds a list head structure intothe enclosing object that is to be placed on thelist The memory order consume load returnsa (struct list head ) but it may be nec-essary to carry a dependency to one or more ofthe fields in the enclosing object Should thisbe handled via something like xload(memory

order consume x) but if so doesnrsquot thisre-introduce the need for lots of stdkill

dependency() calls

3 Larger structures might have quite a few fieldsthat need dependencies carried Should therebe some sort of shorthand to make this easierto code for example tagging the fields needingdependency ordering in the declaration of thestruct or class

4 Common types of abstraction need to be han-dled correctly For example there are more than

WG21N4321 28

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst head-gtfirst-gtt)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(lsp-gtt lsp-gtt)) 28 rcu_read_unlock()29 return data30

Figure 31 List-Based-Stack Head-Marked Depen-dencies 1 of 2

500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

79 Restricted Dependency Chains

This approach restricts dependency chains to opera-tions for which compilers would naturally carry de-pendencies As such this approach can be consid-ered to be a refinement of the whole-program option(Section 74) restricted as described in Section 412but also omitting control dependencies and RCU-protected array indexes As always a memory order

consume load heads a dependency chain and as al-ways a memory order consume load may be imple-mented with a simple load instruction on architec-tures such as x86 ARM and Power

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst head-gtfirst-gtnext)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext lsnp-gtnext-gtt)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt lsnp2-gtt))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 32 List-Based-Stack Head-Marked Depen-dencies 2 of 2

WG21N4321 29

1 define rcu_ptr_extract(p) 2 ( 3 uintptr_t ___ip = (uintptr_t)(p) 4 5 ___ip = ___ip amp ~0x7 6 (typeof(p)) ___ip 7 )89 define rcu_ptr_set_bits(p bits)

10 ( 11 uintptr_t ___ip = (uintptr_t)(p) 12 13 ___ip = ___ip amp ~0x7 14 ___ip = ___ip | (bits) 15 (typeof(p)) ___ip 16 )1718 define rcu_ptr_get_bits(p bits) 19 ( 20 uintptr_t ___ip = (uintptr_t)(p) 21 22 ___ip amp ~0x7 23 )

Figure 33 Pointers and Bit Manipulation on 64-BitSystem

This results in a specific list of operations that ex-tend dependency chains and a separate specific list ofoperations that terminate such chains each coveredin its own section

791 Extending Dependency Chains

The following primitive operations extend thatchain15

1 If any value is part of a dependency chain thenusing that value as the left-hand side of an as-signment expression extends the chain to coverthe assignment This rule is exercised in theLinux kernel by stores into fields making up anRCU protected data element

2 If any value is part of a dependency chain thenusing that value as the right-hand side of an as-signment expression extends the chain to coverboth the assignment and the value returned bythat assignment statement Line 20 of Figure 20shows how this rule may be used to extend adependency chain into a local variable

15 In case of operator overloading the actual functions calledmust be analyzed in order to determine their effects on depen-dency chains

3 If any value that is part of a dependency chain isstored to a non-shared variable then any valueloaded by a later load from that same variable bythat same thread is also part of the dependencychain Lines 20 and 24 of Figure 20 illustratethis rule though this rule would apply even iflocal variable lsp was a intptr t instead of apointer

4 If a pointer that is part of a dependency chainis stored to any variable then any value loadedby a later load from that same variable by thatsame thread is also part of the dependency chainLines 20 and 24 of Figure 20 illustrate this rulethough this rule would apply even if local vari-able lsp was instead a shared variable

5 If a pointer is part of a dependency chain thenadding an integral value to that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers and alsoto addition via the infix + operator and via thepostfix [] operator Note that the addition mustbe carried out on a pointer Casting to an inte-gral type and then carrying out the addition willbreak the dependency chain Therefore insteadof casting to an integral type to carry out theaddition cast to a pointer to char16 Line 24 ofFigure 20 illustrates this given that the -gtt actsas a pointer offset prior to indirection

6 If a pointer is part of a dependency chain thensubtracting an integer from that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers Againcasting to an integral type and then carryingout the subtraction will break the dependencychain so instead cast to a pointer to char TheLinux-kernel container of() macro illustratesthis This macro is used to find the beginning ofa structure given a pointer to a field within thatsame structure

16 Yes some old systems had strange formats for characterpointers and this restriction does exclude those systems fromthis nuance of dependency ordering However to the best ofmy knowledge all such systems were uniprocessors so this isnot a real problem

WG21N4321 30

7 If a pointer is part of a dependency chain thendereferencing it using the prefix operator ex-tends the chain through the dereference opera-tion Line 24 of Figure 20 illustrates this giventhat the -gtt acts as a pointer offset prior to in-direction

8 If a pointer is part of a dependency chain thendereferencing it using the -gt field-selection oper-ator extends the chain to the field Note that thewhen the -gt operator is followed by one or more operators these latter operators are equiva-lent to adding a constant integer to the originalpointer Line 24 of Figure 20 directly illustratesthis rule

9 If a pointer is part of a dependency chain thencasting it (either explicitly or implicitly) to anypointer-sized type extends the chain to the re-sult Line 3 of Figure 33 illustrates this rule

10 If a value of type intptr t or uintptr t ispart of a dependency chain then casting it toa pointer type extends the chain to the resultLine 6 of Figure 33 illustrates this rule

11 If a value of type intptr t or uintptr t is partof a dependency chain then the bitwise infix amp

and | operators extend the dependency chain tothe resulting value Line 5 of Figure 33 illus-trates this rule

12 If a value of type intptr t or uintptr t is partof a dependency chain the the bitwise infix ^

operator extends the dependency chain to theresulting value The use case for this traversal ofbuddy-allocator-like lists or dense-array heapsbut it is not clear whether these use cases justifythis addition to dependency ordering

13 If a pointer is part of a dependency chain thenapplying the unary amp address-of operator op-tionally casting this address to a pointer type(perhaps repeatedly to different pointer typeseither explicitly or implicitly) then applying the dereference operator extends the chain to theresult This is used by some of the Linux-kernellist-processing macros

14 If a pointer is part of a dependency chain andthat pointer appears in the entry of a expres-sion selected by the condition then the chainextends to the result Please note that doesnot extend chains from its condition only fromits second or third argument

15 If a pointer to a function is part of a depen-dency chain then invoking the pointed-to func-tion extends the chain from the pointer to theinstructions executed Note that the exact mech-anism used to update instructions is implemen-tation defined and might require use of specialinstruction-cache-flush operations17

16 If a pointer is part of a dependency chain thenif that pointer is used as the actual parameter ofa function call the dependency chain extends tothe formal parameter

17 If a pointer is part of a dependency chain thenif that pointer is returned from a function thedependency chain extends to the returned valuein the calling function

18 If a given operation extends a dependency chainthen so does its atomic counterpart For exam-ple the rules applying to assignments also applyto atomic loads and stores

Any other operation terminates a dependencychain

792 Terminating Dependency Chains

Even though all other operations terminate depen-dency chains there are a few that deserve specialmention

1 Equality comparisons

2 Narrowing magnitude comparisons

3 Narrowing arithmetic operations

4 Narrowing bitwise operations

17 This may sound strange but just you try implementingdynamic linking without the ability to update instructions

WG21N4321 31

1 struct bar 2 struct bar next3 int a4 int b5 6 struct bar head = amphead 1 2 78 for (p = head-gtnext p p = rcu_dereference(p-gtnext)) 9 foo += p-gta

10 if (p == amphead)11 break12 13 bar = head-gtb

Figure 34 Back-Propagation of Dependency-ChainBreakage Due to Comparisons

1 if (p gt ampfoo)2 do_something(p)3 else if (p lt ampfoo)4 do_something_else(p)5 else6 do_something_nodep(p)

Figure 35 Inequality-Comparison Dependency-Chain Breakage

5 Storing non-pointers into shared variables

6 Passing values between threads without using amemory order consume load

7 Undefined behavior

8 Use of kill dependency

Each of these is covered below

Equality comparisons If a pointer is part of adependency chain then a == or = comparison thatcompares equal to some other pointer where thatother pointer is not part of any dependency chainwill cause any uses of the original pointer to no longerbe part of the dependency chain This dependency-chain breakage can back-propagate to earlier uses ofthe pointer so that in Figure 34 if the comparisonon line 10 compares equal then the access on line 9is not part of the dependency chain This is admit-tedly a rather strange code fragment and besidesthe Linux-kernel barrier() macro could prevent thisif placed between lines 9 and 10 Furthermore the

Linux kernelrsquos list macros avoid this situation becausethe equal comparison terminates the loop

So what if the compiler introduces an equality com-parison This might happen when doing feedback-directed optimization where the compiler might no-tice (for example) that a particularly statically allo-cated structure was almost always the first elementon a given list The compiler might therefore in-troduce a specialization optimization comparing theaddresses and generating code using the statically al-located structure on equals comparison On the onehand in the cases where the Linux kernel adds a stat-ically allocated structure to an RCU-protected linkeddata structure that structure has been initialized atcompile time so that dependency ordering is not re-quired On the other hand this appears to be anextremely dubious optimization for linked data struc-tures In a great many cases the added overhead ofthe comparison would overwhelm the benefits of gen-erating code based on the statically allocated struc-ture

High-quality implementations would therefore beexpected to provide means for disabling this sort ofoptimization especially for pointers obtained fromthe heap After all use of statically allocated struc-tures in RCU-protected lists could be quite usefulduring out-of-memory conditions in which case thespecialization optimization would almost always re-duce performance which is not what optmizationsare supposed to be doing

Narrowing magnitude comparisons A seriesof gt lt gt= or lt= operators that informs the compilerof the exact value of a pointer causes that pointerto no longer be part of the dependency chain SeeFigure 35 for an example of this On line 6 of thisfigure the compiler knows that the value of p is equalto ampfoo so although there is dependency ordering tolines 2 and 4 there is no dependency ordering toline 6 This dependency-chain breakage can back-propagate just as for equality comparisons How-ever dependencies are maintained for normal usesfor exmaple the use of comparisons for deadlockavoidance when acquiring locks contained in multi-ple RCU-protected data elements

WG21N4321 32

Narrowing arithmetic operations If a pointeris part of a dependency chain and if the valuesadded to or subtracted from that pointer cancel thepointer value so as to allow the compiler to pre-cisely determine the resulting value then the result-ing value will not be part of any dependency chainFor example if p is part of a dependency chain then((char )p-(uintptr t)p)+65536 will not be18

Narrowing bitwise operations If a value of typeintptr t or uintptr t is part of a dependency chainand if that value is one of the operands to an amp

or | infix operator whose result has too few or toomany bits set then the resulting value will not bepart of any dependency chain For example on a64-bit system if p is part of a dependency chainthen (p amp 0x7) provides just the tag bits and nor-mally cannot even be legally dereferenced Similarly(p | ~0) normally cannot be legally dereferencedHowever (p amp ~0x7) will provide a usable pointerthat is part of prsquos dependency chain Setting orclearing bits that are not used by the implementa-tion or that would be zero for a properly aligned ob-ject will not break the dependency chain That saidthe compiler might not know the definition of ldquoprop-erly alignedrdquo for example in the cases of manuallycache-aligned or page-aligned objects

Storing non-pointers into shared variables Ifa non-pointer value that is part of a dependency chainis stored into a shared variable then the dependencychain does not extend to a later load from that vari-able

Passing values between threads If a value thatis part of a dependency chain is stored into a variableby one thread and loaded from that same variable bysome other thread using either a non-atomic load ora memory order relaxed load then the dependencychain does not extend to the second thread To getthis effect the second thread would instead need touse a memory order consume load Note that this

18 That said 57p4 of C++ and 656p8 of C both say thatindexing outside of an object is undefined behavior so the lossof dependency ordering is likely the least of the problems here

would extend the dependency chain even if the cor-responding store was a memory order relaxed storebecause the required store-side ordering is providedby the dependency chain

Undefined behavior If undefined behavior is in-voked then consistent with the notion of undefinedbehavior there are no dependency-chain guarantees

If a given pointer intptr t or uintptr t is usedsuch that only one value avoids undefined behaviorthen the dependency chain is broken in the same wayas it would be in the case of an equality comparisonwith that same value

kill dependency() The result of calling kill

dependency is never part of any dependency chainThis operation can be used to suppress diagnosticsthat implementations might omit for likely misusesof dependency ordering

793 Restricted Dependency Chains and theLinux Kernel

This covers all known pointer-based RCU uses in theLinux kernel aside from RCU-protected array in-dexes (more on these later) However as noted ear-lier these restrictions might prove too constrainingfor future code Therefore it might be necessary tocombine this approach with some variation on one ofthe methods for explicitly marking variables formalparameters and return values that are intended tocarry dependencies

Section 32 discussed dependency chains headedby memory order consume loads of integers that arelater used as array indexes or pointer offsets Al-though there are a (very) few such uses in the Linuxkernel accommodating dependency chains headed byloads of integers greatly complicates the handling ofdependency chains For example given an integer x

produced by a memory order consume load we mustcorrectly handle expressions containing (x - x) in-cluding cases where the cancellation is not at all obvi-ous from the source code In contrast given a pointerp the expression (p - p) not a pointer and the rulesgiven above do not require carrying a dependency

WG21N4321 33

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a) 3 q = kill_dependency(p)4 a = q-gtspecial_a5 else 6 a = p-gtnormal_a7

Figure 36 Avoiding Diagnostic Due To Dependency-Ordering Value-Narrowing Hazard

through such an expression This approach there-fore excludes dependency chains headed by memory

order consume loads from non-pointer atomic vari-ables

This raises the question of what should be doneabout the Linux kernel code that relies on order-ing carried through integer array indexes Paulhas (hopefully) answered this question by creatinga Linux-kernel patch that removes the kernelrsquos de-pendency on RCU-protected array indexes [22]

794 Restricted Dependency Chains Ad-vantages and Disavantages

This approach to dependency ordering has the fol-lowing advantages and disadvantages

1 There is no need for compilers to trace depen-dency chains Instead dependencies are an au-tomatic result of current code-generation and op-timization practices

2 There would be little or no limitation on com-piler optimizations Compilers are no longer re-quired to establish artificial dependencies for ex-pressions such as (x - x) or to detect value-narrowing hazards involving == and =

3 The breaking of dependency chains when apointer compares equal to some other pointermight prove to be onerous but it is not aproblem for current Linux use cases19 Themost common cases are comparison against alist header (in which case an equality compar-ison terminates the traversal) and comparison

19 This needs to be re-verified

against NULL (in which case an equality com-parison indicates a pointer that cannot be deref-erenced in any case)

4 The compiler is prohibited from carrying outvalue-speculation optimizations on pointers orvalues of type intptr t or uintptr t that havebeen cast from a pointer type and subjected tono operations other than infix amp | and ^ (Isthis an advantage or a disadvantage The an-swer to this question is left to the reader)

5 In a great many cases dependency chains canreliably pass through library functions compiledby pre-C11 compilers

6 It would not be necessary to use stdkill

dependency() calls in most cases That saiduse of stdkill dependency() might at somefuture point allow the compiler to produce bet-ter diagnostics for dubious dependency-chain usecases For example a compiler might issue awarning for the value-narrowing hazard shownon line 3 of Figure 13 and that diagnostic mightbe suppressed as shown in Figure 36

7 More generally this approach allows annotationsto be discarded as was shown in Figures 20and 21 However the memory order consume

loads are still required in order to support DECAlpha and prevent compiler optimizations thatmight otherwise destroy the dependency chain

8 It would not be necessary to use [[carries

dependency]] attributes However as withstdkill dependency() use of something like[[carries dependency]] might produce bet-ter diagnostics for dubious use dependency-chainuse cases That said it might be preferable tosubstitute either type or variable modifiers forattributes given that use of attributes is not per-mitted to change the meaning of the programand also that attributes are not yet supportedby the C language

9 With the exception of one use case involving ar-rays no changes to the Linux-kernel source code

WG21N4321 34

are required As noted earlier a patch is avail-able to remove the Linux kernelrsquos dependency onRCU-protected array indexes [22]

In the future this approach could be augmentedby attributes type modifiers variable modifiers orother markings to allow more elaborate dependencychains to be created on the one hand and to improvethe compilerrsquos ability to emit diagnostics for dubioususes of dependency chains on the other

710 Evaluation

This evaluation starts by enumerating the differentaudiences that any change to memory order consume

must address (Section 7101) and then compares thevarious proposals based on the perceived viewpointsof these audiences (Section 7102)

7101 Audiences

The main audiences for any change to memory

order consume include standards committee mem-bers compiler implementers formal-methods re-searchers developers intending to write new codeand developers working with existing RCU code TheLinux kernel community is of course a notable exam-ple of this last category

Standards committee members would like a cleanand non-intrusive change to the standard Theywould of course also like solutions minimizing thenumber and vehemence of complaints from the otheraudiences or failing that reducing the complaintsto a tolerable noise level

Compiler implementers would like a mechanismthat fits nicely into current implementations whichdoes much to explain their satisfaction with the ap-proach of strengthening memory order consume tomemory order acquire In particular they wouldlike to avoid unbounded tracing of dependencies andwould prefer minimal constraints on their ability toapply time-honored optimizations

Formal-methods researchers would like a definitionof memory order consume that fits into existing the-oretical frameworks without undue conceptual vio-lence Of particular concern is any need to deal with

counter-factuals in other words any need to rea-son not only about values of variables required forthe solution of a given litmus test but also aboutother unrelated values for these variables As suchcounter-factuals are the rock upon which otherwiseattractive approaches involving semantic dependencyhave foundered20 Some practitioners might won-der why the opinion of formal-methods researchersshould be given any weight at all and the answer tothis question is that it is the work of formal-methodsresearchers that provides us the much-needed toolsthat we need to analyze both the memory-orderingspecification itself as well as programs using thatspecification

Developers writing new code need something thatexpresses their algorithm with a minimum of syn-tactic saccharine that is easy to learn and that iseasy to maintain For example one of the weak-nesses of the current standardsrsquo definition of memoryorder consume is the need to sprinkle large numbersof kill dependency() calls throughout onersquos codeIn short developers would like it to be easy to writeanalyze and maintain code that uses dependency or-dering

Developers with existing RCU code have the samedesires as do developers writing new code but arealso very interested in minimizing the code churn re-quired to adhere to the standard

The challenge if of course to find a proposal thataddresses the viewoints of all of these audiences Aswe will see in the next session this is not easy How-ever there is some hope that the approach presentedin Section 79 might suffice

7102 Comparison

A summary comparison of the proposals is shown inTable 1

The dependency type can either be ldquodeprdquo fornormal dependency ldquordeprdquo for the restricted de-pendencies discussed in Section 79 ldquosdeprdquo for(global) semantic dependency or ldquolsdeprdquo for local

20 That said Alan Jeffries is making another attempt tocome up with a suitable formal definition of semantic depen-dency

WG21N4321 35

Dep

enden

cyType

Variable

Marking

Form

al-Parameter

Marking

Return-V

alueMarking

Beginning-O

f-Chain

Handling

End-O

f-Chain

Handling

Dep

enden

cyTracingRequired

CAttribute

Support

Req

uired

C11 C++11 dep A A K Y Y

Type-Based Designation of DependencyChains With Restrictions (Section 72)

lsdep T T T k

Type-Based Designation of DependencyChains (Section 73)

dep T T T k

Whole-Program Option (Section 74) sdep

Local-Variable Restriction (Section 75) dep A A k Y

Mark Dependency-Carrying Local Variables(Section 76)

dep A A A k Y

Explicitly Tail-Marked Dependency Chains(Section 77)

dep A A Dk y Y

Explicitly Head-Marked Dependency Chains(Section 78)

dep D k y Y

Restricted Dependency Chains (Section 79) rdep

Marking ldquoArdquo attribute ldquoTrdquo typeBeginning of chain ldquoDrdquo explicit designationEnd of chain ldquoDrdquo explicit designation ldquokKrdquo implicitexplicit kill dependencyDependency tracking required ldquoyrdquo only for marked chains ldquoYrdquo always

Table 1 Comparison of Consume Proposals

WG21N4321 36

semantic dependency21 Variable formal-parameterand return-value marking can either be type-based(ldquoTrdquo) attribute-based (ldquoArdquo) or not required (ldquo rdquo)Beginning-of-chain handling can either require ex-plicit indication to which quantities dependenciesmust be carried (ldquoDrdquo) or nothing (ldquo rdquo) End-of-chain handling can either require an explicit kill

dependency (ldquoKrdquo) an implicit kill dependency

(ldquokrdquo) explicit designation of dependency (ldquoDrdquo) ornothing (ldquo rdquo)22 Dependency tracking might berequired for all chains (ldquoYrdquo) explicitly designatedchains (ldquoyrdquo) or not required at all (ldquo rdquo) C-language[[carries dependency]] support might be required(ldquoYrdquo) or not (ldquo rdquo)

The ideal proposal would have dependency typeldquodeprdquo (thus making it easier to model dependencyordering and making it unnecessary for developer tohave to outwit full-program optimizations) no needfor variable formal-parameter or return-value mark-ing (thus minimizing changes required for existingRCU code) implicit ldquodo the right thingrdquo end-of-chainhandling23 (thus minimizing the need for whack-a-mole source-code markups) no need for dependencytracking (thus making it easier to implement) andno need for C-language support for the [[carries

dependency]] attribute (thus minimizing changes tothe C standard)

7103 Other Approaches

If the C standards committee is unwilling to ac-cept attributes perhaps a new keyword such asCarries dependency would be an acceptable alter-native (This was suggested at the 2014 UIUC meet-ing but I cannot recall who suggested it)

21 Recall that a local semantic dependency remains a de-pendency even if the memory order consume load at its headcan return only a single value In contrast a global seman-tic dependency remains a dependency only if more than onevalue can appear at the end of the chain Therefore optimiza-tions based on global full-program analysis can break a globalsemantic dependency but can break neither a local semanticdependency nor a normal dependency

22 Variables that go out of scope always have any dependencychain implicitly killed

23 Perhaps implemented by a careful choice of exactly whichoperators carry dependencies in which situations

It might also be possible to combine different as-pects of the various proposals perhaps even arrivingat an improved proposal

Nevertheless we clearly have some more work todo

8 Summary

This document has analyzed Linux-kernel use of de-pendency ordering and has laid out the status-quo in-teraction between the Linux kernel and pre-C11 com-pilers It has also put forward some possible ways ofbuilding towards a full implementation of C11rsquos andC++11rsquos handling of dependency ordering Finallyit calls out some weaknesses in C11rsquos and C++11rsquoshandling of dependency ordering and offers some al-ternatives

References

[1] Alglave J Maranget L Pawan PSarkar S Sewell P Williams D andNardelli F Z PPCMEMARMMEM A toolfor exploring the POWER and ARM memorymodels httpwwwclcamacuk~pes20

ppc-supplementalpldi105-sarkarpdfJune 2011

[2] Alglave J Maranget L andTautschnig M Herding cats Modellingsimulation testing and data-mining for weakmemory In Proceedings of the 35th ACM SIG-PLAN Conference on Programming LanguageDesign and Implementation (New York NYUSA 2014) PLDI rsquo14 ACM pp 40ndash40

[3] ARM Limited ARM Architecture ReferenceManual ARMv7-A and ARMv7-R Edition2010

[4] Boehm H J Space efficient conservativegarbage collection SIGPLAN Not 39 4 (Apr2004) 490ndash501

[5] Bonzini P and Day M RCU imple-mentation for Qemu httplistsgnu

WG21N4321 37

orgarchivehtmlqemu-devel2013-08

msg02055html August 2013

[6] Dalton M THE DESIGN AND IMPLE-MENTATION OF DYNAMIC INFORMATIONFLOW TRACKING SYSTEMS FOR SOFT-WARE SECURITY PhD thesis StanfordUniversity 2009 Available httpcsl

stanfordedu~christospublications

2009michael_daltonphd_thesispdf

[Viewed March 9 2010]

[7] Desnoyers M [RFC git tree] userspace RCU(urcu) for Linux httpurcuso February2009

[8] Desnoyers M McKenney P E SternA Dagenais M R and Walpole JUser-level implementations of read-copy updateIEEE Transactions on Parallel and DistributedSystems 23 (2012) 375ndash382

[9] Grisenthwaite R ARM Barrier LitmusTests and Cookbook ARM Limited 2009

[10] Howard P W and Walpole J A rel-ativistic enhancement to software transactionalmemory In Proceedings of the 3rd USENIX con-ference on Hot topics in parallelism (BerkeleyCA USA 2011) HotParrsquo11 USENIX Associa-tion pp 1ndash6

[11] Howard P W and Walpole J Relativis-tic red-black trees Concurrency and Computa-tion Practice and Experience (2013) nandashna

[12] Intel Corporation A Formal Speci-fication of Intel Itanium Processor Fam-ily Memory Ordering 2002 Avail-able httpdeveloperintelcom

designitaniumdownloads251429htm

ftpdownloadintelcomdesign

ItaniumDownloads25142901pdf [ViewedJanuary 10 2007]

[13] International Business Machines Corpo-ration Power ISA Version 207 2013

[14] Kannan H Ordering decoupled metadata ac-cesses in multiprocessors In MICRO 42 Pro-ceedings of the 42nd Annual IEEEACM Inter-national Symposium on Microarchitecture (NewYork NY USA 2009) ACM pp 381ndash390

[15] McCarthy J Recursive functions of symbolicexpressions and their computation by machinepart i Commun ACM 3 4 (Apr 1960) 184ndash195

[16] McKenney P E Read-copy update(RCU) usage in Linux kernel Availablehttpwwwrdropcomuserspaulmck

RCUlinuxusagerculocktabhtml [ViewedJanuary 14 2007] October 2006

[17] McKenney P E What is RCU part 2Usage Available httplwnnetArticles

263130 [Viewed January 4 2008] January2008

[18] McKenney P E The RCU API 2010 edi-tion httplwnnetArticles418853 De-cember 2010

[19] McKenney P E Validating memory barri-ers and atomic instructions httplwnnet

Articles470681 December 2011

[20] McKenney P E Is Parallel ProgrammingHard And If So What Can You Do About Itkernelorg Corvallis OR USA 2012

[21] McKenney P E Structured deferral syn-chronization via procrastination CommunACM 56 7 (July 2013) 40ndash49

[22] McKenney P E [PATCH tipcorercu14] mce Stop using array-index-based RCUprimitives [PATCHtipcorercu14]mce

Stopusingarray-index-basedRCUprimitivesMay 2015

[23] McKenney P E Purcell C Al-gae Schumin B Cornelius G Qwer-tyus Conway N Sbw Blainster Ru-fus C Zoicon5 Anome and Eisen HRead-copy update httpenwikipedia

orgwikiRead-copy-update July 2006

WG21N4321 38

[24] McKenney P E and Slingwine J DRead-copy update Using execution history tosolve concurrency problems In Parallel andDistributed Computing and Systems (Las VegasNV October 1998) pp 509ndash518

[25] McKenney P E and Walpole J What isRCU fundamentally Available httplwn

netArticles262464 [Viewed December 272007] December 2007

[26] Michael M M Hazard pointers Safe mem-ory reclamation for lock-free objects IEEETransactions on Parallel and Distributed Sys-tems 15 6 (June 2004) 491ndash504

[27] Rossbach C J Hofmann O S PorterD E Ramadan H E Bhandari A andWitchel E TxLinux Using and managinghardware transactional memory in an operatingsystem In SOSPrsquo07 Twenty-First ACMSymposium on Operating Systems Principles(October 2007) ACM SIGOPS Avail-able httpwwwsosp2007orgpapers

sosp056-rossbachpdf [Viewed October 212007]

[28] Toit S D Working draft stan-dard for programming language C++httpwwwopen-stdorgjtc1sc22wg21

docspapers2013n3691pdf May 2013

[29] Vigueras G Orduna J M and LozanoM A Read-Copy Update based parallel serverfor distributed crowd simulations The Journalof Supercomputing (Apr 2012)

Change Log

This paper first appeared as N4026 in May of 2014Revisions to this document are as follows

bull Add a suggestion to Section 71 that developersprovide diagnostics for singleton arrays when us-ing RCU-protected indexes (May 28 2014)

bull Apply changes from Torvald Riegel feedback

1 Move Section 71 which summarizes theLinux-kernel uses of dependency chains upto the beginning of Section 7

2 Fix some typos in Section 74 and addwords accounting for DEC Alpha whichmust continue to promote memory order

consume loads to memory order acquire

with this sectionrsquos approach Also clarifyJeff Preshingrsquos conjecture about the lack ofoptimizations that break semantic depen-dency chains

(June 5 2014)

bull Add Section 75 which covers Hans Boehmrsquosapproach which restricts dependency chains tolocal variable Also add Section 76 whichcovers Clark Nelsorsquos approach which usesthe [[carries dependency]] attribute to markthose local variables that carry dependencies(July 17 2014)

bull Mark the paper as a revision of N4036 (July 172014)

bull Add Jeff Preshing Hans Boehm and Clark Nel-son as authors and clarify Hansrsquos proposal inSection 75 (August 13 2014)

bull Add Olivier Girouxrsquos proposal for explicitlymarking the tails of dependency chains as Sec-tion 77 and add Olivier as author Update theadvantages and disadvantages of Jeff Preshingrsquoswhole-program option in Section 74 Add advan-tages and disadvantages for Hans Boehmrsquos pro-posal in Section 75 and for Clark Nelsonrsquos pro-posal in Section 76 (August 23 2014)

bull Update Olivier Girouxrsquos explicit dependency-chain marking proposal in Section 77 based onfeedback from Olivier (August 24 2014)

bull Add a section header (Section 411) for thedependency-chain rules for 2014 GCC implemen-tations Add Section 413 laying out the sim-pler rules for managing dependency chains in theolder 1990s Sequent C implementations Adda footnote to Section 76 recording Lawrence

WG21N4321 39

Crowlrsquos preference for variable modifiers insteadof either attributes or type modifiers Addwords to Section 77 noting that atomic signal

fence() or the Linux kernelrsquos barrier() can beused to suppress code-motion optimizations thatmight otherwise prevent fixing up tail-markeddependency chains that were broken by code-motion optimizations (October 1 2014)

bull Add Section 34 which describes operators thatact as the last link in the Linux kernelrsquos depen-dency chains including the ambiguous role ofthe -gt operator (October 1 2014)

bull Update plots of RCU API usage (October 52014)

bull Flesh out Section 412 which lays out a shorthistory of dependency-chain management rulesfor the Linux kernel Numerous minor clarifica-tions and corrections throughout the document(October 5 2014)

bull Add ldquoandrdquo before final authorrsquos name and emailaddress Also add a pointer back to N4036 (Oc-tober 5 2014)

bull Add a brain-dead build script (October 52014)

bull Add Section 710 which has the beginnings ofan evaluation of the various proposals (October5 2014)

The paper was then published as N4215 whichwas revised as follows

bull Add verbiage to Sections 72 and 7102 differ-entiating between various forms of syntactic andsemantic dependencies (November 11 2014)

bull Add Section 7103 suggesting a C keyword in-stead of the C++ [[carries dependency]] at-tribute This section also suggests that combin-ing ideas from several proposals might be helpful(November 11 2014)

bull Add Section 78 which contains a first attemptto describe Olivier Girouxrsquos head-marked depen-dency chains (November 11 2014)

bull Add example code in Figures 15 and 16 Addsimilar figures to the proposals showing how theymark the dependency chains in the sample code(November 11 2014)

At this point the paper was published as N4321which was revised as follows

bull Add a warning against using operands to amp and| that leave only one bit set (or respectivelycleared) to Section 411 (April 9 2015)

bull Update plots of RCU API usage (April 192014)

bull Add Section 79 which restricts dependencychains with the goal of allowing unmarked de-pendency chains while avoiding restricting com-piler optimizations One important restriction isthe elimination of RCU-protected array indexes(April 19 2014)

bull Add verbiage to Section 32 noting that the amp

operator is sometimes used to find the beginningof a power-of-two aligned structure (April 192014)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Torvald Riegel to Sec-tion 79 (May 20 2015)

bull Donrsquot allow integers to carry dependencies intoor out of unmarked functions in Section 79(May 21 2015)

bull Update Section 79 to note that assignments can-not extend a dependency chain from one threadto another and that assignments cannot extendan intptr t- or uintptr t-based dependencychain through a shared variable even within agiven thread (May 21 2015)

bull Update Section 79 to note that atomic op-erations can extend dependency chains in thesame way that their non-atomic counterpartscan (May 21 2015)

WG21N4321 40

bull Update Section 79 to explicitly call out the factthat stores and subsequent loads can extend de-pendency chains (May 21 2015)

bull Update Section 79 to explicitly state that un-defined behavior terminates dependency chainsand that if only one value of a pointer avoidsundefined behavior any dependency chainsthrough that pointer at that point are termi-nated (May 22 2015)

bull Update Section 79 to document the difficul-ties stemming from value-specialization opti-mizations (May 23 2015)

bull Update Section 79 to note the back-propagationof the dependency-chain-breaking effects of thecompiler deducing the exact value of a pointer(May 27 2015)

bull Update Section 79 to give more informationon how much bit-setting and bit-clearing is toomuch (June 8 2015)

bull Update Section 79 to note that the Linux ker-nel does not rely on dependency ordering whena statically allocated structure is linked into anRCU-protected structure but also note that op-timizations breaking dependency ordering in thiscase seem quite dubious (July 11 2015)

bull Update Section 79 illustrating dependency-chain rules with examples (July 11 2015)

bull Update Section 79 noting that simple loads suf-fice for memory order consume loads (July 112015)

bull Update Sections 77 and 78 calling out abstrac-tion challenges for head- and tail-marking ap-proaches (July 13 2015)

bull Add subsections to Section 79 (July 13 2015)

bull Update Section 79 to amplify objections to spe-cialization optimizations applied to heap-basedpointers (July 13 2015)

  • 1 Introduction
  • 2 Introduction to RCU
  • 3 Linux-Kernel Use Cases
    • 31 Types of Linux-Kernel Dependency Chains
    • 32 Operators in Linux-Kernel Dependency Chains
    • 33 Operators Terminating Linux-Kernel Dependency Chains
    • 34 Operators Acting as Last Link in Linux-Kernel Dependency Chains
    • 35 Linux-Kernel Dependency Chain Length
      • 4 Dependency Ordering in Pre-C11 Implementations
        • 41 Rules for C-Language RCU Users
          • 411 Rules for 2014 GCC Implementations
          • 412 Rules for 2003 GCC Implementations
          • 413 Rules for 1990s Sequent C Implementations
            • 42 Rules for C-Language Implementers
              • 5 Dependency Ordering in C11 and C++11 Implementations
              • 6 Weaknesses in C11 and C++11 Dependency Ordering
              • 7 Potential Alternatives to C11 and C++11 Dependency Ordering
                • 71 Revising C11 and C++11 Dependency-Ordering Definition
                • 72 Type-Based Designation of Dependency Chains With Restrictions
                • 73 Type-Based Designation of Dependency Chains
                • 74 Whole-Program Option
                • 75 Local-Variable Restriction
                • 76 Mark Dependency-Carrying Local Variables
                • 77 Explicitly Tail-Marked Dependency Chains
                • 78 Explicitly Head-Marked Dependency Chains
                • 79 Restricted Dependency Chains
                  • 791 Extending Dependency Chains
                  • 792 Terminating Dependency Chains
                  • 793 Restricted Dependency Chains and the Linux Kernel
                  • 794 Restricted Dependency Chains Advantages and Disavantages
                    • 710 Evaluation
                      • 7101 Audiences
                      • 7102 Comparison
                      • 7103 Other Approaches
                          • 8 Summary
Page 28: N4321: Towards Implementation and Use of memory order … · 2015-07-14 · N4321: Towards Implementation and Use of memory order consume Doc. No.: WG21/N4321 (revised) Date: 2015-04-19

WG21N4321 28

1 define rcu_dereference(x) 2 atomic_load_explicit((x) memory_order_consume)34 struct liststackhead 5 struct liststack first6 78 struct liststack 9 struct liststack next

10 void t11 struct rcu_head rh12 1314 _Carries_dependency15 void ls_front(struct liststackhead head)16 17 void data18 struct liststack lsp1920 rcu_read_lock()21 lsp = rcu_dereference(head-gtfirst head-gtfirst-gtt)22 if (lsp == NULL)23 data = NULL24 else25 data =26 rcu_dereference(lsp-gtt lsp-gtt)) 28 rcu_read_unlock()29 return data30

Figure 31 List-Based-Stack Head-Marked Depen-dencies 1 of 2

500 invocations of list for each entry rcu()

in the v41 Linux kernel each of which headsa distinct dependency chain In addition someof these distinct dependency chains invoke com-mon functions and macros It is not clear thatcompile-time marking suffices for these cases

79 Restricted Dependency Chains

This approach restricts dependency chains to opera-tions for which compilers would naturally carry de-pendencies As such this approach can be consid-ered to be a refinement of the whole-program option(Section 74) restricted as described in Section 412but also omitting control dependencies and RCU-protected array indexes As always a memory order

consume load heads a dependency chain and as al-ways a memory order consume load may be imple-mented with a simple load instruction on architec-tures such as x86 ARM and Power

1 int ls_push(struct liststackhead head void t)2 3 struct liststack lsp4 struct liststack lsnp15 struct liststack lsnp26 size_t sz78 sz = sizeof(lsp)9 sz = (sz + CACHE_LINE_SIZE - 1) CACHE_LINE_SIZE

10 sz = CACHE_LINE_SIZE11 lsp = malloc(sz)12 if (lsp)13 return -ENOMEM14 if (t)15 abort()16 lsp-gtt = t17 rcu_read_lock()18 lsnp2 = ACCESS_ONCE(head-gtfirst)19 do 20 lsnp1 = lsnp221 lsp-gtnext = lsnp122 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)23 while (lsnp1 = lsnp2)24 rcu_read_unlock()25 return 026 2728 static void ls_rcu_free_cb(struct rcu_head rhp)29 30 struct liststack lsp3132 lsp = container_of(rhp struct liststack rh)33 free(lsp)34 3536 _Carries_dependency37 void ls_pop(struct liststackhead head)38 39 struct liststack lsp40 struct liststack lsnp141 struct liststack lsnp242 void data4344 rcu_read_lock()45 lsnp2 = rcu_dereference(head-gtfirst head-gtfirst-gtnext)46 do 47 lsnp1 = lsnp248 if (lsnp1 == NULL) 49 rcu_read_unlock()50 return NULL51 52 lsp = rcu_dereference(lsnp1-gtnext lsnp-gtnext-gtt)53 lsnp2 = cmpxchg(amphead-gtfirst lsnp1 lsp)54 while (lsnp1 = lsnp2)55 data = rcu_dereference(lsnp2-gtt lsnp2-gtt))56 rcu_read_unlock()57 call_rcu(amplsnp2-gtrh ls_rcu_free_cb)58 return data59

Figure 32 List-Based-Stack Head-Marked Depen-dencies 2 of 2

WG21N4321 29

1 define rcu_ptr_extract(p) 2 ( 3 uintptr_t ___ip = (uintptr_t)(p) 4 5 ___ip = ___ip amp ~0x7 6 (typeof(p)) ___ip 7 )89 define rcu_ptr_set_bits(p bits)

10 ( 11 uintptr_t ___ip = (uintptr_t)(p) 12 13 ___ip = ___ip amp ~0x7 14 ___ip = ___ip | (bits) 15 (typeof(p)) ___ip 16 )1718 define rcu_ptr_get_bits(p bits) 19 ( 20 uintptr_t ___ip = (uintptr_t)(p) 21 22 ___ip amp ~0x7 23 )

Figure 33 Pointers and Bit Manipulation on 64-BitSystem

This results in a specific list of operations that ex-tend dependency chains and a separate specific list ofoperations that terminate such chains each coveredin its own section

791 Extending Dependency Chains

The following primitive operations extend thatchain15

1 If any value is part of a dependency chain thenusing that value as the left-hand side of an as-signment expression extends the chain to coverthe assignment This rule is exercised in theLinux kernel by stores into fields making up anRCU protected data element

2 If any value is part of a dependency chain thenusing that value as the right-hand side of an as-signment expression extends the chain to coverboth the assignment and the value returned bythat assignment statement Line 20 of Figure 20shows how this rule may be used to extend adependency chain into a local variable

15 In case of operator overloading the actual functions calledmust be analyzed in order to determine their effects on depen-dency chains

3 If any value that is part of a dependency chain isstored to a non-shared variable then any valueloaded by a later load from that same variable bythat same thread is also part of the dependencychain Lines 20 and 24 of Figure 20 illustratethis rule though this rule would apply even iflocal variable lsp was a intptr t instead of apointer

4 If a pointer that is part of a dependency chainis stored to any variable then any value loadedby a later load from that same variable by thatsame thread is also part of the dependency chainLines 20 and 24 of Figure 20 illustrate this rulethough this rule would apply even if local vari-able lsp was instead a shared variable

5 If a pointer is part of a dependency chain thenadding an integral value to that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers and alsoto addition via the infix + operator and via thepostfix [] operator Note that the addition mustbe carried out on a pointer Casting to an inte-gral type and then carrying out the addition willbreak the dependency chain Therefore insteadof casting to an integral type to carry out theaddition cast to a pointer to char16 Line 24 ofFigure 20 illustrates this given that the -gtt actsas a pointer offset prior to indirection

6 If a pointer is part of a dependency chain thensubtracting an integer from that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers Againcasting to an integral type and then carryingout the subtraction will break the dependencychain so instead cast to a pointer to char TheLinux-kernel container of() macro illustratesthis This macro is used to find the beginning ofa structure given a pointer to a field within thatsame structure

16 Yes some old systems had strange formats for characterpointers and this restriction does exclude those systems fromthis nuance of dependency ordering However to the best ofmy knowledge all such systems were uniprocessors so this isnot a real problem

WG21N4321 30

7 If a pointer is part of a dependency chain thendereferencing it using the prefix operator ex-tends the chain through the dereference opera-tion Line 24 of Figure 20 illustrates this giventhat the -gtt acts as a pointer offset prior to in-direction

8 If a pointer is part of a dependency chain thendereferencing it using the -gt field-selection oper-ator extends the chain to the field Note that thewhen the -gt operator is followed by one or more operators these latter operators are equiva-lent to adding a constant integer to the originalpointer Line 24 of Figure 20 directly illustratesthis rule

9 If a pointer is part of a dependency chain thencasting it (either explicitly or implicitly) to anypointer-sized type extends the chain to the re-sult Line 3 of Figure 33 illustrates this rule

10 If a value of type intptr t or uintptr t ispart of a dependency chain then casting it toa pointer type extends the chain to the resultLine 6 of Figure 33 illustrates this rule

11 If a value of type intptr t or uintptr t is partof a dependency chain then the bitwise infix amp

and | operators extend the dependency chain tothe resulting value Line 5 of Figure 33 illus-trates this rule

12 If a value of type intptr t or uintptr t is partof a dependency chain the the bitwise infix ^

operator extends the dependency chain to theresulting value The use case for this traversal ofbuddy-allocator-like lists or dense-array heapsbut it is not clear whether these use cases justifythis addition to dependency ordering

13 If a pointer is part of a dependency chain thenapplying the unary amp address-of operator op-tionally casting this address to a pointer type(perhaps repeatedly to different pointer typeseither explicitly or implicitly) then applying the dereference operator extends the chain to theresult This is used by some of the Linux-kernellist-processing macros

14 If a pointer is part of a dependency chain andthat pointer appears in the entry of a expres-sion selected by the condition then the chainextends to the result Please note that doesnot extend chains from its condition only fromits second or third argument

15 If a pointer to a function is part of a depen-dency chain then invoking the pointed-to func-tion extends the chain from the pointer to theinstructions executed Note that the exact mech-anism used to update instructions is implemen-tation defined and might require use of specialinstruction-cache-flush operations17

16 If a pointer is part of a dependency chain thenif that pointer is used as the actual parameter ofa function call the dependency chain extends tothe formal parameter

17 If a pointer is part of a dependency chain thenif that pointer is returned from a function thedependency chain extends to the returned valuein the calling function

18 If a given operation extends a dependency chainthen so does its atomic counterpart For exam-ple the rules applying to assignments also applyto atomic loads and stores

Any other operation terminates a dependencychain

792 Terminating Dependency Chains

Even though all other operations terminate depen-dency chains there are a few that deserve specialmention

1 Equality comparisons

2 Narrowing magnitude comparisons

3 Narrowing arithmetic operations

4 Narrowing bitwise operations

17 This may sound strange but just you try implementingdynamic linking without the ability to update instructions

WG21N4321 31

1 struct bar 2 struct bar next3 int a4 int b5 6 struct bar head = amphead 1 2 78 for (p = head-gtnext p p = rcu_dereference(p-gtnext)) 9 foo += p-gta

10 if (p == amphead)11 break12 13 bar = head-gtb

Figure 34 Back-Propagation of Dependency-ChainBreakage Due to Comparisons

1 if (p gt ampfoo)2 do_something(p)3 else if (p lt ampfoo)4 do_something_else(p)5 else6 do_something_nodep(p)

Figure 35 Inequality-Comparison Dependency-Chain Breakage

5 Storing non-pointers into shared variables

6 Passing values between threads without using amemory order consume load

7 Undefined behavior

8 Use of kill dependency

Each of these is covered below

Equality comparisons If a pointer is part of adependency chain then a == or = comparison thatcompares equal to some other pointer where thatother pointer is not part of any dependency chainwill cause any uses of the original pointer to no longerbe part of the dependency chain This dependency-chain breakage can back-propagate to earlier uses ofthe pointer so that in Figure 34 if the comparisonon line 10 compares equal then the access on line 9is not part of the dependency chain This is admit-tedly a rather strange code fragment and besidesthe Linux-kernel barrier() macro could prevent thisif placed between lines 9 and 10 Furthermore the

Linux kernelrsquos list macros avoid this situation becausethe equal comparison terminates the loop

So what if the compiler introduces an equality com-parison This might happen when doing feedback-directed optimization where the compiler might no-tice (for example) that a particularly statically allo-cated structure was almost always the first elementon a given list The compiler might therefore in-troduce a specialization optimization comparing theaddresses and generating code using the statically al-located structure on equals comparison On the onehand in the cases where the Linux kernel adds a stat-ically allocated structure to an RCU-protected linkeddata structure that structure has been initialized atcompile time so that dependency ordering is not re-quired On the other hand this appears to be anextremely dubious optimization for linked data struc-tures In a great many cases the added overhead ofthe comparison would overwhelm the benefits of gen-erating code based on the statically allocated struc-ture

High-quality implementations would therefore beexpected to provide means for disabling this sort ofoptimization especially for pointers obtained fromthe heap After all use of statically allocated struc-tures in RCU-protected lists could be quite usefulduring out-of-memory conditions in which case thespecialization optimization would almost always re-duce performance which is not what optmizationsare supposed to be doing

Narrowing magnitude comparisons A seriesof gt lt gt= or lt= operators that informs the compilerof the exact value of a pointer causes that pointerto no longer be part of the dependency chain SeeFigure 35 for an example of this On line 6 of thisfigure the compiler knows that the value of p is equalto ampfoo so although there is dependency ordering tolines 2 and 4 there is no dependency ordering toline 6 This dependency-chain breakage can back-propagate just as for equality comparisons How-ever dependencies are maintained for normal usesfor exmaple the use of comparisons for deadlockavoidance when acquiring locks contained in multi-ple RCU-protected data elements

WG21N4321 32

Narrowing arithmetic operations If a pointeris part of a dependency chain and if the valuesadded to or subtracted from that pointer cancel thepointer value so as to allow the compiler to pre-cisely determine the resulting value then the result-ing value will not be part of any dependency chainFor example if p is part of a dependency chain then((char )p-(uintptr t)p)+65536 will not be18

Narrowing bitwise operations If a value of typeintptr t or uintptr t is part of a dependency chainand if that value is one of the operands to an amp

or | infix operator whose result has too few or toomany bits set then the resulting value will not bepart of any dependency chain For example on a64-bit system if p is part of a dependency chainthen (p amp 0x7) provides just the tag bits and nor-mally cannot even be legally dereferenced Similarly(p | ~0) normally cannot be legally dereferencedHowever (p amp ~0x7) will provide a usable pointerthat is part of prsquos dependency chain Setting orclearing bits that are not used by the implementa-tion or that would be zero for a properly aligned ob-ject will not break the dependency chain That saidthe compiler might not know the definition of ldquoprop-erly alignedrdquo for example in the cases of manuallycache-aligned or page-aligned objects

Storing non-pointers into shared variables Ifa non-pointer value that is part of a dependency chainis stored into a shared variable then the dependencychain does not extend to a later load from that vari-able

Passing values between threads If a value thatis part of a dependency chain is stored into a variableby one thread and loaded from that same variable bysome other thread using either a non-atomic load ora memory order relaxed load then the dependencychain does not extend to the second thread To getthis effect the second thread would instead need touse a memory order consume load Note that this

18 That said 57p4 of C++ and 656p8 of C both say thatindexing outside of an object is undefined behavior so the lossof dependency ordering is likely the least of the problems here

would extend the dependency chain even if the cor-responding store was a memory order relaxed storebecause the required store-side ordering is providedby the dependency chain

Undefined behavior If undefined behavior is in-voked then consistent with the notion of undefinedbehavior there are no dependency-chain guarantees

If a given pointer intptr t or uintptr t is usedsuch that only one value avoids undefined behaviorthen the dependency chain is broken in the same wayas it would be in the case of an equality comparisonwith that same value

kill dependency() The result of calling kill

dependency is never part of any dependency chainThis operation can be used to suppress diagnosticsthat implementations might omit for likely misusesof dependency ordering

793 Restricted Dependency Chains and theLinux Kernel

This covers all known pointer-based RCU uses in theLinux kernel aside from RCU-protected array in-dexes (more on these later) However as noted ear-lier these restrictions might prove too constrainingfor future code Therefore it might be necessary tocombine this approach with some variation on one ofthe methods for explicitly marking variables formalparameters and return values that are intended tocarry dependencies

Section 32 discussed dependency chains headedby memory order consume loads of integers that arelater used as array indexes or pointer offsets Al-though there are a (very) few such uses in the Linuxkernel accommodating dependency chains headed byloads of integers greatly complicates the handling ofdependency chains For example given an integer x

produced by a memory order consume load we mustcorrectly handle expressions containing (x - x) in-cluding cases where the cancellation is not at all obvi-ous from the source code In contrast given a pointerp the expression (p - p) not a pointer and the rulesgiven above do not require carrying a dependency

WG21N4321 33

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a) 3 q = kill_dependency(p)4 a = q-gtspecial_a5 else 6 a = p-gtnormal_a7

Figure 36 Avoiding Diagnostic Due To Dependency-Ordering Value-Narrowing Hazard

through such an expression This approach there-fore excludes dependency chains headed by memory

order consume loads from non-pointer atomic vari-ables

This raises the question of what should be doneabout the Linux kernel code that relies on order-ing carried through integer array indexes Paulhas (hopefully) answered this question by creatinga Linux-kernel patch that removes the kernelrsquos de-pendency on RCU-protected array indexes [22]

794 Restricted Dependency Chains Ad-vantages and Disavantages

This approach to dependency ordering has the fol-lowing advantages and disadvantages

1 There is no need for compilers to trace depen-dency chains Instead dependencies are an au-tomatic result of current code-generation and op-timization practices

2 There would be little or no limitation on com-piler optimizations Compilers are no longer re-quired to establish artificial dependencies for ex-pressions such as (x - x) or to detect value-narrowing hazards involving == and =

3 The breaking of dependency chains when apointer compares equal to some other pointermight prove to be onerous but it is not aproblem for current Linux use cases19 Themost common cases are comparison against alist header (in which case an equality compar-ison terminates the traversal) and comparison

19 This needs to be re-verified

against NULL (in which case an equality com-parison indicates a pointer that cannot be deref-erenced in any case)

4 The compiler is prohibited from carrying outvalue-speculation optimizations on pointers orvalues of type intptr t or uintptr t that havebeen cast from a pointer type and subjected tono operations other than infix amp | and ^ (Isthis an advantage or a disadvantage The an-swer to this question is left to the reader)

5 In a great many cases dependency chains canreliably pass through library functions compiledby pre-C11 compilers

6 It would not be necessary to use stdkill

dependency() calls in most cases That saiduse of stdkill dependency() might at somefuture point allow the compiler to produce bet-ter diagnostics for dubious dependency-chain usecases For example a compiler might issue awarning for the value-narrowing hazard shownon line 3 of Figure 13 and that diagnostic mightbe suppressed as shown in Figure 36

7 More generally this approach allows annotationsto be discarded as was shown in Figures 20and 21 However the memory order consume

loads are still required in order to support DECAlpha and prevent compiler optimizations thatmight otherwise destroy the dependency chain

8 It would not be necessary to use [[carries

dependency]] attributes However as withstdkill dependency() use of something like[[carries dependency]] might produce bet-ter diagnostics for dubious use dependency-chainuse cases That said it might be preferable tosubstitute either type or variable modifiers forattributes given that use of attributes is not per-mitted to change the meaning of the programand also that attributes are not yet supportedby the C language

9 With the exception of one use case involving ar-rays no changes to the Linux-kernel source code

WG21N4321 34

are required As noted earlier a patch is avail-able to remove the Linux kernelrsquos dependency onRCU-protected array indexes [22]

In the future this approach could be augmentedby attributes type modifiers variable modifiers orother markings to allow more elaborate dependencychains to be created on the one hand and to improvethe compilerrsquos ability to emit diagnostics for dubioususes of dependency chains on the other

710 Evaluation

This evaluation starts by enumerating the differentaudiences that any change to memory order consume

must address (Section 7101) and then compares thevarious proposals based on the perceived viewpointsof these audiences (Section 7102)

7101 Audiences

The main audiences for any change to memory

order consume include standards committee mem-bers compiler implementers formal-methods re-searchers developers intending to write new codeand developers working with existing RCU code TheLinux kernel community is of course a notable exam-ple of this last category

Standards committee members would like a cleanand non-intrusive change to the standard Theywould of course also like solutions minimizing thenumber and vehemence of complaints from the otheraudiences or failing that reducing the complaintsto a tolerable noise level

Compiler implementers would like a mechanismthat fits nicely into current implementations whichdoes much to explain their satisfaction with the ap-proach of strengthening memory order consume tomemory order acquire In particular they wouldlike to avoid unbounded tracing of dependencies andwould prefer minimal constraints on their ability toapply time-honored optimizations

Formal-methods researchers would like a definitionof memory order consume that fits into existing the-oretical frameworks without undue conceptual vio-lence Of particular concern is any need to deal with

counter-factuals in other words any need to rea-son not only about values of variables required forthe solution of a given litmus test but also aboutother unrelated values for these variables As suchcounter-factuals are the rock upon which otherwiseattractive approaches involving semantic dependencyhave foundered20 Some practitioners might won-der why the opinion of formal-methods researchersshould be given any weight at all and the answer tothis question is that it is the work of formal-methodsresearchers that provides us the much-needed toolsthat we need to analyze both the memory-orderingspecification itself as well as programs using thatspecification

Developers writing new code need something thatexpresses their algorithm with a minimum of syn-tactic saccharine that is easy to learn and that iseasy to maintain For example one of the weak-nesses of the current standardsrsquo definition of memoryorder consume is the need to sprinkle large numbersof kill dependency() calls throughout onersquos codeIn short developers would like it to be easy to writeanalyze and maintain code that uses dependency or-dering

Developers with existing RCU code have the samedesires as do developers writing new code but arealso very interested in minimizing the code churn re-quired to adhere to the standard

The challenge if of course to find a proposal thataddresses the viewoints of all of these audiences Aswe will see in the next session this is not easy How-ever there is some hope that the approach presentedin Section 79 might suffice

7102 Comparison

A summary comparison of the proposals is shown inTable 1

The dependency type can either be ldquodeprdquo fornormal dependency ldquordeprdquo for the restricted de-pendencies discussed in Section 79 ldquosdeprdquo for(global) semantic dependency or ldquolsdeprdquo for local

20 That said Alan Jeffries is making another attempt tocome up with a suitable formal definition of semantic depen-dency

WG21N4321 35

Dep

enden

cyType

Variable

Marking

Form

al-Parameter

Marking

Return-V

alueMarking

Beginning-O

f-Chain

Handling

End-O

f-Chain

Handling

Dep

enden

cyTracingRequired

CAttribute

Support

Req

uired

C11 C++11 dep A A K Y Y

Type-Based Designation of DependencyChains With Restrictions (Section 72)

lsdep T T T k

Type-Based Designation of DependencyChains (Section 73)

dep T T T k

Whole-Program Option (Section 74) sdep

Local-Variable Restriction (Section 75) dep A A k Y

Mark Dependency-Carrying Local Variables(Section 76)

dep A A A k Y

Explicitly Tail-Marked Dependency Chains(Section 77)

dep A A Dk y Y

Explicitly Head-Marked Dependency Chains(Section 78)

dep D k y Y

Restricted Dependency Chains (Section 79) rdep

Marking ldquoArdquo attribute ldquoTrdquo typeBeginning of chain ldquoDrdquo explicit designationEnd of chain ldquoDrdquo explicit designation ldquokKrdquo implicitexplicit kill dependencyDependency tracking required ldquoyrdquo only for marked chains ldquoYrdquo always

Table 1 Comparison of Consume Proposals

WG21N4321 36

semantic dependency21 Variable formal-parameterand return-value marking can either be type-based(ldquoTrdquo) attribute-based (ldquoArdquo) or not required (ldquo rdquo)Beginning-of-chain handling can either require ex-plicit indication to which quantities dependenciesmust be carried (ldquoDrdquo) or nothing (ldquo rdquo) End-of-chain handling can either require an explicit kill

dependency (ldquoKrdquo) an implicit kill dependency

(ldquokrdquo) explicit designation of dependency (ldquoDrdquo) ornothing (ldquo rdquo)22 Dependency tracking might berequired for all chains (ldquoYrdquo) explicitly designatedchains (ldquoyrdquo) or not required at all (ldquo rdquo) C-language[[carries dependency]] support might be required(ldquoYrdquo) or not (ldquo rdquo)

The ideal proposal would have dependency typeldquodeprdquo (thus making it easier to model dependencyordering and making it unnecessary for developer tohave to outwit full-program optimizations) no needfor variable formal-parameter or return-value mark-ing (thus minimizing changes required for existingRCU code) implicit ldquodo the right thingrdquo end-of-chainhandling23 (thus minimizing the need for whack-a-mole source-code markups) no need for dependencytracking (thus making it easier to implement) andno need for C-language support for the [[carries

dependency]] attribute (thus minimizing changes tothe C standard)

7103 Other Approaches

If the C standards committee is unwilling to ac-cept attributes perhaps a new keyword such asCarries dependency would be an acceptable alter-native (This was suggested at the 2014 UIUC meet-ing but I cannot recall who suggested it)

21 Recall that a local semantic dependency remains a de-pendency even if the memory order consume load at its headcan return only a single value In contrast a global seman-tic dependency remains a dependency only if more than onevalue can appear at the end of the chain Therefore optimiza-tions based on global full-program analysis can break a globalsemantic dependency but can break neither a local semanticdependency nor a normal dependency

22 Variables that go out of scope always have any dependencychain implicitly killed

23 Perhaps implemented by a careful choice of exactly whichoperators carry dependencies in which situations

It might also be possible to combine different as-pects of the various proposals perhaps even arrivingat an improved proposal

Nevertheless we clearly have some more work todo

8 Summary

This document has analyzed Linux-kernel use of de-pendency ordering and has laid out the status-quo in-teraction between the Linux kernel and pre-C11 com-pilers It has also put forward some possible ways ofbuilding towards a full implementation of C11rsquos andC++11rsquos handling of dependency ordering Finallyit calls out some weaknesses in C11rsquos and C++11rsquoshandling of dependency ordering and offers some al-ternatives

References

[1] Alglave J Maranget L Pawan PSarkar S Sewell P Williams D andNardelli F Z PPCMEMARMMEM A toolfor exploring the POWER and ARM memorymodels httpwwwclcamacuk~pes20

ppc-supplementalpldi105-sarkarpdfJune 2011

[2] Alglave J Maranget L andTautschnig M Herding cats Modellingsimulation testing and data-mining for weakmemory In Proceedings of the 35th ACM SIG-PLAN Conference on Programming LanguageDesign and Implementation (New York NYUSA 2014) PLDI rsquo14 ACM pp 40ndash40

[3] ARM Limited ARM Architecture ReferenceManual ARMv7-A and ARMv7-R Edition2010

[4] Boehm H J Space efficient conservativegarbage collection SIGPLAN Not 39 4 (Apr2004) 490ndash501

[5] Bonzini P and Day M RCU imple-mentation for Qemu httplistsgnu

WG21N4321 37

orgarchivehtmlqemu-devel2013-08

msg02055html August 2013

[6] Dalton M THE DESIGN AND IMPLE-MENTATION OF DYNAMIC INFORMATIONFLOW TRACKING SYSTEMS FOR SOFT-WARE SECURITY PhD thesis StanfordUniversity 2009 Available httpcsl

stanfordedu~christospublications

2009michael_daltonphd_thesispdf

[Viewed March 9 2010]

[7] Desnoyers M [RFC git tree] userspace RCU(urcu) for Linux httpurcuso February2009

[8] Desnoyers M McKenney P E SternA Dagenais M R and Walpole JUser-level implementations of read-copy updateIEEE Transactions on Parallel and DistributedSystems 23 (2012) 375ndash382

[9] Grisenthwaite R ARM Barrier LitmusTests and Cookbook ARM Limited 2009

[10] Howard P W and Walpole J A rel-ativistic enhancement to software transactionalmemory In Proceedings of the 3rd USENIX con-ference on Hot topics in parallelism (BerkeleyCA USA 2011) HotParrsquo11 USENIX Associa-tion pp 1ndash6

[11] Howard P W and Walpole J Relativis-tic red-black trees Concurrency and Computa-tion Practice and Experience (2013) nandashna

[12] Intel Corporation A Formal Speci-fication of Intel Itanium Processor Fam-ily Memory Ordering 2002 Avail-able httpdeveloperintelcom

designitaniumdownloads251429htm

ftpdownloadintelcomdesign

ItaniumDownloads25142901pdf [ViewedJanuary 10 2007]

[13] International Business Machines Corpo-ration Power ISA Version 207 2013

[14] Kannan H Ordering decoupled metadata ac-cesses in multiprocessors In MICRO 42 Pro-ceedings of the 42nd Annual IEEEACM Inter-national Symposium on Microarchitecture (NewYork NY USA 2009) ACM pp 381ndash390

[15] McCarthy J Recursive functions of symbolicexpressions and their computation by machinepart i Commun ACM 3 4 (Apr 1960) 184ndash195

[16] McKenney P E Read-copy update(RCU) usage in Linux kernel Availablehttpwwwrdropcomuserspaulmck

RCUlinuxusagerculocktabhtml [ViewedJanuary 14 2007] October 2006

[17] McKenney P E What is RCU part 2Usage Available httplwnnetArticles

263130 [Viewed January 4 2008] January2008

[18] McKenney P E The RCU API 2010 edi-tion httplwnnetArticles418853 De-cember 2010

[19] McKenney P E Validating memory barri-ers and atomic instructions httplwnnet

Articles470681 December 2011

[20] McKenney P E Is Parallel ProgrammingHard And If So What Can You Do About Itkernelorg Corvallis OR USA 2012

[21] McKenney P E Structured deferral syn-chronization via procrastination CommunACM 56 7 (July 2013) 40ndash49

[22] McKenney P E [PATCH tipcorercu14] mce Stop using array-index-based RCUprimitives [PATCHtipcorercu14]mce

Stopusingarray-index-basedRCUprimitivesMay 2015

[23] McKenney P E Purcell C Al-gae Schumin B Cornelius G Qwer-tyus Conway N Sbw Blainster Ru-fus C Zoicon5 Anome and Eisen HRead-copy update httpenwikipedia

orgwikiRead-copy-update July 2006

WG21N4321 38

[24] McKenney P E and Slingwine J DRead-copy update Using execution history tosolve concurrency problems In Parallel andDistributed Computing and Systems (Las VegasNV October 1998) pp 509ndash518

[25] McKenney P E and Walpole J What isRCU fundamentally Available httplwn

netArticles262464 [Viewed December 272007] December 2007

[26] Michael M M Hazard pointers Safe mem-ory reclamation for lock-free objects IEEETransactions on Parallel and Distributed Sys-tems 15 6 (June 2004) 491ndash504

[27] Rossbach C J Hofmann O S PorterD E Ramadan H E Bhandari A andWitchel E TxLinux Using and managinghardware transactional memory in an operatingsystem In SOSPrsquo07 Twenty-First ACMSymposium on Operating Systems Principles(October 2007) ACM SIGOPS Avail-able httpwwwsosp2007orgpapers

sosp056-rossbachpdf [Viewed October 212007]

[28] Toit S D Working draft stan-dard for programming language C++httpwwwopen-stdorgjtc1sc22wg21

docspapers2013n3691pdf May 2013

[29] Vigueras G Orduna J M and LozanoM A Read-Copy Update based parallel serverfor distributed crowd simulations The Journalof Supercomputing (Apr 2012)

Change Log

This paper first appeared as N4026 in May of 2014Revisions to this document are as follows

bull Add a suggestion to Section 71 that developersprovide diagnostics for singleton arrays when us-ing RCU-protected indexes (May 28 2014)

bull Apply changes from Torvald Riegel feedback

1 Move Section 71 which summarizes theLinux-kernel uses of dependency chains upto the beginning of Section 7

2 Fix some typos in Section 74 and addwords accounting for DEC Alpha whichmust continue to promote memory order

consume loads to memory order acquire

with this sectionrsquos approach Also clarifyJeff Preshingrsquos conjecture about the lack ofoptimizations that break semantic depen-dency chains

(June 5 2014)

bull Add Section 75 which covers Hans Boehmrsquosapproach which restricts dependency chains tolocal variable Also add Section 76 whichcovers Clark Nelsorsquos approach which usesthe [[carries dependency]] attribute to markthose local variables that carry dependencies(July 17 2014)

bull Mark the paper as a revision of N4036 (July 172014)

bull Add Jeff Preshing Hans Boehm and Clark Nel-son as authors and clarify Hansrsquos proposal inSection 75 (August 13 2014)

bull Add Olivier Girouxrsquos proposal for explicitlymarking the tails of dependency chains as Sec-tion 77 and add Olivier as author Update theadvantages and disadvantages of Jeff Preshingrsquoswhole-program option in Section 74 Add advan-tages and disadvantages for Hans Boehmrsquos pro-posal in Section 75 and for Clark Nelsonrsquos pro-posal in Section 76 (August 23 2014)

bull Update Olivier Girouxrsquos explicit dependency-chain marking proposal in Section 77 based onfeedback from Olivier (August 24 2014)

bull Add a section header (Section 411) for thedependency-chain rules for 2014 GCC implemen-tations Add Section 413 laying out the sim-pler rules for managing dependency chains in theolder 1990s Sequent C implementations Adda footnote to Section 76 recording Lawrence

WG21N4321 39

Crowlrsquos preference for variable modifiers insteadof either attributes or type modifiers Addwords to Section 77 noting that atomic signal

fence() or the Linux kernelrsquos barrier() can beused to suppress code-motion optimizations thatmight otherwise prevent fixing up tail-markeddependency chains that were broken by code-motion optimizations (October 1 2014)

bull Add Section 34 which describes operators thatact as the last link in the Linux kernelrsquos depen-dency chains including the ambiguous role ofthe -gt operator (October 1 2014)

bull Update plots of RCU API usage (October 52014)

bull Flesh out Section 412 which lays out a shorthistory of dependency-chain management rulesfor the Linux kernel Numerous minor clarifica-tions and corrections throughout the document(October 5 2014)

bull Add ldquoandrdquo before final authorrsquos name and emailaddress Also add a pointer back to N4036 (Oc-tober 5 2014)

bull Add a brain-dead build script (October 52014)

bull Add Section 710 which has the beginnings ofan evaluation of the various proposals (October5 2014)

The paper was then published as N4215 whichwas revised as follows

bull Add verbiage to Sections 72 and 7102 differ-entiating between various forms of syntactic andsemantic dependencies (November 11 2014)

bull Add Section 7103 suggesting a C keyword in-stead of the C++ [[carries dependency]] at-tribute This section also suggests that combin-ing ideas from several proposals might be helpful(November 11 2014)

bull Add Section 78 which contains a first attemptto describe Olivier Girouxrsquos head-marked depen-dency chains (November 11 2014)

bull Add example code in Figures 15 and 16 Addsimilar figures to the proposals showing how theymark the dependency chains in the sample code(November 11 2014)

At this point the paper was published as N4321which was revised as follows

bull Add a warning against using operands to amp and| that leave only one bit set (or respectivelycleared) to Section 411 (April 9 2015)

bull Update plots of RCU API usage (April 192014)

bull Add Section 79 which restricts dependencychains with the goal of allowing unmarked de-pendency chains while avoiding restricting com-piler optimizations One important restriction isthe elimination of RCU-protected array indexes(April 19 2014)

bull Add verbiage to Section 32 noting that the amp

operator is sometimes used to find the beginningof a power-of-two aligned structure (April 192014)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Torvald Riegel to Sec-tion 79 (May 20 2015)

bull Donrsquot allow integers to carry dependencies intoor out of unmarked functions in Section 79(May 21 2015)

bull Update Section 79 to note that assignments can-not extend a dependency chain from one threadto another and that assignments cannot extendan intptr t- or uintptr t-based dependencychain through a shared variable even within agiven thread (May 21 2015)

bull Update Section 79 to note that atomic op-erations can extend dependency chains in thesame way that their non-atomic counterpartscan (May 21 2015)

WG21N4321 40

bull Update Section 79 to explicitly call out the factthat stores and subsequent loads can extend de-pendency chains (May 21 2015)

bull Update Section 79 to explicitly state that un-defined behavior terminates dependency chainsand that if only one value of a pointer avoidsundefined behavior any dependency chainsthrough that pointer at that point are termi-nated (May 22 2015)

bull Update Section 79 to document the difficul-ties stemming from value-specialization opti-mizations (May 23 2015)

bull Update Section 79 to note the back-propagationof the dependency-chain-breaking effects of thecompiler deducing the exact value of a pointer(May 27 2015)

bull Update Section 79 to give more informationon how much bit-setting and bit-clearing is toomuch (June 8 2015)

bull Update Section 79 to note that the Linux ker-nel does not rely on dependency ordering whena statically allocated structure is linked into anRCU-protected structure but also note that op-timizations breaking dependency ordering in thiscase seem quite dubious (July 11 2015)

bull Update Section 79 illustrating dependency-chain rules with examples (July 11 2015)

bull Update Section 79 noting that simple loads suf-fice for memory order consume loads (July 112015)

bull Update Sections 77 and 78 calling out abstrac-tion challenges for head- and tail-marking ap-proaches (July 13 2015)

bull Add subsections to Section 79 (July 13 2015)

bull Update Section 79 to amplify objections to spe-cialization optimizations applied to heap-basedpointers (July 13 2015)

  • 1 Introduction
  • 2 Introduction to RCU
  • 3 Linux-Kernel Use Cases
    • 31 Types of Linux-Kernel Dependency Chains
    • 32 Operators in Linux-Kernel Dependency Chains
    • 33 Operators Terminating Linux-Kernel Dependency Chains
    • 34 Operators Acting as Last Link in Linux-Kernel Dependency Chains
    • 35 Linux-Kernel Dependency Chain Length
      • 4 Dependency Ordering in Pre-C11 Implementations
        • 41 Rules for C-Language RCU Users
          • 411 Rules for 2014 GCC Implementations
          • 412 Rules for 2003 GCC Implementations
          • 413 Rules for 1990s Sequent C Implementations
            • 42 Rules for C-Language Implementers
              • 5 Dependency Ordering in C11 and C++11 Implementations
              • 6 Weaknesses in C11 and C++11 Dependency Ordering
              • 7 Potential Alternatives to C11 and C++11 Dependency Ordering
                • 71 Revising C11 and C++11 Dependency-Ordering Definition
                • 72 Type-Based Designation of Dependency Chains With Restrictions
                • 73 Type-Based Designation of Dependency Chains
                • 74 Whole-Program Option
                • 75 Local-Variable Restriction
                • 76 Mark Dependency-Carrying Local Variables
                • 77 Explicitly Tail-Marked Dependency Chains
                • 78 Explicitly Head-Marked Dependency Chains
                • 79 Restricted Dependency Chains
                  • 791 Extending Dependency Chains
                  • 792 Terminating Dependency Chains
                  • 793 Restricted Dependency Chains and the Linux Kernel
                  • 794 Restricted Dependency Chains Advantages and Disavantages
                    • 710 Evaluation
                      • 7101 Audiences
                      • 7102 Comparison
                      • 7103 Other Approaches
                          • 8 Summary
Page 29: N4321: Towards Implementation and Use of memory order … · 2015-07-14 · N4321: Towards Implementation and Use of memory order consume Doc. No.: WG21/N4321 (revised) Date: 2015-04-19

WG21N4321 29

1 define rcu_ptr_extract(p) 2 ( 3 uintptr_t ___ip = (uintptr_t)(p) 4 5 ___ip = ___ip amp ~0x7 6 (typeof(p)) ___ip 7 )89 define rcu_ptr_set_bits(p bits)

10 ( 11 uintptr_t ___ip = (uintptr_t)(p) 12 13 ___ip = ___ip amp ~0x7 14 ___ip = ___ip | (bits) 15 (typeof(p)) ___ip 16 )1718 define rcu_ptr_get_bits(p bits) 19 ( 20 uintptr_t ___ip = (uintptr_t)(p) 21 22 ___ip amp ~0x7 23 )

Figure 33 Pointers and Bit Manipulation on 64-BitSystem

This results in a specific list of operations that ex-tend dependency chains and a separate specific list ofoperations that terminate such chains each coveredin its own section

791 Extending Dependency Chains

The following primitive operations extend thatchain15

1 If any value is part of a dependency chain thenusing that value as the left-hand side of an as-signment expression extends the chain to coverthe assignment This rule is exercised in theLinux kernel by stores into fields making up anRCU protected data element

2 If any value is part of a dependency chain thenusing that value as the right-hand side of an as-signment expression extends the chain to coverboth the assignment and the value returned bythat assignment statement Line 20 of Figure 20shows how this rule may be used to extend adependency chain into a local variable

15 In case of operator overloading the actual functions calledmust be analyzed in order to determine their effects on depen-dency chains

3 If any value that is part of a dependency chain isstored to a non-shared variable then any valueloaded by a later load from that same variable bythat same thread is also part of the dependencychain Lines 20 and 24 of Figure 20 illustratethis rule though this rule would apply even iflocal variable lsp was a intptr t instead of apointer

4 If a pointer that is part of a dependency chainis stored to any variable then any value loadedby a later load from that same variable by thatsame thread is also part of the dependency chainLines 20 and 24 of Figure 20 illustrate this rulethough this rule would apply even if local vari-able lsp was instead a shared variable

5 If a pointer is part of a dependency chain thenadding an integral value to that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers and alsoto addition via the infix + operator and via thepostfix [] operator Note that the addition mustbe carried out on a pointer Casting to an inte-gral type and then carrying out the addition willbreak the dependency chain Therefore insteadof casting to an integral type to carry out theaddition cast to a pointer to char16 Line 24 ofFigure 20 illustrates this given that the -gtt actsas a pointer offset prior to indirection

6 If a pointer is part of a dependency chain thensubtracting an integer from that pointer extendsthe chain to the resulting value This appliesfor both positive and negative integers Againcasting to an integral type and then carryingout the subtraction will break the dependencychain so instead cast to a pointer to char TheLinux-kernel container of() macro illustratesthis This macro is used to find the beginning ofa structure given a pointer to a field within thatsame structure

16 Yes some old systems had strange formats for characterpointers and this restriction does exclude those systems fromthis nuance of dependency ordering However to the best ofmy knowledge all such systems were uniprocessors so this isnot a real problem

WG21N4321 30

7 If a pointer is part of a dependency chain thendereferencing it using the prefix operator ex-tends the chain through the dereference opera-tion Line 24 of Figure 20 illustrates this giventhat the -gtt acts as a pointer offset prior to in-direction

8 If a pointer is part of a dependency chain thendereferencing it using the -gt field-selection oper-ator extends the chain to the field Note that thewhen the -gt operator is followed by one or more operators these latter operators are equiva-lent to adding a constant integer to the originalpointer Line 24 of Figure 20 directly illustratesthis rule

9 If a pointer is part of a dependency chain thencasting it (either explicitly or implicitly) to anypointer-sized type extends the chain to the re-sult Line 3 of Figure 33 illustrates this rule

10 If a value of type intptr t or uintptr t ispart of a dependency chain then casting it toa pointer type extends the chain to the resultLine 6 of Figure 33 illustrates this rule

11 If a value of type intptr t or uintptr t is partof a dependency chain then the bitwise infix amp

and | operators extend the dependency chain tothe resulting value Line 5 of Figure 33 illus-trates this rule

12 If a value of type intptr t or uintptr t is partof a dependency chain the the bitwise infix ^

operator extends the dependency chain to theresulting value The use case for this traversal ofbuddy-allocator-like lists or dense-array heapsbut it is not clear whether these use cases justifythis addition to dependency ordering

13 If a pointer is part of a dependency chain thenapplying the unary amp address-of operator op-tionally casting this address to a pointer type(perhaps repeatedly to different pointer typeseither explicitly or implicitly) then applying the dereference operator extends the chain to theresult This is used by some of the Linux-kernellist-processing macros

14 If a pointer is part of a dependency chain andthat pointer appears in the entry of a expres-sion selected by the condition then the chainextends to the result Please note that doesnot extend chains from its condition only fromits second or third argument

15 If a pointer to a function is part of a depen-dency chain then invoking the pointed-to func-tion extends the chain from the pointer to theinstructions executed Note that the exact mech-anism used to update instructions is implemen-tation defined and might require use of specialinstruction-cache-flush operations17

16 If a pointer is part of a dependency chain thenif that pointer is used as the actual parameter ofa function call the dependency chain extends tothe formal parameter

17 If a pointer is part of a dependency chain thenif that pointer is returned from a function thedependency chain extends to the returned valuein the calling function

18 If a given operation extends a dependency chainthen so does its atomic counterpart For exam-ple the rules applying to assignments also applyto atomic loads and stores

Any other operation terminates a dependencychain

792 Terminating Dependency Chains

Even though all other operations terminate depen-dency chains there are a few that deserve specialmention

1 Equality comparisons

2 Narrowing magnitude comparisons

3 Narrowing arithmetic operations

4 Narrowing bitwise operations

17 This may sound strange but just you try implementingdynamic linking without the ability to update instructions

WG21N4321 31

1 struct bar 2 struct bar next3 int a4 int b5 6 struct bar head = amphead 1 2 78 for (p = head-gtnext p p = rcu_dereference(p-gtnext)) 9 foo += p-gta

10 if (p == amphead)11 break12 13 bar = head-gtb

Figure 34 Back-Propagation of Dependency-ChainBreakage Due to Comparisons

1 if (p gt ampfoo)2 do_something(p)3 else if (p lt ampfoo)4 do_something_else(p)5 else6 do_something_nodep(p)

Figure 35 Inequality-Comparison Dependency-Chain Breakage

5 Storing non-pointers into shared variables

6 Passing values between threads without using amemory order consume load

7 Undefined behavior

8 Use of kill dependency

Each of these is covered below

Equality comparisons If a pointer is part of adependency chain then a == or = comparison thatcompares equal to some other pointer where thatother pointer is not part of any dependency chainwill cause any uses of the original pointer to no longerbe part of the dependency chain This dependency-chain breakage can back-propagate to earlier uses ofthe pointer so that in Figure 34 if the comparisonon line 10 compares equal then the access on line 9is not part of the dependency chain This is admit-tedly a rather strange code fragment and besidesthe Linux-kernel barrier() macro could prevent thisif placed between lines 9 and 10 Furthermore the

Linux kernelrsquos list macros avoid this situation becausethe equal comparison terminates the loop

So what if the compiler introduces an equality com-parison This might happen when doing feedback-directed optimization where the compiler might no-tice (for example) that a particularly statically allo-cated structure was almost always the first elementon a given list The compiler might therefore in-troduce a specialization optimization comparing theaddresses and generating code using the statically al-located structure on equals comparison On the onehand in the cases where the Linux kernel adds a stat-ically allocated structure to an RCU-protected linkeddata structure that structure has been initialized atcompile time so that dependency ordering is not re-quired On the other hand this appears to be anextremely dubious optimization for linked data struc-tures In a great many cases the added overhead ofthe comparison would overwhelm the benefits of gen-erating code based on the statically allocated struc-ture

High-quality implementations would therefore beexpected to provide means for disabling this sort ofoptimization especially for pointers obtained fromthe heap After all use of statically allocated struc-tures in RCU-protected lists could be quite usefulduring out-of-memory conditions in which case thespecialization optimization would almost always re-duce performance which is not what optmizationsare supposed to be doing

Narrowing magnitude comparisons A seriesof gt lt gt= or lt= operators that informs the compilerof the exact value of a pointer causes that pointerto no longer be part of the dependency chain SeeFigure 35 for an example of this On line 6 of thisfigure the compiler knows that the value of p is equalto ampfoo so although there is dependency ordering tolines 2 and 4 there is no dependency ordering toline 6 This dependency-chain breakage can back-propagate just as for equality comparisons How-ever dependencies are maintained for normal usesfor exmaple the use of comparisons for deadlockavoidance when acquiring locks contained in multi-ple RCU-protected data elements

WG21N4321 32

Narrowing arithmetic operations If a pointeris part of a dependency chain and if the valuesadded to or subtracted from that pointer cancel thepointer value so as to allow the compiler to pre-cisely determine the resulting value then the result-ing value will not be part of any dependency chainFor example if p is part of a dependency chain then((char )p-(uintptr t)p)+65536 will not be18

Narrowing bitwise operations If a value of typeintptr t or uintptr t is part of a dependency chainand if that value is one of the operands to an amp

or | infix operator whose result has too few or toomany bits set then the resulting value will not bepart of any dependency chain For example on a64-bit system if p is part of a dependency chainthen (p amp 0x7) provides just the tag bits and nor-mally cannot even be legally dereferenced Similarly(p | ~0) normally cannot be legally dereferencedHowever (p amp ~0x7) will provide a usable pointerthat is part of prsquos dependency chain Setting orclearing bits that are not used by the implementa-tion or that would be zero for a properly aligned ob-ject will not break the dependency chain That saidthe compiler might not know the definition of ldquoprop-erly alignedrdquo for example in the cases of manuallycache-aligned or page-aligned objects

Storing non-pointers into shared variables Ifa non-pointer value that is part of a dependency chainis stored into a shared variable then the dependencychain does not extend to a later load from that vari-able

Passing values between threads If a value thatis part of a dependency chain is stored into a variableby one thread and loaded from that same variable bysome other thread using either a non-atomic load ora memory order relaxed load then the dependencychain does not extend to the second thread To getthis effect the second thread would instead need touse a memory order consume load Note that this

18 That said 57p4 of C++ and 656p8 of C both say thatindexing outside of an object is undefined behavior so the lossof dependency ordering is likely the least of the problems here

would extend the dependency chain even if the cor-responding store was a memory order relaxed storebecause the required store-side ordering is providedby the dependency chain

Undefined behavior If undefined behavior is in-voked then consistent with the notion of undefinedbehavior there are no dependency-chain guarantees

If a given pointer intptr t or uintptr t is usedsuch that only one value avoids undefined behaviorthen the dependency chain is broken in the same wayas it would be in the case of an equality comparisonwith that same value

kill dependency() The result of calling kill

dependency is never part of any dependency chainThis operation can be used to suppress diagnosticsthat implementations might omit for likely misusesof dependency ordering

793 Restricted Dependency Chains and theLinux Kernel

This covers all known pointer-based RCU uses in theLinux kernel aside from RCU-protected array in-dexes (more on these later) However as noted ear-lier these restrictions might prove too constrainingfor future code Therefore it might be necessary tocombine this approach with some variation on one ofthe methods for explicitly marking variables formalparameters and return values that are intended tocarry dependencies

Section 32 discussed dependency chains headedby memory order consume loads of integers that arelater used as array indexes or pointer offsets Al-though there are a (very) few such uses in the Linuxkernel accommodating dependency chains headed byloads of integers greatly complicates the handling ofdependency chains For example given an integer x

produced by a memory order consume load we mustcorrectly handle expressions containing (x - x) in-cluding cases where the cancellation is not at all obvi-ous from the source code In contrast given a pointerp the expression (p - p) not a pointer and the rulesgiven above do not require carrying a dependency

WG21N4321 33

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a) 3 q = kill_dependency(p)4 a = q-gtspecial_a5 else 6 a = p-gtnormal_a7

Figure 36 Avoiding Diagnostic Due To Dependency-Ordering Value-Narrowing Hazard

through such an expression This approach there-fore excludes dependency chains headed by memory

order consume loads from non-pointer atomic vari-ables

This raises the question of what should be doneabout the Linux kernel code that relies on order-ing carried through integer array indexes Paulhas (hopefully) answered this question by creatinga Linux-kernel patch that removes the kernelrsquos de-pendency on RCU-protected array indexes [22]

794 Restricted Dependency Chains Ad-vantages and Disavantages

This approach to dependency ordering has the fol-lowing advantages and disadvantages

1 There is no need for compilers to trace depen-dency chains Instead dependencies are an au-tomatic result of current code-generation and op-timization practices

2 There would be little or no limitation on com-piler optimizations Compilers are no longer re-quired to establish artificial dependencies for ex-pressions such as (x - x) or to detect value-narrowing hazards involving == and =

3 The breaking of dependency chains when apointer compares equal to some other pointermight prove to be onerous but it is not aproblem for current Linux use cases19 Themost common cases are comparison against alist header (in which case an equality compar-ison terminates the traversal) and comparison

19 This needs to be re-verified

against NULL (in which case an equality com-parison indicates a pointer that cannot be deref-erenced in any case)

4 The compiler is prohibited from carrying outvalue-speculation optimizations on pointers orvalues of type intptr t or uintptr t that havebeen cast from a pointer type and subjected tono operations other than infix amp | and ^ (Isthis an advantage or a disadvantage The an-swer to this question is left to the reader)

5 In a great many cases dependency chains canreliably pass through library functions compiledby pre-C11 compilers

6 It would not be necessary to use stdkill

dependency() calls in most cases That saiduse of stdkill dependency() might at somefuture point allow the compiler to produce bet-ter diagnostics for dubious dependency-chain usecases For example a compiler might issue awarning for the value-narrowing hazard shownon line 3 of Figure 13 and that diagnostic mightbe suppressed as shown in Figure 36

7 More generally this approach allows annotationsto be discarded as was shown in Figures 20and 21 However the memory order consume

loads are still required in order to support DECAlpha and prevent compiler optimizations thatmight otherwise destroy the dependency chain

8 It would not be necessary to use [[carries

dependency]] attributes However as withstdkill dependency() use of something like[[carries dependency]] might produce bet-ter diagnostics for dubious use dependency-chainuse cases That said it might be preferable tosubstitute either type or variable modifiers forattributes given that use of attributes is not per-mitted to change the meaning of the programand also that attributes are not yet supportedby the C language

9 With the exception of one use case involving ar-rays no changes to the Linux-kernel source code

WG21N4321 34

are required As noted earlier a patch is avail-able to remove the Linux kernelrsquos dependency onRCU-protected array indexes [22]

In the future this approach could be augmentedby attributes type modifiers variable modifiers orother markings to allow more elaborate dependencychains to be created on the one hand and to improvethe compilerrsquos ability to emit diagnostics for dubioususes of dependency chains on the other

710 Evaluation

This evaluation starts by enumerating the differentaudiences that any change to memory order consume

must address (Section 7101) and then compares thevarious proposals based on the perceived viewpointsof these audiences (Section 7102)

7101 Audiences

The main audiences for any change to memory

order consume include standards committee mem-bers compiler implementers formal-methods re-searchers developers intending to write new codeand developers working with existing RCU code TheLinux kernel community is of course a notable exam-ple of this last category

Standards committee members would like a cleanand non-intrusive change to the standard Theywould of course also like solutions minimizing thenumber and vehemence of complaints from the otheraudiences or failing that reducing the complaintsto a tolerable noise level

Compiler implementers would like a mechanismthat fits nicely into current implementations whichdoes much to explain their satisfaction with the ap-proach of strengthening memory order consume tomemory order acquire In particular they wouldlike to avoid unbounded tracing of dependencies andwould prefer minimal constraints on their ability toapply time-honored optimizations

Formal-methods researchers would like a definitionof memory order consume that fits into existing the-oretical frameworks without undue conceptual vio-lence Of particular concern is any need to deal with

counter-factuals in other words any need to rea-son not only about values of variables required forthe solution of a given litmus test but also aboutother unrelated values for these variables As suchcounter-factuals are the rock upon which otherwiseattractive approaches involving semantic dependencyhave foundered20 Some practitioners might won-der why the opinion of formal-methods researchersshould be given any weight at all and the answer tothis question is that it is the work of formal-methodsresearchers that provides us the much-needed toolsthat we need to analyze both the memory-orderingspecification itself as well as programs using thatspecification

Developers writing new code need something thatexpresses their algorithm with a minimum of syn-tactic saccharine that is easy to learn and that iseasy to maintain For example one of the weak-nesses of the current standardsrsquo definition of memoryorder consume is the need to sprinkle large numbersof kill dependency() calls throughout onersquos codeIn short developers would like it to be easy to writeanalyze and maintain code that uses dependency or-dering

Developers with existing RCU code have the samedesires as do developers writing new code but arealso very interested in minimizing the code churn re-quired to adhere to the standard

The challenge if of course to find a proposal thataddresses the viewoints of all of these audiences Aswe will see in the next session this is not easy How-ever there is some hope that the approach presentedin Section 79 might suffice

7102 Comparison

A summary comparison of the proposals is shown inTable 1

The dependency type can either be ldquodeprdquo fornormal dependency ldquordeprdquo for the restricted de-pendencies discussed in Section 79 ldquosdeprdquo for(global) semantic dependency or ldquolsdeprdquo for local

20 That said Alan Jeffries is making another attempt tocome up with a suitable formal definition of semantic depen-dency

WG21N4321 35

Dep

enden

cyType

Variable

Marking

Form

al-Parameter

Marking

Return-V

alueMarking

Beginning-O

f-Chain

Handling

End-O

f-Chain

Handling

Dep

enden

cyTracingRequired

CAttribute

Support

Req

uired

C11 C++11 dep A A K Y Y

Type-Based Designation of DependencyChains With Restrictions (Section 72)

lsdep T T T k

Type-Based Designation of DependencyChains (Section 73)

dep T T T k

Whole-Program Option (Section 74) sdep

Local-Variable Restriction (Section 75) dep A A k Y

Mark Dependency-Carrying Local Variables(Section 76)

dep A A A k Y

Explicitly Tail-Marked Dependency Chains(Section 77)

dep A A Dk y Y

Explicitly Head-Marked Dependency Chains(Section 78)

dep D k y Y

Restricted Dependency Chains (Section 79) rdep

Marking ldquoArdquo attribute ldquoTrdquo typeBeginning of chain ldquoDrdquo explicit designationEnd of chain ldquoDrdquo explicit designation ldquokKrdquo implicitexplicit kill dependencyDependency tracking required ldquoyrdquo only for marked chains ldquoYrdquo always

Table 1 Comparison of Consume Proposals

WG21N4321 36

semantic dependency21 Variable formal-parameterand return-value marking can either be type-based(ldquoTrdquo) attribute-based (ldquoArdquo) or not required (ldquo rdquo)Beginning-of-chain handling can either require ex-plicit indication to which quantities dependenciesmust be carried (ldquoDrdquo) or nothing (ldquo rdquo) End-of-chain handling can either require an explicit kill

dependency (ldquoKrdquo) an implicit kill dependency

(ldquokrdquo) explicit designation of dependency (ldquoDrdquo) ornothing (ldquo rdquo)22 Dependency tracking might berequired for all chains (ldquoYrdquo) explicitly designatedchains (ldquoyrdquo) or not required at all (ldquo rdquo) C-language[[carries dependency]] support might be required(ldquoYrdquo) or not (ldquo rdquo)

The ideal proposal would have dependency typeldquodeprdquo (thus making it easier to model dependencyordering and making it unnecessary for developer tohave to outwit full-program optimizations) no needfor variable formal-parameter or return-value mark-ing (thus minimizing changes required for existingRCU code) implicit ldquodo the right thingrdquo end-of-chainhandling23 (thus minimizing the need for whack-a-mole source-code markups) no need for dependencytracking (thus making it easier to implement) andno need for C-language support for the [[carries

dependency]] attribute (thus minimizing changes tothe C standard)

7103 Other Approaches

If the C standards committee is unwilling to ac-cept attributes perhaps a new keyword such asCarries dependency would be an acceptable alter-native (This was suggested at the 2014 UIUC meet-ing but I cannot recall who suggested it)

21 Recall that a local semantic dependency remains a de-pendency even if the memory order consume load at its headcan return only a single value In contrast a global seman-tic dependency remains a dependency only if more than onevalue can appear at the end of the chain Therefore optimiza-tions based on global full-program analysis can break a globalsemantic dependency but can break neither a local semanticdependency nor a normal dependency

22 Variables that go out of scope always have any dependencychain implicitly killed

23 Perhaps implemented by a careful choice of exactly whichoperators carry dependencies in which situations

It might also be possible to combine different as-pects of the various proposals perhaps even arrivingat an improved proposal

Nevertheless we clearly have some more work todo

8 Summary

This document has analyzed Linux-kernel use of de-pendency ordering and has laid out the status-quo in-teraction between the Linux kernel and pre-C11 com-pilers It has also put forward some possible ways ofbuilding towards a full implementation of C11rsquos andC++11rsquos handling of dependency ordering Finallyit calls out some weaknesses in C11rsquos and C++11rsquoshandling of dependency ordering and offers some al-ternatives

References

[1] Alglave J Maranget L Pawan PSarkar S Sewell P Williams D andNardelli F Z PPCMEMARMMEM A toolfor exploring the POWER and ARM memorymodels httpwwwclcamacuk~pes20

ppc-supplementalpldi105-sarkarpdfJune 2011

[2] Alglave J Maranget L andTautschnig M Herding cats Modellingsimulation testing and data-mining for weakmemory In Proceedings of the 35th ACM SIG-PLAN Conference on Programming LanguageDesign and Implementation (New York NYUSA 2014) PLDI rsquo14 ACM pp 40ndash40

[3] ARM Limited ARM Architecture ReferenceManual ARMv7-A and ARMv7-R Edition2010

[4] Boehm H J Space efficient conservativegarbage collection SIGPLAN Not 39 4 (Apr2004) 490ndash501

[5] Bonzini P and Day M RCU imple-mentation for Qemu httplistsgnu

WG21N4321 37

orgarchivehtmlqemu-devel2013-08

msg02055html August 2013

[6] Dalton M THE DESIGN AND IMPLE-MENTATION OF DYNAMIC INFORMATIONFLOW TRACKING SYSTEMS FOR SOFT-WARE SECURITY PhD thesis StanfordUniversity 2009 Available httpcsl

stanfordedu~christospublications

2009michael_daltonphd_thesispdf

[Viewed March 9 2010]

[7] Desnoyers M [RFC git tree] userspace RCU(urcu) for Linux httpurcuso February2009

[8] Desnoyers M McKenney P E SternA Dagenais M R and Walpole JUser-level implementations of read-copy updateIEEE Transactions on Parallel and DistributedSystems 23 (2012) 375ndash382

[9] Grisenthwaite R ARM Barrier LitmusTests and Cookbook ARM Limited 2009

[10] Howard P W and Walpole J A rel-ativistic enhancement to software transactionalmemory In Proceedings of the 3rd USENIX con-ference on Hot topics in parallelism (BerkeleyCA USA 2011) HotParrsquo11 USENIX Associa-tion pp 1ndash6

[11] Howard P W and Walpole J Relativis-tic red-black trees Concurrency and Computa-tion Practice and Experience (2013) nandashna

[12] Intel Corporation A Formal Speci-fication of Intel Itanium Processor Fam-ily Memory Ordering 2002 Avail-able httpdeveloperintelcom

designitaniumdownloads251429htm

ftpdownloadintelcomdesign

ItaniumDownloads25142901pdf [ViewedJanuary 10 2007]

[13] International Business Machines Corpo-ration Power ISA Version 207 2013

[14] Kannan H Ordering decoupled metadata ac-cesses in multiprocessors In MICRO 42 Pro-ceedings of the 42nd Annual IEEEACM Inter-national Symposium on Microarchitecture (NewYork NY USA 2009) ACM pp 381ndash390

[15] McCarthy J Recursive functions of symbolicexpressions and their computation by machinepart i Commun ACM 3 4 (Apr 1960) 184ndash195

[16] McKenney P E Read-copy update(RCU) usage in Linux kernel Availablehttpwwwrdropcomuserspaulmck

RCUlinuxusagerculocktabhtml [ViewedJanuary 14 2007] October 2006

[17] McKenney P E What is RCU part 2Usage Available httplwnnetArticles

263130 [Viewed January 4 2008] January2008

[18] McKenney P E The RCU API 2010 edi-tion httplwnnetArticles418853 De-cember 2010

[19] McKenney P E Validating memory barri-ers and atomic instructions httplwnnet

Articles470681 December 2011

[20] McKenney P E Is Parallel ProgrammingHard And If So What Can You Do About Itkernelorg Corvallis OR USA 2012

[21] McKenney P E Structured deferral syn-chronization via procrastination CommunACM 56 7 (July 2013) 40ndash49

[22] McKenney P E [PATCH tipcorercu14] mce Stop using array-index-based RCUprimitives [PATCHtipcorercu14]mce

Stopusingarray-index-basedRCUprimitivesMay 2015

[23] McKenney P E Purcell C Al-gae Schumin B Cornelius G Qwer-tyus Conway N Sbw Blainster Ru-fus C Zoicon5 Anome and Eisen HRead-copy update httpenwikipedia

orgwikiRead-copy-update July 2006

WG21N4321 38

[24] McKenney P E and Slingwine J DRead-copy update Using execution history tosolve concurrency problems In Parallel andDistributed Computing and Systems (Las VegasNV October 1998) pp 509ndash518

[25] McKenney P E and Walpole J What isRCU fundamentally Available httplwn

netArticles262464 [Viewed December 272007] December 2007

[26] Michael M M Hazard pointers Safe mem-ory reclamation for lock-free objects IEEETransactions on Parallel and Distributed Sys-tems 15 6 (June 2004) 491ndash504

[27] Rossbach C J Hofmann O S PorterD E Ramadan H E Bhandari A andWitchel E TxLinux Using and managinghardware transactional memory in an operatingsystem In SOSPrsquo07 Twenty-First ACMSymposium on Operating Systems Principles(October 2007) ACM SIGOPS Avail-able httpwwwsosp2007orgpapers

sosp056-rossbachpdf [Viewed October 212007]

[28] Toit S D Working draft stan-dard for programming language C++httpwwwopen-stdorgjtc1sc22wg21

docspapers2013n3691pdf May 2013

[29] Vigueras G Orduna J M and LozanoM A Read-Copy Update based parallel serverfor distributed crowd simulations The Journalof Supercomputing (Apr 2012)

Change Log

This paper first appeared as N4026 in May of 2014Revisions to this document are as follows

bull Add a suggestion to Section 71 that developersprovide diagnostics for singleton arrays when us-ing RCU-protected indexes (May 28 2014)

bull Apply changes from Torvald Riegel feedback

1 Move Section 71 which summarizes theLinux-kernel uses of dependency chains upto the beginning of Section 7

2 Fix some typos in Section 74 and addwords accounting for DEC Alpha whichmust continue to promote memory order

consume loads to memory order acquire

with this sectionrsquos approach Also clarifyJeff Preshingrsquos conjecture about the lack ofoptimizations that break semantic depen-dency chains

(June 5 2014)

bull Add Section 75 which covers Hans Boehmrsquosapproach which restricts dependency chains tolocal variable Also add Section 76 whichcovers Clark Nelsorsquos approach which usesthe [[carries dependency]] attribute to markthose local variables that carry dependencies(July 17 2014)

bull Mark the paper as a revision of N4036 (July 172014)

bull Add Jeff Preshing Hans Boehm and Clark Nel-son as authors and clarify Hansrsquos proposal inSection 75 (August 13 2014)

bull Add Olivier Girouxrsquos proposal for explicitlymarking the tails of dependency chains as Sec-tion 77 and add Olivier as author Update theadvantages and disadvantages of Jeff Preshingrsquoswhole-program option in Section 74 Add advan-tages and disadvantages for Hans Boehmrsquos pro-posal in Section 75 and for Clark Nelsonrsquos pro-posal in Section 76 (August 23 2014)

bull Update Olivier Girouxrsquos explicit dependency-chain marking proposal in Section 77 based onfeedback from Olivier (August 24 2014)

bull Add a section header (Section 411) for thedependency-chain rules for 2014 GCC implemen-tations Add Section 413 laying out the sim-pler rules for managing dependency chains in theolder 1990s Sequent C implementations Adda footnote to Section 76 recording Lawrence

WG21N4321 39

Crowlrsquos preference for variable modifiers insteadof either attributes or type modifiers Addwords to Section 77 noting that atomic signal

fence() or the Linux kernelrsquos barrier() can beused to suppress code-motion optimizations thatmight otherwise prevent fixing up tail-markeddependency chains that were broken by code-motion optimizations (October 1 2014)

bull Add Section 34 which describes operators thatact as the last link in the Linux kernelrsquos depen-dency chains including the ambiguous role ofthe -gt operator (October 1 2014)

bull Update plots of RCU API usage (October 52014)

bull Flesh out Section 412 which lays out a shorthistory of dependency-chain management rulesfor the Linux kernel Numerous minor clarifica-tions and corrections throughout the document(October 5 2014)

bull Add ldquoandrdquo before final authorrsquos name and emailaddress Also add a pointer back to N4036 (Oc-tober 5 2014)

bull Add a brain-dead build script (October 52014)

bull Add Section 710 which has the beginnings ofan evaluation of the various proposals (October5 2014)

The paper was then published as N4215 whichwas revised as follows

bull Add verbiage to Sections 72 and 7102 differ-entiating between various forms of syntactic andsemantic dependencies (November 11 2014)

bull Add Section 7103 suggesting a C keyword in-stead of the C++ [[carries dependency]] at-tribute This section also suggests that combin-ing ideas from several proposals might be helpful(November 11 2014)

bull Add Section 78 which contains a first attemptto describe Olivier Girouxrsquos head-marked depen-dency chains (November 11 2014)

bull Add example code in Figures 15 and 16 Addsimilar figures to the proposals showing how theymark the dependency chains in the sample code(November 11 2014)

At this point the paper was published as N4321which was revised as follows

bull Add a warning against using operands to amp and| that leave only one bit set (or respectivelycleared) to Section 411 (April 9 2015)

bull Update plots of RCU API usage (April 192014)

bull Add Section 79 which restricts dependencychains with the goal of allowing unmarked de-pendency chains while avoiding restricting com-piler optimizations One important restriction isthe elimination of RCU-protected array indexes(April 19 2014)

bull Add verbiage to Section 32 noting that the amp

operator is sometimes used to find the beginningof a power-of-two aligned structure (April 192014)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Torvald Riegel to Sec-tion 79 (May 20 2015)

bull Donrsquot allow integers to carry dependencies intoor out of unmarked functions in Section 79(May 21 2015)

bull Update Section 79 to note that assignments can-not extend a dependency chain from one threadto another and that assignments cannot extendan intptr t- or uintptr t-based dependencychain through a shared variable even within agiven thread (May 21 2015)

bull Update Section 79 to note that atomic op-erations can extend dependency chains in thesame way that their non-atomic counterpartscan (May 21 2015)

WG21N4321 40

bull Update Section 79 to explicitly call out the factthat stores and subsequent loads can extend de-pendency chains (May 21 2015)

bull Update Section 79 to explicitly state that un-defined behavior terminates dependency chainsand that if only one value of a pointer avoidsundefined behavior any dependency chainsthrough that pointer at that point are termi-nated (May 22 2015)

bull Update Section 79 to document the difficul-ties stemming from value-specialization opti-mizations (May 23 2015)

bull Update Section 79 to note the back-propagationof the dependency-chain-breaking effects of thecompiler deducing the exact value of a pointer(May 27 2015)

bull Update Section 79 to give more informationon how much bit-setting and bit-clearing is toomuch (June 8 2015)

bull Update Section 79 to note that the Linux ker-nel does not rely on dependency ordering whena statically allocated structure is linked into anRCU-protected structure but also note that op-timizations breaking dependency ordering in thiscase seem quite dubious (July 11 2015)

bull Update Section 79 illustrating dependency-chain rules with examples (July 11 2015)

bull Update Section 79 noting that simple loads suf-fice for memory order consume loads (July 112015)

bull Update Sections 77 and 78 calling out abstrac-tion challenges for head- and tail-marking ap-proaches (July 13 2015)

bull Add subsections to Section 79 (July 13 2015)

bull Update Section 79 to amplify objections to spe-cialization optimizations applied to heap-basedpointers (July 13 2015)

  • 1 Introduction
  • 2 Introduction to RCU
  • 3 Linux-Kernel Use Cases
    • 31 Types of Linux-Kernel Dependency Chains
    • 32 Operators in Linux-Kernel Dependency Chains
    • 33 Operators Terminating Linux-Kernel Dependency Chains
    • 34 Operators Acting as Last Link in Linux-Kernel Dependency Chains
    • 35 Linux-Kernel Dependency Chain Length
      • 4 Dependency Ordering in Pre-C11 Implementations
        • 41 Rules for C-Language RCU Users
          • 411 Rules for 2014 GCC Implementations
          • 412 Rules for 2003 GCC Implementations
          • 413 Rules for 1990s Sequent C Implementations
            • 42 Rules for C-Language Implementers
              • 5 Dependency Ordering in C11 and C++11 Implementations
              • 6 Weaknesses in C11 and C++11 Dependency Ordering
              • 7 Potential Alternatives to C11 and C++11 Dependency Ordering
                • 71 Revising C11 and C++11 Dependency-Ordering Definition
                • 72 Type-Based Designation of Dependency Chains With Restrictions
                • 73 Type-Based Designation of Dependency Chains
                • 74 Whole-Program Option
                • 75 Local-Variable Restriction
                • 76 Mark Dependency-Carrying Local Variables
                • 77 Explicitly Tail-Marked Dependency Chains
                • 78 Explicitly Head-Marked Dependency Chains
                • 79 Restricted Dependency Chains
                  • 791 Extending Dependency Chains
                  • 792 Terminating Dependency Chains
                  • 793 Restricted Dependency Chains and the Linux Kernel
                  • 794 Restricted Dependency Chains Advantages and Disavantages
                    • 710 Evaluation
                      • 7101 Audiences
                      • 7102 Comparison
                      • 7103 Other Approaches
                          • 8 Summary
Page 30: N4321: Towards Implementation and Use of memory order … · 2015-07-14 · N4321: Towards Implementation and Use of memory order consume Doc. No.: WG21/N4321 (revised) Date: 2015-04-19

WG21N4321 30

7 If a pointer is part of a dependency chain thendereferencing it using the prefix operator ex-tends the chain through the dereference opera-tion Line 24 of Figure 20 illustrates this giventhat the -gtt acts as a pointer offset prior to in-direction

8 If a pointer is part of a dependency chain thendereferencing it using the -gt field-selection oper-ator extends the chain to the field Note that thewhen the -gt operator is followed by one or more operators these latter operators are equiva-lent to adding a constant integer to the originalpointer Line 24 of Figure 20 directly illustratesthis rule

9 If a pointer is part of a dependency chain thencasting it (either explicitly or implicitly) to anypointer-sized type extends the chain to the re-sult Line 3 of Figure 33 illustrates this rule

10 If a value of type intptr t or uintptr t ispart of a dependency chain then casting it toa pointer type extends the chain to the resultLine 6 of Figure 33 illustrates this rule

11 If a value of type intptr t or uintptr t is partof a dependency chain then the bitwise infix amp

and | operators extend the dependency chain tothe resulting value Line 5 of Figure 33 illus-trates this rule

12 If a value of type intptr t or uintptr t is partof a dependency chain the the bitwise infix ^

operator extends the dependency chain to theresulting value The use case for this traversal ofbuddy-allocator-like lists or dense-array heapsbut it is not clear whether these use cases justifythis addition to dependency ordering

13 If a pointer is part of a dependency chain thenapplying the unary amp address-of operator op-tionally casting this address to a pointer type(perhaps repeatedly to different pointer typeseither explicitly or implicitly) then applying the dereference operator extends the chain to theresult This is used by some of the Linux-kernellist-processing macros

14 If a pointer is part of a dependency chain andthat pointer appears in the entry of a expres-sion selected by the condition then the chainextends to the result Please note that doesnot extend chains from its condition only fromits second or third argument

15 If a pointer to a function is part of a depen-dency chain then invoking the pointed-to func-tion extends the chain from the pointer to theinstructions executed Note that the exact mech-anism used to update instructions is implemen-tation defined and might require use of specialinstruction-cache-flush operations17

16 If a pointer is part of a dependency chain thenif that pointer is used as the actual parameter ofa function call the dependency chain extends tothe formal parameter

17 If a pointer is part of a dependency chain thenif that pointer is returned from a function thedependency chain extends to the returned valuein the calling function

18 If a given operation extends a dependency chainthen so does its atomic counterpart For exam-ple the rules applying to assignments also applyto atomic loads and stores

Any other operation terminates a dependencychain

792 Terminating Dependency Chains

Even though all other operations terminate depen-dency chains there are a few that deserve specialmention

1 Equality comparisons

2 Narrowing magnitude comparisons

3 Narrowing arithmetic operations

4 Narrowing bitwise operations

17 This may sound strange but just you try implementingdynamic linking without the ability to update instructions

WG21N4321 31

1 struct bar 2 struct bar next3 int a4 int b5 6 struct bar head = amphead 1 2 78 for (p = head-gtnext p p = rcu_dereference(p-gtnext)) 9 foo += p-gta

10 if (p == amphead)11 break12 13 bar = head-gtb

Figure 34 Back-Propagation of Dependency-ChainBreakage Due to Comparisons

1 if (p gt ampfoo)2 do_something(p)3 else if (p lt ampfoo)4 do_something_else(p)5 else6 do_something_nodep(p)

Figure 35 Inequality-Comparison Dependency-Chain Breakage

5 Storing non-pointers into shared variables

6 Passing values between threads without using amemory order consume load

7 Undefined behavior

8 Use of kill dependency

Each of these is covered below

Equality comparisons If a pointer is part of adependency chain then a == or = comparison thatcompares equal to some other pointer where thatother pointer is not part of any dependency chainwill cause any uses of the original pointer to no longerbe part of the dependency chain This dependency-chain breakage can back-propagate to earlier uses ofthe pointer so that in Figure 34 if the comparisonon line 10 compares equal then the access on line 9is not part of the dependency chain This is admit-tedly a rather strange code fragment and besidesthe Linux-kernel barrier() macro could prevent thisif placed between lines 9 and 10 Furthermore the

Linux kernelrsquos list macros avoid this situation becausethe equal comparison terminates the loop

So what if the compiler introduces an equality com-parison This might happen when doing feedback-directed optimization where the compiler might no-tice (for example) that a particularly statically allo-cated structure was almost always the first elementon a given list The compiler might therefore in-troduce a specialization optimization comparing theaddresses and generating code using the statically al-located structure on equals comparison On the onehand in the cases where the Linux kernel adds a stat-ically allocated structure to an RCU-protected linkeddata structure that structure has been initialized atcompile time so that dependency ordering is not re-quired On the other hand this appears to be anextremely dubious optimization for linked data struc-tures In a great many cases the added overhead ofthe comparison would overwhelm the benefits of gen-erating code based on the statically allocated struc-ture

High-quality implementations would therefore beexpected to provide means for disabling this sort ofoptimization especially for pointers obtained fromthe heap After all use of statically allocated struc-tures in RCU-protected lists could be quite usefulduring out-of-memory conditions in which case thespecialization optimization would almost always re-duce performance which is not what optmizationsare supposed to be doing

Narrowing magnitude comparisons A seriesof gt lt gt= or lt= operators that informs the compilerof the exact value of a pointer causes that pointerto no longer be part of the dependency chain SeeFigure 35 for an example of this On line 6 of thisfigure the compiler knows that the value of p is equalto ampfoo so although there is dependency ordering tolines 2 and 4 there is no dependency ordering toline 6 This dependency-chain breakage can back-propagate just as for equality comparisons How-ever dependencies are maintained for normal usesfor exmaple the use of comparisons for deadlockavoidance when acquiring locks contained in multi-ple RCU-protected data elements

WG21N4321 32

Narrowing arithmetic operations If a pointeris part of a dependency chain and if the valuesadded to or subtracted from that pointer cancel thepointer value so as to allow the compiler to pre-cisely determine the resulting value then the result-ing value will not be part of any dependency chainFor example if p is part of a dependency chain then((char )p-(uintptr t)p)+65536 will not be18

Narrowing bitwise operations If a value of typeintptr t or uintptr t is part of a dependency chainand if that value is one of the operands to an amp

or | infix operator whose result has too few or toomany bits set then the resulting value will not bepart of any dependency chain For example on a64-bit system if p is part of a dependency chainthen (p amp 0x7) provides just the tag bits and nor-mally cannot even be legally dereferenced Similarly(p | ~0) normally cannot be legally dereferencedHowever (p amp ~0x7) will provide a usable pointerthat is part of prsquos dependency chain Setting orclearing bits that are not used by the implementa-tion or that would be zero for a properly aligned ob-ject will not break the dependency chain That saidthe compiler might not know the definition of ldquoprop-erly alignedrdquo for example in the cases of manuallycache-aligned or page-aligned objects

Storing non-pointers into shared variables Ifa non-pointer value that is part of a dependency chainis stored into a shared variable then the dependencychain does not extend to a later load from that vari-able

Passing values between threads If a value thatis part of a dependency chain is stored into a variableby one thread and loaded from that same variable bysome other thread using either a non-atomic load ora memory order relaxed load then the dependencychain does not extend to the second thread To getthis effect the second thread would instead need touse a memory order consume load Note that this

18 That said 57p4 of C++ and 656p8 of C both say thatindexing outside of an object is undefined behavior so the lossof dependency ordering is likely the least of the problems here

would extend the dependency chain even if the cor-responding store was a memory order relaxed storebecause the required store-side ordering is providedby the dependency chain

Undefined behavior If undefined behavior is in-voked then consistent with the notion of undefinedbehavior there are no dependency-chain guarantees

If a given pointer intptr t or uintptr t is usedsuch that only one value avoids undefined behaviorthen the dependency chain is broken in the same wayas it would be in the case of an equality comparisonwith that same value

kill dependency() The result of calling kill

dependency is never part of any dependency chainThis operation can be used to suppress diagnosticsthat implementations might omit for likely misusesof dependency ordering

793 Restricted Dependency Chains and theLinux Kernel

This covers all known pointer-based RCU uses in theLinux kernel aside from RCU-protected array in-dexes (more on these later) However as noted ear-lier these restrictions might prove too constrainingfor future code Therefore it might be necessary tocombine this approach with some variation on one ofthe methods for explicitly marking variables formalparameters and return values that are intended tocarry dependencies

Section 32 discussed dependency chains headedby memory order consume loads of integers that arelater used as array indexes or pointer offsets Al-though there are a (very) few such uses in the Linuxkernel accommodating dependency chains headed byloads of integers greatly complicates the handling ofdependency chains For example given an integer x

produced by a memory order consume load we mustcorrectly handle expressions containing (x - x) in-cluding cases where the cancellation is not at all obvi-ous from the source code In contrast given a pointerp the expression (p - p) not a pointer and the rulesgiven above do not require carrying a dependency

WG21N4321 33

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a) 3 q = kill_dependency(p)4 a = q-gtspecial_a5 else 6 a = p-gtnormal_a7

Figure 36 Avoiding Diagnostic Due To Dependency-Ordering Value-Narrowing Hazard

through such an expression This approach there-fore excludes dependency chains headed by memory

order consume loads from non-pointer atomic vari-ables

This raises the question of what should be doneabout the Linux kernel code that relies on order-ing carried through integer array indexes Paulhas (hopefully) answered this question by creatinga Linux-kernel patch that removes the kernelrsquos de-pendency on RCU-protected array indexes [22]

794 Restricted Dependency Chains Ad-vantages and Disavantages

This approach to dependency ordering has the fol-lowing advantages and disadvantages

1 There is no need for compilers to trace depen-dency chains Instead dependencies are an au-tomatic result of current code-generation and op-timization practices

2 There would be little or no limitation on com-piler optimizations Compilers are no longer re-quired to establish artificial dependencies for ex-pressions such as (x - x) or to detect value-narrowing hazards involving == and =

3 The breaking of dependency chains when apointer compares equal to some other pointermight prove to be onerous but it is not aproblem for current Linux use cases19 Themost common cases are comparison against alist header (in which case an equality compar-ison terminates the traversal) and comparison

19 This needs to be re-verified

against NULL (in which case an equality com-parison indicates a pointer that cannot be deref-erenced in any case)

4 The compiler is prohibited from carrying outvalue-speculation optimizations on pointers orvalues of type intptr t or uintptr t that havebeen cast from a pointer type and subjected tono operations other than infix amp | and ^ (Isthis an advantage or a disadvantage The an-swer to this question is left to the reader)

5 In a great many cases dependency chains canreliably pass through library functions compiledby pre-C11 compilers

6 It would not be necessary to use stdkill

dependency() calls in most cases That saiduse of stdkill dependency() might at somefuture point allow the compiler to produce bet-ter diagnostics for dubious dependency-chain usecases For example a compiler might issue awarning for the value-narrowing hazard shownon line 3 of Figure 13 and that diagnostic mightbe suppressed as shown in Figure 36

7 More generally this approach allows annotationsto be discarded as was shown in Figures 20and 21 However the memory order consume

loads are still required in order to support DECAlpha and prevent compiler optimizations thatmight otherwise destroy the dependency chain

8 It would not be necessary to use [[carries

dependency]] attributes However as withstdkill dependency() use of something like[[carries dependency]] might produce bet-ter diagnostics for dubious use dependency-chainuse cases That said it might be preferable tosubstitute either type or variable modifiers forattributes given that use of attributes is not per-mitted to change the meaning of the programand also that attributes are not yet supportedby the C language

9 With the exception of one use case involving ar-rays no changes to the Linux-kernel source code

WG21N4321 34

are required As noted earlier a patch is avail-able to remove the Linux kernelrsquos dependency onRCU-protected array indexes [22]

In the future this approach could be augmentedby attributes type modifiers variable modifiers orother markings to allow more elaborate dependencychains to be created on the one hand and to improvethe compilerrsquos ability to emit diagnostics for dubioususes of dependency chains on the other

710 Evaluation

This evaluation starts by enumerating the differentaudiences that any change to memory order consume

must address (Section 7101) and then compares thevarious proposals based on the perceived viewpointsof these audiences (Section 7102)

7101 Audiences

The main audiences for any change to memory

order consume include standards committee mem-bers compiler implementers formal-methods re-searchers developers intending to write new codeand developers working with existing RCU code TheLinux kernel community is of course a notable exam-ple of this last category

Standards committee members would like a cleanand non-intrusive change to the standard Theywould of course also like solutions minimizing thenumber and vehemence of complaints from the otheraudiences or failing that reducing the complaintsto a tolerable noise level

Compiler implementers would like a mechanismthat fits nicely into current implementations whichdoes much to explain their satisfaction with the ap-proach of strengthening memory order consume tomemory order acquire In particular they wouldlike to avoid unbounded tracing of dependencies andwould prefer minimal constraints on their ability toapply time-honored optimizations

Formal-methods researchers would like a definitionof memory order consume that fits into existing the-oretical frameworks without undue conceptual vio-lence Of particular concern is any need to deal with

counter-factuals in other words any need to rea-son not only about values of variables required forthe solution of a given litmus test but also aboutother unrelated values for these variables As suchcounter-factuals are the rock upon which otherwiseattractive approaches involving semantic dependencyhave foundered20 Some practitioners might won-der why the opinion of formal-methods researchersshould be given any weight at all and the answer tothis question is that it is the work of formal-methodsresearchers that provides us the much-needed toolsthat we need to analyze both the memory-orderingspecification itself as well as programs using thatspecification

Developers writing new code need something thatexpresses their algorithm with a minimum of syn-tactic saccharine that is easy to learn and that iseasy to maintain For example one of the weak-nesses of the current standardsrsquo definition of memoryorder consume is the need to sprinkle large numbersof kill dependency() calls throughout onersquos codeIn short developers would like it to be easy to writeanalyze and maintain code that uses dependency or-dering

Developers with existing RCU code have the samedesires as do developers writing new code but arealso very interested in minimizing the code churn re-quired to adhere to the standard

The challenge if of course to find a proposal thataddresses the viewoints of all of these audiences Aswe will see in the next session this is not easy How-ever there is some hope that the approach presentedin Section 79 might suffice

7102 Comparison

A summary comparison of the proposals is shown inTable 1

The dependency type can either be ldquodeprdquo fornormal dependency ldquordeprdquo for the restricted de-pendencies discussed in Section 79 ldquosdeprdquo for(global) semantic dependency or ldquolsdeprdquo for local

20 That said Alan Jeffries is making another attempt tocome up with a suitable formal definition of semantic depen-dency

WG21N4321 35

Dep

enden

cyType

Variable

Marking

Form

al-Parameter

Marking

Return-V

alueMarking

Beginning-O

f-Chain

Handling

End-O

f-Chain

Handling

Dep

enden

cyTracingRequired

CAttribute

Support

Req

uired

C11 C++11 dep A A K Y Y

Type-Based Designation of DependencyChains With Restrictions (Section 72)

lsdep T T T k

Type-Based Designation of DependencyChains (Section 73)

dep T T T k

Whole-Program Option (Section 74) sdep

Local-Variable Restriction (Section 75) dep A A k Y

Mark Dependency-Carrying Local Variables(Section 76)

dep A A A k Y

Explicitly Tail-Marked Dependency Chains(Section 77)

dep A A Dk y Y

Explicitly Head-Marked Dependency Chains(Section 78)

dep D k y Y

Restricted Dependency Chains (Section 79) rdep

Marking ldquoArdquo attribute ldquoTrdquo typeBeginning of chain ldquoDrdquo explicit designationEnd of chain ldquoDrdquo explicit designation ldquokKrdquo implicitexplicit kill dependencyDependency tracking required ldquoyrdquo only for marked chains ldquoYrdquo always

Table 1 Comparison of Consume Proposals

WG21N4321 36

semantic dependency21 Variable formal-parameterand return-value marking can either be type-based(ldquoTrdquo) attribute-based (ldquoArdquo) or not required (ldquo rdquo)Beginning-of-chain handling can either require ex-plicit indication to which quantities dependenciesmust be carried (ldquoDrdquo) or nothing (ldquo rdquo) End-of-chain handling can either require an explicit kill

dependency (ldquoKrdquo) an implicit kill dependency

(ldquokrdquo) explicit designation of dependency (ldquoDrdquo) ornothing (ldquo rdquo)22 Dependency tracking might berequired for all chains (ldquoYrdquo) explicitly designatedchains (ldquoyrdquo) or not required at all (ldquo rdquo) C-language[[carries dependency]] support might be required(ldquoYrdquo) or not (ldquo rdquo)

The ideal proposal would have dependency typeldquodeprdquo (thus making it easier to model dependencyordering and making it unnecessary for developer tohave to outwit full-program optimizations) no needfor variable formal-parameter or return-value mark-ing (thus minimizing changes required for existingRCU code) implicit ldquodo the right thingrdquo end-of-chainhandling23 (thus minimizing the need for whack-a-mole source-code markups) no need for dependencytracking (thus making it easier to implement) andno need for C-language support for the [[carries

dependency]] attribute (thus minimizing changes tothe C standard)

7103 Other Approaches

If the C standards committee is unwilling to ac-cept attributes perhaps a new keyword such asCarries dependency would be an acceptable alter-native (This was suggested at the 2014 UIUC meet-ing but I cannot recall who suggested it)

21 Recall that a local semantic dependency remains a de-pendency even if the memory order consume load at its headcan return only a single value In contrast a global seman-tic dependency remains a dependency only if more than onevalue can appear at the end of the chain Therefore optimiza-tions based on global full-program analysis can break a globalsemantic dependency but can break neither a local semanticdependency nor a normal dependency

22 Variables that go out of scope always have any dependencychain implicitly killed

23 Perhaps implemented by a careful choice of exactly whichoperators carry dependencies in which situations

It might also be possible to combine different as-pects of the various proposals perhaps even arrivingat an improved proposal

Nevertheless we clearly have some more work todo

8 Summary

This document has analyzed Linux-kernel use of de-pendency ordering and has laid out the status-quo in-teraction between the Linux kernel and pre-C11 com-pilers It has also put forward some possible ways ofbuilding towards a full implementation of C11rsquos andC++11rsquos handling of dependency ordering Finallyit calls out some weaknesses in C11rsquos and C++11rsquoshandling of dependency ordering and offers some al-ternatives

References

[1] Alglave J Maranget L Pawan PSarkar S Sewell P Williams D andNardelli F Z PPCMEMARMMEM A toolfor exploring the POWER and ARM memorymodels httpwwwclcamacuk~pes20

ppc-supplementalpldi105-sarkarpdfJune 2011

[2] Alglave J Maranget L andTautschnig M Herding cats Modellingsimulation testing and data-mining for weakmemory In Proceedings of the 35th ACM SIG-PLAN Conference on Programming LanguageDesign and Implementation (New York NYUSA 2014) PLDI rsquo14 ACM pp 40ndash40

[3] ARM Limited ARM Architecture ReferenceManual ARMv7-A and ARMv7-R Edition2010

[4] Boehm H J Space efficient conservativegarbage collection SIGPLAN Not 39 4 (Apr2004) 490ndash501

[5] Bonzini P and Day M RCU imple-mentation for Qemu httplistsgnu

WG21N4321 37

orgarchivehtmlqemu-devel2013-08

msg02055html August 2013

[6] Dalton M THE DESIGN AND IMPLE-MENTATION OF DYNAMIC INFORMATIONFLOW TRACKING SYSTEMS FOR SOFT-WARE SECURITY PhD thesis StanfordUniversity 2009 Available httpcsl

stanfordedu~christospublications

2009michael_daltonphd_thesispdf

[Viewed March 9 2010]

[7] Desnoyers M [RFC git tree] userspace RCU(urcu) for Linux httpurcuso February2009

[8] Desnoyers M McKenney P E SternA Dagenais M R and Walpole JUser-level implementations of read-copy updateIEEE Transactions on Parallel and DistributedSystems 23 (2012) 375ndash382

[9] Grisenthwaite R ARM Barrier LitmusTests and Cookbook ARM Limited 2009

[10] Howard P W and Walpole J A rel-ativistic enhancement to software transactionalmemory In Proceedings of the 3rd USENIX con-ference on Hot topics in parallelism (BerkeleyCA USA 2011) HotParrsquo11 USENIX Associa-tion pp 1ndash6

[11] Howard P W and Walpole J Relativis-tic red-black trees Concurrency and Computa-tion Practice and Experience (2013) nandashna

[12] Intel Corporation A Formal Speci-fication of Intel Itanium Processor Fam-ily Memory Ordering 2002 Avail-able httpdeveloperintelcom

designitaniumdownloads251429htm

ftpdownloadintelcomdesign

ItaniumDownloads25142901pdf [ViewedJanuary 10 2007]

[13] International Business Machines Corpo-ration Power ISA Version 207 2013

[14] Kannan H Ordering decoupled metadata ac-cesses in multiprocessors In MICRO 42 Pro-ceedings of the 42nd Annual IEEEACM Inter-national Symposium on Microarchitecture (NewYork NY USA 2009) ACM pp 381ndash390

[15] McCarthy J Recursive functions of symbolicexpressions and their computation by machinepart i Commun ACM 3 4 (Apr 1960) 184ndash195

[16] McKenney P E Read-copy update(RCU) usage in Linux kernel Availablehttpwwwrdropcomuserspaulmck

RCUlinuxusagerculocktabhtml [ViewedJanuary 14 2007] October 2006

[17] McKenney P E What is RCU part 2Usage Available httplwnnetArticles

263130 [Viewed January 4 2008] January2008

[18] McKenney P E The RCU API 2010 edi-tion httplwnnetArticles418853 De-cember 2010

[19] McKenney P E Validating memory barri-ers and atomic instructions httplwnnet

Articles470681 December 2011

[20] McKenney P E Is Parallel ProgrammingHard And If So What Can You Do About Itkernelorg Corvallis OR USA 2012

[21] McKenney P E Structured deferral syn-chronization via procrastination CommunACM 56 7 (July 2013) 40ndash49

[22] McKenney P E [PATCH tipcorercu14] mce Stop using array-index-based RCUprimitives [PATCHtipcorercu14]mce

Stopusingarray-index-basedRCUprimitivesMay 2015

[23] McKenney P E Purcell C Al-gae Schumin B Cornelius G Qwer-tyus Conway N Sbw Blainster Ru-fus C Zoicon5 Anome and Eisen HRead-copy update httpenwikipedia

orgwikiRead-copy-update July 2006

WG21N4321 38

[24] McKenney P E and Slingwine J DRead-copy update Using execution history tosolve concurrency problems In Parallel andDistributed Computing and Systems (Las VegasNV October 1998) pp 509ndash518

[25] McKenney P E and Walpole J What isRCU fundamentally Available httplwn

netArticles262464 [Viewed December 272007] December 2007

[26] Michael M M Hazard pointers Safe mem-ory reclamation for lock-free objects IEEETransactions on Parallel and Distributed Sys-tems 15 6 (June 2004) 491ndash504

[27] Rossbach C J Hofmann O S PorterD E Ramadan H E Bhandari A andWitchel E TxLinux Using and managinghardware transactional memory in an operatingsystem In SOSPrsquo07 Twenty-First ACMSymposium on Operating Systems Principles(October 2007) ACM SIGOPS Avail-able httpwwwsosp2007orgpapers

sosp056-rossbachpdf [Viewed October 212007]

[28] Toit S D Working draft stan-dard for programming language C++httpwwwopen-stdorgjtc1sc22wg21

docspapers2013n3691pdf May 2013

[29] Vigueras G Orduna J M and LozanoM A Read-Copy Update based parallel serverfor distributed crowd simulations The Journalof Supercomputing (Apr 2012)

Change Log

This paper first appeared as N4026 in May of 2014Revisions to this document are as follows

bull Add a suggestion to Section 71 that developersprovide diagnostics for singleton arrays when us-ing RCU-protected indexes (May 28 2014)

bull Apply changes from Torvald Riegel feedback

1 Move Section 71 which summarizes theLinux-kernel uses of dependency chains upto the beginning of Section 7

2 Fix some typos in Section 74 and addwords accounting for DEC Alpha whichmust continue to promote memory order

consume loads to memory order acquire

with this sectionrsquos approach Also clarifyJeff Preshingrsquos conjecture about the lack ofoptimizations that break semantic depen-dency chains

(June 5 2014)

bull Add Section 75 which covers Hans Boehmrsquosapproach which restricts dependency chains tolocal variable Also add Section 76 whichcovers Clark Nelsorsquos approach which usesthe [[carries dependency]] attribute to markthose local variables that carry dependencies(July 17 2014)

bull Mark the paper as a revision of N4036 (July 172014)

bull Add Jeff Preshing Hans Boehm and Clark Nel-son as authors and clarify Hansrsquos proposal inSection 75 (August 13 2014)

bull Add Olivier Girouxrsquos proposal for explicitlymarking the tails of dependency chains as Sec-tion 77 and add Olivier as author Update theadvantages and disadvantages of Jeff Preshingrsquoswhole-program option in Section 74 Add advan-tages and disadvantages for Hans Boehmrsquos pro-posal in Section 75 and for Clark Nelsonrsquos pro-posal in Section 76 (August 23 2014)

bull Update Olivier Girouxrsquos explicit dependency-chain marking proposal in Section 77 based onfeedback from Olivier (August 24 2014)

bull Add a section header (Section 411) for thedependency-chain rules for 2014 GCC implemen-tations Add Section 413 laying out the sim-pler rules for managing dependency chains in theolder 1990s Sequent C implementations Adda footnote to Section 76 recording Lawrence

WG21N4321 39

Crowlrsquos preference for variable modifiers insteadof either attributes or type modifiers Addwords to Section 77 noting that atomic signal

fence() or the Linux kernelrsquos barrier() can beused to suppress code-motion optimizations thatmight otherwise prevent fixing up tail-markeddependency chains that were broken by code-motion optimizations (October 1 2014)

bull Add Section 34 which describes operators thatact as the last link in the Linux kernelrsquos depen-dency chains including the ambiguous role ofthe -gt operator (October 1 2014)

bull Update plots of RCU API usage (October 52014)

bull Flesh out Section 412 which lays out a shorthistory of dependency-chain management rulesfor the Linux kernel Numerous minor clarifica-tions and corrections throughout the document(October 5 2014)

bull Add ldquoandrdquo before final authorrsquos name and emailaddress Also add a pointer back to N4036 (Oc-tober 5 2014)

bull Add a brain-dead build script (October 52014)

bull Add Section 710 which has the beginnings ofan evaluation of the various proposals (October5 2014)

The paper was then published as N4215 whichwas revised as follows

bull Add verbiage to Sections 72 and 7102 differ-entiating between various forms of syntactic andsemantic dependencies (November 11 2014)

bull Add Section 7103 suggesting a C keyword in-stead of the C++ [[carries dependency]] at-tribute This section also suggests that combin-ing ideas from several proposals might be helpful(November 11 2014)

bull Add Section 78 which contains a first attemptto describe Olivier Girouxrsquos head-marked depen-dency chains (November 11 2014)

bull Add example code in Figures 15 and 16 Addsimilar figures to the proposals showing how theymark the dependency chains in the sample code(November 11 2014)

At this point the paper was published as N4321which was revised as follows

bull Add a warning against using operands to amp and| that leave only one bit set (or respectivelycleared) to Section 411 (April 9 2015)

bull Update plots of RCU API usage (April 192014)

bull Add Section 79 which restricts dependencychains with the goal of allowing unmarked de-pendency chains while avoiding restricting com-piler optimizations One important restriction isthe elimination of RCU-protected array indexes(April 19 2014)

bull Add verbiage to Section 32 noting that the amp

operator is sometimes used to find the beginningof a power-of-two aligned structure (April 192014)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Torvald Riegel to Sec-tion 79 (May 20 2015)

bull Donrsquot allow integers to carry dependencies intoor out of unmarked functions in Section 79(May 21 2015)

bull Update Section 79 to note that assignments can-not extend a dependency chain from one threadto another and that assignments cannot extendan intptr t- or uintptr t-based dependencychain through a shared variable even within agiven thread (May 21 2015)

bull Update Section 79 to note that atomic op-erations can extend dependency chains in thesame way that their non-atomic counterpartscan (May 21 2015)

WG21N4321 40

bull Update Section 79 to explicitly call out the factthat stores and subsequent loads can extend de-pendency chains (May 21 2015)

bull Update Section 79 to explicitly state that un-defined behavior terminates dependency chainsand that if only one value of a pointer avoidsundefined behavior any dependency chainsthrough that pointer at that point are termi-nated (May 22 2015)

bull Update Section 79 to document the difficul-ties stemming from value-specialization opti-mizations (May 23 2015)

bull Update Section 79 to note the back-propagationof the dependency-chain-breaking effects of thecompiler deducing the exact value of a pointer(May 27 2015)

bull Update Section 79 to give more informationon how much bit-setting and bit-clearing is toomuch (June 8 2015)

bull Update Section 79 to note that the Linux ker-nel does not rely on dependency ordering whena statically allocated structure is linked into anRCU-protected structure but also note that op-timizations breaking dependency ordering in thiscase seem quite dubious (July 11 2015)

bull Update Section 79 illustrating dependency-chain rules with examples (July 11 2015)

bull Update Section 79 noting that simple loads suf-fice for memory order consume loads (July 112015)

bull Update Sections 77 and 78 calling out abstrac-tion challenges for head- and tail-marking ap-proaches (July 13 2015)

bull Add subsections to Section 79 (July 13 2015)

bull Update Section 79 to amplify objections to spe-cialization optimizations applied to heap-basedpointers (July 13 2015)

  • 1 Introduction
  • 2 Introduction to RCU
  • 3 Linux-Kernel Use Cases
    • 31 Types of Linux-Kernel Dependency Chains
    • 32 Operators in Linux-Kernel Dependency Chains
    • 33 Operators Terminating Linux-Kernel Dependency Chains
    • 34 Operators Acting as Last Link in Linux-Kernel Dependency Chains
    • 35 Linux-Kernel Dependency Chain Length
      • 4 Dependency Ordering in Pre-C11 Implementations
        • 41 Rules for C-Language RCU Users
          • 411 Rules for 2014 GCC Implementations
          • 412 Rules for 2003 GCC Implementations
          • 413 Rules for 1990s Sequent C Implementations
            • 42 Rules for C-Language Implementers
              • 5 Dependency Ordering in C11 and C++11 Implementations
              • 6 Weaknesses in C11 and C++11 Dependency Ordering
              • 7 Potential Alternatives to C11 and C++11 Dependency Ordering
                • 71 Revising C11 and C++11 Dependency-Ordering Definition
                • 72 Type-Based Designation of Dependency Chains With Restrictions
                • 73 Type-Based Designation of Dependency Chains
                • 74 Whole-Program Option
                • 75 Local-Variable Restriction
                • 76 Mark Dependency-Carrying Local Variables
                • 77 Explicitly Tail-Marked Dependency Chains
                • 78 Explicitly Head-Marked Dependency Chains
                • 79 Restricted Dependency Chains
                  • 791 Extending Dependency Chains
                  • 792 Terminating Dependency Chains
                  • 793 Restricted Dependency Chains and the Linux Kernel
                  • 794 Restricted Dependency Chains Advantages and Disavantages
                    • 710 Evaluation
                      • 7101 Audiences
                      • 7102 Comparison
                      • 7103 Other Approaches
                          • 8 Summary
Page 31: N4321: Towards Implementation and Use of memory order … · 2015-07-14 · N4321: Towards Implementation and Use of memory order consume Doc. No.: WG21/N4321 (revised) Date: 2015-04-19

WG21N4321 31

1 struct bar 2 struct bar next3 int a4 int b5 6 struct bar head = amphead 1 2 78 for (p = head-gtnext p p = rcu_dereference(p-gtnext)) 9 foo += p-gta

10 if (p == amphead)11 break12 13 bar = head-gtb

Figure 34 Back-Propagation of Dependency-ChainBreakage Due to Comparisons

1 if (p gt ampfoo)2 do_something(p)3 else if (p lt ampfoo)4 do_something_else(p)5 else6 do_something_nodep(p)

Figure 35 Inequality-Comparison Dependency-Chain Breakage

5 Storing non-pointers into shared variables

6 Passing values between threads without using amemory order consume load

7 Undefined behavior

8 Use of kill dependency

Each of these is covered below

Equality comparisons If a pointer is part of adependency chain then a == or = comparison thatcompares equal to some other pointer where thatother pointer is not part of any dependency chainwill cause any uses of the original pointer to no longerbe part of the dependency chain This dependency-chain breakage can back-propagate to earlier uses ofthe pointer so that in Figure 34 if the comparisonon line 10 compares equal then the access on line 9is not part of the dependency chain This is admit-tedly a rather strange code fragment and besidesthe Linux-kernel barrier() macro could prevent thisif placed between lines 9 and 10 Furthermore the

Linux kernelrsquos list macros avoid this situation becausethe equal comparison terminates the loop

So what if the compiler introduces an equality com-parison This might happen when doing feedback-directed optimization where the compiler might no-tice (for example) that a particularly statically allo-cated structure was almost always the first elementon a given list The compiler might therefore in-troduce a specialization optimization comparing theaddresses and generating code using the statically al-located structure on equals comparison On the onehand in the cases where the Linux kernel adds a stat-ically allocated structure to an RCU-protected linkeddata structure that structure has been initialized atcompile time so that dependency ordering is not re-quired On the other hand this appears to be anextremely dubious optimization for linked data struc-tures In a great many cases the added overhead ofthe comparison would overwhelm the benefits of gen-erating code based on the statically allocated struc-ture

High-quality implementations would therefore beexpected to provide means for disabling this sort ofoptimization especially for pointers obtained fromthe heap After all use of statically allocated struc-tures in RCU-protected lists could be quite usefulduring out-of-memory conditions in which case thespecialization optimization would almost always re-duce performance which is not what optmizationsare supposed to be doing

Narrowing magnitude comparisons A seriesof gt lt gt= or lt= operators that informs the compilerof the exact value of a pointer causes that pointerto no longer be part of the dependency chain SeeFigure 35 for an example of this On line 6 of thisfigure the compiler knows that the value of p is equalto ampfoo so although there is dependency ordering tolines 2 and 4 there is no dependency ordering toline 6 This dependency-chain breakage can back-propagate just as for equality comparisons How-ever dependencies are maintained for normal usesfor exmaple the use of comparisons for deadlockavoidance when acquiring locks contained in multi-ple RCU-protected data elements

WG21N4321 32

Narrowing arithmetic operations If a pointeris part of a dependency chain and if the valuesadded to or subtracted from that pointer cancel thepointer value so as to allow the compiler to pre-cisely determine the resulting value then the result-ing value will not be part of any dependency chainFor example if p is part of a dependency chain then((char )p-(uintptr t)p)+65536 will not be18

Narrowing bitwise operations If a value of typeintptr t or uintptr t is part of a dependency chainand if that value is one of the operands to an amp

or | infix operator whose result has too few or toomany bits set then the resulting value will not bepart of any dependency chain For example on a64-bit system if p is part of a dependency chainthen (p amp 0x7) provides just the tag bits and nor-mally cannot even be legally dereferenced Similarly(p | ~0) normally cannot be legally dereferencedHowever (p amp ~0x7) will provide a usable pointerthat is part of prsquos dependency chain Setting orclearing bits that are not used by the implementa-tion or that would be zero for a properly aligned ob-ject will not break the dependency chain That saidthe compiler might not know the definition of ldquoprop-erly alignedrdquo for example in the cases of manuallycache-aligned or page-aligned objects

Storing non-pointers into shared variables Ifa non-pointer value that is part of a dependency chainis stored into a shared variable then the dependencychain does not extend to a later load from that vari-able

Passing values between threads If a value thatis part of a dependency chain is stored into a variableby one thread and loaded from that same variable bysome other thread using either a non-atomic load ora memory order relaxed load then the dependencychain does not extend to the second thread To getthis effect the second thread would instead need touse a memory order consume load Note that this

18 That said 57p4 of C++ and 656p8 of C both say thatindexing outside of an object is undefined behavior so the lossof dependency ordering is likely the least of the problems here

would extend the dependency chain even if the cor-responding store was a memory order relaxed storebecause the required store-side ordering is providedby the dependency chain

Undefined behavior If undefined behavior is in-voked then consistent with the notion of undefinedbehavior there are no dependency-chain guarantees

If a given pointer intptr t or uintptr t is usedsuch that only one value avoids undefined behaviorthen the dependency chain is broken in the same wayas it would be in the case of an equality comparisonwith that same value

kill dependency() The result of calling kill

dependency is never part of any dependency chainThis operation can be used to suppress diagnosticsthat implementations might omit for likely misusesof dependency ordering

793 Restricted Dependency Chains and theLinux Kernel

This covers all known pointer-based RCU uses in theLinux kernel aside from RCU-protected array in-dexes (more on these later) However as noted ear-lier these restrictions might prove too constrainingfor future code Therefore it might be necessary tocombine this approach with some variation on one ofthe methods for explicitly marking variables formalparameters and return values that are intended tocarry dependencies

Section 32 discussed dependency chains headedby memory order consume loads of integers that arelater used as array indexes or pointer offsets Al-though there are a (very) few such uses in the Linuxkernel accommodating dependency chains headed byloads of integers greatly complicates the handling ofdependency chains For example given an integer x

produced by a memory order consume load we mustcorrectly handle expressions containing (x - x) in-cluding cases where the cancellation is not at all obvi-ous from the source code In contrast given a pointerp the expression (p - p) not a pointer and the rulesgiven above do not require carrying a dependency

WG21N4321 33

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a) 3 q = kill_dependency(p)4 a = q-gtspecial_a5 else 6 a = p-gtnormal_a7

Figure 36 Avoiding Diagnostic Due To Dependency-Ordering Value-Narrowing Hazard

through such an expression This approach there-fore excludes dependency chains headed by memory

order consume loads from non-pointer atomic vari-ables

This raises the question of what should be doneabout the Linux kernel code that relies on order-ing carried through integer array indexes Paulhas (hopefully) answered this question by creatinga Linux-kernel patch that removes the kernelrsquos de-pendency on RCU-protected array indexes [22]

794 Restricted Dependency Chains Ad-vantages and Disavantages

This approach to dependency ordering has the fol-lowing advantages and disadvantages

1 There is no need for compilers to trace depen-dency chains Instead dependencies are an au-tomatic result of current code-generation and op-timization practices

2 There would be little or no limitation on com-piler optimizations Compilers are no longer re-quired to establish artificial dependencies for ex-pressions such as (x - x) or to detect value-narrowing hazards involving == and =

3 The breaking of dependency chains when apointer compares equal to some other pointermight prove to be onerous but it is not aproblem for current Linux use cases19 Themost common cases are comparison against alist header (in which case an equality compar-ison terminates the traversal) and comparison

19 This needs to be re-verified

against NULL (in which case an equality com-parison indicates a pointer that cannot be deref-erenced in any case)

4 The compiler is prohibited from carrying outvalue-speculation optimizations on pointers orvalues of type intptr t or uintptr t that havebeen cast from a pointer type and subjected tono operations other than infix amp | and ^ (Isthis an advantage or a disadvantage The an-swer to this question is left to the reader)

5 In a great many cases dependency chains canreliably pass through library functions compiledby pre-C11 compilers

6 It would not be necessary to use stdkill

dependency() calls in most cases That saiduse of stdkill dependency() might at somefuture point allow the compiler to produce bet-ter diagnostics for dubious dependency-chain usecases For example a compiler might issue awarning for the value-narrowing hazard shownon line 3 of Figure 13 and that diagnostic mightbe suppressed as shown in Figure 36

7 More generally this approach allows annotationsto be discarded as was shown in Figures 20and 21 However the memory order consume

loads are still required in order to support DECAlpha and prevent compiler optimizations thatmight otherwise destroy the dependency chain

8 It would not be necessary to use [[carries

dependency]] attributes However as withstdkill dependency() use of something like[[carries dependency]] might produce bet-ter diagnostics for dubious use dependency-chainuse cases That said it might be preferable tosubstitute either type or variable modifiers forattributes given that use of attributes is not per-mitted to change the meaning of the programand also that attributes are not yet supportedby the C language

9 With the exception of one use case involving ar-rays no changes to the Linux-kernel source code

WG21N4321 34

are required As noted earlier a patch is avail-able to remove the Linux kernelrsquos dependency onRCU-protected array indexes [22]

In the future this approach could be augmentedby attributes type modifiers variable modifiers orother markings to allow more elaborate dependencychains to be created on the one hand and to improvethe compilerrsquos ability to emit diagnostics for dubioususes of dependency chains on the other

710 Evaluation

This evaluation starts by enumerating the differentaudiences that any change to memory order consume

must address (Section 7101) and then compares thevarious proposals based on the perceived viewpointsof these audiences (Section 7102)

7101 Audiences

The main audiences for any change to memory

order consume include standards committee mem-bers compiler implementers formal-methods re-searchers developers intending to write new codeand developers working with existing RCU code TheLinux kernel community is of course a notable exam-ple of this last category

Standards committee members would like a cleanand non-intrusive change to the standard Theywould of course also like solutions minimizing thenumber and vehemence of complaints from the otheraudiences or failing that reducing the complaintsto a tolerable noise level

Compiler implementers would like a mechanismthat fits nicely into current implementations whichdoes much to explain their satisfaction with the ap-proach of strengthening memory order consume tomemory order acquire In particular they wouldlike to avoid unbounded tracing of dependencies andwould prefer minimal constraints on their ability toapply time-honored optimizations

Formal-methods researchers would like a definitionof memory order consume that fits into existing the-oretical frameworks without undue conceptual vio-lence Of particular concern is any need to deal with

counter-factuals in other words any need to rea-son not only about values of variables required forthe solution of a given litmus test but also aboutother unrelated values for these variables As suchcounter-factuals are the rock upon which otherwiseattractive approaches involving semantic dependencyhave foundered20 Some practitioners might won-der why the opinion of formal-methods researchersshould be given any weight at all and the answer tothis question is that it is the work of formal-methodsresearchers that provides us the much-needed toolsthat we need to analyze both the memory-orderingspecification itself as well as programs using thatspecification

Developers writing new code need something thatexpresses their algorithm with a minimum of syn-tactic saccharine that is easy to learn and that iseasy to maintain For example one of the weak-nesses of the current standardsrsquo definition of memoryorder consume is the need to sprinkle large numbersof kill dependency() calls throughout onersquos codeIn short developers would like it to be easy to writeanalyze and maintain code that uses dependency or-dering

Developers with existing RCU code have the samedesires as do developers writing new code but arealso very interested in minimizing the code churn re-quired to adhere to the standard

The challenge if of course to find a proposal thataddresses the viewoints of all of these audiences Aswe will see in the next session this is not easy How-ever there is some hope that the approach presentedin Section 79 might suffice

7102 Comparison

A summary comparison of the proposals is shown inTable 1

The dependency type can either be ldquodeprdquo fornormal dependency ldquordeprdquo for the restricted de-pendencies discussed in Section 79 ldquosdeprdquo for(global) semantic dependency or ldquolsdeprdquo for local

20 That said Alan Jeffries is making another attempt tocome up with a suitable formal definition of semantic depen-dency

WG21N4321 35

Dep

enden

cyType

Variable

Marking

Form

al-Parameter

Marking

Return-V

alueMarking

Beginning-O

f-Chain

Handling

End-O

f-Chain

Handling

Dep

enden

cyTracingRequired

CAttribute

Support

Req

uired

C11 C++11 dep A A K Y Y

Type-Based Designation of DependencyChains With Restrictions (Section 72)

lsdep T T T k

Type-Based Designation of DependencyChains (Section 73)

dep T T T k

Whole-Program Option (Section 74) sdep

Local-Variable Restriction (Section 75) dep A A k Y

Mark Dependency-Carrying Local Variables(Section 76)

dep A A A k Y

Explicitly Tail-Marked Dependency Chains(Section 77)

dep A A Dk y Y

Explicitly Head-Marked Dependency Chains(Section 78)

dep D k y Y

Restricted Dependency Chains (Section 79) rdep

Marking ldquoArdquo attribute ldquoTrdquo typeBeginning of chain ldquoDrdquo explicit designationEnd of chain ldquoDrdquo explicit designation ldquokKrdquo implicitexplicit kill dependencyDependency tracking required ldquoyrdquo only for marked chains ldquoYrdquo always

Table 1 Comparison of Consume Proposals

WG21N4321 36

semantic dependency21 Variable formal-parameterand return-value marking can either be type-based(ldquoTrdquo) attribute-based (ldquoArdquo) or not required (ldquo rdquo)Beginning-of-chain handling can either require ex-plicit indication to which quantities dependenciesmust be carried (ldquoDrdquo) or nothing (ldquo rdquo) End-of-chain handling can either require an explicit kill

dependency (ldquoKrdquo) an implicit kill dependency

(ldquokrdquo) explicit designation of dependency (ldquoDrdquo) ornothing (ldquo rdquo)22 Dependency tracking might berequired for all chains (ldquoYrdquo) explicitly designatedchains (ldquoyrdquo) or not required at all (ldquo rdquo) C-language[[carries dependency]] support might be required(ldquoYrdquo) or not (ldquo rdquo)

The ideal proposal would have dependency typeldquodeprdquo (thus making it easier to model dependencyordering and making it unnecessary for developer tohave to outwit full-program optimizations) no needfor variable formal-parameter or return-value mark-ing (thus minimizing changes required for existingRCU code) implicit ldquodo the right thingrdquo end-of-chainhandling23 (thus minimizing the need for whack-a-mole source-code markups) no need for dependencytracking (thus making it easier to implement) andno need for C-language support for the [[carries

dependency]] attribute (thus minimizing changes tothe C standard)

7103 Other Approaches

If the C standards committee is unwilling to ac-cept attributes perhaps a new keyword such asCarries dependency would be an acceptable alter-native (This was suggested at the 2014 UIUC meet-ing but I cannot recall who suggested it)

21 Recall that a local semantic dependency remains a de-pendency even if the memory order consume load at its headcan return only a single value In contrast a global seman-tic dependency remains a dependency only if more than onevalue can appear at the end of the chain Therefore optimiza-tions based on global full-program analysis can break a globalsemantic dependency but can break neither a local semanticdependency nor a normal dependency

22 Variables that go out of scope always have any dependencychain implicitly killed

23 Perhaps implemented by a careful choice of exactly whichoperators carry dependencies in which situations

It might also be possible to combine different as-pects of the various proposals perhaps even arrivingat an improved proposal

Nevertheless we clearly have some more work todo

8 Summary

This document has analyzed Linux-kernel use of de-pendency ordering and has laid out the status-quo in-teraction between the Linux kernel and pre-C11 com-pilers It has also put forward some possible ways ofbuilding towards a full implementation of C11rsquos andC++11rsquos handling of dependency ordering Finallyit calls out some weaknesses in C11rsquos and C++11rsquoshandling of dependency ordering and offers some al-ternatives

References

[1] Alglave J Maranget L Pawan PSarkar S Sewell P Williams D andNardelli F Z PPCMEMARMMEM A toolfor exploring the POWER and ARM memorymodels httpwwwclcamacuk~pes20

ppc-supplementalpldi105-sarkarpdfJune 2011

[2] Alglave J Maranget L andTautschnig M Herding cats Modellingsimulation testing and data-mining for weakmemory In Proceedings of the 35th ACM SIG-PLAN Conference on Programming LanguageDesign and Implementation (New York NYUSA 2014) PLDI rsquo14 ACM pp 40ndash40

[3] ARM Limited ARM Architecture ReferenceManual ARMv7-A and ARMv7-R Edition2010

[4] Boehm H J Space efficient conservativegarbage collection SIGPLAN Not 39 4 (Apr2004) 490ndash501

[5] Bonzini P and Day M RCU imple-mentation for Qemu httplistsgnu

WG21N4321 37

orgarchivehtmlqemu-devel2013-08

msg02055html August 2013

[6] Dalton M THE DESIGN AND IMPLE-MENTATION OF DYNAMIC INFORMATIONFLOW TRACKING SYSTEMS FOR SOFT-WARE SECURITY PhD thesis StanfordUniversity 2009 Available httpcsl

stanfordedu~christospublications

2009michael_daltonphd_thesispdf

[Viewed March 9 2010]

[7] Desnoyers M [RFC git tree] userspace RCU(urcu) for Linux httpurcuso February2009

[8] Desnoyers M McKenney P E SternA Dagenais M R and Walpole JUser-level implementations of read-copy updateIEEE Transactions on Parallel and DistributedSystems 23 (2012) 375ndash382

[9] Grisenthwaite R ARM Barrier LitmusTests and Cookbook ARM Limited 2009

[10] Howard P W and Walpole J A rel-ativistic enhancement to software transactionalmemory In Proceedings of the 3rd USENIX con-ference on Hot topics in parallelism (BerkeleyCA USA 2011) HotParrsquo11 USENIX Associa-tion pp 1ndash6

[11] Howard P W and Walpole J Relativis-tic red-black trees Concurrency and Computa-tion Practice and Experience (2013) nandashna

[12] Intel Corporation A Formal Speci-fication of Intel Itanium Processor Fam-ily Memory Ordering 2002 Avail-able httpdeveloperintelcom

designitaniumdownloads251429htm

ftpdownloadintelcomdesign

ItaniumDownloads25142901pdf [ViewedJanuary 10 2007]

[13] International Business Machines Corpo-ration Power ISA Version 207 2013

[14] Kannan H Ordering decoupled metadata ac-cesses in multiprocessors In MICRO 42 Pro-ceedings of the 42nd Annual IEEEACM Inter-national Symposium on Microarchitecture (NewYork NY USA 2009) ACM pp 381ndash390

[15] McCarthy J Recursive functions of symbolicexpressions and their computation by machinepart i Commun ACM 3 4 (Apr 1960) 184ndash195

[16] McKenney P E Read-copy update(RCU) usage in Linux kernel Availablehttpwwwrdropcomuserspaulmck

RCUlinuxusagerculocktabhtml [ViewedJanuary 14 2007] October 2006

[17] McKenney P E What is RCU part 2Usage Available httplwnnetArticles

263130 [Viewed January 4 2008] January2008

[18] McKenney P E The RCU API 2010 edi-tion httplwnnetArticles418853 De-cember 2010

[19] McKenney P E Validating memory barri-ers and atomic instructions httplwnnet

Articles470681 December 2011

[20] McKenney P E Is Parallel ProgrammingHard And If So What Can You Do About Itkernelorg Corvallis OR USA 2012

[21] McKenney P E Structured deferral syn-chronization via procrastination CommunACM 56 7 (July 2013) 40ndash49

[22] McKenney P E [PATCH tipcorercu14] mce Stop using array-index-based RCUprimitives [PATCHtipcorercu14]mce

Stopusingarray-index-basedRCUprimitivesMay 2015

[23] McKenney P E Purcell C Al-gae Schumin B Cornelius G Qwer-tyus Conway N Sbw Blainster Ru-fus C Zoicon5 Anome and Eisen HRead-copy update httpenwikipedia

orgwikiRead-copy-update July 2006

WG21N4321 38

[24] McKenney P E and Slingwine J DRead-copy update Using execution history tosolve concurrency problems In Parallel andDistributed Computing and Systems (Las VegasNV October 1998) pp 509ndash518

[25] McKenney P E and Walpole J What isRCU fundamentally Available httplwn

netArticles262464 [Viewed December 272007] December 2007

[26] Michael M M Hazard pointers Safe mem-ory reclamation for lock-free objects IEEETransactions on Parallel and Distributed Sys-tems 15 6 (June 2004) 491ndash504

[27] Rossbach C J Hofmann O S PorterD E Ramadan H E Bhandari A andWitchel E TxLinux Using and managinghardware transactional memory in an operatingsystem In SOSPrsquo07 Twenty-First ACMSymposium on Operating Systems Principles(October 2007) ACM SIGOPS Avail-able httpwwwsosp2007orgpapers

sosp056-rossbachpdf [Viewed October 212007]

[28] Toit S D Working draft stan-dard for programming language C++httpwwwopen-stdorgjtc1sc22wg21

docspapers2013n3691pdf May 2013

[29] Vigueras G Orduna J M and LozanoM A Read-Copy Update based parallel serverfor distributed crowd simulations The Journalof Supercomputing (Apr 2012)

Change Log

This paper first appeared as N4026 in May of 2014Revisions to this document are as follows

bull Add a suggestion to Section 71 that developersprovide diagnostics for singleton arrays when us-ing RCU-protected indexes (May 28 2014)

bull Apply changes from Torvald Riegel feedback

1 Move Section 71 which summarizes theLinux-kernel uses of dependency chains upto the beginning of Section 7

2 Fix some typos in Section 74 and addwords accounting for DEC Alpha whichmust continue to promote memory order

consume loads to memory order acquire

with this sectionrsquos approach Also clarifyJeff Preshingrsquos conjecture about the lack ofoptimizations that break semantic depen-dency chains

(June 5 2014)

bull Add Section 75 which covers Hans Boehmrsquosapproach which restricts dependency chains tolocal variable Also add Section 76 whichcovers Clark Nelsorsquos approach which usesthe [[carries dependency]] attribute to markthose local variables that carry dependencies(July 17 2014)

bull Mark the paper as a revision of N4036 (July 172014)

bull Add Jeff Preshing Hans Boehm and Clark Nel-son as authors and clarify Hansrsquos proposal inSection 75 (August 13 2014)

bull Add Olivier Girouxrsquos proposal for explicitlymarking the tails of dependency chains as Sec-tion 77 and add Olivier as author Update theadvantages and disadvantages of Jeff Preshingrsquoswhole-program option in Section 74 Add advan-tages and disadvantages for Hans Boehmrsquos pro-posal in Section 75 and for Clark Nelsonrsquos pro-posal in Section 76 (August 23 2014)

bull Update Olivier Girouxrsquos explicit dependency-chain marking proposal in Section 77 based onfeedback from Olivier (August 24 2014)

bull Add a section header (Section 411) for thedependency-chain rules for 2014 GCC implemen-tations Add Section 413 laying out the sim-pler rules for managing dependency chains in theolder 1990s Sequent C implementations Adda footnote to Section 76 recording Lawrence

WG21N4321 39

Crowlrsquos preference for variable modifiers insteadof either attributes or type modifiers Addwords to Section 77 noting that atomic signal

fence() or the Linux kernelrsquos barrier() can beused to suppress code-motion optimizations thatmight otherwise prevent fixing up tail-markeddependency chains that were broken by code-motion optimizations (October 1 2014)

bull Add Section 34 which describes operators thatact as the last link in the Linux kernelrsquos depen-dency chains including the ambiguous role ofthe -gt operator (October 1 2014)

bull Update plots of RCU API usage (October 52014)

bull Flesh out Section 412 which lays out a shorthistory of dependency-chain management rulesfor the Linux kernel Numerous minor clarifica-tions and corrections throughout the document(October 5 2014)

bull Add ldquoandrdquo before final authorrsquos name and emailaddress Also add a pointer back to N4036 (Oc-tober 5 2014)

bull Add a brain-dead build script (October 52014)

bull Add Section 710 which has the beginnings ofan evaluation of the various proposals (October5 2014)

The paper was then published as N4215 whichwas revised as follows

bull Add verbiage to Sections 72 and 7102 differ-entiating between various forms of syntactic andsemantic dependencies (November 11 2014)

bull Add Section 7103 suggesting a C keyword in-stead of the C++ [[carries dependency]] at-tribute This section also suggests that combin-ing ideas from several proposals might be helpful(November 11 2014)

bull Add Section 78 which contains a first attemptto describe Olivier Girouxrsquos head-marked depen-dency chains (November 11 2014)

bull Add example code in Figures 15 and 16 Addsimilar figures to the proposals showing how theymark the dependency chains in the sample code(November 11 2014)

At this point the paper was published as N4321which was revised as follows

bull Add a warning against using operands to amp and| that leave only one bit set (or respectivelycleared) to Section 411 (April 9 2015)

bull Update plots of RCU API usage (April 192014)

bull Add Section 79 which restricts dependencychains with the goal of allowing unmarked de-pendency chains while avoiding restricting com-piler optimizations One important restriction isthe elimination of RCU-protected array indexes(April 19 2014)

bull Add verbiage to Section 32 noting that the amp

operator is sometimes used to find the beginningof a power-of-two aligned structure (April 192014)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Torvald Riegel to Sec-tion 79 (May 20 2015)

bull Donrsquot allow integers to carry dependencies intoor out of unmarked functions in Section 79(May 21 2015)

bull Update Section 79 to note that assignments can-not extend a dependency chain from one threadto another and that assignments cannot extendan intptr t- or uintptr t-based dependencychain through a shared variable even within agiven thread (May 21 2015)

bull Update Section 79 to note that atomic op-erations can extend dependency chains in thesame way that their non-atomic counterpartscan (May 21 2015)

WG21N4321 40

bull Update Section 79 to explicitly call out the factthat stores and subsequent loads can extend de-pendency chains (May 21 2015)

bull Update Section 79 to explicitly state that un-defined behavior terminates dependency chainsand that if only one value of a pointer avoidsundefined behavior any dependency chainsthrough that pointer at that point are termi-nated (May 22 2015)

bull Update Section 79 to document the difficul-ties stemming from value-specialization opti-mizations (May 23 2015)

bull Update Section 79 to note the back-propagationof the dependency-chain-breaking effects of thecompiler deducing the exact value of a pointer(May 27 2015)

bull Update Section 79 to give more informationon how much bit-setting and bit-clearing is toomuch (June 8 2015)

bull Update Section 79 to note that the Linux ker-nel does not rely on dependency ordering whena statically allocated structure is linked into anRCU-protected structure but also note that op-timizations breaking dependency ordering in thiscase seem quite dubious (July 11 2015)

bull Update Section 79 illustrating dependency-chain rules with examples (July 11 2015)

bull Update Section 79 noting that simple loads suf-fice for memory order consume loads (July 112015)

bull Update Sections 77 and 78 calling out abstrac-tion challenges for head- and tail-marking ap-proaches (July 13 2015)

bull Add subsections to Section 79 (July 13 2015)

bull Update Section 79 to amplify objections to spe-cialization optimizations applied to heap-basedpointers (July 13 2015)

  • 1 Introduction
  • 2 Introduction to RCU
  • 3 Linux-Kernel Use Cases
    • 31 Types of Linux-Kernel Dependency Chains
    • 32 Operators in Linux-Kernel Dependency Chains
    • 33 Operators Terminating Linux-Kernel Dependency Chains
    • 34 Operators Acting as Last Link in Linux-Kernel Dependency Chains
    • 35 Linux-Kernel Dependency Chain Length
      • 4 Dependency Ordering in Pre-C11 Implementations
        • 41 Rules for C-Language RCU Users
          • 411 Rules for 2014 GCC Implementations
          • 412 Rules for 2003 GCC Implementations
          • 413 Rules for 1990s Sequent C Implementations
            • 42 Rules for C-Language Implementers
              • 5 Dependency Ordering in C11 and C++11 Implementations
              • 6 Weaknesses in C11 and C++11 Dependency Ordering
              • 7 Potential Alternatives to C11 and C++11 Dependency Ordering
                • 71 Revising C11 and C++11 Dependency-Ordering Definition
                • 72 Type-Based Designation of Dependency Chains With Restrictions
                • 73 Type-Based Designation of Dependency Chains
                • 74 Whole-Program Option
                • 75 Local-Variable Restriction
                • 76 Mark Dependency-Carrying Local Variables
                • 77 Explicitly Tail-Marked Dependency Chains
                • 78 Explicitly Head-Marked Dependency Chains
                • 79 Restricted Dependency Chains
                  • 791 Extending Dependency Chains
                  • 792 Terminating Dependency Chains
                  • 793 Restricted Dependency Chains and the Linux Kernel
                  • 794 Restricted Dependency Chains Advantages and Disavantages
                    • 710 Evaluation
                      • 7101 Audiences
                      • 7102 Comparison
                      • 7103 Other Approaches
                          • 8 Summary
Page 32: N4321: Towards Implementation and Use of memory order … · 2015-07-14 · N4321: Towards Implementation and Use of memory order consume Doc. No.: WG21/N4321 (revised) Date: 2015-04-19

WG21N4321 32

Narrowing arithmetic operations If a pointeris part of a dependency chain and if the valuesadded to or subtracted from that pointer cancel thepointer value so as to allow the compiler to pre-cisely determine the resulting value then the result-ing value will not be part of any dependency chainFor example if p is part of a dependency chain then((char )p-(uintptr t)p)+65536 will not be18

Narrowing bitwise operations If a value of typeintptr t or uintptr t is part of a dependency chainand if that value is one of the operands to an amp

or | infix operator whose result has too few or toomany bits set then the resulting value will not bepart of any dependency chain For example on a64-bit system if p is part of a dependency chainthen (p amp 0x7) provides just the tag bits and nor-mally cannot even be legally dereferenced Similarly(p | ~0) normally cannot be legally dereferencedHowever (p amp ~0x7) will provide a usable pointerthat is part of prsquos dependency chain Setting orclearing bits that are not used by the implementa-tion or that would be zero for a properly aligned ob-ject will not break the dependency chain That saidthe compiler might not know the definition of ldquoprop-erly alignedrdquo for example in the cases of manuallycache-aligned or page-aligned objects

Storing non-pointers into shared variables Ifa non-pointer value that is part of a dependency chainis stored into a shared variable then the dependencychain does not extend to a later load from that vari-able

Passing values between threads If a value thatis part of a dependency chain is stored into a variableby one thread and loaded from that same variable bysome other thread using either a non-atomic load ora memory order relaxed load then the dependencychain does not extend to the second thread To getthis effect the second thread would instead need touse a memory order consume load Note that this

18 That said 57p4 of C++ and 656p8 of C both say thatindexing outside of an object is undefined behavior so the lossof dependency ordering is likely the least of the problems here

would extend the dependency chain even if the cor-responding store was a memory order relaxed storebecause the required store-side ordering is providedby the dependency chain

Undefined behavior If undefined behavior is in-voked then consistent with the notion of undefinedbehavior there are no dependency-chain guarantees

If a given pointer intptr t or uintptr t is usedsuch that only one value avoids undefined behaviorthen the dependency chain is broken in the same wayas it would be in the case of an equality comparisonwith that same value

kill dependency() The result of calling kill

dependency is never part of any dependency chainThis operation can be used to suppress diagnosticsthat implementations might omit for likely misusesof dependency ordering

793 Restricted Dependency Chains and theLinux Kernel

This covers all known pointer-based RCU uses in theLinux kernel aside from RCU-protected array in-dexes (more on these later) However as noted ear-lier these restrictions might prove too constrainingfor future code Therefore it might be necessary tocombine this approach with some variation on one ofthe methods for explicitly marking variables formalparameters and return values that are intended tocarry dependencies

Section 32 discussed dependency chains headedby memory order consume loads of integers that arelater used as array indexes or pointer offsets Al-though there are a (very) few such uses in the Linuxkernel accommodating dependency chains headed byloads of integers greatly complicates the handling ofdependency chains For example given an integer x

produced by a memory order consume load we mustcorrectly handle expressions containing (x - x) in-cluding cases where the cancellation is not at all obvi-ous from the source code In contrast given a pointerp the expression (p - p) not a pointer and the rulesgiven above do not require carrying a dependency

WG21N4321 33

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a) 3 q = kill_dependency(p)4 a = q-gtspecial_a5 else 6 a = p-gtnormal_a7

Figure 36 Avoiding Diagnostic Due To Dependency-Ordering Value-Narrowing Hazard

through such an expression This approach there-fore excludes dependency chains headed by memory

order consume loads from non-pointer atomic vari-ables

This raises the question of what should be doneabout the Linux kernel code that relies on order-ing carried through integer array indexes Paulhas (hopefully) answered this question by creatinga Linux-kernel patch that removes the kernelrsquos de-pendency on RCU-protected array indexes [22]

794 Restricted Dependency Chains Ad-vantages and Disavantages

This approach to dependency ordering has the fol-lowing advantages and disadvantages

1 There is no need for compilers to trace depen-dency chains Instead dependencies are an au-tomatic result of current code-generation and op-timization practices

2 There would be little or no limitation on com-piler optimizations Compilers are no longer re-quired to establish artificial dependencies for ex-pressions such as (x - x) or to detect value-narrowing hazards involving == and =

3 The breaking of dependency chains when apointer compares equal to some other pointermight prove to be onerous but it is not aproblem for current Linux use cases19 Themost common cases are comparison against alist header (in which case an equality compar-ison terminates the traversal) and comparison

19 This needs to be re-verified

against NULL (in which case an equality com-parison indicates a pointer that cannot be deref-erenced in any case)

4 The compiler is prohibited from carrying outvalue-speculation optimizations on pointers orvalues of type intptr t or uintptr t that havebeen cast from a pointer type and subjected tono operations other than infix amp | and ^ (Isthis an advantage or a disadvantage The an-swer to this question is left to the reader)

5 In a great many cases dependency chains canreliably pass through library functions compiledby pre-C11 compilers

6 It would not be necessary to use stdkill

dependency() calls in most cases That saiduse of stdkill dependency() might at somefuture point allow the compiler to produce bet-ter diagnostics for dubious dependency-chain usecases For example a compiler might issue awarning for the value-narrowing hazard shownon line 3 of Figure 13 and that diagnostic mightbe suppressed as shown in Figure 36

7 More generally this approach allows annotationsto be discarded as was shown in Figures 20and 21 However the memory order consume

loads are still required in order to support DECAlpha and prevent compiler optimizations thatmight otherwise destroy the dependency chain

8 It would not be necessary to use [[carries

dependency]] attributes However as withstdkill dependency() use of something like[[carries dependency]] might produce bet-ter diagnostics for dubious use dependency-chainuse cases That said it might be preferable tosubstitute either type or variable modifiers forattributes given that use of attributes is not per-mitted to change the meaning of the programand also that attributes are not yet supportedby the C language

9 With the exception of one use case involving ar-rays no changes to the Linux-kernel source code

WG21N4321 34

are required As noted earlier a patch is avail-able to remove the Linux kernelrsquos dependency onRCU-protected array indexes [22]

In the future this approach could be augmentedby attributes type modifiers variable modifiers orother markings to allow more elaborate dependencychains to be created on the one hand and to improvethe compilerrsquos ability to emit diagnostics for dubioususes of dependency chains on the other

710 Evaluation

This evaluation starts by enumerating the differentaudiences that any change to memory order consume

must address (Section 7101) and then compares thevarious proposals based on the perceived viewpointsof these audiences (Section 7102)

7101 Audiences

The main audiences for any change to memory

order consume include standards committee mem-bers compiler implementers formal-methods re-searchers developers intending to write new codeand developers working with existing RCU code TheLinux kernel community is of course a notable exam-ple of this last category

Standards committee members would like a cleanand non-intrusive change to the standard Theywould of course also like solutions minimizing thenumber and vehemence of complaints from the otheraudiences or failing that reducing the complaintsto a tolerable noise level

Compiler implementers would like a mechanismthat fits nicely into current implementations whichdoes much to explain their satisfaction with the ap-proach of strengthening memory order consume tomemory order acquire In particular they wouldlike to avoid unbounded tracing of dependencies andwould prefer minimal constraints on their ability toapply time-honored optimizations

Formal-methods researchers would like a definitionof memory order consume that fits into existing the-oretical frameworks without undue conceptual vio-lence Of particular concern is any need to deal with

counter-factuals in other words any need to rea-son not only about values of variables required forthe solution of a given litmus test but also aboutother unrelated values for these variables As suchcounter-factuals are the rock upon which otherwiseattractive approaches involving semantic dependencyhave foundered20 Some practitioners might won-der why the opinion of formal-methods researchersshould be given any weight at all and the answer tothis question is that it is the work of formal-methodsresearchers that provides us the much-needed toolsthat we need to analyze both the memory-orderingspecification itself as well as programs using thatspecification

Developers writing new code need something thatexpresses their algorithm with a minimum of syn-tactic saccharine that is easy to learn and that iseasy to maintain For example one of the weak-nesses of the current standardsrsquo definition of memoryorder consume is the need to sprinkle large numbersof kill dependency() calls throughout onersquos codeIn short developers would like it to be easy to writeanalyze and maintain code that uses dependency or-dering

Developers with existing RCU code have the samedesires as do developers writing new code but arealso very interested in minimizing the code churn re-quired to adhere to the standard

The challenge if of course to find a proposal thataddresses the viewoints of all of these audiences Aswe will see in the next session this is not easy How-ever there is some hope that the approach presentedin Section 79 might suffice

7102 Comparison

A summary comparison of the proposals is shown inTable 1

The dependency type can either be ldquodeprdquo fornormal dependency ldquordeprdquo for the restricted de-pendencies discussed in Section 79 ldquosdeprdquo for(global) semantic dependency or ldquolsdeprdquo for local

20 That said Alan Jeffries is making another attempt tocome up with a suitable formal definition of semantic depen-dency

WG21N4321 35

Dep

enden

cyType

Variable

Marking

Form

al-Parameter

Marking

Return-V

alueMarking

Beginning-O

f-Chain

Handling

End-O

f-Chain

Handling

Dep

enden

cyTracingRequired

CAttribute

Support

Req

uired

C11 C++11 dep A A K Y Y

Type-Based Designation of DependencyChains With Restrictions (Section 72)

lsdep T T T k

Type-Based Designation of DependencyChains (Section 73)

dep T T T k

Whole-Program Option (Section 74) sdep

Local-Variable Restriction (Section 75) dep A A k Y

Mark Dependency-Carrying Local Variables(Section 76)

dep A A A k Y

Explicitly Tail-Marked Dependency Chains(Section 77)

dep A A Dk y Y

Explicitly Head-Marked Dependency Chains(Section 78)

dep D k y Y

Restricted Dependency Chains (Section 79) rdep

Marking ldquoArdquo attribute ldquoTrdquo typeBeginning of chain ldquoDrdquo explicit designationEnd of chain ldquoDrdquo explicit designation ldquokKrdquo implicitexplicit kill dependencyDependency tracking required ldquoyrdquo only for marked chains ldquoYrdquo always

Table 1 Comparison of Consume Proposals

WG21N4321 36

semantic dependency21 Variable formal-parameterand return-value marking can either be type-based(ldquoTrdquo) attribute-based (ldquoArdquo) or not required (ldquo rdquo)Beginning-of-chain handling can either require ex-plicit indication to which quantities dependenciesmust be carried (ldquoDrdquo) or nothing (ldquo rdquo) End-of-chain handling can either require an explicit kill

dependency (ldquoKrdquo) an implicit kill dependency

(ldquokrdquo) explicit designation of dependency (ldquoDrdquo) ornothing (ldquo rdquo)22 Dependency tracking might berequired for all chains (ldquoYrdquo) explicitly designatedchains (ldquoyrdquo) or not required at all (ldquo rdquo) C-language[[carries dependency]] support might be required(ldquoYrdquo) or not (ldquo rdquo)

The ideal proposal would have dependency typeldquodeprdquo (thus making it easier to model dependencyordering and making it unnecessary for developer tohave to outwit full-program optimizations) no needfor variable formal-parameter or return-value mark-ing (thus minimizing changes required for existingRCU code) implicit ldquodo the right thingrdquo end-of-chainhandling23 (thus minimizing the need for whack-a-mole source-code markups) no need for dependencytracking (thus making it easier to implement) andno need for C-language support for the [[carries

dependency]] attribute (thus minimizing changes tothe C standard)

7103 Other Approaches

If the C standards committee is unwilling to ac-cept attributes perhaps a new keyword such asCarries dependency would be an acceptable alter-native (This was suggested at the 2014 UIUC meet-ing but I cannot recall who suggested it)

21 Recall that a local semantic dependency remains a de-pendency even if the memory order consume load at its headcan return only a single value In contrast a global seman-tic dependency remains a dependency only if more than onevalue can appear at the end of the chain Therefore optimiza-tions based on global full-program analysis can break a globalsemantic dependency but can break neither a local semanticdependency nor a normal dependency

22 Variables that go out of scope always have any dependencychain implicitly killed

23 Perhaps implemented by a careful choice of exactly whichoperators carry dependencies in which situations

It might also be possible to combine different as-pects of the various proposals perhaps even arrivingat an improved proposal

Nevertheless we clearly have some more work todo

8 Summary

This document has analyzed Linux-kernel use of de-pendency ordering and has laid out the status-quo in-teraction between the Linux kernel and pre-C11 com-pilers It has also put forward some possible ways ofbuilding towards a full implementation of C11rsquos andC++11rsquos handling of dependency ordering Finallyit calls out some weaknesses in C11rsquos and C++11rsquoshandling of dependency ordering and offers some al-ternatives

References

[1] Alglave J Maranget L Pawan PSarkar S Sewell P Williams D andNardelli F Z PPCMEMARMMEM A toolfor exploring the POWER and ARM memorymodels httpwwwclcamacuk~pes20

ppc-supplementalpldi105-sarkarpdfJune 2011

[2] Alglave J Maranget L andTautschnig M Herding cats Modellingsimulation testing and data-mining for weakmemory In Proceedings of the 35th ACM SIG-PLAN Conference on Programming LanguageDesign and Implementation (New York NYUSA 2014) PLDI rsquo14 ACM pp 40ndash40

[3] ARM Limited ARM Architecture ReferenceManual ARMv7-A and ARMv7-R Edition2010

[4] Boehm H J Space efficient conservativegarbage collection SIGPLAN Not 39 4 (Apr2004) 490ndash501

[5] Bonzini P and Day M RCU imple-mentation for Qemu httplistsgnu

WG21N4321 37

orgarchivehtmlqemu-devel2013-08

msg02055html August 2013

[6] Dalton M THE DESIGN AND IMPLE-MENTATION OF DYNAMIC INFORMATIONFLOW TRACKING SYSTEMS FOR SOFT-WARE SECURITY PhD thesis StanfordUniversity 2009 Available httpcsl

stanfordedu~christospublications

2009michael_daltonphd_thesispdf

[Viewed March 9 2010]

[7] Desnoyers M [RFC git tree] userspace RCU(urcu) for Linux httpurcuso February2009

[8] Desnoyers M McKenney P E SternA Dagenais M R and Walpole JUser-level implementations of read-copy updateIEEE Transactions on Parallel and DistributedSystems 23 (2012) 375ndash382

[9] Grisenthwaite R ARM Barrier LitmusTests and Cookbook ARM Limited 2009

[10] Howard P W and Walpole J A rel-ativistic enhancement to software transactionalmemory In Proceedings of the 3rd USENIX con-ference on Hot topics in parallelism (BerkeleyCA USA 2011) HotParrsquo11 USENIX Associa-tion pp 1ndash6

[11] Howard P W and Walpole J Relativis-tic red-black trees Concurrency and Computa-tion Practice and Experience (2013) nandashna

[12] Intel Corporation A Formal Speci-fication of Intel Itanium Processor Fam-ily Memory Ordering 2002 Avail-able httpdeveloperintelcom

designitaniumdownloads251429htm

ftpdownloadintelcomdesign

ItaniumDownloads25142901pdf [ViewedJanuary 10 2007]

[13] International Business Machines Corpo-ration Power ISA Version 207 2013

[14] Kannan H Ordering decoupled metadata ac-cesses in multiprocessors In MICRO 42 Pro-ceedings of the 42nd Annual IEEEACM Inter-national Symposium on Microarchitecture (NewYork NY USA 2009) ACM pp 381ndash390

[15] McCarthy J Recursive functions of symbolicexpressions and their computation by machinepart i Commun ACM 3 4 (Apr 1960) 184ndash195

[16] McKenney P E Read-copy update(RCU) usage in Linux kernel Availablehttpwwwrdropcomuserspaulmck

RCUlinuxusagerculocktabhtml [ViewedJanuary 14 2007] October 2006

[17] McKenney P E What is RCU part 2Usage Available httplwnnetArticles

263130 [Viewed January 4 2008] January2008

[18] McKenney P E The RCU API 2010 edi-tion httplwnnetArticles418853 De-cember 2010

[19] McKenney P E Validating memory barri-ers and atomic instructions httplwnnet

Articles470681 December 2011

[20] McKenney P E Is Parallel ProgrammingHard And If So What Can You Do About Itkernelorg Corvallis OR USA 2012

[21] McKenney P E Structured deferral syn-chronization via procrastination CommunACM 56 7 (July 2013) 40ndash49

[22] McKenney P E [PATCH tipcorercu14] mce Stop using array-index-based RCUprimitives [PATCHtipcorercu14]mce

Stopusingarray-index-basedRCUprimitivesMay 2015

[23] McKenney P E Purcell C Al-gae Schumin B Cornelius G Qwer-tyus Conway N Sbw Blainster Ru-fus C Zoicon5 Anome and Eisen HRead-copy update httpenwikipedia

orgwikiRead-copy-update July 2006

WG21N4321 38

[24] McKenney P E and Slingwine J DRead-copy update Using execution history tosolve concurrency problems In Parallel andDistributed Computing and Systems (Las VegasNV October 1998) pp 509ndash518

[25] McKenney P E and Walpole J What isRCU fundamentally Available httplwn

netArticles262464 [Viewed December 272007] December 2007

[26] Michael M M Hazard pointers Safe mem-ory reclamation for lock-free objects IEEETransactions on Parallel and Distributed Sys-tems 15 6 (June 2004) 491ndash504

[27] Rossbach C J Hofmann O S PorterD E Ramadan H E Bhandari A andWitchel E TxLinux Using and managinghardware transactional memory in an operatingsystem In SOSPrsquo07 Twenty-First ACMSymposium on Operating Systems Principles(October 2007) ACM SIGOPS Avail-able httpwwwsosp2007orgpapers

sosp056-rossbachpdf [Viewed October 212007]

[28] Toit S D Working draft stan-dard for programming language C++httpwwwopen-stdorgjtc1sc22wg21

docspapers2013n3691pdf May 2013

[29] Vigueras G Orduna J M and LozanoM A Read-Copy Update based parallel serverfor distributed crowd simulations The Journalof Supercomputing (Apr 2012)

Change Log

This paper first appeared as N4026 in May of 2014Revisions to this document are as follows

bull Add a suggestion to Section 71 that developersprovide diagnostics for singleton arrays when us-ing RCU-protected indexes (May 28 2014)

bull Apply changes from Torvald Riegel feedback

1 Move Section 71 which summarizes theLinux-kernel uses of dependency chains upto the beginning of Section 7

2 Fix some typos in Section 74 and addwords accounting for DEC Alpha whichmust continue to promote memory order

consume loads to memory order acquire

with this sectionrsquos approach Also clarifyJeff Preshingrsquos conjecture about the lack ofoptimizations that break semantic depen-dency chains

(June 5 2014)

bull Add Section 75 which covers Hans Boehmrsquosapproach which restricts dependency chains tolocal variable Also add Section 76 whichcovers Clark Nelsorsquos approach which usesthe [[carries dependency]] attribute to markthose local variables that carry dependencies(July 17 2014)

bull Mark the paper as a revision of N4036 (July 172014)

bull Add Jeff Preshing Hans Boehm and Clark Nel-son as authors and clarify Hansrsquos proposal inSection 75 (August 13 2014)

bull Add Olivier Girouxrsquos proposal for explicitlymarking the tails of dependency chains as Sec-tion 77 and add Olivier as author Update theadvantages and disadvantages of Jeff Preshingrsquoswhole-program option in Section 74 Add advan-tages and disadvantages for Hans Boehmrsquos pro-posal in Section 75 and for Clark Nelsonrsquos pro-posal in Section 76 (August 23 2014)

bull Update Olivier Girouxrsquos explicit dependency-chain marking proposal in Section 77 based onfeedback from Olivier (August 24 2014)

bull Add a section header (Section 411) for thedependency-chain rules for 2014 GCC implemen-tations Add Section 413 laying out the sim-pler rules for managing dependency chains in theolder 1990s Sequent C implementations Adda footnote to Section 76 recording Lawrence

WG21N4321 39

Crowlrsquos preference for variable modifiers insteadof either attributes or type modifiers Addwords to Section 77 noting that atomic signal

fence() or the Linux kernelrsquos barrier() can beused to suppress code-motion optimizations thatmight otherwise prevent fixing up tail-markeddependency chains that were broken by code-motion optimizations (October 1 2014)

bull Add Section 34 which describes operators thatact as the last link in the Linux kernelrsquos depen-dency chains including the ambiguous role ofthe -gt operator (October 1 2014)

bull Update plots of RCU API usage (October 52014)

bull Flesh out Section 412 which lays out a shorthistory of dependency-chain management rulesfor the Linux kernel Numerous minor clarifica-tions and corrections throughout the document(October 5 2014)

bull Add ldquoandrdquo before final authorrsquos name and emailaddress Also add a pointer back to N4036 (Oc-tober 5 2014)

bull Add a brain-dead build script (October 52014)

bull Add Section 710 which has the beginnings ofan evaluation of the various proposals (October5 2014)

The paper was then published as N4215 whichwas revised as follows

bull Add verbiage to Sections 72 and 7102 differ-entiating between various forms of syntactic andsemantic dependencies (November 11 2014)

bull Add Section 7103 suggesting a C keyword in-stead of the C++ [[carries dependency]] at-tribute This section also suggests that combin-ing ideas from several proposals might be helpful(November 11 2014)

bull Add Section 78 which contains a first attemptto describe Olivier Girouxrsquos head-marked depen-dency chains (November 11 2014)

bull Add example code in Figures 15 and 16 Addsimilar figures to the proposals showing how theymark the dependency chains in the sample code(November 11 2014)

At this point the paper was published as N4321which was revised as follows

bull Add a warning against using operands to amp and| that leave only one bit set (or respectivelycleared) to Section 411 (April 9 2015)

bull Update plots of RCU API usage (April 192014)

bull Add Section 79 which restricts dependencychains with the goal of allowing unmarked de-pendency chains while avoiding restricting com-piler optimizations One important restriction isthe elimination of RCU-protected array indexes(April 19 2014)

bull Add verbiage to Section 32 noting that the amp

operator is sometimes used to find the beginningof a power-of-two aligned structure (April 192014)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Torvald Riegel to Sec-tion 79 (May 20 2015)

bull Donrsquot allow integers to carry dependencies intoor out of unmarked functions in Section 79(May 21 2015)

bull Update Section 79 to note that assignments can-not extend a dependency chain from one threadto another and that assignments cannot extendan intptr t- or uintptr t-based dependencychain through a shared variable even within agiven thread (May 21 2015)

bull Update Section 79 to note that atomic op-erations can extend dependency chains in thesame way that their non-atomic counterpartscan (May 21 2015)

WG21N4321 40

bull Update Section 79 to explicitly call out the factthat stores and subsequent loads can extend de-pendency chains (May 21 2015)

bull Update Section 79 to explicitly state that un-defined behavior terminates dependency chainsand that if only one value of a pointer avoidsundefined behavior any dependency chainsthrough that pointer at that point are termi-nated (May 22 2015)

bull Update Section 79 to document the difficul-ties stemming from value-specialization opti-mizations (May 23 2015)

bull Update Section 79 to note the back-propagationof the dependency-chain-breaking effects of thecompiler deducing the exact value of a pointer(May 27 2015)

bull Update Section 79 to give more informationon how much bit-setting and bit-clearing is toomuch (June 8 2015)

bull Update Section 79 to note that the Linux ker-nel does not rely on dependency ordering whena statically allocated structure is linked into anRCU-protected structure but also note that op-timizations breaking dependency ordering in thiscase seem quite dubious (July 11 2015)

bull Update Section 79 illustrating dependency-chain rules with examples (July 11 2015)

bull Update Section 79 noting that simple loads suf-fice for memory order consume loads (July 112015)

bull Update Sections 77 and 78 calling out abstrac-tion challenges for head- and tail-marking ap-proaches (July 13 2015)

bull Add subsections to Section 79 (July 13 2015)

bull Update Section 79 to amplify objections to spe-cialization optimizations applied to heap-basedpointers (July 13 2015)

  • 1 Introduction
  • 2 Introduction to RCU
  • 3 Linux-Kernel Use Cases
    • 31 Types of Linux-Kernel Dependency Chains
    • 32 Operators in Linux-Kernel Dependency Chains
    • 33 Operators Terminating Linux-Kernel Dependency Chains
    • 34 Operators Acting as Last Link in Linux-Kernel Dependency Chains
    • 35 Linux-Kernel Dependency Chain Length
      • 4 Dependency Ordering in Pre-C11 Implementations
        • 41 Rules for C-Language RCU Users
          • 411 Rules for 2014 GCC Implementations
          • 412 Rules for 2003 GCC Implementations
          • 413 Rules for 1990s Sequent C Implementations
            • 42 Rules for C-Language Implementers
              • 5 Dependency Ordering in C11 and C++11 Implementations
              • 6 Weaknesses in C11 and C++11 Dependency Ordering
              • 7 Potential Alternatives to C11 and C++11 Dependency Ordering
                • 71 Revising C11 and C++11 Dependency-Ordering Definition
                • 72 Type-Based Designation of Dependency Chains With Restrictions
                • 73 Type-Based Designation of Dependency Chains
                • 74 Whole-Program Option
                • 75 Local-Variable Restriction
                • 76 Mark Dependency-Carrying Local Variables
                • 77 Explicitly Tail-Marked Dependency Chains
                • 78 Explicitly Head-Marked Dependency Chains
                • 79 Restricted Dependency Chains
                  • 791 Extending Dependency Chains
                  • 792 Terminating Dependency Chains
                  • 793 Restricted Dependency Chains and the Linux Kernel
                  • 794 Restricted Dependency Chains Advantages and Disavantages
                    • 710 Evaluation
                      • 7101 Audiences
                      • 7102 Comparison
                      • 7103 Other Approaches
                          • 8 Summary
Page 33: N4321: Towards Implementation and Use of memory order … · 2015-07-14 · N4321: Towards Implementation and Use of memory order consume Doc. No.: WG21/N4321 (revised) Date: 2015-04-19

WG21N4321 33

1 p = atomic_load_explicit(gp memory_order_consume)2 if (p == ptr_a) 3 q = kill_dependency(p)4 a = q-gtspecial_a5 else 6 a = p-gtnormal_a7

Figure 36 Avoiding Diagnostic Due To Dependency-Ordering Value-Narrowing Hazard

through such an expression This approach there-fore excludes dependency chains headed by memory

order consume loads from non-pointer atomic vari-ables

This raises the question of what should be doneabout the Linux kernel code that relies on order-ing carried through integer array indexes Paulhas (hopefully) answered this question by creatinga Linux-kernel patch that removes the kernelrsquos de-pendency on RCU-protected array indexes [22]

794 Restricted Dependency Chains Ad-vantages and Disavantages

This approach to dependency ordering has the fol-lowing advantages and disadvantages

1 There is no need for compilers to trace depen-dency chains Instead dependencies are an au-tomatic result of current code-generation and op-timization practices

2 There would be little or no limitation on com-piler optimizations Compilers are no longer re-quired to establish artificial dependencies for ex-pressions such as (x - x) or to detect value-narrowing hazards involving == and =

3 The breaking of dependency chains when apointer compares equal to some other pointermight prove to be onerous but it is not aproblem for current Linux use cases19 Themost common cases are comparison against alist header (in which case an equality compar-ison terminates the traversal) and comparison

19 This needs to be re-verified

against NULL (in which case an equality com-parison indicates a pointer that cannot be deref-erenced in any case)

4 The compiler is prohibited from carrying outvalue-speculation optimizations on pointers orvalues of type intptr t or uintptr t that havebeen cast from a pointer type and subjected tono operations other than infix amp | and ^ (Isthis an advantage or a disadvantage The an-swer to this question is left to the reader)

5 In a great many cases dependency chains canreliably pass through library functions compiledby pre-C11 compilers

6 It would not be necessary to use stdkill

dependency() calls in most cases That saiduse of stdkill dependency() might at somefuture point allow the compiler to produce bet-ter diagnostics for dubious dependency-chain usecases For example a compiler might issue awarning for the value-narrowing hazard shownon line 3 of Figure 13 and that diagnostic mightbe suppressed as shown in Figure 36

7 More generally this approach allows annotationsto be discarded as was shown in Figures 20and 21 However the memory order consume

loads are still required in order to support DECAlpha and prevent compiler optimizations thatmight otherwise destroy the dependency chain

8 It would not be necessary to use [[carries

dependency]] attributes However as withstdkill dependency() use of something like[[carries dependency]] might produce bet-ter diagnostics for dubious use dependency-chainuse cases That said it might be preferable tosubstitute either type or variable modifiers forattributes given that use of attributes is not per-mitted to change the meaning of the programand also that attributes are not yet supportedby the C language

9 With the exception of one use case involving ar-rays no changes to the Linux-kernel source code

WG21N4321 34

are required As noted earlier a patch is avail-able to remove the Linux kernelrsquos dependency onRCU-protected array indexes [22]

In the future this approach could be augmentedby attributes type modifiers variable modifiers orother markings to allow more elaborate dependencychains to be created on the one hand and to improvethe compilerrsquos ability to emit diagnostics for dubioususes of dependency chains on the other

710 Evaluation

This evaluation starts by enumerating the differentaudiences that any change to memory order consume

must address (Section 7101) and then compares thevarious proposals based on the perceived viewpointsof these audiences (Section 7102)

7101 Audiences

The main audiences for any change to memory

order consume include standards committee mem-bers compiler implementers formal-methods re-searchers developers intending to write new codeand developers working with existing RCU code TheLinux kernel community is of course a notable exam-ple of this last category

Standards committee members would like a cleanand non-intrusive change to the standard Theywould of course also like solutions minimizing thenumber and vehemence of complaints from the otheraudiences or failing that reducing the complaintsto a tolerable noise level

Compiler implementers would like a mechanismthat fits nicely into current implementations whichdoes much to explain their satisfaction with the ap-proach of strengthening memory order consume tomemory order acquire In particular they wouldlike to avoid unbounded tracing of dependencies andwould prefer minimal constraints on their ability toapply time-honored optimizations

Formal-methods researchers would like a definitionof memory order consume that fits into existing the-oretical frameworks without undue conceptual vio-lence Of particular concern is any need to deal with

counter-factuals in other words any need to rea-son not only about values of variables required forthe solution of a given litmus test but also aboutother unrelated values for these variables As suchcounter-factuals are the rock upon which otherwiseattractive approaches involving semantic dependencyhave foundered20 Some practitioners might won-der why the opinion of formal-methods researchersshould be given any weight at all and the answer tothis question is that it is the work of formal-methodsresearchers that provides us the much-needed toolsthat we need to analyze both the memory-orderingspecification itself as well as programs using thatspecification

Developers writing new code need something thatexpresses their algorithm with a minimum of syn-tactic saccharine that is easy to learn and that iseasy to maintain For example one of the weak-nesses of the current standardsrsquo definition of memoryorder consume is the need to sprinkle large numbersof kill dependency() calls throughout onersquos codeIn short developers would like it to be easy to writeanalyze and maintain code that uses dependency or-dering

Developers with existing RCU code have the samedesires as do developers writing new code but arealso very interested in minimizing the code churn re-quired to adhere to the standard

The challenge if of course to find a proposal thataddresses the viewoints of all of these audiences Aswe will see in the next session this is not easy How-ever there is some hope that the approach presentedin Section 79 might suffice

7102 Comparison

A summary comparison of the proposals is shown inTable 1

The dependency type can either be ldquodeprdquo fornormal dependency ldquordeprdquo for the restricted de-pendencies discussed in Section 79 ldquosdeprdquo for(global) semantic dependency or ldquolsdeprdquo for local

20 That said Alan Jeffries is making another attempt tocome up with a suitable formal definition of semantic depen-dency

WG21N4321 35

Dep

enden

cyType

Variable

Marking

Form

al-Parameter

Marking

Return-V

alueMarking

Beginning-O

f-Chain

Handling

End-O

f-Chain

Handling

Dep

enden

cyTracingRequired

CAttribute

Support

Req

uired

C11 C++11 dep A A K Y Y

Type-Based Designation of DependencyChains With Restrictions (Section 72)

lsdep T T T k

Type-Based Designation of DependencyChains (Section 73)

dep T T T k

Whole-Program Option (Section 74) sdep

Local-Variable Restriction (Section 75) dep A A k Y

Mark Dependency-Carrying Local Variables(Section 76)

dep A A A k Y

Explicitly Tail-Marked Dependency Chains(Section 77)

dep A A Dk y Y

Explicitly Head-Marked Dependency Chains(Section 78)

dep D k y Y

Restricted Dependency Chains (Section 79) rdep

Marking ldquoArdquo attribute ldquoTrdquo typeBeginning of chain ldquoDrdquo explicit designationEnd of chain ldquoDrdquo explicit designation ldquokKrdquo implicitexplicit kill dependencyDependency tracking required ldquoyrdquo only for marked chains ldquoYrdquo always

Table 1 Comparison of Consume Proposals

WG21N4321 36

semantic dependency21 Variable formal-parameterand return-value marking can either be type-based(ldquoTrdquo) attribute-based (ldquoArdquo) or not required (ldquo rdquo)Beginning-of-chain handling can either require ex-plicit indication to which quantities dependenciesmust be carried (ldquoDrdquo) or nothing (ldquo rdquo) End-of-chain handling can either require an explicit kill

dependency (ldquoKrdquo) an implicit kill dependency

(ldquokrdquo) explicit designation of dependency (ldquoDrdquo) ornothing (ldquo rdquo)22 Dependency tracking might berequired for all chains (ldquoYrdquo) explicitly designatedchains (ldquoyrdquo) or not required at all (ldquo rdquo) C-language[[carries dependency]] support might be required(ldquoYrdquo) or not (ldquo rdquo)

The ideal proposal would have dependency typeldquodeprdquo (thus making it easier to model dependencyordering and making it unnecessary for developer tohave to outwit full-program optimizations) no needfor variable formal-parameter or return-value mark-ing (thus minimizing changes required for existingRCU code) implicit ldquodo the right thingrdquo end-of-chainhandling23 (thus minimizing the need for whack-a-mole source-code markups) no need for dependencytracking (thus making it easier to implement) andno need for C-language support for the [[carries

dependency]] attribute (thus minimizing changes tothe C standard)

7103 Other Approaches

If the C standards committee is unwilling to ac-cept attributes perhaps a new keyword such asCarries dependency would be an acceptable alter-native (This was suggested at the 2014 UIUC meet-ing but I cannot recall who suggested it)

21 Recall that a local semantic dependency remains a de-pendency even if the memory order consume load at its headcan return only a single value In contrast a global seman-tic dependency remains a dependency only if more than onevalue can appear at the end of the chain Therefore optimiza-tions based on global full-program analysis can break a globalsemantic dependency but can break neither a local semanticdependency nor a normal dependency

22 Variables that go out of scope always have any dependencychain implicitly killed

23 Perhaps implemented by a careful choice of exactly whichoperators carry dependencies in which situations

It might also be possible to combine different as-pects of the various proposals perhaps even arrivingat an improved proposal

Nevertheless we clearly have some more work todo

8 Summary

This document has analyzed Linux-kernel use of de-pendency ordering and has laid out the status-quo in-teraction between the Linux kernel and pre-C11 com-pilers It has also put forward some possible ways ofbuilding towards a full implementation of C11rsquos andC++11rsquos handling of dependency ordering Finallyit calls out some weaknesses in C11rsquos and C++11rsquoshandling of dependency ordering and offers some al-ternatives

References

[1] Alglave J Maranget L Pawan PSarkar S Sewell P Williams D andNardelli F Z PPCMEMARMMEM A toolfor exploring the POWER and ARM memorymodels httpwwwclcamacuk~pes20

ppc-supplementalpldi105-sarkarpdfJune 2011

[2] Alglave J Maranget L andTautschnig M Herding cats Modellingsimulation testing and data-mining for weakmemory In Proceedings of the 35th ACM SIG-PLAN Conference on Programming LanguageDesign and Implementation (New York NYUSA 2014) PLDI rsquo14 ACM pp 40ndash40

[3] ARM Limited ARM Architecture ReferenceManual ARMv7-A and ARMv7-R Edition2010

[4] Boehm H J Space efficient conservativegarbage collection SIGPLAN Not 39 4 (Apr2004) 490ndash501

[5] Bonzini P and Day M RCU imple-mentation for Qemu httplistsgnu

WG21N4321 37

orgarchivehtmlqemu-devel2013-08

msg02055html August 2013

[6] Dalton M THE DESIGN AND IMPLE-MENTATION OF DYNAMIC INFORMATIONFLOW TRACKING SYSTEMS FOR SOFT-WARE SECURITY PhD thesis StanfordUniversity 2009 Available httpcsl

stanfordedu~christospublications

2009michael_daltonphd_thesispdf

[Viewed March 9 2010]

[7] Desnoyers M [RFC git tree] userspace RCU(urcu) for Linux httpurcuso February2009

[8] Desnoyers M McKenney P E SternA Dagenais M R and Walpole JUser-level implementations of read-copy updateIEEE Transactions on Parallel and DistributedSystems 23 (2012) 375ndash382

[9] Grisenthwaite R ARM Barrier LitmusTests and Cookbook ARM Limited 2009

[10] Howard P W and Walpole J A rel-ativistic enhancement to software transactionalmemory In Proceedings of the 3rd USENIX con-ference on Hot topics in parallelism (BerkeleyCA USA 2011) HotParrsquo11 USENIX Associa-tion pp 1ndash6

[11] Howard P W and Walpole J Relativis-tic red-black trees Concurrency and Computa-tion Practice and Experience (2013) nandashna

[12] Intel Corporation A Formal Speci-fication of Intel Itanium Processor Fam-ily Memory Ordering 2002 Avail-able httpdeveloperintelcom

designitaniumdownloads251429htm

ftpdownloadintelcomdesign

ItaniumDownloads25142901pdf [ViewedJanuary 10 2007]

[13] International Business Machines Corpo-ration Power ISA Version 207 2013

[14] Kannan H Ordering decoupled metadata ac-cesses in multiprocessors In MICRO 42 Pro-ceedings of the 42nd Annual IEEEACM Inter-national Symposium on Microarchitecture (NewYork NY USA 2009) ACM pp 381ndash390

[15] McCarthy J Recursive functions of symbolicexpressions and their computation by machinepart i Commun ACM 3 4 (Apr 1960) 184ndash195

[16] McKenney P E Read-copy update(RCU) usage in Linux kernel Availablehttpwwwrdropcomuserspaulmck

RCUlinuxusagerculocktabhtml [ViewedJanuary 14 2007] October 2006

[17] McKenney P E What is RCU part 2Usage Available httplwnnetArticles

263130 [Viewed January 4 2008] January2008

[18] McKenney P E The RCU API 2010 edi-tion httplwnnetArticles418853 De-cember 2010

[19] McKenney P E Validating memory barri-ers and atomic instructions httplwnnet

Articles470681 December 2011

[20] McKenney P E Is Parallel ProgrammingHard And If So What Can You Do About Itkernelorg Corvallis OR USA 2012

[21] McKenney P E Structured deferral syn-chronization via procrastination CommunACM 56 7 (July 2013) 40ndash49

[22] McKenney P E [PATCH tipcorercu14] mce Stop using array-index-based RCUprimitives [PATCHtipcorercu14]mce

Stopusingarray-index-basedRCUprimitivesMay 2015

[23] McKenney P E Purcell C Al-gae Schumin B Cornelius G Qwer-tyus Conway N Sbw Blainster Ru-fus C Zoicon5 Anome and Eisen HRead-copy update httpenwikipedia

orgwikiRead-copy-update July 2006

WG21N4321 38

[24] McKenney P E and Slingwine J DRead-copy update Using execution history tosolve concurrency problems In Parallel andDistributed Computing and Systems (Las VegasNV October 1998) pp 509ndash518

[25] McKenney P E and Walpole J What isRCU fundamentally Available httplwn

netArticles262464 [Viewed December 272007] December 2007

[26] Michael M M Hazard pointers Safe mem-ory reclamation for lock-free objects IEEETransactions on Parallel and Distributed Sys-tems 15 6 (June 2004) 491ndash504

[27] Rossbach C J Hofmann O S PorterD E Ramadan H E Bhandari A andWitchel E TxLinux Using and managinghardware transactional memory in an operatingsystem In SOSPrsquo07 Twenty-First ACMSymposium on Operating Systems Principles(October 2007) ACM SIGOPS Avail-able httpwwwsosp2007orgpapers

sosp056-rossbachpdf [Viewed October 212007]

[28] Toit S D Working draft stan-dard for programming language C++httpwwwopen-stdorgjtc1sc22wg21

docspapers2013n3691pdf May 2013

[29] Vigueras G Orduna J M and LozanoM A Read-Copy Update based parallel serverfor distributed crowd simulations The Journalof Supercomputing (Apr 2012)

Change Log

This paper first appeared as N4026 in May of 2014Revisions to this document are as follows

bull Add a suggestion to Section 71 that developersprovide diagnostics for singleton arrays when us-ing RCU-protected indexes (May 28 2014)

bull Apply changes from Torvald Riegel feedback

1 Move Section 71 which summarizes theLinux-kernel uses of dependency chains upto the beginning of Section 7

2 Fix some typos in Section 74 and addwords accounting for DEC Alpha whichmust continue to promote memory order

consume loads to memory order acquire

with this sectionrsquos approach Also clarifyJeff Preshingrsquos conjecture about the lack ofoptimizations that break semantic depen-dency chains

(June 5 2014)

bull Add Section 75 which covers Hans Boehmrsquosapproach which restricts dependency chains tolocal variable Also add Section 76 whichcovers Clark Nelsorsquos approach which usesthe [[carries dependency]] attribute to markthose local variables that carry dependencies(July 17 2014)

bull Mark the paper as a revision of N4036 (July 172014)

bull Add Jeff Preshing Hans Boehm and Clark Nel-son as authors and clarify Hansrsquos proposal inSection 75 (August 13 2014)

bull Add Olivier Girouxrsquos proposal for explicitlymarking the tails of dependency chains as Sec-tion 77 and add Olivier as author Update theadvantages and disadvantages of Jeff Preshingrsquoswhole-program option in Section 74 Add advan-tages and disadvantages for Hans Boehmrsquos pro-posal in Section 75 and for Clark Nelsonrsquos pro-posal in Section 76 (August 23 2014)

bull Update Olivier Girouxrsquos explicit dependency-chain marking proposal in Section 77 based onfeedback from Olivier (August 24 2014)

bull Add a section header (Section 411) for thedependency-chain rules for 2014 GCC implemen-tations Add Section 413 laying out the sim-pler rules for managing dependency chains in theolder 1990s Sequent C implementations Adda footnote to Section 76 recording Lawrence

WG21N4321 39

Crowlrsquos preference for variable modifiers insteadof either attributes or type modifiers Addwords to Section 77 noting that atomic signal

fence() or the Linux kernelrsquos barrier() can beused to suppress code-motion optimizations thatmight otherwise prevent fixing up tail-markeddependency chains that were broken by code-motion optimizations (October 1 2014)

bull Add Section 34 which describes operators thatact as the last link in the Linux kernelrsquos depen-dency chains including the ambiguous role ofthe -gt operator (October 1 2014)

bull Update plots of RCU API usage (October 52014)

bull Flesh out Section 412 which lays out a shorthistory of dependency-chain management rulesfor the Linux kernel Numerous minor clarifica-tions and corrections throughout the document(October 5 2014)

bull Add ldquoandrdquo before final authorrsquos name and emailaddress Also add a pointer back to N4036 (Oc-tober 5 2014)

bull Add a brain-dead build script (October 52014)

bull Add Section 710 which has the beginnings ofan evaluation of the various proposals (October5 2014)

The paper was then published as N4215 whichwas revised as follows

bull Add verbiage to Sections 72 and 7102 differ-entiating between various forms of syntactic andsemantic dependencies (November 11 2014)

bull Add Section 7103 suggesting a C keyword in-stead of the C++ [[carries dependency]] at-tribute This section also suggests that combin-ing ideas from several proposals might be helpful(November 11 2014)

bull Add Section 78 which contains a first attemptto describe Olivier Girouxrsquos head-marked depen-dency chains (November 11 2014)

bull Add example code in Figures 15 and 16 Addsimilar figures to the proposals showing how theymark the dependency chains in the sample code(November 11 2014)

At this point the paper was published as N4321which was revised as follows

bull Add a warning against using operands to amp and| that leave only one bit set (or respectivelycleared) to Section 411 (April 9 2015)

bull Update plots of RCU API usage (April 192014)

bull Add Section 79 which restricts dependencychains with the goal of allowing unmarked de-pendency chains while avoiding restricting com-piler optimizations One important restriction isthe elimination of RCU-protected array indexes(April 19 2014)

bull Add verbiage to Section 32 noting that the amp

operator is sometimes used to find the beginningof a power-of-two aligned structure (April 192014)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Torvald Riegel to Sec-tion 79 (May 20 2015)

bull Donrsquot allow integers to carry dependencies intoor out of unmarked functions in Section 79(May 21 2015)

bull Update Section 79 to note that assignments can-not extend a dependency chain from one threadto another and that assignments cannot extendan intptr t- or uintptr t-based dependencychain through a shared variable even within agiven thread (May 21 2015)

bull Update Section 79 to note that atomic op-erations can extend dependency chains in thesame way that their non-atomic counterpartscan (May 21 2015)

WG21N4321 40

bull Update Section 79 to explicitly call out the factthat stores and subsequent loads can extend de-pendency chains (May 21 2015)

bull Update Section 79 to explicitly state that un-defined behavior terminates dependency chainsand that if only one value of a pointer avoidsundefined behavior any dependency chainsthrough that pointer at that point are termi-nated (May 22 2015)

bull Update Section 79 to document the difficul-ties stemming from value-specialization opti-mizations (May 23 2015)

bull Update Section 79 to note the back-propagationof the dependency-chain-breaking effects of thecompiler deducing the exact value of a pointer(May 27 2015)

bull Update Section 79 to give more informationon how much bit-setting and bit-clearing is toomuch (June 8 2015)

bull Update Section 79 to note that the Linux ker-nel does not rely on dependency ordering whena statically allocated structure is linked into anRCU-protected structure but also note that op-timizations breaking dependency ordering in thiscase seem quite dubious (July 11 2015)

bull Update Section 79 illustrating dependency-chain rules with examples (July 11 2015)

bull Update Section 79 noting that simple loads suf-fice for memory order consume loads (July 112015)

bull Update Sections 77 and 78 calling out abstrac-tion challenges for head- and tail-marking ap-proaches (July 13 2015)

bull Add subsections to Section 79 (July 13 2015)

bull Update Section 79 to amplify objections to spe-cialization optimizations applied to heap-basedpointers (July 13 2015)

  • 1 Introduction
  • 2 Introduction to RCU
  • 3 Linux-Kernel Use Cases
    • 31 Types of Linux-Kernel Dependency Chains
    • 32 Operators in Linux-Kernel Dependency Chains
    • 33 Operators Terminating Linux-Kernel Dependency Chains
    • 34 Operators Acting as Last Link in Linux-Kernel Dependency Chains
    • 35 Linux-Kernel Dependency Chain Length
      • 4 Dependency Ordering in Pre-C11 Implementations
        • 41 Rules for C-Language RCU Users
          • 411 Rules for 2014 GCC Implementations
          • 412 Rules for 2003 GCC Implementations
          • 413 Rules for 1990s Sequent C Implementations
            • 42 Rules for C-Language Implementers
              • 5 Dependency Ordering in C11 and C++11 Implementations
              • 6 Weaknesses in C11 and C++11 Dependency Ordering
              • 7 Potential Alternatives to C11 and C++11 Dependency Ordering
                • 71 Revising C11 and C++11 Dependency-Ordering Definition
                • 72 Type-Based Designation of Dependency Chains With Restrictions
                • 73 Type-Based Designation of Dependency Chains
                • 74 Whole-Program Option
                • 75 Local-Variable Restriction
                • 76 Mark Dependency-Carrying Local Variables
                • 77 Explicitly Tail-Marked Dependency Chains
                • 78 Explicitly Head-Marked Dependency Chains
                • 79 Restricted Dependency Chains
                  • 791 Extending Dependency Chains
                  • 792 Terminating Dependency Chains
                  • 793 Restricted Dependency Chains and the Linux Kernel
                  • 794 Restricted Dependency Chains Advantages and Disavantages
                    • 710 Evaluation
                      • 7101 Audiences
                      • 7102 Comparison
                      • 7103 Other Approaches
                          • 8 Summary
Page 34: N4321: Towards Implementation and Use of memory order … · 2015-07-14 · N4321: Towards Implementation and Use of memory order consume Doc. No.: WG21/N4321 (revised) Date: 2015-04-19

WG21N4321 34

are required As noted earlier a patch is avail-able to remove the Linux kernelrsquos dependency onRCU-protected array indexes [22]

In the future this approach could be augmentedby attributes type modifiers variable modifiers orother markings to allow more elaborate dependencychains to be created on the one hand and to improvethe compilerrsquos ability to emit diagnostics for dubioususes of dependency chains on the other

710 Evaluation

This evaluation starts by enumerating the differentaudiences that any change to memory order consume

must address (Section 7101) and then compares thevarious proposals based on the perceived viewpointsof these audiences (Section 7102)

7101 Audiences

The main audiences for any change to memory

order consume include standards committee mem-bers compiler implementers formal-methods re-searchers developers intending to write new codeand developers working with existing RCU code TheLinux kernel community is of course a notable exam-ple of this last category

Standards committee members would like a cleanand non-intrusive change to the standard Theywould of course also like solutions minimizing thenumber and vehemence of complaints from the otheraudiences or failing that reducing the complaintsto a tolerable noise level

Compiler implementers would like a mechanismthat fits nicely into current implementations whichdoes much to explain their satisfaction with the ap-proach of strengthening memory order consume tomemory order acquire In particular they wouldlike to avoid unbounded tracing of dependencies andwould prefer minimal constraints on their ability toapply time-honored optimizations

Formal-methods researchers would like a definitionof memory order consume that fits into existing the-oretical frameworks without undue conceptual vio-lence Of particular concern is any need to deal with

counter-factuals in other words any need to rea-son not only about values of variables required forthe solution of a given litmus test but also aboutother unrelated values for these variables As suchcounter-factuals are the rock upon which otherwiseattractive approaches involving semantic dependencyhave foundered20 Some practitioners might won-der why the opinion of formal-methods researchersshould be given any weight at all and the answer tothis question is that it is the work of formal-methodsresearchers that provides us the much-needed toolsthat we need to analyze both the memory-orderingspecification itself as well as programs using thatspecification

Developers writing new code need something thatexpresses their algorithm with a minimum of syn-tactic saccharine that is easy to learn and that iseasy to maintain For example one of the weak-nesses of the current standardsrsquo definition of memoryorder consume is the need to sprinkle large numbersof kill dependency() calls throughout onersquos codeIn short developers would like it to be easy to writeanalyze and maintain code that uses dependency or-dering

Developers with existing RCU code have the samedesires as do developers writing new code but arealso very interested in minimizing the code churn re-quired to adhere to the standard

The challenge if of course to find a proposal thataddresses the viewoints of all of these audiences Aswe will see in the next session this is not easy How-ever there is some hope that the approach presentedin Section 79 might suffice

7102 Comparison

A summary comparison of the proposals is shown inTable 1

The dependency type can either be ldquodeprdquo fornormal dependency ldquordeprdquo for the restricted de-pendencies discussed in Section 79 ldquosdeprdquo for(global) semantic dependency or ldquolsdeprdquo for local

20 That said Alan Jeffries is making another attempt tocome up with a suitable formal definition of semantic depen-dency

WG21N4321 35

Dep

enden

cyType

Variable

Marking

Form

al-Parameter

Marking

Return-V

alueMarking

Beginning-O

f-Chain

Handling

End-O

f-Chain

Handling

Dep

enden

cyTracingRequired

CAttribute

Support

Req

uired

C11 C++11 dep A A K Y Y

Type-Based Designation of DependencyChains With Restrictions (Section 72)

lsdep T T T k

Type-Based Designation of DependencyChains (Section 73)

dep T T T k

Whole-Program Option (Section 74) sdep

Local-Variable Restriction (Section 75) dep A A k Y

Mark Dependency-Carrying Local Variables(Section 76)

dep A A A k Y

Explicitly Tail-Marked Dependency Chains(Section 77)

dep A A Dk y Y

Explicitly Head-Marked Dependency Chains(Section 78)

dep D k y Y

Restricted Dependency Chains (Section 79) rdep

Marking ldquoArdquo attribute ldquoTrdquo typeBeginning of chain ldquoDrdquo explicit designationEnd of chain ldquoDrdquo explicit designation ldquokKrdquo implicitexplicit kill dependencyDependency tracking required ldquoyrdquo only for marked chains ldquoYrdquo always

Table 1 Comparison of Consume Proposals

WG21N4321 36

semantic dependency21 Variable formal-parameterand return-value marking can either be type-based(ldquoTrdquo) attribute-based (ldquoArdquo) or not required (ldquo rdquo)Beginning-of-chain handling can either require ex-plicit indication to which quantities dependenciesmust be carried (ldquoDrdquo) or nothing (ldquo rdquo) End-of-chain handling can either require an explicit kill

dependency (ldquoKrdquo) an implicit kill dependency

(ldquokrdquo) explicit designation of dependency (ldquoDrdquo) ornothing (ldquo rdquo)22 Dependency tracking might berequired for all chains (ldquoYrdquo) explicitly designatedchains (ldquoyrdquo) or not required at all (ldquo rdquo) C-language[[carries dependency]] support might be required(ldquoYrdquo) or not (ldquo rdquo)

The ideal proposal would have dependency typeldquodeprdquo (thus making it easier to model dependencyordering and making it unnecessary for developer tohave to outwit full-program optimizations) no needfor variable formal-parameter or return-value mark-ing (thus minimizing changes required for existingRCU code) implicit ldquodo the right thingrdquo end-of-chainhandling23 (thus minimizing the need for whack-a-mole source-code markups) no need for dependencytracking (thus making it easier to implement) andno need for C-language support for the [[carries

dependency]] attribute (thus minimizing changes tothe C standard)

7103 Other Approaches

If the C standards committee is unwilling to ac-cept attributes perhaps a new keyword such asCarries dependency would be an acceptable alter-native (This was suggested at the 2014 UIUC meet-ing but I cannot recall who suggested it)

21 Recall that a local semantic dependency remains a de-pendency even if the memory order consume load at its headcan return only a single value In contrast a global seman-tic dependency remains a dependency only if more than onevalue can appear at the end of the chain Therefore optimiza-tions based on global full-program analysis can break a globalsemantic dependency but can break neither a local semanticdependency nor a normal dependency

22 Variables that go out of scope always have any dependencychain implicitly killed

23 Perhaps implemented by a careful choice of exactly whichoperators carry dependencies in which situations

It might also be possible to combine different as-pects of the various proposals perhaps even arrivingat an improved proposal

Nevertheless we clearly have some more work todo

8 Summary

This document has analyzed Linux-kernel use of de-pendency ordering and has laid out the status-quo in-teraction between the Linux kernel and pre-C11 com-pilers It has also put forward some possible ways ofbuilding towards a full implementation of C11rsquos andC++11rsquos handling of dependency ordering Finallyit calls out some weaknesses in C11rsquos and C++11rsquoshandling of dependency ordering and offers some al-ternatives

References

[1] Alglave J Maranget L Pawan PSarkar S Sewell P Williams D andNardelli F Z PPCMEMARMMEM A toolfor exploring the POWER and ARM memorymodels httpwwwclcamacuk~pes20

ppc-supplementalpldi105-sarkarpdfJune 2011

[2] Alglave J Maranget L andTautschnig M Herding cats Modellingsimulation testing and data-mining for weakmemory In Proceedings of the 35th ACM SIG-PLAN Conference on Programming LanguageDesign and Implementation (New York NYUSA 2014) PLDI rsquo14 ACM pp 40ndash40

[3] ARM Limited ARM Architecture ReferenceManual ARMv7-A and ARMv7-R Edition2010

[4] Boehm H J Space efficient conservativegarbage collection SIGPLAN Not 39 4 (Apr2004) 490ndash501

[5] Bonzini P and Day M RCU imple-mentation for Qemu httplistsgnu

WG21N4321 37

orgarchivehtmlqemu-devel2013-08

msg02055html August 2013

[6] Dalton M THE DESIGN AND IMPLE-MENTATION OF DYNAMIC INFORMATIONFLOW TRACKING SYSTEMS FOR SOFT-WARE SECURITY PhD thesis StanfordUniversity 2009 Available httpcsl

stanfordedu~christospublications

2009michael_daltonphd_thesispdf

[Viewed March 9 2010]

[7] Desnoyers M [RFC git tree] userspace RCU(urcu) for Linux httpurcuso February2009

[8] Desnoyers M McKenney P E SternA Dagenais M R and Walpole JUser-level implementations of read-copy updateIEEE Transactions on Parallel and DistributedSystems 23 (2012) 375ndash382

[9] Grisenthwaite R ARM Barrier LitmusTests and Cookbook ARM Limited 2009

[10] Howard P W and Walpole J A rel-ativistic enhancement to software transactionalmemory In Proceedings of the 3rd USENIX con-ference on Hot topics in parallelism (BerkeleyCA USA 2011) HotParrsquo11 USENIX Associa-tion pp 1ndash6

[11] Howard P W and Walpole J Relativis-tic red-black trees Concurrency and Computa-tion Practice and Experience (2013) nandashna

[12] Intel Corporation A Formal Speci-fication of Intel Itanium Processor Fam-ily Memory Ordering 2002 Avail-able httpdeveloperintelcom

designitaniumdownloads251429htm

ftpdownloadintelcomdesign

ItaniumDownloads25142901pdf [ViewedJanuary 10 2007]

[13] International Business Machines Corpo-ration Power ISA Version 207 2013

[14] Kannan H Ordering decoupled metadata ac-cesses in multiprocessors In MICRO 42 Pro-ceedings of the 42nd Annual IEEEACM Inter-national Symposium on Microarchitecture (NewYork NY USA 2009) ACM pp 381ndash390

[15] McCarthy J Recursive functions of symbolicexpressions and their computation by machinepart i Commun ACM 3 4 (Apr 1960) 184ndash195

[16] McKenney P E Read-copy update(RCU) usage in Linux kernel Availablehttpwwwrdropcomuserspaulmck

RCUlinuxusagerculocktabhtml [ViewedJanuary 14 2007] October 2006

[17] McKenney P E What is RCU part 2Usage Available httplwnnetArticles

263130 [Viewed January 4 2008] January2008

[18] McKenney P E The RCU API 2010 edi-tion httplwnnetArticles418853 De-cember 2010

[19] McKenney P E Validating memory barri-ers and atomic instructions httplwnnet

Articles470681 December 2011

[20] McKenney P E Is Parallel ProgrammingHard And If So What Can You Do About Itkernelorg Corvallis OR USA 2012

[21] McKenney P E Structured deferral syn-chronization via procrastination CommunACM 56 7 (July 2013) 40ndash49

[22] McKenney P E [PATCH tipcorercu14] mce Stop using array-index-based RCUprimitives [PATCHtipcorercu14]mce

Stopusingarray-index-basedRCUprimitivesMay 2015

[23] McKenney P E Purcell C Al-gae Schumin B Cornelius G Qwer-tyus Conway N Sbw Blainster Ru-fus C Zoicon5 Anome and Eisen HRead-copy update httpenwikipedia

orgwikiRead-copy-update July 2006

WG21N4321 38

[24] McKenney P E and Slingwine J DRead-copy update Using execution history tosolve concurrency problems In Parallel andDistributed Computing and Systems (Las VegasNV October 1998) pp 509ndash518

[25] McKenney P E and Walpole J What isRCU fundamentally Available httplwn

netArticles262464 [Viewed December 272007] December 2007

[26] Michael M M Hazard pointers Safe mem-ory reclamation for lock-free objects IEEETransactions on Parallel and Distributed Sys-tems 15 6 (June 2004) 491ndash504

[27] Rossbach C J Hofmann O S PorterD E Ramadan H E Bhandari A andWitchel E TxLinux Using and managinghardware transactional memory in an operatingsystem In SOSPrsquo07 Twenty-First ACMSymposium on Operating Systems Principles(October 2007) ACM SIGOPS Avail-able httpwwwsosp2007orgpapers

sosp056-rossbachpdf [Viewed October 212007]

[28] Toit S D Working draft stan-dard for programming language C++httpwwwopen-stdorgjtc1sc22wg21

docspapers2013n3691pdf May 2013

[29] Vigueras G Orduna J M and LozanoM A Read-Copy Update based parallel serverfor distributed crowd simulations The Journalof Supercomputing (Apr 2012)

Change Log

This paper first appeared as N4026 in May of 2014Revisions to this document are as follows

bull Add a suggestion to Section 71 that developersprovide diagnostics for singleton arrays when us-ing RCU-protected indexes (May 28 2014)

bull Apply changes from Torvald Riegel feedback

1 Move Section 71 which summarizes theLinux-kernel uses of dependency chains upto the beginning of Section 7

2 Fix some typos in Section 74 and addwords accounting for DEC Alpha whichmust continue to promote memory order

consume loads to memory order acquire

with this sectionrsquos approach Also clarifyJeff Preshingrsquos conjecture about the lack ofoptimizations that break semantic depen-dency chains

(June 5 2014)

bull Add Section 75 which covers Hans Boehmrsquosapproach which restricts dependency chains tolocal variable Also add Section 76 whichcovers Clark Nelsorsquos approach which usesthe [[carries dependency]] attribute to markthose local variables that carry dependencies(July 17 2014)

bull Mark the paper as a revision of N4036 (July 172014)

bull Add Jeff Preshing Hans Boehm and Clark Nel-son as authors and clarify Hansrsquos proposal inSection 75 (August 13 2014)

bull Add Olivier Girouxrsquos proposal for explicitlymarking the tails of dependency chains as Sec-tion 77 and add Olivier as author Update theadvantages and disadvantages of Jeff Preshingrsquoswhole-program option in Section 74 Add advan-tages and disadvantages for Hans Boehmrsquos pro-posal in Section 75 and for Clark Nelsonrsquos pro-posal in Section 76 (August 23 2014)

bull Update Olivier Girouxrsquos explicit dependency-chain marking proposal in Section 77 based onfeedback from Olivier (August 24 2014)

bull Add a section header (Section 411) for thedependency-chain rules for 2014 GCC implemen-tations Add Section 413 laying out the sim-pler rules for managing dependency chains in theolder 1990s Sequent C implementations Adda footnote to Section 76 recording Lawrence

WG21N4321 39

Crowlrsquos preference for variable modifiers insteadof either attributes or type modifiers Addwords to Section 77 noting that atomic signal

fence() or the Linux kernelrsquos barrier() can beused to suppress code-motion optimizations thatmight otherwise prevent fixing up tail-markeddependency chains that were broken by code-motion optimizations (October 1 2014)

bull Add Section 34 which describes operators thatact as the last link in the Linux kernelrsquos depen-dency chains including the ambiguous role ofthe -gt operator (October 1 2014)

bull Update plots of RCU API usage (October 52014)

bull Flesh out Section 412 which lays out a shorthistory of dependency-chain management rulesfor the Linux kernel Numerous minor clarifica-tions and corrections throughout the document(October 5 2014)

bull Add ldquoandrdquo before final authorrsquos name and emailaddress Also add a pointer back to N4036 (Oc-tober 5 2014)

bull Add a brain-dead build script (October 52014)

bull Add Section 710 which has the beginnings ofan evaluation of the various proposals (October5 2014)

The paper was then published as N4215 whichwas revised as follows

bull Add verbiage to Sections 72 and 7102 differ-entiating between various forms of syntactic andsemantic dependencies (November 11 2014)

bull Add Section 7103 suggesting a C keyword in-stead of the C++ [[carries dependency]] at-tribute This section also suggests that combin-ing ideas from several proposals might be helpful(November 11 2014)

bull Add Section 78 which contains a first attemptto describe Olivier Girouxrsquos head-marked depen-dency chains (November 11 2014)

bull Add example code in Figures 15 and 16 Addsimilar figures to the proposals showing how theymark the dependency chains in the sample code(November 11 2014)

At this point the paper was published as N4321which was revised as follows

bull Add a warning against using operands to amp and| that leave only one bit set (or respectivelycleared) to Section 411 (April 9 2015)

bull Update plots of RCU API usage (April 192014)

bull Add Section 79 which restricts dependencychains with the goal of allowing unmarked de-pendency chains while avoiding restricting com-piler optimizations One important restriction isthe elimination of RCU-protected array indexes(April 19 2014)

bull Add verbiage to Section 32 noting that the amp

operator is sometimes used to find the beginningof a power-of-two aligned structure (April 192014)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Torvald Riegel to Sec-tion 79 (May 20 2015)

bull Donrsquot allow integers to carry dependencies intoor out of unmarked functions in Section 79(May 21 2015)

bull Update Section 79 to note that assignments can-not extend a dependency chain from one threadto another and that assignments cannot extendan intptr t- or uintptr t-based dependencychain through a shared variable even within agiven thread (May 21 2015)

bull Update Section 79 to note that atomic op-erations can extend dependency chains in thesame way that their non-atomic counterpartscan (May 21 2015)

WG21N4321 40

bull Update Section 79 to explicitly call out the factthat stores and subsequent loads can extend de-pendency chains (May 21 2015)

bull Update Section 79 to explicitly state that un-defined behavior terminates dependency chainsand that if only one value of a pointer avoidsundefined behavior any dependency chainsthrough that pointer at that point are termi-nated (May 22 2015)

bull Update Section 79 to document the difficul-ties stemming from value-specialization opti-mizations (May 23 2015)

bull Update Section 79 to note the back-propagationof the dependency-chain-breaking effects of thecompiler deducing the exact value of a pointer(May 27 2015)

bull Update Section 79 to give more informationon how much bit-setting and bit-clearing is toomuch (June 8 2015)

bull Update Section 79 to note that the Linux ker-nel does not rely on dependency ordering whena statically allocated structure is linked into anRCU-protected structure but also note that op-timizations breaking dependency ordering in thiscase seem quite dubious (July 11 2015)

bull Update Section 79 illustrating dependency-chain rules with examples (July 11 2015)

bull Update Section 79 noting that simple loads suf-fice for memory order consume loads (July 112015)

bull Update Sections 77 and 78 calling out abstrac-tion challenges for head- and tail-marking ap-proaches (July 13 2015)

bull Add subsections to Section 79 (July 13 2015)

bull Update Section 79 to amplify objections to spe-cialization optimizations applied to heap-basedpointers (July 13 2015)

  • 1 Introduction
  • 2 Introduction to RCU
  • 3 Linux-Kernel Use Cases
    • 31 Types of Linux-Kernel Dependency Chains
    • 32 Operators in Linux-Kernel Dependency Chains
    • 33 Operators Terminating Linux-Kernel Dependency Chains
    • 34 Operators Acting as Last Link in Linux-Kernel Dependency Chains
    • 35 Linux-Kernel Dependency Chain Length
      • 4 Dependency Ordering in Pre-C11 Implementations
        • 41 Rules for C-Language RCU Users
          • 411 Rules for 2014 GCC Implementations
          • 412 Rules for 2003 GCC Implementations
          • 413 Rules for 1990s Sequent C Implementations
            • 42 Rules for C-Language Implementers
              • 5 Dependency Ordering in C11 and C++11 Implementations
              • 6 Weaknesses in C11 and C++11 Dependency Ordering
              • 7 Potential Alternatives to C11 and C++11 Dependency Ordering
                • 71 Revising C11 and C++11 Dependency-Ordering Definition
                • 72 Type-Based Designation of Dependency Chains With Restrictions
                • 73 Type-Based Designation of Dependency Chains
                • 74 Whole-Program Option
                • 75 Local-Variable Restriction
                • 76 Mark Dependency-Carrying Local Variables
                • 77 Explicitly Tail-Marked Dependency Chains
                • 78 Explicitly Head-Marked Dependency Chains
                • 79 Restricted Dependency Chains
                  • 791 Extending Dependency Chains
                  • 792 Terminating Dependency Chains
                  • 793 Restricted Dependency Chains and the Linux Kernel
                  • 794 Restricted Dependency Chains Advantages and Disavantages
                    • 710 Evaluation
                      • 7101 Audiences
                      • 7102 Comparison
                      • 7103 Other Approaches
                          • 8 Summary
Page 35: N4321: Towards Implementation and Use of memory order … · 2015-07-14 · N4321: Towards Implementation and Use of memory order consume Doc. No.: WG21/N4321 (revised) Date: 2015-04-19

WG21N4321 35

Dep

enden

cyType

Variable

Marking

Form

al-Parameter

Marking

Return-V

alueMarking

Beginning-O

f-Chain

Handling

End-O

f-Chain

Handling

Dep

enden

cyTracingRequired

CAttribute

Support

Req

uired

C11 C++11 dep A A K Y Y

Type-Based Designation of DependencyChains With Restrictions (Section 72)

lsdep T T T k

Type-Based Designation of DependencyChains (Section 73)

dep T T T k

Whole-Program Option (Section 74) sdep

Local-Variable Restriction (Section 75) dep A A k Y

Mark Dependency-Carrying Local Variables(Section 76)

dep A A A k Y

Explicitly Tail-Marked Dependency Chains(Section 77)

dep A A Dk y Y

Explicitly Head-Marked Dependency Chains(Section 78)

dep D k y Y

Restricted Dependency Chains (Section 79) rdep

Marking ldquoArdquo attribute ldquoTrdquo typeBeginning of chain ldquoDrdquo explicit designationEnd of chain ldquoDrdquo explicit designation ldquokKrdquo implicitexplicit kill dependencyDependency tracking required ldquoyrdquo only for marked chains ldquoYrdquo always

Table 1 Comparison of Consume Proposals

WG21N4321 36

semantic dependency21 Variable formal-parameterand return-value marking can either be type-based(ldquoTrdquo) attribute-based (ldquoArdquo) or not required (ldquo rdquo)Beginning-of-chain handling can either require ex-plicit indication to which quantities dependenciesmust be carried (ldquoDrdquo) or nothing (ldquo rdquo) End-of-chain handling can either require an explicit kill

dependency (ldquoKrdquo) an implicit kill dependency

(ldquokrdquo) explicit designation of dependency (ldquoDrdquo) ornothing (ldquo rdquo)22 Dependency tracking might berequired for all chains (ldquoYrdquo) explicitly designatedchains (ldquoyrdquo) or not required at all (ldquo rdquo) C-language[[carries dependency]] support might be required(ldquoYrdquo) or not (ldquo rdquo)

The ideal proposal would have dependency typeldquodeprdquo (thus making it easier to model dependencyordering and making it unnecessary for developer tohave to outwit full-program optimizations) no needfor variable formal-parameter or return-value mark-ing (thus minimizing changes required for existingRCU code) implicit ldquodo the right thingrdquo end-of-chainhandling23 (thus minimizing the need for whack-a-mole source-code markups) no need for dependencytracking (thus making it easier to implement) andno need for C-language support for the [[carries

dependency]] attribute (thus minimizing changes tothe C standard)

7103 Other Approaches

If the C standards committee is unwilling to ac-cept attributes perhaps a new keyword such asCarries dependency would be an acceptable alter-native (This was suggested at the 2014 UIUC meet-ing but I cannot recall who suggested it)

21 Recall that a local semantic dependency remains a de-pendency even if the memory order consume load at its headcan return only a single value In contrast a global seman-tic dependency remains a dependency only if more than onevalue can appear at the end of the chain Therefore optimiza-tions based on global full-program analysis can break a globalsemantic dependency but can break neither a local semanticdependency nor a normal dependency

22 Variables that go out of scope always have any dependencychain implicitly killed

23 Perhaps implemented by a careful choice of exactly whichoperators carry dependencies in which situations

It might also be possible to combine different as-pects of the various proposals perhaps even arrivingat an improved proposal

Nevertheless we clearly have some more work todo

8 Summary

This document has analyzed Linux-kernel use of de-pendency ordering and has laid out the status-quo in-teraction between the Linux kernel and pre-C11 com-pilers It has also put forward some possible ways ofbuilding towards a full implementation of C11rsquos andC++11rsquos handling of dependency ordering Finallyit calls out some weaknesses in C11rsquos and C++11rsquoshandling of dependency ordering and offers some al-ternatives

References

[1] Alglave J Maranget L Pawan PSarkar S Sewell P Williams D andNardelli F Z PPCMEMARMMEM A toolfor exploring the POWER and ARM memorymodels httpwwwclcamacuk~pes20

ppc-supplementalpldi105-sarkarpdfJune 2011

[2] Alglave J Maranget L andTautschnig M Herding cats Modellingsimulation testing and data-mining for weakmemory In Proceedings of the 35th ACM SIG-PLAN Conference on Programming LanguageDesign and Implementation (New York NYUSA 2014) PLDI rsquo14 ACM pp 40ndash40

[3] ARM Limited ARM Architecture ReferenceManual ARMv7-A and ARMv7-R Edition2010

[4] Boehm H J Space efficient conservativegarbage collection SIGPLAN Not 39 4 (Apr2004) 490ndash501

[5] Bonzini P and Day M RCU imple-mentation for Qemu httplistsgnu

WG21N4321 37

orgarchivehtmlqemu-devel2013-08

msg02055html August 2013

[6] Dalton M THE DESIGN AND IMPLE-MENTATION OF DYNAMIC INFORMATIONFLOW TRACKING SYSTEMS FOR SOFT-WARE SECURITY PhD thesis StanfordUniversity 2009 Available httpcsl

stanfordedu~christospublications

2009michael_daltonphd_thesispdf

[Viewed March 9 2010]

[7] Desnoyers M [RFC git tree] userspace RCU(urcu) for Linux httpurcuso February2009

[8] Desnoyers M McKenney P E SternA Dagenais M R and Walpole JUser-level implementations of read-copy updateIEEE Transactions on Parallel and DistributedSystems 23 (2012) 375ndash382

[9] Grisenthwaite R ARM Barrier LitmusTests and Cookbook ARM Limited 2009

[10] Howard P W and Walpole J A rel-ativistic enhancement to software transactionalmemory In Proceedings of the 3rd USENIX con-ference on Hot topics in parallelism (BerkeleyCA USA 2011) HotParrsquo11 USENIX Associa-tion pp 1ndash6

[11] Howard P W and Walpole J Relativis-tic red-black trees Concurrency and Computa-tion Practice and Experience (2013) nandashna

[12] Intel Corporation A Formal Speci-fication of Intel Itanium Processor Fam-ily Memory Ordering 2002 Avail-able httpdeveloperintelcom

designitaniumdownloads251429htm

ftpdownloadintelcomdesign

ItaniumDownloads25142901pdf [ViewedJanuary 10 2007]

[13] International Business Machines Corpo-ration Power ISA Version 207 2013

[14] Kannan H Ordering decoupled metadata ac-cesses in multiprocessors In MICRO 42 Pro-ceedings of the 42nd Annual IEEEACM Inter-national Symposium on Microarchitecture (NewYork NY USA 2009) ACM pp 381ndash390

[15] McCarthy J Recursive functions of symbolicexpressions and their computation by machinepart i Commun ACM 3 4 (Apr 1960) 184ndash195

[16] McKenney P E Read-copy update(RCU) usage in Linux kernel Availablehttpwwwrdropcomuserspaulmck

RCUlinuxusagerculocktabhtml [ViewedJanuary 14 2007] October 2006

[17] McKenney P E What is RCU part 2Usage Available httplwnnetArticles

263130 [Viewed January 4 2008] January2008

[18] McKenney P E The RCU API 2010 edi-tion httplwnnetArticles418853 De-cember 2010

[19] McKenney P E Validating memory barri-ers and atomic instructions httplwnnet

Articles470681 December 2011

[20] McKenney P E Is Parallel ProgrammingHard And If So What Can You Do About Itkernelorg Corvallis OR USA 2012

[21] McKenney P E Structured deferral syn-chronization via procrastination CommunACM 56 7 (July 2013) 40ndash49

[22] McKenney P E [PATCH tipcorercu14] mce Stop using array-index-based RCUprimitives [PATCHtipcorercu14]mce

Stopusingarray-index-basedRCUprimitivesMay 2015

[23] McKenney P E Purcell C Al-gae Schumin B Cornelius G Qwer-tyus Conway N Sbw Blainster Ru-fus C Zoicon5 Anome and Eisen HRead-copy update httpenwikipedia

orgwikiRead-copy-update July 2006

WG21N4321 38

[24] McKenney P E and Slingwine J DRead-copy update Using execution history tosolve concurrency problems In Parallel andDistributed Computing and Systems (Las VegasNV October 1998) pp 509ndash518

[25] McKenney P E and Walpole J What isRCU fundamentally Available httplwn

netArticles262464 [Viewed December 272007] December 2007

[26] Michael M M Hazard pointers Safe mem-ory reclamation for lock-free objects IEEETransactions on Parallel and Distributed Sys-tems 15 6 (June 2004) 491ndash504

[27] Rossbach C J Hofmann O S PorterD E Ramadan H E Bhandari A andWitchel E TxLinux Using and managinghardware transactional memory in an operatingsystem In SOSPrsquo07 Twenty-First ACMSymposium on Operating Systems Principles(October 2007) ACM SIGOPS Avail-able httpwwwsosp2007orgpapers

sosp056-rossbachpdf [Viewed October 212007]

[28] Toit S D Working draft stan-dard for programming language C++httpwwwopen-stdorgjtc1sc22wg21

docspapers2013n3691pdf May 2013

[29] Vigueras G Orduna J M and LozanoM A Read-Copy Update based parallel serverfor distributed crowd simulations The Journalof Supercomputing (Apr 2012)

Change Log

This paper first appeared as N4026 in May of 2014Revisions to this document are as follows

bull Add a suggestion to Section 71 that developersprovide diagnostics for singleton arrays when us-ing RCU-protected indexes (May 28 2014)

bull Apply changes from Torvald Riegel feedback

1 Move Section 71 which summarizes theLinux-kernel uses of dependency chains upto the beginning of Section 7

2 Fix some typos in Section 74 and addwords accounting for DEC Alpha whichmust continue to promote memory order

consume loads to memory order acquire

with this sectionrsquos approach Also clarifyJeff Preshingrsquos conjecture about the lack ofoptimizations that break semantic depen-dency chains

(June 5 2014)

bull Add Section 75 which covers Hans Boehmrsquosapproach which restricts dependency chains tolocal variable Also add Section 76 whichcovers Clark Nelsorsquos approach which usesthe [[carries dependency]] attribute to markthose local variables that carry dependencies(July 17 2014)

bull Mark the paper as a revision of N4036 (July 172014)

bull Add Jeff Preshing Hans Boehm and Clark Nel-son as authors and clarify Hansrsquos proposal inSection 75 (August 13 2014)

bull Add Olivier Girouxrsquos proposal for explicitlymarking the tails of dependency chains as Sec-tion 77 and add Olivier as author Update theadvantages and disadvantages of Jeff Preshingrsquoswhole-program option in Section 74 Add advan-tages and disadvantages for Hans Boehmrsquos pro-posal in Section 75 and for Clark Nelsonrsquos pro-posal in Section 76 (August 23 2014)

bull Update Olivier Girouxrsquos explicit dependency-chain marking proposal in Section 77 based onfeedback from Olivier (August 24 2014)

bull Add a section header (Section 411) for thedependency-chain rules for 2014 GCC implemen-tations Add Section 413 laying out the sim-pler rules for managing dependency chains in theolder 1990s Sequent C implementations Adda footnote to Section 76 recording Lawrence

WG21N4321 39

Crowlrsquos preference for variable modifiers insteadof either attributes or type modifiers Addwords to Section 77 noting that atomic signal

fence() or the Linux kernelrsquos barrier() can beused to suppress code-motion optimizations thatmight otherwise prevent fixing up tail-markeddependency chains that were broken by code-motion optimizations (October 1 2014)

bull Add Section 34 which describes operators thatact as the last link in the Linux kernelrsquos depen-dency chains including the ambiguous role ofthe -gt operator (October 1 2014)

bull Update plots of RCU API usage (October 52014)

bull Flesh out Section 412 which lays out a shorthistory of dependency-chain management rulesfor the Linux kernel Numerous minor clarifica-tions and corrections throughout the document(October 5 2014)

bull Add ldquoandrdquo before final authorrsquos name and emailaddress Also add a pointer back to N4036 (Oc-tober 5 2014)

bull Add a brain-dead build script (October 52014)

bull Add Section 710 which has the beginnings ofan evaluation of the various proposals (October5 2014)

The paper was then published as N4215 whichwas revised as follows

bull Add verbiage to Sections 72 and 7102 differ-entiating between various forms of syntactic andsemantic dependencies (November 11 2014)

bull Add Section 7103 suggesting a C keyword in-stead of the C++ [[carries dependency]] at-tribute This section also suggests that combin-ing ideas from several proposals might be helpful(November 11 2014)

bull Add Section 78 which contains a first attemptto describe Olivier Girouxrsquos head-marked depen-dency chains (November 11 2014)

bull Add example code in Figures 15 and 16 Addsimilar figures to the proposals showing how theymark the dependency chains in the sample code(November 11 2014)

At this point the paper was published as N4321which was revised as follows

bull Add a warning against using operands to amp and| that leave only one bit set (or respectivelycleared) to Section 411 (April 9 2015)

bull Update plots of RCU API usage (April 192014)

bull Add Section 79 which restricts dependencychains with the goal of allowing unmarked de-pendency chains while avoiding restricting com-piler optimizations One important restriction isthe elimination of RCU-protected array indexes(April 19 2014)

bull Add verbiage to Section 32 noting that the amp

operator is sometimes used to find the beginningof a power-of-two aligned structure (April 192014)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Torvald Riegel to Sec-tion 79 (May 20 2015)

bull Donrsquot allow integers to carry dependencies intoor out of unmarked functions in Section 79(May 21 2015)

bull Update Section 79 to note that assignments can-not extend a dependency chain from one threadto another and that assignments cannot extendan intptr t- or uintptr t-based dependencychain through a shared variable even within agiven thread (May 21 2015)

bull Update Section 79 to note that atomic op-erations can extend dependency chains in thesame way that their non-atomic counterpartscan (May 21 2015)

WG21N4321 40

bull Update Section 79 to explicitly call out the factthat stores and subsequent loads can extend de-pendency chains (May 21 2015)

bull Update Section 79 to explicitly state that un-defined behavior terminates dependency chainsand that if only one value of a pointer avoidsundefined behavior any dependency chainsthrough that pointer at that point are termi-nated (May 22 2015)

bull Update Section 79 to document the difficul-ties stemming from value-specialization opti-mizations (May 23 2015)

bull Update Section 79 to note the back-propagationof the dependency-chain-breaking effects of thecompiler deducing the exact value of a pointer(May 27 2015)

bull Update Section 79 to give more informationon how much bit-setting and bit-clearing is toomuch (June 8 2015)

bull Update Section 79 to note that the Linux ker-nel does not rely on dependency ordering whena statically allocated structure is linked into anRCU-protected structure but also note that op-timizations breaking dependency ordering in thiscase seem quite dubious (July 11 2015)

bull Update Section 79 illustrating dependency-chain rules with examples (July 11 2015)

bull Update Section 79 noting that simple loads suf-fice for memory order consume loads (July 112015)

bull Update Sections 77 and 78 calling out abstrac-tion challenges for head- and tail-marking ap-proaches (July 13 2015)

bull Add subsections to Section 79 (July 13 2015)

bull Update Section 79 to amplify objections to spe-cialization optimizations applied to heap-basedpointers (July 13 2015)

  • 1 Introduction
  • 2 Introduction to RCU
  • 3 Linux-Kernel Use Cases
    • 31 Types of Linux-Kernel Dependency Chains
    • 32 Operators in Linux-Kernel Dependency Chains
    • 33 Operators Terminating Linux-Kernel Dependency Chains
    • 34 Operators Acting as Last Link in Linux-Kernel Dependency Chains
    • 35 Linux-Kernel Dependency Chain Length
      • 4 Dependency Ordering in Pre-C11 Implementations
        • 41 Rules for C-Language RCU Users
          • 411 Rules for 2014 GCC Implementations
          • 412 Rules for 2003 GCC Implementations
          • 413 Rules for 1990s Sequent C Implementations
            • 42 Rules for C-Language Implementers
              • 5 Dependency Ordering in C11 and C++11 Implementations
              • 6 Weaknesses in C11 and C++11 Dependency Ordering
              • 7 Potential Alternatives to C11 and C++11 Dependency Ordering
                • 71 Revising C11 and C++11 Dependency-Ordering Definition
                • 72 Type-Based Designation of Dependency Chains With Restrictions
                • 73 Type-Based Designation of Dependency Chains
                • 74 Whole-Program Option
                • 75 Local-Variable Restriction
                • 76 Mark Dependency-Carrying Local Variables
                • 77 Explicitly Tail-Marked Dependency Chains
                • 78 Explicitly Head-Marked Dependency Chains
                • 79 Restricted Dependency Chains
                  • 791 Extending Dependency Chains
                  • 792 Terminating Dependency Chains
                  • 793 Restricted Dependency Chains and the Linux Kernel
                  • 794 Restricted Dependency Chains Advantages and Disavantages
                    • 710 Evaluation
                      • 7101 Audiences
                      • 7102 Comparison
                      • 7103 Other Approaches
                          • 8 Summary
Page 36: N4321: Towards Implementation and Use of memory order … · 2015-07-14 · N4321: Towards Implementation and Use of memory order consume Doc. No.: WG21/N4321 (revised) Date: 2015-04-19

WG21N4321 36

semantic dependency21 Variable formal-parameterand return-value marking can either be type-based(ldquoTrdquo) attribute-based (ldquoArdquo) or not required (ldquo rdquo)Beginning-of-chain handling can either require ex-plicit indication to which quantities dependenciesmust be carried (ldquoDrdquo) or nothing (ldquo rdquo) End-of-chain handling can either require an explicit kill

dependency (ldquoKrdquo) an implicit kill dependency

(ldquokrdquo) explicit designation of dependency (ldquoDrdquo) ornothing (ldquo rdquo)22 Dependency tracking might berequired for all chains (ldquoYrdquo) explicitly designatedchains (ldquoyrdquo) or not required at all (ldquo rdquo) C-language[[carries dependency]] support might be required(ldquoYrdquo) or not (ldquo rdquo)

The ideal proposal would have dependency typeldquodeprdquo (thus making it easier to model dependencyordering and making it unnecessary for developer tohave to outwit full-program optimizations) no needfor variable formal-parameter or return-value mark-ing (thus minimizing changes required for existingRCU code) implicit ldquodo the right thingrdquo end-of-chainhandling23 (thus minimizing the need for whack-a-mole source-code markups) no need for dependencytracking (thus making it easier to implement) andno need for C-language support for the [[carries

dependency]] attribute (thus minimizing changes tothe C standard)

7103 Other Approaches

If the C standards committee is unwilling to ac-cept attributes perhaps a new keyword such asCarries dependency would be an acceptable alter-native (This was suggested at the 2014 UIUC meet-ing but I cannot recall who suggested it)

21 Recall that a local semantic dependency remains a de-pendency even if the memory order consume load at its headcan return only a single value In contrast a global seman-tic dependency remains a dependency only if more than onevalue can appear at the end of the chain Therefore optimiza-tions based on global full-program analysis can break a globalsemantic dependency but can break neither a local semanticdependency nor a normal dependency

22 Variables that go out of scope always have any dependencychain implicitly killed

23 Perhaps implemented by a careful choice of exactly whichoperators carry dependencies in which situations

It might also be possible to combine different as-pects of the various proposals perhaps even arrivingat an improved proposal

Nevertheless we clearly have some more work todo

8 Summary

This document has analyzed Linux-kernel use of de-pendency ordering and has laid out the status-quo in-teraction between the Linux kernel and pre-C11 com-pilers It has also put forward some possible ways ofbuilding towards a full implementation of C11rsquos andC++11rsquos handling of dependency ordering Finallyit calls out some weaknesses in C11rsquos and C++11rsquoshandling of dependency ordering and offers some al-ternatives

References

[1] Alglave J Maranget L Pawan PSarkar S Sewell P Williams D andNardelli F Z PPCMEMARMMEM A toolfor exploring the POWER and ARM memorymodels httpwwwclcamacuk~pes20

ppc-supplementalpldi105-sarkarpdfJune 2011

[2] Alglave J Maranget L andTautschnig M Herding cats Modellingsimulation testing and data-mining for weakmemory In Proceedings of the 35th ACM SIG-PLAN Conference on Programming LanguageDesign and Implementation (New York NYUSA 2014) PLDI rsquo14 ACM pp 40ndash40

[3] ARM Limited ARM Architecture ReferenceManual ARMv7-A and ARMv7-R Edition2010

[4] Boehm H J Space efficient conservativegarbage collection SIGPLAN Not 39 4 (Apr2004) 490ndash501

[5] Bonzini P and Day M RCU imple-mentation for Qemu httplistsgnu

WG21N4321 37

orgarchivehtmlqemu-devel2013-08

msg02055html August 2013

[6] Dalton M THE DESIGN AND IMPLE-MENTATION OF DYNAMIC INFORMATIONFLOW TRACKING SYSTEMS FOR SOFT-WARE SECURITY PhD thesis StanfordUniversity 2009 Available httpcsl

stanfordedu~christospublications

2009michael_daltonphd_thesispdf

[Viewed March 9 2010]

[7] Desnoyers M [RFC git tree] userspace RCU(urcu) for Linux httpurcuso February2009

[8] Desnoyers M McKenney P E SternA Dagenais M R and Walpole JUser-level implementations of read-copy updateIEEE Transactions on Parallel and DistributedSystems 23 (2012) 375ndash382

[9] Grisenthwaite R ARM Barrier LitmusTests and Cookbook ARM Limited 2009

[10] Howard P W and Walpole J A rel-ativistic enhancement to software transactionalmemory In Proceedings of the 3rd USENIX con-ference on Hot topics in parallelism (BerkeleyCA USA 2011) HotParrsquo11 USENIX Associa-tion pp 1ndash6

[11] Howard P W and Walpole J Relativis-tic red-black trees Concurrency and Computa-tion Practice and Experience (2013) nandashna

[12] Intel Corporation A Formal Speci-fication of Intel Itanium Processor Fam-ily Memory Ordering 2002 Avail-able httpdeveloperintelcom

designitaniumdownloads251429htm

ftpdownloadintelcomdesign

ItaniumDownloads25142901pdf [ViewedJanuary 10 2007]

[13] International Business Machines Corpo-ration Power ISA Version 207 2013

[14] Kannan H Ordering decoupled metadata ac-cesses in multiprocessors In MICRO 42 Pro-ceedings of the 42nd Annual IEEEACM Inter-national Symposium on Microarchitecture (NewYork NY USA 2009) ACM pp 381ndash390

[15] McCarthy J Recursive functions of symbolicexpressions and their computation by machinepart i Commun ACM 3 4 (Apr 1960) 184ndash195

[16] McKenney P E Read-copy update(RCU) usage in Linux kernel Availablehttpwwwrdropcomuserspaulmck

RCUlinuxusagerculocktabhtml [ViewedJanuary 14 2007] October 2006

[17] McKenney P E What is RCU part 2Usage Available httplwnnetArticles

263130 [Viewed January 4 2008] January2008

[18] McKenney P E The RCU API 2010 edi-tion httplwnnetArticles418853 De-cember 2010

[19] McKenney P E Validating memory barri-ers and atomic instructions httplwnnet

Articles470681 December 2011

[20] McKenney P E Is Parallel ProgrammingHard And If So What Can You Do About Itkernelorg Corvallis OR USA 2012

[21] McKenney P E Structured deferral syn-chronization via procrastination CommunACM 56 7 (July 2013) 40ndash49

[22] McKenney P E [PATCH tipcorercu14] mce Stop using array-index-based RCUprimitives [PATCHtipcorercu14]mce

Stopusingarray-index-basedRCUprimitivesMay 2015

[23] McKenney P E Purcell C Al-gae Schumin B Cornelius G Qwer-tyus Conway N Sbw Blainster Ru-fus C Zoicon5 Anome and Eisen HRead-copy update httpenwikipedia

orgwikiRead-copy-update July 2006

WG21N4321 38

[24] McKenney P E and Slingwine J DRead-copy update Using execution history tosolve concurrency problems In Parallel andDistributed Computing and Systems (Las VegasNV October 1998) pp 509ndash518

[25] McKenney P E and Walpole J What isRCU fundamentally Available httplwn

netArticles262464 [Viewed December 272007] December 2007

[26] Michael M M Hazard pointers Safe mem-ory reclamation for lock-free objects IEEETransactions on Parallel and Distributed Sys-tems 15 6 (June 2004) 491ndash504

[27] Rossbach C J Hofmann O S PorterD E Ramadan H E Bhandari A andWitchel E TxLinux Using and managinghardware transactional memory in an operatingsystem In SOSPrsquo07 Twenty-First ACMSymposium on Operating Systems Principles(October 2007) ACM SIGOPS Avail-able httpwwwsosp2007orgpapers

sosp056-rossbachpdf [Viewed October 212007]

[28] Toit S D Working draft stan-dard for programming language C++httpwwwopen-stdorgjtc1sc22wg21

docspapers2013n3691pdf May 2013

[29] Vigueras G Orduna J M and LozanoM A Read-Copy Update based parallel serverfor distributed crowd simulations The Journalof Supercomputing (Apr 2012)

Change Log

This paper first appeared as N4026 in May of 2014Revisions to this document are as follows

bull Add a suggestion to Section 71 that developersprovide diagnostics for singleton arrays when us-ing RCU-protected indexes (May 28 2014)

bull Apply changes from Torvald Riegel feedback

1 Move Section 71 which summarizes theLinux-kernel uses of dependency chains upto the beginning of Section 7

2 Fix some typos in Section 74 and addwords accounting for DEC Alpha whichmust continue to promote memory order

consume loads to memory order acquire

with this sectionrsquos approach Also clarifyJeff Preshingrsquos conjecture about the lack ofoptimizations that break semantic depen-dency chains

(June 5 2014)

bull Add Section 75 which covers Hans Boehmrsquosapproach which restricts dependency chains tolocal variable Also add Section 76 whichcovers Clark Nelsorsquos approach which usesthe [[carries dependency]] attribute to markthose local variables that carry dependencies(July 17 2014)

bull Mark the paper as a revision of N4036 (July 172014)

bull Add Jeff Preshing Hans Boehm and Clark Nel-son as authors and clarify Hansrsquos proposal inSection 75 (August 13 2014)

bull Add Olivier Girouxrsquos proposal for explicitlymarking the tails of dependency chains as Sec-tion 77 and add Olivier as author Update theadvantages and disadvantages of Jeff Preshingrsquoswhole-program option in Section 74 Add advan-tages and disadvantages for Hans Boehmrsquos pro-posal in Section 75 and for Clark Nelsonrsquos pro-posal in Section 76 (August 23 2014)

bull Update Olivier Girouxrsquos explicit dependency-chain marking proposal in Section 77 based onfeedback from Olivier (August 24 2014)

bull Add a section header (Section 411) for thedependency-chain rules for 2014 GCC implemen-tations Add Section 413 laying out the sim-pler rules for managing dependency chains in theolder 1990s Sequent C implementations Adda footnote to Section 76 recording Lawrence

WG21N4321 39

Crowlrsquos preference for variable modifiers insteadof either attributes or type modifiers Addwords to Section 77 noting that atomic signal

fence() or the Linux kernelrsquos barrier() can beused to suppress code-motion optimizations thatmight otherwise prevent fixing up tail-markeddependency chains that were broken by code-motion optimizations (October 1 2014)

bull Add Section 34 which describes operators thatact as the last link in the Linux kernelrsquos depen-dency chains including the ambiguous role ofthe -gt operator (October 1 2014)

bull Update plots of RCU API usage (October 52014)

bull Flesh out Section 412 which lays out a shorthistory of dependency-chain management rulesfor the Linux kernel Numerous minor clarifica-tions and corrections throughout the document(October 5 2014)

bull Add ldquoandrdquo before final authorrsquos name and emailaddress Also add a pointer back to N4036 (Oc-tober 5 2014)

bull Add a brain-dead build script (October 52014)

bull Add Section 710 which has the beginnings ofan evaluation of the various proposals (October5 2014)

The paper was then published as N4215 whichwas revised as follows

bull Add verbiage to Sections 72 and 7102 differ-entiating between various forms of syntactic andsemantic dependencies (November 11 2014)

bull Add Section 7103 suggesting a C keyword in-stead of the C++ [[carries dependency]] at-tribute This section also suggests that combin-ing ideas from several proposals might be helpful(November 11 2014)

bull Add Section 78 which contains a first attemptto describe Olivier Girouxrsquos head-marked depen-dency chains (November 11 2014)

bull Add example code in Figures 15 and 16 Addsimilar figures to the proposals showing how theymark the dependency chains in the sample code(November 11 2014)

At this point the paper was published as N4321which was revised as follows

bull Add a warning against using operands to amp and| that leave only one bit set (or respectivelycleared) to Section 411 (April 9 2015)

bull Update plots of RCU API usage (April 192014)

bull Add Section 79 which restricts dependencychains with the goal of allowing unmarked de-pendency chains while avoiding restricting com-piler optimizations One important restriction isthe elimination of RCU-protected array indexes(April 19 2014)

bull Add verbiage to Section 32 noting that the amp

operator is sometimes used to find the beginningof a power-of-two aligned structure (April 192014)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Torvald Riegel to Sec-tion 79 (May 20 2015)

bull Donrsquot allow integers to carry dependencies intoor out of unmarked functions in Section 79(May 21 2015)

bull Update Section 79 to note that assignments can-not extend a dependency chain from one threadto another and that assignments cannot extendan intptr t- or uintptr t-based dependencychain through a shared variable even within agiven thread (May 21 2015)

bull Update Section 79 to note that atomic op-erations can extend dependency chains in thesame way that their non-atomic counterpartscan (May 21 2015)

WG21N4321 40

bull Update Section 79 to explicitly call out the factthat stores and subsequent loads can extend de-pendency chains (May 21 2015)

bull Update Section 79 to explicitly state that un-defined behavior terminates dependency chainsand that if only one value of a pointer avoidsundefined behavior any dependency chainsthrough that pointer at that point are termi-nated (May 22 2015)

bull Update Section 79 to document the difficul-ties stemming from value-specialization opti-mizations (May 23 2015)

bull Update Section 79 to note the back-propagationof the dependency-chain-breaking effects of thecompiler deducing the exact value of a pointer(May 27 2015)

bull Update Section 79 to give more informationon how much bit-setting and bit-clearing is toomuch (June 8 2015)

bull Update Section 79 to note that the Linux ker-nel does not rely on dependency ordering whena statically allocated structure is linked into anRCU-protected structure but also note that op-timizations breaking dependency ordering in thiscase seem quite dubious (July 11 2015)

bull Update Section 79 illustrating dependency-chain rules with examples (July 11 2015)

bull Update Section 79 noting that simple loads suf-fice for memory order consume loads (July 112015)

bull Update Sections 77 and 78 calling out abstrac-tion challenges for head- and tail-marking ap-proaches (July 13 2015)

bull Add subsections to Section 79 (July 13 2015)

bull Update Section 79 to amplify objections to spe-cialization optimizations applied to heap-basedpointers (July 13 2015)

  • 1 Introduction
  • 2 Introduction to RCU
  • 3 Linux-Kernel Use Cases
    • 31 Types of Linux-Kernel Dependency Chains
    • 32 Operators in Linux-Kernel Dependency Chains
    • 33 Operators Terminating Linux-Kernel Dependency Chains
    • 34 Operators Acting as Last Link in Linux-Kernel Dependency Chains
    • 35 Linux-Kernel Dependency Chain Length
      • 4 Dependency Ordering in Pre-C11 Implementations
        • 41 Rules for C-Language RCU Users
          • 411 Rules for 2014 GCC Implementations
          • 412 Rules for 2003 GCC Implementations
          • 413 Rules for 1990s Sequent C Implementations
            • 42 Rules for C-Language Implementers
              • 5 Dependency Ordering in C11 and C++11 Implementations
              • 6 Weaknesses in C11 and C++11 Dependency Ordering
              • 7 Potential Alternatives to C11 and C++11 Dependency Ordering
                • 71 Revising C11 and C++11 Dependency-Ordering Definition
                • 72 Type-Based Designation of Dependency Chains With Restrictions
                • 73 Type-Based Designation of Dependency Chains
                • 74 Whole-Program Option
                • 75 Local-Variable Restriction
                • 76 Mark Dependency-Carrying Local Variables
                • 77 Explicitly Tail-Marked Dependency Chains
                • 78 Explicitly Head-Marked Dependency Chains
                • 79 Restricted Dependency Chains
                  • 791 Extending Dependency Chains
                  • 792 Terminating Dependency Chains
                  • 793 Restricted Dependency Chains and the Linux Kernel
                  • 794 Restricted Dependency Chains Advantages and Disavantages
                    • 710 Evaluation
                      • 7101 Audiences
                      • 7102 Comparison
                      • 7103 Other Approaches
                          • 8 Summary
Page 37: N4321: Towards Implementation and Use of memory order … · 2015-07-14 · N4321: Towards Implementation and Use of memory order consume Doc. No.: WG21/N4321 (revised) Date: 2015-04-19

WG21N4321 37

orgarchivehtmlqemu-devel2013-08

msg02055html August 2013

[6] Dalton M THE DESIGN AND IMPLE-MENTATION OF DYNAMIC INFORMATIONFLOW TRACKING SYSTEMS FOR SOFT-WARE SECURITY PhD thesis StanfordUniversity 2009 Available httpcsl

stanfordedu~christospublications

2009michael_daltonphd_thesispdf

[Viewed March 9 2010]

[7] Desnoyers M [RFC git tree] userspace RCU(urcu) for Linux httpurcuso February2009

[8] Desnoyers M McKenney P E SternA Dagenais M R and Walpole JUser-level implementations of read-copy updateIEEE Transactions on Parallel and DistributedSystems 23 (2012) 375ndash382

[9] Grisenthwaite R ARM Barrier LitmusTests and Cookbook ARM Limited 2009

[10] Howard P W and Walpole J A rel-ativistic enhancement to software transactionalmemory In Proceedings of the 3rd USENIX con-ference on Hot topics in parallelism (BerkeleyCA USA 2011) HotParrsquo11 USENIX Associa-tion pp 1ndash6

[11] Howard P W and Walpole J Relativis-tic red-black trees Concurrency and Computa-tion Practice and Experience (2013) nandashna

[12] Intel Corporation A Formal Speci-fication of Intel Itanium Processor Fam-ily Memory Ordering 2002 Avail-able httpdeveloperintelcom

designitaniumdownloads251429htm

ftpdownloadintelcomdesign

ItaniumDownloads25142901pdf [ViewedJanuary 10 2007]

[13] International Business Machines Corpo-ration Power ISA Version 207 2013

[14] Kannan H Ordering decoupled metadata ac-cesses in multiprocessors In MICRO 42 Pro-ceedings of the 42nd Annual IEEEACM Inter-national Symposium on Microarchitecture (NewYork NY USA 2009) ACM pp 381ndash390

[15] McCarthy J Recursive functions of symbolicexpressions and their computation by machinepart i Commun ACM 3 4 (Apr 1960) 184ndash195

[16] McKenney P E Read-copy update(RCU) usage in Linux kernel Availablehttpwwwrdropcomuserspaulmck

RCUlinuxusagerculocktabhtml [ViewedJanuary 14 2007] October 2006

[17] McKenney P E What is RCU part 2Usage Available httplwnnetArticles

263130 [Viewed January 4 2008] January2008

[18] McKenney P E The RCU API 2010 edi-tion httplwnnetArticles418853 De-cember 2010

[19] McKenney P E Validating memory barri-ers and atomic instructions httplwnnet

Articles470681 December 2011

[20] McKenney P E Is Parallel ProgrammingHard And If So What Can You Do About Itkernelorg Corvallis OR USA 2012

[21] McKenney P E Structured deferral syn-chronization via procrastination CommunACM 56 7 (July 2013) 40ndash49

[22] McKenney P E [PATCH tipcorercu14] mce Stop using array-index-based RCUprimitives [PATCHtipcorercu14]mce

Stopusingarray-index-basedRCUprimitivesMay 2015

[23] McKenney P E Purcell C Al-gae Schumin B Cornelius G Qwer-tyus Conway N Sbw Blainster Ru-fus C Zoicon5 Anome and Eisen HRead-copy update httpenwikipedia

orgwikiRead-copy-update July 2006

WG21N4321 38

[24] McKenney P E and Slingwine J DRead-copy update Using execution history tosolve concurrency problems In Parallel andDistributed Computing and Systems (Las VegasNV October 1998) pp 509ndash518

[25] McKenney P E and Walpole J What isRCU fundamentally Available httplwn

netArticles262464 [Viewed December 272007] December 2007

[26] Michael M M Hazard pointers Safe mem-ory reclamation for lock-free objects IEEETransactions on Parallel and Distributed Sys-tems 15 6 (June 2004) 491ndash504

[27] Rossbach C J Hofmann O S PorterD E Ramadan H E Bhandari A andWitchel E TxLinux Using and managinghardware transactional memory in an operatingsystem In SOSPrsquo07 Twenty-First ACMSymposium on Operating Systems Principles(October 2007) ACM SIGOPS Avail-able httpwwwsosp2007orgpapers

sosp056-rossbachpdf [Viewed October 212007]

[28] Toit S D Working draft stan-dard for programming language C++httpwwwopen-stdorgjtc1sc22wg21

docspapers2013n3691pdf May 2013

[29] Vigueras G Orduna J M and LozanoM A Read-Copy Update based parallel serverfor distributed crowd simulations The Journalof Supercomputing (Apr 2012)

Change Log

This paper first appeared as N4026 in May of 2014Revisions to this document are as follows

bull Add a suggestion to Section 71 that developersprovide diagnostics for singleton arrays when us-ing RCU-protected indexes (May 28 2014)

bull Apply changes from Torvald Riegel feedback

1 Move Section 71 which summarizes theLinux-kernel uses of dependency chains upto the beginning of Section 7

2 Fix some typos in Section 74 and addwords accounting for DEC Alpha whichmust continue to promote memory order

consume loads to memory order acquire

with this sectionrsquos approach Also clarifyJeff Preshingrsquos conjecture about the lack ofoptimizations that break semantic depen-dency chains

(June 5 2014)

bull Add Section 75 which covers Hans Boehmrsquosapproach which restricts dependency chains tolocal variable Also add Section 76 whichcovers Clark Nelsorsquos approach which usesthe [[carries dependency]] attribute to markthose local variables that carry dependencies(July 17 2014)

bull Mark the paper as a revision of N4036 (July 172014)

bull Add Jeff Preshing Hans Boehm and Clark Nel-son as authors and clarify Hansrsquos proposal inSection 75 (August 13 2014)

bull Add Olivier Girouxrsquos proposal for explicitlymarking the tails of dependency chains as Sec-tion 77 and add Olivier as author Update theadvantages and disadvantages of Jeff Preshingrsquoswhole-program option in Section 74 Add advan-tages and disadvantages for Hans Boehmrsquos pro-posal in Section 75 and for Clark Nelsonrsquos pro-posal in Section 76 (August 23 2014)

bull Update Olivier Girouxrsquos explicit dependency-chain marking proposal in Section 77 based onfeedback from Olivier (August 24 2014)

bull Add a section header (Section 411) for thedependency-chain rules for 2014 GCC implemen-tations Add Section 413 laying out the sim-pler rules for managing dependency chains in theolder 1990s Sequent C implementations Adda footnote to Section 76 recording Lawrence

WG21N4321 39

Crowlrsquos preference for variable modifiers insteadof either attributes or type modifiers Addwords to Section 77 noting that atomic signal

fence() or the Linux kernelrsquos barrier() can beused to suppress code-motion optimizations thatmight otherwise prevent fixing up tail-markeddependency chains that were broken by code-motion optimizations (October 1 2014)

bull Add Section 34 which describes operators thatact as the last link in the Linux kernelrsquos depen-dency chains including the ambiguous role ofthe -gt operator (October 1 2014)

bull Update plots of RCU API usage (October 52014)

bull Flesh out Section 412 which lays out a shorthistory of dependency-chain management rulesfor the Linux kernel Numerous minor clarifica-tions and corrections throughout the document(October 5 2014)

bull Add ldquoandrdquo before final authorrsquos name and emailaddress Also add a pointer back to N4036 (Oc-tober 5 2014)

bull Add a brain-dead build script (October 52014)

bull Add Section 710 which has the beginnings ofan evaluation of the various proposals (October5 2014)

The paper was then published as N4215 whichwas revised as follows

bull Add verbiage to Sections 72 and 7102 differ-entiating between various forms of syntactic andsemantic dependencies (November 11 2014)

bull Add Section 7103 suggesting a C keyword in-stead of the C++ [[carries dependency]] at-tribute This section also suggests that combin-ing ideas from several proposals might be helpful(November 11 2014)

bull Add Section 78 which contains a first attemptto describe Olivier Girouxrsquos head-marked depen-dency chains (November 11 2014)

bull Add example code in Figures 15 and 16 Addsimilar figures to the proposals showing how theymark the dependency chains in the sample code(November 11 2014)

At this point the paper was published as N4321which was revised as follows

bull Add a warning against using operands to amp and| that leave only one bit set (or respectivelycleared) to Section 411 (April 9 2015)

bull Update plots of RCU API usage (April 192014)

bull Add Section 79 which restricts dependencychains with the goal of allowing unmarked de-pendency chains while avoiding restricting com-piler optimizations One important restriction isthe elimination of RCU-protected array indexes(April 19 2014)

bull Add verbiage to Section 32 noting that the amp

operator is sometimes used to find the beginningof a power-of-two aligned structure (April 192014)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Torvald Riegel to Sec-tion 79 (May 20 2015)

bull Donrsquot allow integers to carry dependencies intoor out of unmarked functions in Section 79(May 21 2015)

bull Update Section 79 to note that assignments can-not extend a dependency chain from one threadto another and that assignments cannot extendan intptr t- or uintptr t-based dependencychain through a shared variable even within agiven thread (May 21 2015)

bull Update Section 79 to note that atomic op-erations can extend dependency chains in thesame way that their non-atomic counterpartscan (May 21 2015)

WG21N4321 40

bull Update Section 79 to explicitly call out the factthat stores and subsequent loads can extend de-pendency chains (May 21 2015)

bull Update Section 79 to explicitly state that un-defined behavior terminates dependency chainsand that if only one value of a pointer avoidsundefined behavior any dependency chainsthrough that pointer at that point are termi-nated (May 22 2015)

bull Update Section 79 to document the difficul-ties stemming from value-specialization opti-mizations (May 23 2015)

bull Update Section 79 to note the back-propagationof the dependency-chain-breaking effects of thecompiler deducing the exact value of a pointer(May 27 2015)

bull Update Section 79 to give more informationon how much bit-setting and bit-clearing is toomuch (June 8 2015)

bull Update Section 79 to note that the Linux ker-nel does not rely on dependency ordering whena statically allocated structure is linked into anRCU-protected structure but also note that op-timizations breaking dependency ordering in thiscase seem quite dubious (July 11 2015)

bull Update Section 79 illustrating dependency-chain rules with examples (July 11 2015)

bull Update Section 79 noting that simple loads suf-fice for memory order consume loads (July 112015)

bull Update Sections 77 and 78 calling out abstrac-tion challenges for head- and tail-marking ap-proaches (July 13 2015)

bull Add subsections to Section 79 (July 13 2015)

bull Update Section 79 to amplify objections to spe-cialization optimizations applied to heap-basedpointers (July 13 2015)

  • 1 Introduction
  • 2 Introduction to RCU
  • 3 Linux-Kernel Use Cases
    • 31 Types of Linux-Kernel Dependency Chains
    • 32 Operators in Linux-Kernel Dependency Chains
    • 33 Operators Terminating Linux-Kernel Dependency Chains
    • 34 Operators Acting as Last Link in Linux-Kernel Dependency Chains
    • 35 Linux-Kernel Dependency Chain Length
      • 4 Dependency Ordering in Pre-C11 Implementations
        • 41 Rules for C-Language RCU Users
          • 411 Rules for 2014 GCC Implementations
          • 412 Rules for 2003 GCC Implementations
          • 413 Rules for 1990s Sequent C Implementations
            • 42 Rules for C-Language Implementers
              • 5 Dependency Ordering in C11 and C++11 Implementations
              • 6 Weaknesses in C11 and C++11 Dependency Ordering
              • 7 Potential Alternatives to C11 and C++11 Dependency Ordering
                • 71 Revising C11 and C++11 Dependency-Ordering Definition
                • 72 Type-Based Designation of Dependency Chains With Restrictions
                • 73 Type-Based Designation of Dependency Chains
                • 74 Whole-Program Option
                • 75 Local-Variable Restriction
                • 76 Mark Dependency-Carrying Local Variables
                • 77 Explicitly Tail-Marked Dependency Chains
                • 78 Explicitly Head-Marked Dependency Chains
                • 79 Restricted Dependency Chains
                  • 791 Extending Dependency Chains
                  • 792 Terminating Dependency Chains
                  • 793 Restricted Dependency Chains and the Linux Kernel
                  • 794 Restricted Dependency Chains Advantages and Disavantages
                    • 710 Evaluation
                      • 7101 Audiences
                      • 7102 Comparison
                      • 7103 Other Approaches
                          • 8 Summary
Page 38: N4321: Towards Implementation and Use of memory order … · 2015-07-14 · N4321: Towards Implementation and Use of memory order consume Doc. No.: WG21/N4321 (revised) Date: 2015-04-19

WG21N4321 38

[24] McKenney P E and Slingwine J DRead-copy update Using execution history tosolve concurrency problems In Parallel andDistributed Computing and Systems (Las VegasNV October 1998) pp 509ndash518

[25] McKenney P E and Walpole J What isRCU fundamentally Available httplwn

netArticles262464 [Viewed December 272007] December 2007

[26] Michael M M Hazard pointers Safe mem-ory reclamation for lock-free objects IEEETransactions on Parallel and Distributed Sys-tems 15 6 (June 2004) 491ndash504

[27] Rossbach C J Hofmann O S PorterD E Ramadan H E Bhandari A andWitchel E TxLinux Using and managinghardware transactional memory in an operatingsystem In SOSPrsquo07 Twenty-First ACMSymposium on Operating Systems Principles(October 2007) ACM SIGOPS Avail-able httpwwwsosp2007orgpapers

sosp056-rossbachpdf [Viewed October 212007]

[28] Toit S D Working draft stan-dard for programming language C++httpwwwopen-stdorgjtc1sc22wg21

docspapers2013n3691pdf May 2013

[29] Vigueras G Orduna J M and LozanoM A Read-Copy Update based parallel serverfor distributed crowd simulations The Journalof Supercomputing (Apr 2012)

Change Log

This paper first appeared as N4026 in May of 2014Revisions to this document are as follows

bull Add a suggestion to Section 71 that developersprovide diagnostics for singleton arrays when us-ing RCU-protected indexes (May 28 2014)

bull Apply changes from Torvald Riegel feedback

1 Move Section 71 which summarizes theLinux-kernel uses of dependency chains upto the beginning of Section 7

2 Fix some typos in Section 74 and addwords accounting for DEC Alpha whichmust continue to promote memory order

consume loads to memory order acquire

with this sectionrsquos approach Also clarifyJeff Preshingrsquos conjecture about the lack ofoptimizations that break semantic depen-dency chains

(June 5 2014)

bull Add Section 75 which covers Hans Boehmrsquosapproach which restricts dependency chains tolocal variable Also add Section 76 whichcovers Clark Nelsorsquos approach which usesthe [[carries dependency]] attribute to markthose local variables that carry dependencies(July 17 2014)

bull Mark the paper as a revision of N4036 (July 172014)

bull Add Jeff Preshing Hans Boehm and Clark Nel-son as authors and clarify Hansrsquos proposal inSection 75 (August 13 2014)

bull Add Olivier Girouxrsquos proposal for explicitlymarking the tails of dependency chains as Sec-tion 77 and add Olivier as author Update theadvantages and disadvantages of Jeff Preshingrsquoswhole-program option in Section 74 Add advan-tages and disadvantages for Hans Boehmrsquos pro-posal in Section 75 and for Clark Nelsonrsquos pro-posal in Section 76 (August 23 2014)

bull Update Olivier Girouxrsquos explicit dependency-chain marking proposal in Section 77 based onfeedback from Olivier (August 24 2014)

bull Add a section header (Section 411) for thedependency-chain rules for 2014 GCC implemen-tations Add Section 413 laying out the sim-pler rules for managing dependency chains in theolder 1990s Sequent C implementations Adda footnote to Section 76 recording Lawrence

WG21N4321 39

Crowlrsquos preference for variable modifiers insteadof either attributes or type modifiers Addwords to Section 77 noting that atomic signal

fence() or the Linux kernelrsquos barrier() can beused to suppress code-motion optimizations thatmight otherwise prevent fixing up tail-markeddependency chains that were broken by code-motion optimizations (October 1 2014)

bull Add Section 34 which describes operators thatact as the last link in the Linux kernelrsquos depen-dency chains including the ambiguous role ofthe -gt operator (October 1 2014)

bull Update plots of RCU API usage (October 52014)

bull Flesh out Section 412 which lays out a shorthistory of dependency-chain management rulesfor the Linux kernel Numerous minor clarifica-tions and corrections throughout the document(October 5 2014)

bull Add ldquoandrdquo before final authorrsquos name and emailaddress Also add a pointer back to N4036 (Oc-tober 5 2014)

bull Add a brain-dead build script (October 52014)

bull Add Section 710 which has the beginnings ofan evaluation of the various proposals (October5 2014)

The paper was then published as N4215 whichwas revised as follows

bull Add verbiage to Sections 72 and 7102 differ-entiating between various forms of syntactic andsemantic dependencies (November 11 2014)

bull Add Section 7103 suggesting a C keyword in-stead of the C++ [[carries dependency]] at-tribute This section also suggests that combin-ing ideas from several proposals might be helpful(November 11 2014)

bull Add Section 78 which contains a first attemptto describe Olivier Girouxrsquos head-marked depen-dency chains (November 11 2014)

bull Add example code in Figures 15 and 16 Addsimilar figures to the proposals showing how theymark the dependency chains in the sample code(November 11 2014)

At this point the paper was published as N4321which was revised as follows

bull Add a warning against using operands to amp and| that leave only one bit set (or respectivelycleared) to Section 411 (April 9 2015)

bull Update plots of RCU API usage (April 192014)

bull Add Section 79 which restricts dependencychains with the goal of allowing unmarked de-pendency chains while avoiding restricting com-piler optimizations One important restriction isthe elimination of RCU-protected array indexes(April 19 2014)

bull Add verbiage to Section 32 noting that the amp

operator is sometimes used to find the beginningof a power-of-two aligned structure (April 192014)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Torvald Riegel to Sec-tion 79 (May 20 2015)

bull Donrsquot allow integers to carry dependencies intoor out of unmarked functions in Section 79(May 21 2015)

bull Update Section 79 to note that assignments can-not extend a dependency chain from one threadto another and that assignments cannot extendan intptr t- or uintptr t-based dependencychain through a shared variable even within agiven thread (May 21 2015)

bull Update Section 79 to note that atomic op-erations can extend dependency chains in thesame way that their non-atomic counterpartscan (May 21 2015)

WG21N4321 40

bull Update Section 79 to explicitly call out the factthat stores and subsequent loads can extend de-pendency chains (May 21 2015)

bull Update Section 79 to explicitly state that un-defined behavior terminates dependency chainsand that if only one value of a pointer avoidsundefined behavior any dependency chainsthrough that pointer at that point are termi-nated (May 22 2015)

bull Update Section 79 to document the difficul-ties stemming from value-specialization opti-mizations (May 23 2015)

bull Update Section 79 to note the back-propagationof the dependency-chain-breaking effects of thecompiler deducing the exact value of a pointer(May 27 2015)

bull Update Section 79 to give more informationon how much bit-setting and bit-clearing is toomuch (June 8 2015)

bull Update Section 79 to note that the Linux ker-nel does not rely on dependency ordering whena statically allocated structure is linked into anRCU-protected structure but also note that op-timizations breaking dependency ordering in thiscase seem quite dubious (July 11 2015)

bull Update Section 79 illustrating dependency-chain rules with examples (July 11 2015)

bull Update Section 79 noting that simple loads suf-fice for memory order consume loads (July 112015)

bull Update Sections 77 and 78 calling out abstrac-tion challenges for head- and tail-marking ap-proaches (July 13 2015)

bull Add subsections to Section 79 (July 13 2015)

bull Update Section 79 to amplify objections to spe-cialization optimizations applied to heap-basedpointers (July 13 2015)

  • 1 Introduction
  • 2 Introduction to RCU
  • 3 Linux-Kernel Use Cases
    • 31 Types of Linux-Kernel Dependency Chains
    • 32 Operators in Linux-Kernel Dependency Chains
    • 33 Operators Terminating Linux-Kernel Dependency Chains
    • 34 Operators Acting as Last Link in Linux-Kernel Dependency Chains
    • 35 Linux-Kernel Dependency Chain Length
      • 4 Dependency Ordering in Pre-C11 Implementations
        • 41 Rules for C-Language RCU Users
          • 411 Rules for 2014 GCC Implementations
          • 412 Rules for 2003 GCC Implementations
          • 413 Rules for 1990s Sequent C Implementations
            • 42 Rules for C-Language Implementers
              • 5 Dependency Ordering in C11 and C++11 Implementations
              • 6 Weaknesses in C11 and C++11 Dependency Ordering
              • 7 Potential Alternatives to C11 and C++11 Dependency Ordering
                • 71 Revising C11 and C++11 Dependency-Ordering Definition
                • 72 Type-Based Designation of Dependency Chains With Restrictions
                • 73 Type-Based Designation of Dependency Chains
                • 74 Whole-Program Option
                • 75 Local-Variable Restriction
                • 76 Mark Dependency-Carrying Local Variables
                • 77 Explicitly Tail-Marked Dependency Chains
                • 78 Explicitly Head-Marked Dependency Chains
                • 79 Restricted Dependency Chains
                  • 791 Extending Dependency Chains
                  • 792 Terminating Dependency Chains
                  • 793 Restricted Dependency Chains and the Linux Kernel
                  • 794 Restricted Dependency Chains Advantages and Disavantages
                    • 710 Evaluation
                      • 7101 Audiences
                      • 7102 Comparison
                      • 7103 Other Approaches
                          • 8 Summary
Page 39: N4321: Towards Implementation and Use of memory order … · 2015-07-14 · N4321: Towards Implementation and Use of memory order consume Doc. No.: WG21/N4321 (revised) Date: 2015-04-19

WG21N4321 39

Crowlrsquos preference for variable modifiers insteadof either attributes or type modifiers Addwords to Section 77 noting that atomic signal

fence() or the Linux kernelrsquos barrier() can beused to suppress code-motion optimizations thatmight otherwise prevent fixing up tail-markeddependency chains that were broken by code-motion optimizations (October 1 2014)

bull Add Section 34 which describes operators thatact as the last link in the Linux kernelrsquos depen-dency chains including the ambiguous role ofthe -gt operator (October 1 2014)

bull Update plots of RCU API usage (October 52014)

bull Flesh out Section 412 which lays out a shorthistory of dependency-chain management rulesfor the Linux kernel Numerous minor clarifica-tions and corrections throughout the document(October 5 2014)

bull Add ldquoandrdquo before final authorrsquos name and emailaddress Also add a pointer back to N4036 (Oc-tober 5 2014)

bull Add a brain-dead build script (October 52014)

bull Add Section 710 which has the beginnings ofan evaluation of the various proposals (October5 2014)

The paper was then published as N4215 whichwas revised as follows

bull Add verbiage to Sections 72 and 7102 differ-entiating between various forms of syntactic andsemantic dependencies (November 11 2014)

bull Add Section 7103 suggesting a C keyword in-stead of the C++ [[carries dependency]] at-tribute This section also suggests that combin-ing ideas from several proposals might be helpful(November 11 2014)

bull Add Section 78 which contains a first attemptto describe Olivier Girouxrsquos head-marked depen-dency chains (November 11 2014)

bull Add example code in Figures 15 and 16 Addsimilar figures to the proposals showing how theymark the dependency chains in the sample code(November 11 2014)

At this point the paper was published as N4321which was revised as follows

bull Add a warning against using operands to amp and| that leave only one bit set (or respectivelycleared) to Section 411 (April 9 2015)

bull Update plots of RCU API usage (April 192014)

bull Add Section 79 which restricts dependencychains with the goal of allowing unmarked de-pendency chains while avoiding restricting com-piler optimizations One important restriction isthe elimination of RCU-protected array indexes(April 19 2014)

bull Add verbiage to Section 32 noting that the amp

operator is sometimes used to find the beginningof a power-of-two aligned structure (April 192014)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Linus Torvalds to Sec-tion 79 (May 20 2015)

bull Apply feedback from Torvald Riegel to Sec-tion 79 (May 20 2015)

bull Donrsquot allow integers to carry dependencies intoor out of unmarked functions in Section 79(May 21 2015)

bull Update Section 79 to note that assignments can-not extend a dependency chain from one threadto another and that assignments cannot extendan intptr t- or uintptr t-based dependencychain through a shared variable even within agiven thread (May 21 2015)

bull Update Section 79 to note that atomic op-erations can extend dependency chains in thesame way that their non-atomic counterpartscan (May 21 2015)

WG21N4321 40

bull Update Section 79 to explicitly call out the factthat stores and subsequent loads can extend de-pendency chains (May 21 2015)

bull Update Section 79 to explicitly state that un-defined behavior terminates dependency chainsand that if only one value of a pointer avoidsundefined behavior any dependency chainsthrough that pointer at that point are termi-nated (May 22 2015)

bull Update Section 79 to document the difficul-ties stemming from value-specialization opti-mizations (May 23 2015)

bull Update Section 79 to note the back-propagationof the dependency-chain-breaking effects of thecompiler deducing the exact value of a pointer(May 27 2015)

bull Update Section 79 to give more informationon how much bit-setting and bit-clearing is toomuch (June 8 2015)

bull Update Section 79 to note that the Linux ker-nel does not rely on dependency ordering whena statically allocated structure is linked into anRCU-protected structure but also note that op-timizations breaking dependency ordering in thiscase seem quite dubious (July 11 2015)

bull Update Section 79 illustrating dependency-chain rules with examples (July 11 2015)

bull Update Section 79 noting that simple loads suf-fice for memory order consume loads (July 112015)

bull Update Sections 77 and 78 calling out abstrac-tion challenges for head- and tail-marking ap-proaches (July 13 2015)

bull Add subsections to Section 79 (July 13 2015)

bull Update Section 79 to amplify objections to spe-cialization optimizations applied to heap-basedpointers (July 13 2015)

  • 1 Introduction
  • 2 Introduction to RCU
  • 3 Linux-Kernel Use Cases
    • 31 Types of Linux-Kernel Dependency Chains
    • 32 Operators in Linux-Kernel Dependency Chains
    • 33 Operators Terminating Linux-Kernel Dependency Chains
    • 34 Operators Acting as Last Link in Linux-Kernel Dependency Chains
    • 35 Linux-Kernel Dependency Chain Length
      • 4 Dependency Ordering in Pre-C11 Implementations
        • 41 Rules for C-Language RCU Users
          • 411 Rules for 2014 GCC Implementations
          • 412 Rules for 2003 GCC Implementations
          • 413 Rules for 1990s Sequent C Implementations
            • 42 Rules for C-Language Implementers
              • 5 Dependency Ordering in C11 and C++11 Implementations
              • 6 Weaknesses in C11 and C++11 Dependency Ordering
              • 7 Potential Alternatives to C11 and C++11 Dependency Ordering
                • 71 Revising C11 and C++11 Dependency-Ordering Definition
                • 72 Type-Based Designation of Dependency Chains With Restrictions
                • 73 Type-Based Designation of Dependency Chains
                • 74 Whole-Program Option
                • 75 Local-Variable Restriction
                • 76 Mark Dependency-Carrying Local Variables
                • 77 Explicitly Tail-Marked Dependency Chains
                • 78 Explicitly Head-Marked Dependency Chains
                • 79 Restricted Dependency Chains
                  • 791 Extending Dependency Chains
                  • 792 Terminating Dependency Chains
                  • 793 Restricted Dependency Chains and the Linux Kernel
                  • 794 Restricted Dependency Chains Advantages and Disavantages
                    • 710 Evaluation
                      • 7101 Audiences
                      • 7102 Comparison
                      • 7103 Other Approaches
                          • 8 Summary
Page 40: N4321: Towards Implementation and Use of memory order … · 2015-07-14 · N4321: Towards Implementation and Use of memory order consume Doc. No.: WG21/N4321 (revised) Date: 2015-04-19

WG21N4321 40

bull Update Section 79 to explicitly call out the factthat stores and subsequent loads can extend de-pendency chains (May 21 2015)

bull Update Section 79 to explicitly state that un-defined behavior terminates dependency chainsand that if only one value of a pointer avoidsundefined behavior any dependency chainsthrough that pointer at that point are termi-nated (May 22 2015)

bull Update Section 79 to document the difficul-ties stemming from value-specialization opti-mizations (May 23 2015)

bull Update Section 79 to note the back-propagationof the dependency-chain-breaking effects of thecompiler deducing the exact value of a pointer(May 27 2015)

bull Update Section 79 to give more informationon how much bit-setting and bit-clearing is toomuch (June 8 2015)

bull Update Section 79 to note that the Linux ker-nel does not rely on dependency ordering whena statically allocated structure is linked into anRCU-protected structure but also note that op-timizations breaking dependency ordering in thiscase seem quite dubious (July 11 2015)

bull Update Section 79 illustrating dependency-chain rules with examples (July 11 2015)

bull Update Section 79 noting that simple loads suf-fice for memory order consume loads (July 112015)

bull Update Sections 77 and 78 calling out abstrac-tion challenges for head- and tail-marking ap-proaches (July 13 2015)

bull Add subsections to Section 79 (July 13 2015)

bull Update Section 79 to amplify objections to spe-cialization optimizations applied to heap-basedpointers (July 13 2015)

  • 1 Introduction
  • 2 Introduction to RCU
  • 3 Linux-Kernel Use Cases
    • 31 Types of Linux-Kernel Dependency Chains
    • 32 Operators in Linux-Kernel Dependency Chains
    • 33 Operators Terminating Linux-Kernel Dependency Chains
    • 34 Operators Acting as Last Link in Linux-Kernel Dependency Chains
    • 35 Linux-Kernel Dependency Chain Length
      • 4 Dependency Ordering in Pre-C11 Implementations
        • 41 Rules for C-Language RCU Users
          • 411 Rules for 2014 GCC Implementations
          • 412 Rules for 2003 GCC Implementations
          • 413 Rules for 1990s Sequent C Implementations
            • 42 Rules for C-Language Implementers
              • 5 Dependency Ordering in C11 and C++11 Implementations
              • 6 Weaknesses in C11 and C++11 Dependency Ordering
              • 7 Potential Alternatives to C11 and C++11 Dependency Ordering
                • 71 Revising C11 and C++11 Dependency-Ordering Definition
                • 72 Type-Based Designation of Dependency Chains With Restrictions
                • 73 Type-Based Designation of Dependency Chains
                • 74 Whole-Program Option
                • 75 Local-Variable Restriction
                • 76 Mark Dependency-Carrying Local Variables
                • 77 Explicitly Tail-Marked Dependency Chains
                • 78 Explicitly Head-Marked Dependency Chains
                • 79 Restricted Dependency Chains
                  • 791 Extending Dependency Chains
                  • 792 Terminating Dependency Chains
                  • 793 Restricted Dependency Chains and the Linux Kernel
                  • 794 Restricted Dependency Chains Advantages and Disavantages
                    • 710 Evaluation
                      • 7101 Audiences
                      • 7102 Comparison
                      • 7103 Other Approaches
                          • 8 Summary