software transactional memory - research | mit...

Distrib. Comput. (1997) 10: 99—116

Software transactional memory*Nir Shavit**, Dan Touitou

School of Mathematical Sciences, Tel-Aviv University, Ramat Aviv, Tel-Aviv 69978, Israel

Received: January 1996 /Revised: June 1996 /Accepted: August 1996

Summary. As we learn from the literature, flexibility inchoosing synchronization operations greatly simplifies thetask of designing highly concurrent programs. Unfortu-nately, existing hardware is inflexible and is at best on thelevel of a ¸oad—¸inked/Store—Conditional operation ona single word. Building on the hardware based transac-tional synchronization methodology of Herlihy and Moss,we offer software transactional memory (STM), a novelsoftware method for supporting flexible transactional pro-gramming of synchronization operations. STM is non-blocking, and can be implemented on existing machinesusing only a ¸oad—¸inked/Store—Conditional operation.We use STM to provide a general highly concurrentmethod for translating sequential object implementationsto non-blocking ones based on implementing a k-wordcompare&swap STM-transaction. Empirical evidence col-lected on simulated multiprocessor architectures showsthat our method always outperforms the non-blockingtranslation methods in the style of Barnes, and outper-forms Herlihy’s translation method for sufficiently largenumbers of processors. The key to the efficiency of oursoftware-transactional approach is that unlike Barnesstyle methods, it is not based on a costly ‘‘recursive help-ing’’ policy.

Key words: Multiprocessor synchronization — Lock-free— Transactional memory — Distributed shared memory

1 Introduction

A major obstacle on the way to making multiprocessormachines widely acceptable is the difficulty of program-mers in designing highly concurrent programs and datastructures. Given the growing realization that unpredict-able delay is an increasingly serious problem in modernmultiprocessor architectures, we argue that conventionaltechniques for implementing concurrent objects by means

*A preliminary version of this paper appeared in the 14th ACMSymposium on the Principles of Distributed Computing, Ottawa,Ontario, Canada, 1995

Correspondence to: N. Shavit (e-mail: [email protected])

of critical sections are unsuitable, since they limit parallel-ism, increase contention for memory and interconnect,and make the system vulnerable to timing anomaliesand processor failures. The key to highly concurrentprogramming is to decrease the number and size ofcritical sections a multiprocessor program uses (possiblyeliminating critical sections altogether) by constructingclasses of implementations that are non-blocking [7, 15,16]. As we learn from the literature, flexibility in choos-ing the synchronization operations greatly simplifies thetask of designing non-blocking concurrent programs.Examples are the non-blocking data-structures ofMassalin and Pu [24] which use a Compare&Swap ontwo words, Anderson’s [2] parallel path compressionon lists which uses a special Splice operation, the countingnetworks of [5] which use combination of Fetch&Comp-lement and Fetch&Inc, Israeli and Rappoport’s Heap[20] which can be implemented using a three-wordCompare&Swap, and many more. Unfortunately, mostof the current or soon to be developed architecturessupport operations on the level of a ¸oad—¸inked/Store—Conditional operation for a single word, makingmost of these highly concurrent algorithms impractical inthe near future.

Bershad [7] suggested to overcome the problem ofproviding efficient programming primitives on existingmachines by employing operating system support. Herlihyand Moss [17] have proposed an ingenious hardwaresolution: transactional memory. By adding a specializedassociative cache and making several minor changes to thecache consistency protocols, they are able to supporta flexible transactional language for writing synchroniza-tion operations. Any synchronization operation can bewritten as a transaction and executed using an optimisticalgorithm built into the consistency protocol. Unfortu-nately though, this solution is blocking.

This paper proposes to adopt the transactional ap-proach, but not its hardware based implementation. Weintroduce software transactional memory (STM), a noveldesign that supports flexible transactional programmingof synchronization operations in software. Though wecannot aim for the same overall performance, our softwaretransactional memory has clear advantages in terms ofapplicability to today’s machines, portability among

machines, and resiliency in the face of timing anomaliesand processor failures.

We focus on implementations of a software transac-tional memory that support static transactions, that is,transactions which access a pre-determined sequence oflocations. This class includes most of the known andproposed synchronization primitives in the literature.

1.1 STM in a nutshell

In a non-faulty environment, the way to ensure the atom-icity of the operations is usually based on locking oracquiring exclusive ownerships on the memory locationsaccessed by a given operation Op. If a transaction cannotacquire an ownership it fails, and releases the ownershipsalready acquired. Otherwise, it succeeds in executing Opand frees the ownerships acquired. To guarantee liveness,one must first eliminate deadlocks, which for static trans-actions is done by acquiring the ownerships needed insome increasing order. In order to continue ensuring live-ness in a faulty environment, we must make certain thatevery transaction completes even if the process whichexecutes it has been delayed, swapped out, or crashed. Thisis achieved by a ‘‘helping’’ methodology, forcing othertransactions which are trying to acquire the same locationto help the owner of this location to complete its owntransaction. The key feature in the transactional approachis that in order to free a location one need only help itssingle ‘‘owner’’ transaction. Moreover, one can effectivelyavoid the overhead of coordination among several trans-actions attempting to help release a location by employinga ‘‘reactive’’ helping policy which we call non-redundant-helping.

1.2 Sequential-to-non-blocking translation

One can use STM to provide a general highly concurrentmethod for translating sequential object implementationsinto non-blocking ones based on the caching approach of[6, 28]. The approach is straightforward: use transactionalmemory to implement any collection of changes toa shared object, performing them as an atomic k-wordCompare&Swap transaction (see Fig. 2) on the desiredlocations. The non-blocking STM implementationguarantees that some transaction will always succeed.

Herlihy, in [16] (referred to in the sequel as Herlihy’smethod), was the first to offer a general transformationof sequential objects into non-blocking concurrentones. According to his methodology, updating a datastructure is done by first copying it into a new allocatedblock of memory, making the changes on the new versionand tentatively switching the pointer to the new datastructure, all that with the help of ¸oad—¸inked/Store—Conditional atomic operations. Unfortunately,Herlihy’s method does not provide a suitable solution forlarge data structures and like the standard approach oflocking the whole object, does not support concurrentupdating. Alemany and Felten [4] and LaMarca [22]suggested to improve the efficiency of this general methodat the price of losing portability, by using operating systemsupport making a set of strong assumptions on systembehavior.

To overcome the limitations of Herlihy’s method,Barnes, in [6], introduced his caching method, that avoidscopying the whole object and allows concurrent disjointupdating. A similar approach was independently proposedby Turek, Shasha, and Prakash [28]. According to Barnes,a process first ‘‘simulates’’ the execution of the updating inits private memory, i.e., reading a location for the first timeis done from the shared memory but writing is done intothe private memory. Then, the process uses a non-blockingk-word Read-Modify-¼rite atomic operation whichchecks if the values contained in the memory are equiva-lent to the value read in the cache update. If this is the case,the operation stores the new values in the memory. Other-wise, the process restarts from the beginning. Barnes sug-gested to implement the k-word Read-Modify-Write bylocking locations in ascending key. The key to achievingthe non-blocking resilient behavior in the cachingapproach [6, 28] is the cooperative method: whenevera process needs (depends on) a location already locked byanother process it helps the locking process to complete itsown operation, and this is done recursively along thedependency chain. Though Barnes as well as Turek,Shasha, and Prakash are vague on specific implemen-tation details, a recent paper by Israeli and Rappoport[21] gives, using the cooperative method, a clean andstreamlined implementation of a non-blocking k-wordCompare&Swap using ¸oad—¸inked/Store—Conditional.Our STM based translation method is similar to that ofIsraeli and Rappoport in that it uses the Barnes cachingalgorithm to acquire and release locations. However, itintroduces a new transactional approach to providing non-blocking resilient behavior. We do so in order to overcomethe two major drawbacks of the general cooperativemethod [6, 21, 28]:

— The cooperative method’s recursive structure of‘‘helping’’ frequently causes processes to help other pro-cesses which access a disjoint part of the data structure.

— Unlike STM’s transactional k-word Compare&Swapoperations which mostly fail on the transaction level andare thus not ‘‘helped,’’ a high percentage of cooperativek-word Compare&Swap operations fail but generatecontention since they are nevertheless helped by otherprocesses.

Take for example a process P which executes a 2-wordCompare&Swap operation on locations a and b. Assumethat some other process Q already owns b. According tothe cooperative method, P first helps Q complete its opera-tion and only then acquires b and continues on its ownoperation. However, in many cases P’s Compare&Swapoperation will not change the memory since Q changedb after P already read it, and P will have to retry. All theprocesses waiting for location a will have to first help P,then Q, and again P, when in any case P’s operation willlikely fail. Moreover, after P has acquired b, all the pro-cesses requesting b will also redundantly help P.

On the other hand, if P executes the 2-wordCompare&Swap operation as an STM transaction, P willfail to acquire b, release a, help Q, and restart. The pro-cesses waiting for a will have to help only P. The processeswaiting for b will not have to help P. Finally, if Q has not

100

changed b, P will most likely find the value of b in its owncache.

1.3 Empirical results

To make sequential-to-non-blocking translation methodsacceptable, one needs to reduce the performance overheadone has to pay when the system is stable (non-faulty). Wepresent (see Sect. 5) the first experimental comparison ofthe performance under stable conditions of the translationtechniques cited above. We use the well accepted ProteusParallel Hardware Simulator [8, 9].

We found that on a simulated Alewife [1] cache-coher-ent distributed shared-memory machine, as the potentialfor concurrency in accessing the object grows, the STMnon-blocking translation method outperforms bothHerlihy’s method and the cooperative method. Unfortu-nately, our experiments show that in general STM andother non-blocking techniques are inferior to standardnon-resilient lock-based methods such as queue-locks [25].Results for a shared bus architecture were similar in flavor.

In summary, STM offers a novel software package offlexible coordination operations for the design of highlyconcurrent shared objects, which ensures resiliency infaulty runs and improved performance in non-faulty ones.The following section introduces STM. In Sects. 3 and4 we describe our implementation and its correctnessproof. Finally, in Sect. 5 we present our empirical perfor-mance evaluation.

2 Transactional memory

We begin by presenting software transactional memory,a variant of the transactional memory of [17]. A transac-tion is a finite sequence of local and shared memorymachine instructions:

Read-transactional — reads the value of a shared locationinto a local register.

¼rite-transactional — stores the contents of a local registerinto a shared location.

The data set of a transaction is the set of shared locationsaccessed by the Read—transactional and ¼rite—transac-tional instructions. Any transaction may either fail, orcomplete successfully, in which case its charges are visibleatomically to other processes. For example, dequeuinga value from the head of a doubly linked list as in Fig. 1may be performed as a transaction. If the transactionterminates successfully it returns the dequeued item or anEmpty value.

A k-word Compare&Swap transaction as in Fig. 2 isa transaction which gets as parameters the data set, its size,and two vectors Old and New of the data set’s size. Asuccessful k-word Compare&Swap transaction checkswhether the values stored in the memory are equivalent toOld. In that case, the transaction stores the New valuesinto the memory and returns a C&S-Success value, other-wise it returns C&S-Failure.

A software transactional memory (STM) is a sharedobject which behaves like a memory that supports mul-

Dequeue ( )BeginTransaction

DeletedItem"Read-transactional(Head)if DeletedItem"Null

ReturnedValue"Emptyelse

¼rite-transactional(Head, DeletedItemC.Next)if DeletedItemC.Next"Null

¼rite-transactional (Tail, Null)ReturnedValue"DeletedItemC.Value

EndTransactionend Dequeue

Fig. 1. A non static transaction

k—word-C&S(Size,DataSet[],Old[ ],New[])BeginTransaction

for i"1 to Size doif Read-transactional(Dataset[i]])9Old[i]

ReturnedValue"C&S-FailureExitTransaction

for i"1 to Size do¼rite-transactional(DataSet[i], New[i])

ReturnedValue"C&S-SuccessEndTransaction

end k—word—C&S

Fig. 2. A static transaction

tiple changes to its addresses by means of transactions.A transaction is a thread of control that applies a finitesequence of primitive operations to memory.

A static transaction is a special form of transaction inwhich the data set is known in advance, and can thus bethought of as an atomic procedure which gets as para-meters the data set and a deterministic transition functionwhich determines the new values to be stored in the dataset. This procedure updates the memory and returns theprevious value stored. This paper will focus on implemen-tations of a transactional memory that supports statictransactions, a class that includes most of the known andproposed synchronization operations in the literature. Thek-word Compare&Swap transaction in Fig. 2 is an exampleof a static transaction, while the Dequeue procedure inFig. 1 is not.

An STM implementation is wait-free if any processwhich repeatedly attempts to execute a given transactionterminates successfully after a finite number of machinesteps. It is non-blocking if the repeated attempts to executesome transaction by a process implies that some process(not necessarily the same one and with a possibly differenttransaction) will terminate successfully after a finite num-ber of machine steps in the whole system. An STM imple-mentation is swap tolerant if it is non-blocking under theassumption that a process cannot be swapped out infinite-ly many times. The hardware implemented transactions of[17] could in theory repeatedly fail forever, it processes tryto write two locations in different order (as when updatinga doubly linked list). However, if used only for statictransactions, their implementation can be made swap

101

tolerant (but not non-blocking, since a single processrunning alone and being repeatedly swapped out duringthe execution of a transaction will never terminatesuccessfully).

2.1 The system model

Our computation model follows Herlihy and Wing [14]and can also be cast in terms of the I/O automata model ofLynch and Tuttle [23]. A concurrent system consists ofa collection of processes. Processes communicate throughshared data structures called objects. Each object has a setof primitive operations that provide the only means tomanipulate that object. Each process is a sequential threadof control [14] which applies a sequence of operations toobjects by issuing an invocation and receiving the asso-ciated response. A history is a sequence of invocations andresponses of some system execution. Each history inducesa ‘‘real-time’’ order of operations (P) where an operationA precedes another operation B if A’s response occursbefore B’s invocation. Two operations are concurrent ifthey are unrelated by the real-time order. A sequentialhistory is a history in which each invocation is followedimmediately by its corresponding response. The sequentialspecification of an object is the set of legal sequentialhistories associated with it. The basic correctness require-ment for a concurrent implementation is linearizability[14]: every concurrent history is ‘‘equivalent’’ to somelegal sequential history which is consistent with the partialreal-time order induced by the concurrent history. In a lin-earizable implementation, operations appear to take effectatomically at some point between their invocation andresponse. In our model, every shared memory location l ofa multiprocessor machine’s memory is formally modeledas an object which provides every processor i"12 nfour types of possible operations, with the following se-quential specification:

Readi(l) reads location l and returns its value v.¸oad—¸inkedi(l) reads location l and returns its value v.

Marks location l as ‘‘read by i.’’Store—Conditionali(l, v) if location l is marked as ‘‘read by

i,’’ the operation writes the value v to l, erases all existingmarks by other processors on l and returns a successstatus. Otherwise returns a failure status.

¼ritei(l, v) writes the value v to location l, erases all exist-ing ‘‘read by’’ marks by other processors on l.

A more detailed formal specification of these operationcan be found in [15, 16].

2.2 A sequential specification of STM

The following is the sequential specification of STM. Let¸ be a set of locations. A memory state is a functions :¸>» which returns for each location l of ¸ a valuefrom some set ». Let S be the set of all possible memorystates. A transition function t : S>S, is a computable func-tion which gets as a parameter a state and returns a newstate. Given a subset ds-¸, we say that a transitionfunction t is ds dependent, if the following conditions hold:(a) for every state s and every location l, if lNds then

t(s)(l)"s (l ); (b) if s1

and s2

are two states s.t. for everyl3ds, s

1(l )"s

2(l ), then for every l3ds t (s

1) (l )"t(s

2) (l). In

other words, (a) states that locations not in ds are notaffected and (b) states that effects of a transaction dependonly on locations in ds.

Given a set ¸ of locations, a Static ¹ransactionalMemory over ¸ is a concurrent object which providesevery process i with a ¹ran

i(DataSet, f, r, status) operation

(we add the subscript to denote that this operation isexecuted by i, and omit it when the id of the processorperforming the operation is unimportant). It has as inputDataSet — a subset of ¸, and f — a transition function whichis Dataset dependent. It returns a function r : DataSet>»

and a boolean value status.Let h"o

1o2o32

be a finite or infinite sequentialhistory where o

iis the ith operation executed. For every

finite prefix hm"o1o2o32

om

of h, we define the termina-ting state of hm, ¹S(hm) in the following inductive way: Ifm"0 then ¹S (hm)"e where e is the function e(l)"0for every l3¸. If m'0 then assume w.l.o.g. thatom"¹ran (DS, f, r, status) and let hm~1"o

1o2o32

om~1

.If status"success then ¹S (hm)"f (¹S(hm~1)) otherwise¹S(hm)"¹S (hm~1).

We can now proceed to define the sequential specifica-tion of the static transactional memory. Given a functionf :A>B and A@-A, we define the restriction of f on A@(denoted f pA@) to be the function f @ :A@>B s.t.∀a3A@f @ (a)"f (a). We require that a correct implementa-tion of an STM object meet the following sequentialspecification:

Definition 2.1. The sequential specification includes the setof sequential histories, such that for each finite or infinitehistory h"o

1o2o32

, for all k, if ok"¹ran(DataSet, f, r,

status) and status"success then r"¹S(o1o2o32

ok~1

)pDataSet.

In other words, each concurrent execution can be lin-earized to a sequential one, in which each successful trans-action returns the values that were stored in the Datasetlocations before the transaction started.

3 A non-blocking implementation of STM

We implement a non-blocking static STM of size M usingthe following data structures (See Fig. 3):

— Memory [M], a vector which contains the data storedin the transactional memory.

— Ownerships [M], a vector which determines for anycell in Memory, which transaction owns it.

Each process i keeps in the shared memory a record,pointed to by Rec

i, that will be used to store information

on the current transaction it initiated. It has the followingfields: Size which contains the size of the data set. Add[]— a vector which contains the data set addresses in increas-ing order. Old»alues[] a vector of the data set’s size whosecells are initialized to Null at the beginning of every trans-action. In case of a successful transaction this vector willcontain the former values stored in the involved locations.The other fields are used to synchronize between the

102

Fig. 3. STM implementation:shared data structures

owner of the record and the processes which may event-ually help its transactions. »ersion is an integer, initially 0,which determines the instance number of the transaction.This field is incremented every time the process terminatesa transaction.

A process i initiates the execution of a transaction bycalling the Start¹ransaction routine of Fig. 4. The Start-¹ransaction routine first initializes the process’s recordand then declares the record as stable, ensuring that anyprocessors helping the transaction complete will reada consistent description of the transaction. After executingthe transaction the process checks if the transaction hassucceeded, and if so returns the content of the vectorOld»alues.

The procedure ¹ransaction (Fig. 5), gets as parametersRec, the address of the record the transaction executed,and a boolean value IsInitiator, indicating whether ¹rans-action was called by the initiating process or by a helpingprocess. The parameter version contains the instance num-ber of the record executed.1 This parameter is not usedwhen the routine is called by the initiating process sincethe version field will never change during the call. ¹rans-action, first tries to acquire ownership on the data set’slocations by calling AquireOwnership. If it fails to do so,then upon returning from AquireOwnership, the status fieldwill be set to (Failure, failadd). If the status field does nothave a value yet, the process sets it to (Success, 0). In case ofsuccess the process writes the old values into the transac-tion’s record, calculates the new values to be stored, writesthem to the memory and releases the ownerships. Other-wise, the status field contains the location that caused thefailure. The process first releases the ownerships that italready owns and, in the case that it is not a helpingprocess, it helps the transaction which owns the failinglocation. Helping is performed only if the helped transac-tion’s record is in a stable state. Since the procedureAcquireOwnerships of Fig. 6 may be called either by the

1The use of this unbounded field can be avoided if an additional»alidate operation is available [20, 21]. A standard 64 bit field willhowever suffice in practice

Start¹ransaction(DataSet)Initialize(Rec

i, DataSet)

ReciC.stable"True

¹ransaction(Reci, Rec

iC.version, True)

ReciC.stable"False

ReciC.version]]

If ReciC .status"Success then

return(Success, ReciC.OldValues)

elsereturn Failure

Initialize (Reci,DataSet)

ReciC.status"Null

ReciC.AllWritten"Null

ReciC.size"DDataSetD

for j"1 to DDataSet D doRec

iC.Add[j]"DataSet[j]

ReciC.OldValues[j]"Null

Fig. 4. Start&Transaction

initiator or by the helping processes, we must ensure that(1) all processes will try to acquire ownership on the samelocations (this is done by checking the version between the¸oad—¸inked and the Store—Conditional instructions)(2) from the moment that the status of the transactionbecomes fixed, no additional ownerships are allowed forthat transaction. The second property is essential for prov-ing not only atomicity but also the non-blocking property.Any process which reads a free location must, beforeacquiring ownership on it, confirm that the transactionstatus is still undecided. This is done by writing (withStore—Conditional) (Null, 0) in the status field. This pre-vents any process that read the location in the past while itwas owned by a different transaction from setting thestatus to Failure.

When writing the new values with ºpdateMemory asin Fig. 6, the processes synchronize in order to preventa slow process from updating the memory after the owner-ships have been released. To do so every process sets theAll¼ritten field to be True, after updating the memory andbefore releasing the ownerships.

103

¹ransaction(rec, version, IsInitiator)AcquireOwnerships(rec, version)(status, failadd)"LL(recC.status)if status"Null then

if (version9recC.version) then returnSC(recC.status, (Success, 0))

(status, failadd)"LL(recC.status)if status"Success then

AgreeOld»alues(rec, version)NewValues"CalcNewValues(recC.OldValues)ºpdateMemory(rec, version, NewValues)ReleaseOwnerships(rec, version)

elseReleaseOwnerships(rec, version)if IsInitiator then

failtran"Ownerships[failadd]if failtran"Nobody then

returnelse

failversion"failtranC.versionif failtranC.stable

¹ransaction(failtran, failversion, False)

Fig. 5. Transaction

4 Correctness proof

Given a run (we freely interchange between run andhistory) of the STM implementation, the nth transactionexecution of process i is marked at ¹(i, n). The transactionrecord for ¹ (i, n) is denoted as R

i, and by definition only

process i updates RiC.version. It is thus clear that the

number n in ¹ (i, n) is equal to the content of RiC.version

during ¹(i, n)’s execution. The executing processes of¹(i, n) consist of process i, called the initiator, and thehelping processes, those executing ¹ransaction with para-meters (R

i, n, False).

The following are the definitions of the register opera-tions, where the superscript of an operation marks the id ofthe process which executed it, and the subscript marks thetransaction instance that the process executes. Sometimes,when subscript and superscript are not needed we willomit them.

¼ iT(variable, value) Process i performs a ¼rite operation

on variable with value while executing transaction ¹.Ri

T(variable, value) Process i performs a Read operation onvariable which returns value while executing transaction¹.

¸¸ iT(variable, value) Process i performs a ¸oad—¸inked op-

eration on variable which returns value while executing¹.

SC iT(variable, value) Process i performs a successful

Store—Conditional operation on variable with valuewhile executing ¹.

R iT(U(variable)) is a short form for R i

T(variable, value)?

U(value) for some predicate U.

Clearly, any implementation of transactional memorywhich is based on an ownership policy only, without

AcquireOwnerships(rec, version)transize"recC.sizefor j"1 to size do

while true dolocation"recC.add[j]if LL(recC.status)9Null then returnowner"LL(Ownerships[recC.Add[j]])if recC.version9version returnif owner"rec then exit while loopif owner"Nobody then

if SC(recC.status, (Null,0)) thenif SC(Ownerships[location], rec) then exit while loop

elseif SC(recC.status, (Failure, j)) then return

ReleaseOwnerships(rec, version)size"recC.sizefor j"1 to size do

location"recC.Add[j]if LL(Ownerships[location])"rec then

if recC.version9version then returnSC(Ownerships[location], Nobody)

AgreeOld»alues(rec, version)size"recC.sizefor j"1 to size do

location"recC.Add[j]if LL(recC.OldValues[j])"Null then

if recC.version9version then returnSC(recC.OldValues[j], Memory[location])

ºpdateMemory(rec, version, newvalues)size"recC.sizefor j"1 to size do

location"recC.Add[j]oldvalue"LL(Memory[location])if recC.AllWritten then returnif version9recC.version then returnif oldvalues9newvalues[j] then

SC(Memory[location], newvalues[j])if (not LL(recC.AllWritten)) then

if version9recC.version then returnSC(recC.AllWritten, True)

Fig. 6. Ownerships and Memory access

helping, will satisfy the linearizability requirement: ifa single process is able to lock all the needed memorylocations it will be able to update the memory atomically.Consequently, in order to prove the linearizability of ourimplementation, we must show that the fact that manyprocesses may execute the same transaction will behaveas if they were a single process running alone. In thefollowing proof we will first show that all the executingprocesses of a transaction perform the same transactionthat the initiator intended. Then, we will prove that all theexecuting processes agree on the final status of the transac-tion. Finally, we will demonstrate that the executing pro-cess of a successful transaction will update the memorycorrectly.

The non-blocking property of the implementation willbe established by showing first that no executing process

104

will ever be able to acquire an ownership after the transac-tion has failed, and then showing that since locations areacquired in increasing order, some transaction will event-ually succeed.

4.1 Linearizability

We first show that although process i uses the same recordfor all its transactions and may eventually change it whilesome executing process reads its content, all the executingprocesses of a transaction read a consistent description ofwhat is supposed to do.

Claim 4.1. Given an execution r of the STM implementa-tion, the helping processes of a transaction ¹(i, n) in r readthe same data set vector which was stored by i. Anyexecuting process of ¹(i, n) which read a different data setwill not update any of the shared data structures.

Proof. Assume by way of contradiction that there isa helping process j of ¹(i, n) which read a differentdescription of the transaction. That means that for somelocation a from ¹(i, n)’s description, Rj

T (i,n)(a, x) and

¼ iT (i,n)

(a, y) but x9y. By the algorithm, only processi updates the description fields in Rec

iand it does it

only once per transaction. Assume first that¼i

T (i,n)(a, y)PRj

T (i, n)(a, x). Since x9y, there is some write

operation ¼ iT (i,n{)

(a, x) s.t.

¼ iT (i,n)

(a, y)P¼ iT (i,n{)

(a,x)PRjT (i,n)

(a, x)

where n@'n. Since

¼iT(i,n)

(ReciC.version, n#1)P¼i

T(i,n{)(a, x)PRj

T(i,n)(a, x)

and all the helping process of ¹(n, i) compare n andRec

iC.version before executing a SC operation, j will not,

from this point on, update any shared data structure.Assume that Rj

T (i,n)(a, x)P¼ i

T (i,n)(a, y). By the algorithm

(lines 19, 20 in the ¹ransaction procedure),

RjT (i,n)

(ReciC.version,n)PRj

T (i,n)(Rec

iC.stable, true)P

RjT (i,n)

(a, x),

and in that case

¼iT (i,n~1)

(ReciC.version, n)P¼i

T (i,n)(Rec

iC.stable, true)P

¼ iT (i,n)

(a, y)

which is a contradiction to the description of the Start-¹ransaction procedure. K

Next we show that all the executing processes of a trans-action agree on its terminating status.

Claim 4.2. Assume that i and j are two executing processesof some transaction ¹ (i, n). If i and j read different valuesof the terminating status (line 6 in the ¹ransaction proced-ure), at least one of them will henceforth not update theshared data structures.

Proof. Assume by way of contradiction that

RkT (i,n)

(ReciC.status, Failure)

and

RjT (i,n)

(ReciC.status, Success),

and assume w.l.o.g. that

RkT(i,n)

(ReciC.status, Failure)PRj

T(i,n)(Rec

iC.status, Success).

In that case, there is some process z such that

RkT (i,n)

(ReciC.status, Failure)P¸¸k (Rec

iC.status, Null)P

SCk (ReciC .status, Success)P

RjT (i,n)

(ReciC.status, Success).

Since i is the only process which initializes ReciC.status,

it follows that

RkT (i,n)

(ReciC.status, Failure)P¼z (Rec

iC.status, Null)P

SCz (ReciC.status, Success)P

RjT (i,n)

(ReciC.status, Success).

By the algorithm

RkT (i,n)

(ReciC.version, n)PRk

T (i,n)(Rec

iC.stable, true)P

RkT (i,n)

(ReciC.status, Failure)

and

¼ i (ReciC.version, n#1)P¼ i (Rec

iC.status, Null).

We may therefore conclude that process j will not updatethe shared data structures any more after executingRj

T (i,n)(Rec

iC.status, Success). K

Thanks to Claim 4.2, we can now define a transactionas successful if its terminating status is Success and failingotherwise. From the algorithm and Claim 4.2 it is clearthat executing processes of failing transactions will neverchange the Memory data structure.

Claim 4.3. Every successful transaction has:

(a) only one executing process which writes Success asthe terminating status of the transaction and

(b) only one executing process who sets the All¼rittenfield to true.

Proof. Assume that during a successful transaction ¹(i,n),one of those fields, f was updated by two executing pro-cesses k and j. Both have executed

RT (i,n)

(ReciC.version, n)P¸¸

T (i,n)( f, Null)P

RT (i,n)

(ReciC.version, n)P¼(Rec

iC.stable, ¹rue)P

¸¸( f, Null)PSCT (i,n)

( f, v).

Assume w.l.o.g. that SCkT (i,n)

( f, v)PSCjT (i,n)

( f, v). Bythe specification of the ¸oad—¸inked/Store—Conditionaloperation,

¸¸kT (i,n)

( f, Null)PSCkT (i,n)

( f, v)

P¸¸jT (i,n)

( f, Null)PSCjT (i,n)

( f, v).

But since only process i writes Null into field f, it followsthat

¼ i (ReciC.stable, False)P¼ i (Rec

iC.version, n)P

¼ iT (i,n)

( f, Null).

105

Process j thus read ReciC.stable as false and therefore

should not have helped ¹(i, n). K

For any successful transaction ¹(i, n) let Sº (n, i) be the SCoperation which has set ¹ (i, n)’s status to Success and letA¼(n, i) be the SC operation which has set the All¼rittenfield to True. By the above claims those operations are welldefined. The following lemma shows that successful trans-actions access the memory atomically.

Lemma 4.4. For every n and every process i, if ¹ (i, n) isa successful transaction, then

(a) between Sº (n, i ) and A¼(n, i ) all the entries in theOwnerships vector from ¹(i, n)’s data set contain Rec

i, and

(b) at ¼ iT (i,n)

(ReciC.version, n#1) no entry contains

Reci.

Proof. The proof is by joint induction on n. Assume thatthe properties hold for n@(n and let us prove them for n.

To prove (a), consider j, the process which executedSº(n, i ). By the algorithm, j has performed

Rj (ReciC.version, n)P/ j

Tran (i,n)(Ownerships[x

1], Rec

i)P

2P/jTran (i,n)

(Ownerships[xl], Rec

i)PSº(n, i)

where / is either a SC or R operation, and x12

xlare

¹ran(i, n)’s data set locations. Assume that for some loca-tion x

rat Sº (n, i), Ownerships[x

j] differ from those of

Reci. By the algorithm this may happen only if

/jTran (i,n)

(Ownerships[xr], Rec

i)P

SCk (Ownerships[xr], Null).

Therefore, there is some process k executing release—ownerships during ¹ran(i, n@) for n@(n. More precisely,the following sequence of operations has occurred:

¸¸kTran (i,n)


i)P

RkTran (i,n{)

(ReciC .version, n@)P

¼ i (ReciC.version, n)PRj (Rec

iC.version, n)P

SCkTran (i,n{)

(Ownerships[xr], Null).

By the induction hypothesis on property (b), at¼ i (Rec

iC.version,n), Ownerships[x

r] differs from Rec

iand therefore the SCkTran (i,n{)

(Ownerships[xr], Null) should

have failed. A contradiction.To prove (b), note that from the algorithm it follows

that process i has executed

/ iTran (i,n)

(ReciC.status, Success)P

/iTran (i,n)

(Ownerships[x1],Null)P2P

/ jTran (i,n)

(Ownerships[xl], Null)P

¼ iT (i,n)

(ReciC.version, n#1).

Assume by way of contradiction that at ¼ iT (i,n)

(ReciC

.version, n#1) there is some location xr

whereOwnerships[x

r]"Rec

i. By the induction hypothesis on

property (b), xrbelongs to ¹ran

i’s data set. Let k be the

processor that wrote Recion Ownerships[x

r]. By the algo-

rithm, k performed

¸¸k (ReciC.status, Null)P¸¸k (Ownerships[x

r], Null)P

SCk (ReciC.status, Null)P

/ iTran (i,n)

(ReciC.status, Success)P

SCk (Ownerships[xr], Rec

i).

By property a, at the point of executing SCTran (i,n)

(ReciC

.status, Success), Ownerships[xj]"Rec

iand therefore

SCk (Ownerships[xr], Null) should have failed, a contradic-

tion. K

The following corollary will be useful when proving thenon-blocking property of the implementation. The proof issimilar to the proof of part (b) in Lemma 4.4

Corollary 4.5. ¸et ¹ (i, n) be a failing transaction; then atthe point of executing ¼ i

T (i,n)(Rec

iC.version, n#1), no en-

try contains Reci.

We can now complete the proof of linearizability. Wedefine the execution state of the implementation at anypoint of the execution to be the function F s.t.F(x)"Memory[x] for every x3¸.

Lemma 4.6. ¸et ¹ (i, n) be a successful transaction, and letF1 and F2 be the execution states at Sº(i, n) and A¼(i, n)respectively. ¹he following properties hold:

(a) At A¼(i, n), ReciC.old—values"F1pDataSet

(i,n).

(b) If F1 and F2 are the execution states of Sº(i, n)and A¼(i, n) respectively then F2pDateSet

(i,n)"f

(i,n)(F1)p

DataSet(i,n)

, where f(i,n)

is the transition function of ¹(i, n).(c) After A¼(i, n) no process executing ¹

(i,n)will update

Memory.

Proof. The proof is by joint induction on the length of theexecution.

To prove (a), let F1 be the execution state at Sº (i, n).Assume by way of contradiction that there is some loca-tion x3DataSet

i,n, Rec

iC.old—values[x]9F1(x). That

means that Memory[x] was changed between Sº (i, n) andthe point in the execution in which Rec

iC.old—values[x]

was set. Since, by the algorithm, all the executing processesof ¹ (n, i) update Rec

iC.old—values before updating the

Memory, Memory[x] was altered by an executing processof some other successful transaction ¹(i@, n@). By Lemma4.4, A¼(i@, n@)pSº(i, n) and therefore by the inductionhypothesis on property (c), we have a contradiction.

To prove (b), assume by way of contradiction thatat A¼(i, n) there is some location x

r3DataSet

i,ns.t.

Memory[xr]9f

(i,n)(F1) (x

r). Let j be the process which

executed A¼(i, n). By the algorithm, as a part of theºpdateMemory procedure, j performed either

RjT (i,n)

(Memory[xr], f

(i,n)(F1) (x

r))PA¼(i, n)

or

RjT (i,n)

(Memory[xr]9f

(i,n)(F1) (x

r))P

SCjT (i,n)

(Memory[xr] f

(i,n)(F1) (x

r))PA¼ (i, n).

Therefore, there is some process k which performed SCk onMemory[x

r] with a value different from f

(i,n)(F1) (x

r) after

Rj(xr,*) and before A¼(i, n). Assume w.l.o.g. that k is

106

executing the transaction ¹ran(i@, n@). If i"i@ then clearlyn@(n and by the induction hypothesis on property (c), k’swriting should have failed. Therefore i@9i. IfA¼(i@, n@)PSCk then A¼ (i@, n@)PA¼(i, n) and using theinduction hypothesis on property (c) we again have a con-tradiction. Therefore SCkPA¼(i@, n@). In that case wehave a contradiction to Lemma 4.4 since at SCk, Owner-ships[x

r] is supposed to contain both Rec

iand Rec

i{.

To prove (c), assume by way of contradiction that someexecuting process j of ¹(i, n) updated a location x

rin

memory after A¼(i, n). Process j performed the followingsequence of operations:

¸¸jTran (i,n)

(Memory[xr], val)P

RjTran (i,j)

(ReciC.All¼ritten, False)P

RjTran (i,j)

(ReciC.version, n)P

SCjTran (i,n)

(Memory[xr], f

(i,n)(F1) (x

r))

where

val9f(i,n)

(F1) (xr)

and therefore

¸¸jTran (i,n)

(Memory[xr], val)PA¼ (i,n).

By property (a), at A¼(i, n), Memory[xr] contains

f(i,n)

(F1) (xr) and therefore SCj

Tran (i,n)(Memory[x

r],

f(i,n)

(F1) (xr)) should have failed. K

In order to prove that the implementation is linearizable,let us first consider executions of the STM implementationwhich contain successful transactions only. Let HS be oneof those executions and let A¼1PA¼2PA¼32bethe sequential subsequence of all the A¼ events thatoccurred during HS. Since an A¼ event occurs only oncefor every successful transaction, let H be the sequence

¹AW1

(DataSet1, f

1, ov

1, success) ¹

AW2(DataSet

2, f

2, ov

2, success)

¹AW3

(DataSet1, f

1, ov

1, success)2

of transaction executions induced by the A¼ events,where for every ¹

AWnthe triple (DataSet

n, f

n, ov

n, success)

represents the content of the DataSet, F, old—values, andstatus fields respectively in ¹

AWn’s records at A¼

n. By

Lemma 4.6, it is a simple exercise to show by inductionthat H is a legal sequential history according to Definition2.1. Since failing transactions do not cause any change toMemory, we may conclude that:

Theorem 4.7. ¹he implementation is linearizable.

4.2 Non-blocking

We denote the executing process which wrote Failure toRec

iC.status of a transaction ¹(i, n) as its failing process. In

order to prove the non-blocking property of the imple-mentation, define the failing location of ¹ (i, n) to be thelocation that the failing process failed to acquire.

Claim 4.8. Given a failing transaction ¹ (i, n), all theexecuting process of ¹(i, n) will never acquire a locationwhich is higher or equal to the failing location of ¹ (i, n).

Proof. Assume by way of contradiction that some execut-ing process of ¹ (i, n) acquired a location at least as high as¹(i, n)’s failing location. Since the process read all thelower locations acquired for ¹(i, n), let j be the executingprocess of ¹(i, n) that acquired the failing location x

rof

¹(i, n). By the algorithm, j performed the followingsequence of operations:

¸¸jT (i,n)

(ReciC.status, Null)P

¸¸jT (i,n)

(Ownerships[xr], Nobody)P

SCjT (i,n)


SCjT (i,n)


i).

The failing process of ¹ (i, n), k has performed the followingsequence of operations:

¸¸kT (i,n)


¸¸kT (i,n)

(Ownerships[xr], other)P

SCkT (i,n)

(ReciC.status, Failure),

where other is neither Null nor Reci. Assume that

SCkT (i,n)

(ReciC.status, Failure)P

SCjT (i,n)


i).

In that case

SCjT (i,n)


¸¸kT (i,n)


¸¸kT (i,n)


SCkT (i,n)

(ReciC .status, Failure).

Consequently k must have seen Ownerships[xr] already

owned by i or the SCjT (i,n)


i) should

have failed. Therefore SCjT (i,n)

(Ownerships[xr],

Reci)PSCk

T (i,n)(Rec

iC.status, Failure). Now, if SCj

T (i,n)(Ownerships[xr], Rec

i)P¸¸k

T (i,n)(Ownerships[x

r],other)

we have a contradiction since a process executing ¹(i, n)never releases its ownership before the status is set andprocesses executing ¹(i, n@), n@(n will see that theRec

iC.version has changed. For that reason,

¸¸kT (i,n)


SCkT (i,n)

(ReciC.status, Failure)P

SCjT (i,n)


i).

In that case

¸¸kT (i,n)


¸¸kT (i,n)


¸¸jT (i,n)

(Ownerships[xr], Nobody)P

SCjT (i,n)


SCjT (i,n)


i)

and SCkT (i,n)

(ReciC.status, Failure) must have failed, a

contradiction. K

Theorem 4.9. ¹he implementation is non-blocking.

107

Proof. Assume by way of contradiction that there is aninfinite schedule in which no transaction terminates suc-cessfully. Assume that the number of failing transactionsis finite. This happens only if from some point on, in thecomputation, all the processes are ‘‘stuck’’ in theAcquireOwnerships routine. In this case there are severalprocesses which try to get ownership on the same locationfor the same transaction. This means that case at least oneprocess will succeed or will fail the transaction, a contra-diction. It must thus be the case that the number of failingtransactions is infinite. In that case, there is at least onelocation which is a failing address infinitely often. Con-sider A, the highest of those addresses. Since the initiator ofthe transaction tries to help the transaction which hasfailed him before retrying, and since by Corollary 4.5 allacquired locations are released before helping, it followsthat there are infinitely many transactions which haveacquired ownership on A but have failed. By Claim 4.8those transactions have failed on addresses higher than A,a contradiction to the fact that A is the highest failedlocation. K

To avoid major overheads when no Failures occur, anyalgorithm based on the helping paradigm must avoid‘‘redundant helping’’ as much as possible. A processeshelping is redundant when the helping process is not fasterthen the helped process. Such helping will only increasecontention and consequently, will cause the helped processto release the ownerships later than it would have releasedif not helped. A straightforward solution is to modify theSTM implementation such that a process i will help thetransactions of some other process j, not every time that itis blocked by it, but rather every r

ij71 times. Choosing

the correct value for rij

is important. If it is too high, it willtake too much time for the system to discover and helpa slow process. On the other hand if it is too low, it willresult into a considerable amount of redundant helping.The solution for the process i is to vary r

ijaccording to its

expected estimate of process j ’s speed. On multiprocessorarchitectures, the relative speed between two processes isnot completely random, but rather depends on the loca-tion and on the load of the processors on which theprocesses are located. One can thus use the history ofredundant helps that occurred in the past to providea good estimate of the relative speed of other processes. Inour implementation, a process i discovers that helping wasredundant if the version field changed while it was helping.Each time a process i discovers that help to process j wasnot redundant it decreases r

ijby one, otherwise it increases

it by one (up to some upper bound).

5 An empirical evaluation of translation methods

5.1 Methodology

We compared the performance of STM and other softwaremethods on 64 processor bus and network architecturesusing the Proteus simulator developed by Brewer,Dellarocas, Colbrook, and Weihl [8]. Proteus simulatesparallel code by multiplexing several parallel threads ona single CPU. Each thread runs on its own virtual CPU

with accompanying local memory, cache, and communica-tions hardware, keeping track of how much time is spentusing each component. In order to facilitate fast simula-tions, Proteus does not do complete hardware simulations.Instead, operations which are local (do not interact withthe parallel environment) are run uninterrupted on thesimulating machine’s CPU and memory. The amount oftime used for local calculations is added to the time spentperforming (simulated) globally visible operations to de-rive each thread’s notion of the current time. Proteusmakes sure a thread can only see global events within thescope of its local time.

In the simulated bus architecture, processors commun-icate with shared memory modules through a commonbus. Uniform shared-memory access is assumed, that is,access of any memory module from any processor, whenthe bus is free, takes the same amount of time, which is4 cycles. Each processor has a cache with 2048 lines of8 bytes and the cache coherence is maintained usingGoodman’s [18] ‘‘snoopy’’ cache-coherence protocol.

The simulated network architecture is similar to that ofthe Alewife cache-coherent distributed-memory machinecurrently under development at MIT [1]. Each node of themachine’s torus shaped communication grid consists ofa processor, cache memory, router, and a portion of theglobally-addressable memory. The cost of switching orwiring in the Alewife architecture is 1 cycle/packet. Eachprocessor has a cache with 2048 lines of 8 bytes. Cachecoherence is provided using a version of Chaiken’s [12]directory-based cache-coherence protocol.

The current version of Proteus does not support¸oad—¸inked/Store—Conditional instructions. Instead weused a slightly modified version that supports a 64-bitCompare&Swap operation where 32 bits serve as a timestamp. Naturally this operation is less efficient than thetheoretical ¸oad—¸inked/Store—Conditional proposed in[6, 16, 20] (which we could have built directly into Pro-teus), since a failing Compare&Swap will cost a memoryaccess while a failing Store—Conditional will not. However,we believe the 64-bit Compare&Swap is closer to the realworld than the theoretical ¸oad—¸inked/Store—Conditionalsince existing implementations of ¸oad—¸inked/Store—Conditional as on Alpha [13] or PowerPC [19] do notallow access to the shared memory between the¸oad—¸inked and the Store—Conditional operations. Onexisting machines the 64 bits Compare&Swap may beimplemented by using a 64 bit ¸oad—¸inked/Store—Conditional as on the Alpha, or using Bershad’s lock-freemethodology2 [7].

We used four synthetic benchmarks for evaluatingvarious methods for implementing shared data structures.The methods vary in the size of the data structure and theamount of parallelism.

We ran each benchmark varying the number of proces-sors participating in the simulation. We measured through-put, the average number of operations performed in a onemillion cycle period. The throughput was calculated usingthe following formula: throughput"(106]operations)/

2The non-blocking property will be achieved only if the number ofspurious failures is finite

108

elapsed-time where operations is the total number of opera-tions performed in the benchmark and elapsed-time is thenumber of cycles that elapsed from the beginning up to theend of the simulation.

Counting. Each of n processes increments a sharedcounter 10000/n times. In this benchmark updates areshort, change the whole object state, and have no built inparallelism.

Resource allocation. A resource allocation scenario [10]:a few processes share a set of resources and from timeto time a process tries to atomically acquire a subset ofsize s of those resources. This is the typical behaviorof a well designed distributed data structure. For lackof space we show only the benchmark which has n pro-cesses atomically increment 5000/n times with s"2, 4,6 locations chosen uniformly at random from a vector oflength 60. The benchmark captures the behavior of highlyconcurrent queue and counter implementations as in[26, 27].

Priority queue. A shared priority queue on a heap ofsize n. We used a variant of a sequential heap implementa-tion [11]. In this benchmark each of the n processesalternately enqueues a random value in a heap and de-queues the greatest value from it 5000/n times. The heap isinitially empty and its maximal size is n. This is probablythe most trying benchmark since there is no potential forconcurrency and the size of the data structure increaseswith n.

Doubly linked queue. An implementation of a queue asa doubly linked list in an array. The first two cells of thearray contain the head and the tail of the list. Every item inthe list is a pair of cells in the array, which represent theindex of the previous and next element respectively. Eachprocess enqueues a new item by updating tail to containthe new item’s index and dequeues an item by updating thehead to contain the index of the next item in the list. Eachprocess executes 5000/n pairs of enqueue/dequeue opera-tions on a queue of initial size n. This benchmark supportslimited parallelism since when the queue is not empty,enqueues/dequeues update the tail/head of the queue with-out interfering each other. For a high number of processes,the size of the updated locations in each enqueue/dequeueis relatively small compared to the object size.

We simplified the general STM implementation for thecase in which k-word Compare&Swap transaction are be-ing performed (given in Fig. 2). The simplification is thatprocesses excuting a transaction do not have to agree onthe value stored in the Dataset before the transactionstarted, only on a boolean value which is true if the value isequal to old[].

We used the above benchmarks to compare STM tothe two non-blocking software translation methods de-scribed earlier and a blocking MCS queue-lock [25] basedsolution (the data structure is accessed in a mutuallyexclusive manner). The non-blocking methods includeHerlihy’s Method and Israeli and Rappoport’s k-wordCompare&Swap based implementation. All the non-blocking methods use exponential backoff [3] to reducecontention.

5.2 Results

The data to be presented leads us to conclude that thereare three factors differentiating among the performance ofthe four methods:

1. Potential for parallelism: Both locking and Her-lihy’s method do not exploit potential for parallelism3 andonly one process at a time is allowed to update the datastructure. The software-transactional and the cooperativemethods allow concurrent processes to access disjointparts of the data structure.

2. ¹he price of a failing update: In Herlihy’s non-blocking method, the number of memory accesses of a fail-ing update is at least the size of the object (reading theobject and copying it to the private copy, and reading andwriting to the pointer). Fortunately, the nature of the cachecoherence protocols is such that almost all accesses per-formed when the process updates its private copy are local.In both caching methods (STM and Israeli and Rappo-port), the price of a failure is a least the number locationsaccessed during the cached execution.

3. ¹he amount of helping by other processes: Helpingexists only in the software-transactional and the coopera-tive methods. In the Israeli and Rappoport method, k-word Compare&Swap, including failing ones, are helpednot only by the k-word Compare&Swap operations thataccess the same locations concurrently, but also by all theoperations that are in turn helping them, and so on2 Inthe STM method, a k-word Compare&Swap is helped onlyby operations that need non-disjoint locations. Moreover,and this is a crucial performance factor, in STM most ofthe unsuccessful updates terminate as failing transactions,not as failing k-word Compare&Swap, and when a transac-tion fails on the first location, it is not helped.

The results for the counting benchmark are given in Fig. 7.The horizontal axis shows the number of processors andthe vertical axis shows the throughput achieved. Thisbenchmark is cruel to the caching based methods, since theamount of updated memory is equivalent to the size of theobject and there is no potential for parallelism. On the busarchitecture, locking and Herlihy’s method give signifi-cantly higher throughput than the caching methods.

The results of the resource allocation benchmark areshown in Fig. 8. We measured the potential for parallelismas a percentage of the atomic s-word-increments that suc-ceeded on first attempt. When s"2 this percentage variesbetween 73—75% at 10 processors down to 33—34% at 60processors. For s"4 the potential for parallelism is40—44% at 4 processors down to 16% at 60 processors,and when s"6 it varies between 24—29% at 10 processorsto 9—10% at 60 processors. In general, as the number ofprocessors increases, local work can be performed concur-rently, and thus the performance of the STM improves.Beyond a certain number of processors, the potential forparallelism declines, causing a growing number of k-wordCompare&Swap conflicts, and the throughput degrades.This is the reason for the relatively low throughput of the

3Note that we use a very unsophisticated locking solution and oneswith more potential for parallelism can be designed

109

Fig. 7. Counting benchmark

Fig. 8. Resource allocation benchmark

110

Fig. 9. Priority queue benchmark

Fig. 10. Doubly linked queue benchmark

STM method for small numbers of processors and theconcave form of STM graphs. As one can see, when s"2on the bus or when s"2, 4 on Alewife architecture, theSTM method outperforms even the queue-lock method.

Figure 9 describes the results of a priority queue bench-mark. A priority queue is a data structure that does notallow concurrency, and as the number of processors in-creases, the number of locations accessed increases too.Still, the number of accessed locations is smaller than thesize of the object. Therefore, the STM performs better thanHerlihy’s method at most levels of concurrency.

Figure 10 contains the doubly linked queue results.There is more concurrency in accessing the object than inthe counter benchmark, though it is limited: at most twoprocesses may concurrently update the queue. Herlihy’smethod performs poorly because the penalty paid fora failed update grows linearly with queue size: usuallytwice the number of the processes. In the STM method, thelow granularity of the two-word Compare&Swap transac-tions implies that the price of a failure remains constant inall concurrency levels, though local work is still higherthan the queue-lock method.

5.3 A comparison of non-blocking methods

Every theoretical method can be improved in many wayswhen implemented in practice. In order to get a fair com-

parison between the non-blocking methods, we believe oneshould use them in their ‘‘purest’’ form. We thereforecompare the performance of all the non-blocking methodswithout backoff (in all the methods) and without the non-redundant-helping policy (in STM). We show the results ofrunning these ‘‘pure’’ algorithms in high load situations,when all processes are repeatedly trying to access the datastructure, and low load situations, where the access pat-terns are sparse.

5.3.1 A high load comparison

In general, our results show that STM outperforms thecooperative method in all circumstances, and except fromthe counter benchmark, STM outperforms Herlihy’smethod too. The results of the counter benchmark areshown in Fig. 11. As in the previous tests, Herlihy’smethod performs better than caching methods on botharchitectures. In bus Herlihy’s method is 2.91 times fasterthan STM on 10 processors, down to 1.35 times faster thanSTM on 60 processors. On Alewife style architecture,Herlihy’s method is 3.38 times faster than STM on 10processors, down to 2.23 times faster than STM on 60processors. STM is 1.97 faster than the Israeli and Rappo-port method on the bus architecture up to 8.44 times fasterthan Israeli and Rappoport on 60 processors. On theAlewife architecture, STM is from 1.92 times up to 7.6 time

111

Fig. 11. Non-blocking comparison: counting benchmark

Fig. 12. Non-blocking comparison: priority queue benchmark

Fig. 13. Non-blocking comparison: doubly linked queue benchmark

faster than the Israeli and Rappoport method. The degra-dation in the performance of the Israeli and Rappoportmethod is due to the high number of failing k-word Com-pare&Swap operations: up to 8.4 times the successful ones!For STM the number of failing k-word Compare&Swapoperations is at most 0.26 times the number of successfulk-word Compare&Swap. Thus most of the transactions inSTM terminate as failing transactions and are not helpedsince they failed in acquiring the first (and last) location

needed. In the priority queue benchmark, on a simulatedbus architecture, Herlihy’s method is from 2.36 times fasterthan STM, down to 2.8 times slower than STM. On theAlewife architecture, Herlihy’s method has a throughputthat is 2.41 times higher than STM throughput, down to1.1 times lower than STM. The results of the doubly linkedqueue appear in Fig. 13. On the bus architecture STM isup to 3.37 faster than Israeli and Rappoport and up to 59times faster than Herlihy’s method. On the Alewife style

112

Fig. 14. Non-blocking comparison: resource allocation benchmark

architecture, STM had 12.9 times higher throughput thanHerlihy’s method and 7.28 higher throughput than theIsraeli and Rappoport method. In the resource allocationbenchmark (Fig. 14) STM also outperforms othermethods. On the bus architecture it is 1.1—1.6 times faster.On the Alewife architecture it is 1.09—1.68 times faster.Note, that in this benchmark, the factor that affects theIsraeli and Rappoport performance is not the number offailing k-word Compare&Swap operations, which is rela-tively low, but the increased redundant helping by remoteprocessors. We also compare the cooperative k-wordCompare&Swap with STM for a specific implementation

which explicitly needs such a software supported opera-tion. We chose Israeli and Rappoport’s algorithm fora concurrent priority queue [20], since it is based onrecursive helping. Therefore, all the recursive helping doneby a process during the execution of a k-word Com-pare&Swap has a higher chance to be non-redundant andfor that reason Israeli and Rappoport’s method is expectedto perform better. Our implementation is slightly differentsince it uses a 3-word Compare&Swap operation instead ofa 2-word Store—Conditional operation (In fact, using 3-word Compare&Swap operation simplifies the implemen-tation since it avoids freezing [20] nodes).

113

Fig. 15. Non-blocking comparison: Israeli & Rappoport priority queue

Fig. 16. Sparse access pattern — resource allocation benchmark

114

Fig. 17. Sparse access pattern — counting benchmark

The results of the concurrent priority queue bench-mark are given in Fig. 15. Though the inherent structure ofthe algorithm should give the advantage to the Israeli andRappoport method, STM gives the highest throughput. Asin the counter and the sequential priority queue bench-marks, the reason is the high number of failing k-wordCompare&Swap operations in the Israeli and Rappoportmethod: up to 2.5 times the number of successful k-wordCompare&Swap operations.

5.3.2 A low load comparison

The experimental results we presented above test the vari-ous non-blocking methods in high load situations, whenall the processors repeatedly attempt to access the shareddata structure. The reader may ask herself what the rela-tive overhead of using the various non-blocking methodsmight be under sparse access patterns when only a smallnumber of the processors at a time try to update the shareddata structure. To answer this question, we compared theperformance of the non-blocking methods while varyingaccess patterns. We present here two benchmarks inwhich, each of the processors, after every operation per-formed on the object, waits rand cycles before accessing theobject again. The value rand is each time chosen randomlybetween 0 and some upper bound ¼. We rand the bench-marks with 30 and 60 processors, varying the value of¼ between 103 and 106. Since most of the running time isspent waiting rather than accessing the data structures,throughput is no longer a good performance measure. Wetherefore measured latency: the number of cycles it takeson average for a processor to complete a single access tothe data structure.

The results of the resource allocation and the countingbenchmarks on a simulated Alewife architecture arepresented in Figs. 16 and 17 respectively. The latency ofHerlihy’s method in the resource allocation benchmark iscompletely off the scale and therefore not shown. As ex-pected, it outperforms other methods in the countingbenchmark (Fig. 17) since it introduces little overhead. Inboth benchmarks STM is never worst than Israeli andRappoport’s method, but as the access patterns becomesparser the latency of both methods tends to rapidly con-verge. This is because at low loads every process has

a good chance of accessing the object alone, causing thedifferences between the two algorithmic ‘‘helping’’ policiesto have no noticeable affect on the performance.

6 Conclusions

Our paper introduces a non-blocking software version ofHerlihy and Moss’s transactional memory approach.There are many possible directions in which it can beextended. One issue is to design better non-blocking trans-lation engines, possibly by limiting STM’s expressibility toa smaller set of implementable transactions. Another inter-esting question is what performance guarantees one canget with a less robust STM software package, possiblyprogrammed on the machine’s message passing level.Finally, the ability to add an STM component to existingsoftware-based virtual shared memory systems, raises the-oretical questions of the computational power ofa programming abstraction based on having a variety of‘‘operations’’ that can be applied to memory locations, vs.the traditional approach of thinking of synchronizationoperations as ‘‘objects.’’

Acknowledgements. We wish to thank Greg Barnes, Maurice Herlihy,and the anonymous referees for their many helpful comments.

References

1. Agarwal A, Chaiken D, Johnson K, Krantz D, Kubiatowicz J,Kurihara K, Lim B, Maa G, Nussbaum D: The MIT alewifemachine: a large-scale distributed memory multiprocessor. In:Proceedings of Workshop on Scalable Shared Memory Multi-processors. Kluwer Academic Publishers, 1991. An extendedversion of this paper has been submitted for publication, andappears as MIT/LCS Memo TM-454, 1991

2. Anderson RJ: Primitives for asynchronous list-compression.Proceeding of the 4th ACM Symposium on Parallel Algorithmsand Architectures, pp 199—208, 1992

3. Anderson TE: The performance of spin lock alternatives forshared memory multiprocessors. IEEE Trans Parallel DistribSyst 1(1): 6—16 (1990)

4. Alemany J, Felten EW: Performance issues in non-blockingsynchronization on shared-memory multiprocessors. In: Pro-ceedings of 11th ACM Symposium on Principles of DistributedComputation, pp 125—134, August 1992

115

5. Aspnes J, Herlihy MP, Shavit N: Counting networks. J ACM41(5): 1020—1048 (1994)

6. Barnes G: A method for implementing lock-free shared datastructures. In: Proceedings of the 5th ACM Symposium onParallel Algorithms and Architectures, 1993

7. Bershad BN: Practical considerations for lock-free concurrentobjects. Technical Report, CMU-CS-91-183, Carnegie MellonUniversity, September 1991

8. Brewer EA, Dellarocas CN, Colbrook A, Weihl WE: Proteus:a high-performance parallel-architecture simulator. MIT/LCS/TR-516, September 1989

9. Brewer EA, Dellarocas CN: Proteus. User documentation10. Chandy K, Misra J: The drinking philosophers problem. ACM

Trans Program Lang Syst 6(4): 632—646 (1984)11. Cormen TH, Leiserson CE, Rivest RL: Introduction to algo-

rithms. MIT Press,12. Chaiken D: Cache coherence protocols for large-scale multi-

processors. S.M. thesis, Massachusetts Institute of Technology,Laboratory for Computer Science Technical Report MIT/LCS/TR-489, September 1990

13. Digital Equipment Corporation: Alpha system reference manual14. Herlihy M, Wing, JM: Linearizability: a correctness condition

for concurrent objects. ACM Trans Program Lang Syst 12(3):463—492 (1990)

15. Herlihy M: Wait-free synchronization. ACM Trans ProgramLang Syst 13(1): 124—149 (1991)

16. Herlihy M: A methodology for implementing highly concurrentdata objects. ACM Trans Program Lang Syst 15(9): 745—770(1993)

17. Herlihy M, Moss JEB: Transactional memory: architectural sup-port for lock-free data structures. In: 20th Annual Symposium onComputer Architecture, pp 289—300, May 1993

18. Goodman JR: Using cache-memory to reduce processor-mem-ory traffic. In: Proceeding of the 10th International Symposiumon Computer Architectures 13(1): 124—131, June 1983

19. IBM: Power PC. Reference manual20. Israeli A, Rappoport L: Efficient wait free implementation of

a concurrent priority queue. In: Workshop on Distributed Algo-rithms on Graphs 1993. Lect Notes Comp Sci, vol 725. Springer,Berlin Heidelberg New York, pp 1—17, 1993

21. Israeli A, Rappoport L: Disjoint-access-parallel implementationsof strong shared memory. Proceedings of the 13th ACM Sympo-sium on Principles of Distributed Computing, pp 151—160, 1993

22. LaMarca A: A performance evaluation of lock-free synchroniza-tion protocols. Proceedings of the 13th ACM Symposium onPrinciples of Distributed Computing, pp 130—140, 1993

23. Lynch N, Tuttle M: Hierarchical correctness proofs for distrib-uted algorithm. In: Proceedings of 6th ACM Symposium onPrinciples of Distributed Computation, pp 137—151, August1987. Full version available as MIT Technical Report MIT/LCS/TR-387

24. Massalin H, Pu C: A lock-free multiprocessor OS kernel. Tech-nical Report CUCS-005-91. Columbia University, March 1991

25. Mellor-Crummey JM, Scott ML: Synchronization without Con-tention. In: Proceedings of the 4th International Conference onArchitecture Support for Programming Languages and Operat-ing Systems, April 1991

26. Rudolph L, Slivkin M, Upfal E: A simple load balancing schemefor task allocation in parallel machines. In: Proceedings of the3rd ACM Symposium on Parallel Algorithms and Architectures,pp 237—245, July 1991

27. Shavit N, Zemach A: Diffracting trees. ACM Trans ComputSyst, vol 18 (1996)

28. Turek J, Shasha D, Prakash S: Locking without blocking:making lock based concurrent data structure algorithms non-blocking. In: Proceedings of the 1992 Conference on the Prin-ciples of Database Systems, pp 212—222, 1992

29. Touitou D: Lock-free programming: a thesis proposal. Tel AvivUniversity, April 1993

Nir N. Shavit received a B.A. (1984) and a M.Sc. (1986) in computerscience from the Technion, Israel Institute of Technology, anda Ph.D. (1990) in computer science from the Hebrew University inJerusalem. He has been a postdoctoral researcher at IBM Almaden,Stanford and MIT, and is presently an assistant professor in thecomputer science department at Tel-Aviv University. His researchinterests include theoretical and practical aspects of synchronizationand coordination, ranging from tightly coupled multiprocessors tocomputer networks.

Dan Touitou received the B.Sc. degree in Computer Science fromthe Technion, Haifa, in 1986 and the M.Sc. degree in ComputerScience from Tel-Aviv University, in 1992. He is now studying fora Ph.D. in Computer Science at Tel-Aviv University. His researchinterests are the practical and theoretical aspects of synchronizationin distributed computation.

.

116

software transactional memory - research | mit...

Documents