distributed speculative execution for reliability and fault tolerance: an operational semantics

23

Click here to load reader

Upload: cristian-tapus

Post on 15-Jul-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Distributed speculative execution for reliability and fault tolerance: an operational semantics

Distrib. Comput. (2009) 21:433–455DOI 10.1007/s00446-008-0073-1

Distributed speculative execution for reliability and faulttolerance: an operational semantics

Cristian Tapus · Jason Hickey

Received: 8 September 2005 / Accepted: 12 September 2008 / Published online: 11 November 2008© Springer-Verlag 2008

Abstract This paper examines the use of speculations, aform of distributed transactions, for improving the reliabilityand fault tolerance of distributed systems. A speculation isdefined as a computation that is based on an assumption that isnot validated before the computation is started. If the assump-tion is later found to be false, the computation is aborted andthe state of the program is rolled back; if the assumption isfound to be true, the results of the computation are committed.The primary difference between a speculation and a transac-tion is that a speculation is not isolated—for example, a spe-culative computation may send and receive messages, and itmay modify shared objects. As a result, processes that sharethose objects may be absorbed into a speculation. We presenta syntax, and an operational semantics in two forms. The firstone is a speculative model, which takes full advantage ofthe speculative features. The second one is a nonspeculative,nondeterministic model, where aborts are treated as failures.We prove the equivalence of the two models, demonstra-ting that speculative execution is equivalent to failure-freecomputation.

Keywords Speculations · Operational semantics ·Distributed systems · Fault tolerance · Transactions

C. Tapus (B) · J. HickeyComputer Science Department, California Institute of Technology,Pasadena, USAe-mail: [email protected]

J. Hickeye-mail: [email protected]

1 Introduction

Building safe and reliable programs is a difficult andimportant problem. It is even more challenging in the contextof distributed environments because of the need for poten-tially complex synchronization operations in the presence ofprocess and network failures. Ideally, when a process reachesa synchronization point it should be possible to allow it tocontinue executing by assuming that the synchronization suc-ceeded, even if the rest of the processes involved in the com-putation have not yet reached that state. We call this kindof optimistic computation a speculation. Processes start tospeculate as soon as they pass a synchronization point and,if it is later determined that a process in the computation hasfailed, they roll back to a common point and continue theirexecution.

This paper presents the operational semantics of a newprogramming model that provides reliability and fault tole-rance using the notion of speculative execution. A specula-tion, or speculative execution, defines a computation that isbased on an assumption whose verification may be delayed.If the assumption is later verified, the speculation block iscommitted and the execution of the program is continuedas expected. If the assumption is false, the speculation isaborted, and the computation is rolled back and started ona different path of execution, as described by the specula-tive execution block. In our implementation, speculations uselightweight checkpointing to be able to perform the rollbackof processes. Processes use a copy-on-write mechanism tobackup their address space in memory, rather than savingthe entire state of the system on stable storage, and similarmechanisms are used for shared data. The speculative execu-tion model we propose uses programming language primi-tives to define the speculative blocks and it also defines theinteraction between speculations in distributed environments

123

Page 2: Distributed speculative execution for reliability and fault tolerance: an operational semantics

434 C. Tapus, J. Hickey

as well as the actions that are taken when distributed specu-lations are started, committed or aborted. Besides providingreliability and fault tolerance, speculations can also be usedto improve performance of certain network protocols anddistributed environments [31].

Speculations share many traits with traditional distribu-ted transactions [10], which are a commonly used constructto provide reliability and fault-tolerance in distributed data-bases. Traditional transactions guarantee the well-knownACID properties: Atomicity, consistency, isolation and dura-bility. Speculations do not provide isolation, and allow otherprocesses in the system to observe the actions performedinside a speculation. Due to the lack of isolation, if otherprocesses depend on the values generated by a process whileinside a speculation they will be required to join that pro-cess’s speculation and be rolled back together in case of afailure.

The focus of this paper is on the semantics of simple spe-culative execution in the context of distributed environments.The rest of the paper is structured as follows. Section 2 dis-cusses some related projects that have studied the use oftransactions to provide reliability and fault-tolerance. Thesyntax of the language constructs used to write speculativeprograms is defined in Sect. 3.2. Section 3 presents the ope-rational semantics for the speculative programming model,and is followed by a discussion of a nonspeculative, nonde-terministic programming model similar to nondeterministicTuring machines (Sect. 4). The proof of equivalence of thetwo models is addressed in Sect. 5, and the paper is concludedwith a brief overview of future research avenues emergingfrom this work.

2 Related work

The related work can be broadly classified in three majorcategories. Firstly, there is a vast area of research that coversdatabase transactions [10], distributed transactions and theiradoption in compilers, operating systems, and hardware.Secondly, we can find significant work in using checkpoin-ting and recovery as tools for fault-tolerance and softwarereliability. Thirdly, there is work on optimistic execution (orspeculative execution) that focuses on how hardware andsoftware systems can use short-lived speculative blocks toimprove their performance. We discuss each of these cate-gories below, providing specific examples and comparisonswith our system.

2.1 Transactions

Database transactions, known for their ACID (Atomic,Consistent, Isolated, Durable) properties, are a powerfulconcept in providing reliability and fault-tolerance. They

have been studied both from a theoretical point of view,and through application to various areas of computer sciencewhere reliability is desired, like programming languages,filesystems, and shared memory systems.

2.1.1 Theory of transactions

While database transactions have been studied extensively[10], an operational semantics describing the behavior of spe-culations that could be applied to other domains has only beenrecently provided [25]. In their approach, Prinz and Thalheimdiscuss ACID transactions and do not take into considerationany relaxation of the properties.

Black et al. [2] provide a very interesting equational theoryof various types of transactions. They discuss lightweighttransactions that deviate from traditional transaction byrelaxing either the consistency or the durability property [2].Their work is particularly interesting since they provide atheoretical analysis of how lightweight transactions may benested, or composed through either parallel or sequentialcomposition. The theory they provide is presented using anequational calculus that has limited expressiveness, as it onlyanalyzes actions and does not capture state. Furthermore, intheir approach isolation is implicit.

Non-isolated transactions have been studied in the contextof long-lived transactions that can hold on to databaseresources for long periods of time, delaying the terminationof other transactions. Molina and Salem [9] introduced theconcept of saga, which is a non-isolated, non-atomic transac-tion formed of a set of smaller isolated, atomic transactionsand a set of compensating transactions that undo the actionsof the smaller transactions if those have to be rolled back.Recent work by Bruni et al. [3] provides a description offour different compensation policies used in the context ofparallel sagas. Relaxing isolation creates complex dependen-cies of transactions, which is why this has not seen the samelevel of exposure as other lightweight transactions.

Moss [21] introduced the notion of nested transactions inhis Ph.D. thesis. His model is similar in many respects to thenested speculations model we presented. The main differenceremains that speculations are not isolated, which allows themto create distributed dependencies in the system.

Chothia and Duggan [7] introduce a family of calculi thatuse logs to specify application-specific protocols for faulttolerance in applications distributed over wide area networkswhere the network topology is exposed to the application.Their work follows the same goal as ours, in that it triesto establish a family of programming languages that defineconsensus and provide mechanisms for fault resilience.However, their approach is different, as follows: First, theydefine a process as a “simple, assembly language, concurrentprogram”, while we introduce high-level programming lan-guage concepts. Second, they use logs to record actions and

123

Page 3: Distributed speculative execution for reliability and fault tolerance: an operational semantics

Distributed speculative execution for reliability and fault tolerance: an operational semantics 435

to undo actions in case of an aborted conclave (a collectionof processes), but specify that upon abort, a process insidethe conclave will execute, presumably undoing some of theeffects of the aborted conclave. We try to provide strongerguarantees by restoring the exact state of the process to whatit was before it started a speculation. We use checkpointsinstead of logs and we try to provide a semantics that it clo-ser to a real implementation rather than an abstraction of theconcepts we introduce.

A transactional system with shared objects where seve-ral processes may belong to the same transaction and sharetemporary data is studied by Busi and Zavattaro [5]. Theypresent a serializable semantics of transactions in JavaSpace.In our approach we study the propagation of speculations ina dynamic distributed environment, rather than discussingtransactional inter-process communication for only a staticset of processes. We also analyze the behavior of the systemin the presence of implicitly speculative processes that do nothave a say in the outcome of the speculation, but that mayhave to roll back in case the speculation is aborted.

Danos and Krivine present an extension of CCS, calledRCCS that distinguishes reversible and irreversible actionsin the formalization of the notion of transactions and incor-porates a mechanism for distributed backtracking that gua-rantees the correction of their approach.

The extension of Join calculus, CJoin, introduced by Bruniet al. [4] provides a very interesting result in the effort tolay the foundations for global computing. Their concept ofcontract or negotiation describes the consensus reachingmechanism required by a system that use cooperating pro-cesses. Each process has a compensation associated with itin case its execution fails. While the concepts of contract andspeculation have common points they are different in severalrespects. First, the speculation is strict in the definition of thecompensation associated with a process in that it assumes thatthe compensation is restoring the process state to the exactstate it had before speculating. We achieve such strict gua-rantees by describing a mechanism for generating compen-sations based on lightweight checkpointing. We present highlevel programming language constructs that can be appliedto existing applications and describe the behavior of suchsystems in a dynamic distributed environment. Furthermore,we consider implicitly speculative processes whose failuredoes not affect the outcome of the speculation they belongto.

Our concept of speculations and traditional databasetransactions share many traits, but they are distinct in onesignificant way: speculations do not provide isolation. Thus,processes executing inside speculations expose their actionsto the outside world and can absorb other processes in theirspeculation. We believe that it is a powerful mechanism that,if used properly, can have a significant positive impact on theperformance of applications and on their reliability.

Furthermore, we push the concept of non-isolated tran-sactions from data-bases to programming languages and dis-tributed environments. This requires redefining its semanticsto the new domain it is used in, which we do by specifyingseveral models of speculative execution in the form of ope-rational semantics.

2.1.2 Transactional systems

We discuss applications of transactions to programming lan-guages, filesystems and shared memory. Transactions areused in all these domains primarily for their ability to provideatomicity and isolation.

Transactional shared memory and programming languages.Software transactional memory was introduced in a seminal

paper by Herlihy [13] in which he proposes a newmethodology for constructing wait-free and non-blockingimplementation of concurrent objects. This area has seen asignificant amount of work. Recent research enables automa-tic conversion of correct sequential programs to concurrentcode that does not use locking or other complex synchroniza-tion mechanisms that are prone to deadlocks and errors [20].

The Plurix operating system is implemented on top of atransactional distributed shared memory infrastructure [33].It integrates transactions with optimistic synchronizationmechanisms to guarantee sequential consistency. The Plurixoperating system is purely transactional, in that the runningunit is not a process but a transaction. This makes the Plurixoperating system behave like a distributed database system.

In the Venari project, Haines et al. [11] implement a tran-saction mechanism as part of Standard ML, utilizing a muta-tion log produced by a generational garbage collector toimplement undoability.

Harris and Keir provide conditional critical regions imple-mentation in Java. The implementation also support tran-sactional memory and atomic execution blocks [12]. Theirapproach has been very successful in moving away fromlocks and condition variables in writing concurrent applica-tion. Unlike speculative execution, their transactionalmemory considers only isolated atomic blocks that could beevaluated in parallel.

Support for transactions in hardware. Software transactio-nal memory and hardware implementations of similar atomicprimitives have evolved in parallel. One of the early worksdescribes an architecture with support for transactions [14].The authors introduce transactional primitives for accessingmemory, including the following: load-transactional, load-transactional-exclusive, store-transactional, commit, abort,and validate. They use a simulation environment and showthe advantages that transactional memory has over traditionallocking in terms of performance.

123

Page 4: Distributed speculative execution for reliability and fault tolerance: an operational semantics

436 C. Tapus, J. Hickey

Other architectures with the same flavor have been sug-gested. Most of them set a limit on the size of the operationsinside each transaction. Ananian et al. [1] introduce an archi-tecture for supporting unbounded transactional memory.Their approach addresses the issue of having transactionsthat are larger than the CPU cache, for which they devisespecial mechanisms to enable rollback. However, the life oftransactions supported in hardware is a few orders of magni-tudes shorter than that which we introduced through spe-culative execution. Furthermore, the traditional transactionalmemory supports transactions that have the ACID properties.Hardware support for transactions does not consider distri-buted dependencies or distributed rollback.

Another interesting topic in the area of transactionalmemory support in hardware is building mechanisms thatcan identify, at runtime, lock-protected critical sections inprograms and execute them without actually acquiring thelock [27].

The speculative execution in our system provides a pro-gramming model that extends optimistic computation to dis-tributed environments.

2.2 Checkpoint and recovery

Another area of research that is related to speculations isusing checkpointing and rollback mechanisms to providerecoverability. Both theoretical and practical works have dis-cussed various approaches and protocols that enable distribu-ted applications to recover in a consistent state based on savedcheckpoints. This is achieved by either optimistic or pessi-mistic message logging, and by coordinated or uncoordinatedcheckpoints. The goal of checkpoint and recovery algorithmsis to provide distributed applications with mechanisms to sur-vive failures and roll back their state to a previously globallyconsistent state. They usually assume the existence of stablestorage that survives failures.

Strom and Yemini introduced in [29] the notion ofoptimistic recovery. They define it as a technique based ondependency tracking, which avoids the domino effect whileallowing the computation, the checkpointing and the “com-mitting” to proceed asynchronously. Their approach requiresthe analysis of existing checkpoints and computing a globalsafe recovery line. This can be done either centralized [17]or distributed [28]. Furthermore, their approach assumes thatupon rollback the execution continues on the same path asbefore, hence messages that have been logged since thecheckpoint can be replayed. This is true for most of thesubsequent work based on optimistic logging [17,22]. Thisapproach is incompatible with speculative execution sinceprocesses may take a different execution path when the spe-culation is aborted.

The “Virtual Time” [16] paper introduces a mechanismfor optimistic speculative execution in a distributed system.

Processes exchange messages and assume that the messagesthey receive are in order. In case this assumption is violatedthe computation is rolled back, the messages in the receivequeue are reordered and the computation continues by pro-cessing messages in the newly provided order. Processes areimplicitly forced to checkpoint on regularly and the rollbackmechanism can be cascading. The specification of the systemrequires the state of each process to be saved after each sendor receive operation. On rollback, certain messages have tobe “annihilated” using “anti-messages”.

While speculations are similar to the concept oflookahead-rollback introduced by the TimeWarp [16] mecha-nism, we extend the concept by allowing both explicit andimplicit speculations through programming languages exten-sions. We also introduce shared objects as part of the specu-lative model.

Other projects, like Condor [19], CRAK [34] or Score [30]support only heavyweight checkpointing and recoverymechanisms. Furthermore, none of these systems present aformalized operational semantics of their checkpoint/recovery mechanism.

The Rx [26] system uses checkpointing and rollback toenable applications to survive software bugs. It regularlysaves checkpoints of running applications and, in case ofprocess failure, it rolls back the process to one of the pre-viously saved checkpoints. It performs various modificationsto the environment in which the application is running, likepadding newly allocated buffers to prevent buffer overflows,and it re-starts the program from the saved checkpoint. Thismechanism proved to be efficient in combating failures due tocertain race conditions and buffer overflows. The limitationof the Rx system is that it operates solely on isolated applica-tions. A similar mechanism that would enable the recoveryof distributed applications could be implemented using spe-culations instead of the traditional checkpoint and rollbackmechanism.

The main differences between this area of research andour approach are:

– speculations can provide programs with alternate execu-tion paths upon rollback,

– speculations are lightweight checkpoints that are storedin memory and can be coupled with real checkpointingmechanism for increased reliability, and

– we expose speculations as programming language primi-tives that have a semantics closer to that of transactionsthan that of checkpoints.

In our implementation of speculation we do use mecha-nisms similar to those designed for checkpointing/rollbacksystems, like the protocol designed by Damani and Garg [8],to ensure safe recovery lines in case of distributed speculationrollback.

123

Page 5: Distributed speculative execution for reliability and fault tolerance: an operational semantics

Distributed speculative execution for reliability and fault tolerance: an operational semantics 437

2.3 Speculative execution

Concepts of optimistic execution similar to speculations areused to address optimizations of I/O operations [6], fault-tolerant networking [32], shared memory systems [18] andalso to increase the performance of processors [24]. By intro-ducing programming language primitives we extend the usa-bility of speculations to a wider range of applications.

One of the most recent systems, BlueFs [23] uses specula-tions to improve the performance of NFS. Its implementationuses a very similar kernel-level implementation of specu-lations and speculative actions based on internal primitivessimilar to those introduced by us in [31]. As mentionedbefore, our approach pushes speculative primitives to userlevel and provides an implementation that is able to handledistributed speculations and distributed rollback, making itmore generic and more widely applicable. Furthermore, oursystem is based on a strong formal semantics which increasesthe confidence in our implementation.

The angelic nondeterminism [15] concept introduced byHoare has a semantic that is similar to speculative execution.It defines nondeterminism (P Q) as the “execution of bothP and Q concurrently until the environment chooses an eventwhich is possible for one but not the other.” This implementa-tion of nondeterminism has a high cost in terms of efficiency.In our speculative model we optimize the angelic nondeter-minism implementation by setting a higher preference forone of the two execution branches, based on the assumptionthat we make. This permits a more efficient implementation.Furthermore, we consider communicating processes and theeffects of rollback (abort) to the state of the entire distributedsystem.

3 Model for a speculative distributed objects system

We present a speculative distributed objects system modelconsisting of processes and shared objects. Shared objectsstore values that can be accessed (read or written) by anyprocess in the system. The objects may be accessed by anyprocess. They will be the vehicles to propagate speculationsin the system.

Processes execute programs and start speculations by exe-cuting the speculate call. After a speculation is started, wesay that the speculation is active until a commit or an abortcall is executed. We say that a process is executing insidea speculation if the process’s program is executed as partof a speculative computation. An object becomes part of aspeculation, or it is involved in a speculation, if a processthat is inside a speculation accesses the object. A processbecomes absorbed in a speculation if it reads data from aspeculative object. The merger of two speculations is defi-ned as the operation by which two speculations initiated by

different processes, change their speculation identifier to ashared, common one. From that point on the new identifieris used to refer to either of the two initial speculations.

A process can be executing within at most one speculationat any given time. In other words, a process cannot start aspeculation while it executes inside another speculation. Aprocess is allowed to access multiple objects and an objectcan be accessed by multiple processes.

We say that a process or an object belongs to a speculationif it started that speculation or if it was absorbed in the spe-culation. A shared object belongs to at most one speculationat any given time.

3.1 Examples

Consider the simple example of a bank transfer performed atan automatic teller machine (ATM). Translating the opera-tions of the ATM to our speculative language is illustrated inFig. 1. The two accounts involved in the transfer, Acc1 andAcc2, are shared objects in the distributed environment.

When a transfer operation is initiated at the ATM, a specu-lation is started under the assumption that the transfer will besuccessful. The amount found in each account is read by theATM, and if there are enough funds in the first account, thenthe requested sum is transferred to the second account. If thetransfer would generate a negative balance, the speculationis aborted and the transfer operation fails. If any of the reador write operations performed by the ATM program fail dueto unexpected reasons, the speculation is also aborted. Theabort of the speculation restores the amounts found in thetwo accounts to their original values.

This example shows the natural way in which speculativeexecution provides fault-tolerance. If anything goes wronginside the speculation, the ATM machine fails, or there arenot enough funds in the accounts, the speculation is abor-ted and the state of the entire system is reverted to a safestate. It also illustrates how the speculative programmingmodel simplifies the program by eliminating the recoverycode and highlights the computation, as compared to thetraditional approach. It is important to mention that, unlike

speculate(let v1 = read(Acc1 )

in let v2 = read(Acc2 )in if (v1 > sum) then write(v1 − sum, Acc1 );write(v2 + sum, Acc2 );commit()

else abort()⊕print ”Transfer failed !”)

Fig. 1 Atomic transfer of the amount sum from account Acc1 to Acc2

123

Page 6: Distributed speculative execution for reliability and fault tolerance: an operational semantics

438 C. Tapus, J. Hickey

software transactional memory, speculations do not try toreplace existing synchronization mechanisms. For example,each account accessed by the transfer function can be pro-tected by a global lock to serialize concurrent accesses toit.

Another interesting example considers the interaction bet-ween different active entities (processes) in the system whilethey perform speculative operations. We use a reservationsystem to illustrate this. A client needs to reserve both aplane ticket and a hotel room subject to time and financialconstraints. The client speculates that she will be able tosatisfy her constraints and contacts a flight agent for the air-fare, and a hotel for lodging. The programs in pseudo-codewith speculative constructs for the client and the agents areshown in Fig. 2. We use message passing with shared com-munication channels between the processes in this example.It is important to note that channels can be implemented withshared objects, but we prefer to illustrate this example usingcommunication channels because it is more intuitive in thiscontext.

Figure 3 shows the speculative dependencies for a success-ful reservation. It involves operations like absorbing nonspe-culative entities in speculations and merging speculations, asdescribed below.

The client speculates that she will be able to secure thehotel and the flight for her trip subject to the constraints shehas (speculation s). Then she requests quotes from both theflight agent and the hotel. The hotel receives the request andit is absorbed in the client’s speculation. It successfully pro-cesses it and sends the quote to the client.

The flight agent receives the request and it becomes impli-citly involved in the client’s speculation as well. He checks itslocal information and it speculates (explicitly this time) thatthe information is accurate and that it will be confirmed bythe airline. This explicit speculation makes the flight agenta co-owner of the client’s speculation, since in this modelwe do not allow nested speculation and the agent is alreadyimplicitly speculative. This means that the flight agent willhave decision power on the outcome of speculation s, whe-reas the hotel does not. If either of the two assumption, that

Client Flight Agent Hotelspeculate receive-request receive-requestrequest-hotel check-reservations check-availabilityrequest-flight speculate if room then

get-quotes send price-quote send price-quoteif expensive then check with airline get payment

abort() if unavailable then elseelse abort() send NO-ROOM

wait confirmation elseif all-OK then send-confirmation

pay for services get paymentcommit() commit()

else abort() ⊕⊕ do nothing

try different agents

Fig. 2 Speculative programs for reservation system

Fig. 3 The reservation succeeds

of the client’s or that of the flight agent’s, turns out to befalse then speculation s will be aborted. However, if both arevalidated, the speculation will be committed.

The flight agent computes a quote based on his localinformation and sends the quote to the client. Meanwhileit receives final confirmation from the airline and it commitshis speculation. What this means is that the outcome of thespeculation is now to be decided by the client only.

When the client receives the speculative quote from theflight agent it verifies the prices against the allocated budgetand issues payments to both the hotel and the flight agent.Next it commits the speculation. This fully commits the spe-culation on all three entities.

Figure 4 illustrates the behavior of the system in case of anaborted speculation. The first set of events is identical withthe previous case. The flight agent, however, receives a report

Fig. 4 The reservation fails because the airline’s database differs fromthat of the agent’s

123

Page 7: Distributed speculative execution for reliability and fault tolerance: an operational semantics

Distributed speculative execution for reliability and fault tolerance: an operational semantics 439

from the airline that the flight is full, so it aborts speculation s.This triggers a universal abort of speculation s, forcing boththe client and the hotel to abort the speculation locally. Thisis true even if the client reaches a point where it commitsthe speculation locally. The commit is postponed until allco-owners (in this case the client and the flight-agent) decidethe outcome of the speculation based on the evaluation oftheir assumption. After receiving the quotes from the flightagent and the hotel, the client decides the prices are too highand aborts its speculation. This triggers the rollback of boththe client and the flight agent.

3.2 Overview of the language

The terms of the language are defined in Fig. 5. The base lan-guage (L) can be any language that does not have a read/writeinterface for shared objects.

We extend it with speculative constructs and with a specialcall for accessing shared objects.

The speculative construct speculate(e1⊕e2) defines a spe-culation. In the speculative mode, the program executes e1.If it executes an abort(), the speculation is aborted and theprocess rolls back and executes e2. If a commit() is encoun-tered in e1, the process will never roll back its state to takethe e2 branch. We refer to e1 as the “commit” branch, and toe2 as the “abort” branch.

In this paper we are only considering non-nested specula-tions, thus we do not allow constructs of the form: speculate(e1; speculate(e2 ⊕ e3) ⊕ e4). We do allow constructs of theform: speculate(e1 ⊕ e2; speculate(e3 ⊕ e4)), where a newspeculation can be started after the initial speculation hasbeen either committed or aborted.

The let v = read(o j ) in e[v] construct assigns the valueof shared object o j to variable v, which is bound in e. Thewrite operation is represented by write(x, o j ). It stores thevalue x in shared object o j .

The syntactically valid terms presented above can besequenced using the “;” separator.

Construct Descriptioni ::= L The base language

speculate(e1 ⊕ e2) Speculate callcommit() Commit callabort() Abort calllet v = read(oj) in e[v] Read the value of object oj

write(x, oj) Write value x to object oj

e ::= i Instructioni ; e Sequencing

Fig. 5 Syntactically valid terms

3.3 Terminology and notation

The notation used in this section is illustrated through a set ofsample rules similar to those used in the operational seman-tics.

The operational semantics defines a reduction system thatoperates on states. If a distributed system in state ∆i reducesin one step to state ∆′

i we write ∆i ⇒ ∆′i . The meaning

of ∆i ⇒ ∆′i is that the state of one, and only one, process

reduces as part of the reduction step. The state of a processreduces to another state either because the process executesan instruction in its program, or because it finds itself in astate that requires an implicit action that modifies its state,regardless of the program that it has to execute. Formally, thisis written as shown in rule Simple- Sample- Rule below.

Simple-Sample-Rules : pi ; Σ

o j

⟨sO, V ′⟩

V

...

pi

〈(s, fl) , Γ ′ , e′〉Γ i; e

...

s : pi ; Σ

o j

⟨sO, V ′⟩

V

...

pi

〈(s, fl) , Γ ′ , e′〉Γ ′′ e

...

The state of a distributed system has three components.

– The speculations environment (Σ) is a set of definitionsof the form s : S where S is a set of process ids (pi ) ofprocesses co-owning speculation s. We call s the identifierof the speculation.

– The state of shared objects in the system (Θ).– The state of processes in the system (Π).

Each speculation has a unique identifier that is generatedwhen the speculation is created. Furthermore, we define twoconstants that may be used to replace speculation identifiers:aborted and committed. The two constants are explicitlyused in the operational semantics rules, and it should notbe assumed that any one of the s, sP, or sO identifiers maytake the value of the two constants unless an explicit substi-tution occurs, as described below. The main usage of thesetwo constants is to direct the behavior of the processes whenspeculations are aborted and committed. The use of theseconstants will become clear when we present the operationalsemantics rules.

The state (Pi ) of a speculative process (pi ) is defined bythree components:

– a checkpoint (c) (present in the top half of the box),– a local environment (Γ ), and– the instructions of the program it executes (i;e).

123

Page 8: Distributed speculative execution for reliability and fault tolerance: an operational semantics

440 C. Tapus, J. Hickey

The checkpoint of a process has three components.

– The unique id of the speculation (s) that generated it,along with a flag (fl) that evaluates to own if the processis the “owner" of the speculation, and to client if the pro-cess was absorbed in the speculation due to a read/writeoperation.

– The local environment of the process at the time the pro-cess became part of the speculation.

– The program to be executed in case of rollback, the“abort” branch of the speculation.

A process is the owner of a speculation if it started thatspeculation.

This rule illustrates the semantics of sequencing of pro-gram instructions. If process pi has a program defined by thesequencing, i.e., then its state may change (its local environ-ment may change) by executing instruction i . The processcontinues the execution with the rest of the program definedby e.

The state (O j ) of a shared object (o j ) is characterized bythe value it stores (V ) and by an optional checkpoint, whichstores the speculation id (s) along with the value (V ′) theobject had before entering the speculation. The checkpoint,presented in the upper half of the object’s state box of ourgraphical representation, is empty if the object is not part ofany speculation.

Initially the speculations environment is empty, which wedenote by using the notation Σ∅. When a new speculation isstarted it gets added to the speculations environment. Whentwo speculations merge, they are erased from Σ and the spe-culation representing the merger is added. When a specula-tion aborts or commits it is erased from Σ .

Next, we define a set of well-formedness conditions forthe states and environments described above.

– The state of a distributed system (∆) is well formed if:

– for each definition s : S in Σ , with S = pi1, . . . , pik processes pi1, . . . , pik are explicitly speculative andthey are involved in speculation s.

– if a process pi or an object o j belongs to speculations then there is a definition for s in Σ .

We only consider well formed states of the distributed sys-tem. These conditions will be preserved by the operationalsemantics rules that we present.

Additional notation used by the operational semantics isshown in Table 1.

The next sample rule illustrates one of the key operationsperformed in our rules: the substitution of speculation ids.The substitution operation involves changes throughout thestate of the distributed system, as shown below. If the out-

Table 1 Notation for speculative processes

Π ::= p1 : P1 . . . pn : Pn Set of states for processesp1 . . . pn

Θ ::= o1 : O1 . . . om : Om Set of states for objectso1 . . . om

s : S ::= s : pl1 . . . plq List of peer processes

co-owning speculation s

Σ ::= s0 : S0; . . . ; sk : Sk Speculations environment

come of a speculation is decided or if the speculation changesits identifier all objects and processes that have checkpointsdepending on it have to be notified. For example, if specula-tion sP is aborted, committed or replaced with the new iden-tifier s (as a consequence of a merger) then the new state ofthe system reflects the change by substituting all occurrencesof sP with either aborted, committed, or s as needed. Afterthe substitution is performed, the new state of the system mayrestrict which actions the process can take. For example, ifthe speculation was aborted and the speculation identifierwas replaced with the aborted constant, then the processwould only be able to continue execution by following ope-rational semantics rules that explicitly deal with speculationsthat have been aborted. This is explained in more detail inthe operational semantics rules presented further down in thetext.

The notation Σ[sP] is used if the speculation id sP occursin the speculation environment. The substitution of sP withs is represented by Σ[s]. For simplicity, the reduction rulesuse the notation for bound speculation ids only in conjunctionwith the substitution operation. In all the other cases wheresubstitution does not occur the above notation is omitted.

The substitution is also reflected in the states of processesand objects, as shown in the sample rule Substitution-Sample- Rule below.

Substitution-Sample-Rule

Σ[sP]or

sP...

pi

〈 〉Γ e

pk

sP...

Σ[s]or

s...

pi

〈 〉Γ e

pk

s...

The operational semantics rules presented in this sectiondescribe the behavior of the system in the case of “well-behaved”, correct programs. All the rules that are omitted in

123

Page 9: Distributed speculative execution for reliability and fault tolerance: an operational semantics

Distributed speculative execution for reliability and fault tolerance: an operational semantics 441

the presentation of the operational semantics can be conside-red as taking the process that tries to execute an unexpectedinstruction in either an “error” state, or “block” it indefinitely.For example, we do not discuss the case when a process thathas not started a speculation tries to execute either an abortor a commit statement. Also, the rules that involve abortingor committing a speculation restrict the behavior of the pro-cesses that are involved in the speculation, through the useof the aborted and committed constants respectively, untilthe speculation is fully committed or aborted by each entityinvolved. Furthermore, the speculative operational semanticsrules that we present are defined in such a way that at mostone of them may be applied at any time. The only exceptionto this rule is the Ab- Fail rule that describes the behaviorof a process that fails due to external factors, independent ofthe program that it executes. This rule may be applied at anytime and matches any state the process might find itself in.

3.4 Speculate

A process outside any speculation successfully starts a newspeculation by using the speculate(e1⊕e2) construct. A freshspeculation identifier (s) is created and added to the specula-tions environment, and a checkpoint of the process is taken.The checkpoint is labeled with the identifier of the specu-lation (s) and it becomes part of that process’s state. Theprocess advances with the execution to the next instructionin its program, as specified by e1. If the speculation is suc-cessful the code that will be executed is the one specifiedby program e1; e3. However, if the speculation is aborted theexecution is defined by the abort branch of the speculation,specified by program e2; e3. In both cases, there might be acommon execution block (e3) that is executed regardless ofthe outcome of the speculation. This follows in the lines ofthe money transfer example presented in Sect. 3.1.

Spec

Σ

Θ

pi

〈 〉Γ speculate(e1 ⊕ e2); e3

...

⇒s : pi ; Σ

Θ

pi

〈(s, own) , Γ , e2; e3〉Γ e1; e3

...

There is one more rule involving speculations that weinclude in our operational semantics for the case when aprocess that is implicitly absorbed in a speculation (due tospeculative I/O, as shown later in rules Read- Spec- Obj)needs to start an explicit speculation. This operation is allo-wed because a process that is implicitly speculative is notaware of the speculation it belongs to and we want to pre-serve this property.

The behavior described in rule Spec- Implicit is the follo-wing. When an implicitly speculative process initiates a newspeculation it becomes a co-owner of the speculation it is apart of and continues execution on the commit branch speci-fied by its explicit speculation. This means that the processwill be able to contribute to the outcome of the speculation.

Spec-Implicit

s : S; Σ

Θ

pi

〈(s, client) , Γ ′ , e′〉Γ speculate(e1 ⊕ e2); e3

...

⇒s : S ∪ pi ; Σ

Θ

pi

〈(s, own) , Γ ′ , e′〉Γ e1; e3

...

It is important to notice that since the process is alreadyin a speculation we do not need to save e2 since, in the casethe speculation is aborted, the execution is resumed from thesaved checkpoint, which preceded the execution of e2.

3.5 Reading from a shared object

The reduction rules for the read operation(let v = read(o) in e[v]) take into consideration whetherthe object and the process belong to a speculation or not. Thelocal variable v is assigned the value read from the objectand becomes part of the process’s local environment.

3.5.1 Both the process and the object are outsidespeculations

Reading the value of a shared object when neither the objectnor the process is part of any speculation is illustrated by thefollowing reduction rule.

123

Page 10: Distributed speculative execution for reliability and fault tolerance: an operational semantics

442 C. Tapus, J. Hickey

Read-NoSpecΣ

o j

〈 〉V

...

pi

〈 〉Γ let v = read(o j ) in e[v]

...

⇒Σ

o j

〈 〉V

...

pi

〈 〉Γ, v : V e[v]

...

The process continues the execution with the next instruc-tion in its program.

3.5.2 The process is inside a speculation and the objectis outside any speculation

When a speculative process reads the value of a shared objectthat is not part of any speculation it does not absorb the objectin its speculation. The behavior of the system is defined assuch to prevent the needless propagation of speculations inthe distributed system.Read-Spec-Proc

Σ

o j

〈 〉V

...

pi

〈(s, fl) , Γ ′ , e′〉Γ let v = read(o j ) in e[v]

...

⇒Σ

o j

〈 〉V

...

pi

〈(s, fl) , Γ ′ , e′〉Γ, v : V e[v]

...

The process continues the execution with the next instruc-tion in its program.

3.5.3 The process is outside any speculation and the objectis inside a speculation

After executing the read operation the process’s state dependson speculative information, so the process is absorbed in theobject’s speculation. A checkpoint of the process is createdto allow rollback if the speculation is later aborted.

Read-Spec-ObjΣ

o j

⟨s, V ′⟩

V

...

pi

〈 〉Γ let v = read(o j ) in e[v]

...

⇒Σ

o j

⟨s, V ′⟩

V

...

pi

〈(s, client) , Γ , let v = read(o j ) in e[v]〉Γ, v : V e[v]

...

3.5.4 Both the process and the object are inside the samespeculation

The reduction rule is similar to the case when neither theprocess nor the object were part of any speculation (RuleRead- NoSpec). Only the internal environment of the pro-cess changes and the program continues the execution withthe next instruction.

Read-Same-SpecΣ

o j

⟨s, V ′⟩

V

...

pi

〈(s, fl) , Γ ′ , e′〉Γ let v = read(o j ) in e[v]

...

⇒Σ

o j

⟨s, V ′⟩

V

...

pi

〈(s, fl) , Γ ′ , e′〉Γ, v : V e[v]

...

123

Page 11: Distributed speculative execution for reliability and fault tolerance: an operational semantics

Distributed speculative execution for reliability and fault tolerance: an operational semantics 443

3.5.5 The process and the object are inside differentspeculations; speculations are merged

The most interesting case for reading the value of a sharedobject is when both the process and the object are speculativeand they belong to different speculations (sP and sO, respec-tively). After the read operation is performed the state of theprocess depends on the speculative data stored in the sharedobject at the time of the access. Since we do not considernested speculations in this model, the only way to guaran-tee the process is rolled back if the object’s speculation isaborted is to merge the two speculations. The merger of thetwo speculations, sP and sO, is represented in the operatio-nal semantics rule by a substitution operation of sP and sOwith s, which is the “new” speculation id created from themerger of the initial two speculations. The objects and pro-cesses that belong to either of the two speculations becomeaware of the merger. This operation guarantees that the out-come of the “new” speculation will be broadcast to all inter-ested parties, as described by the abort and the commit rulesbelow.

Read-Spec-Merge

sO : SO; sP : SP ; Σ

o j

⟨sO, V ′⟩

V

or om

sO...

sP

pi

〈(sP, fl) , Γ ′ , e′〉Γ let v = read(o j )

in e[v]

pk

sO

...pq

sP

⇒s : SO ∪ SP ; Σ

o j

⟨s, V ′⟩

V

or om

s...

s

pi

〈(s, fl) , Γ ′ , e′〉Γ, v : V e[v]

pk

s

...

pq

s

3.6 Writing data to a shared object

The operational semantics defines different reduction rules todescribe the action of writing data to shared objects, depen-ding on whether processes and objects are inside or outsidespeculations. After a process executes the first instructiondefined by the write(V, o j ); e program the value of objecto j is updated with value V and execution of the programcontinues with e.

3.6.1 Both the process and the object are outsidespeculations

Writing a value to a shared object, when neither the objectnor the process accessing it are part of any speculation isillustrated by the following reduction rule.

Write-NoSpecΣ

o j

〈 〉V

...

pi

〈 〉Γ write(V ′, o j ); e

...

Σ

o j

〈 〉V ′

...

pi

〈 〉Γ e

...

3.6.2 The process is inside a speculation and the objectis outside any speculation

When a process that speculates writes a value to a sharedobject that is not part of any speculation the object is absor-bed in the process’s speculation. This behavior is differentfrom the one seen for the read operation because the valuestored in the shared object is speculative and the object has tobecome speculative itself. The system creates a checkpointfor the object. The checkpoint stores the value the object hadbefore becoming part of the speculation, which allows it torollback if the speculation is aborted. The speculation id isalso stored as part of the checkpoint. From this moment, theobject will absorb processes that read its value into the samespeculation.

Write-Spec-ProcΣ

o j

〈 〉V

...

pi

〈(s, fl) , Γ ′ , e′〉Γ write(V ′, o j ); e

...

Σ

o j

〈s, V 〉V ′

...

pi

〈(s, fl) , Γ ′ , e′〉Γ e

...

123

Page 12: Distributed speculative execution for reliability and fault tolerance: an operational semantics

444 C. Tapus, J. Hickey

3.6.3 The process is outside any speculation and the objectis inside a speculation

A nonspeculative process that writes to a speculative objectextracts the object from the speculation it belongs to. Thisbehavior is expected, because regardless of the outcome ofthe speculation the object belongs to, the nonspeculativewrite would have been performed.

Write-Spec-Obj

Σ

o j

⟨s, V ′⟩

V

...

pi

〈 〉Γ write(V ′′, o j ); e

...

Σ

o j

〈 〉V ′′

...

pi

〈 〉Γ e

...

3.6.4 Both the process and the object are inside the samespeculation

The reduction rule is similar to the case where neither theprocess nor the object were part of any speculation (RuleWrite- NoSpec).

Write-Same-Spec

Σ

o j

⟨s, V ′⟩

V

...

pi

〈(s, fl) , Γ ′ , e′〉Γ write(V ′′, o j ); e

...

⇒Σ

o j

⟨s, V ′⟩

V ′′...

pi

〈(s, fl) , Γ ′ , e′〉Γ e

...

3.6.5 The process and the object are inside differentspeculations; speculations are merged

The most interesting case for writing an object is when theprocess and the object are inside different speculations. Thetwo speculations, sP and sO, merge and are substituted in allthe states of objects and processes composing the distributedsystem with s, the “new” speculation id created from themerger of the initial two speculations.

Write-Spec-Merge

sO : SO ; sP : SP ; Σ

o j

⟨sO, V ′⟩

V

or om

sO...

sP

pi

〈(sP, fl) , Γ ′ , e′〉Γ write(V ′′, o j ); e

pk

sO

...

pq

sP

⇒s : SO ∪ SP ; Σ

o j

⟨s, V ′⟩

V ′′

or om

s...

s

pi

〈(s, fl) , Γ ′ , e′〉Γ e

pk

s

...

pq

s

Again, as in the case of the read operation (RuleRead- Spec- Merge) the merger is known to all the involvedprocesses and objects.

3.7 Abort a speculation

A speculation can be aborted by the initiating process whenthe assumption it was based on turns out to be false. Specula-tions also abort if their initiating process fails due to externalfactors.

123

Page 13: Distributed speculative execution for reliability and fault tolerance: an operational semantics

Distributed speculative execution for reliability and fault tolerance: an operational semantics 445

The reduction rules presented in this section are construc-ted such that they guarantee that when a process or an objectbecomes aware that the speculation they belong to has beenaborted they will roll back their state. Furthermore, theywould not be allowed to continue execution inside the abortedspeculation. This is ensured by the following. If a reductionrule contains a speculation identifier, like s, sP, or sO, itmeans that the identifier cannot be a constant, like aborted.

3.7.1 Processes and aborted speculations

A process that is inside a speculation that it owns is allowedto abort it. The reduction rule for the abort() call substitutesthe id of the aborted speculation with the aborted specialconstant and rolls back the process to the state it was beforeentering the speculation. Also, the speculation id is erasedfrom the speculations environment. The substitution opera-tion guarantees that once a speculation has been aborted noother process or object can be absorbed in that speculation,since its id has been replaced by the special constant aborted.The state change is presented below.

Ab-Owners : S; Σ

or

s...

pi

〈(s, own) , Γ ′ , e′〉Γ abort(); e

pk

s...

Σ

or

aborted...

pi

〈 〉Γ ′ e′

pk

aborted...

when pi ∈ S

Processes can also fail at any time during their executiondue to external factors. When a process fails while it executesinside a speculation that it owns, the system aborts the spe-culation, and the process rolls back and executes the abortbranch.

Ab-Fails : S; Σ

or

s...

pi

〈(s, own) , Γ ′ , e′ 〉

Γ e

pk

s...

Σ

or

aborted...

pi

〈 〉

Γ ′ e′

pk

aborted...

when pi ∈ S

If a process is inside a speculation that has been abortedby another process it rolls back and restarts the computation

from the point where it was absorbed in the speculation. Ifthe process is the owner of the speculation it rolls back towhere it started its own original speculation.

This behavior is illustrated by the next operational seman-tics rule.

Ab-ProcΣ

Θ

pi

〈(aborted, fl) , Γ ′ , e′〉Γ e

...

Σ

Θ

pi

〈 〉Γ ′ e′

...

3.7.2 Objects and aborted speculations

If an object is inside an aborted speculation it rolls back itsstate to the state saved in the checkpoint.

Ab-Object

Σ

o j

⟨aborted, V ′⟩

V

...

Π

Σ

o j

〈 〉V ′

...

Π

3.8 Commit a speculation

3.8.1 Processes and committed speculations

Only processes that own a speculation can commit it. If theowner of a speculations commits it while its peers are stillexecuting inside the speculation then it becomes a client ofthe speculation and continues executing inside the specula-tion until every one of its peers commits the speculation orone of them executes an abort call. The reduction rule belowillustrates this behavior. The flag associated with the specu-lation changes from own to client.

Comm-Peers : pi ∪ S; Σ

Θ

pi

〈(s, own) , Γ ′ , e′〉Γ commit(); e

...

s : S; Σ

Θ

pi

〈(s, client) , Γ ′ , e′〉Γ e

...

when S = ∅

123

Page 14: Distributed speculative execution for reliability and fault tolerance: an operational semantics

446 C. Tapus, J. Hickey

If a process is the only owner of a speculation, or if itis the last owner to commit it, then it substitutes the id ofthe speculation with the committed special constant. Thecheckpoint is discarded, and the speculation is erased fromthe speculations environment. The operational semantics ruleshowing the state change is the following.

Comm-Owners : pi ; Σ

or

s...

pi

〈(s, own) , Γ ′ , e′〉Γ commit(); e

pk

s...

⇒Σ

or

committed...

pi

〈 〉Γ e

pk

committed...

If a process was absorbed in a speculation that has beenfully committed by its owners it can only continue its execu-tion by the following rule, which discards the saved check-point and continues the execution of the process outside anyspeculation. The same rule applies for processes that ownedthe speculation and committed it before the other owners ofit (as per rule Comm- Peer).

Comm-ClientΣ

Θ

pi

〈(committed, client) , Γ ′ , e′〉Γ e

...

Σ

Θ

pi

〈 〉Γ e

...

3.8.2 Objects and committed speculations

When a speculation is committed, the objects absorbed inthe speculation have to discard their saved checkpoint. Thefollowing reduction rule illustrates this behavior.

Comm-Obj

Σ

o j

⟨committed, V ′⟩

V

...

Π

Σ

o j

〈 〉V

...

Π

4 Nonspeculative model

In this section we define a nonspeculative model for the spe-culative constructs introduced in Sect. 3.2. We show how theexecution of speculative programs is equivalent to the exe-cution of programs that use this nonspeculative, nondetermi-nistic model. This may allow us to reason about propertiesof speculative programs by using existing tools that can rea-son about the nonspeculative model. We use the speculativemodel in Sect. 3 for the equivalence proof that is presentedin Sect. 5.

The nonspeculative model discussed in this section usesprocesses and shared objects in a way similar to that of thespeculative model. An object is characterized by the value Vit stores. Processes execute programs. A program is definedby its current environment Γ , and the expression e to bereduced.

The expression e is a sequence of statements. Statementsinclude the speculative programming constructs presented inSect. 3.2.

The state of the distributed system is composed of the stateof all processes (ΠNS) and the state of all objects (ΘNS) in thesystem. We use a diagram representation of the system state,where objects are shown in the upper half, and processes inthe lower half of the block corresponding to the state.

We use a graphical representation of the model similar tothe one used in the speculative model.

A sample rule, showing the two components of the distri-buted system state is illustrated below.

NS-Sample

o j :

V ; ...

pi

Γ i; e2...

⇒N S

o j :

V ; ...

pi

Γ ′ e2...

4.1 Nonspeculative operational semantics

When the process encounters a speculate call, it may choosenondeterministically to take either the commit or the abort

123

Page 15: Distributed speculative execution for reliability and fault tolerance: an operational semantics

Distributed speculative execution for reliability and fault tolerance: an operational semantics 447

branch. This behavior is illustrated by Rules NS- Spec- C- brand NS- Spec- A- br below.

For the purpose of proving the equivalence of this modelto the speculative model presented in Sect. 3, we introducetwo history variables that are local to each process’s envi-ronment and to each object. There is one history variable,labeled S, that monitors if the object or the process are cur-rently inside a speculation. The S variable points to a listof process identifiers, defined as follows: L = pi , . . . , pk,that own the speculation that the object or the process belongto. If the process or object are nonspeculative the historyvariable points to an empty list. The second history variable,labeled A, maintains information about the state of the pro-cess or the object before it became part of the speculation.If they are nonspeculative the history variable does not haveany meaning. For process it refers to the abort branch thatis ignored by the program, while for objects the A historyvariable refers to the value of the object before it becameabsorbed in the speculation. These history variables are notaccessible to programs. For simplicity we omit these historyvariables in the description of the rules below when they areeither irrelevant or empty and they do not cause any confu-sion.

NS-Spec-C-br

...

pi

Γ,S : speculate(e1 ⊕ e2); e3...

⇒N S

...

pi

Γ,S : pi ,A : Γ e2; e3 e1; e3...

NS-Spec-Implicit-C-br...

pi

Γ, S : L, A : Γ ′ e′ speculate(e1 ⊕ e2); e3...

⇒NS

...

pi

Γ,S : L⋃pi ,A : Γ ′ e′ e1; e3

...

when pi /∈ L

NS-Spec-A-br...

pi

Γ speculate(e1 ⊕ e2); e3...

⇒N S

...

pi

Γ e2; e3...

NS-Spec-Implicit-A-br...

pi

Γ, S : L, A : Γ ′ e′ speculate(e1 ⊕ e2); e3...

⇒NS

...

pi

Γ e′ ...

when pi /∈ L

The commit operation modifies the history variables. Theprocess continues the execution, as shown in the followingreduction rules.

NS-Commit...

pi

Γ, S : pi , A : Γ ′ e′ commit(); e...

⇒N S

...

pi

Γ e...

If the process is the only owner of the speculation thenthe history variables are cleared. Otherwise, the variables areupdated by removing the process from the list of owners, asdescribed below.

NS-Commit-Peer...

pi

Γ,S : ..., pi , ...,A : Γ ′ e′ commit(); e

pk

S : ..., pi , ... ...

⇒N S

...

pi

Γ,S : ...,A : Γ ′ e′ e

pk

S : ... ...

123

Page 16: Distributed speculative execution for reliability and fault tolerance: an operational semantics

448 C. Tapus, J. Hickey

Reading the value of a shared object is defined by the fol-lowing reduction rule. Local variable v is assigned the valueof object o j and becomes part of the local environment of theprocess. As in the speculative model we distinguish severalcases that differ only in how the history variables (invisibleto the user) are modified. As far as the real environment andthe state of the process are concerned all the five rules beloware identical.

NS-Read-NoSpec

o j :

V ; ...

pi

Γ let v = read(o j ) in e[v] ...

⇒N S

o j :

V ; ...

pi

Γ, v : V e[v] ...

NS-Read-NoObjSpec

o j :

V ; ...

pi

Γ, S : L, A : Γ ′ e′ let v = read(o j ) in e[v] ...

⇒N S

o j :

V ; ...

pi

Γ,S : L,A : Γ ′ e′, v : V e[v] ...

NS-Read-ObjSpec

o j :

V,S : L,A : V ′ ; ...

pi

Γ,S : let v = read(o j ) in e[v] ...

⇒N S

o j :

V,S : L, A : V ′ ; ...

pi

Γ,S : L, A : Γ let v = read(o j ) in e[v], v : V e[v] ...

NS-Read-SameSpec

o j :

V, S : L, A : V ′ ; ...

pi

Γ, S : L, A : Γ ′ e′ let v = read(o j ) in e[v] ...

⇒N S

o j :

V,S : L,A : V ′ ; ...

pi

Γ,S : L,A : Γ ′ e′, v : V e[v] ...

NS-Read-MergeHistSpec

o j :

V,S : L1,A : V ′ ; ol :

V,S : L2,A : V ′′ ...

pi

Γ,S : L2 let v = read(o j ) in e[v]

pk

S : L1

...

pq

S : L2

⇒N S

o j :

V,S : L1

⋃L2,A : V ′ ; ol :

V,S : L1

⋃L2, A : V ′′ ...

pi

Γ,S : L1⋃

L2, v : V e[v]

pk

S : L1⋃

L2

...

pq

S : L1⋃

L2

Writing a value to a shared object is defined by the follo-wing reduction rule. The new value, as described by the writeoperation, is stored in the object and the program executingthe write continues its execution. Again, we distinguish fivedifferent rules that differ only in how the history variablesare modified. The effects of the write operation on the realstate of the system is identical in all five rules.

NS-Write

o j :

V ; ...

pi

Γ write(V ′, o j ); e...

⇒N S

o j :

V ′ ; ...

pi

Γ e...

123

Page 17: Distributed speculative execution for reliability and fault tolerance: an operational semantics

Distributed speculative execution for reliability and fault tolerance: an operational semantics 449

NS-Write-NoObjSpec

o j :

V ; ...

pi

Γ,S : L,A : Γ ′ e′ write(V ′′, o j ); e...

⇒N S

o j :

V ′′,S : L,A : V ; ...

pi

Γ,S : L,A : Γ ′ e′ e...

NS-Write-ObjSpec

o j :

V,S : L,A : V ′ ; ...

pi

Γ,S : write(V ′′, o j ); e...

⇒N S

o j :

V ′′ ; ...

pi

Γ,S : e...

NS-Write-SameSpec

o j :

V,S : L,A : V ′ ; ...

pi

Γ,S : L,A : Γ ′ e′ write(V ′′, o j ); e...

⇒N S

o j :

V ′′,S : L,A : V ′ ; ...

pi

Γ,S : L,A : Γ ′ e′ e...

NS-Write-MergeHistSpec

o j :

V, S : L1, A : V ′ ; ol :

V ′′′′, S : L2, A : V ′′′ ...

pi

Γ, S : L2 write(V ′′, o j ); e

pk

S : L1

...

pq

S : L2

⇒N S

o j :

V ′′, S : L1

⋃L2, A : V ′ ; ol :

V ′′′′, S : L1

⋃L2, A : V ′′′ ...

pi

Γ, S : L1⋃

L2 e

pk

S : L1⋃

L2

...

pq

S : L1⋃

L2

There is no rule for the abort operation, since rollback isnot defined in the nonspeculative model. If a nonspeculativeprocess tries to execute an abort operation it is consideredto have taken an invalid execution branch and is said to be“stuck."

In the nonspeculative operational semantics processes aregiven the choice to choose the abort branch when theyencounter a speculate instruction.

5 Equivalence of the speculative and nonspeculativeversions of the distributed objects system model

5.1 Algebraic representation of the operationalsemantics rules

For the purpose of a clear and concise presentation of thedefinitions, theorems and proofs in this section we resort toan algebraic representation of the operational semantics rulespresented in Sects. 3 and 4.

Figure 6 presents the three different states that we encoun-ter in our speculative model and their equivalent algebraicrepresentations. The state of a process, pi , will be referredto using the symbol Pi , while the state of an object, o j isrepresented by O j . The set of processes in the system thatare not explicitly shown in any given rule are referred tousing the symbol Π , while the equivalent symbol for objectsis Θ .

Figure 7 presents the three different types of states weencounter in the nonspeculative model and their equivalentalgebraic representations. A similar notation as in the caseof the speculative model is used to refer to the states withoutexplicitly showing their components. The state of a process

123

Page 18: Distributed speculative execution for reliability and fault tolerance: an operational semantics

450 C. Tapus, J. Hickey

Process’s state Object’s state System’s state

pi

(s, fl) , Γ , eΓ e

oj

s, VV

Σ

Θ

Π

pi : [ (s, fl) , Γ , e Γ e ] oj : [ s, V V ] ∆ ::= Σ ‡ Θ ‡ Π

Fig. 6 Graphical and algebraic representation of speculative states

Process’s state Object’s state System’s state

pi

Γ e

oj

V

ΘNS

ΠNS

pi : [ Γ e ] oj : [ V ] ∆NS ::= ΘNS ‡ ΠNS

Fig. 7 Graphical and algebraic representation of nonspeculative states

will be referred to using notation PNS , while the state of anobject is represented by ONS . The set of processes in thesystem that are not shown in any given rule are referred tousing the symbol ΠNS , while the equivalent symbol for objectis ΘNS .

5.2 Definitions and abstractions

Definition 5.1 The reduction trace of a distributed systemis defined as the sequence of distributed system states givenby the reduction rules that are applied during the executionof the programs. The states need not be distinct.

−→∆ ::= ∆0 ⇒ · · · ⇒ ∆n

We call the transition ∆i ⇒ ∆i a no-op. The definitionof the reduction trace refers only to single reduction steps.We also introduce the notion for one or more reduction steps,with notation ⇒∗. Definition 5.1 can also be written as:

−→∆ ::= ∆0 ⇒∗ ∆n

Definition 5.2 An oracle, O, is defined as a mapping fromspeculations (ids) to either commit or abort, and provides theoutcome of each speculation.

The oracles, as defined above, allow speculations to bemapped to commit or abort regardless of the actual out-come of the speculation. We introduce a relation betweenoracles and reduction traces that guarantees that oracles donot contradict the reduction rules composing a trace. Forexample, if a trace contains the Rule Ab- Owner involvingspeculation s, then the oracle should not predict the outcomeof speculation s as commit.

Definition 5.3 The set of valid oracles, Ω(−→∆), for a given

reduction trace,−→∆ , is defined as follows:

Ω(−→∆) = O ∈ ((∪n

i=0 active speculations of state i) ⇒ a, c)∣∣∣∀∆i−1,∆i ∈ −→∆,

if ∆i−1

merge s1,s2into s3⇒ ∆i then O(s1) = O(s2) = O(s3)

if ∆i−1abort s⇒ ∆i then O(s) = a

if ∆i−1commit s⇒ ∆i then O(s) = c.

Furthermore, if speculation s is nei ther committed

nor aborted in−→∆ then O(s) = c.

⎫⎪⎪⎪⎪⎪⎪⎪⎬

⎪⎪⎪⎪⎪⎪⎪⎭

There are two special constants used for committed oraborted speculations. Any valid oracle extends over thesetwo constants.

Definition 5.4 The definition of a valid oracle extends overconstantsaborted and committed as follows:

∀O ∈ Ω(−→∆),O(aborted) = a and O(committed) = c

Next we define a conversion from speculative states intononspeculative ones. For clarity we use both a graphicalrepresentation and its equivalent algebraic notation throu-ghout this section.

Definition 5.5 The action of lowering (⇓ ()) a speculativestate to a nonspeculative state, given an oracle O and a specu-lations environment Σ is a function from speculative statesto nonspeculative states defined as follows:

⇓ (o j : O) =⇓ (o j : [ 〈 〉 | V ]) = o j : [ V ]

⇓ (o j : O) =⇓ (o j : [ ⟨s, V ′⟩ | V

])

=

o j : [V, S : Ls , A : V ′ ]

if O(s) = c

o j : [V ′ ]

if O(s) = a

⇓ (pi : P) =⇓ (pi : [ 〈 〉 | Γ e ]) = pi : [ Γ e ]

⇓ (pi : P) =⇓ (pi : [ 〈(s, fl) , Γ ′ , e′〉 | Γ e]) =

= [

Γ, S : Ls , A : Γ ′ e′ e]

if O(s) = c[Γ ′ e′ ]

if O(s) = a

⇓ (Θ) =⇓ (o j : O1 . . . o j : Om) =⇓ (o j : O1) . . . ⇓ (o j : Om)

⇓ (Π) =⇓ (p1 : P1 . . . pn : Pn) =⇓ (p1 : P1) . . . ⇓ (pn : Pn)

⇓ (∆) =⇓ (Σ ‡ Θ ‡ Π) =⇓ (Θ) ‡ ⇓ (Π)

The lowering function over reduction trace−→∆ is defined

given an oracle O in Ω(−→∆).

The lowering operation strips the checkpoint of a specu-lative state in order to create an equivalent nonspeculativestate. The result of the lowering operation depends on thechosen oracle.

Furthermore, in this definition we assume the existenceof a function that maps the speculation identifier (s) of aspeculation from the speculative model to the list of processesthat own that speculation (Ls) in the nonspeculative model.From the construction of the speculations environment (Σ)this translation is straight-forward.

123

Page 19: Distributed speculative execution for reliability and fault tolerance: an operational semantics

Distributed speculative execution for reliability and fault tolerance: an operational semantics 451

Lemma 5.1 Given any valid oracle, O, the result of applying⇓ () to any well-formed speculative state, O, P,Θ,Π, or ∆,is a well-formed nonspeculative state, ONS, PNS,ΘNS,ΠNS,

or ∆NS, respectively.

Proof The lemma follows directly from Definition 5.5. We can also show that the operation of substituting the

special constants committed and aborted for any specula-tion s, as defined in Sect. 3, is consistent with the definitionof lowering a speculative state.

Lemma 5.2 For any valid oracle, O, the following proper-ties hold.

⇓ (Θ[s]) =⇓ (Θ[committed]) if O(s) = c

⇓ (Θ[aborted]) if O(s) = a

⇓ (Π [s]) =⇓ (Π [committed]) if O(s) = c

⇓ (Π [aborted]) if O(s) = a

Proof This lemma immediately follows from the definitionof ⇓ ( ) (Definition 5.5), the extended definition of oracles(Definition 5.4), and the definition of substitution presentedin Sect. 3.

We also define the inverse of the lowering function, whichwe call lifting (⇑ ( )) a nonspeculative state to a speculativestate.

Definition 5.6 Lifting a nonspeculative state to a speculativestate is a function from nonspeculative states to speculativestates defined as follows:

⇑ (o j : ONS

) =⇑ (o j : [ V ]

) = o j : [ 〈 〉 | V ]

⇑ (o j : ONS

) =⇑ (o j : [

V, S : L, A : V ′ ] ) = o j : [ ⟨s, V ′⟩ | V

]

⇑ (pi : PNS

) =⇑ ( pi : [ Γ e ] ) = pi : [ 〈 〉 | Γ e ]

⇑ (pi : PNS

) =⇑ (pi : [

Γ, S : L, A : e′ e] ) =

=

pi : [ 〈(L, own) , Γ , e′〉 | Γ e]

if pi ∈ L

pi : [ 〈(L, client) , Γ , e′〉 | Γ e]

if pi /∈ L⇑ (

ΘNS) =⇑ (

o1 : ONS1 . . . om : ONS

m

) =⇑ (o1 : ONS

1

)

. . .⇑ (om : ONS

m

)

⇑ (ΠNS

) =⇑ (p1 : PNS

1 . . . pn : PNSn

) =⇑ (p1 : PNS

1

)

. . .⇑ (pn : PNS

n

)

⇑ (∆NS

) =⇑ (ΘNS ‡ ΠNS

) = Σ ‡ ⇑ (ΘNS

)‡ ⇑ (

ΠNS),

where Σ contains definitions of speculations from⇑ (

ΘNS)

and ⇑ (ΠNS

)

In certain cases, when the history variables exist in thelocal environment of a process the lifting function creates anon-empty checkpoint in the speculative process state. This isrequired to keep track of whether or not a process is still insidea speculation or not. More details on when this lifting ruleis applied will be discussed in the proofs of the equivalencetheorems.

The definition of the lifting operation uses the straight-forward mapping of the list of owners of a speculation (L)

from the nonspeculative model to a speculation identifier (s)in the speculative model. The straight-forward mapping isthe identity function that names the speculation by the iden-tities of its owners, as we can have at most one speculationper process at any given time. Furthermore, the componentsof the distributed nonspeculative state must have consistenthistory variables in order to be able to build a consistent spe-culations environment. The speculations environment (Σ) isre-constructed from the history variables.

Lemma 5.3 Given any valid oracle, O, the result of applying⇑ ( ) to any well-formed nonspeculative state, ONS, PNS,

ΘNS,ΠNS, or ∆NS, is a well-formed speculative state, O, P,

Θ,Π, or ∆, respectively.

Proof The lemma follows directly from Definition 5.6.

5.3 Equivalence theorems

The proof of equivalence between the two models could alsobe done using a simulation relation. However, showing thatthe speculative model has a non-speculative reduction tracerequires, in case of an abort, discarding states and replacingthem with no-ops in the simulation. This is not natural andcan be avoided by using oracles.

Theorem 5.1 For any speculative reduction rule ∆i−1 ⇒∆i and for any valid oracle O, there is a sequence of nons-peculative reduction rules such that

⇓ (∆i−1) ⇒∗NS⇓ (∆i ).

Proof The proof of this theorem is done by examining eachpossible rule defined in the speculative operational seman-tics.

Rule Spec.

⇓ (∆i−1) = ⇓ (Σ ‡ Θ ‡[ 〈 〉 | Γ speculate(e1 ⊕ e2); e3

] ; Π)

= ⇓ (Θ) ‡ ⇓ ([ 〈 〉 | Γ speculate(e1 ⊕ e2); e3

] ; Π)

= ΘNS ‡[Γ speculate(e1 ⊕ e2); e3

] ; ΠNS

⇓ (∆i ) = ⇓ (Σ ‡ Θ ‡ [ 〈(s, own) , Γ , e2; e3〉 | Γ e1; e3 ] ; Π)

= ⇓ (Θ) ‡ ⇓ ([ 〈(s, own) , Γ , e2; e3〉 | Γ e1; e3 ] ; Π)

Now, given the outcome of speculation s the evaluation of⇓ (∆i ) is as follows.

– if O(s) = c then

⇓ (∆i )=ΘNS ‡[ Γ,S : pi ,A : Γ e2; e3 e1; e3];ΠNS

– if O(s) = a then

⇓ (∆i ) = ΘNS ‡ [ Γ e2; e3 ] ; ΠNS

123

Page 20: Distributed speculative execution for reliability and fault tolerance: an operational semantics

452 C. Tapus, J. Hickey

This means that reduction rule Spec is equivalent to eitherNS- Spec- C- br or NS- Spec- A- br reduction rules fromthe nonspeculative operational semantics, depending on thechosen oracle.

Rule Read- Spec- Obj. Since this rule involves specu-lations, the lowering process has to take into account theoutcome of the speculation, as illustrated in Definition 5.5.

– if O(s) = c then

⇓ (∆i−1) = o j : [V, S : L, A : V ′ ] ; ΘNS ‡

[Γ let v = read(o j ) in e[v] ] ; ΠNS

⇓ (∆i ) = o j : [V, S : L, A : V ′ ] ; ΘNS ‡

[Γ, S : L, A : Γ let v = read(o j ) in e[v], v : V e[v] ] ;ΠNS

– if O(s) = a then

⇓ (∆i−1) = o j : [V ′ ] ; ΘNS ‡

[Γ let v = read(o j ) in e[v] ] ; ΠNS

⇓ (∆i ) = o j : [V ′ ] ; ΘNS ‡

[Γ let v = read(o j ) in e[v] ] ; ΠNS

The Read- Spec- Obj reduction rule is equivalent to eitherthe NS- Read- ObjSpec nonspeculative rule or with a no-op,depending on the outcome of speculation s.

Rule Read- Spec- Merge.The two speculations, sP and sO, merge and a new specu-

lation, s, is substituted for them. Since the oracle, O(), is validit must predict the same outcome for all three speculations.

– if O(s) = O(sP) = O(sO) = c then

⇓ (∆i−1) = o j : [V, S : LsO, A : V ′ ] ; ΘNS ‡

[Γ, S : LsP, A : Γ ′ e′ let v = read(o j ) in e[v] ] ; ΠNS

⇓ (∆i ) = o j : [V, S : Ls , A : V ′ ] ; ΘNS ‡

[Γ, S : Ls , A : Γ ′ e′, v : V e[v] ] ; ΠNS

The Read- Spec- Merge reduction rule is equivalent totheNS- Read- MergeHistSpec nonspeculative rule.

– if O(s) = O(sP) = O(sO) = a then

⇓ (∆i−1) = o j : [V ′ ] ; ΘNS ‡

[Γ ′ e′ ] ; ΠNS

⇓ (∆i ) = o j : [V ′ ] ; ΘNS ‡

[Γ ′ e′ ] ; ΠNS

In this case, the Read- Spec- Merge reduction rule is equi-valent to a no-op, since the speculations will be aborted andall the actions performed since entering them will be equiva-lent to null.

Rule Write- Spec- Obj.

– if O(s) = c then

⇓ (∆i−1) = o j : [V,S : L,A : V ′ ] ; ΘNS ‡

[Γ write(V ′′, o j ); e

] ; ΠNS

⇓ (∆i ) = o j : [V ′′ ] ; ΘNS ‡ [ Γ e ] ; ΠNS

– if O(s) = a then

⇓ (∆i−1) = o j : [V ′ ] ; ΘNS ‡

[Γ write(V ′′, o j ); e

] ; ΠNS

⇓ (∆i ) = o j : [V ′′ ] ; ΘNS ‡ [ Γ e ] ; ΠNS

The Write- Spec- Obj reduction rule is equivalent toeither the NS- Write- ObjSpec nonspeculative rule or tothe NS- Write rule, depending on the outcome of specu-lation s.

Rule Write- Spec- Merge. According to the definitionof valid oracles, the oracle that we choose will provide thesame outcome for the three speculations present in this rule,s, sP, and sO.

– if O(s) = O(sP) = O(sO) = c then

⇓ (∆i−1) = o j : [V,S : LsO,A : V ′ ] ; ΘNS ‡

[Γ,S : LsP,A : Γ ′ e′ write(V ′′, o j ); e

] ; ΠNS

⇓ (∆i ) = o j : [V ′′,S : Ls = LsO

⋃LsP,A : V ′ ] ; ΘNS ‡

[Γ,S : Ls = LsO

⋃LsP,A : Γ ′ e′ e

] ; ΠNS

– if O(s) = O(sP) = O(sO) = a then

⇓ (∆i−1) = o j : [V ′ ] ; ΘNS ‡

[Γ ′ e′ ] ; ΠNS

⇓ (∆i ) = o j : [V ′ ] ; ΘNS ‡

[Γ ′ e′ ] ; ΠNS

The Write- Spec- Merge reduction rule is equivalent toeither theNS- Write- MergeHistSpec nonspeculative rule or with ano-op, depending on the outcome of speculation s.

Rule Comm- Peer.This rule is equivalent to either NS- Commit- Peer or

to a no-op, depending on the final outcome of the specu-lation, as follows. The rule only changes the flag associa-ted with the speculation. By applying the lowers to bothsides of the rule the following nonspeculative rules are obtai-ned.

– if O(s) = c then

⇓ (∆i−1) = ΘNS ‡[Γ,S : ..., pi , ...,A : Γ ′ e′ e

] ; ΠNS

⇓ (∆i ) = ΘNS ‡[Γ,S : ...,A : Γ ′ e′ e

] ; ΠNS

– if O(s) = a then

⇓ (∆i−1) = ΘNS ‡[Γ ′ e′ ] ; ΠNS

⇓ (∆i ) = ΘNS ‡[Γ ′ e′ ] ; ΠNS

123

Page 21: Distributed speculative execution for reliability and fault tolerance: an operational semantics

Distributed speculative execution for reliability and fault tolerance: an operational semantics 453

For brevity, the equivalence of the rest of the rules is shownin the table below.

Speculative rule Nonspeculative rule

Read- NoSpec NS- Read

Read- Same- SpecNS- Read- SameSpec if speculation committed

no-op if speculation aborted

Read- Spec- ProcNS- Read- NoObjSpec if speculation committed

no-op if speculation aborted

Write- Spec- ProcNS- Write- NoObjSpec if speculation committed

no-op if speculation abortedWrite- NoSpec NS- Write

Write- Same- SpecNS- Write- SameSpec if speculation committed

no-op if speculation abortedAb- Owner no-opAb- Proc no-opAb- Fail no-opAb- Object no-opComm- Owner NS- Commit

Comm- Obj no-op

All the abort rules are equivalent to no-ops when the ⇓ ()

is applied to the left and the right hand sides of the rules,since the oracle can only predict that the speculation hasbeen aborted.

This concludes our proof. The next theorem relies on the definition of lifting a nons-

peculative state to a speculative state (Definition 5.6).

Theorem 5.2 For any nonspeculative reduction rule suchthat

∆NSi−1 ⇒NS ∆NS

i ,

there is a sequence of speculative reduction rules such that

⇑ (∆NS

i−1

) ⇒∗⇑ (∆NS

i

)

Proof The proof is done in a similar manner with the prooffor Theorem 5.1, by considering each of the reduction rulesin the nonspeculative operational semantics and showing thatthere exists an appropriate sequence of transformations in thespeculative operational semantics that satisfy the conditionstated in the theorem.

Rule NS- Spec- C- br.

⇑(

∆NSi−1

)= ⇑ (

ΘNS ‡[Γ speculate(e1 ⊕ e2); e3

] ; ΠNS )

= Σ ‡ Θ ‡ pi : [ 〈 〉 | Γ speculate(e1 ⊕ e2); e3] ; Π

⇑(

∆NSi

)= ⇑ (

ΘNS ‡[Γ,S : pi ,A : e2; e3 e1; e3

] ; ΠNS)

= s : pi ; Σ ‡ Θ ‡ pi : [ 〈(S, own) , Γ , e2; e3〉 | Γ e1; e3] ; Π

This rule is equivalent to the Spec reduction rule from theoperational semantics of the speculative model.

Rule NS- Spec- A- br.

⇑ (∆NS

i−1

) = ⇑ (ΘNS ‡

[Γ speculate(e1 ⊕ e2); e3

] ; ΠNS)

= Σ ‡ Θ ‡ pi : [ 〈 〉 | Γ speculate(e1 ⊕ e2); e3] ; Π

⇑ (∆NS

i

) = ⇑ (ΘNS ‡ [ Γ e2; e3 ] ; ΠNS

)

= Σ ‡ Θ[aborted] ‡ pi : [ 〈 〉 | Γ e2; e3 ] ; Π[aborted]

Lifting the right and the left hand side of this rule presentsus with an interesting case. There is no rule that perfectlymatches ⇑ (

∆NSi−1

) ⇒⇑ (∆NS

i

).

However, if we carefully look at the operational semanticsfor the speculative model we observe that ⇑ (

∆NSi−1

)matches

the left side of the Spec reduction rule.After applying the Spec rule, we get the following state

of the system:

Σ ‡ Θ ‡ pi : [ 〈(s, own) , Γ , e2; e3〉 | Γ e1; e3 ] ; Π

To obtain ⇑ (∆NS

i

)from the above state, the speculation

has to be aborted, so applying the Ab- Fail reduction rulewould take us to the desired state.

In conclusion, this nonspeculative reduction rule is equi-valent to a reduction trace composed of two rules.

⇑ (∆NS

i−1

) Spec⇒ ∆Ab- Fail⇒ ⇑ (

∆NSi

).

Rule NS- Commit.

⇑ (∆NS

i−1

) = ⇑(

ΘNS ‡[Γ,S : pi ,A : Γ ′ e′ commit(); e

] ; ΠNS

)

= s : pi ; Σ ‡ Θ ‡

pi : [ 〈(s, own) , Γ ′ , e′〉 | Γ commit(); e] ; Π

⇑ (∆NS

i

) = ⇑ (ΘNS ‡ [ Γ e ] ; ΠNS

)

= Σ ‡ Θ ‡ pi : [ 〈 〉 | Γ e ] ; Π

This rule is equivalent to the Comm- Owner reductionrule from the operational semantics of the speculative model.

Rule NS- Commit- Peer.

⇑ (∆NS

i−1

) = ⇑(

ΘNS ‡[Γ, S : ...pi ..., A : Γ ′ e′ commit(); e

] ; ΠNS

)

= s : pi ∪ S; Σ ‡ Θ ‡

pi : [ 〈(s, own) , Γ ′ , e′〉 | Γ commit(); e] ; Π

⇑ (∆NS

i

) = ⇑ (ΘNS ‡

[Γ, S : ..., A : Γ ′ e′ e

] ; ΠNS)

= s : S; Σ ‡ Θ ‡ pi : [ 〈 〉 | Γ e ] ; Π

This rule is equivalent to the Comm- Peer reduction rulefrom the operational semantics of the speculative model.

123

Page 22: Distributed speculative execution for reliability and fault tolerance: an operational semantics

454 C. Tapus, J. Hickey

Rule NS- Read- MergeHistSpec.

⇑ (∆NS

i−1

) = ⇑⎛

⎜⎝

o j : [V,S : LsO,A : V ′ ] ; ΘNS ‡

[Γ,S : LsP,A : Γ ′ e′ let v = read(o j ) in e[v] ] ; ΠNS

⎟⎠

= Σ ‡ o j[ ⟨

sO, V ′⟩ | V] ; Θ ‡

pi : [ 〈(sP, fl) , Γ ′ , e′〉 | Γ let v = read(o j ) in e[v] ] ; Π

⇑ (∆NS

i

) = ⇑⎛

⎜⎝

o j : [V,S : Ls ,A : V ′ ] ; ΘNS ‡

[Γ,S : Ls ,A : Γ e′, v : V e

] ; ΠNS

⎟⎠

= Σ ‡ o j[ ⟨

s, V ′⟩ | V] ; Θ ‡

pi : [ 〈(s, fl) , Γ ′ , e′〉 | Γ, v : V e] ; Π

where Ls = LsO⋃

LsP.This rule is equivalent to the Read- Spec- Merge reduc-

tion rule from the operational semantics of the speculativemodel.

Rule NS- Write- MergeHistSpec.

⇑ (∆NS

i−1

) = ⇑⎛

⎝o j : [

V, S : LsO, A : V ′ ] ; ΘNS ‡[Γ, S : LsP, A : Γ ′ e′ write(V ′′, o j ); e

] ; ΠNS

= Σ ‡ o j[ ⟨

sO, V ′⟩ | V] ; Θ ‡

pi : [ 〈(sP, fl) , Γ ′ , e′〉 | Γ write(V ′, o j ); e] ; Π

⇑ (∆NS

i

) = ⇑⎛

⎝o j : [

V ′′, S : Ls , A : V ′ ] ; ΘNS ‡[Γ, S : Ls , A : Γ ′ e′ e

] ; ΠNS

= Σ ‡ o j[ ⟨

s, V ′⟩ | V ′′ ] ; Θ ‡

pi : [ 〈(s, fl) , Γ ′ , e′〉 | Γ e] ; Π

where Ls = LsO⋃

LsP.This rule is equivalent to the Write- Spec- Merge reduc-

tion rule from the operational semantics of the speculativemodel.

For brevity, the equivalence of the rest of the rules is shownin the table below.

Nonspeculative rule Speculative rule

NS- Read- NoSpec Read- NoSpecNS- Read- NoObjSpec Read- SpecProcNS- Read- ObjSpec Read- Spec- ObjNS- Read- SameSpec Read- Same- SpecNS- Write Write- NoSpecNS- Write- NoObjSpec Write- SpecProcNS- Write- ObjSpec Write- SpecObjNS- Write- SameSpec Write- SameSpec

This concludes the proof of this theorem. To conclude the equivalence proof, we need to show next

that for any distributed system, composed of processes andshared objects, as defined in Sect. 3.2, the speculative andthe nonspeculative operational semantics defined in Sects. 3and 4 are equivalent.

To show the equivalence of the two operational semanticswe need to show first that for any reduction trace that takesa distributed system from an initial speculative state ∆0 to astate ∆n , by only applying reduction rules in the speculative

operational semantics, there is an equivalent nonspeculativereduction trace that takes the nonspeculative state ∆NS

0 tostate ∆NS

m , using only reduction rules from the nonspecula-tive operational semantics. Furthermore, the initial and finalstates have to be equivalent under the lowering and liftingoperations.

The reciprocal of the above statement concludes the proof,and it states that for any nonspeculative reduction trace thereis an equivalent speculative reduction trace such that the ini-tial and the final states are equivalent under the lifting andlowering operations.

Formally, this is summarized by the following theorem.

Theorem 5.3 The speculative and the nonspeculative ope-rational semantics are equivalent under the lowering andlifting operations, as follows:

∀−→∆ ::= ∆0 ⇒ · · · ⇒ ∆n, ∀O ∈ Ω(

−→∆),

∃−→∆ N S ::= ∆NS

0 ⇒NS · · · ⇒NS ∆NSm , such that

⇓ (∆0) = ∆NS0 ∧ ⇓ (∆n) = ∆NS

m ,

and

∀−→∆ N S ::= ∆NS

0 ⇒NS · · · ⇒NS ∆NSm ,

∃−→∆ ::= ∆0 ⇒ · · · ⇒ ∆n, such that

⇑ (∆NS

0

) = ∆0 ∧ ⇑ (∆NS

m

) = ∆n

Proof The proof of this theorem follows from applying Theo-rems 5.2 and 5.1 to each reduction rule in the speculative andnonspeculative reduction traces, respectively.

6 Conclusion and future work

We have presented a model of speculative computation wherecomputations include delayed verification of assumptions.For consistency, we have shown that speculative computa-tions are equivalent to nonspeculative, nondeterministic com-putations with an oracle.

In this account, each process and object is assigned to atmost one speculation, and communication between processescauses speculations to become merged. For long-runningspeculations, this may be overly pessimistic. As future work,we are investigating models that use nested speculations forprocesses and shared objects. In addition, we would like toexplore a formal account of this work on a logical framework.

In addition, we would like to apply speculations to a widerset of real applications to expose and select frequent well-behaved patterns, and to discover and discouragewrong usage of the primitives. We believe speculations canbe used in various ways to tackle the same problem and thegoal is to be able to easily find the right way to solve it.

Acknowledgments The authors would like to thank Nathan Gray,David Noblet, and Aleksey Nogin for their careful reading of this paper

123

Page 23: Distributed speculative execution for reliability and fault tolerance: an operational semantics

Distributed speculative execution for reliability and fault tolerance: an operational semantics 455

and for their valuable feedback. We also thank the anonymous reviewersfor their thoughtful and helpful comments.

References

1. Ananian, C.S., Asanovic, K., Kuszmaul, B.C., Leiserson, C.E.,Lie, S.: Unbounded transactional memory. In: Proceedings of the11th International Symposium on High-Performance ComputerArchitecture (HPCA’05), San Franscisco, California, pp. 316–327(2005)

2. Black, A.P., Cremet, V., Guerraoui, R., Odersky, M.: An equationaltheory for transactions. In: FST TCS 2003: Foundations of Soft-ware Technology and Theoretical Computer Science, pp. 38–49.Australian Computer Society, Inc., Queensland (2003)

3. Bruni, R., Butler, M.J., Ferreira, C., Hoare, C.A.R., Melgratti, H.C.,Montanari, U.: Comparing two approaches to compensable flowcomposition. In: Abadi, M., de Alfaro L. (eds.) CONCUR. Lec-ture Notes in Computer Science, vol. 3653, pp. 383–397. Springer,Heidelerg (2005)

4. Bruni, R., Melgratti, H.C., Montanari, U.: Nested commits formobile calculi: Extending join. In: Lévy, J.J., Mayr, E.W., Mitchell,J.C. (eds.) IFIP TCS, pp. 563–576. Kluwer, Dordercht (2004)

5. Busi, N., Zavattaro, G.: On the serializability of transactions in sha-red dataspaces with temporary data. In: SAC, pp. 359–366. ACM,New York (2002)

6. Chang, F., Gibson, G.A.: Automatic i/o hint generation throughspeculative execution. In: OSDI ’99: Proceedings of the Third Sym-posium on Operating Systems Design and Implementation (1999)

7. Chothia, T., Duggan, D.: Abstractions for fault-tolerant global com-puting. Theor. Comput. Sci. 322(3), 567–613 (2004)

8. Damani, O.P., Garg, V.K.: How to recover efficiently and asyn-chronously when optimism fails. In: International Conference onDistributed Computing Systems, pp. 108–115 (1996)

9. Garcia-Molina, H., Salem, K.: Sagas. In: SIGMOD ’87: Procee-dings of the 1987 ACM SIGMOD international conference onManagement of data, pp. 249–259. ACM Press, New York (1987).doi:10.1145/38713.38742

10. Gray, J., Reuter, A.: Transaction Processing: Concepts and Tech-niques. Morgan Kaufmann, Menlo Park (1994)

11. Haines, N., Kindred, D., Morrisett, J.G., Nettles, S.M., Wing, J.M.:Composing first-class transactions. ACM Transactions on Pro-gramming Languages and Systems. Short Communication (1994)

12. Harris, T., Fraser, K.: Language support for lightweight transac-tions. In: Object-Oriented Programming, Systems, Languages, andApplications, pp. 388–402 (2003)

13. Herlihy, M.: A methodology for implementing highly concur-rent data structures. In: PPOPP ’90: Proceedings of the secondACM SIGPLAN symposium on Principles & practice of paral-lel programming, pp. 197–206. ACM Press, New York (1990).doi:10.1145/99163.99185

14. Herlihy, M., Moss, J.E.B.: Transactional memory: Architecturalsupport for lock-free data structures. In: Proceedings of the 20thAnnual International Symposium on Computer Architecture, pp.289–300 (1993)

15. Hoare, C.: Communicating Sequential Processes. PrenticeHall, New Jersey (1985)

16. Jefferson, D.R.: Virtual time. ACM Trans. Program. Lang. Syst.7(3), 404–425 (1985). doi:10.1145/3916.3988

17. Johnson, D.B., Zwaenepoel, W.: Recovery in distributed sys-tems using asynchronous message logging and checkpointing. In:PODC, pp. 171–181 (1988)

18. Lai, A.C., Falsafi, B.: Memory sharing predictor: the key to a spe-culative coherent dsm. In: Proceedings of the 26th annual interna-tional symposium on Computer architecture, pp. 172–183. IEEE

Computer Society Press, New York (1999). doi:10.1145/300979.300994

19. Litzkow, M., Tannenbaum, T., Basney, J., Livny, M.: Checkpointand migration of unix processes in the condor distributed proces-sing system. Tech. Rep. 1346. Computer Sciences Department,University of Wisconsin (1997)

20. Marathe, V.J., Scherer III, W.N., Scott, M.L.: Adaptive softwaretransactional memory. In: Proceedings of the 19th InternationalSymposium on Distributed Computing, Cracow, Poland. Earlierbut expanded version available as TR 868, University of RochesterComputer Science Dept., May 2005 (2005)

21. Moss, E.B.: (1981) Nested transactions: An approach to reliabledistributed computing. Tech. rep., Cambridge, MA, USA

22. Neves, N., Castro, M., Guedes, P.: A checkpoint protocol for anentry consistent shared memory system. In: PODC, pp. 121–129(1994)

23. Nightingale, E.B., Chen, P.M., Flinn, J.: Speculative execution in adistributed file system. In: SOSP ’05: Proceedings of the twentiethACM symposium on Operating systems principles, pp. 191–205.ACM Press, New York (2005). doi:10.1145/1095810.1095829

24. Oplinger, J., et al.: Software and hardware for exploiting specula-tive parallelism with a multiprocessor. Tech. rep., Stanford, CA,USA (1997)

25. Prinz, A., Thalheim, B.: Operational semantics of transactions. In:CRPITS’17: Proceedings of the Fourteenth Australasian databaseconference on Database technologies 2003, pp. 169–179. Austra-lian Computer Society, Inc., Queensland (2003)

26. Qin, F., Tucek, J., Sundaresan, J., Zhou, Y.: Rx: treating bugs asallergies—a safe method to survive software failures. In: SOSP’05: Proceedings of the twentieth ACM symposium on Operatingsystems principles, pp. 235–248. ACM Press, New York (2005).doi:10.1145/1095810.1095833

27. Rajwar, R., Bernstein, P.A.: Atomic transactional execution inhardware: A new high-performance abstraction for databases. In:Position paper for the 10th International Workshop on High Per-formance Transaction Systems (2003)

28. Sistla, A.P., Welch, J.L.: Efficient distributed recovery using mes-sage logging. In: PODC, pp. 223–238 (1989)

29. Strom, R., Yemini, S.: Optimistic recovery in distributed systems.ACM Trans. Comput. Syst. 3(3), 204–226 (1985). doi:10.1145/3959.3962

30. Takahashi, T., Sumimoto, S., Hori, A., Harada, H., Ishikawa, Y.:Pm2: High performance communication middleware for heteroge-neous network environments. In: Proceedings of the IEEE/ACMSC2000 Conference (2000)

31. Tapus, C., Smith, J.D., Hickey, J.: Kernel level speculative DSM.In: IEEE International Symposium on Cluster Computing andthe Grid (CCGRID 2003), Tokyo, Japan (2003). http://www.cs.caltech.edu/~crt/publications/dsm2003.pdf. Workshop on Distri-buted Shared Memory (DSM)

32. Thain, D., Livny, M.: The ethernet approach to grid computing. In:HPDC ’03: Proceedings of the 12th IEEE International Symposiumon High Performance Distributed Computing (HPDC’03)

33. Wende, M., Schoettner, M., Goeckelmann, R., Bindhammer,T., Schulthess, P.: Optimistic synchronization and transactionalconsistency. In: CCGRID ’02: Proceedings of the 2nd IEEE/ACMInternational Symposium on Cluster Computing and the Grid,p. 331. IEEE Computer Society, Washington (2002)

34. Zhong, H., Nieh, J.: Crak: Linux checkpoint / restart as a ker-nel module. Tech. Rep. CUCS-014-01, Department of Com-puter Science, Columbia University (2002). http://www.ncl.cs.columbia.edu/research/migrate/crak.html

123