the parallel persistent memory model - arxiv · we consider a parallel computational model, the...

The Parallel Persistent Memory Model ∗

Guy E. BlellochCarnegie Mellon University

[email protected]

Phillip B. GibbonsCarnegie Mellon University

[email protected]

Yan GuCarnegie Mellon University

[email protected]

Charles McGuffeyCarnegie Mellon University

[email protected]

Julian ShunMIT CSAIL

[email protected]

Abstract

We consider a parallel computational model, the Parallel Persistent Memory model, comprised ofP processors, each with a fast local ephemeral memory of limited size, and sharing a large persistentmemory. The model allows for each processor to fault at any time (with bounded probability), andpossibly restart. When a processor faults, all of its state and local ephemeral memory is lost, but thepersistent memory remains. This model is motivated by upcoming non-volatile memories that are nearlyas fast as existing random access memory, are accessible at the granularity of cache lines, and have thecapability of surviving power outages. It is further motivated by the observation that in large parallelsystems, failure of processors and their caches is not unusual.

We present several results for the model, using an approach that breaks a computation into capsules,each of which can be safely run multiple times. For the single-processor version we describe how tosimulate any program in the RAM, the external memory model, or the ideal cache model with an expectedconstant factor overhead. For the multiprocessor version we describe how to efficiently implement awork-stealing scheduler within the model such that it handles both soft faults, with a processor restarting,and hard faults, with a processor permanently failing. For any multithreaded fork-join computation that israce free, write-after-read conflict free and has W work, D depth, and C maximum capsule work in theabsence of faults, the scheduler guarantees a time bound on the model ofO

(WPA

+ DPPA

⌈log1/(Cf)W

⌉)in

expectation, where P is the maximum number of processors, PA is the average number, and f ≤ 1/(2C)is the probability a processor faults between successive persistent memory accesses. Within the model,and using the proposed methods, we develop efficient algorithms for parallel prefix sums, merging, sorting,and matrix multiply.

1 Introduction

In this paper, we consider a parallel computational model, the Parallel Persistent Memory (Parallel-PM)model, that consists of P processors, each with a fast local ephemeral memory of limited size M , and sharinga large slower persistent memory. As in the external memory model [5, 4], each processor runs a standardinstruction set from its ephemeral memory and has instructions for transferring blocks of size B to and fromthe persistent memory. The cost of an algorithm is calculated based on the number of such transfers. A keydifference, however, is that the model allows for individual processors to fault at any time. If a processorfaults, all of its processor state and local ephemeral memory is lost, but the persistent memory remains. We

∗This paper is the full version of a paper at SPAA 2018 with the same name.

1

arX

iv:1

805.

0558

0v2

[cs

.DC

] 1

3 Ju

n 20

18

consider both the case where the processor restarts (soft faults) and the case where it never restarts (hardfaults).

The model is motivated by two complimentary trends. Firstly, it is motivated by upcoming non-volatilememories that are nearly as fast as existing random access memory (DRAM), are accessed via loads andstores at the granularity of cache lines, have large capacity (more bits per unit area than existing randomaccess memory), and have the capability of surviving power outages and other failures without losing data(the memory is non-volatile or persistent). For example, Intel’s 3D-Xpoint memory technology, currentlyavailable as an SSD, is scheduled to be available as such a random access memory in 2019. While suchmemories are expected to be the pervasive type of memory [52, 50, 56], each processor will still have a smallamount of cache and other fast memory implemented with traditional volatile memory technologies (SRAMor DRAM). Secondly, it is motivated by the fact that in current and upcoming large parallel systems theprobability that an individual processor faults is not negligible, requiring some form of fault tolerance [16].

In this paper, we first consider a single processor version of the model, the PM model, and give conditionsunder which programs are robust against faults. In particular, we identify that breaking a computationinto “capsules” that have no write-after-read conflicts (writing a location that was read earlier within thesame capsule) is sufficient, when combined with our approach to restarting faulting capsules from theirbeginning, due to its idempotent behavior. We then show that RAM algorithms, external memory algorithms,and cache-oblivious algorithms [31] can all be implemented asymptotically efficiently on the model. Thisinvolves a simulation that breaks the computations into capsules and buffers writes, which are handled inthe next capsule. However, the simulation is likely not practical. We therefore consider a programmingmethodology in which the algorithm designer can identify capsule boundaries, and ensure that the capsulesare free of write-after-read conflicts.

We then consider our multiprocessor counterpart, the Parallel-PM described above, and consider condi-tions under which programs are correct when the processors are interacting through the shared memory. Weidentify that if capsules are free of write-after-read conflicts and atomic, in a way that we define, then eachcapsule acts as if it ran once despite many possible restarts. Furthermore we identify that a compare-and-swap(CAS) instruction is not safe in the PM model, but that a compare-and-modify (CAM), which does not see itsresult, is safe.

The most significant result in the paper is a work-stealing scheduler that can be used on the Parallel-PM.Our scheduler is based on the scheduler of Arora, Blumofe, and Plaxton (ABP) [5]. The key challenges inadopting it to handle faults are (i) modifying it so that it only uses CAMs instead of CASs, (ii) ensuring thateach stolen task gets executed despite faults, (iii) properly handling hard faults, and (iv) ensuring its efficiencyin the presence of soft or hard faults. Without a CAS, and to avoid blocking, handling faults requires thatprocessors help the processor that is part way through a steal. Handling hard faults further requires beingable to steal a thread from a processor that was part way through executing the thread.

Based on the scheduler we show that any race-free, write-after-read conflict free multithreaded fork-joinprogram with work W , depth D, and maximum capsule work C will run in expected time:

O

(W

PA+D

(P

PA

)⌈log1/(Cf)W

⌉).

Here P is the maximum number of processors, PA the average number, and f ≤ 1/(2C) an upper bound onthe probability a processor faults between successive persistent memory accesses. This bound differs fromthe ABP result only in the log1/(Cf)W factor on the depth term, due to faults along the critical path.

Finally, we present Parallel-PM algorithms for prefix-sums, merging, sorting, and matrix multiply thatsatisfy the required conditions. The results for prefix-sums, merging, and sorting are work-optimal, matchinglower bounds for the external memory model. Importantly, these algorithms are only slight modifications

2

from known parallel I/O efficient algorithms [14]. The main change is ensuring that they write their partialresults to a separate location from where they read them so that they avoid write-after-read conflicts.

Related Work. Because of its importance to future computing, the computer systems community (includingcompanies such as Intel and HP) have been hard at work trying to solve the issues arising when fast nonvolatilememories (such as caches) sit between the processor and a large persistent memory [11, 28, 38, 39, 36, 18,45, 47, 37, 22, 55, 53, 19, 44, 51, 32, 33, 46, 20, 10, 30]. Standard caches are write-back, meaning that awrite to a memory location will make it only as far as the cache, until at some later point the updated cacheline gets flushed out to the persistent memory. Thus, when a processor crashes, some writes (those still in thecache) are lost while other writes are not. The above prior work includes schemes for encapsulating updatesto persistent memory in either transactions or lock-protected failure atomic sections and using various formsof (undo, redo, resume) logging to ensure correct recovery. The intermittent computing community works onthe related problem of small systems that will crash due to power loss [48, 25, 7, 24, 35, 54, 15, 49]. Luciaand Ransford [48] describe how faults and restarting lead to errors that will not occur in a faultless setting.Several of these works [48, 25, 24, 54, 49] break code into small chunks, referred to as tasks, and work toensure progress at that granularity. Avoiding write-after-read conflicts is often the key step towards ensuringthat tasks are idempotent. Because these works target intermittent computing systems, which are designed tobe small and energy efficient, they do not consider multithreaded programs, concurrency, or synchronization.In contrast to this flurry of recent systems research, there is relatively little work from the theory/algorithmscommunity aimed at this setting [27, 41, 40, 52]. David et al. [27] presents concurrent data structures (e.g.,for skip-lists) that avoid the overheads of logging. Izraelevitz et al. [41, 40] presents efficient techniquesfor ensuring that the data in persistent memory captures a consistent cut in the happens-before graph of theprogram’s execution, via the explicit use of instructions that flush cache lines to persistent memory (such asIntel’s CLFLUSH instruction [38]). Nawab et al. [52] defines periodically persistent data structures, whichcombine mechanisms for tracking proper write ordering with a periodic flush of all cache lines to persistentmemory. None of this work defines an algorithmic cost model, presents a work-stealing scheduler, or providesthe provable bounds in this paper.

There is a very large body of research on models and algorithms where processors and/or memory canfault, but to our knowledge, none of it (other than the works mentioned above) fits the setting we study withits two classes of memory (local volatile and shared nonvolatile). Papers focusing on memory faults (e.g.,[29, 1, 21] among a long list of such papers) consider models in which individual memory locations can fault.Papers focusing on processor faults (e.g., [6] among an even longer list of such papers) either do not considermemory faults or assume that all memory is volatile.

Write-back Caches. Note that while the PM models are defined using explicit external read and externalwrite instructions, they are also appropriate for modeling the (write-back) cache setting described above, asfollows. Explicit instructions, such as CLFLUSH, are used to ensure that an external write indeed writes tothe persistent memory. Writes that are intended to be solely in local memory, on the other hand, could end upbeing evicted from the cache and written back to persistent memory. However, for programs that are race-freeand well-formed, as defined in Section 3, our approach preserves its correctness properties.

2 The Persistent Memory Model

Single Processor. We assume a two-layer memory model with a small fast ephemeral memory of size M (inwords) and a large slower persistent memory of size Mp �M . The two memories are partitioned into blocksof B words. Instructions include standard RAM instructions that work on single words within the processorregisters (a processor has O(1) registers) and ephemeral memory, as well as two (external) memory transfer

3

instructions: an external read that transfers a block from persistent memory into ephemeral memory, andan external write that transfers a block from ephemeral memory to persistent memory. We assume that thewords contain Θ(logMp) bits. These assumptions are effectively the same as in the (M,B) external memorymodel [2].

We further assume that the processor can fault between any two instructions,1 and that after faulting, theprocessor restarts. On restart, the ephemeral memory and processor registers can be in an arbitrary state, butthe persistent memory is in the same state as immediately before the fault. To enable forward progress, weassume there is a fixed memory location in the persistent memory referred to as the restart pointer location,containing a restart pointer. On restart, the processor loads the restart pointer from the persistent memoryinto a register, which we refer to as the base register, and then loads the location pointed to by the restartpointer (the restart instruction) and jumps to that location, i.e., sets it as the program counter. The processorthen proceeds as normal. As it executes, the processor can update the restart pointer to be the current programcounter, at the cost of an external write, in order to limit how far the processor will fall back on a fault. Werefer to this model as the (single processor) (M,B) persistent memory (PM) model.

The basic model can be parameterized based on the cost of the various instructions. Throughout thispaper, and in the spirit of the external memory model [2] and the ideal cache model [31], we assume thatexternal reads and writes take unit cost and all other instructions have no cost.2 We further assume thatthe program is constant size and that either the program is loaded from persistent memory into ephemeralmemory at restart, or that there is a small cache for the program itself, which is also lost in the case of a fault.Thus, faulting and restarting (loading the base register and jumping to the restart instruction, and fetching thecode) takes a constant number of external memory transfers.

The processor’s computation can be viewed as partitioned into capsules: each capsule corresponds to amaximally contiguous sequence of instructions running on the processor while the restart pointer locationcontains the same restart pointer. The last instruction of every capsule is therefore a write of a new restartpointer. We refer to writing a new restart pointer as installing a capsule. We assume that the next instructionsafter this write, which are at the start of the next capsule, do exactly the same as a restart does—i.e., load therestart pointer into the base pointer, load the start instruction pointed to by base pointer, and jump to it. Thecapsule is active while its restart pointer is installed. Whenever the processor faults, it will restart using therestart pointer of the active capsule, i.e., the capsule will be restarted as it was the first time. We define thecapsule work to be the number of external reads and writes in the capsule, assuming no faults. Note that,akin to checkpointing, there is a tension between the desire for high work capsules that amortize the capsulestart/restart overheads and the desire for low work capsules that lessen the repeated work on restart.

In our analysis, we consider two ways to count the total cost. We say that the faultless work (or work), W ,is the number of external memory transfers assuming no faults. We say that the total work (or fault-tolerantwork), Wf , is the number of external transfers for an actual run including all transfers due to having to restart.Wf can only be defined with respect to an assumed fault model. In this paper, for analyzing costs, we assumethat the probability of faulting by a processor between any two consecutive non-zero cost instructions (i.e.,external reads or writes) is bounded by f ≤ 1/2, and that faults are independent events. We will specify f toensure that a maximum work capsule fails with at most constant probability.

We assume throughout the paper that instructions are deterministic, i.e., each instruction is a functionfrom the values of registers and memory locations that it reads to the registers and memory locations that itwrites.

1For simplicity, we assume that individual instructions are atomic.2The results in this paper can be readily extended to a setting (an Asymmetric PM model) where external writes are more costly

than external reads, as in prior work on algorithms for NVM [17, 9, 13, 12, 8, 42]; for simplicity, we study here the simpler PMmodel because such asymmetry is not the focus of this paper.

4

Persistent Memory

EphemeralMemory

CPU

EphemeralMemory

CPU

EphemeralMemory

CPU

EphemeralMemory

CPU

Figure 1: The Parallel Persistent Memory Model

Multiple Processors. The Parallel-PM consists of P processors each with its own fast local ephemeralmemory of size M , but sharing a single slower persistent memory of size Mp (see Figure 1). Each processorworks as in the single processor PM, and the processors run asynchronously. Any processor can fault betweentwo of its instructions, and each has its own restart pointer location in the persistent memory. When aprocessor faults, the processor restarts like it would in the single processor case. We refer to this as a softfault. We also allow for a hard fault, in which the processor faults and then never restarts—we say that sucha processor is dead. We assume that other processors can detect when a processor has hard faulted usinga liveness oracle isLive(procId). We allow for concurrent reads and writes to the shared persistentmemory, and assume that all instructions involving the persistent memory are sequentially consistent.

The Parallel-PM includes a compare-and-swap (CAS) instruction. The CAS takes a pointer to a locationof a word in persistent memory and two values in registers. If the first value equals the value at the location,it atomically swaps the value at the location and the value in the second register, and the CAS is successful.Otherwise, no swap occurs and the CAS is unsuccessful. Even though the persistent memory is organized inblocks, we assume that the CAS is on a single word within a block.

The (faultless) work W and the total work Wf are as defined in the sequential model but summed acrossall processors. The (faultless) time T (and the fault-tolerant or total time Tf ) is the maximum faultlesswork (total work, respectively) done by any one processor. Without faults, this is effectively the same as theparallel external memory model [4]. In analyzing correctness, we allow for arbitrary delays between any twosuccessive instructions by a processor. However, for our time bounds and our results on work stealing wemake similar assumptions as made in [5]. These are described in Section 6.

Multithreaded Computations. Our aim is to support multithreaded dynamic parallelism layered on top ofthe Parallel-PM. We consider the same form of multithreaded computations as considered by Arora, Blumofe,and Plaxton (ABP) [5]. In the model, a computation starts as a single thread. On each step, a thread can runan instruction, fork a new thread, or join with another thread. Such a computation can be viewed as a DAG,with an edge between instructions, a pair of out-edges at a fork, and a pair of in-edges at a join. As with ABP,we assume that each node in the DAG has out-degree at most two. In the multithreaded model, the (faultless)work W is the work summed across all threads in the absence of faults, and the total work Wf is the summedwork including faults. In addition, we define the (faultless) depth D (and the fault-tolerant or total depth Df )to be the maximum work (total work, respectively) along any path in the DAG. The goal of our work-stealingscheduler (Section 6) is to efficiently map computations in the multithreaded model into the Parallel-PM.

5

3 Robustness on a Single Processor

In this section, we discuss how to run programs on the single processor PM model so that they complete thecomputation properly.

Our goal is to structure the computation and its partitioning into capsules in a way that is sufficient toensure correctness regardless of faults. Specifically, our goal is that each capsule is a sequence of instructionsthat will look from an external view like it has been run exactly once after its completion, regardless of thenumber of times it was partially run due to faults and restarts. We say that a capsule is idempotent if, when itcompletes, regardless of how many times it faults and restarts, all modifications to the persistent memory areconsistent with running once from the initial state (i.e., the state of the persistent memory, the ephemeralmemory, and the registers at the start of the capsule).

There are various means to guarantee that a capsule is idempotent, and here we consider a natural one.We say that a capsule has a write-after-read conflict if the first transfer from a block in persistent memoryis a read (called an “exposed” read), and later there is a write to the same block. Avoiding such a conflictis important because if a location in the persistent memory is read and later written, then on restart thecapsule would see the new value instead of the old one. We say a capsule is well-formed if the first access toeach word in the registers or ephemeral memory is a write. Being well-formed means that a capsule willnot read the undefined values from registers and ephemeral memory after a fault. We say that a capsule iswrite-after-read conflict free if it is well-formed and had no write-after-read conflicts.

Theorem 3.1. With a single processor, all write-after-read conflict free capsules are idempotent.

Proof. On restarting, the capsule cannot read any persistent memory written by previous faults on the capsule,because we restart from the beginning of the capsule and the exposed read locations are disjoint from thewrite locations. Moreover, the capsule cannot read the state of the ephemeral memory because a write isrequired before a read (well-formedness). Therefore, the first time a capsule runs and every time a capsulerestarts it has the same visible state, and because the processor instructions are deterministic, will repeatexactly the same instructions with the same results.

An immediate question is whether a standard processing model such as the RAM can be simulatedefficiently on the PM model. The following theorem shows that the PM can simulate the RAM model withonly constant overheads.

Theorem 3.2. Any RAM computation taking t time can be simulated on the (O(1), B) PM model withf ≤ 1/c for some constant c ≥ 2, using O(t) expected total work, for any B (B = 1 is sufficient).

Proof. The simulation keeps all simulated memory in the persistent memory one word per block. It alsokeeps two copies of the registers in persistent memory, and the simulation swaps between the two. At theend of a capsule on one copy, it sets the restart pointer to a location just before the other copy. The codeat that location, run at the start of the next capsule, copies the other copy of the registers into the currentcopy, and then simulates one instruction given by the program counter, by reading from the other copy of theregisters, and writing to the current copy of registers (typically just a single register). The instruction mightinvolve a read or write to the simulated memory, and an update of the program counter either to the nextsimulated instruction, or if a jump, to some other instruction. Once the instruction is done, the other copy ofthe registers is installed. This repeats. The capsules are write-after-read conflict free because they only readfrom one set of registers and write to the other, and the simulated memory instructions do a single read orwrite. Every simulated step takes a constant number of reads and writes to the persistent memory. Since thecapsule work is constant, it can be bounded by some k. If kf ≤ 1/2 then the probability of a capsule faulting

6

is bounded by 1/2 and therefore the expected total work on any capsule is upper bounded by 2k. Settingc = 2k gives the stated bounds.

Although the RAM simulation is linear in the number of instructions, our goal is to create algorithmsthat require asymptotically fewer reads and writes to persistent memory. We therefore consider efficientlysimulating external memory algorithms in the model.

Theorem 3.3. Any (M,B) external memory computation with t external accesses can be simulated on the(O(M), B) PM model with f ≤ B/(cM) for some constant c ≥ 2, using O(t) expected total work.

Proof. The simulation consists of rounds each of which has a simulation capsule and a commit capsule.It maps the ephemeral memory of the source program to part of the ephemeral memory, and the externalmemory to the persistent memory. It keeps the registers in the ephemeral memory, and keeps space for twocopies of the simulated ephemeral memory and the registers in the persistent memory, which it swaps backand forth between.

The simulation capsule simulates some number of steps of the source program. It starts by reading in oneof the two copies of the ephemeral memory and registers. Then during the simulation all instructions areapplied within their corresponding memories, except for writes from the ephemeral memory to the persistentmemory. These writes, instead of being written immediately, are buffered in the ephemeral memory. Thismeans that all reads from the external memory have to first check the buffer. The simulation also maintains acount of the number of reads and writes to the external memory within a capsule. When this count reachesM/B, the simulation “closes” the capsule. The closing is done by writing out the simulated ephemeralmemory, the registers, and the write buffer to persistent memory. For ephemeral memory and registers, thisis the other copy from the one that is read. The capsule finishes by installing a commit capsule.

The commit capsule reads in the write buffer from the closed capsule to ephemeral memory, and appliesall the writes to their appropriate locations of the simulated external memory in the persistent memory. Whenthe commit capsule is done, it installs the next simulation capsule.

This simulation is write-after-read conflict free because the only writes during a simulation capsule are tothe copy of ephemeral memory, registers, and write buffer. The write buffer has no conflicts since it is notread, and the ephemeral memory and registers have no conflicts since they swap back and forth. There areno conflicts in the commit capsules because they read from write buffer and write to the simulated externalmemory. The simulation is therefore write-after-read conflict free.

To see the claimed time and space bounds, we note that the ephemeral memory need only be a constantfactor bigger than the simulated ephemeral memory because the write buffer can only contain M entries.Each round requires onlyO(M/B) reads and writes to the persistent memory because the simulating capsulesonly need the stored copy of the ephemeral memory, do at most M/B reads, and then do at most O(M/B)writes to the other stored copy. The commit capsule does at most M/B simulated writes, each requiringa read from and write to the persistent memory. Because each round simulates M/B reads and writes toexternal memory at the cost of O(M/B) reads and writes to persistent memory, the faultless work acrossall capsules is bounded by O(t). Because the probability that a capsule faults is bounded by the maximumcapsule work, O(M/B), when f ≤ B/(cM), there is a constant c such that the probability of a capsulefaulting is less than 1. Since the faults are independent, the expected total work is a constant factor greaterthan the faultless work, giving the stated bounds.

It is also possible to simulate the ideal cache model [31] in the PM model. The ideal cache model issimilar to the external memory model, but assumes that the fast memory is managed as a fully associativecache. It assumes a cache of size M is organized in blocks of size B and has an optimal replacement policy.

7

The ideal cache model makes it possible to design cache-oblivious algorithms [31]. Due to the followingresult, these algorithms are also efficient in the PM model.

Theorem 3.4. Any (M,B) ideal cache computation with t cache misses can be simulated on the (O(M), B)PM model with f ≤ B/(cM) for a constant c ≥ 2, using O(t) expected total work.

Proof. The simulation is similar to our external memory simulation, using rounds consisting of a simulationcapsule and a commit capsule. During each simulation capsule a simulated cache of size 2M/B blocksis maintained in the ephemeral memory. The capsule starts by loading the registers, and with an emptycache. During simulation, entries are never evicted, but instead the simulation stops when the cache runsout of space, i.e., after 2M/B distinct blocks are accessed. The capsule then writes out all dirty cache lines(together with the corresponding persistent memory address for each cache line) to a buffer in persistentmemory, saves the registers and installs the commit capsule. The commit capsule reads in the buffer, writesout all the dirty cache lines to their correct locations, and installs the next simulation capsule. The simulationis write-after-read conflict free.

We now consider the costs of the simulation. O(M) ephemeral memory is sufficient to simulate the cacheof size 2M . The total faultless work of a simulation capsule (run once) is bounded by O(M/B) becausewe only have 2M/B misses before ending, and then have to write out at most 2M/B dirty cache blocks.Accounting for faults, the total cost is still O(M/B)—given constant probability of faults. The size of thecache and cost are similarly bounded for the commit capsule. We now note that over the same simulatedinstructions, the ideal cache will suffer at least M/B cache misses. This is because the simulation roundaccesses 2M/B distinct locations, and only M/B of them could have been in the ideal-cache at the start ofthe round, in the best case. The other M/B must therefore suffer a miss. Therefore each simulation roundsimulates what were at least M/B misses in the ideal cache model with at most O(M/B) expected cost inthe PM model. As in the previous proofs, and since the capsule work is bounded by O(M/B), the probabilityof a capsule faulting can be bounded by 1/2 when f ≤ B/(cM), for some c. Hence the expected total workcan be bounded by twice the faultless work.

4 Programming for Robustness

This simulation of the external memory is not completely satisfactory because its overhead, although constant,could be significant. It can be more convenient and certainly more efficient to program directly for the model.Here we describe one protocol for this purpose. It can greatly reduce the overhead of using the PM model. Itis also useful in the context of the parallel model.

Our protocol is designed so capsules begin and end at the boundaries of certain function calls, which werefer to as persistent function calls. Non-persistent calls are ephemeral. We assume function calls can bemarked as persistent or ephemeral, by the user or possibly compiler. Once a persistent call is made, the calleewill never revert back further than the call itself, and after a return the caller will never revert back furtherthan the return. All persistent calls require a constant number of external reads and writes on the call andon the return. In an ephemeral function call a fault in a callee can roll back to before the call, and similarlya fault after a return can roll back to before the return. All ephemeral calls are handled completely in theephemeral memory and therefore by themselves do not require any external reads or writes. In addition to thepersistent function call we assume a commit command that forces a capsule boundary at that point. As witha persistent call, the commit requires a constant number of external reads and writes.

We assume that all user code between persistent boundaries is write-after-read conflict free, or otherwiseidempotent. This requires a style of programming in which results are copied instead of overwritten. Forsequential programs, this increases the space requirements of an algorithm by at most a factor of two.

8

Persistent counters can be implemented by placing a commit between reading the old value and writing thenew. In the algorithms that we describe in Section 7, this style is very natural.

4.1 Implementing Persistent Calls

The standard stack protocol for function calling is not write-after-read conflict free and therefore cannotbe used directly for persistent function calls. We therefore describe a simple mechanism based on closuresand continuation passing in functional programming [3]. The convention also serves to clearly delineate thecapsule boundaries, and will be useful in the discussion of the parallel model. We then discuss how this canbe implemented on a stack and can be used for loops.

We use a contiguous sequence of memory words, called a closure, to represent a capsule. The restartpointer for the capsule is the address of the first word of the closure. A closure consists of an instructionpointer in the first position (the start instruction), local state, arguments, and a pointer to another closure,which we refer to as the continuation. Typically a closure is constant size and points indirectly (via a pointer)to non-constant sized data, although this is not required. Once a closure is filled, it can be installed andstarted. Any faults while it is active will restart it. Since the base of the closure is loaded into the base pointerwhen starting, the instructions have access to the local state and arguments. A closure can be thought ofas a stack frame, but need not be allocated in a stack discipline. Indeed, as discussed later, allocating in astack discipline requires some extra care. The continuation can be thought of as a return pointer, except itdoes not point directly to an instruction, but rather another closure (perhaps the parent stack frame), with theinstruction to run stored in the first location.

A persistent function call consists of creating two closures, a continuation closure and a callee closure,and then installing and starting the callee closure. The continuation closure corresponds to what needs to berun when returning from the callee. It consists of a pointer to the first instruction to run on return, any localvariables that are needed on return, an empty slot for the return result of the callee, and the continuation of thecurrent closure. The callee closure consists of a pointer to the first instruction to run in the called function, anyarguments it needs, and a pointer to the continuation closure in its continuation. The return from a persistentcall consists of writing any results into the closure pointed to by the continuation (the continuation closure),and then installing and starting the continuation closure. Note that if the processor faults after installing thecontinuation closure, then a computation will only back up as far as the start of the continuation. Thereforepersistence occurs at the boundaries (in and out) of persistent function calls. Because a capsule correspondsto running a single installed closure, all capsules correspond to the code run between two persistent functionboundaries. We note that if a function call is in tail position (i.e., it cannot call another function), then thecontinuation closure is not necessary, and the continuation pointer of the current active capsule can be copieddirectly to the new callee closure before installing it (this is the standard tail call optimization).

Our calling mechanism is write-after-read conflict free. This is because we only ever write to a closurewhen it is being created, and read when it is being used in a future capsule. The one exception is writingresults into a closure, but in this case the callee does the writes, and the caller does the reads after the returnand in a different capsule. A loop can be made persistent by using tail recursive function calls. To avoidallocating a new closure for each, the implementation could use just two closures and swap back and forthbetween the two.

Closures can be implemented in a stack discipline by allocating both the callee and continuation closureson the top of the stack, and popping the callee closure when returning from the called function, and thecontinuation closure when returning from the current function. Standard stack-based conventions, however,will not be write-after-read conflict free because they are based on side-affecting the current stack, e.g. bychanging the value of local variables. Also the return address is typically stored in the child (callee) frame.Here it is important it is kept in the continuation closure so that the move to a new closure can be done

9

atomically by swinging the restart-pointer (changing the start instruction address and base pointer on thesame instruction). A commit command can be implemented in the compiler, or by hand, by putting the codeafter the commit into a separate function body, and making a tail call to it.

To implement closures we need memory allocation. This can be implemented in various ways in awrite-after-read conflict free manner. One way is for the memory for a capsule to be allocated starting at abase pointer that is stored in the closure. Memory is allocated one after the other, using a pointer kept inlocal memory (avoiding the need for a write-after-read conflict to persistent memory in order to update it).In this way, the allocations are the same addresses in memory each time the capsule restarts. At the end ofthe capsule, the final value of the pointer is stored in the closure for the next capsule. For the Parallel-PM,each processor allocates from its own pool of persistent memory, using this approach. In the case where aprocessor takes over for a hard-faulting processor, any allocations while the taking-over processor is executingon behalf of the faulted processor will be from the pool of the faulted processor.

5 Robustness on Multiple Processors

With multiple processors our previous definition of idempotent is inadequate since the other processors canread or write persistent memory locations while a capsule is running. For example, even though the finalvalues written by a capsule c might be idempotent, other processors can observe intermediate values while cis running and therefore act differently than if c was run just once. We therefore consider a stronger variant ofidempotency that in addition to requiring that its final effects on memory are if it ran once, requires that it actsas if it ran atomically. The requirement of atomicity is not necessary for correctness, but sufficient for whatwe need and allows a simple definition. We give an example of how it can be relaxed at the end of the section.

More formally we consider the history of a computation, which is an interleaving of the persistent memoryinstructions from each of the processors, and which abides by the sequential semantics of the memory. Thehistory includes the additional instructions due to faults (i.e., it is a complete trace of instructions that actuallyhappened). A capsule within a history is invoked at the instruction it is installed and responds at the instructionthat installs the next capsule on the processor. All instructions of a capsule, and possibly other instructionsfrom other processors, fall between the invocation and response.

We say that a capsule in a history is atomically idempotent if

1. (atomic) all its instructions can be moved in the history to be adjacent somewhere between its invocationand response without violating the memory semantics, and

2. (idempotent) the instructions are idempotent at the spot they are moved to—i.e., their effect on memoryis as if the capsule ran just once without fault.

As with a single processor, we now consider conditions under which capsules are ensured to be idempotent,in this case atomically. Akin to standard definitions of conflict, race, and race free, we say that two persistentmemory instructions on separate processors conflict if they are on the same block and one is a write. For acapsule within a history we say that one of its instructions has a race if it conflicts with another instructionthat is between the invocation and response of that capsule. A capsule in a history is race free if none of itsinstructions have a race.

Theorem 5.1. Any capsule that is write-after-read conflict free and race free in a history is atomicallyidempotent.

Proof. Because the capsule is race free we can move its instructions to be adjacent at any point between theinvocation and response without affecting the memory semantics. Once moved to that point, the idempotencefollows from Theorem 3.1 because the capsule is write-after-read conflict free.

10

This property is useful for user code if one can ensure that the capsules are race free via synchronization.We use this extensively in our algorithms. However the requirement of being race free is insufficient ingeneral because synchronizations themselves require races. In fact the only way to ensure race freedomthroughout a computation would be to have no processor ever write a location that another processor everreads or writes. We therefore consider some other conditions that are sufficient for atomic idempotence.

Racy Read Capsule. We first consider a racy read capsule, which reads one location from persistent memoryand writes its value to another location in persistent memory. The capsule can have other instructions, butnone of them can depend on the value that is read. A racy read capsule is atomically idempotent if allits instructions except for the read are race free. This is true because we can move all instructions of thecapsule, with possible repeats due to faults, to the position of the last read. The capsule will then properlyact like the read and write happened just once. Because races are allowed on the read location, there can bemultiple writes by other processors of different values to the read location, and different such values can beread anytime the racy read capsule is restarted. However, because the write location is race free, no otherprocessor can “witness” these possible writes of different values to the write location. Thus, the copy capsuleis atomically idempotent. A copy capsule is a useful primitive for copying from a volatile location that couldbe written at any point into a processor private location that will be stable once copied. Then when theprocessor private location is used in a future capsule, it will stay the same however many times the capsulefaults and restarts. We make significant use of this in the work-stealing scheduler.

Racy Write Capsule. We also consider a racy write capsule, for which the only instruction with a race is awrite instruction to persistent memory, and the instruction races only with either read instructions or otherwrite instructions, but not both kinds. Such a capsule can be shown to be atomically idempotent. In the formercase (races only with reads), then in any history, the value in the write location during the capsule transitionsfrom an old value to a new value exactly once no matter how many times the capsule is restarted. Thus, forthe purposes of showing atomicity, we can move all the instructions of the capsule to immediately before thefirst read that sees the new value, or to the end of the capsule if there is no such read. Although the first timethe new value is written (and read by other processors) may be part of a capsule execution that subsequentlyfaulted, the effect on memory is as if the capsule ran just once without fault (idempotency). In the latter case(races only with other writes), then if in the history the racy write capsule is the last writer before the end ofthe capsule, we can move all the instructions of the capsule to the end of the capsule, otherwise we can moveall the instructions to the beginning of the capsule, satisfying atomicity and idempotency.

Compare-and-Modify (CAM) Instruction. We now consider idempotency of the CAS instruction. Recallthat we assume that a CAS is part of the machine model. We cannot assume the CAS is race free becausethe whole purpose of the operation is to act atomically in the presence of a race. Unfortunately it seemshard to efficiently simulate a CAS at the user level when there are faults. The problem is that a CAS writestwo locations, the two that it swaps. In the standard non-faulty model one is local (a register) and thereforethe CAS involves a single shared memory modification and a local register update. Unfortunately in theParallel-PM model, the processor could fault immediately before or after the CAS instruction. On restart thelocal register is lost and therefore the information about whether it succeeded is lost. Looking at the sharedlocation does not help since identical CAS instructions from other processors might have been applied to thelocation, and the capsule cannot distinguish its success from their success.

Instead of using a CAS, here we show how to use a weaker instruction, a compare-and modify (CAM).A CAM is simply a CAS for which no subsequent instruction in the capsule reads the local result (i.e., theswapped value).3 Furthermore, we restrict the usage of a CAM. For a capsule within a history we say a writew (including a CAS or CAM) to persistent memory is non-reverting if no other conflicting write between

3Some CAS instructions in practice return a boolean to indicate success; in such cases, the boolean cannot be read either.

11

old new target 0xDEAD currjob

xxxx xxxx xxxx 0 xxxx

0

3

0xDEAD

0 → 3

0

3

0xDEAD

3

Processor 3

void claimOwnership(int jobId, int style) {

int old = defaults[style];int new = getProcNum();int* target = &jobOwners[jobId];CAM(target, old, new);currentJob = jobId;

}

void claimOwnership(int jobId, int style) {

int old = defaults[style];int new = getProcNum();int* target = &jobOwners[jobId];CAM(target, old, new);currentJob = jobId;

}

Figure 2: CAM Capsule Example. In CAM capsules, earlier faulting runs of the capsule may perform workthat is visible to the rest of the system.

w and the capsule’s response changes the value back to its value before w. We define a CAM capsule as acapsule that contains one non-reverting CAM and may contain other write-after-read conflict free and racefree instructions.

Theorem 5.2. A CAM capsule is atomically idempotent.

Proof. Assume that the CAM is non-reverting and all other instructions in the capsule are write-after-readconflict free and race free. Due to faults the CAM can repeat multiple times, but it can only succeed inchanging the target value at most once. This is because the CAM is non-reverting so once the target valueis changed, it could not be changed back. Therefore if the CAM ever succeeds, for the purpose of showingatomicity, in the history we move all the instructions of the capsule (including the instructions from faultyruns) to the point of the successful CAM. This does not affect the memory semantics because none of theother instructions have races, and any of the other CAMs were unsuccessful and therefore also have no affecton memory. At the point of the successful CAM the capsule acts like it ran once because it is write-after-readconflict free—other than the CAM, which succeeded just once. If the CAM never succeeds, the capsule isconflict free and race free because the CAM did not do any writes, so Theorem 5.1 applies.

The example CAM capsule in Figure 2 shows one of the interesting properties of idempotence: unliketransactions or checkpointing, earlier runs that faulted can make changes to the memory that are seen or usedby other processes. Similarly, these earlier runs can affect the results of the successful run, as long as theresult is equivalent to a non-faulty run.

A CAM can be used to implement a form of test-and-set in a constant number of instructions. In particular,we will assume a location can either be unset, or the value of a process identifier or other unique identifier. A

12

process can then use a CAM to conditionally swap such a location from unset to its unique identifier. Theprocess can then check if it “won” by seeing if its identifier is in the location. We make heavy use of this inthe work-stealing scheduler to atomically “steal” a job from another queue. It can also be used at the joinpoint of two threads in fork-join parallelism to determine who got there last (the one whose CAM from unsetwas unsuccessful) and hence needs to run the code after the join.

Racy Multiread Capsule. It is also possible to design capsules that are idempotent without the requirementof atomicity. By way of example, we discuss the racy multiread capsule. This capsule consists of multipleracy read capsules that have been combined together into a single capsule. Concurrent processes may writeto locations that the capsule is reading between reads, which violates atomicity. Despite this, a racy multireadcapsule is idempotent since the results of the final successful run of the capsule will overwrite any results ofpartial runs. We make use of the snapshot capsule in the work-stealing scheduler to reduce the number ofcapsules required. It is not needed for correctness.

6 Work Stealing

We show how to implement an efficient version of work stealing (WS) in the Parallel-PM model. Our resultsare based on the work-stealing scheduler of Arora, Blumofe, and Plaxton (ABP) [5] and therefore work ina multiprogrammed environment where the number of active processors can change. As in their work, werequire some assumptions about the machine, which we summarize here.

The schedule is a two-level scheduler in which the work-stealing scheduler, under our control, mapsthreads to processes, and an adversarial operating system scheduler maps processes to processors. TheOS scheduler can change the number of allocated processors and which processes are scheduled on thoseprocessors during the computation, perhaps giving processors to other users. The number of processes andthe maximum number of processors used is given by P . The average number that are allocated to the user isPA.

The quanta for scheduling is at least the time for two scheduling steps where each step takes a smallconstant number of instructions. In our case we cannot guarantee that the quanta is big enough to capturetwo steps since the processor could fault. However it is sufficient to show that with constant probability twoscheduling steps complete within the quanta, which we can show.

The available instruction set contains a yield-to-all instruction. This instruction tells the OS that it mustschedule all other processes that have not hard faulted before (or at the same time) as the process that executesthe instruction. It is used to ensure that processors that are doing useful work have preference over ones whorun out of work and need to steal.

Our schedule differs from the ABP scheduler in some crucial ways since our model allowing processorsto fault. First, our scheduler cannot use a CAS, for reasons described in Section 5, and instead must usea CAM. ABP uses a CAS and we see no direct translation to using a CAM. Second, our scheduler has tohandle soft faults anywhere in either the scheduler or the user program. This requires some care to maintainidempotence. Third, our scheduler has to handle hard faults. In particular it has to be able to steal from aprocessor that hard faults while it is running a thread. It cannot restart the thread from scratch, but needs tostart from the previous capsule boundary (a thread can consist of multiple capsules).

Our scheduler is also similar to the ABP scheduler in some crucial ways. In particular it uses a work-stealing double ended work queue and takes a constant number of instructions for the popTop, popBottom,and pushBottom functions. This is important in proving the performance bounds and allows us to leveragemuch of their analysis. An important difference in the performance analysis is that faults can increase boththe total work and the total depth. Because faults can happen anywhere this holds for the user work and

13

for the scheduler. The expected work is only increased by a constant factor, which is not a serious issue.However, for total depth, expectations cannot be carried through the maximum implied by parallel execution.We therefore need to consider high probability bounds.

6.1 The Scheduler Interface

For handling faults, and in particular hard faults, the interaction of the scheduler and threads is slightlydifferent from that of ABP. We assume that when a thread finishes it jumps to the scheduler.4 When athread forks another thread, it calls a fork function, which pushes the new thread on the bottom of the workqueue and returns to the calling thread. When the scheduler starts a thread it jumps to it (actually a capsulerepresenting the code to run for the thread). Recall that when the thread is done it jumps back to the scheduler.These are the only interactions of threads and the scheduler—i.e. jumping to a thread from the scheduler,forking a new thread within a thread, and jumping back to the scheduler from a thread on completion. All ofthese occur at capsule boundaries, but a thread itself can consist of many capsules. We assume that at a join(synchronization) point of threads whichever one arrives last continues the code after the join and thereforethat thread need not interact with the scheduler. The other threads that arrive at the join earlier finish andjump to the scheduler. In our setup, therefore, a thread is never blocked, assuming the fork function isnon-blocking.

6.2 WS-Deque

A work-stealing deque (WS-deque) is a concurrent deque supporting a limited interface. Here we used asimilar interface to ABP. In particular the interface supports popTop, pushBottom, and popBottom. Anynumber of concurrent processors can execute popTop, but only one process can execute either pushBottom orpopBottom. The idea is only the process owning the deque will work on the bottom. The deque is linearizableexcept that popTop can return empty even if the deque is not-empty. However this can only happen if anotherconcurrent popTop succeeds with a linearization point when the popTop is live, i.e., from invocation toresponse.

We provide an implementation of a idempotent WS-deque in Figure 3. Our implementation maintains anarray of tagged entries that refer to threads that the processor has either enabled or stolen while working onthe computation. The tag is simply a counter that is used to avoid the ABA problem [34]. An entry consistsof one of the following states:• empty: An empty entry is one that has not been associated with a thread yet. Newly created elements

in the array are initialized to empty.• local: A local entry refers to a thread that is currently being run by the processor that owns this

WS-Deque. We need to track local entries to deal with processors that have a hard fault (i.e., neverrestart).• job: A job entry is equivalent to the values found in the original implementation of the WS-Deque. It

contains a thread (i.e., a capsule to jump to start the thread).• taken: A taken entry refers to a thread that has already been or is in the process of being stolen. It

contains a pointer to the entry that the thief is using to hold the stolen thread, and the tag of that entryat the time of the steal.

4Note that jumping to a thread is the same as installing a capsule.

14

1 P = number of procs2 S = stack size

4 struct procState {5 union entry = empty6 | local7 | job of continuation8 | taken of 〈entry*,int〉

10 〈int,entry〉 stack[S];11 int top;12 int bot;13 int ownerID;

15 inline int getStep(i) { return stack[i].first; }

17 inline void clearBottom() {18 stack[bot] = 〈getStep(bot)+1, empty〉; }

20 void helpPopTop() {21 int t = top;22 switch(stack[t]) {23 case 〈_, taken(ps,i)〉:24 // Set thief state.25 CAM(ps, 〈i,empty〉, 〈i+1,local〉);26 CAM(&top, t, t+1); // Increment top.27 } }

29 // Steal from current process, if possible.30 // If a steal happens, location e is set to "local"31 // & a job is returned. Otherwise NULL is returned.32 continuation popTop(entry* e, int c) {33 helpPopTop();34 int i = top;35 〈int, entry〉 old = stack[i];36 commit;37 switch(old) {38 // No jobs to steal and no ongoing local work.39 case 〈j, empty〉: return NULL;40 // Someone else stole in meantime. Help it.41 case 〈j, taken(_)〉:42 helpPopTop(); return NULL;43 // Job available, try to steal it with a CAM.44 case 〈j, job(f)〉:45 〈int, entry〉 new = 〈j+1, taken(e,c)〉;46 CAM(&stack[i], old, new);47 helpPopTop();48 if (stack[i] != new) return NULL;49 return f;

15

50 // No jobs to steal, but there is local work.51 case 〈j, local〉:52 // Try to steal local work if process is dead.53 if (!isLive(ownerID) && stack[i] == old) {54 commit;55 〈int, entry〉 new = 〈j+1,taken(e,c)〉;56 stack[i+1] = 〈getStep(i+1)+1, empty〉;57 CAM(&stack[i], old, new);58 helpPopTop();59 if (stack[i] != new) return NULL;60 return getActiveCapsule(ownerID);61 }62 // Otherwise, return NULL.63 return NULL;64 } }

66 void pushBottom(continuation f) {67 int b = bot;68 int t1 = getStep(b+1);69 int t2 = getStep(b);70 commit;71 if (stack[b] == 〈t2, local〉) {72 stack[b+1] = 〈t1+1, local〉;73 bot = b + 1;74 CAM(&stack[b], 〈t2, local〉, 〈t2+1, job(f)〉75 } else if (stack[b+1].second == empty) {76 states[getProcNum()].pushBottom(f);77 }78 return;79 }

81 continuation popBottom() {82 int b = bot;83 〈int, entry〉 old = stack[b-1];84 commit;85 if (old == 〈j, job(f)〉) {86 CAM(&stack[b-1], old, 〈j+1,local〉);87 if (stack[b-1] == 〈j+1, local〉) {88 bot = b-1;89 return f;90 } }91 // If we fail to grab a job, return NULL.92 return NULL;93 }

95 ˆ findWork() {96 // Try to take from local stack first.97 continuation f = popBottom();98 if (f) GOTO(f);99 // If nothing locally, randomly steal.

100 while (true) {101 yield();102 int victim = rand(P);103 int i = getStep(bot);104 continuation g105 = states[victim].popTop(&stack[bot],i);106 if (g) GOTO(g);107 } } }

16

108 procState states[P]; // Stack for each process.

110 // User call to fork.111 void fork(continuation f) {112 // Pushes job onto the correct stack.113 states[getProcNum()].pushBottom(f);114 }

116 // Return to scheduler when any job finishes.117 ˆ scheduler() {118 // Mark the completion of local thread.119 states[getProcNum()].clearBottom();120 // Find work on the correct stack.121 GOTO(states[getProcNum()].findWork());122 }

Figure 3: Fault-tolerant WS-Deque Implementation. Jumps are marked as GOTO and functions that arejumped to and do not return (technically continuations) are marked with a ˆ. All CAM instructions occur inseparate capsules, similar to function calls.

The transition table for the entry states is shown in Figure 4.

New StateEmpty Local Job Taken

Old State

Empty - XLocal X - X XJob X - X

Taken -

Figure 4: Entry state transition diagram

In addition to this array of entries, we maintain pointers to the top and the bottom of the deque, which is acontiguous region of the array. As new threads are forked by the owner process, new entries will be added tothe bottom of the deque using the pushBottom function. The bottom pointer will be updated to these newentries. The top pointer will move down on the deque as threads are stolen. This implementation does notdelete elements at the top of the deque, even after steals. This means that we do not need to worry aboutentries being deleted in the process of a steal attempt, but does mean that maintaining P WS-Deques for acomputation with span T∞ requires O(PT∞) storage space.

Our implementation of the WS-Deque maintains a consistent structure that is useful for proving itscorrectness and efficiency. The elements of our WS-Deque are always ordered from the beginning to the endof the array as follows:

1. A non-negative number of taken entries. These entries refer to threads that have been stolen, or possiblyin the case of the last taken entry, to a thread that is in the process of being stolen.

2. A non-negative number of job entries. These entries refer to threads that the process has enabled that havenot been stolen or started since their enabling.

3. Zero, one, or two local entries. If a process has one local entry, it is the entry that the process is currentlyworking on. Processes can momentarily have two local entries during the pushBottom function, before the

17

earlier one is changed to a job. If a process has zero local entries, that means the process has completed theexecution of its local work and is in the process of acquiring more work through popBottom or stealing, orit is dead.

4. A non-negative number of empty entries. These entries are available to store new threads as they areforked during the computation.

We can also relate the top and bottom pointers of the WS-Deque (i.e. the range of the deque) to this arraystructure. The top pointer will point to the last taken entry in the array if a steal is in process. Otherwise, itwill point to the first entry after the taken entries. At the end of a capsule, the bottom pointer will point to thelocal entry if it exists, or the first empty entry after the jobs otherwise. The bottom pointer can also point tothe last job in the array or the earlier local entry during a call to pushBottom.

6.3 Algorithm Overview and Rationale

We now give an overview and rationale of correctness of our work-stealing scheduler under the Parallel-PM.Each process is initialized with an empty WS-Deque containing enough empty entries to complete

the computation. The top and bottom pointers of each WS-Deque are set to the first entry. One process isassigned the root thread. This process installs the first capsule of this thread, and sets its first entry to local.All other processes install the findWork capsule.

Once computation begins, the adversary chooses processes to schedule according to the rules of the yieldinstruction described in ABP, with the additional restriction that dead processes cannot be scheduled. When aprocess is scheduled, it continues running its code. This code may be scheduler code or user code.

If the process is running user code, this continues until the code calls fork or terminates. Calls to forkresult in the newly enabled thread being pushed onto the bottom of the process’ WS-Deque. When the usercode terminates, the process returns to the scheduler function.

The scheduler code works to find new threads for the process to work on. It begins by calling thepopBottom function to try and find a thread on the owner’s WS-Deque. If popBottom finds a thread, theprocess works on that thread as described above. Otherwise, the process begins to make steal attempts usingthe popTop function on random victim stacks. In a faultless setting, our work-stealing scheduler fuctionslike that of ABP. We use the additional information stored in the WS-Deques and the configuration of capsuleboundaries to provide fault tolerance.

We provide correctness in a setting with soft faults using idempotent capsules. Each capsule in thescheduler is an instance of one of the capsules discussed in Section 5. This means that processes can faultand restart without affecting the correctness of the scheduler.

Providing correctness in a setting with hard faults is more challenging. This requires the scheduler toensure that work being done by processes that hard fault is picked up in the same capsule that the fault ocurredduring by exactly one other process. We handle this by allowing thieves to steal local entries from deadprocesses. A process can check whether another process is dead using a liveness oracle isLive(procId).

The liveness oracle might be constructed by implementing a counter and a flag for each process. Eachprocess updates its counter after a constant number of steps (this does not have to be synchronized). If thetime since a counter has last updated passes some threshold, the process is considered dead and its flag is set.If the process restarts, it can notice that it was marked as dead, clear its flag, and enter the system with a newempty WS-Deque. Constructing such an oracle does not require a global clock or tight synchronization.

By handling these high level challenges, along with some of the more subtle challenges that occur whentrying to provide exactly-once semantics in the face of both soft and hard faults, we reach the following result.

18

Theorem 6.1. The implementation of work stealing provided in Figure 3 correctly schedules work accordingto the specification in Section 6.

The proof, appearing in Appendix A, deals with the many possible code interleavings that arise whenconsidering combinations of faulting and concurrency. We discuss our methods for ensuring that work isneither duplicated during capsule retries after soft faults or dropped due to hard faults. In particular, wespend considerable time ensuring that recovery from hard faults during interaction with the bottom of theWS-Deque happens correctly.

6.4 Time Bounds

We now analyze bounds on runtime based on the work-stealing scheduler under the assumptions mentionedat the start of the section (scheduled in fixed quanta, and supporting a yield-to-all instruction).

As with ABP, we consider the total amount of work done by a computation, and the depth of thecomputation, also called the critical path length. In our case we have W , the work assuming no faults, andWf , the work including faults. In algorithm analysis the user analyzes the first, but in determining the runtimewe care about the second. Similarly we have both D, a depth assuming no faults, and Df , a depth with faults.

For the time bounds we can leverage the proof of ABP. In particular as in their algorithm our popTop,popBottom, and pushBottom functions all take O(1) work without faults. With our deque, operations takeexpected O(1) work. Also as with their version, our popTop is unsuccessful (returns Null when there iswork) only if another popTop is successful during the attempt. The one place where their proof breaks downin our setup is the assumption that a constant sized quanta can always capture two steal attempts. Becauseour processors can fault multiple times, we cannot guarantee this. However in their proof this is needed toshow that for every P steal attempts, with probability at least 1/4, at least 1/4 of the non-empty dequesare successfully stolen from ([5], Lemma 8). In our case a constant fraction (1 − O(1) · f)2 of adjacentpairs of steal attempts will not fault at all and therefore count as a steal attempt. For analysis we can assumethat if either steals in a pair faults, then the steal is unsuccessful. This gives a similar result, only with adifferent constant, i.e., with probability at least 1/4, at least (1 − O(1) · f)2/4 of the non-empty dequesare successfully stolen from. We note that hard faults affect the average number of active processors PA.However they otherwise have no asymptotic affect in our bounds because a hard fault in our scheduler iseffectively the same as forking a thread onto the bottom of a work-queue and then finishing.

ABP show that their work-stealing scheduler runs in expected time O(W/PA +DP/PA). To apply theirresults we need to plug in Wf for W because that is the actual work done, and Df for D because that isactual depth. While bounding Wf to be within a constant factor of W is straightforward, bounding Df istrickier because we cannot sum expectations to get the depth bound (the depth is a maximum over pathslengths). Instead we show that with some high probability no capsule faults more than some number of timesl. We then simply multiply the depth by l. By making the probability sufficiently high, we can pessimisticallyassume that in the unlikely even that any capsule faults more than l times then, the depth is as large as thework. This idea leads to the following theorem.

Theorem 6.2. Consider any multithreaded computation with W work, D depth, and C maximum capsulework (all assuming no faults) for which all capsules are atomically idempotent. On the Parallel-PM withP processors, PA average number of active processors, and fault probability bounded by f ≤ 1/(2C), theexpected total time Tf for the computation is

O

(W

PA+D

(P

PA

)⌈log1/(Cf)W

⌉).

19

Proof. We must account for faults in both the computation and the work-stealing scheduler. The work-stealing scheduler has O(1) maximum capsule work, which we assume is at most C. Because we assume allfaults are independent, the probability that a capsule will run l or more times is upper bounded by (Cf)l.Therefore if there are κ capsules in the computation including the capsules executed as part of the scheduler,the probability that any one runs more than l times is upper bounded by κ(Cf)l (by the union bound). Ifwe want to bound this probability by some ε, we have κ(Cf)l ≤ ε. Solving for l and using κ ≤ 2W givesl ≤ dlog1/(Cf)(2W/ε)e. This means that with probability at most ε, Df ≤ D log1/(Cf)(2W/ε). If we setε = 2/W then Df ≤ 2D log1/(Cf)W . Now we assume that if any capsule faults l times or more that thedepth of the computation equals the work. This gives (P/PA)(2/W )W + (1− 2/W )2Ddlog1/(Cf)W e) asthe expected value of the second term of the ABP bound, which is bounded by O((P/PA)Ddlog1/(Cf)W e).Because the expected total work for the first term is Wf ≤ (1/(1 − Cf))W , and given Cf ≤ 1/2, thetheorem follows.

This time bound differs from the ABP bound only in the extra log1/(Cf)W factor. If we assume PA is aconstant fraction of P then the expected time simplifies to O(W/P +Ddlog1/(Cf)W e).

7 Fault-Tolerant Algorithms

In this section, we outline how to implement several algorithms for the Parallel-PM model. The algorithmsare all based on binary fork-join parallelism (i.e., nested parallelism), and hence fit within the multithreadedmodel. We state all results in terms of faultless work and depth. The results can be used with Theorem 6.2to derive bounds on time for the Parallel-PM. Recall that in the Parallel-PM model, external reads andwrites are unit cost, and all other instructions have no cost (accounting for other instructions would not behard). The algorithms that we use are already race-free. Making them write-after-read conflict free simplyinvolves ensuring that reads and writes are to different locations. All capsules of the algorithms are thereforeatomically idempodent. The base case for each of our variants of the algorithms is done sequentially withinthe ephemeral memory.

Prefix Sum. Given n elements {a1, · · · , an} and an associative operator “+”, the prefix sum algorithmcomputes a list of prefix sums {p1, · · · , pn} such that pi =

∑ij=1 aj . Prefix sum is one of the most

commonly-used building blocks in parallel algorithm design [43].We note that the standard prefix sum algorithm [43] works well in our setting. The algorithm consists

of two phases—the up-sweep phase and the down-sweep phase, both based on divide-and-conquer. Theup-sweep phase bisects the list, computes the sum of each sublist recursively, adds the two partial sums asthe sum of the overall list, and stores the sum in the persistent memory. After the up-sweep phase finishes,we run the down-sweep phase with the same bisection of the list and recursion. Each recursive call in thisphase has a temporary parameter t, which is initiated as 0 for the initial call. Then within each function, wepass t to the left recursive call and t+ LeftSum for the right recursive call, where LeftSum is the sum of theleft sublist computed from the up-sweep phase. In both sweeps the recursion stops when the sublist has nomore than B elements, and we sequentially process it using O(1) memory transfers. For the base case inthe down-sweep phase, we set the first element pi to be t+ ai, and then sequentially compute the rest of theprefix sums for this block. The correctness of pi follows from how t is computed along the path to ai.

This algorithm fits the Parallel-PM model in a straightforward manner. We can place the body of eachfunction call (without the recursive calls) in an individual capsule. In the up-sweep phase, a capsule readsfrom two memory locations and stores the sum back to another location. In the down-sweep phase, it readsfrom at most one memory location, updates t, and passes t to the recursive calls. Defining capsules in thisway provides write-after-read conflict-freedom and limits the maximum capsule work to a constant.

20

Theorem 7.1. The prefix sum of an array of size n can be computed in O(n/B) work, O(log n) depth, andO(1) maximum capsule work, using only atomically-idempotent capsules.

Merging. A merging algorithm takes the input of two sorted arrays A and B of size lA and lB (lA + lB = n),and returns a sorted array containing the elements in both input lists. We use an algorithm on the Parallel-PMmodel based on the classic divide-and-conquer algorithm [14].

The first step of the algorithm is to allocate the output array of size n. Then the algorithm conducts dualbinary searches of the arrays in parallel to find the elements ranked {n2/3, 2n2/3, 3n2/3, . . . , (n1/3− 1)n2/3}among the set of keys from both arrays, and recurses on each pair of subarrays until the base case whenthere are no more than B elements left (and we switch to a sequential version). We put each of the binarysearches into a capsule, as well as each base case. These capsules are write-after-read conflict free because theoutput of each capsule is written to a different subarray. Based on the analysis in [14] we have the followingtheorem.

Theorem 7.2. Merging two sorted arrays of overall size n can be done in O(n/B) work, O(log n) depth,and O(log n) maximum capsule work, using only atomically-idempotent capsules.

Sorting. Using the merging algorithm in Section 7, we can implement a fault-tolerant mergesort withO((n/B) log(n/M)) work and maximum capsule work O(log n). However, this is not optimal. We nowoutline a samplesort algorithm with improved work O(n/B · logM n), based on the algorithm in [14].

The sorting algorithm first splits the set of elements into√n subarrays of size

√n and recursively sorts

each of the subarrays. The recursion terminates when the subarray size is less than M , and the algorithmthen sequentially sorts within a single capsule. Then the algorithm samples every log n’th element from eachsubarray. These samples are sorted using mergesort, and

√n pivots are picked from the result using a fixed

stride. The next step is to merge each√n-size subarray with the sorted pivots to determine bucket boundaries

within each subarray. Once the subarrays have been split, prefix sums and matrix transposes are used todetermine the location in the buckets where each segment of the subarray is to be sent. After that, the keysneed to be moved to the buckets, using a bucket transpose algorithm. We can use our prefix sum algorithmand the divide-and-conquer bucket transpose algorithm from [14], where the base case is a matrix of size lessthan M , and in the base case the transpose is done sequentially within a single capsule (note that this assumesM > B2 to be efficient). The last step is to recursively sort the elements within each bucket. All steps can bemade write-after-read conflict free by writing to locations separate than those being read. By applying theanalysis in [14] with the change that the base cases (for the recursive sort and the transpose) are when the sizefits in the ephemeral memory, and that the base case is done sequentially, we obtain the following theorem.

Theorem 7.3. Sorting n elements can be done in O(n/B · logM n) work, O((M/B + log n) logM n) depth,and O(M/B) maximum capsule work, using only atomically-idempotent capsules.

It is possible that the log n term in the depth could be reduced using a sort by Cole and Ramachandran [23].

Matrix Multiplication. Consider multiplying two square matricesA andB of size n×n (assuming n2 > M )with the standard recursive matrix multiplication [26] based on the 8-way divide-and conquer approach.(

A11 A12

A21 A22

)×(B11 B12

B21 B22

)=

(A11B11 +A12B21 A11B12 +A12B22

A21B11 +A22B21 A21B12 +A22B22

)

21

Note that every pair of submatrix multiplications shares the same output location. This leads to write-after-readconflicts since a straightforward implementation will read the value from the output cell, add the computedvalue, and finally write the sum back. Therefore, the algorithm allocates two copies of temporary space forthe output in each recursive subtask, which allows applying computation for the matrix multiplication intwo subtasks on different output spaces (with no conflicts), and eventually adding computed values from thetemporary space back to the original output space.

If we stack-allocate the memory for each processor, a straightforward upper bound for the total extrastorage is O(pn2) on p processors using the standard space bound under work-stealing. A more carefulanalysis can tighten the bound to O(p1/3n2). This should be significantly better than the worst-case bound ofΘ(n3/(B

√M) when plugging in real-world parameters. This extra storage can be further limited to O(n2)

by slightly modifying the orders of the recursive calls, assuming the main memory size is larger than theoverall size of all private caches.

When this algorithm is scheduled by a randomized work-stealing scheduler, the whole computation israce-free. All multiplications that run at the same time have different output locations. The additions areindependent of each other, and applied after the associated multiplications. Therefore all operations arerace-free.

However, if we put each arithmetic operation in a separate capsule, the whole algorithm incurs O(n3)memory accesses, which is inefficient. Hence, we mark a capsule anytime the recursion reaches a subtaskthat can entirely fit into the ephemeral memory. This happens when the matrix size is smaller than c′

√M for

a constant c′ < 1. We then continue to run the algorithm sequentially within these capsules. For the matrixadditions, we similarly mark a capsule boundaries such that each capsule can fit into the ephemeral memory.This does not affect the overall work. We obtain the following theorem.

Theorem 7.4. Multiplying two square matrices of size n can be done in O(n3/B√M) work, O(M3/2 +

log2 n) depth, and O(M3/2) maximum capsule work, using only atomically-idempotent capsules.

We note that we can extend this result to non-square matrices using a similar approach to [31].

8 Conclusion

In this paper, we describe the Parallel Persistent Memory model, which characterizes faults as loss of data inindividual processors and their associated volatile memory. For this paper, we consider an external memorymodel view of algorithm cost, but the model could easily be adapted to support other traditional cost models.We also provide a general strategy for designing programs based on capsules that perform properly whenfaults occur. We specify a condition of being atomically idempotent that is sufficient for correctness, andprovide examples of atomic idempotent capsules that can be used to generate more complex programs.We use these capsules to build a work-stealing scheduler that can run programs in a parallel system whiletolerating both hard and soft faults with only a modest increase in the total cost of the computation. We alsoprovide several algorithms designed to support fault tolerance using our capsule methodology. We believethat the techniques in this paper can provide a practical way to provide the desirable quality of fault tolerancewithout requiring significant changes to hardware or software.

Acknowledgements

This work was supported in part by NSF grants CCF-1408940, CCF-1533858, and CCF-1629444.

22

References

[1] Y. Afek, D. S. Greenberg, M. Merritt, and G. Taubenfeld. Computing with faulty shared memory. InPODC, 1992.

[2] A. Aggarwal and J. S. Vitter. The Input/Output complexity of sorting and related problems. Communi-cations of the ACM, 31(9), 1988.

[3] A. W. Appel and T. Jim. Continuation-passing, closure-passing style. In POPL, 1989.

[4] L. Arge, M. T. Goodrich, M. Nelson, and N. Sitchinava. Fundamental parallel algorithms for private-cache chip multiprocessors. In SPAA, 2008.

[5] N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors.Theory of Computing Systems, 34(2), Apr 2001.

[6] Y. Aumann and M. Ben-Or. Asymptotically optimal PRAM emulation on faulty hypercubes. In FOCS,1991.

[7] D. Balsamo, A. S. Weddell, G. V. Merrett, B. M. Al-Hashimi, D. Brunelli, and L. Benini. Hibernus:Sustaining computation during intermittent supply for energy-harvesting systems. IEEE EmbeddedSystems Letters, 7(1), 2015.

[8] N. Ben-David, G. E. Blelloch, J. T. Fineman, P. B. Gibbons, Y. Gu, C. McGuffey, and J. Shun. Parallelalgorithms for asymmetric read-write costs. In SPAA, 2016.

[9] N. Ben-David, G. E. Blelloch, J. T. Fineman, P. B. Gibbons, Y. Gu, C. McGuffey, and J. Shun. Implicitdecomposition for write-efficient connectivity algorithms. In IPDPS, 2018.

[10] R. Berryhill, W. Golab, and M. Tripunitara. Robust shared objects for non-volatile main memory. InConf. on Principles of Distributed Systems (OPODIS), volume 46, 2016.

[11] K. Bhandari, D. R. Chakrabarti, and H.-J. Boehm. Makalu: Fast recoverable allocation of non-volatilememory. In OOPSLA, 2016.

[12] G. E. Blelloch, J. T. Fineman, P. B. Gibbons, Y. Gu, and J. Shun. Sorting with asymmetric read andwrite costs. In SPAA, 2015.

[13] G. E. Blelloch, J. T. Fineman, P. B. Gibbons, Y. Gu, and J. Shun. Efficient algorithms with asymmetricread and write costs. In ESA, 2016.

[14] G. E. Blelloch, P. B. Gibbons, and H. V. Simhadri. Low depth cache-oblivious algorithms. In SPAA,2010.

[15] M. Buettner, B. Greenstein, and D. Wetherall. Dewdrop: an energy-aware runtime for computationalRFID. In NSDI, 2011.

[16] F. Cappello, G. Al, W. Gropp, S. Kale, B. Kramer, and M. Snir. Toward exascale resilience: 2014update. Supercomput. Front. Innov.: Int. J., 1(1), Apr. 2014.

[17] E. Carson, J. Demmel, L. Grigori, N. Knight, P. Koanantakool, O. Schwartz, and H. V. Simhadri.Write-avoiding algorithms. In IPDPS, 2016.

23

[18] D. R. Chakrabarti, H.-J. Boehm, and K. Bhandari. Atlas: Leveraging locks for non-volatile memoryconsistency. In OOPSLA, 2014.

[19] H. Chauhan, I. Calciu, V. Chidambaram, E. Schkufza, O. Mutlu, and P. Subrahmanyam. NVMove:Helping programmers move to byte-based persistence. In INFLOW, 2016.

[20] S. Chen and Q. Jin. Persistent b+-trees in non-volatile main memory. Proceedings of the VLDBEndowment, 8(7), 2015.

[21] B. S. Chlebus, A. Gambin, and P. Indyk. PRAM computations resilient to memory faults. In ESA, 1994.

[22] J. Coburn, A. M. Caulfield, A. Akel, L. M. Grupp, R. K. Gupta, R. Jhala, and S. Swanson. NV-Heaps:Making persistent objects fast and safe with next-generation, non-volatile memories. In ASPLOS, 2011.

[23] R. Cole and V. Ramachandran. Resource oblivious sorting on multicores. ACM Transactions on ParallelComputing (TOPC), 3(4), 2017.

[24] A. Colin and B. Lucia. Chain: tasks and channels for reliable intermittent programs. OOPSLA, 2016.

[25] A. Colin and B. Lucia. Termination checking and task decomposition for task-based intermittentprograms. In International Conference on Compiler Construction, 2018.

[26] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms (3rd edition).MIT Press, 2009.

[27] T. David, A. Dragojevic, R. Guerraoui, and I. Zablotchi. Log-free concurrent data structures. EPFLTechnical Report, 2017.

[28] M. A. de Kruijf, K. Sankaralingam, and S. Jha. Static analysis and compiler design for idempotentprocessing. In PLDI, 2012.

[29] I. Finocchi and G. F. Italiano. Sorting and searching in the presence of memory faults (withoutredundancy). In STOC, 2004.

[30] M. Friedman, M. Herlihy, V. J. Marathe, and E. Petrank. A persistent lock-free queue for non-volatilememory. In ACM Symposium on Principles and Practice of Parallel Programming (PPOPP), 2018.

[31] M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In FOCS,1999.

[32] D. Grove, S. S. Hamouda, B. Herta, A. Iyengar, K. Kawachiya, J. Milthorpe, V. Saraswat, A. Shinnar,M. Takeuchi, and O. Tardieu. Failure recovery in resilient X10. Technical Report RC25660 (WAT1707-028), IBM Research, Computer Science, 2017.

[33] R. Guerraoui and R. R. Levy. Robust emulations of shared memory in a crash-recovery model. In Inter.Conference on Distributed Computing Systems (ICDCS). IEEE, 2004.

[34] M. Herlihy and N. Shavit. The Art of Multiprocessor Programming. Morgan Kaufmann, 2012.

[35] J. Hester, K. Storer, and J. Sorber. Timely execution on intermittently powered batteryless sensors. InProc. ACM Conference on Embedded Network Sensor Systems, 2017.

24

[36] T. C.-H. Hsu, H. Bruegner, I. Roy, K. Keeton, and P. Eugster. NVthreads: Practical persistence formulti-threaded applications. In EuroSys, 2017.

[37] Intel. Intel NVM library. https://github.com/pmem/nvml/.

[38] Intel. Intel architecture instruction set extensions programming reference. Technical Report 3319433-029, Intel Corporation, April 2017.

[39] J. Izraelevitz, T. Kelly, and A. Kolli. Failure-atomic persistent memory updates via JUSTDO logging.In ASPLOS, 2016.

[40] J. Izraelevitz, H. Mendes, and M. L. Scott. Brief announcement: Preserving happens-before in persistentmemory. In SPAA, 2016.

[41] J. Izraelevitz, H. Mendes, and M. L. Scott. Linearizability of persistent memory objects under afull-system-crash failure model. In DISC, 2016.

[42] R. Jacob and N. Sitchinava. Lower bounds in the asymmetric external memory model. In SPAA, 2017.

[43] J. JaJa. Introduction to Parallel Algorithms. Addison-Wesley Professional, 1992.

[44] W.-H. Kim, J. Kim, W. Baek, B. Nam, and Y. Won. NVWAL: exploiting NVRAM in write-aheadlogging. In ASPLOS, 2016.

[45] A. Kolli, S. Pelley, A. Saidi, P. M. Chen, and T. F. Wenisch. High-performance transactions for persistentmemories. In ASPLOS, 2016.

[46] S. K. Lee, K. H. Lim, H. Song, B. Nam, and S. H. Noh. Wort: Write optimal radix tree for persistentmemory storage systems. In USENIX Conference on File and Storage Technologies (FAST), 2017.

[47] M. Liu, M. Zhang, K. Chen, X. Qian, Y. Wu, W. Zheng, and J. Ren. DudeTM: Building durabletransactions with decoupling for persistent memory. In ASPLOS, 2017.

[48] B. Lucia and B. Ransford. A simpler, safer programming and execution model for intermittent systems.PLDI, 2015.

[49] K. Maeng, A. Colin, and B. Lucia. Alpaca: intermittent execution without checkpoints. OOPSLA, 2017.

[50] J. S. Meena, S. M. Sze, U. Chand, and T.-Y. Tseng. Overview of emerging nonvolatile memorytechnologies. Nanoscale Research Letters, 9, 2014.

[51] A. Memaripour, A. Badam, A. Phanishayee, Y. Zhou, R. Alagappan, K. Strauss, and S. Swanson.Atomic in-place updates for non-volatile main memories with Kamino-Tx. In EuroSys, 2017.

[52] F. Nawab, J. Izraelevitz, T. Kelly, C. B. Morrey III, and D. R. C. amd Michael L. Scott. Dali: Aperiodically persistent hash map. In DISC, 2017.

[53] S. Pelley, P. M. Chen, and T. F. Wenisch. Memory persistency. In ISCA, 2014.

[54] J. Van Der Woude and M. Hicks. Intermittent computation without hardware support or programmerintervention. In OSDI, 2016.

[55] H. Volos, A. J. Tack, and M. M. Swift. Mnemosyne: Lightweight persistent memory. In ASPLOS, 2011.

[56] Yole Developpement. Emerging non-volatile memory technologies, 2013.

25

A Proof of the Correctness of Work-Stealing

Throughout our proof of correctness, we will refer to the code of the work-stealing scheduler shown inFigure 3. We begin by stating some definitions and assumptions. We assume that at least one process will nothard fault during the computation. If this is not true, the computation will have no processes performing workand will never finish. The local continuation of a process can be queried using the function getActiveCapsule.This function may be persistent or ephemeral. Any process can query whether another process has hardfaulted through the ephemeral function isDead. We define the owner of a WS-Deque to be the process that hasthe same process number as the ownerID field of that WS-Deque. We consider a popBottom to be successfulif the CAM at Line 86 is successful. We consider a popTop to be successful if either of the CAM operationsat Lines 46 or 57 are successful.

The first property we prove about our implementation is that the bottom of a WS-Deque can only beoperated on by one process at any time.

Lemma A.1. For a given WS-Deque, only one process can call pushBottom, popBottom, or clearBottomat any time. This process is the owner of the WS-Deque unless the owner hard faults in the middle of apushBottom or popBottom invocation.

Proof. We first consider the pushBottom function. All calls to pushBottom are made from the fork function.These calls are always made to the WS-Deque chosen by the getProcNum function. Since this function returnsthe ID of the process that is running it and there is no capsule boundary between the call to getProcNum andthe call to pushBottom, the process running the fork will always invoke pushBottom on its own WS-Deque.Since the pushBottom function is part of the scheduler rather than the algorithm code, it is never pushed ontothe WS-Deque as a job. This means that it can only be stolen from the local state in the event of a processhard fault. Similarly, all calls to popBottom and clearBottom are made using the getProcNum function insidethe findWork and fork functions respectively. Therefore, the same argument holds. Since only the owner caninvoke these functions and it will run them to completion before calling any other functions, we know that atmost one of these function invocations can exist at any time.

From this property, we find the related result.

Corollary A.1. For a given WS-Deque, only one process can update the bottom pointer at any time. Thisprocess is the owner of the WS-Deque unless the owner hard faults in the middle of a pushBottom orpopBottom invocation.

Proof. The only functions that update the bottom pointer are pushBottom, popBottom, or clearBottom.Applying Lemma A.1 gives the desired result.

We then use this property about the bottom of WS-Deques to show that user level threads that are beingworked on by processes are tracked with local entries.

Lemma A.2. Every process that is working on user level threads will have a local entry that is pointed to bythe bottom pointer of their WS-Deque.

Proof. All user threads are initiated by the findWork function at Line 98 or Line 106.If the thread is started at Line 98, it means that popBottom returned that continuation. The if statement at

Line 87 requires a local entry to exist at stack[b-1] in order for a non-NULL return value. The bottom pointeris then set to this location before the return. Corollary A.1 tells us that bottom pointer will not be modified byany other process. The entry pointed to by the bottom pointer can only be modified from local by calls topushBottom or popTop. We know from Lemma A.1 that pushBottom cannot be running concurrently. We

26

show that popTop cannot concurrently modify the entry by observing that popTop will only modify a localentry for a process that hard faulted, and a process cannot return a value after it hard faults. Therefore, thevalues that exist at Line 87 must still exist upon jumping to the continuation.

If the thread is started at Line 106, it means that popTop returned that continuation. The popTop functioncan return a non-NULL value at Line 49 or Line 60. In either case, the return is preceded by a call to thehelpPopTop function. This function ensures that the entry pointed to by the newly taken entry is set to local.This newly taken entry was set by the CAM at Line 46 if the return happened at Line 49 or the CAM atLine 57 if the return happened at Line 60. Both of these CAMs set the entry pointer in the taken to theargument passed to popTop. Looking at Line 105, we see that this is the pointer to the bottom of the thief’sWS-Deque. Therefore, that is the entry that will be set to local. We know that the bottom entry and pointerwill not be modified between the call to helpPopTop and the jump to the continuation because the ownerprocess is the one running the calls to popTop and findWork and the jump to the thread, and can therefore nothard fault or make other calls to pushBottom, clearBottom, or popTop.

In both cases where user threads are started, a local entry exists on the bottom of the WS-Deque ownedby the process starting that thread. It then remains to show that this local entry is not deleted before theprocess ceases working on that thread. Local entries are only modified by the clearBottom, pushBottom,and popTop functions. We know from Lemma A.1 that unless the process hard faults, only the owner canrun pushBottom or clearBottom. If the owner calls clearBottom, it must have done so from the schedulerfunction. This function is only called when the user level thread completes, meaning the process is no longerworking on it. Calls to pushBottom may modify the local entry that existed prior to the call if the CAMLine 74 succeeds, but Lines 72 and 73 will create a local entry at the new bottom before this can happen.Calls to popTop will never modify a local entry unless the owner has hard faulted. In this case, the local entrywill be set to taken by the CAM at Line 57. Once this CAM is successful, the taken entry will point to thebottom of the thief’s WS-Deque, which will be an empty entry. The first helpPopTop call on the victim’sWS-Deque that resolves Line 25 will change the empty entry to local. Since the thief must complete a call tohelpPopTop between Line 74 and the return from popTop, the local entry will be created before the thiefbegins working on the thread.

Since a process can never work on multiple user level threads, we provide a lemma showing that there arenever multiple local entries visible to steal if the process crashes.

Lemma A.3. At most one local entry can be successfully targeted by a popTop on a WS-Deque. Calls to thepopTop function of that WS-Deque after the successful steal completes will target an empty entry.

Proof. In order for a local entry to be stolen the top pointer of the WS-Deque must point to that entry. Sincethis entry is a local entry, any thief will execute the case beginning at Line 51. In this case Line 56 willexecuted prior to any CAM operation. This will set the entry below the top pointer to empty. Once the localentry has been stolen, the top pointer will be changed to point to the empty entry by the helpPopTop function.No popTop targeting an empty entry will succeed, or perform any modifications to WS-Deque at all. As longas the entry remains empty, no popTop on that WS-Deque can succeed. Empty entries can only be modifiedby the pushBottom function. The code that performs this modification is enclosed in the if statement atLine 71. The condition in this if statement will always fail since the CAM at Line 57 removes the remaininglocal entry from the WS-Deque. Since the empty entry pointed to by the top pointer will never be modified,no further popTop calls can be successful.

Having completed these useful structural lemmas, we can begin to prove the correctness of our functions.We focus first on proving correctness in the face of soft faults and leave hard faults for later in the proof.

27

Lemma A.4. Any popBottom function targeting a job entry will be successful unless a concurrent popTopfunction targeting the same entry is successful or the process hard faults.

Proof. If at any point during the findWork function the process hard faults, then the lemma is vacuously true.This means that we can ignore hard faults for the sake of the proof.

The entry targeted by a popBottom invocation is the entry immediately above the bottom pointer. Ifthis entry is a job, the CAM at Line 86 will succeed unless the entry is changed before the CAM happens.Job entries are only modified by successful invocations of popBottom or popTop, so if neither of thesefunctions concurrently succeed on the target entry, the CAM will succeed, and therefore the popBottom willsucceed.

Lemma A.5. Any popTop function targeting a job entry or a local entry on a process that hard faulted willbe successful unless a concurrent popBottom or popTop function targeting the same entry is successful or theprocess hard faults.

Proof. If at any point during the findWork function the process hard faults, then the lemma is vacuously true.This means that we can ignore hard faults for the sake of the proof.

We first consider the case when the top pointer points to a job entry. In this case the CAM at Line 46 willsucceed unless the entry is changed before the CAM happens. Job entries are only modified by successfulinvocations of popBottom or popTop, so if neither of these functions concurrently succeed on the target entry,the CAM will succeed, and therefore the popTop will succeed.

We next consider the case when the victim has hard faulted and the top pointer points to a local entry.In this case, the CAM at Line 57 will succeed unless the entry is changed before the CAM happens. Localentries are only modified by successful invocations of popTop or invocations of pushBottom. We know fromLemma A.1 that pushBottom functions can only be run by the owner of the WS-Deque or a thief if the ownerof the WS-Deque hard faulted. The owner has hard-faulted, so it cannot run the pushBottom function. SincepushBottom is a scheduler function, it can only be stolen from a local entry, rather than a job entry. Byapplying Lemma A.3 we find that it is impossible for a popTop to target a local entry after the pushBottomfunction is stolen. By applying Lemma A.3 we find that if a popTop invocation targeting a local entry on aprocess that hard faulted is running concurrently with the pushBottom function for that process’ WS-Dequethen the entry targeted by the popTop invocation was the target of another successful popTop invocation thatran concurrently with the original popTop invocation. This means that a popTop invocation targeting a localentry on a process that hard faulted will either succeed or be concurrent with another popTop invocation thatsucceeds at targeting the same entry.

Since we have proven the lemma for both possible cases, the proof is complete.

When proving the correctness of pushBottom we consider how hard faults affect the implementation ofthe function and the interleavings that result. We also connect the user level interface function fork to thescheduler.

Lemma A.6. Every continuation will be added to a WS-Deque as a job exactly the number of times fork wascalled on it.

Proof. Job entries are only added to a WS-Deque via the pushBottom function. This function is only everinvoked by the fork function. Each call to fork directly calls pushBottom exactly once. It therefore suffices toshow that each call to pushBottom other than recursive calls result in the passed argument being added to aWS-Deque exactly once. We show that each call to pushBottom will have exactly one of the following results:the argument is added to the associated WS-Deque exactly once or the owner hard faults and pushBottom is

28

recursively called with the same argument on a different WS-Deque whose owner has not hard faulted. Sincewe know that not all processes can hard fault, this is sufficient.

We begin by assuming that the process does not hard fault while running pushBottom. We know that thebottom entry of the WS-Deque is a local entry by Lemma A.2 and that it cannot be concurrently modifiedby Lemma A.1. This means that all statements inside the if block that begins at Line 71 are executed atleast once and that the first execution of the CAM at Line 74 will succeed. This adds the continuation to theWS-Deque as a job entry. The tag before the entry prevents the CAM from succeeding more than once. Weare then left to show that soft faults will not result in additional calls to pushBottom being made. Until theCAM succeeds, we know that the capsule will always enter the if block at Line 71. In order for this CAM tosucceed, Line 72 must be completed, setting the entry below the old bottom pointer to local. Local entries canonly be modified by the pushBottom, clearBottom, or popTop functions. The current instance of pushBottomwill not change this entry, and Lemma A.1 states that no other instance of pushBottom or clearBottom can berunning concurrently. We observe that popTop will only modify a local entry on a process that hard faulted.This lets us conclude that the local entry will not be modified if the process does not hard fault, preventingthe process from executing the recursive pushBottom call. This means that the continuation is added to theWS-Deque exactly once.

We then consider the case when the process hard faults while running pushBottom. If the hard faultoccurs prior to the CAM at Line 74 has succeeded, then the CAM will not succeed on this invocation ofpushBottom and the thief that steals this thread will recursively call pushBottom on its own WS-Deque. Inthis case, the owner hard faulted, so the local entry at stack[b] will not be modified until it is stolen by acall to popTop. During this popTop, the top pointer will point to stack[b]. This means that the thief will setstack[b+1] to empty in Line 56 prior to completing the popTop. Furthermore, when the CAM at Line 57succeeds, it changes the entry at stack[b] from local to taken. When the thief begins runs the pushBottomcapsule, it will bypass the if block starting at Line 71 in favor of the else block. Since the if block is not taken,the CAM will never be tried. The else block recursively calls pushBottom with the same argument on thethief’s WS-Deque.

If the hard fault occurs after the CAM succeeds, then the continuation has been added to the WS-Dequeand it must not be added again. The tag before the entry prevents the CAM from succeeding more than once.Since the CAM was successful, the entry at stack[b] has been set to job. This means that in order for thepushBottom function to be restarted, a thief had to steal the local entry set at stack[b+1] during Line 72. Inorder for the steal to occur, the CAM at Line 57 had to succeed, which would change the entry from localto taken. Since taken entries are never modified, we know that stack[b+1] must be a taken entry for thepushBottom function to be resumed. This means that when the thief restarts the capsule, it can never reachthe invocation of the pushBottom function.

We have proven that each call to pushBottom will add the argument to the associated WS-Deque exactlyonce or recursively call pushBottom with the same argument on a different WS-Deque whose owner has nothard faulted. This proves that each call to fork results in the argument being added to a WS-Deque as a jobexactly once, completing the proof.

Now that we have shown that user work is correctly added to the scheduler, we show that each processwill try to perform the work that has been added.

Lemma A.7. Every call to findWork results in a successful popTop or a successful popBottom unless theprocess hard faults or the computation ends.

Proof. If at any point during the findWork function the process hard faults, then the lemma is vacuously true.This means that we can ignore hard faults by the process calling findWork for the sake of the proof.

29

The findWork function begins by calling the popBottom function. Lemma A.4 shows that the popBottomcall will be successful unless unless all job entries on the WS-Deque are stolen prior to the CAM at Line 86.

If the popBottom function is not successful then the findWork function will proceed to the while loopthat performs steal attempts. This loop selects a victim process at random, and the performs the popTopfunction on that victim. We know from Lemma A.5 that popTop will succeed if the top entry of the victim’sWS-Deque is a job entry or if the victim has crashed and the top entry of its WS-Deque is a local entry unlessa concurrent popTop targeting the same entry succeeds.

In order for the computation to complete, each user thread must be run to completion. This meansthat if the computation is not complete there is a positive number of user threads that have not been run tocompletion. Since user threads can only be enabled by other user threads, a at least one of these threads mustbe enabled. Lemma A.6 states that this thread had a job entry created for it. Since the thread has not beencompleted, it must have a process working on it or its job entry must be in a WS-Deque. If the job entry is in aWS-Deque, then the top pointer of that WS-Deque must point to that entry, or another job entry above it. Weknow from Lemma A.5 that if the thief calls popTop on this WS-Deque, it will succeed unless a concurrentcall to popBottom or popTop successfully targets that entry. If a process is working on the thread, LemmaA.2 states that WS-Deque has a local entry pointed to by the bottom. If the process does not hard fault, it willeventually call fork with a new continuation or complete its user level thread. If fork is called, it will result ina new job entry which may be targeted by popBottom or popTop. If the user level thread is completed, eitherthere exists another enabled user level thread that this analysis applies to, or the computation is complete. Ifthe process hard faults at any time, then its top entry is a valid popTop target and Lemma A.5 applies.

At any time, progress is being made towards the end of the computation or there exists a target that theprocess running findWork may call popBottom or popTop on successfully. Since the findWork function willmake attempts to call popTop repeatedly until it succeeds, it will eventually either succeed, or the computationwill finish.

We extend the proof of processor effort to show that between all of the available processes, all of thework that is added to the scheduler is found without inadvertently duplicating any of that work.

Lemma A.8. Each job entry in a WS-Deque will be the target of a successful popBottom or popTop exactlyonce.

Proof. To show that a job entry cannot be successfully popBottomed or popTopped more than once, we notethat either successful function is caused by a successful CAM operation on the associated entry. Such a CAMchanges the entry to either local or taken depending on which function was successful.

In order for the computation to complete, each user thread must be run to completion. This means thatwhile there are job entries in any WS-Deque, the scheduler will continue to run. Job entries are located in aWS-Deque above the bottom pointer of the WS-Deque and at or below the top pointer of the WS-Deque.The structure of a WS-Deque means that if a job entry is not directly above the bottom pointer, all entriesbetween it and the bottom entry are job entries. Similarly, if a job entry is not at the top pointer, then allentries between it and the top entry are job entries.

If a process that does not have user level work on its WS-Deque does not hard fault, it will perform onefailed popBottom call, and then repeatedly make popTop attempts until one is successful and it becomes aprocess that has user level work. These popTop attempts are made on random processes, ensuring that eachprocess will eventually e chosen as a victim.

If a process that has user level work on its WS-Deque does not hard fault, it will repeatedly run any localwork that it has, then call the popBottom function. Lemma A.4 tells us that this function will succeed unlessit is targeting a non-job entry or a concurrent popTop call targeting the same entry succeeds. If popBottomsucceeds, the bottom pointer is set to the next lowest entry and the process is repeated. If popBottom is not

30

targeting a job entry, the structure of a WS-Deque tells us that there are no job entries on that WS-Deque. If aconcurrent popTop call targeting the same entry succeeds that entry becomes taken and the top pointer of theWS-Deque must point to that entry. This also means that the WS-Deque contains no job entries. Once theWS-Deque contains no job entries and the local entry (if any) finishes, the process becomes a process thatdoes not have user level work.

If a process that has user level work on its WS-Deque does hard fault, it will have some non-negativenumber of job entries above the bottom pointer of its WS-Deque. We rely on thief processes to pop theseentries from the WS-Deque, and assume that they exist. Lemma A.5 tells us that every popTop attempt on theWS-Deque will be successful unless a concurrent popTop or popBottom is successful. Lemma A.1 statesthat popBottom cannot be run on a WS-Deque owned by a process that hard faulted unless it is stolen. SincepopBottom is a scheduler function, it can only be stolen by a local entry. The structures of the WS-Dequemeans that local entries cannot be stolen until there are no job entries on the WS-Deque. This means that thetop entry will be the target of a successful popTop call from a thief. Since a successful popTop call results inthe top pointer being lowered, this process will repeat until all job entries have been targeted by successfulpopTop calls.

As long as there exists at least one process that does not hard fault, it will switch between having userlevel work that it completes to not having user level work and making popTop attempts until it finds some.The end result of this process is that no job entries will remain in any WS-Deque. Since job entries are onlymodified by successful calls to popBottom or popTop, each job entry must have been removed by a successfulpopBottom or popTop call.

Once the scheduler has assigned threads to various processes, the processes must complete the work. Thefollowing two lemmas show that each thread that is assigned has computation begun on it, which is sufficientto show completion in the face of soft faults.

Lemma A.9. Every continuation that is successfully popTopped is jumped to at least once.

Proof. We consider a continuation to successfully popTopped if the WS-Deque entry associated with thatcontinuation is targeted by a successful CAM operation inside of the popTop function. We consider an entryand a continuation to be associated if the entry is a job containing the continuation or the entry is local whilethe continuation is being run by the process that owns the WS-Deque the entry resides in. After the successfulCAM, the target entry has been set to a taken entry that contains a pointer to the bottom of the thief. Sincetaken entries are never changed, we know that the if statement will succeed if and only if the CAM succeededduring the current capsule. Therefore if the process does not hard fault then the continuation will be returnedto findWork, which jumps to that continuation. Soft faults may cause some of the instructions to be re-run,but will not change the resulting memory state. If the process hard faults at any point between the successfulCAM and the jump, it relies on other thieves calling the helpPopTop function to ensure that there is a localentry at the location pointed to in the taken entry, which is the bottom of its WS-Deque. This entry willeventually be stolen by some other thief. That thief will restart the capsule that the original thief hard faultedduring. Since we know that not all processes hard fault, at some point a process will complete the popTopfunction and jump to the continuation inside the findWork function.

Lemma A.10. Every continuation that is successfully popBottomed is jumped to at least once.

Proof. We consider a continuation to successfully popBottomed if the WS-Deque job entry containingthat continuation is targeted by a successful CAM operation inside of the popBottom function. After theCAM is successful, the target entry has been set to local. This local entry can only be changed by a call toclearBottom, or a call to popTop after the process hard faults. By Lemma A.1, we know that clearBottom

31

cannot run concurrently with popBottom. This means that the if statement will succeed if and only if theCAM succeeded during the current capsule. Therefore if the process does not hard fault then the continuationwill be returned to findWork, which jumps to that continuation. Soft faults may cause some of the instructionsto be re-run, but will not change the resulting memory state. If the process hard faults at any point betweenthe successful CAM and the jump, then then a local entry will exist at the bottom of its WS-Deque until thatentry is stolen. Lemma A.9 shows that once the entry is stolen, it will be jumped to at least once. Jumping toeither popBottom or findWork during the specified window will maintain the local variables, including thecontinuation that will then be jumped to in findWork.

We now show that hard faults do not prevent any computation from being completed.

Lemma A.11. Any user thread on a process that hard faulted will be be the result of a successful popTopand the capsule that was in process will be restarted.

Proof. In order for the computation to complete, each user thread must be run to completion. This meansthat while there are unfinished user threads, the scheduler will continue to run. Lemma A.2 states that anyprocess that is working on a user level thread has a local entry that is pointed to by its bottom pointer. LemmaA.7 states that processes that run out of work will eventually perform a successful popBottom or popTopunless they hard fault or the computation ends. Using Lemma A.8, we know that the number of successfulcalls that target a job entry is limited. Since successful popBottoms can only target job entries and successfulpopTops can only target job or local entries, all other successful popTop calls must occur on user threads onprocesses that have hard faulted. We combine Lemma A.9 with the fact that popTop calls getActiveCapsulewhen stealing a local entry to finish the proof.

We have shown that all work created at the user level is completed and that no user level threads arecreated by the scheduler. We conclude the proof by showing that the scheduler does not over-execute userthreads.

Lemma A.12. No capsule in user level code will be run to completion more times than the number of timesit is invoked by user level code.

Proof. A capsule is considered run to completion when all of its instructions have been completed andthe restart pointer for the subsequent capsule has been installed. We assume that capsules are handled asdiscussed in Section sec-single-proc-robust, which describes how to ensure that soft faults during direct runswill not cause a capsule to run to completion multiple times. We are then left to show that the scheduler doesnot result in extra invocations of user level code. This might happen in two ways: threads might be added toWS-Deques more times than they were enabled or entries on WS-Deques may be run multiple times.

We first show that threads are not added to WS-Deques more times than they are enabled. Threads areonly enabled as job entries through calls to the fork function. We show in Lemma A.6 that the number of jobentries added for a thread is exactly the number of times fork is invoked on that thread.

We next show that although WS-Deque entries can be run multiple times, this will not result in anycapsule being run to completion multiple times. Lemma A.8 states that each job entry will be the result of asuccessful popBottom or popTop exactly once. In Lemmas A.10 and A.9, we prove that in either of thesecases we will jump to the beginning of the thread.

If the thread is run to completion without hard faulting, it will complete the user level work normally,possibly make calls to fork, and then call the scheduler function. We know that local work is not stolen froma process unless that process hard faulted, so we do not have to consider steal attempts at this time. The userlevel work does not interact with the scheduler, and therefore cannot affect the entries on the WS-Deque. IfpushBottom is run to completion without any hard faults, then the original entry that corresponded to the user

32

thread will be replaced with a job entry containing the newly enabled thread continuation. A new local entrycorresponding to the thread will be created below the original entry. The scheduler function calls clearBottom,sets the local entry at the bottom of the WS-Deque to empty. Once this has been completed there is no longeran entry corresponding to the thread, so it cannot be jumped to again unless it is later re-enabled. Since theprocess has not hard faulted, no thief will ever steal the local entry associated with the thread.

We then consider what happens if the process running the thread hard faults. If the process hard faultsduring the user level code, then it will be stolen regularly. Since the popTop function returns the activecapsule when stealing a local entry (Line 60) rather than the entire thread, the thief will start on the firstcapsule that has not been run to completion rather the beginning of the thread. The thief will also set thelocal entry that corresponded to the thread to empty during Line 57, preventing it from being stolen by anyother thief. These facts are true for any steal on a process that has hard faulted. We consider two cases for ahard fault during pushBottom: before the CAM at Line 74 is run at all, and after it has been run at least once.If the hard fault occurs before the CAM is run then the entry at stack[b] will remain local until it is stolen.During the popTop call when this entry is stolen, the thief will set stack[b+1] to empty during Line 56. Sincethis occurs before the CAM at Line 57, it will occur before the top pointer can be changed to stack[b+1].Since this entry will be set to empty before any popTop calls can see it, it will never be stolen. When the thiefrestarts the active capsule in pushBottom, the entry at stack[b] will have been set to taken and the entry atstack[b+1] will have been set to empty, so the thief will call pushBottom on its own WS-Deque. That call canbe analyzed in the same manner as the original call. If the hard fault occurs after the CAM at Line 74 hasbeen run then the entry at stack[b] has been set to job. In this case, the current thread will not be stolen untilthe top pointer is set to point to stack[b+1]. This entry was set to local by Line 72. When the thief restartsthe active capsule, the state of the WS-Deque will cause it to bypass both if clauses and immediately returnto the user thread without further modifying the WS-Deque. If the process hard faults during the schedulerfunction, the hard fault will either occur before clearBottom finishes, in which case it will be stolen andrestarted normally, or it will occur after the clearBottom finishes, in which case the entry will be set to emptyand therefore never stolen.

We have now proven that threads are not added to WS-Deques more times than they are enabled and thatWS-Deque entries being run multiple times will not cause a user level capsule inside threads associated withthose entries to be run to completion multiple times. This completes the proof.

Combining the various lemmas gives the following theorem.

Theorem A.1. The implementation of work stealing provided in Figure 3 correctly schedules work accordingto the specification in Section 6.

Proof. We know from Lemma A.6 and Lemma A.8 that every enabled user thread will be scheduled on to anactive process. Lemma A.9, Lemma A.10, and Lemma A.11 combine to prove that every scheduled thread isrun to completion. Lemma A.8 and Lemma A.12 show that no work is duplicated or re-executed. Since allwork is scheduled and run to completion following the computation dependencies, the implementation iscorrect.

33

the parallel persistent memory model - arxiv · we consider a parallel computational model, the...

Documents