zz: cheap practical bft using...

University of Massachusetts, Technical Report TR14-08 1

ZZ: Cheap Practical BFT using Virtualization

Timothy Wood, Rahul Singh, Arun Venkataramani, and Prashant ShenoyDepartment of Computer Science,

University of Massachusetts,Amherst, MA 01003

Email: {twood,rahul,arun,shenoy}@cs.umass.edu

AbstractDespite numerous efforts to improve their performanceand scalability, byzantine fault-tolerance (BFT) tech-niques remain expensive, and few commercial systemsuse BFT today. We present ZZ, a novel approach toconstruct general BFT services with a replication costof practically f + 1, halving the 2f + 1 or higher costincurred by state-of-the-art approaches. The key insightin ZZ is to use f + 1 execution replicas in the normalcase and to activate additional replicas only upon fail-ures. ZZ uses virtual machines for fast replica activa-tion and several novel mechanisms for rapid recoveryof these replicas such as using filesystem snapshots toreduce checkpointing overhead, replaying state updatesinstead of full requests, and the use of an amortized statetransfer mechanism that fetches state on-demand. Wehave implemented ZZ using the BASE library, Xen vir-tual machines and the ZFS file system. Our experimentalevaluation shows that the recovery time of ZZ replicasis independent of the application disk state, taking lessthan 4s for 400MB of disk state, at the expense of a smallincrease in request latency in the fault-mode.

1 Introduction

Today’s enterprises rely in data centers—server and stor-age farms—to run their critical business applications.As users have become increasingly dependent on onlineapplications and services, any malfunctions are highlyproblematic, resulting in financial losses, negative pub-licity, or inconvenience to users. Application malfunc-tions can be caused by hardware failures, software bugs,or malicious break-in. Consequently, maintaining highavailability of critical applications and services has be-come both a pressing need as well as a challenge.

Byzantine fault tolerance (BFT) is a powerful but ex-pensive technique to maintain high availability while tol-erating arbitrary (Byzantine) faults. BFT uses replica-tion to mask faults. BFT state machine replication is awell-studied approach to construct general services so asto be robust to byzantine faults. This approach requires

replicas to agree upon the order of incoming requests andprocess every request in the agreed upon order. Despitenumerous efforts to improve the performance or scalabil-ity of BFT systems [2, 5, 13, 21, 25, 1], BFT techniquesremain expensive. Existing approaches for BFT requireat least 2f + 1 replicas to execute each request in orderto tolerate f faults [13, 29]. Our position is that this highcost has discouraged widespread adoption of BFT tech-niques for commercial services. On the other hand, datacenters today do employ fail-stop fault-tolerance tech-niques or simple mirroring of data that require only f +1replicas to tolerate f faults. This suggests that reducingthe replication cost of BFT may remove a significant bar-rier to its adoption.

To this end, we present a novel approach to constructgeneral BFT services with a replication cost close tof + 1, halving the 2f + 1 or higher cost incurred bystate-of-the-art approaches, effectively providing BFT atthe cost of fail-stop fault tolerance. Like [29], our sys-tem still requires 3f + 1 agreement replicas, but it onlyrequires f + 1 execution replicas in the common casewhen there are no faults. The key insight in ZZ is touse additional replicas that get activated only upon fail-ures. Furthermore, since not all applications hosted bya data center are likely to experience faults simultane-ously, ZZ can share a small pool of free servers across alarge number of applications and multiplex these serverson-demand to house newly activated replicas, thereby re-ducing the total hardware provisioning cost for the datacenter to practically f + 1. ZZ draws inspiration fromCheap Paxos [17], which advocated the use of cheaperauxiliary nodes that are used only to handle crash fail-ures of main nodes. Our contribution lies in adapting theapproach to tolerate byzantine faults, demonstrating itsfeasibility through a prototyped system, and making itcheap and practical by leveraging a number of modernsystems mechanisms.

We are able to circumvent “lower bounds” on replica-tion costs stated in previous work [13], because our ap-proach relies on virtualization, a mechanism not consid-ered in previous work. Virtualization enables additional


replicas to be quickly activated on-demand upon fault de-tection. Moreover, ZZ employs several mechanisms forrapid recovery so that this newly started replica can beginprocessing requests without waiting for a full state trans-fer. Our design and implementation of ZZ has resulted inthe following contributions. First, ZZ employs virtual-ization to enable fast replica startup as well as to dynam-ically multiplex a small number of “auxiliary” serversacross a larger number of applications. Since a newly-activated replica is out-of-date with the remaining repli-cas, it must catch up with the rest before it can processclient requests. ZZ enables rapid recovery of standbyreplicas by enabling them to execute requests without afull state transfer. Specifically, ZZ employs an amortizedstate transfer mechanism that fetches and verifies stateon-demand, allowing for faster recovery. Second, ZZemploys mechanisms to optimize the replay of requestssince the last checkpoint by only requiring state updatesto be applied (eliminating the computational overheadsof full request processing during replay). Third, ZZ ex-ploits the snapshot feature of modern journaled file sys-tems to obtain the benefits of incremental checkpoint-ing without requiring modifications to application sourcecode.

We have implemented a prototype of ZZ by enhancingthe BASE library and combining it with the Xen virtualmachine and the ZFS file system. We have evaluatedour prototype using a canonical client-server applicationand a ZZ-based NFS file server. Our experimental resultsdemonstrate that, by verifying state on demand, ZZ canobtain a recovery time that is independent of applicationstate size—a newly activated ZZ replica with 400MB ofdisk state can recover in less than 4s in contrast to over60s for existing approaches; this recovery speed comesat the expense of a small increased in request latency inthe fault-mode due to the on-demand state verification.

The rest of this paper is structured as follows. Sec-tions 2 and 3 present background and our system model.Sections 4 and 5 present the ZZ design, and Section 6presents its implementation. We present an experimentalevaluation of our prototype in Section 7, related work inSection 8 and conclude in Section 9.

2 Background

2.1 From 3f + 1 to 2f + 1

In the traditional PBFT approach [2], during graceful ex-ecution a client sends a request Q to the replicas. The3f + 1 (or more) replicas agree upon the sequence num-ber corresponding to Q, execute it, and send responsesback to the client. When the client receives f + 1 validand matching responses R1, . . . , Rf+1 from differentreplicas, it knows that at least one correct replica exe-cuted Q in the correct order. Figure 1 illustrates how theprinciple of separating agreement from execution can re-duce the number of execution replicas required to toler-

client client

Q

R1 . . . Rf+1

Q R

[i, Q]

[i, R1], . . . , [i, Rf+1]

3f+1 replicas for agreement + execution 2f+1 execution replicas

3f+1 agreement machines

Figure 1: The PBFT approach versus the separation ofagreement from execution.

ate up to f faults from 3f + 1 to 2f + 1. In this sepa-ration approach [29], the client sends Q to a primary inthe agreement cluster consisting of 3f + 1 lightweightmachines that agree upon the sequence number i corre-sponding to Q and send [Q, i] to the execution clusterconsisting of 2f +1 replicas that store and process appli-cation state. When the agreement cluster receives f + 1matching responses from the execution cluster, it for-wards the response to the client knowing that at least onecorrect execution replica executed Q in the correct or-der. For simplicity of exposition, we have omitted cryp-tographic operations above.

2.2 State-of-the-artThe 2f+1 replication cost is believed necessary [13, 5, 1]for BFT systems. For example, Kotla et al. in theirpaper describing Zyzzyva (refer Table 1 in [13]) claimthat 2f + 1 is a lower bound on the number of repli-cas with application state for state machine replication(SMR). However, more than a decade ago, Castro andLiskov concluded their original paper on PBFT [2] say-ing “it is possible to reduce the number of copies of thestate to f + 1 but the details remain to be worked out”.In this paper, we work out those details.

Table 1 above shows the cost of existing BFT ap-proaches including non-SMR approaches based on quo-rums in comparison to ZZ. All approaches require a totalof at least 3f + 1 machines in order to tolerate up to findependent Byzantine failures, consistent with classicalresults that place a lower bound of 3f + 1 replicas for asafe byzantine consensus protocol that is live under weaksynchrony assumptions [7].

The quorum-based approach, Q/U, requires an evenhigher number of replicas but provides better fault scala-bility compared to PBFT, i.e., better throughput as f in-creases provided there is low contention across requests.The hybrid quorum (HQ) approach attempts to achievethe best of PBFT and Q/U, i.e., lower replication costas well as fault scalability, but is unable to leverage thethroughput benefits of batching. Zyzzyva, based on themore traditional state machine approach, goes a step fur-ther to incorporate speculative execution, resulting in


PBFT’99 [2] Q/U’05 [1] H/Q’06 [5] Zyzzyva’07 [13] ZZTotal machines 3f + 1 5f + 1 3f + 1 3f + 1 3f + 1Application state and executionreplicas (with separation’03 [29])

2f + 1 5f + 1 3f + 1 2f + 1 f + 1

Batching b > 1 requests Y N N Y YCrypto operations per request 2 + (8f + 1)/b 2 + 8f 4 + 4f 2 + 3f/b 2 + (9f + 2)/b

Table 1: Properties of various approaches.

fewer cryptographic operations per request while retain-ing the throughput benefits of batching. Our implemen-tation of ZZ incurs f + 1 additional cryptographic op-erations per batch compared to PBFT, but it is straight-forward to match PBFT by more efficient piggybackingof protocol messages. In theory, ZZ’s execution replicascan be interfaced with any agreement protocol includingZyzzyva to match its low cryptographic overhead as wellas fault scalability.

Q/U incurs just two one-way network delays on thecritical path when contention is low and is optimal inthis respect. All approaches require four or more delayswhen contention is high or when faults occur. However,the dominant latency in practice is likely to be the wide-area latency from the client to the replicas and back asthe replicas in a data center are likely to be located onthe same local area network. All approaches except HQincur just one wide-area network round-trip latency.

The comparison above is for performance of differ-ent protocols during fault-free and timeout-free execu-tion. When failures occur, the throughput and latencyof all protocols can degrade significantly and a thor-ough comparison is nontrivial and difficult to character-ize concisely [23]. When failures occur, a recoveringZZ1 replica incurs a higher latency to execute some re-quests until it has fully recovered, so it takes longer for aclient to receive a response certificate for such requests.Our experiments suggest that this additional overheadis tolerable. Also, the client can tentatively accept areplica’s response while waiting to collect a certificatefor the response, and issue other requests that do not de-pend on this response. In a world where failures are rare,ZZ offers valuable savings in replication cost while min-imally impacting performance.

3 System Model

We assume a byzantine failure model where faulty repli-cas or clients may behave arbitrarily. There are two kindsof replicas: 1) agreement replicas that assign an orderto client requests and 2) execution replicas that main-tain application state and execute client requests. Repli-cas fail independently, and we assume an upper bound gon the number of faulty agreement replicas and an up-

1Denotes sleeping replicas; from the sleeping connotation of theterm “zz..”

per bound f on the number of faulty execution repli-cas in a given window of vulnerability. An adversarymay coordinate the actions of faulty nodes in an arbitrarymalicious manner. However, the adversary can not sub-vert standard cryptographic assumptions about collision-resistant hashes, encryption, and digital signatures. Thenotation 〈LABEL, X〉i denotes the message X of typeLABEL digitally signed by node i. We indicate the di-gest of message X as X .

Our system uses the state machine replication (SMR)model to implement a BFT service. Replicas agree on anordering of incoming requests and each execution replicaexecutes all requests in the same order. Like all previousSMR based BFT systems, we assume that either the ser-vice is deterministic or the non-deterministic operationsin the service can be transformed to deterministic onesvia the agreement protocol [2, 13, 29, 21].

Our system ensures safety in an asynchronous net-work that can drop, delay, corrupt, or reorder messages.Liveness is guaranteed only under eventual synchrony,i.e., there is an unknown point in the execution afterwhich either all messages delivered within some constanttime ∆, or all non-faulty clients have received replies totheir requests. Our system model and assumptions aresimilar to those assumed by many existing BFT systems[2, 13, 29, 21].

4 ZZ Design

ZZ reduces the replication cost of BFT from 2f + 1 tof + 1 using virtualization. ZZ is based on two simpleinsights. First, if a system is designed to be correct inan asynchronous environment, it must be correct even ifsome or all replicas are arbitrarily slow. Second, duringfault-free periods, a system designed to be correct despitef byzantine faults must be unaffected if up to f replicasare turned off. ZZ leverages the second insight to turn offf replicas during fault-free periods requiring just f + 1replicas to actively execute requests. When faults occur,ZZ leverages the first insight and behaves exactly as if thef standby replicas were just slow but correct replicas.

Given an ordered request, if the f + 1 active repli-cas return matching responses, at least one of these re-sponses, and by implication all of the responses, must becorrect. The problematic case is when the f+1 responsesdo not match. In this case, ZZ starts up additional virtual


[23, !] [23, !][23, !]

(2) Fault detection

(4) Fault recovery

wakeup wakeup

[23, !][23, !] [23, !]

[23, !] [23, !]

(3) Fault resolution

[23, !]

[23, !] [23, !]

Faulty Faulty

asleep asleep

(1) Graceful execution

[22, !] [22, !] [22, !]

[22, !]

[23, !]

[23, !]

[22, !]

[22, !]

[23, !]

faultdetection

wakeup

[23, !]

[23, !]

VM2

VM1

VM3

VM4

VM5

faultresolved

AC

clientreq 22

clientreq 23response

(a) (b)

Figure 2: Various scenarios in the ZZ system for f = 2 faults. Request 22 results in matching responses γ, but themismatch in request 23 initiates new virtual machine replicas on demand.

machines hosting standby replicas. For example, whenf = 1, upon detecting a fault, ZZ starts up a third replicathat executes the most recent request. Since at most onereplica can be faulty, the third response must match oneof the other two responses, and ZZ returns this matchingresponse to the client. Figure 2 illustrates the high-levelflow of control for the case when f = 2. Request 22 isexecuted successfully generating the response γ, but re-quest 23 results in a mismatch waking up the two standbyVM replicas.

Design Challenges. The high-level approach de-scribed above raises several challenges. First, how does arestored replica obtain the necessary application state re-quired to execute the current request? In traditional BFTsystems, each replica maintains an independent copyof the entire application state. Periodically, all repli-cas checkpoint their application state and discard the se-quence of responses before the checkpoint. The check-point is also used to bring up to speed a very slow replicathat has missed several previous checkpoints by transfer-ring the entire application state to the slow replica. How-ever, a restored ZZ replica may not have any previousversion of application state. It must be able to verify thatthe state is correct even though there may be only onecorrect execution replica (and f faulty ones), e.g., whenf = 1, the third replica must be able to determine whichof the two existing replica’s state is correct.

Second, transferring the entire application state cantake an unacceptably long time. In existing BFT sys-tems, a recovering replica may generate incorrect mes-sages until it obtains a stable checkpoint. This inconsis-tent behavior during checkpoint transfer is treated like afault and does not impede progress of request executionif there is a quorum of f + 1 correct execution replicaswith a current copy of the application state. However,when a ZZ replica recovers, there may exist just one cor-rect execution replica with a current copy of the appli-cation state. The traditional state transfer approach maystall request execution until f recovering replicas haveobtained a stable checkpoint.

Third, ZZ’s replication cost must be robust to faultyreplica or client behavior. A faulty client must not be ableto trigger recovery of standby replicas. A compromisedreplica must not be able to trigger additional recoveriesif there are at least f + 1 correct and active replicas. Ifthese conditions are not met, the replication cost savingswould vanish and system performance can be worse thana traditional BFT system using 2f + 1 replicas.

5 ZZ Protocol

This section presents the ZZ protocol for (i) handling in-coming client requests, (ii) graceful execution, (iii) faultdetection, (iv) replica recovery, and (v) checkpointing.We also present our amortized state transfer mechanismand discuss safety and liveness properties of ZZ.

5.1 Client RequestA client c sends a request 〈REQUEST, o, t, c〉c to theagreement cluster to submit an operation o to the statemachine service with a timestamp t. The timestamps en-sure exactly-once semantics for execution of client re-quests as they enforce a total order across all requestsissued by the client. A correct client uses monotonicallyincreasing timestamps, e.g., the value of the client’s localclock, and a faulty client’s behavior does not affect otherclients’ requests. After issuing the request, the clientwaits for a reply certificate certified by at least f + 1execution replicas.

5.2 Graceful ExecutionThe agreement cluster 1) assigns an order to incomingclient requests, 2) sends committed requests to the ex-ecution cluster, 3) receives execution reports back fromthe execution cluster, and 4) relays response certificatesto the client when needed.

Upon receiving a client request Q =〈REQUEST, o, t, c〉c, the agreement replicas executean agreement protocol and commit a sequence numberto the request. Each agreement replica j sends an order


message 〈ORDER, v, n,Q〉j that includes the view v andthe sequence number n to all execution replicas.

An execution replica i executes a request Q whenit receives a request certificate 〈ORDER, v, n,Q〉A|2g+1

signed by at least 2g + 1 agreement replicas and it hasexecuted all other requests with a lower sequence num-ber. Let R denote the response obtained by executingQ. Replica i sends a message containing the digests ofthe response and the writes 〈EXEC-REPORT, n,R〉i to theagreement cluster. Replica i also sends a response mes-sage 〈REPLY, v, t, c, i, R〉i directly to the client.

In the normal case, the client receives f + 1 validand matching response messages from execution repli-cas. Two response messages match if they have the samevalues for v, t, c, and R. The client determines a mes-sage from replica i to be valid if it has a valid signature.This collection of f + 1 valid and matching responses iscalled a response certificate. Since at most f replicas canbe faulty, a client receiving a response certificate knowsthat the response is correct.

Request Timeout at Client If a client does not re-ceive a response certificate within a predetermined time-out, it retransmits the request message to all agreementreplicas. If an agreement replica j receives such a re-transmitted request and it has received f + 1 matchingEXEC-REPORT messages, it sends the response message〈REPLY-A, v, t, c, j, R〉j . A client considers a responseR as correct when it receives g + 1 valid and matchingREPLY-A messages from agreement replicas and at leastone valid REPLY message from an execution replica.

As in PBFT [2], a client normally sends its requestonly to the primary agreement replica, which it learnsbased on previous replies. If many agreement replicasreceive retransmissions of a client request for which avalid sequence number has not been assigned, they willeventually suspect the primary to be faulty and trigger aview change electing a new primary. Note that althoughclients can cause view changes in ZZ, they can not triggerthe recovery of standby execution replicas.

5.3 Fault DetectionThe agreement cluster is responsible for detecting faultsin the execution cluster. In the normal case, an agree-ment replica j waits for an execution report certificate〈EXEC-REPORT, n,R〉E|f+1 for each request, i.e., validand matching execution reports for the request from allf + 1 execution replicas. Replica j inserts this certifi-cate into a local log ordered by the sequence number ofrequests. When j does not obtain an execution reportcertificate for a request within a predetermined timeout,j sends a recovery request 〈RECOVER, n〉j to a subsetof the hypervisors controlling the f standby executionreplicas. The number of recovery requests issued by anagreement replica is at least as large as f minus the size

of the smallest set of valid and matching execution re-ports for the request that triggered the recovery.

When the hypervisor controlling an execution replicai receives a recovery certificate 〈RECOVER, n〉A|g+1, i.e.,valid and matching recovery requests from g + 1 ormore agreement replicas, it starts up the local executionreplica. The hypervisor is assumed to be a trusted entity.

5.4 Replica RecoveryWhen an execution replica k starts up, it neither hasany application state nor a pending log of requests.Replica k’s situation is similar to that of a long-lostreplica that may have missed several previous check-points and wishes to catch up in existing BFT systems[2, 21, 29, 13]. Since these systems are designed foran asynchronous environment, they can correctly re-integrate up to f such recovering replicas provided thereare at least f +1 other correct execution replicas. The re-covering replica must obtain the most recent checkpointof the entire application state from existing replicas andverify that it is correct. Unfortunately, checkpoint trans-fer and verification can take an unacceptably long timefor applications with a large application state. Worse,unlike previous BFT systems that can leverage copy-on-write techniques and incremental cryptography schemesto transfer only the pages modified since the last check-point, a recovering ZZ replica does not have any previousapplication state.

To reduce recovery latency, ZZ uses a novel recoveryscheme that amortizes the cost of state transfer acrossmany requests. A recovering execution replica k first ob-tains an ordered log of committed requests since the mostrecent checkpoint from the agreement cluster. Let m de-note the sequence number corresponding to the most re-cent checkpoint and let n ≥ m + 1 denote the sequencenumber of the most recent committed request. Some ofthe requests in the interval [m + 1, n] involve writes toapplication state while others do not. Replica k begins toreplay in order the requests in [m+1, n] involving writesto application state.

5.5 Amortized State TransferHow does replica k begin to execute requests without anyapplication state? The key insight is to fetch and verifythe state necessary to execute a request on demand. Af-ter fetching the pending log of committed requests fromthe agreement cluster, k obtains a checkpoint certificate〈CHECKPOINT, n, C〉E|g+1 from the agreement cluster(as explained in the following subsection). Replica k fur-ther obtains valid digests for each page in the checkpointfrom any execution replica. The checkpoint digest C iscomputed over the digests of individual page digests, sok can verify that it has valid page digests by obtaining itfrom any correct execution replica. As in previous BFTsystems [2, 21, 29, 13], execution replicas optimize the


computation of checkpoint and page digests using copy-on-write and incremental cryptography schemes.

After obtaining the above checkpoints, replica k be-gins to execute in order the committed requests [m+1, n]that involve writes. Let Q be the first request that readsfrom or writes to some page p since the most recentcheckpoint. To execute Q, replica k fetches the page ondemand from any execution replica that can provide apage consistent with p’s digest that k has already veri-fied. Replica k continues executing requests in sequencenumber order and fetches pages on demand until it ob-tains a stable checkpoint.

5.6 CheckpointingExecution replicas periodically construct checkpoints ofapplication state in order to garbage collect their pendingrequest logs. The checkpoints are constructed at prede-termined request sequence numbers, e.g., when the se-quence number is exactly divisible by 1024.

Unlike [29], the checkpoint protocol is not com-pletely contained within the execution cluster and re-quires coordination with the agreement cluster. To ap-preciate why, consider that in existing BFT systems in-cluding [29], if digital signatures are used, it suffices foran execution replica to obtain a checkpoint certificate〈CHECKPOINT, n, C〉E|f+1 signed by f + 1 executionreplicas. It can use this certificate to prove to a recover-ing replica during a state transfer that the checkpoint isvalid. However, if message authentication codes (MACs)are used, then f + 1 checkpoint messages are insuffi-cient for a replica to prove to a recovering replica thatthe checkpoint is correct. With 2f + 1 execution repli-cas in [29], a recovering replica is guaranteed to get atleast f +1 valid and matching checkpoint messages fromother execution replicas. However, with just f + 1 exe-cution replicas in ZZ, a recovering replica may get onlyone checkpoint message from other execution replicas.

To address this problem, the execution cluster coor-dinates with the agreement cluster during checkpoints.Each execution replica i sends a checkpoint message〈CHECKPOINT, n, C〉i to the agreement cluster. Anagreement replica waits to receive f + 1 valid, matchingcheckpoint messages. If g + 1 or more agreement repli-cas fail to receive f + 1 valid and matching checkpointmessages, they issue a recovery request to a subset of thehypervisors controlling the standby execution replicas.

5.7 Separating state updates and executionA novel optimization enabled by ZZ’s recovery approachis the separation of state updates from the rest of re-quest execution to further speed up recovery. A ZZreplica need only apply state updates without actually re-executing requests since the last checkpoint.

To this end, execution replicas include informa-tion about state updates in their execution reports.During graceful execution, an execution replica i

Verified with f+1 signatures

Waiting for f+1signatures

Empty disk

m n

Execute requests fetching state on demand

f+1

Apply stateupdates

Wakeup on fault

Recoveryreplica

Activereplicas

Figure 3: ZZ separates request execution from updatingstate during recovery.

includes additional information in its execution re-port 〈EXEC-REPORT, n,W, Hn〉 where W identifies thewrites performed during the execution of the request andHn = [Hn−1,W ] summarizes the write history. Letm be the sequence number of the last stable checkpointand let n be the sequence number that caused a faultto be detected. Replica i also stores the list of writesWm+1, . . . ,Wn−1 performed while executing requestswith the corresponding sequence numbers.

Upon recovery, an execution replica j obtains acheckpoint certificate 〈CHECKPOINT,m, C〉A|g+1 forthe most recent successful checkpoint, and the execu-tion report certificate 〈EXEC-REPORT, n,W, Hn−1〉 forthe sequence number immediately preceding the fault.Replica j obtains the list of writes [Wm+1, . . . ,Wn−1]matching the execution report certificate from any cor-rect execution replica. Then, instead of actually execut-ing the requests, j simply applies the writes in sequencenumber order while fetching the most recent checkpointof the pages being written to on demand. This approachis illustrated in Figure 3. For compute-intensive applica-tions with infrequent and small writes, this optimizationcan significantly reduce recovery latency.

5.8 Safety and Liveness PropertiesWe state the safety and liveness properties ensured byZZ and outline the proofs. Due to space constraints, wedefer formal proofs to a technical report [28].

ZZ ensures the safety property that if a client receivesa reply 〈REPLY, v, n, t, c, i, R〉i from f + 1 executionreplicas or a 〈REPLY, v, n, t, c, i, R〉i from at least oneexecution replica and 〈REPLY-A, v, b, t, c, j, R〉j fromg + 1 agreement replicas, then 1) the client issued a re-quest 〈REQUEST, o, t, c〉c earlier, 2) all correct replicasagree on the order of all requests with sequence num-bers in [1, n], 3) the value of the reply R is the reply thata single correct replica would have produced if it startedwith a default initial state S0 and executed each operationoi, 1 ≤ i ≤ n, in that order, where oi denotes the opera-tion requested by the request to which sequence number


i was assigned.

The first claim follows from the fact that the agree-ment cluster only generates valid request certificates forvalid client requests and the second claim follows fromthe safety of the agreement protocol [3] which ensuresthat no two requests are assigned the same sequencenumber. To show the third claim, consider a single cor-rect execution replica e that starts with a default stateS0 and executes all operations oi, 1 ≤ i ≤ n, in order.We first show that the state Si

n−1 of at least one correctactive execution replica i after executing on is equal tothe corresponding state Se

n of the single correct replica.The proof is an inductive argument over epochs wherea new epoch begins each time the agreement cluster de-tects a fault and starts up standby replicas. Just before anew epoch begins, there are at least f + 1 replicas witha full copy of the most recent stable checkpoint. Theagreement cluster shuts down execution replicas that re-sponded incorrectly. At least one replica that respondedcorrectly is correct as there can be at most f faults. Sincewe assume that the window of vulnerability is greaterthan the time required for f replicas to obtain a stablecheckpoint, there are at least f +1 replicas with the mostrecent stable checkpoint before the next epoch begins.

ZZ ensures the liveness property that if a client sendsa request with a timestamp exceeding previous requestsand repeatedly retransmits the request, then it will even-tually receive a valid reply certificate. The liveness prop-erty relies on the eventual synchrony assumption in Sec-tion 3 that is slightly stronger than systems [2, 29, 13]that was assume an infinite window of vulnerability, andis the same that assumed by the proactive recovery ap-proach described by Castro and Liskov [3]. An impor-tant difference between the proactive recovery protocoland ZZ is that the former requires recovering replicas tonot lose state, whereas a recovering ZZ replica starts withno state and obtains certified pages on demand. Livenessis preserved because, unlike [3], a ZZ replica does notstop executing requests if its checkpoint is not stable af-ter a predetermined number of requests. For the durationuntil its checkpoint becomes stable, a correct recoveringreplica correctly executes requests, and fetching pages ondemand only cause a higher request processing latency.

6 ZZ Implementation

We have implemented ZZ by enhancing the BASE li-brary. Our implementation consists of (i) enhancementsto the BASE library to implement our checkpointing,fault detection, rapid recovery and fault-mode executionmechanisms, (ii) use of virtual machines to house andrun replicas, and (iii) use of file system snapshots to as-sist checkpointing. We describe these mechanisms next.

VM

A1

Hypervisor 1

VM

A2

Hypervisor 2

VM

B1

Hypervisor 3

VM

C1

VM

B2

Hypervisor 4

VM

C2

Active Servers Free Server Pool

Free 1

C3

Free 2

B3

A3

Hibernated VMs on disk

Paused VM

Figure 4: An example server setup with three fault toler-ant applications, A, B, and C. The recovery pool stores amix of paused and hibernated VMs.

6.1 Using Virtual Machine ReplicasOur implementation assumes a environment such as adata center that runs N independent applications; repli-cas belonging to each application are assumed to run in-side an separate virtual machine. In the normal case, atleast f + 1 execution replicas are assumed to be runningon a mutually exclusive set of servers. The data centermaintains a pool of free (or under-loaded) servers thatare used on-demand to host newly activated replicas. Inthe case of a fault, an addition f execution VM repli-cas are started for that application. Assuming that not allof the N executing applications will see a fault simulta-neously, this free server pool can be multiplexed acrossthe N applications, allowing the pool to be much smallerthan the worst-case needs of N(f + 1) additional exe-cution servers. The number of free servers needed as a“backup” for fault-mode execution depends on the num-ber of applications that experience simultaneous faults;at least f + 1 free servers are needed and a larger poolallows multiple application faults to be handled withoutdegrading execution speed. 2 Thus, using virtualizationto dynamically multiplex free server pool allows us toreduce the BFT cost to practically f + 1.

Virtualization also enables fast replica startup if weassume a pool of pre-spawned replicas that are main-tained in “dormant” mode. For instance, each pre-spawned VM can be maintained in hibernated mode ondisk, allowing for a quick resume. For faster startup,the prespawned VM can be maintained in pause modein memory, where it consumes no CPU and uses limitedmemory; unpausing a VM simply requires scheduling iton the CPU, causing it to become active within millisec-onds. The size of the pre-spawned pool is limited bythe size of disk or memory on the server, depending onwhether VMs are hibernating or paused. The ZZ imple-mentation uses the Xen platform and supports a mix, al-

2Since each physical server can house multiple VMs, it is possibleto run N(f+1) concurrent VMs on as few as f + 1 servers even if allN applications experience a fault simultaneously. However, runningN VMs per physical server will cause each newly activated replica toget a fraction of the server resources, degrading its execution speed. Alarger free server pool permits better multiplexing.


Agreement Cluster Execution

Replica

file 1(1)

memfile 2

...file n

Mem State

(2) md5

ZFS

file 3

(3) snapshot

(4) hashes

Figure 5: For each checkpoint an execution replica (1)sends any modified memory state, (2) creates hashes forany modified disk files, (3) creates a ZFS snapshot, and(4) returns the list of hashes to the agreement cluster.

lowing for some (e.g., important) VMs to be paused andthe rest to hibernate (see Figure 4)

6.2 Exploiting File System SnapshotsAt the end of each epoch, the BASE protocol requires allreplicas to create a stable checkpoint of the current appli-cation state, which is then used by out-of-date replicas tocatch up with the remaining replicas. The BASE libraryrequires BFT applications to implement get() and put()functions which are used to gather the application statewhen creating a checkpoint and to update a recoveringnode’s state. Instead ZZ exploits the snapshot mecha-nism supported by many modern file systems to makeefficeint checkpoints without modifications to existingapplicatoins. Rather than including all of the applica-tion’s disk state in the checkpoint, ZZ creates a separatefile system snapshot of the application’s disk and onlyincludes meta information about the disk contents in theBASE checkpoint. A new replica can subsequently usethe checkpoint meta data to access and verify the disksnapshot.

Creation of snapshots is very efficient in some modernfile systems [30, 19]; in most cases, these file systemsmaintain a log of updates in a journal. Creating snapshotsis efficient because copy-on-write techniques prevent theneed for duplicate disk blocks to be created. Snapshotssize and creation time is independent of the size of datastored since only meta information needs to be updated.

ZZ replies on ZFS for snapshot support. ZFS is in-cluded with the Solaris operating system which runs oncomodity x86 hardware. ZZ works with both the nativeSolaris and the user-space Linux ZFS implementations;in either case, it primarily relies the snapshot feature ofZFS during checkpointing, as discussed next.

6.3 CheckpointingOur ZZ implementation is based on the Dec’07 releaseof the BASE library [21]. We first modify BASE to sep-arate execution from agreement as described in [29]. Toimplement checkpointing in ZZ we rely on the existingmechansisms in the BASE library to save the protocolstate of the agreement nodes and any memory state usedby the application on the execution nodes. We have mod-

ified the BASE checkpointing code to have the file sys-tem build a full snapshot of the disk state, and only in-clude certain meta-information about this snapshot in theactual checkpoint (see Figure 5).

By not including the disk snapshot in the BASEcheckpoint, ZZ significantly reduces the size of thecheckpoint which must be exchanged and verified by theexecution and agreement nodes. This also reduces theamount of data which must be transferred to additionalrecovery nodes after a fault is detected. However, sincethe disk snapshots are not explicitly verified by the BFTprotocol, ZZ must include meta-information about thedisk state in the BASE checkpoint so that the recoverynodes can validate the disk snapshots created by otherexecution nodes. To do so, an MD5 hash is created foreach file in the disk snapshot. Hashes are computed onlyfor those files that have been modified since the previousepoch; hashes from the previous epoch are reused for un-modified files to save computation overheads.

6.4 RecoveryFault Detection: Faults are detected by the agreementcluster when the output messages from the f + 1 activeexecution replicas are not identical for a request. Wehave modified the agreement node code so that whena mismatch is detected, each agreement replica sends awakeup message to f servers in the free pool. If the hy-pervisor on a free pool server receives g+1 wakeup mes-sages, it wakes up the pre-spawned replica VM for thatapplication.

VM Wakeup: After being restored from a pausedsleeping state, recovery repliacs will immediately sendstatus messages to the other active replicas to determinethe current protocol state. Any non-faulty replicas willrecognize that the recovery nodes have out of date stateand will attempt to send the latest checkpoint. The re-covery replicas will receive the checkpoint and verify itagainst the digests stored by the agreement nodes. Therecovery replicas load the BFT protocol information andany application memory state from the checkpoint andthen attempt to mount via NFS the latest disk snapshotfrom each active execution node.

Replay: Once the recovery replicas have acquired thelatest checkpoint, they must replay any requests whichoccurred between the checkpoint and the fault itself.There are several levels of optimization for replaying re-quests: (1) replay all, (2) replay only modify requests,or (3) state updates only. In case 1, all requests are re-played by the recovery replicas, regardless of whetherthey modify state or not. In case 2, only those requestswhich make state modifications are replayed—the recov-ery replica only needs to replay these requests in order toupdate its state to the time of the fault, so it can safelyignore any requests which do not modify state. In case3, the recovery replica obtains a list of the modificationsmade to each state object since the last checkpoint, and


simply applies these state updates locally. By directlyapplying the state changes made by each request, thereplica then does not need to replay full requests.

Verify on Demand: In order to replay, the recoveryreplica will need access to the state objects used by eachrequest. All memory state objects are included with thecheckpoint transfered to the recovery replica. For diskstate objects, the recovery replica starts with an emptydisk that must be populated with the correct files fromone of the disk snapshots it has access to. If a requestaccesses a disk object which the recovery replica has notyet verified, the library will attempt to retrieve a copy ofthe file from the disk snapshot created by another replica.The correctness of this file must be verified by calculat-ing its MD5 hash and comparing it to the one stored withthe checkpoint meta data. If the hashes do not match, thefile is retrieved from one of the remaining f snapshotsuntil a match is found. Once a valid file is retrieved, itis copied to the replica’s local storage and all future ac-cesses are directed to that version of the file.

Although our current implementation treats disk stateobjects with the granularity of files, a simple optimiza-tion would be to allow files to be split into multiple datablocks so that only portions of files need to be verified.This would reduce the overhead of checking large files.

Eliminating Faulty Repliacs: After the recoveryreplicas have replayed the faulty request, the agreementcluster can determine which of the original executionnodes were faulty. The agreement replicas will each re-ceive 2f + 1 output messages for the faulty request, ofwhich at least f + 1 must be identical correct responses.Each agreement node then sends a shutdown message tothe replica control daemon listing which replicas it be-lieves should be terminated. If the replica control dae-mon receives at least g+1 shutdown messages for a givenreplica, it can safely terminate that virtual machine. Us-ing the Xen hibernate command allows the faulty repli-cas to be saved to disk so that they can be examined laterto determine the cause of the fault. If f > 1, there maybe more than f + 1 execution replicas remaining afterfaulty machines are removed. In this case we terminatethe extra replicas to keep the operating cost at f+1 activereplicas. As an optimization we prefer to terminate extrarecovery replicas since they may not yet have obtainedall state.

6.5 Tracking Disk State ChangesThe BASE original library requires all state, either ob-jects in memory or files on disk, to be registered withthe library. In ZZ we have simplified the tracking ofdisk state so that it can be handled transparently with-out modifications to the application. We define func-tions bft fopen() and bft fwrite() which replace the or-dinary fopen() and fwrite() calls in an application. Thebft fwrite() function invokes the modify() call of theBASE library which must be issued whenever a state ob-

NFS server 3

BFTWrapper ZFS

Kernel NFS client

NFSRelay

Andrew benchmark

NFS server 2

BFTWrapper ZFS f1

...fn

NFS server 1

BFTWrapper ZFS f1

...fn

Agreement Cluster

Figure 6: Experimetnal setup for NFS. The recoveryreplica obtains state on demand from the active servers.

ject is being edited. This ensures that any files which aremodified during an epoch will be rehashed during check-point creation.

For the initial execution replicas, the bft fopen() callis identical to fopen(). However, for the additional repli-cas which are spawned after a fault, the bft fopen callis used retrieve files from the disk snapshots and copyit to the replica’s disk on demand. Retrieving disk stateon demand allows us to significantly reduce the recoverytime after a fault at the expense of increasing latency ofsome requests after a fault. When a recovering replicafirst tries to open some file it calls bft fopen(foo), but thereplica will not yet have a local copy of the file. The localcopy is created by retrieving and verifying the file fromanother replica’s disk snapshot as described previously.

6.6 ApplicationsGeneral client-server application. We have imple-mented a canonical ZZ-based client-server applicationthat emulates applications with different amounts of diskand memory state and and different request processingoverheads. The client component sends requests to theserver which either read or write to a portion of the ap-plication state, either in memory or on disk. Each re-quest can also trigger a configurable amount of requestprocessing, prior to its reading or writing state. Duringcheckpoint creation, the contents of all memory pageswhich were modified in the previous epoch are includedin the checkpoint created by each execution replica. Thedisk state is maintained transparently to the applicationthrough the use of the bft fopen() and bft fwrite() calls.

After a fault, the additional replicas start with anempty disk and rebuild the application state from the lat-est checkpoint by first acquiring the memory state andlist of disk hashes as described in the BASE protocol.Requests since the last checkpoint are replayed prior toprocessing the faulty request; all three replay options—replay all, replay only modify requests and replay stateupdates only—are supported. The disk is populated byfetching individual files on demand and verifying its cor-rectness using the hash values.

NFS Server. For this application, each executionreplica runs an unmodified NFS daemon and a confor-


mance wrapper which handles the BFT protocol andhandles state conversion and checkpointing. Client ma-chines run a special NFS relay which takes the com-mands from the ordinary NFS client, translates them intorequests for the BFT library, and forwards them to theagreement nodes for processing. Once the replicated filesystem is mounted, the client machine makes ordinaryaccesses to the remote files as if they were served by anyother NFS server. Figure 6 illustrates the system setupfor an NFS server capable of handling one faulty replica.

To support the use of disk snapshots and state on de-mand, we have modified the recovery code within theNFS conformance wrapper. Our wrapper is capable ofmaking calls to multiple NFS servers, allowing the re-covery replicas to fetch files from the other active repli-cas. Our implementation retrieves the full contents of afile only when the NFS wrapper receives a read or writerequest, at which point the recovery replica sends RPCrequests to each active execution replica’s NFS server toobtain the full file, which is then validated against thepreviously obtained hash.

7 Experimental Evaluation

In this section, we evaluate our ZZ prototype undergraceful execution and in the presence of faults. Specif-ically, we quantify the recovery times seen by a ZZreplica, the efficacy of obtaining state of demand, and thebenefits of using virtualization and file system snapshots.

7.1 Experimental SetupOur experimental setup consists of a mini fault-tolerantdata center that consists of a cluster of 2.12 GHz 64-bitXeon quad-core Dell servers, each with 4GB RAM. Eachmachine runs a Xen v3.1 hypervisor and Xen virtual ma-chines. Both domain-0 (the controller domain in Xen)as well as the individual VMs run the CentOS 5.1 Linuxdistribution with the 2.6.18 Linux kernel. Each virtualmachine accesses a ZFS-based file system. Each ZFSfile system resides on a storage node, which is a 2.8GHzPentium-4 system running Solaris Express CommunityEdition and with a 120GB SATA Hitachi Deskstar hard-drive. Each execution replica is gives an independentZFS file system on the storage node (note that, we placeall ZFS file systems on a single node due to constraintson the number of machines at our disposal; a true BFTsystem would use a separate storage node for each ex-ecution replica). All machines are interconnected overgigabit ethernet.

We use f + 1 machines to house our execution repli-cas, each of which runs inside its own virtual machine.An additional f + 1 machines are assumed to belong tothe free server pool; we experiment with both paused andhibernated pre-spawned execution replicas. The agree-ment replicas run on a separate set of machines from theexecution nodes.

0 100 200 300 400State Size (MB)

0

10

20

30

40

50

60

Reco

very

Tim

e (s

ec) Full Transfer

On Demand

Figure 7: Recovery time for the client-server applicationwith varying disk state sizes.

Our experiments involve both our canonical client-server application as well as the ZZ-based fault-tolerantNSF server; we use a combination of synthetic work-loads and the Andrew filesystem benchmark to generatethe workloads for our experiments. Since we focus onexperiments investigating performance around faults, weuse a batch size of one; increasing the batch size wouldprovide throughput benefits similar to those seen in othersystems [29, 13].

7.2 Recovery TimeNext we measure the time required to bring additionalreplicas online after a faulty request. We use our client-server application with different amounts of disk stateand inject a fault so that it occurs immediately after acheckpoint, which causes a new virtual machine-basedreplica to be started up. The disk state at the server isvaried between 1MB and 500MB by varying the sizes of500 files. We compare the recovery time of our approach,which fetches state on-demand, to the BASE approach,where an out-of-date replica must transfer and verify allstate before the it can process the faulty request. We de-fine recovery time as the delay from when the agreementcluster detects a fault until the client the client receivesthe correct response.

Figure 7 compares the recovery time for a replica thatperforms full state transfer to one that fetches and verifiesstate on-demand. As shown, the recovery time increaseslinearly with state size when transferring and verifyingall state. In contrast, our approach only transfers andverifies state on demand, making the recovery time inde-pendent of the application’s state size. Fetching and ver-ifying 400MB of state can take up to 60 seconds in thenaive approach, while it takes only 3.3 seconds for ZZ,since only meta information is transferred and verified.The full recovery time also includes the VM startup timein both cases. For the full transfer case, this accountsfor an additional 0.29 seconds. In addition to startingup a new VM, the state on-demand approach in ZZ mustmake ZFS calls to obtain access the last snapshot, tak-ing an average of 0.60 seconds. In both cases, the VMstartup time is independent of state size.


0 100 200 300 400Request Number

10

20

30

40

Late

ncy

(mse

c)Recovery time = 3.09 sec

Figure 8: Some requests after a fault see higher latencybecause of on demand transfer and verification.

0 25 50 75 100 125 150 175 200Computation Cost

20

40

60

80

Repl

ay T

ime

(mse

c/re

q) Full ReplayUpdate Only

Figure 9: Separating state from execution can reduce la-tency when replaying requests.

7.3 Fault Mode LatencyWhile obtaining state on-demand significantly reducesthe initial recovery time, it may increase the latencyfor subsequent requests since they must fetch and ver-ify state. In this experiment we examine the latency ofrequests to the client-server application after a fault hasoccurred assuming state is fetched and verified on de-mand. The client sends a series of read requests to theserver which each involve accesses to 100KB files. Thereare a total of 500 files, and the client sends each requestfor a random file.

As shown in Figure 8, we inject a fault at request num-ber 175 which experiences a latency of about 3 secondsbecause of the recovery time. The mean request latencyprior to the fault is 5 milliseconds with very little varia-tion. The latency of requests after the fault has a bimodaldistribution depending on whether the request accessesa file that has already been fetched or one which needsto be fetched and verified. The long requests, which in-clude state verification and transfer, take an average of15 milliseconds. Requests to files that have already beenverified see low latencies that are comparable to beforethe fault.

7.4 Replay Vs State UpdatesIn this experiment we motivate separating computationfrom state updates with a microbenchmark comparingthe latency of requests when full requests need to be re-

5 10 15 20 40 50State Size (MB)

0

20

40

60

80

100

Tim

e (s

ec)

Full TransferOn Demand

(a) Loading Checkpoints

5 10 15 20 40 50State Size (MB)

0

20

40

60

80

Repl

ay T

ime

(mse

c/re

q) Full TransferOn Demand

(b) Replaying Requests

Figure 10: (a)The cost of full state transfer increases withstate size. (b) On demand incurs overhead when replay-ing requests since state objects must be verified.

played versus when only state updates need to be ap-plied. We test this with a server application which inthe ordinary case, receives a request from a client, runsan encryption algorithm over a buffer, and then writesthe result to disk. If full requests must be replayed, thenrecovery replicas will have to perform this potentially ex-pensive computation. Figure 9 illustrates how the replaytime of requests changes as the cost of computation in-creases. By having replicas record and verify state up-dates as they are performed, recovery replicas only needto apply the state update (a disk write).

7.5 NFS Recovery TimeIn practice, the recovery time of an application may behigher than shown in Section 7.2 because recovery repli-cas will need to rebuild application state and replay anyrequests which occurred between the last checkpoint andthe fault. In this experiment we test the recovery timeof the NFS application for a workload which creates 200files before encountering a fault while reading back a filewhich has been corrupted by one of the replicas. Wevary the size of the files to adjust the total state main-tained by the application, which also impacts the numberof requests which need to be replayed after the fault.

We have found that the full state transfer approachperforms very poorly since the NFS conformance wrap-per must both retrieve the full contents of each file andperform RPC calls to write out all of the files to the ac-


Phase No Fault With Fault1 3.27 3.292 57.94 57.393 35.31 66.974 56.18 68.50

Total 152.7 196.15

Table 2: Completion times for the Andrews benchmark.A fault is injected after phase 2.

tual NFS server. To examine the cost of recovery, wesplit the recovery time into two portions: the time to ver-ify a checkpoint and build the initial file system, and thetime to replay requests which occurred between the lastcheckpoint and the faulty request. Figure 10(a) showsthe time for processing the checkpoints when using fulltransfer or ZZ’s on demand approach. When transferringthe full checkpoint, state sizes greater than a mere 25megabytes can take longer than 60 seconds, after whichpoint NFS requests typically will time out. In contrast,the on demand approach has a constant overhead with anaverage of 1.4 seconds. Figure 10(b) shows the time re-quired to replay requests. Since different state sizes mayrequire different numbers of requests to be replayed, wereport the average time per request replayed. The ZZsystem experiences a higher replay cost because of theadded overhead of fetching and verifying state on de-mand, it also has a higher variance since the first accessto a file incurs more overhead than subsequent calls.

7.6 NFS PerformanceThe Andrews filesystem benchmark emulates a softwaredevelopment workload of reading and modifying diskfiles. In this experiment we measure the time to com-plete the benchmark with and without faults. We use ascaled up version of the first four phases of the bench-mark which include recursively creating source directo-ries, copying source code files, examining file attributes,and reading files.3 We use a modified version of thebenchmark that can be scaled—instead of creating a sin-gle source directory, our enhanced benchmark creates Nsuch directories and performs file operations on each set.

We run the benchmark with N = 50 and inject a faultduring phase 2 (file creation). Table 2 compares the com-pletion time of the benchmark with and without faults;we report the average of three runs. While the fault oc-curs earlier on, it is not detected until phase 3 when thefile is accessed, at which point ZZ takes an average of31.11 seconds for recovery. Since phase 3 only examinesthe attributes of files, no files are copied to the recoveryreplica until phase 4. The additional latency of fetchingfiles on demand in this phase increases the completion

3Phase 5 of the benchmark, which compiles a set of source files,was omitted since it included old libraries that do not compile withoutsubstantial changes on modern Linux distributions.

f=1 f=2 f=3Num Faults

0.5

11.5

22.5

Reco

very

Tim

e (s

ec)

Figure 11: Recovery time increases for larger f frommessage overhead and increased ZFS operations.

128 256 512 1024 1536 2048Hibernate Size (MB)

10

20

30

40

Rest

ore

Tim

e (s

ec)

Figure 12: Restoring hibernated VMs incurs a latencydependent on the saved VM’s memory size.

time by 12 seconds. In total, the fault adds 43.4 secondsof latency, an overhead of 28% across the full run.

7.7 Impact of Multiple FaultsHere we examine the recovery latency of the client-server application for up to three faults. In each case, wemaintain f paused recovery replicas. We inject a fault tof of the active execution replicas and measure the recov-ery time to handle the faulty request. Figure 11 showshow the recovery time for f = 1 to f = 3 increasesslightly due to increased message passing and becausethe ZFS server needs to export snapshots for a largernumber of file systems. We believe that the communi-cation costs could be decreased with hardware multicastand the use of multiple ZFS storage nodes.

7.8 VM Startup TimeIn our previous experiments, the recovery VMs are keptin a paused state, here we evaluate the feasibility ofkeeping recovery VMs hibernated on disk for applica-tions with less stringent latency requirements. Figure12 shows the time to restore different size VM imagesfrom disk. In all cases, Domain-0 has 512 MB of RAMand we ensure that the VM’s disk file is not cached inmemory prior to restoration. Restoring a VM from apaused state takes only 0.29 seconds, which is signifi-cantly faster since it simply needs to add the VM to thescheduling queue.

Restoring a VM from disk with 2GB of RAM can takemore than 40 seconds, however, this time can be signifi-cantly reduced by adjusting the VM’s memory allocation


prior to the initial hibernation call. To use a paged out,hibernated VM approach, a VM with a 2GB allocation ofRAM would be reduced down to 128MB of RAM priorto hibernation. If a fault occurs, the VM could be re-stored from disk in 5.78 seconds. After being restored,the memory allocation can be increased back to 2GB ofRAM within an additional 0.1 seconds. Although theVM will immediately have access to its expanded mem-ory allocation, there may be an application dependent pe-riod of reduced performance if data needs to be paged in.

8 Related Work

This section discusses related work not covered else-where. Lamport, Shostak, and Pease [16] introduced theproblem of byzantine agreement. Lamport also intro-duced the state machine replication approach [14] (with apopular tutorial by Schneider [22]) that relies on consen-sus to establish an order on requests. Consensus in thepresence of asynchrony and faults has seen almost threedecades of research. Dwork et al. [8] established a lowerbound of 3f + 1 replicas for byzantine agreement givenpartial synchrony, i.e., an unknown but fixed upper boundon message delivery time. The classic FLP [9] resultshowed that no agreement protocol is guaranteed to ter-minate with even one (benignly) faulty node in an asyn-chronous environment. Viewstamped replication [18]and Paxos [15] describe an agreement protocol for imple-menting an arbitrary state machine in an asynchronousbenign environment that is safe when n > 3f + 1 andlive with synchrony assumptions.

Early BFT systems [20, 12] incurred a prohibitivelyhigh overhead and relied on failure detectors to excludefaulty replicas. However, accurate failure detectors arenot achievable under asynchrony, thus these systems ef-fectively relied on synchrony for safety. Castro andLiskov [2] introduced a BFT SMR-based system that re-lied on synchrony only for liveness. The three-phase pro-tocol at the core of PBFT is similar to viewstamped repli-cation [18] or Paxos [15] and incurs a replication costof at least 3f + 1. More importantly, they showed thatthe latency and throughput overhead of BFT can be lowenough to be practical.

Although virtualization has not been previously em-ployed for BFT, there has been much research on thistopic in recent years. Rapid activation of virtual machinereplicas for dynamic capacity provisioning and orches-tration has been studied in [11, 24]. In contrast, ZZ usesVM replicas for high availability rather than scale. Livemigration of virtual machines was proposed in [4] and itsuse for load balancing in a data center has been studied in[27]. Such techniques can be employed by ZZ to intelli-gently manage its free server pool, although we leave animplementation to future work. Virtualization has alsobeen employed for security. Potemkin uses dynamicallyinvocation of virtual machines to serve as a honeypot for

security attacks [26]. Tera is a virtual machine platformfor trusted computing that employs a trusted hypervisor[10]; ZZ assumes a trusted hypervisor, which a systemsuch as Tera can provide.

9 Conclusions and Future Work

In this paper, we presented ZZ, a novel approach to con-struct general BFT services with a replication cost ofpractically f + 1, halving the 2f + 1 or higher costincurred by state-of-the-art approaches. Analogous toRemus [6] which uses virtualization to provide fail-stopfault tolerance at a minimal cost of 1+ε, ZZ exploits vir-tualization to provide BFT at a low cost. Our key insightwas to use f+1 execution replicas in the normal case andto activate additional VM replicas only upon failures. ZZuses a number of novel mechanisms for rapid recoveryof replicas, including the use of filesystem snapshots forefficient, application independent checkpoints, replayingstate updates instead of full requests, and the use of anamortized state transfer mechanism that fetches state on-demand. We implemented ZZ using the BASE library,Xen and ZFS, and demonstrated the efficacy of our ap-proach via an experimental evaluation.

Our future work will focus on three practical issues.First, since ZZ can not tolerate additional faults until anewly activated replica has rebuilt its state, and sincefetching state purely on-demand can increase this win-dow of vulnerability, we plan to implement techniquesto fetch missing state updates in the background at lowpriority, while the replica continues normal foregroundprocessing. Second, we plan to better integrate ZZ’s freeserver pool management with the data center’s resourcemanagement policies. In particular, we will examine howlive migration techniques employed by a data center formanaging its servers can be extended to manage the freeserver pool. Third, we plan to employ ZZ to provideBFT in multi-tier web apps by deploying our techniquesat each tier and measure the end-to-end impact.

References[1] Michael Abd-El-Malek, Gregory R. Ganger, Garth R. Goodson, Michael K.

Reiter, and Jay J. Wylie. Fault-scalable byzantine fault-tolerant services.SIGOPS Oper. Syst. Rev., 39(5):59–74, 2005.

[2] M. Castro and B. Liskov. Practical Byzantine fault tolerance. In Proceed-ings of the Third Symposium on Operating Systems Design and Implemen-tation, February 1999.

[3] Miguel Castro and Barbara Liskov. Practical byzantine fault toleranceand proactive recovery. ACM Transactions on Computer Systems (TOCS),20(4):398–461, November 2002.

[4] C. Clark, K. Fraser, S. Hand, J. Hansen, E. Jul, C. Limpach, I. Pratt, andA. Warfiel. Live migration of virtual machines. In Proceedings of UsenixSymposium on Network Systems Design and Implementation (NSDI), May2005.

[5] James Cowling, Daniel Myers, Barbara Liskov, Rodrigo Rodrigues, andLiuba Shrira. Hq replication: A hybrid quorum protocol for byzantine faulttolerance. In Proceedings of the Seventh Symposium on Operating Sys-tems Design and Implementations (OSDI), Seattle, Washington, November2006.


[6] B. Cully, G. Lefebvre, D. Meyer, M. Feeley, N. Hutchinson, andA. Warfield. Remus: High availability via asyncronous virtual machinereplication. April 2008.

[7] Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer. Consensus in thepresence of partial synchrony. J. ACM, 35(2):288–323, 1988.

[8] Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer. Consensus in thepresence of partial synchrony. Journal of the ACM, 35(2):288–323, 1988.

[9] Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. Impossibilityof distributed consensus with one faulty process. J. ACM, 32(2):374–382,1985.

[10] Tal Garfinkel, Ben Pfaff, Jim Chow, Mendel Rosenblum, and Dan Boneh.Terra: a virtual machine-based platform for trusted computing. In SOSP’03: Proceedings of the nineteenth ACM symposium on Operating systemsprinciples, pages 193–206, New York, NY, USA, 2003. ACM Press.

[11] Laura Grit, David Irwin, , Aydan Yumerefendi, and Jeff Chase. Virtualmachine hosting for networked clusters: Building the foundations for auto-nomic orchestration. In In the First International Workshop on Virtualiza-tion Technology in Distributed Computing (VTDC), November 2006.

[12] Kim Potter Kihlstrom, L. E. Moser, and P. M. Melliar-Smith. The se-curering protocols for securing group communication. In HICSS ’98: Pro-ceedings of the Thirty-First Annual Hawaii International Conference onSystem Sciences-Volume 3, page 317, Washington, DC, USA, 1998. IEEEComputer Society.

[13] Ramakrishna Kotla, Lorenzo Alvisi, Mike Dahlin, Allen Clement, and Ed-mund Wong. Zyzzyva: speculative byzantine fault tolerance. In SOSP ’07:Proceedings of twenty-first ACM SIGOPS symposium on Operating systemsprinciples, pages 45–58, New York, NY, USA, 2007. ACM.

[14] L. Lamport. How to make a multiprocessor computer that correctly ex-ecutes multiprocess programs. IEEE Transactions on Computers, C-28(9):690–691, September 1979.

[15] L. Lamport. Part time parliament. ACM Transactions on Computer Sys-tems, 16(2), May 1998.

[16] L. Lamport, R. Shostack, and M. Pease. The Byzantine Generals Problem.ACM Transactions on Programming Languages and Systems, 4(3):382–401, 1982.

[17] Leslie Lamport and Mike Massa. Cheap paxos. In DSN ’04: Proceedings ofthe 2004 International Conference on Depen dable Systems and Networks,page 307, Washington, DC, USA, 2004. IEEE Computer Society.

[18] Brian M. Oki and Barbara H. Liskov. Viewstamped replication: a generalprimary copy. In PODC ’88: Proceedings of the seventh annual ACMSymposium on Principles of distributed computing, pages 8–17, New York,NY, USA, 1988. ACM.

[19] Z. Peterson and R. Burns. Ext3cow: A time-shifting file system for regula-tory compliance. ACM Transactions on Storage, 1(2):190–212, 2005.

[20] Michael K. Reiter. The rampart toolkit for building high-integrity services.In Selected Papers from the International Workshop on Theory and Practicein Distributed Systems, pages 99–110, London, UK, 1995. Springer-Verlag.

[21] Rodrigo Rodrigues, Miguel Castro, and Barbara Liskov. Base: using ab-straction to improve fault tolerance. In Proceedings of the eighteenth ACMsymposium on Operating systems principles, pages 15–28, New York, NY,USA, 2001. ACM Press.

[22] F. Schneider. Implementing Fault-tolerant Services Using the State Ma-chine Approach: A tutorial. Computing Surveys, 22(3):299–319, Septem-ber 1990.

[23] Atul Singh, Tathagata Das, Petros Maniatis, Peter Druschel, and TimothyRoscoe. Bft protocols under fire. In NSDI ’08: Proceedings of the UsenixSymposium on Networked System Design and Implementation, 2008.

[24] B. Urgaonkar, P. Shenoy, A. Chandra, and P. Goyal. Dynamic provisioningfor multi-tier internet applications. In Proceedings of the 2nd IEEE Interna-tional Conference on Autonomic Computing (ICAC-05), Seattle, WA, June2005.

[25] Ben Vandiver, Hari Balakrishnan, Barbara Liskov, and Sam Madden. Toler-ating byzantine faults in database systems using commit barrier scheduling.In Proceedings of the 21st ACM Symposium on Operating Systems Princi-ples (SOSP), Stevenson, Washington, USA, October 2007.

[26] M. Vrable, J. Ma, J. Chen, D. Moore, E. Vandekieft, A. Snoeren, G. M.Voelker, and S. Savage. Scalability, fidelity, and containment in thepotemkin virtual honeyfarm. In Proceedings of SOSP, 2005.

[27] T. Wood, P. Shenoy, A. Venkataramani, and M. Yousif. Black-box andgray-box strategies for virtual machine migration. In Proceedings ofthe Usenix Symposium on Networked System Design and Implementation(NSDI), Cambridge, MA, April 2007.

[28] Timothy Wood, Rahul Singh, Arun Venkataramani, and Prashant Shenoy.Zz: Cheap practical bft using virtualization. Technical report, U. of Mas-sachusetts Dept. of Computer Sciences, May 2008.

[29] J. Yin, J.P. Martin, A. Venkataramani, L. Alvisi, and M. Dahlin. Sepa-rating agreement from execution for Byzantine fault tolerant services. InProceedings of the 19th ACM Symposium on Operating Systems Principles,October 2003.

[30] Zfs: the last word in file systems. http://www.sun.com/2004-0914/feature/.

zz: cheap practical bft using...

Documents