proactive recovery in a byzantine-fault-tolerant systemcastro/osdi2000.pdf · 2000. 9. 13. ·...

Proactive Recovery in a Byzantine-Fault-Tolerant System

Miguel Castro and Barbara LiskovLaboratory for Computer Science,

Massachusetts Institute of Technology,545 Technology Square, Cambridge, MA 02139

castro,liskov @lcs.mit.edu

Abstract

This paper describes an asynchronous state-machine replicationsystem that tolerates Byzantine faults, which can be causedby malicious attacks or software errors. Our system is thefirst to recover Byzantine-faulty replicas proactively and itperforms well because it uses symmetric rather than public-key cryptography for authentication. The recovery mechanismallows us to tolerate any number of faults over the lifetime ofthe system provided fewer than 1 3 of the replicas becomefaulty within a window of vulnerability that is small undernormal conditions. The window may increase under a denial-of-service attack but we can detect and respond to suchattacks. The paper presents results of experiments showingthat overall performance is good and that even a small windowof vulnerability has little impact on service latency.

1 Introduction

This paper describes a new system for asynchronousstate-machine replication [17, 28] that offers both in-tegrity and high availability in the presence of Byzan-tine faults. Our system is interesting for two reasons:it improves security by recovering replicas proactively,and it is based on symmetric cryptography, which allowsit to perform well so that it can be used in practice toimplement real services.

Our system continues to function correctly even whensome replicas are compromised by an attacker; thisis worthwhile because the growing reliance on onlineinformation services makes malicious attacks more likelyand their consequences more serious. The system alsosurvives nondeterministic software bugs and softwarebugs due to aging (e.g., memory leaks). Our approachimproves on the usual technique of rebooting the systembecause it refreshes state automatically, staggers recoveryso that individual replicas are highly unlikely to failsimultaneously, and has little impact on overall systemperformance. Section 4.7 discusses the types of faultstolerated by the system in more detail.

Because of recovery, our system can tolerate anynumber of faults over the lifetime of the system, providedfewer than 1 3 of the replicas become faulty within

This research was supported by DARPA under contract F30602-98-1-0237 monitored by the Air Force Research Laboratory.

a window of vulnerability. The best that could beguaranteed previously was correct behavior if fewerthan 1 3 of the replicas failed during the lifetime of asystem. Our previous work [6] guaranteed this and othersystems [26, 16] provided weaker guarantees. Limitingthe number of failures that can occur in a finite windowis a synchrony assumption but such an assumption isunavoidable: since Byzantine-faulty replicas can discardthe service state, we must bound the number of failuresthat can occur before recovery completes. But werequire no synchrony assumptions to match the guaranteeprovided by previous systems. We compare our approachwith other work in Section 7.

The window of vulnerability can be small (e.g., afew minutes) under normal conditions. Additionally, ouralgorithm provides detection of denial-of-service attacksaimed at increasing the window: replicas can time howlong a recovery takes and alert their administrator if itexceeds some pre-established bound. Therefore, integritycan be preserved even when there is a denial-of-serviceattack.

The paper describes a number of new techniquesneeded to solve the problems that arise when providingrecovery from Byzantine faults:Proactive recovery. A Byzantine-faulty replica mayappear to behave properly even when broken; thereforerecovery must be proactive to prevent an attacker fromcompromising the service by corrupting 1 3 of thereplicas without being detected. Our algorithm recoversreplicas periodically independent of any failure detectionmechanism. However a recovering replica may notbe faulty and recovery must not cause it to becomefaulty, since otherwise the number of faulty replicas couldexceed the bound required to provide safety. In fact, weneed to allow the replica to continue participating in therequest processing protocol while it is recovering, sincethis is sometimes required for it to complete the recovery.Fresh messages. An attacker must be prevented fromimpersonating a replica that was faulty after it recovers.This can happen if the attacker learns the keys used toauthenticate messages. Furthermore even if messagesare signed using a secure cryptographic co-processor,an attacker might be able to authenticate bad messageswhile it controls a faulty replica; these messages couldbe replayed later to compromise safety. To solve thisproblem, we define a notion of authentication freshness

and replicas reject messages that are not fresh. However,this leads to a further problem, since replicas may beunable to prove to a third party that some message theyreceived is authentic (because it may no longer be fresh).All previous state-machine replication algorithms [26,16], including the one we described in [6], relied on suchproofs. Our current algorithm does not, and this hasthe added advantage of enabling the use of symmetriccryptography for authentication of all protocol messages.This eliminates most use of public-key cryptography, themajor performance bottleneck in previous systems.Efficient state transfer. State transfer is harder in thepresence of Byzantine faults and efficiency is crucial toenable frequent recovery with little impact on perfor-mance. To bring a recovering replica up to date, the statetransfer mechanism checks the local copy of the state todetermine which portions are both up-to-date and not cor-rupt. Then, it must ensure that any missing state it obtainsfrom other replicas is correct. We have developed an effi-cient hierarchical state transfer mechanism based on hashchaining and incremental cryptography [1]; the mecha-nism tolerates Byzantine-faults and state modificationswhile transfers are in progress.

Our algorithm has been implemented as a genericprogram library with a simple interface. This librarycan be used to provide Byzantine-fault-tolerant versionsof different services. The paper describes experimentsthat compare the performance of a replicated NFS imple-mented using the library with an unreplicated NFS. Theresults show that the performance of the replicated sys-tem without recovery is close to the performance of theunreplicated system. They also show that it is possibleto recover replicas frequently to achieve a small windowof vulnerability in the normal case (2 to 10 minutes) withlittle impact on service latency.

The rest of the paper is organized as follows. Sec-tion 2 presents our system model and lists our assump-tions; Section 3 states the properties provided by our al-gorithm; and Section 4 describes the algorithm. Our im-plementation is described in Section 5 and some perfor-mance experiments are presented in Section 6. Section 7discusses related work. Our conclusions are presented inSection 8.

2 System Model and Assumptions

We assume an asynchronous distributed system wherenodes are connected by a network. The network mayfail to deliver messages, delay them, duplicate them, ordeliver them out of order.

We use a Byzantine failure model, i.e., faulty nodesmay behave arbitrarily, subject only to the restrictionsmentioned below. We allow for a very strong adversarythat can coordinate faulty nodes, delay communication,inject messages into the network,or delay correct nodes inorder to cause the most damage to the replicated service.We do assume that the adversary cannot delay correctnodes indefinitely.

We use cryptographic techniques to establish sessionkeys, authenticate messages, and produce digests. We use

the SFS [21] implementation of a Rabin-Williams public-key cryptosystem with a 1024-bit modulus to establish128-bit session keys. All messages are then authenti-cated using message authentication codes (MACs) [2]computed using these keys. Message digests are com-puted using MD5 [27].

We assume that the adversary (and the faulty nodes itcontrols) is computationally bound so that (with very highprobability) it is unable to subvert these cryptographictechniques. For example, the adversary cannot forgesignatures or MACs without knowing the correspondingkeys, or find two messages with the same digest. Thecryptographic techniques we use are thought to have theseproperties.

Previous Byzantine-fault tolerant state-machine repli-cation systems [6, 26, 16] also rely on the assumptionsdescribed above. We require no additional assumptionsto match the guarantees provided by these systems, i.e.,to provide safety if less than 1 3 of the replicas becomefaulty during the lifetime of the system. To tolerate morefaults we need additional assumptions: we must mutu-ally authenticate a faulty replica that recovers to the otherreplicas, and we need a reliable mechanism to trigger pe-riodic recoveries. These could be achieved by involvingsystem administrators in the recovery process, but suchan approach is impractical given our goal of recoveringreplicas frequently. Instead, we rely on the followingassumptions:Secure Cryptography. Each replica has a secure crypto-graphic co-processor, e.g., a Dallas Semiconductors iBut-ton, or the security chip in the motherboard of the IBMPC 300PL. The co-processor stores the replica’s privatekey, and can sign and decrypt messages without exposingthis key. It also contains a true random number generator,e.g., based on thermal noise, and a counter that never goesbackwards. This enables it to append random numbersor the counter to messages it signs.Read-Only Memory. Each replica stores the public keysfor other replicas in some memory that survives failureswithout being corrupted (provided the attacker does nothave physical access to the machine). This memory couldbe a portion of the flash BIOS. Most motherboards canbe configured such that it is necessary to have physicalaccess to the machine to modify the BIOS.Watchdog Timer. Each replica has a watchdog timerthat periodically interrupts processing and hands controlto a recovery monitor, which is stored in the read-only memory. For this mechanism to be effective, anattacker should be unable to change the rate of watchdoginterrupts without physical access to the machine. Somemotherboards and extension cards offer the watchdogtimer functionality but allow the timer to be reset withoutphysical access to the machine. However, this is easy tofix by preventing write access to control registers unlesssome jumper switch is closed.

These assumptions are likely to hold when the attackerdoes not have physical access to the replicas, which weexpect to be the common case. When they fail we canfall back on system administrators to perform recovery.

Note that all previous proactive security algo-rithms [24, 13, 14, 3, 10] assume the entire program runby a replica is in read-only memory so that it cannot bemodified by an attacker. Most also assume that there areauthenticated channels between the replicas that continueto work even after a replica recovers from a compromise.These assumptions would be sufficient to implement ouralgorithm but they are less likely to hold in practice.We only require a small monitor in read-only memoryand use the secure co-processors to establish new sessionkeys between the replicas after a recovery.

The only work on proactive security that does notassume authenticated channels is [3], but the best thata replica can do when its private key is compromisedin their system is alert an administrator. Our securecryptography assumption enables automatic recoveryfrom most failures, and secure co-processors with theproperties we require are now readily available, e.g., IBMis selling PCs with a cryptographic co-processor in themotherboard at essentially no added cost.

We also assume clients have a secure co-processor;this simplifies the key exchange protocol between clientsand replicas but it could be avoided by adding an extraround to this protocol.

3 Algorithm Properties

Our algorithm is a form of state machine replication [17,28]: the service is modeled as a state machine that isreplicated across different nodes in a distributed system.The algorithm can be used to implement any replicatedservice with a state and some operations. The operationsare not restricted to simple reads and writes; they canperform arbitrary computations.

The service is implemented by a set of replicasand each replica is identified using an integer in

0 1 . Each replica maintains a copy of theservice state and implements the service operations. Forsimplicity, we assume 3 1 where is themaximum number of replicas that may be faulty. Serviceclients and replicas are non-faulty if they follow thealgorithm and if no attacker can impersonate them (e.g.,by forging their MACs).

Like all state machine replication techniques, weimpose two requirements on replicas: they must startin the same state, and they must be deterministic (i.e., theexecution of an operation in a given state and with a givenset of arguments must always produce the same result).We can handle some common forms of non-determinismusing the technique we described in [6].

Our algorithm ensures safety for an execution pro-vided at most replicas become faulty within a windowof vulnerability of size . Safety means that the repli-cated service satisfies linearizability [12, 5]: it behaveslike a centralized implementation that executes opera-tions atomically one at a time. Our algorithm providessafety regardless of how many faulty clients are usingthe service (even if they collude with faulty replicas).

We will discuss the window of vulnerability further inSection 4.7.

The algorithm also guarantees liveness: non-faultyclients eventually receive replies to their requests pro-vided (1) at most replicas become faulty within thewindow of vulnerability ; and (2) denial-of-service at-tacks do not last forever, i.e., there is some unknown pointin the execution after which all messages are delivered(possibly after being retransmitted) within some constanttime , or all non-faulty clients have received replies totheir requests. Here, is a constant that depends on thetimeout values used by the algorithm to refresh keys, andtrigger view-changes and recoveries.

4 Algorithm

The algorithm works as follows. Clients send requeststo execute operations to the replicas and all non-faultyreplicas execute the same operations in the same order.Since replicas are deterministic and start in the same state,all non-faulty replicas send replies with identical resultsfor each operation. The client waits for 1 replies fromdifferent replicas with the same result. Since at least oneof these replicas is not faulty, this is the correct result ofthe operation.

The hard problem is guaranteeing that all non-faultyreplicas agree on a total order for the execution ofrequests despite failures. We use a primary-backupmechanism to achieve this. In such a mechanism, replicasmove through a succession of configurations called views.In a view one replica is the primary and the others arebackups. We choose the primary of a view to be replica

such that mod , where is the view numberand views are numbered consecutively.

The primary picks the ordering for execution ofoperations requested by clients. It does this by assigninga sequence number to each request. But the primary maybe faulty. Therefore, the backups trigger view changeswhen it appears that the primary has failed to select a newprimary. Viewstamped Replication [23] and Paxos [18]use a similar approach to tolerate benign faults.

To tolerate Byzantine faults, every step taken by anode in our system is based on obtaining a certificate. Acertificate is a set of messages certifying some statementis correct and coming from different replicas. An exampleof a statement is: “the result of the operation requestedby a client is ”.

The size of the set of messages in a certificate is either1 or 2 1, depending on the type of statement and

step being taken. The correctness of our system dependson a certificate never containing more than messagessent by faulty replicas. A certificate of size 1 issufficient to prove that the statement is correct because itcontains at least one message from a non-faulty replica.A certificate of size 2 1 ensures that it will also bepossible to convince other replicas of the validity of thestatement even when replicas are faulty.

Our earlier algorithm [6] used the same basic ideasbut it did not provide recovery. Recovery complicates the

construction of certificates; if a replica collects messagesfor a certificate over a sufficiently long period of timeit can end up with more than messages from faultyreplicas. We avoid this problem by introducing a notionof freshness; replicas reject messages that are not fresh.But this raises another problem: the view change protocolin [6] relied on the exchange of certificates betweenreplicas and this may be impossible because some ofthe messages in a certificate may no longer be fresh.Section 4.5 describes a new view change protocol thatsolves this problem and also eliminates the need forexpensive public-key cryptography.

To provide liveness with the new protocol, a replicamust be able to fetch missing state that may be held bya single correct replica whose identity is not known. Inthis case, voting cannot be used to ensure correctness ofthe data being fetched and it is important to prevent afaulty replica from causing the transfer of unnecessaryor corrupt data. Section 4.6 describes a mechanism toobtain missing messages and state that addresses theseissues and that is efficient to enable frequent recoveries.

The sections below describe our algorithm. Sec-tions 4.2 and 4.3, which explain normal-case request pro-cessing, are similar to what appeared in [6]. They arepresented here for completeness and to highlight somesubtle changes.

4.1 Message Authentication

We use MACs to authenticate all messages. There is apair of session keys for each pair of replicas and :is used to compute MACs for messages sent from to ,and is used for messages sent from to .

Some messages in the protocol contain a single MACcomputed using UMAC32 [2]; we denote such a messageas , where is the sender is the receiver and theMAC is computed using . Other messages containauthenticators; we denote such a message as ,where is the sender. An authenticator is a vector ofMACs, one per replica ( ), where the MAC inentry is computed using . The receiver of a messageverifies its authenticity by checking the correspondingMAC in the authenticator.

Replicas and clients refresh the session keys usedto send messages to them by sending new-key messagesperiodically (e.g., every minute). The same mechanism isused to establish the initial session keys. The message hasthe form NEW-KEY . The messageis signed by the secure co-processor (using the replica’sprivate key) and is the value of its counter; the counteris incremented by the co-processor and appended tothe message every time it generates a signature. (Thisprevents suppress-replay attacks [11].) Each is thekey replica should use to authenticate messages it sendsto in the future; is encrypted by ’s public key, sothat only can read it. Replicas use timestamp to detectspurious new-key messages: must be larger than thetimestamp of the last new-key message received from .

Each replica shares a single secret key with eachclient; this key is used for communication in both

directions. The key is refreshed by the client periodically,using the new-key message. If a client neglects to do thiswithin some system-defined period, a replica discardsits current key for that client, which forces the client torefresh the key.

When a replica or client sends a new-key message,it discards all messages in its log that are not part of acomplete certificate and it rejects any messages it receivesin the future that are authenticated with old keys. Thisensures that correct nodes only accept certificates withequally fresh messages, i.e., messages authenticated withkeys created in the same refreshment phase.

4.2 Processing Requests

We use a three-phase protocol to atomically multicastrequests to the replicas. The three phases are pre-prepare,prepare, and commit. The pre-prepare and prepare phasesare used to totally order requests sent in the same vieweven when the primary, which proposes the orderingof requests, is faulty. The prepare and commit phasesare used to ensure that requests that commit are totallyordered across views. Figure 1 shows the operation ofthe algorithm in the normal case of no primary faults.

Client

Replica 0

Replica 1

Replica 2

Replica 3 X

request pre-prepare prepare commit reply

pre-prepared prepared committedunknown

Figure 1: Normal Case Operation. Replica 0 is theprimary, and replica 3 is faulty

Each replica stores the service state, a log containinginformation about requests, and an integer denoting thereplica’s current view. The log records informationabout the request associated with each sequence number,including its status; the possibilities are: unknown (theinitial status), pre-prepared, prepared, and committed.Figure 1 also shows the evolution of the request status asthe protocol progresses. We describe how to truncate thelog in Section 4.3.

A client requests the execution of state machineoperation by sending a REQUEST message tothe primary. Timestamp is used to ensure exactly-oncesemantics for the execution of client requests [6].

When the primary receives a request from a client,it assigns a sequence number to . Then it multicasts apre-prepare message with the assignment to the backups,and marks as pre-prepared with sequence number .The message has the form PRE-PREPARE ,where indicates the view in which the message is beingsent, and is ’s digest.

Like pre-prepares, the prepare and commit messages

sent in the other phases also contain and . A replicaonly accepts one of these messages if it is in view ; it canverify the authenticity of the message; and is betweena low water mark, , and a high water mark, . Thelast condition is necessary to enable garbage collectionand prevent a faulty primary from exhausting the spaceof sequence numbers by selecting a very large one. Wediscuss how and advance in Section 4.3.

A backup accepts the pre-prepare message provided(in addition to the conditions above): it has not accepted apre-prepare for view and sequence number containinga different digest; it can verify the authenticity of ; and

is ’s digest. If accepts the pre-prepare, it marksas pre-prepared with sequence number , and enters theprepare phase by multicasting a PREPAREmessage to all other replicas.

When replica has accepted a certificate with apre-prepare message and 2 prepare messages for thesame sequence number and digest (each from adifferent replica including itself), it marks the message asprepared. The protocol guarantees that other non-faultyreplicas will either prepare the same request or will notprepare any request with sequence number in view .

Replica multicasts COMMIT saying itprepared the request. This starts the commit phase.When a replica has accepted a certificate with 2 1commit messages for the same sequence number anddigest from different replicas (including itself), it marksthe request as committed. The protocol guarantees thatthe request is prepared with sequence number in view

at 1 or more non-faulty replicas. This ensuresinformation about committed requests is propagated tonew views.

Replica executes the operation requested by theclient when is committed with sequence number andthe replica has executed all requests with lower sequencenumbers. This ensures that all non-faulty replicas executerequests in the same order as required to provide safety.

After executing the requested operation, replicassend a reply to the client . The reply has the formREPLY where is the timestamp of the

corresponding request, is the replica number, and is theresult of executing the requested operation. This messageincludes the current view number so that clients cantrack the current primary.

The client waits for a certificate with 1 repliesfrom different replicas and with the same and , beforeaccepting the result . This certificate ensures that theresult is valid. If the client does not receive replies soonenough, it broadcasts the request to all replicas. If therequest is not executed, the primary will eventually besuspected to be faulty by enough replicas to cause a viewchange and select a new primary.

4.3 Garbage Collection

Replicas can discard entries from their log once thecorresponding requests have been executed by at least

1 non-faulty replicas; this many replicas are needed

to ensure that the execution of that request will be knownafter a view change.

We can determine this condition by extra communi-cation, but to reduce cost we do the communication onlywhen a request with a sequence number divisible by someconstant (e.g., 128) is executed. We will refer tothe states produced by the execution of these requests ascheckpoints.

When replica produces a checkpoint, it multicastsa CHECKPOINT message to the other replicas,where is the sequence number of the last request whoseexecution is reflected in the state and is the digest ofthe state. A replica maintains several logical copies ofthe service state: the current state and some previouscheckpoints. Section 4.6 describes how we managecheckpoints efficiently.

Each replica waits until it has a certificate containing2 1 valid checkpoint messages for sequence numberwith the same digest sent by different replicas (includingpossibly its own message). At this point, the checkpointis said to be stable and the replica discards all entries inits log with sequence numbers less than or equal to ; italso discards all earlier checkpoints.

The checkpoint protocol is used to advance the lowand high water marks (which limit what messages willbe added to the log). The low-water mark is equal tothe sequence number of the last stable checkpoint and thehigh water mark is , where is the log size.The log size is obtained by multiplying by a smallconstant factor (e.g., 2) that is big enough so that replicasdo not stall waiting for a checkpoint to become stable.

4.4 Recovery

The recovery protocol makes faulty replicas behavecorrectly again to allow the system to tolerate more than

faults over its lifetime. To achieve this, the protocolensures that after a replica recovers it is running correctcode; it cannot be impersonated by an attacker; and it hascorrect, up-to-date state.Reboot. Recovery is proactive — it starts periodicallywhen the watchdog timer goes off. The recovery monitorsaves the replica’s state (the log and the service state)to disk. Then it reboots the system with correct codeand restarts the replica from the saved state. Thecorrectness of the operating system and service code isensured by storing them in a read-only medium (e.g., theSeagate Cheetah 18LP disk can be write protected byphysically closing a jumper switch). Rebooting restoresthe operating system data structures and removes anyTrojan horses.

After this point, the replica’s code is correct and itdid not lose its state. The replica must retain its stateand use it to process requests even while it is recovering.This is vital to ensure both safety and liveness in thecommon case when the recovering replica is not faulty;otherwise, recovery could cause the 1st fault. Butif the recovering replica was faulty, the state may becorrupt and the attacker may forge messages because it

knows the MAC keys used to authenticate both incomingand outgoing messages. The rest of the recovery protocolsolves these problems.

The recovering replica starts by discarding the keysit shares with clients and it multicasts a new-key messageto change the keys it uses to authenticate messages sentby the other replicas. This is important if was faultybecause otherwise the attacker could prevent a successfulrecovery by impersonating any client or replica.Run estimation protocol. Next, runs a simple protocolto estimate an upper bound, , on the high-water markthat it would have in its log if it were not faulty. It discardsany entries with greater sequence numbers to bound thesequence number of corrupt entries in the log.

Estimation works as follows: multicasts aQUERY-STABLE message to all the other replicas,

where is a random nonce. When replica receives thismessage, it replies REPLY-STABLE , whereand are the sequence numbers of the last checkpointand the last request prepared at respectively. keepsretransmitting the query message and processing replies;it keeps the minimum value of and the maximum valueof it received from each replica. It also keeps its ownvalues of and .

The recovering replica uses the responses to selectas follows: where is the log size

and is a value received from replica such that 2replicas other than reported values for less than orequal to and replicas other than reported valuesof greater than or equal to .

For safety, must be greater than any stablecheckpoint so that will not discard log entries whenit is not faulty. This is insured because if a checkpointis stable it will have been created by at least 1 non-faulty replicas and it will have a sequence number lessthan or equal to any value of that they propose. Thetest against ensures that is close to a checkpointat some non-faulty replica since at least one non-faultyreplica reports a not less than ; this is importantbecause it prevents a faulty replica from prolonging ’srecovery. Estimation is live because there are 2 1non-faulty replicas and they only propose a value ofif the corresponding request committed and that impliesthat it prepared at at least 1 correct replicas.

After this point participates in the protocol as if itwere not recovering but it will not send any messagesabove until it has a correct stable checkpoint withsequence number greater than or equal to .Send recovery request. Next sends a recovery requestwith the form: REQUEST RECOVERY .This message is produced by the cryptographic co-processor and is the co-processor’s counter to preventreplays. The other replicas reject the request if it is a re-play or if they accepted a recovery request from recently(where recently can be defined as half of the watchdogperiod). This is important to prevent a denial-of-serviceattack where non-faulty replicas are kept busy executingrecovery requests.

The recovery request is treated like any other request:it is assigned a sequence number and it goes throughthe usual three phases. But when another replica executesthe recovery request, it sends its own new-key message.Replicas also send a new-key message when they fetchmissing state (see Section 4.6) and determine that itreflects the execution of a new recovery request. This isimportant because these keys are known to the attacker ifthe recovering replica was faulty. By changing these keys,we bound the sequence number of messages forged bythe attacker that may be accepted by the other replicas —they are guaranteed not to accept forged messages withsequence numbers greater than the maximum high watermark in the log when the recovery request executes, i.e.,

.The reply to the recovery request includes the se-

quence number . Replica uses the same protocolas the client to collect the correct reply to its recoveryrequest but waits for 2 1 replies. Then it computes itsrecovery point, . It also computesa valid view (see Section 4.5); it retains its current viewif there are 1 replies for views greater than or equalto it, else it changes to the median of the views in thereplies.Check and fetch state. While is recovering, it usesthe state transfer mechanism discussed in Section 4.6 todetermine what pages of the state are corrupt and to fetchpages that are out-of-date or corrupt.

Replica is recovered when the checkpoint withsequence number is stable. This ensures that anystate other replicas relied on to have is actually heldby 1 non-faulty replicas. Therefore if some otherreplica fails now, we can be sure the state of the systemwill not be lost. This is true because the estimationprocedure run at the beginning of recovery ensures thatwhile recovering never sends bad messages for sequencenumbers above the recovery point. Furthermore, therecovery request ensures that other replicas will notaccept forged messages with sequence numbers greaterthan .

Our protocol has the nice property that any replicaknows that has completed its recovery when checkpoint

is stable. This allows replicas to estimate the durationof ’s recovery, which is useful to detect denial-of-serviceattacks that slow down recovery with low false positives.

4.5 View Change Protocol

The view change protocol provides liveness by allowingthe system to make progress when the current primaryfails. The protocol must preserve safety: it must ensurethat non-faulty replicas agree on the sequence numbersof committed requests across views. In addition, theprotocol must provide liveness: it must ensure that non-faulty replicas stay in the same view long enough for thesystem to make progress, even in the face of a denial-of-service attack.

The new view change protocol uses the techniquesdescribed in [6] to address liveness but uses a differentapproach to preserve safety. Our earlier approach relied

on certificates that were valid indefinitely. In the newprotocol, however, the fact that messages can becomestale means that a replica cannot prove the validity of acertificate to others. Instead the new protocol relies onthe group of replicas to validate each statement that somereplica claims has a certificate. The rest of this sectiondescribes the new protocol.Data structures. Replicas record information about whathappened in earlier views. This information is maintainedin two sets, the PSet and the QSet. A replica alsostores the requests corresponding to the entries in thesesets. These sets only contain information for sequencenumbers between the current low and high water marksin the log; therefore only limited storage is required. Thesets allow the view change protocol to work properlyeven when more than one view change occurs before thesystem is able to continue normal operation; the sets areusually empty while the system is running normally.

The PSet at replica stores information about requeststhat have prepared at in previous views. Its entriesare tuples meaning that a request with digestprepared at with number in view and no requestprepared at in a later view.

The QSet stores information about requests that havepre-prepared at in previous views (i.e., requests forwhich has sent a pre-prepare or prepare message). Itsentries are tuples meaning that foreach , is the latest view in which a request pre-prepared with sequence number and digest at .View-change messages. View changes are triggeredwhen the current primary is suspected to be faulty (e.g.,when a request from a client is not executed after someperiod of time; see [6] for details). When a backupsuspects the primary for view is faulty, it enters view1 and multicasts a VIEW-CHANGE 1message to all replicas. Here is the sequence numberof the latest stable checkpoint known to ; is a setof pairs with the sequence number and digest of eachcheckpoint stored at ; and and are sets containinga tuple for every request that is prepared or pre-prepared,respectively, at . These sets are computed using theinformation in the log, the PSet, and the QSet, asexplained in Figure 2. Once the view-change messagehas been sent, stores in PSet, in QSet, and clearsits log. The computation bounds the size of each tuple inQSet; it retains only pairs corresponding to 2 distinctrequests (corresponding to possibly messages fromfaulty replicas, one message from a good replica, andone special null message as explained below). Thereforethe amount of storage used is bounded.View-change-ack messages. Replicas collect view-change messages for 1 and send acknowledgments forthem to 1’s primary, . The acknowledgments have theform VIEW-CHANGE-ACK 1 where is theidentifier of the sender, is the digest of the view-changemessage being acknowledged, and is the replica thatsent that view-change message. These acknowledgmentsallow the primary to prove authenticity of view-changemessages sent by faulty replicas as explained later.

let be the view before the view change, be the size ofthe log, and be the log’s low water mark

for all such that doif request number with digest is prepared orcommitted in view then

add toelse if PSet then

add to

if request number with digest is pre-prepared,prepared or committed in view then

if QSet thenadd to

else if thenadd to

else if 1 thenremove entry with lowest view number fromadd to

else if QSet thenadd to

Figure 2: Computing and

New-view message construction. The new primarycollects view-change and view-change-ack messages

(including messages from itself). It stores view-changemessages in a set . It adds a view-change messagereceived from replica to after receiving 2 1 view-change-acks for ’s view-change message from otherreplicas. Each entry in is for a different replica.

let 2 1 messages :1 messages :

if : : thenselect checkpoint with digest and number

else exit

for all such that doA. if with that verifies:

A1. 2 1 messages :has no entry for or

:A2. 1 messages :

:A3. the primary has the request with digest

then select the request with digest for number

B. else if 2 1 messages such thathas no entry for

then select the null request for number

Figure 3: Decision procedure at the primary.

The new primary uses the information in and thedecision procedure sketched in Figure 3 to choose acheckpoint and a set of requests. This procedure runseach time the primary receives new information, e.g.,when it adds a new message to .

The primary starts by selecting the checkpoint that isgoing to be the starting state for request processing inthe new view. It picks the checkpoint with the highestnumber from the set of checkpoints that are knownto be correct and that have numbers higher than the low

water mark in the log of at least 1 non-faulty replicas.The last condition is necessary for safety; it ensures thatthe ordering information for requests that committed withnumbers higher than is still available.

Next, the primary selects a request to pre-prepare inthe new view for each sequence number between and

(where is the size of the log). For each numberthat was assigned to some request that committed

in a previous view, the decision procedure selects topre-prepare in the new view with the same number. Thisensures safety because no distinct request can commitwith that number in the new view. For other numbers, theprimary may pre-prepare a request that was in progressbut had not yet committed, or it might select a specialnull request that goes through the protocol as a regularrequest but whose execution is a no-op.

We now argue informally that this procedure willselect the correct value for each sequence number. Ifa request committed at some non-faulty replica withnumber , it prepared at at least 1 non-faulty replicasand the view-change messages sent by those replicas willindicate that prepared with number . Any set of atleast 2 1 view-change messages for the new view mustinclude a message from one of the non-faulty replicas thatprepared . Therefore, the primary for the new viewwill be unable to select a different request for numberbecause no other request will be able to satisfy conditionsA1 or B (in Figure 3).

The primary will also be able to make the right de-cision eventually: condition A1 will be satisfied becausethere are 2 1 non-faulty replicas and non-faulty repli-cas never prepare different requests for the same viewand sequence number; A2 is also satisfied since a requestthat prepares at a non-faulty replica pre-prepares at atleast 1 non-faulty replicas. Condition A3 may not besatisfied initially, but the primary will eventually receivethe request in a response to its status messages (discussedin Section 4.6). When a missing request arrives, this willtrigger the decision procedure to run.

The decision procedure ends when the primary has se-lected a request for each number. This takes 3

local steps in the worst case but the normal case is muchfaster because most replicas propose identical values. Af-ter deciding, the primary multicasts a new-view messageto the other replicas with its decision. The new-viewmessage has the form NEW-VIEW 1 . Here,

contains a pair for each entry in consisting of theidentifier of the sending replica and the digest of its view-change message, and identifies the checkpoint andrequest values selected.New-view message processing. The primary updates itsstate to reflect the information in the new-view message.It records all requests in as pre-prepared in view 1in its log. If it does not have the checkpoint with sequencenumber it also initiates the protocol to fetch the missingstate (see Section 4.6.2). In any case the primary does notaccept any prepare or commit messages with sequencenumber less than or equal to and does not send anypre-prepare message with such a sequence number.

The backups for view 1 collect messages until theyhave a correct new-view message and a correct matchingview-change message for each pair in . If some replicachanges its keys in the middle of a view change, it has todiscard all the view-change protocol messages it alreadyreceived with the old keys. The message retransmissionmechanism causes the other replicas to re-send thesemessages using the new keys.

If a backup did not receive one of the view-changemessages for some replica with a pair in , the primaryalone may be unable to prove that the message it receivedis authentic because it is not signed. The use of view-change-ack messages solves this problem. The primaryonly includes a pair for a view-change message in afterit collects 2 1 matching view-change-ack messagesfrom other replicas. This ensures that at least 1 non-faulty replicas can vouch for the authenticity of everyview-change message whose digest is in . Therefore, ifthe original sender of a view-change is uncooperative, theprimary retransmits that sender’s view-change messageand the non-faulty backups retransmit their view-change-acks. A backup can accept a view-change message whoseauthenticator is incorrect if it receives view-change-acks that match the digest and identifier in .

After obtaining the new-view message and the match-ing view-change messages, the backups check whetherthese messages support the decisions reported by the pri-mary by carrying out the decision procedure in Figure 3.If they do not, the replicas move immediately to view

2. Otherwise, they modify their state to account forthe new information in a way similar to the primary. Theonly difference is that they multicast a prepare messagefor 1 for each request they mark as pre-prepared.Thereafter, the protocol proceeds as described in Sec-tion 4.2.

The replicas use the status mechanism in Section 4.6to request retransmission of missing requests as wellas missing view-change, view-change acknowledgment,and new-view messages.

4.6 Obtaining Missing Information

This section describes the mechanisms for messageretransmission and state transfer. The state transfermechanism is necessary to bring replicas up to date whensome of the messages they are missing were garbagecollected.

4.6.1 Message Retransmission

We use a receiver-based recovery mechanism similar toSRM [8]: a replica multicasts small status messagesthat summarize its state; when other replicas receive astatus message they retransmit messages they have sentin the past that is missing. Status messages are sentperiodically and when the replica detects that it is missinginformation (i.e., they also function as negative acks).

If a replica is unable to validate a status message, itsends its last new-key message to . Otherwise, sendsmessages it sent in the past that may be missing. For

example, if is in a view less than ’s, sends itslatest view-change message. In all cases, authenticatesmessages it retransmits with the latest keys it received ina new-key message from . This is important to ensureliveness with frequent key changes.

Clients retransmit requests to replicas until they re-ceive enough replies. They measure response times tocompute the retransmission timeout and use a random-ized exponential backoff if they fail to receive a replywithin the computed timeout.

4.6.2 State Transfer

A replica may learn about a stable checkpoint beyondthe high water mark in its log by receiving checkpointmessages or as the result of a view change. In this case, ituses the state transfer mechanism to fetch modificationsto the service state that it is missing.

It is important for the state transfer mechanism tobe efficient because it is used to bring a replica up todate during recovery, and we perform proactive recover-ies frequently. The key issues to achieving efficiency arereducing the amount of information transferred and re-ducing the burden imposed on replicas. This mechanismmust also ensure that the transferred state is correct. Westart by describing our data structures and then explainhow they are used by the state transfer mechanism.Data Structures. We use hierarchical state partitionsto reduce the amount of information transferred. Theroot partition corresponds to the entire service stateand each non-leaf partition is divided into equal-sized, contiguous sub-partitions. We call leaf partitionspages and interior partitions meta-data. For example,the experiments described in Section 6 were run with ahierarchy with four levels, equal to 256, and 4KB pages.

Each replica maintains one logical copy of the parti-tion tree for each checkpoint. The copy is created whenthe checkpoint is taken and it is discarded when a latercheckpoint becomes stable. The tree for a checkpointstores a tuple for each meta-data partition and atuple for each page. Here, is the sequencenumber of the checkpoint at the end of the last checkpointinterval where the partition was modified, is the digestof the partition, and is the value of the page.

The digests are computed efficiently as follows. Fora page, is obtained by applying the MD5 hash func-tion [27] to the string obtained by concatenating the in-dex of the page within the state, its value of and .For meta-data, is obtained by applying MD5 to thestring obtained by concatenating the index of the parti-tion within its level, its value of , and the sum modulo alarge integer of the digests of its sub-partitions. Thus, weapply AdHash [1] at each meta-data level. This construc-tion has the advantage that the digests for a checkpointcan be obtained efficiently by updating the digests fromthe previous checkpoint incrementally.

The copies of the partition tree are logical becausewe use copy-on-write so that only copies of the tuplesmodified since the checkpoint was taken are stored. This

reduces the space and time overheads for maintainingthese checkpoints significantly.Fetching State. The strategy to fetch state is to recursedown the hierarchy to determine which partitions are outof date. This reduces the amount of information about(both non-leaf and leaf) partitions that needs to be fetched.

A replica multicasts FETCH to allreplicas to obtain information for the partition with index

in level of the tree. Here, is the sequence numberof the last checkpoint knows for the partition, and iseither -1 or it specifies that is seeking the value of thepartition at sequence number from replica .

When a replica determines that it needs to initiatea state transfer, it multicasts a fetch message for the rootpartition with equal to its last checkpoint. The valueof is non-zero when knows the correct digest of thepartition information at checkpoint , e.g., after a viewchange completes knows the digest of the checkpointthat propagated to the new view but might not have it.also creates a new (logical) copy of the tree to store thestate it fetches and initializes a table in which it storesthe number of the latest checkpoint reflected in the stateof each partition in the new tree. Initially each entry inthe table will contain .

If FETCH is received by the desig-nated replier, , and it has a checkpoint for sequencenumber , it sends back META-DATA , where

is a set with a tuple for each sub-partitionof with index , digest , and . Sinceknows the correct digest for the partition value at check-point , it can verify the correctness of the reply withoutthe need for voting or even authentication. This reducesthe burden imposed on other replicas.

The other replicas only reply to the fetch message ifthey have a stable checkpoint greater than and . Theirreplies are similar to ’s except that is replaced bythe sequence number of their stable checkpoint and themessage contains a MAC. These replies are necessaryto guarantee progress when replicas have discarded aspecific checkpoint requested by .

Replica retransmits the fetch message (choosing adifferent each time) until it receives a valid reply fromsome or 1 equally fresh responses with the same sub-partition values for the same sequence number (greaterthan and ). Then, it compares its digests for each sub-partition of with those in the fetched information; itmulticasts a fetch message for sub-partitions where thereis a difference, and sets the value in to (or ) forthe sub-partitions that are up to date. Since learns thecorrect digest of each sub-partition at checkpoint (or

) it can use the optimized protocol to fetch them.The protocol recurses down the tree until sends

fetch messages for out-of-date pages. Pages are fetchedlike other partitions except that meta-data replies containthe digest and last modification sequence number for thepage rather than sub-partitions, and the designated repliersends back DATA . Here, is the page index andis the page value. The protocol imposes little overheadon other replicas; only one replica replies with the full

page and it does not even need to compute a MAC forthe message since can verify the reply using the digestit already knows.

When obtains the new value for a page, it updatesthe state of the page, its digest, the value of the last modi-fication sequence number, and the value corresponding tothe page in . Then, the protocol goes up to its parentand fetches another missing sibling. After fetching allthe siblings, it checks if the parent partition is consistent.A partition is consistent up to sequence number ifis the minimum of all the sequence numbers in forits sub-partitions, and is greater than or equal to themaximum of the last modification sequence numbers inits sub-partitions. If the parent partition is not consistent,the protocol sends another fetch for the partition. Other-wise, the protocol goes up again to its parent and fetchesmissing siblings.

The protocol ends when it visits the root partitionand determines that it is consistent for some sequencenumber . Then the replica can start processing requestswith sequence numbers greater than .

Since state transfer happens concurrently with requestexecution at other replicas and other replicas are free togarbage collect checkpoints, it may take some time for areplica to complete the protocol, e.g., each time it fetchesa missing partition, it receives information about yet alater modification. This is unlikely to be a problem inpractice (this intuition is confirmed by our experimentalresults). Furthermore, if the replica fetching the state everis actually needed because others have failed, the systemwill wait for it to catch up.

4.7 Discussion

Our system ensures safety and liveness for an executionprovided at most replicas become faulty within a

window of vulnerability of size 2 . Thevalues of and are characteristic of each execution

and unknown to the algorithm. is the maximumkey refreshment period in for a non-faulty node, andis the maximum time between when a replica fails andwhen it recovers from that fault in .

The message authentication mechanism from Sec-tion 4.1 ensures non-faulty nodes only accept certificateswith messages generated within an interval of size atmost 2 .1 The bound on the number of faults within

ensures there are never more than faulty replicaswithin any interval of size at most 2 . Therefore, safetyand liveness are provided because non-faulty nodes neveraccept certificates with more than bad messages.

We have little control over the value of becausemay be increased by a denial-of-service attack, but

we have good control over and the maximum timebetween watchdog timeouts, , because their valuesare determined by timer rates, which are quite stable.Setting these timeout values involves a tradeoff between

1It would be except that during view changes replicas may acceptmessages that are claimed authentic by 1 replicas without directlychecking their authentication token.

security and performance: small values improve securityby reducing the window of vulnerability but degradeperformance by causing more frequent recoveries andkey changes. Section 6 analyzes this tradeoff.

The value of should be set based on , the timeit takes to recover a non-faulty replica under normal loadconditions. There is no point in recovering a replicawhen its previous recovery has not yet finished; and westagger the recoveries so that no more than replicasare recovering at once, since otherwise service could beinterrupted even without an attack. Therefore, we set

4 . Here, the factor 4 accounts for thestaggered recovery of 3 1 replicas at a time, and isa safety factor to account for benign overload conditions(i.e., no attack).

Another issue is the bound on the number of faults.Our replication technique is not useful if there is a strongpositive correlation between the failure probabilities ofthe replicas; the probability of exceeding the bound maynot be lower than the probability of a single fault in thiscase. Therefore, it is important to take steps to increasediversity. One possibility is to have diversity in the exe-cution environment: the replicas can be administered bydifferent people; they can be in different geographic loca-tions; and they can have different configurations (e.g., rundifferent combinations of services, or run schedulers withdifferent parameters). This improves resilience to severaltypes of faults, for example, attacks involving physicalaccess to the replicas, administrator attacks or mistakes,attacks that exploit weaknesses in other services, andsoftware bugs due to race conditions. Another possibil-ity is to have software diversity; replicas can run differentoperating systems and different implementations of theservice code. There are several independent implemen-tations available for operating systems and important ser-vices (e.g. file systems, data bases, and WWW servers).This improves resilience to software bugs and attacks thatexploit software bugs.

Even without taking any steps to increase diversity,our proactive recovery technique increases resilience tonondeterministic software bugs, to software bugs dueto aging (e.g., memory leaks), and to attacks that takemore time than to succeed. It is possible to improvesecurity further by exploiting software diversity acrossrecoveries. One possibility is to restrict the serviceinterface at a replica after its state is found to be corrupt.Another potential approach is to use obfuscation andrandomization techniques [7, 9] to produce a new versionof the software each time a replica is recovered. Thesetechniques are not very resilient to attacks but they canbe very effective when combined with proactive recoverybecause the attacker has a bounded time to break them.

5 Implementation

We implemented the algorithm as a library with a verysimple interface (see Figure 4). Some components of thelibrary run on clients and others at the replicas.

On the client side, the library provides a procedure

Client:int Byz init client(char conf);int Byz invoke(Byz req req, Byz rep rep, bool read only);

Server:int Byz init replica(char conf, char mem, int size, UC exec);void Byz modify(char mod, int size);

Server upcall:int execute(Byz req req, Byz rep rep, int client);

Figure 4: The replication library API.

to initialize the client using a configuration file, whichcontains the public keys and IP addresses of the replicas.The library also provides a procedure, invoke, that iscalled to cause an operation to be executed. Thisprocedure carries out the client side of the protocol andreturns the result when enough replicas have responded.

On the server side, we provide an initializationprocedure that takes as arguments a configuration filewith the public keys and IP addresses of replicas andclients, the region of memory where the application stateis stored, and a procedure to execute requests. Whenour system needs to execute an operation, it makes anupcall to the execute procedure. This procedure carriesout the operation as specified for the application, usingthe application state. As the application performs theoperation, each time it is about to modify the applicationstate, it calls the modify procedure to inform us of thelocations about to be modified. This call allows us tomaintain checkpoints and compute digests efficiently asdescribed in Section 4.6.2.

6 Performance Evaluation

This section has two parts. First, it presents results ofexperiments to evaluate the benefit of eliminating public-key cryptography from the critical path. Then, it presentsan analysis of the cost of proactive recoveries.

6.1 Experimental Setup

All experiments ran with four replicas. Four replicas cantolerate one Byzantine fault; we expect this reliabilitylevel to suffice for most applications. Clients andreplicas ran on Dell Precision 410 workstations withLinux 2.2.16-3 (uniprocessor). These workstations havea 600 MHz Pentium III processor, 512 MB of memory,and a Quantum Atlas 10K 18WLS disk. All machineswere connected by a 100 Mb/s switched Ethernet andhad 3Com 3C905B interface cards. The switch was anExtreme Networks Summit48 V4.1. The experiments ranon an isolated network.

The interval between checkpoints, , was 128 re-quests, which causes garbage collection to occur severaltimes in each experiment. The size of the log, , was256. The state partition tree had 4 levels, each internalnode had 256 children, and the leaves had 4 KB.

6.2 The cost of Public-Key Cryptography

To evaluate the benefit of using MACs instead of publickey signatures, we implemented BFT-PK. Our previousalgorithm [6] relies on the extra power of digital sig-natures to authenticate pre-prepare, prepare, checkpoint,and view-change messages but it can be easily modifiedto use MACs to authenticate other messages. To providea fair comparison, BFT-PK is identical to the BFT librarybut it uses public-key signatures to authenticate these fourtypes of messages. We ran a micro benchmark, and a filesystem benchmark to compare the performance of ser-vices implemented with the two libraries. There were noview changes, recoveries or key changes in these experi-ments.

6.2.1 Micro-Benchmark

The micro-benchmark compares the performance of twoimplementations of a simple service: one implementationuses BFT-PK and the other uses BFT. This service hasno state and its operations have arguments and results ofdifferent sizes but they do nothing. We also evaluatedthe performance of NO-REP: an implementation ofthe service using UDP with no replication. We ranexperiments to evaluate the latency and throughput ofthe service. The comparison with NO-REP shows theworst case overhead for our library; in real services, therelative overhead will be lower due to computation or I/Oat the clients and servers.

Table 1 reports the latency to invoke an operationwhen the service is accessed by a single client. The resultswere obtained by timing a large number of invocationsin three separate runs. We report the average of the threeruns. The standard deviations were always below 0.5%of the reported value.

system 0/0 0/4 4/0BFT-PK 59368 59761 59805BFT 431 999 1046NO-REP 106 625 630

Table 1: Micro-benchmark: operation latency in mi-croseconds. Each operation type is denoted by a/b, wherea and b are the sizes of the argument and result in KB.

BFT-PK has two signatures in the critical path andeach of them takes 29.4 ms to compute. The algorithmdescribed in this paper eliminates the need for thesesignatures. As a result, BFT is between 57 and 138times faster than BFT-PK. BFT’s latency is between 60%and 307% higher than NO-REP because of additionalcommunication and computation overhead. For read-only requests, BFT uses the optimization described in [6]that reduces the slowdown for operations 0/0 and 0/4 to93% and 25%, respectively.

We also measured the overhead of replication at theclient. BFT increases CPU time relative to NO-REP byup to a factor of 5, but the CPU time at the client is onlybetween 66 and 142 s per operation. BFT also increasesthe number of bytes in Ethernet packets that are sent or

0 50 100 150 200number of clients

0

10000

20000

300000/

0 op

erat

ions

per

sec

ond

0 50 100 150 200number of clients

0

2000

4000

6000

0/4

oper

atio

ns p

er s

econ

d

0 20 40 60number of clients

0

1000

2000

3000

4/0

oper

atio

ns p

er s

econ

d

NO-REPBFTBFT-PK

Figure 5: Micro-benchmark: throughput in operations per second.

received by the client: 405% for the 0/0 operation butonly 12% for the other operations.

Figure 5 compares the throughput of the different im-plementations of the service as a function of the numberof clients. The client processes were evenly distributedover 5 client machines2 and each client process invokedoperations synchronously, i.e., it waited for a reply beforeinvoking a new operation. Each point in the graph is theaverage of at least three independent runs and the stan-dard deviation for all points was below 4% of the reportedvalue (except that it was as high as 17% for the last fourpoints in the graph for BFT-PK operation 4/0). There areno points with more than 15 clients for NO-REP opera-tion 4/0 because of lost request messages; NO-REP usesUDP directly and does not retransmit requests.

The throughput of both replicated implementationsincreases with the number of concurrent clients becausethe library implements batching [4]. Batching inlinesseveral requests in each pre-prepare message to amortizethe protocol overhead. BFT-PK performs 5 to 11 timesworse than BFT because signing messages leads to ahigh protocol overhead and there is a limit on how manyrequests can be inlined in a pre-prepare message.

The bottleneck in operation 0/0 is the server’s CPU;BFT’s maximum throughput is 53% lower than NO-REP’s due to extra messages and cryptographic oper-ations that increase the CPU load. The bottleneck inoperation 4/0 is the network; BFT’s throughput is within11% of NO-REP’s because BFT does not consume signif-icantly more network bandwidth in this operation. BFTachieves a maximum aggregate throughput of 26 MB/sin operation 0/4 whereas NO-REP is limited by the linkbandwidth (approximately 12 MB/s). The throughput isbetter in BFT because of an optimization that we de-scribed in [6]: each client chooses one replica randomly;this replica’s reply includes the 4 KB but the replies ofthe other replicas only contain small digests. As a result,clients obtain the large replies in parallel from differentreplicas. We refer the reader to [4] for a detailed analysisof these latency and throughput results.

2Two client machines had 700 MHz PIIIs but were otherwiseidentical to the other machines.

6.2.2 File System Benchmarks

We implemented the Byzantine-fault-tolerant NFS ser-vice that was described in [6]. The next set of exper-iments compares the performance of two implementa-tions of this service: BFS, which uses BFT, and BFS-PK,which uses BFT-PK.

The experiments ran the modified Andrew bench-mark [25, 15], which emulates a software developmentworkload. It has five phases: (1) creates subdirectoriesrecursively; (2) copies a source tree; (3) examines thestatus of all the files in the tree without examining theirdata; (4) examines every byte of data in all the files; and(5) compiles and links the files. Unfortunately, Andrewis so small for today’s systems that it does not exercisethe NFS service. So we increased the size of the bench-mark by a factor of as follows: phase 1 and 2 create

copies of the source tree, and the other phases operatein all these copies. We ran a version of Andrew with

equal to 100, Andrew100, and another with equalto 500, Andrew500. BFS builds a file system inside amemory mapped file [6]. We ran Andrew100 in a filesystem file with 205 MB and Andrew500 in a file sys-tem file with 1 GB; both benchmarks fill 90% of thesesfiles. Andrew100 fits in memory at both the client andthe replicas but Andrew500 does not.

We also compare BFS and the NFS implementation inLinux, NFS-std. The performance of NFS-std is a goodmetric of what is acceptable because it is used daily bymany users. For all configurations, the actual benchmarkcode ran at the client workstation using the standard NFSclient implementation in the Linux kernel with the samemount options. The most relevant of these options forthe benchmark are: UDP transport, 4096-byte read andwrite buffers, allowing write-back client caching, andallowing attribute caching.

Tables 2 and 3 present the results for these experi-ments. We report the mean of 3 runs of the benchmark.The standard deviation was always below 1% of the re-ported averages except for phase 1 where it was as highas 33%. The results show that BFS-PK takes 12 timeslonger than BFS to run Andrew100 and 15 times longerto run Andrew500. The slowdown is smaller than theone observed with the micro-benchmarks because the

phase BFS-PK BFS NFS-std1 25.4 0.7 0.62 1528.6 39.8 26.93 80.1 34.1 30.74 87.5 41.3 36.75 2935.1 265.4 237.1

total 4656.7 381.3 332.0

Table 2: Andrew100: elapsed time in seconds

client performs a significant amount of computation inthis benchmark.

Both BFS and BFS-PK use the read-only optimiza-tion described in [6] for reads and lookups, and as aconsequence do not set the time-last-accessed attributewhen these operations are invoked. This reduces the per-formance difference between BFS and BFS-PK duringphases 3 and 4 where most operations are read-only.

phase BFS-PK BFS NFS-std1 122.0 4.2 3.52 8080.4 204.5 139.63 387.5 170.2 157.44 496.0 262.8 232.75 23201.3 1561.2 1248.4

total 32287.2 2202.9 1781.6

Table 3: Andrew500: elapsed time in seconds

BFS-PK is impractical but BFS’s performance is closeto NFS-std: it performs only 15% slower in Andrew100and 24% slower in Andrew500. The performance dif-ference would be lower if Linux implemented NFS cor-rectly. For example, we reported previously [6] that BFSwas only 3% slower than NFS in Digital Unix, whichimplements the correct semantics. The NFS implemen-tation in Linux does not ensure stability of modified dataand meta-data as required by the NFS protocol, whereasBFS ensures stability through replication.

6.3 The Cost of Recovery

Frequent proactive recoveries and key changes improveresilience to faults by reducing the window of vulnerabil-ity, but they also degrade performance. We ran Andrewto determine the minimum window of vulnerability thatcan be achieved without overlapping recoveries. Then weconfigured the replicated file system to achieve this win-dow, and measured the performance degradation relativeto a system without recoveries.

The implementation of the proactive recovery mech-anism is complete except that we are simulating the se-cure co-processor, the read-only memory, and the watch-dog timer in software. We are also simulating fast re-boots. The LinuxBIOS project [22] has been experiment-ing with replacing the BIOS by Linux. They claim to beable to reboot Linux in 35 s (0.1 s to get the kernel runningand 34.9 to execute scripts in /etc/rc.d) [22]. Thismeans that in a suitably configured machine we shouldbe able to reboot in less than a second. Replicas simulate

a reboot by sleeping either 1 or 30 seconds and callingmsync to invalidate the service-state pages (this forcesreads from disk the next time they are accessed).

6.3.1 Recovery Time

The time to complete recovery determines the minimumwindow of vulnerability that can be achieved withoutoverlaps. We measured the recovery time for Andrew100and Andrew500 with 30s reboots and with the periodbetween key changes, , set to 15s.

Table 4 presents a breakdown of the maximum time torecover a replica in both benchmarks. Since the processesof checking the state for correctness and fetching missingupdates over the network to bring the recovering replicaup to date are executed in parallel, Table 4 presents asingle line for both of them. The line labeled restorestate only accounts for reading the log from disk theservice state pages are read from disk on demand whenthey are checked.

Andrew100 Andrew500save state 2.84 6.3

reboot 30.05 30.05restore state 0.09 0.30estimation 0.21 0.15

send new-key 0.03 0.04send request 0.03 0.03

fetch and check 9.34 106.81total 42.59 143.68

Table 4: Andrew: recovery time in seconds.

The most significant components of the recovery timeare the time to save the replica’s log and service stateto disk, the time to reboot, and the time to check andfetch state. The other components are insignificant. Thetime to reboot is the dominant component for Andrew100and checking and fetching state account for most of therecovery time in Andrew500 because the state is bigger.

Given these times, we set the period between watch-dog timeouts, , to 3.5 minutes in Andrew100 and to 10minutes in Andrew500. These settings correspond to aminimum window of vulnerability of 4 and 10.5 minutes,respectively. We also run the experiments for Andrew100with a 1s reboot and the maximum time to complete re-covery in this case was 13.3s. This enables a window ofvulnerability of 1.5 minutes with set to 1 minute.

Recovery must be fast to achieve a small window ofvulnerability. While the current recovery times are low, itis possible to reduce them further. For example, the timeto check the state can be reduced by periodically backingup the state onto a disk that is normally write-protectedand by using copy-on-write to create copies of modifiedpages on a writable disk. This way only the modifiedpages need to be checked. If the read-only copy of thestate is brought up to date frequently (e.g., daily), it willbe possible to scale to very large states while achievingeven lower recovery times.

6.3.2 Recovery Overhead

We also evaluated the impact of recovery on performancein the experimental setup described in the previous sec-tion. Table 5 shows the results. BFS-rec is BFS withproactive recoveries. The results show that adding fre-quent proactive recoveries to BFS has a low impact onperformance: BFS-rec is 16% slower than BFS in An-drew100 and 2% slower in Andrew500. In Andrew100with 1s reboot and a window of vulnerability of 1.5 min-utes, the time to complete the benchmark was 482.4s; thisis only 27% slower than the time without recoveries eventhough every 15s one replica starts a recovery.

The results also show that the period between keychanges, , can be small without impacting performancesignificantly. could be smaller than 15s but it should besubstantially larger than 3 message delays under normalload conditions to provide liveness.

system Andrew100 Andrew500BFS-rec 443.5 2257.8BFS 381.3 2202.9NFS-std 332.0 1781.6

Table 5: Andrew: recovery overhead in seconds.

There are several reasons why recoveries have alow impact on performance. The most obvious is thatrecoveries are staggered such that there is never morethan one replica recovering; this allows the remainingreplicas to continue processing client requests. But it isnecessary to perform a view change whenever recoveryis applied to the current primary and the clients cannotobtain further service until the view change completes.These view changes are inexpensive because a primarymulticasts a view-change message just before its recoverystarts and this causes the other replicas to move to the nextview immediately.

7 Related WorkMost previous work on replication techniques assumedbenign faults, e.g., [17, 23, 18, 19] or a synchronous sys-tem model, e.g., [28]. Earlier Byzantine-fault-tolerantsystems [26, 16, 20], including the algorithm we de-scribed in [6], could guarantee safety only if fewer than1 3 of the replicas were faulty during the lifetime of thesystem. This guarantee is too weak for long-lived sys-tems. Our system improves this guarantee by recoveringreplicas proactively and frequently; it can tolerate anynumber of faults if fewer than 1 3 of the replicas be-come faulty within a window of vulnerability, which canbe made small under normal load conditions with lowimpact on performance.

In a previous paper [6], we described a system thattolerated Byzantine faults in asynchronous systems andperformed well. This paper extends that work byproviding recovery, a state transfer mechanism, and a newview change mechanism that enables both recovery andan important optimization — the use of MACs instead ofpublic-key cryptography.

Rampart [26] and SecureRing [16] provide groupmembership protocols that can be used to implementrecovery, but only in the presence of benign faults. Theseapproaches cannot be guaranteed to work in the presenceof Byzantine faults for two reasons. First, the system maybe unable to provide safety if a replica that is not faultyis removed from the group to be recovered. Second, thealgorithms rely on messages signed by replicas even afterthey are removed from the group and there is no way toprevent attackers from impersonating removed replicasthat they controlled.

The problem of efficient state transfer has not beenaddressed by previous work on Byzantine-fault-tolerantreplication. We present an efficient state transfer mecha-nism that enables frequent proactive recoveries with lowperformance degradation.

Public-key cryptography was the major performancebottleneck in previous systems [26, 16] despite the factthat these systems include sophisticated techniques toreduce the cost of public-key cryptography at the expenseof security or latency. They cannot use MACs insteadof signatures because they rely on the extra power ofdigital signatures to work correctly: signatures allow thereceiver of a message to prove to others that the messageis authentic, whereas this may be impossible with MACs.The view change mechanism described in this paper doesnot require signatures. It allows public-key cryptographyto be eliminated, except for obtaining new secret keys.This approach improves performance by up to two ordersof magnitude without loosing security.

The concept of a system that can tolerate more thanfaults provided no more than nodes in the system

become faulty in some time window was introducedin [24]. This concept has previously been appliedin synchronous systems to secret-sharing schemes [13],threshold cryptography [14], and more recently secureinformation storage and retrieval [10] (which providessingle-writer single-reader replicated variables). But ouralgorithm is more general; it allows a group of nodes inan asynchronous system to implement an arbitrary statemachine.

8 Conclusions

This paper has described a new state-machine replicationsystem that offers both integrity and high availability inthe presence of Byzantine faults. The new system canbe used to implement real services because it performswell, works in asynchronous systems like the Internet,and recovers replicas to enable long-lived services.

The system described here improves the security androbustness against software errors of previous systemsby recovering replicas proactively and frequently. Itcan tolerate any number of faults provided fewer than1 3 of the replicas become faulty within a windowof vulnerability. This window can be small (e.g., afew minutes) under normal load conditions and whenthe attacker does not corrupt replicas’ copies of theservice state. Additionally, our system provides intrusion

detection; it detects denial-of-service attacks aimed atincreasing the window and detects the corruption of thestate of a recovering replica.

Recovery from Byzantine faults is harder than recov-ery from benign faults for several reasons: the recoveryprotocol itself needs to tolerate other Byzantine-faultyreplicas; replicas must be recovered proactively; and at-tackers must be prevented from impersonating recoveredreplicas that they controlled. For example, the last re-quirement prevents signatures in messages from beingvalid indefinitely. However, this leads to a further prob-lem, since replicas may be unable to prove to a third partythat some message they received is authentic (because itssignature is no longer valid). All previous state-machinereplication algorithms relied on such proofs. Our algo-rithm does not rely on these proofs and has the addedadvantage of enabling the use of symmetric cryptogra-phy for authentication of all protocol messages. Thiseliminates the use of public-key cryptography, the majorperformance bottleneck in previous systems.

The algorithm has been implemented as a genericprogram library with a simple interface that can be usedto provide Byzantine-fault-tolerant versions of differentservices. We used the library to implement BFS, areplicated NFS service, and ran experiments to determinethe performance impact of our techniques by comparingBFS with an unreplicated NFS. The experiments showthat it is possible to use our algorithm to implement realservices with performance close to that of an unreplicatedservice. Furthermore, they show that the window ofvulnerability can be made very small: 1.5 to 10 minuteswith only 2% to 27% degradation in performance.

Acknowledgments

We would like to thank Kyle Jamieson, Rodrigo Ro-drigues, Bill Weihl, and the anonymous referees for theirhelpful comments on drafts of this paper. We also thankthe Computer Resource Services staff in our laboratoryfor lending us a switch to run the experiments and TedKrovetz for the UMAC code.

References[1] M. Bellare and D. Micciancio. A New Paradigm for Collision-

free Hashing: Incrementality at Reduced Cost. In Advances inCryptology - EUROCRYPT, 1997.

[2] J. Black et al. UMAC: Fast and Secure Message Authentication.In Advances in Cryptology - CRYPTO, 1999.

[3] R. Canetti, S. Halevi, and A. Herzberg. Maintaining Authen-ticated Communication in the Presence of Break-ins. In ACMConference on Computers and Communication Security, 1997.

[4] M. Castro. Practical Byzantine Faul Tolerance. PhD thesis,Massachusetts Institute of Technology, Cambridge, MA, 2000.In preparation.

[5] M. Castro and B. Liskov. A Correctness Proof for a Practi-cal Byzantine-Fault-Tolerant Replication Algorithm. TechnicalMemo MIT/LCS/TM-590, MIT Laboratory for Computer Sci-ence, 1999.

[6] M. Castro and B. Liskov. Practical Byzantine Fault Tolerance.In USENIX Symposium on Operating Systems Design and Imple-mentation, 1999.

[7] C. Collberg and C. Thomborson. Watermarking, Tamper-Proofing, and Obfuscation - Tools for Software Protection. Tech-nical Report 2000-03, University of Arizona, 2000.

[8] S. Floyd et al. A Reliable Multicast Framework for Light-weight Sessions and Application Level Framing. IEEE/ACMTransactions on Networking, 5(6), 1995.

[9] S. Forrest et al. Building Diverse Computer Systems. InProceedings of the 6th Workshop on Hot Topics in OperatingSystems, 1997.

[10] J. Garay et al. Secure Distributed Storage and Retrieval. Theo-retical Computer Science, to appear.

[11] L. Gong. A Security Risk of Depending on Synchronized Clocks.Operating Systems Review, 26(1):49–53, 1992.

[12] M. Herlihyand J. Wing. Axioms for ConcurrentObjects. In ACMSymposium on Principles of Programming Languages, 1987.

[13] A. Herzberg et al. Proactive Secret Sharing, Or: How To CopeWith Perpetual Leakage. In Advances in Cryptology - CRYPTO,1995.

[14] A. Herzberg et al. Proactive Public Key and Signature Systems.In ACM Conference on Computers and Communication Security,1997.

[15] J. Howard et al. Scale and Performance in a Distributed FileSystem. ACM Transactions on Computer Systems, 6(1), 1988.

[16] K. Kihlstrom, L. Moser, and P. Melliar-Smith. The SecureRingProtocols for Securing Group Communication. In Hawaii Inter-national Conference on System Sciences, 1998.

[17] L. Lamport. Time, Clocks, and the Ordering of Events in aDistributed System. Communications of the ACM, 21(7), 1978.

[18] L. Lamport. The Part-Time Parliament. Technical Report 49,DEC Systems Research Center, 1989.

[19] B. Liskov et al. Replication in the Harp File System. In ACMSymposium on Operating System Principles, 1991.

[20] D. Malkhi and M. Reiter. Secure and Scalable Replication inPhalanx. In IEEE Symposium on Reliable Distributed Systems,1998.

[21] D. Mazieres et al. Separating Key Management from File SystemSecurity. In ACM Symposium on Operating System Principles,1999.

[22] Ron Minnich. The Linux BIOS Home Page.http://www.acl.lanl.gov/linuxbios, 2000.

[23] B. Oki and B. Liskov. Viewstamped Replication: A New PrimaryCopy Method to Support Highly-Available Distributed Systems.In ACM Symposium on Principles of Distributed Computing,1988.

[24] R. Ostrovsky and M. Yung. How to Withstand Mobile VirusAttacks. In ACM Symposium on Principles of DistributedComputing, 1991.

[25] J. Ousterhout. Why Aren’t Operating Systems Getting Faster asFast as Hardware? In USENIX Summer, 1990.

[26] M. Reiter. The Rampart Toolkit for Building High-IntegrityServices. Theory and Practice in Distributed Systems (LNCS938), 1995.

[27] R. Rivest. The MD5 Message-Digest Algorithm. Internet RFC-1321, 1992.

[28] F. Schneider. Implementing Fault-Tolerant Services Using TheState Machine Approach: A Tutorial. ACM Computing Surveys,22(4), 1990.

proactive recovery in a byzantine-fault-tolerant systemcastro/osdi2000.pdf · 2000. 9. 13. ·...

Documents