1 principles of reliable distributed systems lecture 11: disk paxos, quorum systems, and frangipani...
Post on 22-Dec-2015
217 views
TRANSCRIPT
![Page 1: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/1.jpg)
1
Principles of Reliable Distributed Systems
Lecture 11: Disk Paxos,
Quorum Systems, and Frangipani
Spring 2008
Prof. Idit Keidar
![Page 2: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/2.jpg)
2
Today’s Material
• Shared memory Paxos from Sec. 5 of:Byzantine Disk Paxos: Optimal Resilience with Byzantine Shared Memory, Abraham, Chockler, Keidar, & Malkhi: PODC 2004.
• Disk Paxos, Gafni & Lamport, DISC 2000• Frangipani: A Scalable Distributed File
System, Thekkath, Mann, & Lee, SOSP 1997
![Page 3: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/3.jpg)
3
Reminder: Asynchronous R/W Shared Memory Model
• Shared memory registers– Simple read/write (R/W) objects
• Accessed by processes with ids 1,2,…• All communication through shared memory!• Algorithms must be wait-free
– Must tolerate any number of process (client) failures
– Possible thanks to reliable shared memory
![Page 4: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/4.jpg)
4
Consensus in Shared Memory
• A shared object supporting a method decide(vi)i returning a value di
• Satisfying:– Agreement: for all i and j di=dj
– Validity: di=vj for some j
– Termination: decide returns
![Page 5: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/5.jpg)
5
Solving Consensus in/with Shared Memory
• Assume asynchronous shared memory system with atomic R/W registers
• Can we solve consensus?– Consensus is not solvable if even one process
can fail. Shared-memory version of [FLP]: write stands for send, read for receive.
– Yes, if no process can fail– Yes, with eventual synchrony or
![Page 6: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/6.jpg)
6
Shared Memory (SM) Paxos
• Consensus – In asynchronous shared memory – Using wait-free regular R/W registers– And (why?)
• Wait-free – Any number of processes may fail (t < n)
• Unlike message-passing model (why?)
– Only the leader takes steps
![Page 7: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/7.jpg)
7
Regular Registers
• SM Paxos can use registers that provide weaker semantics than atomicity
• SWMR regular register: a read returns– Either a value written by an overlapping write
or – The register’s value before the first write that
overlaps the read
![Page 8: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/8.jpg)
8
write(0)
Regular versus Atomic
time
read(1)
read(0)
write(1)
time
write(1) already
happened
Regular canreturn 0
not
linearizable
![Page 9: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/9.jpg)
9
Variables
• Reminder: Paxos variables are:– BallotNum, AcceptVal, AcceptNum
• SM version uses shared SWMR regular registers:– xi = bal, val, num, decision i for each process i
– Initially 0,0, , 0,0, – Writeable by i, readable by all
• Each process keeps local variables b,v,n– Initially 0,0, , 0,0
![Page 10: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/10.jpg)
10
Reminder: Paxos Phase I
• if leader (by ) thenBallotNum choose new unique ballot
send to all
• Upon receive (“prepare”, bal) from iif bal BallotNum then
BallotNum bal
send (ack, bal, AcceptNum, AcceptVal) to i
• Upon receive (ack, BallotNum, num, val) from n-t
if all vals = then myVal initial value
else myVal received val with highest num
n-t must have not moved on
![Page 11: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/11.jpg)
11
SM Paxos: Phase I
if leader (by ) thenb choose new unique ballot
write b, v, n, to xi
read all xj’sif some xj.bal > b then start over
if all read xj.val’s = then v my initial value
else v read val with highest num
Write is like sending to all
Read instead of waiting for acks
No ack: someone
moved on!
Only b changed in this phase
![Page 12: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/12.jpg)
12
Phase I Summary
• Classical Paxos: – Leader chooses new
ballot, sends to all
– Others ack if they did not move on to a later ballot
– If leader cannot get a majority, try again
– Otherwise, move to Phase 2
• SM Paxos:– Leader chooses new
ballot, writes its variable
– Leader reads to check if anyone moved on to a later ballot
– If anyone did move on, try again
– Otherwise, move to Phase 2
![Page 13: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/13.jpg)
13
Reminder: Paxos Phase II
send (“accept”, BallotNum, myVal) to all
Upon receive (“accept”, b, v) with b BallotNum
AcceptNum b; AcceptVal vsend (“accept”, b, v) to all (first time only)
Upon receive (“accept”, b, v) from n-tdecide vsend (“decide”, v) to all
Accept messages change
AcceptNum and AcceptVal
Only if did not move on yet.
![Page 14: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/14.jpg)
14
SM Paxos: Phase IILeader Cont’d
n b
write b,v,n , to xi
read all xj’s
if some xi.bal > b then start over
write b,v,n,v to xi
return v
Read to see if all would
have accepted this proposal
When don’t they?
Like sending “accept” to all
v,n changed in this phase
Decide
![Page 15: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/15.jpg)
15
Why Read Twice?
readwrite(b) write read
readwrite(b’>b)
write(b’) did not complete
write(b’>b)read
read does not see b’
![Page 16: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/16.jpg)
16
Adding The Non-Leader Code
while (true)
if leader (by ) then
[ leader code from previous slides ]
else
read xld ,were ld is leader
if xld.decision ≠ then
return xld.decision
start over means go here
![Page 17: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/17.jpg)
17
Liveness
• The shared memory is reliable• The non-leaders don’t write
– They don’t even need to be “around”
• The leader only fails if another leader competes with it– Contention
– By , eventually only one leader will compete
– In shared memory systems, is called a contention manager
![Page 18: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/18.jpg)
18
Validity
• Leader always proposes its own value or one previously proposed by an earlier leader– Regular registers suffice
![Page 19: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/19.jpg)
19
Agreement
readwrite(b) write(v) read write decision
no write(b’) for b’>b
completed
write(b’>b) read
read does not see any b’>b
write
read sees b,vwrites
v
![Page 20: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/20.jpg)
20
Agreement Proof Idea
• Look at lowest ballot, b, in which some process decides, v
• By uniqueness of b, no other value is decided with b
• Prove by induction that every decision with b>b’ is v
• Homework: complete the proof– See argument in previous slide– See Byzantine Disk Paxos paper
![Page 21: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/21.jpg)
21
Termination
• When one correct leader exists– It eventually chooses a higher b than all those
written before– No other process writes a higher ballot– So it does not start over, and hence decides
• Any number of processes can fail• How can it be possible? Didn’t we show a
majority of correct processes is needed?
![Page 22: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/22.jpg)
22
Optimization
• As in the message passing case….
• The first write does not write consensus values
• A leader running multiple consensus instances can perform the first write once and for all and then perform only the second write for each consensus instance
![Page 23: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/23.jpg)
23
Leases
• We need eventually accurate leader ()– But what does this mean in shared memory?
• We would like to have mutual exclusion– Not fault-tolerant!
• Lease: fault-tolerant, time-based mutual exclusion– Live but not safe in eventual synchrony model
![Page 24: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/24.jpg)
24
Using Leases
• A client that has something to write tries to obtain the lease – Lease holder = leader– May fail…
• Example implementation:– Upon failure, backoff period
• Leases have limited duration, expire• When is mutual exclusion guaranteed?
![Page 25: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/25.jpg)
25
Lock versus Lease
Lock is blocking– Using locks is not wait-free– If lock holder fails, we’re in trouble
Lease is non-blocking– Lease expires regardless whether holder fails
Lock is always safe– Never two lock-holders
Lease is not – Two lease-holders possible due to asynchrony– OK for indulgent algorithms, like Paxos
![Page 26: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/26.jpg)
26
Disk Paxos
[Gafni,Lamport 00]
![Page 27: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/27.jpg)
27
Data-Centric Replication
• A fixed collection of persistent data items accessed by transient clients
• Data items have limited functionality– E.g., R/W registers, or– An object of a certain type
• Data items can fail
• Cannot communicate with one another
![Page 28: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/28.jpg)
28
System Model: Fault-Prone Memory
• n fault-prone shared-memory objects– Called base objects– Can be n servers or disks storing base objects– t out of n can fail
• m processes (clients) – Any number can fail (wait-free)
![Page 29: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/29.jpg)
29
What Is It Good For?
• Storage Area Networks (SAN)– “Brick” storage
– Disk functionality is limited (R/W)
– Disks cannot communicate with each other
– Disks and disk servers can fail
• Large-scale client/server systems– Simple servers that do not communicate with each
other scale better, manage load better
– Servers can fail
![Page 30: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/30.jpg)
30
Disk Paxos
• Consensus using n 2t+1 fault-prone disks– Disks can incur crash failures
• Solution combines:– m-process shared memory Paxos and– ABD-like emulation of shared registers from
fault-prone ones
![Page 31: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/31.jpg)
31
Disk Paxos Setting
R/W
R/W
R/W
Replicated Data StoreClient processes
![Page 32: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/32.jpg)
32
Disk Paxos Data Structures
m processes
n disks
1
2
3
4
5
b,v,n,d
1 2 3
Process i can write block[i][j], for each disk j, can read all blocks
x2
b,v,n,d b,v,n,d
![Page 33: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/33.jpg)
33
Read Emulation
• In order to read xi
– Issue read block[i][j], for each disk j– Wait for majority of disks to respond– Choose block with largest b,n
• Is this enough?
• How did ABD’s read emulation work?
![Page 34: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/34.jpg)
34
does not find a written
copy,returns 0
write(0)
One Read Round Enough for Regular
time
read(1)
read(0)
write(1)
time
returning 0 is OK for regular
finds a copy that was written
![Page 35: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/35.jpg)
35
Write Emulation
• In order to write xi
– Issue write block[i][j], for each disk j– Wait for majority of disks to respond
• Is this enough?
• Homework: put everything together– Write complete Disk Paxos pseudo-code
based on SM Paxos and R/W emulations
![Page 36: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/36.jpg)
36
Quorum Systems
Generalization of Majority
![Page 37: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/37.jpg)
37
Why Majority?
• In indulgent algorithms (e.g., Paxos) we assumed a majority of the processes are correct
• But what we really need is:If Q1, Q2 are sets of processes s.t.
there liveness is guaranteed whenever all processes in P-Q1 or P-Q2 crash,
then Q1 and Q2 intersect.
![Page 38: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/38.jpg)
38
1st Generalization: Weighted Voting [Gifford 79]
• Each process has a weight– Like share-holders in a corporation
• In order to make progress, need “votes” from a set of processes that have a majority of the weights (shares)
• Special cases:– Each process has weight 1 – majority– One process has all the weights – singleton
![Page 39: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/39.jpg)
39
Definition of Quorum System
• A quorum system over a universe U of n processes is a collection of subsets of U (called quorums) such that every two quorums intersect
• Examples: – Singleton: QS = {{pi}}
– Majority: QS = {Q U: |Q| > n/2}
![Page 40: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/40.jpg)
40
The Grid Quorum System
• A quorum consists of one row plus one cell from each row above it
p1 p2 p3 p4 p5
p6 p7 p8 p9 p10
p11 p12 p13 p14 p15
p16 p17 p18 p19 p20
p21 p22 p23 p24 p25
![Page 41: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/41.jpg)
41
Advantages of Quorum Systems
• Availability– Allow faulty/slow servers to be avoided (up to
a certain threshold)
• Load balancing– Each server participates only in a fraction of
quorums and therefore is accessed only a fraction of overall accesses
• Fundamental tradeoff: load vs. availability
![Page 42: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/42.jpg)
42
Coteries and Domination
• A coterie is a quorum system in which no quorum is a subset of another quorum– Obtained from a quorum system by removing
supersets and keeping only minimal quorums
• A coterie QS dominates a coterie QS’ if every quorum Q’QS’ is a superset of some quorum in Q QS
• A non-dominated coterie is not dominated
![Page 43: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/43.jpg)
43
Quorum Sizes
• Majority: O(n)
• Grid: O(Sqrt(n))
• Primary Copy: O(1)
• Weighted Majority: varies
![Page 44: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/44.jpg)
44
The Load of a Quorum System
• The probability of accessing the busiest server in the best case, i.e., using a strategy that minimizes the load, and when no failures occur
• An access strategy for QS is a probability distribution for accessing the quorums in QS
• The load of a server under a strategy is the probability that this server is in the accessed quorum
![Page 45: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/45.jpg)
45
Availability of a Quorum System
• The resilience f of QS is the number of failures QS is guaranteed to survive– After f failures there is always a live quorum
• Failure probability– Assume that each server fails independently
with probability p
– Fp(QS) is the probability that all quorums in QS are hit, i.e., no quorum survives
![Page 46: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/46.jpg)
46
Examples
• Majority– Best availability (smallest failure probability) for p<½– Worst availability for p > ½– Load is close to ½
• Singleton– Fp = p (optimal when p > ½)– Load is 1
• Grid– Load O(1/Sqrt(n))– Resilience of Sqrt(n)-1– Failure probability goes to 1 as n grows
![Page 47: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/47.jpg)
47
Quorum Replication
• Each operation accesses a quorum of replicas
• Generalization: Byzantine Quorums– Larger intersection
![Page 48: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/48.jpg)
48
Frangipani File SystemFrangipani File System
Thekkath, Mann, and Lee, SOSP 1997
![Page 49: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/49.jpg)
49
Frangipani
• Scalable file system built at SRC-DEC
• Published in SOSP’97
• Uses failure detection, Paxos, leases,…
• Two layers:– Petal: virtual disk from many “storage bricks”– Frangipani file system and lock service
![Page 50: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/50.jpg)
50
Motivation
• Large-scale distributed file systems are hard to administer
• Hard to add/remove machines (servers)
• Hard to add/remove disks (storage space)
• Hard to manage set of current components
• Hard to manage locks
![Page 51: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/51.jpg)
51
Petal: Distributed Virtual Disks
C. A. Thekkath and E. K. LeeSystems Research Center
Digital Equipment CorporationASPLOS’96
![Page 52: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/52.jpg)
52
Client’s View
![Page 53: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/53.jpg)
53
Petal Overview
• Petal provides virtual disks– Large (264 bytes), sparse virtual space
– Disk storage allocated on demand
– Accessible to all file servers over a network
• Virtual disks implemented by– Cooperating CPUs executing Petal software
– Ordinary disks attached to the CPUs
– A scalable interconnection network
![Page 54: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/54.jpg)
54
Petal Prototype
![Page 55: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/55.jpg)
55
Global State Management
• Uses Paxos– Global state is replicated across all servers
• Metadata (disk allocation) only!
– Consistent in the face of server and network failures
– A majority is needed to update the global state– Any server can be added/removed in the
presence of failed servers
![Page 56: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/56.jpg)
56
Key Petal Features
• Storage is incrementally expandable• Data is optionally mirrored over multiple servers• Metadata is replicated on all servers• Transparent addition and deletion of servers• Supports read-only snapshots of virtual disks• Client API looks like block-level disk device• Throughput
– Scales linearly with additional servers– Degrades gracefully with failures
![Page 57: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/57.jpg)
57
Frangipani: A Scalable Distributed File System
C. A. Thekkath, T. Mann, and E. K. LeeSystems Research Center
Digital Equipment CorporationSOSP’97
![Page 58: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/58.jpg)
58
Frangipani Features
• Behaves like a local file system– Multiple machines cooperatively manage
a Petal disk– Users on any machine see a consistent
view of data
• Exhibits good performance, scaling, and load balancing
• Easy to administer
![Page 59: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/59.jpg)
59
Ease of Administration
• Frangipani machines are modular– Can be added and deleted transparently
• Common free space pool – Users don’t have to be moved
• Automatically recovers from crashes
• Consistent backup without halting the system
![Page 60: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/60.jpg)
60
Frangipani Structure
• Distributed file system built atop a shared virtual disk (Petal)
• Frangipani servers do not communicate with each other directly– Only through Petal
• Simplifies managemant– Addition/removal of servers
![Page 61: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/61.jpg)
61
Frangipani Layering
![Page 62: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/62.jpg)
62
Standard Organization
![Page 63: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/63.jpg)
63
Components of Frangipani
• File system core– Implements the file system (FS) interface– Uses FS mechanisms (buffer cache etc.)– Exploits Petal’s large virtual space
• Locks with leases– Granted for finite time, must be refreshed
• Write-ahead redo log– Performance optimization + failure recovery
![Page 64: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/64.jpg)
64
Locks• Multiple reader/single writer• Granularity: lock per entire file or directory• A lock is really a lease – it expires
– After 30 seconds in their implementation
• Assumption?
![Page 65: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/65.jpg)
65
Using Locks
• Frangipani servers are clients of lock service
• Dirty data is written to disk (Petal) before the lock is given to another machine
• Locks are cached by servers that acquire them– Soft state: no need to explicitly release locks– Uses lease timeouts for lock recovery
![Page 66: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/66.jpg)
66
Distributed Lock Management
• A set of lock servers collaboratively manage locks– Run Paxos among them– Consensus on global state: set of locks each server is
responsible for, list of current lock servers, lock allocation to clients
– Need majority to make progress• Using leases requires assuming loosely
synchronized clocks– Expired leases should not be accepted
• Why Paxos then?– To overcome network partitions
![Page 67: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/67.jpg)
67
Logging
• Frangipani uses a write ahead redo log for metadata– Log records are kept on Petal (why?)
• Data is written to Petal – On sync, fsync, or every 30 seconds– On lock revocation or when the log wraps
• Each server has a separate log– Reduces contention– Independent recovery
![Page 68: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/68.jpg)
68
Recovery
• Recovery initiated due to failure detection– By the lock service– Failure detection implemented using heartbeats
• Any server can recover operations for a failed server– Log is available via Petal
![Page 69: 1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos, Quorum Systems, and Frangipani Spring 2008 Prof. Idit Keidar](https://reader034.vdocument.in/reader034/viewer/2022042702/56649d7a5503460f94a5f296/html5/thumbnails/69.jpg)
69
Conclusions
• Fault-tolerance in the real world• Overcome crashes and network partitions
using consensus-based replication – Paxos
• Un-contended good performance – Using locks
• Implement locks as leases for robustness• Logging for recovery