fault tolerant storage and quorum systems in dynamic environments uri nadav, master thesis advisor:...
Post on 20-Dec-2015
220 views
TRANSCRIPT
Fault Tolerant Storage And Quorum Systems in Dynamic
Environments
Uri Nadav, Master thesisAdvisor: Moni Naor
The Weizmann Institute of Science
Slide - 3
Goal
Distributed file storage system Peer-to-peer environment Processors join and leave the system
Partial Solutions Distributed File sharing applications [Gnutella, Kazaa] Distributed Hash Tables [DH, Chord, Viceroy]
Store (key, value) pairs and perform lookup on key
Slide - 4
Fault-Tolerant Storage System
Censor Aims to eliminate access to some files
Design Goal:
A reader should be able to reconstruct each file with high probability after faults have been caused
Probability taken over coins of the writer and reader
Slide - 5
Adversarial Model
Adversary chooses the set of processors to crash
fail-stop failures We do not consider Byzantine failures
Different degrees of adaptiveness Non adaptive adversary
Choice of faulty processors is not based on their content
Adversary with a limited number of queries May query some processors
Slide - 6
Other Fault Models
Random faults model: Examples: Distance Halving DHT, Chord Standard technique:
Replication to log(n) processors Assures survival with high probability
Adversarial faults [Fiat, Saia]
Large fraction accessible after adversary crashes a linear fraction of the processors
Still, a censor can target a specific file
Slide - 7
Measures of Quality
Read/Write complexity: Average number of processors accessed during a
read/write operation
Number of rounds: Number of rounds required from an adaptive reader
Blowup Ratio: Ratio between the total number of bits used for the
storage of a file and its size
Slide - 8
Quorum Systems
Formal Definition: U – Universe F ½ 2U
8 A,B 2 F AÅ B ; F is called a quorum system. A,B are called quorum sets
Probabilistic -intersecting quorum system [Malkhi et al] Strategy (distribution) w over 2U
Two sets A,B drawn from the strategy w, intersect with probability at least 1
A quorum system is an intersecting family of sets over some universe
The set of processors from which a file is read must intersect the set of processors to which a file was written
Slide - 9
Storage system example
The intersecting quorum system [Malkhi et al]
The quorum set is made of all sets of size -----
Pick one quorum uniformly at random
Intersection follows from the birthday paradox
Storage System:
Storage: A file is replicated to all members of a quorum set
Retrieval: Choose a quorum set and probe its members
Slide - 10
Properties of the Probabilistic Storage System
Pros: Simplicity
Resilient against linear number of faults Even if the processors are chosen by the adversary
adaptively
Adapted to a dynamic environment [Abraham, Malkhi]
Target: Come up with a storage system with better parameters
Cons: High read/write complexity ( )
High blowup-ratio ( )
Slide - 11
The Model
Non-adaptive adversary: Chooses a set of processors (linear size) without
accessing any processor first.
Non-adaptive reader: Processors are chosen without accessing any processor
Theorem: A fault tolerant storage system, in the non-adaptive reader model, resilient against (n) faults, cannot do better than the -intersecting storage system example.
Slide - 12
Lower bound on the blowup-ratioTheorem: A system which tolerates (n) faults, with -----
read complexity, has
Blowup Ratio =
Lower Bounds for the Non-Adaptive Reader Model
Lower bound on the read/write complexityTheorem: A system which tolerates (n) faults has
Read Complexity ¢ Write Complexity = ----
Formal definitions of the non-adaptive storage system
Slide - 13
Slightly Adaptive Adversary Model
Reader can have adaptive queries Wish to have a small number of rounds to shorten time
complexity of read operation
Slightly adaptive adversary: Adversary is less adaptive than the reader queries
Fail-stop and not Byzantine faults
Slide - 14
Generic Storage Scheme Storing a file:
Encode a file using a coding scheme with a constant blowup ratio (Reed Solomon, IDA [Rabin])
Distribute to a set chosen by a write strategy with load ------ (optimal)
Retrieval of a file, after faults: Find enough processors from the ‘write set’ Decode the file using the coding scheme
Fault-tolerance: With high probability:
The adversary doesn’t find any element during the adaptive queries phase. At least half of the processors in a chosen write set survive
Any half of the processors in the write set can reconstruct the file
To instantiate a storage system – plug-in a write strategy and a read algorithm
Load: Maximal probability of a processor to be chosen
Slide - 15
Choosing a write strategy
What about the random strategy of the example? Not a good candidate A read algorithm that finds a constant fraction of the
surviving processors requires access to (n) processors
We will present a strategy with ---------- read complexity Logarithmic number of rounds. Using the And-Or tree.
Slide - 16
The And-Or Tree Structure
Complete binary tree Leaves represent processors Inner nodes are AND/OR gates
Alternating layers
1
AND
OR OR
2 3 4 9 10 11 125 6 7 8 14 15 16
AND/OR gateProcessor
13
AND AND AND AND
OR OR OR OR OR OR OR OR
Slide - 17
The And-Or Tree Structure
Recursive Definition of ANDset, ORset collections Recursive procedure for selecting a set
Write strategy: Pick a set from the ANDset collection uniformly at random
Intersection Property: A set from ANDset collection and a set from ORset collection intersect
1 2 3 4 9 10 11 125 6 7 8 14 15 16131 3 13 16
AND
OR
AND
OR
AND/OR gateProcessor
Slide - 18
Adaptive Read Algorithm
1 2 3 4 9 10 11 125 6 7 8 14 15 1613
Write set
1 2 5 6
Pick a set from the ORset collection to find an element from the write set
AND/OR gateProcessor
Slide - 19
Read Algorithm - Pruning the Tree…
1 2 3 4 9 10 11 125 6 7 8 14 15 16131
To find remaining items, algorithm is recursively applied to remaining subtreesTotal of processor-accesses during the read algorithm
Write set
AND/OR gateProcessor
Slide - 20
Properties of the And-Or Storage System
Constant blowup ratio, write complexity and ---------- read complexity
Logarithmic number of rounds
Resilient against (n) faults of a slightly adaptive adversary
Cannot expect anything much better in terms of read/write tradeoff!
Slide - 21
Early Stopping When Less Faults Occur
Drawback: The read complexity is high even when no faults occur
Dynamic read-complexity: When up to t faults occur the read complexity is Pay in logarithmic instead of constant blowup ratio
Slide - 22
Dynamically Adjusting to the Number of Faults
AND
OR
AND
OR
Each Node represents a processor, not only leaves
Redefine the ANDset collection so that each set includes all the visited nodes
The size of a set in the collection remains ------
Slide - 23
Where do we stand? And-Or for static network
Ignored routing scheme
Adaptation of the And-Or storage system
Storage coupled with the routing
Use the distance-halving network [Naor-Wieder]
Next: Dynamic Environment
Slide - 24
Dynamic Hash Tables
The continuous space is partitioned locally (on the fly) into cells corresponding to processors Each point in [0,1) is covered by exactly one processor
0 1
Slide - 25
The Distance Halving Network [Naor, Wieder]
0 1x
continuous graph Nodes: [0,1) interval
Edges: Left and right outgoing edges
Each point is the root of a binary tree subgraph
Slide - 26
The Distance Halving Network [Naor, Wieder]
Connect two processors if their respective cells are connected in the continuous graph
0 1
Slide - 27
Embedding the Storage System
The binary tree Subgraph of the continuous graph
Well defined for each point
Depth log n
0 1
Edges covered by network connections
Gossip protocols
Each file has a different tree
Slide - 28
Storage Through Gossip
Data percolate using DH edges for log(n) steps
After a single write operation the writer is done, and the file can already be retrieved
Fault-Tolerance is built during gossip When messages reach the nodes in the ith level, the file is
(2i) fault-tolerant
Slide - 29
Retrieval
Uses routing protocol of the DH-network Routing dilation is O(log(n))
Total time for retrieval is O(log2(n))
Read complexity can be dynamically adjusted to the number of faults Store in every processor visited
Slide - 30
Fault-Tolerance
Balanced network A processor covers a segment of size O(1/n) Various balancing techniques (Manku, Karger and Ruhl, Naor and
Wieder, Abraham et al)
Theorem: When the network is balanced, the system is (n) fault-tolerant
0 1
Slide - 31
Open Questions
Do the lower bounds shown when both the reader and the adversary are non-adaptive hold when both are adaptive?
Is there a fault-tolerant storage system in the adaptive reader model with o(log(n)) rounds?
Slide - 32
Summary The probabilistic solution is optimal in the non-
adaptive reader model
The And-Or storage system Constant blowup-ratio Almost optimal read/write complexity
Adaptation of the storage system in a dynamic environment Storage uses network topology When the system is balanced it maintains fault-tolerance
Slide - 33
Agenda
Fault-Tolerant Storage System Fighting Censors
The And-Or Quorum System Static case Dynamic Networks Quorum systems are
important beyond their application in storage (mutual exclusion, load balancing, access control…)
Slide - 34
Measures of Quality
Load: Load of strategy: maximal probability of a processor to
be chosen Minimum over all strategies
Availability: Probability all quorums are hit under random faults
Probe Complexity: Number of probes required to obtain a live quorum w.h.p
Slide - 35
The And-Or Quorum System
Known Properties [Naor, Wool]: Optimal Load, High Availability
Our contribution: Static network:
Optimal non-adaptive algorithm Optimal adaptive algorithm
Construction in a dynamic network
Slide - 38
Dynamic Quorum System
The universe constantly changes
Two challenges: Integrity:
Intersection property Combinatorial structure and properties
Locality: Local way to access a quorum
Slide - 39
Dynamic And-Or
Embedding of a binary tree
DH-Graph Left, Right children Define Tree on each point Leaves equally divides [0,1)
0 1
A quorum of processors is the set that covers the points in a quorum
Slide - 40
Dynamic And-Or
Locality Natural gossip protocol
Integrity When network grows/shrinks members of quorums gossip
themselves to children/parent in the continuous graph
Network connections cover edges in the continuous graph
Slide - 41
Load
Processor is chosen when covered leaves are chosen Optimal load on leaves Balanced Network
Induced optimal load on processors
0 1
Slide - 42
Availability of the Dynamic Quorum Static case:
Global Failure probability exponentially decays
Processor fails with probability < 0.25
Dynamic case: Problem in analysis: Faults are not independent
When the network is balanced… Two leaves are dependent, only if covered by same processor
Constant number of dependent faults
Domination by a product measure
Slide - 43
Domination by a Product Measure
Finite set S
Space of configurations: = {0,1}S
Partial order : 1, 2 2
1 ¸ 2 if 8 s2 S, 1(s) ¸ 2(s)
Function f increasing:
1 ¸ 2 ) f(1) ¸ f(2)
Product measure (p)
8 s2 S, Pr[(s)=0] = p
(s) independent of all others
Slide - 44
Domination by a Product Measure
, Probability measures on ,
dominates ( ¹ ): for every increasing f
E(f) · E(f)
[Ligget et al]: If 8 s2 S, Pr[(s)=0] < pand this event is dependent on at most k
other such events (where k is a constant), then,
9 p`<p, s.t. p’ ¹
By decreasing p, p` can be made arbitrarily close to 0
Slide - 45
Availability of the Dynamic And-Or
S is the set of leaves, the configurations
probability measure induced by random faults on
processors
Balanced network: limited independence
Dominates a product measure p’
When p' < 0.25, Fp' · O(exp(-n0.5))
Slide - 46
Probe Complexity of the Dynamic And-Or
Nonadaptive
Subtrees are not independent
Positively correlated
Adaptive
Expected constant height for local subtrees
Expected number of probes
Markov: Optimal probe complexity with probability 1-
o(1)
Slide - 47
Other Dynamic Quorum Systems Dynamic Probabilistic QS [Abraham, Malkhi]
Random walk
Very high availability For arbitrary failure probability
Higher load
Dynamic Paths [Naor, Wieder]
Emulate Paths quorum system
Voronoi diagram
High availability Failure probability < 0.5
Slower Adaptive algorithm
Slide - 48
Summary
Non-adaptive, Adaptive Algorithms to And-Or Optimal Adaptive case: Excellent time complexity
Adaptation over dynamic overlay network
Optimal Load, probe complexity and high availability Domination by product measure
Slide - 49
Open Questions (on Quorum Systems)
Lower bound to the adaptive algorithmic probe complexity
Better analysis of the adaptive algorithm for dynamic network
Slide - 50
Elementary Storage System
Write strategy w Distribution on {N}n
Encoder: E(f,qw) (x1,…,xn)
Read strategy r Distribution on {0,1}n
Decoder:D(x1,…,xn) {0,1}k
qw chosen by w
Slide - 51
Reconstruction
Decoding a previously encoded file: D((E(f,qw),qr)) = f
(,k)-Storage System: 8 f2{0,1}k, Pr[D((E(f,qw),qr)) = f] > 1-
qw, qr chosen from w, r
projection to mask unread processors:x1,x2,x3,x4, (1,1,0,1)) = (x1,x2,,x4)
Slide - 52
(,k-)Intersection Property
Write Strategy w, Read Strategy r
The Pair (w,r) satisfy (,k)-Intersection Property, if
Pr[h qw,qr i > k] > 1-
number of bits read
Slide - 53
Storage System Characterization
Theorem: Let S=(w,E,r,D) be an (,k)-storage system. Then w,r maintain the (2,k) intersection property.
Slide - 54
Error Correcting Codes
View storage-system as coding scheme:
Message: files concatenated
Codeword: Processors’ memories concatenated
Worst case Faults-Model
Adversary “knows” the content of each processor
Slide - 55
Locally Decodable Codes
Decode a single symbol, instead of the whole message No need to read all the codeword(?)
Rates: No linear code for constant number of queries [Katz, Trevisan]
Exponential lower bound for 2 queries [Goldreich et al],[Wolf, Ker’]
Linear rate for polynomial number of queries Multivariate code [Reed-Muller]