what’s the difference? efficient set reconciliation without prior context
DESCRIPTION
What’s the Difference? Efficient Set Reconciliation without Prior Context. Frank Uyeda University of California, San Diego David Eppstein , Michael T. Goodrich & George Varghese. Motivation. Distributed applications often need to compare remote state . R1. R2. Partition Heals. - PowerPoint PPT PresentationTRANSCRIPT
What’s the Difference?Efficient Set Reconciliation
without Prior Context
Frank UyedaUniversity of California, San Diego
David Eppstein, Michael T. Goodrich & George Varghese1
2
Motivation
• Distributed applications often need to compare remote state.
R1 R2
Must solve the Set-Difference Problem!
Partition Heals
3
What is the Set-Difference problem?
• What objects are unique to host 1?• What objects are unique to host 2?
A
Host 1 Host 2
CAFEB D F
4
Example 1: Data Synchronization
• Identify missing data blocks• Transfer blocks to synchronize sets
A
Host 1 Host 2
CAFEB D F
DC
B E
5
Example 2: Data De-duplication
• Identify all unique blocks.• Replace duplicate data with pointers
A
Host 1 Host 2
CAFEB D F
6
Set-Difference Solutions• Trade a sorted list of objects.– O(n) communication, O(n log n) computation
• Approximate Solutions:– Approximate Reconciliation Tree (Byers)
• O(n) communication, O(n log n) computation
• Polynomial Encodings (Minsky & Trachtenberg)– Let “d” be the size of the difference– O(d) communication, O(dn+d3) computation
• Invertible Bloom Filter– O(d) communication, O(n+d) computation
7
Difference Digests
• Efficiently solves the set-difference problem.• Consists of two data structures:– Invertible Bloom Filter (IBF)• Efficiently computes the set difference.• Needs the size of the difference
– Strata Estimator• Approximates the size of the set difference.• Uses IBF’s as a building block.
8
Invertible Bloom Filters (IBF)
• Encode local object identifiers into an IBF.
A
Host 1 Host 2
CAFEB D F
IBF 2IBF 1
9
IBF Data Structure
• Array of IBF cells– For a set difference of size, d, require αd cells
(α > 1)• Each ID is assigned to many IBF cells• Each IBF cell contains:
idSum XOR of all ID’s in the cellhashSum XOR of hash(ID) for all ID’s in the cellcount Number of ID’s assign to the cell
10
IBF EncodeA
idSum ⊕ AhashSum ⊕ H(A)
count++
idSum ⊕ AhashSum ⊕
H(A)count++
idSum ⊕ AhashSum ⊕
H(A)count++
Hash1 Hash2 Hash3
B C
Assign ID to many cells
IBF:
αd “Add” ID to cellNot O(n), like
Bloom Filters!
All hosts use the same hash functions
11
Invertible Bloom Filters (IBF)
• Trade IBF’s with remote host
A
Host 1 Host 2
CAFEB D F
IBF 2IBF 1
12
Invertible Bloom Filters (IBF)
• “Subtract” IBF structures– Produces a new IBF containing only unique objects
A
Host 1 Host 2
CAFEB D F
IBF 2
IBF 1
IBF (2 - 1)
13
IBF Subtract
Timeout for Intuition
• After subtraction, all elements common to both sets have disappeared. Why?– Any common element (e.g W) is assigned to same cells on
both hosts (assume same hash functions on both sides)– On subtraction, W XOR W = 0. Thus, W vanishes.
• While elements in set difference remain, they may be randomly mixed need a decode procedure.
14
15
Invertible Bloom Filters (IBF)
• Decode resulting IBF– Recover object identifiers from IBF structure.
A
Host 1 Host 2
CAFEB D F
IBF (2 - 1)
B E C DHost 1 Host 2IBF 2
IBF 1
16
IBF Decode
H(V X Z)⊕ ⊕≠
H(V) H(X) ⊕ ⊕H(Z)
Test for Purity:H( idSum )H( idSum ) = hashSumH(V) = H(V)
17
IBF Decode
18
IBF Decode
19
IBF Decode
20
Small Diffs:1.4x – 2.3x
Large Differences:1.25x - 1.4x
How many IBF cells?Sp
ace
Ove
rhea
d
Set Difference
Hash Cnt 3Hash Cnt 4
Overhead to decode at >99%
How many hash functions?
• 1 hash function produces many pure cells initially but nothing to undo when an element is removed.
21
A B
C
How many hash functions?
• 1 hash function produces many pure cells initially but nothing to undo when an element is removed.
• Many (say 10) hash functions: too many collisions.
22
A A B
C B C
A A
B B
C C
How many hash functions?
• 1 hash function produces many pure cells initially but nothing to undo when an element is removed.
• Many (say 10) hash functions: too many collisions.• We find by experiment that 3 or 4 hash functions
works well. Is there some theoretical reason?
23
A A B
C C
A
B
B
C
Theory
• Let d = difference size, k = # hash functions.• Theorem 1: With (k + 1) d cells, failure probability
falls exponentially. – For k = 3, implies a 4x tax on storage, a bit weak.
• [Goodrich,Mitzenmacher]: Failure is equivalent to finding a 2-core (loop) in a random hypergraph
• Theorem 2: With ck d, cells, failure probability falls exponentially
– c4 = 1.3x tax, agrees with experiments
24
25
Large Differences:1.25x - 1.4x
How many IBF cells?Sp
ace
Ove
rhea
d
Set Difference
Hash Cnt 3Hash Cnt 4
Overhead to decode at >99%
Connection to Coding
• Mystery: IBF decode similar to peeling procedure used to decode Tornado codes. Why?
• Explanation: Set Difference is equivalent to coding with insert-delete channels
• Intuition: Given a code for set A, send codewords only to B. Think of B’s set as a corrupted form of A’s.
• Reduction: If code can correct D insertions/deletions, then B can recover A and the set difference.
26
Reed Solomon <---> Polynomial Methods LDPC (Tornado) <---> Difference Digest
27
Difference Digests
• Consists of two data structures:– Invertible Bloom Filter (IBF)• Efficiently computes the set difference.• Needs the size of the difference
– Strata Estimator• Approximates the size of the set difference.• Uses IBF’s as a building block.
28
Strata EstimatorA
ConsistentPartitioning
B C
~1/2
~1/4
~1/8
1/16
IBF 1
IBF 4
IBF 3
IBF 2
Estimator
• Divide keys into partitions of containing ~1/2k
• Encode each partition into an IBF of fixed size– log(n) IBF’s of ~80 cells each
29
4x
Strata Estimator
IBF 1
IBF 4
IBF 3
IBF 2
Estimator 1
• Attempt to subtract & decode IBF’s at each level.• If level k decodes, then return:
2k x (the number of ID’s recovered)
…
IBF 1
IBF 4
IBF 3
IBF 2
Estimator 2…Decode
Host 1 Host 2
30
4x
Strata Estimator
IBF 1
IBF 4
IBF 3
IBF 2
Estimator 1
• Attempt to subtract & decode IBF’s at each level.• If level k decodes, then return:
2k x (the number of ID’s recovered)
…
IBF 1
IBF 4
IBF 3
IBF 2
Estimator 2…
DecodeHost 1 Host 2
What about the other strata?
31
2x
Strata Estimator
IBF 1
IBF 4
IBF 3
IBF 2
Estimator 1…
IBF 1
IBF 4
IBF 3
IBF 2
Estimator 2…
Decode
Decode
Host 1 Host 2
Host 2Host 1
• Observation: Extra partitions hold useful data• Sum elements from all decoded strata & return:
2(k-1) x (the number of ID’s recovered)
DecodeHost 1 Host 2
…
32
Estimation Accuracy
Strata good for
small d
ifferences.
Min-Wise
good fo
r
large
differences.
Average Estimation Error (15.3 KBytes)
Set Difference
Rela
tive
Erro
r in
Estim
ation
(%)
33
Hybrid Estimator
IBF 1
IBF 4
IBF 3
IBF 2
Strata
• Combine Strata and Min-Wise Estimators.– Use IBF Stratas for small differences.– Use Min-Wise for large differences.
…IBF 1
Min-Wise
IBF 2
Hybrid
IBF 3
34
Hybrid Estimator Accuracy
Hybrid matches Strata for small differences.
Converges with Min-wise for large differences
Set Difference
Average Estimation Error (15.3 KBytes)
Rela
tive
Erro
r in
Estim
ation
(%)
35
Application: KeyDiff Service
• Promising Applications:– File Synchronization– P2P file sharing– Failure Recovery
Key Service
Key Service
Key Service
Application Application
Application
Add( key )Remove( key )Diff( host1, host2 )
36
Difference Digests Summary
• Strata & Hybrid Estimators– Estimate the size of the Set Difference.– For 100K sets, 15KB estimator has <15% error– O(log n) communication, O(log n) computation.
• Invertible Bloom Filter– Identifies all ID’s in the Set Difference.– 16 to 28 Bytes per ID in Set Difference.– O(d) communication, O(n+d) computation.
• Implemented in KeyDiff Service
Conclusions: Got Diffs?
• New randomized algorithm (difference digests) for set difference or insertion/deletion coding
• Could it be useful for your system? Need:– Large but roughly equal size sets – Small set differences (less than 10% of set size)
37
38
39
Extra Slides
40
Comparison to Logs
• IBF work with no prior context.• Logs work with prior context, BUT– Redundant information when sync’ing with
multiple parties.– Logging must be built into system for each write.– Logging add overhead at runtime.– Logging requires non-volatile storage.• Often not present in network devices.
IBF’s may out-perform logs when:• Synchronizing multiple parties• Synchronizations happen infrequently