de-duplication algorithms for high bandwidth data de-duplication of large scale data sets from...
TRANSCRIPT
DE-DUPLICATIONALGORITHMS FOR HIGH BANDWIDTH DATA DE-DUPLICATION OF LARGE SCALE DATA SETS
From Virtualization to Cloud (Spring 2011)Ariel Szapiro, Leeor Peled
Method of Deduplication
As stated in class there are three main mehus for activating a deduplication engine SBA – user side resposiblety to deduplicate ILA – inline data process of deduplication PPA – batch operation in the background on the
server side
Papers
High Throughput Data Redundancy Removal Algorithm with Scalable Performance
Bhattacherjee, Narang, Garg (IBM) Sparse Indexing: Large Scale, Inline
Deduplication Using Sampling and Locality
Lillibridge, Eshghi, Bhagwat, Deolalikar, Trezise (HP)
Bloom Filters 101
Basic problem - Given a set S = {x1,x2,,…xn}, Answer: y∊S ? Demand & Relaxations:
Efficient lookup time False positives allowed, false negatives are not!
Implementation – map into k locations of an m-wide bit array using k different hash functions. Insertion – set k bits on. Lookup – return true iff all k-bits are set Delete – impossible! (why?)
Bloom Filter example
Start with an m bit array, filled with 0s.
Hash each item xj in S k times. If Hi(xj) = a, set B[a] = 1.
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0B 0 0 0 0
0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B
To check if y is in S, check B at Hi(y). All k values must be 1.
0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B
0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0BPossible to have a false positive; all k values are 1, but y is not in S.
n items m = cn bits k hash functions
Slide by Prof Michael Mitzenmacher, Harvard
False Positive Probability
Pr(specific bit of filter is 0) is
If r is fraction of 0 bits in the filter then false positive probability is
Approximations valid as r is concentrated around E[r]. Martingale argument suffices.
Find optimal at k = (ln 2)m/n by calculus. So optimal fpp is about (0.6185)m/n
kckkkk pp )e1()1()'1()1( /
n items m = cn bits k hash functions
pmp mknkn /e)/11('
Slide by Prof Michael Mitzenmacher, Harvard
Counting Bloom Filters
Suggested by Fan et al. (1998) Handles the deletion problem –
Instead of 1-bit, put a counter (usually 4bit) Inc/dec all k locations on insertion/deletion
Mind for overflows! prob. of the order of 6E-17…
What happened to the false deletion problem?
Further upgrades: Double/triple hashing, compressed BF,
hierarchical BF, space-code BF, spectral BF, …
Proposed Parallel-BF (Bhattacherjee)
Deletion is still slow (k re-hashes). Hard to parallelize. New upgrade:
Streaming De-Duplication We want to stream over windows of ω chunks
(sets).
When done with group i, we only delete the occurrences in the first set (Si), and add the items in the next (Si+ω+1).
For fast delete - instead of counter, keep an array of ω bits, for each set in the currently observed group
Process Flow
Divide data flow into batches of records For each batch:
Pre-process: Divide into chunks and remove internal duplications
(parallel) Merge chunks and remove duplications
Process (FE and BE decoupled): Compute k-hashes on record signatures (parallel) Add to BFA while removing duplications (buffered +
parallel) Continue streaming over next batches while
removing inter-batch duplications (ω-wide) Delete sets from BFA as the window advances
Process Flow Diagram
Pre-processing stage
BFA Visualization
0 0 0 0 0 0
1 0 0 0 0 0
1 0 0 0 0 0
0 0 0 0 0 0
1 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
m
ω
ωSet
identifier
Add: ε∊Si
1
1
Window = {Si,Si+1,..,Si+ω})
Add: ε’∊Si+1
1
1
k
Delete Si-ω
0
0
0
τ
Scalability Results
Conclusion
Since overall thread count is constant – Trade off between PP and FE+BE threads Paper analyses queue model to find sweet-spot
PP stage behaves like M/M/k1
FE+BE stages behave like G/M/k2
No mention on FE/BE trade off.. Algorithm scales well for #threads,
#records and record size Experimental throughput - 0.81 GB/s
22 2 2 2 2 42 1 1 2 1 1 1 2 1 1*
11 2 1 1
4
2
T Tk
Papers
High Throughput Data Redundancy Removal Algorithm with Scalable Performance
Bhattacherjee, Narang, Garg (IBM) Sparse Indexing: Large Scale, Inline
Deduplication Using Sampling and Locality
Lillibridge, Eshghi, Bhagwat, Deolalikar, Trezise (HP)
ILA method – inline process
ILA method Pros Simplicity, Avoids the compilation in batch mode
ILA method Cons Full index Hugh RAM (unpractical) or small RAM
and damage memory BDW (chunk-lookup disk bottleneck)
This paper address the main weakness of ILA method chunk-lookup disk bottleneck by using a sparse indexing and picking few candidates for deduplication. i.e. approximate deduplication.
Few Words on Data segmentationThe data stream input to the storage device is divided into large pieces called segments. Each Segment can be built and divided into two parts: Chunks – blocks of real data Manifest – a struct that holds a pointer to each
chunk by the order they should appear in the original segment and a hash value for each chunk
Chunk container:Address - Raw
Data
0x1 –
0x2 –
0x3 –
Few Words on Data segmentQuick example
Chunk A Chunk B Chunk A Chunk BChunk A Chunk C
Manifest:Address - hash
value 0x1 – 0x2340x2 – 0x0170x1 – 0x2340x1 – 0x2340x2 – 0x0170x3 – 0x459
Data Segment
A champion(s) is picked from the sparse index RAM,
according to a most similar segment policy.
The RAM stores only the Ptr to the manifrest thus a Read request to the disk is needed
After retrieving the champion manifest from the disk the
deduplication process starts. At the end of this process the new manifest, new entry and new chunks are stored at the
disk
Data stream arrivesAnd it Is divided into chunks
using Two-Threshold Two-Divisor (TTTD) chunking
algorithm
Data Chunks are divided into segments using fixed-size segmentation algorithm or variable-size segmentation
algorithm
The incoming chunks of the Data segment are sampled, it
can be done by using only hash values with a common prefix. The frequency of the
sampler correlates exponentially to the prefix
lengthA B A BA C
Proposed Flow
A B A BA C
Data Segment
Data Stream
A B A BA CA CDM-PTR
F G A DC A
BManifest New entry
Assumption Used
In the proposed flow an approximation is used to avoid the main downsize of an inline deduplication, chunk-lookup disk bottleneck.
The approximation is the use of a sparse index, which implies that not all possible duplicated chunks are deduplicated.
The assumptions are Locality of chunks: if a champion segment share
few chunks it likely to share many other chunks with the incoming segment as well.
Locality of segments: most of the deduplication possible for a given segment can be obtained by deduplicating it against a small number of prior segments.
Simulation Results
In the below graph, all the flow is as shown previously when sparse indexation of the chunks is used
SMB –synthetic data set that represent a small or medium business server backed up to virtual tape.
Workgroup–synthetic data set that represent a small corporate workgroup backedup via tar directly to a NAS interface.
When Fixing the mean segment size to be 10MB
Simulation Results
Deduplication factor = original size/deduplicated size
SMB –synthetic data set that represent a small or medium business server backed up to virtual tape.
Workgroup–synthetic data set that represent a small corporate workgroup backedup via tar directly to a NAS interface.
Conclusions
The proposed method that is presented in the paper have the following strengths: Simple flow – inline flow that uses known
algorithm for the preprocessing stage (chunk and segment partitioning)
Very small RAM –Stores ~15-10 prior segments ,actually only the sparse indexes and the manifest pointer is stored in RAM.
Used in the industry – even though not all the details regarding that flow are presented, the fact that this system is in real use gives is a major strength.
Conclusions (2)
The proposed method that is presented in the paper have the following weakness: The efficiency of the flow is crucially dependent
on the data set – can be viewed from the simulation 2.3-13 from different sets
both of the datasets which use to evaluate are synthetic – since this flow is so sensitive for it data set a real dataset evaluation is more than needed
Main Differences
Stream based Bloom Filter based
Processing inline inline
Deduplicating “similar” chunks(Sampled)
Consecutive segments
Approximation Chunk Sampling Sparse indexing
BF (false positives) + window limits
Throughput 250MB-2.5GB/s 0.81GB/s
Purpose (proposed) D2D Storage saving
Scalability / parallelism
#champions bound
Overall threads (internal division is
optimized)
A note on the comparison
Tradeoffs between BW Quality (level of Deduplication) Space (RAM/disk)
Both approaches don’t limit data set size (usually 10-100 TB) since they’re inline.
Sparse indexing provides flexibility – support higher rates by doing a worse job
BF provides guaranteed deduplication within a given window of segments, but limits the BW Inherent problem – BF strength is in no false
negatives, for DD we require no false positives..
Proposal
Augmenting the BF based approach: Instead of ω-wide window (assuming
temporal locality), generalize to generic ω-way set “caching” Also a bit similar to the champion approach from
the second paper. Maintain sets that are either recent (from the
window), or have been hit by some BF lookups. Can be implemented via LRU-like score based on
number of hits in last de-duplication stage – each cycle throw away the lowest scoring set from the BFA and fill the new set