de-duplication algorithms for high bandwidth data de-duplication of large scale data sets from...

DE-DUPLICATIONALGORITHMS FOR HIGH BANDWIDTH DATA DE-DUPLICATION OF LARGE SCALE DATA SETS

From Virtualization to Cloud (Spring 2011)Ariel Szapiro, Leeor Peled

Method of Deduplication

As stated in class there are three main mehus for activating a deduplication engine SBA – user side resposiblety to deduplicate ILA – inline data process of deduplication PPA – batch operation in the background on the

server side

Papers

High Throughput Data Redundancy Removal Algorithm with Scalable Performance

Bhattacherjee, Narang, Garg (IBM) Sparse Indexing: Large Scale, Inline

Deduplication Using Sampling and Locality

Lillibridge, Eshghi, Bhagwat, Deolalikar, Trezise (HP)

http://moodle.technion.ac.il/file.php/2499/FinalWork/De-Duplication/High%20throughput%20data%20redundancy%20removal%20algorithm%20with%20scalable%20performance.pdf



http://moodle.technion.ac.il/file.php/2499/FinalWork/De-Duplication/lillibridge.pdf





Bloom Filters 101

Basic problem - Given a set S = {x1,x2,,…xn}, Answer: y∊S ? Demand & Relaxations:

Efficient lookup time False positives allowed, false negatives are not!

Implementation – map into k locations of an m-wide bit array using k different hash functions. Insertion – set k bits on. Lookup – return true iff all k-bits are set Delete – impossible! (why?)

Bloom Filter example

Start with an m bit array, filled with 0s.

Hash each item xj in S k times. If Hi(xj) = a, set B[a] = 1.

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0B 0 0 0 0

0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B

To check if y is in S, check B at Hi(y). All k values must be 1.

0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B

0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0BPossible to have a false positive; all k values are 1, but y is not in S.

n items m = cn bits k hash functions

Slide by Prof Michael Mitzenmacher, Harvard

False Positive Probability

Pr(specific bit of filter is 0) is

If r is fraction of 0 bits in the filter then false positive probability is

Approximations valid as r is concentrated around E[r]. Martingale argument suffices.

Find optimal at k = (ln 2)m/n by calculus. So optimal fpp is about (0.6185)m/n

kckkkk pp )e1()1()'1()1( /

n items m = cn bits k hash functions

pmp mknkn /e)/11('

Slide by Prof Michael Mitzenmacher, Harvard

Counting Bloom Filters

Suggested by Fan et al. (1998) Handles the deletion problem –

Instead of 1-bit, put a counter (usually 4bit) Inc/dec all k locations on insertion/deletion

Mind for overflows! prob. of the order of 6E-17…

What happened to the false deletion problem?

Further upgrades: Double/triple hashing, compressed BF,

hierarchical BF, space-code BF, spectral BF, …

Proposed Parallel-BF (Bhattacherjee)

Deletion is still slow (k re-hashes). Hard to parallelize. New upgrade:

Streaming De-Duplication We want to stream over windows of ω chunks

(sets).

When done with group i, we only delete the occurrences in the first set (Si), and add the items in the next (Si+ω+1).

For fast delete - instead of counter, keep an array of ω bits, for each set in the currently observed group

Process Flow

Divide data flow into batches of records For each batch:

Pre-process: Divide into chunks and remove internal duplications

(parallel) Merge chunks and remove duplications

Process (FE and BE decoupled): Compute k-hashes on record signatures (parallel) Add to BFA while removing duplications (buffered +

parallel) Continue streaming over next batches while

removing inter-batch duplications (ω-wide) Delete sets from BFA as the window advances

Process Flow Diagram

Pre-processing stage

BFA Visualization

0 0 0 0 0 0

1 0 0 0 0 0

1 0 0 0 0 0

0 0 0 0 0 0

1 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

m

ω

ωSet

identifier

Add: ε∊Si

1

1

Window = {Si,Si+1,..,Si+ω})

Add: ε’∊Si+1

1

1

k

Delete Si-ω

0

0

0

τ

Scalability Results

Conclusion

Since overall thread count is constant – Trade off between PP and FE+BE threads Paper analyses queue model to find sweet-spot

PP stage behaves like M/M/k1

FE+BE stages behave like G/M/k2

No mention on FE/BE trade off.. Algorithm scales well for #threads,

#records and record size Experimental throughput - 0.81 GB/s

22 2 2 2 2 42 1 1 2 1 1 1 2 1 1*

11 2 1 1

4

2

T Tk

Papers

High Throughput Data Redundancy Removal Algorithm with Scalable Performance

Bhattacherjee, Narang, Garg (IBM) Sparse Indexing: Large Scale, Inline

Deduplication Using Sampling and Locality

Lillibridge, Eshghi, Bhagwat, Deolalikar, Trezise (HP)









ILA method – inline process

ILA method Pros Simplicity, Avoids the compilation in batch mode

ILA method Cons Full index Hugh RAM (unpractical) or small RAM

and damage memory BDW (chunk-lookup disk bottleneck)

This paper address the main weakness of ILA method chunk-lookup disk bottleneck by using a sparse indexing and picking few candidates for deduplication. i.e. approximate deduplication.

Few Words on Data segmentationThe data stream input to the storage device is divided into large pieces called segments. Each Segment can be built and divided into two parts: Chunks – blocks of real data Manifest – a struct that holds a pointer to each

chunk by the order they should appear in the original segment and a hash value for each chunk

Chunk container:Address - Raw

Data

0x1 –

0x2 –

0x3 –

Few Words on Data segmentQuick example

Chunk A Chunk B Chunk A Chunk BChunk A Chunk C

Manifest:Address - hash

value 0x1 – 0x2340x2 – 0x0170x1 – 0x2340x1 – 0x2340x2 – 0x0170x3 – 0x459

Data Segment

A champion(s) is picked from the sparse index RAM,

according to a most similar segment policy.

The RAM stores only the Ptr to the manifrest thus a Read request to the disk is needed

After retrieving the champion manifest from the disk the

deduplication process starts. At the end of this process the new manifest, new entry and new chunks are stored at the

disk

Data stream arrivesAnd it Is divided into chunks

using Two-Threshold Two-Divisor (TTTD) chunking

algorithm

Data Chunks are divided into segments using fixed-size segmentation algorithm or variable-size segmentation

algorithm

The incoming chunks of the Data segment are sampled, it

can be done by using only hash values with a common prefix. The frequency of the

sampler correlates exponentially to the prefix

lengthA B A BA C

Proposed Flow

A B A BA C

Data Segment

Data Stream

A B A BA CA CDM-PTR

F G A DC A

BManifest New entry

Assumption Used

In the proposed flow an approximation is used to avoid the main downsize of an inline deduplication, chunk-lookup disk bottleneck.

The approximation is the use of a sparse index, which implies that not all possible duplicated chunks are deduplicated.

The assumptions are Locality of chunks: if a champion segment share

few chunks it likely to share many other chunks with the incoming segment as well.

Locality of segments: most of the deduplication possible for a given segment can be obtained by deduplicating it against a small number of prior segments.

Simulation Results

In the below graph, all the flow is as shown previously when sparse indexation of the chunks is used

SMB –synthetic data set that represent a small or medium business server backed up to virtual tape.

Workgroup–synthetic data set that represent a small corporate workgroup backedup via tar directly to a NAS interface.

When Fixing the mean segment size to be 10MB

Simulation Results

Deduplication factor = original size/deduplicated size

SMB –synthetic data set that represent a small or medium business server backed up to virtual tape.

Workgroup–synthetic data set that represent a small corporate workgroup backedup via tar directly to a NAS interface.

Conclusions

The proposed method that is presented in the paper have the following strengths: Simple flow – inline flow that uses known

algorithm for the preprocessing stage (chunk and segment partitioning)

Very small RAM –Stores ~15-10 prior segments ,actually only the sparse indexes and the manifest pointer is stored in RAM.

Used in the industry – even though not all the details regarding that flow are presented, the fact that this system is in real use gives is a major strength.

Conclusions (2)

The proposed method that is presented in the paper have the following weakness: The efficiency of the flow is crucially dependent

on the data set – can be viewed from the simulation 2.3-13 from different sets

both of the datasets which use to evaluate are synthetic – since this flow is so sensitive for it data set a real dataset evaluation is more than needed

Main Differences

Stream based Bloom Filter based

Processing inline inline

Deduplicating “similar” chunks(Sampled)

Consecutive segments

Approximation Chunk Sampling Sparse indexing

BF (false positives) + window limits

Throughput 250MB-2.5GB/s 0.81GB/s

Purpose (proposed) D2D Storage saving

Scalability / parallelism

#champions bound

Overall threads (internal division is

optimized)

A note on the comparison

Tradeoffs between BW Quality (level of Deduplication) Space (RAM/disk)

Both approaches don’t limit data set size (usually 10-100 TB) since they’re inline.

Sparse indexing provides flexibility – support higher rates by doing a worse job

BF provides guaranteed deduplication within a given window of segments, but limits the BW Inherent problem – BF strength is in no false

negatives, for DD we require no false positives..

Proposal

Augmenting the BF based approach: Instead of ω-wide window (assuming

temporal locality), generalize to generic ω-way set “caching” Also a bit similar to the champion approach from

the second paper. Maintain sets that are either recent (from the

window), or have been hit by some BF lookups. Can be implemented via LRU-like score based on

number of hits in last de-duplication stage – each cycle throw away the lowest scoring set from the BFA and fill the new set

de-duplication algorithms for high bandwidth data de-duplication of large scale data sets from...

Documents