jingwei ma, rebecca j. stones, yuxiang ma, jingui wang, junjie...

34
Lazy Exact Deduplication Jingwei Ma , Rebecca J. Stones , Yuxiang Ma , Jingui Wang , Junjie Ren , Gang Wang , Xiaoguang Liu College of Computer and Control Engineering, Nankai University, China. 5 May 2016

Upload: others

Post on 12-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca

Lazy Exact Deduplication

Jingwei Ma, Rebecca J. Stones, Yuxiang Ma,Jingui Wang, Junjie Ren, Gang Wang, Xiaoguang Liu

College of Computer and Control Engineering,Nankai University, China.

5 May 2016

Page 2: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca

Lazy exact deduplication

Page 3: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca

Lazy exact deduplication

Lead author: Jingwei Ma, PhD student at Nankai University(supervisor: Prof. Gang Wang).

Page 4: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca

Lazy exact deduplication

Lead author: Jingwei Ma, PhD student at Nankai University(supervisor: Prof. Gang Wang). Couldn’t get USA visa in time=⇒ I will present this work.

Page 5: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca

Lazy exact deduplication

Lead author: Jingwei Ma, PhD student at Nankai University(supervisor: Prof. Gang Wang). Couldn’t get USA visa in time=⇒ I will present this work.

Credit where credit is due: Jingwei Ma did the lion’s share of thiswork (development, implementation, experimentation, etc.).

Page 6: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca

Lazy exact deduplication

Lead author: Jingwei Ma, PhD student at Nankai University(supervisor: Prof. Gang Wang). Couldn’t get USA visa in time=⇒ I will present this work.

Credit where credit is due: Jingwei Ma did the lion’s share of thiswork (development, implementation, experimentation, etc.).

Lazy deduplication: ‘Lazy’ in the sense that we postpone disklookups, until we can do them as a batch.

Page 7: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca

Lazy exact deduplication

Lead author: Jingwei Ma, PhD student at Nankai University(supervisor: Prof. Gang Wang). Couldn’t get USA visa in time=⇒ I will present this work.

Credit where credit is due: Jingwei Ma did the lion’s share of thiswork (development, implementation, experimentation, etc.).

Lazy deduplication: ‘Lazy’ in the sense that we postpone disklookups, until we can do them as a batch. (Lazy is exact.)

Page 8: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca

Deduplication: What usually happens...

We have a large amount of data, with lots of duplicate data(e.g. weekly backups).

Page 9: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca

Deduplication: What usually happens...

We have a large amount of data, with lots of duplicate data(e.g. weekly backups).

We read through the data, and if we see something we’ve seenbefore, we replace it with an index entry (saving disk space).

Page 10: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca

Deduplication: What usually happens...

We have a large amount of data, with lots of duplicate data(e.g. weekly backups).

We read through the data, and if we see something we’ve seenbefore, we replace it with an index entry (saving disk space).

Page 11: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca

Deduplication: What usually happens...

We have a large amount of data, with lots of duplicate data(e.g. weekly backups).

We read through the data, and if we see something we’ve seenbefore, we replace it with an index entry (saving disk space).

The data is broken up into chunks (Rabin Hash).

Page 12: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca

Deduplication: What usually happens...

We have a large amount of data, with lots of duplicate data(e.g. weekly backups).

We read through the data, and if we see something we’ve seenbefore, we replace it with an index entry (saving disk space).

The data is broken up into chunks (Rabin Hash).

The chunks are fingerprinted (SHA1): same fingerprint =⇒duplicate chunk.

Page 13: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca

Deduplication: What usually happens...

Disk bottleneck: Most fingerprints are stored on disk =⇒lots of disk reads (“have I seen this before?”) =⇒ slow.

Page 14: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca

Deduplication: What usually happens...

Disk bottleneck: Most fingerprints are stored on disk =⇒lots of disk reads (“have I seen this before?”) =⇒ slow.

Caching and prefetching reduce the disk bottleneck problem:

Page 15: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca

Deduplication: What usually happens...

Disk bottleneck: Most fingerprints are stored on disk =⇒lots of disk reads (“have I seen this before?”) =⇒ slow.

Caching and prefetching reduce the disk bottleneck problem:fingerprints

cache

disk

cache miss

· · · fA fB fC fD · · ·

fA fB fC fD

The first time we see fingerprints fA, fB , ...

Page 16: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca

Deduplication: What usually happens...

Disk bottleneck: Most fingerprints are stored on disk =⇒lots of disk reads (“have I seen this before?”) =⇒ slow.

Caching and prefetching reduce the disk bottleneck problem:fingerprints

cache

disk

cache miss

· · · fA fB fC fD · · ·

fA fB fC fD

The first time we see fingerprints fA, fB , ...

fingerprints

cache

cache miss prefetching

disk

cache hit

· · · fA fB fC fD · · ·

fA fB fC fD

The second time we see fingerprints fA, fB , ...

Page 17: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca

Lazy deduplication...

fingerprints

cache

disk

· · · fA fB fC fD · · ·

fC fA fB , fD

fA fB fC fD

Page 18: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca

Lazy deduplication...

Bloom filter: identifies manyuniques (not all). [Commonlyused.]

fingerprints

Bloomfilter

cache

disk

· · · fA fB fC fD · · ·

fC fA fB , fD

fA fB fC fD

Page 19: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca

Lazy deduplication...

Bloom filter: identifies manyuniques (not all). [Commonlyused.]

buffer: stores fingerprints inhash buckets; searched lateron disk (“lazy”)—when full,whole buckets are searched inone go (stored on-disk in hashbuckets)

fingerprints

Bloomfilter

cache

disk

buffer

· · · fA fB fC fD · · ·

fC fA fB , fD

fA fB fC fD

Page 20: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca

Lazy deduplication...

Bloom filter: identifies manyuniques (not all). [Commonlyused.]

buffer: stores fingerprints inhash buckets; searched lateron disk (“lazy”)—when full,whole buckets are searched inone go (stored on-disk in hashbuckets)

post-lookup: searching thecache after buffering (maybemultiple times)

fingerprints

Bloomfilter

cache

post-lookup

disk

buffer

· · · fA fB fC fD · · ·

fC fA fB , fD

fA fB fC fD

Page 21: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca

Lazy deduplication...

Bloom filter: identifies manyuniques (not all). [Commonlyused.]

buffer: stores fingerprints inhash buckets; searched lateron disk (“lazy”)—when full,whole buckets are searched inone go (stored on-disk in hashbuckets)

post-lookup: searching thecache after buffering (maybemultiple times)

pre-lookup: searching thecache before buffering [notshown]

fingerprints

Bloomfilter

cache

post-lookup

disk

buffer

· · · fA fB fC fD · · ·

fC fA fB , fD

fA fB fC fD

Page 22: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca

Lazy deduplication...

Bloom filter: identifies manyuniques (not all). [Commonlyused.]

buffer: stores fingerprints inhash buckets; searched lateron disk (“lazy”)—when full,whole buckets are searched inone go (stored on-disk in hashbuckets)

post-lookup: searching thecache after buffering (maybemultiple times)

pre-lookup: searching thecache before buffering [notshown]

prefetching: bidirectional;triggers post-lookup

fingerprints

Bloomfilter

cache

post-lookup

prefetching

disk

buffer

· · · fA fB fC fD · · ·

fC fA fB , fD

fA fB fC fD

Page 23: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca

Prefetching...

Ordinarily, we prefetch the subsequent on-disk fingerprints after aduplicate is found on disk

Page 24: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca

Prefetching...

Ordinarily, we prefetch the subsequent on-disk fingerprints after aduplicate is found on disk—these will probably be the next incomingfingerprints.

Page 25: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca

Prefetching...

Ordinarily, we prefetch the subsequent on-disk fingerprints after aduplicate is found on disk—these will probably be the next incomingfingerprints. But this doesn’t work with the lazy method (wherefingerprints are buffered).

Page 26: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca

Prefetching...

Ordinarily, we prefetch the subsequent on-disk fingerprints after aduplicate is found on disk—these will probably be the next incomingfingerprints. But this doesn’t work with the lazy method (wherefingerprints are buffered).

To overcome this obstacle, each buffered fingerprint is given a...

Page 27: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca

Prefetching...

Ordinarily, we prefetch the subsequent on-disk fingerprints after aduplicate is found on disk—these will probably be the next incomingfingerprints. But this doesn’t work with the lazy method (wherefingerprints are buffered).

To overcome this obstacle, each buffered fingerprint is given a...

rank, used to determine the on-disk search range;

Page 28: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca

Prefetching...

Ordinarily, we prefetch the subsequent on-disk fingerprints after aduplicate is found on disk—these will probably be the next incomingfingerprints. But this doesn’t work with the lazy method (wherefingerprints are buffered).

To overcome this obstacle, each buffered fingerprint is given a...

rank, used to determine the on-disk search range; and abuffer cycle, indicating where duplicates might be on-disk.

Page 29: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca

Prefetching...

Ordinarily, we prefetch the subsequent on-disk fingerprints after aduplicate is found on disk—these will probably be the next incomingfingerprints. But this doesn’t work with the lazy method (wherefingerprints are buffered).

To overcome this obstacle, each buffered fingerprint is given a...

rank, used to determine the on-disk search range; and abuffer cycle, indicating where duplicates might be on-disk.

It looks like this:

rank r :

fingerprints

0 1 2 3 4 5 6 7 8

fingerprintsstored on disk

on-disklookup

· · ·

2048 fingerprints

r

incoming unique on-disk unique buffered / on-disk match

Page 30: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca

Experimental results...

(See our paper for the details and further experiments.)

Page 31: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca

Experimental results...

(See our paper for the details and further experiments.)

The time it takes to deduplicate a dataset (on SSD):

Vm (220GB) Src (343GB) FSLHomes (3.58TB)

eager way 282 sec. 476 sec. 5824 sec.

lazy way 151 sec. 226 sec. 3939 sec.

(eager = non-lazy [exact] way—i.e., no buffering before accessingthe disk)

Conclusion: Lazy is faster.

Page 32: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca

On-disk lookups...

Disk access time (sec.) on SSD:

Vm Src FSLHomes

eager lazy eager lazy eager lazy

on-disk lookup 176 20 325 45 4598 1639

prefetching 46 60 52 68 298 655

other 59 71 99 113 928 1645

total disk access 222 80 377 113 4896 2294

total dedup. 282 151 476 226 5824 3939

Conclusion: Lazy reduces the disk bottleneck.

Page 33: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca

Throughput...

656

397

0

100

200

300

400

500

600

700

800

20 70 120 170 220 270 320 370 420

thro

ughp

ut

(MB

/sec

.)

data size (GB)

Src on SSD

eager lazy

69

151

0

50

100

150

200

250

300

350

20 70 120 170 220 270 320 370 420

thro

ughp

ut

(MB

/sec

.)

data size (GB)

Src on HDD

eager lazy

Conclusion: Lazy has better throughput on both SSD and HDD,but moreso on slower HDD.

Page 34: Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie …storageconference.us/2016/Slides/RebeccaStones.pdf · 2016-05-24 · Lazy Exact Deduplication Jingwei Ma, Rebecca