the logic of physical garbage collection in deduplicating ...24 © copyright 2017 dell inc....
TRANSCRIPT
© Copyright 2017 Dell Inc.1
The Logic of Physical Garbage Collection in Deduplicating Storage
Fred Douglis
Abhinav DuggalPhilip Shilane
Tony Wong
Dell EMC
Shiqin Yan
University of Chicago
Fabiano Botelho
Rubrik
© Copyright 2017 Dell Inc.2
Deduplication in Data Domain Filesystem (DDFS)
R S T W
File 1
W X Y Z
R S T WRfp Sfp Tfp Wfp
R S
T W
C1
C2
fp CID
R C1
S C1
T C2
W C2
Fingerprint Index
X YC3
ZC4
Containers holding chunks
File 2
W X Y
X C3
Y C3
Z C4
Variable sized chunks Variable sized chunks
Generate fingerprints
Wfp Xfp Yfp
Generate fingerprints
ZZfp
© Copyright 2017 Dell Inc.3
File Representation in DDFS
L6
… L5
…
L5
L1: Rfp Sfp Tfp Ufp Vfp Wfp Xfp Yfp
L4
L3
L2
…
…
R
Files represented as a Merkle tree of fingerprints
L0: Chunks stored on disk in containers
S
Y
L6
L1 : Rfp Sfp Zfp
L2 …
COPY
“fastcopy” creates new root into same
tree
Lp chunks (metadata)
© Copyright 2017 Dell Inc.4
Deduplication Workloads on Data Domain
• Traditional backups– Weekly full and daily incremental backups
› Full backups tend to be very large – 100GBs to TBs› Much content in full backups repeats previous full
– Typically, 10-20x total compression (TC)› 20x TC = 10x dedup and 2x compression
• New workloads– “Synthetic” full backups
› Send changes and a recipe to create a single full backup from some previous backup
› Daily fulls› High TC (100x-400x or higher)
– High file count› 100M to 1 billion small files
© Copyright 2017 Dell Inc.5
Garbage Collection in a Deduplication Filesystem
File 1
R S
T W
C1
C2
X YC3
ZC4
Containers holding chunks
File 2
Shared chunk
Duplicate chunk
fp CID
R C1
S C1
T C2
W C2
Fingerprint Index
X C3
Y C3
Z C4
File 3
Q C5
Y C5Q YC5
Duplicates are sometimes written to improve throughput
© Copyright 2017 Dell Inc.6
Evolution of GC in DDFS
• Logical GC (LGC)– Depth-first traversal of per-file Merkle tree on disk to mark live
chunks in memory– In-memory data structures may not allow system to track all chunks,
so an extra mark phase (“pre-phases”) is used when necessary
• Physical GC (PGC)– Breadth-first traversal of the physical layout of Merkle trees to mark
live chunks in memory– Similar to LGC, pre-phases may be needed
• Phase-optimized Physical GC (PGC+)– Improvement over PGC by removing pre-phases, plus other
optimizations
© Copyright 2017 Dell Inc.7
Logical GC Phases
• Merge– Merge in-memory Index on disk
• Enumeration– Depth-first walk and mark live chunks in an in-memory
Bloom filter called live vector
• Filter– Create live instance vector (also a Bloom filter) from
live vector to remove the duplicates
• Select– Select best containers to compact
• Copy– Copy live chunks from selected containers into new
containers and delete old containers
Mark phase
Sweep phase
© Copyright 2017 Dell Inc.8
EnumerationPhase(LogicalGC)
L6
L2
L1 L1’
L6’
L2’
F1 F1’
L1’’
L0 L0
Only Lpchunks are traversed
shared
© Copyright 2017 Dell Inc.9
Logical GC àPhysical GC• Logical enumeration performance is sensitive to the
following parameters– Total compression factor– Number of small files – Spatial locality of Lp
Physical GC addresses these performance issues
© Copyright 2017 Dell Inc.10
Physical GC (PGC)• Uses breadth-first walk instead of per-file depth-first walk
during enumeration
• Uses Perfect Hash Vector(PHV) to store LPs for assisting the breadth-first walk– Uses less memory– Needed for doing checksums to prevent corruption
• New analysis phase to build Perfect Hash Functions for LPs• Remaining phases are same as logical GC
Live vector Live instance vector
Bloom filters
Live vector Live instance vector
Walk Vector
Bloom filtersPHV
LGC PGC
© Copyright 2017 Dell Inc.11
Collision Free - Perfect Hashing Vector (PHvec)
s1 s2 … sn
0 1 n - 1
PHF (m ≥ n)
1 0 … 1
0 1 m - 1
Fingerprint set S
Bit vector
Collision-free hash function which maps a fingerprint to a unique position in a bit vector
© Copyright 2017 Dell Inc.12
Analysis Phase
FP CID type
fp1 10 L0
fp2 5 LP
fp3 30 LP
……
….. …..
….. ….. ……
……
……
……
fpn 40
On-disk container index
In-memory Perfect Hashfunctions of Lp
1
2
3
4
.
.
.
#fps
© Copyright 2017 Dell Inc.13
Benefits & Costs of Physical Enumeration
• Pro: Sequential scan of containers on disk– All L6, then all L5, down to L1s– Relatively few containers store high-level metadata– No need to keep revisiting same Lp containers due to fastcopy
(high deduplication)
• Con: extra analysis cost doesn’t help “traditional” workloads
• … and due to pre-phases we may have to run analysis twice!
© Copyright 2017 Dell Inc.14
LGC and PGC phases (including pre-phases)• Physical GC
1. Pre-merge2. Pre-analysis3. Pre-enumeration4. Pre-filter5. Pre-select6. Merge7. Analysis8. Candidate9. Enumeration10. Filter11. Copy12. Summary
• Logical GC1. Pre-merge2. Pre-enumeration3. Pre-filter4. Pre-select5. Candidate6. Enumeration7. Merge8. Filter9. Copy10. Summary
Pre-phases/sampling phases
Pre-phases / sampling phases
© Copyright 2017 Dell Inc.15
Physical GC à Phase-optimized Physical GC
• Limitations of Physical GC– Adds 2 extra phases (pre-analysis and analysis)– Slightly degrades GC performance for customers with
traditional backup workloads
• Motivation for Phase-optimized Physical GC (PGC+)– Avoid pre-phases by representing all chunks in memory– Can we use Perfect hash as a live vector?
› Need only 2.7 bits per fingerprint instead of a 6 bits in Bloom filter– Can we maintain duplicate recipe without using a Bloom
filter?› Get 50% memory back
Live vectorLive vector Live instance vector
Walk Vector
Bloom filtersPHV
PGCWalk
Vector
PHV PHV
PGC+
© Copyright 2017 Dell Inc.16
Phase-optimized Physical GC (PGC+) Phases1. Merge
2. Analysis
3. Enumeration
4. Select
5. Copy6. Summary
© Copyright 2017 Dell Inc.17
PGC+ Analysis and Enumeration • Replace Bloom filter with Perfect Hash vector for tracking
live and dead chunks
• In analysis phase build two Perfect hash vectors– Lp vector called the walk vector (similar to PGC) – All fingerprints(Lp + L0) based Perfect Hash vector called live vector
• Perfect hashing optimizations– NUMA-aware Perfect Hashing– Cache prefetching of Perfect hash functions and values in the Perfect
Hash Vector
© Copyright 2017 Dell Inc.18
PGC+ Copy phase
fp1, fp2 fp1, fp3111
fp1 fp2 fp3C1 C2
fp1, fp2 fp1, fp3 010fp1 fp2 fp3
C1 C2
fp1, fp2 fp1, fp3 000fp1 fp2 fp3
C1 C2
Initial state
Process C2
Process C1
Dynamically remove duplicates during
Copy phase
Live vector
Live vector
Live vector
© Copyright 2017 Dell Inc.19
Evaluation • Deployed systems
– Comparison of GC runs for systems upgraded from LGC to PGC
• Controlled experiments on 4 systems– Comparison of LGC vs PGC vs PGC+
› One phase versus two phase GC
– DD860 used as default for all experiments– Workload used was Synthetic dataset similar to some past
deduplication work (e.g., Botelho, et al., FAST 2012)Systems DD2500 DD860 DD890 DD990
CPU(cores*GHz) 8*2.2 GHz 16*2.53 GHz 24*2.8 GHz 40*2.4 GHz
Mem(GB) 64 GB 70 GB 94 GB 256 GB
PhysicalCapacity (TB)
122 TB 126 TB 167 TB 319 TB
© Copyright 2017 Dell Inc.20
Deployed System Results- LGC vs PGC
• For high TC workloads, PGC improved from LGC up to 20x
• For high file count workload, PGC improved over LGC by 7x
• 75% of systems upgraded from LGC to PGC suffered from some degradation but usually not much– Hard to compare LGC v/s PGC systems because of some other
performance changes introduced with PGC
• Lab experiments to compare all GC variants with same performance parameters
© Copyright 2017 Dell Inc.21
GC on Different Platforms (36.6x TC)
For this dedup, LGC2 is slightly better than PGC2 but PGC+ is better than LGC2/PGC2
© Copyright 2017 Dell Inc.22
High Total compression Workload
0
20
40
60
80
100
120LG
CPG
CPG
C+
LGC
PGC
PGC
+
LGC
PGC
PGC
+
LGC
PGC
PGC
+
LGC
PGC
PGC
+
LGC
PGC
PGC
+
LGC
PGC
PGC
+
Dur
atio
n (h
ours
)
LGC2LGC1PGC2PGC1PGC+
36.6x 73.2x 147x 293x 586x 1170x 2340x
250
LGC duration scales with TC
PGC/PGC+ remain flat
Total compression factor (TC)
© Copyright 2017 Dell Inc.23
High file Count Workload
0
20
40
60
80
100
LGC PGC PGC+
Dur
atio
n (h
ours
)
LGC2LGC1PGC2PGC1PGC+
187
High file count(900M)
LGC1/LGC2 is orders of magnitude slower than PGC
© Copyright 2017 Dell Inc.24
Conclusions• Shift in workloads required moving from depth-first based
mark phase to breadth-first based mark phase• PGC works better than LGC for very high TC datasets and
large number of small files• Due to extra phases and performance constraints
introduced in PGC, PGC is not uniformly faster than LGC• PGC+ uses various optimizations to improve over PGC,
primarily by avoiding multiple mark phases • PGC+ is significantly faster than LGC when 2 mark phases
are required and orders of magnitude faster for problematic workloads