smore: a cold data object store for smr...
TRANSCRIPT
SMORE: A Cold Data Object Store for SMR DrivesPeter Macko, Xiongzi Ge, John Haskins Jr.*, James Kelley, David Slik, Keith A. Smith, and Maxim G. Smith
Advanced Technology GroupNetApp, Inc.
* Qualcomm (work performed while at NetApp)© 2017 NetApp, Inc. All rights reserved.1
Storage for large, cold files
Demand for storage large, cold files
Motivation
© 2017 NetApp, Inc. All rights reserved.2
Media Files: https://commons.wikimedia.org/wiki/File:Video-film.svg
Media Files Virtual Machine Libraries Backups
Good fit for SMR drives! High-capacity, low-cost storage
Large sequential reads and writes
Shingled Magnetic Recording (SMR)
Conventional HDDs Tracks do not overlap A guard band lies between adjacent tracks
SMR HDDs Tracks partially overlap, leaving enough width for
the read head (smaller than the write head) No overwrite in place
SMR in practice A zone can only be written sequentially or erased
(no random writes) Supports random reads Split into 256 MB sequential-only zones
A quick review
© 2017 NetApp, Inc. All rights reserved.3
http://www.seagate.com/tech-insights/breaking-areal-density-barriers-with-seagate-smr-master-ti/
SMORE (SMR Object Repository)
Object store for an array of SMR drives Supports PUT, GET, and DELETE
Optimized for large, cold objects
Assume that frequently accessed or updated data are cached at higher layers of the storage stack
Requirements for SMORE: Ingest & read at disk speed Low write amplification Reliable storage
Introduction
© 2017 NetApp, Inc. All rights reserved.4
S'more: https://commons.wikimedia.org/wiki/File:Smores-Microwave.jpg
SMORE: A Cold Data Object Store for SMR Drives
1) Introduction
2) SMORE Architecture
3) Results
4) Conclusion
Agenda
© 2017 NetApp, Inc. All rights reserved.5
High-level overviewSMORE Architecture
© 2017 NetApp, Inc. All rights reserved.6
Architecture overview
© 2017 NetApp, Inc. All rights reserved.7
Zone 5
Zone 4
Zone 3
Zone 2
Zone 1
Disk 1
Zone 5
Zone 4
Zone 3
Zone 2
Zone 1
Disk 2
Zone 5
Zone 4
Zone 3
Zone 2
Zone 1
Disk 3
Superblock Superblock Superblock
Flash (B+ tree index)
Data + log of metadata updates(intermixed in a single stream)
Read-optimized versionof the metadata
Good lookup performance
Good writeperformance
Long runsof data
fast reads
Architecture overview
Write the superblock in a log-structured manner
Replicate the superblock to 2 or more drives
Erase a zone when full, always preserving the replication count of the latest superblock
Superblock zones
© 2017 NetApp, Inc. All rights reserved.8
Disk 1, Zone 1 1
1Disk 2, Zone 1
Disk 3, Zone 1
4
4
7
72
2
3
3
5
5
6
6
8
8
Architecture overview
Write the superblock in a log-structured manner
Replicate the superblock to 2 or more drives
Erase a zone when full, always preserving the replication count of the latest superblock
Superblock zones
© 2017 NetApp, Inc. All rights reserved.9
Disk 1, Zone 1
1Disk 2, Zone 1
Disk 3, Zone 1
4 72
2 3
5
5 6
8
8 9
9
Architecture overview
© 2017 NetApp, Inc. All rights reserved.10
Zone 5
Zone 4
Zone 3
Zone 2
Zone 1
Disk 1
Zone 5
Zone 4
Zone 3
Zone 2
Zone 1
Disk 2
Zone 5
Zone 4
Zone 3
Zone 2
Zone 1SuperblockSuperblockSuperblock
Disk 3
Zone Set 4
Zone Set 3
Zone Set 2
Zone Set 1
Superblock
Zone Set 4
Zone Set 3
Zone Set 2
Zone Set 1
Superblock
Zone Set 4
Zone Set 3
Zone Set 2
Zone Set 1
Superblock
Flash (B+ tree index)
Architecture overview
© 2017 NetApp, Inc. All rights reserved.11
Zone 5
Zone 4
Zone 3
Zone 2
Zone 1
Disk 1
……
FIFO Buffers
Zone 5
Zone 4
Zone 3
Zone 2
Zone 1
Disk 2
Zone 5
Zone 4
Zone 3
Zone 2
Zone 1SuperblockSuperblockSuperblock
Disk 3
Segments
H+EC
Object
Zone Set 4
Zone Set 3
Zone Set 2
Zone Set 1
Superblock
Zone Set 4
Zone Set 3
Zone Set 2
Zone Set 1
Superblock
Zone Set 4
Zone Set 3
Zone Set 2
Zone Set 1
Superblock……
FIFO Buffers
……
FIFO Buffers
Flash (B+ tree index)
Object 1, Segment 1 (zone set = 1, offset = 0, length = 16)
Consolidate small writes & improve performance (backed by NVRAM, optional)
Size on the order of 10’s of MB (configurable)
n+1 EC(configurable)
Architecture overview
© 2017 NetApp, Inc. All rights reserved.12
Zone 5
Zone 4
Zone 3
Zone 2
Zone 1
Disk 1
……
FIFO Buffers
Zone 5
Zone 4
Zone 3
Zone 2
Zone 1
Disk 2
Zone 5
Zone 4
Zone 3
Zone 2
Zone 1
Disk 3
H+EC H+EC
Zone Set 4
Zone Set 3
Zone Set 2
Zone Set 1
Superblock
Zone Set 4
Zone Set 3
Zone Set 2
Zone Set 1
Superblock
Zone Set 4
Zone Set 3
Zone Set 2
Zone Set 1
Superblock……
FIFO Buffers
……
FIFO Buffers
Flash (B+ tree index)
Object 1, Segment 1 (zone set = 1, offset = 0, length = 16)Object 1, Segment 2 (zone set = 1, offset = 16, length = 16)Object 1, Segment 3 (zone set = 2, offset = 0, length = 15)
SegmentsObject
Architecture overview
Zone set = any n zones from n different disks Written and erased together Data is erasure-coded across the zones
Write pointers always advance together
Zone sets
© 2017 NetApp, Inc. All rights reserved.13
Disk 1, Zone 2
Disk 2, Zone 2
Disk 3, Zone 2
Location of a segment of data:(zone set ID, offset) Decreases the size of the index Easily move contents of a zone
Segment 1a
Segment 1b
Segment 1c
Segment 2a
Segment 2b
Segment 2c
Architecture overview
Each segment is prefixed with a layout marker block Uniquely identifies the segment and acts as a redo log entry for the index Delete objects by writing tombstone records, which are special layout marker blocks
Write a digest, a summary of all layout marker blocks, at the end of each full zone set (helps with recovery)
Zone sets
© 2017 NetApp, Inc. All rights reserved.14
Disk 1, Zone 2
Disk 2, Zone 2
Disk 3, Zone 2
Segment 1a
Segment 1b
Segment 1c
Segment 2a
Segment 2b
Segment 2c
T
T
T
Digest
Digest
Digest
Layout marker blockTombstone record Digest
Index recovery
1. Read the most recent superblock
2. Restore the index from the latest consistent snapshot
3. Identify which zone sets could have changed since the snapshot was taken
4. Read and replay the digests from full zone sets and individual layout marker blocks from incomplete zone sets
5. Optionally, take an index snapshot
A recovery-oriented design
© 2017 NetApp, Inc. All rights reserved.15
Flash
Superblock
Zone Set 1
Zone Set 2
Zone Set 3
Zone Set 4
Zone Set 5
Index Snapshot
Digest
Digest
Index recovery
1. Read the most recent superblock
2. Restore the index from the latest consistent snapshot
3. Identify which zone sets could have changed since the snapshot was taken
4. Read and replay the digests from full zone sets and individual layout marker blocks from incomplete zone sets
5. Optionally, take an index snapshot
A recovery-oriented design
© 2017 NetApp, Inc. All rights reserved.16
Flash
Superblock
Zone Set 1
Zone Set 2
Zone Set 3
Zone Set 4
Zone Set 5
Index Snapshot
Digest
Digest
Index Snapshot
Operational Index
Index recovery
1. Read the most recent superblock
2. Restore the index from the latest consistent snapshot
3. Identify which zone sets could have changed since the snapshot was taken
4. Read and replay the digests from full zone sets and individual layout marker blocks from incomplete zone sets
5. Optionally, take an index snapshot
A recovery-oriented design
© 2017 NetApp, Inc. All rights reserved.17
Flash
Superblock
Zone Set 1
Zone Set 2
Zone Set 3
Zone Set 4
Zone Set 5
Index Snapshot
Digest
Digest
Operational Index
Zone Set 3*
Zone Set 4*
Index recovery
1. Read the most recent superblock
2. Restore the index from the latest consistent snapshot
3. Identify which zone sets could have changed since the snapshot was taken
4. Read and replay the digests from full zone sets and individual layout marker blocks from incomplete zone sets
5. Optionally, take an index snapshot
A recovery-oriented design
© 2017 NetApp, Inc. All rights reserved.18
Flash
Superblock
Zone Set 1
Zone Set 2
Zone Set 3
Zone Set 4
Zone Set 5
Index Snapshot
Digest
Digest
Operational Index
Zone Set 3*
Zone Set 4*
Index recovery
1. Read the most recent superblock
2. Restore the index from the latest consistent snapshot
3. Identify which zone sets could have changed since the snapshot was taken
4. Read and replay the digests from full zone sets and individual layout marker blocks from incomplete zone sets
5. Optionally, take an index snapshot
A recovery-oriented design
© 2017 NetApp, Inc. All rights reserved.19
Flash
Superblock
Zone Set 1
Zone Set 2
Zone Set 3
Zone Set 4
Zone Set 5
Index Snapshot
Digest
Digest
Operational IndexOperational Index
Index Snapshot
Zone sets and recovery
The general case: Any zones from different drives can form one zone set
Reconstruct the zones from the offline disk into the free zones on other drives
We can do this without having to update the index!
The general case
© 2017 NetApp, Inc. All rights reserved.20
Zone 6Zone 5Zone 4Zone 3Zone 2Zone 1
Zone Set 5Zone Set 3Zone Set 1Superblock
Zone 6Zone 5Zone 4Zone 3Zone 2Zone 1
Zone Set 5Zone Set 3Zone Set 1Superblock
Zone 6Zone 5Zone 4Zone 3Zone 2Zone 1
Zone Set 6Zone Set 3Zone Set 1Superblock
Zone 6Zone 5Zone 4Zone 3Zone 2Zone 1
Zone Set 6Zone Set 4Zone Set 2Superblock
Zone 6Zone 5Zone 4Zone 3Zone 2Zone 1
Zone Set 6Zone Set 4Zone Set 2Superblock
Zone 6Zone 5Zone 4Zone 3Zone 2Zone 1
Zone Set 5Zone Set 4Zone Set 2Superblock
Zone 6Zone Set 2
Zone 4Zone 3Zone 2Zone 1
Zone Set 5Zone Set 3Zone Set 1Superblock
Zone 6Zone Set 5
Zone 4Zone 3Zone 2Zone 1
Zone Set 6Zone Set 4Zone Set 2Superblock
Zone 6Zone Set 4
Zone 4Zone 3Zone 2Zone 1
Zone Set 5Zone Set 3Zone Set 1Superblock
Segment 1a Segment 2a D.Segment 3a
Segment 1b Segment 2b D.Segment 3b
Segment 1c Segment 2c D.Segment 3c
Segment 1a Segment 2a D.Segment 3a
Segment 1b Segment 2b D.Segment 3b
Segment 1c Segment 2c D.Segment 3c
Garbage collection
Reclaim space using garbage collection
Use the “dead space” algorithm:1. Maintain an accurate measure of dead space for each zone set2. Choose the zone set with most dead space3. Relocate live data to the currently open zone set4. Erase the zone set
Works reasonably well for cold data (no need to maintain hot-cold separation)
The dead space algorithm
© 2017 NetApp, Inc. All rights reserved.21
Segment 4a
Segment 4b
Segment 4c
Segment 3a
Segment 3b
Segment 3c
SMOREResults
© 2017 NetApp, Inc. All rights reserved.22
Experimental setup (1/2)
32-core Intel Xeon 2.90 GHz, 128 GB RAM The experiments used only 1.5 GB RAM
6x HGST Ultrastar Archive Ha10, 10 TB each Used only every 60th zone to reduce the capacity while preserving full seek profile 5+1 parity within each zone set Total system capacity after format: 766 GB Confirmed that the results are representative by running selected workloads on a full 50 TB system
Hardware Platform
© 2017 NetApp, Inc. All rights reserved.23
Experimental setup (2/2)
SMORE Implemented as a C++ library ~19k LOC Custom I/O stack built on top of libzbc
Workload generator Synthetic workload generator parameterized by object size, target utilization, read-write ratio, etc. Random object deletion, which produces the worst GC behavior Mixed object sizes follow the distribution from the European Center for Medium-Range Weather Forecasts
Unless stated otherwise: 80% target utilization (amount of live data as a percentage of storage capacity) 6 clients
Software & Workload
© 2017 NetApp, Inc. All rights reserved.24
Average ingest rate: 282 MB/s About 100% of max disk bandwidth
Mostly independent of the number of clients The same results for 3-12 clients Slightly lower rate for 24 clients
Initial ingest
0
50
100
150
200
250
300
350
1 MB 10 MB 100 MB 1 GB Mixed
Aggr
egat
e Ba
ndw
idth
(MB/
s)
Object size
Per object size
© 2017 NetApp, Inc. All rights reserved.25
Theoretical best ingest performance
0
1
2
3
4
5
6
1 MB 10 MB 100 MB 1 GB Mixed
Writ
e Am
plifi
catio
n
Object Size
Varying object size and utilization
70% 80% 90%
Steady-state write amplification
0
1
2
3
4
5
6
3 6 12 24
Writ
e Am
plifi
catio
n
Number of Clients
Varying number of clients and object size
1 MB objects 1 GB objects
For write-only workloads, overwriting random objects (theoretical best: 1.20)
© 2017 NetApp, Inc. All rights reserved.26
Theoretical best write amplification
0
100
200
300
400
500
600
1 MB 10 MB 100 MB 1 GB Mixed
Rea
d Pe
rform
ance
(MB/
s)
Object Size
Varying object size and utilization
70% 80% 90%
Steady-state read performance
0
100
200
300
400
500
600
3 6 12 24
Rea
d Pe
rform
ance
(MB/
s)
Number of Clients
Varying number of clients and object size
1 MB objects 1 GB objects
For read-only workloads, reading random objects (theoretical best: 590 MB/s)
© 2017 NetApp, Inc. All rights reserved.27
Theoretical best read performance
0
0.1
0.2
0.3
0.4
0.5
0.6
1 MB 10 MB 100 MB 1 GB Mixed
Tim
e (s
econ
ds)
Object Size
Time to create an index snapshot
Snapshot to Flash Copy to SMR
Index recovery
0
1
2
3
4
5
6
7
8
0 100 200 300 400 500 600 700
Tim
e (s
econ
ds)
Number of Examined Zone Sets
Time to recover
Time to create and to recover an index
© 2017 NetApp, Inc. All rights reserved.28
(0 – 4.5 hours since the index snapshot was taken)
Conclusion
Fast ingest Large segments of data + log of metadata
updates on SMR drives
Fast streaming reads Fast lookup using a read-optimized version of the
metadata index on a flash Large (almost) contiguous segments of data on
SMR drives
Efficient recovery Efficient index rebuild through zone set digests Data relocation supported by indirection through
zone sets
SMORE
© 2017 NetApp, Inc. All rights reserved.29
S'more: https://commons.wikimedia.org/wiki/File:Smores-Microwave.jpg
© 2017 NetApp, Inc. All rights reserved.30
Thank you
Additional informationBack up slides
© 2017 NetApp, Inc. All rights reserved.31
Zone setsThe state transition diagram
© 2017 NetApp, Inc. All rights reserved.32
EMPTY
AVAILABLE
OPEN
CLOSED
INDEXED
Allocate*
Open
Write the digest, close
Snapshot the index*
Clean, trimthe zone set
* Involves writing the superblock
– the zone set is empty and not currently used by SMORE
– the zone set is empty, but is available for opening
– the zone set can receive writes
– the zone set is full (does not accept any more writes)
– the zone set was CLOSED when the index snapshot was taken(it does not need to be examined during index recovery)