smore: a cold data object store for smr...

32
SMORE: A Cold Data Object Store for SMR Drives Peter Macko, Xiongzi Ge, John Haskins Jr.*, James Kelley, David Slik, Keith A. Smith, and Maxim G. Smith Advanced Technology Group NetApp, Inc. * Qualcomm (work performed while at NetApp) © 2017 NetApp, Inc. All rights reserved. 1

Upload: others

Post on 28-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SMORE: A Cold Data Object Store for SMR Drivesstorageconference.us/2017/Presentations/ColdDataObject...SMORE: A Cold Data Object Store for SMR Drives Peter Macko, Xiongzi Ge, John

SMORE: A Cold Data Object Store for SMR DrivesPeter Macko, Xiongzi Ge, John Haskins Jr.*, James Kelley, David Slik, Keith A. Smith, and Maxim G. Smith

Advanced Technology GroupNetApp, Inc.

* Qualcomm (work performed while at NetApp)© 2017 NetApp, Inc. All rights reserved.1

Page 2: SMORE: A Cold Data Object Store for SMR Drivesstorageconference.us/2017/Presentations/ColdDataObject...SMORE: A Cold Data Object Store for SMR Drives Peter Macko, Xiongzi Ge, John

Storage for large, cold files

Demand for storage large, cold files

Motivation

© 2017 NetApp, Inc. All rights reserved.2

Media Files: https://commons.wikimedia.org/wiki/File:Video-film.svg

Media Files Virtual Machine Libraries Backups

Good fit for SMR drives! High-capacity, low-cost storage

Large sequential reads and writes

Page 3: SMORE: A Cold Data Object Store for SMR Drivesstorageconference.us/2017/Presentations/ColdDataObject...SMORE: A Cold Data Object Store for SMR Drives Peter Macko, Xiongzi Ge, John

Shingled Magnetic Recording (SMR)

Conventional HDDs Tracks do not overlap A guard band lies between adjacent tracks

SMR HDDs Tracks partially overlap, leaving enough width for

the read head (smaller than the write head) No overwrite in place

SMR in practice A zone can only be written sequentially or erased

(no random writes) Supports random reads Split into 256 MB sequential-only zones

A quick review

© 2017 NetApp, Inc. All rights reserved.3

http://www.seagate.com/tech-insights/breaking-areal-density-barriers-with-seagate-smr-master-ti/

Page 4: SMORE: A Cold Data Object Store for SMR Drivesstorageconference.us/2017/Presentations/ColdDataObject...SMORE: A Cold Data Object Store for SMR Drives Peter Macko, Xiongzi Ge, John

SMORE (SMR Object Repository)

Object store for an array of SMR drives Supports PUT, GET, and DELETE

Optimized for large, cold objects

Assume that frequently accessed or updated data are cached at higher layers of the storage stack

Requirements for SMORE: Ingest & read at disk speed Low write amplification Reliable storage

Introduction

© 2017 NetApp, Inc. All rights reserved.4

S'more: https://commons.wikimedia.org/wiki/File:Smores-Microwave.jpg

Page 5: SMORE: A Cold Data Object Store for SMR Drivesstorageconference.us/2017/Presentations/ColdDataObject...SMORE: A Cold Data Object Store for SMR Drives Peter Macko, Xiongzi Ge, John

SMORE: A Cold Data Object Store for SMR Drives

1) Introduction

2) SMORE Architecture

3) Results

4) Conclusion

Agenda

© 2017 NetApp, Inc. All rights reserved.5

Page 6: SMORE: A Cold Data Object Store for SMR Drivesstorageconference.us/2017/Presentations/ColdDataObject...SMORE: A Cold Data Object Store for SMR Drives Peter Macko, Xiongzi Ge, John

High-level overviewSMORE Architecture

© 2017 NetApp, Inc. All rights reserved.6

Page 7: SMORE: A Cold Data Object Store for SMR Drivesstorageconference.us/2017/Presentations/ColdDataObject...SMORE: A Cold Data Object Store for SMR Drives Peter Macko, Xiongzi Ge, John

Architecture overview

© 2017 NetApp, Inc. All rights reserved.7

Zone 5

Zone 4

Zone 3

Zone 2

Zone 1

Disk 1

Zone 5

Zone 4

Zone 3

Zone 2

Zone 1

Disk 2

Zone 5

Zone 4

Zone 3

Zone 2

Zone 1

Disk 3

Superblock Superblock Superblock

Flash (B+ tree index)

Data + log of metadata updates(intermixed in a single stream)

Read-optimized versionof the metadata

Good lookup performance

Good writeperformance

Long runsof data

fast reads

Page 8: SMORE: A Cold Data Object Store for SMR Drivesstorageconference.us/2017/Presentations/ColdDataObject...SMORE: A Cold Data Object Store for SMR Drives Peter Macko, Xiongzi Ge, John

Architecture overview

Write the superblock in a log-structured manner

Replicate the superblock to 2 or more drives

Erase a zone when full, always preserving the replication count of the latest superblock

Superblock zones

© 2017 NetApp, Inc. All rights reserved.8

Disk 1, Zone 1 1

1Disk 2, Zone 1

Disk 3, Zone 1

4

4

7

72

2

3

3

5

5

6

6

8

8

Page 9: SMORE: A Cold Data Object Store for SMR Drivesstorageconference.us/2017/Presentations/ColdDataObject...SMORE: A Cold Data Object Store for SMR Drives Peter Macko, Xiongzi Ge, John

Architecture overview

Write the superblock in a log-structured manner

Replicate the superblock to 2 or more drives

Erase a zone when full, always preserving the replication count of the latest superblock

Superblock zones

© 2017 NetApp, Inc. All rights reserved.9

Disk 1, Zone 1

1Disk 2, Zone 1

Disk 3, Zone 1

4 72

2 3

5

5 6

8

8 9

9

Page 10: SMORE: A Cold Data Object Store for SMR Drivesstorageconference.us/2017/Presentations/ColdDataObject...SMORE: A Cold Data Object Store for SMR Drives Peter Macko, Xiongzi Ge, John

Architecture overview

© 2017 NetApp, Inc. All rights reserved.10

Zone 5

Zone 4

Zone 3

Zone 2

Zone 1

Disk 1

Zone 5

Zone 4

Zone 3

Zone 2

Zone 1

Disk 2

Zone 5

Zone 4

Zone 3

Zone 2

Zone 1SuperblockSuperblockSuperblock

Disk 3

Zone Set 4

Zone Set 3

Zone Set 2

Zone Set 1

Superblock

Zone Set 4

Zone Set 3

Zone Set 2

Zone Set 1

Superblock

Zone Set 4

Zone Set 3

Zone Set 2

Zone Set 1

Superblock

Flash (B+ tree index)

Page 11: SMORE: A Cold Data Object Store for SMR Drivesstorageconference.us/2017/Presentations/ColdDataObject...SMORE: A Cold Data Object Store for SMR Drives Peter Macko, Xiongzi Ge, John

Architecture overview

© 2017 NetApp, Inc. All rights reserved.11

Zone 5

Zone 4

Zone 3

Zone 2

Zone 1

Disk 1

……

FIFO Buffers

Zone 5

Zone 4

Zone 3

Zone 2

Zone 1

Disk 2

Zone 5

Zone 4

Zone 3

Zone 2

Zone 1SuperblockSuperblockSuperblock

Disk 3

Segments

H+EC

Object

Zone Set 4

Zone Set 3

Zone Set 2

Zone Set 1

Superblock

Zone Set 4

Zone Set 3

Zone Set 2

Zone Set 1

Superblock

Zone Set 4

Zone Set 3

Zone Set 2

Zone Set 1

Superblock……

FIFO Buffers

……

FIFO Buffers

Flash (B+ tree index)

Object 1, Segment 1 (zone set = 1, offset = 0, length = 16)

Consolidate small writes & improve performance (backed by NVRAM, optional)

Size on the order of 10’s of MB (configurable)

n+1 EC(configurable)

Page 12: SMORE: A Cold Data Object Store for SMR Drivesstorageconference.us/2017/Presentations/ColdDataObject...SMORE: A Cold Data Object Store for SMR Drives Peter Macko, Xiongzi Ge, John

Architecture overview

© 2017 NetApp, Inc. All rights reserved.12

Zone 5

Zone 4

Zone 3

Zone 2

Zone 1

Disk 1

……

FIFO Buffers

Zone 5

Zone 4

Zone 3

Zone 2

Zone 1

Disk 2

Zone 5

Zone 4

Zone 3

Zone 2

Zone 1

Disk 3

H+EC H+EC

Zone Set 4

Zone Set 3

Zone Set 2

Zone Set 1

Superblock

Zone Set 4

Zone Set 3

Zone Set 2

Zone Set 1

Superblock

Zone Set 4

Zone Set 3

Zone Set 2

Zone Set 1

Superblock……

FIFO Buffers

……

FIFO Buffers

Flash (B+ tree index)

Object 1, Segment 1 (zone set = 1, offset = 0, length = 16)Object 1, Segment 2 (zone set = 1, offset = 16, length = 16)Object 1, Segment 3 (zone set = 2, offset = 0, length = 15)

SegmentsObject

Page 13: SMORE: A Cold Data Object Store for SMR Drivesstorageconference.us/2017/Presentations/ColdDataObject...SMORE: A Cold Data Object Store for SMR Drives Peter Macko, Xiongzi Ge, John

Architecture overview

Zone set = any n zones from n different disks Written and erased together Data is erasure-coded across the zones

Write pointers always advance together

Zone sets

© 2017 NetApp, Inc. All rights reserved.13

Disk 1, Zone 2

Disk 2, Zone 2

Disk 3, Zone 2

Location of a segment of data:(zone set ID, offset) Decreases the size of the index Easily move contents of a zone

Segment 1a

Segment 1b

Segment 1c

Segment 2a

Segment 2b

Segment 2c

Page 14: SMORE: A Cold Data Object Store for SMR Drivesstorageconference.us/2017/Presentations/ColdDataObject...SMORE: A Cold Data Object Store for SMR Drives Peter Macko, Xiongzi Ge, John

Architecture overview

Each segment is prefixed with a layout marker block Uniquely identifies the segment and acts as a redo log entry for the index Delete objects by writing tombstone records, which are special layout marker blocks

Write a digest, a summary of all layout marker blocks, at the end of each full zone set (helps with recovery)

Zone sets

© 2017 NetApp, Inc. All rights reserved.14

Disk 1, Zone 2

Disk 2, Zone 2

Disk 3, Zone 2

Segment 1a

Segment 1b

Segment 1c

Segment 2a

Segment 2b

Segment 2c

T

T

T

Digest

Digest

Digest

Layout marker blockTombstone record Digest

Page 15: SMORE: A Cold Data Object Store for SMR Drivesstorageconference.us/2017/Presentations/ColdDataObject...SMORE: A Cold Data Object Store for SMR Drives Peter Macko, Xiongzi Ge, John

Index recovery

1. Read the most recent superblock

2. Restore the index from the latest consistent snapshot

3. Identify which zone sets could have changed since the snapshot was taken

4. Read and replay the digests from full zone sets and individual layout marker blocks from incomplete zone sets

5. Optionally, take an index snapshot

A recovery-oriented design

© 2017 NetApp, Inc. All rights reserved.15

Flash

Superblock

Zone Set 1

Zone Set 2

Zone Set 3

Zone Set 4

Zone Set 5

Index Snapshot

Digest

Digest

Page 16: SMORE: A Cold Data Object Store for SMR Drivesstorageconference.us/2017/Presentations/ColdDataObject...SMORE: A Cold Data Object Store for SMR Drives Peter Macko, Xiongzi Ge, John

Index recovery

1. Read the most recent superblock

2. Restore the index from the latest consistent snapshot

3. Identify which zone sets could have changed since the snapshot was taken

4. Read and replay the digests from full zone sets and individual layout marker blocks from incomplete zone sets

5. Optionally, take an index snapshot

A recovery-oriented design

© 2017 NetApp, Inc. All rights reserved.16

Flash

Superblock

Zone Set 1

Zone Set 2

Zone Set 3

Zone Set 4

Zone Set 5

Index Snapshot

Digest

Digest

Index Snapshot

Operational Index

Page 17: SMORE: A Cold Data Object Store for SMR Drivesstorageconference.us/2017/Presentations/ColdDataObject...SMORE: A Cold Data Object Store for SMR Drives Peter Macko, Xiongzi Ge, John

Index recovery

1. Read the most recent superblock

2. Restore the index from the latest consistent snapshot

3. Identify which zone sets could have changed since the snapshot was taken

4. Read and replay the digests from full zone sets and individual layout marker blocks from incomplete zone sets

5. Optionally, take an index snapshot

A recovery-oriented design

© 2017 NetApp, Inc. All rights reserved.17

Flash

Superblock

Zone Set 1

Zone Set 2

Zone Set 3

Zone Set 4

Zone Set 5

Index Snapshot

Digest

Digest

Operational Index

Zone Set 3*

Zone Set 4*

Page 18: SMORE: A Cold Data Object Store for SMR Drivesstorageconference.us/2017/Presentations/ColdDataObject...SMORE: A Cold Data Object Store for SMR Drives Peter Macko, Xiongzi Ge, John

Index recovery

1. Read the most recent superblock

2. Restore the index from the latest consistent snapshot

3. Identify which zone sets could have changed since the snapshot was taken

4. Read and replay the digests from full zone sets and individual layout marker blocks from incomplete zone sets

5. Optionally, take an index snapshot

A recovery-oriented design

© 2017 NetApp, Inc. All rights reserved.18

Flash

Superblock

Zone Set 1

Zone Set 2

Zone Set 3

Zone Set 4

Zone Set 5

Index Snapshot

Digest

Digest

Operational Index

Zone Set 3*

Zone Set 4*

Page 19: SMORE: A Cold Data Object Store for SMR Drivesstorageconference.us/2017/Presentations/ColdDataObject...SMORE: A Cold Data Object Store for SMR Drives Peter Macko, Xiongzi Ge, John

Index recovery

1. Read the most recent superblock

2. Restore the index from the latest consistent snapshot

3. Identify which zone sets could have changed since the snapshot was taken

4. Read and replay the digests from full zone sets and individual layout marker blocks from incomplete zone sets

5. Optionally, take an index snapshot

A recovery-oriented design

© 2017 NetApp, Inc. All rights reserved.19

Flash

Superblock

Zone Set 1

Zone Set 2

Zone Set 3

Zone Set 4

Zone Set 5

Index Snapshot

Digest

Digest

Operational IndexOperational Index

Index Snapshot

Page 20: SMORE: A Cold Data Object Store for SMR Drivesstorageconference.us/2017/Presentations/ColdDataObject...SMORE: A Cold Data Object Store for SMR Drives Peter Macko, Xiongzi Ge, John

Zone sets and recovery

The general case: Any zones from different drives can form one zone set

Reconstruct the zones from the offline disk into the free zones on other drives

We can do this without having to update the index!

The general case

© 2017 NetApp, Inc. All rights reserved.20

Zone 6Zone 5Zone 4Zone 3Zone 2Zone 1

Zone Set 5Zone Set 3Zone Set 1Superblock

Zone 6Zone 5Zone 4Zone 3Zone 2Zone 1

Zone Set 5Zone Set 3Zone Set 1Superblock

Zone 6Zone 5Zone 4Zone 3Zone 2Zone 1

Zone Set 6Zone Set 3Zone Set 1Superblock

Zone 6Zone 5Zone 4Zone 3Zone 2Zone 1

Zone Set 6Zone Set 4Zone Set 2Superblock

Zone 6Zone 5Zone 4Zone 3Zone 2Zone 1

Zone Set 6Zone Set 4Zone Set 2Superblock

Zone 6Zone 5Zone 4Zone 3Zone 2Zone 1

Zone Set 5Zone Set 4Zone Set 2Superblock

Zone 6Zone Set 2

Zone 4Zone 3Zone 2Zone 1

Zone Set 5Zone Set 3Zone Set 1Superblock

Zone 6Zone Set 5

Zone 4Zone 3Zone 2Zone 1

Zone Set 6Zone Set 4Zone Set 2Superblock

Zone 6Zone Set 4

Zone 4Zone 3Zone 2Zone 1

Zone Set 5Zone Set 3Zone Set 1Superblock

Page 21: SMORE: A Cold Data Object Store for SMR Drivesstorageconference.us/2017/Presentations/ColdDataObject...SMORE: A Cold Data Object Store for SMR Drives Peter Macko, Xiongzi Ge, John

Segment 1a Segment 2a D.Segment 3a

Segment 1b Segment 2b D.Segment 3b

Segment 1c Segment 2c D.Segment 3c

Segment 1a Segment 2a D.Segment 3a

Segment 1b Segment 2b D.Segment 3b

Segment 1c Segment 2c D.Segment 3c

Garbage collection

Reclaim space using garbage collection

Use the “dead space” algorithm:1. Maintain an accurate measure of dead space for each zone set2. Choose the zone set with most dead space3. Relocate live data to the currently open zone set4. Erase the zone set

Works reasonably well for cold data (no need to maintain hot-cold separation)

The dead space algorithm

© 2017 NetApp, Inc. All rights reserved.21

Segment 4a

Segment 4b

Segment 4c

Segment 3a

Segment 3b

Segment 3c

Page 22: SMORE: A Cold Data Object Store for SMR Drivesstorageconference.us/2017/Presentations/ColdDataObject...SMORE: A Cold Data Object Store for SMR Drives Peter Macko, Xiongzi Ge, John

SMOREResults

© 2017 NetApp, Inc. All rights reserved.22

Page 23: SMORE: A Cold Data Object Store for SMR Drivesstorageconference.us/2017/Presentations/ColdDataObject...SMORE: A Cold Data Object Store for SMR Drives Peter Macko, Xiongzi Ge, John

Experimental setup (1/2)

32-core Intel Xeon 2.90 GHz, 128 GB RAM The experiments used only 1.5 GB RAM

6x HGST Ultrastar Archive Ha10, 10 TB each Used only every 60th zone to reduce the capacity while preserving full seek profile 5+1 parity within each zone set Total system capacity after format: 766 GB Confirmed that the results are representative by running selected workloads on a full 50 TB system

Hardware Platform

© 2017 NetApp, Inc. All rights reserved.23

Page 24: SMORE: A Cold Data Object Store for SMR Drivesstorageconference.us/2017/Presentations/ColdDataObject...SMORE: A Cold Data Object Store for SMR Drives Peter Macko, Xiongzi Ge, John

Experimental setup (2/2)

SMORE Implemented as a C++ library ~19k LOC Custom I/O stack built on top of libzbc

Workload generator Synthetic workload generator parameterized by object size, target utilization, read-write ratio, etc. Random object deletion, which produces the worst GC behavior Mixed object sizes follow the distribution from the European Center for Medium-Range Weather Forecasts

Unless stated otherwise: 80% target utilization (amount of live data as a percentage of storage capacity) 6 clients

Software & Workload

© 2017 NetApp, Inc. All rights reserved.24

Page 25: SMORE: A Cold Data Object Store for SMR Drivesstorageconference.us/2017/Presentations/ColdDataObject...SMORE: A Cold Data Object Store for SMR Drives Peter Macko, Xiongzi Ge, John

Average ingest rate: 282 MB/s About 100% of max disk bandwidth

Mostly independent of the number of clients The same results for 3-12 clients Slightly lower rate for 24 clients

Initial ingest

0

50

100

150

200

250

300

350

1 MB 10 MB 100 MB 1 GB Mixed

Aggr

egat

e Ba

ndw

idth

(MB/

s)

Object size

Per object size

© 2017 NetApp, Inc. All rights reserved.25

Theoretical best ingest performance

Page 26: SMORE: A Cold Data Object Store for SMR Drivesstorageconference.us/2017/Presentations/ColdDataObject...SMORE: A Cold Data Object Store for SMR Drives Peter Macko, Xiongzi Ge, John

0

1

2

3

4

5

6

1 MB 10 MB 100 MB 1 GB Mixed

Writ

e Am

plifi

catio

n

Object Size

Varying object size and utilization

70% 80% 90%

Steady-state write amplification

0

1

2

3

4

5

6

3 6 12 24

Writ

e Am

plifi

catio

n

Number of Clients

Varying number of clients and object size

1 MB objects 1 GB objects

For write-only workloads, overwriting random objects (theoretical best: 1.20)

© 2017 NetApp, Inc. All rights reserved.26

Theoretical best write amplification

Page 27: SMORE: A Cold Data Object Store for SMR Drivesstorageconference.us/2017/Presentations/ColdDataObject...SMORE: A Cold Data Object Store for SMR Drives Peter Macko, Xiongzi Ge, John

0

100

200

300

400

500

600

1 MB 10 MB 100 MB 1 GB Mixed

Rea

d Pe

rform

ance

(MB/

s)

Object Size

Varying object size and utilization

70% 80% 90%

Steady-state read performance

0

100

200

300

400

500

600

3 6 12 24

Rea

d Pe

rform

ance

(MB/

s)

Number of Clients

Varying number of clients and object size

1 MB objects 1 GB objects

For read-only workloads, reading random objects (theoretical best: 590 MB/s)

© 2017 NetApp, Inc. All rights reserved.27

Theoretical best read performance

Page 28: SMORE: A Cold Data Object Store for SMR Drivesstorageconference.us/2017/Presentations/ColdDataObject...SMORE: A Cold Data Object Store for SMR Drives Peter Macko, Xiongzi Ge, John

0

0.1

0.2

0.3

0.4

0.5

0.6

1 MB 10 MB 100 MB 1 GB Mixed

Tim

e (s

econ

ds)

Object Size

Time to create an index snapshot

Snapshot to Flash Copy to SMR

Index recovery

0

1

2

3

4

5

6

7

8

0 100 200 300 400 500 600 700

Tim

e (s

econ

ds)

Number of Examined Zone Sets

Time to recover

Time to create and to recover an index

© 2017 NetApp, Inc. All rights reserved.28

(0 – 4.5 hours since the index snapshot was taken)

Page 29: SMORE: A Cold Data Object Store for SMR Drivesstorageconference.us/2017/Presentations/ColdDataObject...SMORE: A Cold Data Object Store for SMR Drives Peter Macko, Xiongzi Ge, John

Conclusion

Fast ingest Large segments of data + log of metadata

updates on SMR drives

Fast streaming reads Fast lookup using a read-optimized version of the

metadata index on a flash Large (almost) contiguous segments of data on

SMR drives

Efficient recovery Efficient index rebuild through zone set digests Data relocation supported by indirection through

zone sets

SMORE

© 2017 NetApp, Inc. All rights reserved.29

S'more: https://commons.wikimedia.org/wiki/File:Smores-Microwave.jpg

Page 30: SMORE: A Cold Data Object Store for SMR Drivesstorageconference.us/2017/Presentations/ColdDataObject...SMORE: A Cold Data Object Store for SMR Drives Peter Macko, Xiongzi Ge, John

© 2017 NetApp, Inc. All rights reserved.30

Thank you

Page 31: SMORE: A Cold Data Object Store for SMR Drivesstorageconference.us/2017/Presentations/ColdDataObject...SMORE: A Cold Data Object Store for SMR Drives Peter Macko, Xiongzi Ge, John

Additional informationBack up slides

© 2017 NetApp, Inc. All rights reserved.31

Page 32: SMORE: A Cold Data Object Store for SMR Drivesstorageconference.us/2017/Presentations/ColdDataObject...SMORE: A Cold Data Object Store for SMR Drives Peter Macko, Xiongzi Ge, John

Zone setsThe state transition diagram

© 2017 NetApp, Inc. All rights reserved.32

EMPTY

AVAILABLE

OPEN

CLOSED

INDEXED

Allocate*

Open

Write the digest, close

Snapshot the index*

Clean, trimthe zone set

* Involves writing the superblock

– the zone set is empty and not currently used by SMORE

– the zone set is empty, but is available for opening

– the zone set can receive writes

– the zone set is full (does not accept any more writes)

– the zone set was CLOSED when the index snapshot was taken(it does not need to be examined during index recovery)