pelican: a building block for exascale cold data storage · 2017-07-14 · pelican: a building...

27
Pelican: A building block for exascale cold data storage Shobana Balakrishnan, Richard Black, Austin Donnelly, Paul England, Adam Glass, Dave Harper, Sergey Legtchenko , Aaron Ogus, Eric Peterson, Antony Rowstron Microsoft Research 1

Upload: others

Post on 21-Jan-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pelican: A building block for exascale cold data storage · 2017-07-14 · Pelican: A building block for exascale cold data storage Shobana Balakrishnan, Richard Black, Austin Donnelly,

Pelican: A building block for exascale cold data storage

Shobana Balakrishnan, Richard Black, Austin Donnelly, Paul England, Adam Glass, Dave Harper, Sergey Legtchenko, Aaron

Ogus, Eric Peterson, Antony Rowstron

Microsoft Research

1

Page 2: Pelican: A building block for exascale cold data storage · 2017-07-14 · Pelican: A building block for exascale cold data storage Shobana Balakrishnan, Richard Black, Austin Donnelly,

Background: cold data in the cloud

• Cold data: “written once – read rarely” access pattern

• Large fraction of stored data

Hot

Warm

Cold

Archival

?

Hot tier• Provisioned for peak• High throughput• Low latency• High cost

Archive tier• Low cost• High latency (hours)

$$$$$

$$$$

$$$

$

2

Page 3: Pelican: A building block for exascale cold data storage · 2017-07-14 · Pelican: A building block for exascale cold data storage Shobana Balakrishnan, Richard Black, Austin Donnelly,

Background: cold data in the cloud

• Cold data: “written once – read rarely” access pattern

• Large fraction of stored data

Hot

Warm

Cold

Archival

Hot tier• Provisioned for peak• High throughput• Low latency• High cost

$$$$$

$$$$

$$$

3

$

Pelican vs. Tape:• Better performance• Similar cost

Cold tier:• High density• Low hardware cost• Low operating cost• Latency lower than

tape

Page 4: Pelican: A building block for exascale cold data storage · 2017-07-14 · Pelican: A building block for exascale cold data storage Shobana Balakrishnan, Richard Black, Austin Donnelly,

Right-provisioning

• Provision resources just for the cold data workload:–Disks:

• Archival and SMR instead of commodity–Power

–Cooling

–Bandwidth

Enough for bandwidth required by workload instead of

for all disks spinning

–Servers:

• Enough for data management instead of 1 server/ 40 disks

• Benefits of removing unnecessary resources:–High density of storage

– Low hardware cost

– Low operating cost (capped performance)4

Page 5: Pelican: A building block for exascale cold data storage · 2017-07-14 · Pelican: A building block for exascale cold data storage Shobana Balakrishnan, Richard Black, Austin Donnelly,

Pelican: rack-scale appliance for cold data

• Converged design:

– Power, cooling, mechanical, storage & software co-designed

• Right-provisioned for cold data workload:

– Resources for just workload requirements

• At most 8% disks spun up

• 2 servers

• No Top of Rack switch

– 4x 10Gbps uplinks from the servers

• 1,152 disks in 52U: 22 disks/U

• 5+ PB of raw storage

Pelican rack prototype

Other disk-based storage:Up to 15/U

5

Page 6: Pelican: A building block for exascale cold data storage · 2017-07-14 · Pelican: A building block for exascale cold data storage Shobana Balakrishnan, Richard Black, Austin Donnelly,

Pelican: rack-scale appliance for cold data

• Converged design:

– Power, cooling, mechanical, storage & software co-designed

• Right-provisioned for cold data workload:

– Resources for just workload requirements

• At most 8% disks spun up

• 2 servers

• No Top of Rack switch

– 4x 10Gbps uplinks from the servers

• 1,152 disks in 52U: 22 disks/U

• 5+ PB of raw storage

Pelican rack prototype

Other disk-based storage:Up to 15/U

• Total cost of ownership comparable to tape• Lower latency than tape

Challenging resource limitations managed in software

6

Page 7: Pelican: A building block for exascale cold data storage · 2017-07-14 · Pelican: A building block for exascale cold data storage Shobana Balakrishnan, Richard Black, Austin Donnelly,

Pelican storage stack: handling right-provisioning

• Co-designed with hardware

• Constraints over sets of active disks:

–Hard: power, cooling, failure domains

– Soft: bandwidth, vibration

Placement

IOs to disks

*.sys

kerneluserspace

requests Blob

store API

Scheduler

• Software challenges:

– Data placement: concurrency of requests

– IO scheduling: minimize spin ups, fairness

– Recovery: minimize window of vulnerability

In this talk

7

Page 8: Pelican: A building block for exascale cold data storage · 2017-07-14 · Pelican: A building block for exascale cold data storage Shobana Balakrishnan, Richard Black, Austin Donnelly,

Impact of right provisioning on resources

• Systems provisioned for peak performance:

– Any disk can be active at any time

• Right-provisioned system:

– Disk part of a domain for each resource

– Domain supplies limited resources

– Disk active if enough resources in all its domains

• Pelican domains:

– power, cooling, vibration, bandwidth

• Resource limitations:

– 2 active out of 16 per power domain

– 1 active out of 12 per cooling domain

– 1 active out of 2 per vibration domain

Disk d

Power domain of d

Cooling domain of d

Rack: 3D array of disks

8

Page 9: Pelican: A building block for exascale cold data storage · 2017-07-14 · Pelican: A building block for exascale cold data storage Shobana Balakrishnan, Richard Black, Austin Donnelly,

Data placement: maximizing request concurrency

• Blob erasure-encoded on a set of concurrently active disks

• In fully provisioned systems:

– Any two sets can be active

– No impact of placement on concurrency

• In right-provisioned systems:

– Sets can conflict in resource requirements

– Conflicting cannot be concurrently active

– Challenge: form sets that minimize P Disks of blob 1

Rack: 3D array of disks

First approach: random placement

Disks of blob 2

ConflictConflict

9

Page 10: Pelican: A building block for exascale cold data storage · 2017-07-14 · Pelican: A building block for exascale cold data storage Shobana Balakrishnan, Richard Black, Austin Donnelly,

Data placement: maximizing request concurrency

• Blob erasure-encoded on a set of concurrently active disks

• In fully provisioned systems:

– Any two sets can be active

– No impact of placement on concurrency

• In right-provisioned systems:

– Sets can conflict in resource requirements

– Conflicting cannot be concurrently active

– Challenge: form sets that minimize P Disks of blob 1

Rack: 3D array of disks

First approach: random placement

Disks of blob 2

ConflictConflict

Random placement: Storing blobs on n disks,

P O(n²)Conflict

10

Page 11: Pelican: A building block for exascale cold data storage · 2017-07-14 · Pelican: A building block for exascale cold data storage Shobana Balakrishnan, Richard Black, Austin Donnelly,

Pelican data placement

• Intuition: concentrate all conflicts over a few sets of disks

• Statically partition disks in groups in which disks can be concurrently active

• Property:

– Either fully conflicting

– Or fully independent

Schematic side-view of the rack

Power domain

Co

olin

g d

om

ain

• Blob is stored in one group

– P O(n)Conflict

• Groups encapsulate constraints:

– Unit of IO scheduling

– No constraint management at runtime

11

Page 12: Pelican: A building block for exascale cold data storage · 2017-07-14 · Pelican: A building block for exascale cold data storage Shobana Balakrishnan, Richard Black, Austin Donnelly,

Pelican data placement

• Intuition: concentrate all conflicts over a few sets of disks

• Statically partition disks in groups in which disks can be concurrently active

• Property:

– Either fully conflicting

– Or fully independent

Schematic side-view of the rack

Power domain

Co

olin

g d

om

ain

• Blob is stored in one group

– P O(n)Conflict

• Groups encapsulate constraints:

– Unit of IO scheduling

– No constraint management at runtime

12

Page 13: Pelican: A building block for exascale cold data storage · 2017-07-14 · Pelican: A building block for exascale cold data storage Shobana Balakrishnan, Richard Black, Austin Donnelly,

Pelican data placement

• Intuition: concentrate all conflicts over a few sets of disks

• Statically partition disks in groups in which disks can be concurrently active

• Property:

– Either fully conflicting

– Or fully independent

Schematic side-view of the rack

Power domain

Co

olin

g d

om

ain

Class: 12 fully-conflicting groups

• Blob is stored in one group

– P O(n)Conflict

• Groups encapsulate constraints:

– Unit of IO scheduling

– No constraint management at runtime

13

Page 14: Pelican: A building block for exascale cold data storage · 2017-07-14 · Pelican: A building block for exascale cold data storage Shobana Balakrishnan, Richard Black, Austin Donnelly,

Pelican data placement

• Intuition: concentrate all conflicts over a few sets of disks

• Statically partition disks in groups in which disks can be concurrently active

• Property:

– Either fully conflicting

– Or fully independent

Schematic side-view of the rack

Power domain

Co

olin

g d

om

ain

Class: 12 fully-conflicting groups

• 48 groups of 24 disks

– 4 classes of 12 fully-conflicting groups

– Class is independent: concurrency = 4

• Blob is stored over 18 disks

– 15+3 erasure coding• Blob is stored in one group

– P O(n)Conflict

• Groups encapsulate constraints:

– Unit of IO scheduling

– No constraint management at runtime

14

Page 15: Pelican: A building block for exascale cold data storage · 2017-07-14 · Pelican: A building block for exascale cold data storage Shobana Balakrishnan, Richard Black, Austin Donnelly,

IO Scheduling: “spin up is the new seek”

Four independent schedulers

Each scheduler: 12 groups, only one can be active

• Naïve scheduler: FIFO

– Avg. group activation time: 14.2 sec

– High probability of spinup after each request

– Time is spent doing spinups!

Time

Spin upSpin upSpin up… …

• Pelican scheduler: Request batching

– Limit on maximum re-ordering

– Trade-off between throughput and fairness

– Weighted fair-share between client and rebuild traffic

Time

Spin up

IO batch

Spin up ……IO batch

15

Page 16: Pelican: A building block for exascale cold data storage · 2017-07-14 · Pelican: A building block for exascale cold data storage Shobana Balakrishnan, Richard Black, Austin Donnelly,

Outline: challenges of right-provisioning

1. Challenge: conflicts in domains reduce concurrency

Solution: constraint-aware data placement

2. Challenge: “spinup is the new seek”

Solution: IO scheduler that amortizes spinup latency

Last part of the talk:

Performance impact of right-provisioning

16

Page 17: Pelican: A building block for exascale cold data storage · 2017-07-14 · Pelican: A building block for exascale cold data storage Shobana Balakrishnan, Richard Black, Austin Donnelly,

Evaluating impact of right-provisioning

• Pelican vs. rack with all disks active (called FP)

• Cross-validated discrete-event simulator

• Metrics (more in the paper):

– Rack throughput

– Latency (time to first byte)

– Power consumption

• Open loop workload:

– Poisson arrival process

– Read requests on 1GB blobs

– Varying workload rate up to 8 requests/s

17

Page 18: Pelican: A building block for exascale cold data storage · 2017-07-14 · Pelican: A building block for exascale cold data storage Shobana Balakrishnan, Richard Black, Austin Donnelly,

First step: simulator cross-validation

• Burst workload, varying burst intensity

Simulator accurately predicts real system behaviour for all metrics.See paper for more results.

0

2

4

6

8

10

8 32 128 512 2048

Thro

ugh

pu

t (G

bp

s)

Simulator

Rack

1

10

100

1000

8 32 128 512 2048Ti

me

to f

irst

byt

e (s

)

Simulator

Rack

Burst intensity (#req/burst) Burst intensity (#req/burst)

18

Page 19: Pelican: A building block for exascale cold data storage · 2017-07-14 · Pelican: A building block for exascale cold data storage Shobana Balakrishnan, Richard Black, Austin Donnelly,

Rack throughput

0

10

20

30

40

0.0625 0.125 0.25 0.5 1 2 4 8Avg

. th

rou

ghp

ut

(Gb

ps)

Workload rate (req/s)

Random placement

19

Page 20: Pelican: A building block for exascale cold data storage · 2017-07-14 · Pelican: A building block for exascale cold data storage Shobana Balakrishnan, Richard Black, Austin Donnelly,

Rack throughput

0

10

20

30

40

0.0625 0.125 0.25 0.5 1 2 4 8Avg

. th

rou

ghp

ut

(Gb

ps)

Workload rate (req/s)

FP

Random placement

20

Page 21: Pelican: A building block for exascale cold data storage · 2017-07-14 · Pelican: A building block for exascale cold data storage Shobana Balakrishnan, Richard Black, Austin Donnelly,

Rack throughput

0

10

20

30

40

0.0625 0.125 0.25 0.5 1 2 4 8Avg

. th

rou

ghp

ut

(Gb

ps)

Workload rate (req/s)

FP

Pelican

Random placement

21

Page 22: Pelican: A building block for exascale cold data storage · 2017-07-14 · Pelican: A building block for exascale cold data storage Shobana Balakrishnan, Richard Black, Austin Donnelly,

Time to first byte

0.0001

0.001

0.01

0.1

1

10

100

1000

10000

0.0625 0.125 0.25 0.5 1 2 4 8

Tim

e to

fir

st b

yte

(sec

)

Workload rate (req/s)

FP Pelican

14.2 seconds: average time to activate group

22

Page 23: Pelican: A building block for exascale cold data storage · 2017-07-14 · Pelican: A building block for exascale cold data storage Shobana Balakrishnan, Richard Black, Austin Donnelly,

Power consumption

0

2

4

6

8

10

12

0.0625 0.125 0.25 0.5 1 2 4 8

Agg

rega

te d

isk

po

wer

d

raw

(kW

)

Workload rate (req/s)

All disks spun down

1.8kW

23

Page 24: Pelican: A building block for exascale cold data storage · 2017-07-14 · Pelican: A building block for exascale cold data storage Shobana Balakrishnan, Richard Black, Austin Donnelly,

Power consumption

0

2

4

6

8

10

12

0.0625 0.125 0.25 0.5 1 2 4 8

Agg

rega

te d

isk

po

wer

d

raw

(kW

)

Workload rate (req/s)

All disks spun down All disks active

1.8kW

10.8kW

24

Page 25: Pelican: A building block for exascale cold data storage · 2017-07-14 · Pelican: A building block for exascale cold data storage Shobana Balakrishnan, Richard Black, Austin Donnelly,

Power consumption

0

2

4

6

8

10

12

0.0625 0.125 0.25 0.5 1 2 4 8

Agg

rega

te d

isk

po

wer

d

raw

(kW

)

Workload rate (req/s)

Pelican average All disks spun down

All disks active

1.8kW

10.8kW

25

Page 26: Pelican: A building block for exascale cold data storage · 2017-07-14 · Pelican: A building block for exascale cold data storage Shobana Balakrishnan, Richard Black, Austin Donnelly,

Power consumption: 3x lower peak

0

2

4

6

8

10

12

0.0625 0.125 0.25 0.5 1 2 4 8

Agg

rega

te d

isk

po

wer

d

raw

(kW

)

Workload rate (req/s)

Pelican average All disks spun down

All disks active Pelican peak

1.8kW

10.8kW

3.7kW

26

Page 27: Pelican: A building block for exascale cold data storage · 2017-07-14 · Pelican: A building block for exascale cold data storage Shobana Balakrishnan, Richard Black, Austin Donnelly,

Conclusion

• Rack-scale hardware/software co-design– Storage right-provisioned for cold data workload

– Efficient constraint-aware software storage stack

• Prototype rack storing 5+ PB of raw data in 52U

• Challenging design process:

– Many constraints to handle manually

– Sensitive to hardware changes

• Follow up work:– “Flamingo: Synthesizing cold storage stacks for Pelican-like systems”

– See our poster in tonight’s session

27