mantle: a programmable metadata load balancer for the ceph ...€¦ · orse than . 1 mds. adaptable...

32
Michael Sevilla Mantle, Symposium ‘15 Mantle: A Programmable Metadata Load Balancer for the Ceph File System Michael A. Sevilla, Noah Watkins, Carlos Maltzahn, Ike Nassi, Scott A. Brandt, Sage A. Weil * , Greg Farnum * , Sam Fineberg ^ UC Santa Cruz, * Red Hat, ^ HP Storage Published at Supercomputing 2015

Upload: others

Post on 14-Nov-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15

Michael SevillaMantle, Symposium ‘15

Mantle: A Programmable Metadata Load Balancer for the

Ceph File SystemMichael A. Sevilla, Noah Watkins, Carlos Maltzahn, Ike Nassi, Scott A. Brandt, Sage A. Weil*, Greg Farnum*, Sam Fineberg^

UC Santa Cruz, *Red Hat, ^HP StoragePublished at Supercomputing 2015

Page 2: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15

Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation

Michael SevillaMantle, Symposium ‘15

Separating Metadata & Data IO

File System

2

Page 3: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15

Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation

Michael SevillaMantle, Symposium ‘15

metadata service

Separating Metadata & Data IO

DistributedFile System

object store

3

Page 4: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15

Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation

Michael SevillaMantle, Symposium ‘15

History: A Simple Solution

• 1 MDS is insufficient[McKusick et al., login; '10], [Beaver et al., OSDI '10], [Thusoo et al., SIGMOD '10]

• How do we distribute metadata?

4

Page 5: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15

Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation

Michael SevillaMantle, Symposium ‘15

History: Scalable Solutions

1. Hash file identifier 2. Subtree partitioning

5

Page 6: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15

Michael SevillaMantle, Symposium ‘15

Outline

1. File System Metadata Management2. CephFS Background3. Complexity of Dynamic Subtree Partitioning4. Mantle5. Evaluation

6

Page 7: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15

Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation

Michael SevillaMantle, Symposium ‘15

CephFS Background

Page 8: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15

Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation

Michael SevillaMantle, Symposium ‘15

Example File System Workload

• Linux kernel compile locality

• Shade of Red: locality

Time

Fewer InodeRead/Writes

Many InodeRead/Writes

8

Page 9: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15

Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation

Michael SevillaMantle, Symposium ‘15

Example File System Workload

• Linux kernel compile locality

• Shade of Red: locality

Time

Fewer InodeRead/Writes

Many InodeRead/Writes

9

Page 10: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15

Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation

Michael SevillaMantle, Symposium ‘15

CephFS Hotspot Detection!

Migration!

10

Page 11: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15

Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation

Michael SevillaMantle, Symposium ‘15

Does CephFS work?what we want

bad

bad

bad

11

Page 12: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15

Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation

Michael SevillaMantle, Symposium ‘15

Complexity of Dynamic Subtree Partitioning

Page 13: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15

Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation

Michael SevillaMantle, Symposium ‘15

MDS Cluster

rebalancemigrate?

partitionclusterpartition

namespace

migratefragment

recv HB

Why not?

Migration Policies• How to calculate load?• When to move load?• Where to move load?• How much to move?

RADOSrebalance

Hierarchical Namespace

13

Page 14: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15

Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation

Michael SevillaMantle, Symposium ‘15

CephFS’s Policies

“weighted ∑𝒐𝒐𝒐𝒐’’

“weighted ∑𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎’’

“greater than average’’

“underload MDS’’

“equal load across cluster’’

14

Page 15: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15

Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation

Michael SevillaMantle, Symposium ‘15

Different Balancers for Different Workloads

• Which heuristics should we use?[Weil et al., SuperComputing ‘04] [Patil et al., FAST ‘11] [Pai et al., ASPLOS ‘98]

Good for mixed workloads

Good for create-heavy workloads

Simple implementation

15

Page 16: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15

Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation

Michael SevillaMantle, Symposium ‘15

Mantle

http://synapostasy.blogspot.com/2007/10/cephalopod-awareness-day.html

Page 17: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15

Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation

Michael SevillaMantle, Symposium ‘15

Different Balancers for Different Workloads

• Which heuristics should we use?[Weil et al., SuperComputing ‘04] [Patil et al., FAST ‘11] [Pai et al., ASPLOS ‘98]

MDS Cluster

Mantle API

17

Page 18: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15

Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation

Michael SevillaMantle, Symposium ‘15

Different Balancers for Different Workloads

• Which heuristics should we use?[Weil et al., SuperComputing ‘04] [Patil et al., FAST ‘11] [Pai et al., ASPLOS ‘98]

MDS Cluster

18

Page 19: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15

Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation

Michael SevillaMantle, Symposium ‘15

Different Balancers for Different Workloads

• Which heuristics should we use?[Weil et al., SuperComputing ‘04] [Patil et al., FAST ‘11] [Pai et al., ASPLOS ‘98]

MDS Cluster

19

Page 20: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15

Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation

Michael SevillaMantle, Symposium ‘15

Implementation: API + EnvironmentMDS Cluster

rebalance

20

Page 21: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15

Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation

Michael SevillaMantle, Symposium ‘15

Balancers

• Greedy Spill Balancer

• Fill & Spill Balancer

• Adaptable Balancer

21

Page 22: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15

Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation

Michael SevillaMantle, Symposium ‘15

Evaluation

Page 23: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15

Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation

Michael SevillaMantle, Symposium ‘15

Evaluation: Creates Workload

• % of total load:

25 25 2525

25 0 075

25 13 1350

23

Page 24: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15

Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation

Michael SevillaMantle, Symposium ‘15

Workload: Creates in Same Directory

best

sp

eedu

p

distribution not worthwhile st

able

Ove

rload

ed

MD

S

bett

er th

an1

MDS

wor

se th

an

1 M

DS

Strategy

24

Page 25: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15

Outline1. FS Metadata Mngmt2. CephFS Background3. Complexity of DSP4. Mantle5. Evaluation

Michael SevillaMantle, Symposium ‘15

Workload: Compiling Code

system notsaturated

best speedupmost stable

bett

er th

an1

MDS

wor

se th

an

1 M

DS

Adaptable Balancer

too

aggr

essi

ve=

bad

perf.

25

Page 26: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15

Michael SevillaMantle, Symposium ‘15

Conclusion: Separate Policy and Mechanism

• Benefits of understanding server capacity• less resource utilization• better performance/stability

• Distribution can hurt performance/stability

• Being too aggressive thrashes workload

26

Page 27: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15

Michael SevillaMantle, Symposium ‘15

Thanks! Questions?Acknowledgements:

Co-authors: Noah Watkins, Carlos Maltzahn, Ike Nassi, Scott A. Brandt, Sage A. Weil*, Greg Farnum*, Sam Fineberg^

Collaboraters: Ivo Jimenez, Adam CrumeFunding: HP Enterprise; storage division

27

Page 28: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15

Michael SevillaMantle, Symposium ‘15

Extra Slides

28/24

Page 29: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15

Michael SevillaMantle, Symposium ‘15

Why is Locality Important?

29

Page 30: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15

Michael SevillaMantle, Symposium ‘15

More Recent History: Distributed Metadata

Mechanisms for migrating load

Heuristics for migrating resources

30

Page 31: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15

Michael SevillaMantle, Symposium ‘15

Evaluation: Compile Workload

31

Page 32: Mantle: A Programmable Metadata Load Balancer for the Ceph ...€¦ · orse than . 1 MDS. Adaptable Balancer. too . aggressive = bad perf. 25. Michael Sevilla Mantle, Symposium ‘15

Michael SevillaMantle, Symposium ‘15

Background CephFS

• Why layering a file system over RADOS is effective• Random access• Significant engineering effort• Specialized subsystem for handling the namespace

32