modeling, estimating, and predicting ceph (linux foundation - vault 2015)

45
Modeling, estimating, and predicting Ceph Lars Marowsky-Brée Distinguished Engineer, Architect Storage & HA [email protected]

Upload: lars-marowsky-bree

Post on 14-Jul-2015

353 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

Modeling, estimating, and predicting Ceph

Lars Marowsky-BréeDistinguished Engineer, Architect Storage & HA

[email protected]

Page 2: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

What is Ceph?

Page 3: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

3

From 10,000 Meters

[1] http://www.openstack.org/blog/2013/11/openstack-user-survey-statistics-november-2013/

• Open Source Distributed Storage solution

• Most popular choice of distributed storage for

OpenStack[1]

• Lots of goodies‒ Distributed Object Storage

‒ Redundancy

‒ Efficient Scale-Out

‒ Can be built on commodity hardware

‒ Lower operational cost

Page 4: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

4

From 1,000 Meters

Object Storage(Like Amazon S3) Block Device File System

Unified Data Handling for 3 Purposes

●RESTful Interface●S3 and SWIFT APIs

●Block devices●Up to 16 EiB●Thin Provisioning●Snapshots

●POSIX Compliant●Separate Data and Metadata●For use e.g. with Hadoop

Autonomous, Redundant Storage Cluster

Page 5: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

Unreasonable expectations

Page 6: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

6

Unreasonable questions customers want answers to

• How to build a storage system that meets my requirements?

• If I build a system like this, what will its characteristics be?

• If I change XY in my existing system, how will its characteristics change?

Page 7: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

7

How to design a Ceph cluster today

• Understand your workload (?)

• Make a best guess, based on the desirable properties and factors

• Build a 1-10% pilot / proof of concept‒ Preferably using loaner hardware from a vendor to avoid early

commitment. (The outlook of selling a few PB of storage and compute nodes makes vendors very cooperative.)

• Refine and tune this until desired performance is achieved

• Scale up‒ Ceph retains most characteristics at scale or even improves (until it

doesn’t), so re-tune

Page 8: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

Software Defined Storage

Page 9: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

9

Legacy Storage Arrays

• Limits:‒ Tightly controlled

environment

‒ Limited scalability

‒ Few options

‒ Only certain approved drives

‒ Constrained number of disk slots

‒ Few memory variations

‒ Only very few networking choices

‒ Typically fixed controller and CPU

• Benefits:‒ Reasonably easy to

understand

‒ Long-term experience and “gut instincts”

‒ Somewhat deterministic behavior and pricing

Page 10: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

10

Software Defined Storage (SDS)

• Limits:‒ ?

• Benefits:‒ Infinite scalability

‒ Infinite adaptability

‒ Infinite choices

‒ Infinite flexibility

‒ ... right.

Page 11: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

11

Properties of a SDS System

• Throughput

• Latency

• IOPS

• Single client or aggregate

• Availability

• Reliability

• Safety

• Cost

• Adaptability

• Flexibility

• Capacity

• Density

Page 12: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

12

Architecting a SDS system

• These goals often conflict:‒ Availability versus Density

‒ IOPS versus Density

‒ Latency versus Throughput

‒ Everything versus Cost

• Many hardware options

• Software topology offers many configuration choices

• There is no one size fits all

Page 13: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

A multitude of choices

Page 14: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

14

Network

• Choose the fastest network you can afford

• Switches should be low latency with fully meshed backplane

• Separate public and cluster network

• Cluster network should typically be twice the public bandwidth‒ Incoming writes are replicated over the cluster network

‒ Re-balancing and re-mirroring utilize the cluster network

Page 15: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

15

Networking (Public and Internal)

• Ethernet (1, 10, 40 GbE)‒ Can easily be bonded for availability

‒ Use jumbo frames if you can

‒ Various bonding modes to try

• Infiniband‒ High bandwidth, low latency

‒ Typically more expensive

‒ RDMA support

‒ Replacing Ethernet since the year of the Linux desktop

Page 16: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

16

Storage Node

• CPU‒ Number and speed of cores

‒ One or more sockets?

• Memory

• Storage controller‒ Bandwidth, performance, cache size

• SSDs for OSD journal‒ SSD to HDD ratio

• HDDs‒ Count, capacity, performance

Page 17: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

17

Adding More Nodes

• Capacity increases

• Total throughput increases

• IOPS increase

• Redundancy increases

• Latency unchanged

• Eventually: network topology limitations

• Temporary impact during re-balancing

Page 18: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

18

Adding More Disks to a Node

• Capacity increases

• Redundancy increases

• Throughput might increase

• IOPS might increase

• Internal node bandwidth is consumed

• Higher CPU and memory load

• Cache contention

• Latency unchanged

Page 19: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

19

OSD File System

• btrfs‒ Typically better write

throughput performance

‒ Higher CPU utilization

‒ Feature rich

‒ Compression, checksums, copy on write

‒ The choice for the future!

• XFS‒ Good all around choice

‒ Very mature for data partitions

‒ Typically lower CPU utilization

‒ The choice for today!

Page 20: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

20

Impact of Caches

• Cache on the client side‒ Typically, biggest impact on performance

‒ Does not help with write performance

• Server OS cache‒ Low impact: reads have already been cached on the client

‒ Still, helps with readahead

• Caching controller, battery backed:‒ Significant benefit for writes

Page 21: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

21

Impact of SSD Journals

• SSD journals accelerate bursts and random write IO

• For sustained writes that overflow the journal, performance degrades to HDD levels

• SSDs help very little with read performance

• SSDs are very costly‒ ... and consume storage slots -> lower density

• A large battery-backed cache on the storage controller is highly recommended if not using SSD journals

Page 22: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

22

Hard Disk Parameters

• Capacity matters‒ Often, highest density is not

most cost effective

‒ On-disk cache matters less

• Reliability advantage of Enterprise drives typically marginal compared to cost‒ Buy more drives instead

‒ Consider validation matrices for small/medium NAS servers as a guide

• RPM:‒ Increase IOPS & throughput

‒ Increases power consumption

‒ 15k drives quite expensive still

• SMR/Shingled drives‒ Density at the cost of write

speed

Page 23: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

23

Other parameters

• IO Schedulers‒ Deadline, CFQ, noop

• Let’s not even talk about the various mkfs options

• Encryption

Page 24: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

24

Impact of Redundancy Choices

• Replication:‒ n number of exact, full-size

copies

‒ Potentially increased read performance due to striping

‒ More copies lower throughput, increase latency

‒ Increased cluster network utilization for writes

‒ Rebuilds can leverage multiple sources

‒ Significant capacity impact

• Erasure coding:‒ Data split into k parts plus m

redundancy codes

‒ Better space efficiency

‒ Higher CPU overhead

‒ Significant CPU and cluster network impact, especially during rebuild

‒ Cannot directly be used with block devices (see next slide)

Page 25: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

25

Cache Tiering

• Multi-tier storage architecture:‒ Pool acts as a transparent write-back overlay for another

‒ e.g., SSD 3-way replication over HDDs with erasure coding

‒ Can flush either on relative or absolute dirty levels, or age

‒ Additional configuration complexity and requires workload-specific tuning

‒ Also available: read-only mode (no write acceleration)

‒ Some downsides (no snapshots), memory consumption for HitSet

• A good way to combine the advantages of replication and erasure coding

Page 26: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

26

Number of placement groups

● Number of hash buckets per pool● Data is chunked &

distributed across nodes

● Typically approx. 100 per OSD/TB

• Too many:‒ More peering

‒ More resources used

• Too few:‒ Large amounts of data per

group

‒ More hotspots, less striping

‒ Slower recovery from failure

‒ Slower re-balancing

Page 27: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

From components to system

Page 28: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

28

More than the sum of its parts

• If this seems straightforward enough, it is because it isn’t.

• The interactions are not trivial or humanly predictable.

• http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/

Page 29: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

29

Hiding in a corner, weeping:

Page 30: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

When anecdotes are not enough

Page 31: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

31

The data drought of big data

• Isolated benchmarks

• No standardized work loads

• Huge variety of reporting formats (not parsable)

• Published benchmarks usually skewed to highlight superior performance of product X

• Not enough data points to make sense of the “why”

• Not enough data about the environment

• Rarely from production systems

Page 32: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

32

Existing efforts

• ceph-brag‒ Last update: ~1 year ago

• cephperf‒ Talk proposal for Vancover OpenStack Summit, presented at

Ceph Infernalis Developer Summit

‒ Mostly focused on object storage

Page 33: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

33

Measuring Ceph performance(you were in the previous session by Adolfo, right?)

• rados bench‒ Measures backend performance of the RADOS store

• rados load-gen‒ Generate configurable load on the cluster

• ceph tell osd.XX

• fio rbd backend‒ Swiss army knife of IO benchmarking on Linux

‒ Can also compare in-kernel rbd with user-space librados

• rest-bench‒ Measures S3/radosgw performance

Page 34: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

34

Standard benchmark suite

• Gather all relevant and anonymized data in JSON format

• Evaluate all components individually, as well as combined

• Core and optional tests‒ Must be fast enough to run on a (quiescent) production cluster

‒ Also means: non-destructive tests in core only

• Share data to build up a corpus on big data performance

Page 35: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

35

Block benchmarks – how?

• fio‒ JSON output, bandwidth, latency

• Standard profiles‒ Sequential and random IO

‒ 4k, 16k, 64k, 256k block sizes

‒ Read versus write

‒ Mixed profiles

‒ Don’t write zeros!

Page 36: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

36

Block benchmarks – where?

• Individual OSDs

• All OSDs in a node‒ On top of OSD fs, and/or raw disk?

‒ Cache baseline/destructive tests?

• Journal disks‒ Read first, write back, for non-destructive tests

• RBD performance: rbd.ko, user-space

• One client, multiple clients

• Identify similar machines and run a random selection

Page 37: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

37

Network benchmarks

• Latency (various packet sizes): ping

• Bandwidth: iperf3

• Relevant links:‒ OSD – OSD, OSD – Monitor

‒ OSD – Rados Gateway, OSD - iSCSI

‒ OSD – Client, Monitor – Client, Client – RADOS Gateway

Page 38: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

38

Other benchmark data

• CPU loads:‒ Encryption

‒ Compression

• Memory access speeds

• Benchmark during re-balancing an OSD?

• CephFS (FUSE/cephfs.ko)

Page 39: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

39

Environment description

• Hardware setup for all nodes‒ Not always trivial to discover

• kernel, glibc, ceph versions

• Network layout metadata

Page 40: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

40

Different Ceph configurations?

• Different node counts,

• replication sizes,

• erasure coding options,

• PG sizes,

• differently filled pools, ...

• Not necessarily non-destructive, but potentially highly valuable.

Page 41: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

41

Performance monitoring during runs

• Performance Co-Pilot?

• Gather network, CPU, IO, ... metrics

• While not fine grained enough to debug everything, helpful to spot anomalies and trends‒ e.g., how did network utilization of the whole run change over

time? Is CPU maxed out on a MON for long?

• Also, pretty pictures (such as those missing in this presentation)

• LTTng instrumentation?

Page 42: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

42

Summary of goals

• Holistic exploration of Ceph cluster performance

• Answer those pesky customer questions

• Identify regressions and brag about improvements

• Use machine learning to spot correlations that human brains can’t grasp‒ Recurrent neural networks?

• Provide students with a data corpus to answer questions we didn’t even think of yet

Page 43: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

Thank you.

43

Questions and Answers?

Page 44: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

Corporate HeadquartersMaxfeldstrasse 590409 NurembergGermany

+49 911 740 53 0 (Worldwide)www.suse.com

Join us on:www.opensuse.org

44

Page 45: Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

Unpublished Work of SUSE LLC. All Rights Reserved.This work is an unpublished work and contains confidential, proprietary and trade secret information of SUSE LLC. Access to this work is restricted to SUSE employees who have a need to know to perform tasks within the scope of their assignments. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated, abridged, condensed, expanded, collected, or adapted without the prior written consent of SUSE. Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability.

General DisclaimerThis document is not to be construed as a promise by any participating company to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. SUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The development, release, and timing of features or functionality described for SUSE products remains at the sole discretion of SUSE. Further, SUSE reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All SUSE marks referenced in this presentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries. All third-party trademarks are the property of their respective owners.