suse enterprise storagesuse makes no representations or warranties with respect to the contents of...

39
SUSE® Enterprise Storage Design and Performance for Ceph Lars Marowsky-Brée Distinguished Engineer [email protected]

Upload: others

Post on 26-Apr-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SUSE Enterprise StorageSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability

SUSE® Enterprise StorageDesign and Performance for Ceph

Lars Marowsky-BréeDistinguished Engineer

[email protected]

Page 2: SUSE Enterprise StorageSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability

What is Ceph?

Page 3: SUSE Enterprise StorageSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability

3

From 10,000 meters

[1] http://www.openstack.org/blog/2013/11/openstack-user-survey-statistics-november-2013/

• Open Source Storage Distributed solution

• Most popular choice of distributed storage for

openStack[1]

• Lots of goodies‒ Distributed Object Storage

‒ Redundancy

‒ Efficient Scale-Out

‒ Can be built on commodity hardware

‒ Lower operational cost

Page 4: SUSE Enterprise StorageSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability

4

From 1,000 meters

• Three interfaces rolled into one‒ Object Access (like Amazon S3)

‒ Block Access

‒ (Distributed File System)

• Sitting on top of a Storage Cluster‒ Self Healing

‒ Self Managed

‒ No Bottlenecks

Page 5: SUSE Enterprise StorageSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability

5

From 1,000 meters

Object Storage(Like Amazon S3) Block Device File System

Unified Data Handling for 3 Purposes

●RESTful Interface●S3 and SWIFT APIs

●Block devices●Up to 16 EiB●Thin Provisioning●Snapshots

●POSIX Compliant●Separate Data and Metadata●For use e.g. with Hadoop

Autonomous, Redundant Storage Cluster

Page 6: SUSE Enterprise StorageSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability

6

Component Names

radosgw

Object Storage

RBD

Block Device

Ceph FS

File System

RADOS

librados

DirectApplicationAccess toRADOS

Page 7: SUSE Enterprise StorageSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability

How does Ceph work?

Page 8: SUSE Enterprise StorageSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability

8

For a Moment, zooming to Atom Level

FS

Disk

OSD Object Storage Daemon

File System (btrfs, xfs)

Physical Disk

● OSDs serve storage objects to clients● Peer to perform replication and recovery

Page 9: SUSE Enterprise StorageSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability

9

Put Several of These in One Node

FS

Disk

OSD

FS

Disk

OSD

FS

Disk

OSD

FS

Disk

OSD

FS

Disk

OSD

FS

Disk

OSD

Page 10: SUSE Enterprise StorageSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability

10

Mix in a Few Monitor Nodes

M • Monitors are the brain cells of the cluster‒ Cluster Membership‒ Consensus for Distributed Decision Making

• Do not serve stored objects to clients

Page 11: SUSE Enterprise StorageSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability

11

Voilà, a Small RADOS Cluster

M MM

Page 12: SUSE Enterprise StorageSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability

12

Several Ingredients

• Basic Idea‒ Coarse grained partitioning of storage supports policy

based mapping (don't put all copies of my data in one rack)

‒ Topology map and Rules allow clients to “compute” the exact location of any storage object

• Three conceptual components‒ Pools

‒ Placement groups

‒ CRUSH: deterministic decentralized placement algorithm

Page 13: SUSE Enterprise StorageSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability

13

Pools

• A pool is a logical container for storage objects

• A pool has a set of parameters‒ a name

‒ a numerical ID (internal to RADOS)

‒ number of replicas OR erasure encoding settings

‒ number of placement groups

‒ placement rule set

‒ owner

• Pools support certain operations‒ create/remove/read/write entire objects

‒ snapshot of the entire pool

Page 14: SUSE Enterprise StorageSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability

14

Placement Groups

• Placement groups help balance data across OSDs

• Consider a pool named “swimmingpool”‒ with a pool ID of 38 and 8192 placement groups (PGs)

• Consider object “rubberduck” in “swimmingpool” ‒ hash(“rubberduck”) % 8192 = 0xb0b

‒ The resulting PG is 38.b0b

• One PG typically spans several OSDs‒ for balancing

‒ for replication

• One OSD typically serves many PGs

Page 15: SUSE Enterprise StorageSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability

15

CRUSH

• CRUSH uses a map of all OSDs in your cluster

‒ includes physical topology, like row, rack, host

‒ includes rules describing which OSDs to consider for what type of pool/PG

• This map is maintained by the monitor nodes

‒ Monitor nodes use standard cluster algorithms for consensus building, etc

Page 16: SUSE Enterprise StorageSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability

16

CRUSH in Action: reading

M MM

M

38.b0b

swimmingpool/rubberduck

Reads could be serviced by any of the replicas (parallel reads

improve thruput)

Page 17: SUSE Enterprise StorageSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability

17

CRUSH in Action: writing

M MM

M

38.b0b

swimmingpool/rubberduck

Writes go to one OSD, which then propagates the

changes to other replicas

Page 18: SUSE Enterprise StorageSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability

Software Defined Storage

Page 19: SUSE Enterprise StorageSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability

19

Legacy Storage arrays

• Limits:‒ Tightly controlled

environment

‒ Limited scalability

‒ Few options

‒ Only certain approved drives

‒ Constrained number of disk slots

‒ Few memory variations

‒ Only very few networking choices

‒ Typically fixed controller and CPU

• Benefits:‒ Reasonably easy to

understand

‒ Long-term experience and “gut instincts”

‒ Somewhat deterministic behavior and pricing

Page 20: SUSE Enterprise StorageSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability

20

Software Defined Storage (SDS)

• Limits:‒ ?

• Benefits:‒ Infinite scalability

‒ Infinite adaptability

‒ Infinite choices

‒ Infinite flexibility

‒ ... right.

Page 21: SUSE Enterprise StorageSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability

21

Properties of a SDS system

• Throughput

• Latency

• IOPS

• Availability

• Reliability

• Cost• Capacity

• Density

Page 22: SUSE Enterprise StorageSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability

22

Architecting a SDS system

• These goals often conflict:‒ Availability versus Density

‒ IOPS versus Density

‒ Everything versus Cost

• Many hardware options

• Software topology offers many configuration choices

• There is no one size fits all

Page 23: SUSE Enterprise StorageSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability

Setup choices

Page 24: SUSE Enterprise StorageSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability

24

Networking (public and private)

• Ethernet (1, 10, 40 GbE)‒ Reasonably inexpensive (except for 40 GbE)

‒ Can easily be bonded for availability

‒ Use jumbo frames

• Infiniband‒ High bandwidth

‒ Low latency

‒ Typically more expensive

‒ No support for RDMA yet in Ceph, need to use IPoIB

Page 25: SUSE Enterprise StorageSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability

25

Network

• Choose the fastest network you can afford

• Switches should be low latency with fully meshed backplane

• Separate public and cluster network

• Cluster network should typically be twice the public bandwidth

‒ Incoming writes are replicated over the cluster network

‒ Re-balancing and re-mirroring take utilize the cluster network

Page 26: SUSE Enterprise StorageSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability

26

Different access modes

• radosgw:‒ An additional gateway in

front of your RADOS cluster

‒ Little impact on throughput, but it does affect latency

• User-space RADOS access:

‒ More feature rich than in-kernel rbd.ko module

‒ Typically provides higher performance

Page 27: SUSE Enterprise StorageSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability

27

Storage node

• CPU‒ Number and speed of cores

• Memory

• Storage controller‒ Bandwidth, performance, cache size

• SSDs for OSD journal‒ SSD to HDD ratio

• HDDs‒ Count, capacity, performance

Page 28: SUSE Enterprise StorageSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability

28

Adding more nodes

• Capacity increases

• Total throughput increases

• IOPS increase

• Redundancy increases

• Latency unchanged

• Eventually: network topology limitations

• Temporary impact during re-balancing

Page 29: SUSE Enterprise StorageSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability

29

Adding more disks to a node

• Capacity increases

• Redundancy increases

• Throughput might increase

• IOPS might increase

• Internal node bandwidth is consumed

• Higher CPU and memory load

• Cache contention

• Latency unchanged

Page 30: SUSE Enterprise StorageSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability

30

OSD file system

• btrfs‒ Typically better write

throughput performance

‒ Higher CPU utilization

‒ Feature rich

‒ Compression, checksums, copy on write

‒ The choice for the future!

• XFS‒ Good all around choice

‒ Very mature for data partitions

‒ Typically lower CPU utilization

‒ The choice for today!

Page 31: SUSE Enterprise StorageSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability

31

Impact of caches

• Cache on the client side‒ Typically, biggest impact on performance

‒ Does not help with write performance

• Server OS cache‒ Low impact: reads have already been cached on the client

‒ Still, helps with readahead

• Caching controller, battery backed:‒ Significant impact for writes

Page 32: SUSE Enterprise StorageSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability

32

Impact of SSD journals

• SSD journals accelerate bursts and random write IO

• For sustained writes that overflow the journal, performance degrades to HDD levels

• SSDs help very little with read performance

• SSDs are very costly‒ ... and consume storage slots -> lower density

• A large battery-backed cache on the storage controller is highly recommended if not using SSD journals

Page 33: SUSE Enterprise StorageSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability

33

Hard disk parameters

• Capacity matters‒ Often, highest density is not

most cost effective

• On-disk cache matters less

• Reliability advantage of Enterprise drives typically marginal compared to cost

‒ Buy more drives instead

• RPM:‒ Increase IOPS & throughput

‒ Increases power consumption

‒ 15k drives quite expensive still

Page 34: SUSE Enterprise StorageSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability

34

Impact of redundancy choices

• Replication:‒ n number of exact, full-size

copies

‒ Potentially increased read performance due to striping

‒ Increased cluster network utilization for writes

‒ Rebuilds can leverage multiple sources

‒ Significant capacity impact

• Erasure coding:‒ Data split into k parts plus m

redundancy codes

‒ Better space efficiency

‒ Higher CPU overhead

‒ Significant CPU and cluster network impact, especially during rebuild

‒ Cannot directly be used to with block devices (see next slide)

Page 35: SUSE Enterprise StorageSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability

35

Cache tiering

• Multi-tier storage architecture:‒ Pool acts as a transparent write-back overlay for another

‒ e.g., SSD 3-way replication over HDDs with erasure coding

‒ Can flush either on relative or absolute dirty levels, or age

‒ Additional configuration complexity and requires workload-specific tuning

‒ Also available: read-only mode (no write acceleration)

‒ Some downsides (no snapshots)

• A good way to combine the advantages of replication and erasure coding

Page 36: SUSE Enterprise StorageSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability

Conclusion

Page 37: SUSE Enterprise StorageSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability

Thank you.

37

Questions and answers?

Page 38: SUSE Enterprise StorageSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability

Corporate HeadquartersMaxfeldstrasse 590409 NurembergGermany

+49 911 740 53 0 (Worldwide)www.suse.com

Join us on:www.opensuse.org

38

Page 39: SUSE Enterprise StorageSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability

Unpublished Work of SUSE LLC. All Rights Reserved.This work is an unpublished work and contains confidential, proprietary and trade secret information of SUSE LLC. Access to this work is restricted to SUSE employees who have a need to know to perform tasks within the scope of their assignments. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated, abridged, condensed, expanded, collected, or adapted without the prior written consent of SUSE. Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability.

General DisclaimerThis document is not to be construed as a promise by any participating company to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. SUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The development, release, and timing of features or functionality described for SUSE products remains at the sole discretion of SUSE. Further, SUSE reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All SUSE marks referenced in this presentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries. All third-party trademarks are the property of their respective owners.