ceph overview for distributed computing denver meetup

58
CEPH: A MASSIVELY SCALABLE DISTRIBUTED STORAGE SYSTEM Ken Dreyer Software Engineer Apr 23 2015

Upload: ktdreyer

Post on 17-Jul-2015

69 views

Category:

Software


0 download

TRANSCRIPT

CEPH: A MASSIVELY SCALABLEDISTRIBUTED STORAGE SYSTEM

Ken DreyerSoftware EngineerApr 23 2015

Hitler finds out about software-defined storage

If you are a proprietary storage vendor...

THE FUTURE OF STORAGE

Traditional StorageComplex proprietary silos

Open Software Defined StorageStandardized, unified, open platforms

USER

ADMIN

USER

ADMIN

Custom GUI

Proprietary Software

Custom GUI

Proprietary Software

ProprietaryHardware

ProprietaryHardware

StandardComputersand Disks

Com

modit

yH

ard

ware

Open S

ourc

eSoft

ware

Ceph

Control Plane (API, GUI)

ADMIN USER

THE JOURNEY

Open Software-Defined Storage is a fundamental reimagining of how storage infrastructure works.

It provides substantial economic and operational advantages, and it has quickly become ideally suited for a growing number of use cases.

TODAY EMERGING FUTURE

CloudInfrastructure

CloudNative Apps

Analytics

Hyper-Convergence

Containers

???

???

HISTORICAL TIMELINE

RHEL-OSP Certification FEB 2014

MAY 2012Launch of Inktank

OpenStack Integration 2011

2010Mainline Linux Kernel

Open Source 2006

2004 Project Starts at UCSC

Production Ready Ceph SEPT 2012

2012CloudStack Integration

OCT 2013Inktank Ceph Enterprise Launch

Xen Integration 2013

APR 2014Inktank Acquired by Red Hat

10 years in the making

5

ARCHITECTURE

ARCHITECTURAL COMPONENTS

7

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and scale-

out metadata management

APP HOST/VM CLIENT

OBJECT STORAGE DAEMONS

8

FS

DISK

OSD

DISK

OSD

FS

DISK

OSD

FS

DISK

OSD

FS

btrfsxfsext4zfs?

M

M

M

RADOS CLUSTER

9

APPLICATION

M M

M M

M

RADOS CLUSTER

RADOS COMPONENTS

10

OSDs: 10s to 10000s in a cluster One per disk (or one per SSD, RAID group…) Serve stored objects to clients Intelligently peer for replication & recovery

Monitors: Maintain cluster membership and state Provide consensus for distributed decision-

making Small, odd number These do not serve stored objects to clients

M

WHERE DO OBJECTS LIVE?

11

??

APPLICATION

M

M

M

OBJECT

A METADATA SERVER?

12

1

APPLICATION

M

M

M

2

CALCULATED PLACEMENT

13

FAPPLICATION

M

M

MA-G

H-N

O-T

U-Z

EVEN BETTER: CRUSH!

14

CLUSTER

OBJECTS

10

01

01

10

10

01

11

01

10

01

01

10

10

01 11

01

1001

0110 10 01

11

01

PLACEMENT GROUPS(PGs)

CRUSH IS A QUICK CALCULATION

15

RADOS CLUSTER

OBJECT

10

01

01

10

10

01 11

01

1001

0110 10 01

11

01

CRUSH: DYNAMIC DATA PLACEMENT

16

CRUSH: Pseudo-random placement algorithm

Fast calculation, no lookup Repeatable, deterministic

Statistically uniform distribution Stable mapping

Limited data migration on change Rule-based configuration

Infrastructure topology aware Adjustable replication Weighting

CRUSH

OBJECT

10 10 01 01 10 10 01 11 01 10

hash(object name) % num pg

CRUSH(pg, cluster state, rule set)

18

OBJECT

10 10 01 01 10 10 01 11 01 10

19

CLIENT

??

20

21

22

CLIENT

??

23

24

25

ARCHITECTURAL COMPONENTS

26

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and scale-

out metadata management

APP HOST/VM CLIENT

ACCESSING A RADOS CLUSTER

27

APPLICATION

M M

M

RADOS CLUSTER

LIBRADOS

OBJECT

socket

L

LIBRADOS: RADOS ACCESS FOR APPS

28

LIBRADOS: Direct access to RADOS for applications C, C++, Python, PHP, Java, Erlang Direct access to storage nodes No HTTP overhead

ARCHITECTURAL COMPONENTS

29

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and scale-

out metadata management

APP HOST/VM CLIENT

THE RADOS GATEWAY

30

M M

M

RADOS CLUSTER

RADOSGW

LIBRADOS

socket

RADOSGW

LIBRADOS

APPLICATION APPLICATION

REST

RADOSGW MAKES RADOS WEBBY

31

RADOSGW: REST-based object storage proxy Uses RADOS to store objects API supports buckets, accounts Usage accounting for billing Compatible with S3 and Swift applications

ARCHITECTURAL COMPONENTS

32

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and scale-

out metadata management

APP HOST/VM CLIENT

STORING VIRTUAL DISKS

33

M M

RADOS CLUSTER

HYPERVISOR

LIBRBD

VM

SEPARATE COMPUTE FROM STORAGE

34

M M

RADOS CLUSTER

HYPERVISOR

LIBRBDVM

HYPERVISOR

LIBRBD

KRBD - KERNEL MODULE

M M

RADOS CLUSTER

LINUX HOST

KRBD

RBD STORES VIRTUAL DISKS

RADOS BLOCK DEVICE: Storage of disk images in RADOS Decouples VMs from host Images are striped across the cluster (pool) Snapshots Copy-on-write clones Support in:

Mainline Linux Kernel (2.6.39+) and RHEL 7 Qemu/KVM, native Xen coming soon OpenStack, CloudStack, Nebula, Proxmox

Export snapshots to geographically dispersed data centers▪ Institute disaster recovery

Export incremental snapshots▪ Minimize network bandwidth by only sending changes

RBD SNAPSHOTS

ARCHITECTURAL COMPONENTS

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and scale-

out metadata management

APP HOST/VM CLIENT

SEPARATE METADATA SERVER

LINUX HOST

M M

M

RADOS CLUSTER

KERNEL MODULE

datametadata 0110

SCALABLE METADATA SERVERS

METADATA SERVER Manages metadata for a POSIX-compliant

shared filesystem Directory hierarchy File metadata (owner, timestamps, mode,

etc.) Stores metadata in RADOS Does not serve file data to clients Only required for shared filesystem

CALAMARI

41

CALAMARI ARCHITECTURE

CEPH STORAGE CLUSTER

MASTER

CALAMARI

ADMIN NODE

MINION MINION

M

MINION MINION

M

MINIONMINION

M

USE CASES

WEB APPLICATION STORAGE

WEB APPLICATION

APP SERVER APP SERVER APP SERVER

CEPH STORAGE CLUSTER(RADOS)

CEPH OBJECT GATEWAY

(RGW)

CEPH OBJECT GATEWAY(RGW)

APP SERVER

S3/Swift S3/Swift S3/Swift S3/Swift

MULTI-SITE OBJECT STORAGE

WEB APPLICATION

APP SERVER

CEPH OBJECT GATEWAY

(RGW)

CEPH STORAGE CLUSTER

(US-EAST)

WEB APPLICATION

APP SERVER

CEPH OBJECT GATEWAY

(RGW)

CEPH STORAGE CLUSTER

(EU-WEST)

ARCHIVE / COLD STORAGE

APPLICATION

CACHE POOL (REPLICATED)

BACKING POOL (ERASURE CODED)

CEPH STORAGE CLUSTER

ERASURE CODING

47

OBJECT

REPLICATED POOL

CEPH STORAGE CLUSTER

ERASURE CODED POOL

CEPH STORAGE CLUSTER

COPY COPY

OBJECT

31 2 X Y

COPY

4

Full copies of stored objects Very high durability Quicker recovery

One copy plus parity Cost-effective durability Expensive recovery

ERASURE CODING: HOW DOES IT WORK?

48

CEPH STORAGE CLUSTER

OBJECT

Y

OSD

3

OSD

2

OSD

1

OSD

4

OSD

X

OSD

ERASURE CODED POOL

CACHE TIERING

49

CEPH CLIENT

CACHE: WRITEBACK MODE

BACKING POOL (REPLICATED)

CEPH STORAGE CLUSTER

Read/Write Read/Write

WEBSCALE APPLICATIONS

50

WEB APPLICATION

APP SERVER APP SERVER APP SERVER

CEPH STORAGE CLUSTER(RADOS)

APP SERVER

NativeProtocol

NativeProtocol

NativeProtocol

NativeProtocol

ARCHIVE / COLD STORAGE

51

APPLICATION

CACHE POOL (REPLICATED)

BACKING POOL (ERASURE CODED)

CEPH STORAGE CLUSTER

CEPH BLOCK DEVICE (RBD)

DATABASES

52

MYSQL / MARIADB

LINUX KERNEL

CEPH STORAGE CLUSTER(RADOS)

NativeProtocol

NativeProtocol

NativeProtocol

NativeProtocol

Future Ceph Roadmap

CEPH ROADMAP

57

Hammer(current release) Infernalis J-Release

NewStore

Object Expiration

Performance Improvements

Stable CephFS?Object Versioning

Alternative Web Server for RGW

Performance Improvements

???

Performance Improvements

NEXT STEPS

NEXT STEPSWHAT NOW?

• Read about the latest version of Ceph: http://ceph.com/docs

• Deploy a test cluster using ceph-deploy: http://ceph.com/qsg

Getting Started with Ceph

Most discussion happens on the mailing lists ceph-devel and ceph-users. Join or view archives at http://ceph.com/list

IRC is a great place to get help (or help others!) #ceph and #ceph-devel. Details and logs at http://ceph.com/irc

Getting Involved with Ceph

59

• Deploy a test cluster on the AWS free-tier using Juju: http://ceph.com/juju

• Ansible playbooks for Ceph: https://www.github.com/alfredodeza/ceph-ansible

Download the code: http://www.github.com/ceph

The tracker manages bugs and feature requests. Register and start looking around at http://tracker.ceph.com

Doc updates and suggestions are always welcome. Learn how to contribute docs at http://ceph.com/docwriting

Thank You

extras

● metrics.ceph.com● http://yahooeng.tumblr.com/post/116391291701

/yahoo-cloud-object-store-object-storage-at