opennebulaconf 2014 - using ceph to provide scalable storage for opennebula - john spray

59
Using Ceph with OpenNebula John Spray [email protected]

Upload: opennebula-project

Post on 12-Jul-2015

6.940 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

Using Ceph with

OpenNebula

John [email protected]

Page 2: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

2 OpenNebulaConf 2014 Berlin

Agenda

● What is it?

● Architecture

● Integration with OpenNebula

● What's new?

Page 3: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

3OpenNebulaConf 2014 Berlin

What is Ceph?

Page 4: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

4 OpenNebulaConf 2014 Berlin

What is Ceph?

● Highly available resilient data store

● Free Software (LGPL)

● 10 years since inception

● Flexible object, block and filesystem interfaces

● Especially popular in private clouds as VM image service, and S3-compatible object storage service.

Page 5: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

5 OpenNebulaConf 2014 Berlin

Interfaces to storage

FILE SYSTEMCephFS

BLOCK STORAGE

RBD

OBJECT STORAGE

RGW

Keystone

Geo-Replication

Native API

Multi-tenant

S3 & Swift

OpenStack

Linux Kernel

iSCSI

Clones

Snapshots

CIFS/NFS

HDFS

Distributed Metadata

Linux Kernel

POSIX

Page 6: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

6OpenNebulaConf 2014 Berlin

Ceph Architecture

Page 7: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

7OpenNebulaConf 2014 Berlin

Architectural Components

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and scale-

out metadata management

APP HOST/VM CLIENT

Page 8: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

8OpenNebulaConf 2014 Berlin

Object Storage Daemons

FS

DISK

OSD

DISK

OSD

FS

DISK

OSD

FS

DISK

OSD

FS

btrfsxfsext4

M

M

M

Page 9: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

9OpenNebulaConf 2014 Berlin

RADOS Components

OSDs: 10s to 10000s in a cluster One per disk (or one per SSD, RAID group…) Serve stored objects to clients Intelligently peer for replication & recovery

Monitors: Maintain cluster membership and state Provide consensus for distributed decision-

making Small, odd number These do not serve stored objects to clients

M

Page 10: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

10OpenNebulaConf 2014 Berlin

Rados Cluster

APPLICATION

M M

M M

M

RADOS CLUSTER

Page 11: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

11OpenNebulaConf 2014 Berlin

Where do objects live?

??

APPLICATION

M

M

M

OBJECT

Page 12: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

12OpenNebulaConf 2014 Berlin

A Metadata Server?

1

APPLICATION

M

M

M

2

Page 13: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

13OpenNebulaConf 2014 Berlin

Calculated placement

FAPPLICATION

M

M

MA-G

H-N

O-T

U-Z

Page 14: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

14OpenNebulaConf 2014 Berlin

Even better: CRUSH

RADOS CLUSTER

OBJECT

10

01

01

10

10

01

11

01

10

01

01

10

10

01 11

01

1001

0110 10 01

11

01

Page 15: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

15OpenNebulaConf 2014 Berlin

CRUSH is a quick calculation

RADOS CLUSTER

OBJECT

10

01

01

10

10

01 11

01

1001

0110 10 01

11

01

Page 16: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

16OpenNebulaConf 2014 Berlin

CRUSH: Dynamic data placement

CRUSH: Pseudo-random placement algorithm

Fast calculation, no lookup Repeatable, deterministic

Statistically uniform distribution Stable mapping

Limited data migration on change Rule-based configuration

Infrastructure topology aware Adjustable replication Weighting

Page 17: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

17OpenNebulaConf 2014 Berlin

Architectural Components

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and scale-

out metadata management

APP HOST/VM CLIENT

Page 18: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

18OpenNebulaConf 2014 Berlin

RBD: Virtual disks in Ceph

18

RADOS BLOCK DEVICE: Storage of disk images in RADOS Decouples VMs from host Images are striped across the cluster (pool) Snapshots Copy-on-write clones Support in:

Mainline Linux Kernel (2.6.39+) Qemu/KVM OpenStack, CloudStack, OpenNebula,

Proxmox

Page 19: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

19OpenNebulaConf 2014 Berlin

Storing virtual disks

M M

RADOS CLUSTER

HYPERVISOR

LIBRBD

VM

19

Page 20: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

20OpenNebulaConf 2014 Berlin

Using Ceph with OpenNebula

Page 21: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

21OpenNebulaConf 2014 Berlin

Storage in OpenNebula deployments

OpenNebula Cloud Architecture Survey 2014 (http://c12g.com/resources/survey/)

Page 22: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

22OpenNebulaConf 2014 Berlin

RBD and libvirt/qemu

● librbd (user space) client integration with libvirt/qemu

● Support for live migration, thin clones

● Get recent versions!

● Directly supported in OpenNebula since 4.0 with the Ceph Datastore (wraps `rbd` CLI)

More info online:

http://ceph.com/docs/master/rbd/libvirt/http://docs.opennebula.org/4.10/administration/storage/ceph_ds.html

Page 23: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

23OpenNebulaConf 2014 Berlin

Other hypervisors

● OpenNebula is flexible, so can we also use Ceph with non-libvirt/qemu hypervisors?

● Kernel RBD: can present RBD images in /dev/ on hypervisor host for software unaware of librbd

● Docker: can exploit RBD volumes with a local filesystem for use as data volumes – maybe CephFS in future...?

● For unsupported hypervisors, can adapt to Ceph using e.g. iSCSI for RBD, or NFS for CephFS (but test re-exports carefully!)

Page 24: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

24OpenNebulaConf 2014 Berlin

Choosing hardware

Testing/benchmarking/expert advice is needed, but there are general guidelines:

● Prefer many cheap nodes to few expensive nodes (10 is better than 3)

● Include small but fast SSDs for OSD journals

● Don't simply buy biggest drives: consider IOPs/capacity ratio

● Provision network and IO capacity sufficient for your workload plus recovery bandwidth from node failure.

Page 25: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

25OpenNebulaConf 2014 Berlin

What's new?

Page 26: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

26OpenNebulaConf 2014 Berlin

Ceph releases

● Ceph 0.80 firefly (May 2014)

– Cache tiering & erasure coding

– Key/val OSD backends

– OSD primary affinity● Ceph 0.87 giant (October 2014)

– RBD cache enabled by default

– Performance improvements

– Locally recoverable erasure codes● Ceph x.xx hammer (2015)

Page 27: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

27OpenNebulaConf 2014 Berlin

Additional components

● Ceph FS – scale-out POSIX filesystem service, currently being stabilized

● Calamari – monitoring dashboard for Ceph

● ceph-deploy – easy SSH-based deployment tool

● Puppet, Chef modules

Page 28: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

28OpenNebulaConf 2014 Berlin

Get involved

Evaluate the latest releases:

http://ceph.com/resources/downloads/

Mailing list, IRC:

http://ceph.com/resources/mailing-list-irc/

Bugs:

http://tracker.ceph.com/projects/ceph/issues

Online developer summits:

https://wiki.ceph.com/Planning/CDS

Page 29: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

29OpenNebulaConf 2014 Berlin

Questions?

Page 30: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

30OpenNebulaConf 2014 Berlin

Page 31: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

31OpenNebulaConf 2014 Berlin

Spare slides

Page 32: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

32OpenNebulaConf 2014 Berlin

Page 33: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

33OpenNebulaConf 2014 Berlin

Ceph FS

Page 34: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

34 OpenNebulaConf 2014 Berlin

CephFS architecture

● Dynamically balanced scale-out metadata

● Inherit flexibility/scalability of RADOS for data

● POSIX compatibility

● Beyond POSIX: Subtree snapshots, recursive statistics

Weil, Sage A., et al. "Ceph: A scalable, high-performance distributed file system." Proceedings of the 7th symposium on Operating systems

design and implementation. USENIX Association, 2006.http://ceph.com/papers/weil-ceph-osdi06.pdf

Page 35: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

35OpenNebulaConf 2014 Berlin

Components

● Client: kernel, fuse, libcephfs

● Server: MDS daemon

● Storage: RADOS cluster (mons & OSDs)

Page 36: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

36OpenNebulaConf 2014 Berlin

Components

Linux host

M M

M

Ceph server daemons

ceph.ko

datametadata 0110

Page 37: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

37 OpenNebulaConf 2014 Berlin

From application to disk

ceph-mds

libcephfsceph-fuse Kernel client

RADOS

Client network protocol

Application

Disk

Page 38: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

38OpenNebulaConf 2014 Berlin

Scaling out FS metadata

● Options for distributing metadata?

– by static subvolume

– by path hash

– by dynamic subtree● Consider performance, ease of implementation

Page 39: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

39OpenNebulaConf 2014 Berlin

Dynamic subtree placement

Page 40: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

40OpenNebulaConf 2014 Berlin

Dynamic subtree placement

● Locality: get the dentries in a dir from one MDS

● Support read heavy workloads by replicating non-authoritative copies (cached with capabilities just like clients do)

● In practice work at directory fragment level in order to handle large dirs

Page 41: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

41 OpenNebulaConf 2014 Berlin

Data placement

● Stripe file contents across RADOS objects● get full rados cluster bw from clients● fairly tolerant of object losses: reads return zero

● Control striping with layout vxattrs● layouts also select between multiple data pools

● Deletion is a special case: client deletions mark files 'stray', RADOS delete ops sent by MDS

Page 42: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

42 OpenNebulaConf 2014 Berlin

Clients

● Two implementations:● ceph-fuse/libcephfs● kclient

● Interplay with VFS page cache, efficiency harder with fuse (extraneous stats etc)

● Client perf. matters, for single-client workloads

● Slow client can hold up others if it's hogging metadata locks: include clients in troubleshooting

● - future: want more per client perf stats and maybe metadata QoS per client. Clients probably group into jobs or workloads.

● - future: may want to tag client io with job id (eg hpc workload, samba client I'd, container/VM id)

Page 43: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

43OpenNebulaConf 2014 Berlin

Journaling and caching in MDS

● Metadata ops initially journaled to striped journal "file" in the metadata pool.

– I/O latency on metadata ops is sum of network latency and journal commit latency.

– Metadata remains pinned in in-memory cache until expired from journal.

Page 44: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

44 OpenNebulaConf 2014 Berlin

Journaling and caching in MDS

● In some workloads we expect almost all metadata always in cache, in others its more of a stream.

● Control cache size with mds_cache_size

● Cache eviction relies on client cooperation

● MDS journal replay not only recovers data but also warms up cache. Use standby replay to keep that cache warm.

Page 45: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

45 OpenNebulaConf 2014 Berlin

Lookup by inode

● Sometimes we need inode → path mapping:● Hard links● NFS handles

● Costly to store this: mitigate by piggybacking paths (backtraces) onto data objects

● Con: storing metadata to data pool● Con: extra IOs to set backtraces● Pro: disaster recovery from data pool

● Future: improve backtrace writing latency

Page 46: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

46 OpenNebulaConf 2014 Berlin

CephFS in practice

ceph-deploy mds create myserver

ceph osd pool create fs_data

ceph osd pool create fs_metadata

ceph fs new myfs fs_metadata fs_data

mount -t cephfs x.x.x.x:6789 /mnt/ceph

Page 47: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

47 OpenNebulaConf 2014 Berlin

Managing CephFS clients

● New in giant: see hostnames of connected clients

● Client eviction is sometimes important:● Skip the wait during reconnect phase on MDS restart● Allow others to access files locked by crashed client

● Use OpTracker to inspect ongoing operations

Page 48: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

48 OpenNebulaConf 2014 Berlin

CephFS tips

● Choose MDS servers with lots of RAM

● Investigate clients when diagnosing stuck/slow access

● Use recent Ceph and recent kernel

● Use a conservative configuration:● Single active MDS, plus one standby● Dedicated MDS server● Kernel client● No snapshots, no inline data

Page 49: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

49 OpenNebulaConf 2014 Berlin

Towards a production-ready CephFS

● Focus on resilience:

1. Don't corrupt things

2. Stay up

3. Handle the corner cases

4. When something is wrong, tell me

5. Provide the tools to diagnose and fix problems

● Achieve this first within a conservative single-MDS configuration

Page 50: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

50 OpenNebulaConf 2014 Berlin

Giant->Hammer timeframe

● Initial online fsck (a.k.a. forward scrub)

● Online diagnostics (`session ls`, MDS health alerts)

● Journal resilience & tools (cephfs-journal-tool)

● flock in the FUSE client

● Initial soft quota support

● General resilience: full OSDs, full metadata cache

Page 51: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

51 OpenNebulaConf 2014 Berlin

FSCK and repair

● Recover from damage:● Loss of data objects (which files are damaged?)● Loss of metadata objects (what subtree is damaged?)

● Continuous verification:● Are recursive stats consistent?● Does metadata on disk match cache?● Does file size metadata match data on disk?

● Repair:● Automatic where possible● Manual tools to enable support

Page 52: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

52 OpenNebulaConf 2014 Berlin

Client management

● Current eviction is not 100% safe against rogue clients● Update to client protocol to wait for OSD blacklist

● Client metadata● Initially domain name, mount point● Extension to other identifiers?

Page 53: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

53 OpenNebulaConf 2014 Berlin

Online diagnostics

● Bugs exposed relate to failures of one client to release resources for another client: “my filesystem is frozen”. Introduce new health messages:

● “client xyz is failing to respond to cache pressure”● “client xyz is ignoring capability release messages”● Add client metadata to allow us to give domain names

instead of IP addrs in messages.

● Opaque behavior in the face of dead clients. Introduce `session ls`

● Which clients does MDS think are stale?● Identify clients to evict with `session evict`

Page 54: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

54 OpenNebulaConf 2014 Berlin

Journal resilience

● Bad journal prevents MDS recovery: “my MDS crashes on startup”:

● Data loss● Software bugs

● Updated on-disk format to make recovery from damage easier

● New tool: cephfs-journal-tool● Inspect the journal, search/filter● Chop out unwanted entries/regions

Page 55: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

55 OpenNebulaConf 2014 Berlin

Handling resource limits

● Write a test, see what breaks!

● Full MDS cache:● Require some free memory to make progress● Require client cooperation to unpin cache objects● Anticipate tuning required for cache behaviour: what

should we evict?

● Full OSD cluster● Require explicit handling to abort with -ENOSPC

● MDS → RADOS flow control:● Contention between I/O to flush cache and I/O to journal

Page 56: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

56 OpenNebulaConf 2014 Berlin

Test, QA, bug fixes

● The answer to “Is CephFS production ready?”

● teuthology test framework:● Long running/thrashing test● Third party FS correctness tests● Python functional tests

● We dogfood CephFS internally● Various kclient fixes discovered● Motivation for new health monitoring metrics

● Third party testing is extremely valuable

Page 57: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

57 OpenNebulaConf 2014 Berlin

What's next?

● You tell us!

● Recent survey highlighted:● FSCK hardening● Multi-MDS hardening● Quota support

● Which use cases will community test with?● General purpose● Backup● Hadoop

Page 58: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

58 OpenNebulaConf 2014 Berlin

Reporting bugs

● Does the most recent development release or kernel fix your issue?

● What is your configuration? MDS config, Ceph version, client version, kclient or fuse

● What is your workload?

● Can you reproduce with debug logging enabled?

http://ceph.com/resources/mailing-list-irc/

http://tracker.ceph.com/projects/ceph/issues

http://ceph.com/docs/master/rados/troubleshooting/log-and-debug/

Page 59: OpenNebulaConf 2014 - Using Ceph to provide scalable storage for OpenNebula - John Spray

59 OpenNebulaConf 2014 Berlin

Future

● Ceph Developer Summit:● When: 8 October● Where: online

● Post-Hammer work:● Recent survey highlighted multi-MDS, quota support ● Testing with clustered Samba/NFS?