opennebulaconf 2014 - using ceph to provide scalable storage for opennebula - john spray

Using Ceph with

OpenNebula

John [email protected]

mailto:[email protected]

2 OpenNebulaConf 2014 Berlin

Agenda

● What is it?

● Architecture

● Integration with OpenNebula

● What's new?

3OpenNebulaConf 2014 Berlin

What is Ceph?


What is Ceph?

● Highly available resilient data store

● Free Software (LGPL)

● 10 years since inception

● Flexible object, block and filesystem interfaces

● Especially popular in private clouds as VM image service, and S3-compatible object storage service.


Interfaces to storage

FILE SYSTEMCephFS

BLOCK STORAGE

RBD

OBJECT STORAGE

RGW

Keystone

Geo-Replication

Native API

Multi-tenant

S3 & Swift

OpenStack

Linux Kernel

iSCSI

Clones

Snapshots

CIFS/NFS

HDFS

Distributed Metadata

Linux Kernel

POSIX


Ceph Architecture


Architectural Components

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and scale-

out metadata management

APP HOST/VM CLIENT


Object Storage Daemons

FS

DISK

OSD

DISK

OSD

FS

DISK

OSD

FS

DISK

OSD

FS

btrfsxfsext4

M

M

M


RADOS Components

OSDs: 10s to 10000s in a cluster One per disk (or one per SSD, RAID group…) Serve stored objects to clients Intelligently peer for replication & recovery

Monitors: Maintain cluster membership and state Provide consensus for distributed decision-

making Small, odd number These do not serve stored objects to clients

M


Rados Cluster

APPLICATION

M M

M M

M

RADOS CLUSTER


Where do objects live?

??

APPLICATION

M

M

M

OBJECT


A Metadata Server?

1

APPLICATION

M

M

M

2


Calculated placement

FAPPLICATION

M

M

MA-G

H-N

O-T

U-Z


Even better: CRUSH

RADOS CLUSTER

OBJECT

10

01

01

10

10

01

11

01

10

01

01

10

10

01 11

01

1001

0110 10 01

11

01


CRUSH is a quick calculation

RADOS CLUSTER

OBJECT

10

01

01

10

10

01 11

01

1001

0110 10 01

11

01


CRUSH: Dynamic data placement

CRUSH: Pseudo-random placement algorithm

Fast calculation, no lookup Repeatable, deterministic

Statistically uniform distribution Stable mapping

Limited data migration on change Rule-based configuration

Infrastructure topology aware Adjustable replication Weighting


Architectural Components

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed file

system with POSIX semantics and scale-

out metadata management

APP HOST/VM CLIENT


RBD: Virtual disks in Ceph

18

RADOS BLOCK DEVICE: Storage of disk images in RADOS Decouples VMs from host Images are striped across the cluster (pool) Snapshots Copy-on-write clones Support in:

Mainline Linux Kernel (2.6.39+) Qemu/KVM OpenStack, CloudStack, OpenNebula,

Proxmox


Storing virtual disks

M M

RADOS CLUSTER

HYPERVISOR

LIBRBD

VM

19


Using Ceph with OpenNebula


Storage in OpenNebula deployments

OpenNebula Cloud Architecture Survey 2014 (http://c12g.com/resources/survey/)

http://c12g.com/resources/survey/


RBD and libvirt/qemu

● librbd (user space) client integration with libvirt/qemu

● Support for live migration, thin clones

● Get recent versions!

● Directly supported in OpenNebula since 4.0 with the Ceph Datastore (wraps `rbd` CLI)

More info online:

http://ceph.com/docs/master/rbd/libvirt/http://docs.opennebula.org/4.10/administration/storage/ceph_ds.html

http://ceph.com/docs/master/rbd/libvirt/

http://docs.opennebula.org/4.10/administration/storage/ceph_ds.html


Other hypervisors

● OpenNebula is flexible, so can we also use Ceph with non-libvirt/qemu hypervisors?

● Kernel RBD: can present RBD images in /dev/ on hypervisor host for software unaware of librbd

● Docker: can exploit RBD volumes with a local filesystem for use as data volumes – maybe CephFS in future...?

● For unsupported hypervisors, can adapt to Ceph using e.g. iSCSI for RBD, or NFS for CephFS (but test re-exports carefully!)


Choosing hardware

Testing/benchmarking/expert advice is needed, but there are general guidelines:

● Prefer many cheap nodes to few expensive nodes (10 is better than 3)

● Include small but fast SSDs for OSD journals

● Don't simply buy biggest drives: consider IOPs/capacity ratio

● Provision network and IO capacity sufficient for your workload plus recovery bandwidth from node failure.


What's new?


Ceph releases

● Ceph 0.80 firefly (May 2014)

– Cache tiering & erasure coding

– Key/val OSD backends

– OSD primary affinity● Ceph 0.87 giant (October 2014)

– RBD cache enabled by default

– Performance improvements

– Locally recoverable erasure codes● Ceph x.xx hammer (2015)


Additional components

● Ceph FS – scale-out POSIX filesystem service, currently being stabilized

● Calamari – monitoring dashboard for Ceph

● ceph-deploy – easy SSH-based deployment tool

● Puppet, Chef modules


Get involved

Evaluate the latest releases:

http://ceph.com/resources/downloads/

Mailing list, IRC:

http://ceph.com/resources/mailing-list-irc/

Bugs:

http://tracker.ceph.com/projects/ceph/issues

Online developer summits:

https://wiki.ceph.com/Planning/CDS

http://ceph.com/resources/downloads/



https://wiki.ceph.com/Planning/CDS


Questions?


Spare slides


Ceph FS


CephFS architecture

● Dynamically balanced scale-out metadata

● Inherit flexibility/scalability of RADOS for data

● POSIX compatibility

● Beyond POSIX: Subtree snapshots, recursive statistics

Weil, Sage A., et al. "Ceph: A scalable, high-performance distributed file system." Proceedings of the 7th symposium on Operating systems

design and implementation. USENIX Association, 2006.http://ceph.com/papers/weil-ceph-osdi06.pdf

http://ceph.com/papers/weil-ceph-osdi06.pdf


Components

● Client: kernel, fuse, libcephfs

● Server: MDS daemon

● Storage: RADOS cluster (mons & OSDs)


Components

Linux host

M M

M

Ceph server daemons

ceph.ko

datametadata 0110


From application to disk

ceph-mds

libcephfsceph-fuse Kernel client

RADOS

Client network protocol

Application

Disk


Scaling out FS metadata

● Options for distributing metadata?

– by static subvolume

– by path hash

– by dynamic subtree● Consider performance, ease of implementation


Dynamic subtree placement


Dynamic subtree placement

● Locality: get the dentries in a dir from one MDS

● Support read heavy workloads by replicating non-authoritative copies (cached with capabilities just like clients do)

● In practice work at directory fragment level in order to handle large dirs


Data placement

● Stripe file contents across RADOS objects● get full rados cluster bw from clients● fairly tolerant of object losses: reads return zero

● Control striping with layout vxattrs● layouts also select between multiple data pools

● Deletion is a special case: client deletions mark files 'stray', RADOS delete ops sent by MDS


Clients

● Two implementations:● ceph-fuse/libcephfs● kclient

● Interplay with VFS page cache, efficiency harder with fuse (extraneous stats etc)

● Client perf. matters, for single-client workloads

● Slow client can hold up others if it's hogging metadata locks: include clients in troubleshooting

● - future: want more per client perf stats and maybe metadata QoS per client. Clients probably group into jobs or workloads.

● - future: may want to tag client io with job id (eg hpc workload, samba client I'd, container/VM id)


Journaling and caching in MDS

● Metadata ops initially journaled to striped journal "file" in the metadata pool.

– I/O latency on metadata ops is sum of network latency and journal commit latency.

– Metadata remains pinned in in-memory cache until expired from journal.


Journaling and caching in MDS

● In some workloads we expect almost all metadata always in cache, in others its more of a stream.

● Control cache size with mds_cache_size

● Cache eviction relies on client cooperation

● MDS journal replay not only recovers data but also warms up cache. Use standby replay to keep that cache warm.


Lookup by inode

● Sometimes we need inode → path mapping:● Hard links● NFS handles

● Costly to store this: mitigate by piggybacking paths (backtraces) onto data objects

● Con: storing metadata to data pool● Con: extra IOs to set backtraces● Pro: disaster recovery from data pool

● Future: improve backtrace writing latency


CephFS in practice

ceph-deploy mds create myserver

ceph osd pool create fs_data

ceph osd pool create fs_metadata

ceph fs new myfs fs_metadata fs_data

mount -t cephfs x.x.x.x:6789 /mnt/ceph


Managing CephFS clients

● New in giant: see hostnames of connected clients

● Client eviction is sometimes important:● Skip the wait during reconnect phase on MDS restart● Allow others to access files locked by crashed client

● Use OpTracker to inspect ongoing operations


CephFS tips

● Choose MDS servers with lots of RAM

● Investigate clients when diagnosing stuck/slow access

● Use recent Ceph and recent kernel

● Use a conservative configuration:● Single active MDS, plus one standby● Dedicated MDS server● Kernel client● No snapshots, no inline data


Towards a production-ready CephFS

● Focus on resilience:

1. Don't corrupt things

2. Stay up

3. Handle the corner cases

4. When something is wrong, tell me

5. Provide the tools to diagnose and fix problems

● Achieve this first within a conservative single-MDS configuration


Giant->Hammer timeframe

● Initial online fsck (a.k.a. forward scrub)

● Online diagnostics (`session ls`, MDS health alerts)

● Journal resilience & tools (cephfs-journal-tool)

● flock in the FUSE client

● Initial soft quota support

● General resilience: full OSDs, full metadata cache


FSCK and repair

● Recover from damage:● Loss of data objects (which files are damaged?)● Loss of metadata objects (what subtree is damaged?)

● Continuous verification:● Are recursive stats consistent?● Does metadata on disk match cache?● Does file size metadata match data on disk?

● Repair:● Automatic where possible● Manual tools to enable support


Client management

● Current eviction is not 100% safe against rogue clients● Update to client protocol to wait for OSD blacklist

● Client metadata● Initially domain name, mount point● Extension to other identifiers?


Online diagnostics

● Bugs exposed relate to failures of one client to release resources for another client: “my filesystem is frozen”. Introduce new health messages:

● “client xyz is failing to respond to cache pressure”● “client xyz is ignoring capability release messages”● Add client metadata to allow us to give domain names

instead of IP addrs in messages.

● Opaque behavior in the face of dead clients. Introduce `session ls`

● Which clients does MDS think are stale?● Identify clients to evict with `session evict`


Journal resilience

● Bad journal prevents MDS recovery: “my MDS crashes on startup”:

● Data loss● Software bugs

● Updated on-disk format to make recovery from damage easier

● New tool: cephfs-journal-tool● Inspect the journal, search/filter● Chop out unwanted entries/regions


Handling resource limits

● Write a test, see what breaks!

● Full MDS cache:● Require some free memory to make progress● Require client cooperation to unpin cache objects● Anticipate tuning required for cache behaviour: what

should we evict?

● Full OSD cluster● Require explicit handling to abort with -ENOSPC

● MDS → RADOS flow control:● Contention between I/O to flush cache and I/O to journal


Test, QA, bug fixes

● The answer to “Is CephFS production ready?”

● teuthology test framework:● Long running/thrashing test● Third party FS correctness tests● Python functional tests

● We dogfood CephFS internally● Various kclient fixes discovered● Motivation for new health monitoring metrics

● Third party testing is extremely valuable


What's next?

● You tell us!

● Recent survey highlighted:● FSCK hardening● Multi-MDS hardening● Quota support

● Which use cases will community test with?● General purpose● Backup● Hadoop


Reporting bugs

● Does the most recent development release or kernel fix your issue?

● What is your configuration? MDS config, Ceph version, client version, kclient or fuse

● What is your workload?

● Can you reproduce with debug logging enabled?



http://ceph.com/docs/master/rados/troubleshooting/log-and-debug/



http://ceph.com/docs/master/rados/troubleshooting/log-and-debug/


Future

● Ceph Developer Summit:● When: 8 October● Where: online

● Post-Hammer work:● Recent survey highlighted multi-MDS, quota support ● Testing with clustered Samba/NFS?

opennebulaconf 2014 - using ceph to provide scalable storage for opennebula - john spray

Technology