ceph block devices: a deep divevideos.cdn.redhat.com/summit2015/presentations/... · ceph...

42
Ceph Block Devices: A Deep Dive Josh Durgin RBD Lead June 24, 2015

Upload: others

Post on 03-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point

Ceph Block Devices: A Deep DiveJosh DurginRBD LeadJune 24, 2015

Page 2: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point

Ceph Motivating Principles● All components must scale horizontally● There can be no single point of failure● The solution must be hardware agnostic● Should use commodity hardware● Self-manage wherever possible● Open Source (LGPL)● Move beyond legacy approaches

– client/cluster instead of client/server– Ad hoc HA

Page 3: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point

Ceph Components

RGWA web services gateway

for object storage, compatible with S3 and

Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-

distributed block device with cloud platform

integration

CEPHFSA distributed file system

with POSIX semantics and scale-out metadata

management

APP HOST/VM CLIENT

Page 4: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point

Ceph Components

RGWA web services gateway

for object storage, compatible with S3 and

Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-

distributed block device with cloud platform

integration

CEPHFSA distributed file system

with POSIX semantics and scale-out metadata

management

APP HOST/VM CLIENT

Page 5: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point

Storing Virtual Disks

M M

RADOS CLUSTER

HYPERVISORLIBRBD

VM

Page 6: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point

Kernel Module

M M

RADOS CLUSTER

LINUX HOSTKRBD

Page 7: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point

RBD

● Stripe images across entire cluster (pool)● Read-only snapshots● Copy-on-write clones● Broad integration

– QEMU, libvirt– Linux kernel– iSCSI (STGT, LIO)– OpenStack, CloudStack, OpenNebula, Ganeti, Proxmox, oVirt

● Incremental backup (relative to snapshots)

Page 8: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point
Page 9: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point

Ceph Components

RGWA web services gateway

for object storage, compatible with S3 and

Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-

distributed block device with cloud platform

integration

CEPHFSA distributed file system

with POSIX semantics and scale-out metadata

management

APP HOST/VM CLIENT

Page 10: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point

RADOS

● Flat namespace within a pool● Rich object API

– Bytes, attributes, key/value data– Partial overwrite of existing data– Single-object compound atomic operations– RADOS classes (stored procedures)

● Strong consistency (CP system)● Infrastructure aware, dynamic topology● Hash-based placement (CRUSH)● Direct client to server data path

Page 11: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point

RADOS Components

OSDs:

10s to 1000s in a cluster

One per disk (or one per SSD, RAID group…)

Serve stored objects to clients

Intelligently peer for replication & recovery

Monitors:

Maintain cluster membership and state

Provide consensus for distributed decision-making

Small, odd number (e.g., 5)

Not part of data path

M

Page 12: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point

Ceph Components

RGWA web services gateway

for object storage, compatible with S3 and

Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-

distributed block device with cloud platform

integration

CEPHFSA distributed file system

with POSIX semantics and scale-out metadata

management

APP HOST/VM CLIENT

Page 13: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point

Metadata

● rbd_directory– Maps image name to id, and vice versa

● rbd_children– Lists clones in a pool, indexed by parent

● rbd_id.$image_name– The internal id, locatable using only the user-specified image name

● rbd_header.$image_id– Per-image metadata

Page 14: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point

Data

● rbd_data.* objects– Named based on offset in image– Non-existent to start with– Plain data in each object– Snapshots handled by rados– Often sparse

Page 15: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point

Striping

● Objects are uniformly sized– Default is simple 4MB divisions of device

● Randomly distributed among OSDs by CRUSH● Parallel work is spread across many spindles● No single set of servers responsible for image● Small objects lower OSD usage variance

Page 16: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point

I/O

M M

RADOS CLUSTER

HYPERVISORLIBRBD

CACHE

VM

Page 17: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point

Snapshots

● Object granularity● Snapshot context [list of snap ids, latest snap id]

– Stored in rbd image header (self-managed)– Sent with every write

● Snapshot ids managed by monitors● Deleted asynchronously● RADOS keeps per-object overwrite stats, so diffs are easy

Page 18: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point

LIBRBDSnap context: ([], 7)

write

Snapshots

Page 19: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point

LIBRBD

Create snap 8

Snapshots

Page 20: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point

LIBRBD

Set snap context to ([8], 8)

Snapshots

Page 21: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point

LIBRBDWrite with ([8], 8)

Snapshots

Page 22: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point

LIBRBDWrite with ([8], 8)

Snap 8

Snapshots

Page 23: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point

Watch/Notify● establish stateful 'watch' on an object

– client interest persistently registered with object– client keeps connection to OSD open

● send 'notify' messages to all watchers– notify message (and payload) sent to all watchers– notification (and reply payloads) on completion

● strictly time-bounded liveness check on watch– no notifier falsely believes we got a message

Page 24: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point

Watch/Notify

HYPERVISOR

LIBRBD

/usr/bin/rbd

LIBRBD

watch

HYPERVISOR

LIBRBD

/usr/bin/rbd

LIBRBD

watch

Page 25: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point

Watch/Notify

HYPERVISOR

LIBRBD

/usr/bin/rbd

LIBRBD

HYPERVISOR

LIBRBD

/usr/bin/rbd

LIBRBD

Add snapshot

Page 26: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point

Watch/Notify

HYPERVISOR

LIBRBD

/usr/bin/rbd

LIBRBD

notify

HYPERVISOR

LIBRBD

/usr/bin/rbd

LIBRBD

notify

Page 27: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point

Watch/Notify

HYPERVISOR

LIBRBD

/usr/bin/rbd

LIBRBD

notify ack

HYPERVISOR

LIBRBD

/usr/bin/rbd

LIBRBD

notify complete

Page 28: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point

Watch/Notify

HYPERVISOR

LIBRBD

read metadata

HYPERVISOR

LIBRBD

Page 29: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point

Clones

● Copy-on-write (and optionally read, in Hammer)● Object granularity● Independent settings

– striping, feature bits, object size can differ– can be in different pools

● Clones are based on protected snapshots– 'protected' means they can't be deleted

● Can be flattened– Copy all data from parent– Remove parent relationship

Page 30: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point

Clones - read

LIBRBD

clone parent

Page 31: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point

Clones - write

LIBRBD

clone parent

Page 32: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point

LIBRBD

clone parent

read

Clones - write

Page 33: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point

LIBRBD

clone parent

copy up, write

Clones - write

Page 34: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point

Ceph and OpenStack

Page 35: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point

Virtualized Setup

● Secret key stored by libvirt● XML defining VM fed in, includes

– Monitor addresses– Client name– QEMU block device cache setting

● Writeback recommended– Bus on which to attach block device

● virtio-blk/virtio-scsi recommended● Ide ok for legacy systems

– Discard options (ide/scsi only), I/O throttling if desired

Page 36: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point

libvirtvm.xml

QEMULIBRBD

CACHE

qemu ­enable­kvm ...

VM

Virtualized Setup

Page 37: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point

Kernel RBD

● rbd map sets everything up● /etc/ceph/rbdmap is like /etc/fstab● udev adds handy symlinks:

– /dev/rbd/$pool/$image[@snap]● striping v2 and later feature bits not supported yet● Can be used to back LIO, NFS, SMB, etc.● No specialized cache, page cache used by filesystem on top

Page 38: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point
Page 39: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point

What's new in Hammer

● Copy-on-read● Librbd cache enabled in safe mode by default● Readahead during boot● Lttng tracepoints● Allocation hints● Cache hints● Exclusive locking (off by default for now)● Object map (off by default for now)

Page 40: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point

Infernalis

● Easier space tracking (rbd du)● Faster differential backups● Per-image metadata

– Can persist rbd options● RBD journaling● Enabling new features on the fly

Page 41: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point

Future Work

● Kernel client catch up● RBD mirroring● Consistency groups● QoS for rados (policy at rbd level)● Active/Active iSCSI● Performance improvements

– Newstore osd backend– Improved cache tiering– Finer-grained caching– Many more

Page 42: Ceph Block Devices: A Deep Divevideos.cdn.redhat.com/summit2015/presentations/... · Ceph Motivating Principles All components must scale horizontally There can be no single point

Questions?

Josh Durgin

[email protected]