ceph day new york 2014: future of cephfs

Future of CephFSSage Weil

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

RADOS


LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

LIBRADOS


RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RBD


CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

RADOSGW


APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT

MM

MM

MM

CLIENTCLIENT

0110

0110

datametadata

MM

MM

MM

Metadata Server• Manages metadata for a

POSIX-compliant shared filesystem• Directory hierarchy• File metadata (owner,

timestamps, mode, etc.)• Stores metadata in

RADOS• Does not serve file data

to clients• Only required for shared

filesystem

legacy metadata storage

● a scaling disaster● name → inode → block

list → data

● no inode table locality

● fragmentation

– inode table

– directory

● many seeks

● difficult to partition

usr

etc

var

home

vmlinuz

passwdmtabhosts

lib…

…

…

includebin

ceph fs metadata storage

● block lists unnecessary

● inode table mostly useless

● APIs are path-based, not inode-based

● no random table access, sloppy caching

● embed inodes inside directories

● good locality, prefetching

● leverage key/value object

102

100

1

usr

etc

var

home

vmlinuz

passwdmtabhosts

libincludebin

…

…

…

controlling metadata io

● view ceph-mds as cache

● reduce reads

– dir+inode prefetching

● reduce writes

– consolidate multiple writes

● large journal or log● stripe over objects

● two tiers

– journal for short term

– per-directory for long term

● fast failure recovery

journal

directories

one tree

three metadata servers

??

load distribution

● coarse (static subtree)● preserve locality

● high management overhead

● fine (hash)● always balanced

● less vulnerable to hot spots

● destroy hierarchy, locality

● can a dynamic approach capture benefits of both extremes?

static subtree

hash directories

hash files

good locality

good balance

DYNAMIC SUBTREE PARTITIONING

● scalable● arbitrarily partition

metadata

● adaptive● move work from busy to

idle servers

● replicate hot metadata

● efficient● hierarchical partition

preserve locality

● dynamic● daemons can join/leave

● take over for failed nodes

dynamic subtree partitioning

Dynamic partitioning

many directories same directory

Failure recovery

Metadata replication and availability

Metadata cluster scaling

client protocol

● highly stateful● consistent, fine-grained caching

● seamless hand-off between ceph-mds daemons● when client traverses hierarchy

● when metadata is migrated between servers

● direct access to OSDs for file I/O

an example

● mount -t ceph 1.2.3.4:/ /mnt

● 3 ceph-mon RT

● 2 ceph-mds RT (1 ceph-mds to -osd RT)

● cd /mnt/foo/bar

● 2 ceph-mds RT (2 ceph-mds to -osd RT)

● ls -al

● open

● readdir

– 1 ceph-mds RT (1 ceph-mds to -osd RT)

● stat each file

● close

● cp * /tmp

● N ceph-osd RT

ceph-mon

ceph-mds

ceph-osd

recursive accounting

● ceph-mds tracks recursive directory stats● file sizes

● file and directory counts

● modification time

● virtual xattrs present full stats

● efficient

$ ls alSh | headtotal 0drwxrxrx 1 root root 9.7T 20110204 15:51 .drwxrxrx 1 root root 9.7T 20101216 15:06 ..drwxrxrx 1 pomceph pg4194980 9.6T 20110224 08:25 pomcephdrwxrxrx 1 mcg_test1 pg2419992 23G 20110202 08:57 mcg_test1drwxx 1 luko adm 19G 20110121 12:17 lukodrwxx 1 eest adm 14G 20110204 16:29 eestdrwxrxrx 1 mcg_test2 pg2419992 3.0G 20110202 09:34 mcg_test2drwxx 1 fuzyceph adm 1.5G 20110118 10:46 fuzycephdrwxrxrx 1 dallasceph pg275 596M 20110114 10:06 dallasceph

snapshots

● volume or subvolume snapshots unusable at petabyte scale

● snapshot arbitrary subdirectories

● simple interface● hidden '.snap' directory

● no special tools

$ mkdir foo/.snap/one # create snapshot$ ls foo/.snapone$ ls foo/bar/.snap_one_1099511627776 # parent's snap name is mangled$ rm foo/myfile$ ls -F foobar/$ ls -F foo/.snap/onemyfile bar/$ rmdir foo/.snap/one # remove snapshot

multiple client implementations

● Linux kernel client● mount -t ceph

1.2.3.4:/ /mnt

● export (NFS), Samba (CIFS)

● ceph-fuse

● libcephfs.so● your app

● Samba (CIFS)

● Ganesha (NFS)

● Hadoop (map/reduce)

kernel

libcephfs

ceph fuseceph-fuse

your app

libcephfsSamba

libcephfsGanesha

NFS SMB/CIFS

libcephfsHadoop

Recent work

● Groundwork for hosting multi file systems

● Admin-ability● Improved health reporting, logging

● cephfs-journal-tool, improved journal flexibility

● Ability to query, manipulate cluster state

● Consistency checking● forward scrub – verify correctness / integrity

● backward scrub – recovery from corruption, data loss

● Performance

Dog food

● Using CephFS for internal build/test lab● 80 TB (80 x 1 TB HDDs, 10 hosts)

● Old, crummy hardware with lots of failures

● Linux kernel clients (ceph.ko, bleeding edge kernels)

● Lots of good lessons● Several kernel bugs found

● Recovery performance issues

● Lots of painful admin processes identified

● Several fat fingers, facepalms

RADOS


RADOS


LIBRADOS


LIBRADOS


RBD


RBD


RADOSGW


RADOSGW


APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT

CEPH FS


CEPH FS


NEARLYAWESOME

AWESOMEAWESOME

AWESOME

AWESOME

Path forward

● Testing● Various workloads

● Multiple active MDSs

● Test automation● Simple workload generator scripts

● Bug reproducers

● Hacking● Bug squashing

● Long-tail features

● Integrations● Ganesha, Samba, *stacks

librados

object model

● pools● 1s to 100s

● independent namespaces or object collections

● replication level, placement policy

● objects● bazillions

● blob of data (bytes to gigabytes)

● attributes (e.g., “version=12”; bytes to kilobytes)

● key/value bundle (bytes to gigabytes)

atomic transactions

● client operations send to the OSD cluster● operate on a single object

● can contain a sequence of operations, e.g.

– truncate object

– write new object data

– set attribute

● atomicity● all operations commit or do not commit atomically

● conditional● 'guard' operations can control whether operation is

performed

– verify xattr has specific value

– assert object is a specific version

● allows atomic compare-and-swap etc.

key/value storage

● store key/value pairs in an object● independent from object attrs or byte data payload

● based on google's leveldb● efficient random and range insert/query/removal

● based on BigTable SSTable design

● exposed via key/value API● insert, update, remove

● individual keys or ranges of keys

● avoid read/modify/write cycle for updating complex objects

● e.g., file system directory objects

watch/notify

● establish stateful 'watch' on an object● client interest persistently registered with object

● client keeps session to OSD open

● send 'notify' messages to all watchers● notify message (and payload) is distributed to all watchers

● variable timeout

● notification on completion

– all watchers got and acknowledged the notify

● use any object as a communication/synchronization channel

● locking, distributed coordination (ala ZooKeeper), etc.

CLIENT #1

CLIENT #2

CLIENT #3

OSD

watch

ack/commit

ack/commit

watch

ack/commitwatch

notify

notify

notify

notify

ackack

ack

complete

watch/notify example

● radosgw cache consistency● radosgw instances watch a single object (.rgw/notify)

● locally cache bucket metadata

● on bucket metadata changes (removal, ACL changes)

– write change to relevant bucket object

– send notify with bucket name to other radosgw instances

● on receipt of notify

– invalidate relevant portion of cache

rados classes

● dynamically loaded .so● /var/lib/rados-classes/*

● implement new object “methods” using existing methods

● part of I/O pipeline

● simple internal API

● reads● can call existing native or class methods

● do whatever processing is appropriate

● return data

● writes● can call existing native or class methods

● do whatever processing is appropriate

● generates a resulting transaction to be applied atomically

class examples

● grep● read an object, filter out individual records, and return

those

● sha1● read object, generate fingerprint, return that

● images● rotate, resize, crop image stored in object

● remove red-eye

● crypto● encrypt/decrypt object data with provided key

ideas

● distributed key/value table● aggregate many k/v objects into one big 'table'

● working prototype exists (thanks, Eleanor!)

ideas

● lua rados class● embed lua interpreter in a rados class

● ship semi-arbitrary code for operations

● json class● parse, manipulate json structures

ideas

● rados mailbox (RMB?)● plug librados backend into dovecot, postfix, etc.

● key/value object for each mailbox

– key = message id

– value = headers

● object for each message or attachment

● watch/notify to delivery notification

hard links?

● rare

● useful locality properties

● intra-directory

● parallel inter-directory

● on miss, file objects provide per-file backpointers

● degenerates to log(n) lookups

● optimistic read complexity

ceph day new york 2014: future of cephfs

Software

cephmds rt

cephmds daemons

recursive accounting

metadata inrados

metadata replication

example mount t ceph

sambacifs cephfuse

legacy metadata storage