ceph day new york 2014: future of cephfs

45
Future of CephFS Sage Weil

Upload: ceph-community

Post on 25-May-2015

561 views

Category:

Software


4 download

DESCRIPTION

Presented by Sage Weil, Red Hat

TRANSCRIPT

Page 1: Ceph Day New York 2014:  Future of CephFS

Future of CephFSSage Weil

Page 2: Ceph Day New York 2014:  Future of CephFS

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT

Page 3: Ceph Day New York 2014:  Future of CephFS

MM

MM

MM

CLIENTCLIENT

0110

0110

datametadata

Page 4: Ceph Day New York 2014:  Future of CephFS

MM

MM

MM

Page 5: Ceph Day New York 2014:  Future of CephFS

Metadata Server• Manages metadata for a

POSIX-compliant shared filesystem• Directory hierarchy• File metadata (owner,

timestamps, mode, etc.)• Stores metadata in

RADOS• Does not serve file data

to clients• Only required for shared

filesystem

Page 6: Ceph Day New York 2014:  Future of CephFS

legacy metadata storage

● a scaling disaster● name → inode → block

list → data

● no inode table locality

● fragmentation

– inode table

– directory

● many seeks

● difficult to partition

usr

etc

var

home

vmlinuz

passwdmtabhosts

lib…

includebin

Page 7: Ceph Day New York 2014:  Future of CephFS

ceph fs metadata storage

● block lists unnecessary

● inode table mostly useless

● APIs are path-based, not inode-based

● no random table access, sloppy caching

● embed inodes inside directories

● good locality, prefetching

● leverage key/value object

102

100

1

usr

etc

var

home

vmlinuz

passwdmtabhosts

libincludebin

Page 8: Ceph Day New York 2014:  Future of CephFS

controlling metadata io

● view ceph-mds as cache

● reduce reads

– dir+inode prefetching

● reduce writes

– consolidate multiple writes

● large journal or log● stripe over objects

● two tiers

– journal for short term

– per-directory for long term

● fast failure recovery

journal

directories

Page 9: Ceph Day New York 2014:  Future of CephFS

one tree

three metadata servers

??

Page 10: Ceph Day New York 2014:  Future of CephFS

load distribution

● coarse (static subtree)● preserve locality

● high management overhead

● fine (hash)● always balanced

● less vulnerable to hot spots

● destroy hierarchy, locality

● can a dynamic approach capture benefits of both extremes?

static subtree

hash directories

hash files

good locality

good balance

Page 11: Ceph Day New York 2014:  Future of CephFS
Page 12: Ceph Day New York 2014:  Future of CephFS
Page 13: Ceph Day New York 2014:  Future of CephFS
Page 14: Ceph Day New York 2014:  Future of CephFS
Page 15: Ceph Day New York 2014:  Future of CephFS

DYNAMIC SUBTREE PARTITIONING

Page 16: Ceph Day New York 2014:  Future of CephFS

● scalable● arbitrarily partition

metadata

● adaptive● move work from busy to

idle servers

● replicate hot metadata

● efficient● hierarchical partition

preserve locality

● dynamic● daemons can join/leave

● take over for failed nodes

dynamic subtree partitioning

Page 17: Ceph Day New York 2014:  Future of CephFS

Dynamic partitioning

many directories same directory

Page 18: Ceph Day New York 2014:  Future of CephFS

Failure recovery

Page 19: Ceph Day New York 2014:  Future of CephFS

Metadata replication and availability

Page 20: Ceph Day New York 2014:  Future of CephFS

Metadata cluster scaling

Page 21: Ceph Day New York 2014:  Future of CephFS

client protocol

● highly stateful● consistent, fine-grained caching

● seamless hand-off between ceph-mds daemons● when client traverses hierarchy

● when metadata is migrated between servers

● direct access to OSDs for file I/O

Page 22: Ceph Day New York 2014:  Future of CephFS

an example

● mount -t ceph 1.2.3.4:/ /mnt

● 3 ceph-mon RT

● 2 ceph-mds RT (1 ceph-mds to -osd RT)

● cd /mnt/foo/bar

● 2 ceph-mds RT (2 ceph-mds to -osd RT)

● ls -al

● open

● readdir

– 1 ceph-mds RT (1 ceph-mds to -osd RT)

● stat each file

● close

● cp * /tmp

● N ceph-osd RT

ceph-mon

ceph-mds

ceph-osd

Page 23: Ceph Day New York 2014:  Future of CephFS

recursive accounting

● ceph-mds tracks recursive directory stats● file sizes

● file and directory counts

● modification time

● virtual xattrs present full stats

● efficient

$ ls ­alSh | headtotal 0drwxr­xr­x 1 root            root      9.7T 2011­02­04 15:51 .drwxr­xr­x 1 root            root      9.7T 2010­12­16 15:06 ..drwxr­xr­x 1 pomceph         pg4194980 9.6T 2011­02­24 08:25 pomcephdrwxr­xr­x 1 mcg_test1       pg2419992  23G 2011­02­02 08:57 mcg_test1drwx­­x­­­ 1 luko            adm        19G 2011­01­21 12:17 lukodrwx­­x­­­ 1 eest            adm        14G 2011­02­04 16:29 eestdrwxr­xr­x 1 mcg_test2       pg2419992 3.0G 2011­02­02 09:34 mcg_test2drwx­­x­­­ 1 fuzyceph        adm       1.5G 2011­01­18 10:46 fuzycephdrwxr­xr­x 1 dallasceph      pg275     596M 2011­01­14 10:06 dallasceph

Page 24: Ceph Day New York 2014:  Future of CephFS

snapshots

● volume or subvolume snapshots unusable at petabyte scale

● snapshot arbitrary subdirectories

● simple interface● hidden '.snap' directory

● no special tools

$ mkdir foo/.snap/one # create snapshot$ ls foo/.snapone$ ls foo/bar/.snap_one_1099511627776 # parent's snap name is mangled$ rm foo/myfile$ ls -F foobar/$ ls -F foo/.snap/onemyfile bar/$ rmdir foo/.snap/one # remove snapshot

Page 25: Ceph Day New York 2014:  Future of CephFS

multiple client implementations

● Linux kernel client● mount -t ceph

1.2.3.4:/ /mnt

● export (NFS), Samba (CIFS)

● ceph-fuse

● libcephfs.so● your app

● Samba (CIFS)

● Ganesha (NFS)

● Hadoop (map/reduce)

kernel

libcephfs

ceph fuseceph-fuse

your app

libcephfsSamba

libcephfsGanesha

NFS SMB/CIFS

libcephfsHadoop

Page 26: Ceph Day New York 2014:  Future of CephFS

Recent work

● Groundwork for hosting multi file systems

● Admin-ability● Improved health reporting, logging

● cephfs-journal-tool, improved journal flexibility

● Ability to query, manipulate cluster state

● Consistency checking● forward scrub – verify correctness / integrity

● backward scrub – recovery from corruption, data loss

● Performance

Page 27: Ceph Day New York 2014:  Future of CephFS
Page 28: Ceph Day New York 2014:  Future of CephFS

Dog food

● Using CephFS for internal build/test lab● 80 TB (80 x 1 TB HDDs, 10 hosts)

● Old, crummy hardware with lots of failures

● Linux kernel clients (ceph.ko, bleeding edge kernels)

● Lots of good lessons● Several kernel bugs found

● Recovery performance issues

● Lots of painful admin processes identified

● Several fat fingers, facepalms

Page 29: Ceph Day New York 2014:  Future of CephFS

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

NEARLYAWESOME

AWESOMEAWESOME

AWESOME

AWESOME

Page 30: Ceph Day New York 2014:  Future of CephFS

Path forward

● Testing● Various workloads

● Multiple active MDSs

● Test automation● Simple workload generator scripts

● Bug reproducers

● Hacking● Bug squashing

● Long-tail features

● Integrations● Ganesha, Samba, *stacks

Page 31: Ceph Day New York 2014:  Future of CephFS
Page 32: Ceph Day New York 2014:  Future of CephFS

librados

Page 33: Ceph Day New York 2014:  Future of CephFS

object model

● pools● 1s to 100s

● independent namespaces or object collections

● replication level, placement policy

● objects● bazillions

● blob of data (bytes to gigabytes)

● attributes (e.g., “version=12”; bytes to kilobytes)

● key/value bundle (bytes to gigabytes)

Page 34: Ceph Day New York 2014:  Future of CephFS

atomic transactions

● client operations send to the OSD cluster● operate on a single object

● can contain a sequence of operations, e.g.

– truncate object

– write new object data

– set attribute

● atomicity● all operations commit or do not commit atomically

● conditional● 'guard' operations can control whether operation is

performed

– verify xattr has specific value

– assert object is a specific version

● allows atomic compare-and-swap etc.

Page 35: Ceph Day New York 2014:  Future of CephFS

key/value storage

● store key/value pairs in an object● independent from object attrs or byte data payload

● based on google's leveldb● efficient random and range insert/query/removal

● based on BigTable SSTable design

● exposed via key/value API● insert, update, remove

● individual keys or ranges of keys

● avoid read/modify/write cycle for updating complex objects

● e.g., file system directory objects

Page 36: Ceph Day New York 2014:  Future of CephFS

watch/notify

● establish stateful 'watch' on an object● client interest persistently registered with object

● client keeps session to OSD open

● send 'notify' messages to all watchers● notify message (and payload) is distributed to all watchers

● variable timeout

● notification on completion

– all watchers got and acknowledged the notify

● use any object as a communication/synchronization channel

● locking, distributed coordination (ala ZooKeeper), etc.

Page 37: Ceph Day New York 2014:  Future of CephFS

CLIENT #1

CLIENT #2

CLIENT #3

OSD

watch

ack/commit

ack/commit

watch

ack/commitwatch

notify

notify

notify

notify

ackack

ack

complete

Page 38: Ceph Day New York 2014:  Future of CephFS

watch/notify example

● radosgw cache consistency● radosgw instances watch a single object (.rgw/notify)

● locally cache bucket metadata

● on bucket metadata changes (removal, ACL changes)

– write change to relevant bucket object

– send notify with bucket name to other radosgw instances

● on receipt of notify

– invalidate relevant portion of cache

Page 39: Ceph Day New York 2014:  Future of CephFS

rados classes

● dynamically loaded .so● /var/lib/rados-classes/*

● implement new object “methods” using existing methods

● part of I/O pipeline

● simple internal API

● reads● can call existing native or class methods

● do whatever processing is appropriate

● return data

● writes● can call existing native or class methods

● do whatever processing is appropriate

● generates a resulting transaction to be applied atomically

Page 40: Ceph Day New York 2014:  Future of CephFS

class examples

● grep● read an object, filter out individual records, and return

those

● sha1● read object, generate fingerprint, return that

● images● rotate, resize, crop image stored in object

● remove red-eye

● crypto● encrypt/decrypt object data with provided key

Page 41: Ceph Day New York 2014:  Future of CephFS

ideas

● distributed key/value table● aggregate many k/v objects into one big 'table'

● working prototype exists (thanks, Eleanor!)

Page 42: Ceph Day New York 2014:  Future of CephFS

ideas

● lua rados class● embed lua interpreter in a rados class

● ship semi-arbitrary code for operations

● json class● parse, manipulate json structures

Page 43: Ceph Day New York 2014:  Future of CephFS

ideas

● rados mailbox (RMB?)● plug librados backend into dovecot, postfix, etc.

● key/value object for each mailbox

– key = message id

– value = headers

● object for each message or attachment

● watch/notify to delivery notification

Page 44: Ceph Day New York 2014:  Future of CephFS
Page 45: Ceph Day New York 2014:  Future of CephFS

hard links?

● rare

● useful locality properties

● intra-directory

● parallel inter-directory

● on miss, file objects provide per-file backpointers

● degenerates to log(n) lookups

● optimistic read complexity