![Page 1: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/1.jpg)
ceph – a unified distributed storage system
sage weilVAC – september 27, 2012
![Page 2: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/2.jpg)
outline
● why you should care● what is it, what it does● how it works
● architecture
● how you can use it● librados● radosgw● RBD● file system
● who we are, why we do this
![Page 3: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/3.jpg)
why should you care about anotherstorage system?
![Page 4: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/4.jpg)
requirements
● diverse storage needs● object storage● block devices (for VMs) with snapshots, cloning● shared file system with POSIX, coherent caches● structured data... files, block devices, or objects?
● scale● terabytes, petabytes, exabytes● heterogeneous hardware● reliability and fault tolerance
![Page 5: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/5.jpg)
time
● ease of administration● no manual data migration, load balancing● painless scaling
● expansion and contraction● seamless migration
![Page 6: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/6.jpg)
cost
● linear function of size or performance● incremental expansion
● no fork-lift upgrades
● no vendor lock-in● choice of hardware● choice of software
● open
![Page 7: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/7.jpg)
what is ceph?
![Page 8: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/8.jpg)
unified storage system
● objects● native● RESTful
● block● thin provisioning, snapshots, cloning
● file● strong consistency, snapshots
![Page 9: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/9.jpg)
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
![Page 10: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/10.jpg)
open source
● LGPLv2● copyleft● ok to link to proprietary code
● no copyright assignment● no dual licensing● no “enterprise-only” feature set
● active community● commercial support
![Page 11: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/11.jpg)
distributed storage system
● data center scale● 10s to 10,000s of machines● terabytes to exabytes
● fault tolerant● no single point of failure● commodity hardware
● self-managing, self-healing
![Page 12: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/12.jpg)
ceph object model
● pools● 1s to 100s● independent namespaces or object collections● replication level, placement policy
● objects● bazillions● blob of data (bytes to gigabytes)● attributes (e.g., “version=12”; bytes to kilobytes)● key/value bundle (bytes to gigabytes)
![Page 13: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/13.jpg)
why start with objects?
● more useful than (disk) blocks● names in a single flat namespace● variable size● simple API with rich semantics
● more scalable than files● no hard-to-distribute hierarchy● update semantics do not span objects● workload is trivially parallel
![Page 14: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/14.jpg)
HUMANHUMAN COMPUTERCOMPUTER DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
![Page 15: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/15.jpg)
HUMANHUMAN COMPUTERCOMPUTER DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
HUMANHUMAN
HUMANHUMAN
![Page 16: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/16.jpg)
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMANHUMANHUMAN
HUMANHUMANHUMANHUMAN
HUMANHUMAN
HUMANHUMANHUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN (actually more like this…)
(COMPUTER)(COMPUTER)
![Page 17: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/17.jpg)
DISKDISK
HUMANHUMAN
HUMANHUMAN
HUMANHUMAN
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
![Page 18: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/18.jpg)
DISK
FS
DISK DISK
OSD
DISK DISK
OSD OSD OSD OSD
FS FS FSFS btrfsxfsext4
MMM
![Page 19: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/19.jpg)
Monitors:
• Maintain cluster membership and state
• Provide consensus for distributed decision-making
• Small, odd number
• These do not serve stored objects to clients
M
Object Storage Daemons (OSDs):• At least three in a cluster• One per disk or RAID group• Serve stored objects to clients• Intelligently peer to perform
replication tasks
![Page 20: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/20.jpg)
M
M
M
HUMAN
![Page 21: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/21.jpg)
data distribution
● all objects are replicated N times● objects are automatically placed, balanced, migrated
in a dynamic cluster● must consider physical infrastructure
● ceph-osds on hosts in racks in rows in data centers
● three approaches● pick a spot; remember where you put it● pick a spot; write down where you put it● calculate where to put it, where to find it
![Page 22: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/22.jpg)
CRUSH• Pseudo-random placement
algorithm
• Fast calculation, no lookup
• Repeatable, deterministic
• Ensures even distribution
• Stable mapping
• Limited data migration
• Rule-based configuration
• specifiable replication
• infrastructure topology aware
• allows weighting
![Page 23: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/23.jpg)
10 10 01 01 10 10 01 11 01 1010 10 01 01 10 10 01 11 01 10
1010 1010 0101 0101 1010 1010 0101 1111 0101 1010
hash(object name) % num pg
CRUSH(pg, cluster state, policy)
![Page 24: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/24.jpg)
10 10 01 01 10 10 01 11 01 1010 10 01 01 10 10 01 11 01 10
1010 1010 0101 0101 1010 1010 0101 1111 0101 1010
![Page 25: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/25.jpg)
RADOS
● monitors publish osd map that describes cluster state● ceph-osd node status (up/down, weight, IP)● CRUSH function specifying desired data distribution
● object storage daemons (OSDs)● safely replicate and store object● migrate data as the cluster changes over time● coordinate based on shared view of reality
● decentralized, distributed approach allows● massive scales (10,000s of servers or more)● the illusion of a single copy with consistent behavior
M
![Page 26: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/26.jpg)
CLIENTCLIENT
??
![Page 27: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/27.jpg)
![Page 28: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/28.jpg)
![Page 29: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/29.jpg)
CLIENT
??
![Page 30: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/30.jpg)
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
![Page 31: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/31.jpg)
LIBRADOSLIBRADOS
MM
MM
MM
APPAPP
native
![Page 32: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/32.jpg)
LLLIBRADOS
• Provides direct access to RADOS for applications
• C, C++, Python, PHP, Java• No HTTP overhead
![Page 33: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/33.jpg)
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
![Page 34: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/34.jpg)
MM
MM
MM
LIBRADOSLIBRADOS
RADOSGWRADOSGW
APPAPP
native
REST
LIBRADOSLIBRADOS
RADOSGWRADOSGW
APPAPP
![Page 35: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/35.jpg)
RADOS Gateway:• REST-based interface to
RADOS• Supports buckets,
accounting• Compatible with S3 and
Swift applications
![Page 36: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/36.jpg)
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
![Page 37: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/37.jpg)
DISKDISK
COMPUTERCOMPUTER
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
DISKDISK
![Page 38: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/38.jpg)
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
DISKDISK
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
COMPUTERCOMPUTER
VMVM
VMVM
VMVM
![Page 39: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/39.jpg)
MM
MM
MM
VMVM
LIBRADOSLIBRADOS
LIBRBDLIBRBD
VIRTUALIZATION CONTAINERVIRTUALIZATION CONTAINER
![Page 40: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/40.jpg)
LIBRADOSLIBRADOS
MM
MM
MM
LIBRBDLIBRBD
CONTAINERCONTAINER
LIBRADOSLIBRADOS
LIBRBDLIBRBD
CONTAINERCONTAINERVMVM
![Page 41: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/41.jpg)
LIBRADOSLIBRADOS
MM
MM
MM
KRBD (KERNEL MODULE)KRBD (KERNEL MODULE)
HOSTHOST
![Page 42: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/42.jpg)
RADOS Block Device:• Storage of virtual disks in RADOS• Decouples VMs and containers
• Live migration!• Images are striped across the cluster• Snapshots!• Support in
• Qemu/KVM
• OpenStack, CloudStack
• Mainline Linux kernel
![Page 43: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/43.jpg)
HOW DO YOU
SPIN UP
THOUSANDS OF VMs
INSTANTLY
AND
EFFICIENTLY?
![Page 44: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/44.jpg)
144 0 0 0 0 = 144
instant copy
![Page 45: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/45.jpg)
4144
CLIENT
write
write
write
= 148
write
![Page 46: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/46.jpg)
4144
CLIENTread
read
read
= 148
![Page 47: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/47.jpg)
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
![Page 48: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/48.jpg)
MM
MM
MM
CLIENTCLIENT
0110
0110
datametadata
![Page 49: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/49.jpg)
MM
MM
MM
![Page 50: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/50.jpg)
Metadata Server• Manages metadata for a
POSIX-compliant shared filesystem• Directory hierarchy• File metadata (owner,
timestamps, mode, etc.)• Stores metadata in RADOS• Does not serve file data to
clients• Only required for shared
filesystem
![Page 51: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/51.jpg)
one tree
three metadata servers
??
![Page 52: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/52.jpg)
![Page 53: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/53.jpg)
![Page 54: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/54.jpg)
![Page 55: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/55.jpg)
![Page 56: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/56.jpg)
DYNAMIC SUBTREE PARTITIONING
![Page 57: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/57.jpg)
recursive accounting
● ceph-mds tracks recursive directory stats● file sizes ● file and directory counts● modification time
● virtual xattrs present full stats● efficient
$ ls alSh | headtotal 0drwxrxrx 1 root root 9.7T 20110204 15:51 .drwxrxrx 1 root root 9.7T 20101216 15:06 ..drwxrxrx 1 pomceph pg4194980 9.6T 20110224 08:25 pomcephdrwxrxrx 1 mcg_test1 pg2419992 23G 20110202 08:57 mcg_test1drwxx 1 luko adm 19G 20110121 12:17 lukodrwxx 1 eest adm 14G 20110204 16:29 eestdrwxrxrx 1 mcg_test2 pg2419992 3.0G 20110202 09:34 mcg_test2drwxx 1 fuzyceph adm 1.5G 20110118 10:46 fuzycephdrwxrxrx 1 dallasceph pg275 596M 20110114 10:06 dallasceph
![Page 58: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/58.jpg)
snapshots
● volume or subvolume snapshots unusable at petabyte scale● snapshot arbitrary subdirectories
● simple interface● hidden '.snap' directory● no special tools
$ mkdir foo/.snap/one # create snapshot$ ls foo/.snapone$ ls foo/bar/.snap_one_1099511627776 # parent's snap name is mangled$ rm foo/myfile$ ls -F foobar/$ ls -F foo/.snap/onemyfile bar/$ rmdir foo/.snap/one # remove snapshot
![Page 59: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/59.jpg)
multiple protocols, implementations
● Linux kernel client● mount -t ceph 1.2.3.4:/ /mnt● export (NFS), Samba (CIFS)
● ceph-fuse● libcephfs.so
● your app● Samba (CIFS)● Ganesha (NFS)● Hadoop (map/reduce) kernel
libcephfs
ceph fuseceph-fuse
your app
libcephfsSamba
libcephfsGanesha
NFS SMB/CIFS
libcephfsHadoop
![Page 60: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/60.jpg)
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
NEARLYAWESOME
AWESOMEAWESOME
AWESOME
AWESOME
![Page 61: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/61.jpg)
why we do this
● limited options for scalable open source storage ● proprietary solutions
● expensive● don't scale (well or out)● marry hardware and software
● users hungry for alternatives● scalability● cost● features
![Page 62: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/62.jpg)
who we are
● Ceph created at UC Santa Cruz (2007)● supported by DreamHost (2008-2011)● Inktank (2012)
● Los Angeles, Sunnyvale, San Francisco, remote
● growing user and developer community● Linux distros, users, cloud stacks, SIs, OEMs
http://ceph.com/
![Page 65: ceph – a unified distributed storage system...RADOS monitors publish osd map that describes cluster state ceph-osd node status (up/down, weight, IP) CRUSH function specifying desired](https://reader035.vdocument.in/reader035/viewer/2022081617/60852aef1bd2d67cc22eb8d1/html5/thumbnails/65.jpg)
why we like btrfs
● pervasive checksumming● snapshots, copy-on-write● efficient metadata (xattrs)● inline data for small files● transparent compression● integrated volume management
● software RAID, mirroring, error recovery● SSD-aware
● online fsck● active development community