![Page 2: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/2.jpg)
Datacenter Storage with Ceph OSDC.de 2015
Agenda
● What is Ceph?
● How does Ceph store your data?
● Interfaces to Ceph: RBD, RGW, CephFS
● Latest development updates
![Page 3: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/3.jpg)
Datacenter Storage with Ceph OSDC.de 2015
What is Ceph?● Highly available resilient data store
● Free Software (LGPL)
● 10 years since inception
● Flexible object, block and filesystem interfaces
● Especially popular in private clouds as VM image service, and S3-compatible object storage service.
![Page 4: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/4.jpg)
Datacenter Storage with Ceph OSDC.de 2015
A general purpose storage system
● You feed it commodity disks and ethernet
● In return, it gives your apps a storage service
● It doesn't lose your data
● It doesn't need babysitting
● It's portable
![Page 5: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/5.jpg)
Datacenter Storage with Ceph OSDC.de 2015
Interfaces to storage
FILE SYSTEMCephFS
BLOCK STORAGE
RBD
OBJECT STORAGE
RGW
Keystone
Geo-Replication
Native API
Multi-tenant
S3 & Swift
OpenStack
Linux Kernel
iSCSI
Clones
Snapshots
CIFS/NFS
HDFS
Distributed Metadata
Linux Kernel
POSIX
![Page 6: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/6.jpg)
Datacenter Storage with Ceph OSDC.de 2015
Ceph Architecture(how your data is stored)
![Page 7: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/7.jpg)
Components
RGWA web services
gateway for object storage, compatible
with S3 and Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-distributed block device with cloud
platform integration
CEPHFSA distributed file
system with POSIX semantics and scale-
out metadata management
APP HOST/VM CLIENT
![Page 8: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/8.jpg)
Components
RGWA web services
gateway for object storage, compatible
with S3 and Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-distributed block device with cloud
platform integration
CEPHFSA distributed file
system with POSIX semantics and scale-
out metadata management
APP HOST/VM CLIENT
![Page 9: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/9.jpg)
RADOS
ReliableAutonomousDistributedObject Store
![Page 10: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/10.jpg)
RADOS Components
OSDs:
10s to 10000s in a cluster
One per disk (or one per SSD, RAID group…)
Serve stored objects to clients
Intelligently peer for replication & recovery
Monitors:Maintain cluster membership and stateProvide consensus for distributed decision-
makingSmall, odd numberThese do not serve stored objects to clients
M
![Page 11: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/11.jpg)
Object Storage Daemons
FS
DISK
OSD
DISK
OSD
FS
DISK
OSD
FS
DISK
OSD
FS
xfsext4btrfs
M
M
M
![Page 12: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/12.jpg)
Rados Cluster
APPLICATION
M M
M M
M
RADOS CLUSTER
![Page 13: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/13.jpg)
Where do objects live?
??
APPLICATION
M
M
M
OBJECT
![Page 14: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/14.jpg)
A Metadata Server?
1
APPLICATION
M
M
M
2
![Page 15: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/15.jpg)
Calculated placement
FAPPLICATION
M
M
MA-G
H-N
O-T
U-Z
![Page 16: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/16.jpg)
CRUSH: Dynamic data placement
Pseudo-random placement algorithm● Fast calculation, no lookup● Repeatable, deterministic
● Statistically uniform distribution● Stable mapping
● Limited data migration on change● Rule-based configuration
● Infrastructure topology aware● Adjustable replication● Weighting
![Page 17: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/17.jpg)
CRUSH: Replication
RADOS CLUSTER
10
01
01
10
10
01
11
01
10
01
01
10
10
01 11
01
1001
0110 10 01
11
01
DATA
![Page 18: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/18.jpg)
CRUSH: Topology-aware placement
RADOS CLUSTER
10 01 01 11
01
1010
10 10 0101
01 11
10 0101
RACK A RACK B
![Page 19: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/19.jpg)
CRUSH is a quick calculation
RADOS CLUSTER
DATA
10
01
01
10
10
01 11
01
1001
0110 10 01
11
01
![Page 20: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/20.jpg)
CRUSH rules● Simple language● Specify which copies go where (across racks,
servers, datacenters, disk types)
rule <rule name> { ruleset <ruleset name> type <replicated | erasure> step take <buckettype> step [choose|chooseleaf] [firstn|indep] <N> <type> step emit}
![Page 21: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/21.jpg)
Pools and Placement Groups● Trick: apply CRUSH placement to fixed
number of placement groups instead of N objects.
● Manage recovery/backfill at PG granularity: less per-object metadata.
● Typically a few 100 PGs per OSD● Pool is logical collection of PGs, using a
particular CRUSH rule
![Page 22: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/22.jpg)
Recovering from failures● OSDs notice when their peers stop responding, report
this to monitors● Monitors make decision that an OSD is now “down”
● Peers continue to serve data, but it's in a degraded state
● After some time, monitors mark the OSD “out”● New peers selected by CRUSH, data is re-replicated
across whole cluster ● Faster than RAID rebuild because we share the load● Does not require administrator intervention
![Page 23: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/23.jpg)
RADOS advanced features
● Not just puts and gets!
● More feature rich than typical object stores● Partial object updates & appends
● Key-value stores (OMAPs) within objects
● Copy-on-write snapshots
● Watch/notify to for pushing events
● Extensible with Object Classes: perform arbitrary transactions on OSDs.
![Page 24: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/24.jpg)
Choosing hardware● Cheap hardware mitigates cost of replication● OSD data journalling: a separate SSD is useful.
Approx 1 SSD for every 4 OSD disks.● OSDs are more CPU/RAM intensive than
legacy storage: approx 8 or so per host.● Many cheaper servers better than few
expensive: distribute the load of rebalancing.● Consider your bandwidth/capacity ratio and
your read/write ratio
![Page 25: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/25.jpg)
Datacenter Storage with Ceph OSDC.de 2015
Interfaces to applications: RGW, RBD, and CephFS
![Page 26: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/26.jpg)
Components
RGWA web services
gateway for object storage, compatible
with S3 and Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-distributed block device with cloud
platform integration
CEPHFSA distributed file
system with POSIX semantics and scale-
out metadata management
APP HOST/VM CLIENT
![Page 27: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/27.jpg)
RBD: Virtual disks in Ceph
RADOS Block Device:Storage of disk images in RADOSDecouples VMs from host Images are striped across the cluster
(pool)SnapshotsCopy-on-write clonesSupport in:
Mainline Linux Kernel (2.6.39+)Qemu/KVMOpenStack, CloudStack, OpenNebula,
Proxmox
![Page 28: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/28.jpg)
Storing virtual disks
M M
RADOS CLUSTER
HYPERVISOR
LIBRBD
VM
28
![Page 29: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/29.jpg)
Architectural Components
RGWA web services
gateway for object storage, compatible
with S3 and Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-distributed block device with cloud
platform integration
CEPHFSA distributed file
system with POSIX semantics and scale-
out metadata management
APP HOST/VM CLIENT
![Page 30: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/30.jpg)
RGW: HTTP object store
RADOSGW:● REST-based object storage proxy● Compatible with S3 and Swift applications● Uses RADOS to store objects● API supports buckets, accounts● Usage accounting for billing
![Page 31: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/31.jpg)
RADOS Gateway
M M
MRADOS CLUSTER
RADOSGW
LIBRADOS
socket
RADOSGWLIBRADOS
APPLICATION APPLICATION
REST
APPLICATION
![Page 32: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/32.jpg)
Architectural Components
RGWA web services
gateway for object storage, compatible
with S3 and Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-distributed block device with cloud
platform integration
CEPHFSA distributed file
system with POSIX semantics and scale-
out metadata management
APP HOST/VM CLIENT
![Page 33: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/33.jpg)
Ceph Filesystem (CephFS)
POSIX-compliant shared filesystem
Client:● Userspace (FUSE) or Kernel● Looks like like a local filesystem● Sends data directly to RADOS
Metadata server:● Filesystem metadata:
● Directory hierarchy● Inode metadata (owner, timestamps, mode)
● Stores metadata in RADOS● Does not serve file data to clients
![Page 34: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/34.jpg)
Storing Data and Metadata
LINUX HOST
M M
MRADOS CLUSTER
KERNEL MODULE
datametadata 0110
![Page 35: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/35.jpg)
CephFS
● Advanced features:
– Subdirectory snapshots
– Recursive statistics
– Multiple metadata servers
● Coming soon:
– Online consistency checking
– Scalable repair/recovery tools
![Page 36: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/36.jpg)
Datacenter Storage with Ceph OSDC.de 2015
CephFS in practice
ceph-deploy mds create myserver
ceph osd pool create fs_data
ceph osd pool create fs_metadata
ceph fs new myfs fs_metadata fs_data
mount -t cephfs x.x.x.x:6789 /mnt/ceph
![Page 37: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/37.jpg)
Datacenter Storage with Ceph OSDC.de 2015
Beyond Replication: Erasure Coding and Cache Tiering
![Page 38: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/38.jpg)
● Use one Ceph pool as a cache to another:
– e.g. Flash cache to a spinning disk pool
● Configurable policies for eviction based on capacity, object count, lifetime.
● Configurable mode:
– writeback: all client I/O to the cache
– readonly: client writes to backing pool, reads from the cache
Cache Tiering
![Page 39: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/39.jpg)
● Split objects into M data chunks and K parity chunks, with configurable M and K
● An alternative to replication, providing a different set of tradeoffs:
– Consume less storage capacity
– Consume less write bandwidth
– Reads scattered across OSDs
– Modifications are expensive
● Plugin interface for encoding schemes
Erasure Coding
![Page 40: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/40.jpg)
Cache Tiering + EC
replica 1
replica 2
M1 M2 M3 K1 K2
replica 3
Pool ‘cache’
Pool ‘cold’
Client I/O
Writeback cache policy
200% overhead
66% overhead
![Page 41: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/41.jpg)
Example: Cache Tiering & EC
# ceph osd erasure-code-profile set ecdemo k=3 m=2
# ceph osd pool create cold 384 384 erasure ecdemo
# ceph osd pool create cache 384 384
# ceph osd tier add cold cache
# ceph osd tier cache-mode cache writeback
# ceph osd tier set-overlay cold cache
![Page 42: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/42.jpg)
Datacenter Storage with Ceph OSDC.de 2015
What's new?
![Page 43: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/43.jpg)
Datacenter Storage with Ceph OSDC.de 2015
Ceph 0.94
Emperor
Firefly
Giant
Hammer
Infernalis
Jewel
![Page 44: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/44.jpg)
Datacenter Storage with Ceph OSDC.de 2015
RADOS
● Performance:
– more IOPs
– exploit flash backends
– exploit many-cored machines
● CRUSH straw2 algorithm:
– reduced data migration on changes
● Cache tiering:
– read performance, reduce unnecessary promotions
![Page 45: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/45.jpg)
Datacenter Storage with Ceph OSDC.de 2015
RBD
● Object maps:
– per-image metadata, identifies which extents are allocated.
– optimisation for clone/export/delete.
● Mandatory locking:
– prevent multiple clients writing to same image
● Copy-on-read:
– improve performance for some workloads
![Page 46: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/46.jpg)
Datacenter Storage with Ceph OSDC.de 2015
RGW
● S3 object versioning API
– when enabled, all objects maintain history
– GET ID+version to see an old version
● Bucket sharding
– spread bucket index across multiple RADOS objects
– avoid oversized OMAPs
– avoid hotspots
![Page 47: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/47.jpg)
CephFS
● Diagnostics & health checks
● Journal recovery tools
● Initial online metadata scrub
● Refined ENOSPC handling
● Soft client quotas
● General hardening and resilience
![Page 48: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/48.jpg)
Datacenter Storage with Ceph OSDC.de 2015
Finally...
![Page 49: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/49.jpg)
Datacenter Storage with Ceph OSDC.de 2015
Get involvedEvaluate the latest releases:
http://ceph.com/resources/downloads/
Mailing list, IRC:
http://ceph.com/resources/mailing-list-irc/
Bugs:
http://tracker.ceph.com/projects/ceph/issues
Online developer summits:
https://wiki.ceph.com/Planning/CDS
![Page 50: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/50.jpg)
Ceph Days
Ceph Day Berlin is next Tuesday!
http://ceph.com/cephdays/ceph-day-berlin/
Axica Convention Center
April 28 2015
![Page 51: OSDC 2015: John Spray | The Ceph Storage System](https://reader031.vdocument.in/reader031/viewer/2022020218/55a5fb841a28abd3738b45d0/html5/thumbnails/51.jpg)
Datacenter Storage with Ceph OSDC.de 2015