let's talk about ceph terena tf-storage meeting – february 12th
TRANSCRIPT
![Page 1: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/1.jpg)
Let’s talk about Ceph Terena TF-Storage Meeting – February 12th 2015 – University of Vienna -
![Page 3: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/3.jpg)
! First part: 13:00 to 15:00 Introduction to Ceph Technical deep dive: why is Ceph so interesting and popular Q&A about Ceph
! Coffee break
! Second part: 15:30 to 17:30 Notable use cases, deployment options, management and monitoring
AGENDA
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 4: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/4.jpg)
![Page 5: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/5.jpg)
THE FORECAST
By 2020 over 15 ZB of data will be stored. 1 .5 ZB are stored today.
5 Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 6: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/6.jpg)
THE PROBLEM
! Existing systems don’t scale ! Increasing cost and complexity
! Need to invest in new platforms ahead of time
2010 2020
IT Storage Budget
Growth of data
6 Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 7: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/7.jpg)
THE SOLUTION
PAST: SCALE UP
FUTURE: SCALE OUT
7 Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 8: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/8.jpg)
EXAMPLES IN NATURE
![Page 9: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/9.jpg)
EXAMPLES IN NATURE
![Page 10: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/10.jpg)
EXAMPLES IN NATURE
![Page 11: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/11.jpg)
DO YOU STILL REMEMBER?
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 12: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/12.jpg)
STORAGE APPLIANCES
![Page 13: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/13.jpg)
STORAGE APPLIANCES
![Page 14: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/14.jpg)
! RAID: Redundant Array of Inexpensive Disks ! Enhanced Reliability
! RAID-1 mirroring ! RAID-5/6 parity (reduced overhead) ! Automated recovery
! Enhanced Performance ! RAID-0 striping ! SAN interconnects ! Enterprise SAS drives ! Proprietary H/W RAID controllers
END OF RAID AS WE KNOW IT
![Page 15: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/15.jpg)
END OF RAID AS WE KNOW IT
! Economical Storage Solutions ! Software RAID implementations ! iSCSI and JBODs
! Enhanced Capacity ! Logical volume concatenation
![Page 16: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/16.jpg)
RAID CHALLENGES: CAPACITY/SPEED ! Storage economies in disks come from more GB per spindle ! NRE rates are flat (typically estimated at 10E-15/bit)
! 4% chance of NRE while recovering a 4+1 RAID-5 set and it goes up with the number of volumes in the set many RAID controllers fail the recovery after an NRE
! Access speed has not kept up with density increases ! 27 hours to rebuild a 4+1 RAID-5 set at 20MB/s during which time
a second drive can fail
! Managing the risk of second failures requires hot-spares ! Defeating some of the savings from parity redundancy
![Page 17: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/17.jpg)
RAID CHALLENGES: EXPANSION ! The next generation of disks will be larger and cost less per GB.
We would like to use these as we expand ! Most RAID replication schemes require identical disk meaning
new disks cannot be added to an old set meaning failed disks must be replaced with identical units
! Proprietary appliances may require replacements from manufacturer (at much higher than commodity prices)
! Many storage systems reach a limit beyond which they cannot be further expanded (e.g. fork-lift upgrade)
! Re-balancing existing data over new volumes is non-trivial
![Page 18: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/18.jpg)
RELIABILITY/AVAILABILITY
! RAID-5 can only survive a single disk failure ! The odds of an NRE during recovery are significant ! Odds of a second failure during recovery are non-
negligible ! Annual peta-byte durability for RAID-5 is only 3 nines
! RAID-6 redundancy protects against two disk failures ! Odds of an NRE during recovery are still significant ! Client data access will be starved out during recovery ! Throttling recovery increases the risk of data loss
![Page 19: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/19.jpg)
RELIABILITY/AVAILABILITY
! Even RAID-6 cant protect against: ! Server failures ! NIC failures ! Switch failures ! OS crashes ! Facility or regional disasters
![Page 20: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/20.jpg)
RAID CHALLENGES: EXPENSE
! Capital Expenses … good RAID costs ! Significant mark-up for enterprise hardware ! High performance RAID controllers can add $50-100/disk ! SANs further increase ! Expensive equipment, much of which is often poorly used ! Software RAID is much less expensive, and much slower
![Page 21: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/21.jpg)
RAID CHALLENGES: EXPENSE
! Operating Expenses … RAID doesn’t manage itself ! RAID group, LUN and pool management ! Lots of application-specific tunable parameters ! Difficult expansion and migration ! When a recovery goes bad, it goes very bad ! Dont even think about putting off replacing a failed drive
![Page 22: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/22.jpg)
! Ceph is a massively scalable, open source, software-defined storage system that runs on commodity hardware.
! Ceph delivers object, block, and file system storage in one self-managing, self-healing platform with no single point of failure.
! Ceph ideally replaces legacy storage and provides a single solution for the cloud.
Copyright © 2015 Red Hat, Inc. | Private and Confidential
WHAT IS CEPH?
22
![Page 23: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/23.jpg)
CEPH UNIFIED STORAGE
FILE SYSTEM
BLOCK STORAGE
OBJECT STORAGE
Keystone Geo-Replication Erasure Coding
23
Multi-tenant S3 & Swift
OpenStack Linux Kernel
Tiering
Clones Snapshots
CIFS/NFS HDFS
Distributed Metadata
Linux Kernel POSIX
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 24: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/24.jpg)
TRADITIONAL STORAGE VS. CEPH
TRADITIONAL ENTERPRISE STORAGE
24 Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 25: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/25.jpg)
“Run storage the way Google does”
Copyright © 2015 Red Hat, Inc. | Private and Confidential
VISION
25
![Page 26: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/26.jpg)
! Failure is normal ! Scale out on commodity hardware ! Everything runs on software ! Solution must be 100% open source
Copyright © 2015 Red Hat, Inc. | Private and Confidential
THE CEPH PHILOSOPHY
26
![Page 27: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/27.jpg)
HISTORICAL TIMELINE
Copyright © 2015 Red Hat, Inc. | Private and Confidential 27
RHEL-OSP Certification FEB 2014
MAY 2012 Launch of Inktank
OpenStack Integration 2011
2010 Mainline Linux Kernel
Open Source 2006
2004 Project Starts at UCSC
Production Ready Ceph SEPT 2012
2012 CloudStack Integration
OCT 2013 Inktank Ceph Enterprise Launch
Xen Integration 2013
APR 2014 Inktank Acquired by Red Hat
![Page 28: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/28.jpg)
STRONG & GROWING COMMUNITY
28 Copyright © 2015 Red Hat, Inc. | Private and Confidential
COMMUNITY METRICS DASHBOARD
METRICS.CEPH.COM
![Page 29: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/29.jpg)
ARCHITECTURE
![Page 30: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/30.jpg)
Copyright © 2015 Red Hat, Inc. | Private and Confidential
ARCHITECTURAL COMPONENTS
30
APP HOST/VM CLIENT
![Page 31: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/31.jpg)
Copyright © 2015 Red Hat, Inc. | Private and Confidential
ARCHITECTURAL COMPONENTS
31
APP HOST/VM CLIENT
![Page 32: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/32.jpg)
RADOS COMPONENTS
32
OSDs: ! 10s to 10000s in a cluster ! One per disk (or one per SSD, RAID group…)
! Serve stored objects to clients ! Intelligently peer for replication & recovery
Monitors: ! Maintain cluster membership and state ! Provide consensus for distributed decision-making
! Small, odd number ! These do not serve stored objects to clients
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 33: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/33.jpg)
OBJECT STORAGE DAEMONS
33
btrfs xfs ext4
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 34: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/34.jpg)
WHERE DO OBJECTS LIVE?
34
??
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 35: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/35.jpg)
A METADATA SERVER?
35
1
2
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 36: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/36.jpg)
CALCULATED PLACEMENT?
36
A-G
H-N
O-T
U-Z
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 37: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/37.jpg)
EVEN BETTER: CRUSH!
37
RADOS CLUSTER
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 38: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/38.jpg)
CRUSH: DYNAMIC DATA PLACEMENT
38
CRUSH: ! Pseudo-random placement algorithm
! Fast calculation, no lookup
! Repeatable, deterministic ! Statistically uniform distribution ! Stable mapping
! Limited data migration on change ! Rule-based configuration
! Infrastructure topology aware ! Adjustable replication ! Weighting
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 39: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/39.jpg)
DATA IS ORGANIZED INTO POOLS
39
CLUSTER POOLS (CONTAINING PGs)
POOL A
POOL B
POOL C
POOL D
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 40: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/40.jpg)
ARCHITECTURAL COMPONENTS
40
APP HOST/VM CLIENT
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 41: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/41.jpg)
ACCESSING A RADOS CLUSTER
41
RADOS CLUSTER
socket
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 42: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/42.jpg)
LIBRADOS: RADOS ACCESS FOR APPS
42
LIBRADOS: ! Direct access to RADOS for applications ! C, C++, Python, PHP, Java, Erlang ! Direct access to storage nodes ! No HTTP overhead
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 43: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/43.jpg)
Copyright © 2015 Red Hat, Inc. | Private and Confidential
ARCHITECTURAL COMPONENTS
43
APP HOST/VM CLIENT
![Page 44: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/44.jpg)
THE RADOS GATEWAY
44
RADOS CLUSTER
socket
REST
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 45: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/45.jpg)
RADOSGW MAKES RADOS WEBBY
45
RADOSGW: ! REST-based object storage proxy ! Uses RADOS to store objects ! API supports buckets, accounts ! Usage accounting for billing ! Compatible with S3 and Swift applications
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 46: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/46.jpg)
Copyright © 2015 Red Hat, Inc. | Private and Confidential
ARCHITECTURAL COMPONENTS
46
APP HOST/VM CLIENT
![Page 47: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/47.jpg)
RBD STORES VIRTUAL DISKS
47
RADOS BLOCK DEVICE: ! Storage of disk images in RADOS ! Decouples VMs from host ! Images are striped across the cluster (pool) ! Snapshots ! Copy-on-write clones ! Support in:
! Mainline Linux Kernel (2.6.39+) ! Qemu/KVM, native Xen coming soon ! OpenStack, CloudStack, Nebula, Proxmox
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 48: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/48.jpg)
STORING VIRTUAL DISKS
48
RADOS CLUSTER
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 49: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/49.jpg)
SEPARATE COMPUTE FROM STORAGE
49
RADOS CLUSTER
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 50: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/50.jpg)
KERNEL MODULE FOR MAX FLEXIBILITY
50
RADOS CLUSTER
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 51: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/51.jpg)
Copyright © 2015 Red Hat, Inc. | Private and Confidential
ARCHITECTURAL COMPONENTS
51
APP HOST/VM CLIENT
![Page 52: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/52.jpg)
SCALABLE METADATA SERVERS
52
METADATA SERVER ! Manages metadata for a POSIX-compliant shared
filesystem ! Directory hierarchy ! File metadata (owner, timestamps, mode, etc.)
! Stores metadata in RADOS ! Does not serve file data to clients ! Only required for shared filesystem
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 53: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/53.jpg)
SEPARATE METADATA SERVER
53
RADOS CLUSTER
data metadata
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 54: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/54.jpg)
CEPH ADVANTAGE: DECLUSTERED PLACEMENT
! Consider a failed 2TB RAID mirror ! We must copy 2TB from the survivor to the successor ! Survivor and successor are likely in same failure zone
! Consider two RADOS objects clustered on the same primary ! Surviving copies are declustered (on different secondaries) ! New copies will be declustered (on different successors) ! Copy 10GB from each of 200 survivors to 200 successors ! Survivors and successors are in different failure zones
![Page 55: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/55.jpg)
CEPH ADVANTAGE: DECLUSTERED PLACEMENT
! Benefits ! Recovery is parallel and 200x faster ! Service can continue during the recovery process ! Exposure to 2nd failures is reduced by 200x ! Zone aware placement protects against higher level failures ! Recovery is automatic and does not await new drives ! No idle hot-spares are required
![Page 56: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/56.jpg)
CEPH ADVANTAGE: OBJECT GRANULARITY
! Consider a failed 2TB RAID mirror ! To recover it we must read and write (at least) 2TB ! Successor must be same size as failed volume ! An error in recovery will probably lose the file system
! Consider a failed RADOS OSD ! To recover it we must read and write thousands of objects ! Successor OSDs must have each have some free space ! An error in recovery will probably lose one object
![Page 57: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/57.jpg)
CEPH ADVANTAGE: OBJECT GRANULARITY
! Benefits ! Heterogeneous commodity disks are easily supported ! Better and more uniform space utilization ! Per-object updates always preserve causality ordering ! Object updates are more easily replicated over WAN links ! Greatly reduced data loss if errors do occur
![Page 58: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/58.jpg)
CEPH ADVANTAGE: INTELLIGENT STORAGE
! Intelligent OSDs automatically rebalance data ! When new nodes are added ! When old nodes fail or are decommissioned ! When placement policies are changed
! The resulting rebalancing is very good: ! Even distribution of data across all OSDs ! Uniform mix of old and new data across all OSDs ! Moves only as much data as required
![Page 59: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/59.jpg)
CEPH ADVANTAGE: INTELLIGENT STORAGE
! Intelligent OSDs continuously scrub their objects ! To detect and correct silent write errors before another failure
! This architecture scales from Petabytes to exabytes ! A single pool of thin provisioned, self-managing storage ! Serving a wide range of block, object, and file clients
![Page 60: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/60.jpg)
QUESTIONS?
![Page 61: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/61.jpg)
USE CASES
USING CEPH
![Page 62: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/62.jpg)
DOES ONE SOLUTION FITS ALL?
! Remember this?
Ceph ideally replaces legacy storage and provides a single solution for the cloud.
! One simple rule: if the performance is not high enough, add more nodes …
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 63: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/63.jpg)
The most widely deployed1 technology for OpenStack storage
CLOUD INFRASTRUCTURE WITH OPENSTACK
63
Features Benefits • Full integration with Nova, Cinder & Glance • Single storage for images and ephemeral & persistent
volumes • Copy-on-write provisioning • Swift-compatible object storage gateway • Full integration with Red Hat Enterprise Linux OpenStack
Platform
• Provides both volume storage and object storage for tenant applications
• Reduces provisioning time for new virtual machines • No data transfer of images between storage and compute
nodes required • Unified installation experience with Red Hat Enterprise Linux
OpenStack Platform
1 http://superuser.openstack.org/articles/openstack-user-survey-insights-november-2014
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 64: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/64.jpg)
OPENSTACK SURVEY NOVEMBER 2014
source: http://superuser.openstack.org/articles/openstack-user-survey-insights-november-2014
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 65: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/65.jpg)
CEPH AND OPENSTACK
65
RADOS CLUSTER
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 66: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/66.jpg)
S3-compatible, on-premise cloud storage with uncompromised cost and performance
WEB-SCALE OBJECT STORAGE FOR APPLICATIONS
66
Features Benefits • Compatibility with S3 APIs • Fully-configurable replicated storage backend • Pluggable erasure coded storage backend • Cache tiering pools • Multi-site failover
• Supports broad ecosystem of tools and applications built for the S3 API
• Provides a modern hot/warm/cold storage topology that offers cost-efficient performance
• Advanced durability at optimal price • Delivers a foundation for the next generation of cloud-native
applications using object storage
S3 API S3 API S3 API S3 API
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 67: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/67.jpg)
WEB APPLICATION STORAGE
67
S3/Swift S3/Swift S3/Swift S3/Swift
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 68: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/68.jpg)
MULTI-SITE OBJECT STORAGE
68 Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 69: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/69.jpg)
ARCHIVE / COLD STORAGE
69
CEPH STORAGE CLUSTER
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 70: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/70.jpg)
ARCHIVE / COLD STORAGE
70
Site A Site B
CEPH STORAGE CLUSTER CEPH STORAGE CLUSTER
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 71: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/71.jpg)
WEBSCALE APPLICATIONS
71
Native Protocol
Native Protocol
Native Protocol
Native Protocol
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 72: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/72.jpg)
DATABASES
72
Native Protocol
Native Protocol
Native Protocol
Native Protocol
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 73: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/73.jpg)
! Why do people ask about iSCSI
! There is currently NO supported iSCSI solution for Ceph
! We’ve tested a TGT proxy with mixed results
! Goal is a hardened iSCSI solution
Copyright © 2015 Red Hat, Inc. | Private and Confidential
WHAT ABOUT iSCSI?
![Page 74: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/74.jpg)
CEPH FOR HADOOP
! CephFS can be used as a drop replacement for HDFS
! Overcomes limitations of HDFS
! Very experimental at this time
! Not supported
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 75: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/75.jpg)
! Customers often ask about VMware compatibility ! We have tried via third-party iSCSI target software (stgt), but
results were mixed. ! Pre-acquisition some large banks have requested VMware to
incorporate Ceph support into their hypervisor ! VMware has indicated they will evaluate incorporating Ceph
support into their next major release (scheduled 2015) ! If you encounter VMware companies that want that, please let
them know they should request that from VMware. ! We will be looking at VVOL support in a later version as well
Copyright © 2014 Red Hat, Inc. | Private and Confidential
VMWARE
75
![Page 76: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/76.jpg)
Copyright © 2015 Red Hat, Inc. | Private and Confidential
CEPH ON DRIVES
76
![Page 77: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/77.jpg)
! Object model ! Pools
! 1s to 100s ! Independent namespaces for object collections ! Replication level, placement policy
! Objects ! Bazillions ! Blob of data (bytes to gigabytes) ! Attributes (e.g. “version=12”) ! Key/value bundle
CLOSER LOOK AT LIBRADOS
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 78: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/78.jpg)
ATOMIC TRANSACTIONS
! Client operations sent to the OSD cluster ! Operate on a single object ! Can contain a sequence of operations; e.g.
! Truncate object ! Write new object data ! Set attribute
![Page 79: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/79.jpg)
ATOMIC TRANSACTIONS
! Atomicity ! All operations commit or do not commit atomically
! Conditional ! ‘guard’ operations can control
whether operation is performed ! Verify xattr has specific value ! Assert object is a specific version
! Allows atomic compare-and-swap etc.
![Page 80: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/80.jpg)
KEY/VALUE STORAGE ! store key/value pairs in an object
! independent from object attrs or byte data payload
! based on google's leveldb ! efficient random and range insert/query/removal ! based on BigTable SSTable design
! exposed via key/value API ! insert, update, remove ! individual keys or ranges of keys
! avoid read/modify/write cycle for updating complex objects ! e.g., file system directory objects
![Page 81: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/81.jpg)
WATCH/NOTIFY ! establish stateful 'watch' on an object
! client interest persistently registered with object ! client keeps session to OSD open
! send 'notify' messages to all watchers ! notify message (and payload) is distributed to all watchers ! variable timeout ! notification on completion
! all watchers got and acknowledged the notify ! use any object as a communication/synchronization channel
! locking, distributed coordination (ala ZooKeeper), etc.
![Page 82: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/82.jpg)
WATCH/NOTIFY
![Page 83: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/83.jpg)
WATCH/NOTIFY EXAMPLE
! radosgw cache consistency ! radosgw instances watch a single object (.rgw/notify) ! locally cache bucket metadata ! on bucket metadata changes (removal, ACL changes) ! write change to relevant bucket object ! send notify with bucket name to other radosgw
instances ! on receipt of notify
! invalidate relevant portion of cache
![Page 84: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/84.jpg)
RADOS CLASSES
! dynamically loaded .so ! /var/lib/rados-classes/* ! implement new object “methods” using existing methods ! part of I/O pipeline ! simple internal API
![Page 85: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/85.jpg)
RADOS CLASSES
! reads ! can call existing native or class methods ! do whatever processing is appropriate ! return data
! writes ! can call existing native or class methods ! do whatever processing is appropriate ! generates a resulting transaction to be applied
atomically
![Page 86: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/86.jpg)
CLASS EXAMPLES ! Grep
! read an object, filter out individual records, and return those
! Sha1 ! read object, generate fingerprint, return that
! Images ! rotate, resize, crop image stored in object ! remove red-eye
! crypto ! encrypt/decrypt object data with provided key
![Page 87: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/87.jpg)
IDEAS
! rados mailbox (RMB?) ! plug librados backend into dovecot, postfix, etc. ! key/value object for each mailbox
! key = message id ! value = headers
! object for each message or attachment ! watch/notify to delivery notification
![Page 88: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/88.jpg)
IDEAS ! distributed key/value table
! aggregate many k/v objects into one big 'table’ ! working prototype exists
! lua rados class ! embed lua interpreter in a rados class ! ship semi-arbitrary code for operations
! json class ! parse, manipulate json structures
![Page 89: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/89.jpg)
● RBD is a relatively thin layer on top of RADOS ● many RBD problems are RADOS problems ● Most RBD-specific issues are configuration/
integration with other software (libvirt, OpenStack, CloudStack, QEMU, etc.)
● Most of these issues are during initial setup
CLOSER LOOK AT RBD
![Page 90: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/90.jpg)
To connect, need: ● monitor addresses ● client name (--name client.foo, or --id foo) ● secret key (from keyring, plain key file, or command
line) Capabilities needed after connecting: ● mon 'allow r' (every client) ● osd 'allow rwx pool foo' (RBD rw access to images in pool foo) ● osd 'allow r class-read pool bar' (RBD ro access to pool bar) ● osd 'allow class-read object_prefix rbd_children'
o for RBD format 2 snapshot unprotection ● ceph-authtool man page has more
RADOS AUTH
![Page 91: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/91.jpg)
● hang connecting o is ceph.conf being read for the monitor addresses? o correct keyring being read? o are monitors up and accessible from this machine? o --debug-ms 1 will show if it's getting any response o strace useful for checking the correct files are being
read ● error connecting
o permission denied ! is keyring and ceph.conf readable? ! are capabilities correct? (ceph auth list) ! is the right client name being used?
COMMON RADOS CLIENT PROBLEMS
![Page 92: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/92.jpg)
Format 2 is not supported yet: $ sudo modprobe rbd $ rbd create -s 20 --format 2 foo $ sudo rbd map foo rbd: add failed: (2) No such file or directory $ dmesg | tail [17246768.484421] libceph: client0 fsid c78b75ac-c0a1-4a2a-9401-f1cc74f541f0 [17246768.486924] libceph: mon0 10.214.131.34:6789 session established
Providing a secret that does not match the rados user (client.admin):
$ sudo rbd map foo --keyfile keyfile rbd: add failed: (1) Operation not permitted $ dmesg | tail [17247003.169925] libceph: client0 fsid c78b75ac-c0a1-4a2a-9401-f1cc74f541f0 [17247003.170480] libceph: auth method 'x' error -1
KERNEL RBD
![Page 93: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/93.jpg)
What if the cluster stops? $ mkfs.xfs /dev/rbd0 & $ ps aux | grep mkfs root 32059 0.0 0.0 18800 1084 pts/0 D 16:17 0:00 mkfs.ext4 /dev/rbd/
rbd/test
$ dmesg | tail [17174479.039082] libceph: mon0 10.214.131.34:6789 socket closed [17174479.049735] libceph: mon0 10.214.131.34:6789 session lost, hunting for new mon [17174479.049803] libceph: mon0 10.214.131.34:6789 connection failed [17174511.047618] libceph: osd1 10.214.131.34:6803 socket closed [17174511.058221] libceph: osd1 10.214.131.34:6803 connection failed
Newer kernels will not let /dev/rbd0 be unmapped at this point, since it
is still in use by mkfs.xfs. rbd unmap will show 'device is busy'.
KERNEL RBD
![Page 94: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/94.jpg)
Crush tunables are supported in 3.6+. Before then: $ rbd map foo & (hangs) $ dmesg | tail [17248284.205022] libceph: mon0 10.214.131.34:6789 feature set mismatch, my 2
< server's 2040002, missing 2040000 [17248284.219332] libceph: mon0 10.214.131.34:6789 missing required protocol
features
KERNEL RBD
![Page 95: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/95.jpg)
Most errors should give a useful error message, i.e.: $ rbd snap rm foo@snap rbd: snapshot 'snap' is protected from removal. 2013-02-22 13:08:02.603716 7fde490e3780 -1 librbd: removing
snapshot from header failed: (16) Device or resource busy $ rbd rm foo 2013-02-22 13:06:52.836368 7fd46ae54780 -1 librbd: image has
snapshots - not removing Removing image: 0% complete...failed. rbd: image has snapshots - these must be deleted with 'rbd snap
purge' before the image can be removed.
USERSPACE RBD
![Page 96: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/96.jpg)
Same problems with auth can occur: $ rbd --auth-supported none create -s 1 image 2013-02-22 13:14:29.796404 7ffa74c26780 0 librados: client.admin
authentication error (95) Operation not supported rbd: couldn't connect to the cluster!
$ rbd --name client.foo create -s 1 image 2013-02-22 13:14:01.944670 7fda89844780 -1 monclient(hunting):
ERROR: missing keyring, cannot use cephx for authentication 2013-02-22 13:14:01.944675 7fda89844780 0 rbd: couldn't connect
to the cluster! librados: client.foo initialization error (2) No such file or
directory
Use --debug-ms 1, --debug-rbd 20 to see lots of detail.
USERSPACE RBD
![Page 97: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/97.jpg)
Libvirt is often used to configure QEMU to use librbd. ● same issues, info harder to reach ● check instance logs in /var/log/libvirt/qemu/* ● logging, admin socket often complicated by
apparmor, selinux, and unix permissions ● QEMU will read /etc/ceph/ceph.conf by default, but
it must be accessible
MOVING UP THE STACK
![Page 98: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/98.jpg)
● userspace only ● useful for showing current config ● unlike other parts of ceph, provides help for every command
automatically ● if there are performance issues, can help debug with perf dump ● disabled for clients (like librbd) by default ● enable by adding to client section in ceph.conf, or in libvirt
appending to disk name (i.e. name="pool/image:admin_socket=/path/")
ADMIN SOCKET
![Page 99: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/99.jpg)
Error handling is not easy for users still - need to look at service logs. Path is:
cinder-api cinder-schedule cinder-volume
Nova uses libvirt to configure QEMU, so same issues here (but QEMU log may be in nova's instance directory).
OPENSTACK
![Page 100: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/100.jpg)
● check rados performance ● is rbd_cache enabled (and QEMU configured to send flushes)? ● perf dump from admin socket may give coarse clues ● more infrastructure to measure needed,
should be added soon
PERFORMANCE ISSUES
![Page 101: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/101.jpg)
QUESTIONS?
![Page 102: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/102.jpg)
RELIABILITY BEST PRACTICES
USING CEPH
![Page 103: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/103.jpg)
! Block & Object are rock-solid and are in use at many important production sites
! CephFS cannot be used reliably yet as it is still under heavy development
! As features are added new bugs might be introduced
Copyright © 2015 Red Hat, Inc. | Private and Confidential
HOW RELIABLE IS CEPH?
103
![Page 104: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/104.jpg)
Striping, erasure coding, or replication across nodes ! Enjoy data durability, high availability, and high performance Dynamic block resizing ! Expand or shrink block devices with no or minimal downtime Data placement ! No need for lookups - every client can calculate where data is stored ! Policy-defined SLAs, performance tiers, and failure domains Automatic failover ! Prevent failures from impacting integrity or availability
WHY IS CEPH RELIABLE?
104
RELIABILITY & AVAILABILITY
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 105: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/105.jpg)
! Use multiple RADOSGW’s with HA PROXY ! Monitor mailinglists for critical updates and fixes
(or purchase support for Ceph from Red Hat and run Inktank Ceph Enterprise)
! Deploy Ceph automated using a configuration management/scripting system like Puppet
! Monitor all Ceph nodes like you would monitor any Linux server, just the response might be different
RELIABILITY BEST PRACTICES
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 106: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/106.jpg)
HARD- & SOFTWARE ARCHITECTURE
USING CEPH
![Page 107: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/107.jpg)
! Good News, Ceph can run on any hardware ! Bad News, Ceph can run on any hardware ! Paying attention to hardware up front can help eliminate
problems later ! You do not need and should not use a RAID controller ! Always check if controller supports pass-through (JBOD)
Copyright © 2015 Red Hat, Inc. | Private and Confidential
CEPH HARDWARE REQUIREMENTS
107
![Page 108: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/108.jpg)
Copyright © 2015 Red Hat, Inc. | Private and Confidential
STORAGE NODE SIZING
108
Block Storage • Smaller Nodes. <24 HDD’s
Object Storage • Larger Nodes. Up to 72 HDD’s
![Page 109: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/109.jpg)
! Storage Node Processor Requirements ! 1GHz of 1 core per OSD ! Example: a single socket 8 core 2.0 GHz proc will support 16 OSD’s
! Storage Node Memory Requirements ! 2-3 GB per OSD
! Monitor Requirements ! Lower Requirements ! E5-2603, 16GB RAM, 1GbE
Copyright © 2015 Red Hat, Inc. | Private and Confidential
MEMORY AND PROCESSOR RULES
109
![Page 110: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/110.jpg)
! To improve write performance, moving write journals to SSD drives is recommended
! One SSD drive per 5-6 spinning drives ! A write journal requires approx. 10 GB per journal, so
80-120 GB drives are sufficient ! A typical Ceph storage node has 12 spinning disks
and 2 SSD disks, example Dell R515
Copyright © 2015 Red Hat, Inc. | Private and Confidential
SSD JOURNALS
110
![Page 111: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/111.jpg)
! Want to know what hardware should we select for validating Ceph in our environment? This document provides high level guidelines and recommendations for selecting commodity hardware infrastructure from major server vendors to enable you to quickly deploy Ceph-based storage solutions.
Download at: https://engage.redhat.com/inktank-hardware-selection-guide-s-201409080912
Copyright © 2015 Red Hat, Inc. | Private and Confidential
HARDWARE SELECTION GUIDE
111
![Page 112: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/112.jpg)
NETWORK ARCHITECTURE
USING CEPH
![Page 113: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/113.jpg)
BASIC NETWORK CONSIDERATIONS
! Separate client and backend network
! Backend network preferably 10 GbE
! Client facing network speed depends on requirements
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 114: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/114.jpg)
! qperf $IP_ADDR -t 10 -oo msg_size:1:8M:*2 –v tcp_lat tcp_bw rc_rdma_write_lat rc_rdma_write_bw rc_rdma_write_poll_lat
! ρ fio --filename=$RBD|--directory=$MDIR --direct=1 --rw=$io --bs=$bs --size=10G --numjobs=$threads --runtime=60 --group_reporting --name=file1 --output=fio_${io}_${bs}_${threads}
! RBD=/dev/rbdX, MDIR=/cephfs/fio-test-dir
! io=write,randwrite,read,randread
! bs=4k,8k,4m,8m
! threads=1,64,128
Latency is defined by the network
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 115: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/115.jpg)
qperf - latency
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 116: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/116.jpg)
qperf - bandwidth
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 117: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/117.jpg)
qperf - explanation ! tcp_lat almost the same for blocks <= 128 bytes ! tcp_lat very similar between 40GbE, 40Gb IPoIB and 56Gb IPoIB ! Significant difference between 1 / 10GbE only for blocks >= 4k ! Better latency on IB can only be achieved with rdma_write /
rdma_write_poll ! Bandwidth on 1 / 10GbE very stable on possible maximum ! 40GbE implementation (MLX) with unexpected fluctuation ! Under IB only with RDMA the maximum transfer rate can be achieved ! Use Socket over RDMA ! Options without big code changes are: SDP, rsocket, SMC-R
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 118: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/118.jpg)
PERFORMANCE EXPECTATIONS
USING CEPH
![Page 119: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/119.jpg)
Copyright © 2015 Red Hat, Inc. | Private and Confidential
WHAT PERFORMANCE TO EXPECT?
119
![Page 120: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/120.jpg)
! Not easy to calculate, really dependant on workload and chosen hardware components
! But: performance can be enhanced by adding more nodes
! Generally speaking: to calculate IOPS: # drives * IOPS/drive * 80%
! Many factors affect performance
Copyright © 2014 Red Hat, Inc. | Private and Confidential
WHAT PERFORMANCE TO EXPECT?
120
![Page 121: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/121.jpg)
MY CUSTOMER REQUIRES X IOPS/GB
! Some customers have very specific requirements ! X iops/GB is a common requirement ! Although the total # op iops increases when a
cluster grows, the size grows as well, so your IOPS/GB does not go up
! Not for all GBs anyway ! In these cases recommend small and fast
harddrives (or better: SSDs)
![Page 122: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/122.jpg)
![Page 123: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/123.jpg)
! Ceph relies heavily on a tuned OS ! Network subsystem tuning ! Disk I/O subsystem tuning
! Bottlenecks can ! Jeopardize Ceph’s overall performance ! Jeopardize overall balance
WHAT IS AT STAKE?
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 124: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/124.jpg)
! Assuming tuning has been achieved:
! How can we check the path is clear for performance?
! How can we check Ceph’s performance?
WHAT IS AT STAKE?
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 125: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/125.jpg)
! Fast journal drives – preferably SSDs ! Data drive choice must match SLAs
! Life can’t be reduced to just costs ! Business driving both costs and SLA’s
! 10GbE or Infiniband cluster network ! 2-3GB RAM per OSD daemon ! 1 GHz of a CPU core per OSD daemon
GENERAL BEST PRACTICES
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 126: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/126.jpg)
SYSTEM METRICS
![Page 127: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/127.jpg)
! Network subsystem ! Public side network must be monitored ! Cluster side network must be monitored
! Metrics ! Frames per second ! Bandwidth
SYSTEM METRICS
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 128: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/128.jpg)
! Disk IO subsystem ! Journal drives must be monitored ! Data drives must be equally monitored
! Metrics ! General IO wait time ! Device IO wait queue ! Device utilization rate & IOPS ! Device bandwidth usage ! Device error rates
SYSTEM METRICS
![Page 129: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/129.jpg)
CLUSTER HEALTH
![Page 130: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/130.jpg)
! General cluster health ! OSDs status ! PG status ! Cluster health
CLUSTER HEALTH
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 131: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/131.jpg)
! General cluster health ! ceph health and ceph health detail! ceph –s and ceph status! ceph –w
CLUSTER HEALTH
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 132: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/132.jpg)
CEPH METRICS
![Page 133: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/133.jpg)
! Data drive usage ! ceph health detail
! Check for x nearfull osds! Check for x full osds
! df output ! Important to remember
! Writes will get denied on full OSD ! Backfilling will get denied on full OSD ! Recovery will get denied on full OSD
CEPH METRICS
![Page 134: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/134.jpg)
! Admin socket (Check primary PG assignment) ! ceph daemon osd.$id perf dump
! numpg Total cluster PGs ! numpg_primary Primary PGs this OSD ! numpg_replica Sec.C`PGs this OSD
CEPH METRICS
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 135: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/135.jpg)
! Admin socket (Check OSD usage) ! ceph daemon osd.$id perf dump
! stat_bytes Total bytes this OSD ! stat_bytes_used Bytes used this OSD ! stat_bytes_avail Bytes available this OSD
CEPH METRICS
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 136: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/136.jpg)
! Admin socket (Checking ops and subops live) ! ceph daemon osd.$id dump_ops_in_flight
! "flag_point" : "waiting for subops" ! "events"
! "time": "2014-03-17 17:55:57.274529", ! "event": "xxxxxxxxxxxxxxx"
CEPH METRICS
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 137: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/137.jpg)
! Admin socket (Checking ops and subops live) ! waiting_for_osdmap ! reached_pg ! started ! waiting for subops from [x] Only on primary OSD ! commit_queued_for_journal_write ! write_thread_in_journal_buffer ! journaled_completion_queued ! op_commit ! op_applied or sub_op_applied ! sub_op_commit_rec Only on primary OSD ! commit_sent ! done
CEPH METRICS
![Page 138: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/138.jpg)
PERFORMANCE OPTIMIZATION
USING CEPH
![Page 139: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/139.jpg)
GEO-REPLICATION
USING CEPH
![Page 140: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/140.jpg)
LET’S START WITH AN EXAMPLE • CLIMB: a collaboration between
Birmingham, Cardiff, Swansea and Warwick Universities, to develop and deploy a world leading cyber infrastructure for microbial bioinformatics, supported by three world class Medical Research Fellows, a comprehensive training program and two newly refurbished Bioinformatics facilities in Swansea and Warwick.
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 141: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/141.jpg)
QUOTE FROM WWW.CLIMB.AC.UK ! “For longer term data storage, to share datasets and VMs, and
to provide block storage for running VMs, we will be deploying a storage solution based on Ceph. Each site has 27 Dell R730XD servers, with each server containing 16x 4TB HDDs, giving a total raw storage capacity of 6912TB. All data stored in this system is replicated 3x, which gives us a usable capacity of 2304TB.”
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 142: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/142.jpg)
TIME FOR OTHER EXAMPLES ! A large American cable company streams all their Video on
Demand content from 17 datacenters across the United States with Ceph.
! Another large American media sharing service has the currently largest “single” Ceph cluster that we know about: 300(!) PB. The cluster is spanning multiple datacenters.
! And there are so many more …
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 143: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/143.jpg)
SECURITY CONSIDERATIONS
USING CEPH
![Page 144: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/144.jpg)
! The weakest link will fail eventually ! Although Ceph has no single point of failures, the
administrator can cause serious trouble ! The storage systems’ operating system should not
be accessed by ‘normal’ administrators ! Consider storage nodes as ‘appliances’
SECURE YOUR SYSTEM LIKE ANY OTHER
![Page 145: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/145.jpg)
MONITORING & MAINTENANCE
USING CEPH
![Page 146: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/146.jpg)
! Ceph was designed to be self healing ! Cluster rebalancing is automatic ! Drives do not need to be swapped immediately ! Data distribution is handled automatically
! Calamari monitoring allows you to keep an eye on cluster ! New Calamari management features allow cluster
management through GUI ! More management features coming in the future
Copyright © 2015 Red Hat, Inc. | Private and Confidential
HOW DIFFICULT IS IT TO MANAGE?
146
![Page 147: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/147.jpg)
CALAMARI
Copyright © 2015 Red Hat, Inc. | Private and Confidential 147
![Page 148: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/148.jpg)
COMMERCIAL BREAK
USING CEPH
![Page 149: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/149.jpg)
Copyright © 2015 Red Hat, Inc. | Private and Confidential 149
DEPLOYMENT TOOLS New “Wizard” for easier installation of Calamari
Local repository with dependencies
Cluster bootstrapping tool
CALAMARI
On-premise, web-based application
Cluster monitoring and management
RESTful API
Unlimited Instances
CEPH FOR OBJECT & BLOCK
SUPPORT SERVICES
SLA-backed Technical Support
Bug Escalation
“Hot Patches”
Roadmap Input
INKTANK CEPH ENTERPRISE WHAT’S INSIDE?
![Page 150: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/150.jpg)
BENEFITS
TCO FUTURE PROOFED EXPERTISE ENTERPRISE READY
" Very Low TCO " Operational
Efficiency
" Scalable Tools
150
" Long-Term Support " Single, predictable
price
" Use-Case Flexibility
" Backed by the Ceph experts
" Support from the developers
" Use Existing Infrastructure
" Extend use-case options
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 151: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/151.jpg)
RELEASE SCHEDULE
Copyright © 2015 Red Hat, Inc. | Private and Confidential 151
2013 2014 2015 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2
![Page 152: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/152.jpg)
SUPPORT MATRIX
152
SUPPORTED
SUPPORTED FUTURE
Copyright © 2015 Red Hat, Inc. | Private and Confidential
![Page 153: Let's talk about Ceph Terena TF-Storage Meeting – February 12th](https://reader030.vdocument.in/reader030/viewer/2022020411/586d01061a28ab070b8b9b78/html5/thumbnails/153.jpg)
QUESTIONS?