Let’s talk about Ceph Terena TF-Storage Meeting – February 12th 2015 – University of Vienna -
! First part: 13:00 to 15:00 Introduction to Ceph Technical deep dive: why is Ceph so interesting and popular Q&A about Ceph
! Coffee break
! Second part: 15:30 to 17:30 Notable use cases, deployment options, management and monitoring
AGENDA
Copyright © 2015 Red Hat, Inc. | Private and Confidential
THE FORECAST
By 2020 over 15 ZB of data will be stored. 1 .5 ZB are stored today.
5 Copyright © 2015 Red Hat, Inc. | Private and Confidential
THE PROBLEM
! Existing systems don’t scale ! Increasing cost and complexity
! Need to invest in new platforms ahead of time
2010 2020
IT Storage Budget
Growth of data
6 Copyright © 2015 Red Hat, Inc. | Private and Confidential
THE SOLUTION
PAST: SCALE UP
FUTURE: SCALE OUT
7 Copyright © 2015 Red Hat, Inc. | Private and Confidential
EXAMPLES IN NATURE
EXAMPLES IN NATURE
EXAMPLES IN NATURE
DO YOU STILL REMEMBER?
Copyright © 2015 Red Hat, Inc. | Private and Confidential
STORAGE APPLIANCES
STORAGE APPLIANCES
! RAID: Redundant Array of Inexpensive Disks ! Enhanced Reliability
! RAID-1 mirroring ! RAID-5/6 parity (reduced overhead) ! Automated recovery
! Enhanced Performance ! RAID-0 striping ! SAN interconnects ! Enterprise SAS drives ! Proprietary H/W RAID controllers
END OF RAID AS WE KNOW IT
END OF RAID AS WE KNOW IT
! Economical Storage Solutions ! Software RAID implementations ! iSCSI and JBODs
! Enhanced Capacity ! Logical volume concatenation
RAID CHALLENGES: CAPACITY/SPEED ! Storage economies in disks come from more GB per spindle ! NRE rates are flat (typically estimated at 10E-15/bit)
! 4% chance of NRE while recovering a 4+1 RAID-5 set and it goes up with the number of volumes in the set many RAID controllers fail the recovery after an NRE
! Access speed has not kept up with density increases ! 27 hours to rebuild a 4+1 RAID-5 set at 20MB/s during which time
a second drive can fail
! Managing the risk of second failures requires hot-spares ! Defeating some of the savings from parity redundancy
RAID CHALLENGES: EXPANSION ! The next generation of disks will be larger and cost less per GB.
We would like to use these as we expand ! Most RAID replication schemes require identical disk meaning
new disks cannot be added to an old set meaning failed disks must be replaced with identical units
! Proprietary appliances may require replacements from manufacturer (at much higher than commodity prices)
! Many storage systems reach a limit beyond which they cannot be further expanded (e.g. fork-lift upgrade)
! Re-balancing existing data over new volumes is non-trivial
RELIABILITY/AVAILABILITY
! RAID-5 can only survive a single disk failure ! The odds of an NRE during recovery are significant ! Odds of a second failure during recovery are non-
negligible ! Annual peta-byte durability for RAID-5 is only 3 nines
! RAID-6 redundancy protects against two disk failures ! Odds of an NRE during recovery are still significant ! Client data access will be starved out during recovery ! Throttling recovery increases the risk of data loss
RELIABILITY/AVAILABILITY
! Even RAID-6 cant protect against: ! Server failures ! NIC failures ! Switch failures ! OS crashes ! Facility or regional disasters
RAID CHALLENGES: EXPENSE
! Capital Expenses … good RAID costs ! Significant mark-up for enterprise hardware ! High performance RAID controllers can add $50-100/disk ! SANs further increase ! Expensive equipment, much of which is often poorly used ! Software RAID is much less expensive, and much slower
RAID CHALLENGES: EXPENSE
! Operating Expenses … RAID doesn’t manage itself ! RAID group, LUN and pool management ! Lots of application-specific tunable parameters ! Difficult expansion and migration ! When a recovery goes bad, it goes very bad ! Dont even think about putting off replacing a failed drive
! Ceph is a massively scalable, open source, software-defined storage system that runs on commodity hardware.
! Ceph delivers object, block, and file system storage in one self-managing, self-healing platform with no single point of failure.
! Ceph ideally replaces legacy storage and provides a single solution for the cloud.
Copyright © 2015 Red Hat, Inc. | Private and Confidential
WHAT IS CEPH?
22
CEPH UNIFIED STORAGE
FILE SYSTEM
BLOCK STORAGE
OBJECT STORAGE
Keystone Geo-Replication Erasure Coding
23
Multi-tenant S3 & Swift
OpenStack Linux Kernel
Tiering
Clones Snapshots
CIFS/NFS HDFS
Distributed Metadata
Linux Kernel POSIX
Copyright © 2015 Red Hat, Inc. | Private and Confidential
TRADITIONAL STORAGE VS. CEPH
TRADITIONAL ENTERPRISE STORAGE
24 Copyright © 2015 Red Hat, Inc. | Private and Confidential
“Run storage the way Google does”
Copyright © 2015 Red Hat, Inc. | Private and Confidential
VISION
25
! Failure is normal ! Scale out on commodity hardware ! Everything runs on software ! Solution must be 100% open source
Copyright © 2015 Red Hat, Inc. | Private and Confidential
THE CEPH PHILOSOPHY
26
HISTORICAL TIMELINE
Copyright © 2015 Red Hat, Inc. | Private and Confidential 27
RHEL-OSP Certification FEB 2014
MAY 2012 Launch of Inktank
OpenStack Integration 2011
2010 Mainline Linux Kernel
Open Source 2006
2004 Project Starts at UCSC
Production Ready Ceph SEPT 2012
2012 CloudStack Integration
OCT 2013 Inktank Ceph Enterprise Launch
Xen Integration 2013
APR 2014 Inktank Acquired by Red Hat
STRONG & GROWING COMMUNITY
28 Copyright © 2015 Red Hat, Inc. | Private and Confidential
COMMUNITY METRICS DASHBOARD
METRICS.CEPH.COM
ARCHITECTURE
Copyright © 2015 Red Hat, Inc. | Private and Confidential
ARCHITECTURAL COMPONENTS
30
APP HOST/VM CLIENT
Copyright © 2015 Red Hat, Inc. | Private and Confidential
ARCHITECTURAL COMPONENTS
31
APP HOST/VM CLIENT
RADOS COMPONENTS
32
OSDs: ! 10s to 10000s in a cluster ! One per disk (or one per SSD, RAID group…)
! Serve stored objects to clients ! Intelligently peer for replication & recovery
Monitors: ! Maintain cluster membership and state ! Provide consensus for distributed decision-making
! Small, odd number ! These do not serve stored objects to clients
Copyright © 2015 Red Hat, Inc. | Private and Confidential
OBJECT STORAGE DAEMONS
33
btrfs xfs ext4
Copyright © 2015 Red Hat, Inc. | Private and Confidential
WHERE DO OBJECTS LIVE?
34
??
Copyright © 2015 Red Hat, Inc. | Private and Confidential
A METADATA SERVER?
35
1
2
Copyright © 2015 Red Hat, Inc. | Private and Confidential
CALCULATED PLACEMENT?
36
A-G
H-N
O-T
U-Z
Copyright © 2015 Red Hat, Inc. | Private and Confidential
EVEN BETTER: CRUSH!
37
RADOS CLUSTER
Copyright © 2015 Red Hat, Inc. | Private and Confidential
CRUSH: DYNAMIC DATA PLACEMENT
38
CRUSH: ! Pseudo-random placement algorithm
! Fast calculation, no lookup
! Repeatable, deterministic ! Statistically uniform distribution ! Stable mapping
! Limited data migration on change ! Rule-based configuration
! Infrastructure topology aware ! Adjustable replication ! Weighting
Copyright © 2015 Red Hat, Inc. | Private and Confidential
DATA IS ORGANIZED INTO POOLS
39
CLUSTER POOLS (CONTAINING PGs)
POOL A
POOL B
POOL C
POOL D
Copyright © 2015 Red Hat, Inc. | Private and Confidential
ARCHITECTURAL COMPONENTS
40
APP HOST/VM CLIENT
Copyright © 2015 Red Hat, Inc. | Private and Confidential
ACCESSING A RADOS CLUSTER
41
RADOS CLUSTER
socket
Copyright © 2015 Red Hat, Inc. | Private and Confidential
LIBRADOS: RADOS ACCESS FOR APPS
42
LIBRADOS: ! Direct access to RADOS for applications ! C, C++, Python, PHP, Java, Erlang ! Direct access to storage nodes ! No HTTP overhead
Copyright © 2015 Red Hat, Inc. | Private and Confidential
Copyright © 2015 Red Hat, Inc. | Private and Confidential
ARCHITECTURAL COMPONENTS
43
APP HOST/VM CLIENT
THE RADOS GATEWAY
44
RADOS CLUSTER
socket
REST
Copyright © 2015 Red Hat, Inc. | Private and Confidential
RADOSGW MAKES RADOS WEBBY
45
RADOSGW: ! REST-based object storage proxy ! Uses RADOS to store objects ! API supports buckets, accounts ! Usage accounting for billing ! Compatible with S3 and Swift applications
Copyright © 2015 Red Hat, Inc. | Private and Confidential
Copyright © 2015 Red Hat, Inc. | Private and Confidential
ARCHITECTURAL COMPONENTS
46
APP HOST/VM CLIENT
RBD STORES VIRTUAL DISKS
47
RADOS BLOCK DEVICE: ! Storage of disk images in RADOS ! Decouples VMs from host ! Images are striped across the cluster (pool) ! Snapshots ! Copy-on-write clones ! Support in:
! Mainline Linux Kernel (2.6.39+) ! Qemu/KVM, native Xen coming soon ! OpenStack, CloudStack, Nebula, Proxmox
Copyright © 2015 Red Hat, Inc. | Private and Confidential
STORING VIRTUAL DISKS
48
RADOS CLUSTER
Copyright © 2015 Red Hat, Inc. | Private and Confidential
SEPARATE COMPUTE FROM STORAGE
49
RADOS CLUSTER
Copyright © 2015 Red Hat, Inc. | Private and Confidential
KERNEL MODULE FOR MAX FLEXIBILITY
50
RADOS CLUSTER
Copyright © 2015 Red Hat, Inc. | Private and Confidential
Copyright © 2015 Red Hat, Inc. | Private and Confidential
ARCHITECTURAL COMPONENTS
51
APP HOST/VM CLIENT
SCALABLE METADATA SERVERS
52
METADATA SERVER ! Manages metadata for a POSIX-compliant shared
filesystem ! Directory hierarchy ! File metadata (owner, timestamps, mode, etc.)
! Stores metadata in RADOS ! Does not serve file data to clients ! Only required for shared filesystem
Copyright © 2015 Red Hat, Inc. | Private and Confidential
SEPARATE METADATA SERVER
53
RADOS CLUSTER
data metadata
Copyright © 2015 Red Hat, Inc. | Private and Confidential
CEPH ADVANTAGE: DECLUSTERED PLACEMENT
! Consider a failed 2TB RAID mirror ! We must copy 2TB from the survivor to the successor ! Survivor and successor are likely in same failure zone
! Consider two RADOS objects clustered on the same primary ! Surviving copies are declustered (on different secondaries) ! New copies will be declustered (on different successors) ! Copy 10GB from each of 200 survivors to 200 successors ! Survivors and successors are in different failure zones
CEPH ADVANTAGE: DECLUSTERED PLACEMENT
! Benefits ! Recovery is parallel and 200x faster ! Service can continue during the recovery process ! Exposure to 2nd failures is reduced by 200x ! Zone aware placement protects against higher level failures ! Recovery is automatic and does not await new drives ! No idle hot-spares are required
CEPH ADVANTAGE: OBJECT GRANULARITY
! Consider a failed 2TB RAID mirror ! To recover it we must read and write (at least) 2TB ! Successor must be same size as failed volume ! An error in recovery will probably lose the file system
! Consider a failed RADOS OSD ! To recover it we must read and write thousands of objects ! Successor OSDs must have each have some free space ! An error in recovery will probably lose one object
CEPH ADVANTAGE: OBJECT GRANULARITY
! Benefits ! Heterogeneous commodity disks are easily supported ! Better and more uniform space utilization ! Per-object updates always preserve causality ordering ! Object updates are more easily replicated over WAN links ! Greatly reduced data loss if errors do occur
CEPH ADVANTAGE: INTELLIGENT STORAGE
! Intelligent OSDs automatically rebalance data ! When new nodes are added ! When old nodes fail or are decommissioned ! When placement policies are changed
! The resulting rebalancing is very good: ! Even distribution of data across all OSDs ! Uniform mix of old and new data across all OSDs ! Moves only as much data as required
CEPH ADVANTAGE: INTELLIGENT STORAGE
! Intelligent OSDs continuously scrub their objects ! To detect and correct silent write errors before another failure
! This architecture scales from Petabytes to exabytes ! A single pool of thin provisioned, self-managing storage ! Serving a wide range of block, object, and file clients
QUESTIONS?
USE CASES
USING CEPH
DOES ONE SOLUTION FITS ALL?
! Remember this?
Ceph ideally replaces legacy storage and provides a single solution for the cloud.
! One simple rule: if the performance is not high enough, add more nodes …
Copyright © 2015 Red Hat, Inc. | Private and Confidential
The most widely deployed1 technology for OpenStack storage
CLOUD INFRASTRUCTURE WITH OPENSTACK
63
Features Benefits • Full integration with Nova, Cinder & Glance • Single storage for images and ephemeral & persistent
volumes • Copy-on-write provisioning • Swift-compatible object storage gateway • Full integration with Red Hat Enterprise Linux OpenStack
Platform
• Provides both volume storage and object storage for tenant applications
• Reduces provisioning time for new virtual machines • No data transfer of images between storage and compute
nodes required • Unified installation experience with Red Hat Enterprise Linux
OpenStack Platform
1 http://superuser.openstack.org/articles/openstack-user-survey-insights-november-2014
Copyright © 2015 Red Hat, Inc. | Private and Confidential
OPENSTACK SURVEY NOVEMBER 2014
source: http://superuser.openstack.org/articles/openstack-user-survey-insights-november-2014
Copyright © 2015 Red Hat, Inc. | Private and Confidential
CEPH AND OPENSTACK
65
RADOS CLUSTER
Copyright © 2015 Red Hat, Inc. | Private and Confidential
S3-compatible, on-premise cloud storage with uncompromised cost and performance
WEB-SCALE OBJECT STORAGE FOR APPLICATIONS
66
Features Benefits • Compatibility with S3 APIs • Fully-configurable replicated storage backend • Pluggable erasure coded storage backend • Cache tiering pools • Multi-site failover
• Supports broad ecosystem of tools and applications built for the S3 API
• Provides a modern hot/warm/cold storage topology that offers cost-efficient performance
• Advanced durability at optimal price • Delivers a foundation for the next generation of cloud-native
applications using object storage
S3 API S3 API S3 API S3 API
Copyright © 2015 Red Hat, Inc. | Private and Confidential
WEB APPLICATION STORAGE
67
S3/Swift S3/Swift S3/Swift S3/Swift
Copyright © 2015 Red Hat, Inc. | Private and Confidential
MULTI-SITE OBJECT STORAGE
68 Copyright © 2015 Red Hat, Inc. | Private and Confidential
ARCHIVE / COLD STORAGE
69
CEPH STORAGE CLUSTER
Copyright © 2015 Red Hat, Inc. | Private and Confidential
ARCHIVE / COLD STORAGE
70
Site A Site B
CEPH STORAGE CLUSTER CEPH STORAGE CLUSTER
Copyright © 2015 Red Hat, Inc. | Private and Confidential
WEBSCALE APPLICATIONS
71
Native Protocol
Native Protocol
Native Protocol
Native Protocol
Copyright © 2015 Red Hat, Inc. | Private and Confidential
DATABASES
72
Native Protocol
Native Protocol
Native Protocol
Native Protocol
Copyright © 2015 Red Hat, Inc. | Private and Confidential
! Why do people ask about iSCSI
! There is currently NO supported iSCSI solution for Ceph
! We’ve tested a TGT proxy with mixed results
! Goal is a hardened iSCSI solution
Copyright © 2015 Red Hat, Inc. | Private and Confidential
WHAT ABOUT iSCSI?
CEPH FOR HADOOP
! CephFS can be used as a drop replacement for HDFS
! Overcomes limitations of HDFS
! Very experimental at this time
! Not supported
Copyright © 2015 Red Hat, Inc. | Private and Confidential
! Customers often ask about VMware compatibility ! We have tried via third-party iSCSI target software (stgt), but
results were mixed. ! Pre-acquisition some large banks have requested VMware to
incorporate Ceph support into their hypervisor ! VMware has indicated they will evaluate incorporating Ceph
support into their next major release (scheduled 2015) ! If you encounter VMware companies that want that, please let
them know they should request that from VMware. ! We will be looking at VVOL support in a later version as well
Copyright © 2014 Red Hat, Inc. | Private and Confidential
VMWARE
75
Copyright © 2015 Red Hat, Inc. | Private and Confidential
CEPH ON DRIVES
76
! Object model ! Pools
! 1s to 100s ! Independent namespaces for object collections ! Replication level, placement policy
! Objects ! Bazillions ! Blob of data (bytes to gigabytes) ! Attributes (e.g. “version=12”) ! Key/value bundle
CLOSER LOOK AT LIBRADOS
Copyright © 2015 Red Hat, Inc. | Private and Confidential
ATOMIC TRANSACTIONS
! Client operations sent to the OSD cluster ! Operate on a single object ! Can contain a sequence of operations; e.g.
! Truncate object ! Write new object data ! Set attribute
ATOMIC TRANSACTIONS
! Atomicity ! All operations commit or do not commit atomically
! Conditional ! ‘guard’ operations can control
whether operation is performed ! Verify xattr has specific value ! Assert object is a specific version
! Allows atomic compare-and-swap etc.
KEY/VALUE STORAGE ! store key/value pairs in an object
! independent from object attrs or byte data payload
! based on google's leveldb ! efficient random and range insert/query/removal ! based on BigTable SSTable design
! exposed via key/value API ! insert, update, remove ! individual keys or ranges of keys
! avoid read/modify/write cycle for updating complex objects ! e.g., file system directory objects
WATCH/NOTIFY ! establish stateful 'watch' on an object
! client interest persistently registered with object ! client keeps session to OSD open
! send 'notify' messages to all watchers ! notify message (and payload) is distributed to all watchers ! variable timeout ! notification on completion
! all watchers got and acknowledged the notify ! use any object as a communication/synchronization channel
! locking, distributed coordination (ala ZooKeeper), etc.
WATCH/NOTIFY
WATCH/NOTIFY EXAMPLE
! radosgw cache consistency ! radosgw instances watch a single object (.rgw/notify) ! locally cache bucket metadata ! on bucket metadata changes (removal, ACL changes) ! write change to relevant bucket object ! send notify with bucket name to other radosgw
instances ! on receipt of notify
! invalidate relevant portion of cache
RADOS CLASSES
! dynamically loaded .so ! /var/lib/rados-classes/* ! implement new object “methods” using existing methods ! part of I/O pipeline ! simple internal API
RADOS CLASSES
! reads ! can call existing native or class methods ! do whatever processing is appropriate ! return data
! writes ! can call existing native or class methods ! do whatever processing is appropriate ! generates a resulting transaction to be applied
atomically
CLASS EXAMPLES ! Grep
! read an object, filter out individual records, and return those
! Sha1 ! read object, generate fingerprint, return that
! Images ! rotate, resize, crop image stored in object ! remove red-eye
! crypto ! encrypt/decrypt object data with provided key
IDEAS
! rados mailbox (RMB?) ! plug librados backend into dovecot, postfix, etc. ! key/value object for each mailbox
! key = message id ! value = headers
! object for each message or attachment ! watch/notify to delivery notification
IDEAS ! distributed key/value table
! aggregate many k/v objects into one big 'table’ ! working prototype exists
! lua rados class ! embed lua interpreter in a rados class ! ship semi-arbitrary code for operations
! json class ! parse, manipulate json structures
● RBD is a relatively thin layer on top of RADOS ● many RBD problems are RADOS problems ● Most RBD-specific issues are configuration/
integration with other software (libvirt, OpenStack, CloudStack, QEMU, etc.)
● Most of these issues are during initial setup
CLOSER LOOK AT RBD
To connect, need: ● monitor addresses ● client name (--name client.foo, or --id foo) ● secret key (from keyring, plain key file, or command
line) Capabilities needed after connecting: ● mon 'allow r' (every client) ● osd 'allow rwx pool foo' (RBD rw access to images in pool foo) ● osd 'allow r class-read pool bar' (RBD ro access to pool bar) ● osd 'allow class-read object_prefix rbd_children'
o for RBD format 2 snapshot unprotection ● ceph-authtool man page has more
RADOS AUTH
● hang connecting o is ceph.conf being read for the monitor addresses? o correct keyring being read? o are monitors up and accessible from this machine? o --debug-ms 1 will show if it's getting any response o strace useful for checking the correct files are being
read ● error connecting
o permission denied ! is keyring and ceph.conf readable? ! are capabilities correct? (ceph auth list) ! is the right client name being used?
COMMON RADOS CLIENT PROBLEMS
Format 2 is not supported yet: $ sudo modprobe rbd $ rbd create -s 20 --format 2 foo $ sudo rbd map foo rbd: add failed: (2) No such file or directory $ dmesg | tail [17246768.484421] libceph: client0 fsid c78b75ac-c0a1-4a2a-9401-f1cc74f541f0 [17246768.486924] libceph: mon0 10.214.131.34:6789 session established
Providing a secret that does not match the rados user (client.admin):
$ sudo rbd map foo --keyfile keyfile rbd: add failed: (1) Operation not permitted $ dmesg | tail [17247003.169925] libceph: client0 fsid c78b75ac-c0a1-4a2a-9401-f1cc74f541f0 [17247003.170480] libceph: auth method 'x' error -1
KERNEL RBD
What if the cluster stops? $ mkfs.xfs /dev/rbd0 & $ ps aux | grep mkfs root 32059 0.0 0.0 18800 1084 pts/0 D 16:17 0:00 mkfs.ext4 /dev/rbd/
rbd/test
$ dmesg | tail [17174479.039082] libceph: mon0 10.214.131.34:6789 socket closed [17174479.049735] libceph: mon0 10.214.131.34:6789 session lost, hunting for new mon [17174479.049803] libceph: mon0 10.214.131.34:6789 connection failed [17174511.047618] libceph: osd1 10.214.131.34:6803 socket closed [17174511.058221] libceph: osd1 10.214.131.34:6803 connection failed
Newer kernels will not let /dev/rbd0 be unmapped at this point, since it
is still in use by mkfs.xfs. rbd unmap will show 'device is busy'.
KERNEL RBD
Crush tunables are supported in 3.6+. Before then: $ rbd map foo & (hangs) $ dmesg | tail [17248284.205022] libceph: mon0 10.214.131.34:6789 feature set mismatch, my 2
< server's 2040002, missing 2040000 [17248284.219332] libceph: mon0 10.214.131.34:6789 missing required protocol
features
KERNEL RBD
Most errors should give a useful error message, i.e.: $ rbd snap rm foo@snap rbd: snapshot 'snap' is protected from removal. 2013-02-22 13:08:02.603716 7fde490e3780 -1 librbd: removing
snapshot from header failed: (16) Device or resource busy $ rbd rm foo 2013-02-22 13:06:52.836368 7fd46ae54780 -1 librbd: image has
snapshots - not removing Removing image: 0% complete...failed. rbd: image has snapshots - these must be deleted with 'rbd snap
purge' before the image can be removed.
USERSPACE RBD
Same problems with auth can occur: $ rbd --auth-supported none create -s 1 image 2013-02-22 13:14:29.796404 7ffa74c26780 0 librados: client.admin
authentication error (95) Operation not supported rbd: couldn't connect to the cluster!
$ rbd --name client.foo create -s 1 image 2013-02-22 13:14:01.944670 7fda89844780 -1 monclient(hunting):
ERROR: missing keyring, cannot use cephx for authentication 2013-02-22 13:14:01.944675 7fda89844780 0 rbd: couldn't connect
to the cluster! librados: client.foo initialization error (2) No such file or
directory
Use --debug-ms 1, --debug-rbd 20 to see lots of detail.
USERSPACE RBD
Libvirt is often used to configure QEMU to use librbd. ● same issues, info harder to reach ● check instance logs in /var/log/libvirt/qemu/* ● logging, admin socket often complicated by
apparmor, selinux, and unix permissions ● QEMU will read /etc/ceph/ceph.conf by default, but
it must be accessible
MOVING UP THE STACK
● userspace only ● useful for showing current config ● unlike other parts of ceph, provides help for every command
automatically ● if there are performance issues, can help debug with perf dump ● disabled for clients (like librbd) by default ● enable by adding to client section in ceph.conf, or in libvirt
appending to disk name (i.e. name="pool/image:admin_socket=/path/")
ADMIN SOCKET
Error handling is not easy for users still - need to look at service logs. Path is:
cinder-api cinder-schedule cinder-volume
Nova uses libvirt to configure QEMU, so same issues here (but QEMU log may be in nova's instance directory).
OPENSTACK
● check rados performance ● is rbd_cache enabled (and QEMU configured to send flushes)? ● perf dump from admin socket may give coarse clues ● more infrastructure to measure needed,
should be added soon
PERFORMANCE ISSUES
QUESTIONS?
RELIABILITY BEST PRACTICES
USING CEPH
! Block & Object are rock-solid and are in use at many important production sites
! CephFS cannot be used reliably yet as it is still under heavy development
! As features are added new bugs might be introduced
Copyright © 2015 Red Hat, Inc. | Private and Confidential
HOW RELIABLE IS CEPH?
103
Striping, erasure coding, or replication across nodes ! Enjoy data durability, high availability, and high performance Dynamic block resizing ! Expand or shrink block devices with no or minimal downtime Data placement ! No need for lookups - every client can calculate where data is stored ! Policy-defined SLAs, performance tiers, and failure domains Automatic failover ! Prevent failures from impacting integrity or availability
WHY IS CEPH RELIABLE?
104
RELIABILITY & AVAILABILITY
Copyright © 2015 Red Hat, Inc. | Private and Confidential
! Use multiple RADOSGW’s with HA PROXY ! Monitor mailinglists for critical updates and fixes
(or purchase support for Ceph from Red Hat and run Inktank Ceph Enterprise)
! Deploy Ceph automated using a configuration management/scripting system like Puppet
! Monitor all Ceph nodes like you would monitor any Linux server, just the response might be different
RELIABILITY BEST PRACTICES
Copyright © 2015 Red Hat, Inc. | Private and Confidential
HARD- & SOFTWARE ARCHITECTURE
USING CEPH
! Good News, Ceph can run on any hardware ! Bad News, Ceph can run on any hardware ! Paying attention to hardware up front can help eliminate
problems later ! You do not need and should not use a RAID controller ! Always check if controller supports pass-through (JBOD)
Copyright © 2015 Red Hat, Inc. | Private and Confidential
CEPH HARDWARE REQUIREMENTS
107
Copyright © 2015 Red Hat, Inc. | Private and Confidential
STORAGE NODE SIZING
108
Block Storage • Smaller Nodes. <24 HDD’s
Object Storage • Larger Nodes. Up to 72 HDD’s
! Storage Node Processor Requirements ! 1GHz of 1 core per OSD ! Example: a single socket 8 core 2.0 GHz proc will support 16 OSD’s
! Storage Node Memory Requirements ! 2-3 GB per OSD
! Monitor Requirements ! Lower Requirements ! E5-2603, 16GB RAM, 1GbE
Copyright © 2015 Red Hat, Inc. | Private and Confidential
MEMORY AND PROCESSOR RULES
109
! To improve write performance, moving write journals to SSD drives is recommended
! One SSD drive per 5-6 spinning drives ! A write journal requires approx. 10 GB per journal, so
80-120 GB drives are sufficient ! A typical Ceph storage node has 12 spinning disks
and 2 SSD disks, example Dell R515
Copyright © 2015 Red Hat, Inc. | Private and Confidential
SSD JOURNALS
110
! Want to know what hardware should we select for validating Ceph in our environment? This document provides high level guidelines and recommendations for selecting commodity hardware infrastructure from major server vendors to enable you to quickly deploy Ceph-based storage solutions.
Download at: https://engage.redhat.com/inktank-hardware-selection-guide-s-201409080912
Copyright © 2015 Red Hat, Inc. | Private and Confidential
HARDWARE SELECTION GUIDE
111
NETWORK ARCHITECTURE
USING CEPH
BASIC NETWORK CONSIDERATIONS
! Separate client and backend network
! Backend network preferably 10 GbE
! Client facing network speed depends on requirements
Copyright © 2015 Red Hat, Inc. | Private and Confidential
! qperf $IP_ADDR -t 10 -oo msg_size:1:8M:*2 –v tcp_lat tcp_bw rc_rdma_write_lat rc_rdma_write_bw rc_rdma_write_poll_lat
! ρ fio --filename=$RBD|--directory=$MDIR --direct=1 --rw=$io --bs=$bs --size=10G --numjobs=$threads --runtime=60 --group_reporting --name=file1 --output=fio_${io}_${bs}_${threads}
! RBD=/dev/rbdX, MDIR=/cephfs/fio-test-dir
! io=write,randwrite,read,randread
! bs=4k,8k,4m,8m
! threads=1,64,128
Latency is defined by the network
Copyright © 2015 Red Hat, Inc. | Private and Confidential
qperf - latency
Copyright © 2015 Red Hat, Inc. | Private and Confidential
qperf - bandwidth
Copyright © 2015 Red Hat, Inc. | Private and Confidential
qperf - explanation ! tcp_lat almost the same for blocks <= 128 bytes ! tcp_lat very similar between 40GbE, 40Gb IPoIB and 56Gb IPoIB ! Significant difference between 1 / 10GbE only for blocks >= 4k ! Better latency on IB can only be achieved with rdma_write /
rdma_write_poll ! Bandwidth on 1 / 10GbE very stable on possible maximum ! 40GbE implementation (MLX) with unexpected fluctuation ! Under IB only with RDMA the maximum transfer rate can be achieved ! Use Socket over RDMA ! Options without big code changes are: SDP, rsocket, SMC-R
Copyright © 2015 Red Hat, Inc. | Private and Confidential
PERFORMANCE EXPECTATIONS
USING CEPH
Copyright © 2015 Red Hat, Inc. | Private and Confidential
WHAT PERFORMANCE TO EXPECT?
119
! Not easy to calculate, really dependant on workload and chosen hardware components
! But: performance can be enhanced by adding more nodes
! Generally speaking: to calculate IOPS: # drives * IOPS/drive * 80%
! Many factors affect performance
Copyright © 2014 Red Hat, Inc. | Private and Confidential
WHAT PERFORMANCE TO EXPECT?
120
MY CUSTOMER REQUIRES X IOPS/GB
! Some customers have very specific requirements ! X iops/GB is a common requirement ! Although the total # op iops increases when a
cluster grows, the size grows as well, so your IOPS/GB does not go up
! Not for all GBs anyway ! In these cases recommend small and fast
harddrives (or better: SSDs)
! Ceph relies heavily on a tuned OS ! Network subsystem tuning ! Disk I/O subsystem tuning
! Bottlenecks can ! Jeopardize Ceph’s overall performance ! Jeopardize overall balance
WHAT IS AT STAKE?
Copyright © 2015 Red Hat, Inc. | Private and Confidential
! Assuming tuning has been achieved:
! How can we check the path is clear for performance?
! How can we check Ceph’s performance?
WHAT IS AT STAKE?
Copyright © 2015 Red Hat, Inc. | Private and Confidential
! Fast journal drives – preferably SSDs ! Data drive choice must match SLAs
! Life can’t be reduced to just costs ! Business driving both costs and SLA’s
! 10GbE or Infiniband cluster network ! 2-3GB RAM per OSD daemon ! 1 GHz of a CPU core per OSD daemon
GENERAL BEST PRACTICES
Copyright © 2015 Red Hat, Inc. | Private and Confidential
SYSTEM METRICS
! Network subsystem ! Public side network must be monitored ! Cluster side network must be monitored
! Metrics ! Frames per second ! Bandwidth
SYSTEM METRICS
Copyright © 2015 Red Hat, Inc. | Private and Confidential
! Disk IO subsystem ! Journal drives must be monitored ! Data drives must be equally monitored
! Metrics ! General IO wait time ! Device IO wait queue ! Device utilization rate & IOPS ! Device bandwidth usage ! Device error rates
SYSTEM METRICS
CLUSTER HEALTH
! General cluster health ! OSDs status ! PG status ! Cluster health
CLUSTER HEALTH
Copyright © 2015 Red Hat, Inc. | Private and Confidential
! General cluster health ! ceph health and ceph health detail! ceph –s and ceph status! ceph –w
CLUSTER HEALTH
Copyright © 2015 Red Hat, Inc. | Private and Confidential
CEPH METRICS
! Data drive usage ! ceph health detail
! Check for x nearfull osds! Check for x full osds
! df output ! Important to remember
! Writes will get denied on full OSD ! Backfilling will get denied on full OSD ! Recovery will get denied on full OSD
CEPH METRICS
! Admin socket (Check primary PG assignment) ! ceph daemon osd.$id perf dump
! numpg Total cluster PGs ! numpg_primary Primary PGs this OSD ! numpg_replica Sec.C`PGs this OSD
CEPH METRICS
Copyright © 2015 Red Hat, Inc. | Private and Confidential
! Admin socket (Check OSD usage) ! ceph daemon osd.$id perf dump
! stat_bytes Total bytes this OSD ! stat_bytes_used Bytes used this OSD ! stat_bytes_avail Bytes available this OSD
CEPH METRICS
Copyright © 2015 Red Hat, Inc. | Private and Confidential
! Admin socket (Checking ops and subops live) ! ceph daemon osd.$id dump_ops_in_flight
! "flag_point" : "waiting for subops" ! "events"
! "time": "2014-03-17 17:55:57.274529", ! "event": "xxxxxxxxxxxxxxx"
CEPH METRICS
Copyright © 2015 Red Hat, Inc. | Private and Confidential
! Admin socket (Checking ops and subops live) ! waiting_for_osdmap ! reached_pg ! started ! waiting for subops from [x] Only on primary OSD ! commit_queued_for_journal_write ! write_thread_in_journal_buffer ! journaled_completion_queued ! op_commit ! op_applied or sub_op_applied ! sub_op_commit_rec Only on primary OSD ! commit_sent ! done
CEPH METRICS
PERFORMANCE OPTIMIZATION
USING CEPH
GEO-REPLICATION
USING CEPH
LET’S START WITH AN EXAMPLE • CLIMB: a collaboration between
Birmingham, Cardiff, Swansea and Warwick Universities, to develop and deploy a world leading cyber infrastructure for microbial bioinformatics, supported by three world class Medical Research Fellows, a comprehensive training program and two newly refurbished Bioinformatics facilities in Swansea and Warwick.
Copyright © 2015 Red Hat, Inc. | Private and Confidential
QUOTE FROM WWW.CLIMB.AC.UK ! “For longer term data storage, to share datasets and VMs, and
to provide block storage for running VMs, we will be deploying a storage solution based on Ceph. Each site has 27 Dell R730XD servers, with each server containing 16x 4TB HDDs, giving a total raw storage capacity of 6912TB. All data stored in this system is replicated 3x, which gives us a usable capacity of 2304TB.”
Copyright © 2015 Red Hat, Inc. | Private and Confidential
TIME FOR OTHER EXAMPLES ! A large American cable company streams all their Video on
Demand content from 17 datacenters across the United States with Ceph.
! Another large American media sharing service has the currently largest “single” Ceph cluster that we know about: 300(!) PB. The cluster is spanning multiple datacenters.
! And there are so many more …
Copyright © 2015 Red Hat, Inc. | Private and Confidential
SECURITY CONSIDERATIONS
USING CEPH
! The weakest link will fail eventually ! Although Ceph has no single point of failures, the
administrator can cause serious trouble ! The storage systems’ operating system should not
be accessed by ‘normal’ administrators ! Consider storage nodes as ‘appliances’
SECURE YOUR SYSTEM LIKE ANY OTHER
MONITORING & MAINTENANCE
USING CEPH
! Ceph was designed to be self healing ! Cluster rebalancing is automatic ! Drives do not need to be swapped immediately ! Data distribution is handled automatically
! Calamari monitoring allows you to keep an eye on cluster ! New Calamari management features allow cluster
management through GUI ! More management features coming in the future
Copyright © 2015 Red Hat, Inc. | Private and Confidential
HOW DIFFICULT IS IT TO MANAGE?
146
CALAMARI
Copyright © 2015 Red Hat, Inc. | Private and Confidential 147
COMMERCIAL BREAK
USING CEPH
Copyright © 2015 Red Hat, Inc. | Private and Confidential 149
DEPLOYMENT TOOLS New “Wizard” for easier installation of Calamari
Local repository with dependencies
Cluster bootstrapping tool
CALAMARI
On-premise, web-based application
Cluster monitoring and management
RESTful API
Unlimited Instances
CEPH FOR OBJECT & BLOCK
SUPPORT SERVICES
SLA-backed Technical Support
Bug Escalation
“Hot Patches”
Roadmap Input
INKTANK CEPH ENTERPRISE WHAT’S INSIDE?
BENEFITS
TCO FUTURE PROOFED EXPERTISE ENTERPRISE READY
" Very Low TCO " Operational
Efficiency
" Scalable Tools
150
" Long-Term Support " Single, predictable
price
" Use-Case Flexibility
" Backed by the Ceph experts
" Support from the developers
" Use Existing Infrastructure
" Extend use-case options
Copyright © 2015 Red Hat, Inc. | Private and Confidential
RELEASE SCHEDULE
Copyright © 2015 Red Hat, Inc. | Private and Confidential 151
2013 2014 2015 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2
SUPPORT MATRIX
152
SUPPORTED
SUPPORTED FUTURE
Copyright © 2015 Red Hat, Inc. | Private and Confidential
QUESTIONS?