ceph performance and optimization - ceph day frankfurt
Post on 18-Oct-2014
3.963 views
DESCRIPTION
Sebastien Han, eNovanceTRANSCRIPT
![Page 1: Ceph Performance and Optimization - Ceph Day Frankfurt](https://reader033.vdocument.in/reader033/viewer/2022061105/544337b3afaf9ff3098b48ef/html5/thumbnails/1.jpg)
Ceph performanceCephDays Frankfurt 2014
![Page 2: Ceph Performance and Optimization - Ceph Day Frankfurt](https://reader033.vdocument.in/reader033/viewer/2022061105/544337b3afaf9ff3098b48ef/html5/thumbnails/2.jpg)
Whoami 💥 Sébastien Han
💥 French Cloud Engineer working for eNovance
💥 Daily job focused on Ceph and OpenStack
💥 Blogger
Personal blog: http://www.sebastien-han.fr/blog/
Company blog: http://techs.enovance.com/
Last Cephdays presentation
![Page 3: Ceph Performance and Optimization - Ceph Day Frankfurt](https://reader033.vdocument.in/reader033/viewer/2022061105/544337b3afaf9ff3098b48ef/html5/thumbnails/3.jpg)
How does Ceph perform?
42*
*The Hitchhiker's Guide to the Galaxy
![Page 4: Ceph Performance and Optimization - Ceph Day Frankfurt](https://reader033.vdocument.in/reader033/viewer/2022061105/544337b3afaf9ff3098b48ef/html5/thumbnails/4.jpg)
The GoodCeph IO pattern
![Page 5: Ceph Performance and Optimization - Ceph Day Frankfurt](https://reader033.vdocument.in/reader033/viewer/2022061105/544337b3afaf9ff3098b48ef/html5/thumbnails/5.jpg)
CRUSH: deterministic object placement
As soon as a client writes into Ceph, the operation is computed and the client decides to which OSD the object should belong
![Page 6: Ceph Performance and Optimization - Ceph Day Frankfurt](https://reader033.vdocument.in/reader033/viewer/2022061105/544337b3afaf9ff3098b48ef/html5/thumbnails/6.jpg)
Aggregation: cluster levelAs soon as you write into Ceph, all the objects get equally spread across the entire
Cluster, understanding machines and disks..
![Page 7: Ceph Performance and Optimization - Ceph Day Frankfurt](https://reader033.vdocument.in/reader033/viewer/2022061105/544337b3afaf9ff3098b48ef/html5/thumbnails/7.jpg)
Aggregation: OSD levelAs soon as an IO goes into an OSD, no matter how the original pattern was,
it becomes sequential.
![Page 8: Ceph Performance and Optimization - Ceph Day Frankfurt](https://reader033.vdocument.in/reader033/viewer/2022061105/544337b3afaf9ff3098b48ef/html5/thumbnails/8.jpg)
The BadCeph IO pattern
![Page 9: Ceph Performance and Optimization - Ceph Day Frankfurt](https://reader033.vdocument.in/reader033/viewer/2022061105/544337b3afaf9ff3098b48ef/html5/thumbnails/9.jpg)
JournalingAs soon as an IO goes into an OSD, it gets written twice.
![Page 10: Ceph Performance and Optimization - Ceph Day Frankfurt](https://reader033.vdocument.in/reader033/viewer/2022061105/544337b3afaf9ff3098b48ef/html5/thumbnails/10.jpg)
Journal and OSD data on the same disk
Journal penalty on the disk
Since we write twice, if the journal is stored on the same disk as the OSD data this will result in the following:
Device: wMB/s
sdb1 - journal 50.11
sdb2 - osd_data 40.25
![Page 11: Ceph Performance and Optimization - Ceph Day Frankfurt](https://reader033.vdocument.in/reader033/viewer/2022061105/544337b3afaf9ff3098b48ef/html5/thumbnails/11.jpg)
Filesystem fragmentation• Objects are stored as files on the OSD filesystem• Several IO patterns with different block sizes increase
filesystem fragmentation• Possible root cause: image sparseness
• One year old cluster ends up with (see allocsize options for XFS):
$ sudo xfs_db -c frag -r /dev/sdd
actual 196334, ideal 122582, fragmentation factor 37.56%
• RADOS hint: fadvice like | helps filesystem allocation
![Page 12: Ceph Performance and Optimization - Ceph Day Frankfurt](https://reader033.vdocument.in/reader033/viewer/2022061105/544337b3afaf9ff3098b48ef/html5/thumbnails/12.jpg)
No parallelized reads
• Ceph will always serve the read request from the primary OSD
• Room for Nx times speed up where N is the replica count
Blueprint from Sage for the Giant release
![Page 13: Ceph Performance and Optimization - Ceph Day Frankfurt](https://reader033.vdocument.in/reader033/viewer/2022061105/544337b3afaf9ff3098b48ef/html5/thumbnails/13.jpg)
Scrubbing impact• Consistent object check at the PG level• Compare replicas versions between each others (Fsck for
objects)
• Light scrubbing (daily) checks the object size and attributes. • Deep scrubbing (weekly) reads the data and uses checksums to
ensure data integrity.
• Corruption exists – ECC memory (10^15 for enterprise disk) ~113TB• No pain No gain
![Page 14: Ceph Performance and Optimization - Ceph Day Frankfurt](https://reader033.vdocument.in/reader033/viewer/2022061105/544337b3afaf9ff3098b48ef/html5/thumbnails/14.jpg)
The UglyCeph IO pattern
![Page 15: Ceph Performance and Optimization - Ceph Day Frankfurt](https://reader033.vdocument.in/reader033/viewer/2022061105/544337b3afaf9ff3098b48ef/html5/thumbnails/15.jpg)
IOs to the OSD diskOne IO into Ceph leads to 2 writes, well… the second write is the worst!
![Page 16: Ceph Performance and Optimization - Ceph Day Frankfurt](https://reader033.vdocument.in/reader033/viewer/2022061105/544337b3afaf9ff3098b48ef/html5/thumbnails/16.jpg)
The problem
• Several objects map to the same physical disks• Sequential streams get mixed all together
• Result: The disk seeks like hell
![Page 17: Ceph Performance and Optimization - Ceph Day Frankfurt](https://reader033.vdocument.in/reader033/viewer/2022061105/544337b3afaf9ff3098b48ef/html5/thumbnails/17.jpg)
Even worse with erasure coding?This is just an assumption!
•Since erasure coding does chunks of chunks we can possibly have this phenomena amplified
![Page 18: Ceph Performance and Optimization - Ceph Day Frankfurt](https://reader033.vdocument.in/reader033/viewer/2022061105/544337b3afaf9ff3098b48ef/html5/thumbnails/18.jpg)
CLUSTERHow to build it?
![Page 19: Ceph Performance and Optimization - Ceph Day Frankfurt](https://reader033.vdocument.in/reader033/viewer/2022061105/544337b3afaf9ff3098b48ef/html5/thumbnails/19.jpg)
How to start?Things that you must consider:
•Use case • IO profile: Bandwidth? IOPS? Mixed?• How many IOPS or Bandwidth per client do I want to deliver?• Do I use Ceph in standalone or is it combined with a software solution?
•Amount of data (usable not RAW)• Replica count• Do I have a data growth planning?
•Leftover• How much data am I willing to lose if a node fails? (%)• Am I ready to be annoyed by the scrubbing process?
•Budget :-)
![Page 20: Ceph Performance and Optimization - Ceph Day Frankfurt](https://reader033.vdocument.in/reader033/viewer/2022061105/544337b3afaf9ff3098b48ef/html5/thumbnails/20.jpg)
Things that you must not do
• Don't put a RAID underneath your OSD• Ceph already manages the replication• Degraded RAID breaks performances• Reduce usable space on the cluster
• Don't build high density nodes with a tiny cluster• Failure consideration and data to re-balance• Potential full cluster
• Don't run Ceph on your hypervisors (unless you're broke)• Well maybe…
![Page 21: Ceph Performance and Optimization - Ceph Day Frankfurt](https://reader033.vdocument.in/reader033/viewer/2022061105/544337b3afaf9ff3098b48ef/html5/thumbnails/21.jpg)
Firefly: Interesting things going on
![Page 22: Ceph Performance and Optimization - Ceph Day Frankfurt](https://reader033.vdocument.in/reader033/viewer/2022061105/544337b3afaf9ff3098b48ef/html5/thumbnails/22.jpg)
Object store multi-backend
• ObjectStore is born
• Aims to support several backends:• levelDB (default)• RocksDB• Fusionio NVMKV• Seagate Kinetic• Yours!
![Page 23: Ceph Performance and Optimization - Ceph Day Frankfurt](https://reader033.vdocument.in/reader033/viewer/2022061105/544337b3afaf9ff3098b48ef/html5/thumbnails/23.jpg)
Why is it so good?
• No more journal! Yay!
• Object backends have built-in atomic functions
![Page 24: Ceph Performance and Optimization - Ceph Day Frankfurt](https://reader033.vdocument.in/reader033/viewer/2022061105/544337b3afaf9ff3098b48ef/html5/thumbnails/24.jpg)
Firefly leveldb
• Relatively new
• Need to be tested with your workload first
• Tend to be more efficient with small objects