your 1st ceph cluster
TRANSCRIPT
![Page 1: Your 1st Ceph cluster](https://reader034.vdocument.in/reader034/viewer/2022042701/55ac24171a28ab48298b4726/html5/thumbnails/1.jpg)
Your 1st Ceph cluster
Rules of thumb
and magic numbers
![Page 2: Your 1st Ceph cluster](https://reader034.vdocument.in/reader034/viewer/2022042701/55ac24171a28ab48298b4726/html5/thumbnails/2.jpg)
Agenda
• Ceph overview
• Hardware planning
• Lessons learned
![Page 3: Your 1st Ceph cluster](https://reader034.vdocument.in/reader034/viewer/2022042701/55ac24171a28ab48298b4726/html5/thumbnails/3.jpg)
Who are we?
• Udo Seidel
• Greg Elkinbard
• Dmitriy Novakovskiy
![Page 4: Your 1st Ceph cluster](https://reader034.vdocument.in/reader034/viewer/2022042701/55ac24171a28ab48298b4726/html5/thumbnails/4.jpg)
Ceph overview
![Page 5: Your 1st Ceph cluster](https://reader034.vdocument.in/reader034/viewer/2022042701/55ac24171a28ab48298b4726/html5/thumbnails/5.jpg)
What is Ceph?
• Ceph is a free & open distributed storage
platform that provides unified object, block,
and file storage.
![Page 6: Your 1st Ceph cluster](https://reader034.vdocument.in/reader034/viewer/2022042701/55ac24171a28ab48298b4726/html5/thumbnails/6.jpg)
Object Storage
• Native RADOS objects:
– hash-based placement
– synchronous replication
– radosgw provides S3 and Swift compatible
REST access to RGW objects (striped over
multiple RADOS objects).
![Page 7: Your 1st Ceph cluster](https://reader034.vdocument.in/reader034/viewer/2022042701/55ac24171a28ab48298b4726/html5/thumbnails/7.jpg)
Block storage
• RBD block devices are thinly provisioned
over RADOS objects and can be accessed
by QEMU via librbd library.
![Page 8: Your 1st Ceph cluster](https://reader034.vdocument.in/reader034/viewer/2022042701/55ac24171a28ab48298b4726/html5/thumbnails/8.jpg)
File storage
• CephFS metadata servers (MDS) provide
a POSIX-compliant file system backed by
RADOS.
– …still experimental?
![Page 9: Your 1st Ceph cluster](https://reader034.vdocument.in/reader034/viewer/2022042701/55ac24171a28ab48298b4726/html5/thumbnails/9.jpg)
How does Ceph fit with OpenStack?
• RBD drivers for OpenStack tell libvirt to
configure the QEMU interface to librbd
• Ceph benefits:
– Multi-node striping and redundancy for block
storage (Cinder volumes, Nova ephemeral
drives, Glance images)
– Copy-on-write cloning of images to volumes
– Unified pool of storage nodes for all types of
data (objects, block devices, files)
– Live migration of Ceph-backed VMs
![Page 10: Your 1st Ceph cluster](https://reader034.vdocument.in/reader034/viewer/2022042701/55ac24171a28ab48298b4726/html5/thumbnails/10.jpg)
Hardware planning
![Page 11: Your 1st Ceph cluster](https://reader034.vdocument.in/reader034/viewer/2022042701/55ac24171a28ab48298b4726/html5/thumbnails/11.jpg)
Questions
• How much net storage, in what tiers?
• How many IOPS?– Aggregated
– Per VM (average)
– Per VM (peak)
• What to optimize for?– Cost
– Performance
![Page 12: Your 1st Ceph cluster](https://reader034.vdocument.in/reader034/viewer/2022042701/55ac24171a28ab48298b4726/html5/thumbnails/12.jpg)
Rules of thumb and magic numbers
#1• Ceph-OSD sizing:
– Disks• 8-10 SAS HDDs per 1 x 10G NIC
• ~12 SATA HDDs per 1 x 10G NIC
• 1 x SSD for write journal per 4-6 OSD drives
– RAM: • 1 GB of RAM per 1 TB of OSD storage space
– CPU: • 0.5 CPU core’s/1 Ghz of a core per OSD disk (1-2 CPU cores for SSD
drives)
• Ceph-mon sizing: – 1 ceph-mon node per 15-20 OSD nodes (min 3 per cluster)
![Page 13: Your 1st Ceph cluster](https://reader034.vdocument.in/reader034/viewer/2022042701/55ac24171a28ab48298b4726/html5/thumbnails/13.jpg)
Rules of thumb and magic numbers
#2
• For typical “unpredictable” workload we usually
assume:– 70/30 read/write IOPS proportion
– ~4-8KB random read pattern
• To estimate Ceph IOPS efficiency we usually
take*:– 4-8K Random Read - 0.88
– 4-8K Random Write - 0.64* Based on benchmark data and semi-empirical evidence
![Page 14: Your 1st Ceph cluster](https://reader034.vdocument.in/reader034/viewer/2022042701/55ac24171a28ab48298b4726/html5/thumbnails/14.jpg)
Sizing task #1
• 1 Petabyte usable (net) storage
• 500 VMs
• 100 IOPS per VM
• 50000 IOPS aggregated
![Page 15: Your 1st Ceph cluster](https://reader034.vdocument.in/reader034/viewer/2022042701/55ac24171a28ab48298b4726/html5/thumbnails/15.jpg)
Sizing example #1 – Ceph-mon
• Ceph monitors– Sample spec:
• HP DL360 Gen9
• CPU: 2620v3 (6 cores)
• RAM: 64 GB
• HDD: 1 x 1 Tb SATA
• NIC: 1 x 1 Gig NIC, 1 x 10 Gig
– How many?
• 1 x ceph-mon node per 15-20 OSD nodes
![Page 16: Your 1st Ceph cluster](https://reader034.vdocument.in/reader034/viewer/2022042701/55ac24171a28ab48298b4726/html5/thumbnails/16.jpg)
Sizing example #2 – Ceph-OSD
• Lower density, performance optimized– Sample spec:
• HP DL380 Gen9
• CPU: 2620v3 (6 cores)
• RAM: 64 GB
• HDD (OS drives): 2 x SATA 500 Gb
• HDD (OSD drives): 20 x SAS 1.8 Tb (10k RPM, 2.5 inch – HGST
C10K1800)
• SSD (write journals): 4 x SSD 128 GB Intel S3700
• NIC: 1 x 1 Gig NIC, 4 x 10 Gig (2 bonds – 1 for ceph-public and 1 for ceph-
replication)
– 1000/ (20 * 1.8 *0.85) * 3 = 98 servers to serve 1 petabyte net
![Page 17: Your 1st Ceph cluster](https://reader034.vdocument.in/reader034/viewer/2022042701/55ac24171a28ab48298b4726/html5/thumbnails/17.jpg)
Expected performance Level• Conservative drive rating: 250 IOPs
• Cluster read rating:
– 250*20*98*.88 = 431200 IOPs
– Approx 800 IOPs per VM capacity is available
• Cluster write rating:
– 250*20*98*.88 = 313600 IOPs
– Approx 600 IOPs per VM capacity is available
![Page 18: Your 1st Ceph cluster](https://reader034.vdocument.in/reader034/viewer/2022042701/55ac24171a28ab48298b4726/html5/thumbnails/18.jpg)
Sizing example #3 – Ceph-OSD
• Higher density, cost optimized– Sample spec:
• SuperMicro 6048R
• CPU: 2 x 2630v3 (16 cores)
• RAM: 192 GB
• HDD (OS drives): 2 x SATA 500 Gb
• HDD (OSD drives): 28 x SATA 4 Tb (7.2k RPM, 3.5 inch – HGST 7K4000)
• SSD (write journals): 6 x SSD 128 GB Intel S3700
• NIC: 1 x 1 Gig NIC, 4 x 10 Gig (2 bonds – 1 for ceph-public and 1 for ceph-
replication)
– 1000/ (28 * 4 * 0.85) * 3 = 32 servers to serve 1 petabyte net
![Page 19: Your 1st Ceph cluster](https://reader034.vdocument.in/reader034/viewer/2022042701/55ac24171a28ab48298b4726/html5/thumbnails/19.jpg)
Expected performance Level• Conservative drive rating: 150 IOPs
• Cluster read rating:
– 150*32*28*0.88 = 118,272 IOPs
– Approx 200 IOPs per VM capacity is available
• Cluster write rating:
– 150*32*28*0.64 = 86,016 IOPs
– Approx 150 IOPs per VM capacity is available
![Page 20: Your 1st Ceph cluster](https://reader034.vdocument.in/reader034/viewer/2022042701/55ac24171a28ab48298b4726/html5/thumbnails/20.jpg)
What about SSDs?
• In Firefly (0.8x) an OSD is limited by ~5k
IOPS, so full-SSD Ceph cluster performance
disappoints
• In Giant (0.9x) ~30k per OSD should be
reachable, but has yet to be verified
![Page 21: Your 1st Ceph cluster](https://reader034.vdocument.in/reader034/viewer/2022042701/55ac24171a28ab48298b4726/html5/thumbnails/21.jpg)
Lessons learned
![Page 22: Your 1st Ceph cluster](https://reader034.vdocument.in/reader034/viewer/2022042701/55ac24171a28ab48298b4726/html5/thumbnails/22.jpg)
How we deploy (with Fuel)
Select storage options ⇒ Assign roles to nodes ⇒ Allocate
disks:
![Page 23: Your 1st Ceph cluster](https://reader034.vdocument.in/reader034/viewer/2022042701/55ac24171a28ab48298b4726/html5/thumbnails/23.jpg)
What we now know #1• RBD is not suitable for IOPS heavy workloads:
– Realistic expectations:• 100 IOPS per VM (OSD with HDD disks)
• 10K IOPS per VM (OSD with SSD disks, pre-Giant Ceph)
• 10-40 ms latency expected and accepted
• high CPU load on OSD nodes at high IOPS (bumping up CPU requirements)
• Object storage for tenants and applications has limitations
– Swift API gaps in radosgw: • bulk operations
• custom account metadata,
• static website
• expiring objects
• object versioning
![Page 24: Your 1st Ceph cluster](https://reader034.vdocument.in/reader034/viewer/2022042701/55ac24171a28ab48298b4726/html5/thumbnails/24.jpg)
What we now know #2• NIC bonding
– Do linux bonds, not OVS bonds. OVS underperforms and eats CPU
• Ceph and OVS
– DON’T. Bridges for Ceph networks should be connected directly to NIC bond, not via OVS integration bridge
• Nova evacuate doesn’t work with Ceph RBD-backed VMs
– Fixed in Kilo
• Ceph network co-location (with Management, public etc)
– DON’T. Dedicated NICs/NIC pairs for Ceph-public and Ceph-internal (replication)
• Tested ceph.conf
– Enable RBD cache and others -https://bugs.launchpad.net/fuel/+bug/1374969
![Page 25: Your 1st Ceph cluster](https://reader034.vdocument.in/reader034/viewer/2022042701/55ac24171a28ab48298b4726/html5/thumbnails/25.jpg)
Questions?
![Page 26: Your 1st Ceph cluster](https://reader034.vdocument.in/reader034/viewer/2022042701/55ac24171a28ab48298b4726/html5/thumbnails/26.jpg)
Thank you!