your 1st ceph cluster

Your 1st Ceph cluster

Rules of thumb

and magic numbers

Agenda

• Ceph overview

• Hardware planning

• Lessons learned

Who are we?

• Udo Seidel

• Greg Elkinbard

• Dmitriy Novakovskiy

Ceph overview

What is Ceph?

• Ceph is a free & open distributed storage

platform that provides unified object, block,

and file storage.

Object Storage

• Native RADOS objects:

– hash-based placement

– synchronous replication

– radosgw provides S3 and Swift compatible

REST access to RGW objects (striped over

multiple RADOS objects).

Block storage

• RBD block devices are thinly provisioned

over RADOS objects and can be accessed

by QEMU via librbd library.

File storage

• CephFS metadata servers (MDS) provide

a POSIX-compliant file system backed by

RADOS.

– …still experimental?

How does Ceph fit with OpenStack?

• RBD drivers for OpenStack tell libvirt to

configure the QEMU interface to librbd

• Ceph benefits:

– Multi-node striping and redundancy for block

storage (Cinder volumes, Nova ephemeral

drives, Glance images)

– Copy-on-write cloning of images to volumes

– Unified pool of storage nodes for all types of

data (objects, block devices, files)

– Live migration of Ceph-backed VMs

Hardware planning

Questions

• How much net storage, in what tiers?

• How many IOPS?– Aggregated

– Per VM (average)

– Per VM (peak)

• What to optimize for?– Cost

– Performance

Rules of thumb and magic numbers

#1• Ceph-OSD sizing:

– Disks• 8-10 SAS HDDs per 1 x 10G NIC

• ~12 SATA HDDs per 1 x 10G NIC

• 1 x SSD for write journal per 4-6 OSD drives

– RAM: • 1 GB of RAM per 1 TB of OSD storage space

– CPU: • 0.5 CPU core’s/1 Ghz of a core per OSD disk (1-2 CPU cores for SSD

drives)

• Ceph-mon sizing: – 1 ceph-mon node per 15-20 OSD nodes (min 3 per cluster)

Rules of thumb and magic numbers

#2

• For typical “unpredictable” workload we usually

assume:– 70/30 read/write IOPS proportion

– ~4-8KB random read pattern

• To estimate Ceph IOPS efficiency we usually

take*:– 4-8K Random Read - 0.88

– 4-8K Random Write - 0.64* Based on benchmark data and semi-empirical evidence

Sizing task #1

• 1 Petabyte usable (net) storage

• 500 VMs

• 100 IOPS per VM

• 50000 IOPS aggregated

Sizing example #1 – Ceph-mon

• Ceph monitors– Sample spec:

• HP DL360 Gen9

• CPU: 2620v3 (6 cores)

• RAM: 64 GB

• HDD: 1 x 1 Tb SATA

• NIC: 1 x 1 Gig NIC, 1 x 10 Gig

– How many?

• 1 x ceph-mon node per 15-20 OSD nodes

Sizing example #2 – Ceph-OSD

• Lower density, performance optimized– Sample spec:

• HP DL380 Gen9

• CPU: 2620v3 (6 cores)

• RAM: 64 GB

• HDD (OS drives): 2 x SATA 500 Gb

• HDD (OSD drives): 20 x SAS 1.8 Tb (10k RPM, 2.5 inch – HGST

C10K1800)

• SSD (write journals): 4 x SSD 128 GB Intel S3700

• NIC: 1 x 1 Gig NIC, 4 x 10 Gig (2 bonds – 1 for ceph-public and 1 for ceph-

replication)

– 1000/ (20 * 1.8 *0.85) * 3 = 98 servers to serve 1 petabyte net

Expected performance Level• Conservative drive rating: 250 IOPs

• Cluster read rating:

– 250*20*98*.88 = 431200 IOPs

– Approx 800 IOPs per VM capacity is available

• Cluster write rating:

– 250*20*98*.88 = 313600 IOPs


Sizing example #3 – Ceph-OSD

• Higher density, cost optimized– Sample spec:

• SuperMicro 6048R

• CPU: 2 x 2630v3 (16 cores)

• RAM: 192 GB

• HDD (OS drives): 2 x SATA 500 Gb

• HDD (OSD drives): 28 x SATA 4 Tb (7.2k RPM, 3.5 inch – HGST 7K4000)

• SSD (write journals): 6 x SSD 128 GB Intel S3700

• NIC: 1 x 1 Gig NIC, 4 x 10 Gig (2 bonds – 1 for ceph-public and 1 for ceph-

replication)

– 1000/ (28 * 4 * 0.85) * 3 = 32 servers to serve 1 petabyte net

Expected performance Level• Conservative drive rating: 150 IOPs

• Cluster read rating:

– 150*32*28*0.88 = 118,272 IOPs


• Cluster write rating:

– 150*32*28*0.64 = 86,016 IOPs


What about SSDs?

• In Firefly (0.8x) an OSD is limited by ~5k

IOPS, so full-SSD Ceph cluster performance

disappoints

• In Giant (0.9x) ~30k per OSD should be

reachable, but has yet to be verified

Lessons learned

How we deploy (with Fuel)

Select storage options ⇒ Assign roles to nodes ⇒ Allocate

disks:

What we now know #1• RBD is not suitable for IOPS heavy workloads:

– Realistic expectations:• 100 IOPS per VM (OSD with HDD disks)

• 10K IOPS per VM (OSD with SSD disks, pre-Giant Ceph)

• 10-40 ms latency expected and accepted

• high CPU load on OSD nodes at high IOPS (bumping up CPU requirements)

• Object storage for tenants and applications has limitations

– Swift API gaps in radosgw: • bulk operations

• custom account metadata,

• static website

• expiring objects

• object versioning

What we now know #2• NIC bonding

– Do linux bonds, not OVS bonds. OVS underperforms and eats CPU

• Ceph and OVS

– DON’T. Bridges for Ceph networks should be connected directly to NIC bond, not via OVS integration bridge

• Nova evacuate doesn’t work with Ceph RBD-backed VMs

– Fixed in Kilo

• Ceph network co-location (with Management, public etc)

– DON’T. Dedicated NICs/NIC pairs for Ceph-public and Ceph-internal (replication)

• Tested ceph.conf

– Enable RBD cache and others -https://bugs.launchpad.net/fuel/+bug/1374969

Questions?

Thank you!

your 1st ceph cluster

Software

cephosd sizing

ceph iops efficiency

ceph clusterrules of

cephosdhigher density

cephosdlower density

g nic1 x ssd

fileslive migration

gbhdd osd drives