your 1st ceph cluster
Post on 20-Jul-2015
608 Views
Preview:
TRANSCRIPT
Your 1st Ceph cluster
Rules of thumb
and magic numbers
Agenda
• Ceph overview
• Hardware planning
• Lessons learned
Who are we?
• Udo Seidel
• Greg Elkinbard
• Dmitriy Novakovskiy
Ceph overview
What is Ceph?
• Ceph is a free & open distributed storage
platform that provides unified object, block,
and file storage.
Object Storage
• Native RADOS objects:
– hash-based placement
– synchronous replication
– radosgw provides S3 and Swift compatible
REST access to RGW objects (striped over
multiple RADOS objects).
Block storage
• RBD block devices are thinly provisioned
over RADOS objects and can be accessed
by QEMU via librbd library.
File storage
• CephFS metadata servers (MDS) provide
a POSIX-compliant file system backed by
RADOS.
– …still experimental?
How does Ceph fit with OpenStack?
• RBD drivers for OpenStack tell libvirt to
configure the QEMU interface to librbd
• Ceph benefits:
– Multi-node striping and redundancy for block
storage (Cinder volumes, Nova ephemeral
drives, Glance images)
– Copy-on-write cloning of images to volumes
– Unified pool of storage nodes for all types of
data (objects, block devices, files)
– Live migration of Ceph-backed VMs
Hardware planning
Questions
• How much net storage, in what tiers?
• How many IOPS?– Aggregated
– Per VM (average)
– Per VM (peak)
• What to optimize for?– Cost
– Performance
Rules of thumb and magic numbers
#1• Ceph-OSD sizing:
– Disks• 8-10 SAS HDDs per 1 x 10G NIC
• ~12 SATA HDDs per 1 x 10G NIC
• 1 x SSD for write journal per 4-6 OSD drives
– RAM: • 1 GB of RAM per 1 TB of OSD storage space
– CPU: • 0.5 CPU core’s/1 Ghz of a core per OSD disk (1-2 CPU cores for SSD
drives)
• Ceph-mon sizing: – 1 ceph-mon node per 15-20 OSD nodes (min 3 per cluster)
Rules of thumb and magic numbers
#2
• For typical “unpredictable” workload we usually
assume:– 70/30 read/write IOPS proportion
– ~4-8KB random read pattern
• To estimate Ceph IOPS efficiency we usually
take*:– 4-8K Random Read - 0.88
– 4-8K Random Write - 0.64* Based on benchmark data and semi-empirical evidence
Sizing task #1
• 1 Petabyte usable (net) storage
• 500 VMs
• 100 IOPS per VM
• 50000 IOPS aggregated
Sizing example #1 – Ceph-mon
• Ceph monitors– Sample spec:
• HP DL360 Gen9
• CPU: 2620v3 (6 cores)
• RAM: 64 GB
• HDD: 1 x 1 Tb SATA
• NIC: 1 x 1 Gig NIC, 1 x 10 Gig
– How many?
• 1 x ceph-mon node per 15-20 OSD nodes
Sizing example #2 – Ceph-OSD
• Lower density, performance optimized– Sample spec:
• HP DL380 Gen9
• CPU: 2620v3 (6 cores)
• RAM: 64 GB
• HDD (OS drives): 2 x SATA 500 Gb
• HDD (OSD drives): 20 x SAS 1.8 Tb (10k RPM, 2.5 inch – HGST
C10K1800)
• SSD (write journals): 4 x SSD 128 GB Intel S3700
• NIC: 1 x 1 Gig NIC, 4 x 10 Gig (2 bonds – 1 for ceph-public and 1 for ceph-
replication)
– 1000/ (20 * 1.8 *0.85) * 3 = 98 servers to serve 1 petabyte net
Expected performance Level• Conservative drive rating: 250 IOPs
• Cluster read rating:
– 250*20*98*.88 = 431200 IOPs
– Approx 800 IOPs per VM capacity is available
• Cluster write rating:
– 250*20*98*.88 = 313600 IOPs
– Approx 600 IOPs per VM capacity is available
Sizing example #3 – Ceph-OSD
• Higher density, cost optimized– Sample spec:
• SuperMicro 6048R
• CPU: 2 x 2630v3 (16 cores)
• RAM: 192 GB
• HDD (OS drives): 2 x SATA 500 Gb
• HDD (OSD drives): 28 x SATA 4 Tb (7.2k RPM, 3.5 inch – HGST 7K4000)
• SSD (write journals): 6 x SSD 128 GB Intel S3700
• NIC: 1 x 1 Gig NIC, 4 x 10 Gig (2 bonds – 1 for ceph-public and 1 for ceph-
replication)
– 1000/ (28 * 4 * 0.85) * 3 = 32 servers to serve 1 petabyte net
Expected performance Level• Conservative drive rating: 150 IOPs
• Cluster read rating:
– 150*32*28*0.88 = 118,272 IOPs
– Approx 200 IOPs per VM capacity is available
• Cluster write rating:
– 150*32*28*0.64 = 86,016 IOPs
– Approx 150 IOPs per VM capacity is available
What about SSDs?
• In Firefly (0.8x) an OSD is limited by ~5k
IOPS, so full-SSD Ceph cluster performance
disappoints
• In Giant (0.9x) ~30k per OSD should be
reachable, but has yet to be verified
Lessons learned
How we deploy (with Fuel)
Select storage options ⇒ Assign roles to nodes ⇒ Allocate
disks:
What we now know #1• RBD is not suitable for IOPS heavy workloads:
– Realistic expectations:• 100 IOPS per VM (OSD with HDD disks)
• 10K IOPS per VM (OSD with SSD disks, pre-Giant Ceph)
• 10-40 ms latency expected and accepted
• high CPU load on OSD nodes at high IOPS (bumping up CPU requirements)
• Object storage for tenants and applications has limitations
– Swift API gaps in radosgw: • bulk operations
• custom account metadata,
• static website
• expiring objects
• object versioning
What we now know #2• NIC bonding
– Do linux bonds, not OVS bonds. OVS underperforms and eats CPU
• Ceph and OVS
– DON’T. Bridges for Ceph networks should be connected directly to NIC bond, not via OVS integration bridge
• Nova evacuate doesn’t work with Ceph RBD-backed VMs
– Fixed in Kilo
• Ceph network co-location (with Management, public etc)
– DON’T. Dedicated NICs/NIC pairs for Ceph-public and Ceph-internal (replication)
• Tested ceph.conf
– Enable RBD cache and others -https://bugs.launchpad.net/fuel/+bug/1374969
Questions?
Thank you!
top related