journey to stability: petabyte ceph cluster in openstack cloud
TRANSCRIPT
Yuming Ma , Architect
Cisco Cloud Services
Ceph Day, Portland Oregon, May 25th, 2016
Stabilizing Petabyte Ceph Cluster in OpenStack Cloud
2© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Highlights
1. What are we doing with Ceph?
2. What did we start with?
3. We need a bigger boat
4. Getting better and sleeping through the night
5. Lessons learned
3© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Cisco Cloud Services provides an Openstack platform to Cisco SaaS applications and tenants through a worldwide deployment of datacenters.
Background
SaaS Cases• Collaboration• IoT• Security• Analytics• “Unknown Projects”
Swift• Database (Trove)
Backups• Static Content• Cold/Offline data for
Hadoop
Cinder• Generic/Magnetic
Volumes• Low Performance
4© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
• Boot Volumes for all VM flavors except those with Ephemeral (local) storage
• Glance Image store
• Generic Cinder Volume
• Swift Object store
• In production since March 2014
• 13 clusters in production in two years
• Each cluster is 1800TB raw over 45 nodes and 450 OSDs.
How Do We Use Ceph?
Cisco UCS
Ceph High-Perf Platform
Generic Volume
Prov IOPS
Cinder API
Object
Swift API
5© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
• Nice consistent growth…• Your users will not warn
you before:• “going live”• Migrating out of S3• Backing up a Hadoop
HDFS
• Stability problems started after 50% used
Growth: It will happen, just not sure when
6© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
CCS Ceph 1.0
RACK3RACK2
1 2 10
LSI 9271 HBA
Data partition
HDD
Journalpartition
…..
…..
XFS
…..
…..
OSD2 OSD10OSD1…..
1211
OS on RAID1
MIRROR
2x10Gb PRIVATE NETWORK
KEYSTONE API
SWIFT API
CINDER API
GLANCE API
NOVA API
OPENSTACK
RADOS GATE WAY CEPH BLOCK DEVICE (RBD)
Libvirt/kvm
2x10Gb PUBLIC NETWORK
monitors monitors monitors
15xC240
CEPH libRADOS API
RACK1
15xC240 15xC240
OSD: 45 x UCS C240 M3• 2xE5 2690 V2, 40 HT/core• 64GB RAM• 2x10Gbs for public• 2x10Gbs for cluster• 3X replication• LSI 9271 HBA• 10 x 4TB HDD, 7200 RPM• 10GB journal partition from
HDD • RHEL3.10.0-
229.1.2.el7.x86_64
NOVA: UCS C220• Ceph 0.94.1• RHEL3.10.0-
229.4.2.el7.x86_64
MON/RGW: UCS C220 M3• 2xE5 2680 V2, 40 HT/core• 64GB RAM• 2x10Gbs for public• 4x3TB HDD, 7200 RPM • RHEL3.10.0-
229.4.2.el7.x86_64
Started with Cuttlefish/Dumpling
7© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
• Get to MVP and keep costs down.• High capacity, hence C240 M3 LFF for 4TB HDDs• Tradeoff was that C240 M3 LFF could not also accommodate SSD
• So Journal was collocated on OSD
• Monitors were on HDD based systems as well
Initial Design Considerations
8© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Major Stability Problems: Monitors
Problem Impact
MON election storm impacting client IO
Monmap changes due to flaky NIC or chatty messaging between MON and client. Caused unstable quorum and an election storm between MON hostsResults: blocked and slowed client IO requests
LevelDB inflation Level DB size grows to XXGB over time that prevents MON daemon from serving OSD requests Results: Blocked IO and slow request
DDOS due to chatty client msg attack
Slow response from MON to client due to levelDB or election storm causing message flood attack from client. Results: failed client operation, e.g volume creation, RBD connection
9© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Major Stability Problems: Cluster
Problem Impact
Backfill & Recovery impacting client IO
Osdmap changes due to loss of disk, resulting in PG peering and backfilling Results: Clients receive blocked and slow IO.
Unbalanced data distribution
Data on OSDs isn’t evenly distributed. Cluster may be 50% full, but some OSDs are at 90%Results: Backfill isn’t always able to complete.
Slow disk impacting client IO
A single slow (sick, not dead) OSD can severely impact many clients until it’s ejected from the cluster.Results: Client have slow or blocked IO.
10© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Stability Improvement StrategyStrategy Improvement
Client IO throttling* Rate limit IOPS at Nova host to 250 IOPS per volume.
Backfill and recovery throttling
Reduced IO consumption by backfill and recovery processes to yield to client IO over
Retrofit with NVME (PCIe) journals
Increased overall IOPS of the cluster
Upgrade to 1.2.3/1.3.2 Overall stability and hardened MONs preventing election storm
LevelDB on SSD (replaced entire mon node)
Faster cluster map query
Re-weight by utilization Balance data distribution
*Client is the RBD client not the tenant
11© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
• Limit max/cap IO consumption at qemu layer:• iops ( IOPS read and write ) 250• bps (Bits per second read and
write ) 100 MB/s
• Predictable and controlled IOPS capacity
• NO min/guaranteed IOPS -> future Ceph feature
• NO burst map -> qemu feature:• iops_max 500• bpx_max 120 MB/s
Client IO throttling
Swing ~ 100%
Swing ~ 12%
12© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
• Problem• Blocked IO during peering• Slow requests during backfill• Both could cause client IO stall
and vCPU soft lockup
• Solution• Throttling backfill and recovery
osd recovery max active = 3 ( default : 15)
osd recovery op priority = 3 (default : 10)
osd max backfills = 1 (default : 10)
Backfill and Recovery Throttling
13© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Rack-16 nodes
osd1
osd10…
Rack-25 nodes
osd1
osd10…
Rack-36 nodes
osd1
osd10…
nova1
vm1
vm20
…
nova10
vm1
vm20
…
nova2
vm1
…
…..
NVMe Journaling: Performance Testing Setup
Partition starts at 4MB(s1024), 10GB each and 4MB offset in between
1 2 10
LSI 9271 HBA
1 2 10RAID0
1DISK
1 2 10NVME
…..
…..
XFS
…..
…..
OSD2
OSD10OSD1
…..
300GB Free
1211
OS on RAID1 MIRROR OSD: C240 M3
• 2xE5 2690 V2, 40 HT/core• 64GB RAM• 2x10Gbs for public• 2x10Gbs for cluster• 3X replication• Intel P3700 400GB NVMe• LSI 9271 HBA• 10x4TB, 7200 RPM
Nova C220• 2xE5 2680 V2, 40 HT/core• 380GB RAM• 2x10Gbs for public• 3.10.0-229.4.2.el7.x86_64
vm20
14© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
NVMe Journaling: Performance Tuning
OSD host iostat:• Both nvme and hdd disk %util and low most of the time, and spikes
every ~45s.• Both nvme and hdd have very low queue size (iodepth) while frontend
VM pushes 16 qdepth to FIO.• CPU %used is reasonable, converge at <%30. But the iowait is low
which corresponding to low disk activity
15© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
NVMe Journaling: Performance TuningTuning Directions: increase disk %util:• Disk thread: 4, 16, 32• Filestore max sync interval: (0.1, 0.2, 0.5, 1 5, 10 20)
16© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
• These two tunings showed no impact:filestore_wbthrottle_xfs_ios_start_flusher: default 500 vs 10filestore_wbthrottle_xfs_inodes_start_flusher: default 500 vs 10
• Final Config:osd_journal_size = 10240 (default : journal_max_write_entries= 1000 (default : 100)journal_max_write_bytes=1048576000 (default :10485760)journal_queue_max_bytes=1048576000 (default :10485760)filestore_queue_max_bytes=1048576000 ( (default :10485760)filestore_queue_committing_max_bytes=1048576000 ( (default :10485760)filestore_wbthrottle_xfs_bytes_start_flusher = 4194304 ( (default :10485760)
NVMe Performance Tuning
Linear tuning filestore_wbthrottle_xfs_bytes_start_flusher:
filestore_wbthrottle_xfs_inodes_start_flusher
filestore_wbthrottle_xfs_ios_start_flusher
17© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
NVMe Stability Improvement Analysis
Backfill and recovery config:osd recovery max active = 3 ( default : 15)osd max backfills = 1 (default : 10)osd recovery op priority = 3 (default : 10)
Server impact:• Shorter recovery time
Client impact • <10% impact (tested without IO throttling,
should be less with throttling)
18© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
LevelDB :• Key-value store for cluster metadata, e.g.
osdmap, pgmap, monmap, clientID, authID etc
• Not in data path• Impactful to IO operation: IO blocked by
the DB query• Larger size, longer query time, hence
longer IO wait -> slow requests
• Solution:• Level DB on SSD in increase disk IO rate• Upgrade to Hammer to reduce DB size
MON Level DB Issues
19© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
MON Level DB on SSDNew BOM:
• UCS C220 M4 with 120GB SSD
Write wait timewith levelDB on HDD
Write wait timewith levelDB on SSD
20© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
• Problem• Election Storm & LevelDB inflation
• Solutions• Upgrade to 1.2.3 to fix election storm• Upgrade to 1.3.2 to fix levelDB inflation• Configuration change
MON Cluster Hardening
[mon]mon_lease = 20 (default = 5)mon_lease_renew_interval = 12 (default 3)mon_lease_ack_timeout = 40 (default 10)mon_accept_timeout = 40 (default 10)
[client]mon_client_hunt_interval = 40 (defaiult 3)
21© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
• Problem • High skew of %used of disks that is
preventing data intake even cluster capacity allows
• Impact: • Unbalanced PG distribution impacts
performance• Rebalancing is impactful as well
• Solution: • Upgrade to Hammer 1.3.2+patch• Re-weight by utilization: >10% delta
Data Distribution and Balance
1 30 59 88 1171461752042332622913203493784074360
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
us-internal-1 disk % used
% usedCluster: 67.9% fullOSDs: • Min: 47.2&• Max: 83.5%• Mean: %69.6 • Stddev: 6.5
22© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
• Problem • RBD image data distributed to
all disk and single slow disk can impact critical data IO
• Solution: proactively detect slow disks
Proactive Detection of Slow Disks
23© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
• Set Clear Stability Goals• You can plan for everything except how tenants will use it• Monitor Everything, but not “everything”
• Turn down logging… It is possible to send 900k logs in 30minutes
• Look for issues in services that consume storage• Had 50TB of “deleted volumes” that weren’t
Lessons Learned
24© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
• DevOps• It’s not just technology, it’s how your team operates as a team• Share knowledge• Manage your backlog and manage your management
• Consistent performance and stability modeling• Rigorous testing• Determine Requires and architect to them
• Balance performance, cost and time
• Automate builds and rebuilds• Shortcuts create Technical Debt
Last Lesson…