99.999% available openstack cloud - a builder's guide

99.999% Available OpenStack Cloud -

A Builder's Guide

Danny Al-Gaaf (Deutsche Telekom)OpenStack Summit 2015 - Tokyo

● Motivation● Availability and SLA's● Data centers

○ Setup and failure scenarios● OpenStack and Ceph

○ Architecture and Critical Components○ HA setup○ Quorum?

● OpenStack and Ceph == HA?○ Failure scenarios○ Mitigation

● Conclusions

Overview

2

Motivation

NFV Cloud @ Deutsche Telekom

● Datacenter design○ Backend DCs

■ Few but classic DCs ■ High SLAs for infrastructure and services■ For private/customer data and services

○ Frontend DCs■ Small but many■ Near to the customer■ Lower SLAs, can fail at any time■ NFVs:

● Spread over many FDCs● Failures are handled by services and not the infrastructure

● Run telco core services @OpenStack/KVM/Ceph4

Availability

Availability

● Measured relative to “100 % operational”

6

availability downtime classification

99.9% 8.76 hours/year high availability

99.99% 52.6 minutes/year very high availability

99.999% 5.26 minutes/year highest availability

99.9999% 0.526 minutes/year disaster tolerant

High Availability

● Continuous system availability in case of component failures

● Which availability?○ Server ○ Network○ Datacenter○ Cloud○ Application/Service

● End-to-End availability most interesting7

High Availability

● Calculation○ Each component contributes to the service availability

■ Infrastructure■ Hardware■ Software■ Processes

○ Likelihood of disaster and failure scenarios○ Model can get very complex

● SLA’s○ ITIL (IT Infrastructure Library)○ Planned maintenance depending on SLA may be excluded

8

http://www.dict.cc/englisch-deutsch/likelihood.html

Data centers

Failure scenarios

● Power outage○ External○ Internal ○ Backup UPS/Generator

● Network outage ○ External connectivity○ Internal

■ Cables■ Switches, router

● Failure of a server or a component● Failure of a software service

10

Failure scenarios

● Human error still often leading cause of outage ○ Misconfiguration○ Accidents○ Emergency power-off

● Disaster○ Fire○ Flood○ Earthquake○ Plane crash○ Nuclear accident

11

Data Center Tiers

12

Mitigation

● Identify potential SPoF● Use redundant components● Careful planning

○ Network design (external/internal)○ Power management (external/internal)○ Fire suppression ○ Disaster management○ Monitoring

● 5-nines on DC/HW level hard to achieve ○ Tier IV usually too expensive (compared with Tier III or III+)○ Requires HA concept on cloud and application level

13

Example: Network

● Spine/leaf arch● Redundant

○ DC-R○ Spine switches○ Leaf switches (ToR)○ OAM switches○ Firewall

● Server○ Redundant NICs○ Redundant power lines

and supplies

14

Ceph and OpenStack

Architecture: Ceph

16

Architecture: Ceph Components

● OSDs○ 10s - 1000s per cluster○ One per device (HDD/SDD/RAID Group, SAN …)○ Store objects○ Handle replication and recovery

● MONs:○ Maintain cluster membership and states○ Use PAXOS protocol to establish quorum

consensus○ Small, lightweight○ Odd number 17

Architecture: Ceph and OpenStack

18

HA - Critical Components

Which services need to be HA? ● Control plane

○ Provisioning, management○ API endpoints and services ○ Admin nodes○ Control nodes

● Data plane○ Steady states○ Storage○ Network

19

HA Setup

● Stateless services○ No dependency between requests ○ After reply no further attention required○ API endpoints (e.g. nova-api, glance-api,...) or nova-scheduler

● Stateful service ○ Action typically comprises out of multiple requests○ Subsequent requests depend on the results of a former request○ Databases, RabbitMQ

20

HA Setup

21

active/passive active/active

stateless ● load balance redundant services

● load balance redundant services

stateful ● bring replacement resource online

● redundant services, all with the same state

● state changes are passed to all instances

OpenStack HA

22

Quorum?

● Required to decide which cluster partition/member is primary to prevent data/service corruption

● Examples:○ Databases

■ MariaDB / Galera, MongoDB, CassandraDB○ Pacemaker/corosync○ Ceph Monitors

■ Paxos■ Odd number of MONs required■ At least 3 MONs for HA, simple majority (2:3, 3:5, 4:7, …)■ Without quorum:

● no changes of cluster membership (e.g. add new MONs/ODSs)● Clients can’t connect to cluster

23

OpenStack and Ceph == HA ?

SPoF

● OpenStack HA○ No SPoF assumed

● Ceph○ No SPoF assumed○ Availability of RBDs is critical to VMs○ Availability of RadosGW can be easily managed via HAProxy

● What in case of failures on higher level?○ Data center cores or fire compartments○ Network

■ Physical■ Misconfiguration

○ Power25

Setup - Two Rooms

26

Failure scenarios - FC fails

27

Failure scenarios - FC fails

28

Failure scenarios - Split brain

29

● Ceph● Quorum selects B● Storage in A stops

● OpenStack HA:● Selects B

● VMs in B still running

● Best-case scenario

Failure scenarios - Split brain

30

● Ceph● Quorum selects B● Storage in A stops

● OpenStack HA:● Selects A

● VMs in A and B stop working

● Worst-case scenario

Other issues

● Replica distribution○ Two room setup:

■ 2 or 3 replica contain risk of having only one replica left■ Would require 4 replica (2:2)

● Reduced performance● Increased traffic and costs

○ Alternative: erasure coding ■ Reduced performance, less space required

● Spare capacity○ Remaining room requires spare capacity to restore○ Depends on

■ Failure/restore scenario■ Replication vs erasure code

○ Costs31

Mitigation - Three FCs

32

● Third FC/failure zone hosting all services

● Usually higher costs

● More resistant against failures

● Better replica distribution

● More east/west traffic

Mitigation - Quorum Room

33

● Most DCs have backup rooms

● Only a few servers to host quorum related services

● Less cost intensive

● Can mitigate split brain between FCs (depending on network layout)

Mitigation - Pets vs Cattle

34

● NO pets allowed !!!● Only cloud-ready applications

Mitigation - Failure tolerant applications

35

● Tier level is not the most relevant layer● Application must build their own cluster

mechanisms on top of the DC→ increases the availability significantly

● Data replication must be done across multi-region

● In case of a disaster route traffic to different DC

● Many VNF (virtual network functions) already support such setups

Mitigation - Federated Object Stores

36

● Best way to synchronize and replicate data across multiple DC is usage of object storage

● Sync is done asynchronously

Open issues: ● Doesn’t solve replication of databases● Many applications don’t support object

storage and need to be adapted● Applications also need to support

regions/zones

Mitigation - Outlook

● “OpenStack follows Storage” ○ Use RBDs as fencing devices ○ Extend Ceph MONs

■ Include information about physical placement similar to CRUSH map■ Enable HA setup to query quorum decisions and map quorum to physical layout

● Passive standby Ceph MONs to ease deployment of MONs if quorum fails○ http://tracker.ceph.com/projects/ceph/wiki/Passive_monitors

● Generic quorum service/library ?37

Conclusions

Conclusions

● OpenStack and Ceph provide HA if carefully planed○ Be aware of potential failure scenarios!○ All Quorums need must be synced○ Third room must be used○ Replica distribution and spare capacity must be considered○ Ceph need more extended quorum information

● Target for five 9’s is E2E○ Five 9’s on data center level very expensive○ No pets !!!○ Distribute applications or services over multiple DCs

39

Get involved !

● Ceph○ https://ceph.com/community/contribute/ ○ [email protected]○ IRC: OFTC

■ #ceph, ■ #ceph-devel

○ Ceph Developer Summit

● OpenStack○ Cinder, Glance, Manila, ...

40

https://ceph.com/community/contribute/

https://ceph.com/community/contribute/

mailto:[email protected]

mailto:[email protected]

http://wiki.ceph.com/Planning/CDS

http://wiki.ceph.com/Planning/CDS

[email protected]

dalgaaf

linkedin.com/in/dalgaaf

Danny Al-Gaaf Senior Cloud Technologist

IRC

Q&A - THANK YOU!

99.999% available openstack cloud - a builder's guide

Presentations & Public Speaking