rightscale webinar: outage proof your cloud applications

21
Outage-Proof Your Cloud Applications Brian Adler, Sr. Services Architect Roberto Monge, Cloud Solutions Engineer RightScale December 18, 2012 Watch the video of this webinar

Upload: rightscale

Post on 20-Aug-2015

679 views

Category:

Technology


2 download

TRANSCRIPT

Outage-Proof Your Cloud Applications

Brian Adler, Sr. Services Architect

Roberto Monge, Cloud Solutions Engineer

RightScale

December 18, 2012

Watch the video of this webinar

# 2

Cloud Management

# 2

#rightscale

Your Panel TodayPresenting• Brian Adler, Sr. Services Architect, RightScale• Roberto Monge, Cloud Solutions Engineer, RightScale

Q&A • Spencer Adams, Account Manager, RightScale• Noel Cohen, Account Manager, RightScale

Please use the “Questions” window

to ask questions any time!

# 3

Cloud Management

# 3

#rightscale

Agenda

• High Availability and Disaster Recovery• Terminology/Level-Setting• Designing for Failure• Cloud and component definitions• HA and DR configurations

• Conclusions / Q&A

# 4

Cloud Management

# 4

#rightscale

Terminology

High Availability (HA)

Disaster Recovery (DR)

Fault Tolerance

Ability of a system to continue operating properly (perhaps at a degraded level) if one or more components fails

The process, policies and procedures related to restoring critical systems after a catastrophic event

Fault Tolerant systems are measured by their Availability in terms of planned and unplanned service outages for end users

# 5

Cloud Management

# 5

#rightscale

Designing for Failure

Large scale failures in the cloud are rare but do happen

Need to balance cost and complexity of HA efforts against risks you are willing to bear

Application owners are ultimately responsible for availability and recoverability

Cloud infrastructure has made DR and HA remarkably affordable

• Multi-server• Multi-Zone• Multi-Region• Multi-Cloud

3

4

1

2

# 6

Cloud Management

# 6

#rightscale

Cloud Isolation DefinitionsRegion Zone

Resources One or more geographically proximate Zones

Datacenter with separate power source

API endpoint, control plane Shared Shared

Local Area Network Shared Shared

Clouds

Amazon Web Services Region Availability Zone

Rackspace Region

Windows Azure Region

Google Cloud Platform Region Availability Group

CloudStack Region Zone

OpenStack Zone Availability Zone

# 7

Cloud Management

# 7

#rightscale

Multi-Zone HA

SLAVE DBMASTER DB

SNAPSHOTS

LOAD BALANCERS

REPLICATE

DNS

S3

EBS

US-EAST 1a 1US-EAST 1b

LOAD BALANCERS

APP SERVERS

AUTOSCALE

172.168.7.31 172.168.8.62

Snapshot data volume for backups so the database can be readily

recovered within the region.

Place Slave databases in one or more zones for failover.

Consider local storage for additional slave database to remove

dependency on attached volume

Consider distributed

NoSQL databases with

the same distribution

considerations.

Spread primary and replica

nodes across multiple zones. Place as many as you need for

required resiliency.

# 8

Cloud Management

# 8

#rightscale

Multi-Region/Cloud DR Options

Cold DR

Warm DR

Hot DR

Multi-Cloud HA0

< 5 Mins

< 1 Hour

> 1 Hour

$ $$ $$$ $$$$

(Most Common)

(Recommended)

(Least Common)

(Live/Live Config)

DowntimeAvailability

99.999%

99.9%

99.5%

99%

# 9

Cloud Management

# 9

#rightscale

Multi-Region Cold DR

LOAD BALANCERS

MASTER DB SLAVE DB

APP SERVERS

LOAD BALANCERS

REPLICATE

DNS

APP SERVERS

DALLAS

SNAPSHOTS

172.168.7.31

SLAVE DB

CHICAGO

CLOUD FILES

Staged Server Configuration and generally no staged data• Not recommended if rapid recovery is required• Slow to replicate data to other cloud and bring database online

CBS

# 10

Cloud Management

# 10

#rightscale

Multi-Region Warm DR

LOAD BALANCERS

MASTER DB SLAVE DB

APP SERVERS

LOAD BALANCERS

REPLICATE

DNS

APP SERVERS

SLAVE DB

REPLICATE

DALLAS

172.168.7.31

CHICAGO

SNAPSHOTS

Staged Server Configuration, pre-staged data and running Slave Database Server• Generally recommended DR solution• Minimal additional cost and allows fairly rapid recovery

SNAPSHOTS

CBS

CLOUD FILES

# 11

Cloud Management

# 11

#rightscale

APP SERVERS

Multi-Region Hot DR

LOAD BALANCERS

MASTER DB SLAVE DB

APP SERVERS

LOAD BALANCERS

REPLICATE

DNS

SLAVE DB

REPLICATE

DALLAS

SNAPSHOTS

172.168.7.31

CHICAGO

Parallel Deployment with all servers running but all traffic going to primary• Not recommended• Very high additional cost to allow rapid recovery

SNAPSHOTS

CBS

CLOUD FILES

# 12

Cloud Management

# 12

#rightscale

Multi-Cloud HA

APP SERVERS

LOAD BALANCERS

MASTER DB SLAVE DB

APP SERVERS

LOAD BALANCERS

REPLICATE

DNS

SLAVE DB

REPLICATE

CHICAGO

SNAPSHOTS

172.168.7.31 172.168.8.62

US-EAST

S3 SWIFT

SNAPSHOTS

Live/Live configuration. Geo-target IP services to direct traffic to regional LBs.• Possible, but not recommended (more to follow…)• Max additional cost and max availability, but complex to implement and manage

EBS

# 13

Cloud Management

# 13

#rightscale

APP SERVERS

LOAD BALANCERS

MASTER DB SLAVE DB

APP SERVERS

LOAD BALANCERS

REPLICATE

DNS

SLAVE DB

REPLICATE

CHICAGO

SNAPSHOTS

172.168.7.31 172.168.8.62

US-EAST

S3

Multi-Cloud HA

You need DNS management or a global load balancer.

Security is an issue as security groups are Region-specific.

Machine Images are specific to the

cloud/region.

Looks similar to Multi-Zone… but additional problems to solve as some resources are not shared

SNAPSHOTS

SWIFT

EBS VOLUME

# 14

Cloud Management

# 14

#rightscale

In the Dashboard

Multi-region or cloud

Multi-region Warm DR

Staged servers

Cost forecasting

for DR environment

# 15

Cloud Management

# 15

#rightscale

Automating HA and DR• Use dynamic DNS for your database servers

• Allow app servers to use a single FQDN.• Use a low TTL to allow rapid failover in the case of a change in master

database

• Automatic connection of app servers to load balancing servers• App servers can connect to all load balancers automatically at launch• No manual intervention• No DNS modifications

• Automated promotion of slave to master• Process is automated• Decision to run process is manual

# 16

Cloud Management

# 16

#rightscale

MultiCloud Images• MultiCloud Images can be launched across regions and clouds

without modification

How RightScale makes it possible

MultiCloud Images

Cloud A, B, Image 1

Cloud A C, Image 2

Cloud B, Image 1

ServerTemplate contains a list of MultiCloud Images (MCIs)

When the Server is created, a specific MCI is chosen.

Cloud A, B, Image 1

Cloud B

Image 1

The appropriate RightImage is used at launch.

RightImage

Stability across clouds

1

2

3

# 17

Cloud Management

# 17

#rightscale

How RightScale makes it possibleServerTemplates, Tags, and Inputs• Automated load balancer registration and database connections• Autoscaling across zones• Dynamic configuration

# 18

Cloud Management

# 18

#rightscale

DR Cost Comparison ExampleMulti-RegionCold DR

Multi-RegionWarm DR

Multi-RegionHot DR

Total $4480 / month $5630 / month $8800 / month

Running $4470 / month3 Load Balancers (Large)6 App Servers (XLarge)1 Master DB (2XLarge)1 Slave DB (2XLarge)

$5540 / month3 Load Balancers (Large)6 App Servers (XLarge)1 Master DB (2XLarge)2 Slave DB (2XLarge)

$8440 / month6 Load Balancers (Large)12 App Servers (XLarge)1 Master DB (2XLarge)2 Slave DB (2XLarge)

Staged $0 / month3 Load Balancers (Large)6 App Servers (XLarge)1 Slave DB (2XLarge)

$0 / month3 Load Balancers (Large)6 App Servers (Xlarge)

Replication $10 / month25GB / day cross-zone

$90 / month25GB / day cross-region

$360 / month100GB / day cross-region

# 19

Cloud Management

# 19

#rightscale

Most Common Observed Cloud Outages• Outage of specific services in a zone

• Degraded performance• E.g. EBS, ELB, RDS

• Outage of specific services in a region• Control plane error or cascading problems• E.g. EBS

• Outage of power or network in a zone• No connectivity• E.g. EC2, Azure

• Capacity availability in a region during an outage• Not possible to provision instances, volumes, or other services

# 20

Cloud Management

# 20

#rightscale

Outage-Proofing Best Practices

Place in >1 zone:• Load balancers• App servers• Databases

Maintain capacity to absorb zone or region failures

Replicate data across zones

Design stateless apps for resilience to reboot / relaunch

Replicate data across zones

Backup across regions & clouds

Monitoring, alert, and automate operations to speed up failover

Replication and Failover

Application Design

Resource Placement

# 21

Cloud Management

# 21

#rightscale

Next Steps• Learn: Building Scalable Applications in the Cloud Whitepaper

• http://www.rightscale.com/info_center/white-papers/building-scalable-applications-in-the-cloud.php

• Analyze: Deployment review of your environment• http://www.rightscale.com/about_us/contact_us.php

• Try: Free Edition• www.rightscale.com/free

Contact RightScale(866) 720-0208

[email protected]