# 1
Safeguard Your Cloud Applications:
Ensuring High Availability and Disaster Recovery Plans
March 14, 2013
Watch the video of this presentation
# 2#
Your Speakers TodayPresenting• Miles Ward, Advanced Solutions Architecture, AWS• Brian Adler, Sr. Services Architect, RightScale
Q&A • Ryan Geyer, Cloud Solutions Engineer, RightScale• Greg Goodwin, Account Manager, RightScale
Please use the “Questions” window
to ask questions any time!
# 3#
Learn at our Events
RightScale Annual Conference: Special offer for webinar attendees
10% off conference registration
10% off any training
Expires March 22
COMPUTE_Webinar_10
AWS Summits in a city near you
NYC, SF, London, Sydney and more
https://aws.amazon.com/aws-summit-2013/
# 4#
Agenda
• Terminology/Level-Setting
• Takeaways
• Cloud and Component Definitions• Designing for Failure• Architectural Options and Considerations
High Availability
Disaster Recovery
• Conclusions / Q&A
# 5#
Faults? • Facilities• Hardware• Networking• Code
• People
# 6#
What is “Fault-Tolerant”?• Degrees of risk mitigation - not binary
• Automated
• Tested!
# 7#
Old School Fault-Tolerance: Build Two
# 8#
No Up-Front Capital Expense
Pay Only for What You Use
Self-Service Infrastructure
Easily Scale Up and Down
Improve Agility & Time-to-Market
Low Cost
Cloud Computing Benefits
Deploy
# 9#
No Up-Front HA Capital Expense
Pay for DR Only When You Use it
Self-Service DR Infrastructure
Easily Deliver Fault-Tolerant
Applications
Improve Agility & Time-to-Recovery
Low Cost Backups
Cloud Computing Fault-Tolerance Benefits
Deploy
The benefits translate!
# 10#
AWS Cloud allows Overcast Redundancy
Have the shadow duplicate of your infrastructure ready to go when you need it…
…but only pay for what you actually use
# 11#
Old Barriers to HA are now Surmountable
• Cost
• Complexity
• Expertise
# 12#
AWS Building Blocks: Two Strategies
Inherently fault-tolerant services
Services that are fault-tolerant with the right architecture
Amazon EC2Amazon Virtual Private Cloud (Amazon VPC)
Amazon Elastic Block Store (EBS)Amazon Relational Database Service
(Amazon RDS)
Amazon S3Amazon SimpleDB
Amazon DynamoDBAmazon CloudFront
Amazon SWF Amazon SQSAmazon SNSAmazon SES
Amazon Route 53Elastic Load BalancingAWS Elastic BeanstalkAmazon ElastiCache
Amazon Elastic MapReduceAWS Identity and Access
Management (IAM)
# 13#
The Stack:
Resources
Deployment
Management
Configuration
Networking
Facilities
Geographies
# 14#
Terminology
High Availability (HA)
Disaster Recovery (DR)
Fault Tolerance
Ability of a system to continue operating properly (perhaps at a degraded level) if one or more components fails.
The process, policies and procedures related to restoring critical systems after a catastrophic event.
Goal is to get application back up and running within a defined time period (RTO) and within a certain data loss window (RPO).
Fault Tolerant systems are measured by their Availability in terms of planned and unplanned service outages for end users.
# 15#
Terminology - continued
Recovery Point Objective (RPO)
Recovery Time Objective (RTO)
Time period in which service must be restored to meet BCP (Business Continuity Planning) objectives
Acceptable data loss as a result of a recovering from a disaster/catastrophic event
RTO and RPO are often at odds, and tradeoffs need to be made in order to find an acceptable middle ground
# 16#
Takeaways
• Understand core concepts behind HA and DR
• Introduction to architectural options for designing HA, fault-tolerant applications and DR environments and procedures
• Best Practices for implementation of these architectural options within AWS (independent of RightScale)• Multi-Availability Zone (AZ) and Multi-Region
• Architectural options and Considerations / pros and cons of these options
• Understanding of the tools RightScale brings to AWS to simplify the creation of these HA and DR environments
# 17#
Regions & Availability Zones
• Zones within a region share a LAN (high bandwidth, low latency, private IP access)
• Zones utilize separate power sources, are physically segregated • Regions are “islands”, and share no resources.
Japan
Availability Zone A
Availability Zone B
EU West Region
Availability Zone A
Availability Zone B
US East Region
Availability Zone A
Availability Zone C
Availability Zone B
US West Region
Availability Zone A
Availability Zone B
Singapore
Availability Zone A
Availability Zone B
Source: AWS
# 18#
Designing for Failure
• Large scale failures in the cloud are rare but do happen
• Application owners are ultimately responsible for availability and recoverability
• Balance cost and complexity of HA efforts against risk(s) you are willing to bear
• Cloud infrastructure has made DR and HA remarkably affordable versus past options-Multi-Server
-Multi-AZ (Availability Zone)
-Multi-Region
“Everything fails, all the time.” Werner Vogels, CTO Amazon.com
# 19#
Designing for Failure – Basic Concepts
• Fault tolerance is the goal. Degradation of service may occur, but application continues to function.
• Avoid single points of failure (SPOF)
• Assume everything fails (remember Werner’s mantra) and design accordingly
• Plan and practice your recovery process (both for HA and DR)
• Remember that better HA and DR equals more $$$. So find that acceptable balance.
# 20#
High Availability
Don’t sweat the small stuff. And it’s all small stuff*
*(until it’s not)
Follow a few general best practices to absorbapplication component outages…
# 21#
General HA Best Practices
• Avoid single points of failure.
• Always place one of each component (load balancers, app servers, databases) in at least two AZs.
• Replicate data across AZs (HA) and backup or replicate across regions for failover (DR)
• Setup monitoring, alerts and operations to identify and automate problem resolution or failover process.
# 22#
• High availability for top web properties
with 270M visitors/month
• Migration from datacenter to AWS
• RightScale provides-Self-service access to developers
-Consistency and low maintenance
-Usage and cost accounting
-Multi-region architectures to avoid downtime
# 23#
Multi-Zone HA
SLAVE DBMASTER DB
SNAPSHOTS
LOAD BALANCERS
REPLICATE
DNS
S3
EBS
US-EAST 1a 1US-EAST 1b
LOAD BALANCERS
APP SERVERS
AUTOSCALE
172.168.7.31 172.168.8.62
Snapshot data volume for backups so the database can be readily
recovered within the region.
Place Slave databases in one or more zones for failover.
Consider local storage for additional slave database to remove
dependency on attached volume
Consider distributed
NoSQL databases with
the same distribution
considerations.
# 24#
Disaster Recovery
DR presents a few new wrinkles compared to HA,but there are multiple options depending on yourneeds and budget…
Don’t sweat the small stuff. And it’s all small stuff*
*(until it’s not)
# 25#
HA/DR Checklist for Risk Mitigation
• Determine who owns the architecture, DR process and testing.
• Develop expertise in-house and / or get outside help.
• Conduct a risk assessment for each application.
• Specify your target RTO and RPO.
• Design for failure starting with application architecture. This will help drive the infrastructure architecture.
# 26#
HA/DR Checklist for Risk Mitigation
• Implement HA best practices balancing cost, complexity and risk.-Automate infrastructure for consistency and reliability.
• Document operational processes and automations.
• Test the failover... then test it again.
• Release the Chaos Monkey.
# 27#
Multi-Region/Cloud DR Options
Cold DR
Warm DR
Hot DR
Multi-Cloud HA0
< 5 Mins
< 1 Hour
> 1 Hour
$ $$ $$$ $$$$
(Most Common)
(Recommended)
(Least Common)
(Live/Live Config)
DowntimeAvailability
99.999%
99.9%
99.5%
99%
# 28#
Multi-Region Cold DR
LOAD BALANCERS
MASTER DB SLAVE DB
APP SERVERS
LOAD BALANCERS
REPLICATE
DNS
APP SERVERS
US WEST
SNAPSHOTS
172.168.7.31
SLAVE DB
US EAST
S3
Staged Server Configuration and generally no staged data• Not recommended if rapid recovery is required• Slow to replicate data to other cloud and bring database online
EBS
# 29#
Multi-Region Warm DR
LOAD BALANCERS
MASTER DB SLAVE DB
APP SERVERS
LOAD BALANCERS
REPLICATE
DNS
APP SERVERS
SLAVE DB
REPLICATE
US WEST
172.168.7.31
US EAST
SNAPSHOTS
Staged Server Configuration, pre-staged data and running Slave Database Server• Generally recommended DR solution• Minimal additional cost and allows fairly rapid recovery
SNAPSHOTS
EBS
S3
# 30#
APP SERVERS
Multi-Region Hot DR
LOAD BALANCERS
MASTER DB SLAVE DB
APP SERVERS
LOAD BALANCERS
REPLICATE
DNS
SLAVE DB
REPLICATE
US WEST
SNAPSHOTS
172.168.7.31
US EAST
Parallel Deployment with all servers running but all traffic going to primary• Not recommended• Very high additional cost to allow rapid recovery
SNAPSHOTS
EBS
S3
# 31#
Hybrid HA
APP SERVERS
LOAD BALANCERS
MASTER DB SLAVE DB
APP SERVERS
LOAD BALANCERS
REPLICATE
DNS
SLAVE DB
REPLICATE
CHICAGO
SNAPSHOTS
172.168.7.31 172.168.8.62
US-EAST
S3 SWIFT
SNAPSHOTS
Live/Live configuration. Geo-target IP services to direct traffic to regional LBs.• Possible, but not recommended (more to follow…)• Max additional cost and max availability, but complex to implement and manage
EBS
# 32#
APP SERVERS
LOAD BALANCERS
MASTER DB SLAVE DB
APP SERVERS
LOAD BALANCERS
REPLICATE
DNS
SLAVE DB
REPLICATE
CHICAGO
SNAPSHOTS
172.168.7.31 172.168.8.62
US-EAST
S3
Hybrid HA
You need DNS management or a global load balancer.
Security requires addt’l effort as security groups are Region-
specific.
Machine Images are specific to the
cloud/region.
Looks similar to Multi-Zone… but additional problems to solve as some resources are not shared
SNAPSHOTS
SWIFT
EBS VOLUME
# 33#
• Procurement software
• SLA to their customers require HA
• Subway chain is a customer that procures perishable goods
through Coupa
# 34#
In the Dashboard
Multi-region or cloud
Multi-region Warm DR
Staged servers
Cost forecasting
for DR environment
# 35#
Automating HA and DR• Use dynamic DNS for your database servers
Allow app servers to use a single FQDN.
Use a low TTL to allow rapid failover in the case of a change in master database
• Automatic connection of app servers to load balancing serversApp servers can connect to all load balancers automatically at launch
No manual intervention
No DNS modifications
• Automated promotion of slave to masterProcess is automated
Decision to run process is manual
# 36#
MultiCloud Images• MultiCloud Images can be launched across regions and hybrid
without modification
How RightScale makes it possible
MultiCloud Images
Cloud A, RightImage 1
Cloud B, RightImage 2
Cloud C, RightImage 3
ServerTemplate contains a list of MultiCloud Images (MCIs)
When the Server is created, a specific MCI is chosen.
Cloud A, RightImage 1
Cloud A
Image 1
The appropriate RightImage is used at launch.
RightImage
Stability across clouds
1
2
3
# 37#
How RightScale makes it possibleServerTemplates, Tags, and Inputs• Automated load balancer registration and database connections• Autoscaling across zones• Dynamic configuration
# 38#
DR Cost Comparison ExampleMulti-RegionCold DR
Multi-RegionWarm DR
Multi-RegionHot DR
Total $4480 / month $5630 / month $8800 / month
Running $4470 / month3 Load Balancers (Large)6 App Servers (XLarge)1 Master DB (2XLarge)1 Slave DB (2XLarge)
$5540 / month3 Load Balancers (Large)6 App Servers (XLarge)1 Master DB (2XLarge)2 Slave DB (2XLarge)
$8440 / month6 Load Balancers (Large)12 App Servers (XLarge)1 Master DB (2XLarge)2 Slave DB (2XLarge)
Staged $0 / month3 Load Balancers (Large)6 App Servers (XLarge)1 Slave DB (2XLarge)
$0 / month3 Load Balancers (Large)6 App Servers (Xlarge)
Replication $10 / month25GB / day cross-zone
$90 / month25GB / day cross-region
$360 / month100GB / day cross-region
# 39#
Outage-Proofing Best Practices
Place in >1 zone:• Load balancers• App servers• Databases
Maintain capacity to absorb zone or region failures
Replicate data across zones
Design stateless apps for resilience to reboot / relaunch
Replicate data across zones
Backup across regions
Monitoring, alert, and automate operations to speed up failover
Replication and Failover
Application Design
Resource Placement
# 40#
AWS
Contact:aws.amazon.com/contact-us
Attend:
https://aws.amazon.com/aws-summit-2013/
Resources and Q&A
RightScale
Try: RightScale Free Edition
www.rightscale.com/free
Contact:
Toll Free: 1.866.720.0208
Int’l: 1.805.855.0265
Attend:
http://www.rightscalecompute.com/