how netflix leverages multiple regions to increase ...awsmedia.s3.amazonaws.com/arc305.pdf · cs ....
Post on 31-Jan-2018
218 Views
Preview:
TRANSCRIPT
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
How Netflix Leverages Multiple Regions to Increase Availability: Isthmus and Active-Active Case Study
Ruslan Meshenberg
November 13, 2013
Failure
Assumptions
• Slowly Changing • Large Scale
• Rapid Change • Large Scale
• Slowly Changing • Small Scale
• Rapid Change • Small Scale
Speed
Scale
Everything Works
Everything Is Broken Hardware Will Fail
Software Will Fail
Enterprise IT
Telcos
Startups
Web Scale
Incidents – Impact and Mitigation
PR X Incidents
CS XX Incidents
Metrics Impact – Feature Disable XXX Incidents
No Impact – Fast Retry or Automated Failover XXXX Incidents
Public relations media impact
High customer service calls
Affects AB test results
Y incidents mitigated by Active Active, game day practicing
YY incidents mitigated by better tools and practices
YYY incidents mitigated by better
data tagging
Does an Instance Fail?
• It can, plan for it • Bad code / configuration pushes • Latent issues • Hardware failure • Test with Chaos Monkey
Does a Zone Fail?
• Rarely, but happened before • Routing issues • DC-specific issues • App-specific issues within a zone
• Test with Chaos Gorilla
Does a Region Fail?
• Full region – unlikely, very rare • Individual Services can fail region-wide • Most likely, a region-wide configuration issue
• Test with Chaos Kong
Everything Fails… Eventually
• The good news is you can do something about it
• Keep your services running by embracing isolation and redundancy
Cloud Native A New Engineering Challenge
Construct a highly agile and highly available service from ephemeral and
assumed broken components
Isolation
• Changes in one region should not affect others • Regional outage should not affect others • Network partitioning between regions should not
affect functionality / operations
Redundancy
• Make more than one (of pretty much everything) • Specifically, distribute services across
Availability Zones and regions
History: X-mas Eve 2012
• Netflix multi-hour outage • US-East1 regional Elastic Load Balancing issue • “...data was deleted by a maintenance process that
was inadvertently run against the production ELB state data”
• “The ELB issue affected less than 7% of running
ELBs and prevented other ELBs from scaling.”
Isthmus – Normal Operation
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-East ELB US-West-2 ELB
Tunnel
ELB Zuul
Infrastructure*
Isthmus – Failover
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-East ELB US-West-2 ELB
Tunnel
ELB Zuul
Infrastructure*
Isthmus
Zuul – Overview
Elastic Load Balancing Elastic Load Balancing Elastic Load Balancing
Zuul – Details
Denominator – Abstracting the DNS Layer
Cassandra Replicas
Zone A Cassandra Replicas
Zone B Cassandra Replicas
Zone C
ELBs
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
Regional Load Balancers
UltraDNS DynECT
DNS Amazon Route 53 Denominator
Isthmus – Only for Elastic Load Balancing Failures
• Other services may fail region-wide • Not worthwhile to develop one-offs for each one
Active-Active – Full Regional Resiliency
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
Regional Load Balancers
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
Regional Load Balancers
Active-Active – Failover
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
Regional Load Balancers
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
Regional Load Balancers
Active-Active Architecture
Separating the Data – Eventual Consistency
• 2–4 region Cassandra clusters • Eventual consistency != hopeful consistency
Highly Available NoSQL Storage
A highly scalable, available, and durable deployment pattern based on
Apache Cassandra
Benchmarking Global Cassandra Write intensive test of cross-region replication capacity
16 x hi1.4xlarge SSD nodes per zone = 96 total 192 TB of SSD in six locations up and running Cassandra in 20 minutes
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-West-2 Region - Oregon
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-East-1 Region - Virginia
Test Load
Test Load
Validation Load
Interzone Traffic
1 Million Writes CL.ONE (Wait for One Replica to ack)
1 Million Reads after 500 ms CL.ONE with No Data Loss
Interregion Traffic Up to 9Gbits/s, 83ms 18 TB backups
from S3
US-EAST-1 US-WEST-2
Drainer Writer
EVCache Replication Metadata
5. Checks write time to ensure this is the latest operation for the key
EVCACHE
EVCACHE EVCACHE
6. Deletes the value for the key
EVCACHE
EVCACHE EVCACHE
App Server
EVCache Client
SQS
EVCache Replication Metadata
Drainer Writer
1. Set data in EVCACHE
2. Write events to SQS and EVCACHE_REGION_REPLICA-TION
3. Read from SQS in batches
8. Delete keys from SQS that were successful
4. Calls Writer with Key, Write Time, TTL & Value after checking if this is the latest event for the key in the current batch. Goes cross-region through ELB over HTTPS
7. Return keys that were successful
Propagating EVCache Invalidations
EVCache Replication Service
EVCache Replication Service
Archaius – Region-isolated Configuration
Running Isthmus and Active-Active
Multiregional Monkeys
• Detect failure to deploy • Differences in configuration • Resource differences
Multiregional Monitoring and Alerting
• Per region metrics • Global aggregation • Anomaly detection
Failover and Fallback
• DNS (denominator) changes • For fallback, ensure data consistency • Some challenges
– Cold cache – Autoscaling
• Automate, automate, automate
Validating the Whole Thing Works
Dev-Ops in N Regions
• Best practices: avoiding peak times for deployment
• Early problem detection / rollbacks • Automated canaries / continuous delivery
Hyperscale Architecture
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
Regional Load Balancers
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
Regional Load Balancers
UltraDNS DynECT
DNS Amazon Route 53 DNS
Automation
Does It Work?
Building Blocks Available on Netflix Github site
Topic Session # When
What an Enterprise Can Learn from Netflix, a Cloud-native Company ENT203 Thursday, Nov 14, 4:15 PM - 5:15 PM
Maximizing Audience Engagement in Media Delivery MED303 Thursday, Nov 14, 4:15 PM - 5:15 PM
Scaling your Analytics with Amazon Elastic MapReduce BDT301 Thursday, Nov 14, 4:15 PM - 5:15 PM
Automated Media Workflows in the Cloud MED304 Thursday, Nov 14, 5:30 PM - 6:30 PM
Deft Data at Netflix: Using Amazon S3 and Amazon Elastic MapReduce for Monitoring at Gigascale
BDT302 Thursday, Nov 14, 5:30 PM - 6:30 PM
Encryption and Key Management in AWS SEC304 Friday, Nov 15, 9:00 AM - 10:00 AM
Your Linux AMI: Optimization and Performance CPN302 Friday, Nov 15, 11:30 AM - 12:30 PM
Takeaways
Embrace isolation and redundancy for availability
NetflixOSS helps everyone to become cloud native
http://netflix.github.com http://techblog.netflix.com http://slideshare.net/Netflix
@rusmeshenberg @NetflixOSS
We are sincerely eager to hear your feedback on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance.
top related