how netflix leverages multiple regions to increase ...awsmedia.s3.amazonaws.com/arc305.pdf · cs ....

39
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. How Netflix Leverages Multiple Regions to Increase Availability: Isthmus and Active-Active Case Study Ruslan Meshenberg November 13, 2013

Upload: vukhanh

Post on 31-Jan-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: How Netflix Leverages Multiple Regions to Increase ...awsmedia.s3.amazonaws.com/ARC305.pdf · CS . XX Incidents . Metrics Impact – Feature Disable . XXX Incidents . No Impact –

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

How Netflix Leverages Multiple Regions to Increase Availability: Isthmus and Active-Active Case Study

Ruslan Meshenberg

November 13, 2013

Page 2: How Netflix Leverages Multiple Regions to Increase ...awsmedia.s3.amazonaws.com/ARC305.pdf · CS . XX Incidents . Metrics Impact – Feature Disable . XXX Incidents . No Impact –

Failure

Page 3: How Netflix Leverages Multiple Regions to Increase ...awsmedia.s3.amazonaws.com/ARC305.pdf · CS . XX Incidents . Metrics Impact – Feature Disable . XXX Incidents . No Impact –

Assumptions

• Slowly Changing • Large Scale

• Rapid Change • Large Scale

• Slowly Changing • Small Scale

• Rapid Change • Small Scale

Speed

Scale

Everything Works

Everything Is Broken Hardware Will Fail

Software Will Fail

Enterprise IT

Telcos

Startups

Web Scale

Page 4: How Netflix Leverages Multiple Regions to Increase ...awsmedia.s3.amazonaws.com/ARC305.pdf · CS . XX Incidents . Metrics Impact – Feature Disable . XXX Incidents . No Impact –

Incidents – Impact and Mitigation

PR X Incidents

CS XX Incidents

Metrics Impact – Feature Disable XXX Incidents

No Impact – Fast Retry or Automated Failover XXXX Incidents

Public relations media impact

High customer service calls

Affects AB test results

Y incidents mitigated by Active Active, game day practicing

YY incidents mitigated by better tools and practices

YYY incidents mitigated by better

data tagging

Page 5: How Netflix Leverages Multiple Regions to Increase ...awsmedia.s3.amazonaws.com/ARC305.pdf · CS . XX Incidents . Metrics Impact – Feature Disable . XXX Incidents . No Impact –

Does an Instance Fail?

• It can, plan for it • Bad code / configuration pushes • Latent issues • Hardware failure • Test with Chaos Monkey

Page 6: How Netflix Leverages Multiple Regions to Increase ...awsmedia.s3.amazonaws.com/ARC305.pdf · CS . XX Incidents . Metrics Impact – Feature Disable . XXX Incidents . No Impact –

Does a Zone Fail?

• Rarely, but happened before • Routing issues • DC-specific issues • App-specific issues within a zone

• Test with Chaos Gorilla

Page 7: How Netflix Leverages Multiple Regions to Increase ...awsmedia.s3.amazonaws.com/ARC305.pdf · CS . XX Incidents . Metrics Impact – Feature Disable . XXX Incidents . No Impact –

Does a Region Fail?

• Full region – unlikely, very rare • Individual Services can fail region-wide • Most likely, a region-wide configuration issue

• Test with Chaos Kong

Page 8: How Netflix Leverages Multiple Regions to Increase ...awsmedia.s3.amazonaws.com/ARC305.pdf · CS . XX Incidents . Metrics Impact – Feature Disable . XXX Incidents . No Impact –

Everything Fails… Eventually

• The good news is you can do something about it

• Keep your services running by embracing isolation and redundancy

Page 9: How Netflix Leverages Multiple Regions to Increase ...awsmedia.s3.amazonaws.com/ARC305.pdf · CS . XX Incidents . Metrics Impact – Feature Disable . XXX Incidents . No Impact –

Cloud Native A New Engineering Challenge

Construct a highly agile and highly available service from ephemeral and

assumed broken components

Page 10: How Netflix Leverages Multiple Regions to Increase ...awsmedia.s3.amazonaws.com/ARC305.pdf · CS . XX Incidents . Metrics Impact – Feature Disable . XXX Incidents . No Impact –

Isolation

• Changes in one region should not affect others • Regional outage should not affect others • Network partitioning between regions should not

affect functionality / operations

Page 11: How Netflix Leverages Multiple Regions to Increase ...awsmedia.s3.amazonaws.com/ARC305.pdf · CS . XX Incidents . Metrics Impact – Feature Disable . XXX Incidents . No Impact –

Redundancy

• Make more than one (of pretty much everything) • Specifically, distribute services across

Availability Zones and regions

Page 12: How Netflix Leverages Multiple Regions to Increase ...awsmedia.s3.amazonaws.com/ARC305.pdf · CS . XX Incidents . Metrics Impact – Feature Disable . XXX Incidents . No Impact –

History: X-mas Eve 2012

• Netflix multi-hour outage • US-East1 regional Elastic Load Balancing issue • “...data was deleted by a maintenance process that

was inadvertently run against the production ELB state data”

• “The ELB issue affected less than 7% of running

ELBs and prevented other ELBs from scaling.”

Page 13: How Netflix Leverages Multiple Regions to Increase ...awsmedia.s3.amazonaws.com/ARC305.pdf · CS . XX Incidents . Metrics Impact – Feature Disable . XXX Incidents . No Impact –

Isthmus – Normal Operation

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

US-East ELB US-West-2 ELB

Tunnel

ELB Zuul

Infrastructure*

Page 14: How Netflix Leverages Multiple Regions to Increase ...awsmedia.s3.amazonaws.com/ARC305.pdf · CS . XX Incidents . Metrics Impact – Feature Disable . XXX Incidents . No Impact –

Isthmus – Failover

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

US-East ELB US-West-2 ELB

Tunnel

ELB Zuul

Infrastructure*

Page 15: How Netflix Leverages Multiple Regions to Increase ...awsmedia.s3.amazonaws.com/ARC305.pdf · CS . XX Incidents . Metrics Impact – Feature Disable . XXX Incidents . No Impact –

Isthmus

Page 16: How Netflix Leverages Multiple Regions to Increase ...awsmedia.s3.amazonaws.com/ARC305.pdf · CS . XX Incidents . Metrics Impact – Feature Disable . XXX Incidents . No Impact –

Zuul – Overview

Elastic Load Balancing Elastic Load Balancing Elastic Load Balancing

Page 17: How Netflix Leverages Multiple Regions to Increase ...awsmedia.s3.amazonaws.com/ARC305.pdf · CS . XX Incidents . Metrics Impact – Feature Disable . XXX Incidents . No Impact –

Zuul – Details

Page 18: How Netflix Leverages Multiple Regions to Increase ...awsmedia.s3.amazonaws.com/ARC305.pdf · CS . XX Incidents . Metrics Impact – Feature Disable . XXX Incidents . No Impact –

Denominator – Abstracting the DNS Layer

Cassandra Replicas

Zone A Cassandra Replicas

Zone B Cassandra Replicas

Zone C

ELBs

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

Regional Load Balancers

UltraDNS DynECT

DNS Amazon Route 53 Denominator

Page 19: How Netflix Leverages Multiple Regions to Increase ...awsmedia.s3.amazonaws.com/ARC305.pdf · CS . XX Incidents . Metrics Impact – Feature Disable . XXX Incidents . No Impact –

Isthmus – Only for Elastic Load Balancing Failures

• Other services may fail region-wide • Not worthwhile to develop one-offs for each one

Page 20: How Netflix Leverages Multiple Regions to Increase ...awsmedia.s3.amazonaws.com/ARC305.pdf · CS . XX Incidents . Metrics Impact – Feature Disable . XXX Incidents . No Impact –

Active-Active – Full Regional Resiliency

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

Regional Load Balancers

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

Regional Load Balancers

Page 21: How Netflix Leverages Multiple Regions to Increase ...awsmedia.s3.amazonaws.com/ARC305.pdf · CS . XX Incidents . Metrics Impact – Feature Disable . XXX Incidents . No Impact –

Active-Active – Failover

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

Regional Load Balancers

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

Regional Load Balancers

Page 22: How Netflix Leverages Multiple Regions to Increase ...awsmedia.s3.amazonaws.com/ARC305.pdf · CS . XX Incidents . Metrics Impact – Feature Disable . XXX Incidents . No Impact –

Active-Active Architecture

Page 23: How Netflix Leverages Multiple Regions to Increase ...awsmedia.s3.amazonaws.com/ARC305.pdf · CS . XX Incidents . Metrics Impact – Feature Disable . XXX Incidents . No Impact –

Separating the Data – Eventual Consistency

• 2–4 region Cassandra clusters • Eventual consistency != hopeful consistency

Page 24: How Netflix Leverages Multiple Regions to Increase ...awsmedia.s3.amazonaws.com/ARC305.pdf · CS . XX Incidents . Metrics Impact – Feature Disable . XXX Incidents . No Impact –

Highly Available NoSQL Storage

A highly scalable, available, and durable deployment pattern based on

Apache Cassandra

Page 25: How Netflix Leverages Multiple Regions to Increase ...awsmedia.s3.amazonaws.com/ARC305.pdf · CS . XX Incidents . Metrics Impact – Feature Disable . XXX Incidents . No Impact –

Benchmarking Global Cassandra Write intensive test of cross-region replication capacity

16 x hi1.4xlarge SSD nodes per zone = 96 total 192 TB of SSD in six locations up and running Cassandra in 20 minutes

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

US-West-2 Region - Oregon

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

US-East-1 Region - Virginia

Test Load

Test Load

Validation Load

Interzone Traffic

1 Million Writes CL.ONE (Wait for One Replica to ack)

1 Million Reads after 500 ms CL.ONE with No Data Loss

Interregion Traffic Up to 9Gbits/s, 83ms 18 TB backups

from S3

Page 26: How Netflix Leverages Multiple Regions to Increase ...awsmedia.s3.amazonaws.com/ARC305.pdf · CS . XX Incidents . Metrics Impact – Feature Disable . XXX Incidents . No Impact –

US-EAST-1 US-WEST-2

Drainer Writer

EVCache Replication Metadata

5. Checks write time to ensure this is the latest operation for the key

EVCACHE

EVCACHE EVCACHE

6. Deletes the value for the key

EVCACHE

EVCACHE EVCACHE

App Server

EVCache Client

SQS

EVCache Replication Metadata

Drainer Writer

1. Set data in EVCACHE

2. Write events to SQS and EVCACHE_REGION_REPLICA-TION

3. Read from SQS in batches

8. Delete keys from SQS that were successful

4. Calls Writer with Key, Write Time, TTL & Value after checking if this is the latest event for the key in the current batch. Goes cross-region through ELB over HTTPS

7. Return keys that were successful

Propagating EVCache Invalidations

EVCache Replication Service

EVCache Replication Service

Page 27: How Netflix Leverages Multiple Regions to Increase ...awsmedia.s3.amazonaws.com/ARC305.pdf · CS . XX Incidents . Metrics Impact – Feature Disable . XXX Incidents . No Impact –

Archaius – Region-isolated Configuration

Page 28: How Netflix Leverages Multiple Regions to Increase ...awsmedia.s3.amazonaws.com/ARC305.pdf · CS . XX Incidents . Metrics Impact – Feature Disable . XXX Incidents . No Impact –

Running Isthmus and Active-Active

Page 29: How Netflix Leverages Multiple Regions to Increase ...awsmedia.s3.amazonaws.com/ARC305.pdf · CS . XX Incidents . Metrics Impact – Feature Disable . XXX Incidents . No Impact –

Multiregional Monkeys

• Detect failure to deploy • Differences in configuration • Resource differences

Page 30: How Netflix Leverages Multiple Regions to Increase ...awsmedia.s3.amazonaws.com/ARC305.pdf · CS . XX Incidents . Metrics Impact – Feature Disable . XXX Incidents . No Impact –

Multiregional Monitoring and Alerting

• Per region metrics • Global aggregation • Anomaly detection

Page 31: How Netflix Leverages Multiple Regions to Increase ...awsmedia.s3.amazonaws.com/ARC305.pdf · CS . XX Incidents . Metrics Impact – Feature Disable . XXX Incidents . No Impact –

Failover and Fallback

• DNS (denominator) changes • For fallback, ensure data consistency • Some challenges

– Cold cache – Autoscaling

• Automate, automate, automate

Page 32: How Netflix Leverages Multiple Regions to Increase ...awsmedia.s3.amazonaws.com/ARC305.pdf · CS . XX Incidents . Metrics Impact – Feature Disable . XXX Incidents . No Impact –

Validating the Whole Thing Works

Page 33: How Netflix Leverages Multiple Regions to Increase ...awsmedia.s3.amazonaws.com/ARC305.pdf · CS . XX Incidents . Metrics Impact – Feature Disable . XXX Incidents . No Impact –

Dev-Ops in N Regions

• Best practices: avoiding peak times for deployment

• Early problem detection / rollbacks • Automated canaries / continuous delivery

Page 34: How Netflix Leverages Multiple Regions to Increase ...awsmedia.s3.amazonaws.com/ARC305.pdf · CS . XX Incidents . Metrics Impact – Feature Disable . XXX Incidents . No Impact –

Hyperscale Architecture

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

Regional Load Balancers

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

Regional Load Balancers

UltraDNS DynECT

DNS Amazon Route 53 DNS

Automation

Page 35: How Netflix Leverages Multiple Regions to Increase ...awsmedia.s3.amazonaws.com/ARC305.pdf · CS . XX Incidents . Metrics Impact – Feature Disable . XXX Incidents . No Impact –

Does It Work?

Page 36: How Netflix Leverages Multiple Regions to Increase ...awsmedia.s3.amazonaws.com/ARC305.pdf · CS . XX Incidents . Metrics Impact – Feature Disable . XXX Incidents . No Impact –

Building Blocks Available on Netflix Github site

Page 37: How Netflix Leverages Multiple Regions to Increase ...awsmedia.s3.amazonaws.com/ARC305.pdf · CS . XX Incidents . Metrics Impact – Feature Disable . XXX Incidents . No Impact –

Topic Session # When

What an Enterprise Can Learn from Netflix, a Cloud-native Company ENT203 Thursday, Nov 14, 4:15 PM - 5:15 PM

Maximizing Audience Engagement in Media Delivery MED303 Thursday, Nov 14, 4:15 PM - 5:15 PM

Scaling your Analytics with Amazon Elastic MapReduce BDT301 Thursday, Nov 14, 4:15 PM - 5:15 PM

Automated Media Workflows in the Cloud MED304 Thursday, Nov 14, 5:30 PM - 6:30 PM

Deft Data at Netflix: Using Amazon S3 and Amazon Elastic MapReduce for Monitoring at Gigascale

BDT302 Thursday, Nov 14, 5:30 PM - 6:30 PM

Encryption and Key Management in AWS SEC304 Friday, Nov 15, 9:00 AM - 10:00 AM

Your Linux AMI: Optimization and Performance CPN302 Friday, Nov 15, 11:30 AM - 12:30 PM

Page 38: How Netflix Leverages Multiple Regions to Increase ...awsmedia.s3.amazonaws.com/ARC305.pdf · CS . XX Incidents . Metrics Impact – Feature Disable . XXX Incidents . No Impact –

Takeaways

Embrace isolation and redundancy for availability

NetflixOSS helps everyone to become cloud native

http://netflix.github.com http://techblog.netflix.com http://slideshare.net/Netflix

@rusmeshenberg @NetflixOSS

Page 39: How Netflix Leverages Multiple Regions to Increase ...awsmedia.s3.amazonaws.com/ARC305.pdf · CS . XX Incidents . Metrics Impact – Feature Disable . XXX Incidents . No Impact –

We are sincerely eager to hear your feedback on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance.