netflix story of embracing the cloud

47
Netflix: Embracing the Cloud Neil Hunt, CPO / Yury Izrailevsky, VP Engineering

Upload: kate-karniouchina

Post on 12-Nov-2014

737 views

Category:

Documents


2 download

DESCRIPTION

Neil Hunt and Yury Izrailevsky talk 2012 AWS re:invent conference about embracing the cloud

TRANSCRIPT

Page 1: Netflix Story of Embracing the Cloud

Netflix: Embracing the Cloud

Neil Hunt, CPO / Yury Izrailevsky, VP Engineering

Page 2: Netflix Story of Embracing the Cloud

Embracing the Cloud:Confronting the Challenge

Neil Hunt

Page 3: Netflix Story of Embracing the Cloud

Motivation

Netflix – Service Unavailable – Database Crashed

Rest assured that the right peopleare losing sleep to fix this problem!

We expect to resume service in approximately 72h

12 Aug 2008 03:12am

Page 4: Netflix Story of Embracing the Cloud

A Business in Transition

OLD – DVD delivery

• Value from DVDs at home• Website load small and

predictable

• Traditional DC technology:• Linux, Apache, Oracle, Java

NEW – Streaming

• Value via Internet delivery• Website and APIs high load

and rapidly growing

• Need more robustness• Cloud as opportunity for

fresh start

Page 5: Netflix Story of Embracing the Cloud

Mission: Cloud – High Level Goals

Availability

Scale Performance

4 x nines

Unconstrainedhorizontal scaling

Unlimitedcompute

Page 6: Netflix Story of Embracing the Cloud

Forklift, or Rewrite?

OLD NEW

MonolithicApp

Oracle NoSQL

Service

Assembly

Page 7: Netflix Story of Embracing the Cloud

Old Style – A large 18 wheeler

• Big• Reliable• Efficient (when full)

• Expensive• Inflexible capacity• Many single points of failure

Page 8: Netflix Story of Embracing the Cloud

New Style – A fleet of leased pickups with drivers

• Scalable to small or large loads• Reliability through redundancy• Requires rethinking the whole problem

Page 9: Netflix Story of Embracing the Cloud

SQL or NoSQL?

MySQL/RDB:

• Developer familiarity

• Developers imagine transactional consistency requirements in every scenario

NoSQL

• Availability & Scale

• Avoid overhead and riskof managing SQL

• Experimented with both• Ended up with NoSQL for almost everything important

Page 10: Netflix Story of Embracing the Cloud

Service Oriented Architecture

• Optimizes for small independent teams with well-defined interfaces

• Better independence from subsystem failures

• Scaling applied to each tier separately NoSQL

Page 11: Netflix Story of Embracing the Cloud

How to Manage the Migration?Rebuilding a complex system while in operation

NoSQL

MonolithicApp

Oracle

Page 12: Netflix Story of Embracing the Cloud

Transitional Infrastructure: “Roman Riding”

Page 13: Netflix Story of Embracing the Cloud

Transitional Infrastructure: Create a read-only copy

NoSQL

Source of Truth

Display onlyExample: Membership records

MonolithicApp

Oracle

Page 14: Netflix Story of Embracing the Cloud

Transitional Infrastructure: Move the master copy

NoSQL

Source of Truth

Display only

Example: AB Test Data (account tags controlling test experience)

MonolithicApp

Oracle

Page 15: Netflix Story of Embracing the Cloud

Transitional Infrastructure: Full Multi-Master duplicate

NoSQL

Multi-master

Example: Queue

MonolithicApp

Oracle

Page 16: Netflix Story of Embracing the Cloud

Organizational Challenges

IT Ops• Initial extensive role

managing legacy DC• Raised visibility during

transition• New DC vulnerabilities

and dependencies to manage

DevOps:• Components at a higher

level abstraction• More opportunities for

automation• Automated build-push tools• Autoscaling• Monitoring and automatic

cutouts and failover

A gradually diminishing role A rapidly expanding role

Page 17: Netflix Story of Embracing the Cloud

The Journey

Phase Components Data & PrerequisitesTrial (2009) Streaming Player Content keys (RO)

Membership status (RO)

Development(2010-11)

Member product pages and APIs

Content catalog (RW)Personalization data (RW) & recs algorithmsAB Test data (RW)

Followthrough(2011-12)

Account and membership

Membership data (RW)

Final (2013) Payments PCI and SOX data

Page 18: Netflix Story of Embracing the Cloud

Lessons Learned…

• Embrace the whole concept:Take the opportunity to build a modern architecturerather than forklifting SQL and monolithic apps

• Plan to discard your first experimentsYou’ll learn so much that you’ll be glad to redo it right

• Invest in transitional infrastructure:Migration will take a while,and it’s worth the effort to make it easy

• Expect your team to learn new ways …… but some won’t make the transition

Page 19: Netflix Story of Embracing the Cloud

Embracing the Cloud:Delivering the Cloud Solution

Yury Izrailevsky

Page 20: Netflix Story of Embracing the Cloud

Mission: Cloud – High Level Goals

Availability4 x nines

ScaleUnconstrained

horizontal scaling

PerformanceUnlimitedcompute

Page 21: Netflix Story of Embracing the Cloud

PerformanceScalability Availability

Page 22: Netflix Story of Embracing the Cloud

PerformanceScalability Availability

Page 23: Netflix Story of Embracing the Cloud

23

1/4/

2009

2/5/

2009

3/9/

2009

4/10

/200

9

5/12

/200

9

6/13

/200

9

7/15

/200

9

8/16

/200

9

9/17

/200

9

10/1

9/20

09

11/2

0/20

09

12/2

2/20

09

1/23

/201

0

2/24

/201

0

3/28

/201

0

4/29

/201

0

5/31

/201

0

7/2/

2010

8/3/

2010

9/4/

2010

10/6

/201

0

11/7

/201

0

12/9

/201

0

1/10

/201

1

2/11

/201

1

3/15

/201

1

4/16

/201

1

5/18

/201

1

6/19

/201

1

7/21

/201

1

8/22

/201

1

9/23

/201

1

10/2

5/20

11

11/2

6/20

11

12/2

8/20

11

1/29

/201

2

3/1/

2012

4/2/

2012

5/4/

2012

6/5/

2012

7/7/

2012

8/8/

2012

Scaling Netflix Streaming Service: Weekly Streaming Starts

Page 24: Netflix Story of Embracing the Cloud

Netflix Cross-Regional Cloud Architecture

Page 25: Netflix Story of Embracing the Cloud

Goal: Regional Failover

Page 26: Netflix Story of Embracing the Cloud

Building Global Netflix Streaming Product

Page 27: Netflix Story of Embracing the Cloud

PerformanceScalability Availability

Page 28: Netflix Story of Embracing the Cloud

Weekly Cloud Cost Per Streaming Start (last 12 months)

28

Page 29: Netflix Story of Embracing the Cloud

Simian Army: Cloud Efficiency Automation

Janitor Monkey

Regularly scrape unused capacity

Clean up instances, ASGs, ELBs, SGs, etc.

Efficiency Monkey

AI-based resource under-usage detection (CPU, memory, etc.)

Automated Deletion of Old Data

TTL for S3 (using ObjectExpiration)

29

Page 30: Netflix Story of Embracing the Cloud

Cyclical Streaming Usage Pattern

30

Page 31: Netflix Story of Embracing the Cloud

Load-Based Auto Scaling

3131

50%+ Cost SavingScale up/down

by 70%+

Move to Load-Based Scaling

Page 32: Netflix Story of Embracing the Cloud

PerformanceScalability Availability

Page 33: Netflix Story of Embracing the Cloud

A Truly Great Service…

33

Availability Goal: 99.99%(30 secs/week at peak traffic)

Has To Just Work!

Page 34: Netflix Story of Embracing the Cloud

7/17

/201

1

7/31

/201

1

8/14

/201

1

8/28

/201

1

9/11

/201

1

9/25

/201

1

10/9

/201

1

10/2

3/20

11

11/6

/201

1

11/2

0/20

11

12/4

/201

1

12/1

8/20

11

1/1/

2012

1/15

/201

2

1/29

/201

2

2/12

/201

2

2/26

/201

2

3/11

/201

2

3/25

/201

2

4/8/

2012

4/22

/201

2

5/6/

2012

5/20

/201

2

6/3/

2012

6/17

/201

2

7/1/

2012

7/15

/201

2

7/29

/201

2

8/12

/201

2

8/26

/201

2

9/9/

2012

9/23

/201

2

10/7

/201

2

10/2

1/20

12

11/4

/201

2

June 29th, 2012 AWS / Netflix Outage

Other AWS Outages

Historical Streaming Availability (13wkMA)

Using Redundancy in AWS Infrastructure to Survive Failures

Page 35: Netflix Story of Embracing the Cloud

Cascading Failures

35

API

InstantQueue

SimpleDB

Page 36: Netflix Story of Embracing the Cloud

Netflix Cloud Architecture

36

Page 37: Netflix Story of Embracing the Cloud

Cascading Failures

37

99% Availability

X …

99% 300 = 4.90%

99% Availability 99% Availability

Page 38: Netflix Story of Embracing the Cloud

Strategies to Improve Availability

38

Graceful Degradation Redundancy

Page 39: Netflix Story of Embracing the Cloud

Graceful Degradation

39

Page 40: Netflix Story of Embracing the Cloud

Redundancy

40

Zone A

Zone B

Zone C

Redundancy Across Availability Zones

Storage Redundancy Across Regions,

Vendors

S3 Backup

Secure Cloud Backup

A B C

Cassandra

Page 41: Netflix Story of Embracing the Cloud

Testing Fault Tolerance: Simian Army

41

Chaos Monkey Latency Monkey Chaos Gorilla

Page 42: Netflix Story of Embracing the Cloud

Open Source Portal at http://netflix.github.com

Page 43: Netflix Story of Embracing the Cloud

Superstorm Sandy

AWS Infrastructure Held Up

>2x Netflix Streaming Usage in East Coast Markets

Boston

New York

Philadelphia

Baltimore

D.C.

Page 44: Netflix Story of Embracing the Cloud

Focus on Building a Great Streaming Product

44

Page 45: Netflix Story of Embracing the Cloud

Netflix at 2012 re:Invent

Date/Time Presenter Topic

Wed 8:30-10:00 Reed Hastings Keynote with Andy Jassy

Wed 1:00-1:45 Coburn Watson Optimizing Costs with AWS

Wed 2:05-2:55 Kevin McEntee Netflix’s Transcoding Transformation

Wed 3:25-4:15 Neil Hunt / Yury I. Netflix: Embracing the Cloud

Wed 4:30-5:20 Adrian Cockcroft High Availability Architecture at Netflix

Thu 10:30-11:20 Jeremy Edberg Rainmakers – Operating Clouds

Thu 11:35-12:25 Kurt Brown Data Science with Elastic Map Reduce (EMR)

Thu 11:35-12:25 Jason Chan Security Panel: Learn from CISOs working with AWS

Thu 3:00-3:50 Adrian Cockcroft Compute & Networking Masters Customer Panel

Thu 3:00-3:50 Ruslan M./Gregg U. Optimizing Your Cassandra Database on AWS

Thu 4:05-4:55 Ariel Tseitlin Intro to Chaos Monkey and the Simian Army

Page 46: Netflix Story of Embracing the Cloud

We are sincerely eager to hear your feedback on this

presentation and on re:Invent.

Please fill out an evaluation form when you have a

chance.

Page 47: Netflix Story of Embracing the Cloud

We are sincerely eager to hear your feedback on this

presentation and on re:Invent.

Please fill out an evaluation form when you have a

chance.