cloud native architecture at netflix

52
Cloud Native Architecture at Netflix Yow December 2013 (Brisbane) Adrian Cockcroft @adrianco @NetflixOSS http://www.linkedin.com/in/adriancockcroft

Upload: buinga

Post on 03-Jan-2017

223 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Cloud Native Architecture at Netflix

Cloud Native Architecture at Netflix

Yow December 2013 (Brisbane)

Adrian Cockcroft @adrianco @NetflixOSS

http://www.linkedin.com/in/adriancockcroft

Page 2: Cloud Native Architecture at Netflix

Netflix History (current size)

• 1998 DVD Shipping Service in USA (~7M users)

• 2007 Streaming video in USA (~31M users)

• International streaming video (~9M users)

– 2010 Canada

– 2011 Latin America

– 2012 UK and Ireland

– 2012 Nordics

– 2013 Netherlands

Page 3: Cloud Native Architecture at Netflix

Netflix Member Web Site Home Page Personalization Driven – How Does It Work?

Page 4: Cloud Native Architecture at Netflix

How Netflix Used to Work

Customer Device (PC, PS3, TV…)

Monolithic Web App

Oracle

MySQL

Monolithic Streaming App

Oracle

MySQL

Limelight/Level 3 Akamai CDNs

Content Management

Content Encoding

Consumer Electronics

AWS Cloud Services

CDN Edge Locations

Datacenter

Page 5: Cloud Native Architecture at Netflix

How Netflix Streaming Works Today

Customer Device (PC, PS3, TV…)

Web Site or Discovery API

User Data

Personalization

Streaming API

DRM

QoS Logging

OpenConnect CDN Boxes

CDN Management and Steering

Content Encoding

Consumer Electronics

AWS Cloud Services

CDN Edge Locations

Datacenter

Page 6: Cloud Native Architecture at Netflix
Page 7: Cloud Native Architecture at Netflix

Netflix Scale

• Tens of thousands of instances on AWS

– Typically 4 core, 30GByte, Java business logic

– Thousands created/removed every day

• Thousands of Cassandra NoSQL nodes on AWS

– Many hi1.4xl - 8 core, 60Gbyte, 2TByte of SSD

– 65 different clusters, over 300TB data, triple zone

– Over 40 are multi-region clusters (6, 9 or 12 zone)

– Biggest 288 m2.4xl – over 300K rps, 1.3M wps

Page 8: Cloud Native Architecture at Netflix

Reactions over time

2009 “You guys are crazy! Can’t believe it” 2010 “What Netflix is doing won’t work” 2011 “It only works for ‘Unicorns’ like Netflix” 2012 “We’d like to do that but can’t” 2013 “We’re on our way using Netflix OSS code”

Page 9: Cloud Native Architecture at Netflix

YOW! Workshop

175 slides of Netflix Architecture

See bit.ly/netflix-workshop

A whole day…

Page 10: Cloud Native Architecture at Netflix

This Talk

Abstract the principles from the architecture

Page 11: Cloud Native Architecture at Netflix

Objectives:

Scalability

Availability

Agility

Efficiency

Page 12: Cloud Native Architecture at Netflix

Principles:

Immutability

Separation of Concerns

Anti-fragility

High trust organization

Sharing

Page 13: Cloud Native Architecture at Netflix

Outcomes:

• Public cloud – scalability, agility, sharing

• Micro-services – separation of concerns

• De-normalized data – separation of concerns

• Chaos Engines – anti-fragile operations

• Open source by default – agility, sharing

• Continuous deployment – agility, immutability

• DevOps – high trust organization, sharing

• Run-what-you-wrote – anti-fragile development

Page 14: Cloud Native Architecture at Netflix

When to use public cloud?

Page 15: Cloud Native Architecture at Netflix
Page 16: Cloud Native Architecture at Netflix

"This is the IT swamp draining manual for anyone who is neck deep in alligators."- Adrian Cockcroft, Cloud Architect at Netflix

Page 17: Cloud Native Architecture at Netflix

Goal of Traditional IT: Reliable hardware

running stable software

Page 18: Cloud Native Architecture at Netflix

SCALE Breaks hardware

Page 19: Cloud Native Architecture at Netflix

….SPEED Breaks software

Page 20: Cloud Native Architecture at Netflix

SPEED at SCALE

Breaks everything

Page 21: Cloud Native Architecture at Netflix
Page 22: Cloud Native Architecture at Netflix

Incidents – Impact and Mitigation

PR

X Incidents

CS

XX Incidents

Metrics impact – Feature disable

XXX Incidents

No Impact – fast retry or automated failover

XXXX Incidents

Public Relations Media Impact

High Customer Service Calls

Affects AB Test Results

Y incidents mitigated by Active Active, game day practicing

YY incidents mitigated by

better tools and practices

YYY incidents mitigated by better

data tagging

Page 23: Cloud Native Architecture at Netflix

Web Scale Architecture

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

Regional Load Balancers

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

Regional Load Balancers

UltraDNS DynECT

DNS

AWS Route53

DNS Automation

Page 24: Cloud Native Architecture at Netflix

Colonel Boyd, USAF

“Get inside your adversaries' OODA loop to disorient them”

Page 25: Cloud Native Architecture at Netflix

“Agile” vs. “Continuous Delivery”

Speed Wins

Page 26: Cloud Native Architecture at Netflix

Observe

Orient

Decide

Act

Territory Expansion Competitive

Moves

Customer Pain Point

Data Warehouse

Business Buy-in

2 Week Plan

Feature Priority

Code Feature

Install Capacity

Web Display Ads

Capacity Estimate

Measure Sales

2 week agile train/sprint

Page 27: Cloud Native Architecture at Netflix

2 Week Train Model Hand-Off Steps

Product Manager – 2 days

Developer – 2 days coding, 2 days meetings

QA Integration Team – 3 days

Operations Deploy Team – 4 days

BI Analytics Team – 1 day

Page 28: Cloud Native Architecture at Netflix

What’s Next?

Increase rate of change

Reduce cost and size and

risk of change

Page 29: Cloud Native Architecture at Netflix

Cloud Native

Construct a highly agile and highly available service from ephemeral and

assumed broken components

Page 30: Cloud Native Architecture at Netflix

Cloud Native Microservices

Start Here

memcached

Cassandra

Web service

S3 bucket

Each icon is three to a few hundred instances across three AWS zones

Each microservice is updated independently and continuously at whatever rate seems appropriate

Page 31: Cloud Native Architecture at Netflix

Continuous Deployment

No time for handoff to IT

Page 32: Cloud Native Architecture at Netflix

Developer Self Service

Freedom and Responsibility

Page 33: Cloud Native Architecture at Netflix

Developers run what they wrote

Root access and pagerduty

Page 34: Cloud Native Architecture at Netflix

IT is a Cloud API

DEVops automation

Page 35: Cloud Native Architecture at Netflix

Github all the things!

Leverage social coding

Page 36: Cloud Native Architecture at Netflix

Netflix.github.com

35 repos 36 repos 37 today

Page 37: Cloud Native Architecture at Netflix

Karyon Microservice Template

• Bootstrapping o Dependency & Lifecycle management via Governator.

o Service registry via Eureka.

o Property management via Archaius

o Hooks for Latency Monkey injection testing

o Preconfigured status page and heathcheck servlets

Page 38: Cloud Native Architecture at Netflix

o Eureka discovery service metadata

o Environment variables

o JMX

o Versions

o Conformity Monkey Support

Karyon Microservice Status Page

Page 39: Cloud Native Architecture at Netflix

Sample Application – RSS Reader

Page 40: Cloud Native Architecture at Netflix

Putting it all together…

Page 41: Cloud Native Architecture at Netflix

Observe

Orient

Decide

Act

Land grab opportunity Competitive

Move

Customer Pain Point

Analysis

JFDI

Plan Response

Share Plans

Increment Implement

Automatic Deploy

Launch AB Test

Model Hypotheses

Measure Customers

Continuous Delivery on

Cloud

Page 42: Cloud Native Architecture at Netflix

Continuous Deploy Hand-Off

Product Manager - 2 days

A/B test setup and enable

Self service hypothesis test results

Developer – 2 days

Automated test

Automated deploy, on call

Self service analytics

Page 43: Cloud Native Architecture at Netflix

Continuous Deploy Automation

Check in code, Jenkins build

Bake AMI, launch in test env

Functional and performance test

Production canary test

Production red/black push

Page 44: Cloud Native Architecture at Netflix

Bad Canary Signature

Page 45: Cloud Native Architecture at Netflix

Happy Canary Signature

Page 46: Cloud Native Architecture at Netflix

Global Deploy Automation

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

West Coast Load Balancers

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

East Coast Load Balancers

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

Europe Load Balancers

Afternoon in California

Night-time in Europe

Next day on East Coast Next day on West Coast After peak in Europe

If passes test suite, canary then deploy

Canary then deploy Canary then deploy

Page 47: Cloud Native Architecture at Netflix

Ephemeral Instances

• Largest services are autoscaled

• Average lifetime of an instance is 36 hours Push

Autoscale Up Autoscale Down

Page 48: Cloud Native Architecture at Netflix

Scryer - Predictive Auto-scaling See techblog.netflix.com

More morning load Sat/Sun high traffic

Lower load on Weds 24 Hours predicted traffic vs. actual

FFT based prediction driving AWS Autoscaler to plan minimum capacity

Page 49: Cloud Native Architecture at Netflix

Suro Event Pipeline

1.5 Million events/s 80 Billion events/day

Cloud native, dynamic, configurable offline and realtime data sinks

Error rate alerting

Page 50: Cloud Native Architecture at Netflix

Inspiration

Page 51: Cloud Native Architecture at Netflix

Principles Revisited:

Immutability

Separation of Concerns

Anti-fragility

High trust organization

Sharing

Page 52: Cloud Native Architecture at Netflix

Takeaway

Speed Wins

Assume Broken

Cloud Native Automation

Github is the “app store” and resumé

@adrianco @NetflixOSS

http://netflix.github.com