cmg2013 workshop: netflix cloud native, capacity, performance and cost optimization techniques

128
Cloud Native, Capacity, Performance and Cost Optimization Tools and Techniques CMG Workshop November 2013 Adrian Cockcroft @adrianco @NetflixOSS http://www.linkedin.com/in/adriancockcroft

Upload: adrian-cockcroft

Post on 06-May-2015

7.946 views

Category:

Technology


1 download

DESCRIPTION

For the Computer Measurement Group workshop in San Diego November 2013. Also presented to a student class at UC Santa Barbara. What is Cloud Native. Capacity and Performance benchmarks. Cost Optimization Techniques - content co-developed with Jinesh Varia of AWS.

TRANSCRIPT

Page 1: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Cloud Native, Capacity, Performance and Cost Optimization Tools and

Techniques

CMG WorkshopNovember 2013Adrian Cockcroft@adrianco @NetflixOSS

http://www.linkedin.com/in/adriancockcroft

Page 2: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Presentation vs. Workshop

• Presentation– Short duration, focused subject– One presenter to many anonymous audience– A few questions at the end

• Workshop– Time to explore in and around the subject– Tutor gets to know the audience– Discussion, rat-holes, “bring out your dead”

Page 3: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Attendee Introductions

• Who are you, where do you work• Why are you here today, what do you need• “Bring out your dead”

– Do you have a specific problem or question?– One sentence elevator pitch

• What instrument do you play?

Page 4: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Content

Cloud Native

Migration Path

Service and API Architectures

Storage Architecture

Operations and Tools

Cost OptimizationMore?

Page 5: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Cloud Native

What is it?Why?

Page 6: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Strive for perfection

Perfect codePerfect hardware

Perfectly operated

Page 7: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

But perfection takes too long…

Compromises…Time to market vs. Quality

Utopia remains out of reach

Page 8: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Where time to market wins big

Making a land-grabDisrupting competitors (OODA)

Anything delivered as web services

Page 9: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Observe

Orient

Decide

Act

Land grab opportunity Competitive

move

Customer Pain Point

Analysis

Get buy-in

Plan response

Commit resources

Implement

Deliver

Engage customers

Model alternatives

BIG DATA

INNOVATION

CULTURE

CLOUD

Measure customers

Colonel Boyd, USAF

“Get inside your adversaries'

OODA loop to disorient them”

Page 10: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

How Soon?

Product features in days instead of monthsDeployment in minutes instead of weeks

Incident response in seconds instead of hours

Page 11: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Cloud NativeA new engineering challenge

Construct a highly agile and highly available service from ephemeral and

assumed broken components

Page 12: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Inspiration

Page 13: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

How to get to Cloud Native

Freedom and Responsibility for DevelopersDecentralize and Automate Ops Activities

Integrate DevOps into the Business Organization

Re-Org!

Page 14: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Four Transitions

• Management: Integrated Roles in a Single Organization– Business, Development, Operations -> BusDevOps

• Developers: Denormalized Data – NoSQL– Decentralized, scalable, available, polyglot

• Responsibility from Ops to Dev: Continuous Delivery– Decentralized small daily production updates

• Responsibility from Ops to Dev: Agile Infrastructure - Cloud– Hardware in minutes, provisioned directly by developers

Page 15: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Netflix BusDevOps OrganizationChief Product

Officer

VP Product Management

Directors Product

VP UI Engineering

Directors Development

Developers + DevOps

UI Data Sources

AWS

VP Discovery Engineering

Directors Development

Developers + DevOps

Discovery Data Sources

AWS

VP Platform

Directors Platform

Developers + DevOps

Platform Data Sources

AWS

Denormalized, independently updated and scaled data

Cloud, self service updated & scaled infrastructure

Code, independently updated continuous delivery

Page 16: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Decentralized Deployment

Page 17: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Asgard Developer Portalhttp://techblog.netflix.com/2012/06/asgard-web-based-cloud-management-and.html

Page 18: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Ephemeral Instances

• Largest services are autoscaled• Average lifetime of an instance is 36 hours

Push

Autoscale UpAutoscale Down

Page 19: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Netflix Member Web Site Home PagePersonalization Driven – How Does It Work?

Page 20: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

How Netflix Used to Work

Customer Device (PC, PS3, TV…)

Monolithic Web App

Oracle

MySQL

Monolithic Streaming App

Oracle

MySQL

Limelight/Level 3 Akamai CDNs

Content Management

Content Encoding

Consumer Electronics

AWS Cloud Services

CDN Edge Locations

Datacenter

Page 21: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

How Netflix Streaming Works Today

Customer Device (PC, PS3, TV…)

Web Site or Discovery API

User Data

Personalization

Streaming API

DRM

QoS Logging

OpenConnect CDN Boxes

CDN Management and Steering

Content Encoding

Consumer Electronics

AWS Cloud Services

CDN Edge Locations

Datacenter

Page 22: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

The DIY Question

Why doesn’t Netflix build and run its own cloud?

Page 23: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Fitting Into Public Scale

Public Grey Area Private

1,000 Instances 100,000 Instances

Netflix FacebookStartups

Page 24: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

How big is Public?

AWS upper bound estimate based on the number of public IP AddressesEvery provisioned instance gets a public IP by default (some VPC don’t)

AWS Maximum Possible Instance Count 4.2 Million – May 2013Growth >10x in Three Years, >2x Per Annum - http://bit.ly/awsiprange

Page 25: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

The Alternative Supplier Question

What if there is no clear leader for a feature, or AWS doesn’t have what

we need?

Page 26: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Things We Don’t Use AWS For

SaaS Applications – Pagerduty, AppdynamicsContent Delivery Service

DNS Service

Page 27: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Nov2012StreamingBandwidth

March2013

MeanBandwidth+39% 6mo

Page 28: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

CDN Scale

AWS CloudFrontAkamai

LimelightLevel 3

Netflix Openconnect

YouTube

Gigabits Terabits

NetflixFacebookStartups

Page 29: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Content Delivery ServiceOpen Source Hardware Design + FreeBSD, bird, nginx

see openconnect.netflix.com

Page 30: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

DNS Service

AWS Route53 is missing too many features (for now)Multiple vendor strategy Dyn, Ultra, Route53

Abstracted (broken) DNS APIs with Denominator

Page 31: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

What Changed?

Get out of the way of innovationBest of breed, by the hour

Choices based on scale

Cost reduction

Slow down developers

Less competitiveLess revenue

Lower margins

Process reduction

Speed up developers

More competitive

More revenue

Higher margins

Page 32: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Availability Questions

Is it running yet?How many places is it running in?How far apart are those places?

Page 33: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques
Page 34: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Netflix Outages

• Running very fast with scissors– Mostly self inflicted – bugs, mistakes from pace of change– Some caused by AWS bugs and mistakes

• Incident Life-cycle Management by Platform Team– No runbooks, no operational changes by the SREs– Tools to identify what broke and call the right developer

• Next step is multi-region active/active– Investigating and building in stages during 2013– Could have prevented some of our 2012 outages

Page 35: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Incidents – Impact and Mitigation

PRX Incidents

CSXX Incidents

Metrics impact – Feature disableXXX Incidents

No Impact – fast retry or automated failoverXXXX Incidents

Public Relations Media Impact

High Customer Service Calls

Affects AB Test Results

Y incidents mitigated by Active Active, game day practicing

YY incidents mitigated by

better tools and practices

YYY incidents mitigated by better

data tagging

Page 36: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Real Web Server Dependencies Flow(Netflix Home page business transaction as seen by AppDynamics)

Start Here

memcached

Cassandra

Web service

S3 bucket

Personalization movie group choosers (for US, Canada and Latam)

Each icon is three to a few hundred instances across three AWS zones

Page 37: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Three Balanced Availability ZonesTest with Chaos Gorilla

Cassandra and Evcache Replicas

Zone A

Cassandra and Evcache Replicas

Zone B

Cassandra and Evcache Replicas

Zone C

Load Balancers

Chaos Gorilla

Page 38: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Isolated Regions

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

US-East Load Balancers

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

EU-West Load Balancers

More?

Page 39: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Highly Available NoSQL Storage

A highly scalable, available and durable deployment pattern based

on Apache Cassandra

Page 40: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Single Function Micro-Service PatternOne keyspace, replaces a single table or materialized view

Single function Cassandra Cluster Managed by PriamBetween 6 and 144 nodes

Stateless Data Access REST ServiceAstyanax Cassandra Client

OptionalDatacenterUpdate Flow

Many Different Single-Function REST Clients

Appdynamics Service Flow Visualization

Each icon represents a horizontally scaled service of three to hundreds of instances deployed over three availability zones

Over 50 Cassandra clustersOver 1000 nodesOver 30TB backupOver 1M writes/s/cluster

Page 41: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Stateless Micro-Service Architecture

Linux Base AMI (CentOS or Ubuntu)

Optional Apache

frontend, memcached, non-java apps

MonitoringLog rotation

to S3 AppDynamics machineagent

Epic/Atlas

Java (JDK 6 or 7)AppDynamics

appagentmonitoring

GC and thread dump logging

TomcatApplication war file, base servlet, platform, client interface jars, Astyanax

Healthcheck, status servlets, JMX interface,

Servo autoscale

Page 42: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Cassandra Instance Architecture

Linux Base AMI (CentOS or Ubuntu)

Tomcat and Priam on JDKHealthcheck,

Status

MonitoringAppDynamics machineagent

Epic/Atlas

Java (JDK 7)AppDynamics

appagentmonitoring

GC and thread dump logging

Cassandra ServerLocal Ephemeral Disk Space – 2TB of SSD or 1.6TB disk

holding Commit log and SSTables

Page 43: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Apache Cassandra

• Scalable and Stable in large deployments– No additional license cost for large scale!– Optimized for “OLTP” vs. Hbase optimized for “DSS”

• Available during Partition (AP from CAP)– Hinted handoff repairs most transient issues– Read-repair and periodic repair keep it clean

• Quorum and Client Generated Timestamp– Read after write consistency with 2 of 3 copies– Latest version includes Paxos for stronger transactions

Page 44: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Astyanax Cassandra Client for JavaAvailable at http://github.com/netflix

• Features– Abstraction of connection pool from RPC protocol– Fluent Style API– Operation retry with backoff– Token aware– Batch manager– Many useful recipes– New: Entity Mapper based on JPA annotations

Page 45: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

C* Astyanax Recipes

• Distributed row lock (without needing zookeeper)• Multi-region row lock• Uniqueness constraint• Multi-row uniqueness constraint• Chunked and multi-threaded large file storage• Reverse index search• All rows query• Durable message queue• Contributed: High cardinality reverse index

Page 46: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Astyanax - Cassandra Write Data FlowsSingle Region, Multiple Availability Zone, Token Aware

Token Aware Clients

Cassandra•Disks•Zone A

Cassandra•Disks•Zone B

Cassandra•Disks•Zone C

Cassandra•Disks•Zone A

Cassandra•Disks•Zone B

Cassandra•Disks•Zone C

1. Client Writes to local coordinator

2. Coodinator writes to other zones

3. Nodes return ack4. Data written to

internal commit log disks (no more than 10 seconds later)

If a node goes offline, hinted handoff completes the write when the node comes back up.

Requests can choose to wait for one node, a quorum, or all nodes to ack the write

SSTable disk writes and compactions occur asynchronously

14

4

42

3

33

2

Page 47: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Data Flows for Multi-Region WritesToken Aware, Consistency Level = Local Quorum

US Clients

Cassandra• Disks• Zone A

Cassandra• Disks• Zone B

Cassandra• Disks• Zone C

Cassandra• Disks• Zone A

Cassandra• Disks• Zone B

Cassandra• Disks• Zone C

1. Client writes to local replicas2. Local write acks returned to

Client which continues when 2 of 3 local nodes are committed

3. Local coordinator writes to remote coordinator.

4. When data arrives, remote coordinator node acks and copies to other remote zones

5. Remote nodes ack to local coordinator

6. Data flushed to internal commit log disks (no more than 10 seconds later)

If a node or region goes offline, hinted handoff completes the write when the node comes back up.Nightly global compare and repair jobs ensure everything stays consistent.

EU Clients

Cassandra• Disks• Zone A

Cassandra• Disks• Zone B

Cassandra• Disks• Zone C

Cassandra• Disks• Zone A

Cassandra• Disks• Zone B

Cassandra• Disks• Zone C

6

5

5

6 64

44

16

6

62

2

23

100+ms latency

Page 48: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Cassandra at Scale

Benchmarking to Retire Risk

More?

Page 49: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Scalability from 48 to 288 nodes on AWShttp://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html

0 50 100 150 200 250 300 3500

200000

400000

600000

800000

1000000

1200000

174373

366828

537172

1099837

Client Writes/s by node count – Replication Factor = 3

Used 288 of m1.xlarge4 CPU, 15 GB RAM, 8 ECUCassandra 0.86Benchmark config only existed for about 1hr

2011

Page 50: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Cassandra Disk vs. SSD BenchmarkSame Throughput, Lower Latency, Half Cost

http://techblog.netflix.com/2012/07/benchmarking-high-performance-io-with.html

Cassandra

Memcached

Application

Load GenerationLoad Test

Driver

REST service

36x m2.xlarge EVcache

48x m2.4xlarge Cassandra

REST service

15x hi1.4xlarge Cassandra

2012

Page 51: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

2013 - Cross Region Use Cases

• Geographic Isolation– US to Europe replication of subscriber data– Read intensive, low update rate– Production use since late 2011

• Redundancy for regional failover– US East to US West replication of everything– Includes write intensive data, high update rate– Testing now

Page 52: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Benchmarking Global CassandraWrite intensive test of cross region replication capacity

16 x hi1.4xlarge SSD nodes per zone = 96 total192 TB of SSD in six locations up and running Cassandra in 20 minutes

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

US-West-2 Region - Oregon

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

US-East-1 Region - Virginia

Test Load

Test Load

Validation Load

Inter-Zone Traffic

1 Million writesCL.ONE (wait for one replica to ack)

1 Million readsAfter 500msCL.ONE with noData loss

Inter-Region TrafficUp to 9Gbits/s, 83ms 18TB

backups from S3

Page 53: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Copying 18TB from East to WestCassandra bootstrap 9.3 Gbit/s single threaded 48 nodes to 48 nodes

Thanks to boundary.com for these network analysis plots

Page 54: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Inter Region Traffic TestVerified at desired capacity, no problems, 339 MB/s, 83ms latency

Page 55: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Ramp Up Load Until It Breaks!Unmodified tuning, dropping client data at 1.93GB/s inter region trafficSpare CPU, IOPS, Network, just need some Cassandra tuning for more

Page 56: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Managing Multi-Region Availability

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

Regional Load Balancers

Cassandra Replicas

Zone A

Cassandra Replicas

Zone B

Cassandra Replicas

Zone C

Regional Load Balancers

UltraDNSDynECT

DNS

AWS Route53

Denominator – manage traffic via multiple DNS providers with Java code2013 Timeline - Concept Jan, Code Feb, OSS March, Production use May

Denominator

Page 57: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Failure Modes and EffectsFailure Mode Probability Current Mitigation Plan

Application Failure High Automatic degraded response

AWS Region Failure Low Active-Active multi-region deployment

AWS Zone Failure Medium Continue to run on 2 out of 3 zones

Datacenter Failure Medium Migrate more functions to cloud

Data store failure Low Restore from S3 backups

S3 failure Low Restore from remote archive

Until we got really good at mitigating high and medium probability failures, the ROI for mitigating regional failures didn’t make sense. Getting there…

Page 58: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Application Resilience

Run what you wroteRapid detectionRapid Response

Page 59: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Chaos Monkeyhttp://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html

• Computers (Datacenter or AWS) randomly die– Fact of life, but too infrequent to test resiliency

• Test to make sure systems are resilient– Kill individual instances without customer impact

• Latency Monkey (coming soon)– Inject extra latency and error return codes

Page 60: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Monkeys

Edda – Configuration Historyhttp://techblog.netflix.com/2012/11/edda-learn-stories-of-your-cloud.html

Edda

AWS Instances, ASGs, etc.

Eureka Services

metadata

AppDynamics Request flow

Page 61: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Edda Query ExamplesFind any instances that have ever had a specific public IP address$ curl "http://edda/api/v2/view/instances;publicIpAddress=1.2.3.4;_since=0"["i-0123456789","i-012345678a","i-012345678b”]

Show the most recent change to a security group$ curl "http://edda/api/v2/aws/securityGroups/sg-0123456789;_diff;_all;_limit=2"--- /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351040779810+++ /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351044093504@@ -1,33 +1,33 @@ {… "ipRanges" : [ "10.10.1.1/32", "10.10.1.2/32",+ "10.10.1.3/32",- "10.10.1.4/32"… }

Page 62: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Cloud Native Big Data

Size the cluster to the dataSize the cluster to the questionsNever wait for space or answers

Page 63: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Netflix Dataoven

Data WarehouseOver 2 Petabytes

Ursula

Aegisthus

Data Pipelines

From cloud Services~100 BillionEvents/day

From C*Terabytes ofDimensiondata

Hadoop Clusters – AWS EMR

1300 nodes 800 nodes Multiple 150 nodes Nightly

RDS

Metadata

Gateways

Tools

More?

Page 64: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Cloud Native Development Patterns

Master copies of data are cloud residentDynamically provisioned micro-servicesServices are distributed and ephemeral

Page 65: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Datacenter to Cloud Transition Goals

• Faster– Lower latency than the equivalent datacenter web pages and API calls– Measured as mean and 99th percentile– For both first hit (e.g. home page) and in-session hits for the same user

• Scalable– Avoid needing any more datacenter capacity as subscriber count increases– No central vertically scaled databases– Leverage AWS elastic capacity effectively

• Available– Substantially higher robustness and availability than datacenter services– Leverage multiple AWS availability zones– No scheduled down time, no central database schema to change

• Productive– Optimize agility of a large development team with automation and tools– Leave behind complex tangled datacenter code base (~8 year old architecture)– Enforce clean layered interfaces and re-usable components

Page 66: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Datacenter Anti-Patterns

What do we currently do in the datacenter that prevents us from

meeting our goals?

Page 67: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Rewrite from Scratch

Not everything is cloud specificPay down technical debt

Robust patterns

Page 68: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Netflix Datacenter vs. Cloud Arch

Central SQL Database Distributed Key/Value NoSQL

Sticky In-Memory Session Shared Memcached Session

Chatty Protocols Latency Tolerant Protocols

Tangled Service Interfaces Layered Service Interfaces

Instrumented Code Instrumented Service Patterns

Fat Complex Objects Lightweight Serializable Objects

Components as Jar Files Components as Services

Anti-Architecture

More?

Page 69: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Cloud Security

Fine grain security rather than perimeterLeveraging AWS Scale to resist DDOS attacks

Automated attack surface monitoring and testinghttp://www.slideshare.net/jason_chan/resilience-and-security-scale-lessons-learned

Page 70: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Security Architecture

• Instance Level Security baked into base AMI– Login: ssh only allowed via portal (not between instances)– Each app type runs as its own userid app{test|prod}

• AWS Security, Identity and Access Management– Each app has its own security group (firewall ports)– Fine grain user roles and resource ACLs

• Key Management– AWS Keys dynamically provisioned, easy updates– High grade app specific key management using HSM

More?

Page 71: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

AWS Accounts

Page 72: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Accounts Isolate Concerns• paastest – for development and testing

– Fully functional deployment of all services– Developer tagged “stacks” for separation

• paasprod – for production– Autoscale groups only, isolated instances are terminated– Alert routing, backups enabled by default

• paasaudit – for sensitive services– To support SOX, PCI, etc.– Extra access controls, auditing

• paasarchive – for disaster recovery– Long term archive of backups– Different region, perhaps different vendor

Page 73: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Cloud Access Control

www-prod

• Userid wwwprod

Dal-prod

• Userid dalprod

Cass-prod

• Userid cassprod

Cloud Access audit log ssh/sudo Gateway

Security groups don’t allowssh between instances

developers

Page 74: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Our perspiration…A Cloud Native Open Source Platform

See netflix.github.com

Page 75: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Example Application – RSS Reader

ZUUL

Zuul TrafficProcessing and Routing

Page 76: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Zuul Architecturehttp://techblog.netflix.com/2013/06/announcing-zuul-edge-service-in-cloud.html

Page 77: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Ice – AWS Usage Trackinghttp://techblog.netflix.com/2013/06/announcing-ice-cloud-spend-and-usage.html

Page 78: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

GithubNetflixOSS

Source

AWSBase AMI

MavenCentral

Cloudbees Jenkins

AminatorBakery

DynaslaveAWS Build

Slaves

Asgard(+ Frigga)Console

AWSBaked AMIs

GlistenWorkflow DSL

AWS Account

NetflixOSS Continuous Build and Deployment

More?

Page 79: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

AWS AccountAsgard Console

Archaius Config Service

Cross region Priam C*

PytheasDashboards

AtlasMonitoring

Genie, LipstickHadoop Services

Ice – AWS UsageCost Monitoring

Multiple AWS RegionsEureka Registry

ExhibitorZookeeper

Edda History

Simian Army

Zuul Traffic Mgr

3 AWS ZonesApplication Clusters

Autoscale GroupsInstances

PriamCassandra

Persistent Storage

EvcacheMemcached

Ephemeral Storage

NetflixOSS Services Scope

More?

Page 80: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

• Baked AMI – Tomcat, Apache, your code• Governator – Guice based dependency injection• Archaius – dynamic configuration properties client• Eureka - service registration client

Initialization

• Karyon - Base Server for inbound requests• RxJava – Reactive pattern• Hystrix/Turbine – dependencies and real-time status• Ribbon and Feign - REST Clients for outbound calls

Service Requests

• Astyanax – Cassandra client and pattern library• Evcache – Zone aware Memcached client• Curator – Zookeeper patterns• Denominator – DNS routing abstraction

Data Access

• Blitz4j – non-blocking logging• Servo – metrics export for autoscaling• Atlas – high volume instrumentation

Logging

NetflixOSS Instance Libraries

More?

Page 81: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

• CassJmeter – Load testing for Cassandra• Circus Monkey – Test account reservation rebalancingTest Tools

• Janitor Monkey – Cleans up unused resources• Efficiency Monkey• Doctor Monkey• Howler Monkey – Complains about AWS limits

Maintenance

• Chaos Monkey – Kills Instances• Chaos Gorilla – Kills Availability Zones• Chaos Kong – Kills Regions• Latency Monkey – Latency and error injection

Availability

• Conformity Monkey – architectural pattern warnings• Security Monkey – security group and S3 bucket permissionsSecurity

NetflixOSS Testing and Automation

More?

Page 82: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Vendor Driven PortabilityInterest in using NetflixOSS for Enterprise Private Clouds

“It’s done when it runs Asgard”Functionally completeDemonstrated MarchReleased June in V3.3

Vendor and end user interestOpenstack “Heat” getting therePaypal C3 Console based on Asgard

IBM Example application “Acme Air”Based on NetflixOSS running on AWSPorted to IBM Softlayer with Rightscale

Page 83: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Cost-AwareCloud Architectures

Jinesh Varia@jinmanTechnology Evangelist

Adrian Cockcroft@adrianco

Director, Architecture

Page 84: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Cloud Economics – Agile ROIGet a faster Return by Speeding up Investment

Observe

Orient

Decide

Act

Rapid innovation by speeding up theOODA loopTry, fail, try again, succeed

Page 85: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

$3K

$500

$75

$2K$100$33

$12

$692

Experiment Often & Adapt Quickly

$1K

$500$234$96

• Cost of failure falls dramatically• Return on (small incremental)

Investments is high• More risk taking, more innovation• More iteration, faster innovation

Page 86: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

« Want to increase innovation? Lower the cost of failure »

Joi Ito

Page 87: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Accelerate building a new line of business

Market Replay

Page 88: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Go Global in Minutes

Page 89: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Netflix Examples

• European Launch using AWS Ireland– No employees in Ireland, no provisioning delay, everything worked– No need to do detailed capacity planning– Over-provisioned on day 1, shrunk to fit after a few days– Capacity grows as needed for additional country launches

• Brazilian Proxy Experiment– No employees in Brazil, no “meetings with IT”– Deployed instances into two zones in AWS Brazil– Experimented with network proxy optimization– Decided that gain wasn’t enough, shut everything down

Page 90: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Product Launch Agility - Rightsized

Pre-Launch Build-out Testing Launch Growth Growth

DemandCloudDatacenter

$

Page 91: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Product Launch - Under-estimated

Pre-Launch Build-out

TestingLaunch

GrowthGrowth

Page 92: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Product Launch Agility – Over-estimated

Pre-Launch Build-out

TestingLaunch

GrowthGrowth

$

Page 93: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Return on Agility (Agile ROI) = More Revenue

Page 94: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

#1 Business Agility by Rapid Experimentation = Increased Revenue

Key Takeaways on Cost-Aware Architectures….

Page 95: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

When you turn off your cloud resources, you actually stop paying for them

Page 96: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

1 5 9 13 17 21 25 29 33 37 41 45 49

Week

Web

Ser

vers

Optimize during a year

50% SavingsWeekly CPU Load

Page 97: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Busi

ness

Thr

ough

put

Inst

ance

s

Page 98: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

50%+ Cost SavingScale up/down

by 70%+

Move to Load-Based Scaling

Page 99: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Pay as you go

Page 100: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

AWS Support – Trusted Advisor – Your personal cloud assistant

Page 101: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Other simple optimization tips

• Don’t forget to…– Disassociate unused EIPs– Delete unassociated Amazon

EBS volumes– Delete older Amazon EBS

snapshots– Leverage Amazon S3 Object

Expiration

Janitor Monkey cleans up unused resources

Page 102: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

#1 Business Agility by Rapid Experimentation = Increased Revenue

#2 Business-driven Auto Scaling Architectures = Savings

Building Cost-Aware Cloud Architectures

Page 103: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

When Comparing TCO…

Page 104: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

When Comparing TCO…

Make sure that you are including all the cost factors into consideration

PlacePowerPipesPeoplePatterns

Page 105: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Save more when you reserve

On-demandInstances

• Pay as you go

• Starts from $0.02/Hour

ReservedInstances

• One time low upfront fee + Pay as you go

• $23 for 1 year term and $0.01/Hour

1-year and 3-year terms

Light Utilization RI

Medium Utilization RI

Heavy Utilization RI

Page 106: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Utilization (Uptime)

Ideal For Savings over On-Demand

10% - 40%(>3.5 < 5.5 months/year)

Disaster Recovery(Lowest Upfront) 56%

40% - 75%(>5.5 < 7 months/year)

Standard Reserved Capacity 66%

>75%(>7 months/year)

Baseline Servers(Lowest Total Cost) 71%

Break-even point

ReservedInstances

• One time low upfront fee + Pay as you go

• $23 for 1 year term and $0.01/Hour

1-year and 3-year terms

Light Utilization RI

Medium Utilization RI

Heavy Utilization RI

Page 107: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Mix and Match Reserved Types and On-DemandIn

stan

ces

Days of Month

0

2

4

6

8

10

12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

Heavy Utilization Reserved Instances

Light RI Light RILight RILight RI

On-Demand

Page 108: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Netflix Concept for Regional Failover Capacity

West Coast

Light Reservations

Heavy Reservations

East Coast

Light Reservations

Heavy Reservations

NormalUse

Failover Use

Page 109: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

#1 Business Agility by Rapid Experimentation = Increased Revenue

#2 Business-driven Auto Scaling Architectures = Savings

#3 Mix and Match Reserved Instances with On-Demand = Savings

Building Cost-Aware Cloud Architectures

Page 110: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Variety of Applications and Environments

Production Fleet

Dev FleetTest FleetStaging/QAPerf FleetDR Site

Every Application has…. Every Company has….

Business App Fleet

Marketing SiteIntranet SiteBI AppMultiple Products Analytics

Page 111: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Consolidated Billing: Single payer for a group of accounts

• One Bill for multiple accounts

• Easy Tracking of account charges (e.g., download CSV of cost data)

• Volume Discounts can be reached faster with combined usage

• Reserved Instances are shared across accounts (including RDS Reserved DBs)

Page 112: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Over-Reserve the Production Environment

Production Env.Account 100 Reserved

QA/Staging Env. Account 0 Reserved

Perf Testing Env.Account 0 Reserved

Development Env.Account 0 Reserved

Storage Account 0 Reserved

Total Capacity

Page 113: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Consolidated Billing Borrows Unused Reservations

Production Env.Account 68 Used

QA/Staging Env. Account 10 Borrowed

Perf Testing Env.Account 6 Borrowed

Development Env.Account 12 Borrowed

Storage Account 4 Borrowed

Total Capacity

Page 114: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Consolidated Billing Advantages

• Production account is guaranteed to get burst capacity– Reservation is higher than normal usage level– Requests for more capacity always work up to reserved

limit– Higher availability for handling unexpected peak demands

• No additional cost– Other lower priority accounts soak up unused reservations– Totals roll up in the monthly billing cycle

Page 115: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

#1 Business Agility by Rapid Experimentation = Increased Revenue

#2 Business-driven Auto Scaling Architectures = Savings

#3 Mix and Match Reserved Instances with On-Demand = Savings

#4 Consolidated Billing and Shared Reservations = Savings

Building Cost-Aware Cloud Architectures

Page 116: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Continuous optimization in your architecture results in

recurring savings as early as your next month’s bill

Page 117: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Right-size your cloud: Use only what you need

• An instance type for every purpose

• Assess your memory & CPU requirements– Fit your

application to the resource

– Fit the resource to your application

• Only use a larger instance when needed

Page 118: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Reserved Instance Marketplace

Buy a smaller term instanceBuy instance with different OS or type

Buy a Reserved instance in different region

Sell your unused Reserved InstanceSell unwanted or over-bought capacityFurther reduce costs by optimizing

Page 119: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Instance Type Optimization

Older m1 and m2 families• Slower CPUs• Higher response times• Smaller caches (6MB)• Oldest m1.xl 15GB/8ECU/48c• Old m2.xl 17GB/6.5ECU/41c• ~16 ECU/$/hr

Latest m3 family• Faster CPUs• Lower response times• Bigger caches (20MB)• Even faster for Java vs. ECU• New m3.xl 15GB/13 ECU/50c• 26 ECU/$/hr – 62% better!• Java measured even higher• Deploy fewer instances

Page 120: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

#1 Business Agility by Rapid Experimentation = Increased Revenue

#2 Business-driven Auto Scaling Architectures = Savings

#3 Mix and Match Reserved Instances with On-Demand = Savings

#4 Consolidated Billing and Shared Reservations = Savings

#5 Always-on Instance Type Optimization = Recurring Savings

Building Cost-Aware Cloud Architectures

Page 121: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Follow the Customer (Run web servers) during the day

Follow the Money (Run Hadoop clusters) at night

0

2

4

6

8

10

12

14

16

Mon Tue Wed Thur Fri Sat Sun

No

of In

stan

ces

Runn

ing

Week

Auto Scaling Servers

Hadoop Servers

No. of ReservedInstances

Page 122: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques
Page 123: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Unused Reservations

Calculator

Web Application

Fleet

Total Instances Reserved

Table

Hadoop Fleet

4 AZ-mappings14 Types

Total InstancesRunning now = 100

Total unused Reservations available = 40 in 2 AZs(5 min interval)

Launch 40

Page 124: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Soaking up unused reservations

Unused reserved instances is published as a metric

Netflix Data Science ETL Workload• Daily business metrics roll-up• Starts after midnight• EMR clusters started using hundreds of instances

Netflix Movie Encoding Workload• Long queue of high and low priority encoding jobs• Can soak up 1000’s of additional unused instances

Page 125: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

#1 Business Agility by Rapid Experimentation = Increased Revenue

#2 Business-driven Auto Scaling Architectures = Savings

#3 Mix and Match Reserved Instances with On-Demand = Savings

#4 Consolidated Billing and Shared Reservations = Savings

#5 Always-on Instance Type Optimization = Recurring Savings

Building Cost-Aware Cloud Architectures

#6 Follow the Customer (Run web servers) during the day Follow the Money (Run Hadoop clusters) at night

Page 126: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Thank you!

Jinesh Varia and Adrian [email protected] @jinman

acockcroft @netf lix.com @adrianco

Page 127: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Slideshare.net/Netflix Details• Meetup S1E3 July – Featuring Contributors Eucalyptus, IBM, Paypal, Riot Games

– http://techblog.netflix.com/2013/07/netflixoss-meetup-series-1-episode-3.html

• Lightning Talks March S1E2– http://www.slideshare.net/RuslanMeshenberg/netflixoss-meetup-lightning-talks-and-

roadmap

• Lightning Talks Feb S1E1– http://www.slideshare.net/RuslanMeshenberg/netflixoss-open-house-lightning-talks

• Asgard In Depth Feb S1E1– http://www.slideshare.net/joesondow/asgard-overview-from-netflix-oss-open-house

• Security Architecture– http://www.slideshare.net/jason_chan/resilience-and-security-scale-lessons-learned/

Page 128: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques

Takeaways

Cloud Native Manages Scale and Complexity at Speed

NetflixOSS makes it easier for everyone to become Cloud Native

Rethink deployments and turn things off to save money!

http://netflix.github.comhttp://techblog.netflix.comhttp://slideshare.net/Netflix

http://www.linkedin.com/in/adriancockcroft

@adrianco #netflixcloud @NetflixOSS