cmg2013 workshop: netflix cloud native, capacity, performance and cost optimization techniques
DESCRIPTION
For the Computer Measurement Group workshop in San Diego November 2013. Also presented to a student class at UC Santa Barbara. What is Cloud Native. Capacity and Performance benchmarks. Cost Optimization Techniques - content co-developed with Jinesh Varia of AWS.TRANSCRIPT
![Page 1: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/1.jpg)
Cloud Native, Capacity, Performance and Cost Optimization Tools and
Techniques
CMG WorkshopNovember 2013Adrian Cockcroft@adrianco @NetflixOSS
http://www.linkedin.com/in/adriancockcroft
![Page 2: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/2.jpg)
Presentation vs. Workshop
• Presentation– Short duration, focused subject– One presenter to many anonymous audience– A few questions at the end
• Workshop– Time to explore in and around the subject– Tutor gets to know the audience– Discussion, rat-holes, “bring out your dead”
![Page 3: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/3.jpg)
Attendee Introductions
• Who are you, where do you work• Why are you here today, what do you need• “Bring out your dead”
– Do you have a specific problem or question?– One sentence elevator pitch
• What instrument do you play?
![Page 4: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/4.jpg)
Content
Cloud Native
Migration Path
Service and API Architectures
Storage Architecture
Operations and Tools
Cost OptimizationMore?
![Page 5: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/5.jpg)
Cloud Native
What is it?Why?
![Page 6: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/6.jpg)
Strive for perfection
Perfect codePerfect hardware
Perfectly operated
![Page 7: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/7.jpg)
But perfection takes too long…
Compromises…Time to market vs. Quality
Utopia remains out of reach
![Page 8: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/8.jpg)
Where time to market wins big
Making a land-grabDisrupting competitors (OODA)
Anything delivered as web services
![Page 9: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/9.jpg)
Observe
Orient
Decide
Act
Land grab opportunity Competitive
move
Customer Pain Point
Analysis
Get buy-in
Plan response
Commit resources
Implement
Deliver
Engage customers
Model alternatives
BIG DATA
INNOVATION
CULTURE
CLOUD
Measure customers
Colonel Boyd, USAF
“Get inside your adversaries'
OODA loop to disorient them”
![Page 10: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/10.jpg)
How Soon?
Product features in days instead of monthsDeployment in minutes instead of weeks
Incident response in seconds instead of hours
![Page 11: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/11.jpg)
Cloud NativeA new engineering challenge
Construct a highly agile and highly available service from ephemeral and
assumed broken components
![Page 12: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/12.jpg)
Inspiration
![Page 13: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/13.jpg)
How to get to Cloud Native
Freedom and Responsibility for DevelopersDecentralize and Automate Ops Activities
Integrate DevOps into the Business Organization
Re-Org!
![Page 14: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/14.jpg)
Four Transitions
• Management: Integrated Roles in a Single Organization– Business, Development, Operations -> BusDevOps
• Developers: Denormalized Data – NoSQL– Decentralized, scalable, available, polyglot
• Responsibility from Ops to Dev: Continuous Delivery– Decentralized small daily production updates
• Responsibility from Ops to Dev: Agile Infrastructure - Cloud– Hardware in minutes, provisioned directly by developers
![Page 15: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/15.jpg)
Netflix BusDevOps OrganizationChief Product
Officer
VP Product Management
Directors Product
VP UI Engineering
Directors Development
Developers + DevOps
UI Data Sources
AWS
VP Discovery Engineering
Directors Development
Developers + DevOps
Discovery Data Sources
AWS
VP Platform
Directors Platform
Developers + DevOps
Platform Data Sources
AWS
Denormalized, independently updated and scaled data
Cloud, self service updated & scaled infrastructure
Code, independently updated continuous delivery
![Page 16: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/16.jpg)
Decentralized Deployment
![Page 17: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/17.jpg)
Asgard Developer Portalhttp://techblog.netflix.com/2012/06/asgard-web-based-cloud-management-and.html
![Page 18: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/18.jpg)
Ephemeral Instances
• Largest services are autoscaled• Average lifetime of an instance is 36 hours
Push
Autoscale UpAutoscale Down
![Page 19: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/19.jpg)
Netflix Member Web Site Home PagePersonalization Driven – How Does It Work?
![Page 20: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/20.jpg)
How Netflix Used to Work
Customer Device (PC, PS3, TV…)
Monolithic Web App
Oracle
MySQL
Monolithic Streaming App
Oracle
MySQL
Limelight/Level 3 Akamai CDNs
Content Management
Content Encoding
Consumer Electronics
AWS Cloud Services
CDN Edge Locations
Datacenter
![Page 21: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/21.jpg)
How Netflix Streaming Works Today
Customer Device (PC, PS3, TV…)
Web Site or Discovery API
User Data
Personalization
Streaming API
DRM
QoS Logging
OpenConnect CDN Boxes
CDN Management and Steering
Content Encoding
Consumer Electronics
AWS Cloud Services
CDN Edge Locations
Datacenter
![Page 22: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/22.jpg)
The DIY Question
Why doesn’t Netflix build and run its own cloud?
![Page 23: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/23.jpg)
Fitting Into Public Scale
Public Grey Area Private
1,000 Instances 100,000 Instances
Netflix FacebookStartups
![Page 24: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/24.jpg)
How big is Public?
AWS upper bound estimate based on the number of public IP AddressesEvery provisioned instance gets a public IP by default (some VPC don’t)
AWS Maximum Possible Instance Count 4.2 Million – May 2013Growth >10x in Three Years, >2x Per Annum - http://bit.ly/awsiprange
![Page 25: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/25.jpg)
The Alternative Supplier Question
What if there is no clear leader for a feature, or AWS doesn’t have what
we need?
![Page 26: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/26.jpg)
Things We Don’t Use AWS For
SaaS Applications – Pagerduty, AppdynamicsContent Delivery Service
DNS Service
![Page 27: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/27.jpg)
Nov2012StreamingBandwidth
March2013
MeanBandwidth+39% 6mo
![Page 28: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/28.jpg)
CDN Scale
AWS CloudFrontAkamai
LimelightLevel 3
Netflix Openconnect
YouTube
Gigabits Terabits
NetflixFacebookStartups
![Page 29: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/29.jpg)
Content Delivery ServiceOpen Source Hardware Design + FreeBSD, bird, nginx
see openconnect.netflix.com
![Page 30: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/30.jpg)
DNS Service
AWS Route53 is missing too many features (for now)Multiple vendor strategy Dyn, Ultra, Route53
Abstracted (broken) DNS APIs with Denominator
![Page 31: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/31.jpg)
What Changed?
Get out of the way of innovationBest of breed, by the hour
Choices based on scale
Cost reduction
Slow down developers
Less competitiveLess revenue
Lower margins
Process reduction
Speed up developers
More competitive
More revenue
Higher margins
![Page 32: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/32.jpg)
Availability Questions
Is it running yet?How many places is it running in?How far apart are those places?
![Page 33: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/33.jpg)
![Page 34: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/34.jpg)
Netflix Outages
• Running very fast with scissors– Mostly self inflicted – bugs, mistakes from pace of change– Some caused by AWS bugs and mistakes
• Incident Life-cycle Management by Platform Team– No runbooks, no operational changes by the SREs– Tools to identify what broke and call the right developer
• Next step is multi-region active/active– Investigating and building in stages during 2013– Could have prevented some of our 2012 outages
![Page 35: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/35.jpg)
Incidents – Impact and Mitigation
PRX Incidents
CSXX Incidents
Metrics impact – Feature disableXXX Incidents
No Impact – fast retry or automated failoverXXXX Incidents
Public Relations Media Impact
High Customer Service Calls
Affects AB Test Results
Y incidents mitigated by Active Active, game day practicing
YY incidents mitigated by
better tools and practices
YYY incidents mitigated by better
data tagging
![Page 36: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/36.jpg)
Real Web Server Dependencies Flow(Netflix Home page business transaction as seen by AppDynamics)
Start Here
memcached
Cassandra
Web service
S3 bucket
Personalization movie group choosers (for US, Canada and Latam)
Each icon is three to a few hundred instances across three AWS zones
![Page 37: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/37.jpg)
Three Balanced Availability ZonesTest with Chaos Gorilla
Cassandra and Evcache Replicas
Zone A
Cassandra and Evcache Replicas
Zone B
Cassandra and Evcache Replicas
Zone C
Load Balancers
Chaos Gorilla
![Page 38: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/38.jpg)
Isolated Regions
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-East Load Balancers
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
EU-West Load Balancers
More?
![Page 39: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/39.jpg)
Highly Available NoSQL Storage
A highly scalable, available and durable deployment pattern based
on Apache Cassandra
![Page 40: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/40.jpg)
Single Function Micro-Service PatternOne keyspace, replaces a single table or materialized view
Single function Cassandra Cluster Managed by PriamBetween 6 and 144 nodes
Stateless Data Access REST ServiceAstyanax Cassandra Client
OptionalDatacenterUpdate Flow
Many Different Single-Function REST Clients
Appdynamics Service Flow Visualization
Each icon represents a horizontally scaled service of three to hundreds of instances deployed over three availability zones
Over 50 Cassandra clustersOver 1000 nodesOver 30TB backupOver 1M writes/s/cluster
![Page 41: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/41.jpg)
Stateless Micro-Service Architecture
Linux Base AMI (CentOS or Ubuntu)
Optional Apache
frontend, memcached, non-java apps
MonitoringLog rotation
to S3 AppDynamics machineagent
Epic/Atlas
Java (JDK 6 or 7)AppDynamics
appagentmonitoring
GC and thread dump logging
TomcatApplication war file, base servlet, platform, client interface jars, Astyanax
Healthcheck, status servlets, JMX interface,
Servo autoscale
![Page 42: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/42.jpg)
Cassandra Instance Architecture
Linux Base AMI (CentOS or Ubuntu)
Tomcat and Priam on JDKHealthcheck,
Status
MonitoringAppDynamics machineagent
Epic/Atlas
Java (JDK 7)AppDynamics
appagentmonitoring
GC and thread dump logging
Cassandra ServerLocal Ephemeral Disk Space – 2TB of SSD or 1.6TB disk
holding Commit log and SSTables
![Page 43: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/43.jpg)
Apache Cassandra
• Scalable and Stable in large deployments– No additional license cost for large scale!– Optimized for “OLTP” vs. Hbase optimized for “DSS”
• Available during Partition (AP from CAP)– Hinted handoff repairs most transient issues– Read-repair and periodic repair keep it clean
• Quorum and Client Generated Timestamp– Read after write consistency with 2 of 3 copies– Latest version includes Paxos for stronger transactions
![Page 44: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/44.jpg)
Astyanax Cassandra Client for JavaAvailable at http://github.com/netflix
• Features– Abstraction of connection pool from RPC protocol– Fluent Style API– Operation retry with backoff– Token aware– Batch manager– Many useful recipes– New: Entity Mapper based on JPA annotations
![Page 45: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/45.jpg)
C* Astyanax Recipes
• Distributed row lock (without needing zookeeper)• Multi-region row lock• Uniqueness constraint• Multi-row uniqueness constraint• Chunked and multi-threaded large file storage• Reverse index search• All rows query• Durable message queue• Contributed: High cardinality reverse index
![Page 46: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/46.jpg)
Astyanax - Cassandra Write Data FlowsSingle Region, Multiple Availability Zone, Token Aware
Token Aware Clients
Cassandra•Disks•Zone A
Cassandra•Disks•Zone B
Cassandra•Disks•Zone C
Cassandra•Disks•Zone A
Cassandra•Disks•Zone B
Cassandra•Disks•Zone C
1. Client Writes to local coordinator
2. Coodinator writes to other zones
3. Nodes return ack4. Data written to
internal commit log disks (no more than 10 seconds later)
If a node goes offline, hinted handoff completes the write when the node comes back up.
Requests can choose to wait for one node, a quorum, or all nodes to ack the write
SSTable disk writes and compactions occur asynchronously
14
4
42
3
33
2
![Page 47: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/47.jpg)
Data Flows for Multi-Region WritesToken Aware, Consistency Level = Local Quorum
US Clients
Cassandra• Disks• Zone A
Cassandra• Disks• Zone B
Cassandra• Disks• Zone C
Cassandra• Disks• Zone A
Cassandra• Disks• Zone B
Cassandra• Disks• Zone C
1. Client writes to local replicas2. Local write acks returned to
Client which continues when 2 of 3 local nodes are committed
3. Local coordinator writes to remote coordinator.
4. When data arrives, remote coordinator node acks and copies to other remote zones
5. Remote nodes ack to local coordinator
6. Data flushed to internal commit log disks (no more than 10 seconds later)
If a node or region goes offline, hinted handoff completes the write when the node comes back up.Nightly global compare and repair jobs ensure everything stays consistent.
EU Clients
Cassandra• Disks• Zone A
Cassandra• Disks• Zone B
Cassandra• Disks• Zone C
Cassandra• Disks• Zone A
Cassandra• Disks• Zone B
Cassandra• Disks• Zone C
6
5
5
6 64
44
16
6
62
2
23
100+ms latency
![Page 48: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/48.jpg)
Cassandra at Scale
Benchmarking to Retire Risk
More?
![Page 49: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/49.jpg)
Scalability from 48 to 288 nodes on AWShttp://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
0 50 100 150 200 250 300 3500
200000
400000
600000
800000
1000000
1200000
174373
366828
537172
1099837
Client Writes/s by node count – Replication Factor = 3
Used 288 of m1.xlarge4 CPU, 15 GB RAM, 8 ECUCassandra 0.86Benchmark config only existed for about 1hr
2011
![Page 50: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/50.jpg)
Cassandra Disk vs. SSD BenchmarkSame Throughput, Lower Latency, Half Cost
http://techblog.netflix.com/2012/07/benchmarking-high-performance-io-with.html
Cassandra
Memcached
Application
Load GenerationLoad Test
Driver
REST service
36x m2.xlarge EVcache
48x m2.4xlarge Cassandra
REST service
15x hi1.4xlarge Cassandra
2012
![Page 51: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/51.jpg)
2013 - Cross Region Use Cases
• Geographic Isolation– US to Europe replication of subscriber data– Read intensive, low update rate– Production use since late 2011
• Redundancy for regional failover– US East to US West replication of everything– Includes write intensive data, high update rate– Testing now
![Page 52: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/52.jpg)
Benchmarking Global CassandraWrite intensive test of cross region replication capacity
16 x hi1.4xlarge SSD nodes per zone = 96 total192 TB of SSD in six locations up and running Cassandra in 20 minutes
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-West-2 Region - Oregon
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-East-1 Region - Virginia
Test Load
Test Load
Validation Load
Inter-Zone Traffic
1 Million writesCL.ONE (wait for one replica to ack)
1 Million readsAfter 500msCL.ONE with noData loss
Inter-Region TrafficUp to 9Gbits/s, 83ms 18TB
backups from S3
![Page 53: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/53.jpg)
Copying 18TB from East to WestCassandra bootstrap 9.3 Gbit/s single threaded 48 nodes to 48 nodes
Thanks to boundary.com for these network analysis plots
![Page 54: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/54.jpg)
Inter Region Traffic TestVerified at desired capacity, no problems, 339 MB/s, 83ms latency
![Page 55: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/55.jpg)
Ramp Up Load Until It Breaks!Unmodified tuning, dropping client data at 1.93GB/s inter region trafficSpare CPU, IOPS, Network, just need some Cassandra tuning for more
![Page 56: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/56.jpg)
Managing Multi-Region Availability
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
Regional Load Balancers
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
Regional Load Balancers
UltraDNSDynECT
DNS
AWS Route53
Denominator – manage traffic via multiple DNS providers with Java code2013 Timeline - Concept Jan, Code Feb, OSS March, Production use May
Denominator
![Page 57: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/57.jpg)
Failure Modes and EffectsFailure Mode Probability Current Mitigation Plan
Application Failure High Automatic degraded response
AWS Region Failure Low Active-Active multi-region deployment
AWS Zone Failure Medium Continue to run on 2 out of 3 zones
Datacenter Failure Medium Migrate more functions to cloud
Data store failure Low Restore from S3 backups
S3 failure Low Restore from remote archive
Until we got really good at mitigating high and medium probability failures, the ROI for mitigating regional failures didn’t make sense. Getting there…
![Page 58: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/58.jpg)
Application Resilience
Run what you wroteRapid detectionRapid Response
![Page 59: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/59.jpg)
Chaos Monkeyhttp://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html
• Computers (Datacenter or AWS) randomly die– Fact of life, but too infrequent to test resiliency
• Test to make sure systems are resilient– Kill individual instances without customer impact
• Latency Monkey (coming soon)– Inject extra latency and error return codes
![Page 60: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/60.jpg)
Monkeys
Edda – Configuration Historyhttp://techblog.netflix.com/2012/11/edda-learn-stories-of-your-cloud.html
Edda
AWS Instances, ASGs, etc.
Eureka Services
metadata
AppDynamics Request flow
![Page 61: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/61.jpg)
Edda Query ExamplesFind any instances that have ever had a specific public IP address$ curl "http://edda/api/v2/view/instances;publicIpAddress=1.2.3.4;_since=0"["i-0123456789","i-012345678a","i-012345678b”]
Show the most recent change to a security group$ curl "http://edda/api/v2/aws/securityGroups/sg-0123456789;_diff;_all;_limit=2"--- /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351040779810+++ /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351044093504@@ -1,33 +1,33 @@ {… "ipRanges" : [ "10.10.1.1/32", "10.10.1.2/32",+ "10.10.1.3/32",- "10.10.1.4/32"… }
![Page 62: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/62.jpg)
Cloud Native Big Data
Size the cluster to the dataSize the cluster to the questionsNever wait for space or answers
![Page 63: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/63.jpg)
Netflix Dataoven
Data WarehouseOver 2 Petabytes
Ursula
Aegisthus
Data Pipelines
From cloud Services~100 BillionEvents/day
From C*Terabytes ofDimensiondata
Hadoop Clusters – AWS EMR
1300 nodes 800 nodes Multiple 150 nodes Nightly
RDS
Metadata
Gateways
Tools
More?
![Page 64: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/64.jpg)
Cloud Native Development Patterns
Master copies of data are cloud residentDynamically provisioned micro-servicesServices are distributed and ephemeral
![Page 65: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/65.jpg)
Datacenter to Cloud Transition Goals
• Faster– Lower latency than the equivalent datacenter web pages and API calls– Measured as mean and 99th percentile– For both first hit (e.g. home page) and in-session hits for the same user
• Scalable– Avoid needing any more datacenter capacity as subscriber count increases– No central vertically scaled databases– Leverage AWS elastic capacity effectively
• Available– Substantially higher robustness and availability than datacenter services– Leverage multiple AWS availability zones– No scheduled down time, no central database schema to change
• Productive– Optimize agility of a large development team with automation and tools– Leave behind complex tangled datacenter code base (~8 year old architecture)– Enforce clean layered interfaces and re-usable components
![Page 66: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/66.jpg)
Datacenter Anti-Patterns
What do we currently do in the datacenter that prevents us from
meeting our goals?
![Page 67: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/67.jpg)
Rewrite from Scratch
Not everything is cloud specificPay down technical debt
Robust patterns
![Page 68: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/68.jpg)
Netflix Datacenter vs. Cloud Arch
Central SQL Database Distributed Key/Value NoSQL
Sticky In-Memory Session Shared Memcached Session
Chatty Protocols Latency Tolerant Protocols
Tangled Service Interfaces Layered Service Interfaces
Instrumented Code Instrumented Service Patterns
Fat Complex Objects Lightweight Serializable Objects
Components as Jar Files Components as Services
Anti-Architecture
More?
![Page 69: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/69.jpg)
Cloud Security
Fine grain security rather than perimeterLeveraging AWS Scale to resist DDOS attacks
Automated attack surface monitoring and testinghttp://www.slideshare.net/jason_chan/resilience-and-security-scale-lessons-learned
![Page 70: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/70.jpg)
Security Architecture
• Instance Level Security baked into base AMI– Login: ssh only allowed via portal (not between instances)– Each app type runs as its own userid app{test|prod}
• AWS Security, Identity and Access Management– Each app has its own security group (firewall ports)– Fine grain user roles and resource ACLs
• Key Management– AWS Keys dynamically provisioned, easy updates– High grade app specific key management using HSM
More?
![Page 71: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/71.jpg)
AWS Accounts
![Page 72: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/72.jpg)
Accounts Isolate Concerns• paastest – for development and testing
– Fully functional deployment of all services– Developer tagged “stacks” for separation
• paasprod – for production– Autoscale groups only, isolated instances are terminated– Alert routing, backups enabled by default
• paasaudit – for sensitive services– To support SOX, PCI, etc.– Extra access controls, auditing
• paasarchive – for disaster recovery– Long term archive of backups– Different region, perhaps different vendor
![Page 73: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/73.jpg)
Cloud Access Control
www-prod
• Userid wwwprod
Dal-prod
• Userid dalprod
Cass-prod
• Userid cassprod
Cloud Access audit log ssh/sudo Gateway
Security groups don’t allowssh between instances
developers
![Page 74: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/74.jpg)
Our perspiration…A Cloud Native Open Source Platform
See netflix.github.com
![Page 75: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/75.jpg)
Example Application – RSS Reader
ZUUL
Zuul TrafficProcessing and Routing
![Page 76: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/76.jpg)
Zuul Architecturehttp://techblog.netflix.com/2013/06/announcing-zuul-edge-service-in-cloud.html
![Page 77: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/77.jpg)
Ice – AWS Usage Trackinghttp://techblog.netflix.com/2013/06/announcing-ice-cloud-spend-and-usage.html
![Page 78: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/78.jpg)
GithubNetflixOSS
Source
AWSBase AMI
MavenCentral
Cloudbees Jenkins
AminatorBakery
DynaslaveAWS Build
Slaves
Asgard(+ Frigga)Console
AWSBaked AMIs
GlistenWorkflow DSL
AWS Account
NetflixOSS Continuous Build and Deployment
More?
![Page 79: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/79.jpg)
AWS AccountAsgard Console
Archaius Config Service
Cross region Priam C*
PytheasDashboards
AtlasMonitoring
Genie, LipstickHadoop Services
Ice – AWS UsageCost Monitoring
Multiple AWS RegionsEureka Registry
ExhibitorZookeeper
Edda History
Simian Army
Zuul Traffic Mgr
3 AWS ZonesApplication Clusters
Autoscale GroupsInstances
PriamCassandra
Persistent Storage
EvcacheMemcached
Ephemeral Storage
NetflixOSS Services Scope
More?
![Page 80: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/80.jpg)
• Baked AMI – Tomcat, Apache, your code• Governator – Guice based dependency injection• Archaius – dynamic configuration properties client• Eureka - service registration client
Initialization
• Karyon - Base Server for inbound requests• RxJava – Reactive pattern• Hystrix/Turbine – dependencies and real-time status• Ribbon and Feign - REST Clients for outbound calls
Service Requests
• Astyanax – Cassandra client and pattern library• Evcache – Zone aware Memcached client• Curator – Zookeeper patterns• Denominator – DNS routing abstraction
Data Access
• Blitz4j – non-blocking logging• Servo – metrics export for autoscaling• Atlas – high volume instrumentation
Logging
NetflixOSS Instance Libraries
More?
![Page 81: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/81.jpg)
• CassJmeter – Load testing for Cassandra• Circus Monkey – Test account reservation rebalancingTest Tools
• Janitor Monkey – Cleans up unused resources• Efficiency Monkey• Doctor Monkey• Howler Monkey – Complains about AWS limits
Maintenance
• Chaos Monkey – Kills Instances• Chaos Gorilla – Kills Availability Zones• Chaos Kong – Kills Regions• Latency Monkey – Latency and error injection
Availability
• Conformity Monkey – architectural pattern warnings• Security Monkey – security group and S3 bucket permissionsSecurity
NetflixOSS Testing and Automation
More?
![Page 82: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/82.jpg)
Vendor Driven PortabilityInterest in using NetflixOSS for Enterprise Private Clouds
“It’s done when it runs Asgard”Functionally completeDemonstrated MarchReleased June in V3.3
Vendor and end user interestOpenstack “Heat” getting therePaypal C3 Console based on Asgard
IBM Example application “Acme Air”Based on NetflixOSS running on AWSPorted to IBM Softlayer with Rightscale
![Page 83: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/83.jpg)
Cost-AwareCloud Architectures
Jinesh Varia@jinmanTechnology Evangelist
Adrian Cockcroft@adrianco
Director, Architecture
![Page 84: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/84.jpg)
Cloud Economics – Agile ROIGet a faster Return by Speeding up Investment
Observe
Orient
Decide
Act
Rapid innovation by speeding up theOODA loopTry, fail, try again, succeed
![Page 85: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/85.jpg)
$3K
$500
$75
$2K$100$33
$12
$692
Experiment Often & Adapt Quickly
$1K
$500$234$96
• Cost of failure falls dramatically• Return on (small incremental)
Investments is high• More risk taking, more innovation• More iteration, faster innovation
![Page 86: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/86.jpg)
« Want to increase innovation? Lower the cost of failure »
Joi Ito
![Page 87: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/87.jpg)
Accelerate building a new line of business
Market Replay
![Page 88: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/88.jpg)
Go Global in Minutes
![Page 89: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/89.jpg)
Netflix Examples
• European Launch using AWS Ireland– No employees in Ireland, no provisioning delay, everything worked– No need to do detailed capacity planning– Over-provisioned on day 1, shrunk to fit after a few days– Capacity grows as needed for additional country launches
• Brazilian Proxy Experiment– No employees in Brazil, no “meetings with IT”– Deployed instances into two zones in AWS Brazil– Experimented with network proxy optimization– Decided that gain wasn’t enough, shut everything down
![Page 90: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/90.jpg)
Product Launch Agility - Rightsized
Pre-Launch Build-out Testing Launch Growth Growth
DemandCloudDatacenter
$
![Page 91: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/91.jpg)
Product Launch - Under-estimated
Pre-Launch Build-out
TestingLaunch
GrowthGrowth
![Page 92: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/92.jpg)
Product Launch Agility – Over-estimated
Pre-Launch Build-out
TestingLaunch
GrowthGrowth
$
![Page 93: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/93.jpg)
Return on Agility (Agile ROI) = More Revenue
![Page 94: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/94.jpg)
#1 Business Agility by Rapid Experimentation = Increased Revenue
Key Takeaways on Cost-Aware Architectures….
![Page 95: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/95.jpg)
When you turn off your cloud resources, you actually stop paying for them
![Page 96: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/96.jpg)
1 5 9 13 17 21 25 29 33 37 41 45 49
Week
Web
Ser
vers
Optimize during a year
50% SavingsWeekly CPU Load
![Page 97: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/97.jpg)
Busi
ness
Thr
ough
put
Inst
ance
s
![Page 98: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/98.jpg)
50%+ Cost SavingScale up/down
by 70%+
Move to Load-Based Scaling
![Page 99: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/99.jpg)
Pay as you go
![Page 100: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/100.jpg)
AWS Support – Trusted Advisor – Your personal cloud assistant
![Page 101: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/101.jpg)
Other simple optimization tips
• Don’t forget to…– Disassociate unused EIPs– Delete unassociated Amazon
EBS volumes– Delete older Amazon EBS
snapshots– Leverage Amazon S3 Object
Expiration
Janitor Monkey cleans up unused resources
![Page 102: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/102.jpg)
#1 Business Agility by Rapid Experimentation = Increased Revenue
#2 Business-driven Auto Scaling Architectures = Savings
Building Cost-Aware Cloud Architectures
![Page 103: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/103.jpg)
When Comparing TCO…
![Page 104: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/104.jpg)
When Comparing TCO…
Make sure that you are including all the cost factors into consideration
PlacePowerPipesPeoplePatterns
![Page 105: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/105.jpg)
Save more when you reserve
On-demandInstances
• Pay as you go
• Starts from $0.02/Hour
ReservedInstances
• One time low upfront fee + Pay as you go
• $23 for 1 year term and $0.01/Hour
1-year and 3-year terms
Light Utilization RI
Medium Utilization RI
Heavy Utilization RI
![Page 106: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/106.jpg)
Utilization (Uptime)
Ideal For Savings over On-Demand
10% - 40%(>3.5 < 5.5 months/year)
Disaster Recovery(Lowest Upfront) 56%
40% - 75%(>5.5 < 7 months/year)
Standard Reserved Capacity 66%
>75%(>7 months/year)
Baseline Servers(Lowest Total Cost) 71%
Break-even point
ReservedInstances
• One time low upfront fee + Pay as you go
• $23 for 1 year term and $0.01/Hour
1-year and 3-year terms
Light Utilization RI
Medium Utilization RI
Heavy Utilization RI
![Page 107: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/107.jpg)
Mix and Match Reserved Types and On-DemandIn
stan
ces
Days of Month
0
2
4
6
8
10
12
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
Heavy Utilization Reserved Instances
Light RI Light RILight RILight RI
On-Demand
![Page 108: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/108.jpg)
Netflix Concept for Regional Failover Capacity
West Coast
Light Reservations
Heavy Reservations
East Coast
Light Reservations
Heavy Reservations
NormalUse
Failover Use
![Page 109: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/109.jpg)
#1 Business Agility by Rapid Experimentation = Increased Revenue
#2 Business-driven Auto Scaling Architectures = Savings
#3 Mix and Match Reserved Instances with On-Demand = Savings
Building Cost-Aware Cloud Architectures
![Page 110: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/110.jpg)
Variety of Applications and Environments
Production Fleet
Dev FleetTest FleetStaging/QAPerf FleetDR Site
Every Application has…. Every Company has….
Business App Fleet
Marketing SiteIntranet SiteBI AppMultiple Products Analytics
![Page 111: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/111.jpg)
Consolidated Billing: Single payer for a group of accounts
• One Bill for multiple accounts
• Easy Tracking of account charges (e.g., download CSV of cost data)
• Volume Discounts can be reached faster with combined usage
• Reserved Instances are shared across accounts (including RDS Reserved DBs)
![Page 112: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/112.jpg)
Over-Reserve the Production Environment
Production Env.Account 100 Reserved
QA/Staging Env. Account 0 Reserved
Perf Testing Env.Account 0 Reserved
Development Env.Account 0 Reserved
Storage Account 0 Reserved
Total Capacity
![Page 113: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/113.jpg)
Consolidated Billing Borrows Unused Reservations
Production Env.Account 68 Used
QA/Staging Env. Account 10 Borrowed
Perf Testing Env.Account 6 Borrowed
Development Env.Account 12 Borrowed
Storage Account 4 Borrowed
Total Capacity
![Page 114: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/114.jpg)
Consolidated Billing Advantages
• Production account is guaranteed to get burst capacity– Reservation is higher than normal usage level– Requests for more capacity always work up to reserved
limit– Higher availability for handling unexpected peak demands
• No additional cost– Other lower priority accounts soak up unused reservations– Totals roll up in the monthly billing cycle
![Page 115: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/115.jpg)
#1 Business Agility by Rapid Experimentation = Increased Revenue
#2 Business-driven Auto Scaling Architectures = Savings
#3 Mix and Match Reserved Instances with On-Demand = Savings
#4 Consolidated Billing and Shared Reservations = Savings
Building Cost-Aware Cloud Architectures
![Page 116: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/116.jpg)
Continuous optimization in your architecture results in
recurring savings as early as your next month’s bill
![Page 117: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/117.jpg)
Right-size your cloud: Use only what you need
• An instance type for every purpose
• Assess your memory & CPU requirements– Fit your
application to the resource
– Fit the resource to your application
• Only use a larger instance when needed
![Page 118: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/118.jpg)
Reserved Instance Marketplace
Buy a smaller term instanceBuy instance with different OS or type
Buy a Reserved instance in different region
Sell your unused Reserved InstanceSell unwanted or over-bought capacityFurther reduce costs by optimizing
![Page 119: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/119.jpg)
Instance Type Optimization
Older m1 and m2 families• Slower CPUs• Higher response times• Smaller caches (6MB)• Oldest m1.xl 15GB/8ECU/48c• Old m2.xl 17GB/6.5ECU/41c• ~16 ECU/$/hr
Latest m3 family• Faster CPUs• Lower response times• Bigger caches (20MB)• Even faster for Java vs. ECU• New m3.xl 15GB/13 ECU/50c• 26 ECU/$/hr – 62% better!• Java measured even higher• Deploy fewer instances
![Page 120: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/120.jpg)
#1 Business Agility by Rapid Experimentation = Increased Revenue
#2 Business-driven Auto Scaling Architectures = Savings
#3 Mix and Match Reserved Instances with On-Demand = Savings
#4 Consolidated Billing and Shared Reservations = Savings
#5 Always-on Instance Type Optimization = Recurring Savings
Building Cost-Aware Cloud Architectures
![Page 121: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/121.jpg)
Follow the Customer (Run web servers) during the day
Follow the Money (Run Hadoop clusters) at night
0
2
4
6
8
10
12
14
16
Mon Tue Wed Thur Fri Sat Sun
No
of In
stan
ces
Runn
ing
Week
Auto Scaling Servers
Hadoop Servers
No. of ReservedInstances
![Page 122: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/122.jpg)
![Page 123: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/123.jpg)
Unused Reservations
Calculator
Web Application
Fleet
Total Instances Reserved
Table
Hadoop Fleet
4 AZ-mappings14 Types
Total InstancesRunning now = 100
Total unused Reservations available = 40 in 2 AZs(5 min interval)
Launch 40
![Page 124: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/124.jpg)
Soaking up unused reservations
Unused reserved instances is published as a metric
Netflix Data Science ETL Workload• Daily business metrics roll-up• Starts after midnight• EMR clusters started using hundreds of instances
Netflix Movie Encoding Workload• Long queue of high and low priority encoding jobs• Can soak up 1000’s of additional unused instances
![Page 125: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/125.jpg)
#1 Business Agility by Rapid Experimentation = Increased Revenue
#2 Business-driven Auto Scaling Architectures = Savings
#3 Mix and Match Reserved Instances with On-Demand = Savings
#4 Consolidated Billing and Shared Reservations = Savings
#5 Always-on Instance Type Optimization = Recurring Savings
Building Cost-Aware Cloud Architectures
#6 Follow the Customer (Run web servers) during the day Follow the Money (Run Hadoop clusters) at night
![Page 127: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/127.jpg)
Slideshare.net/Netflix Details• Meetup S1E3 July – Featuring Contributors Eucalyptus, IBM, Paypal, Riot Games
– http://techblog.netflix.com/2013/07/netflixoss-meetup-series-1-episode-3.html
• Lightning Talks March S1E2– http://www.slideshare.net/RuslanMeshenberg/netflixoss-meetup-lightning-talks-and-
roadmap
• Lightning Talks Feb S1E1– http://www.slideshare.net/RuslanMeshenberg/netflixoss-open-house-lightning-talks
• Asgard In Depth Feb S1E1– http://www.slideshare.net/joesondow/asgard-overview-from-netflix-oss-open-house
• Security Architecture– http://www.slideshare.net/jason_chan/resilience-and-security-scale-lessons-learned/
![Page 128: CMG2013 Workshop: Netflix Cloud Native, Capacity, Performance and Cost Optimization Techniques](https://reader035.vdocument.in/reader035/viewer/2022062418/554a0893b4c905e56c8b5956/html5/thumbnails/128.jpg)
Takeaways
Cloud Native Manages Scale and Complexity at Speed
NetflixOSS makes it easier for everyone to become Cloud Native
Rethink deployments and turn things off to save money!
http://netflix.github.comhttp://techblog.netflix.comhttp://slideshare.net/Netflix
http://www.linkedin.com/in/adriancockcroft
@adrianco #netflixcloud @NetflixOSS