#lspe q1 2013 dynamically scaling netflix in the cloud

16
Dynamically Scaling Netflix in the Cloud Coburn Watson Manager - Cloud Performance Engineering

Upload: coburn-watson

Post on 08-May-2015

1.982 views

Category:

Technology


2 download

DESCRIPTION

Meetup presentation on how Netflix dynamically scales in the cloud. It covers topics primarily related to AWS autoscaling and provides some "day-in-the-life" data.

TRANSCRIPT

Page 1: #lspe Q1 2013   dynamically scaling netflix in the cloud

Dynamically Scaling Netflix in the Cloud

Coburn WatsonManager - Cloud Performance Engineering

Page 2: #lspe Q1 2013   dynamically scaling netflix in the cloud

Netflix, Inc.

- World's leading internet television network- 33 Million subscribers in 40 countries- Over a billion hours streamed per month- Approximately 33% of all US Internet traffic at night- Increasing quantity of original content- Recent Technical Notables

- Open Source Software- OpenConnect (homegrown CDN)

Page 3: #lspe Q1 2013   dynamically scaling netflix in the cloud

About Me

- Manage Cloud Performance Engineering team- Focus on performance since 2000-ish

- Large-scale billing applications, eCommerce, datacenter mgmt, etc.

- Genentech, McKesson, Amdocs, Mercury Int., HP, etc.

- Passion for tackling performance at cloud-scale- Looking for great performance engineers- [email protected]

Page 4: #lspe Q1 2013   dynamically scaling netflix in the cloud

First things first- ASG = Autoscaling group

- AWS description:"An Auto Scaling group is a representation of multiple Amazon EC2 instances that share similar characteristics, and that are treated as a logical grouping for the purposes of instance scaling and management. "

"An Auto Scaling group starts by launching the minimum number (or the desired number, if specified) of EC2 instances and then increases or decreases the number of running EC2 instances automatically according to the conditions that you define."

- Within Netflix (almost) all services are created as ASGs - Asgard (OSS) simplifies this process:

Page 5: #lspe Q1 2013   dynamically scaling netflix in the cloud

Dynamic Scaling @ Netflix- EC2 footprint autoscales 2500-3500 instances per day

- order of tens of thousands of EC2 instances

- Largest ASG* spans 200-600 m2.4xlarge (64GB RAM)

Why:

- Improved scalability during unexpected workloads- Avoid sizing capacity aggressively high

- each service team determines their capacity

- Creates "reserved instance troughs" for batch activity- on the order of hundreds of thousands of instance hours weekly

* largest "autoscaling" ASG

Page 6: #lspe Q1 2013   dynamically scaling netflix in the cloud

How?- Discovery

- AWS elastic load balancers "speak" autoscaling- mid-tier services utilize Eureka (OSS)

- Leverage native AWS autoscaling capabilities- Publish our own metrics up to CloudWatch (Servo OSS)

- Stateless

Page 7: #lspe Q1 2013   dynamically scaling netflix in the cloud

How?Two types of scaling behavior exposed in Asgard 1. rate-based autoscaling

2. scheduled action autoscaling

Page 8: #lspe Q1 2013   dynamically scaling netflix in the cloud

AWS Autoscaling-Define policies on ASG - alarm, scaling unit (percent/amount), cooldown, evaluation interval and period- Cooldowns:

- ASG-level versus policy-level (both exist)

- cooldown start tied to last instance ready- should be tied closely to application/service startup time

- Execute load or squeeze tests; measures capacity- Frequent pushes with SOA corresponds to possible frequent

changes in per-instance capacity- (insert here) 10 second primer on squeeze tests

Page 9: #lspe Q1 2013   dynamically scaling netflix in the cloud

In Action- Example covers 3 services

- 2 edge (A,B), 1 mid-tier (C)

- C has more upstream services than simply A and B

- Multiple autoscaling policy types- (A) System Load Average- (B) Request-rate based (tomcat requestCount)- (C) Request-rate based (internal library numCompleted)

Page 10: #lspe Q1 2013   dynamically scaling netflix in the cloud

Day in the life, instance counts

- At peak 1,948 instances- without autoscaling: ~ 46.8 k instance hours- with autoscaling: ~ 31.2 k instance hours (~ 33% reduction in usage)

Page 11: #lspe Q1 2013   dynamically scaling netflix in the cloud

- Total requests: 4.5x peak versus min- Per instance stays between 45-90 RPS

Day in the life, request rates

Page 12: #lspe Q1 2013   dynamically scaling netflix in the cloud

Day in the life, latency

- Response variability greatest during initial scale-up events- Average response time primarily between 75-150 msec

Page 13: #lspe Q1 2013   dynamically scaling netflix in the cloud

Day in the life, CPU Utilization

- Instance counts 3x, Request rate 4.5x (not shown)- Avg CPU utilization per instance: ~ 25-55% *

* service A currently resolving concurrency issue; limits ideal CPU utilization

Page 14: #lspe Q1 2013   dynamically scaling netflix in the cloud

- Reserved Instance "troughs" = spare capacity-Align services along fewer instance types for fewer, larger pools

- Current usage- Stand up "bonus" EMR cluster in off-peak hours

- Planned usage- Framework being developed to share unused capacity "fairly"

across multiple batch applications

Unused capacity

Page 15: #lspe Q1 2013   dynamically scaling netflix in the cloud

Caveats- AWS Autoscaling

- Simplified scaling policy capabilities

- Cooldown is static, not dynamically configurable

- Application resource profiles can change quickly (SOA)- When something goes wrong...

1. traffic rates can drop quickly2. scale-down can kick in3. thundering herd can knock you back down- lockout scale-down quickly- proactively protect yourself with Hystrix (OSS) against downstream service degradation or failure

Page 16: #lspe Q1 2013   dynamically scaling netflix in the cloud

Wrap-up- Autoscaling is a big win for Netflix- Dynamically scaling affords improved scalability- Our Open Source Software simplifies mgmt at scale

next Netflix OSS meetup: Wednesday March 13th @ Netflix

- Great projects, stunning colleagues: jobs.netflix.com