surge 2013: maximizing scalability, resiliency, and engineering velocity in the cloud

31
Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud Coburn Watson Manager, Cloud Performance, Netflix Surge ‘13

Upload: coburn-watson

Post on 10-May-2015

1.518 views

Category:

Technology


1 download

DESCRIPTION

Surge 2013 presentation which covers how Netflix maximizes engineering velocity while keeping risks to scalability, reliability, and performance in check.

TRANSCRIPT

Page 1: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

Coburn WatsonManager, Cloud Performance, NetflixSurge ‘13

Page 2: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

2

Netflix, Inc.

• World's leading internet television network• ~ 38 Million subscribers in 40+ countries• Over a billion hours streamed per month• Approximately 33% of all US Internet traffic

at night• Recent Notables• Increased Originals catalog• Large open source contribution• OpenConnect (homegrown CDN)

Page 3: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

3

About Me

• Manage Cloud Performance Engineering Team• Sub-team of Cloud Solutions Organization

• Focus on performance since 2000• Large-scale billing applications, eCommerce,

datacenter mgmt., etc.• Genentech, McKesson, Amdocs, Mercury Int., HP, etc.

• Passion for tackling performance at cloud-scale• Looking for great performance engineers• [email protected]

Page 4: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

4

Freedom and Responsibility

• Culture deck..a great read• Good performers: 2x, Top performers: 10x• What engineers dislike• cumbersome processes• deployment inefficiency• restricted access• restricted technical freedom• lack of trust

• If removed…maximize:• Engineering velocity• Engineer satisfaction

Page 5: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

5

Maximizing: Engineering Velocity

Page 6: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

6

How

• Implementation freedom• SCM, libraries, language

• that said..platform benefits exist

• Deployment freedom• Service team owns• push schedule, functionality, performance

• operational activities (being paged)• On-demand cloud capacity

• Thousands of instances at the push of a button

Page 7: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

7

Rapid Deployment?

Impossible..

3-6 Months?

Page 8: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

8

Rapid (Cloud) Deployment

3-5 Minutes

Page 9: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

9

BaseAMI• Supply the foundation• Monitoring, java, apache, tomcat, etc.

• Open source project: Aminator

Page 10: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

10

Pushing Code: Red-Black

• Gracefully roll code in, or out, of production• Asgard is our AWS configuration mgmt.

tool

Page 11: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

11

Compounded risks with increased velocity

Risks: Decreased Reliability, Performance, and Scalability

Not all Roses

Page 12: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

12

Goal: CI (Continuous Improvement)

Page 13: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

13

Maximizing: Reliability

Page 14: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

14

Fear (Revere) the Monkeys

• Simulate• Latency• Errors

• Initiate• Instance Termination• Availability Zone Failure

• Identify• Configuration Drift

… in Test and Production

Page 15: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

15

Tracking Change: Chronos

• Aggregate Significant Events *• Current Sources:• Pushes (Asgard)• Production Change Requests (JIRA)• AWS Notifications• Dynamic Property Changes• ASG Scaling Events

• Implementation• Simple REST-service; customized adapters

* - “can disrupt production service”

Page 16: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

16

Chronos, cont.

Page 17: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

17

Automated Canary Analysis•Identify regression between new and existing code•Point ACA to baseline (prod) and canary ASG

• Typically analyze an hours worth of time series data• Compare ratio of averages between canary and baseline• Evaluate range and noise; determine quality of signal

• Bucket: Hot, Cold, Noisy, or OK• Multiple classifiers available• Multiple metric collections (e.g. hand-picked by service, general)

• Rollup• Constrained: along metric dimensions• Final: Score the canary

•Implementation: R-based analysis

Page 18: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

18

HOT OK NOISYCOLDOK

NOISY

constrained rollup (dashed)final rollup

ACA: in Action

Page 19: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

19

Hystrix: Defend Your App

● Protection from downstream service failures● Functional (unavailable) or performance in nature

Page 20: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

20

Maximizing: Scalability and Performance

Page 21: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

21

Dynamic Scaling

EC2 footprint autoscales 2500-3500 instances per day• order of tens of thousands of EC2 instances• Larger ASG spans 200-900 m2.4xlarge daily

Why:• Improved scalability during unexpected workloads• Absorb variance in service performance profile• Reactive chain of dependencies• Creates "reserved instance troughs" for batch

activity

Page 22: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

22

Dynamic Scaling, cont.

Example covers 3 services• 2 edge (A,B), 1 mid-tier (C)• C has more upstream services

than simply A and B

Multiple Autoscaling Policies• (A) System Load Average• (B,C) Request-Rate based

Page 23: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

23

Dynamic Scaling, cont.

Page 24: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

24

Dynamic Scaling, cont.

• Response time variability greatest during scaling events• Average response time primary between 75-150 msec

Page 25: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

25

Dynamic Scaling, cont.

• Instance counts 3x, Aggregate requests 4.5x (not shown)• Average CPU utilization per instance: ~25-55%

Page 26: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

26

Study performed: • 24 node C* SSD-based cluster (hi1.4xlarge)• mid-tier service load application• Targeting 2x production rates

• Increase read ops from 30k to to 70k in ~ 3 minutes

• Increase write ops 750 to 1500 in ~ 3 minutes

Results: • 95th pctl response time increase: ~ 17 msec to 45

msec• 99th pctl response time increase: ~ 35 msec to 80

msec

Cassandra Performance

Page 27: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

27

Response times consistent during 4x increase in load *

* Due to upstream code change

EVcache (memcached) Scalability

Page 28: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

28

Cloud-scale Load Testing

• Ad-Hoc or CI-based load test model• (CI) Run-over-run comparison; email on rule

violation

1. Jenkins initiates job2. JMeter instances apply load3. Results written to s3 4. Instance metrics published to Atlas5. Raw data fetched and processed

Page 29: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

29

Conclusions

• Continually accelerate engineering velocity• Evolve architecture and processes to mitigate

risks

• Stateless micro-service architectures win!

• Remove barriers for engineers• Last option should be to reduce rate of change

• Exercise failure and “thundering herd” scenarios

• Cloud native scaling and resiliency are key factors• Leverage pre-existing OSS PaaS when

possible

Page 30: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

30

Netflix Open Source

Our Open Source Software simplifies mgmt at scale

Great projects, stunning colleagues: jobs.netflix.com

Page 31: Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in the Cloud

31

Q&A

[email protected]

• Netflix Tech Blog: http://techblog.netflix.com