dynamic scaling and redundancy in the cloud · dynamic scaling and redundancy in the cloud coburn...

22
Dynamic Scaling and Redundancy in the Cloud Coburn Watson Manager, Cloud Performance, Netflix IEEE SCV 10/09/2013

Upload: others

Post on 26-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dynamic Scaling and Redundancy in the Cloud · Dynamic Scaling and Redundancy in the Cloud Coburn Watson Manager, Cloud Performance, Netflix IEEE SCV 10/09/2013 . Netflix, Inc. •

Dynamic Scaling and Redundancy

in the Cloud

Coburn Watson

Manager, Cloud Performance, Netflix

IEEE SCV 10/09/2013

Page 2: Dynamic Scaling and Redundancy in the Cloud · Dynamic Scaling and Redundancy in the Cloud Coburn Watson Manager, Cloud Performance, Netflix IEEE SCV 10/09/2013 . Netflix, Inc. •

Netflix, Inc.

• World's leading internet television network

• ~ 38 Million subscribers in 40+ countries

• Over a billion hours streamed per month

• Approximately 33% of all US Internet traffic at

night

• Recent Notables

• Increased originals catalog

• Large open source contribution

• OpenConnect (homegrown CDN)

Page 3: Dynamic Scaling and Redundancy in the Cloud · Dynamic Scaling and Redundancy in the Cloud Coburn Watson Manager, Cloud Performance, Netflix IEEE SCV 10/09/2013 . Netflix, Inc. •

About Me

• Manage Cloud Performance Engineering Team

• Sub-team of Cloud Solutions Organization

• Focus on performance since 2000

• Large-scale billing applications, eCommerce, datacenter

mgmt., etc.

• Genentech, McKesson, Amdocs, Mercury Int., HP, etc.

• Passion for tackling performance at cloud-scale

• Looking for great performance engineers

[email protected]

Page 4: Dynamic Scaling and Redundancy in the Cloud · Dynamic Scaling and Redundancy in the Cloud Coburn Watson Manager, Cloud Performance, Netflix IEEE SCV 10/09/2013 . Netflix, Inc. •

Performance on the Cloud

http://onebigphoto.com/castle-in-germany-floating-above-the-clouds/

Page 5: Dynamic Scaling and Redundancy in the Cloud · Dynamic Scaling and Redundancy in the Cloud Coburn Watson Manager, Cloud Performance, Netflix IEEE SCV 10/09/2013 . Netflix, Inc. •

The balancing act

• Performance • Instance type selection • Tuning of “BaseAMI” • Aggressive caching model (memcached)

• Reliability • Redundancy

• Over-provision to absorb failures • Focus on OSS vs. COTS

• Select architectures (C*) with strong redundancy

• Scalability • Support “thundering herd” scenarios • Stateless service model..rapid increases in

capacity

Page 6: Dynamic Scaling and Redundancy in the Cloud · Dynamic Scaling and Redundancy in the Cloud Coburn Watson Manager, Cloud Performance, Netflix IEEE SCV 10/09/2013 . Netflix, Inc. •

Instance type selection

• Focus on horizontal vs. vertical scaling model • Many moderate systems vs. few powerhouses

• A 3rd of instances in each of 3 Availability Zones*

• Large memory apps reduce flexibility on choice

• m2.2xl = 4 cores, 36 GB RAM

• m3.2xl = 8 cores, 30 GB RAM

• 2x CPU, only 25% more in terms of $

* datacenter

Page 7: Dynamic Scaling and Redundancy in the Cloud · Dynamic Scaling and Redundancy in the Cloud Coburn Watson Manager, Cloud Performance, Netflix IEEE SCV 10/09/2013 . Netflix, Inc. •

Instance type selection, cont…

• IO capabilities are qualified vs. quantified…

- Measure it! ~ 1000 Mbps

~ 700 Mbps

Page 8: Dynamic Scaling and Redundancy in the Cloud · Dynamic Scaling and Redundancy in the Cloud Coburn Watson Manager, Cloud Performance, Netflix IEEE SCV 10/09/2013 . Netflix, Inc. •

Base AMI Tuning

• “Base AMI” – common Linux image • OS, middleware, system utilities

• Tens of thousands of instances on same Base AMI

• Provides global tuning opportunities • Examples: CFS tuning to improve batch throughput

• kernel.sched_latency_ns=48000000 # default: 24000000 (24ms)

• kernel.sched_min_granularity_ns=6000000 # default: 3000000 (3

ms)

• Above tunables provided 2-5% improvement in throughput

• Can be applied during “baking” process which produces new

AMI with application layered on top.

Page 9: Dynamic Scaling and Redundancy in the Cloud · Dynamic Scaling and Redundancy in the Cloud Coburn Watson Manager, Cloud Performance, Netflix IEEE SCV 10/09/2013 . Netflix, Inc. •

Maximizing Redundancy

vs.

Page 10: Dynamic Scaling and Redundancy in the Cloud · Dynamic Scaling and Redundancy in the Cloud Coburn Watson Manager, Cloud Performance, Netflix IEEE SCV 10/09/2013 . Netflix, Inc. •

Fear (Revere) the Monkeys

• Simulate • Latency

• Errors

• Initiate

• Instance Termination

• Availability Zone Failure

• Identify

• Configuration Drift

… in Test and Production

Page 11: Dynamic Scaling and Redundancy in the Cloud · Dynamic Scaling and Redundancy in the Cloud Coburn Watson Manager, Cloud Performance, Netflix IEEE SCV 10/09/2013 . Netflix, Inc. •

Hystrix: Defend Your App

● Protection from downstream service failures

● Functional (unavailable) or performance in nature

Page 12: Dynamic Scaling and Redundancy in the Cloud · Dynamic Scaling and Redundancy in the Cloud Coburn Watson Manager, Cloud Performance, Netflix IEEE SCV 10/09/2013 . Netflix, Inc. •

Maximizing: Scalability and

Performance

12

Page 13: Dynamic Scaling and Redundancy in the Cloud · Dynamic Scaling and Redundancy in the Cloud Coburn Watson Manager, Cloud Performance, Netflix IEEE SCV 10/09/2013 . Netflix, Inc. •

Dynamic Scaling

EC2 footprint autoscales 2500-3500 instances

per day

• order of tens of thousands of EC2 instances

• Larger ASG spans 200-1000 m2.4xlarge daily

• m2.4xlarge: 8 Cores, 64GB RAM, 1.7TB Disk

Why:

• Improved scalability during unexpected workloads

• Absorb variance in service performance profile

• Reactive chain of dependencies

• Creates "reserved instance troughs" for batch

activity

13

Page 14: Dynamic Scaling and Redundancy in the Cloud · Dynamic Scaling and Redundancy in the Cloud Coburn Watson Manager, Cloud Performance, Netflix IEEE SCV 10/09/2013 . Netflix, Inc. •

Dynamic Scaling, cont.

Example covers 3 services • 2 edge (A,B), 1 mid-tier (C)

• C has more upstream services

than simply A and B

Multiple Autoscaling Policies • (A) System Load Average

• (B,C) Request-Rate based

Page 15: Dynamic Scaling and Redundancy in the Cloud · Dynamic Scaling and Redundancy in the Cloud Coburn Watson Manager, Cloud Performance, Netflix IEEE SCV 10/09/2013 . Netflix, Inc. •

Dynamic Scaling, cont.

Page 16: Dynamic Scaling and Redundancy in the Cloud · Dynamic Scaling and Redundancy in the Cloud Coburn Watson Manager, Cloud Performance, Netflix IEEE SCV 10/09/2013 . Netflix, Inc. •

Dynamic Scaling, cont.

• Response time variability greatest during scaling events

• Average response time primary between 75-150 msec

Page 17: Dynamic Scaling and Redundancy in the Cloud · Dynamic Scaling and Redundancy in the Cloud Coburn Watson Manager, Cloud Performance, Netflix IEEE SCV 10/09/2013 . Netflix, Inc. •

Dynamic Scaling, cont.

• Instance counts 3x, Aggregate requests 4.5x (not shown)

• Average CPU utilization per instance: ~25-55%

Page 18: Dynamic Scaling and Redundancy in the Cloud · Dynamic Scaling and Redundancy in the Cloud Coburn Watson Manager, Cloud Performance, Netflix IEEE SCV 10/09/2013 . Netflix, Inc. •

Study performed:

• 24 node C* SSD-based cluster (hi1.4xlarge)

• mid-tier service load application

• Targeting 2x production rates

• Increase read ops from 30k to to 70k in ~ 3 minutes

• Increase write ops 750 to 1500 in ~ 3 minutes

Results:

• 95th pctl response time increase: ~ 17 msec to 45

msec

• 99th pctl response time increase: ~ 35 msec to 80

msec

Cassandra Performance

18

Page 19: Dynamic Scaling and Redundancy in the Cloud · Dynamic Scaling and Redundancy in the Cloud Coburn Watson Manager, Cloud Performance, Netflix IEEE SCV 10/09/2013 . Netflix, Inc. •

Response times consistent during 4x increase in load *

* Due to upstream code change

EVcache (memcached) Scalability

Page 20: Dynamic Scaling and Redundancy in the Cloud · Dynamic Scaling and Redundancy in the Cloud Coburn Watson Manager, Cloud Performance, Netflix IEEE SCV 10/09/2013 . Netflix, Inc. •

Takeaways

• Evolve architecture and processes to mitigate

risks

• Factor redundancy requirements into scaling

strategy

• Stateless micro-service architectures win!

• Benchmark everything…quantify variability

• On-demand Cloud makes benchmarking convenient

Page 21: Dynamic Scaling and Redundancy in the Cloud · Dynamic Scaling and Redundancy in the Cloud Coburn Watson Manager, Cloud Performance, Netflix IEEE SCV 10/09/2013 . Netflix, Inc. •

Netflix Open Source

Our Open Source Software simplifies mgmt at scale

Great projects, stunning colleagues: jobs.netflix.com

Page 22: Dynamic Scaling and Redundancy in the Cloud · Dynamic Scaling and Redundancy in the Cloud Coburn Watson Manager, Cloud Performance, Netflix IEEE SCV 10/09/2013 . Netflix, Inc. •

Q&A

[email protected]

• Twitter: @coburnw

• LinkedIn: http://www.linkedin.com/in/coburnw/

• Netflix Tech Blog: http://techblog.netflix.com