cloud capacity planning..an oxymoron? - south bay sre meetup aug-09-2016

14
Cloud Capacity Planning South Bay SRE meetup - August 9th, 2016

Upload: coburn-watson

Post on 10-Apr-2017

237 views

Category:

Technology


0 download

TRANSCRIPT

Cloud Capacity Planning

South Bay SRE meetup - August 9th, 2016

● Cloud Capacity Planning..an Oxymoron?

● Santa Cloud: How Netflix Does Holiday Capacity Planning

● The Data Behind the Planning

Presenting...

Cloud Capacity Planning..an Oxymoron?

South Bay SRE Meetup: August 9th, 2016

● > 83M households

● 190 Countries

● 35% of Internet traffic in US at peak

● Entirely on Cloud*, three regions

● Evacuate a region monthly...for 24 hours

● Capacity planning ~ 5 people! (in the room :-)

* Content served from homegrown OpenConnect CDN

Capacity Planning Concerns

● Facility considerations (Space, Power, Network, Cooling)

● Supply Chain Management Constraints and Relationships

● Hardware lifetime contour & failure rates (MTBF)

● Systems management staff

● Seasonal and unexpected burst considerations

● Workload colocation and performance demands

● Over-provisioning for reliability and rate of innovation

● Effective tooling

● Business continuity planning

(Cloud) Capacity Planning Concerns

● Facility considerations (Power, Network, Cooling)

● Supply Chain Management Constraints and Relationships

● Hardware lifetime contour & failure rates (MTBF)

● Systems management staff

● Seasonal and unexpected burst considerations

● Workload colocation and performance demands

● Over-provisioning for reliability and rate of innovation

● Effective tooling

● Business continuity planning

Cloud-specific CP Factors

● Capacity bounds..unknown (-)

● Vendor Decisions (-/+)

○ Hardware/Offering Evolution Timeline

○ Resource Demand (CPU/Mem/Disk/Net) Matrix

● On-Demand Capability (+)

Netflix Model

● Depend on the AWS on-demand pool for elasticity

● Monitor insufficient capacity exceptions (ICEs) for boundaries

● Invest heavily in 3 year reservations

● Maintain relatively few, large reserved pools

● Cloud Capacity Analytics team develops tools for insight

● Leverage cross-account resource borrowing

The TriadCloud Impact

Innovation

Reliability

Efficiency

Default Preferred

Considerations of Scale

● Capacity required for critical footprint might require “guarantees”

● API-based observability has limits

● All resources have capacity limits/throttles

● Resource limits by default set for lowest common denominator

● Get creative with unused, but paid for capacity

● Billing file size!

Summary

Capacity

Planning

Coburn Watson

● Director of Performance and Reliability at Netflix

○ Site Reliability Engineering, Performance and OS Engineering, Traffic Management, Chaos Engineering,

Capacity Planning, Cloud Network Engineering

● @coburnw, [email protected]

● Looking for some great capacity planning-minded folks

● Performance and Reliability Youtube Channel