epidemic failures

Post on 15-Jan-2015

1.901 Views

Category:

Technology

5 Downloads

Preview:

Click to see full reader

DESCRIPTION

Slides originally written in April 2013 for a private conference and internal use at Netflix. Publishing now since Heartbleed is another example of an epidemic failure mode.

TRANSCRIPT

Cloud Native and Epidemic Failures

April 2014Adrian Cockcroft

@adrianco @BatteryVentureshttp://www.linkedin.com/in/adriancockcroft

Cloud Native?

Epidemic Failures

Automated Diversity

Cloud Native

Construct a highly agile and highly available service from ephemeral and

often broken components

Inspiration

Numquam ponenda est pluralitas sine necessitate

Plurality must never be posited without necessity

Occam’s Razor

Monoculture

Replicate “the best” as patternsReduce interaction complexityEpidemic single point of failure

Pattern Failures

Infrastructure Pattern FailuresSoftware Stack Pattern Failures

Application Pattern Failures

Infrastructure Pattern Failures

• Device failures – bad batch of disks, PSUs, etc.• CPU failures – cache corruption, math errors• Datacenter failures – power, network, disaster• Routing failures – DNS, Internet/ISP path

Software Stack Pattern Failures

• Time bombs – Counter wrap, memory leak• Date bombs - Leap year, leap second, epoch• Expiration – Certs timing out• Trust revocation – Certificate Authority fails• Security exploit – e.g. heartbleed• Language bugs – compile time• Runtime bugs – JVM, Linux, Hypervisor• Network bugs – routers, firewalls, protocols

Application Pattern Failures

• Time bombs – Counter wrap, memory leak• Date bombs - Leap year, leap second, epoch• Content bombs – Data dependent failure• Configuration – wrong/bad syntax• Versioning – incompatible mixes• Cascading failures – error handling bugs etc.• Cascading overload – excessive logging etc.

What to do?

Automated diversity managementDiversified automationEfficient vs. Antifragile

Specific Ideas

• Automate running a mixture– Diversity as default for any service stack– No developer overhead, stay agile, low cost

• Support oldest and newest versions together – Automate running 50/50 mix CentOS/Ubuntu– Mix versions of JDK, Tomcat, etc.

• Vendor diversity– Multiple DNS vendors, cloud regions, costs more– Multiple cloud vendors? Much higher cost.

Generate Permutations> epi <- data.frame(java=gl(2,1,8,c("java6","java7")), linux=gl(2,2,8,c("centos","ubuntu")), codeversion=gl(2,4,8,c("v34","v35")))> epi java linux codeversion1 java6 centos v342 java7 centos v343 java6 ubuntu v344 java7 ubuntu v345 java6 centos v356 java7 centos v357 java6 ubuntu v358 java7 ubuntu v35

Deployment

• Builds– Manual to test, automate if it works– Modify build to generate permutation AMIs– Modify Asgard to auto-deploy permutations

• Data collection– Tag each instance with its permutation– Gather metrics by permutation per instance– Do R-based Design of Experiments analysis

Analysis

• As a function of permutations– Error rate– Response time– CPU Utilization

• Interactions– E.g. interaction between linux and java– Contrasts identify components with issues– Small changes with high statistical significance

GCS Total API Outage for ~1hr

Takeaway

Watch out for monocultures

A|B Testing – it’s not just for personalization

http://perfcap.blogspot.comhttp://slideshare.net/adrianco – Netflix

http://slideshare.net/adriancockcroft - Battery

http://www.linkedin.com/in/adriancockcroft

@adrianco @BatteryVentures

top related