engineering netflix global operations in the cloud
TRANSCRIPT
PowerPoint Presentation
Josh Evans - Director of Operations EngineeringEngineering Netflix Global Operations in the Cloud
2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Internet
Web scale distributed are like a living organismServices are like the organs of the human body a diverse set of functions that create a wholeWe can understand many aspects but there are always mysteries or unobservedThey are complex and can become fragile or sickConstantly adapting and renewing itself
If youre responsible for operating such a system youll quickly learn that;
Impossible to test every permutation of failureImpossible to know exactly how it will behave under adverse conditionsConstantly changing to meet the needs of the business even if you learn the system once your knowledge will quickly become stale
Today well talk about strategies for successfully operate complex distributed systems in the face of these challenges
Notes:Vivid details complex, overwhelming no one understands the whole system
Two Operational ChallengesOperational ExcellenceOperations EngineeringOur Journey
Im going to take you on a journey exploring...Well delve deep into operational excellence this is the core of our talk todayBy the end of this talk youll have a strategic framework and tools to pursue operational excellence for your business
Our JourneyTwo Operational ChallengesOperational ExcellenceOperations Engineering
Product Innovation
winning moments of truth
Acquisition, retention, engagement
Every facet of the product1400 AB tests in the last year & acceleratingContinuous Innovation
Talk about blurring the lines between user interface and watching
Challenge #1:Accelerate Innovation and Rate of Change
Scale & Complexity
100,000s of requests per second1000s of Global Starts per Second
This is the heartbeat of the Netflix streaming service
Approaching Global ReachOctober - Spain, Portugal, ItalyEarly 2016 - Korea, Taiwan, Singapore, Hong Kong65m members 100m~60 counties 200
EU-West
US-EastUS-WestMulti-Zone, Multi-Region
Netflix CDN(Open Connect)CloudControl PlaneInternet
The Bigger PictureService PartnersService Partners
In addition to our cloud service & control plane for devices we have CDN thousands of caches at ISPs & Ixs, terabits/sec, petabytes of contentService partners xbox live, psn, samsung, etc
Challenge #2:Sustain & Improve Qualityin the face of ever growing scale & complexity
Our JourneyTwo Operational ChallengesOperational ExcellenceOperations Engineering
Greg Peters were leaving money on the table regarding operations & qualityDid some reading and realized that for Netflix and most internet-based services the operational challenge is about the tension between quality & velocity
Operational Excellence
QualityVelocity
Availability vs. Rate of ChangeRate of ChangeAvailability (nines)6543210110100100099.9999%99.999%99.99%99.9%99%90%31.5 seconds5.26 minutes52.56 minutes8.76 hours3.26 days36.5 days
Quality vs. Velocity
Availability vs. Rate of ChangeRate of ChangeAvailability (nines)6543210110100100099.9999%99.999%99.99%99.9%99%90%
31.5 seconds5.26 minutes52.56 minutes8.76 hours3.26 days36.5 days
The Zero Sum Game
Availability vs. Rate of ChangeRate of ChangeAvailability (nines)6543210110100100099.9999%99.999%99.99%99.9%99%90%
31.5 seconds5.26 minutes52.56 minutes8.76 hours3.26 days36.5 days
The Zero Sum Game
Availability vs. Rate of ChangeRate of ChangeAvailability (nines)6543210110100100099.9999%99.999%99.99%99.9%99%90%
Shifting the Curve
Operational Excellence is the continuous improvement of the management, design, and function of operational environments to achieve greater quality, velocity, and competitive advantage.
Remember member & engineer environments this is why velocity is an operational challenge
Our JourneyTwo Operational ChallengesOperational ExcellenceOperations Engineering
Build Itdesigncodebuildbaketestdeploy
Run ItoperateconfiguremonitorrespondYou build it, you run itglobally
Aligns incentives if you write bad code you get calledDaunting task for each engineering team
Undifferentiated Heavy Lifting
Operations Engineering is the application of software engineering practices and principles to achieve and sustain operational excellence.automationmodular componentstools & servicesbest practices
automationmodularitytoolsservicesbest practices
The leverage comes from
Our Journey Operations EngineeringEngineering ToolsInsight & Real-time AnalyticsPerformance & Reliability
Leverage
Our JourneyEngineering ToolsInsight & Real-time AnalyticsPerformance & Reliability
Leverage
Data CenterDelayed provisioningHand-crafted serversVariations and complexityOur Artisanal Past DeliveryLate night, manual deploymentsRepeated mistakes Painful delays to production fixes
productivityvelocityqualityEngineering Tools
cloud managementdelivery engineautomation platform
Mention - asgard replacement
Global Cloud Management
powerfuleasy to useglobal
Delivery Pipelines
feature richmodularparallel or serial
fully automated
Automated Global Delivery
feature richmodularparallel or serial
fully automated
The Paved RoadStashGradleUbuntuJenkinsSpinnaker
Our JourneyEngineering ToolsInsight & Real-time AnalyticsPerformance & Reliability
Leverage
Insight & Real-Time Analytics
OODA loop
An outage may not be life or death but
We know this is the experience that our customers have every minute countsSo we strive to continuously improve time to detect and time to recover
DES on time series data
Predict the future based on history
Favor recent history
Threshold-based alerts
6-8 minute delayAnomaly DetectionAlert!
SPS is the heartbeat of the Netflix service EKG heart monitoring arrhythmia or cardiac arrest require intervention
Double Exponential SmoothingMini-batches of time series dataPredict future values based on historyFavor recent valuesLook for the gap
Latency of DetectionStream processing vs. time series8 minutes < 1 minute
Finer Granularity, Shorter Time Windows
Ensemble Learning
Use multiple algorithms to obtain better predictive performance.Simple Ensemble Methods: Voting: used when each classifier produces a single class label.Averaging: used when each classifier produces a confidence estimate.Tend to yield better results when there is a diversity among the algorithms.
Median Absolute DeviationIQRLeast SquaresHDIVoting
observe, orient, decide, actAlert!From 6-8 minutes to < 1 minute
observe, orient
decide, act
How do we take humans out of the equation?
Outlier Detection & Remediation
These outliers are like cancer cells we systematically detect, study, and remove them from our ecosystem to maintain health
Unsupervised machine learningDensity-based clustering algorithm
ActionsEmail, pageOOS, detach, terminateKepler
An ounce of prevention
Old Version (v1.0)New Version(v1.1)Load BalancerCustomers100 Servers5 Servers95% 5%MetricsCanary Release Process
Old Version (v1.0)New Version(v1.1)Load BalancerCustomers0 Servers100 Servers 100%MetricsCanary Release Process
DefineMetricsA threshold
Every n minutesClassify metricsCompute scoreMake a decisionAutomatic Canary Analysis
Every n minutesClassify each metricCompute the mean value for the canary & controlCalculate the ratio of the mean valuesClassify the ratio of high, low, etc.Compute the final canary score% of metrics that match in performanceMake go/no-go decisionContinue with release of score is > 95%
Systematic observation of facets & permutationsUnsupervised monitoring & decision- makingAutomated tuning & recoveryAlerts with analysisThinking Globally
Our JourneyEngineering ToolsInsight & Real-time AnalyticsPerformance & Reliability
Leverage
Performance & Reliability
InternetZuulAPINCCPPlayback HistoryPlayback SessionsMAP
Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the systems capability to withstand turbulent conditions in production.
Living system analogy - inoculation
Like giving your service a flu shot - introduces a safer version of the disease in production under - controlled circumstancesNeeds periodic boosters
Cluster ACluster D
Edge ClusterCluster BCluster CImagine a monkey loose in your data center
Xen Hypervisor vulnerability 9/25/14
218 out of 2700+ Cassandra nodes rebooted 22 did not reboot successfullyAutomation handled the rest
A State of Xen Chaos Monkey & Cassandra
Out of our 2700+ Cassandra nodes218 rebooted 22 did not reboot successfullyAutomation replaced failed nodes0 downtime due to reboot
Device
Service B
Service CInternet
EdgeZuulService A ELBFITFault-Injection Testing (FIT)Simulate service failuresOverride by device or account% of member traffic
Device
Service B
Service CInternet
EdgeZuulService A ELB
FITFault-Injection Testing (FIT)Simulate service failuresOverride by device or account% of member traffic
US-EastUS-WestAZ1
EU-WestGlobal Traffic Management
The InternetDNS-based RoutingZuul Proxy Back Channel
###, ###, ###
Alerting and MonitoringApache & Tomcat HardeningAutomated Canary AnalysisAutoscalingChaos ParticipationConsistent NamingELB ConfigurationHealthcheck ConfiguredRed-Black PipelineSqueeze TestingTimeout & Fallback TuningWorkload ReliabilityProduction Ready?
Our JourneyEngineering ToolsInsight & Real-time AnalyticsPerformance & Reliability
Leverage
A federation of toolsCommon UI elementsDeep linking Operational Tools as a Product
Canary AnalysisConformityIntegration TestsCitrusChaosStaticUnit Tests
Deep Integration Modular Components
Functional Testing
RTA auto-tuningAlertsApache/TomcatAuto-scalingHystrix fallbacks
RTA decision supportACACitrusFlowConformity checksConsistent namesELBsHealth checkRed/black deployment
Delivery integrationACACitrusFITProduction Ready Automation & Integration
Containing failuresRecovering quicklySuccessfully shifting the curve
Internet
Our Journey Ends
Shift the Curve
Continuously engineer your operations to increase quality of customer experience & engineering velocity
https://netflix.github.io/
SpeakerWhen?Where?Engineering Netflix Global Operations in the CloudJosh EvansWed @11amPalazzo NEfficient Innovation: High-Velocity Cost Management at NetflixAndrew ParkWed @ 2:45pmPalazzo CNetflix Keystone: How Netflix Handles Data Streams Up to 8 Million Events Per SecondPeter BakasWed @ 2:45pmSan Polo 3501BA Day in the Life of a Netflix Engineer Using 37% of the InternetDave HahnWed @ 4:15pmVenetian HAvailability: The New Kind of Innovators DilemmaCoburn WatsonWed @ 4:15pmMarcello 4501BReal-Time Analytics In Service of Self-Healing EcosystemsRoy RapoportChris SandenWed @ 4:15pmLido 3001BRunning Spark and Presto on the Netflix Big Data PlatformDaniel WeeksThu @ 11amPalazzo FSplitting the Check on Compliance and Security: Keeping Developers and Auditors Happy in the CloudJason ChanThu @ 11amMarcello 4501B
@
Josh Evans [email protected]
@josh_evans_nflx
Thank you!