predictive and adaptive failure mitigation to avert ... · predictive and adaptive failure...

Post on 15-Nov-2020

6 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions

OSDI ‘20

Sebastien Levy†, Randolph Yao†, Youjiang Wu†, Yingnong Dang†, Peng Huang^, Zheng Mu†, Pu Zhao*, Tarun Ramani†, Naga Govindaraju†, Xukun

Li†, Qingwei Lin*, Gil Lapid Shafriri†, Murali Chintalapati†

† Microsoft Azure, ^ Johns Hopkins University, * Microsoft Research

2

Azure global infrastructure

Azure regions

HeterogeneityAll nodes are different

Different characteristics: # cores, total memory …

Different hardware type / vendors

Different workload patterns

Different health history

3

Measuring Customer Experience

• We need to monitor VM availability: down time / up time

• But each VM interruption causes significant impact to customer• Disrupt user experience (ie gaming)• Applications take time to recover• Customer frustration would come in

case of repeated reboots• Two short interruptions are more

impactful than one longer one

4

Impact of Node Failure

VM VM

VM VM

VM

Disks CPUMemory

VM VM

VM VM

VM

Disks CPUMemoryFW/HW failure

• Node-level failure ➔ bad impact for every VM it contains

• We need to predict failures and take the appropriate mitigation actions

5

Traditional Operation Workflow

Offline & repair node

6

VM VM

VM VM

VM

Disks CPUMemory

VM VM

VM VM

VM

Disks CPUMemory

VM VM …

Healthy Node

2 Diagnosis

3 Migration

!

All VMs reboot

Long VM downtime

VM VM

VM VM

VM

Disks CPUMemory

1 Detection

?Node failure

Attempt 1: Mitigation Before Diagnostics

Offline & repair node

7

VM VM

VM VM

VM

Disks CPUMemory

VM VM

VM VM

VM

Disks CPUMemory

VM VM …

Healthy Node

1 Detection 3 Diagnosis

2 Migration

? !

All VMs reboot

Short VM downtime

Mitigation can be better in some cases

Node failure

VM VM

VM VM

VM

Disks CPUMemory

Attempt 2: Prediction of Node Failure

Offline & repair node

8

VM VM

VM VM

VM

Disks CPUMemory

VM VM

VM VM

VM

Disks CPUMemory

VM VM …

Healthy Node

1 Node predicted to fail 2 Diagnosis

3 Migration

?

VM VM

VM VM

VM

Disks CPUMemoryWait

Use prediction to speed up OS crash mitigation

All VMs reboot

Shorter VM downtime

Mitigation suboptimal for wrong prediction

Node failure

Attempt 3: Prediction + Static Mitigation

Offline & repair node

9

VM VM

Disks CPUMemory

VM VM …

Healthy Node

1 Node predicted to fail 3 Diagnosis

2

?

VM VM

Disks CPUMemoryBlock allocation

Capacity impact

Live migrate eligible VMsVM pause impact

Some VMs did not reboot

Other VMs reboot

Shorter VM downtime

Increased impact for false positives

Can we do better?

VM VM

VM VM

VM

Disks CPUMemory

Mitigation Tradeoff

❖ Can a mitigation be effective for all kind of scenarios?

❖ How to confirm a mitigation is effective?

❖ Even if a mitigation is effective, can we do better? Can it change in time?

10

Mitigation Tradeoff: No One-Size-Fits-All

How to mitigate predicted node failures: reasonable proposal

Capacity impactVM pause impact VM interruption impact

Block allocation

Live-migrate eligible VMs

Wait for 7 days

Force remaining VMs

migration

Diagnosis + Repair

If failure is not imminent, do we need to completely block

allocation?

How long should we wait for customers to intentionally move

their VMs?

If node has still not failed after 7 days, could it now be healthy?

11

How To Estimate Mitigation Efficacy?

• Heterogeneity: Multiple factors will impact the mitigation Efficacy• Live migration success rate depends on available capacity, hardware health,

workload on the node

• Allocation relies on a complex optimization logic and on stochastic customer demand

• Prediction false positives are unavoidable and can depend on unobservable signals

12

Key Insights

“In a heterogeneous and ever-changing cloud system, the effectiveness of a mitigation action is often probabilistic.”

“To select the (near-)optimal mitigation, each possible action needs to be compared at-scale with production workload.”

13

Solution

Continuous probabilistic online minimizationof customer impact

Different mitigation actions depending on the type of predicted failure

Online testing of the different options

Adapt to ever-changing cloud behavior

Account for node heterogeneity of Azure’s fleet

Focus on impacting failures

14

Narya’s Overview

Prediction Rule

ML failure prediction Customer impact

Feedback loop

Diagnostics

Offline Monitoring

CostLabels

Prediction Decision Mitigation Impact Assessment15

1

3

2

1

4

4

Choose action

Main Challenges

Define Define customer impact as a generic metric

Test Safely test action in production

BalanceBalance exploring actions and exploiting the best so far

Adapt Adapt our decision to system changes

Fast Fast and scalable mitigation decision

16

Measuring Interruptions

• Interruption Types:

• VM reboot, IO pause, VM temporarily freeze / blackout

• Service Availability

• Time duration based

• Short VM downtime DOES NOT imply short customer service downtime.

• Each VM service downtime requires customer service to rehydrate their state

• Interruption Count

• Event count based

• Each VM interrupt impacts customer service once

• More realistically reflects the customer pain

17

Defining Customer Impact

• VM interruptions are the main negative impact to customer

• We define Annual interruption rate (AIR)

𝐴𝐼𝑅 =# 𝑖𝑛𝑡𝑒𝑟𝑟𝑢𝑝𝑡𝑖𝑜𝑛𝑠 𝑖𝑛 𝑇

𝑇𝑜𝑡𝑎𝑙 𝑢𝑝 𝑡𝑖𝑚𝑒 𝑖𝑛 𝑇× 365 𝑑𝑎𝑦𝑠 × 100

Cost = # VM interruptions per node

When scoping it to a contribution per node and removing the constants, we get:

18

Narya’s Prediction

Expert Rules ML Prediction

19

Motivation To Use ML Prediction

• Expert rule • Setting a threshold on a single predictive signal. Ex: Disk low spare space

• Limitations of expert rule• High variance of prediction horizon - actual failure time also depends on factors

• Difficult precision-recall tradeoff – strict rules result in low recall, loose rules yield low precision

• What if we look at all signals before the actual failure and use those signals to predict the failure?

20

ML Model Highlights

• Comprehensive list of signals that cover OS layer, driver layer and device layer observation.

• Predict server level failures that has customer impact

• Non-handcrafted features• Spatial attention

• Temporal convolution

21

Longer Lead Time

Higher Precision

Higher Recall

Diversity of Mitigation Actions

Allocation change VM migration Node reboot

Allow allocation

Avoid allocation*

Block allocation

Live migrate VMs*(unallocatable node only)

Service heal VMs(unallocatable node only)

Soft kernel reboot*(unallocatable node only)

Hard reboot

Offline + repair(empty node only)

*

No Impact

Pause Impact

Soft capacity impact

Hard capacity impact

VM reboot impact

Non-deterministic action22

Narya’s Mitigation Actions

• Use composite actions (sequence of actions)• Avoid illogical scenarios: migration without blocking allocation, repeated

reboots …

• Add safety constraint to minimize the cost of exploration

• Easy override priority if several rules overlap

• Simpler learningIf OS crashes

If empty

Example:

Composite actions’ duration is typically in the order of days 23

Narya’sDecision Engine

• Mapping each predicted bad nodes to the best composite mitigation action to minimize customer pain

• Two algorithms:• AB testing:

explore all different possibilities, observe the customer pain metric(s) then use the best action if it exists

• Multi-Armed Bandit (Thompson Sampling): learn the right tradeoff between exploring new actions and exploiting the estimated best one based on a single customer pain metric

24

AB Testing

1. Sample action with preconfigured probabilities

2. Observe customer impact within an observation window

3. Hypothesis test between cost of all actions

4. If statistical significance, use the optimal action

Action A

Compare statistical impact

Action B25

Adapting AB Testing To Our Scenario

• Cost attribution: VM reboots during the observation window• knowing which interruptions is caused by the action is not possible

• Stickiness: same node always use the same action• otherwise we would violate the iid assumption

• Decision vs action: we observe the consequence of every decision• even if it is skipped, delayed or overridden

26

Bandit Motivation

• AB testing limitations

• Does not leverage early observation before statistical significance

• Cannot adapt to change after statistical significance

• Multi-armed bandit: dynamically learn the probability based on observations

✓Leverage early observation in exploration

✓ Adapt to change in exploitation

27

Bandit Framework

Reward = - Customer impact (Cost)

Machines = Available composite actions

Pull / Game = New node mitigation request

Each prediction rule is a different bandit experiment

28

Bandit Adaptation To Our Scenario

Accommodate temporal change: exponential decay

Delayed rewards

Bandit stickiness Safeguards

29

Ensuring Safe Exploration

• Only allow relevant composite actions

• Override logic if other more severe issues are detected

• Fallback to AB testing if there are not enough observation data

• Enforce minimum and maximum probability for each action

30

31

Mitigation Engine

ML Prediction

Pub / Sub Real-time

Node Monitoring

Agent

Action Orchestrator

Policy Generator

Req

ues

t H

and

ler

Model Serving Platform Learner

VM Impact

Mitigation decision

Narya’s System Architecture1

3

2

Scalable

Fast

Adaptive

Overall Savings: 26%

• Compare the use of AB testing/Bandit to previous static strategy

• Estimated AIR savings: [Observed AIR] – [Control group AIR mapped to whole fleet]

0

10

20

30

40

50

60

% Im

pro

vem

ent

Experiments

Bandit improvement over AB testing

32

Prediction Performance

• Precision – 79.5%

• Recall – 50.7%

• ML prediction:Time to failure – 48 hours on average

33

Compare AB Testing and Bandits

• Counterfactual estimation

18.4

7.7

22.3

16.6

3.8

7.1

15.3

0

5

10

15

20

25

2 Ierr E500 E11 Tardigrade E7 30 Threshold 1 Ierr 63023 Orange ML Prediction

Improvement (%)34

Improvement (%)

Case Study: I/O Timeouts

• A/B testing then Bandit experiment on I/O timeouts prediction rule

• Unexpected system changes switched probability from using NoOp to using UA-LM-RH automatically

• Although change is not understood, bandit can automatically adjust

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

2/1

8/2

02

0

2/2

0/2

02

0

2/2

2/2

02

0

2/2

4/2

02

0

2/2

6/2

02

0

2/2

8/2

02

0

3/1

/20

20

3/3

/20

20

3/5

/20

20

3/7

/20

20

3/9

/20

20

3/1

1/2

02

0

3/1

3/2

02

0

3/1

5/2

02

0

3/1

7/2

02

0

3/1

9/2

02

0

3/2

0/2

02

0

3/2

2/2

02

0

3/2

4/2

02

0

3/2

6/2

02

0

3/2

8/2

02

0

3/3

0/2

02

0

4/1

/20

20

4/3

/20

20

4/5

/20

20

4/7

/20

20

Bandit probability

UaProb NoOpProb

35

UA-LM-RH: Unallocatable + Live Migration + Reset Node Health

Operation Learnings

• Many factors influence efficiency: need for probabilistic approaches

• Decisions may go wrong: closely monitor all components behavior and use interpretable models

• Data quality is critical: watch for telemetry schema changes

• Customer impact is complex: human in the loop helps prevent from new types of impact

36

Summary / Takeaway Message

• Both failure prediction and proactive mitigation is critical

• No one-size-fits-all mitigation strategy, adapting different mitigation strategies in an online fashion

• Narya uses AB testing and multi-armed bandit to proactively and adaptively mitigate of predicted bad nodes

• Narya achieved 26% improvement over previous static strategy

37

Thank you!

• Acknowledgement • Haryadi Gunawi, our shepherd, and the anonymous reviewers

• All our collaborators within Azure

• Contact us• selevy@microsoft.com

• ranyao@microsoft.com

• yow@microsoft.com

• yidang@microsoft.com

38

top related