deepview: virtual disk failure diagnosis and pattern ... · deepview: virtual disk failure...

28
Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure Qiao Zhang 1 , Guo Yu 2 , Chuanxiong Guo 3 , Yingnong Dang 4 , Nick Swanson 4 , Xinsheng Yang 4 , Randolph Yao 4 , Murali Chintalapati 4 , Arvind Krishnamurthy 1 , Tom Anderson 1 1 University of Washington, 2 Cornell University, 3 Toutiao (Bytedance), 4 Microsoft

Upload: others

Post on 29-Oct-2019

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deepview: Virtual Disk Failure Diagnosis and Pattern ... · Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure Qiao Zhang1, Guo Yu2, Chuanxiong Guo3, Yingnong

Deepview:VirtualDiskFailureDiagnosis

andPatternDetectionforAzureQiaoZhang1,Guo Yu2,ChuanxiongGuo3,

Yingnong Dang4,NickSwanson4,Xinsheng Yang4,RandolphYao4,Murali Chintalapati4,ArvindKrishnamurthy1,TomAnderson1

1UniversityofWashington,2CornellUniversity,3Toutiao(Bytedance),4Microsoft

Page 2: Deepview: Virtual Disk Failure Diagnosis and Pattern ... · Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure Qiao Zhang1, Guo Yu2, Chuanxiong Guo3, Yingnong

VMAvailability

• IaaSisoneofthelargestcloudservicestoday

•HighVMavailabilityisakeyperformancemetric

• Yet,achieving99.999%VMuptimeremainsachallenge

1. Whatistheavailabilitybottleneck?2. Howtoeliminateit?

Page 3: Deepview: Virtual Disk Failure Diagnosis and Pattern ... · Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure Qiao Zhang1, Guo Yu2, Chuanxiong Guo3, Yingnong

Clos Network

AzureIaaSArchitecture• ComputeandstorageclusterswithaClos-likenetwork

• Compute-storageSeparation• VMsandVirtualHardDisks(VHDs)fromdifferentclusters

• Hypervisortransparentlyredirectsdiskaccess

• DatasurvivecomputerackfailureStorage Cluster

VM

Hypervisor

HostVM

Compute Cluster

SubsystemsinsideaDatacenter

Page 4: Deepview: Virtual Disk Failure Diagnosis and Pattern ... · Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure Qiao Zhang1, Guo Yu2, Chuanxiong Guo3, Yingnong

ANewTypeofFailure:VHDFailures

• InfrafailurescandisruptVHDaccess

•Hypervisorcanretry,butnotindefinitely

•HypervisorwilleventuallycrashtheVM

• Customersthentakeactionstokeeptheirapp-levelSLAs

Clos Network

Storage Cluster

VM

Hypervisor

HostVM

Compute Cluster

SubsystemsinsideaDatacenter

Page 5: Deepview: Virtual Disk Failure Diagnosis and Pattern ... · Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure Qiao Zhang1, Guo Yu2, Chuanxiong Guo3, Yingnong

HowmuchdoVHDfailuresimpactVMavailability?

VHDfailures:• 52% ofunplannedVMdowntime• TensofminutestohourstolocalizeVHD

Failure52%

SWFailure41%

HWFailure6%

Unknown1%

BreakdownofUnplannedVMDowntimeinaYear

VHDfailurelocalizationisthebottleneck

Page 6: Deepview: Virtual Disk Failure Diagnosis and Pattern ... · Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure Qiao Zhang1, Guo Yu2, Chuanxiong Guo3, Yingnong

FailureTriagewasSlowandInaccurate

• Eachteamcheckstheirsubsystemforanomaliestomatchtheincident• e.g.,hostheart-beats,storageperf-counters,linkdiscards

• Incidentsgetping-pongedduetofalsepositives• Inaccurateandslowdiagnosis

• Grayfailuresinnetworkandstoragearehardtocatch• Troubledbutnottotallydown• OnlyfailasubsetofVHDrequests• Cantakehourstolocalize

Page 7: Deepview: Virtual Disk Failure Diagnosis and Pattern ... · Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure Qiao Zhang1, Guo Yu2, Chuanxiong Guo3, Yingnong

Deepview Approach:GlobalView

C1C2C3C4

S1S2S3

BipartiteModel

C1C2

C3C4

S1 S2 S3GridView

• Isolatefailuresbyexamininginteractionsbetweensubsystems• Insteadofalertingeveryteam

• Bipartitemodel• ComputeClusters(left):StorageClusters(right)• EdgeifVMsfromcomputeclustermountVHDsfromastoragecluster• Edgeweight=VHDfailurerate

Page 8: Deepview: Virtual Disk Failure Diagnosis and Pattern ... · Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure Qiao Zhang1, Guo Yu2, Chuanxiong Guo3, Yingnong

Deepview Approach:GlobalView

Azuremeasurementsrevealedmanycommonfailurespatterns

C1C2C3C4

S1

S2

S3

ComputeClusterC2failed

C2FailureGridView

C1C2C3C4

S1 S2 S3

ExampleComputeClusterFailure

C1C2C3C4

S1

S2

S3

StorageClusterS1Failed

ExampleStorageClusterFailure

S1GrayFailureGridView

C1C2C3C4

S1 S2 S3

Page 9: Deepview: Virtual Disk Failure Diagnosis and Pattern ... · Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure Qiao Zhang1, Guo Yu2, Chuanxiong Guo3, Yingnong

ChallengesRemainingchallenges:1. Needtolocatenetworkfailures2. Needtohandlegrayfailures3. Needtobenear-real-time

GeneralizedmodelLasso+Hypothesistesting

Streamingdatapipeline

AsystemtolocalizeVHDfailurestounderlyingfailuresincompute,storageornetworksubsystemswithinatimebudgetof15minutes

Summaryofourgoal:

Timebudgetsetbyproductionteamtomeetavailabilitygoals

Page 10: Deepview: Virtual Disk Failure Diagnosis and Pattern ... · Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure Qiao Zhang1, Guo Yu2, Chuanxiong Guo3, Yingnong

Outline

•GlobalViewApproach•Model&Algorithm•System•Evaluation•ArchitecturalLessons•RelatedWork

Page 11: Deepview: Virtual Disk Failure Diagnosis and Pattern ... · Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure Qiao Zhang1, Guo Yu2, Chuanxiong Guo3, Yingnong

Deepview Model:IncludetheNetwork

Clos Network

Storage ClusterCompute Cluster

•Needtohandlemultipath&ECMP

• SimplifyClosnetworktoatreebyaggregatingnetworkdevices

• Canmodelatthegranularityofclustersorracks

Page 12: Deepview: Virtual Disk Failure Diagnosis and Pattern ... · Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure Qiao Zhang1, Guo Yu2, Chuanxiong Guo3, Yingnong

Deepview Model:EstimateComponentHealth

𝐏𝐫𝐨𝐛 𝐩𝐚𝐭𝐡𝐢𝐢𝐬𝐡𝐞𝐚𝐥𝐭𝐡𝐲 = 0 𝐏𝐫𝐨𝐛 𝐜𝐨𝐦𝐩𝐨𝐧𝐞𝐧𝐭𝐣𝐢𝐬𝐡𝐞𝐚𝐥𝐭𝐡𝐲�

𝐣∈𝐩𝐚𝐭𝐡(𝐢)

𝟏 −𝐞𝐢𝐧𝐢= 0 𝐩𝐣

𝐣∈𝐩𝐚𝐭𝐡(𝐢)

𝐥𝐨𝐠 𝟏 −𝐞𝐢𝐧𝐢

= < 𝐥𝐨𝐠𝐩𝐣

𝐣∈𝐩𝐚𝐭𝐡(𝐢)

𝐲𝐢 =<𝛃𝐣 𝐱𝐢𝐣+ 𝛆𝐢

𝐍

𝐣B𝟏

𝐲𝐢=𝐥𝐨𝐠 𝟏 − 𝐞𝐢𝐧𝐢

𝛃𝐣=𝐥𝐨𝐠𝐩𝐣𝛆𝐢=measurementnoise

SystemofLinearEquations

Blue:observableRed:unknownPurple:topology

Componentjishealthywith𝐩𝐣 = 𝐞𝐱𝐩(𝛃𝐣)• βD = 0,clearcomponentj• βD ≪ 0,mayblameit

Assumeindependentfailures

𝐞𝐢=num ofVMscrashed𝒏𝐢=num ofVMs

Page 13: Deepview: Virtual Disk Failure Diagnosis and Pattern ... · Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure Qiao Zhang1, Guo Yu2, Chuanxiong Guo3, Yingnong

Deepview Algorithm:PreferSimplerExplanationviaLasso

• Potentially,#unknowns>#equations• Traditionalleast-squareregressionwouldfail

Sparsity

𝛃H = 𝐚𝐫𝐠𝐦𝐢𝐧𝛃∈ℝ𝐍,𝛃K𝟎

𝐲 − 𝐗𝛃 𝟐 +𝛌 𝛃 𝟏

LassoObjectiveFunction:

𝐲𝟏 = 𝛃𝐜𝟏 + 𝛃𝐧𝐞𝐭 + 𝛃𝐬𝟏 + 𝛆𝟏𝐲𝟐 = 𝛃𝐜𝟏 + 𝛃𝐧𝐞𝐭 + 𝛃𝐬𝟐 + 𝛆𝟐𝐲𝟑 = 𝛃𝐜𝟐 + 𝛃𝐧𝐞𝐭 + 𝛃𝐬𝟏 + 𝛆𝟑𝐲𝟒 = 𝛃𝐜𝟐 + 𝛃𝐧𝐞𝐭 + 𝛃𝐬𝟐 + 𝛆𝟒

Net

C1 C2 S1 S2

𝐲𝐢 =<𝛃𝐣 𝐱𝐢𝐣+ 𝛆𝐢

𝐍

𝐣B𝟏

Example:

• Butmultiplesimultaneousfailuresarerare• Encodethisdomainknowledgemathematically?

• EquivalenttoprefermostβD tobezero• Lassoregression cangetsparsesolutionsefficiently

Page 14: Deepview: Virtual Disk Failure Diagnosis and Pattern ... · Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure Qiao Zhang1, Guo Yu2, Chuanxiong Guo3, Yingnong

Deepview Algorithm:PrincipledBlameDecisionviaHypothesisTesting

• Needabinarydecision(flag/clear)foreachcomponent• Ad-hocthresholdsdonotworkreliably• Canwemakeaprincipleddecision?

• Ifestimatedfailureprobabilityworsethanaverage,thenlikelyarealfailure

• Hypothesistest:• IfrejectHS j ,blamecomponentj;otherwise,clearit

𝐇𝟎 𝐣 : 𝛃𝐣 = 𝛃W𝐯𝐬. 𝐇𝐀 𝐣 : 𝛃𝐣 < 𝛃W

Page 15: Deepview: Virtual Disk Failure Diagnosis and Pattern ... · Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure Qiao Zhang1, Guo Yu2, Chuanxiong Guo3, Yingnong

Kusto Engine

Deepview SystemArchitecture:NRTDataPipeline

VHD Failure

VM Info

StorageAcct

Net Topo

VMsPerPath Input

Real-time

Non-RT

IngestionPipeline

RAW DATA SLIDING WINDOW OF INPUT

Output

ACTIONS

Alerts

Vis

Near-realtimeScheduler

RUN ALGO

Algo

Page 16: Deepview: Virtual Disk Failure Diagnosis and Pattern ... · Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure Qiao Zhang1, Guo Yu2, Chuanxiong Guo3, Yingnong

Outline

•GlobalViewApproach•Model&Algorithm•System•Evaluation•ArchitecturalLessons•RelatedWork

Page 17: Deepview: Virtual Disk Failure Diagnosis and Pattern ... · Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure Qiao Zhang1, Guo Yu2, Chuanxiong Guo3, Yingnong

Evaluation

Deepview hasbeendeployedinproductionatAzure

1. HowwellcanitlocalizeVHDfailuresinproduction?

2. Howaccurateisthealgorithmcomparedtoalternatives?

3. Howfastisthesystem?

Page 18: Deepview: Virtual Disk Failure Diagnosis and Pattern ... · Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure Qiao Zhang1, Guo Yu2, Chuanxiong Guo3, Yingnong

SomeStatistics

• AnalyzedDeepview resultsforonemonth• DailyVHDfailures:hundredstotensofthousands

• Detected100failuresinstances• 70matchedwithexistingtickets,30werepreviouslyundetected

• ReducedunclassifiedVHDfailurestolessthanamaxof500perday• Hostfailuresorcustomermistakes(e.g.,expiredstorageaccounts)

Page 19: Deepview: Virtual Disk Failure Diagnosis and Pattern ... · Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure Qiao Zhang1, Guo Yu2, Chuanxiong Guo3, Yingnong

CaseStudy1:UnplannedToR Reboot

• UnplannedToR rebootcancauseVMcrashes• Knowthiscanhappen,butnotwhereandwhen

• Deepview canflagthoseToRs

• AssociateVMdowntimewithToR failures• QuantifytheimpactofToR asasingle-point-of-failureonVMavailability

ToR_11

ToR_12

ToR_13

ToR_14

ToR_15

STR

_01

STR

_02

STR

_03

STR

_04

STR

_05

STR

_06

STR

_07

BlamedtherightToR among288components

Page 20: Deepview: Virtual Disk Failure Diagnosis and Pattern ... · Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure Qiao Zhang1, Guo Yu2, Chuanxiong Guo3, Yingnong

CaseStudy2:StorageClusterGrayFailure

• AstorageclusterwasbroughtonlinewithabugthatputssomeVHDsinnegativecache

•Deepview flaggedthefaultystorageclusteralmostimmediatelywhilemanualtriagetook20+hours

10

20

0 20 40 60

Hour

Nu

mb

er

of

VM

s w

ith

VH

D F

ailu

res

pe

r H

ou

r

NumberofVMswithVHDFailuresperHourduringaStorageClusterGrayFailure

Page 21: Deepview: Virtual Disk Failure Diagnosis and Pattern ... · Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure Qiao Zhang1, Guo Yu2, Chuanxiong Guo3, Yingnong

CaseStudy3:NetworkFailure

• Networkoutagesarerare,butdohappen

• Inanincident,manytoptierlinksweremistakenlyturnedoff,causinglargecapacityloss

• Whenstoragereplicationtraffichit,itcausedhugepacketlossesandmanyVMstocrash

• Deepview pinpointedthemisbehavingaggregateswitches

ANetworkFailureduetoTopTierLink

CapacityLoss

Com

pute

Clu

ster

s

Storage Clusters

Page 22: Deepview: Virtual Disk Failure Diagnosis and Pattern ... · Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure Qiao Zhang1, Guo Yu2, Chuanxiong Guo3, Yingnong

0.6

0.3

0.90.67

0.881

00.250.5

0.751

BooleanTomo SCORE Deepview

Precision Recall

AlgorithmAccuracyComparison

• Twoothertomographyalgorithms:Boolean-Tomo andSCORE• Greedyheuristicstofindminimumsetoffailures

• Useproductiontracefrom42incidents• 16Compute,14Storage,10ToR,2Net

Page 23: Deepview: Virtual Disk Failure Diagnosis and Pattern ... · Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure Qiao Zhang1, Guo Yu2, Chuanxiong Guo3, Yingnong

Deepview TimetoDetection• Timetodetection(TTD)

• Timefromincidentstarttofailurelocalized• EstimatestarttimefromVHDfailureeventtimestamp

• Deepview’s TTDisunder10min• Dataingestiontakes~3.5min• ~5minutesslidingwindowdelay• Worst-case18secalgorithmrunningtime

• MeetsthetargetTTDof15min• Canbemadefasterbutmitigationtimeisonhumantimescale

Page 24: Deepview: Virtual Disk Failure Diagnosis and Pattern ... · Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure Qiao Zhang1, Guo Yu2, Chuanxiong Guo3, Yingnong

Outline

•GlobalViewApproach•Model&Algorithm•System•Evaluation•ArchitecturalLessons•RelatedWork

Page 25: Deepview: Virtual Disk Failure Diagnosis and Pattern ... · Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure Qiao Zhang1, Guo Yu2, Chuanxiong Guo3, Yingnong

ToR asaSinglePointofFailure• ReducedNetworkCostvs.AvailabilitycostforusingasingleToR perrack• Softfailures(recoverablebyreboot)vs.hardfailures

ToR Availability

= 𝟏 −𝟗𝟎% ∗ 𝟐𝟎𝐦𝐢𝐧 + 𝟏𝟎% ∗ 𝟏𝟐𝟎𝐦𝐢𝐧 ∗ 𝟎. 𝟏%

𝟑𝟎 ∗ 𝟐𝟒 ∗ 𝟔𝟎𝐦𝐢𝐧

= 𝟏 −%𝐬𝐨𝐟𝐭 ∗ 𝐬𝐨𝐟𝐭𝐝𝐮𝐫.+%𝐡𝐚𝐫𝐝 ∗ 𝐡𝐚𝐫𝐝𝐝𝐮𝐫. ∗ 𝐟𝐫𝐚𝐜. 𝐫𝐞𝐛𝐨𝐨𝐭𝐞𝐝𝐓𝐨𝐑𝐬𝐩𝐞𝐫𝐦𝐨𝐧𝐭𝐡

𝐭𝐨𝐭𝐚𝐥𝐭𝐢𝐦𝐞𝐢𝐧𝐚𝐦𝐨𝐧𝐭𝐡

= 𝟗𝟗. 𝟗𝟗𝟗𝟗𝟑%• Dependentservices(ToRs)needtoprovideoneextraninetotargetservice(VMs)

ToRs notoncriticalpathforVMstoachievefive-ninesavailability

Page 26: Deepview: Virtual Disk Failure Diagnosis and Pattern ... · Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure Qiao Zhang1, Guo Yu2, Chuanxiong Guo3, Yingnong

VMsandtheirStorageCo-location• Forloadbalancing,VMscanmountVHDsfromanystorageclusterinthesameregion

• SomeVMshavestoragethatarefurtheraway• CanlongernetworkpathsimpactVMavailability?Andbyhowmuch?

Longernetworkpathdoleadtohigher(11.4%)VHDfailurerate

• AtAzure,52%two-hop,41%four-hop• ComputedailyVHDfailurerates:rS (two-hop),rf (four-hop)• Averageover3-months, rS andrf• rf − rS rS⁄ = 11.4%increase

Page 27: Deepview: Virtual Disk Failure Diagnosis and Pattern ... · Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure Qiao Zhang1, Guo Yu2, Chuanxiong Guo3, Yingnong

RelatedWork• NetPoirot [SIGCOMM'16]

• Asingle-nodesolutiontofailurelocalizationusingTCPstatistics• ComplementaryifTCPstatisticsfromcustomerVMsareavailable

• BinaryTomography• Deepview achieveshigherprecision/recallthanthosegreedyheuristics

• (Approximate)BayesianNetwork• Tooslowforourproblem• Futureworktocompareaccuracyexperimentally

Page 28: Deepview: Virtual Disk Failure Diagnosis and Pattern ... · Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure Qiao Zhang1, Guo Yu2, Chuanxiong Guo3, Yingnong

Conclusion

• IdentifiedVHDfailuresastheavailabilitybottleneckatAzure

• Deepview reducedunclassifieddailyVHDfailuresfrom10,000sto100s

• Revealednewfailures,e.g.,unplannedToR reboots,storagegrayfailures

• QuantifiedtheimpactofseveralarchitecturaldecisionsonVMavailability

Thankyou!Questions?