deepview: virtual disk failure diagnosis and pattern ... · deepview: virtual disk failure...

Deepview:VirtualDiskFailureDiagnosis

andPatternDetectionforAzureQiaoZhang1,Guo Yu2,ChuanxiongGuo3,

Yingnong Dang4,NickSwanson4,Xinsheng Yang4,RandolphYao4,Murali Chintalapati4,ArvindKrishnamurthy1,TomAnderson1

1UniversityofWashington,2CornellUniversity,3Toutiao(Bytedance),4Microsoft

VMAvailability

• IaaSisoneofthelargestcloudservicestoday

•HighVMavailabilityisakeyperformancemetric

• Yet,achieving99.999%VMuptimeremainsachallenge

1. Whatistheavailabilitybottleneck?2. Howtoeliminateit?

Clos Network

AzureIaaSArchitecture• ComputeandstorageclusterswithaClos-likenetwork

• Compute-storageSeparation• VMsandVirtualHardDisks(VHDs)fromdifferentclusters

• Hypervisortransparentlyredirectsdiskaccess

• DatasurvivecomputerackfailureStorage Cluster

Hypervisor

HostVM

Compute Cluster

SubsystemsinsideaDatacenter

ANewTypeofFailure:VHDFailures

• InfrafailurescandisruptVHDaccess

•Hypervisorcanretry,butnotindefinitely

•HypervisorwilleventuallycrashtheVM

• Customersthentakeactionstokeeptheirapp-levelSLAs

Clos Network

Storage Cluster

Hypervisor

HostVM

Compute Cluster

SubsystemsinsideaDatacenter

HowmuchdoVHDfailuresimpactVMavailability?

VHDfailures:• 52% ofunplannedVMdowntime• TensofminutestohourstolocalizeVHD

Failure52%

SWFailure41%

HWFailure6%

Unknown1%

BreakdownofUnplannedVMDowntimeinaYear

VHDfailurelocalizationisthebottleneck

FailureTriagewasSlowandInaccurate

• Eachteamcheckstheirsubsystemforanomaliestomatchtheincident• e.g.,hostheart-beats,storageperf-counters,linkdiscards

• Incidentsgetping-pongedduetofalsepositives• Inaccurateandslowdiagnosis

• Grayfailuresinnetworkandstoragearehardtocatch• Troubledbutnottotallydown• OnlyfailasubsetofVHDrequests• Cantakehourstolocalize

Deepview Approach:GlobalView

C1C2C3C4

S1S2S3

BipartiteModel

S1 S2 S3GridView

• Isolatefailuresbyexamininginteractionsbetweensubsystems• Insteadofalertingeveryteam

• Bipartitemodel• ComputeClusters(left):StorageClusters(right)• EdgeifVMsfromcomputeclustermountVHDsfromastoragecluster• Edgeweight=VHDfailurerate

Deepview Approach:GlobalView

Azuremeasurementsrevealedmanycommonfailurespatterns

C1C2C3C4

ComputeClusterC2failed

C2FailureGridView

C1C2C3C4

S1 S2 S3

ExampleComputeClusterFailure

C1C2C3C4

StorageClusterS1Failed

ExampleStorageClusterFailure

S1GrayFailureGridView

C1C2C3C4

S1 S2 S3

ChallengesRemainingchallenges:1. Needtolocatenetworkfailures2. Needtohandlegrayfailures3. Needtobenear-real-time

GeneralizedmodelLasso+Hypothesistesting

Streamingdatapipeline

AsystemtolocalizeVHDfailurestounderlyingfailuresincompute,storageornetworksubsystemswithinatimebudgetof15minutes

Summaryofourgoal:

Timebudgetsetbyproductionteamtomeetavailabilitygoals

Outline

•GlobalViewApproach•Model&Algorithm•System•Evaluation•ArchitecturalLessons•RelatedWork

Deepview Model:IncludetheNetwork

Clos Network

Storage ClusterCompute Cluster

•Needtohandlemultipath&ECMP

• SimplifyClosnetworktoatreebyaggregatingnetworkdevices

• Canmodelatthegranularityofclustersorracks

Deepview Model:EstimateComponentHealth

𝐏𝐫𝐨𝐛 𝐩𝐚𝐭𝐡𝐢𝐢𝐬𝐡𝐞𝐚𝐥𝐭𝐡𝐲 = 0 𝐏𝐫𝐨𝐛 𝐜𝐨𝐦𝐩𝐨𝐧𝐞𝐧𝐭𝐣𝐢𝐬𝐡𝐞𝐚𝐥𝐭𝐡𝐲�

𝐣∈𝐩𝐚𝐭𝐡(𝐢)

𝟏 −𝐞𝐢𝐧𝐢= 0 𝐩𝐣

𝐥𝐨𝐠 𝟏 −𝐞𝐢𝐧𝐢

= < 𝐥𝐨𝐠𝐩𝐣

𝐲𝐢 =<𝛃𝐣 𝐱𝐢𝐣+ 𝛆𝐢

𝐣B𝟏

𝐲𝐢=𝐥𝐨𝐠 𝟏 − 𝐞𝐢𝐧𝐢

𝛃𝐣=𝐥𝐨𝐠𝐩𝐣𝛆𝐢=measurementnoise

SystemofLinearEquations

Blue:observableRed:unknownPurple:topology

Componentjishealthywith𝐩𝐣 = 𝐞𝐱𝐩(𝛃𝐣)• βD = 0,clearcomponentj• βD ≪ 0,mayblameit

Assumeindependentfailures

𝐞𝐢=num ofVMscrashed𝒏𝐢=num ofVMs

Deepview Algorithm:PreferSimplerExplanationviaLasso

• Potentially,#unknowns>#equations• Traditionalleast-squareregressionwouldfail

Sparsity

𝛃H = 𝐚𝐫𝐠𝐦𝐢𝐧𝛃∈ℝ𝐍,𝛃K𝟎

𝐲 − 𝐗𝛃 𝟐 +𝛌 𝛃 𝟏

LassoObjectiveFunction:

𝐲𝟏 = 𝛃𝐜𝟏 + 𝛃𝐧𝐞𝐭 + 𝛃𝐬𝟏 + 𝛆𝟏𝐲𝟐 = 𝛃𝐜𝟏 + 𝛃𝐧𝐞𝐭 + 𝛃𝐬𝟐 + 𝛆𝟐𝐲𝟑 = 𝛃𝐜𝟐 + 𝛃𝐧𝐞𝐭 + 𝛃𝐬𝟏 + 𝛆𝟑𝐲𝟒 = 𝛃𝐜𝟐 + 𝛃𝐧𝐞𝐭 + 𝛃𝐬𝟐 + 𝛆𝟒

C1 C2 S1 S2

𝐲𝐢 =<𝛃𝐣 𝐱𝐢𝐣+ 𝛆𝐢

𝐣B𝟏

Example:

• Butmultiplesimultaneousfailuresarerare• Encodethisdomainknowledgemathematically?

• EquivalenttoprefermostβD tobezero• Lassoregression cangetsparsesolutionsefficiently

Deepview Algorithm:PrincipledBlameDecisionviaHypothesisTesting

• Needabinarydecision(flag/clear)foreachcomponent• Ad-hocthresholdsdonotworkreliably• Canwemakeaprincipleddecision?

• Ifestimatedfailureprobabilityworsethanaverage,thenlikelyarealfailure

• Hypothesistest:• IfrejectHS j ,blamecomponentj;otherwise,clearit

𝐇𝟎 𝐣 : 𝛃𝐣 = 𝛃W𝐯𝐬. 𝐇𝐀 𝐣 : 𝛃𝐣 < 𝛃W

Kusto Engine

Deepview SystemArchitecture:NRTDataPipeline

VHD Failure

VM Info

StorageAcct

Net Topo

VMsPerPath Input

Real-time

Non-RT

IngestionPipeline

RAW DATA SLIDING WINDOW OF INPUT

Output

ACTIONS

Alerts

Near-realtimeScheduler

RUN ALGO

Outline

Evaluation

Deepview hasbeendeployedinproductionatAzure

1. HowwellcanitlocalizeVHDfailuresinproduction?

2. Howaccurateisthealgorithmcomparedtoalternatives?

3. Howfastisthesystem?

SomeStatistics

• AnalyzedDeepview resultsforonemonth• DailyVHDfailures:hundredstotensofthousands

• Detected100failuresinstances• 70matchedwithexistingtickets,30werepreviouslyundetected

• ReducedunclassifiedVHDfailurestolessthanamaxof500perday• Hostfailuresorcustomermistakes(e.g.,expiredstorageaccounts)

CaseStudy1:UnplannedToR Reboot

• UnplannedToR rebootcancauseVMcrashes• Knowthiscanhappen,butnotwhereandwhen

• Deepview canflagthoseToRs

• AssociateVMdowntimewithToR failures• QuantifytheimpactofToR asasingle-point-of-failureonVMavailability

ToR_11

ToR_12

ToR_13

ToR_14

ToR_15

BlamedtherightToR among288components

CaseStudy2:StorageClusterGrayFailure

• AstorageclusterwasbroughtonlinewithabugthatputssomeVHDsinnegativecache

•Deepview flaggedthefaultystorageclusteralmostimmediatelywhilemanualtriagetook20+hours

0 20 40 60

NumberofVMswithVHDFailuresperHourduringaStorageClusterGrayFailure

CaseStudy3:NetworkFailure

• Networkoutagesarerare,butdohappen

• Inanincident,manytoptierlinksweremistakenlyturnedoff,causinglargecapacityloss

• Whenstoragereplicationtraffichit,itcausedhugepacketlossesandmanyVMstocrash

• Deepview pinpointedthemisbehavingaggregateswitches

ANetworkFailureduetoTopTierLink

CapacityLoss

Storage Clusters

0.90.67

00.250.5

BooleanTomo SCORE Deepview

Precision Recall

AlgorithmAccuracyComparison

• Twoothertomographyalgorithms:Boolean-Tomo andSCORE• Greedyheuristicstofindminimumsetoffailures

• Useproductiontracefrom42incidents• 16Compute,14Storage,10ToR,2Net

Deepview TimetoDetection• Timetodetection(TTD)

• Timefromincidentstarttofailurelocalized• EstimatestarttimefromVHDfailureeventtimestamp

• Deepview’s TTDisunder10min• Dataingestiontakes~3.5min• ~5minutesslidingwindowdelay• Worst-case18secalgorithmrunningtime

• MeetsthetargetTTDof15min• Canbemadefasterbutmitigationtimeisonhumantimescale

Outline

ToR asaSinglePointofFailure• ReducedNetworkCostvs.AvailabilitycostforusingasingleToR perrack• Softfailures(recoverablebyreboot)vs.hardfailures

ToR Availability

= 𝟏 −𝟗𝟎% ∗ 𝟐𝟎𝐦𝐢𝐧 + 𝟏𝟎% ∗ 𝟏𝟐𝟎𝐦𝐢𝐧 ∗ 𝟎. 𝟏%

𝟑𝟎 ∗ 𝟐𝟒 ∗ 𝟔𝟎𝐦𝐢𝐧

= 𝟏 −%𝐬𝐨𝐟𝐭 ∗ 𝐬𝐨𝐟𝐭𝐝𝐮𝐫.+%𝐡𝐚𝐫𝐝 ∗ 𝐡𝐚𝐫𝐝𝐝𝐮𝐫. ∗ 𝐟𝐫𝐚𝐜. 𝐫𝐞𝐛𝐨𝐨𝐭𝐞𝐝𝐓𝐨𝐑𝐬𝐩𝐞𝐫𝐦𝐨𝐧𝐭𝐡

𝐭𝐨𝐭𝐚𝐥𝐭𝐢𝐦𝐞𝐢𝐧𝐚𝐦𝐨𝐧𝐭𝐡

= 𝟗𝟗. 𝟗𝟗𝟗𝟗𝟑%• Dependentservices(ToRs)needtoprovideoneextraninetotargetservice(VMs)

ToRs notoncriticalpathforVMstoachievefive-ninesavailability

VMsandtheirStorageCo-location• Forloadbalancing,VMscanmountVHDsfromanystorageclusterinthesameregion

• SomeVMshavestoragethatarefurtheraway• CanlongernetworkpathsimpactVMavailability?Andbyhowmuch?

Longernetworkpathdoleadtohigher(11.4%)VHDfailurerate

• AtAzure,52%two-hop,41%four-hop• ComputedailyVHDfailurerates:rS (two-hop),rf (four-hop)• Averageover3-months, rS andrf• rf − rS rS⁄ = 11.4%increase

RelatedWork• NetPoirot [SIGCOMM'16]

• Asingle-nodesolutiontofailurelocalizationusingTCPstatistics• ComplementaryifTCPstatisticsfromcustomerVMsareavailable

• BinaryTomography• Deepview achieveshigherprecision/recallthanthosegreedyheuristics

• (Approximate)BayesianNetwork• Tooslowforourproblem• Futureworktocompareaccuracyexperimentally

Conclusion

• IdentifiedVHDfailuresastheavailabilitybottleneckatAzure

• Deepview reducedunclassifieddailyVHDfailuresfrom10,000sto100s

• Revealednewfailures,e.g.,unplannedToR reboots,storagegrayfailures

• QuantifiedtheimpactofseveralarchitecturaldecisionsonVMavailability

Thankyou!Questions?

deepview: virtual disk failure diagnosis and pattern ... · deepview: virtual disk failure...

Documents

comodo rescue disk - comodo internet security · comodo...

fault diagnosis of demountable disk-drum aero-engine rotor

chapter 14: mass-storage systems operating system concepts...

service oriented architecture janell straach deepview...

module 13: secondary-storage disk structure disk scheduling...

1 chapter overview managing compression managing disk quotas...

hard disk & optical disk (college group project)

1 disk scheduling. 2 mass-storage systems disk structure ...

8.1 cse department maitsandeep tayal 8: secondary-storage...

diagnosis and evaluation of spondylolisthesis and/or ... ·...

hp omnibook 800 disk drives · using omnibook disk drives...

disk management - tamu computer science people...

disk to disk clone

active@ hard disk monitor user guide · hard disk...

deepview: view synthesis with learned gradient...

parts manual (en) - tomcat equipment manual (en) models:...

defining and searching for structural motifs using...

13.1 operating system concepts chapter 14: mass-storage...

rsync and dirvish for disk-to-disk backups

deepview - stanford university