distributed systems failure detec2on & leader elec2on · • iif p gets both message back, it...

DistributedSystems

Failuredetec2on&LeaderElec2onRikSarkar

UniversityofEdinburgh

Fall2016

Failures•  Howdoweknowthatsomethinghasfailed?•  Let’sseewhatwemeanbyfailed:

•  Modelsoffailure:1.  Assumenofailures2.  Crashfailures:Processmayfail/crash3.  Messagefailures:Messagesmaygetdropped4.  Linkfailures:acommunica2onlinkstopsworking5.  Somecombina2onsof2,3,46.  Morecomplexmodelscanhaverecoveryfromfailures7.  Arbitraryfailures:computa2on/communica2onmaybe

erroneous

DistributedSystems,Edinburgh,2016 2

Failuredetectors

•  Detec2onofacrashedprocess–  (notoneworkingerroneously)

•  Amajorchallengeindistributedsystems•  Afailuredetectorisaprocessthatrespondstoques2onsaskingwhetheragivenprocesshasfailed– Afailuredetectorisnotnecessarilyaccurate


Failuredetectors•  Reliablefailuredetectors

–  Replieswith“working”or“failed”

•  Difficulty:–  Detec2ngsomethingisworkingiseasier:iftheyrespondtoamessage,they

areworking–  Detec2ngfailureisharder:iftheydon’trespondtothemessage,themessage

mayhevbeenlost/delayed,maybetheprocessisbusy,etc..

•  Unreliablefailuredetector–  Replieswith“suspected(failed)”or“unsuspected”–  Thatis,doesnottrytogiveaconfirmedanswer

•  Wewouldideallylikereliabledetectors,butunreliableones(thatsaygive“maybe”answers)couldbemorerealis2c


Simpleexample

•  SupposeweknowallmessagesaredeliveredwithinDseconds

•  ThenwecanrequireeachprocesstosendamessageeveryTsecondstothefailuredetectors

•  IfafailuredetectordoesnotgetamessagefromprocesspinT+Dseconds,itmarkspas“suspected”or“failed”


Simpleexample

•  SupposeweassumeallmessagesaredeliveredwithinDseconds

•  ThenwecanrequireeachprocesstosendamessageeveryTsecondstothefailuredetectors

•  IfafailuredetectordoesnotgetamessagefromprocesspinT+Dseconds,itmarkspas“suspected”or“failed”(dependingontypeofdetector)


Synchronousvsasynchronous•  Inasynchronoussystemthereisaboundonmessagedelivery2me(andclockdria)

•  Sothissimplemethodgivesareliablefailuredetector

•  Infact,itispossibletoimplementthissimplyasafunc2on:–  Sendamessagetoprocessp,waitfor2D+ε2me–  Adedicateddetectorprocessisnotnecessary

•  InAsynchronoussystems,thingsaremuchharder


Simplefailuredetector

•  IfwechooseTorDtoolarge,thenitwilltakealong2meforfailuretobedetected

•  IfweselectTtoosmall,itincreasescommunica2oncostsandputstoomuchburdenonprocesses

•  IfweselectDtoosmall,thenworkingprocessesmaygetlabeledasfailed/suspected


Assump2onsandrealworld

•  Inreality,bothsynchronousandasynchronousareatoorigid

•  Realsystems,arefast,butsome2mesmessagescantakealongerthanusual– Butnotindefinitelylong

•  Messagesusuallygetdelivered,butsome2mesnot..


Somemorerealis2cfailuredetectors

•  Have2valuesofD:D1,D2– Markprocessesasworking,suspected,failed

•  Useprobabili2es–  Insteadofsynchronous/asynchronous,modeldelivery2measprobabilitydistribu2on

– Wecanlearntheprobabilitydistribu2onofmessagedelivery2me,andaccordinglyex2matetheprobabilityoffailure


Usingbayesrule•  a=probabilitythataprocessfailswithin2meT•  b=probabilityamessageisnotreceivedinT+D

•  So,whenwedonotreceiveamessagefromaprocesswewanttoes2mateP(a|b)–  Probabilityofa,giventhatbhasoccurred


P(a | b) = P(b | a)P(a)P(b)

Ifprocesshasfailed,i.e.aistrue,thenofcoursemessagewillnotbereceived!i.e.P(b|a)=1.Therefore:

P(a | b) = P(a)P(b)

Leaderofacomputa2on

•  Manydistributedcomputa2onsneedacoordina2ngorserverprocess– E.g.Centralserverformutualexclusion–  Ini2a2ngadistributedcomputa2on– Compu2ngthesum/maxusingaggrega2ontree

•  Wemayneedtoelectaleaderatthestartofcomputa2on

•  Wemayneedtoelectanewleaderifthecurrentleaderofthecomputa2onfails


TheDis2nguishedleader

•  Theleadermusthaveaspecialpropertythatothernodesdonothave

•  Ifallnodesareexactlyiden2calineverywaythenthereisnoalgorithmtoiden2fyoneasleader

•  Ourpolicy:– Thenodewithhighestiden2fierisleader


Ref:NL

Nodewithhighestiden2fier•  Ifallnodesknowthehighestiden2fier(sayn),wedonot

needanelec2on–  Everyoneassumesnisleader–  nstartsopera2ngasleader

•  Butwhatifnfails?Wecannotassumen-1isleader,sincen-1mayhavefailedtoo!Ormaybethereneverwasprocessn-1

•  Ourpolicy:–  Thenodewithhighestiden2fierands2llsurvivingistheleader

•  Weneedanalgorithmthatfindstheworkingnodewithhighestiden2fier


Strategy1:Useaggrega2ontree


5

2 8

7

3

2

r=4

2

5

7

8 3

8

•  Supposenoderdetectsthatleaderhasfailed,andini2atesleaderelec2on

•  NodercreatesaBFStree

•  Asksformaxnodeidtobecomputedviaaggrega2on–  Eachnodereceivesidvaluesfromchildren–  Eachnodecomputesmaxofownidand

receivedvalues,andforwardstoparent

•  Needsatreeconstruc2on•  Ifnnodesstartelec2on,willneedntrees

–  O(n2)communica2on–  O(n)storagepernode

Strategy1:Useaggrega2ontree•  Supposenoderdetectsthatleaderhas

failed,andini2atesleaderelec2on

•  NodercreatesaBFStree

•  Asksformaxnodeidtobecomputedviaaggrega2on–  Eachnodereceivesidvaluesfromchildren–  Eachnodecomputesmaxofownidand

receivedvalues,andforwardstoparent

•  Needsatreeconstruc2on•  Ifnnodesstartelec2on,willneedntrees

–  O(n2)communica2on–  O(n)storagepernode


5

2 8

7

3

2

r=4

2

5

7

8 3

8

Strategy2:Usearing•  Supposethenetworkisaring– Weassumethateachnodehas2pointerstonodesitknowsabout:•  Next•  Previous•  (likeacirculardoublylinkedlist)

–  Theactualnetworkmaynotbearing

–  Thiscanbeanoverlay


6

2

45

3

8

Strategy2:Usearing

•  Basicidea:– Suppose6startselec2on– Send“6”to6.next,i.e.2– 2takesmax(2,6),sendto2.next

– 8takesmax(8,6),sendsto8.next

– etc


6

2

45

3

8

next

previous

6

6

8

8

8

Strategy2:Usearing

•  Thevalue“8”goesaroundtheringandcomesbackto8

•  Then8knowsthat“8”isthehighestid–  Sinceiftherewasahigherid,thatwouldhavestopped8

•  8declaresitselftheleader:sendsamessagearoundthering


6

2

45

3

8

next

previous

6

6

8

8

8

8

Strategy2:Usearing

•  Theproblem:Whatifmul2plenodesstartleaderelec2onatthesame2me?

•  Weneedtoadaptalgorithmslightlysothatitcanworkwheneveraleaderisneeded,andworksformul2pleleader


6

2

45

3

8

next

previous

6

6

8

8

8

8

Strategy2:Usearing(Algorithmbychangandroberts)

•  Everynodehasadefaultstate:non-par3cipant

•  Star2ngnodesetsstatetopar3cipantandsendselec3onmessagewithidtonext


6

2

45

3

8

next

previous

6

6

8

8

8

8


•  Ifnodepreceiveselec3onmessagem

•  Ifpisnon-partcipant:–  sendmax(m.id,p.id)top.next–  Setstatetopar2cipant

•  Ifpispar2cipant:–  Ifm.id>p.id:

•  Sendm.idtop.next–  Ifm.id<p.id:

•  donothing


6

2

45

3

8

next

previous

6

6

8

8

8

8


•  Ifnodepreceiveselec3onmessagemwithm.id=p.id

•  Pdeclaresitselfleader– Setsp.leader=p.id– Sendsleadermessagewithp.idtop.next– Anyothernodeqreceivingtheleadermessage•  Setsq.leader=p.id•  Forwardsleadermessagetoq.next



•  Worksinanasynchronoussystem•  Assumingnothingfailswhilethealgorithmisexecu2ng

•  MessagecomplexityO(n^2)– Whendoesthisoccur?–  (hint:allnodesstartelec2on,andmanymessagestraversealongdistance)

•  Whatisthe2mecomplexity?•  Whatisthestoragecomplexity?


Strategy3:Usearing–smartly(HirschbergSinclair)

•  Assumeallnodeswanttoknowtheleader•  k-neighborhoodofnodep– Thesetofallnodeswithindistancekofp

•  Howdoespsendamessagetodistancek?– Messagehasa“2metolivevariable”– Eachnodedecrementsm.plonreceiving–  Ifm.pl=0,don’tforwardanymore



•  Basicidea:– Checkgrowingregionsaroundyourselfforsomeonewithlargerid



•  Algorithmoperatesinphases•  Inphase0,nodepsendselec2onmessagemtobothp.nextandp.previouswith:– m.id=p.idandpl=1

•  Supposeqreceivesthismessage–  Setsm.pl=0–  Ifq.id>m.id:

•  Donothing–  Ifq.id<m.id:

•  Returnmessagetop



•  Algorithmoperatesinphases•  Inphase0,nodepsendselec2onmessagemtoboth

p.nextandp.previouswith:–  m.id=p.idandpl=1

•  Supposeqreceivesthismessage–  Setsm.pl=0–  Ifq.id>m.id:

•  Donothing–  Ifq.id<m.id:

•  Returnmessagetop•  Ifpgetsbackbothmessage,itdecidesitselfleaderofits1-

neighborhood,andproceedstonextphase



•  IfpisInphasei,nodepsendselec2onmessagemtop.nextandp.previouswith:–  m.id=p.id,andm.pl=2i

•  Anodeqonreceivingthemessage(fromnext/previous)–  Ifm.pl=0:forwardsuitablytoprevious/next–  Setsm.pl=m.pl-1–  Ifq.id>m.id:

•  Donothing–  Else:

•  Ifm.pl=0:returntosendingprocess•  Elseforwardtosuitablytoprevious/next

•  Ifpgetsbothmessageback,itistheleaderofits2ineighborhood,andproceedstophasei+1



•  When2i>=n/2– Only1processsurvives:Leader

•  Numberofphases:O(logn)

•  Whatisthemessagecomplexity?



Inphasei•  Atmostonenodeini2atesmessageinanysequenceof2i-1nodes

•  So,n/2i-1candidates–  Eachsends2messages,goingatmost2idistance,andreturning:2*2*2imessages

•  O(n)messagesinphaseiThereareO(logn)phases•  TotalofO(nlogn)messages

31DistributedSystems,Edinburgh,2016


•  Assumesynchronousopera2on•  Assumenodesdonotfailduringalgorithmrun

•  Whatis2mecomplexity?•  Whatisstoragecomplexity?

32DistributedSystems,Edinburgh,2016

Strategy4:BullyAlgorithm•  Assume:

–  Eachnodeknowstheidofallnodesinthesystem(somemayhavefailed)–  Synchronousopera2on

•  Nodepdecidestoini2ateelec2on•  psendselec2onmessagetoallnodeswith id>p.id•  Ifpdoesnothear“Iamalivemessage”fromanynode,pbroadcastsa

messagedeclaringitselfasleader•  Anyworkingnodeqthatreceiveselec2onmessagefromp,replieswith

ownidand“Iamalive”message–  Andstartsanelec2on(unlessitisalreadyintheprocessofanelec2on)

•  Anynodeqthathearsaloweridnodebeingdeclaredleader,startsanewelec2on


Ref:CDK

Strategy4:BullyAlgorithm

•  Assume:–  Eachnodeknowstheidofallnodesinthesystem(somemayhavefailed)

–  Synchronousopera2on

•  Worksevenwhenprocessesfail•  Workswhen(some)messagedeliveriesfail.

•  Whatarethestorageandmessagecomplexi2es?


distributed systems failure detec2on & leader elec2on · • iif p gets both message back, it...

Documents