distributed systems failure detec2on & leader elec2on · • iif p gets both message back, it...
TRANSCRIPT
DistributedSystems
Failuredetec2on&LeaderElec2onRikSarkar
UniversityofEdinburgh
Fall2016
Failures• Howdoweknowthatsomethinghasfailed?• Let’sseewhatwemeanbyfailed:
• Modelsoffailure:1. Assumenofailures2. Crashfailures:Processmayfail/crash3. Messagefailures:Messagesmaygetdropped4. Linkfailures:acommunica2onlinkstopsworking5. Somecombina2onsof2,3,46. Morecomplexmodelscanhaverecoveryfromfailures7. Arbitraryfailures:computa2on/communica2onmaybe
erroneous
DistributedSystems,Edinburgh,2016 2
Failuredetectors
• Detec2onofacrashedprocess– (notoneworkingerroneously)
• Amajorchallengeindistributedsystems• Afailuredetectorisaprocessthatrespondstoques2onsaskingwhetheragivenprocesshasfailed– Afailuredetectorisnotnecessarilyaccurate
DistributedSystems,Edinburgh,2016 3
Failuredetectors• Reliablefailuredetectors
– Replieswith“working”or“failed”
• Difficulty:– Detec2ngsomethingisworkingiseasier:iftheyrespondtoamessage,they
areworking– Detec2ngfailureisharder:iftheydon’trespondtothemessage,themessage
mayhevbeenlost/delayed,maybetheprocessisbusy,etc..
• Unreliablefailuredetector– Replieswith“suspected(failed)”or“unsuspected”– Thatis,doesnottrytogiveaconfirmedanswer
• Wewouldideallylikereliabledetectors,butunreliableones(thatsaygive“maybe”answers)couldbemorerealis2c
DistributedSystems,Edinburgh,2016 4
Simpleexample
• SupposeweknowallmessagesaredeliveredwithinDseconds
• ThenwecanrequireeachprocesstosendamessageeveryTsecondstothefailuredetectors
• IfafailuredetectordoesnotgetamessagefromprocesspinT+Dseconds,itmarkspas“suspected”or“failed”
DistributedSystems,Edinburgh,2016 5
Simpleexample
• SupposeweassumeallmessagesaredeliveredwithinDseconds
• ThenwecanrequireeachprocesstosendamessageeveryTsecondstothefailuredetectors
• IfafailuredetectordoesnotgetamessagefromprocesspinT+Dseconds,itmarkspas“suspected”or“failed”(dependingontypeofdetector)
DistributedSystems,Edinburgh,2016 6
Synchronousvsasynchronous• Inasynchronoussystemthereisaboundonmessagedelivery2me(andclockdria)
• Sothissimplemethodgivesareliablefailuredetector
• Infact,itispossibletoimplementthissimplyasafunc2on:– Sendamessagetoprocessp,waitfor2D+ε2me– Adedicateddetectorprocessisnotnecessary
• InAsynchronoussystems,thingsaremuchharder
DistributedSystems,Edinburgh,2016 7
Simplefailuredetector
• IfwechooseTorDtoolarge,thenitwilltakealong2meforfailuretobedetected
• IfweselectTtoosmall,itincreasescommunica2oncostsandputstoomuchburdenonprocesses
• IfweselectDtoosmall,thenworkingprocessesmaygetlabeledasfailed/suspected
DistributedSystems,Edinburgh,2016 8
Assump2onsandrealworld
• Inreality,bothsynchronousandasynchronousareatoorigid
• Realsystems,arefast,butsome2mesmessagescantakealongerthanusual– Butnotindefinitelylong
• Messagesusuallygetdelivered,butsome2mesnot..
DistributedSystems,Edinburgh,2016 9
Somemorerealis2cfailuredetectors
• Have2valuesofD:D1,D2– Markprocessesasworking,suspected,failed
• Useprobabili2es– Insteadofsynchronous/asynchronous,modeldelivery2measprobabilitydistribu2on
– Wecanlearntheprobabilitydistribu2onofmessagedelivery2me,andaccordinglyex2matetheprobabilityoffailure
DistributedSystems,Edinburgh,2016 10
Usingbayesrule• a=probabilitythataprocessfailswithin2meT• b=probabilityamessageisnotreceivedinT+D
• So,whenwedonotreceiveamessagefromaprocesswewanttoes2mateP(a|b)– Probabilityofa,giventhatbhasoccurred
DistributedSystems,Edinburgh,2016 11
P(a | b) = P(b | a)P(a)P(b)
Ifprocesshasfailed,i.e.aistrue,thenofcoursemessagewillnotbereceived!i.e.P(b|a)=1.Therefore:
P(a | b) = P(a)P(b)
Leaderofacomputa2on
• Manydistributedcomputa2onsneedacoordina2ngorserverprocess– E.g.Centralserverformutualexclusion– Ini2a2ngadistributedcomputa2on– Compu2ngthesum/maxusingaggrega2ontree
• Wemayneedtoelectaleaderatthestartofcomputa2on
• Wemayneedtoelectanewleaderifthecurrentleaderofthecomputa2onfails
DistributedSystems,Edinburgh,2016 12
TheDis2nguishedleader
• Theleadermusthaveaspecialpropertythatothernodesdonothave
• Ifallnodesareexactlyiden2calineverywaythenthereisnoalgorithmtoiden2fyoneasleader
• Ourpolicy:– Thenodewithhighestiden2fierisleader
DistributedSystems,Edinburgh,2016 13
Ref:NL
Nodewithhighestiden2fier• Ifallnodesknowthehighestiden2fier(sayn),wedonot
needanelec2on– Everyoneassumesnisleader– nstartsopera2ngasleader
• Butwhatifnfails?Wecannotassumen-1isleader,sincen-1mayhavefailedtoo!Ormaybethereneverwasprocessn-1
• Ourpolicy:– Thenodewithhighestiden2fierands2llsurvivingistheleader
• Weneedanalgorithmthatfindstheworkingnodewithhighestiden2fier
DistributedSystems,Edinburgh,2016 14
Strategy1:Useaggrega2ontree
DistributedSystems,Edinburgh,2016 15
5
2 8
7
3
2
r=4
2
5
7
8 3
8
• Supposenoderdetectsthatleaderhasfailed,andini2atesleaderelec2on
• NodercreatesaBFStree
• Asksformaxnodeidtobecomputedviaaggrega2on– Eachnodereceivesidvaluesfromchildren– Eachnodecomputesmaxofownidand
receivedvalues,andforwardstoparent
• Needsatreeconstruc2on• Ifnnodesstartelec2on,willneedntrees
– O(n2)communica2on– O(n)storagepernode
Strategy1:Useaggrega2ontree• Supposenoderdetectsthatleaderhas
failed,andini2atesleaderelec2on
• NodercreatesaBFStree
• Asksformaxnodeidtobecomputedviaaggrega2on– Eachnodereceivesidvaluesfromchildren– Eachnodecomputesmaxofownidand
receivedvalues,andforwardstoparent
• Needsatreeconstruc2on• Ifnnodesstartelec2on,willneedntrees
– O(n2)communica2on– O(n)storagepernode
DistributedSystems,Edinburgh,2016 16
5
2 8
7
3
2
r=4
2
5
7
8 3
8
Strategy2:Usearing• Supposethenetworkisaring– Weassumethateachnodehas2pointerstonodesitknowsabout:• Next• Previous• (likeacirculardoublylinkedlist)
– Theactualnetworkmaynotbearing
– Thiscanbeanoverlay
DistributedSystems,Edinburgh,2016 17
6
2
45
3
8
Strategy2:Usearing
• Basicidea:– Suppose6startselec2on– Send“6”to6.next,i.e.2– 2takesmax(2,6),sendto2.next
– 8takesmax(8,6),sendsto8.next
– etc
DistributedSystems,Edinburgh,2016 18
6
2
45
3
8
next
previous
6
6
8
8
8
Strategy2:Usearing
• Thevalue“8”goesaroundtheringandcomesbackto8
• Then8knowsthat“8”isthehighestid– Sinceiftherewasahigherid,thatwouldhavestopped8
• 8declaresitselftheleader:sendsamessagearoundthering
DistributedSystems,Edinburgh,2016 19
6
2
45
3
8
next
previous
6
6
8
8
8
8
Strategy2:Usearing
• Theproblem:Whatifmul2plenodesstartleaderelec2onatthesame2me?
• Weneedtoadaptalgorithmslightlysothatitcanworkwheneveraleaderisneeded,andworksformul2pleleader
DistributedSystems,Edinburgh,2016 20
6
2
45
3
8
next
previous
6
6
8
8
8
8
Strategy2:Usearing(Algorithmbychangandroberts)
• Everynodehasadefaultstate:non-par3cipant
• Star2ngnodesetsstatetopar3cipantandsendselec3onmessagewithidtonext
DistributedSystems,Edinburgh,2016 21
6
2
45
3
8
next
previous
6
6
8
8
8
8
Strategy2:Usearing(Algorithmbychangandroberts)
• Ifnodepreceiveselec3onmessagem
• Ifpisnon-partcipant:– sendmax(m.id,p.id)top.next– Setstatetopar2cipant
• Ifpispar2cipant:– Ifm.id>p.id:
• Sendm.idtop.next– Ifm.id<p.id:
• donothing
DistributedSystems,Edinburgh,2016 22
6
2
45
3
8
next
previous
6
6
8
8
8
8
Strategy2:Usearing(Algorithmbychangandroberts)
• Ifnodepreceiveselec3onmessagemwithm.id=p.id
• Pdeclaresitselfleader– Setsp.leader=p.id– Sendsleadermessagewithp.idtop.next– Anyothernodeqreceivingtheleadermessage• Setsq.leader=p.id• Forwardsleadermessagetoq.next
DistributedSystems,Edinburgh,2016 23
Strategy2:Usearing(Algorithmbychangandroberts)
• Worksinanasynchronoussystem• Assumingnothingfailswhilethealgorithmisexecu2ng
• MessagecomplexityO(n^2)– Whendoesthisoccur?– (hint:allnodesstartelec2on,andmanymessagestraversealongdistance)
• Whatisthe2mecomplexity?• Whatisthestoragecomplexity?
DistributedSystems,Edinburgh,2016 24
Strategy3:Usearing–smartly(HirschbergSinclair)
• Assumeallnodeswanttoknowtheleader• k-neighborhoodofnodep– Thesetofallnodeswithindistancekofp
• Howdoespsendamessagetodistancek?– Messagehasa“2metolivevariable”– Eachnodedecrementsm.plonreceiving– Ifm.pl=0,don’tforwardanymore
DistributedSystems,Edinburgh,2016 25
Strategy3:Usearing–smartly(HirschbergSinclair)
• Basicidea:– Checkgrowingregionsaroundyourselfforsomeonewithlargerid
DistributedSystems,Edinburgh,2016 26
Strategy3:Usearing–smartly(HirschbergSinclair)
• Algorithmoperatesinphases• Inphase0,nodepsendselec2onmessagemtobothp.nextandp.previouswith:– m.id=p.idandpl=1
• Supposeqreceivesthismessage– Setsm.pl=0– Ifq.id>m.id:
• Donothing– Ifq.id<m.id:
• Returnmessagetop
DistributedSystems,Edinburgh,2016 27
Strategy3:Usearing–smartly(HirschbergSinclair)
• Algorithmoperatesinphases• Inphase0,nodepsendselec2onmessagemtoboth
p.nextandp.previouswith:– m.id=p.idandpl=1
• Supposeqreceivesthismessage– Setsm.pl=0– Ifq.id>m.id:
• Donothing– Ifq.id<m.id:
• Returnmessagetop• Ifpgetsbackbothmessage,itdecidesitselfleaderofits1-
neighborhood,andproceedstonextphase
DistributedSystems,Edinburgh,2016 28
Strategy3:Usearing–smartly(HirschbergSinclair)
• IfpisInphasei,nodepsendselec2onmessagemtop.nextandp.previouswith:– m.id=p.id,andm.pl=2i
• Anodeqonreceivingthemessage(fromnext/previous)– Ifm.pl=0:forwardsuitablytoprevious/next– Setsm.pl=m.pl-1– Ifq.id>m.id:
• Donothing– Else:
• Ifm.pl=0:returntosendingprocess• Elseforwardtosuitablytoprevious/next
• Ifpgetsbothmessageback,itistheleaderofits2ineighborhood,andproceedstophasei+1
DistributedSystems,Edinburgh,2016 29
Strategy3:Usearing–smartly(HirschbergSinclair)
• When2i>=n/2– Only1processsurvives:Leader
• Numberofphases:O(logn)
• Whatisthemessagecomplexity?
DistributedSystems,Edinburgh,2016 30
Strategy3:Usearing–smartly(HirschbergSinclair)
Inphasei• Atmostonenodeini2atesmessageinanysequenceof2i-1nodes
• So,n/2i-1candidates– Eachsends2messages,goingatmost2idistance,andreturning:2*2*2imessages
• O(n)messagesinphaseiThereareO(logn)phases• TotalofO(nlogn)messages
31DistributedSystems,Edinburgh,2016
Strategy3:Usearing–smartly(HirschbergSinclair)
• Assumesynchronousopera2on• Assumenodesdonotfailduringalgorithmrun
• Whatis2mecomplexity?• Whatisstoragecomplexity?
32DistributedSystems,Edinburgh,2016
Strategy4:BullyAlgorithm• Assume:
– Eachnodeknowstheidofallnodesinthesystem(somemayhavefailed)– Synchronousopera2on
• Nodepdecidestoini2ateelec2on• psendselec2onmessagetoallnodeswith id>p.id• Ifpdoesnothear“Iamalivemessage”fromanynode,pbroadcastsa
messagedeclaringitselfasleader• Anyworkingnodeqthatreceiveselec2onmessagefromp,replieswith
ownidand“Iamalive”message– Andstartsanelec2on(unlessitisalreadyintheprocessofanelec2on)
• Anynodeqthathearsaloweridnodebeingdeclaredleader,startsanewelec2on
DistributedSystems,Edinburgh,2016 33
Ref:CDK
Strategy4:BullyAlgorithm
• Assume:– Eachnodeknowstheidofallnodesinthesystem(somemayhavefailed)
– Synchronousopera2on
• Worksevenwhenprocessesfail• Workswhen(some)messagedeliveriesfail.
• Whatarethestorageandmessagecomplexi2es?
DistributedSystems,Edinburgh,2016 34