zookeeper con*nued and mapreduce - brown...
TRANSCRIPT
ZooKeeperCon*nuedandMapReduce
ZookeeperandChubby
Today• ZookeeperWrapUp
• MapReduce(bigdataanaly*csatGoogle)
• Grades
• ClosingRemarks
• Cri*calreviews
APIZooKeeper Chubby
Close/Open()
delete(path,expectedVersion) Delete()
create(path,data,acl,flags)
setData(path,data,expectedVersion) setContent()
getData(path,watch) getContentAndStat()
getChildren(path,watch)
usegetContent()ondirectory
exists(path,watch)
sync(path)
LockRelatedCalls Acquire()/TryAcquire()
Release
SequenceNumbercalls ImplicitlyManaged:• Flagpassedtocreate()requestsversion• ZKincrementsIDaTercrea*ngfiles• IDisusedasexpectedVersion
ExplicitManaged:• GetSequencer()• SetSequencer()• CheckSequencer()
APIZooKeeper Chubby
Close/Open()
delete(path,expectedVersion) Delete()
create(path,data,acl,flags)
setData(path,data,expectedVersion) setContent()
getData(path,watch) getContentAndStat()
getChildren(path,watch)
usegetContent()ondirectory
exists(path,watch)
sync(path)
LockRelatedCalls Acquire()/TryAcquire()
Release
SequenceNumbercalls ImplicitlyManaged:• Flagpassedtocreate()requestsversion• ZKincrementsIDaTercrea*ngfiles• IDisusedasexpectedVersion
ExplicitManaged:• GetSequencer()• SetSequencer()• CheckSequencer()
Noneedforopen/closebecauseAllcallshavepathinthem
Nolocks
Read/WriteInterac*on
L
C1 C1C1C1
C1 C1
Zookeeper
L
C1 C1C1C1
C1 C1
Chubby
Read:blueWrite:red
• Writes:linearizable(gothroughleaders)
• Reads:Inzookeepercanbeservedlocallybyanynode.InChubbymustgothroughleader– Weakerreadseman*cs
• Requests:Clientscansendmul*plerequestsata*me– RequestsservedinaFIFOorder
• TCPprovidesFIFO
/theoapp/
.../config
…/IP1
../members/
…/IP3 …/IP2
C1C1 C1
Ephemeralfiles
C1C1 C1
Watchfiles
../leader/
…/cand3…/cand1 …/cand2
C1C1 C1
LeaderElec4on(ge8ngLocks)• Trytocreatefile:firsttocreateisleader“/theoapp/leader”• Fileisephemeral.Ifleaderdies.Filedies.• Monitorfileandwhenitdisappears,you
trytocreateandbecomeleader• Everyonedetectsandtriestograbcreate
fileatthesame*me.• Createslotsofinefficiency
GroupMembership• Ephemeralfiles• Nodeswatchforupdates• Nodesreadthechildren
LeaderchangingNodeconfigura4on• Firstdelete“/theoapp/ready”• Clientsgetno*fica*onofdele*on.If
clientistryingtoread,oncetheno*fica*onarrivestheclientstops
• Update“/theoapp/config”• Create“/theoapp/ready”• Clientsgetno*fica*onofcrea*on.Now,
theyknowconfigshavechangedandtheycanreadtheconfigsfile
.../ready
ZookeeperV.Chubby• LessonsfromChubby
– Mostrequestsareread/keep-alive– Fewdevelopersuselocks– FilesystemAPIiseasytouse
• Zookeeper=Chubbywithweakerreadseman*csandnolocks– Weakerreadseman*cs:clientcanreadfromanynode– Writesmusts*llgothroughleader– Whilenolock,youcanimplementlocks– EnableAsynchrequests:FIFOexecu*on
MapReduce:BigDataAnaly*csatGoogle
GoogleEnvironment• Lots(tensofthousands)ofcomputers
– allmore-or-lessequal• processor,disk,memory,networkinterface
– nospecializedservers– evenifonly.01%downatanyonemoment,manywillbedown
• Simplejobsbecomecomplicated– Lotsofservers—>scaletomanynodes
• Par**ondata• Par**onprocessing/compu*ng
– Commodityserver—>faulttolerance
MapReduce• MapReduce:languageAPIlibrarytohidecomplexity– Performance– Availability– Scalability
• AllqueriesmodelsasMapandReduce– MAP:takeasetofdataentriesandapplyanoperatoronthem
– Reduce:takeintermediatedatatocombinethem
MapReduce• map
– foreachpairinasetofkey/valuepairs,produceasetofnewkey/valuepairs
• reduce– foreachkey
• lookatallthevaluesassociatedwiththatkeyandcomputeasmallersetofvalues
Implementa*onSketch(1)
split0split1
…
splitM-1
Input(onGFS)
worker
worker
worker
master
worker
worker
mapphase:
Mworkers
intermediatefiles(onlocal
disks)par**onedintoRpieces
reducephase:
Rworkers
outputfiles(onGFS)
FindallthewordsForeachoccurrence
create<word,1>
Example Goal count the number of times a word appears in
all documents map(String key, String value) { // key: document name // value: document contents for each word w in value EmitIntermediate(w, 1); } reduce(String key, Iterator values) { // key: a word // values: a list of counts for each v in values result += v; Emit(result); }
Sumupallkey,pairwithsameword(key)
Implementa*onSketch(2)
• Map’sinputpairsdividedintoMsplits– storedinGFS– Splitallowsforparallelism
• OutputofMap/InputofReducedividedintoRpieces• Onemasterprocessisincharge:farmsoutworktoW(<<
M+R)workermachines
split0split1
…
splitM-1
Input(onGFS)
worker
worker
worker
mapphase:
Mworkers
Threetypesofprocesses*Master• Reducer(worker)• Mapper(worker)
Implementa*onSketch(3)
• Masterpar**onssplitsamongsomeoftheworkers– eachworkerpassespairstouser-suppliedmapfunc*on– resultsstoredinlocalfiles
• par**onedintopieces– e.g.,hash(key)modR
– remainingworkersperformreducetasks• theRpiecesarepar**onedamongthem• placeremoteprocedurecallstomapworkerstogetdata• putoutputinGFS
split0split1
…
splitM-1Input(onGFS)
worker
worker
worker
master
worker
worker
mapphase:
Mworkers
intermediatefiles(onlocal
disks)par**onedinto
Rpieces
reducephase:
Rworkers
outputfiles(onGFS)
Storedlocallyandwillbelostonworkerfailure
ComminngData
• Maptask– outputkeptinRlocalfiles– loca*onssendtomasteronlyontaskcomple*on
• Reducetask– outputstoredonGFSusingtemporaryname– fileatomicallyrenamedontaskcomple*on(tofinalname)
split0split1
…
splitM-1Input(onGFS)
worker
worker
worker
master
worker
worker
mapphase:
Mworkers
intermediatefiles(onlocal
disks)par**onedinto
Rpieces
reducephase:
Rworkers
outputfiles(onGFS)
CopingwithFailure(1)
• Mastermaintainsstateofeachtask– idle(notstarted)– inprogress– Completed
• Masterpingsworkersperiodicallytodetermineifthey’reup
CopingwithFailure(2)
• Workercrashes– in-progresstaskshavestatesetbacktoidle
• alloutputislost• restartedfrombeginningonanotherworker
– completedmaptasks• alloutputlost• restartedfrombeginningonanotherworker• reducetasksusingoutputareno*fiedofnewworker
CopingwithFailure(3)
• Workercrashes(con*nued)– completedreducetasks
• outputalreadyonGFS• norestartnecessary
• Mastercrashes– couldberecoveredfromcheckpoint– inprac*ce
• mastercrashesarerare• en*reapplica*onisrestarted
Retrospec*ve• MapReduceàYahooHadoop(nowApache)
– YearsofresearchcreatedSpark
• See:hqp://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html1) agiantstepbackwardintheprogrammingparadigmforlarge-scale
dataintensiveapplica*ons2) asub-op*malimplementa*on,inthatitusesbruteforceinsteadof
indexing3) notnovelatall—itrepresentsaspecificimplementa*onofwell
knowntechniquesdevelopednearly25yearsago4) missingmostofthefeaturesthatarerou*nelyincludedincurrent
DBMS5) incompa*blewithallofthetoolsDBMSusershavecometodepend
on
CurrentGrading
• HW2andHW3done.Willbereleasedtoday– HW2:median:90,std-dev:10– HW3:median:91,std-dev:9
• Projects– Tapestrybeingfinished– WillstartraTgradingthisweekend– Gradingtakeawhileduetopar*alcredits
FinalGrades
• CourseRubics– Projects:50%– HWS:20%– Midterm:10%– Final:20%
• Individualprojects/midtermhavebeencurved
• Finalgradewillalsobecurved
ClosingRemarks• Distributedsystems:artofprovidingconsensuswhiletackling
failureswhileprovidinghighperformance
• FailureV.PerformanceV.Correctness/consensus– Differenttypeoffailuresàdifferentimplica*ons(differentdetectors)
• Mostlyheartbeats
– Dependsontheapplica*on:some*mesyoudon’tneedlinearizable(totalordering)onallevents• Zookeeper:readsarenotlinearizable• Dynamo/Cassandra:reads/writesarecausallyconsistent
– Performance:Shardanddistribute• Consistenthashingtofindshards
• FinalonMonday5/14at2pm