![Page 1: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/1.jpg)
MapReduce&Pig&Spark
IoannaMiliouGiuseppeAttardi
AdvancedProgrammingUniversitadiPisa
![Page 2: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/2.jpg)
Hadoop• TheApache™Hadoop®projectdevelopsopen-sourcesoftwarefor
reliable,scalable,distributedcomputing.
• Frameworkthatallowsforthedistributedprocessingoflargedatasetsacrossclustersofcomputersusingsimpleprogrammingmodels.
• Itisdesignedtoscaleupfromsingleserverstothousandsofmachines,eachofferinglocalcomputationandstorage.
• Itisdesignedtodetectandhandlefailuresattheapplicationlayer.
ThecoreofApacheHadoopconsistsofastoragepart,knownasHadoopDistributedFileSystem(HDFS),andaprocessingpartcalledMapReduce.
![Page 3: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/3.jpg)
Hadoop• Theprojectincludesthesemodules:
– HadoopCommon:ThecommonutilitiesthatsupporttheotherHadoopmodules.
– HadoopDistributedFileSystem(HDFS):Adistributedfilesystemthatprovideshigh-throughputaccesstoapplicationdata.
– HadoopYARN:Aframeworkforjobschedulingandclusterresourcemanagement.
– HadoopMapReduce:AYARN-basedsystemforparallelprocessingoflargedatasets.
![Page 4: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/4.jpg)
Hadoop• OtherHadoop-relatedprojectsatApacheinclude:
– Ambari:Aweb-basedtoolforprovisioning, managing, andmonitoring ApacheHadoop.
– Avro:Adataserializationsystem.– Cassandra:Ascalablemulti-masterdatabasewithnosinglepointsoffailure.– Chukwa:Adatacollectionsystemformanaging largedistributed systems.– HBase:Ascalable,distributeddatabasethatsupports structureddatastorage
forlargetables.– Hive:Adatawarehouseinfrastructurethatprovidesdatasummarizationand
adhocquerying.– Mahout:AScalablemachinelearninganddatamining library.– Tez:Ageneralizeddata-flowprogramming framework,builtonHadoopYARN,
whichprovidesapowerfulandflexibleengine toexecuteanarbitraryDAGoftaskstoprocessdataforbothbatchandinteractiveuse-cases.
– ZooKeeper:Ahigh-performance coordinationservicefordistributedapplications.
![Page 5: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/5.jpg)
Hadoop– Pig:Ahigh-leveldata-flowlanguageandexecutionframeworkforparallelcomputation.
– Spark:AfastandgeneralcomputeengineforHadoopdata.Sparkprovidesasimpleandexpressiveprogrammingmodelthatsupportsawiderangeofapplications,includingETL,machinelearning,streamprocessing,andgraphcomputation.
![Page 6: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/6.jpg)
HadoopStack
![Page 7: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/7.jpg)
WhatisMapReduce?
![Page 8: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/8.jpg)
• MapReduceistheheartofHadoop®
• ProgrammingparadigmthatallowsformassivescalabilityacrosshundredsorthousandsofserversinaHadoopcluster.
ProposedbyDeanandGhemawat atGoogle
![Page 9: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/9.jpg)
Whatisit?
• ProcessingengineofHadoop• Usedforbigdatabatchprocessing• Parallelprocessingofhugedatavolumes• Faulttolerant• Scalable
![Page 10: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/10.jpg)
Whyuseit?
• YourdatainTerabyte/Petabyterange• YouhavehugeI/O• Hadoopframeworktakescareof– Jobandtaskmanagement– Failures– Storage– Replication
YoujustwriteMapandReducejobs
![Page 11: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/11.jpg)
BigUsers
• Users– Facebook– Yahoo– Amazon– Ebay
• Providers– Amazon– Cloudera– HortonWorks– MapR
![Page 12: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/12.jpg)
Map&ReduceThetermMapReduceactuallyreferstotwoseparateanddistincttasksthatHadoopprogramsperform.
1. Themap job,whichtakesasetofdataandconvertsitintoanothersetofdata,whereindividualelementsarebrokendownintotuples(key/valuepairs).
map(k1,v1)→list(k2,v2)
1. Thereduce jobtakestheoutputfromamapasinputandcombinesthosedatatuplesintoasmallersetoftuples.
reduce(k2,list(v2))→list(v2)
AsthesequenceofthenameMapReduceimplies,thereducejobisalwaysperformedafterthemapjob.
![Page 13: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/13.jpg)
TypicalProblemsolvedbyMapReduce
• Readalotofdata
• Map :extractsomethingyoucareaboutfromeachrecord
• ShuffleandSort
• Reduce:aggregate,summarize,filter,ortransform
• Writetheresults
InputData
Map Map Map Map
Shuffle
Reduce Reduce
Results
![Page 14: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/14.jpg)
Example:WordCountinWebPages
Atypicalexerciseforanewengineerinhisorherfirstweek
• Inputisfileswithonedocumentperrecord• Specifyamap functionthattakesakey/valuepair
key=documentURLvalue=documentcontents
• Outputofmapfunctionis(potentiallymany)key/valuepairs.Inourcase,output(word,“1”)onceperwordinthedocument
“document1”, “AppleOrangeMangoOrangeGrapesPlum”
“Apple”,“1”“Orange”,“1”“Mango”,“1”
…
![Page 15: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/15.jpg)
Examplecontinued:WordCountinWebPages
• MapReducelibrarygatherstogetherallpairswiththesamekey(shuffle/sort)
• ThereducefunctioncombinesthevaluesforakeyInourcase,computethesum
• Outputofreducepairedwithkeyandsaved
key=“Apple”values=“1”
key=“Mango”values=“1”
key=“Orange”values=“1”,“1”
key=“Plum”values=“1”
key=“Grapes”values=“1”
“1” “1”“1”“1”“2”
“Apple”,“1”“Orange”,“2”“Mango”,“1”“Grapes”,“1”“Plum”,“1”
![Page 16: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/16.jpg)
ExamplePseudo-code
![Page 17: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/17.jpg)
map() reduce()
![Page 18: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/18.jpg)
MapReducewrappersWrappershavebeendevelopedinorderto:• provideabettercontrolovertheMapReducecode• aidinthesourcecodedevelopment
Somewell-knownexample:• Sawzall (Google)• Pig(originallyYahoo,nowApache)• Hive(Facebook)• DryadLINQ (Microsoft)
![Page 19: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/19.jpg)
WidelyapplicableatGoogle• ImplementedasaC++librarylinkedtouserprograms• Canreadandwritemanydifferentdatatypes
Exampleuses:
![Page 20: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/20.jpg)
Example:GeneratingLanguageModelStatistics• Usedinthestatisticalmachinetranslationsystem
o needtocount#oftimesevery5-wordsequenceoccursinlargecorpusofdocuments(andkeepallthosewherecount>=4)
• EasywithMapReduce:o map :extract5-wordsequences=>countfromdocumento reduce :combinecounts,andkeepifcountlargeenough
![Page 21: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/21.jpg)
Example:JoiningwithOtherData
• Example:generateper-docsummary,butincludeper-hostinformation(e.g.#ofpagesonhost,importanttermsonhost)o per-hostinformationmightbeinper-processdatastructure,or
mightinvolveRPCtoasetofmachinescontainingdataforallsites
• EasywithMapReduce:o map :extracthostnamefromURL,lookupper-hostinfo,
combinewithper-docdataandemito reduce :identityfunction(justemitkey/valuedirectly)
![Page 22: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/22.jpg)
MapReduce:Scheduling
• Onemaster,manyworkerso InputdatasplitintoMmaptasks(typically64MBinsize)o Reducephasepartitioned intoRreducetaskso Tasksareassignedtoworkersdynamicallyo Often:M=200000,R=4000,workers=2000
• Masterassignseachmaptasktoafreeworkero Considerslocalityofdatatoworkerwhenassigning tasko Worker readstaskinput (often fromlocaldisk)o WorkerproducesRlocalfilescontaining intermediatek/vpairs
• Masterassignseachreducetasktoafreeworkero Worker readsintermediatek/vpairsfrommapworkerso Worker sorts&appliesuser’sReduceop toproduce theoutput
![Page 23: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/23.jpg)
![Page 24: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/24.jpg)
TaskGranularityandPipeliningFinegranularitytasks:manymoremaptasksthanmachines• Minimizestimeforfaultrecovery• Canpipelineshufflingwithmapexecution• Betterdynamicloadbalancing
Oftenuse200,000map/5000reducetasksw/2000machines
![Page 25: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/25.jpg)
Faulttolerance:Handledviare-execution
• Onworkerfailure:o Detectfailureviaperiodicheartbeatso Re-executecompletedandin-progressmap taskso Re-executeinprogressreduce taskso Taskcompletioncommittedthroughmaster
• Masterfailure:o Stateischeckpointed :newmasterrecovers&continues
Robust:OnceGooglelost1600of1800machines,butfinishedfine
![Page 26: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/26.jpg)
Refinement:Backuptasks
• Slowworkerssignificantlylengthencompletiontimeo Otherjobsconsumingresourcesonmachineo Baddiskswithsofterrorstransferdataveryslowlyo Weirdthings:processorcachesdisabled(!!)
• Solution:Nearendofphase,spawnbackupcopiesoftaskso Whicheveronefinishesfirst"wins"
• Effect:Dramaticallyshortensjobcompletiontime
![Page 27: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/27.jpg)
Refinement:LocalityOptimization
Masterschedulingpolicy:• Asksforlocationsofreplicasofinputfileblocks• Maptaskstypicallysplitinto64MB• Maptasksscheduledsoinputblockreplicaareonsame
machineorsamerack
Effect:Thousandsofmachinesreadinputatlocaldiskspeed• Withoutthis,rackswitcheslimitreadrate
![Page 28: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/28.jpg)
Refinement:SkippingBadRecords
Map/Reducefunctionssometimesfailforparticularinputs• Bestsolutionistodebug&fix,butnotalwayspossible
Onseg fault:• SendUDPpackettomasterfromsignalhandler• Includesequencenumberofrecordbeingprocessed
IfmasterseesK failuresforsamerecord(typicallyK setto2or3):• Nextworkeristoldtoskiptherecord
Effect:Canworkaroundbugsinthird-partylibraries
![Page 29: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/29.jpg)
OtherRefinements
• Optionalsecondarykeysforordering
• Compressionofintermediatedata
• Combiner:usefulforsavingnetworkbandwidth
• Localexecutionfordebugging/testing
• User-definedcounters
![Page 30: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/30.jpg)
“Playaround”
• AmazonElasticMapReduce(AmazonEMR)• Hortonworks Sandbox• MapR SandboxforHadoop• Qubole• MicrosoftAzureHDInsight• Cloudera
![Page 31: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/31.jpg)
MapReduceexamplesinJava
![Page 32: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/32.jpg)
Serializable vsWritable• Serializable storestheclassnameandtheobjectrepresentationto
thestream;otherinstancesoftheclassarereferredtobyanhandletotheclassname:thisapproachisnotusablewithrandomaccess
• Forthesamereason,thesortingneededfortheshuffleandsortphasecannotbeusedwithSerializable
• Thedeserializationprocesscreatesannewinstanceoftheobject,whileHadoopneedstoreuseobjecttominimizecomputation
• HadoopintroducesthetwointerfacesWritable andWritableComparable thatsolvetheseproblem
![Page 33: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/33.jpg)
Writablewrappers
![Page 34: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/34.jpg)
ImplementingWritable:theSumCount class
![Page 35: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/35.jpg)
Glossary
![Page 36: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/36.jpg)
WordCount
• http://www.gutenberg.org/cache/epub/201/pg201.txt
• InputData:Thetextofthebook“Flatland”byEdwinAbbott
![Page 37: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/37.jpg)
WordCountmapper
![Page 38: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/38.jpg)
WordCount reducer
![Page 39: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/39.jpg)
WordCountresults
![Page 40: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/40.jpg)
TopN :Wewanttofindthetop-nusedwordsofatextfile
• http://www.gutenberg.org/cache/epub/201/pg201.txt
• InputData:Thetextofthebook“Flatland”byEdwinAbbott
![Page 41: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/41.jpg)
TopNmapper
![Page 42: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/42.jpg)
TopN reducer
![Page 43: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/43.jpg)
TopN results
![Page 44: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/44.jpg)
MEAN:Wewanttofindthemeanmaxtemperatureforeverymonth
• http://archivio-meteo.distile.it/tabelle-dati-archivio-meteo/
• InputData:TemperatureinMilan(DD/MM/YYYY,MIN,MAX)02012015,-2, 703012015,-1,804012015,1,16…29012015,0,530012015,0,931012015,-3,6
![Page 45: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/45.jpg)
Meanmapper
![Page 46: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/46.jpg)
Meanreducer
![Page 47: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/47.jpg)
Meanresults
![Page 48: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/48.jpg)
TODO:k-meansclusteringalgorithm
• Wewanttoaggregate2Dpointsinclustersusingk-meansalgorithm
• Inputdata:Arandomsetofpoints2.2705 0.91781.8600 2.10022.0915 1.3679-0.16120.8481…
![Page 49: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/49.jpg)
k-meansalgorithmInput:datapointsD,numberofclusterk
1. initializekcentroidsrandomly2. associateeachdatapointinDwiththenearestcentroid.
Thiswilldividethedatapointsintokclusters.3. recalculatethepositionofcentroids.
Repeatsteps2and3untiltherearenomorechangesinthemembershipofthedatapoints.
Output:datapointswithclustermemberships
![Page 50: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/50.jpg)
MapReduceexamplesinPython
![Page 51: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/51.jpg)
WordCountusingmrjob
“a”,936“ab”,6“abbot”,3“abbott”,2“abbreviated”,1…
![Page 52: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/52.jpg)
ProductRecommendations
• Goal:Foreachproductaclientbuys,generatea‘peoplewhoboughtthisalsoboughtthis’recommendation
• InputData:product_id_1,product_id_2
![Page 53: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/53.jpg)
CoincidentPurchaseFrequency
![Page 54: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/54.jpg)
TopRecommendations
![Page 55: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/55.jpg)
But…Supposeyouhave:
• userdatainonefile,• websitedatainanother,
andyouneedtofind
• thetop5 mostvisitedpagesbyusersaged18-25.
![Page 56: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/56.jpg)
InMapReduce
![Page 57: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/57.jpg)
InPigLatin
![Page 58: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/58.jpg)
WhatisApachePig?
Idea:aMapReduceprogramessentiallyperformsagroup-by-aggregationinparalleloveraclusterofmachines.
• Pig isahigh-levelplatformforcreatingMapReduceprogramsusedwithHadoop.
• ThelanguageforthisplatformiscalledPigLatin.Itcombineshigh-leveldeclarativequeryinginthespiritofSQL,andlow-level,proceduralprogrammingà laMapReduce.
DevelopedatYahoo
![Page 59: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/59.jpg)
Pig• ApachePig isaplatformforanalyzinglargedatasetsthat
consistsofahigh-levellanguageforexpressingdataanalysisprograms,coupledwithinfrastructureforevaluatingtheseprograms.ThesalientpropertyofPigprogramsisthattheirstructureisamenabletosubstantialparallelization,whichinturnsenablesthemtohandleverylargedatasets.
• Atthepresenttime,Pig'sinfrastructurelayerconsistsofacompilerthatproducessequencesofMap-Reduceprograms,forwhichlarge-scaleparallelimplementationsalreadyexist(e.g.,theHadoopsubproject).
![Page 60: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/60.jpg)
PigLatinPigLatinhasthefollowingkeyproperties:
• Easeofprogramming. Itistrivialtoachieveparallelexecutionofsimple,"embarrassinglyparallel"dataanalysistasks.Complextaskscomprisedofmultipleinterrelateddatatransformationsareexplicitlyencodedasdataflowsequences,makingthemeasytowrite,understand,andmaintain.
• Optimizationopportunities.Thewayinwhichtasksareencodedpermitsthesystemtooptimizetheirexecutionautomatically,allowingtheusertofocusonsemanticsratherthanefficiency.
• Extensibility. Userscancreatetheirownfunctionstodospecial-purposeprocessing.
![Page 61: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/61.jpg)
Performance
![Page 62: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/62.jpg)
PigHighlights
• Userdefinedfunctions(UDFs)canbewrittenforcolumntransformation(TOUPPER),oraggregation(SUM)
• UDFscanbewrittentotakeadvantageofthecombiner• Fourjoinimplementationsbuiltin:hash,fragment-replicate,merge,
skewed• Multi-query:Pigwillcombinecertaintypesofoperationstogetherina
singlepipelinetoreducethenumberoftimesdataisscanned• Orderbyprovidestotalorderingacrossreducersinabalancedway• WritingloadandstorefunctionsiseasyonceanInputFormat and
OutputFormat exist• Piggybank,acollectionofuserscontributedUDFs
![Page 63: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/63.jpg)
WhousesPigforwhat?
• 70%ofproductionjobsatYahoo (10ksperday)
• AlsousedbyTwitter,LinkedIn,Ebay,AOL,…
• Usedto– Processweblogs– Builduserbehaviormodels– Processimages– Buildmapsoftheweb– Doresearchonrawdatasets
![Page 64: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/64.jpg)
Components
Pigresidesonusermachine
Usermachine
HadoopCluster
Jobexecutesoncluster
NoneedtoinstallanythingextraonyourHadoopcluster.
![Page 65: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/65.jpg)
So,whyPig?
• Fasterdevelopment– Fewerlinesofcode– Don’tre-inventthewheel
• Flexible– Metadataisoptional– Extensible– Proceduralprogramming
![Page 66: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/66.jpg)
But…
• Doyouneedyourprogramtorunfaster?
• Doesyouranalyticjobrunsforhours?
![Page 67: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/67.jpg)
LimitationsofMapReduce
OneofthemajordrawbacksofMapReduceisitsinefficiencyinrunningiterativealgorithms.
MapReduceisnotdesignedforiterativeprocesses:aftereachiteration,theresultshavetobewrittentothedisktopassthemontothenextiteration.
degradationofperformance
![Page 68: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/68.jpg)
LimitationsofPig
Pigusesbatchorientedframeworks,whichmeansyouranalyticsjobswillrunformanyminutesorhours.
Spark isfaster!
![Page 69: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/69.jpg)
WhatisApacheSpark?
• Afastandgeneralcomputeengineforlarge-scaledataprocessing.
• Themajorfeature:theabilitytoperformin-memorycomputation(thedatacanbecachedinmemory).
• Sparkprovidesasimpleandexpressiveprogrammingmodelthatsupportsawiderangeofapplications,includingETL,machinelearning,streamprocessing,andgraphcomputation.
DevelopedattheUniversityofCaliforniaatBerkeley
![Page 70: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/70.jpg)
Spark
• It providesaninterfaceforprogrammingentireclusterswithimplicitdataparallelismandfault-tolerance.
• Forcertaintasks,itistestedtobeupto100xfaster(datainmemory)or10x(dataindisk)fasterthanHadoopMapReduce
• ItcanrunonHadoopYARNmanagerandcanreaddatafromHDFS.
![Page 71: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/71.jpg)
Spark• Designedtobeusedwitharangeofprogramminglanguagesand
onavarietyofarchitectures.
• Increasinglypopularwithawiderangeofdevelopers,thankstospeed,simplicity,andbroadsupport forexistingdevelopmentenvironmentsandstoragesystems.
• Relativelyaccessible tothoselearningtoworkwithitforthefirsttime.
• OneofApache'slargestandmostvibrant,withover500contributorsfrommorethan200organizationsresponsibleforcodeinthesoftwarerelease.
![Page 72: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/72.jpg)
Why?
• SparkisbasicallydevelopedtoovercomeMapReduce’sshortcomingthatitisnotoptimizedforiterativealgorithms andinteractivedataanalysiswhichperformscyclicoperationsonsamesetofdata.
• SparkovercomesthisproblembyprovidinganewstorageprimitivecalledResilientDistributedDatasets (RDDs).
![Page 73: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/73.jpg)
ResilientDistributedDatasets(RDDs)
TheResilientDistributedDatasetisaconceptattheheartofSpark.Itisdesignedtosupportin-memorydatastorage,distributedacrossaclusterinamannerthatisdemonstrablybothfault-tolerantandefficient.• Fault-tolerance isachieved,inpart,bytrackingthelineageof
transformationsappliedtocoarse-grainedsetsofdata.• Efficiency isachievedthroughparallelizationofprocessingacrossmultiple
nodesinthecluster,andminimizationofdatareplicationbetweenthosenodes.
OncedataisloadedintoanRDD,twobasictypesofoperationcanbecarriedout:• Transformations,whichcreateanewRDDbychangingtheoriginal
throughprocessessuchasmapping,filtering,andmore;• Actions,suchascounts,whichmeasurebutdonotchangetheoriginal
data.
![Page 74: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/74.jpg)
WordCountinSpark
![Page 75: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/75.jpg)
Anotherexample:logisticregression
Acommonmachinelearningalgorithmforclassifyingobjectssuchas,say,spamvs.non-spamemails.
![Page 76: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/76.jpg)
PigvsSpark
• Pig– Thisisthebestdataloading toolavailableinsidehadoop.– Itusesascripting languagecalledPigLatin,whichismoreworkflowdriven.– Don'tneedtobeanexpertJavaprogrammerbutneedafewcodingskills.– Isalsoanabstractionlayerontopofmap-reduce.– Simple towriteandcontrol.
• Spark– Prettymuchthesuccessortomap-reduceinHadoop,withanemphasisonin-
memorycomputing.– You'llneedtobeaprettygood Javaprogrammer tousethis.– Muchlowerlevel.
![Page 77: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/77.jpg)
Howtochooseaplatform?
• Thedecisiontochooseaparticularplatformforacertainapplicationusuallydependsonthefollowingimportantfactors:– datasize– speedorthroughputoptimization– modeldevelopment(Training/Applyingamodel)
![Page 78: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/78.jpg)
Example:k-meansclusteringalgorithm
Thek-meansalgorithmisusedforprovidingmoreinsightintotheanalyticsalgorithmsondifferentplatforms.
Characteristics:• popularandwidelyused• iterativenature• compute-intensivetask(calculatingthecentroids)• aggregationofthelocalresultstoobtainaglobalsolution
![Page 79: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/79.jpg)
k-meansalgorithmInput:datapointsD,numberofclusterk
1. initializekcentroidsrandomly2. associateeachdatapointinDwiththenearestcentroid.
Thiswilldividethedatapointsintokclusters.3. recalculatethepositionofcentroids.
Repeatsteps2and3untiltherearenomorechangesinthemembershipofthedatapoints.
Output:datapointswithclustermemberships
![Page 80: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/80.jpg)
k-meansonMapReduceInput:datapointsD,numberofclusterkandcentroids
1. foreachdatapointd∈ Ddo2. assigndtotheclosestcentroid
Output:centroidswithassociateddatapoints
Input:centroidswithassociateddatapoints1. computethenewcentroidsbycalculatingthe
averageofdatapointsincluster2. writetheglobalcentroidstothedisk
Output:newcentroids
Map
Reduce
![Page 81: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/81.jpg)
k-meansonPigLatinREGISTERudf.jarDEFINEfind_centroid FindCentroid('$centroids');points=LOAD'points.txt'as(id:int,pos:double);centroided=FOREACHpointsGENERATEpos,find_centroid(pos)ascentroid;grouped=GROUPcentroided BYcentroid;result=FOREACHgroupedGENERATEgroup,AVG(centroided.pos);STOREresultINTO'output’;
![Page 82: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/82.jpg)
k-meansonSpark
SimilartoMapReduce-based implementation
• Insteadofwritingtheglobalcentroidstothedisk,theyarewrittentomemorywhichspeedsuptheprocessingandreducesthediskI/Ooverhead.
• Thedatawillbeloadedintothesystemmemoryinordertoprovidefasteraccess.
![Page 83: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec994cfa6dc8c7c24682ca7/html5/thumbnails/83.jpg)
References• Dean,J.andGhemawat,S.MapReduce:Simplifieddataprocessingonlarge
clusters.InProceedingsofOperatingSystemsDesignandImplementation (OSDI).SanFrancisco,CA.137-150.2004
• Hadoop:OpensourceimplementationofMapReduce.http://lucene.apache.org/hadoop/
• C.Olston,B.Reed,U.Srivastava,R.Kumar,A.Tomkins.PigLatin:ANot-so-foreignLanguageforDataProcessing.InProceedingsSIGMOD'08.2008
• D.SinghandC.K.Reddy.Asurveyonplatformsforbigdataanalytics, JournalofBigData.2014
• M.Zaharia,M.Chowdhury,T.Das,A.Dave,J.Ma,M.McCauley,M.Franklin,S.Shenker, andI.Stoica. ResilientDistributedDatasets:AFault-TolerantAbstractionforIn-MemoryClusterComputing. USENIXNSDI.2012