big data with apache hadoop
TRANSCRIPT
![Page 1: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/1.jpg)
Ing.AlessandroBruno– Ph.D.Research Fellow@CVIPGroup
PalermoUniversity
BIGDataAnalyticswithHadoop
ADI- FreeSoftware&ScientificResearch Session
Palermo24-26th June 2016
![Page 2: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/2.jpg)
BigDataAnalytics
![Page 3: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/3.jpg)
BigDataAnalytics
l“Inpioneerdaystheyusedoxenforheavypulling,andwhenoneoxcouldn’tbudgealog,theydidn’ttrytogrowalargerox.Weshouldbetryingforbiggercomputers,butformoresystemsofcomputers”.GraceHopper
![Page 4: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/4.jpg)
BigDataAnalyticsl BigDataisdrivingradicalchangesintraditionaldataanalysisplatforms
l Performinganykindofanalysisinvoluminousandcomplexdata
l Scalingupthehardwareplatformsbecomesimminentandchoosingtherighthardware/softwareplatformsbecomesacrucialdecision
l Satisfyrequestinreasonableamountoftime
[link] [link] [link]
![Page 5: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/5.jpg)
BigDataAnalytics
l Theabilityofbigdataplatformstoadpattoincreaseddataprocessingdemandsplaysacriticalrole:Isitappropriatetobuildtheanalyticsbasedsolutionsonaparticularplatform?
l Whatarethefundamentalissuesbeforemakingtherightdecisions?
![Page 6: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/6.jpg)
FundamentalIssuesl Howquicklydoweneedtogettheresults?
l HowBigistheDatatobeprocessed?
l Doesthemodelbuildingrequireseveraliterationsor asingleiteration?
l Willtherebeaneedformoredataprocessingcapabilityinthefuture?
l Isdatatransfercriticalforthisapplication?
l Isthereanyneedforhandlinghardwarefailureswithintheapplications?
![Page 7: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/7.jpg)
SomeexamplesofBigDatal 1.06billionmonthlyactiveusersonfacebookand680millionmobileusers
l Onanaverage,3.2billionlikesandcommentsarepostedeverydayonFacebook.
l 72%ofwebaudienceisonFacebook.
l FacebookstartedusingHadoopinmid-2009andwasoneoftheinitialusersofHadoop
l Facebookisgenerating500+terabytesofdataperday
![Page 8: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/8.jpg)
SomeexamplesofBigDatal NYSE(NewYorkStockExchange)generatesabout1terabyteofnewtradedataperday,ajetairlinecollects10terabytes ofcensordataforevery 30minutes offlyingtime
l Googleprocessesover40,000searchquerieseverysecondonaverage,whichtranslatestoover3.5billionsearchesperdayand1.2trillionsearchesperyearworldwide
![Page 9: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/9.jpg)
VforBIGDATA
Volume Variety Velocity Veracity Value
![Page 10: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/10.jpg)
VforBigDatal Volume,Velocity,Variety,Veracity,Value
l Volumereferstothevastamountsofdatageneratedeverysecond.Justthinkofalltheemails,twittermessages,photos,videoclips,sensordataetc.weproduceandshareeverysecond
l Velocityreferstothespeedatwhichnewdataisgeneratedandthespeedatwhichdatamovesaround.Justthinkofsocialmediamessagesgoingviralinseconds,thespeedatwhichcreditcardtransactionsarecheckedforfraudulentactivities
![Page 11: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/11.jpg)
VforBigDatal Varietyreferstothedifferenttypesofdatawecannowuse.Inthepastwefocusedonstructureddatathatneatlyfitsintotablesorrelationaldatabases,suchasfinancialdata(e.g.salesbyproductorregion).Infact,80%oftheworld’sdataisnowunstructured,andthereforecan’teasilybeputintotables(thinkofphotos,videosequencesorsocialmediaupdates)
l Veracityreferstothemessinessortrustworthinessofthedata.Withmanyformsofbigdata,qualityandaccuracyarelesscontrollable(justthinkofTwitterpostswithhashtags,abbreviations,typosandcolloquialspeechaswellasthereliabilityandaccuracyofcontent)
![Page 12: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/12.jpg)
VforBigDatal Value: ThereisstillanotherVtotakeintoaccountwhenlookingatBigData:Value!Youarewellandgoodhavingaccesstobigdatabutunlesswecanturnitintovalueitisuseless.Soyoucansafelyarguethat'value'isthemostimportantVofBigData
![Page 13: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/13.jpg)
Scalingl Scalingistheabilityofthesystemtoadapttoincreaseddemandsintermsofdataprocessing
l HORIZONTALSCALING(SCALEOUT) :Itinvolvesdistributing theworkloadacrossmanyserverswhichmaybecommoditymachines.Furthermore, multipleindependentmachinesareaddedtogetherinorder toimproveprocessingcapability
l VERTICALSCALING(SCALEUP):Itinvolvesinstallingmoreprocessors,morememoryandfasterhardware,generallywithinasingleserver
![Page 14: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/14.jpg)
Scalingl DrawbacksandAdvantagesofHorizontalandVerticalScaling
![Page 15: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/15.jpg)
HorizontalScalingl Scalingoutplatformsincludespeer-to-peernetworksandApacheHadoop
l Peertopeernetworksinvolvemillionsofmachinesconnectedinanetwork:decentralizedanddistributednetworkarchitecturewherethenodesinthenetworksserveasconsumeresources
l Bottleneckinpeertopeernetworksarisesinthecommunicationbetweendifferentnodes
l FeatureofMPI(MessagePassingInterface):statepreservingprocess;processcanliveaslongasthesystemrunsandthereisnoneedtoreadagainthesamedata(suchasinthecaseofMAPREDUCEscheme)
![Page 16: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/16.jpg)
HorizontalScalingl InMPIarchitecturealltheparameterscanbepreservedlocally.MPIiswellsuitedforiterativeprocessing.ItisemployedbytheMaster/Slaveparadigm.SlavenodecanbecometheMasterforotherprocesses.
l DrawbacksofMPI:Thefaultintolerance,MPIhasnomechanismtohandlefaults…Asinglenodefailurecancausetheentiresystemshutdown
l WiththeadventofotherframeworkssuchasHadoop,MPIisnotbeingwidelyusedanymore…
![Page 17: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/17.jpg)
HorizontalScaling- ApacheHadoopl ApacheHadoopisanopensourceframeworkforstoringandprocessinglargedatasetsusingclustersofcommodityhardware
l Scaleuptohundredsandeventhousandsofnodes
l Highlyfaulttolerant
![Page 18: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/18.jpg)
HADOOPStorageandAnalysisat
InternetScale
![Page 19: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/19.jpg)
Hadoop- DATAl Weliveinthedataage:
l NewYorkStockExchangegeneratesaboutoneterabyteofnewtradedataperday
l Facebookhostsapproximately10billionsphotos
l Ancestry.com(genealogysite)storedabout2.5petabytesofdata
l TheinternetArchivestoresaround2petabytesofdata…andisgrowingatarateof20terabytespermonth
THEGOODNEWSISTHATBIGDATAISHERE.THEBADNEWSISTHATWEARESTRUGGLINGTOSTOREANDANALYZEIT.
![Page 20: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/20.jpg)
Hadoop– BIGDATAl ThefirstproblemrelativetoDataStorageandAnalysisistosolvehardwarefailureassoonasyoustartusingmanypiecesofhardware,thechancethatonewillfailisfairlyhigh…
l Thesecondproblemisthatmostanalysistasksneedtobeabletocombinethedatainsomewaydatareadfromonediskmayneedtobecombinedwithdatafrommultiplesources
- Allowingdatatobecombined frommultiple sourcesisnotoriously challenging
![Page 21: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/21.jpg)
Hadoop…Let’sgo!l Hadoopgetsitsstartin“Nutch”,anopensourcewebsearchengineproject
l OnceGooglepublishedGFSandMAPREDUCEpapers,theroutebecameclear
MapReduceprovidesaprogrammingmodelthatabstractstheproblemfromdiskreadsandwrites,transformingitintoacomputationoversetsofkeysandvalues
MapReduceapproachmayseemlikeabrute-forceapproach,theentiredatasetisprocessedforeachquery.Itisabatchqueryprocessor
![Page 22: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/22.jpg)
GLOBALHADOOP…
![Page 23: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/23.jpg)
GLOBALHADOOP…
![Page 24: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/24.jpg)
GLOBALHADOOP…
![Page 25: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/25.jpg)
TraditionalRDBMSvsMapReduce
![Page 26: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/26.jpg)
TraditionalRDBMSvsMapReduce
l RDBMSworkswellwithstructureddata
l MapReduceworkswellonunstructuredandonsemi-structureddatasinceitisdesignedtointepretthedataatprocessingtime
l MapReducecanbeseensuchasacomplementtoanRDBMS
![Page 27: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/27.jpg)
MAPREDUCEMODELl TheprogrammingmodelusedinHadoopisMapReduce,proposedbyDeanandGhemawatatGoogle
l MapReduceisthebasicdataprocessingschemeusedinHadoopwhichincludesbreakingtheentiretaskintotwoparts,knownasMappersandReducers
l MapReducemodelconsistsoftwofunctions:amapfunctionandareducefunction.EachofthefunctionsdefinesamappingfromonesetofKEY-VALUEpairstoanother
![Page 28: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/28.jpg)
MAPREDUCE– MODELl MappersreadthedatafromHDFS,processitandgeneratesomeintermediateresultstothereducers
l ReducersareusedtoaggregatetheintermediateresultstogeneratethefinaloutputwhichisagainwrittentoHDFS
l AtypicalHadoopjobinvolvesrunningseveralmappersandreducersacrossdifferentnodesinthecluster
![Page 29: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/29.jpg)
MAPREDUCE- MODELl OneofthemajordrawbacksofMapReduceisitsinefficiencyinrunningiterativealgorithms.MapReduceisnotdesignedforiterativeprocesses
l Mappersreadthesamedataagainandagainfromthedisk.Hence,aftereachiteration,theresultshavetobewrittentothedisktopassthemontothenextiteration.Thismakesdiskaccessamajorbottleneckwhichsignificantlydegradestheperformance
![Page 30: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/30.jpg)
AlternativetoApacheHadoop
l ApacheprojectcalledHaLoop extendsMapReducewithprogrammingsupportforiterativealgorithmsandimprovesefficiencybyaddingcachingmechanisms
l CGLMapReduce isanotherworkthatfocusesonimprovingtheperformanceofMapReduceiterativetasks.OtherexamplesofiterativeMapReduceincludeTwister andimapreduce
![Page 31: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/31.jpg)
Spark– NextGenerationDataAnalysisParadigm
l Spark isanextgenerationparadigmforbigdataprocessingdevelopedbyresearchersatBerkleyUniversity
l ItisalternativetoHadoopwhichisdesignedtoovercomethediskI/Olimitations andimprovetheperformanceofearliersystems.ThemajorfeatureofSparkthatmakes ituniqueisitsabilitytoperformin-memorycomputations.Itallowsthedatatobecachedinmemory,thuseliminating theHadoop diskoverheadlimitationforiterativetasks
l Sparkisageneralenginelarge-scaledataprocessingthatsupportsJava,ScalaandPythonandforcertaintasksitistestedtobeupto100xfasterthanHadoopMapReduce
![Page 32: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/32.jpg)
HADOOPSYSTEM
![Page 33: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/33.jpg)
HADOOPComponents:l Common – AsetofcomponentandinterfacesfordistributedfilesystemsandgeneralI/O(serialization,JavaRPC,persistentdatastructures)
l Avro – Aserializationsystemforefficient,cross-languageRPC,persistentdata-storage
l MapReduce – Adistributeddataprocessingmodelandexecutionenvironmentthatrunsonlargeclustersofcommoditymachines
l HDFS – AdistributedFilesystemthatrunsonlargeclustersofcommoditymachines
![Page 34: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/34.jpg)
HADOOPComponents:l Pig – ADataflowlanguageandexecutionenvironmentforexploringverylargedatasets.PigrunsonHDFSandMAPREDUCEclusters
l Hive – Distributeddatawarehouse.ItmanagesdatastoredinHDFSandprovidesaquerylanguagebasedonSQL,translatedtoMapReducejobsbyruntimeengine
l Hbase – DistributeColumn-orienteddatabase.Hbasesupportsbothbatch-stylecomputationsusingMAPREDUCEandpointqueries
l ZooKeeper – Distributedcoordinationservice. Itprovidesprimitivessuchasdistributed locksthatcanbeusedforbuildingdistributedapplications
l Sqoop – ToolformovingdatabetweenRelationalDBandHDFS
![Page 35: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/35.jpg)
HadoopDistributedFileSystem
![Page 36: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/36.jpg)
HDFS– keyfeatures
lHDFSishighlyfault-tolerant,withhighthroughput,suitableforapplicationswithlargedatasets,streamingaccesstofilesystemdataandcanbebuiltoutofcommodityhardware
![Page 37: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/37.jpg)
HDFS– LargeDatasetl LargeDatasets – Processseveralsmalldatasetsranginginsomemegabytesorevenafewgigabytes
l - ThearchitectureofHDFSisdesignedinsuchawaythatitisbestfittostoreandretrievehugeamountofdata.Whatisrequiredishighcumulativedatabandwidthandthescalabilityfeaturetospreadoutfromasinglenodeclustertoahundredorathousand-nodecluster
![Page 38: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/38.jpg)
HDFS– WriteOnceReadMany
l AfileinHDFSoncewrittenwillnotbemodified,thoughitcanbeaccessnnumberoftimes(thoughfutureversionsofHadoopmaysupportthisfeaturetoo)!Atpresent,inHDFSstrictlyhasonewriteratanytime
![Page 39: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/39.jpg)
HDFS- StreamingDataAccess
l AsHDFSworksontheprincipleofWriteOnce,ReadMany,thefeatureofstreamingdataaccessisextremelyimportantinHDFS.AsHDFSisdesignedmoreforbatchprocessingratherthaninteractiveusebyusers
![Page 40: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/40.jpg)
HDFS– CommodityHardware
lHDFS assumesthatthecluster(s)willrunoncommonhardware,thatis,non-expensive,ordinarymachinesratherthanhigh-availabilitysystems
![Page 41: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/41.jpg)
HDFS– DatareplicationandFaulttolerance
![Page 42: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/42.jpg)
HDFS– DatareplicationandFaulttolerance
l InHDFS,thefilesaredividedintolargeblocksofdataandeachblockisstoredonthreenodes:twoonthesamerackandoneonadifferentrackforfaulttolerance.Ablockistheamountofdatastoredoneverydatanode.Thoughthedefaultblocksizeis64MBandthereplicationfactoristhree,theseareconfigurableperfile
![Page 43: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/43.jpg)
HDFS- HighThroughput
l InHadoopHDFS,whenwewanttoperformataskoranaction,thentheworkisdividedandsharedamongdifferentsystems.So,allthesystemswillbeexecutingthetasksassignedtothemindependentlyandinparallel.Sotheworkwillbecompletedinaveryshortperiodoftime
![Page 44: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/44.jpg)
HDFSMovingComputationsisbetterthanMovingData
lHDFSworksontheprinciplethatifacomputationisdonebyanapplicationnearthedataitoperateson,itismuchmoreefficientthandonefarof,particularlywhentherearelargedatasets.Themajoradvantageisreductioninthenetworkcongestionandincreasedoverallthroughputofthesystem
![Page 45: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/45.jpg)
HDFS–Filesystemnamespace
lHDFSnamespacehierarchyissimilartomostoftheotherexistingfilesystems,whereonecancreateanddeletefilesorrelocateafilefromonedirectorytoanother,orevenrenameafile(Hierarchicalfileorganization)
![Page 46: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/46.jpg)
Hadoop– Alternativefilesysteml SomeHadoopusershavestrictdemandsaroundperformance,availabilityandenterprise-
gradefeatures,whileothersaren’tkeenofitsdirect-attachedstorage(DAS)architecture
l ThereisagrowingnumberofoptionsforreplacingHDFSl Cassandra (DataStax);l Ceph;l DispersedStorageNetwork(Cleversafe)l GPFS(IBM)l Isilon(EMC)l Lustrel MapRFileSysteml NetAppOpenSolutionforHadoop
![Page 47: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/47.jpg)
HDFS– HadoopDistributedFileSystemMasterSlaveArchitecture
l HDFS hasamaster/slave architecture
l AnHDFSclusterconsistsofasingleNameNode,amasterserverthatmanagesthefilesystemnamespaceandregulatesaccesstofilesbyclients
l ThereareanumberofDataNodes,usuallyonepernode inthecluster,whichmanagestorageattachedtothenodesthattheyrunon
l HDFS exposesafilesystemnamespaceandallowsuserdatatobestoredinfiles.Internally,afileissplitintooneormoreblocksandtheseblocksarestoredinasetofDataNodes
![Page 48: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/48.jpg)
HDFS– MasterSlaveArchitecturel TheNameNodeexecutesfilesystemnamespaceoperationslikeopening,closing,andrenamingfilesanddirectories
ItalsodeterminesthemappingofblockstoDataNodes.TheDataNodesareresponsibleforservingreadandwriterequestsfromthefilesystem’sclients
TheDataNodesalsoperformblockcreation,deletion,andreplicationuponinstructionfromtheNameNode
![Page 49: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/49.jpg)
Hadoop– JobTracker&TaskTrackerl JobTracker istheservicewithinHadoopthatfarmsoutMapReducetasks tospecificnodesinthecluster,ideallythenodesthathavethedataor,atleast,areinthesamerack
l Clientapplications submitjobstotheJobTracker
l JobTrackertalkstotheNameNode todeterminethelocationofthedata
l JobTrackerlocatesTaskTracker nodeswithavailableslotsatornearthedataTheJobTrackersubmitstheworktothechosenTaskTrackernodes
l TaskTrackernodes arepermanentlymonitored.Iftheydonotsubmitheartbeatsignalsoftenenough,theyaredeemedtohavefailedandtheworkisscheduledonadifferentTaskTracker
![Page 50: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/50.jpg)
Hadoop– JobTracker&Tasktrackerl ATaskTrackerwillnotifytheJobTrackerwhenataskfails.JobTrackerthendecideswhattodo:itmayresubmitthejobelsewhere,itmaymarkthatspecificrecordassomethingtoavoid,anditmayevenblacklisttheTaskTrackerasunreliable
l Whentheworkiscompleted,JobTrackerupdatesitsstatus.ClientapplicationscanpolltheJobTrackerforinformation.JobTrackerisapointoffailurefortheHadoopMapReduceservice.Ifitgoesdown,allrunningjobsarehalted
![Page 51: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/51.jpg)
Hadoop– Node&Tracker
![Page 52: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/52.jpg)
HadoopMapReduce- Scheme
![Page 53: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/53.jpg)
ApacheHadoopl http://hadoop.apache.org/releases.html DownloadthereleasesofHadoop
l http://hadoop.apache.org/docs/current/ Docs
l http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html SingleNodeSetup
l http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html ClusterSetup
![Page 54: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/54.jpg)
Hadoopl Standaloneà OneMachineà OneJvmProcess
l PseudoDistributedàOneMachineà severalJvmProcesses
l FullyDistributedà SeveralMachinesà JvmMachines
l SSHà Public&privatekeyMasterà Toallnodes
![Page 55: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/55.jpg)
MapReduceARchitecturel JobClient
l Submit Jobs
l JobTrackerlCoordinatesThejobs
lScheduling theJobs
l TaskTrackerl Breaksdownthejobintwotasks:MAP,REDUCE
1. JobClientsubmitsajobtoaJobTracker2. JobTracker sendsaquerytoNameNode,NameNodeneedstoknowDataNodelocations
![Page 56: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/56.jpg)
MAPREDUCE- INTERNALS
INPUTFORMAT
SPLIT SPLIT SPLIT
MAP MAP MAP
SHUFFLE
SORT
REDUCE
OUTPUT
DATANODE(1)INPUTFORMAT
SPLIT SPLIT SPLIT
MAP MAP MAP
SHUFFLE
SORT
REDUCE
OUTPUT
DATANODE(2)
![Page 57: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/57.jpg)
MAPREDUCE- INTERNALS
SPLIT
Shuffle&Sort
REDUCE
MAP
Inputdataisdivided intoSplitsbasedontheinputformat.InputSplitsequatetoamaptaskrunninginparallel
Mapperstransforminput splitsintokey/valuepairsbasedonuserdefinedcode
MoveMappersoutput tothereducers,orderedbykeyvalues
Reducersaggregatekeyvaluesbasedonuserdefinedcode
OutputFormats
DefinehowresultswillbewrittenonHDFS
![Page 58: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/58.jpg)
EXAMPLEofMAPREDUCEl Counttheoccurrencesofwordsinatext:
NUGGETSCISCOMSCISCONUGGETSWEBLINUXWINDOWSITNUGGETLINUXCERTSHADOOPAWSIT
INPUTSPLIT
NUGGETSCISCOMS
CISCONUGGETSWEB
LINUXWINDOWSIT
NUGGETSLINUXCERTS
HADOOPAWSIT
… MAP
![Page 59: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/59.jpg)
EXAMPLEOFMAPREDUCE
SPLITNUGGETSCISCOMS
CISCONUGGETSWEB
LINUXWINDOWSIT
NUGGETSLINUXCERTS
HADOOPAWSIT
MAPNUGGETS(1)
CISCO(1)MS(1)
CISCO(1)NUGGETS(1)
WEB(1)
LINUX(1)WINDOWS(1)
IT(1)
NUGGETS(1)LINUX(1)CERTS(1)
HADOOP (1)AWS(1)
IT(1)
SHUFFLE/SORTNUGGETS(1)NUGGETS(1)NUGGETS(1)
CISCO(1)CISCO(1)
MS(1)
LINUX(1)LINUX(1)
WINDOWS(1)
WEB(1)
IT(1)IT(1)
CERTS(1)
AWS(1)
HADOOP(1)
... ...
![Page 60: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/60.jpg)
EXAMPLEOFMAPREDUCESHUFFLE/SORT
NUGGETS(1)NUGGETS(1)NUGGETS(1)
CISCO(1)CISCO(1)
MS(1)
LINUX(1)LINUX(1)
WINDOWS(1)
WEB(1)
IT(1)IT(1)
CERTS(1)
AWS(1)
HADOOP(1)
...
REDUCECISCO(2)
MS(1)
WEB(1)
LINUX(2)
WINDOWS(1)
IT(2)
HADOOP (1)
AWS(1)
NUGGETS(3)
CERTS(1)
Theoutput isthepair(Key,Value)
![Page 61: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/61.jpg)
Hadoop– SingleNode
![Page 62: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/62.jpg)
Hadoop– SingleNode
![Page 63: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/63.jpg)
Hadoop– PseudoDistributedOperation
![Page 64: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/64.jpg)
Hadoop– PseudoDistributedOperations
![Page 65: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/65.jpg)
Hadoop– PseudoDistributedOperations
![Page 66: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/66.jpg)
BigDataAnalytics– CaseUse
l Bigdataanalyticsforsmartmobility
l Estimateofvisitorsmobilityflows
l Correlationbetweenmobilephonedataandspatialpatterns
l FinancialServices
l Education
![Page 67: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/67.jpg)
BigDataAnalytics– CaseUsel Health
l Agriculture
l MarketCompetitiveness
l BehaviorPrediction
l TextAnalytics
l TelecommunicationIndustry
![Page 68: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/68.jpg)
![Page 69: Big data with apache hadoop](https://reader031.vdocument.in/reader031/viewer/2022021507/5878ae3e1a28ab724c8b4e67/html5/thumbnails/69.jpg)
Clouderaprovidesascalable,flexible,integratedplatformthatmakesiteasytomanagerapidlyincreasingvolumesandvarietiesofdataClouderaproductsandsolutions enableyoutodeployandmanageApacheHadoopandrelatedprojects,manipulateandanalyzeyourdata,andkeepthatdatasecureandprotected.