yarn essentials - storage.ey.md related/pdfs and... · operating hadoop and yarn clusters starting...

Post on 26-Jun-2020

8 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

YARNEssentials

TableofContents

YARNEssentials

Credits

AbouttheAuthors

AbouttheReviewers

www.PacktPub.com

Supportfiles,eBooks,discountoffers,andmore

Whysubscribe?

FreeaccessforPacktaccountholders

Preface

Whatthisbookcovers

Whatyouneedforthisbook

Whothisbookisfor

Conventions

Readerfeedback

Customersupport

Downloadingtheexamplecode

Errata

Piracy

Questions

1.NeedforYARN

Theredesignidea

LimitationsoftheclassicalMapReduceorHadoop1.x

YARNasthemodernoperatingsystemofHadoop

WhatarethedesigngoalsforYARN

Summary

2.YARNArchitecture

CorecomponentsofYARNarchitecture

ResourceManager

ApplicationMaster(AM)

NodeManager(NM)

YARNschedulerpolicies

TheFIFO(FirstInFirstOut)scheduler

Thefairscheduler

Thecapacityscheduler

RecentdevelopmentsinYARNarchitecture

Summary

3.YARNInstallation

Single-nodeinstallation

Prerequisites

Platform

Software

Startingwiththeinstallation

Thestandalonemode(localmode)

Thepseudo-distributedmode

Thefully-distributedmode

HistoryServer

Slavefiles

OperatingHadoopandYARNclusters

StartingHadoopandYARNclusters

StoppingHadoopandYARNclusters

WebinterfacesoftheEcosystem

Summary

4.YARNandHadoopEcosystems

TheHadoop2release

AshortintroductiontoHadoop1.xandMRv1

MRv1versusMRv2

UnderstandingwhereYARNfitsintoHadoop

OldandnewMapReduceAPIs

BackwardcompatibilityofMRv2APIs

Binarycompatibilityoforg.apache.hadoop.mapredAPIs

Sourcecompatibilityoforg.apache.hadoop.mapredAPIs

PracticalexamplesofMRv1andMRv2

Preparingtheinputfile(s)

Runningthejob

Result

Summary

5.YARNAdministration

Containerallocation

Containerallocationtotheapplication

Containerconfigurations

YARNschedulingpolicies

TheFIFO(FirstInFirstOut)scheduler

TheFIFO(FirstInFirstOut)scheduler

Thecapacityscheduler

Capacityschedulerconfigurations

Thefairscheduler

Fairschedulerconfigurations

YARNmultitenancyapplicationsupport

AdministrationofYARN

Administrativetools

AddingandremovingnodesfromaYARNcluster

AdministratingYARNjobs

MapReducejobconfigurations

YARNlogmanagement

YARNwebuserinterface

Summary

6.DevelopingandRunningaSimpleYARNApplication

RunningsampleexamplesonYARN

RunningasamplePiexample

MonitoringYARNapplicationswithwebGUI

YARN’sMapReducesupport

TheMapReduceApplicationMaster

ExampleYARNMapReducesettings

YARN’scompatibilitywithMapReduceapplications

DevelopingYARNapplications

TheYARNapplicationworkflow

WritingtheYARNclient

WritingtheYARNApplicationMaster

ResponsibilitiesoftheApplicationMaster

Summary

7.YARNFrameworks

ApacheSamza

WritingaKafkaproducer

Writingthehello-samzaproject

Startingagrid

Storm-YARN

Prerequisites

HadoopYARNshouldbeinstalled

ApacheZooKeepershouldbeinstalled

SettingupStorm-YARN

Gettingthestorm.yamlconfigurationofthelaunchedStormcluster

BuildingandrunningStorm-Starterexamples

ApacheSpark

WhyrunonYARN?

ApacheTez

ApacheGiraph

HOYA(HBaseonYARN)

KOYA(KafkaonYARN)

Summary

8.FailuresinYARN

ResourceManagerfailures

ApplicationMasterfailures

NodeManagerfailures

Containerfailures

HardwareFailures

Summary

9.YARN–AlternativeSolutions

Mesos

Omega

Corona

Summary

10.YARN–FutureandSupport

WhatYARNmeanstothebigdataindustry

Journey–presentandfuture

Presenton-goingfeatures

Futurefeatures

YARN-supportedframeworks

Summary

Index

YARNEssentials

YARNEssentialsCopyright©2015PacktPublishing

Allrightsreserved.Nopartofthisbookmaybereproduced,storedinaretrievalsystem,ortransmittedinanyformorbyanymeans,withoutthepriorwrittenpermissionofthepublisher,exceptinthecaseofbriefquotationsembeddedincriticalarticlesorreviews.

Everyefforthasbeenmadeinthepreparationofthisbooktoensuretheaccuracyoftheinformationpresented.However,theinformationcontainedinthisbookissoldwithoutwarranty,eitherexpressorimplied.Neithertheauthors,norPacktPublishing,anditsdealersanddistributorswillbeheldliableforanydamagescausedorallegedtobecauseddirectlyorindirectlybythisbook.

PacktPublishinghasendeavoredtoprovidetrademarkinformationaboutallofthecompaniesandproductsmentionedinthisbookbytheappropriateuseofcapitals.However,PacktPublishingcannotguaranteetheaccuracyofthisinformation.

Firstpublished:February2015

Productionreference:1190215

PublishedbyPacktPublishingLtd.

LiveryPlace

35LiveryStreet

BirminghamB32PB,UK.

ISBN978-1-78439-173-7

www.packtpub.com

CreditsAuthors

AmolFasale

NirmalKumar

Reviewers

LakshmiNarasimhan

SwapnilSalunkhe

Jenny(Xiao)Zhang

CommissioningEditor

TaronPereira

AcquisitionEditor

JamesJones

ContentDevelopmentEditor

ArwaManasawala

TechnicalEditor

IndrajitA.Das

CopyEditors

KarunaNarayanan

LaxmiSubramanian

ProjectCoordinator

PuravMotiwalla

Proofreaders

SafisEditing

MariaGould

Indexer

PriyaSane

Graphics

SheetalAute

ValentinaD’silva

AbhinashSahu

ProductionCoordinator

ShantanuN.Zagade

CoverWork

ShantanuN.Zagade

AbouttheAuthorsAmolFasalehasmorethan4yearsofindustryexperienceactivelyworkinginthefieldsofbigdataanddistributedcomputing;heisalsoanactivebloggerinandcontributortotheopensourcecommunity.AmolworksasaseniordatasystemengineeratMakeMyTrip.com,averywell-knowntravelandhospitalityportalinIndia,responsibleforreal-timepersonalizationofonlineuserexperiencewithApacheKafka,ApacheStorm,ApacheHadoop,andmanymore.Also,Amolhasactivehands-onexperienceinJava/J2EE,SpringFrameworks,Python,machinelearning,Hadoopframeworkcomponents,SQL,NoSQL,andgraphdatabases.

YoucanfollowAmolonTwitterat@amolfasaleoronLinkedIn.Amolisveryactiveonsocialmedia.Youcancatchhimonlineforanytechnicalassistance;hewouldbehappytohelp.

Amolhascompletedhisbachelor’sinengineering(electronicsandtelecommunication)fromPuneUniversityandpostgraduatediplomaincomputersfromCDAC.

Thegiftofloveisoneofthegreatestblessingsfromparents,andIamheartilythankfultomymom,dad,friends,andcolleagueswhohaveshownandcontinuetoshowtheirsupportindifferentways.Finally,IowemuchtoJamesandArwawithoutwhosedirectionandunderstanding,Iwouldnothavecompletedthiswork.

NirmalKumarisaleadsoftwareengineeratiLabs,theR&DteamatImpetusInfotechPvt.Ltd.Hehasmorethan8yearsofexperienceinopensourcetechnologiessuchasJava,JEE,Spring,Hibernate,webservices,Hadoop,Hive,Flume,Sqoop,Kafka,Storm,NoSQLdatabasessuchasHBaseandCassandra,andMPPdatabasessuchasTeradata.

YoucanfollowhimonTwitterat@nirmal___kumar.Hespendsmostofhistimereadingaboutandplayingwithdifferenttechnologies.Hehasalsoundertakenmanytechtalksandtrainingsessionsonbigdatatechnologies.

Hehasattainedhismaster’sdegreeincomputerapplicationsfromHarcourtButlerTechnologicalInstitute(HBTI),Kanpur,IndiaandiscurrentlypartofthebigdataR&DteaminiLabsatImpetusInfotechPvt.Ltd.

Iwouldliketothankmyorganization,especiallyiLabs,forsupportingmeinwritingthisbook.Also,aspecialthankstothePacktPublishingteam;withoutyouguys,thisworkwouldnothavebeenpossible.

AbouttheReviewersLakshmiNarasimhanisafullstackdeveloperwhohasbeenworkingonbigdataandsearchsincetheearlydaysofLuceneandwasapartofthesearchteamatAsk.com.Heisabigadvocateofopensourceandregularlycontributesandconsultsonvarioustechnologies,mostnotablyDrupalandtechnologiesrelatedtobigdata.Lakshmiiscurrentlyworkingasthecurriculumdesignerforhisowntrainingcompany,http://www.readybrains.com.Heblogsoccasionallyabouthistechnicalendeavorsathttp://www.lakshminp.comandcanbecontactedviahisTwitterhandle,@lakshminp.

It’shardfindareadyreferenceordocumentationforasubjectlikeYARN.I’dliketothanktheauthorforwritingabookonYARNandhopethetargetaudiencefindsituseful.

SwapnilSalunkheisapassionatesoftwaredeveloperwhoiskeenlyinterestedinlearningandimplementingnewtechnologies.Hehasapassionforfunctionalprogramming,machinelearning,andworkingwithdata.Hehasexperienceworkinginthefinanceandtelecomdomains.

I’dliketothankPacktPublishinganditsstaffforanopportunitytocontributetothisbook.

Jenny(Xiao)Zhangisatechnologyprofessionalinbusinessanalytics,KPIs,andbigdata.Shehelpsbusinessesbettermanage,measure,report,andanalyzedatatoanswercriticalbusinessquestionsanddrivebusinessgrowth.SheisanexpertinSaaSbusinessandhadexperienceinavarietyofindustrydomainssuchastelecom,oilandgas,andfinance.Shehaswrittenanumberofblogpostsathttp://jennyxiaozhang.comonbigdata,Hadoop,andYARN.ShealsoactivelyusesTwitterat@smallnarutotoshareinsightsonbigdataandanalytics.

Iwanttothankallmyblogreaders.Itistheencouragementfromthemthatmotivatesmetodeepdiveintotheoceanofbigdata.Ialsowanttothankmydad,Michael(Tiegang)Zhang,forprovidingtechnicalinsightsintheprocessofreviewingthebook.AspecialthankstothePacktPublishingteamforthisgreatopportunity.

www.PacktPub.com

Supportfiles,eBooks,discountoffers,andmoreForsupportfilesanddownloadsrelatedtoyourbook,pleasevisitwww.PacktPub.com.

DidyouknowthatPacktofferseBookversionsofeverybookpublished,withPDFandePubfilesavailable?YoucanupgradetotheeBookversionatwww.PacktPub.comandasaprintbookcustomer,youareentitledtoadiscountontheeBookcopy.Getintouchwithusat<service@packtpub.com>formoredetails.

Atwww.PacktPub.com,youcanalsoreadacollectionoffreetechnicalarticles,signupforarangeoffreenewslettersandreceiveexclusivediscountsandoffersonPacktbooksandeBooks.

https://www2.packtpub.com/books/subscription/packtlib

DoyouneedinstantsolutionstoyourITquestions?PacktLibisPackt’sonlinedigitalbooklibrary.Here,youcansearch,access,andreadPackt’sentirelibraryofbooks.

Whysubscribe?FullysearchableacrosseverybookpublishedbyPacktCopyandpaste,print,andbookmarkcontentOndemandandaccessibleviaawebbrowser

FreeaccessforPacktaccountholdersIfyouhaveanaccountwithPacktatwww.PacktPub.com,youcanusethistoaccessPacktLibtodayandview9entirelyfreebooks.Simplyuseyourlogincredentialsforimmediateaccess.

PrefaceInashortspanoftime,YARNhasattainedagreatdealofmomentumandacceptanceinthebigdataworld.

YARNessentialsisaboutYARN—themodernoperatingsystemforHadoop.ThisbookcontainsallthatyouneedtoknowaboutYARN,rightfromitsinceptiontothepresentandfuture.

Inthefirstpartofthebook,youwillbeintroducedtothemotivationbehindthedevelopmentofYARNandlearnaboutitscorearchitecture,installation,andadministration.ThispartalsotalksaboutthearchitecturaldifferencesthatYARNbringstoHadoop2withrespecttoHadoop1andwhythisredesignwasneeded.

Inthesecondpart,youwilllearnhowtowriteaYARNapplication,howtosubmitanapplicationtoYARN,andhowtomonitortheapplication.Next,youwilllearnaboutthevariousemergingopensourceframeworksthataredevelopedtorunontopofYARN.YouwilllearntodevelopanddeploysomeusecaseexamplesusingApacheSamzaandStormYARN.

Finally,wewilltalkaboutthefailuresinYARN,somealternativesolutionsavailableonthemarket,andthefutureandsupportforYARNinthebigdataworld.

WhatthisbookcoversChapter1,NeedforYARN,discussesthemotivationbehindthedevelopmentofYARN.ThischapterdiscusseswhatYARNisandwhyitisneeded.

Chapter2,YARNArchitecture,isadeepdiveintoYARN’sarchitecture.Allthemajorcomponentsandtheirinnerworkingsareexplainedinthischapter.

Chapter3,YARNInstallation,describesthestepsrequiredtosetupasingle-nodeandfully-distributedYARNcluster.Italsotalksabouttheimportantconfigurations/propertiesthatyoushouldbeawareofwhileinstallingtheYARNcluster.

Chapter4,YARNandHadoopEcosystems,talksaboutHadoopwithrespecttoYARN.ItgivesashortintroductiontotheHadoop1.xversion,thearchitecturaldifferencesbetweenHadoop1.xandHadoop2.x,andwhereexactlyYARNfitsintoHadoop2.x.

Chapter5,YARNAdministration,coversinformationontheadministrationofYARNclusters.ItexplainstheadministrativetoolsthatareavailableinYARN,whattheymean,andhowtousethem.ThischaptercoversvarioustopicsfromYARNcontainerallocationandconfigurationtovariousschedulingpolicies/configurationsandin-builtsupportformultitenancy.

Chapter6,DevelopingandRunningaSimpleYARNApplication,focusesonsomerealapplicationswithYARN,withsomehands-onexamples.ItexplainshowtowriteaYARNapplication,howtosubmitanapplicationtoYARN,andfinally,howtomonitortheapplication.

Chapter7,YARNFrameworks,discussesthevariousemergingopensourceframeworksthataredevelopedtorunontopofYARN.ThechapterthentalksindetailaboutApacheSamzaandStormonYARN,wherewewilldevelopandrunsomesampleapplicationsusingtheseframeworks.

Chapter8,FailuresinYARN,discussesthefault-toleranceaspectofYARN.ThischapterfocusesonvariousfailuresthatcanoccurintheYARNframework,theircauses,andhowYARNgracefullyhandlesthosefailures.

Chapter9,YARN–AlternativeSolutions,discussesotheralternativesolutionsthatareavailableonthemarkettoday.Thesesystems,likeYARN,sharecommoninspiration/requirementsandthehigh-levelgoalofimprovingscalability,latency,fault-tolerance,andprogrammingmodelflexibility.ThischapterhighlightsthekeydifferencesinthewaythesealternativesolutionsaddressthesamefeaturesprovidedbyYARN.

Chapter10,YARNFutureandSupport,talksaboutYARN’sjourneyanditspresentandfutureintheworldofdistributedcomputing.

WhatyouneedforthisbookYouwillneedasingleLinux-basedmachinewithJDK1.6orlaterinstalled.AnyrecentversionoftheApacheHadoop2distributionwillbesufficienttosetupaYARNclusterandrunsomeexamplesontopofYARN.

ThecodeinthisbookhasbeentestedonCentOS6.4butwillrunonothervariantsofLinux.

WhothisbookisforThisbookisforthebigdataenthusiastswhowanttogainin-depthknowledgeofYARNandknowwhatreallymakesYARNthemodernoperatingsystemforHadoop.YouwilldevelopagoodunderstandingofthearchitecturaldifferencesthatYARNbringstoHadoop2withrespecttoHadoop1.

Youwilldevelopin-depthknowledgeaboutthearchitectureandinnerworkingsoftheYARNframework.

Afterfinishingthisbook,youwillbeabletoinstall,administrate,anddevelopYARNapplications.ThisbooktellsyouanythingyouneedtoknowaboutYARN,rightfromitsinceptiontoitspresentandfutureinthebigdataindustry.

ConventionsInthisbook,youwillfindanumberoftextstylesthatdistinguishbetweendifferentkindsofinformation.Herearesomeexamplesofthesestylesandanexplanationoftheirmeaning.

Codewordsintext,databasetablenames,foldernames,filenames,fileextensions,pathnames,dummyURLs,userinput,andTwitterhandlesareshownasfollows:”TheURLforNameNodeishttp://<namenode_host>:<port>/andthedefaultHTTPportis50070.”

Ablockofcodeissetasfollows:

<property>

<name>io.file.buffer.size</name>

<value>4096</value>

<description>readandwritebuffersizeoffiles</description>

</property>

Anycommand-lineinputoroutputiswrittenasfollows:

${path_to_your_input_dir}

${path_to_your_output_dir_old}

Newtermsandimportantwordsareshowninbold.Wordsthatyouseeonthescreen,forexample,inmenusordialogboxes,appearinthetextlikethis:“UndertheToolssection,youcanfindtheYARNconfigurationfiledetails,schedulinginformation,containerconfigurations,locallogsofthejobs,andalotofotherinformationonthecluster.”

NoteWarningsorimportantnotesappearinaboxlikethis.

TipTipsandtricksappearlikethis.

ReaderfeedbackFeedbackfromourreadersisalwayswelcome.Letusknowwhatyouthinkaboutthisbook—whatyoulikedordisliked.Readerfeedbackisimportantforusasithelpsusdeveloptitlesthatyouwillreallygetthemostoutof.

Tosendusgeneralfeedback,simplye-mail<feedback@packtpub.com>,andmentionthebook’stitleinthesubjectofyourmessage.

Ifthereisatopicthatyouhaveexpertiseinandyouareinterestedineitherwritingorcontributingtoabook,seeourauthorguideatwww.packtpub.com/authors.

CustomersupportNowthatyouaretheproudownerofaPacktbook,wehaveanumberofthingstohelpyoutogetthemostfromyourpurchase.

DownloadingtheexamplecodeYoucandownloadtheexamplecodefilesfromyouraccountathttp://www.packtpub.comforallthePacktPublishingbooksyouhavepurchased.Ifyoupurchasedthisbookelsewhere,youcanvisithttp://www.packtpub.com/supportandregistertohavethefilese-maileddirectlytoyou.

ErrataAlthoughwehavetakeneverycaretoensuretheaccuracyofourcontent,mistakesdohappen.Ifyoufindamistakeinoneofourbooks—maybeamistakeinthetextorthecode—wewouldbegratefulifyoucouldreportthistous.Bydoingso,youcansaveotherreadersfromfrustrationandhelpusimprovesubsequentversionsofthisbook.Ifyoufindanyerrata,pleasereportthembyvisitinghttp://www.packtpub.com/submit-errata,selectingyourbook,clickingontheErrataSubmissionFormlink,andenteringthedetailsofyourerrata.Onceyourerrataareverified,yoursubmissionwillbeacceptedandtheerratawillbeuploadedtoourwebsiteoraddedtoanylistofexistingerrataundertheErratasectionofthattitle.

Toviewthepreviouslysubmittederrata,gotohttps://www.packtpub.com/books/content/supportandenterthenameofthebookinthesearchfield.TherequiredinformationwillappearundertheErratasection.

PiracyPiracyofcopyrightedmaterialontheInternetisanongoingproblemacrossallmedia.AtPackt,wetaketheprotectionofourcopyrightandlicensesveryseriously.IfyoucomeacrossanyillegalcopiesofourworksinanyformontheInternet,pleaseprovideuswiththelocationaddressorwebsitenameimmediatelysothatwecanpursuearemedy.

Pleasecontactusat<copyright@packtpub.com>withalinktothesuspectedpiratedmaterial.

Weappreciateyourhelpinprotectingourauthorsandourabilitytobringyouvaluablecontent.

QuestionsIfyouhaveaproblemwithanyaspectofthisbook,youcancontactusat<questions@packtpub.com>,andwewilldoourbesttoaddresstheproblem.

Chapter1.NeedforYARNYARNstandsforYetAnotherResourceNegotiator.YARNisagenericresourceplatformtomanageresourcesinatypicalcluster.YARNwasintroducedwithHadoop2.0,whichisanopensourcedistributedprocessingframeworkfromtheApacheSoftwareFoundation.

In2012,YARNbecameoneofthesubprojectsofthelargerApacheHadoopproject.YARNisalsocoinedbythenameofMapReduce2.0.ThisissinceApacheHadoopMapReducehasbeenre-architecturedfromthegrounduptoApacheHadoopYARN.

ThinkofYARNasagenericcomputingfabrictosupportMapReduceandotherapplicationparadigmswithinthesameHadoopcluster;earlier,thiswaslimitedtobatchprocessingusingMapReduce.ThisreallychangedthegametorecastApacheHadoopasamuchmorepowerfuldataprocessingsystem.WiththeadventofYARN,Hadoopnowlooksverydifferentcomparedtothewayitwasonlyayearago.

YARNenablesmultipleapplicationstorunsimultaneouslyonthesamesharedclusterandallowsapplicationstonegotiateresourcesbasedonneed.Therefore,resourceallocation/managementiscentraltoYARN.

YARNhasbeenthoroughlytestedatYahoo!sinceSeptember2012.Ithasbeeninproductionacross30,000nodesand325PBofdatasinceJanuary2013.

Recently,ApacheHadoopYARNwontheBestPaperAwardatACMSymposiumonCloudComputing(SoCC)in2013!

TheredesignideaInitially,HadoopwaswrittensolelyasaMapReduceengine.Sinceitrunsonacluster,itsclustermanagementcomponentswerealsotightlycoupledwiththeMapReduceprogrammingparadigm.

TheconceptsofMapReduceanditsprogrammingparadigmweresodeeplyingrainedinHadoopthatonecouldnotuseitforanythingelseexceptMapReduce.MapReducethereforebecamethebaseforHadoop,andasaresult,theonlythingthatcouldberunonHadoopwasaMapReducejob,batchprocessing.InHadoop1.x,therewasasingleJobTrackerservicethatwasoverloadedwithmanythingssuchasclusterresourcemanagement,schedulingjobs,managingcomputationalresources,restartingfailedtasks,monitoringTaskTrackers,andsoon.

TherewasdefinitelyaneedtoseparatetheMapReduce(specificprogrammingmodel)partandtheresourcemanagementinfrastructureinHadoop.YARNwasthefirstattempttoperformthisseparation.

LimitationsoftheclassicalMapReduceorHadoop1.xThemainlimitationsofHadoop1.xcanbecategorizedintothefollowingareas:

Limitedscalability:

LargeHadoopclustersreportedsomeseriouslimitationsonscalability.ThisiscausedmainlybyasingleJobTrackerservice,whichultimatelyresultsinaseriousdeteriorationoftheoverallclusterperformancebecauseofattemptstore-replicatedataandoverloadlivenodes,thuscausinganetworkflood.AccordingtoYahoo!,thepracticallimitsofsuchadesignarereachedwithaclusterof~5,000nodesand40,000tasksrunningconcurrently.Therefore,itisrecommendedthatyoucreatesmallerandlesspowerfulclustersforsuchadesign.

Lowclusterresourceutilization:

TheresourcesinHadoop1.xoneachslavenode(datanode),aredividedintermsofafixednumberofmapandreduceslots.ConsiderthescenariowhereaMapReducejobhasalreadytakenupalltheavailablemapslotsandnowwantsmorenewmaptaskstorun.Inthiscase,itcannotrunnewmaptasks,eventhoughallthereduceslotsarestillempty.Thisnotionofafixednumberofslotshasaseriousdrawbackandresultsinpoorclusterutilization.

Lackofsupportforalternativeframeworks/paradigms:

ThemainfocusofHadooprightfromthebeginningwastoperformcomputationonlargedatasetsusingparallelprocessing.Therefore,theonlyprogrammingmodelitsupportedwasMapReduce.Withthecurrentindustryneedsintermsofnewusecasesintheworldofbigdata,manynewandalternativeprogrammingmodels(suchApacheGiraph,ApacheSpark,Storm,Tez,andsoon)arecomingintothepictureeachday.ThereisdefinitelyanincreasingdemandtosupportmultipleprogrammingparadigmsbesidesMapReduce,tosupportthevariedusecasesthatthebigdataworldisfacing.

YARNasthemodernoperatingsystemofHadoopTheMapReduceprogrammingmodelis,nodoubt,greatformanyapplications,butnotforeverythingintheworldofcomputation.ThereareusecasesthatarebestsuitedforMapReduce,butnotall.

MapReduceisessentiallybatch-oriented,butsupportforreal-timeandnearreal-timeprocessingaretheemergingrequirementsinthefieldofbigdata.

YARNtookclusterresourcemanagementcapabilitiesfromtheMapReducesystemsothatnewenginescouldusethesegenericclusterresourcemanagementcapabilities.ThislighteneduptheMapReducesystemtofocusonthedataprocessingpart,whichitisgoodatandwillideallycontinuetobeso.

YARNthereforeturnsintoadataoperatingsystemforHadoop2.0,asitenablesmultipleapplicationstocoexistinthesamesharedcluster.Refertothefollowingfigure:

YARNasamodernOSforHadoop

WhatarethedesigngoalsforYARNThissectiontalksaboutthecoredesigngoalsofYARN:

Scalability:

Scalabilityisakeyrequirementforbigdata.Hadoopwasprimarilymeanttoworkonaclusterofthousandsofnodeswithcommodityhardware.Also,thecostofhardwareisreducingyear-on-year.YARNisthereforedesignedtoperformefficientlyonthisnetworkofamyriadofnodes.

Highclusterutilization:

InHadoop1.x,theclusterresourcesweredividedintermsoffixedsizeslotsforbothmapandreducetasks.Thismeansthattherecouldbeascenariowheremapslotsmightbefullwhilereduceslotsareempty,orviceversa.Thiswasdefinitelynotanoptimalutilizationofresources,anditneededfurtheroptimization.YARNfine-grainedresourcesintermsofRAM,CPU,anddisk(containers),leadingtoanoptimalutilizationoftheavailableresources.

Localityawareness:

ThisisakeyrequirementforYARNwhendealingwithbigdata;movingcomputationischeaperthanmovingdata.Thishelpstominimizenetworkcongestionandincreasetheoverallthroughputofthesystem.

Multitenancy:

WiththecoredevelopmentofHadoopatYahoo,primarilytosupportlarge-scalecomputation,HDFSalsoacquiredapermissionmodel,quotas,andotherfeaturestoimproveitsmultitenantoperation.YARNwasthereforedesignedtosupportmultitenancyinitscorearchitecture.Sinceclusterresourceallocation/managementisattheheartofYARN,sharingprocessingandstoragecapacityacrossclusterswascentraltothedesign.YARNhasthenotionofpluggableschedulersandtheCapacitySchedulerwithYARNhasbeenenhancedtoprovideaflexibleresourcemodel,elasticcomputing,applicationlimits,andothernecessaryfeaturesthatenablemultipletenantstosecurelysharetheclusterinanoptimizedway.

Supportforprogrammingmodel:

TheMapReduceprogrammingmodelisnodoubtgreatformanyapplications,butnotforeverythingintheworldofcomputation.Astheworldofbigdataisstillinitsinceptionphase,organizationsareheavilyinvestinginR&Dtodevelopnewandevolvingframeworkstosolveavarietyofproblemsthatbigdatabrings.

Aflexibleresourcemodel:

Besidesmismatchwiththeemergingframeworks’requirements,thefixednumberofslotsforresourceshadseriousproblems.ItwasstraightforwardforYARNtocomeupwithaflexibleandgenericresourcemanagementmodel.

Asecureandauditableoperation:

AsHadoopcontinuedtogrowtomanagemoretenantswithamyriadofusecasesacrossdifferentindustries,therequirementsforisolationbecamemoredemanding.Also,theauthorizationmodellackedstrongandscalableauthentication.ThisisbecauseHadoopwasdesignedwithparallelprocessinginmind,withnocomprehensivesecurity.Securitywasanafterthought.YARNunderstandsthisandaddssecurity-relatedrequirementsintoitsdesign.

Reliability/availability:

Althoughfaulttoleranceisinthecoredesign,inrealitymaintainingalargeHadoopclusterisatedioustask.Allissuesrelatedtohighavailability,failures,failuresonrestart,andreliabilitywerethereforeacorerequirementforYARN.

Backwardcompatibility:

Hadoop1.xhasbeeninthepictureforawhile,withmanysuccessfulproductiondeploymentsacrossmanyindustries.ThismassiveinstallationbaseofMapReduceapplicationsandtheecosystemofrelatedprojects,suchasHive,Pig,andsoon,wouldnottoleratearadicalredesign.Therefore,thenewarchitecturereusedasmuchcodefromtheexistingframeworkaspossible,andnomajorsurgerywasconductedonit.ThismadeMRv2abletoensuresatisfactorycompatibilitywithMRv1applications.

SummaryInthischapter,youlearnedwhatYARNisandhowithasturnedouttobethemodernoperatingsystemforHadoop,makingitamultiapplicationplatform.

InChapter2,YARNArchitecture,wewillbetalkingaboutthearchitecturedetailsofYARN.

Chapter2.YARNArchitectureThischapterdivesdeepintoYARNarchitectureitscorecomponents,andhowtheyinteracttodeliveroptimalresourceutilization,betterperformance,andmanageability.ItalsofocusesonsomeimportantterminologyconcerningYARN.

Inthischapter,wewillcoverthefollowingtopics:

CorecomponentsofYARNarchitectureInteractionandflowofYARNcomponentsResourceManagerschedulingpoliciesRecentdevelopmentsinYARN

ThemotivationbehindtheYARNarchitectureistosupportmoredataprocessingmodels,suchasApacheSpark,ApacheStorm,ApacheGiraph,ApacheHAMA,andsoon,thanjustMapReduce.YARNprovidesaplatformtodevelopandexecutedistributedprocessingapplications.Italsoimprovesefficiencyandresource-sharingcapabilities.

ThedesigndecisionbehindYARNarchitectureistoseparatetwomajorfunctionalities,resourcemanagementandjobschedulingormonitoringofJobTracker,intoseparatedaemons,thatis,aclusterlevelResourceManager(RM)andanapplication-specificApplicationMaster(AM).YARNarchitecturefollowsamaster-slavearchitecturalmodelinwhichtheResourceManageristhemasterandnode-specificslaveNodeManager(NM).TheglobalResourceManagerandper-nodeNodeManagerbuildsamostgeneric,scalable,andsimpleplatformfordistributedapplicationmanagement.TheResourceManageristhesupervisorcomponentthatmanagestheresourcesamongtheapplicationsinthewholesystem.Theper-applicationApplicationMasteristheapplication-specificdaemonthatnegotiatesresourcesfromResourceManagerandworksinhandwithNodeManagerstoexecuteandmonitortheapplication’stasks.

ThefollowingdiagramexplainshowJobTrackerisreplacedbyagloballevelResourceManagerandApplicationManagerandaper-nodeTaskTrackerisreplacedbyanapplication-levelApplicationMastertomanageitsfunctionsandresponsibilities.JobTrackerandTaskTrackeronlysupportMapReduceapplicationswithlessscalabilityandpoorclusterutilization.Now,YARNsupportsmultipledistributeddataprocessingmodelswithimprovedscalabilityandclusterutilization.

TheResourceManagerhasacluster-levelschedulerthathasresponsibilityforresourceallocationtoalltherunningtasksaspertheApplicationManager’srequests.TheprimaryresponsibilityoftheResourceManageristoallocateresourcestotheapplication(s).TheResourceManagerisnotresponsiblefortrackingthestatusofanapplicationormonitoringtasks.Also,itdoesn’tguaranteerestarting/balancingtasksinthecaseofapplicationorhardwarefailure.

Theapplication-levelApplicationMasterisresponsiblefornegotiatingresourcesfromtheResourceManageronapplicationsubmission,suchasmemory,CPU,disk,andsoon.Itisalsoresponsiblefortrackinganapplication’sstatusandmonitoringapplicationprocessesincoordinationwiththeNodeManager.

Let’shavealookatthehigh-levelarchitectureofHadoop2.0.Asyoucansee,moreapplicationscanbesupportedbyYARNthanjusttheMapReduceapplication.ThekeycomponentofHadoop2isYARN,forbetterclusterresourcemanagement,andtheunderlyingfilesystemremainsthesameasHadoopDistributedFileSystem(HDFS)andisshowninthefollowingimage:

HerearesomekeyconceptsthatweshouldknowbeforeexploringtheYARNarchitectureindetail:

Application:Thisisthejobsubmittedtotheframework,forexampleaMapReducejob.Itcouldalsobeashellscript.Container:Thisisthebasicunitofhardwareallocation,forexampleacontainerthathas4GBofRAMandoneCPU.Thecontainerdoesoptimizedresourceallocation;thisreplacesthefixedmapandreduceslotsinthepreviousversionsofHadoop.

CorecomponentsofYARNarchitectureHerearesomecorecomponentsofYARNarchitecturethatweneedtoknow:

ResourceManagerApplicationMasterNodeManager

ResourceManagerResourceManageractsasaglobalresourceschedulerthatisresponsibleforresourcemanagementandschedulingaspertheApplicationMaster’srequestsfortheresourcerequirementsoftheapplication(s).Itisalsoresponsibleforthemanagementofhierarchicaljobqueues.TheResourceManagercanbeseeninthefollowingfigure:

TheprecedingdiagramgivesmoredetailsaboutthecomponentsoftheResourceManager.TheAdminandClientserviceisresponsibleforclientinteractions,suchasajobrequestsubmission,start,restart,andsoon.TheApplicationsManagerisresponsibleforthemanagementofeveryapplication.TheApplicationMasterServiceinteractswitheveryapplication.ApplicationMasterregardingresourceorcontainernegotiation,theResourceTrackerServicecoordinateswiththeNodeManagerandResourceManager.TheApplicationMasterLauncherserviceisresponsibleforlaunchingacontainerfortheApplicationMasteronjobsubmissionfromtheclient.TheSchedulerandSecurityarethecorepartsoftheResourceManager.Asalreadyexplained,theSchedulerisresponsibleforresourcenegotiationandallocationtotheapplicationsaspertherequestoftheApplicationMaster.Therearethreedifferentpoliciesofscheduler,FIFO,Fair,andCapacity,whichwillbeexplainedindetaillaterinthischapter.Thesecuritycomponentisresponsibleforgeneratinganddelegatingan/theApplicationTokenandContainerTokentoaccesstheapplicationandcontainer,respectively.

ApplicationMaster(AM)TheApplicationMasterisataper-applicationlevel.Itisresponsiblefortheapplication’slifecyclemanagementandfornegotiatingtheappropriateresourcesfromtheScheduler,trackingtheirstatusandprogressmonitoring,forexample,MapReduceApplicationMaster.

NodeManager(NM)NodeManageractsasaper-machineagentandisresponsibleformanagingthelifecycleofthecontainerandformonitoringtheirresourceusage.ThecorecomponentsoftheNodeManagerareshowninthefollowingdiagram:

ThecomponentresponsibleforcommunicationbetweentheNodeManagerandResourceManageristheNodeStatusUpdater.TheContainerManageristhecorecomponentoftheNodeManager;itmanagesallthecontainersthatrunonthenode.NodeHealthCheckerServiceistheservicethatmonitorsthenode’shealthandcommunicatesthenode’sheartbeattotheResourceManagerviatheNodeStatusUpdaterservice.TheContainerExecutoristheprocessresponsibleforinteractingwithnativehardwareorsoftwaretostartorstopthecontainerprocess.ManagementofAccessControlList(ACL)andaccesstokenverificationisperformedbytheSecuritycomponent.

Let’stakealookatonescenariotounderstandYARNarchitectureindetail.Refertothefollowingdiagram:

Saywehavetwoclientrequests:onewantstoexecuteasimpleshellscript,whileanotheronewantstoexecuteacomplexMapReducejob.TheShellScriptisrepresentedinmarooncolor,whiletheMapReducejobisrepresentedinlightgreencolorintheprecedingdiagram.

TheResourceManagerhastwomaincomponents,theApplicationManagerandtheScheduler.TheApplicationManagerisresponsibleforacceptingtheclient’sjobsubmissionrequests,negotiatingthecontainerstoexecutetheapplicationsspecifictotheApplicationMaster,andprovidingtheservicestorestarttheApplicationMasteronfailure.TheresponsibilityoftheScheduleristoallocateresourcestothevariousrunningapplicationswithrespecttotheapplicationresourcerequirementsandavailableresources.TheSchedulerisapureschedulerinthesensethatitprovidesnomonitoringortrackingfunctionsfortheapplication.Also,itdoesn’tofferanyguaranteesforrestartingafailedtaskeitherduetofailureintheapplicationorinthehardware.TheSchedulerperformsitsschedulingtasksbasedontheresourcerequirementsoftheapplication(s);itdoessobasedontheabstractnotionoftheresourcecontainer,whichincorporateselementssuchasCPU,memory,disk,andsoon.

TheNodeManageristheper-machineframeworkdaemonthatisresponsibleforthecontainers’lifecycles.Itisalsoresponsibleformonitoringtheirresourceusage,forexample,memory,CPU,disk,network,andsoon,andforreportingthistotheResourceManageraccordingly.Theapplication-levelApplicationMasterisresponsiblefornegotiatingtherequiredresourcecontainersfromthescheduler,trackingtheirstatus,andmonitoringprogress.Intheprecedingdiagram,youcanseethatbothjobs,ShellScriptand

MapReduce,haveanindividualApplicationMasterthatallocatesresourcesforjobexecutionandtotrack/monitorthejobexecutionstatus.

Now,takealookattheexecutionsequenceoftheapplication.Refertotheprecedingapplicationflowdiagram.

AclientsubmitstheapplicationtotheResourceManager.Intheprecedingdiagram,client1submitsaShellScriptRequest(marooncolor),andclient2submitsaMapReducerequest(greencolor):

1. Then,theResourceManagerallocatesacontainertostartuptheApplicationMasteraspertheapplicationsubmittedbytheclient:oneApplicationMasterfortheshellscriptandonefortheMapReduceapplication.

2. WhilestartingtheApplicationMaster,theResourceManagerregisterstheapplicationwiththeResourceManager.

3. AfterthestartupoftheApplicationMaster,itnegotiateswiththeResourceManagerforappropriateresourcesaspertheapplicationrequirement.

4. Then,afterresourceallocationfromtheResourceManager,theApplicationMasterrequeststhattheNodeManagerlaunchesthecontainersallocatedbytheResourceManager.

5. Onsuccessfullaunchingofthecontainers,theapplicationcodeexecuteswithinthecontainer,andtheApplicationManagerreportsbacktotheResourceManagerwiththeexecutionstatusoftheapplication.

6. Duringtheexecutionoftheapplication,theclientcanrequesttheApplicationMasterortheResourceManagerdirectlyfortheapplicationstatus,progressupdates,andsoon.

7. Onexecutionoftheapplication,theApplicationMasterrequeststhattheResourceManagerunregistersandshutdownsitsowncontainerprocess.

YARNschedulerpoliciesAsexplainedintheprevioussection,theResourceManageractsasapluggableglobalschedulerthatmanagesandcontrolsallthecontainers(resources).Therearethreedifferentpoliciesthatcanbeappliedoverthescheduler,asperrequirementsandresourceavailability.Theyareasfollows:

TheFIFOschedulerTheFairschedulerTheCapacityscheduler

TheFIFO(FirstInFirstOut)schedulerFIFOmeansFirstInFirstOut.Asthenameindicates,thejobsubmittedfirstwillgetprioritytoexecute;inotherwords,thejobrunsintheorderofsubmission.FIFOisaqueue-basedscheduler.Itisaverysimpleapproachtoschedulinganditdoesnotguaranteeperformanceefficiency,aseachjobwoulduseawholeclusterforexecution.Sootherjobsmaykeepwaitingtofinishtheirexecution,althoughasharedclusterhasagreatcapabilitytooffermore-than-enoughresourcestomanyusers.

ThefairschedulerFairschedulingisthepolicyofschedulingthatassignsresourcesfortheexecutionoftheapplicationsothatallapplicationsgetanequalshareofclusterrecoursesoveraperiodoftime.Forexample,ifasinglejobisrunning,itwouldgetalltheresourcesavailableinthecluster,andasthejobnumberincreases,freerecourseswillbegiventothejobssothateachuserwillgetafairshareofthecluster.Iftwousershavesubmittedtwodifferentjobs,ashortjobthatbelongstoauserwouldcompleteinasmalltimespanwhilealongerjobsubmittedbytheotheruserkeepsrunning,solongjobswillstillmakesomeprogress.

InaFairschedulingpolicy,alljobsareplacedintojobpools,specifictousers;accordingly,eachusergetstheirownjobpool.Theuserwhosubmitsmorejobsthantheotheruserwillnotgetmoreresourcesthanthefirstuseronaverage.Youmayevendefineyourowncustomizedjobpoolswithspecifiedconfigurations.Fairschedulingisapreemptivescheduling,asifapoolhasnotreceivedfairresourcestorunaparticulartaskforacertainperiodoftime.Inthiscase,theschedulerwillkillthetasksinpoolsthatrunoutofcapacity,toreleaseresourcestothepoolsthatrunundercapacity.

Inadditiontofairscheduling,theFairschedulerallocatesaguaranteedminimumshareofresourcestothepools.Thisisalwayshelpfulfortheusers,groups,orapplications,astheyalwaysgetsufficientresourcesforexecution.

ThecapacityschedulerTheCapacityschedulerisdesignedtoallowapplicationstoshareclusterresourcesinapredictableandsimplefashion.Thesearecommonlyknownas“jobqueues”.Themainideabehindcapacityschedulingistoallocateavailableresourcestotherunningapplications,basedonindividualneedsandrequirements.Thereareadditionalbenefitswhenrunningtheapplicationusingcapacityscheduling,astheycanaccesstheexcesscapacityresourcesthatarenotbeingusedbyanyotherapplications.

Theabstractionprovidedbythecapacityscheduleristhequeue.Itprovidescapacityguaranteesforsupportformultiplequeueswhereajobissubmittedtothequeue,andqueuesareallocatedacapacityinthesensethatacertaincapacityofresourceswillbeattheirdisposal.Allthejobssubmittedtothequeuewillaccesstheresourcesallocatedtothejobqueue.Adminscancontrolthecapacityofeachqueue.

Herearesomebasicfeaturesofthecapacityscheduler:

Security:EachqueuehasstrictACLsthattakecontroloftheauthorizationandauthenticationofuserswhocansubmitjobstoindividualqueues.Elasticity:Freeresourcesareallocatedtoanyqueuebeyonditscapacity.Ifthereisdemandfortheseresourcesfromqueuesthatrunbelowcapacity,thenassoonasthetaskscheduledontheseresourceshascompleted,theywillbeassignedtojobsonqueuesthatrunundercapacity.Operability:Theadmincan,atanypointintime,changequeuedefinitionsandproperties.Multitenancy:Allsetsoflimitsareprovidedtopreventasinglejob,user,andqueuefromobtainingtheresourcesofthequeueorcluster.Thisistoensurethatthesystem,specificallyapreviousversionofHadoop,isnotsuppressedbytoomanytasks.Resource-basedscheduling:Intensivejobsupport,asjobscanspecificallydemandforhigherresourcerequirementsthandefault.Jobpriorities:Thesejobqueuescansupportjobpriorities.Withinthequeue,jobswithhighpriorityhaveaccesstoresourcesbeforejobswithlowerpriority.

RecentdevelopmentsinYARNarchitectureTheResourceManagerisasinglepointoffailureandrestartbecauseofvariousreasons:bugs,hardwarefailure,deliberatedowntimeforupgrading,andsoon.

WealreadysawhowcrucialtheroleoftheResourceManagerinYARNarchitectureis.TheResourceManagerhasbecomeasinglepointoffailure;iftheResourceManagerinaclustergoesdown,everythingonthatclusterwillbelost.

SoinarecentdevelopmentofYARN,ResourceManagerHAbecameahighpriority.ThisrecentdevelopmentofYARNnotonlycoversResourceManagerHA,butalsoprovidestransparencytousersanddoesnotrequirethemtomonitorsucheventsexplicitlyandresubmitthejobs.

OverlycomplexinMRv1forthefactthatJobTrackerhastosavetoomuchofmeta-data:bothclusterstateandper-applicationrunningstate.ThismeansthatifJob-Trackerdies,thenalltheapplicationsinarunningstatewillbelost.

ThedevelopmentofResourceManagerrecoverywillbedoneintwophases:

1. RMRestartPhaseI:Inthisphase,alltheapplicationswillbekilledwhilerestartingtheResourceManageronfailure.Nostateoftheapplicationcanbestored.Developmentofthisphaseisalmostcompleted.

2. RMRestartPhaseII:AsinPhaseII,theapplicationwillstorethestateonRMfailure;thismeansthatapplicationsarenotkilled,andtheyreporttherunningstatebacktotheRMaftertheRMcomesbackup.

TheResourceManagerwillbeusedonlytosaveanapplication’ssubmissionmetadataandcluster-levelinformation.Applicationstatepersistenceandtherecoveryofspecificinformationwillbemanagedbytheapplicationitself.

Asshownintheprecedingdiagram,inthenextversion,wewillgetapluggablestatestore,suchasZookeeperandHDFS,thatcanstorethestateoftherunningapplications.ResourceManagerHAwouldcontainsynchronizedactive-passiveResourceManagerarchitecturalmodelsmanagedbyZookeeper;asonegoesdown,theothercantakeoverclusterresponsibilitywithouthaltingandlosinginformation.

SummaryInthischapter,wecoveredthearchitecturalcomponentsofYARN,theirresponsibilities,andtheirinteroperations.Wealsofocusedonsomemajordevelopmentworkgoingoninthecommunitytoovercomethedrawbacksofthecurrentrelease.Inthenextchapter,wewillcovertheinstallationstepsofYARN.

Chapter3.YARNInstallationInthissection,we’llcovertheinstallationofHadoopandYARNandtheirconfigurationforasingle-nodeandsingle-clustersetup.Now,wewillconsiderHadoopastwodifferentcomponents:oneisHadoopDistributedFileSystem(HDFS),theotherisYARN.TheYARNcomponentstakecareofresourceallocationandtheschedulingofthejobsthatrunoverthedatastoredinHDFS.We’llcovermostoftheconfigurationstomakeYARNdistributedcomputingmoreoptimizedandefficient.

Inthischapter,wewillcoverthefollowingtopics:

HadoopandYARNsingle-nodeinstallationHadoopandYARNfully-distributedmodeinstallationOperatingHadoopandYARNclusters

Single-nodeinstallationLet’sstartwiththestepsforHadoop’ssingle-nodeinstallations,asit’seasytounderstandandsetup.Thisway,wecanquicklyperformsimpleoperationsusingHadoopMapReduceandtheHDFS.

PrerequisitesHerearesomeprerequisitesneededforHadoopinstallations;makesurethattheprerequisitesarefulfilledtostartworkingwithHadoopandYARN.

PlatformGNU/UnixissupportedforHadoopinstallationasadevelopmentaswellasaproductionplatform.TheWindowsplatformisalsosupportedforHadoopinstallation,withsomeextraconfigurations.Now,we’llfocusmoreonLinux-basedplatforms,asHadoopismorewidelyusedwiththeseplatformsandworksmoreefficientlywithLinuxcomparedtoWindowssystems.Herearethestepsforsingle-nodeHadoopinstallationforLinuxsystems.IfyouwanttoinstallitonWindows,refertotheHadoopwikipagefortheinstallationsteps.

SoftwareHere’ssomesoftware;makesurethattheyareinstalledbeforeinstallingHadoop.

Javamustbeinstalled.ConfirmwhethertheJavaversioniscompatiblewiththeHadoopversionthatistobeinstalledbycheckingtheHadoopwikipage(http://wiki.apache.org/hadoop/HadoopJavaVersions).

SSHandSSHDmustbeinstalledandrunning,astheyareusedbyHadoopscriptstomanageremoteHadoopdaemons.

Now,downloadtherecentstablereleaseoftheHadoopdistributionfromApachemirrorsandarchivesusingthefollowingcommand:

$$wgethttp://mirrors.ibiblio.org/apache/hadoop/common/hadoop-

2.6.0/hadoop-2.6.0.tar.gz

Notethatatthetimeofwritingthisbook,Hadoop2.6.0isthemostrecentstablerelease.Nowusethefollowingcommands:

$$mkdir–p/opt/yarn

$$cd/opt/yarn

$$tarxvzf/root/hadoop-2.6.0.tar.gz

StartingwiththeinstallationNow,unzipthedownloaddistributionunderthe/etc/directory.ChangetheHadoopenvironmentalparametersasperthefollowingconfigurations.

SettheJAVA_HOMEenvironmentalparametertotheJAVArootinstalledbefore:

$$exportJAVA_HOME=etc/java/latest

SettheHadoophometotheHadoopinstallationdirectory:

$$exportHADOOP_HOME=etc/hadoop

TryrunningtheHadoopcommand.ItshoulddisplaytheHadoopdocumentation;thisindicatesasuccessfulHadoopconfiguration.

Now,ourHadoopsingle-nodesetupisreadytoruninthefollowingmodes.

Thestandalonemode(localmode)Bydefault,HadooprunsinstandalonemodeasasingleJavaprocess.Thismodeisusefulfordevelopmentanddebugging.

Thepseudo-distributedmodeHadoopcanrunonasinglenodeinpseudo-distributedmode,aseachdaemonisrunasaseparateJavaprocess.TorunHadoopinpseudo-distributedmode,followtheseconfigurationinstructions.First,navigatetothe/etc/hadoop/core-site.xml.

ThisconfigurationfortheNameNodesetupwillrunonlocalhostport9000.YoucansetthefollowingpropertyfortheNameNode:

<configuration>

<property>

<name>fs.defaultFS</name>

<value>hdfs://localhost:9000</value>

</property>

</configuration>

Nownavigateto/etc/hadoop/hdfs-site.xml.

Bysettingthefollowingproperty,weareensuringthatthereplicationfactorofeachdatablockis3(bydefault,thereplicationfactoris3):

<configuration>

<property>

<name>dfs.replication</name>

<value>3</value>

</property>

</configuration>

Then,formattheHadoopfilesystemusingthiscommand:

$$$HADOOP_HOME/bin/hdfsnamenode–format

Afterformattingthefilesystem,startthenamenodeanddatanodedaemonsusingthenext

command.Youcanseelogsunderthe$HADOOP_HOME/logsdirectorybydefault:

$$$HADOOP_HOME/sbin/start-dfs.sh

Now,wecanseethenamenodeUIonthewebinterface.Hithttp://localhost:50070/inthebrowser.

CreatetheHDFSdirectoriesthatarerequiredtorunMapReducejobs:

$$$HADOOP_HOME/bin/hdfs-mkdir/user

$$$HADOOP_HOME/bin/hdfs-mkdir/user/{username}

ToMapReducejobonYARNinpseudo-distributedmode,youneedtostarttheResourceManagerandNodeManagerdaemons.Navigateto/etc/hadoop/mapred-site.xml:

<configuration>

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

</configuration>

Navigateto/etc/hadoop/yarn-site.xml:

<configuration>

<property>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

</configuration>

Now,starttheResourceManagerandNodeManagerdaemonsbyissuingthiscommand:

$$sbin/start-yarn.sh

Bysimplynavigatingtohttp://localhost:8088/inyourbrowser,youcanseethewebinterfacefortheResourceManager.Fromhere,youcanstart,restart,orstopthejobs.

TostoptheYARNdaemons,youneedtorunthefollowingcommand:

$$$HADOOP_HOME/sbin/stop-yarn.sh

ThisishowwecanconfigureHadoopandYARNinasinglenodeinstandaloneandpseudo-distributedmodes.Movingforward,wewillfocusonfully-distributedmode.Asthebasicconfigurationremainsthesame,weonlyneedtodosomeextraconfigurationforfully-distributedmode.Single-nodesetupismainlyusedfordevelopmentanddebuggingofdistributedapplications,whilefully-distributedmodeisusedfortheproductionsetup.

Thefully-distributedmodeIntheprevioussection,wehighlightedthestandaloneHadoopandYARNconfigurations,andinthissectionwe’llfocusonthefully-distributedmodesetup.Thissectiondescribeshowtoinstall,configure,andmanageHadoopandYARNinfully-distributed,verylargeclusterswiththousandsofnodesinthem.

Inordertostartwithfully-distributedmode,wefirstneedtodownloadthestableversionofHadoopfromApachemirrors.InstallingHadoopindistributedmodegenerallymeansunpackingthesoftwaredistributiononeachmachineintheclusterorinstallingRedHatPackageManagers(RPMs).AsHadoopfollowsamaster-slavearchitecture,onemachineintheclusterisdesignatedastheNameNode(NN),oneastheResourceManager(RM),andtherestofthemachines,DataNodes(DN)andNodeManagers(NM),willtypicallyactsasslaves.

AfterthesuccessfulunpackingofsoftwaredistributiononeachclustermachineorRPMinstallation,youneedtotakecareofaveryimportantpartoftheHadoopinstallationphase,Hadoopconfiguration.

Hadooptypicallyhastwotypesofconfiguration:oneistheread-onlydefaultconfiguration(core-default.xml,hdfs-default.xml,yarn-default.xml,andmapred-default.xml),whiletheotheristhesite-specificconfiguration(core-site.xml,hdfs-site.xml,yarn-site.xml,andmapred-site.xml).Allthesefilearefoundunderthe$HADOOP_HOME/confdirectory.

Inadditiontotheprecedingconfigurationfiles,theHadoop-environmentandYARN-environmentspecificfileisfoundinconf/hadoop-env.shandconf/yarn-env.sh.AsfortheHadoopandYARNclusterconfiguration,youneedtosetupanenvironmentinwhichHadoopdaemonscanexecute.TheHadoop/YARNdaemonsaretheNameNode/ResourceManager(masters)andtheDataNode/NodeManager(slaves).

First,makesurethatJAVA_HOMEiscorrectlyspecifiedoneachnode.

Herearesomeimportantconfigurationparameterswithrespecttoeachdaemon:

NameNode:HADOOP_NAMENODE_OPTSDataNode:HADOOP_DATANODE_OPTSSecondaryNameNode:HADOOP_SECONDARYNAMENODE_OPTSResourceManager:YARN_RESOURCEMANAGER_OPTSNodeManager:YARN_NODEMANAGER_OPTSWebAppProxy:YARN_PROXYSERVER_OPTSMapReduceJobHistoryServer:HADOOP_JOB_HISTORYSERVER_OPTS

Forexample,toruntheNameNodeinparallelGCmode,thefollowinglineshouldbeaddedintohadoop-env.sh:

$$exportHADOOP_NAMENODE_OPTS="-XX:+UseParallelGC${HADOOP_NAMENODE_OPTS}"

Herearesomeimportantconfigurationparameterswithrespecttothedaemonanditsconfigurationfiles.

Navigatetoconf/core-site.xmlandconfigureitasfollows:

fs.defaultFS:NameNodeURI,hdfs://<hdfshost>:<hdfsport>

<property>

<name>fs.defaultFS</name>

<value>hdfs://$<hdfshostname>:<hdfsport></value>

<description>ItisaNameNodehostname</description>

</property>

Theio.file.buffer.size:4096,readandwritebuffersizeoffiles.

ThebuffersizeforI/O(read/write)operationonsequencefilesstoredindiskfiles,thatis,itdetermineshowmuchdataisbufferedinI/Opipesbeforetransferringittootheroperationsduringread/writeoperations.IshouldbemultipleofOSfilesystemblocksize.

<property>

<name>io.file.buffer.size</name>

<value>4096</value>

<description>readandwritebuffersizeoffiles</description>

</property>

Nownavigatetoconf/hdfs-site.xml.HereistheconfigurationfortheNameNode:

Parameter Description

dfs.namenode.name.dirThepathonthelocalfilesystemwheretheNameNodegeneratesthenamespaceandapplicationtransactionlogs.

dfs.namenode.hosts ThelistofpermittedDataNodes.

dfs.namenode.hosts.exclude ThelistofexcludedDataNodes.

dfs.blocksize Thedefaultvalueis268435456.TheHDFSblocksizeis256MBforlargefilesystems.

dfs.namenode.handler.countThedefaultvalueis100.MoreNameNodeserverthreadstohandleRPCsfromalargenumberofDataNodes.

TheconfigurationfortheDataNodeisasfollows:

Parameter Description

dfs.datanode.data.dir Comma-delimitedlistofpathsonthelocalfilesystemswheretheDataNodestorestheblocks

Nownavigatetoconf/yarn-site.xml.We’lltakealookattheconfigurationsrelatedtotheResourceManagerandNodeManager:

Parameter Description

yarn.acl.enable ValuesaretrueorfalsetoenableordisableACLs.Thedefaultvalueisfalse.

yarn.admin.aclThisreferstotheadminorACL.Thedefaultis*,whichmeansanyonecandoadmintasks.ACLsetsadminsonthecluster.Thiscouldbeacomma-delimitedusergrouptosetmorethanoneadmin.

yarn.log-

aggregation-

enableThisistrueorfalsetoenableordisablelogaggregation.

Now,wewilltakelookatconfigurationsfortheResourceManagerintheconf/yarn-site.xmlfile:

Parameter Description

yarn.resourcemanager.address ThisistheResourceManagerhost:portforclientstosubmitjobs.

yarn.resourcemanager.scheduler.addressThisistheResourceManagerhost:portforApplicationMasterstotalktotheSchedulertoobtainresources.

yarn.resourcemanager.resource-

tracker.addressThisistheResourceManagerhost:portforNodeManagers.

yarn.resourcemanager.admin.addressThisistheResourceManagerhost:portforadministrativecommands.

yarn.resourcemanager.webapp.address ThisistheResourceManagerweb-uihost:port.

yarn.resourcemanager.scheduler.classThisistheResourceManagerSchedulerclass.ThevaluesareCapacityScheduler,FairScheduler,andFifoScheduler.

yarn.scheduler.minimum-allocation-mbThisistheminimumlimitofmemorytoallocatetoeachcontainerrequestintheResourceManager.

yarn.scheduler.maximum-allocation-mbThisisthemaximumlimitofmemorytoallocatetoeachcontainerrequestintheResourceManager.

yarn.resourcemanager.nodes.include-path/

yarn.resourcemanager.nodes.exclude-path

Thisisthelistofpermitted/excludedNodeManagers.Ifnecessary,usethesefilestocontrolthelistofpermittedNodeManagers.

NowtakelookatconfigurationsfortheNodeManagerinconf/yarn-site.xml:

Parameter Description

yarn.nodemanager.resource.memory-

mb

Thisreferstotheavailablephysicalmemory(MBs)fortheNodeManager.ItdefinesthetotalavailablememoryresourcesontheNodeManagertobemadeavailabletotherunningcontainers.

yarn.nodemanager.vmem-pmem-ratioThisreferstothemaximumratiobywhichvirtualmemoryusageoftasksmayexceedphysicalmemory.

yarn.nodemanager.local-dirsThisreferstothelistofdirectorypathsonthelocalfilesystemwhereintermediatedataiswritten.Thisshouldbeacomma-separatedlist.

yarn.nodemanager.log-dirs Thisreferstothepathonthelocalfilesystemwherelogsarewritten.

yarn.nodemanager.log.retain-

seconds

Thisreferstothetime(inseconds)topersistlogfilesontheNodeManager.Thedefaultvalueis10800seconds.Thisconfigurationisapplicableonlyiflogaggregationisenabled.

yarn.nodemanager.remote-app-log-

dir

ThisistheHDFSdirectorypathtowhichlogshavebeenmovedafterapplicationcompletion.Thedefaultpathis/logs.Thisconfigurationis

applicableonlyiflogaggregationisenabled.

yarn.nodemanager.remote-app-log-

dir-suffix

Thisreferstothespecifiedsuffixappendedtotheremotelogdirectory.Thisconfigurationisapplicableonlyiflogaggregationisenabled.

yarn.nodemanager.aux-servicesThisreferstotheshuffleservicethatspecificallyneedstobesetforMapReduceapplications.

HistoryServerTheHistoryServerallowsallYARNapplicationswithacentrallocationtoaggregatetheircompletedjobsforhistoricalreferenceanddebugging.ThesettingsfortheMapReduceJobHistoryServercanbefoundinthemapred-default.xmlfile:

mapreduce.jobhistory.address:MapReduceJobHistoryServerhost:port.Thedefaultportis10020.mapreduce.jobhistory.webapp.address:ThisistheMapReduceJobHistoryServerWebUIhost:port.Thedefaultportis19888.mapreduce.jobhistory.intermediate-done-dir:ThisisthedirectorywherehistoryfilesarewrittenbyMapReducejobs(inHDFS).Thedefaultis/mr-history/tmp.mapreduce.jobhistory.done-dir:ThisisthedirectorywherehistoryfilesaremanagedbytheMRJobHistoryServer(inHDFS).Thedefaultis/mr-history/done.

SlavefilesWithrespecttotheHadoopslaveandYARNslavenodes,generallyonechoosesonenodeintheclusterastheNameNode(Hadoopmaster),anothernodeastheResourceManager(YARNmaster),andtherestofthemachineactsasbothHadoopslaveDataNodesandYarnslaveNodeManagers.Listalltheslaves,oneperlinehostnameorIPaddressesinyourHadoopconf/slavesfile.

OperatingHadoopandYARNclustersThisisthefinalstageofHadoopandYARNclustersetupandconfiguration.HerearethecommandsthatneedtobeusedtostartandstoptheHadoopandYARNclusters.

StartingHadoopandYARNclustersTostartHadoopandtheYARNcluster,usewiththefollowingprocedure:

1. FormataHadoopdistributedfilesystem:

$HADOOP_HOME/bin/hdfsnamenode-format<cluster_name>

2. ThefollowingcommandisusedtostartHDFS.RunitontheNameNode:

$HADOOP_HOME/sbin/hadoop-daemon.sh--config$HADOOP_CONF_DIR--script

hdfsstartnamenode

3. RunthiscommandtostartDataNodesonallslavesnodes:

$HADOOP_HOME/sbin/hadoop-daemon.sh--config$HADOOP_CONF_DIR--script

hdfsstartdatanode

4. StartYARNwiththefollowingcommandontheResourceManager:

$HADOOP_YARN_HOME/sbin/yarn-daemon.sh--config$HADOOP_CONF_DIRstart

resourcemanager

5. ExecutethiscommandtostartNodeManagersonallslaves:

$HADOOP_YARN_HOME/sbin/yarn-daemon.sh--config$HADOOP_CONF_DIRstart

nodemanager

6. StartastandaloneWebAppProxyserver.Thisisusedforload-balancingpurposesonamultiservercluster:

$HADOOP_YARN_HOME/sbin/yarn-daemonartproxyserver--config

$HADOOP_CONF_DIR

7. ExecutethiscommandonthedesignatedHistoryServer:

$HADOOP_HOME/sbin/mr-jobhistory-daemon.shstarthistoryserver--config

$HADOOP_CONF_DIR

StoppingHadoopandYARNclustersTostopHadoopandtheYARNcluster,usewiththefollowingprocedure:

1. UsethefollowingcommandontheNameNodetostopit:

$HADOOP_HOME/sbin/hadoop-daemon.sh--config$HADOOP_CONF_DIR--script

hdfsstopnamenode

2. IssuethiscommandonalltheslavenodestostopDataNodes:

$HADOOP_HOME/sbin/hadoop-daemon.sh--config$HADOOP_CONF_DIR--script

hdfsstopdatanode

3. TostoptheResourceManager,issuethefollowingcommandonthespecifiedResourceManager:

$HADOOP_YARN_HOME/sbin/yarn-daemon.sh--config$HADOOP_CONF_DIRstop

resourcemanager

4. ThefollowingcommandisusedtostoptheNodeManageronallslavenodes:

$HADOOP_YARN_HOME/sbin/yarn-daemon.sh--config$HADOOP_CONF_DIRstop

nodemanager

5. StoptheWebAppProxyserver:

$HADOOP_YARN_HOME/sbin/yarn-daemon.shstopproxyserver--config

$HADOOP_CONF_DIR

6. StoptheMapReduceJobHistoryServerbyrunningthefollowingcommandontheHistoryServer:

$HADOOP_HOME/sbin/mr-jobhistory-daemon.shstophistoryserver--config

$HADOOP_CONF_DIR

WebinterfacesoftheEcosystemIt’sallabouttheHadoopandYARNsetupandconfigurationsandcommandingoverHadoopandYARN.HerearesomewebinterfacesusedbyHadoopandYARNadministratorsforadmintasks:

TheURLfortheNameNodeishttp://<namenode_host>:<port>/andthedefaultHTTPportis50070.

TheURLfortheResourceManagerishttp://<resourcermanager_host>:<port>/andthedefaultHTTPportis8088.TheWebUIfortheNameNodecanbeseenasfollows:

TheURLfortheMapReduceJobHistoryServerishttp://<jobhistoryserver_host>:<port>/andthedefaultHTTPportis19888.

SummaryInthissection,wecoveredHadoopandYARNsingle-nodeandfully-distributedclustersetupandimportantconfigurations.WealsocoveredthebasicbutimportantcommandstoadministrateHadoopandYARNclusters.Inthenextchapter,we’lllookattheHadoopandYARNcomponentsinmoredetail.

Chapter4.YARNandHadoopEcosystemsThischapterdiscussesYARNwithrespecttoHadoop,sinceitisveryimportanttoknowwhereexactlyYARNfitsinHadoop2now.

Hadoop2hasundergoneacompletechangeintermsofarchitectureandcomponentscomparedtoHadoop1.

Inthischapter,wewillbecoverthefollowingtopics:

AshortintroductiontoHadoop1ThedifferencebetweenMRv1andMRv2WhereYARNfitsinHadoop2OldandnewMapReduceAPIsBackwardcompatibilityofMRv2APIsPracticalexamplesofMRv1andMRv2

TheHadoop2releaseYARNcameintothepicturewiththereleaseofHadoop0.23onNovember11,2011.ThiswasthealphaversionoftheHadoop0.23majorrelease.

Themajordifferencebetween0.23andpre-0.23releasesisthatthe0.23releasehadundergoneacompleterevampintermsoftheMapReduceengineandresourcemanagement.This0.23releaseseparatedoutresourcemanagementandapplicationlifecyclemanagement.

AshortintroductiontoHadoop1.xandMRv1WewillbrieflylookatthebasicApacheHadoop1.xanditsprocessingframework,MRv1(Classic),sothatwecangetaclearpictureofthedifferencesinApacheHadoop2.xMRv2(YARN)intermsofarchitecture,components,andprocessingframework.

ApacheHadoopisascalable,fault-tolerantdistributedsystemfordatastorageandprocessing.ThecoreprogrammingmodelinHadoopisMapReduce.

Since2004,Hadoophasemergedasthedefactostandardtostore,process,andanalyzehundredsofterabytesandevenpetabytesofdata.

ThemajorcomponentsinHadoop1.xareasfollows:

NameNode:Thiskeepsthemetadatainthemainmemory.DataNode:Thisiswherethedataresidesintheformofblocks.JobTracker:Thisassigns/reassignsMapReducetaskstoTaskTrackersintheclusterandtracksthestatusofeachTaskTracker.TaskTracker:ThisexecutesthetaskassignedbytheJobTrackerandsendsthestatusofthetasktotheJobTracker.

ThemajorcomponentsofHadoop1.xcanbeseenasfollows:

AtypicalHadoop1.xcluster(shownintheprecedingfigure)canconsistofthousandsof

nodes.ItfollowstheMaster\Slavepattern,wheretheNameNodes\JobTrackersarethemastersandtheDataNodes\TaskTrackersaretheslaves.

ThemaindataprocessingisdistributedacrosstheclusterintheDataNodestoincreaseparallelprocessing.

ThemasterNameNodeprocess(masterforslaveDataNodes)managesthefilesystem,andthemasterJobTrackerprocess(masterforslaveTaskTrackers)managesthetasks.Thetopologyisseenasfollows:

AHadoopclustercanbeconsideredtobemainlymadeupoftwodistinguishableparts:

HDFS:Thisistheunderlyingstoragelayerthatactsasafilesystemfordistributeddatastorage.Youcanputdataofanyformat,schema,andtypeonit,suchasstructured,semi-structured,orunstructureddata.ThisflexibilitymakesHadoopfitforthedatalake,whichissometimescalledthebitbucketorthelandingzone.MapReduce:Thisistheexecutionlayerwhichistheonlydistributeddata-processingframework.

TipDownloadingtheexamplecode

Youcandownloadtheexamplecodefilesfromyouraccountathttp://www.packtpub.comforallthePacktPublishingbooksyouhavepurchased.Ifyoupurchasedthisbookelsewhere,youcanvisithttp://www.packtpub.com/supportandregistertohavethefilese-maileddirectlytoyou.

MRv1versusMRv2MRv1(MapReduceversion1)ispartofApacheHadoop1.xandisanimplementationoftheMapReduceprogrammingparadigm.

TheMapReduceprojectitselfcanbebrokenintothefollowingparts:

End-userMapReduceAPI:ThisistheAPIneededtodeveloptheMapReduceapplication.MapReduceframework:Thisistheruntimeimplementationofvariousphases,suchasthemapphase,thesort/shuffle/mergeaggregationphase,andthereducephase.MapReducesystem:ThisisthebackendinfrastructurerequiredtorunMapReduceapplicationsandincludesthingssuchasclusterresourcemanagement,schedulingofjobs,andsoon.

Hadoop1.xwaswrittensolelyasanMRengine.Sinceitrunsonacluster,itsclustermanagementcomponentwasalsotightlycoupledwiththeMRprogrammingparadigm.TheonlythingthatcouldberunonHadoop1.xwasanMRjob.

InMRv1,theclusterwasmanagedbyasingleJobTrackerandmultipleTaskTrackersrunningontheDataNodes.

InHadoop2.x,theoldMRv1frameworkwasrewrittentorunontopofYARN.ThisapplicationwasnamedMRv2,orMapReduceversion2.ItisthefamiliarMapReduceexecutionunderneath,exceptthateachjobnowrunsonYARN.

ThecoredifferencebetweenMRv1andMRv2isthewaytheMapReducejobsareexecuted.

WithHadoop1.x,itwastheJobTrackerandTaskTrackers,butnowwithYARNonHadoop2.x,it’stheResourceManager,ApplicationMaster,andNodeManagers.

However,theunderlyingconcept,theMapReduceframework,remainsthesame.

Hadoop2hasbeenredefinedfromHDFS-plus-MapReducetoHDFS-plus-YARN.

Referringtothefollowingfigure,YARNtookcontroloftheresourcemanagementandapplicationlifecyclepartofHadoop1.x.

YARNtherefore,definitelyresultsinincreasedROIforHadoopinvestment,inthesensethatnowthesameHadoop2.xclusterresourcescanbeusedtodomultiplethings,suchasbatchprocessing,real-timeprocessing,SQLapplications,andsoon.

Earlier,runningthisvarietyofapplicationswasnotpossible,andpeoplehadtouseaseparateHadoopclusterforMapReduceandaseparateonetodosomethingelse.

UnderstandingwhereYARNfitsintoHadoopIfwerefertoHadoop1.xinthefirstfigureofthischapter,thenitisclearthattheresponsibilitiesoftheJobTrackermainlyincludedthefollowing:

ManagingthecomputationalresourcesintermsofmapandreduceslotsSchedulingsubmittedjobsMonitoringtheexecutionsoftheTaskTrackersRestartingfailedtasksPerformingaspeculativeexecutionoftasksCalculatingtheJobCounters

Clearly,theJobTrackeralonedoesalotoftaskstogetherandisoverloadedwithlotsofwork.

ThisoverloadingoftheJobTrackerledtotheredesignoftheJobTracker,andYARNtriedtoreducetheresponsibilitiesoftheJobTrackerinthefollowingways:

ClusterresourcemanagementandSchedulingresponsibilitiesweremovedtotheglobalResourceManager(RM)Theapplicationlifecyclemanagement,thatis,jobexecutionandmonitoringwasmovedintoaper-applicationApplicationMaster(AM)

TheGlobalResourceManagerisseeninthefollowingimage:

Ifyoulookattheprecedingfigure,youwillclearlyseethedisappearanceofthesinglecentralizedJobTracker;itsplaceistakenbyaGlobalResourceManager.

Also,foreachjobatiny,dedicatedJobTrackeriscreated,whichmonitorsthetasksspecifictoitsjob.ThistinyJobTrackerisrunontheslavenode.

Thistiny,dedicatedJobTrackeristermedanApplicationMasterinthenewframework(refertothefollowingfigure).

Also,theTaskTrackersarereferredtoasNodeManagersinthenewframework.

Finally,lookingattheJobTrackerredesign(inthefollowingfigure),wecanclearlyseethattheJobTracker’sresponsibilitiesarebrokenintoaper-clusterResourceManagerandaper-applicationApplicationMaster:

TheResourceManagertopologycanbeseenasfollows:

OldandnewMapReduceAPIsThenewAPI(whichisalsoknownasContextObjects)wasprimarilydesignedtomaketheAPIeasiertoevolveinthefutureandistypeincompatiblewiththeoldone.

ThenewAPIcameintothepicturefromthe1.xreleaseseries.However,itwaspartiallysupportedinthisseries.So,theoldAPIisrecommendedfor1.xseries:

Feature\Release 1.x 0.23

OldMapReduceAPI Yes Deprecated

NewMapReduceAPI Partial Yes

MRv1runtime(Classic) Yes No

MRv2runtime(YARN) No Yes

TheoldandnewAPIcanbecomparedasfollows:

OldAPI NewAPI

TheoldAPIisintheorg.apache.hadoop.mapredpackageandisstillpresent.

ThenewAPIisintheorg.apache.hadoop.mapreducepackage.

TheoldAPIusedinterfacesforMapperandReducer. ThenewAPIusesAbstractClassesforMapperandReducer.

TheoldAPIusedtheJobConf,OutputCollector,andReporterobjecttocommunicatewiththeMapReducesystem.

ThenewAPIusesthecontextobjecttocommunicatewiththeMapReducesystem.

IntheoldAPI,jobcontrolwasdonethroughtheJobClient.

InthenewAPI,jobcontrolisperformedthroughtheJobclass.

IntheoldAPI,jobconfigurationwasdonewithaJobConfobject.

InthenewAPO,jobconfigurationisdonethroughtheConfigurationclassviasomeofthehelpermethodsonJob.

IntheoldAPI,boththemapandreduceoutputsarenamedpart-nnnnn.

InthenewAPI,themapoutputsarenamedpart-m-nnnnnandthereduceoutputsarenamedpart-r-nnnnn.

IntheoldAPI,thereduce()methodpassesvaluesasajava.lang.Iterator.

InthenewAPI,the.methodpassesvaluesasajava.lang.Iterable.

TheoldAPIcontrolsmappersbywritingaMapRunnable,butnoequivalentexistsforreducers.

ThenewAPIallowsbothmappersandreducerstocontroltheexecutionflowbyoverridingtherun()method.

BackwardcompatibilityofMRv2APIsThissectiondiscussesthescopeandlevelofbackwardcompatibilitysupportedinApacheHadoopMapReduce2.x(MRv2).

Binarycompatibilityoforg.apache.hadoop.mapredAPIsBinarycompatibilityheremeansthatthecompiledbinariesshouldbeabletorunwithoutanymodificationonthenewframework.

ForthoseHadoop1.xuserswhousetheorg.apache.hadoop.mapredAPIs,theycansimplyruntheirMapReducejobsonYARNjustbypointingthemtotheirApacheHadoop2.xclusterviatheconfigurationsettings.

Theywillnotneedanyrecompilation.AlltheywillneedtodoispointtheirapplicationtotheYARNinstallationandpointHADOOP_CONF_DIRtothecorrespondingconfigurationdirectory.Theyarn-site.xml(configurationforYARN)andmapred-site.xmlfiles(configurationforMapReduceapps)arepresentintheconfdirectory.

Also,mapred.job.trackerinmapred-site.xmlisnolongernecessaryinApacheHadoop2.x.Instead,thefollowingpropertyneedstobeaddedinthemapred-site.xmlfiletomakeMRv1applicationsrunontopofYARN:

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

Sourcecompatibilityoforg.apache.hadoop.mapredAPIsSourceincompatibilitymeansthatsomecodechangesarerequiredforcompilation.Sourceincompatibilityisorthogonaltobinarycompatibility.

Binariesforanapplicationthatisbinarycompatiblebutnotsourcecompatiblewillcontinuetorunfineonthenewframework.However,codechangesarerequiredtoregeneratethesebinaries.

ApacheHadoop2.xdoesnotensurecompletebinarycompatibilitywiththeapplicationsthatuseorg.apache.hadoop.mapreduceAPIs,astheseAPIshaveevolvedalotsinceMRv1.However,itensuressourcecompatibilityfororg.apache.hadoop.mapreduceAPIsthatbreakbinarycompatibility.Inotherwords,youshouldrecompiletheapplicationsthatuseMapReduceAPIsagainstMRv2JARs.

ExistingapplicationsthatuseMapReduceAPIsaresourcecompatibleandcanrunonYARNwithnochanges,recompilation,and/orminorupdates.

IfanMRv1MapReduce-basedapplicationfailstorunonYARN,youarerequestedtoinvestigateitssourcecodeandcheckwhetherMapReduceAPIsarereferredtoornot.Iftheyarereferredto,youhavetorecompiletheapplicationagainsttheMRv2JARsthatareshippedwithHadoop2.

PracticalexamplesofMRv1andMRv2WewillnowpresentaMapReduceexampleusingboththeoldandnewMapReduceAPIs.

WewillnowwriteaMapReduceprograminJavathatfindsalltheanagrams(aword,phrase,ornameformedbyrearrangingthelettersofanother,suchascinema,formedfromiceman)presentstheminaninputfile,andfinallyprintsalltheanagramsintheoutputfile.

HereistheAnagramMapperOldAPI.javaclassthatusestheoldMapReduceAPI:

importjava.io.IOException;

importjava.util.Arrays;

importorg.apache.hadoop.io.Text;

importorg.apache.hadoop.mapred.MapReduceBase;

importorg.apache.hadoop.mapred.Mapper;

importorg.apache.hadoop.mapred.OutputCollector;

importorg.apache.hadoop.mapred.Reporter;

importjava.util.StringTokenizer;

/**

*TheAnagrammapperclassgetsawordasalinefromtheHDFSinputand

sortsthe

*lettersinthewordandwritesitsbacktotheoutputcollectoras

*Key:sortedword(lettersinthewordsorted)

*Value:theworditselfasthevalue.

*Whenthereducerrunsthenwecangroupanagramstogetherbasedonthe

sortedkey.

*/

publicclassAnagramMapperOldAPIextendsMapReduceBaseimplements

Mapper<Object,Text,Text,Text>{

privateTextsortedText=newText();

privateTextoriginalText=newText();

@Override

publicvoidmap(ObjectkeyNotUsed,Textvalue,

OutputCollector<Text,Text>output,Reporterreporter)

throwsIOException{

Stringline=value.toString().trim().toLowerCase().replace(",","");

System.out.println("LINE:"+line);

StringTokenizerst=newStringTokenizer(line);

System.out.println("----Splitbyspace------");

while(st.hasMoreElements()){

Stringword=(String)st.nextElement();

char[]wordChars=word.toCharArray();

Arrays.sort(wordChars);

StringsortedWord=newString(wordChars);

sortedText.set(sortedWord);

originalText.set(word);

System.out.println("\torig:"+word+"\tsorted:"+sortedWord);

output.collect(sortedText,originalText);

}

}

}

HereistheAnagramReducerOldAPI.javaclassthatusestheoldMapReduceAPI:

importjava.io.IOException;

importjava.util.Iterator;

importjava.util.StringTokenizer;

importorg.apache.hadoop.io.Text;

importorg.apache.hadoop.mapred.MapReduceBase;

importorg.apache.hadoop.mapred.OutputCollector;

importorg.apache.hadoop.mapred.Reducer;

importorg.apache.hadoop.mapred.Reporter;

publicclassAnagramReducerOldAPIextendsMapReduceBaseimplements

Reducer<Text,Text,Text,Text>{

privateTextoutputKey=newText();

privateTextoutputValue=newText();

publicvoidreduce(TextanagramKey,Iterator<Text>anagramValues,

OutputCollector<Text,Text>output,Reporterreporter)

throwsIOException{

Stringout="";

//Consideringwordswithlength>2

if(anagramKey.toString().length()>2){

System.out.println("ReducerKey:"+anagramKey);

while(anagramValues.hasNext()){

out=out+anagramValues.next()+"~";

}

StringTokenizeroutputTokenizer=newStringTokenizer(out,"~");

if(outputTokenizer.countTokens()>=2){

out=out.replace("~",",");

outputKey.set(anagramKey.toString()+"-->");

outputValue.set(out);

System.out.println("************Writingreduceroutput:"

+anagramKey.toString()+"-->"+out);

output.collect(outputKey,outputValue);

}

}

}

}

Finally,toruntheMapReduceprogram,wehavetheAnagramJobOldAPI.javaclasswrittenusingtheoldMapReduceAPI:

importorg.apache.hadoop.fs.Path;

importorg.apache.hadoop.io.Text;

importorg.apache.hadoop.mapred.FileInputFormat;

importorg.apache.hadoop.mapred.FileOutputFormat;

importorg.apache.hadoop.mapred.JobClient;

importorg.apache.hadoop.mapred.JobConf;

publicclassAnagramJobOldAPI{

publicstaticvoidmain(String[]args)throwsException{

if(args.length!=2){

System.err.println("Usage:Anagram<inputpath><outputpath>");

System.exit(-1);

}

JobConfconf=newJobConf(AnagramJobOldAPI.class);

conf.setJobName("AnagramJobOldAPI");

FileInputFormat.addInputPath(conf,newPath(args[0]));

FileOutputFormat.setOutputPath(conf,newPath(args[1]));

conf.setMapperClass(AnagramMapperOldAPI.class);

conf.setReducerClass(AnagramReducerOldAPI.class);

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(Text.class);

JobClient.runJob(conf);

}

}

Next,wewillwritethesameMapper,Reducer,andJobclassesusingthenewMapReduceAPI.

HereistheAnagramMapper.javaclassthatusesthenewMapReduceAPI:

importjava.io.IOException;

importjava.util.Arrays;

importjava.util.StringTokenizer;

importorg.apache.hadoop.io.Text;

importorg.apache.hadoop.mapreduce.Mapper;

publicclassAnagramMapperextendsMapper<Object,Text,Text,Text>{

privateTextsortedText=newText();

privateTextorginalText=newText();

@Override

publicvoidmap(Objectkey,Textvalue,Contextcontext)

throwsIOException,InterruptedException{

Stringline=value.toString().trim().toLowerCase().replace(",","");

System.out.println("LINE:"+line);

StringTokenizerst=newStringTokenizer(line);

System.out.println("----Splitbyspace------");

while(st.hasMoreElements()){

Stringword=(String)st.nextElement();

char[]wordChars=word.toCharArray();

Arrays.sort(wordChars);

StringsortedWord=newString(wordChars);

sortedText.set(sortedWord);

orginalText.set(word);

System.out.println("\torig:"+word+"\tsorted:"+sortedWord);

context.write(sortedText,orginalText);

}

}

}

HereistheAnagramReducer.javaclassthatusesthenewMapReduceAPI:

importjava.io.IOException;

importjava.util.StringTokenizer;

importorg.apache.hadoop.io.Text;

importorg.apache.hadoop.mapreduce.Reducer;

publicclassAnagramReducerextendsReducer<Text,Text,Text,Text>{

privateTextoutputKey=newText();

privateTextoutputValue=newText();

publicvoidreduce(TextanagramKey,Iterable<Text>anagramValues,

Contextcontext)throwsIOException,InterruptedException{

Stringout="";

if(anagramKey.toString().length()>2){

System.out.println("ReducerKey:"+anagramKey);

for(Textanagram:anagramValues){

out=out+anagram.toString()+"~";

}

StringTokenizeroutputTokenizer=newStringTokenizer(out,"~");

if(outputTokenizer.countTokens()>=2){

out=out.replace("~",",");

outputKey.set(anagramKey.toString()+"-->");

outputValue.set(out);

System.out.println("******Writingreduceroutput:"

+anagramKey.toString()+"-->"+out);

context.write(outputKey,outputValue);

}

}

}

}

Finally,hereistheAnagramJob.javaclassthatusesthenewMapReduceAPI:

importorg.apache.hadoop.fs.Path;

importorg.apache.hadoop.io.Text;

importorg.apache.hadoop.mapreduce.Job;

importorg.apache.hadoop.mapreduce.lib.input.FileInputFormat;

importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

publicclassAnagramJob{

publicstaticvoidmain(String[]args)throwsException{

if(args.length!=2){

System.err.println("Usage:Anagram<inputpath><outputpath>");

System.exit(-1);

}

Jobjob=newJob();

job.setJarByClass(AnagramJob.class);

job.setJobName("AnagramJob");

FileInputFormat.addInputPath(job,newPath(args[0]));

FileOutputFormat.setOutputPath(job,newPath(args[1]));

job.setMapperClass(AnagramMapper.class);

job.setReducerClass(AnagramReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(Text.class);

System.exit(job.waitForCompletion(true)?0:1);

}

}

Preparingtheinputfile(s)1. Createa${Inputfile_1}filewiththefollowingcontents:

TheProjectGutenbergEtextofMobyWordIIbyGradyWard

hellotheredrawehllolemonsmelonssolemn

Also,bluestbluetsbustlesubletsubtle

2. Createanotherfile,${Inputfile_2},withthefollowingcontents:

Cinemaisanagramtoiceman

Secondisstop,tops,opts,pots,andspot

Stoolandtools

Secureandrescue

3. Copythesefilesinto${path_to_your_input_dir}.

RunningthejobRuntheAnagramJobOldAPI.javaclassandpassthefollowingascommand-lineargs:

${path_to_your_input_dir}

${path_to_your_output_dir_old}

Now,runtheAnagramJob.javaclassandpassthefollowingascommand-lineargs:

${path_to_your_input_dir}

${path_to_your_output_dir_new}

ResultThefinaloutputwrittentois${path_to_your_output_dir_old}and${path_to_your_output_dir_new}.

Thesearethecontentsthatwewillseeintheoutputfile:

aceimn-->cinema,iceman,

adn-->and,and,and,

adrw-->ward,draw,

belstu-->subtle,bustle,bluets,bluest,sublet,

ceersu-->rescue,secure,

ehllo-->hello,ehllo,

elmnos-->lemons,melons,solemn,

loost-->stool,tools,

opst-->pots,tops,stop,spot,opts,

SummaryInthischapter,westartedwithabriefhistoryofHadoopreleases.Next,wecoveredthebasicsofHadoop1.xandMRv1.WethenlookedatthecoredifferencesbetweenMRv1andMRv2andhowYARNfitsintoaHadoopenvironment.WealsosawhowtheJobTracker’sresponsibilitieswerebrokendowninHadoop2.x.

WealsotalkedabouttheoldandnewMapReduceAPIs,theirorigin,differences,andsupportinYARN.Finally,weconcludedthechapterwithsomepracticalexamplesusingtheoldandnewMapReduceAPIs.

Inthenextchapter,youwilllearnabouttheadministrationpartofYARN.

Chapter5.YARNAdministrationInthissection,wewillfocusonYARN’sadministrativepartandontheadministratorrolesandresponsibilitiesofYARN.Wewillalsogainamoredetailedinsightintotheadministrationconfigurationsettingsandparameters,applicationcontainermonitoring,andoptimizedresourceallocations,aswellasschedulingandmultitenancyapplicationsupportinYARN.We’llalsocoverthebasicadministrationtoolsandconfigurationoptionsofYARN.

Thefollowingtopicswillbecoveredinthischapter:

YARNcontainerallocationandconfigurationsSchedulingpoliciesYARNmultitenancyapplicationsupportYARNadministrationandtools

ContainerallocationAtaveryfundamentallevel,thecontaineristhegroupofphysicalresourcessuchasmemory,disk,network,CPU,andsoon.Therecanbeoneormorecontainersonasinglemachine;forexample,ifamachinehas16GBofRAMand8coreprocessors,thenasinglecontainercouldbe1CPUcoreand2GBofRAM.Thismeansthatthereareatotalof8containersonasinglemachine,ortherecouldbeasinglelargecontainerwithalltheoccupiedresources.So,acontainerisaphysicalnotationofmemory,CPU,network,disk,andsooninthecluster.Thecontainer’slifecycleismanagedbytheNodeManager,andtheschedulingisdonebytheResourceManager.Thecontainerallocationcanbeseenasfollows:

YARNisdesignedtoallocateresourcecontainerstotheindividualapplicationsinashared,secure,andmultitenantmanner.WhenanyjobortaskissubmittedtotheYARNframework,theResourceManagertakescareoftheresourceallocationstotheapplication,dependingonschedulingconfigurationsandtheapplication’sneedsandrequirementsviatheApplicationMaster.Toachievethisgoal,thecentralschedulermaintainsthemetadataaboutalltheapplication’sresourcerequirements;thisleadstoefficientschedulingdecisionsforalltheapplicationsthatrunintothecluster.

Let’stakealookathowcontainerallocationhappensinatraditionalHadoopsetup.InthetraditionalHadoopapproach,oneachnodethereisapredefinedandfixednumberofmapslotsandapredefinedandfixednumberofreduceslots.Themapandreducefunctionsareunabletoshareslots,astheyarepredefinedforspecificoperationsonly.Thisstaticallocationisnotefficient;forexample,oneclusterhasafixedtotalof32mapslotsand32reduceslots.WhilerunningaMapReduceapplication,ittookonly16mapslotsandrequiredmorethan32slotsforreduceoperations.Thereduceroperationisunabletousethe16freemapperslots,astheyarepredefinedformapperfunctionalitiesonly,sothereducefunctionhastowaituntilsomereduceslotsbecomefree.

Toovercomethisproblem,YARNhascontainerslots.Irrespectiveoftheapplication,allcontainersareabletorunallapplications;forexample,ifYARNhas64availablecontainersintheclusterandisrunningthesameMapReduceapplication,ifthemapperfunctiontakesonly16slotsandthereducerrequiresmoreresourceslots,thenallother

freeresourcesintheclusterareallocatedtothereduceroperation.Thismakestheoperationmoreefficientandproductive.

Essentially,anapplicationdemandstherequiredresourcesfromtheResourceManagertosatisfyitsneedsviatheApplicationMaster.Then,byallocatingtherequestedresourcestoanapplication,theResourceManagerrespondstotheapplication’sResourceRequest.TheResourceRequestcontainsthenameoftheresourcethathasbeenrequested;priorityoftherequestwithinthevariousotherResourceRequestsofthesameapplication;resourcerequirementcapabilities,suchasRAM,disk,CPU,network,andsoon;andthenumberofresources.ContainerallocationfromtheResourceManagertotheapplicationmeansthesuccessfulfulfillmentofthespecificResourceRequest.

ContainerallocationtotheapplicationNow,takealookatthefollowingsequencediagram:

ThediagramshowshowcontainerallocationisdoneforapplicationsviatheApplicationMaster.Itcanbeexplainedasfollows:

1. TheclientsubmitstheapplicationrequesttotheResourceManager.2. TheResourceManagerregisterstheapplicationwiththeApplicationManager,

generatestheApplicationID,andrespondstotheclientwiththesuccessfullyregisteredApplicationID.

3. Then,theResourceManagerstartstheclientApplicationMasterinaseparateavailablecontainer.Ifnocontainerisavailable,thisrequesthastowaituntilasuitablecontainerisfoundandthensendtheapplicationregistrationrequestforapplicationregistration.

4. TheResourceManagersharesalltheminimumandmaximumresourcecapabilitiesoftheclusterwiththeApplicationMaster.Then,theApplicationMasterdecideshowtoefficientlyusetheavailableresourcestofulfilltheapplication’sneeds.

5. DependingontheresourcecapabilitiessharedbytheResourceManager,theApplicationMasterrequeststhattheResourceManagerallocatesanumberofcontainersonbehalfoftheapplication.

6. TheResourceManagerrespondstotheResourceRequestbytheApplicationMasteraspertheschedulingpoliciesandresourceavailability.ContainerallocationbytheResourceManagermeansthesuccessfulfulfillmentoftheResourceRequestbytheApplicationMaster.

Whilerunningthejob,theApplicationMastersendstheheartbeatandjobprogress

informationoftheapplicationtotheResourceManager.Duringtheruntimeoftheapplication,theApplicationMasterrequestsforthereleaseorallocationofmorecontainersfromtheResourceManager.Whenthejobfinishes,theApplicationMastersendsacontainerde-allocationrequesttotheResourceManagerandexitsitselffromrunningthecontainer.

ContainerconfigurationsHerearethesomeimportantconfigurationsrelatedtoresourcecontainersthatareusedtocontrolcontainers.

Tocontrolthememoryallocationtoacontainer,theadministratorneedstosetthefollowingthreeparametersintheyarn-site.xmlconfigurationfile:

Parameter Description

yarn.nodemanager.resource.memory-

mb

ThisistheamountofmemoryinMBsthattheNodeManagercanuseforthecontainers.

yarn.scheduler.minimum-

allocation-mb

ThisisthesmallestamountofmemoryinMBsallocatedtothecontainerbytheResourceManager.Thedefaultvalueis1024MB.

yarn.scheduler.maximum-

allocation-mb

ThisisthelargestamountofmemoryinMBsallocatedtothecontainerbytheResourceManager.Thedefaultvalueis8192MB.

TheCPUcoreallocationstothecontainerarecontrolledbysettingthefollowingpropertiesintheyarn-site.xmlconfigurationfile:

Parameter Description

yarn.scheduler.minimum-allocation-

vcores

ThisistheminimumnumberofCPUcoresthatareallocatedtothecontainer.

yarn.scheduler.maximum-allocation-

vcores

ThisisthemaximumnumberofCPUcoresthatareallocatedtothecontainer.

yarn.nodemanager.resource.cpu-vcores Thisisthenumberofcoresthatthecontainercanrequestforthenode.

YARNschedulingpoliciesTheYARNarchitecturehaspluggableschedulingpoliciesthatdependontheapplication’srequirementsandtheusecasedefinedfortherunningapplication.YoucanfindtheYARNschedulingconfigurationsintheyarn-site.xmlfile.Here,youcanspecifytheschedulingsystemaseitherFIFO,capacity,orfairschedulingaspertheapplication’sneeds.YoucanalsofindtherunningapplicationschedulinginformationintheResourceManagerUI.Manycomponentsoftheschedulingsystemaredefinedbrieflythere.

Asalreadymentioned,therearethreetypeofschedulingpoliciesthattheYARNschedulerfollows:

FIFOschedulerCapacityschedulerFairscheduler

TheFIFO(FirstInFirstOut)schedulerThisistheschedulingpolicyintroducedintothesystemfromHadoop1.0.TheJobTrackerwasusedtobeFIFOschedulingpolicies.Asthenameindicates,FIFOmeansFirstinFirstOut,thatis,thejobsubmittedfirstwillexecutefirst.TheFIFOschedulerpolicydoesnotfollowanyapplicationpriorities;thispolicymightefficientlyworkforsmallerjobs,butwhileexecutinglargerjobs,FIFOworksveryinefficiently.Soforheavy-loadedclusters,thispolicyisnotrecommended.TheFIFOschedulercanbeseenasfollows:

TheFIFO(FirstInFirstOut)schedulerHereistheconfigurationpropertyfortheFIFOscheduler.Byspecifyingthisinyarn-site.xml,youcanenabletheFIFOschedulingpolicyinyourYARNcluster:

<property>

<name>yarn.resourcemanager.scheduler.class</name>

<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoSch

eduler</value>

</property>

ThecapacityschedulerThecapacityschedulingpolicyisoneoftheveryfamouspluggableschedulerpoliciesthatallowsmultipleapplicationsorusergroupstosharetheHadoopclusterresourcesinasecureway.Nowadays,thisschedulingpolicyrunssuccessfullyonmanyofthelargestHadoopproductionclustersinanefficientway.

Thecapacityschedulingpolicyallowsauserorusergroupstoshareclusterresourcesinsuchawaythateachuserorgroupofuserswouldgetassignedacertaincapacityoftheclusterforsure.Toenablethispolicy,theclusteradministratorconfiguresoneormorequeueswithsomeprecalculatedsharesofthetotalclusterresourcecapacity;thisassignmentguaranteestheminimumresourcecapacityallocationtoeachqueue.Theadministratorcanalsoconfigurethemaximumandminimumconstraintsontheuseofclusterresources(capacity)oneachqueue.EachqueuehasitsownAccessControlList(ACL)policiesthatcanmanagewhichuserhaspermissiontosubmittheapplicationsonwhichqueues.ACLsalsomanagethereadandmodifypermissionsatthequeuelevelsothatuserscannotviewormodifytheapplicationssubmittedbyotherusers.

CapacityschedulerconfigurationsCapacityschedulerconfigurationscomewithHadoopYARNbydefault.Sometimes,itisnecessarytoconfigurethepolicyinYARNconfigurationfiles.Herearetheconfigurationpropertiesthatneedtobespecifiedinyarn-site.xmltoenablethecapacityschedulerpolicy:

<property>

<name>yarn.resourcemanager.scheduler.class</name>

<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.Cap

acityScheduler</value>

</property>

Thecapacityscheduler,bydefault,comeswithitsownconfigurationfilenamed$HADOOP_CONF_DIR/capacity-scheduler.xml;thisshouldbepresentintheclasspathsothattheResourceManagerisabletolocateitandloadthepropertiesforthisaccordingly.

ThefairschedulerThefairschedulerisoneofthemostfamouspluggableschedulersforlargeclusters.Itenablesmemory-intensiveapplicationstoshareclusterresourcesinaveryefficientway.Fairschedulingisapolicythatenablestheallocationofresourcestoapplicationsinawaythatallapplicationsget,onaverage,anequalshareoftheclusterresourcesoveragivenperiod.

Inafairschedulingpolicy,ifoneapplicationisrunningonthecluster,itmightrequestallclusterresourcesforitsexecution,ifneeded.Ifotherapplicationsaresubmitted,thepolicycandistributethefreeresourcesamongtheapplicationsinsuchawaythateachapplicationgetsafairlyequalshareofclusterresources.AfairscheduleralsofollowsapreemptionwheretheResourceManagermightrequesttheresourcecontainersbackfromtheApplicationMaster,dependingonthejobconfigurations.Itmightbeahealthyoranunhealthypreemption.

Inthisschedulingmodel,everyapplicationispartofaqueue,soresourcesareassignedtothequeue.Bydefault,eachusersharesthequeuecalled‘DefaultQueue’.Afairschedulersupportsmanyfeaturesatthequeuelevel,suchasassigningweighttothequeue.Aheavyweightqueuewouldgetahighernumberofresourcesthanlightweightqueues,minimumandmaximumsharesthatqueuewouldgetFIFOpolicywithinthequeue.

Whilesubmittingtheapplication,usersmightspecifythenameofthequeuetheapplicationwantstouseresourcesfrom.Forexample,iftheapplicationrequiresahighernumberofresources,itcanspecifytheheavyweightqueuesothatitcangetalltherequiredresourcesthatareavailablethere.

Theadvantageofusingthefairschedulingpolicyisthateveryqueuewouldgetaminimumshareoftheclusterresources.Itisveryimportanttonotethatwhenaqueuecontainsapplicationsthatarewaitingfortheresources,theywouldgettheminimumresourceshare.Ontheotherhand,ifthequeuesresourcesaremorethanenoughfortheapplication,thentheexcessamountwouldbedistributedequallyamongtherunningapplications.

FairschedulerconfigurationsToenablethefairschedulingpolicyinyourYARNcluster,youneedtospecifythefollowingpropertyintheyarn-site.xmlfile:

<property>

<name>yarn.resourcemanager.scheduler.class</name>

<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairSch

eduler</value>

</property>

Thefairscheduleralsohasaspecificconfigurationfileforamoredetailedconfigurationsetup;youwillfinditat$HADOOP_CONF_DIR/fair-scheduler.xml.

YARNmultitenancyapplicationsupportYARNcomeswithbuilt-inmultitenancysupport.Now,let’shavealookatwhatmultitenancymeans.Considerasocietythathasmultipleapartmentsinit,sotherearedifferenttypesoffamilylivingindifferentapartmentswithsecurityandprivacy,buttheyallsharethesociety’scommonareas,suchasthesocietygate,garden,playarea,andotheramenities.Theirapartmentsalsosharecommonwalls.ThesameconceptisfollowedinYARN:thethatrunrunningintotheclustersharetheclusterresourcesinamultitenantway.Theyshareclusterprocessingcapacity,clusterstoragecapacity,dataaccesssecurities,andsoon.Multitenancyisachievedintheclusterbydifferentiatingapplicationsintomultiplebusinessunits,forexample,differentqueuesandusersfordifferenttypesofapplications.

SecurityandprivacycanbeachievedbyconfiguringLinuxandHDFSpermissionstoseparatefilesanddirectoriestocreatetenantboundaries.ThiscanbeachievedbyintegratingwithLDAPorActiveDirectory.Securityisusedtoenforcethetenantapplicationboundaries,andthiscanbeintegratedwiththeKerberossecuritymodel.

ThefollowingdiagramwillexplainhowanapplicationrunsintheYARNclusterinamultitenantway:

IntheprecedingYARNcluster,youcanseethattwojobsarerunning:oneisStorm,andtheotheristheMapReducejob.Theyaresharingtheclusterscheduler,clusterprocessingcapacity,HDFSstorage,andclustersecurity.WecanalsoseethetwoapplicationsarerunningonasingleYARNcluster.TheMapReduceandStormjobsarerunningoverYARNandsharingthecommonclusterinfrastructure,CPU,RAM,andsoon.TheStorm

ApplicationMaster,StormSupervisor,MapRedApplicationMaster,Mappers,andReducersarerunningovertheYARNclusterinamultitenantwaybysharingclusterresources.

AdministrationofYARNNow,wewilltakealookatsomeYARNbasicadministrationconfigurations,basicallyfromHadoop2.0.YARNwasintroducedandmadechangesinHadoopconfigurationfiles.HadoopandYARNhavethefollowingbasicconfigurationfiles:

core-default.xml:Thisfilecontainspropertiesrelatedtothesystem.hdfs-default.xml:ThisfilecontainsHDFS-relatedconfigurations.mapred-default.xml:ThisconfigurationfilecontainspropertiesrelatedtotheYARNMapReduceframework.yarn-default.xml:ThisfilecontainsYARN-relatedproperties.

YouwillfindallthesepropertieslistedontheApachewebsite(http://hadoop.apache.org/docs/current/)intheconfigurationsection,withdetailedinformationoneachpropertyanditsdefaultandpossiblevalues.

AdministrativetoolsYARNhasseveraladministrativetoolsbydefault;youcanfindthemusingthermadmincommand.HereisamoredetailedexplanationoftheResourceManageradmincommand:

$yarnrmadmin-help

ThermadmincommandisthecommandtoexecuteMapReduceadministrativecommands.Thefullsyntaxis:

hadooprmadmin[-refreshQueues][-refreshNodes]

[-refreshSuperUserGroupsConfiguration][-refreshUserToGroupsMappings]

[-refreshAdminAcls][-refreshServiceAcl][-getGroup[username]][-help

[cmd]]

Theprecedingcommandcontainsthefollowingfields:

-refreshQueues:Reloadsthequeues’acls,states,andscheduler-specificproperties.TheResourceManagerwillreloadthemapred-queuesconfigurationfile.-refreshNodes:Refreshesthehost’sinformationattheResourceManager.-refreshUserToGroupsMappings:Refreshesuser-to-groupsmappings.-refreshSuperUserGroupsConfiguration:Refreshessuperuserproxygroupsmappings.-refreshAdminAcls:RefreshesaclsfortheadministrationoftheResourceManager.-refreshServiceAcl:Reloadstheservice-levelauthorizationpolicyfile.ResourceManagerwillreloadtheauthorizationpolicyfile.-getGroups[username]:Getthegroupsthatthegivenuserbelongsto.-help[cmd]:Displayshelpforthegivencommand,orallcommandsifnoneisspecified.

Thegenericoptionssupportedareasfollows:

-conf<configurationfile>:Thiswillspecifyanapplicationconfigurationfile.-D<property=value>:Thiswillusethevalueforthegivenproperty.-fs<local|namenode:port>:ThiswillspecifyaNameNode.-jt<local|jobtracker:port>:ThiswillspecifyaJobTracker.-files<commaseparatedlistoffiles>:Thiswillspecifycomma-separatedfilestobecopiedtotheMapReducecluster.-libjars<commaseparatedlistofjars>:Thiswillspecifycomma-separatedJARfilestoincludeintheclasspath.-archives<commaseparatedlistofarchives>:Thiswillspecifycomma-separatedarchivestobeunarchivedonthecomputemachines.

Thegeneralcommandlinesyntaxis:

bin/hadoopcommand[genericOptions][commandOptions]

AddingandremovingnodesfromaYARNclusterAYARNclusterishorizontallyscalable;youcanaddorremoveworkernodesinorfromtheclusterwithoutstoppingit.Toaddanewnode,allthesoftwareandconfigurationsmustbedoneoverthenewnode.

Thefollowingpropertyisusedtoaddanewnodetothecluster:

yarn.resourcemanager.nodes.include-path

Forremovingthenodefromthecluster,thefollowingpropertyisused:

yarn.resourcemanager.exclude-path

Theprecedingtwopropertiestakevaluesasalocalfilethatcontainsthelistofnodesthatneedtobeaddedorremovedfromthecluster.ThisfilecontainseitherthehostnamesortheIPsoftheworkernodesseparatedbyanewline,tab,orspace.

Afteraddingorremovingthenode,theYARNclusterdoesnotrequirearestart.ItjustneedstorefreshthelistofworkernodessothattheResourceManagergetsinformedaboutthenewlyaddedorremovednodes:

$yarnrmadmin-refreshNodes

AdministratingYARNjobsThemostimportantYARNadmintaskisadministratingtherunningofYARNjobs.YoucanmanageYARNjobsusingtheyarnapplicationCLIcommand.

Usingtheyarnapplicationcommand,theadministratorcankillajob,listalljobs,andfindoutthestatusofajob.MapReducejobscanbecontrolledbythemapredjobcommand.

Hereistheusageoftheyarnapplicationcommand:

usage:application

-appTypes<Comma-separatedlistofapplicationtypes>Workswith--listto

filterapplicationsbasedontheirtype.

-helpDisplayshelpforallcommands.

-kill<ApplicationID>Killstheapplication.

-listListsapplicationsfromtheRM.Supportsoptionaluseof–appTypesto

filter

applicationsbasedonapplicationtype.

-status<ApplicationID>Printsthestatusoftheapplication.

MapReducejobconfigurationsAsMapReducejobsarenowrunningonYARNcontainersinsteadoftraditionalMapReduceslots,it’snecessarytoconfigureMapReducepropertiesintomapred-site.xml.HerearesomepropertiesofMapReducejobsthatcouldbeconfiguredtorunMapReducejobsonYARNcontainers:

Properties Description

mapred.child.java.optsThispropertyisusedtosettheJavaheapsizeforchildJVMsofmaps,forexampleXmx4096m.

mapreduce.map.memory.mbThispropertyisusedtoconfiguretheresourcelimitformapfunctionsforexample,1536MB.

mapreduce.reduce.memory.mbThispropertyisusedtoconfiguretheresourcelimitforreducerfunctions,forexample3072MB.

mapreduce.reduce.java.optsThispropertyisusedtosettheJavaheapsizeforchildJVMsofreducers,forexampleXmx4096m.

YARNlogmanagementThelogmanagementCLItoolisveryusefulforYARNapplicationlogmanagement.TheadministratorcanusethelogsCLIcommanddescribedhere:

$yarnlogs

RetrievelogsforcompletedYARNapplications.

usage:yarnlogs-applicationId<applicationID>[OPTIONS]

generaloptionsare:

-appOwner<ApplicationOwner>AppOwner(assumedtobecurrentuserif

notspecified)

-containerId<ContainerID>ContainerId(mustbespecifiedifnode

addressis

specified)

-nodeAddress<NodeAddress>NodeAddressintheformatnodename:port

(mustbespecifiedifcontainerIDisspecified)

Let’stakeanexample.Ifyouwantedtoprintallthelogsofaspecificapplication,usethefollowingcommand:

$yarnlogs-applicationId<applicationID>

Thiscommandwillprintallthelogsrelatedtotheapplication_IDspecifiedintheconsole’sinterface.

YARNwebuserinterfaceIntheYARNwebuserinterface(http://localhost:8088/cluster),youcanfindinformationonclusternodes,containersconfiguredoneachnode,andapplicationsandtheirstatus.TheYARNwebinterfaceisasfollows:

UndertheSchedulersection,youcanseetheschedulinginformationofallthesubmitted,acceptedbythescheduler,runningapplications,withthetotalclustercapacity,usedandmaximumcapacity,andresourcesallocatedtotheapplicationqueue.Inthefollowingscreenshot,youcanseetheresourcesallocatedtothedefaultqueue:

UndertheToolssection,youcanfindtheYARNconfigurationfiledetails,schedulinginformation,containerconfigurations,locallogsofthejobs,andalotofotherinformation

onthecluster.

SummaryInthischapter,wecoveredYARNcontainerallocationsandconfigurations,schedulingpolicies,andconfigurations.WealsocoveredmultitenancyapplicationsupportinYARNandsomebasicYARNadministrativetoolsandsettings.Inthenextchapter,wewillcoversomeusefulpracticalexamplesaboutYARNandtheecosystem.

Chapter6.DevelopingandRunningaSimpleYARNApplicationInthepreviouschapters,wediscussedtheconceptsoftheYARNarchitecture,clustersetup,andadministration.Nowinthischapter,wewillfocusmoreonMapReduceapplicationswithYARNanditsecosystems,withsomehands-onexamples.YoupreviouslylearnedaboutwhenaclientsubmitsanapplicationrequesttotheYARNclusterandhowYARNregisterstheapplication,allocatestherequiredcontainersforitsexecution,andmonitorstheapplicationwhileit’srunning.Now,wewillseesomepracticalusecasesofYARN.

Inthischapter,wewilldiscuss:

RunningsampleapplicationsonYARNDevelopingYARNexamplesApplicationmonitoringandtracking

Now,let’sstartbyrunningsomeofthesampleapplicationsthatcomeasapartoftheYARNdistributionbundle.

RunningsampleexamplesonYARNRunningtheavailablesampleMapReduceprogramsisasimpletaskwithYARN.TheHadoopversionshipswithsomebasicMapReduceexamples.Youcanfindtheminside$HADOOP_HOME/share/Hadoop/mapreduce/Hadoop-mapreduce-examples-

<HADOOP_VERSION>.jar.ThelocationofthefilemaydifferdependingonyourHadoopinstallationfolderstructure.

Let’sincludethisintheYARN_EXAMPLESpath:

$exportYARN_EXAMPLES=$HADOOP_HOME/share/Hadoop/mapreduce

Now,wehaveallthesampleexamplesintheYARN_EXAMPLESenvironmentalvariable.Youcanaccessalltheexamplesusingthisvariable;tolistalltheavailableexamples,trytypingthefollowingcommandontheconsole:

$yarnjar$YARN_EXAMPLES/hadoop-mapreduce-examples-2.4.0.2.1.1.0-385.jar

Anexampleprogrammustbegivenasthefirstargument.

Thevalidprogramnamesareasfollows:

aggregatewordcount:Thisisanaggregate-basedmap/reduceprogramthatcountsthewordsintheinputfilesaggregatewordhist:Thisisanaggregate-basedmap/reduceprogramthatcomputesthehistogramofthewordsintheinputfilesbbp:Thisisamap/reduceprogramthatusesBailey-Borwein-PlouffetocomputetheexactdigitsofPidbcount:Thisisanexamplejobthatcountsthepageviewcountsfromadatabasedistbbp:Thisisamap/reduceprogramthatusesaBBP-typeformulatocomputetheexactbitsofPigrep:Thisisamap/reduceprogramthatcountsthematchesofaregexintheinputjoin:Thisisajobthataffectsajoinoversorted,equally-partitioneddatasetsmultifilewc:Thisisajobthatcountswordsfromseveralfilespentomino:Thisisamap/reducetilethatlaysaprogramtofindsolutionstopentominoproblemspi:Thisisamap/reduceprogramthatestimatesPiusingaquasi-MonteCarlomethodrandomtextwriter:Thisisamap/reduceprogramthatwrites10GBofrandomtextualdatapernoderandomwriter:Thisisamap/reduceprogramthatwrites10GBofrandomdatapernodesecondarysort:Thisisanexamplethatdefinesasecondarysorttothereducesort:Thisisamap/reduceprogramthatsortsthedatawrittenbytherandomwritersudoku:Thisisasudokusolverteragen:Thisgeneratesdatafortheterasortterasort:Thisrunstheterasortteravalidate:Thischeckstheresultsofterasort

wordcount:Thisisamap/reduceprogramthatcountsthewordsintheinputfileswordmean:Thisisamap/reduceprogramthatcountstheaveragelengthofthewordsintheinputfileswordmedian:Thisisamap/reduceprogramthatcountsthemedianlengthofthewordsintheinputfileswordstandarddeviation:Thisisamap/reduceprogramthatcountsthestandarddeviationofthelengthofthewordsintheinputfiles

ThesewerethesampleexamplesthatcomeaspartoftheYARNdistributionbydefault.Now,let’stryrunningsomeoftheexamplestoshowcaseYARNcapabilities.

RunningasamplePiexampleTorunanyapplicationontopofYARN,youneedtofollowthisJavacommandsyntax:

$yarnjar<application_jar.jar><arg0><arg1>

TorunasampleexampletocalculatethevalueofPIwith16mapsand10,000samples,usethefollowingcommand:

$yarnjar$YARN_EXAMPLES/hadoop-mapreduce-examples-2.4.0.2.1.1.0-385.jar

PI1610000

Notethatweareusinghadoop-mapreduce-examples-2.4.0.2.1.1.0-385.jarhere.TheJARversionmaychangedependingonyourinstalledHadoopdistribution.

Onceyouhittheprecedingcommandontheconsole,youwillseethelogsgeneratedbytheapplicationontheconsole,asshowninthefollowingcommand.Thedefaultloggerconfigurationisdisplayedontheconsole.ThedefaultmodeisINFO,andyoumaychangeitbyoverwritingthedefaultloggersettingsbyupdatinghadoop.root.logger=WARN,consoleinconf/log4j.properties:

NumberofMaps=16

SamplesperMap=10000

WroteinputforMap#0

WroteinputforMap#1

WroteinputforMap#2

WroteinputforMap#3

WroteinputforMap#4

WroteinputforMap#5

WroteinputforMap#6

WroteinputforMap#7

WroteinputforMap#8

WroteinputforMap#9

WroteinputforMap#10

WroteinputforMap#11

WroteinputforMap#12

WroteinputforMap#13

WroteinputforMap#14

WroteinputforMap#15

StartingJob

11/09/1421:12:02INFOmapreduce.Job:map0%reduce0%

11/09/1421:12:09INFOmapreduce.Job:map25%reduce0%

11/09/1421:12:11INFOmapreduce.Job:map56%reduce0%

11/09/1421:12:12INFOmapreduce.Job:map100%reduce0%

11/09/1421:12:12INFOmapreduce.Job:map100%reduce100%

11/09/1421:12:12INFOmapreduce.Job:Jobjob_1381790835497_0003completed

successfully

11/09/1421:12:19INFOmapreduce.Job:Counters:44

FileSystemCounters

FILE:Numberofbytesread=358

FILE:Numberofbyteswritten=1365080

FILE:Numberofreadoperations=0

FILE:Numberoflargereadoperations=0

FILE:Numberofwriteoperations=0

HDFS:Numberofbytesread=4214

HDFS:Numberofbyteswritten=215

HDFS:Numberofreadoperations=67

HDFS:Numberoflargereadoperations=0

HDFS:Numberofwriteoperations=3

JobCounters

Launchedmaptasks=16

Launchedreducetasks=1

Data-localmaptasks=14

Rack-localmaptasks=2

Totaltimespentbyallmapsinoccupiedslots

(ms)=184421

Totaltimespentbyallreducesinoccupiedslots

(ms)=8542

Map-ReduceFramework

Mapinputrecords=16

Mapoutputrecords=32

Mapoutputbytes=288

Mapoutputmaterializedbytes=448

Inputsplitbytes=2326

Combineinputrecords=0

Combineoutputrecords=0

Reduceinputgroups=2

Reduceshufflebytes=448

Reduceinputrecords=32

Reduceoutputrecords=0

SpilledRecords=64

ShuffledMaps=16

FailedShuffles=0

MergedMapoutputs=16

GCtimeelapsed(ms)=195

CPUtimespent(ms)=7740

Physicalmemory(bytes)snapshot=6143396896

Virtualmemory(bytes)snapshot=23142254400

Totalcommittedheapusage(bytes)=43340769024

ShuffleErrors

BAD_ID=0

CONNECTION=0

IO_ERROR=0

WRONG_LENGTH=0

WRONG_MAP=0

WRONG_REDUCE=0

FileInputFormatCounters

BytesRead=1848

FileOutputFormatCounters

BytesWritten=98

JobFinishedin23.144seconds

EstimatedvalueofPiis3.14127500000000000000

YoucancomparetheexamplethatrunsoverHadoop1.xandtheonethatrunsoverYARN.Youcanhardlydifferentiatebylookingatthelogs,butyoucanclearlyidentifythedifferenceinperformance.YARNhasbackward-compatibilitysupportwithMapReduce1.x,withoutanycodechange.

MonitoringYARNapplicationswithwebGUINow,wewilllookattheYARNwebGUItomonitortheexamples.YoucanmonitortheapplicationsubmissionID,theuserwhosubmittedtheapplication,thenameoftheapplication,thequeueinwhichtheapplicationissubmitted,thestarttimeandfinishtimeinthecaseoffinishedapplications,andthefinalstatusoftheapplication,usingtheResourceManagerUI.TheResourceManagerwebUIdiffersfromtheUIoftheHadoop1.xversions.ThefollowingscreenshotshowstheinformationwecouldgetfromtheYARNwebUI(http://localhost:8088).

Currently,thefollowingwebUIisshowinginformationrelatedtothePIexampleweranintheprevioussection,exploringtheYARNwebUI:

ThefollowingscreenshotshowsthePIexamplerunningovertheYARNframeworkandthePIexamplesubmittedbytherootuserintothedefaultqueue.AnApplicationMasterisassignedtoit,whichiscurrentlyintherunningstate.Similarly,youcanalsomonitorallthesubmitted,acceptedandrunning,finished,andfailedjobs’statusesfromtheResourceManagerwebUI.

Ifyoudrilldownfurther,youcanseetheapplicationmaster-levelinformationofthesubmittedapplication,suchasthetotalcontainersallocatedtothemapandreducefunctionsandtheirrunningstatus.Forexample,thefollowingscreenshotshowsthatwealreadysubmittedaPIexamplewith16mappers.Sointhefollowingscreenshot,youcanseethatthetotalnumberofcontainersallocatedtothemapfunctionis16,outofwhich8arecompletedand8areintherunningstate.YoucanalsotrackthecontainersallocatedtothereducefunctionanditsprogressfromUI:

Youcanseealltheinformationdisplayedovertheconsolewhilerunningthejob.ThesameinformationwillalsobedisplayedonthewebUIinatabularformandinamoresophisticatedway:

AllthemapperandreducerjobsandfilesystemcounterswillbedisplayedunderthecountersectionoftheYARNapplicationwebGUI.Youcanalsoexploretheconfigurationsoftheapplicationintheconfigurationssection:

Thefollowingscreenshotshowsthestatisticsofthefinishedjob,suchasthetotalnumberofmappers,reducers,starttime,finishtime,andsoon:

ThefollowingscreenshotoftheYARNwebUIgivesschedulinginformationabouttheYARNcluster,suchastheclusterresourcecapacityandcontainersallocatedtotheapplicationorqueue:

Attheend,youwillseethejobsummarypage.Youmayalsoexaminethelogsbyclickingonthelogslinkprovidedonthejobsummarypage.

OnceauserreturnstothemainclusterUI,choosesanyfinishedapplications,andthenselectsajobwerecentlyran,theuserwillabletoseethesummarypage,asshowninfollowingscreenshot:

Thereareafewthingstonoteaswemovedthroughthewindowsdescribedearlier.First,asYARNmanagesapplications,allinputfromYARNreferstoanapplication.YARNhasnodataontheactualapplication.DatafromtheMapReducejobisprovidedbytheMapReduceframework.Therefore,therearetwoclearlydifferentdatastreamsthatarecombinedinthewebGUI,YARNapplicationsandMapReduceframeworkjobs.Iftheframeworkdoesnotprovidejobinformation,thencertainpartsofthewebGUIwillhavenothingtodisplay.

AveryimportantfactaboutYARNjobsisthedynamicnatureofthecontainerallocationstothemapperandreducertasks.TheseareexecutedasYARNcontainers,andtheirrespectivenumberalsochangesdynamicallyaspertheapplication’sneedsandrequirements.Thisfeatureprovidesmuchbetterclusterutilizationduetothedynamic

container(“slots”intraditionallanguage)allocations.

YARN’sMapReducesupportMapReducewastheonlyusecaseonwhichthepreviousversionsofHadoopweredeveloped.WeknowthatMapReduceismainlyusedfortheefficientandeffectiveprocessingofbigdata.Itisusedtoprocessagraphandmillionsofitsnodesandedges.Goingforwardwithtechnology,tocaterfortherequirementsofdatalocationavailability,faulttolerantsystems,andapplicationpriorities,YARNbuiltsupportforeverythingfromasimpleshellscriptapplicationtoacomplexMapReduceapplication.

Forthedatalocationavailability,MapReducer’sApplicationMasterhastofindoutthedatablocklocationsandallocationsofcontainerstoprocesstheseblocksaccordingly.Faulttolerantsystemmeanstheabilitytohandlefailedtasksandactonthemaccordingly,suchastohandlefailedmapandreducetasksandrerunthemwithothercontainersifneeded.Prioritiesareassignedtoeachapplicationinthequeue;thelogictohandlecomplexintra-applicationprioritiesformapandreducetaskshastobebuiltintotheApplicationMaster.Thereisnoneedtostartidlereducersbeforemappersfinishenoughdataprocessing.ReducersarenowunderthecontroloftheYARNApplicationMasterandarenotfixedastheyhadbeeninHadoopversion1.

TheMapReduceApplicationMasterTheMapReduceApplicationMasterserviceismadeupofmultipleloosely-coupledservices;theseservicesinteractwitheachotherviaevents.Everyservicegetstriggeredonaneventandproducesanoutputastheeventtriggersanotherservice;thishappenshighlyconcurrentlyandwithoutsynchronization.Allservicecomponentsareregisteredwiththecentraldispatcherservice,andserviceinformationissharedbetweenthemultiplecomponentsviaApplicationContext(AppContext).

InHadoopversion1,alltherunningandsubmittedjobsarepurelydependentontheJobTracker,sothefailureofJobTrackerresultsinalossofalltherunningandsubmittedjobs.However,withYARN,theApplicationMasterisequivalenttotheJobTracker.TheApplicationMasterrunsandallocatesnodestoanapplication.Itmayfail,butYARNhasthecapabilitytorestarttheApplicationMasteraspecifiednumberoftimesandthecapabilitytorecovercompletedtasks.MorelikeJobTracker,theApplicationMasterkeepsthemetricsofthejobscurrentlyrunning.ThefollowingsettingsintheconfigurationfileenableMapReducerecoveryinYARN.

ToenabletherestartoftheApplicationMaster,executethefollowingsteps:

1. Insideyarn-site.xml,youcantunetheyarn.resourcemanager.am.max-retriesproperty.Thedefaultis2.

2. Insidemapred-site.xml,youcandirectlytunehowmanytimesaMapReduceApplicationMastershouldrestartwiththemapreduce.am.max-attemptsproperty.Thedefaultis2.

3. Toenablerecoveryofcompletedtasks,lookinsidethemapred-site.xmlfile.Theyarn.app.mapreduce.am.job.recovery.enablepropertyenablestherecoveryoftasks.Bydefault,itistrue.

ExampleYARNMapReducesettingsYARNhasreplacedthefixedslotarchitectureformappersandreducerswithflexibledynamiccontainerallocation.TherearesomeimportantparameterstorunMapReduceefficiently,andtheycanbefoundinmapred-site.xmlandyarn-site.xml.Asanexample,thefollowingaresomesettingsthathavebeenusedtoruntheMapReduceapplicationonYARN:

Property Propertyfile Value

mapreduce.map.memory.mb mapred-site.xml 1536

mapreduce.reduce.memory.mb mapred-site.xml 2560

mapreduce.map.java.opts mapred-site.xml -Xmx1024m

mapreduce.reduce.java.opts mapred-site.xml -Xmx2048m

yarn.scheduler.minimum-allocation-mb yarn-site.xml 512

yarn.scheduler.maximum-allocation-mb yarn-site.xml 4096

yarn.nodemanager.resource.memory-mb yarn-site.xml 36864

yarn.nodemanager.vmem-pmem-ratio yarn-site.xml 2.1

YARNconfigurationallowsacontainersizebetween512MBto4GB.Ifnodeshave36GBofRAMwithavirtualmemoryof2.1,eachmapcanhavemax3225.6MB,andeachreducercanhave5376MBofvirtualmemory.So,thecomputenodeconfiguredfor36GBofcontainerspacecansupportupto24mapsand14reducers,oranycombinationofmapperandreducersallowedbytheavailableresourcesonthenode.

YARN’scompatibilitywithMapReduceapplicationsForasmoothtransitionfromHadoopv1toYARN,applicationbackwardcompatibilityhasbeenthemajorgoaloftheYARNimplementationteamtoensurethatexistingMapReduceapplicationsthatwereprogrammedusingHadoopv1(MRv1)APIsandcompliedagainstthemcancontinuetorunoverYARN,withlittleenhancement.

YARNensuresfullbinarycompatibilitywithHadoopv1(MRv1)APIs;userswhousedtheorg.apache.hadoop.mapredAPIsprovidefullcompatibilitywiththeYARNframework,withoutrecompilation.YoucanuseyourMapReduceJARfileandbin/hadooptosubmitthemdirectlytoYARN.

YARNintroducednewAPIchangesforMapReduceapplicationsontopoftheYARNframeworkintoorg.apache.hadoop.mapreduce.

Ifanapplicationisdevelopedbyorg.apache.hadoop.mapreduceandcompliedbytheHadoopv1(MRv1)APIs,thenunfortunatelyYARNdoesn’tprovidecompatibilitywithit,asorg.apache.hadoop.mapreduceAPIshavegonethroughaYARNtransitionandshouldberecompiledagainstHadoopv2(MRv2)torunoverYARN.

DevelopingYARNapplicationsTodevelopaYARNapplication,youneedtokeeptheYARNarchitectureinmind.YARNisaplatformthatallowsdistributedapplicationstotakefulladvantageoftheresourcesthatYARNhasdeployed.Currently,resourcescanbethingssuchasCPU,memory,anddata.Manydeveloperswhocomefromaserver-sideapplication-developmentbackgroundorfromaMapReducedeveloperbackgroundmaybeaccustomedtoacertainflowinthedevelopmentanddeploymentcycle.

Inthissection,we’lldescribethedevelopmentlifecycleofYARNapplications.Also,we’llfocusonthekeyareasofYARNapplicationdevelopment,suchashowYARNapplicationscanlaunchcontainers,howresourceallocationhasbeendonefortheapplications,andmanyotherareasindetail.

ThegeneralworkflowoftheYARNapplicationsubmissionisthattheYARNClientcommunicateswiththeResourceManagerthroughtheApplicationClientProtocoltogenerateanewApplicationID.ItthensubmitstheapplicationtotheResourceManagertorunviatheApplicationClientProtocol.Asapartoftheprotocol,theYARNClienthastoprovidealltherequiredinformationtotheResourceManagertolaunchtheapplication’sfirstcontainer,thatis,theApplicationMaster.TheYARNClientalsoneedstoprovideinformationdetailsofthedependencyJARs/filesfortheapplicationviacommand-linearguments.YoucanalsospecifythedependencyJARs/filesintheenvironmentvariables.

ThefollowingaresomeinterfaceprotocolsthattheYARNframeworkwilluseforintercomponentcommunication:

ApplicationClientProtocol:ThisprotocolisusedbyYARNforcommunicationbetweentheYARNClientandResourceManagertolaunchanewapplication,checkitsstatus,ortokilltheapplication.ApplicationMasterProtocol:ThisprotocolisusedbytheYARNframeworktocommunicatebetweentheApplicationMasterandResourceManager.ItisusedbytheApplicationMastertoregister/unregisteritselfto/fromtheResourceManagerandalsofortheresourceallocation/deallocationrequesttotheResourceManager.ContainerManagerProtocol:ThisprotocolisusedforcommunicationbetweentheApplicationMasterandNodeManagertostartandstopcontainersandtheirstatusupdates.

TheYARNapplicationworkflowNow,takealookatthefollowingsequencediagramthatdescribestheYARNapplicationworkflowandalsoexplainshowcontainerallocationisdoneforanapplicationviatheApplicationMaster:

Refertotheprecedingdiagramforthefollowingdetails:

TheclientsubmitstheapplicationrequesttotheResourceManager.TheResourceManagerregisterstheapplicationwiththeApplicationManager,generatestheApplicationID,andrespondstotheclientwiththesuccessfullyregisteredApplicationID.Then,theResourceManagerstartstheclient’sApplicationMasterinaseparateavailablecontainer.Ifnocontainerisavailable,thenthisrequesthastowaittillasuitablecontainerisfound,andsendtheapplicationregistrationrequestforapplicationregistration.TheResourceManagersharesalltheminimumandmaximumresourcecapabilitiesoftheclusterwiththeApplicationMaster.Then,theApplicationMasterdecideshowtoefficientlyusetheavailableresourcestofulfillapplicationneeds.DependingontheresourcecapabilitiessharedbytheResourceManager,theApplicationMasterrequeststheResourceManagertoallocatethenumberof

containersonbehalfoftheapplication.TheResourceManagerrespondstotheResourceRequestbytheApplicationMasteraspertheschedulingpoliciesandresourceavailabilities.ContainerallocationbytheResourceManagermeanssuccessfulfulfillingoftheResourceRequestbytheApplicationMaster.

Whilerunningthejob,theApplicationMastersendstheheartbeatandjobprogressinformationoftheapplicationtotheResourceManager.Duringtherunningtimeoftheapplication,theApplicationMasterrequestsforareleaseof,orallocatesmorecontainersto,theResourceManager.Whenthetimejobfinishes,theApplicationMastersendsacontainerdeallocationrequesttotheResourceManager,thusexitingitselffromtherunningcontainer.

WritingtheYARNclientTheYARNclientisrequiredtosubmitthejobtotheYARNframework.ItisaplainJavaclass,simplyhavingmainasentrypointfunctioninto.ThemainfunctionoftheYARNclientistosubmittheapplicationtotheYARNenvironmentbyinstantiatingtheorg.apache.hadoop.yarn.conf.YarnConfigurationobject.TheYarnConfigurationobjectdependsonfindingtheyarn-default.xmlandyarn-site.xmlfilesinitsclasspath.AlltheserequirementsneedtobesatisfiedtoruntheYARNclientapplication.TheYARNclientprocessisshowninthefollowingimage:

OnceaYarnConfigurationobjectisinstantiatedinyourYARNclient,wehavetocreateanobjectoforg.apache.hadoop.client.api.YarnClientusingtheYarnConfigurationobjectthathasalreadybeeninstantiated.Thenewly-instantiatedYarnClientobjectwillbeusedtosubmittheapplicationstotheYARNframeworkusingthefollowingsteps:

1. CreateaninstanceofaYarnClientobjectusingYarnConfiguration.2. InitializetheYarnClientandtheYarnConfigurationobject.3. StartaYarnClient.4. GettheYARNcluster,node,andqueueinformation.5. GetAccessControlListinformationfortheuserrunningtheclient.6. Createtheclientapplication.7. SubmittheapplicationtotheYARNResourceManager.8. Getapplicationreportsaftersubmittingtheapplication.

Also,theYarnClientwillcreateacontextforapplicationsubmissionandfortheApplicationMaster’scontainerlaunch.TherunnableYarnClientwilltakethecommand-lineargumentsfromtheuserwhoisrequiredtorunthejob.We’llseethesimplecodesnippetfortheYARNapplicationclienttogetabetterideaaboutit.

ThefirststepoftheYARNClientistoconnectwiththeResourceManager.Thefollowingisthecodesnippetforit:

//DeclareApplicationClientProtocol

ApplicationClientProtocolapplicationsManager;

//InstamtiateYarnConfiguration

YarnConfigurationyarnConf=newYarnConfiguration(conf);

//GettheResourceManagerIPaddress,ifnotprovidedusedefault

InetSocketAddressrmAddress=

NetUtils.createSocketAddr(yarnConf.get(

YarnConfiguration.RM_ADDRESS,

YarnConfiguration.DEFAULT_RM_ADDRESS));

LOGGER.info("ConnectingtoResourceManagerat"+rmAddress);

configurationappsManagerServerConf=newConfiguration(conf);

appsManagerServerConf.setClass(

YarnConfiguration.YARN_SECURITY_INFO,

ClientRMSecurityInfo.class,SecurityInfo.class);

//InitializeApplicationManagerhandle

applicationsManager=((ApplicationClientProtocol)rpc.getProxy(

ApplicationClientProtocol.class,rmAddress,

appsManagerServerConf));

OncetheconnectionbetweentheYARNClientandResourceManagerisestablished,theYARNClientneedstorequesttheApplicationIDfromtheResourceManager:

GetNewApplicationRequestnewRequest=

Records.newRecord(GetNewApplicationRequest.class);

GetNewApplicationResponsenewResponse=

applicationsManager.getNewApplication(newRequest);

TheresponsefromtheApplicationManageristhenewly-generatedApplicationIDfortheapplicationsubmittedbytheYARNClient.Youcanalsogettheinformationrelatedtotheminimumandmaximumresourcecapabilitiesofthecluster(usingtheGetNewApplicationResponseAPI).Usingthisinformation,developerscansettherequiredresourcesfortheApplicationMastercontainertolaunch.

TheYARNClientneedstosetupthefollowinginformationfortheApplicationSubmissionContextinitialization;thisinformationincludesalltherequiredinformationneededbytheResourceManagertolaunchtheApplicationMaster,asmentionedhere:

Applicationinformation,suchasApplicationIDgeneratedbythepreviousstepNameoftheapplicationQueueandpriorityinformation,suchasinwhichqueuetheapplicationneedstobesubmittedandtheprioritiesassignedtotheapplicationUserinformation,thatis,bywhomtheapplicationistobesubmittedContainerLaunchContext,thatis,theinformationneededbytheApplicationMastertolaunchlocalresources(suchasJARs,binaries,andfiles)

Italsocontainsthesecurity-relatedinformation(securitytokens)andenvironmentalvariables(classpathsettings)withthecommandtobeexecutedviatheApplicationMaster:

//CreateanewlaunchcontextforAppMaster

ApplicationSubmissionContextappContext=

Records.newRecord(ApplicationSubmissionContext.class);

//settheApplicationId

appContext.setApplicationId(appId);

//settheapplicationname

appContext.setApplicationName(appName);

//CreateanewcontainerlaunchcontextfortheApplicationMaster

ContainerLaunchContextamContainer=

Records.newRecord(ContainerLaunchContext.class);

//setthelocalresourcesrequiredfortheApplicationMaster

//localfilesorarchivesasneeded(forexamplesjarfiles)

Map<String,LocalResource>localResources=

newHashMap<String,LocalResource>();

//CopyApplicationMasterjartothefilesystemandcreate

//localresourcetopointdestinationjarpath

FileSystemfs=FileSystem.get(conf);

Pathsrc=newPath(AppMaster.jar);

StringpathSuffix=appName+"/"+appId.getId()+

"/AppMaster.jar";

Pathdst=newPath(fs.getHomeDirectory(),pathSuffix);

//CopyfilefromsrctodestionationonHDFS

fs.copyFromLocal(false,true,src,dst);

//getHDFSfilestatusfromthepathwhereitcopied

FileStatusjarStatus=fs.getFileStatus(dst);

LocalResourceamJarResorce=Records.newRecord(LocalResource.class);

//Setthetypeofresource-fileorarchive

//archivesareuntarredatthedestinationbytheframework

amJarResorce.setType(LocalResourceType.FILE);

//Setvisibilityoftheresource

//Settingtomostprivateoption

amJarResorce.setVisibility(LocalResourceVisibility.APPLICATION);

//Settheresourcetobecopiedoverlocation

amJarResorce.setResource(ConverterUtils.getYarnUrlFromPath(dst));

//Settimestampandlengthoffilesothattheframework

//candobasicsanitychecksforthelocalresource

//afterithasbeencopiedovertoensureitisthesame

//resourcetheclientintendedtousewiththeapplication

amJarResorce.setTimestamp(jarStatus.getModificationTime());

amJarResorce.setSize(jarStatus.getLen());

localResources.put("AppMaster.jar",amJarResorce);

//Setthelocalresourcesintothelaunchcontext

amContainer.setLocalResources(localResources);

//setthesecuritytokensasneeded

//amContainer.setContainerTokens(containerToken);

//Setuptheenvironmentneededforthelaunchcontextwherethe

//ApplicationMastertoberun

Map<String,String>env=newHashMap<String,String>();

//Forexample,wecouldsetuptheclasspathneeded.

//incaseofshellscriptexample,putrequiredresources

env.put(DSConstants.SCLOCATION,HdfsSCLocation);

env.put(DSConstants.SCTIMESTAMP,Long.toString(HdfsSCTimeStamp));

env.put(DSConstants.SCLENGTH,Long.toString(HdfsSCLength));

//AddAppMaster.jarlocationtotheClasspath.

//Bydefault,allthehadoopspecificclasspathswillalreadybe

//available

//in$CLASSPATH,soweshouldbecarefulnottooverwriteit.

StringBuilderclassPathEnv=newStringBuilder("$CLASSPATH:./*:");

for(Stringstr:

conf.get(YarnConfiguration.YARN_APPLICATION_CLASSPATH).split(",")){

classPathEnv.append(':');

classPathEnv.append(str.trim());

}

//addlog4jpropertiesintotheenvvariableifrequired

classPathEnv.append(":./log4j.properties");

env.put("CLASSPATH",classPathEnv);

//setenvironmentalvaribalesintothecontainer

amContainer.setEnvironment(env);

//setnecessarycommandtobeexecutetheApplicationMaster

vector<CharSequence>vargs=newVector<CharSequence>(30);

//setjavaexecutablecommand

vargs.add("${JAVA_HOME}"+"/bin/java");

//setmemoryXmxbasedonAMmemoryrequirements

vargs.add("-Xms"+amMemory+"m");

//setClassName

vargs.add(amMasterMainClass);

//Setparametersforapplicationmaster

vargs.add("--container_memory"+String.valueOf(containerMemory));

vargs.add("--num_containers"+String.valueOf(numContainers));

vargs.add("--priority"+String.valueOf(shellCmdPriority));

if(!shellCommand.isEmpty()){

vargs.add("--shell_command"+shellCommand+"");

}

if(!shellArgs.isEmpty()){

vargs.add("--shell_args"+shellArgs+"");

}

for(Map.Entry<String,String>entry:shellEnv.entrySet()){

vargs.add("--shell_env"+entry.getKey()+"="+

entry.getValue());

}

if(debugFlag){

vargs.add("--debug");

}

vargs.add("1>"+ApplicationConstants.LOG_DIR_EXPANSION_VAR+

"/AppMaster.stdout");

vargs.add("2>"+ApplicationConstants.LOG_DIR_EXPANSION_VAR+

"/AppMaster.stderr");

//Getfinalcommand

StringBuildercommand=newStringBuilder();

for(CharSequencestr:vargs){

command.append(str).append("");

}

List<String>commands=newArrayList<String>();

commands.add(command.toString());

//Setthecommandarrayintothecontainerspec

amContainer.setCommands(commands);

//ForlaunchinganAMcontainer,settinguserhereisnot

//needed

//amContainer.setUser(amUser);

Resourcecapability=Records.newRecord(Resource.class);

//Fornowonlymemoryissupported,sowesetthememory

capability.setMemory(amMemory);

amContainer.setResource(capability);

//Setthecontainerlaunchcontentintothe

ApplicationSubmissionContext

appContext.setAMContainerSpec(amContainer);

Nowthesetupprocessiscomplete,andourYARNClientisreadytosubmittheapplicationtotheApplicationManager:

//CreatetheApplicationrequesttosendtotheApplicationsManager

SubmitApplicationRequestappRequest=

Records.newRecord(SubmitApplicationRequest.class);

appRequest.setApplicationSubmissionContext(appContext);

//SubmittheapplicationtotheApplicationsManager

//Ignoretheresponseaseitheravalidresponseobjectis

//returnedon

//successoranexceptionthrowntodenotethefailure

applicationsManager.submitApplication(appRequest);

Duringthisprocess,theResourceManagerwillacceptalltherequestsofapplicationsubmissionandallocatecontainerstotheApplicationMastertorun.TheprogressofthetasksubmittedbytheclientcanbetrackedbycommunicatingwiththeResourceManagerandrequestinganapplicationstatusreportviatheApplicationClientProtocol:

GetApplicationReportRequestreportRequest=

Records.newRecord(GetApplicationReportRequest.class);

reportRequest.setApplicationId(appId);

GetApplicationReportResponsereportResponse=

applicationsManager.getApplicationReport(reportRequest);

ApplicationReportreport=reportResponse.getApplicationReport();

TheresponsetothereportrequestreceivedfromtheResourceManagercontainsgeneralapplicationinformation,suchastheApplicationID,thequeueinformationinwhichtheapplicationisrunning,andinformationontheuserwhosubmittedtheapplication.ItalsocontainstheApplicationMasterdetails,thehostonwhichtheApplicationMasterisrunning,andapplication-trackinginformationtomonitortheprogressoftheapplication.

Theapplicationreportalsocontainstheapplicationstatusinformation,suchasSUBMITTED,RUNNING,FINISHED,andsoon.

Also,theclientcandirectlyquerytheApplicationMastertogetreportinformationviahost:rpc_portobtainedfromtheApplicationReport.

Sometimes,theapplicationmaybewronglysubmittedinanotherqueueormaytakelongerthanusual.Insuchcases,theclientmaywanttokilltheapplication.TheApplicationClientProtocolsupportstheforcefullykilloperationthatcansendakillsignaltotheApplicationMasterviatheResourceManager:

KillApplicationRequestkillRequest=

Records.newRecord(KillApplicationRequest.class);

killRequest.setApplicationId(appId);

applicationsManager.forceKillApplication(killRequest);

WritingtheYARNApplicationMasterThistaskistheheartofthewholeprocess.ThiswouldbelaunchedbytheResourceManager,andallthenecessaryinformationwillbeprovidedbytheclient.AstheApplicationMasterislaunchedinthefirstcontainerallocatedbytheResourceManager,severalparametersaremadeavailablebytheResourceManagerviaenvironment.TheseparametersincludecontainerIDfortheApplicationMastercontainer,applicationsubmissiontimeanddetailsabouttheNodeManagerandthehostonwhichtheApplicationMasterisrunning.InteractionsbetweentheApplicationMasterandtheResourceManagerwouldrequiretheApplicationAttemptID.ThiswillbeobtainedfromtheApplicationMaster’sContainerID:

Map<String,String>envs=System.getenv();

StringcontainerIdString=

envs.get(ApplicationConstants.AM_CONTAINER_ID_ENV);

if(containerIdString==null){

thrownewIllegalArgumentException(

"ContainerIdnotsetintheenvironment");

}

ContainerIdcontainerId=

ConverterUtils.toContainerId(containerIdString);

ApplicationAttemptIdappAttemptID=

containerId.getApplicationAttemptId();

AfterthesuccessfulinitializationoftheApplicationMaster,itneedstoberegisteredwiththeResourceManagerviatheApplicationMasterProtocol.TheApplicationMasterandResourceManagercommunicateviatheSchedulerinterface:

//ConnecttotheResourceManagerandreturnhandlewithRM

YarnConfigurationyarnConf=newYarnConfiguration(conf);

InetSocketAddressrmAddress=

NetUtils.createSocketAddr(yarnConf.get(

YarnConfiguration.RM_SCHEDULER_ADDRESS,

YarnConfiguration.DEFAULT_RM_SCHEDULER_ADDRESS));

LOG.info("ConnectingtoResourceManagerat"+rmAddress);

ApplicationMasterProtocolresourceManager=

(ApplicationMasterProtocol)

rpc.getProxy(ApplicationMasterProtocol.class,rmAddress,conf);

//RegistertheApplicationMastertotheResourceManager

//Settherequiredinfointotheregistrationrequest:

//ApplicationAttemptId,

//hostonwhichtheappmasterisrunning

//rpcportonwhichtheappmasteracceptsrequestsfromtheclient

//trackingurlfortheclienttotrackappmasterprogress

RegisterApplicationMasterRequestappMasterRequest=

Records.newRecord(RegisterApplicationMasterRequest.class);

appMasterRequest.setApplicationAttemptId(appAttemptID);

appMasterRequest.setHost(appMasterHostname);

appMasterRequest.setRpcPort(appMasterRpcPort);

appMasterRequest.setTrackingUrl(appMasterTrackingUrl);

RegisterApplicationMasterResponseresponse=

resourceManager.registerApplicationMaster(appMasterRequest);

TheApplicationMastersendsstatustotheResourceManagerviaheartbeatsignals,andthetimeoutexpiryintervalsattheResourceManageraredefinedbyconfigurationsettingsintheYarnConfiguration.TheApplicationMasterProtocolcommunicateswiththeResourceManagertosendheartbeatsandapplicationprogressinformation.

Dependingonapplicationrequirements,theApplicationMastercanrequestfromtheResourceManagerthenumberofcontainerresourcestobeallocated.Forthisrequest,theApplicationMasterwillusetheResourceRequestAPItodefinecontainerspecifications.TheResourceRequestwillcontainthehostnameifthecontainersneedtobehostedonspecifichosts,orthe*wildcardcharacterwhichimpliesthatanyhostcanfulfilltheresourcecapabilities,suchasthememorytobeallocatedtothecontainer.Itwillalsocontainpriorities,tosetcontainersthatcanbeallocatedtospecifictasksonhigherpriority.Forexample,inmap-reducetasks,higherpriorityforacontainerisallocatedtothemaptaskandlowerpriorityforthecontainersisallocatedtothereducetask:

//ResourceRequest

ResourceRequestrequest=Records.newRecord(ResourceRequest.class);

//setuprequirementsforhosts

//whetheraparticularrack/hostisexpected

//Refertoapisunderorg.apache.hadoop.netformoredetailson

//using*asanyhostwilldo

request.setHostName("*");

//setnumberofcontainers

request.setNumContainers(numContainers);

//setthepriorityfortherequest

Prioritypri=Records.newRecord(Priority.class);

pri.setPriority(requestPriority);

request.setPriority(pri);

//Setupresourcetyperequirements

//Fornow,onlymemoryissupportedsowesetmemoryrequirements

Resourcecapability=Records.newRecord(Resource.class);

capability.setMemory(containerMemory);

request.setCapability(capability);

Afterdefiningthecontainerrequests,theApplicationMasterhastobuildanallocationrequestfortheResourceManager.TheAllocationRequestconsistsoftherequestedcontainers,containerstobereleased,theResponseID(theIDoftheresponsethatwouldbesentbackfromtheallocatecall)andprogressupdateinformation:

List<ResourceRequest>requestedContainers;

List<ContainerId>releasedContainers

AllocateRequestreq=Records.newRecord(AllocateRequest.class);

//Theresponseidsetintherequestwillbesentbackin

//theresponsesothattheApplicationMastercan

//matchittoitsoriginalaskandactappropriately.

req.setResponseId(rmRequestID);

//SetApplicationAttemptId

req.setApplicationAttemptId(appAttemptID);

//AddthelistofcontainersbeingaskedbytheAM

req.addAllAsks(requestedContainers);

//ApplicationMastercanrequestResourceManagertodeallocation

//ofthecontainerifnolongerrequires.

req.addAllReleases(releasedContainers);

//ApplicationMastercantrackitsprogressbysettingprogess

req.setProgress(currentProgress);

AllocateResponseallocateResponse=resourceManager.allocate(req);

TheresponsetothecontainerallocationrequestfromtheApplicationMastertotheResourceManagercontainstheinformationonthecontainersallocatedtotheApplicationMaster,thenumberofhostsavailableinthecluster,andmanymoresuchdetails.

ContainersarenotimmediatelyassignedtotheApplicationMasterbytheResourceManager.However,whenthecontainerrequestissenttotheResourceManager,theApplicationMasterwilleventuallygetthecontainersbasedoncluster-capacity,prioritiesandcluster-schedulingpolicy:

//Retrievelistofallocatedcontainersfromtheresponse

List<Container>allocatedContainers=

allocateResponse.getAllocatedContainers();

for(ContainerallocatedContainer:allocatedContainers){

LOG.info("Launchingshellcommandonanewcontainer."

+",containerId="+allocatedContainer.getId()

+",containerNode="+allocatedContainer.getNodeId().getHost()

+":"+allocatedContainer.getNodeId().getPort()

+",containerNodeURI="+allocatedContainer.getNodeHttpAddress()

+",containerState"+allocatedContainer.getState()

+",containerResourceMemory"

+allocatedContainer.getResource().getMemory());

LaunchContainerRunnablerunnableLaunchContainer=

newLaunchContainerRunnable(allocatedContainer);

ThreadlaunchThread=newThread(runnableLaunchContainer);

launchThreads.add(launchThread);

launchThread.start();

}

//Checkwhatthecurrentavailableresourcesinthecluster

ResourceavailableResources=allocateResponse.getAvailableResources();

LOG.info("Currentavailableresourcesinthecluster"+

availableResources);

//Basedonthisinformation,anApplicationMastercanmake

//appropriatedecisions

//Checkthecompletedcontainers

List<ContainerStatus>completedContainers=

allocateResponse.getCompletedContainersStatuses();

for(ContainerStatuscontainerStatus:completedContainers){

LOG.info("GotcontainerstatusforcontainerID="

+containerStatus.getContainerId()

+",state="+containerStatus.getState()

+",exitStatus="+containerStatus.getExitStatus()

+",diagnostics="+containerStatus.getDiagnostics());

intexitStatus=containerStatus.getExitStatus();

if(0!=exitStatus){

//containerfailed

if(-100!=exitStatus){

//applicationjoboncontainerreturnedanon-zeroexit

//codecountsascompleted

numCompletedContainers.incrementAndGet();

numFailedContainers.incrementAndGet();

}

else{

//somethingelsebadhappened

//appjobdidnotcompleteforsomereason

//weshouldre-tryasthecontainerwaslostforsome

//reason

numRequestedContainers.decrementAndGet();

//wedonotneedtoreleasethecontainerasthathas

//alreadybeendonebytheResourceManager/NodeManager.

}

}

else{

//nothingtodo

//containercompletedsuccessfully

numCompletedContainers.incrementAndGet();

LOG.info("Containercompletedsuccessfully."+",

containerId="+containerStatus.getContainerId());

}

}

}

AftercontainerallocationissuccessfullyperformedfortheApplicationMaster,ithastosetuptheContainerLaunchContextforthetasksonwhichitwillrun.OncetheContainerLaunchContextisset,theApplicationMastercanrequesttheContainerManagertostarttheallocatedcontainer:

//AssuminganallocatedContainerobtainedfromAllocateResponse

//andhasbeenalreadyinitializationofcontainerisdone

Containercontainer;

LOG.debug("ConnectingtoContainerManagerforcontainerid="+

container.getId());

//ConnecttoContainerManagerontheallocatedcontainer

StringcmIpPortStr=container.getNodeId().getHost()+":"

+container.getNodeId().getPort();

InetSocketAddresscmAddress=NetUtils.createSocketAddr(cmIpPortStr);

LOG.info("ConnectingtoContainerManagerat"+cmIpPortStr);

ContainerManagercm=((ContainerManager)

rpc.getProxy(ContainerManager.class,cmAddress,conf));

//NowwesetupaContainerLaunchContext

LOG.info("Settingupcontainerlaunchcontainerforcontainerid="+

container.getId());

ContainerLaunchContextctx=

Records.newRecord(ContainerLaunchContext.class);

ctx.setContainerId(container.getId());

ctx.setResource(container.getResource());

try{

ctx.setUser(UserGroupInformation.getCurrentUser().getShortUserName());

}catch(IOExceptione){

LOG.info(

"Gettingcurrentuserfailedwhentryingtolaunchthe

container",+e.getMessage());

}

//Settheenvironment

Map<String,String>unixEnv;

//Setuptherequiredenv.

//Pleasenotethatthelaunchedcontainerdoesnotinherit

//theenvironmentoftheApplicationMastersoallthe

//necessaryenvironmentsettingswillneedtobere-setup

//forthisallocatedcontainer.

ctx.setEnvironment(unixEnv);

//Setthelocalresources

Map<String,LocalResource>localResources=

newHashMap<String,LocalResource>();

//Again,thelocalresourcesfromtheApplicationMasterisnotcopied

over

//bydefaulttotheallocatedcontainer.Thus,itisthe

responsibility

//oftheApplicationMastertosetupallthenecessarylocal

resources

//neededbythejobthatwillbeexecutedontheallocated

container.

//Assumethatweareexecutingashellscriptontheallocated

container

//andtheshellscript'slocationinthefilesystemisknowntous.

PathshellScriptPath;

LocalResourceshellRsrc=Records.newRecord(LocalResource.class);

shellRsrc.setType(LocalResourceType.FILE);

shellRsrc.setVisibility(LocalResourceVisibility.APPLICATION);

shellRsrc.setResource(

ConverterUtils.getYarnUrlFromURI(newURI(shellScriptPath)));

shellRsrc.setTimestamp(shellScriptPathTimestamp);

shellRsrc.setSize(shellScriptPathLen);

localResources.put("MyExecShell.sh",shellRsrc);

ctx.setLocalResources(localResources);

//Setthenecessarycommandtoexecuteontheallocatedcontainer

Stringcommand="/bin/sh./MyExecShell.sh"

+"1>"+ApplicationConstants.LOG_DIR_EXPANSION_VAR+"/stdout"

+"2>"+ApplicationConstants.LOG_DIR_EXPANSION_VAR+"/stderr";

List<String>commands=newArrayList<String>();

commands.add(command);

ctx.setCommands(commands);

//SendthestartrequesttotheContainerManager

StartContainerRequeststartReq=

Records.newRecord(StartContainerRequest.class);

startReq.setContainerLaunchContext(ctx);

try{

cm.startContainer(startReq);

}catch(YarnRemoteExceptione){

LOG.info("Startcontainerfailedfor:"+",containerId="+

container.getId());

e.printStackTrace();

}

TheApplicationMasterwillgettheapplicationstatusinformationviatheApplicationMasterProtocol.Also,itmaymonitorbyqueryingtheContainerManagerfortheapplicationstatus:

GetContainerStatusRequeststatusReq=

Records.newRecord(GetContainerStatusRequest.class);

statusReq.setContainerId(container.getId());

GetContainerStatusResponsestatusResp;

try{

statucResp=cm.getContainerStatus(statusReq);

LOG.info("ContainerStatus"

+",id="+container.getId()

+",status="+statusResp.getStatus());

}catch(YarnRemoteExceptione){

e.printStackTrace();

}

ThiscodesnippetexplainshowtowritetheYARNClientandApplicationMasteringeneral.Actually,theApplicationMasteristheapplication-specificentity;eachapplicationorframeworkthatwantstorunoverYARNhasadifferentApplicationMaster,buttheflowisthesame.FormoredetailsontheYARNClientandApplicationMasterfordifferentframeworks,visittheApacheFoundationwebsite.

ResponsibilitiesoftheApplicationMasterTheApplicationMasteristheapplication-specificlibraryandisresponsiblefornegotiatingresourcesfromtheResourceManageraspertheclientapplication’srequirementsandneeds.TheApplicationMasterworkswiththeNodeManagertoexecuteandmonitorthecontainerandtracktheapplication’sprogress.TheApplicationMasteritselfrunsinoneofthecontainersallocatedbytheResourceManager,andtheResourceManagertrackstheprogressoftheApplicationMaster.

TheApplicationMasterprovidesscalabilitytotheYARNframework,asthe

ApplicationMastercanprovideafunctionalitythatismuchsimilartothatofthetraditionalResourceManager,sotheYARNclusterisabletoscalewithmanyhardwarechanges.Also,bymovingalltheapplication-specificcodeintotheApplicationMaster,YARNgeneralizesthesystemsothatitcansupportmultipleframeworks,justbywritingtheApplicationMaster.

SummaryInthischapter,youlearnedhowtousebundledapplicationsthatcomewiththeYARNframework,howtodeveloptheYARNClientandApplicationMaster,thecorepartsoftheYARNframework,howtosubmitanapplicationtoYARN,howtomonitoranapplication,andtheresponsibilitiesoftheApplicationMaster.

Inthenextchapter,youwilllearntowritesomereal-timepracticalexamples.

Chapter7.YARNFrameworksIt’sthedawnof2015,andbigdataisstillinitsboomingstage.Manynewstart-upsandgiantsareinvestingahugeamountintodevelopingPOCsandnewframeworkstocatertoanewandemergingvarietyofproblems.Theseframeworksarethenewcutting-edgetechnologiesorprogrammingmodelsthattendtosolvetheproblemsacrossindustriesintheworldofbigdata.Asthecorporationsaretryingtousebigdata,theyarefacinganewanduniquesetofproblemsthattheyneverfacedbefore.Hence,tosolvethesenewproblems,manyframeworksandprogrammingmodelsarecomingontothemarket.

YARN’ssupportformultipleprogrammingmodelsandframeworksmakesitidealtobeintegratedwiththesenewandemergingframeworksorprogrammingmodels.WithYARNtakingresponsibilityforresourcemanagementandothernecessarythings(schedulingjobs,faulttolerance,andsoon),itallowsthesenewapplicationframeworkstofocusonsolvingtheproblemsthattheywerespecificallymeantfor.

Atthetimeofwritingthisbook,manynewandemergingopensourceframeworksarealreadyintegratedwithYARN.

Inthischapter,wewillcoverthefollowingframeworksthatrunonYARN:

ApacheSamzaStormonYARNApacheSparkApacheTezApacheGiraphHoya(HBaseonYARN)KOYA(KafkaonYARN)

WewilltalkindetailaboutApacheSamzaandStormonYARN,wherewewilldevelopandrunsomesampleapplications.Forotherframeworks,wewillhaveabriefdiscussion.

ApacheSamzaSamzaisanopensourceprojectfromLinkedInandiscurrentlyanincubationprojectattheApacheSoftwareFoundation.Samzaisalightweightdistributedstream-processingframeworktodoreal-timeprocessingofdata.TheversionthatisavailablefordownloadfromtheApachewebsiteisnottheproductionversionthatLinkedInuses.

Samzaismadeupofthefollowingthreelayers:

AstreaminglayerAnexecutionlayerAprocessinglayer

Samzaprovidesout-of-the-boxsupportforalltheprecedingthreelayers:

Streaming:ThislayerissupportedbyKafka(anotheropensourceprojectfromLinkedIn)Execution:supportedbyYARNProcessing:supportedbySamzaAPI

ThefollowingthreepiecesfittogethertoformSamza:

ThefollowingarchitectureshouldbefamiliartoanyonewhohasusedHadoop:

Beforegoingintoeachofthesethreelayersindepth,itshouldbenotedthatSamza’ssupportisnotlimitedtothesesystems.BothSamza’sexecutionandstreaminglayersarepluggableandallowdeveloperstoimplementalternativesasrequired.

Samzaisastream-processingsystemtoruncontinuouscomputationoninfinitestreamsofdata.

Samzaprovidesasystemtoprocessstreamdatafrompublish-subscribesystemssuchasApacheKafka.Thedeveloperwritesastream-processingtaskandexecutesitasaSamzajob.Samzathenroutesmessagesbetweenthestream-processingtasksandthepublish-

subscribesystemsthatthemessagesareaddressedto.

SamzaworksalotlikeStorm,theTwitter-developedstream-processingtechnology,exceptthatSamzarunsonKafka,LinkedIn’sownmessagingsystem.Samzawasdevelopedwithapluggablearchitecture,enablingdeveloperstousethesoftwarewithothermessagingsystems.

ApacheSamzaisbasicallyacombinationofthefollowingtechnologies:

Kafka:SamzausesApacheKafkaasitsunderlyingmessagepassingsystemApacheYARN:SamzaalsousesApacheYARNfortaskschedulingZooKeeper:BothYARNandKafka,inturn,relyonApacheZooKeeperforcoordination

Moreinformationisavailableontheofficialsiteathttp://samza.incubator.apache.org/.

Wewillusethehello-samzaprojecttodevelopasampleexampletoprocesssomereal-timestreamprocessing.

WewillwriteaKafkaproducerusingtheJavaKafkaAPIstopublishacontinuousstreamofmessagestoaKafkatopic.Finally,wewillwriteaSamzaconsumerusingtheSamzaAPItoprocessthesestreamsfromtheKafkatopicinrealtime.Forsimplicity,wewilljustprintamessageandrecordeachtimeamessageisreceivedintheKafkatopic.

WritingaKafkaproducerLet’sfirstwriteaKafkaproducertopublishmessagestoaKafkatopic(namedstorm-sentence):

importjava.io.BufferedReader;

importjava.io.File;

importjava.io.FileInputStream;

importjava.io.FileNotFoundException;

importjava.io.FileReader;

importjava.io.IOException;

importjava.io.PrintStream;

importjava.util.Properties;

importkafka.javaapi.producer.Producer;

importkafka.producer.KeyedMessage;

importkafka.producer.ProducerConfig;

/**

*AsimpleJavaClasstopublishmessagesintoKAFKA.

*

*

*@authornirmal.kumar

*

*/

publicclassKafkaStringProducerService{

publicProducer<String,String>producer;

publicProducer<String,String>getProducer(){

returnthis.producer;

}

publicvoidsetProducer(Producer<String,String>producer){

this.producer=producer;

}

publicKafkaStringProducerService(Propertiesprop){

setProducer(newProducer(newProducerConfig(prop)));

}

/**

*Changethelocationofproducer.propertiesaccordinglyinLineNo.123

*

*Loadtheproducer.propertieshavingfollowingproperties:

*kafka.zk.connect=192.xxx.xxx.xxx

*serializer.class=kafka.serializer.StringEncoder

*producer.type=async

*queue.buffering.max.ms=5000000

*queue.buffering.max.messages=1000000

*metadata.broker.list=192.xxx.xxx.xxx:9092

*

*@paramfilepath

*@return

*/

privatestaticPropertiesgetConfiguartionProperties(Stringfilepath){

Filepath=newFile(filepath);

Propertiesproperties=newProperties();

try{

properties.load(newFileInputStream(path));

}catch(FileNotFoundExceptione){

e.printStackTrace();

}catch(IOExceptione){

e.printStackTrace();

}

returnproperties;

}

/**

*PublisheseachmessagetoKAFKA

*

*@paraminput

*@paramii

*/

publicvoidexecute(Stringinput,intii){

KeyedMessagedata=newKeyedMessage("storm-sentence",input);

this.producer.send(data);

//LogstoSystemConsoletheno.ofmessagespublished(each100000)

if((ii!=0)&&(ii%100000==0))

System.out.println("$$$$$$$PUBLISHED"+ii+"messages@"

+System.currentTimeMillis());

}

/**

*Readseachlinefromtheinputmessagefile

*

*@paramfile

*@return

*@throwsIOException

*/

privatestaticStringreadFile(Stringfile)throwsIOException{

BufferedReaderreader=newBufferedReader(newFileReader(file));

Stringline=null;

StringBuilderstringBuilder=newStringBuilder();

Stringls=System.getProperty("line.separator");

while((line=reader.readLine())!=null){

stringBuilder.append(line);

stringBuilder.append(ls);

}

returnstringBuilder.toString();

}

/**

*mainmethodforinvokingtheJavaapplication

*Needtopasscommandlineargument:theabsolutefilepathcontaining

Stringmessages.

*

*@paramargs

*/

publicstaticvoidmain(String[]args){

intii=0;

intnoOfMessages=Integer.parseInt(args[1]);

Strings=null;

try{

s=readFile(args[2]);

}catch(IOExceptione){

e.printStackTrace();

}

/**

*instantiatetheMainclass.

*Changethelocationofproducer.propertiesaccordingly

*/

KafkaStringProducerServiceservice=newKafkaStringProducerService(

getConfiguartionProperties("/home/cloud/producer.properties"));

System.out.println("********START:Publishing"+noOfMessages

+"messages@"+System.currentTimeMillis());

while(ii<=noOfMessages){

//invoketheexecutemethodtopublishmessagesintoKAFKA

service.execute(s,ii);

ii++;

}

System.out.println("#######END:Published"+noOfMessages

+"messages@"+System.currentTimeMillis());

try{

service.producer.close();

}catch(Exceptione){

e.printStackTrace();

}

}

}

CreatetheProducer.propertiesfilesomewherein/home/cloud/producer.propertiesandspecifythelocationinthepreviousKafkaproducerJavaclass.

TheProducer.propertiesfilewillhavethefollowinginformation:

Writingthehello-samzaprojectLet’snowwriteaSamzaconsumerandpackageitwiththehello-samzaproject:

1. Downloadandbuildthehello-samzaproject.Checkoutthehello-samzaproject:

gitclonegit://git.apache.org/incubator-samza-hello-samza.githello-

samza

cdhello-samza

Theoutputoftheprecedingcodecanbeseenhere:

2. Next,wewillwriteaSamzaconsumerusingtheSamzaAPItoprocesstheseNmessagesfromaKafkatopic.Gottohello-samza/samza-wikipedia/src/main/java/samza/examples/wikipedia/taskandwritetheYarnEssentialsSamzaConsumer.javafileasfollows:

3. AfterwritingtheSamzaconsumerclassinthehello-samzaproject,youwillneedtobuildtheproject:

mvncleanpackage

4. Createasamzadirectoryinsidethedeploydirectory:

mkdir-pdeploy/samza

5. Finally,createtheSamzajobpackage:

tar-xvf./samza-job-package/target/samza-job-package-0.7.0-dist.tar.gz

-Cdeploy/samza

6. ForSamzaconsumerproperties,goto/home/cloud/hello-samza/deploy/samza/config.

7. Writeasamza-test-consumer.propertiesfileasfollows:

Thispropertiesfilewillmainlycontainthefollowinginformation:

job.name:ThisisthenameoftheSamzajobyarn.package.path:ThisisthepathoftheSamzajobpackagetask.class:ThisistheclassoftheactualSamzaconsume.task.inputs:ThisistheKafkatopicnamewherethepublishedwillbereadfromsystems.kafka.consumer.zookeeper.connect:ThisistheZooKeeper-relatedinformation

StartingagridASamzagridusuallycomprisesthreedifferentsystems:YARN,Kafka,andZooKeeper.Thehello-samzaprojectcomeswithascriptcalledgridtohelpyousetupthesesystems.Startbyrunningthefollowingcommand:

bin/gridbootstrap

Thiscommandwilldownload,install,andstartZooKeeper,Kafka,andYARN.ItwillalsocheckoutthelatestversionofSamzaandbuildit.Allthepackagefileswillbeputinasubdirectorycalleddeployinsidethehello-samzaproject’srootfolder.Theresultoftheprecedingcommandisshownhere:

ThefollowingscreenshotshowsthatZookeeper,YARN,andKafkaarebeingstarted:

Oncealltheprocessesareupandrunningyoucanchecktheprocesses,asshowninthisscreenshot:

TheYARNResourceManagerwebUIwilllooklikethis:

TheYARNNodeManagerwebUIwilllooklikethis:

Sincewestartedthegrid,let’snowdeploytheSamzajobtoit:

deploy/samza/bin/run-job.sh--config-

factory=org.apache.samza.config.factories.PropertiesConfigFactory--config-

path=file:/home/cloud/hello-samza/deploy/samza/config/samza-test-

consumer.properties

ChecktheapplicationprocessesandRMUI.Asyoucanseeinthefollowingscreenshot,runningtheSamzajobfirstcreatesaSamzaAppMasterandthenaSamzaContainertorun

theconsumerthatwewrote:

TheResourceManagerwebUInowshowstheSamzaapplicationupandrunning:

TheApplicationMasterUIlooksasfollows:

ThefollowingscreenshotshowstheApplicationMasterUIinterface:

SincenowourSamzaconsumerisupandrunningandlisteningforanymessagesintheKafkatopic(namedstorm-sentence),let’spublishsomemessagestotheKafkatopicusingtheKafkaproducerwewroteinitially.ThefollowingJavacommandisusedtoinvoketheKafkaproducerthathastwocommand-linearguments:

N:ThisisthenumberoftimesthemessageispublishedintoKafka{pathOfFileNameHavingMessage}:Thisistheactualstringmessage

Createanyfilehavingastringmessage(strmsg10K.txt)andpassthisfilenameandpathasthesecondcommand-lineargumenttotheJavacommand,asshowninthefollowingscreenshot:

AssoonasthesemessagesarepublishedintheKafkatopic,theSamzaconsumerconsumesitandprintsthetimestamp,aswrittenintheSamzaconsumercode.

TheresultaftercheckingtheSamzaconsumerlogsisasfollows:

Storm-YARNApacheStormisanopensourcedistributedreal-timecomputationsystemfromTwitter.

Stormhelpsinprocessingunboundedstreamsofdatainareliablemanner.Stormcanbeusedwithanyprogramminglanguage.SomeofthemostcommonusecasesofStormarereal-timeanalytics,real-timemachinelearning,continuouscomputation,ETL,andmanymore.

Storm-YARNisaprojectfromYahoothatenablestheStormclustertobedeployedandmanagedbyYARN.Earlier,aseparateclusterwasneededforHadoopandStorm.

Onemajorbenefitthatcomeswiththisintegrationiselasticity.Batchprocessing(HadoopMapReduce)isusuallydoneonthebasisofneed,andreal-timeprocessing(Storm)isanongoingprocessing.WhentheHadoopclusterisidle,youcanleverageitforanyreal-timeprocessingwork.

Inatypicalreal-timeprocessingusecase,constantandpredictableloadsareveryrare.Storm,therefore,willneedmoreresourcesduringpeaktimewhentheloadisgreater.Atpeaktime,Stormcanstealresourcesfromthebatchjobsandgivethembackwhentheloadisless.

Thisway,theoverallresourceutilizationcanscaleupanddowndependingontheloadanddemand.Thiselasticityis,therefore,usefulforutilizingtheavailableresourcesonthebasisofdemandbetweenreal-timeandbatchprocessing.

AnotherbenefitisthatthisintegrationreducesthephysicaldistanceofdatatransfersbetweenStormandHadoop.ManyapplicationsusebothStormandHadooponseparateclusterswhilesharingdatabetweenthem(MapReduce).Forsuchascenario,Storm-YARNreducesnetworktransfers,andinturnthetotalcostofacquiringthedata,astheysharethesamecluster,asshowninthefollowingimage:

Referringtotheprecedingdiagram,Storm-YARNasksYARN’sResourceManagertolaunchaStormApplicationMaster.TheStormApplicationMasterthenlaunchesaStormNimbusserverandaStormUIserverlocally.ItalsousesYARNtoallocateresourcesforthesupervisorsandfinallylaunchthem.

WewillnowinstallStorm-YARNonaHadoopYARNclusteranddeploysomeStormtopologiestothecluster.

PrerequisitesThefollowingaretheprerequisitesforStorm-YARN.

HadoopYARNshouldbeinstalledRefertotheHadoopYARNinstallationathttp://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/SingleCluster.html.

TheMasterThriftserviceofStorm-on-YARNusesport9000,andifStorm-YARNislaunchedfromtheNameNode,therewillbeaportcrash.

Inthiscase,youwillneedtochangetheportoftheNameNodeinyourHadoopinstallation.Typically,thefollowingprocessesshouldbeupandrunninginHadoop:

ApacheZooKeepershouldbeinstalledAtthetimeofwritingthisbook,theStorm-on-YARNApplicationMasterimplementationdoesnotincluderunningZookeeperonYARN.Therefore,itispresumedthatthereisaZookeeperclusteralreadyrunningtoenablecommunicationbetweenNimbusandworkers.

Thereisanopenissuethatthisthoughtathttps://github.com/yahoo/storm-yarn/issues/22.

InstallingZookeeperisverystraightforwardandeasy.

Refertohttp://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html.

SettingupStorm-YARNStorm-YARNisbasicallyanimplementationoftheYARNclientandApplicationMasterforStorm.

TheclientgetsanewapplicationIDforStormandsubmitstheapplication,andtheApplicationMastersetsuptheStormcomponents(Nimbus,Supervisor,andsoon)onYARNusingthecontainersthattheApplicationMasterrequestsfromtheResourceManager.

NotethatStorm-on-YARNisnotanewimplementationofStormthatworksonYARN.Frameworks(thatisSamza,Storm,Spark,Tez,andsoon)themselvesdonotneedtobemodifiedtobeabletorunonYARN.OnlytheApplicationMasterandtheYARNclientcodeneedtobewrittenforeachoftheframeworkssothattheyrunonYARNasanapplicationjustlikeanyother.Now,proceedwiththefollowingsteps:

1. ClonetheStorm-YARNrepositoryfromGit:

cdstorm-on-yarn-poc/

gitclonehttps://github.com/yahoo/storm-yarn.git

cdstorm-yarn

TheStormclientmachinereferstothemachinethatwillsubmittheYARNclientandApplicationMastertotheResourceManager.

Asofnow,thereissinglereleaseofStorm-on-YARNfromYahoothatcontainsbothStorm-YARNandStormversions(0.9.0-wip21).TheStormreleaseispresentinthelibdirectoryoftheextractedStorm-on-YARNrelease.

2. BuildStorm-YARNusingMaven:

mvnpackageormvnpackage-DskipTests

3. Wewillgetthefollowingoutput:

[INFO]Scanningforprojects…

[INFO]

[INFO]Usingthebuilder

org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThread

edBuilderwithathreadcountof1

[INFO]

[INFO]----------------------------------------------------------------

--------

[INFO]Buildingstorm-yarn1.0-alpha

[INFO]----------------------------------------------------------------

--------

[INFO]

[INFO]Compiling5sourcefilesto/home/nirmal/storm-on-yarn-

poc/storm-yarn-master/target/test-classes

[INFO]

[INFO]---maven-jar-plugin:2.4:jar(default)@storm-yarn---

[INFO]

[INFO]---maven-surefire-plugin:2.10:test(default-test)@storm-yarn

---

[INFO]Testsareskipped.

[INFO]

[INFO]---maven-jar-plugin:2.4:jar(default-jar)@storm-yarn---

[INFO]----------------------------------------------------------------

--------

[INFO]BUILDSUCCESS

[INFO]----------------------------------------------------------------

--------

[INFO]Totaltime:10.153s

[INFO]Finishedat:2014-11-12T15:57:49+05:30

[INFO]FinalMemory:10M/118M

[INFO]----------------------------------------------------------------

--------

[INFO]FinalMemory:14M/152M

[INFO]----------------------------------------------------

4. Next,youwillneedtocopythestorm.zipfilefromstorm-yarn/libtoHDFS.ThisissinceStorm-on-YARNwilldeployacopyofStormcodethroughoutallthenodesoftheYARNclusterusingHDFS.However,thelocationofwheretofetchthiscopyoftheStormcodeishardcodedintotheStorm-on-YARNclient.Copythestorm.zipfiletoHDFSusingthefollowingcommand:

hdfsdfs-mkdir-p/lib/storm/0.9.0-wip21

Alternatively,youcanalsousethefollowingcommand:

hadoopfs–mkdir-p/lib/storm/0.9.0-wip21

hdfsdfs-put/home/nirmal/storm-on-yarn-poc/storm-yarn-

master/lib/storm.zip/lib/storm/0.9.0-wip21/storm.zip

Youcanalsousethefollowingcommand:

hadoopfs-put/home/nirmal/storm-on-yarn-poc/storm-yarn-

master/lib/storm.zip/lib/storm/0.9.0-wip21/storm.zip

TheexactversionofStormmightdiffer,inyourcase,from0.9.0-wip21.

5. CreateadirectorytoholdourStormconfiguration:

mkdir-p/home/nirmal/storm-on-yarn-poc/storm-data/

cp/home/nirmal/storm-on-yarn-poc/storm-yarn-master/lib/storm.zip

/home/nirmal/storm-on-yarn-poc/storm-data/

cd/home/nirmal/storm-on-yarn-poc/storm-data

unzipstorm.zip

6. Addthefollowingconfigurationinthestorm.yamlfilelocatedat/home/nirmal/storm-on-yarn-poc/storm-data/storm-0.9.0-wip21/conf.Youcanchangethefollowingvaluesasperyoursetup:

storm.zookeeper.servers:localhostnimbus.host:localhostmaster.initial-num-supervisors:2master.container.size-mb:1024

7. Addthestorm-yarn/binfoldertoyourpathvariable:

exportPATH=$PATH:/home/nirmal/storm-on-yarn-poc/storm-data/storm-

0.9.0-wip21/bin:/home/nirmal/storm-on-yarn-poc/storm-yarn-master/bin

8. Finally,launchStorm-YARNusingthefollowingcommand:

storm-yarnlaunch/home/nirmal/storm-on-yarn-poc/storm-data/storm-

0.9.0-wip21/conf/storm.yaml

LaunchingStorm-YARNexecutestheStorm-YARNclientthatgetsanappIDfromYARN’sResourceManagerandstartsrunningtheStorm-YARNApplicationMaster.TheApplicationMasterthenstartstheNimbus,Workers,andSupervisorservices.Youwillgetanoutputsimilartotheoneshowninthefollowingscreenshot:

9. WecanretrievethestatusofourapplicationusingthefollowingYARNcommand:

yarnapplication-list

Wewillgetthestatusofourapplicationasfollows:

10. YoucanalsoseeStorm-YARNrunningonthefollowingResourceManagerwebUIathttp://localhost:8088/cluster/:

11. Nimbusshouldalsoberunningnow,andyoushouldbeabletoseeitthroughtheNimbuswebUIathttp://localhost:7070/.Thislooksasfollows:

12. Thefollowingprocessesshouldbeupandrunning:

Gettingthestorm.yamlconfigurationofthelaunchedStormclusterThemachinethatwillusetheStormclientcommandtosubmitanewtopologytoStormneedsthestorm.yamlconfigurationfileofthelaunchedStormclusteronYARNtobestoredin/home/nirmal/.storm/storm.yaml.

Normally,whenStormisnotrunonYARN,thisconfigurationfileismanuallyedited,soyoushouldknowtheIPaddressesoftheStormcomponents.However,sincethelocationofwheretheStormcomponentswillberunonYARNdependsonthelocationoftheallocatedcontainers,Storm-on-YARNisresponsibleforsettingstorm.yamlforus.Youcanfetchthisstorm.yamlfilefromtherunningStorm-on-YARN:

$cd

$mkdir.storm/

$storm-yarngetStormConfig-appId(checktheappIdontheYARNapplication

UIatport8088)-output/home/nirmal/.storm/storm.yaml

BuildingandrunningStorm-StarterexamplesInthissection,wewillseehowtogettheexamplecodefromGitHub,builditusingMaven,andfinally,runtheexamples.Toperformthesetasks,you’llhavetoexecutethefollowingsteps:

1. GetthecodefromGitHub.Wewillusethestorm-starterfromGitHub:

gitclonehttps://github.com/nathanmarz/storm-starter

Cloninginto'storm-starter'...

remote:Countingobjects:756,done.

remote:Total756(delta0),reused0(delta0)

Receivingobjects:100%(756/756),171.81KiB|56.00KiB/s,done.

Resolvingdeltas:100%(274/274),done.

Checkingconnectivity…done

2. Next,gotothedownloadedstorm-starterdirectory:

cdstorm-starter/

3. Checkthecontentusingthefollowingcommands:

ls-ltr

-rw-r--r--1nirmalnirmal171Nov1212:58README.markdown

-rw-r--r--1nirmalnirmal5047Nov1212:58m2-pom.xml

drwxr-xr-x3nirmalnirmal4096Nov1212:58multilang

-rw-r--r--1nirmalnirmal580Nov1212:58LICENSE

drwxr-xr-x4nirmalnirmal4096Nov1212:58src

-rw-r--r--1nirmalnirmal929Nov1212:58project.clj

drwxr-xr-x3nirmalnirmal4096Nov1212:58test

-rw-r--r--1nirmalnirmal8042Nov1212:58storm-starter.iml

4. Buildthestorm-starterprojectusingMaven:

mvn-fm2-pom.xmlpackageormvn-fm2-pom.xmlpackage-DskipTests

5. Youwillseeanoutputsimilartothefollowingcommands:

[INFO]Scanningforprojects…

[INFO]Usingthebuilder

org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThread

edBuilderwithathreadcountof1

[INFO]

[INFO]----------------------------------------------------------------

--------

[INFO]Buildingstorm-starter0.0.1-SNAPSHOT

[INFO]----------------------------------------------------------------

--------

[INFO]META-INF/MANIFEST.MFalreadyadded,skipping

[INFO]META-INF/alreadyadded,skipping

[INFO]META-INF/maven/alreadyadded,skipping

[INFO]Buildingjar:/home/nirmal/storm-on-yarn-poc/storm-

starter/target/storm-starter-0.0.1-SNAPSHOT-jar-with-dependencies.jar

[INFO]META-INF/MANIFEST.MFalreadyadded,skipping

[INFO]META-INF/alreadyadded,skipping

[INFO]META-INF/maven/alreadyadded,skipping

[INFO]----------------------------------------------------------------

--------

[INFO]BUILDSUCCESS

[INFO]----------------------------------------------------------------

--------

[INFO]Totaltime:05:21min

[INFO]Finishedat:2014-11-12T13:05:40+05:30

[INFO]FinalMemory:30M/191M

[INFO]----------------------------------------------------------------

--------

6. Afterthebuildissuccessful,youwillseethefollowingJARfilebeingcreatedunderthetargetdirectory:

storm-starter-0.0.1-SNAPSHOT-jar-with-dependencies.jar

7. RuntheStormtopologyexampleontheStorm-YARNcluster:

stormjarstorm-starter-0.0.1-SNAPSHOT-jar-with-dependencies.jar

storm.starter.WordCountTopologyword-count-topology

Theoutputcanbeseeninthefollowingscreenshot:

8. Clickonthetopology,asshowninthefollowingscreenshot:

ApacheSparkApacheSparkisafastandgeneralengineforlarge-scaledataprocessing.Itwasoriginallydevelopedin2009inUCBerkeley’sAMPLabandopensourcedin2010.

ThemainfeaturesofSparkareasfollows:

Speed:SparkenablesapplicationsinHadoopclusterstorunupto100xfasterinmemoryand10xfasterevenwhenrunningondisk.Easeofuse:SparkletsyouquicklywriteapplicationsinJava,Scala,orPython.YoucanuseitinteractivelytoquerybigdatasetsfromtheScalaandPythonshells.Runseverywhere:SparkrunsonHadoop,Mesos,instandalonemode,orinthecloud.Itcanaccessdiversedatasources,includingHDFS,Cassandra,HBase,andS3.YoucanrunSparkreadilyusingitsstandaloneclustermode,onEC2,orrunitonHadoopYARNorApacheMesos.ItcanreadfromHDFS,HBase,Cassandra,andanyHadoopdatasource.Generality:Sparkpowersastackofhigh-leveltools,includingSparkSQL,MLlibformachinelearning,GraphX,andSparkStreaming.Youcancombinetheseframeworksseamlesslyinthesameapplication.

WhyrunonYARN?YARNenablesSparktoruninasingleclusteralongsideotherframeworks,suchasTez,Storm,HBase,andothers.ThisavoidstheneedtocreateandmanageseparateanddedicatedSparkclusters.

Typically,customerswanttorunmultipleworkloadsonasingledatasetinasinglecluster.YARN,asagenericresourcemanagementandsingledataplatformforalldifferentframeworks/engines,makesithappen.

YARN’sbuilt-inmultitenancysupportallowsdynamicandoptimalsharingofthesamesharedclusterresourcesbetweendifferentframeworksthatrunonYARN.

YARNhaspluggableschedulerstocategorize,isolate,andprioritizeworkloads.

ApacheTezApacheTezispartoftheStingerinitiativeledbyHortonworkstomaketheHiveenterprisereadyandsuitableforinteractiveSQLqueries.TheTezdesignisbasedonresearchdonebyMicrosoftonparallelanddistributedcomputing.

TezenteredtheApacheIncubatorinFebruary2013andgraduatedtoatop-levelprojectinJuly2014.

Tezisbasicallyanembeddableandextensibleframeworktobuildhigh-performancebatchandinteractivedata-processingapplicationsthatneedtointegrateeasilywithYARN.

ConfusionoftenariseswhenTezisthoughtofasanengine.Tezisnotageneral-purposeengine,butmoreofaframeworkfortoolstoexpresstheirpurpose-builtneeds.Tez,forexample,enablesHive,Pig,andotherstobuildtheirownpurpose-builtenginesandembedtheminthosetechnologiestoexpresstheirpurpose-builtneeds.ProjectssuchasHive,Pig,andCascadingnowhavesignificantimprovementsinresponsetimeswhentheyuseTezinsteadofMapReduce.

TezgeneralizestheMapReduceparadigmtoamorepowerfulframeworkbasedonexpressingcomputationsasadataflowgraph.TezexiststoaddresssomeofthelimitationsofMapReduce.Forexample,inatypicalMapReduce,alotoftemporarydataisstored(suchaseachmapper’soutput,whichisadiskI/O),whichisanoverhead.InthecaseofTez,thisdiskI/Ooftemporarydataissaved,therebyresultinginhigherperformancecomparedtotheMapReducemodel.

Also,Tezcanadjusttheparallelismofreducetasksatruntime,dependingontheactualdatasizecomingoutoftheprevioustask.Ontheotherhand,inMapReducethenumberofreducersisstaticandhastobedecidedbytheuserbeforethejobissubmittedtothecluster.

TheprocessingdonebymultipleMapReducejobscannowbedonebyasingleTezjob,asfollows:

Referringtotheprecedingdiagram,earlier(withPIG/HIVE),weusedtoneedmultipleM/Rjobstodosomeprocessing.However,now,inTez,asingleM/Rjobdoesthesame,thatis,thereducers(thegreenboxes)ofthepreviousstepfeedthemappers(theblueboxes)ofthenextstep.

Theprecedingimageistakenfromhttp://www.infoq.com/articles/apache-tez-saha-murthy.

Tezisnotmeantdirectlyforendusers;infact,itenablesdeveloperstobuildend-userapplicationswithmuchbetterperformanceandflexibility.Traditionally,Hadoophasbeenabatch-processingplatformtoprocesslargeamountsofdata.However,therearealotofusecasesfornear-real-timeperformanceofqueryprocessing.Therearealsoseveralworkloads,suchasmachinelearning,thatdonotfitintotheMapReduceparadigm.TezhelpsHadoopaddresstheseusecases.

Tezprovidesanexpressivedataflow-definitionAPIthatletsdeveloperscreatetheirownuniquedata-processinggraphs(DAGs)torepresenttheirapplications’data-processingflows.Oncethedeveloperdefinesaflow,TezthenprovidesadditionalAPIstoinjectcustombusinesslogicthatwillruninthatflow.TheseAPIsthencombineinputs(thatreaddata),outputs(thatwritedata),andprocessors(thatprocessdata)toprocesstheflow.

TezcanalsorunanyexistingMRjobwithoutanymodification.FormoreinformationonTez,refertohttp://tez.apache.org/.

ApacheGiraphApacheGiraphisagraph-processingsystemthatusestheMapReducemodeltoprocessgraphs.Currently,itisinincubationattheApacheSoftwareFoundation.

ItisbasedonGoogle’sPregel,whichisusedtocalculatepagerank.

Currently,GiraphisbeingusedbyFacebook,Twitter,andLinkedIntocreatesocialgraphsoftheirusers.BothGiraphandPregelarebasedontheBulkSynchronousParallel(BSP)modelofdistributedcomputation,whichwasintroducedbyLeslieValiant.

SupportforYARNisfromrelease1.1.0.Formoreinformation,refertotheofficialsiteathttp://giraph.apache.org/.

HOYA(HBaseonYARN)HoyaisbasicallyrunningHBaseonYARN.ItiscurrentlyhostedonGithub,butthereareplanstomoveittotheApacheFoundation.

HoyacreatesHBaseclustersontopofYARN.ItdoesthiswithaclientapplicationcalledHoyaclient;thisapplicationcreatesthepersistentconfigurationfiles,setsuptheHBaseclusterXMLfiles,andthenasksYARNtocreateanApplicationMaster,whichistheHoyaAMhere.

Formoreinformation,refertohttps://github.com/hortonworks/hoya,http://hortonworks.com/blog/introducing-hoya-hbase-on-yarn/andhttp://hortonworks.com/blog/hoya-hbase-on-yarn-application-architecture/.

KOYA(KafkaonYARN)OnNovember5,2014,DataTorrent,acompanyfoundedbyex-Yahoo!,announcedanewprojecttobringthefault-tolerant,high-performance,scalableApacheKafkamessagingsystemtoYARN.

Theso-calledKafkaonYARN(KOYA)projectplanstoleverageYARNforKafkabrokermanagement,automaticbrokerrecovery,andmore.Plannedfeaturesincludeafully-HAApplicationMaster,stickyallocationofcontainers(sothatarestartcanaccesslocaldata),awebinterfaceforKafka,andmore.

TheexpectedreleasetotheopensourcecommunityissomewhereinQ22015.

Moreinformationisavailableathttps://www.datatorrent.com/introducing-koya-apache-kafka-on-apache-hadoop-2-0-yarn/.

SummaryThischaptertalkedaboutthedifferentframeworksandprogrammingmodelsthatcanberunonYARN.WediscussedApacheSamzaandStormonYARNindetail.

WiththewideacceptanceofYARNintheindustry,moreandmoreframeworkswillsupportYARN,takingcompleteadvantageofYARN’sgenericfeatures.

WelookedattheexistingframeworksthatareintegratedwithYARNatthemoment.

ThereisalotmoreworkgoingonintheindustrytomakeexistingandnewapplicationsrunonYARN.

InChapter8,FailuresinYARN,wewilldiscusshowfaults,failuresatvariouslevels,arehandledinYARN.

Chapter8.FailuresinYARNDealingwithfailuresindistributedsystemsiscomparativelymorechallengingandtimeconsuming.Also,theHadoopandYARNframeworksrunoncommodityhardwareandclustersizenowadays;thissizecanvaryfromseveralnodestoseveralthousandnodes.Sohandlingfailurescenariosanddealingwithever-growingscalingissuesisveryimportant.Inthissection,wewillfocusonfailuresintheYARNframework:thecausesoffailuresandhowtoovercomethem.

Inthischapter,wewillcoverthefollowingtopics:

ResourceManagerfailuresApplicationMasterfailuresNodeManagerfailuresContainerfailuresHardwarefailures

Wewillbedealingwiththerootcausesofthesefailuresandthesolutionstothem.

ResourceManagerfailuresIntheinitialversionsoftheYARNframework,ResourceManagerfailuresmeantatotalclusterfailure,asitwasasinglepointoffailure.TheResourceManagerstoresthestateofthecluster,suchasthemetadataofthesubmittedapplication,informationonclusterresourcecontainers,informationonthecluster’sgeneralconfigurations,andsoon.Therefore,iftheResourceManagergoesdownbecauseofsomehardwarefailure,thenthereisnowaytoavoidmanuallydebuggingtheclusterandrestartingtheResourceManager.DuringthetimetheResourceManagerisdown,theclusterisunavailable,andonceitgetsrestarted,alljobswouldneedarestart,sothehalf-completedjobsloseanydataandneedtoberestartedagain.Inshort,arestartoftheResourceManagerusedtorestartalltherunningApplicationMasters.

ThelatestversionsofYARNaddressthisproblemintwoways.Onewayisbycreatinganactive-passiveResourceManagerarchitecture,sothatwhenonegoesdown,anotherbecomesactiveandtakesresponsibilityforthecluster.TheResourceManagerRMstatecanbeseeninthefollowingimage:

AnotherwayisbyusingtheZookeeperResourceManagerquorum,sothattheResourceManagerstateisstoredexternallyovertheZookeeper,andoneResourceManagerisinanactivestateandoneormoreResourceManagersareinpassivemode,waitingforsomethingtohappenthatbringsthemtoanactivestate.TheResourceManager’sstatecanbeseeninthefollowingimage:

Intheprecedingdiagram,youcanseethattheResourceManager’sstateismanagedbytheZookeeper.Wheneverthereisafailurecondition,theResourceManager’sstateissharedwiththepassiveResourceManager(s)tochangetoanactivestateandtakeoverresponsibilityforthecluster,withoutanydowntime.

ApplicationMasterfailuresTorecovertheapplication’sstateafteritsrestartbecauseofanApplicationMasterfailureistheresponsibilityoftheApplicationMasteritself.WhentheApplicationMasterfails,theResourceManagersimplystartsanothercontainerwithanewApplicationMasterrunninginitforanotherapplicationattempt.ItistheresponsibilityofthenewApplicationMastertorecoverthestateoftheolderApplicationMaster,andthisispossibleonlywhenApplicationMasterspersisttheirstatesintheexternallocationsothatitcanbeusedforfuturereference.AnyApplicationMastercanrunanyapplicationfromscratchinsteadofrecoveringitsstateandrerunningagain.

Forexample,anApplicationMastercanrecoveritscompletedjobs.However,ifthejobsthatarerunningandcompletedduringtheApplicationMaster’srecoverytimeframegethaltedforsomereason,theirstatewillbediscardedandtheApplicationMasterwillsimplyrerunthemfromscratch.

TheYARNframeworkiscapableofrerunningtheApplicationMasteraspecifiednumberoftimesandrecoveringthecompletedtasks.

NodeManagerfailuresAlmostallnodesintheclusterrunsaNodeManagerservicedaemon.TheNodeManagertakescareofexecutingacertainpartofaYARNjoboneveryindividualmachine,whileotherpartsareexecutedonothernodes.Fora1000nodeYARNcluster,thereareprobablyaround999nodemanagersrunning.Sonodemanagersareindeedaper-nodeagentandtakescareoftheindividualnodesdistributedinthecluster.

IfaNodeManagerfails,theResourceManagerdetectsthisfailureusingatime-out(thatis,stopsreceivingtheheartbeatsfromtheNodeManager).TheResourceManagerthenremovestheNodeManagerfromitspoolofavailableNodeManagers.Italsokillsallthecontainersrunningonthatnode&reportsthefailuretoallrunningAMs.AMsarethenresponsibleforreactingtonodefailures,byredoingtheworkdonebyanycontainersrunningonthatnodeduringthefault.

Ifthefaultcausingthetime-outistransientthentheNodeManagerwillresynchronizeswiththeResourceManager.OnthesimilarlinesifanewNodeManagerjoinsthecluster,theResourceManagernotifiesallApplicationMastersabouttheavailabilityofnewresources.

ContainerfailuresWheneveracontainerfinishes,theApplicationMasterisinformedofthiseventbytheResourceManager.SotheApplicationMasterinterpretsthatthecontainerstatusreceivedthroughtheResourceManageristhesuccessorfailurefromcontainerexitstatus.TheApplicationMasterhandlesthefailuresofthejobcontainers.

Itistheresponsibilityoftheapplicationframeworkstomanagethecontainer’sfailures,andtheresponsibilityoftheYARNframeworkistoprovideinformationtotheapplicationframework.AsapartofallocatingtheAPI’sresponse,theResourceManagercollectsinformationonthefinishedcontainersfromtheApplicationMaster,asthecontainersreturnallthisinformationtothecorrespondingApplicationMaster.ItistheresponsibilityoftheApplicationMastertovalidatethecontainer’sstatus,exitcode,anddiagnosticinformationandappropriateactiononit,forexamplewhentheMapReduceApplicationMasterretriesthemapandreducetasksbyrequestingnewcontainers,untiltheconfigurednumberoftasksfailforasinglejob.

Toaddresscontainerallocationfailurescenarios,theResourceManagercollectscontainerinformationbyexecutingtheAllocatecall,andtheAllocateResponseusuallydoesnotreturnanycontainers.However,theAllocatecallshouldbemadeperiodicallytoensurethatallcontainersareassigned.Whenthecontainerarrives,itisforsurethattheframeworkwillhavesufficientresources,andtheApplicationMasterwillnotreceivemorecontainersthanitaskedfor.Also,theApplicationMastercanmakeseparatecontainerrequests,ResourceRequests,typicallyonepersecond.

HardwareFailuresAstheHadoopandYARNframeworksusecommodityhardwarefortheclustersetupandscalingfromseveralnodestoseveralthousandnodes,allthecomponentsofHadooporYARNaredesignedontheassumptionthathardwarefailuresareverycommon.Therefore,thesefailureswouldbeautomaticallyhandledbytheframeworksothatimportantdataisnotlostbecauseofthem.Forthis,Hadoopprovidesdatareplicationacrossthenodes/rackssothatevenifthewholerackfails,datawouldberecoveredfromanothernodeonanotherrack,andjobswouldberestartedoveranotherreplicadatasettocomputetheresults.

SummaryInthischapter,wediscussedYARNfailurescenariosandhowtheseareaddressedintheYARNframework.Inthenextchapter,wewillbefocusingonalternativesolutionsfortheYARNframework.WewillalsoseeabriefoverviewofthemostcommonframeworksthatarecloselyrelatedtoYARN.

Chapter9.YARN–AlternativeSolutionsDuringthedevelopmentofYARN,manyotherorganizationssimultaneouslyidentifiedthelimitationsofHadoop1.xandwereactivelyinvolvedindevelopingalternativesolutions.

ThischapterwillbrieflytalkaboutsuchalternatesolutionsandcomparethemtoYARN.AmongthemostcommonframeworksthatarecloselyrelatedtoYARNare:

MesosOmegaCorona

MesosMesoswasoriginallydevelopedattheUniversityofCaliforniaatBerkeleyandlaterbecameopensourceundertheApacheSoftwareFoundation.

Mesoscanbethoughtofasahighly-availableandfault-tolerantoperatingsystemkernelforyourclusters.It’saclusterresourcemanagerthatprovidesefficientresourceisolationandsharingacrossmultiplediversecluster-computingorframeworks.

MesoscanbecomparedtoYARNinsomeaspectsbutacompletequantitativecomparisonisliterallynotpossible.

WewilltalkaboutthearchitectureofMesosandcomparesomeofthearchitecturaldifferenceswithrespecttoYARN.Thiswaywewillhaveahighlevelunderstandingofthemaindifferencebetweenthetwoframeworks.

TheprecedingfigureshowsthemaincomponentsofMesos.Itbasicallyconsistsofamasterprocessthatmanagesslaveprocessesrunningoneachclusternodeandmesosapplications(alsocalledframeworks)thatruntasksontheseslaves.

Formoreinformationpleaserefertotheofficialsiteathttp://mesos.apache.org/.

Herearethehigh-leveldifferencesbetweenMesosandYARN:

Mesos YARN

MesosusesLinuxcontainergroups(http://lxc.sourceforge.net).

Linuxcontainergroupsareastrongerisolationbutmayhavesomeadditionaloverhead.

YARNusessimpleUnixprocesses.

MesosisprimarilywritteninC++. YARNisprimarilywritteninJavawithbitsofnativecode.

MesossupportsbothmemoryandCPUscheduling.

Currently,YARNonlysupportsmemoryscheduling(forexample,yourequestxcontainersofyMBeach),butthereareplanstoextendittootherresourcessuchasnetworkanddiskI/Oresources.

Mesosintroducesadistributedtwo-levelschedulingmechanismcalledresourceoffers.Mesosdecideshowmanyresourcestooffereachframework,whileframeworksdecidewhichresourcestoacceptandwhichcomputationstorunonthem.

YARNhasarequest-basedapproach.ItallowstheApplicationMastertoaskforresourcesbasedonvariouscriteria,includinglocations,andalsoallowstherequestertomodifyfuturerequestsbasedonwhatwasgivenandonthecurrentusage.

Mesosleveragesapoolofcentralschedulers(forexample,classicHadooporMPI).

YARNontheotherhandhasaperjobscheduler.AlthoughYARNenableslatebindingofcontainerstotasks,whereeachindividualjobcanperformlocaloptimizations,theper-jobApplicationMastermightresultingreateroverheadthantheMesosapproach.

OmegaOmegaisGoogle’snextgenerationclustermanagementsystem.

Omegaisspecificallyfocusedonaclusterschedulingarchitecturethatusesparallelism,sharedstate,andoptimisticconcurrencycontrol.

Fromthepastexperience,Googlenoticedthatastheclustersandtheirworkloadsincrease,theschedulerisatriskofbecomingascalabilitybottleneck.

Google’sproductionjobschedulerhasexperiencedallofthis.Overtheyears,ithasevolvedintoacomplicated,sophisticatedsystemthatishardtochange.

Aschematicoverviewoftheschedulingarchitecturescanbeseeninthefollowingfigure:

contribprojecttoHadoop0.20branchandisnotaverylargecodebase.Coronaisintegratedwiththefair-scheduler.YARNismoreinterestedinthecapacityscheduler.

Googleidentifiedthefollowingtwoprevalentschedulerarchitecturesshownintheprecedingfigure:

Monolithicschedulers:Thisusesasingle,centralizedschedulingalgorithmforalljobs(ourexistingschedulerisoneofthese).Theydonotmakeiteasytoaddnewpoliciesandspecializedimplementations,andmaynotscaleuptotheclustersizesoneisplanningforinthefuture.Two-levelschedulers:Thiswillhaveasingleactiveresourcemanagerthatofferscomputeresourcestomultipleparallel,independentschedulerframeworks,asinMesosandHadoopOnDemand(HOD).Theirarchitecturesdoappeartoprovideflexibilityandparallelism,butinpracticetheirconservativeresourcevisibilityandlockingalgorithmslimitboth,andmakeithardtoplacedifficultto-schedule“picky”

jobsortomakedecisionsthatrequireaccesstothestateoftheentirecluster.

ThesolutionisOmega—anewparallelschedulerarchitecturebuiltaroundthesharedstate,usinglock-freeoptimisticconcurrencycontrol,toachievebothimplementationextensibilityandperformancescalability.

Omega’sapproachreflectsagreaterfocusonscalability,butmakesithardertoenforceglobalproperties,suchascapacity,fairness,anddeadlines.

Formoreinformation,refertohttp://research.google.com/pubs/pub41684.html.

CoronaCoronaisanotherworkfromFacebook,whichisnowopen-sourcedandhostedontheGitHubrepositoryathttps://github.com/facebookarchive/hadoop-20/tree/master/src/contrib/corona.

Facebook,withitshugepeta-scalequantityofdata,sufferedseriousperformance-relatedissueswiththeclassicMapReduceframeworkbecauseofthesingleJobTrackertakingcareofthousandsofjobsanddoingalotofworkalone.

Inordertosolvetheseissues,FacebookcreatedCorona,whichseparatedclusterresourcemanagementfromjobcoordination.

InHadoopCorona,theclusterresourcesaretrackedbyacentralClusterManager.EachjobgetsitsownCoronaJobTrackerwhichtracksjustthatparticularjob.

CoronahasentirelyredesignedMapReducearchitecturetobringbetterclusterutilizationandjobscheduling,justlikeYARNdid.

Facebook’sgoalsinre-writingtheHadoopschedulingframeworkwerenotthesameasYARN’s.FacebookwantedquickimprovementsinMapReduce,butonlythepartthattheywereusing.TheyhadnointerestinrunningmultipleheterogeneousframeworkssuchasYARNdoesorotherkeydesignconsiderationsofYARN.

ForFacebook,doingaquickrewriteoftheschedulerseemedfeasibleandlowrisk,comparedtogoingwithYARN,gettingfeaturesthatwerenotneeded,understandingit,fixingitsproblemsandthenlandingupwithsomethingthatdidn’taddresstheprimarygoalofloweringlatency.

Thefollowingaresomeofthekeydifferences:

Coronadoespush-basedschedulingandhasanevent-driven,callback-orientedmessageflow.Thiswascriticaltoachievingfast,low-latencyscheduling.PollingisabigpartofwhytheHadoopschedulerisslowandhasscalabilityissues.YARNdoesnotdocallback-basedmessageflow.InCorona,JobTrackercanrunonthesameJVMastheJobClient(thatisHive).FacebookhadfatclientmachineswithtonsofRAMandCPU.Toreducelatency,maximumprocessingontheclientmachineispreferred.InYARN,JobTrackerhastobescheduledwithinthecluster.Thismeansthatthere’soneextrastepbetweenstartingaqueryandgettingitrunning.CoronaisstructuredasacontribprojecttoHadoop0.20branchandisnotaverylargecodebase.Coronaisintegratedwiththefair-scheduler.YARNismoreinterestedinthecapacityscheduler.

FormoreinformationonCorona,refertohttps://www.facebook.com/notes/facebook-engineering/under-the-hood-scheduling-mapreduce-jobs-more-efficiently-with-corona/10151142560538920.

SummaryWetalkedaboutvariousworksrelatedtoYARNthatareavailableonthemarkettoday.Thesesystemssharecommoninspiration/requirements,andthehigh-levelgoalofimprovingscalability,latency,fault-tolerance,andprogramming-modelflexibility.Thevariedarchitecturaldifferencesareduetothediverseandvarieddesignpriorities.Inthenextchapter,wewilltalkaboutYARN’sfutureandsupportintheindustry.

Chapter10.YARN–FutureandSupportYARNisthenewmoderndataoperatingsystemforHadoop2.YARNactsasacentralorchestratortosupportmixedworkloads/programmingmodels,runningmultipleengines,andmultipleaccesspatternssuchasbatchprocessing,interactive,streaming,andreal-time,inHadoop2.

Inthischapter,wewilltalkaboutYARN’sjourneyanditspresentandfutureinthebigdataindustry.

WhatYARNmeanstothebigdataindustryItcanbesaidthatYARNisaboontothebigdataindustry.WithoutYARNtheentirebigdataindustrywouldhavebeenatseriousrisk.Astheindustrystartedplayingwithbigdata,newandemergingvarietiesofproblemscameintothepictureandhencenewframeworks.

YARN’ssupporttorunthesenewandemergingframeworksallowstheseframeworkstofocusonsolvingtheproblemsforwhichtheywerespecificallymeantfor,whileYARNtakescareofresourcemanagementandothernecessarythings(resourceallocation,schedulingjobs,faulttolerance,andsoon).

HadtherebeennoYARN,theseframeworkswouldhavehadtodoalltheresource-managementontheirown.Therearemanybigdataprojectsthatfailedinthepastduetounrealisticexpectationsonimmaturetechnologies.

YARNistheenablerforportingmatureandenterprise-classtechnologiesdirectlyontoHadoop.WithoutYARN,theonlythinginHadoopwastouseMapReduce.

Journey–presentandfutureAroundtwoyearsback,YARNwasintroducedwiththeHadoop0.23releaseon11Nov,2011.

Sincethen,therewasnolookingbackandtherewereanumberofreleases.

Finally,onOctober15,2013ApacheHadoop2.2.0wastheGA(GeneralAvailability)releaseofApacheHadoop2.x.

InOctober2013,ApacheHadoopYARNwontheBestPaperawardatACMSoCC(SymposiumonCloudComputing)2013.

ApacheHadoop2.x,poweredbyYARN,isnodoubtthebestplatformforalloftheHadoopecosystemcomponentssuchasMapReduce,ApacheHive,ApachePig,andsoonthatuseHDFSastheunderlyingdatastorage.

YARNwasalsohonoredbyotheropensourcecommunitiesforframeworkssuchasApacheGiraph,ApacheTez,ApacheSpark,ApacheFlink,andmanyothers.

VendorssuchasHP,Microsoft,SAS,Teradata,SAP,RedHat,andthelistgoeson,aremovingtowardsYARNtoruntheirexistingproductsandservicesonHadoop.

PeoplewillingtomodifyapplicationscanalreadyuseYARNdirectly,buttherearemanycustomers/vendorswhodon’twanttomodifytheirexistingapplication.Forthem,thereisApacheSlider,anotheropensourceprojectfromHortonworks,whichcandeployanyexistingdistributedapplicationswithoutrequiringthemtobeportedtoYARN.

ApacheSliderallowsyoutobridgeexistingalways-onservicesandmakessuretheyworkreallywellontopofYARN,withouthavingtomodifytheapplicationitself.

Sliderfacilitatesmanylong-runningservicesandapplicationssuchasApacheStorm,ApacheHBase,ApacheAccumulo,andsoonrunningonYARN.

ThisinitiativewilldefinitelyexpandthespectrumofapplicationsandusecasesthatonecanactuallyusewithHadoopandYARNinfuture.

Presenton-goingfeaturesNow,let’sdiscussthepresenton-goingworksinYARN.

LongRunningApplicationsonSecureClusters(YARN-896)

Supportlong-livedapplicationsandlong-livedcontainers.Refertohttps://issues.apache.org/jira/browse/YARN-896.

ApplicationTimelineServer(YARN-321,YARN-1530)

Currently,wehaveaJobHistoryServerforMapReducehistory.TheMapReducejobhistoryservercurrentlyneedstobedeployedasatrustedserverinsyncwiththeMapReduceruntime.Everynewapplicationwouldneedasimilarapplicationhistoryserver.HavingtodeployO(T*V)(whereTisthenumberoftypeofapplication,Visthenumberofversionofapplication)trustedserversisclearlynotscalable.

ThisJIRAistocreateonlyonetrustedapplicationhistoryserver,whichcanhaveagenericUI.Refertothefollowinglinksformoreinformation:

https://issues.apache.org/jira/browse/YARN-321https://issues.apache.org/jira/browse/YARN-1530

Diskscheduling(YARN-2139)

SupportfordiskasaresourceinYARN.YARNshouldconsiderdiskasanotherresourceforschedulingtasksonnodes,isolationatruntime,andspindlelocality.Refertohttps://issues.apache.org/jira/browse/YARN-2139.

Reservation-basedscheduling(YARN-1051)

ToextendtheYARNRMtohandletimeexplicitly,allowinguserstoreservecapacityovertime.ThisisanimportantsteptowardsSLAs,long-runningservices,workflows,andhelpsingangscheduling.

FuturefeaturesLet’sdiscussthefutureworksinYARN.

ContainerResizing(YARN-1197)

ThecurrentYARNresourcemanagementlogicassumesthattheresourcesallocatedtoacontainerarefixedduringitslifetime.Whenuserswanttochangetheresourcesofanallocatedcontainer,theonlywayisreleasingitandallocatinganewcontainerwiththeexpectedsize.Allowingruntimechangestotheresourcesofanallocatedcontainerwillgiveusbettercontrolofresourceusageontheapplicationside.Refertohttps://issues.apache.org/jira/browse/YARN-1197.

Adminlabels(YARN-796)

Supportforadminstospecifylabelsfornodes.TheexamplesoflabelsareOS,processorarchitecture,andsoon.Refertohttps://issues.apache.org/jira/browse/YARN-796.

ContainerDelegation(YARN-1488)

Allowcontainerstodelegateresourcestoanothercontainer.ThiswouldallowexternalframeworkstosharenotjustYARN’sresource-managementcapabilities,butalsoitsworkload-managementcapabilities.

ThisalsoshowsthatYARNisnotonlyfocusedontheApacheHadoopecosystemcomponents,butalsoonanyexistingexternalnon-HadoopproductsandservicesthatwanttouseHadoop.

Also,workisgoingoninbringingtogethertheworldsofDataandPaaSbyusingDocker,GoogleKubernetes,andRedHatOpenShiftonYARNsothatacommonresourcemanagementcanbedoneacrossdataandPaaSworkloads.

YARN-supportedframeworksThefollowingisthecurrentlistofframeworksthatrunsontopofYARN,andthislistwillgoongettinglongerinthefuture:

ApacheHadoopMapReduceanditsecosystemcomponentsApacheHAMAOpenMPIApacheS4ApacheSparkApacheTezImpalaStormHOYA(HBaseonYARN)ApacheSamzaApacheGiraphApacheAccumuloApacheFlinkKOYA(KafkaonYARN)Solr

SummaryInthischapter,webrieflytalkedaboutYARN’sjourneysinceitsinception.YARNhascompletelychangedHadoopfromthewayitwasearlierintheHadoop1.xversion.NowYARNisafirst-classresourcemanagementframeworkforsupportingmixedworkloads/processingframeworks.

Fromwhatcanbeenseenandpredicted,YARNissurelyahitinthebigdataindustryandhasmanymorenewandpromisingfeaturestocomeinthefuture.Currently,YARNhandlesmemoryandCPUandwillcoordinateadditionalresourcessuchasdiskandnetworkI/Ointhefuture.

IndexA

AccessControlList(ACL)about/NodeManager(NM),Thecapacityscheduler

administrativetoolsabout/Administrativetoolscommands/Administrativetoolsgenericoptions,supporting/Administrativetools

/Administrativetoolsanagrams/PracticalexamplesofMRv1andMRv2ApacheGiraph

about/ApacheGiraphURL/ApacheGiraph

ApacheHadoop2.2.0about/Journey–presentandfuture

ApacheSamzaabout/ApacheSamzaKafka/ApacheSamzaApacheYARN/ApacheSamzaZooKeeper/ApacheSamzaKafkaproducer,writing/WritingaKafkaproducerhello-samzaproject,writing/Writingthehello-samzaproject

ApacheSamza,layersprocessinglayer/ApacheSamzastreaminglayer/ApacheSamzaexecutionlayer/ApacheSamza

ApacheSliderabout/Journey–presentandfuture

ApacheSoftwareFoundationabout/Mesos

ApacheSparkabout/ApacheSparkfeatures/ApacheSparkrunning,onYARN/WhyrunonYARN?

ApacheTezabout/ApacheTezURL/ApacheTez

ApplicationContext(AppContext)/TheMapReduceApplicationMasterApplicationMaster

about/TheMapReduceApplicationMasterApplicationMaster(AM)/ApplicationMaster(AM)

restarting/TheMapReduceApplicationMaster

writing/WritingtheYARNApplicationMasterresposibilities/ResponsibilitiesoftheApplicationMasterfailures/ApplicationMasterfailures

ApplicationMasterLauncherserviceabout/ResourceManager

ApplicationMasterServiceabout/ResourceManager

ApplicationsManagerabout/ResourceManager

Bbackwardcompatibility,MRv2APIs

about/BackwardcompatibilityofMRv2APIsbinarycompatibility,oforg.apache.hadoop.mapredAPIs/Binarycompatibilityoforg.apache.hadoop.mapredAPIssourcecompatibility,oforg.apache.hadoop.mapredAPIs/Sourcecompatibilityoforg.apache.hadoop.mapredAPIs

BulkSynchronousParallel(BSP)about/ApacheGiraph

Ccapacityscheduler

about/Thecapacityscheduler,Thecapacityschedulerbenefits/Thecapacityschedulerfeatures/Thecapacityschedulerconfigurations/Capacityschedulerconfigurations

clusterschedulingarchitectureabout/Omega

configurationparametersabout/Thefully-distributedmode

containerfailures/Containerfailures

containerallocationabout/Containerallocationtoapplication/Containerallocationtotheapplication

containerconfigurationsabout/Containerconfigurationsparameters/Containerconfigurations

ContainerExecutorabout/NodeManager(NM)

ContainerManagerabout/NodeManager(NM)

ContextObjects/OldandnewMapReduceAPIsCorona

about/CoronaandFacebook,differences/CoronaURL/Corona

Ddata-processinggraphs(DAGs)

about/ApacheTezDataNodes(DN)/Thefully-distributedmode

configuring/Thefully-distributedmodeDocker

about/Futurefeatures

EEcoSystem

webinterfaces/WebinterfacesoftheEcosystem

FFacebook

about/CoronaandCorona,differences/Corona

Fairscheduler/Thefairschedulerabout/Thefairschedulerconfigurations/Fairschedulerconfigurations

FIFOscheduler/TheFIFO(FirstInFirstOut)schedulerabout/TheFIFO(FirstInFirstOut)schedulerconfigurations/TheFIFO(FirstInFirstOut)scheduler

fully-distributedmodeabout/Thefully-distributedmodeHistoryServer/HistoryServerslavefiles/Slavefiles

GGoogleKubernetes

about/Futurefeaturesgrid

starting/Startingagrid

HHadoop

URL/SoftwareYARN,usingin/UnderstandingwhereYARNfitsintoHadoop

Hadoop0.23about/Journey–presentandfuture

Hadoop1.xabout/AshortintroductiontoHadoop1.xandMRv1components/AshortintroductiontoHadoop1.xandMRv1

Hadoop2releaseabout/TheHadoop2release

HadoopandYARNclusteroperating/OperatingHadoopandYARNclustersstarting/StartingHadoopandYARNclustersstopping/StoppingHadoopandYARNclusters

HadoopclusterHDFS/AshortintroductiontoHadoop1.xandMRv1MapReduce/AshortintroductiontoHadoop1.xandMRv1

HadoopOnDemand(HOD)/Omegahello-samzaproject

writing/Writingthehello-samzaprojectproperties/Writingthehello-samzaprojectgrid,starting/Startingagrid

HistoryServer/HistoryServerHOYA(HBaseonYARN)

about/HOYA(HBaseonYARN)URL/HOYA(HBaseonYARN)

KKafkaproducer

writing/WritingaKafkaproducerKOYA(KafkaonYARN)

about/KOYA(KafkaonYARN)URL/KOYA(KafkaonYARN)

MMapReduce,YARN

about/YARN’sMapReducesupportApplicationMaster/TheMapReduceApplicationMastersettings,example/ExampleYARNMapReducesettingsYARNapplications,developing/DevelopingYARNapplications

MapReduceapplicationsYARN,compatiblewith/YARN’scompatibilitywithMapReduceapplications

MapReducejobconfigurations/MapReducejobconfigurationsproperties/MapReducejobconfigurations

MapReduceJobHistoryServersettings/HistoryServer

MapReduceprojectEnd-userMapReduceAPI/MRv1versusMRv2MapReduceframework/MRv1versusMRv2MapReducesystem/MRv1versusMRv2

Mesosabout/MesosandYARN,differencebetween/MesosURL/Mesos

modernoperatingsystem,ofHadoopYARN,usedas/YARNasthemodernoperatingsystemofHadoop

monolithicschedulers/OmegaMRv1

about/AshortintroductiontoHadoop1.xandMRv1versusMRv2/MRv1versusMRv2examples/PracticalexamplesofMRv1andMRv2,Runningthejob

MRv2versusMRv1/MRv1versusMRv2examples/PracticalexamplesofMRv1andMRv2,Preparingtheinputfile(s)

NNameNode(NN)/Thefully-distributedmode

configuring/Thefully-distributedmodenewMapReduceAPI

about/OldandnewMapReduceAPIsversusoldMapReduceAPI/OldandnewMapReduceAPIs

NodeHealthCheckerServiceabout/NodeManager(NM)

NodeManager(NM)/NodeManager(NM)configuring/Thefully-distributedmodeparameters/Thefully-distributedmode

NodeManagers(NM)/Thefully-distributedmodeNodeStatusUpdater

about/NodeManager(NM)

OoldMapReduceAPI

about/OldandnewMapReduceAPIsversusnewMapReduceAPI/OldandnewMapReduceAPIs

Omegaabout/Omega

PPiexample

running/RunningasamplePiexampleprerequisites,single-nodeinstallation

platform/Platformsoftwares/Software

prerequisites,Storm-YARNHadoopYARN,installing/HadoopYARNshouldbeinstalledApacheZooKeeper,installing/ApacheZooKeepershouldbeinstalled

programnamesaggregatewordcount/RunningsampleexamplesonYARNaggregatewordhist/RunningsampleexamplesonYARNbbp/RunningsampleexamplesonYARNdbcount/RunningsampleexamplesonYARNdistbbp/RunningsampleexamplesonYARNgrep/RunningsampleexamplesonYARNjoin/RunningsampleexamplesonYARNmultifilewc/RunningsampleexamplesonYARNpentomino/RunningsampleexamplesonYARNpi/RunningsampleexamplesonYARNrandomtextwriter/RunningsampleexamplesonYARNrandomwriter/RunningsampleexamplesonYARNsecondarysort/RunningsampleexamplesonYARNsort/RunningsampleexamplesonYARNsudoku/RunningsampleexamplesonYARNteragen/RunningsampleexamplesonYARNterasort/RunningsampleexamplesonYARNteravalidate/RunningsampleexamplesonYARNwordcount/RunningsampleexamplesonYARNwordmean/RunningsampleexamplesonYARNwordmedian/RunningsampleexamplesonYARNwordstandarddeviation/RunningsampleexamplesonYARN

pseudo-distributedmode/Thepseudo-distributedmodepush-basedscheduling/Corona

Rredesignidea

about/TheredesignideaMapReduce,limitations/LimitationsoftheclassicalMapReduceorHadoop1.xHadoop1.x,limitations/LimitationsoftheclassicalMapReduceorHadoop1.x

RedHatOpenShiftabout/Futurefeatures

RedHatPackageManagers(RPMs)/Thefully-distributedmodeResourceManager/ResourceManagerResourceManager(RM)

scheduler/ResourceManagersecurity/ResourceManagerRMRestartPhaseI/RecentdevelopmentsinYARNarchitectureRMRestartPhaseII/RecentdevelopmentsinYARNarchitectureabout/Thefully-distributedmodeconfiguring/Thefully-distributedmodeparameters/Thefully-distributedmodefailures/ResourceManagerfailures

ResourceManager(RM),componentsApplicationManager/NodeManager(NM)Scheduler/NodeManager(NM)

Sschedulerarchitectures

monolithicschedulers/Omegatwo-levelschedulers/Omega

single-nodeinstallationabout/Single-nodeinstallationprerequisites/Prerequisitesstarting/Startingwiththeinstallationstandalonemode(localmode)/Thestandalonemode(localmode)pseudo-distributedmode/Thepseudo-distributedmode

slavefiles/Slavefilesstandalonemode(localmode)/Thestandalonemode(localmode)Storm-Starterexamples

building/BuildingandrunningStorm-Starterexamplesrunning/BuildingandrunningStorm-Starterexamples

Storm-YARNabout/Storm-YARNprerequisites/Prerequisitessettingup/SettingupStorm-YARNstorm.yamlconfiguration,obtaining/Gettingthestorm.yamlconfigurationofthelaunchedStormclusterStorm-Starterexamples,building/BuildingandrunningStorm-StarterexamplesStorm-Starterexamples,running/BuildingandrunningStorm-Starterexamples

storm.yamlconfigurationobtaining/Gettingthestorm.yamlconfigurationofthelaunchedStormcluster

Ttwo-levelschedulers/Omega

WwebGUI

YARNapplications,monitoringwith/MonitoringYARNapplicationswithwebGUI

YYARN

used,asmodernoperatingsystemofHadoop/YARNasthemodernoperatingsystemofHadoopdesigngoals/WhatarethedesigngoalsforYARNused,inHadoop/UnderstandingwhereYARNfitsintoHadoopmultitenancyapplicationsupport/YARNmultitenancyapplicationsupportsampleexamples,runningon/RunningsampleexamplesonYARNsamplePiexample,running/RunningasamplePiexamplecompatibility,withMapReduceapplications/YARN’scompatibilitywithMapReduceapplicationsApacheSpark,runningon/WhyrunonYARN?and,Mesosdifferencebetween/Mesosimportance,toBigDataindustry/WhatYARNmeanstothebigdataindustrypresent/Journey–presentandfuturefuture/Journey–presentandfuturepresenton-goingfeatures/Presenton-goingfeaturesfuturefeatures/Futurefeatures

YARN,featuresLongRunningApplicationsonSecureClusters(YARN-896)/Presenton-goingfeaturesApplicationTimelineServer(YARN-321,YARN-1530)/Presenton-goingfeaturesDiskscheduling(YARN-2139)/Presenton-goingfeaturesReservation-basedscheduling(YARN-1051)/Presenton-goingfeaturesContainerResizing(YARN-1197)/FuturefeaturesAdminlabels(YARN-796)/FuturefeaturesContainerDelegation(YARN-1488)/Futurefeatures

YARN-321URL/Presenton-goingfeatures

YARN-796URL/Futurefeatures

YARN-896URL/Presenton-goingfeatures

YARN-1197URL/Futurefeatures

YARN-1530URL/Presenton-goingfeatures

YARN-2139URL/Presenton-goingfeatures

YARN-supportedframeworksabout/YARN-supportedframeworks

YARNadministrations

about/AdministrationofYARNconfigurationfiles/AdministrationofYARNadministrativetools/Administrativetoolsnodes,addingfromYARNcluster/AddingandremovingnodesfromaYARNclusternodes,removingfromYARNcluster/AddingandremovingnodesfromaYARNclusterYARNjobs,administrating/AdministratingYARNjobsMapReducejob,configurations/MapReducejobconfigurationsYARNlogmanagement/YARNlogmanagementYARNwebuserinterface/YARNwebuserinterface

YARNapplicationsmonitoring,withwebGUI/MonitoringYARNapplicationswithwebGUIdeveloping/DevelopingYARNapplicationsApplicationClientProtocol/DevelopingYARNapplicationsApplicationMasterProtocol/DevelopingYARNapplicationsContainerManagerProtocol/DevelopingYARNapplications

YARNapplicationworkflowabout/TheYARNapplicationworkflowYARNclient,writing/WritingtheYARNclientApplicationMaster,writing/WritingtheYARNApplicationMaster

YARNarchitecturecomponents/CorecomponentsofYARNarchitecturedevelopment/RecentdevelopmentsinYARNarchitecture

YARNarchitecture,componentsResourceManager/ResourceManagerApplicationMaster(AM)/ApplicationMaster(AM)NodeManager(NM)/NodeManager(NM)

YARNclientwriting/WritingtheYARNclient

YARNclusternodes,addingfrom/AddingandremovingnodesfromaYARNclusternodes,removingfrom/AddingandremovingnodesfromaYARNcluster

YARNjobsadministrating/AdministratingYARNjobs

YARNlogmanagement/YARNlogmanagementYARNMapReducesettings

example/ExampleYARNMapReducesettingsproperties/ExampleYARNMapReducesettings

YARNschedulerpoliciesabout/YARNschedulerpoliciesFIFOscheduler/TheFIFO(FirstInFirstOut)schedulerFairscheduler/Thefairschedulercapacityscheduler/Thecapacityscheduler

YARNschedulingpolicesabout/YARNschedulingpoliciesFIFOscheduler/TheFIFO(FirstInFirstOut)schedulercapacityscheduler/ThecapacityschedulerFairscheduler/Thefairscheduler

YARNwebuserinterface/YARNwebuserinterface

ZZookeeper

URL/ApacheZooKeepershouldbeinstalled

top related