yarn essentials - storage.ey.md related/pdfs and... · operating hadoop and yarn clusters starting...

285

Upload: others

Post on 26-Jun-2020

8 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 2: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 3: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

YARNEssentials

Page 4: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

TableofContents

YARNEssentials

Credits

AbouttheAuthors

AbouttheReviewers

www.PacktPub.com

Supportfiles,eBooks,discountoffers,andmore

Whysubscribe?

FreeaccessforPacktaccountholders

Preface

Whatthisbookcovers

Whatyouneedforthisbook

Whothisbookisfor

Conventions

Readerfeedback

Customersupport

Downloadingtheexamplecode

Errata

Piracy

Questions

1.NeedforYARN

Theredesignidea

LimitationsoftheclassicalMapReduceorHadoop1.x

YARNasthemodernoperatingsystemofHadoop

WhatarethedesigngoalsforYARN

Summary

2.YARNArchitecture

CorecomponentsofYARNarchitecture

ResourceManager

ApplicationMaster(AM)

Page 5: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

NodeManager(NM)

YARNschedulerpolicies

TheFIFO(FirstInFirstOut)scheduler

Thefairscheduler

Thecapacityscheduler

RecentdevelopmentsinYARNarchitecture

Summary

3.YARNInstallation

Single-nodeinstallation

Prerequisites

Platform

Software

Startingwiththeinstallation

Thestandalonemode(localmode)

Thepseudo-distributedmode

Thefully-distributedmode

HistoryServer

Slavefiles

OperatingHadoopandYARNclusters

StartingHadoopandYARNclusters

StoppingHadoopandYARNclusters

WebinterfacesoftheEcosystem

Summary

4.YARNandHadoopEcosystems

TheHadoop2release

AshortintroductiontoHadoop1.xandMRv1

MRv1versusMRv2

UnderstandingwhereYARNfitsintoHadoop

OldandnewMapReduceAPIs

BackwardcompatibilityofMRv2APIs

Binarycompatibilityoforg.apache.hadoop.mapredAPIs

Page 6: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Sourcecompatibilityoforg.apache.hadoop.mapredAPIs

PracticalexamplesofMRv1andMRv2

Preparingtheinputfile(s)

Runningthejob

Result

Summary

5.YARNAdministration

Containerallocation

Containerallocationtotheapplication

Containerconfigurations

YARNschedulingpolicies

TheFIFO(FirstInFirstOut)scheduler

TheFIFO(FirstInFirstOut)scheduler

Thecapacityscheduler

Capacityschedulerconfigurations

Thefairscheduler

Fairschedulerconfigurations

YARNmultitenancyapplicationsupport

AdministrationofYARN

Administrativetools

AddingandremovingnodesfromaYARNcluster

AdministratingYARNjobs

MapReducejobconfigurations

YARNlogmanagement

YARNwebuserinterface

Summary

6.DevelopingandRunningaSimpleYARNApplication

RunningsampleexamplesonYARN

RunningasamplePiexample

MonitoringYARNapplicationswithwebGUI

YARN’sMapReducesupport

Page 7: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

TheMapReduceApplicationMaster

ExampleYARNMapReducesettings

YARN’scompatibilitywithMapReduceapplications

DevelopingYARNapplications

TheYARNapplicationworkflow

WritingtheYARNclient

WritingtheYARNApplicationMaster

ResponsibilitiesoftheApplicationMaster

Summary

7.YARNFrameworks

ApacheSamza

WritingaKafkaproducer

Writingthehello-samzaproject

Startingagrid

Storm-YARN

Prerequisites

HadoopYARNshouldbeinstalled

ApacheZooKeepershouldbeinstalled

SettingupStorm-YARN

Gettingthestorm.yamlconfigurationofthelaunchedStormcluster

BuildingandrunningStorm-Starterexamples

ApacheSpark

WhyrunonYARN?

ApacheTez

ApacheGiraph

HOYA(HBaseonYARN)

KOYA(KafkaonYARN)

Summary

8.FailuresinYARN

ResourceManagerfailures

ApplicationMasterfailures

Page 8: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

NodeManagerfailures

Containerfailures

HardwareFailures

Summary

9.YARN–AlternativeSolutions

Mesos

Omega

Corona

Summary

10.YARN–FutureandSupport

WhatYARNmeanstothebigdataindustry

Journey–presentandfuture

Presenton-goingfeatures

Futurefeatures

YARN-supportedframeworks

Summary

Index

Page 9: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 10: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

YARNEssentials

Page 11: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 12: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

YARNEssentialsCopyright©2015PacktPublishing

Allrightsreserved.Nopartofthisbookmaybereproduced,storedinaretrievalsystem,ortransmittedinanyformorbyanymeans,withoutthepriorwrittenpermissionofthepublisher,exceptinthecaseofbriefquotationsembeddedincriticalarticlesorreviews.

Everyefforthasbeenmadeinthepreparationofthisbooktoensuretheaccuracyoftheinformationpresented.However,theinformationcontainedinthisbookissoldwithoutwarranty,eitherexpressorimplied.Neithertheauthors,norPacktPublishing,anditsdealersanddistributorswillbeheldliableforanydamagescausedorallegedtobecauseddirectlyorindirectlybythisbook.

PacktPublishinghasendeavoredtoprovidetrademarkinformationaboutallofthecompaniesandproductsmentionedinthisbookbytheappropriateuseofcapitals.However,PacktPublishingcannotguaranteetheaccuracyofthisinformation.

Firstpublished:February2015

Productionreference:1190215

PublishedbyPacktPublishingLtd.

LiveryPlace

35LiveryStreet

BirminghamB32PB,UK.

ISBN978-1-78439-173-7

www.packtpub.com

Page 13: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 14: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

CreditsAuthors

AmolFasale

NirmalKumar

Reviewers

LakshmiNarasimhan

SwapnilSalunkhe

Jenny(Xiao)Zhang

CommissioningEditor

TaronPereira

AcquisitionEditor

JamesJones

ContentDevelopmentEditor

ArwaManasawala

TechnicalEditor

IndrajitA.Das

CopyEditors

KarunaNarayanan

LaxmiSubramanian

ProjectCoordinator

PuravMotiwalla

Proofreaders

SafisEditing

MariaGould

Indexer

PriyaSane

Graphics

SheetalAute

ValentinaD’silva

AbhinashSahu

ProductionCoordinator

Page 15: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

ShantanuN.Zagade

CoverWork

ShantanuN.Zagade

Page 16: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 17: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

AbouttheAuthorsAmolFasalehasmorethan4yearsofindustryexperienceactivelyworkinginthefieldsofbigdataanddistributedcomputing;heisalsoanactivebloggerinandcontributortotheopensourcecommunity.AmolworksasaseniordatasystemengineeratMakeMyTrip.com,averywell-knowntravelandhospitalityportalinIndia,responsibleforreal-timepersonalizationofonlineuserexperiencewithApacheKafka,ApacheStorm,ApacheHadoop,andmanymore.Also,Amolhasactivehands-onexperienceinJava/J2EE,SpringFrameworks,Python,machinelearning,Hadoopframeworkcomponents,SQL,NoSQL,andgraphdatabases.

YoucanfollowAmolonTwitterat@amolfasaleoronLinkedIn.Amolisveryactiveonsocialmedia.Youcancatchhimonlineforanytechnicalassistance;hewouldbehappytohelp.

Amolhascompletedhisbachelor’sinengineering(electronicsandtelecommunication)fromPuneUniversityandpostgraduatediplomaincomputersfromCDAC.

Thegiftofloveisoneofthegreatestblessingsfromparents,andIamheartilythankfultomymom,dad,friends,andcolleagueswhohaveshownandcontinuetoshowtheirsupportindifferentways.Finally,IowemuchtoJamesandArwawithoutwhosedirectionandunderstanding,Iwouldnothavecompletedthiswork.

NirmalKumarisaleadsoftwareengineeratiLabs,theR&DteamatImpetusInfotechPvt.Ltd.Hehasmorethan8yearsofexperienceinopensourcetechnologiessuchasJava,JEE,Spring,Hibernate,webservices,Hadoop,Hive,Flume,Sqoop,Kafka,Storm,NoSQLdatabasessuchasHBaseandCassandra,andMPPdatabasessuchasTeradata.

YoucanfollowhimonTwitterat@nirmal___kumar.Hespendsmostofhistimereadingaboutandplayingwithdifferenttechnologies.Hehasalsoundertakenmanytechtalksandtrainingsessionsonbigdatatechnologies.

Hehasattainedhismaster’sdegreeincomputerapplicationsfromHarcourtButlerTechnologicalInstitute(HBTI),Kanpur,IndiaandiscurrentlypartofthebigdataR&DteaminiLabsatImpetusInfotechPvt.Ltd.

Iwouldliketothankmyorganization,especiallyiLabs,forsupportingmeinwritingthisbook.Also,aspecialthankstothePacktPublishingteam;withoutyouguys,thisworkwouldnothavebeenpossible.

Page 18: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 19: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

AbouttheReviewersLakshmiNarasimhanisafullstackdeveloperwhohasbeenworkingonbigdataandsearchsincetheearlydaysofLuceneandwasapartofthesearchteamatAsk.com.Heisabigadvocateofopensourceandregularlycontributesandconsultsonvarioustechnologies,mostnotablyDrupalandtechnologiesrelatedtobigdata.Lakshmiiscurrentlyworkingasthecurriculumdesignerforhisowntrainingcompany,http://www.readybrains.com.Heblogsoccasionallyabouthistechnicalendeavorsathttp://www.lakshminp.comandcanbecontactedviahisTwitterhandle,@lakshminp.

It’shardfindareadyreferenceordocumentationforasubjectlikeYARN.I’dliketothanktheauthorforwritingabookonYARNandhopethetargetaudiencefindsituseful.

SwapnilSalunkheisapassionatesoftwaredeveloperwhoiskeenlyinterestedinlearningandimplementingnewtechnologies.Hehasapassionforfunctionalprogramming,machinelearning,andworkingwithdata.Hehasexperienceworkinginthefinanceandtelecomdomains.

I’dliketothankPacktPublishinganditsstaffforanopportunitytocontributetothisbook.

Jenny(Xiao)Zhangisatechnologyprofessionalinbusinessanalytics,KPIs,andbigdata.Shehelpsbusinessesbettermanage,measure,report,andanalyzedatatoanswercriticalbusinessquestionsanddrivebusinessgrowth.SheisanexpertinSaaSbusinessandhadexperienceinavarietyofindustrydomainssuchastelecom,oilandgas,andfinance.Shehaswrittenanumberofblogpostsathttp://jennyxiaozhang.comonbigdata,Hadoop,andYARN.ShealsoactivelyusesTwitterat@smallnarutotoshareinsightsonbigdataandanalytics.

Iwanttothankallmyblogreaders.Itistheencouragementfromthemthatmotivatesmetodeepdiveintotheoceanofbigdata.Ialsowanttothankmydad,Michael(Tiegang)Zhang,forprovidingtechnicalinsightsintheprocessofreviewingthebook.AspecialthankstothePacktPublishingteamforthisgreatopportunity.

Page 20: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 21: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

www.PacktPub.com

Page 22: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Supportfiles,eBooks,discountoffers,andmoreForsupportfilesanddownloadsrelatedtoyourbook,pleasevisitwww.PacktPub.com.

DidyouknowthatPacktofferseBookversionsofeverybookpublished,withPDFandePubfilesavailable?YoucanupgradetotheeBookversionatwww.PacktPub.comandasaprintbookcustomer,youareentitledtoadiscountontheeBookcopy.Getintouchwithusat<[email protected]>formoredetails.

Atwww.PacktPub.com,youcanalsoreadacollectionoffreetechnicalarticles,signupforarangeoffreenewslettersandreceiveexclusivediscountsandoffersonPacktbooksandeBooks.

https://www2.packtpub.com/books/subscription/packtlib

DoyouneedinstantsolutionstoyourITquestions?PacktLibisPackt’sonlinedigitalbooklibrary.Here,youcansearch,access,andreadPackt’sentirelibraryofbooks.

Page 23: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Whysubscribe?FullysearchableacrosseverybookpublishedbyPacktCopyandpaste,print,andbookmarkcontentOndemandandaccessibleviaawebbrowser

Page 24: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

FreeaccessforPacktaccountholdersIfyouhaveanaccountwithPacktatwww.PacktPub.com,youcanusethistoaccessPacktLibtodayandview9entirelyfreebooks.Simplyuseyourlogincredentialsforimmediateaccess.

Page 25: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 26: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

PrefaceInashortspanoftime,YARNhasattainedagreatdealofmomentumandacceptanceinthebigdataworld.

YARNessentialsisaboutYARN—themodernoperatingsystemforHadoop.ThisbookcontainsallthatyouneedtoknowaboutYARN,rightfromitsinceptiontothepresentandfuture.

Inthefirstpartofthebook,youwillbeintroducedtothemotivationbehindthedevelopmentofYARNandlearnaboutitscorearchitecture,installation,andadministration.ThispartalsotalksaboutthearchitecturaldifferencesthatYARNbringstoHadoop2withrespecttoHadoop1andwhythisredesignwasneeded.

Inthesecondpart,youwilllearnhowtowriteaYARNapplication,howtosubmitanapplicationtoYARN,andhowtomonitortheapplication.Next,youwilllearnaboutthevariousemergingopensourceframeworksthataredevelopedtorunontopofYARN.YouwilllearntodevelopanddeploysomeusecaseexamplesusingApacheSamzaandStormYARN.

Finally,wewilltalkaboutthefailuresinYARN,somealternativesolutionsavailableonthemarket,andthefutureandsupportforYARNinthebigdataworld.

Page 27: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

WhatthisbookcoversChapter1,NeedforYARN,discussesthemotivationbehindthedevelopmentofYARN.ThischapterdiscusseswhatYARNisandwhyitisneeded.

Chapter2,YARNArchitecture,isadeepdiveintoYARN’sarchitecture.Allthemajorcomponentsandtheirinnerworkingsareexplainedinthischapter.

Chapter3,YARNInstallation,describesthestepsrequiredtosetupasingle-nodeandfully-distributedYARNcluster.Italsotalksabouttheimportantconfigurations/propertiesthatyoushouldbeawareofwhileinstallingtheYARNcluster.

Chapter4,YARNandHadoopEcosystems,talksaboutHadoopwithrespecttoYARN.ItgivesashortintroductiontotheHadoop1.xversion,thearchitecturaldifferencesbetweenHadoop1.xandHadoop2.x,andwhereexactlyYARNfitsintoHadoop2.x.

Chapter5,YARNAdministration,coversinformationontheadministrationofYARNclusters.ItexplainstheadministrativetoolsthatareavailableinYARN,whattheymean,andhowtousethem.ThischaptercoversvarioustopicsfromYARNcontainerallocationandconfigurationtovariousschedulingpolicies/configurationsandin-builtsupportformultitenancy.

Chapter6,DevelopingandRunningaSimpleYARNApplication,focusesonsomerealapplicationswithYARN,withsomehands-onexamples.ItexplainshowtowriteaYARNapplication,howtosubmitanapplicationtoYARN,andfinally,howtomonitortheapplication.

Chapter7,YARNFrameworks,discussesthevariousemergingopensourceframeworksthataredevelopedtorunontopofYARN.ThechapterthentalksindetailaboutApacheSamzaandStormonYARN,wherewewilldevelopandrunsomesampleapplicationsusingtheseframeworks.

Chapter8,FailuresinYARN,discussesthefault-toleranceaspectofYARN.ThischapterfocusesonvariousfailuresthatcanoccurintheYARNframework,theircauses,andhowYARNgracefullyhandlesthosefailures.

Chapter9,YARN–AlternativeSolutions,discussesotheralternativesolutionsthatareavailableonthemarkettoday.Thesesystems,likeYARN,sharecommoninspiration/requirementsandthehigh-levelgoalofimprovingscalability,latency,fault-tolerance,andprogrammingmodelflexibility.ThischapterhighlightsthekeydifferencesinthewaythesealternativesolutionsaddressthesamefeaturesprovidedbyYARN.

Chapter10,YARNFutureandSupport,talksaboutYARN’sjourneyanditspresentandfutureintheworldofdistributedcomputing.

Page 28: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 29: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

WhatyouneedforthisbookYouwillneedasingleLinux-basedmachinewithJDK1.6orlaterinstalled.AnyrecentversionoftheApacheHadoop2distributionwillbesufficienttosetupaYARNclusterandrunsomeexamplesontopofYARN.

ThecodeinthisbookhasbeentestedonCentOS6.4butwillrunonothervariantsofLinux.

Page 30: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 31: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

WhothisbookisforThisbookisforthebigdataenthusiastswhowanttogainin-depthknowledgeofYARNandknowwhatreallymakesYARNthemodernoperatingsystemforHadoop.YouwilldevelopagoodunderstandingofthearchitecturaldifferencesthatYARNbringstoHadoop2withrespecttoHadoop1.

Youwilldevelopin-depthknowledgeaboutthearchitectureandinnerworkingsoftheYARNframework.

Afterfinishingthisbook,youwillbeabletoinstall,administrate,anddevelopYARNapplications.ThisbooktellsyouanythingyouneedtoknowaboutYARN,rightfromitsinceptiontoitspresentandfutureinthebigdataindustry.

Page 32: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 33: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

ConventionsInthisbook,youwillfindanumberoftextstylesthatdistinguishbetweendifferentkindsofinformation.Herearesomeexamplesofthesestylesandanexplanationoftheirmeaning.

Codewordsintext,databasetablenames,foldernames,filenames,fileextensions,pathnames,dummyURLs,userinput,andTwitterhandlesareshownasfollows:”TheURLforNameNodeishttp://<namenode_host>:<port>/andthedefaultHTTPportis50070.”

Ablockofcodeissetasfollows:

<property>

<name>io.file.buffer.size</name>

<value>4096</value>

<description>readandwritebuffersizeoffiles</description>

</property>

Anycommand-lineinputoroutputiswrittenasfollows:

${path_to_your_input_dir}

${path_to_your_output_dir_old}

Newtermsandimportantwordsareshowninbold.Wordsthatyouseeonthescreen,forexample,inmenusordialogboxes,appearinthetextlikethis:“UndertheToolssection,youcanfindtheYARNconfigurationfiledetails,schedulinginformation,containerconfigurations,locallogsofthejobs,andalotofotherinformationonthecluster.”

NoteWarningsorimportantnotesappearinaboxlikethis.

TipTipsandtricksappearlikethis.

Page 34: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 35: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

ReaderfeedbackFeedbackfromourreadersisalwayswelcome.Letusknowwhatyouthinkaboutthisbook—whatyoulikedordisliked.Readerfeedbackisimportantforusasithelpsusdeveloptitlesthatyouwillreallygetthemostoutof.

Tosendusgeneralfeedback,simplye-mail<[email protected]>,andmentionthebook’stitleinthesubjectofyourmessage.

Ifthereisatopicthatyouhaveexpertiseinandyouareinterestedineitherwritingorcontributingtoabook,seeourauthorguideatwww.packtpub.com/authors.

Page 36: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 37: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

CustomersupportNowthatyouaretheproudownerofaPacktbook,wehaveanumberofthingstohelpyoutogetthemostfromyourpurchase.

Page 38: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

DownloadingtheexamplecodeYoucandownloadtheexamplecodefilesfromyouraccountathttp://www.packtpub.comforallthePacktPublishingbooksyouhavepurchased.Ifyoupurchasedthisbookelsewhere,youcanvisithttp://www.packtpub.com/supportandregistertohavethefilese-maileddirectlytoyou.

Page 39: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

ErrataAlthoughwehavetakeneverycaretoensuretheaccuracyofourcontent,mistakesdohappen.Ifyoufindamistakeinoneofourbooks—maybeamistakeinthetextorthecode—wewouldbegratefulifyoucouldreportthistous.Bydoingso,youcansaveotherreadersfromfrustrationandhelpusimprovesubsequentversionsofthisbook.Ifyoufindanyerrata,pleasereportthembyvisitinghttp://www.packtpub.com/submit-errata,selectingyourbook,clickingontheErrataSubmissionFormlink,andenteringthedetailsofyourerrata.Onceyourerrataareverified,yoursubmissionwillbeacceptedandtheerratawillbeuploadedtoourwebsiteoraddedtoanylistofexistingerrataundertheErratasectionofthattitle.

Toviewthepreviouslysubmittederrata,gotohttps://www.packtpub.com/books/content/supportandenterthenameofthebookinthesearchfield.TherequiredinformationwillappearundertheErratasection.

Page 40: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

PiracyPiracyofcopyrightedmaterialontheInternetisanongoingproblemacrossallmedia.AtPackt,wetaketheprotectionofourcopyrightandlicensesveryseriously.IfyoucomeacrossanyillegalcopiesofourworksinanyformontheInternet,pleaseprovideuswiththelocationaddressorwebsitenameimmediatelysothatwecanpursuearemedy.

Pleasecontactusat<[email protected]>withalinktothesuspectedpiratedmaterial.

Weappreciateyourhelpinprotectingourauthorsandourabilitytobringyouvaluablecontent.

Page 41: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

QuestionsIfyouhaveaproblemwithanyaspectofthisbook,youcancontactusat<[email protected]>,andwewilldoourbesttoaddresstheproblem.

Page 42: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 43: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Chapter1.NeedforYARNYARNstandsforYetAnotherResourceNegotiator.YARNisagenericresourceplatformtomanageresourcesinatypicalcluster.YARNwasintroducedwithHadoop2.0,whichisanopensourcedistributedprocessingframeworkfromtheApacheSoftwareFoundation.

In2012,YARNbecameoneofthesubprojectsofthelargerApacheHadoopproject.YARNisalsocoinedbythenameofMapReduce2.0.ThisissinceApacheHadoopMapReducehasbeenre-architecturedfromthegrounduptoApacheHadoopYARN.

ThinkofYARNasagenericcomputingfabrictosupportMapReduceandotherapplicationparadigmswithinthesameHadoopcluster;earlier,thiswaslimitedtobatchprocessingusingMapReduce.ThisreallychangedthegametorecastApacheHadoopasamuchmorepowerfuldataprocessingsystem.WiththeadventofYARN,Hadoopnowlooksverydifferentcomparedtothewayitwasonlyayearago.

YARNenablesmultipleapplicationstorunsimultaneouslyonthesamesharedclusterandallowsapplicationstonegotiateresourcesbasedonneed.Therefore,resourceallocation/managementiscentraltoYARN.

YARNhasbeenthoroughlytestedatYahoo!sinceSeptember2012.Ithasbeeninproductionacross30,000nodesand325PBofdatasinceJanuary2013.

Recently,ApacheHadoopYARNwontheBestPaperAwardatACMSymposiumonCloudComputing(SoCC)in2013!

Page 44: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

TheredesignideaInitially,HadoopwaswrittensolelyasaMapReduceengine.Sinceitrunsonacluster,itsclustermanagementcomponentswerealsotightlycoupledwiththeMapReduceprogrammingparadigm.

TheconceptsofMapReduceanditsprogrammingparadigmweresodeeplyingrainedinHadoopthatonecouldnotuseitforanythingelseexceptMapReduce.MapReducethereforebecamethebaseforHadoop,andasaresult,theonlythingthatcouldberunonHadoopwasaMapReducejob,batchprocessing.InHadoop1.x,therewasasingleJobTrackerservicethatwasoverloadedwithmanythingssuchasclusterresourcemanagement,schedulingjobs,managingcomputationalresources,restartingfailedtasks,monitoringTaskTrackers,andsoon.

TherewasdefinitelyaneedtoseparatetheMapReduce(specificprogrammingmodel)partandtheresourcemanagementinfrastructureinHadoop.YARNwasthefirstattempttoperformthisseparation.

Page 45: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

LimitationsoftheclassicalMapReduceorHadoop1.xThemainlimitationsofHadoop1.xcanbecategorizedintothefollowingareas:

Limitedscalability:

LargeHadoopclustersreportedsomeseriouslimitationsonscalability.ThisiscausedmainlybyasingleJobTrackerservice,whichultimatelyresultsinaseriousdeteriorationoftheoverallclusterperformancebecauseofattemptstore-replicatedataandoverloadlivenodes,thuscausinganetworkflood.AccordingtoYahoo!,thepracticallimitsofsuchadesignarereachedwithaclusterof~5,000nodesand40,000tasksrunningconcurrently.Therefore,itisrecommendedthatyoucreatesmallerandlesspowerfulclustersforsuchadesign.

Lowclusterresourceutilization:

TheresourcesinHadoop1.xoneachslavenode(datanode),aredividedintermsofafixednumberofmapandreduceslots.ConsiderthescenariowhereaMapReducejobhasalreadytakenupalltheavailablemapslotsandnowwantsmorenewmaptaskstorun.Inthiscase,itcannotrunnewmaptasks,eventhoughallthereduceslotsarestillempty.Thisnotionofafixednumberofslotshasaseriousdrawbackandresultsinpoorclusterutilization.

Lackofsupportforalternativeframeworks/paradigms:

ThemainfocusofHadooprightfromthebeginningwastoperformcomputationonlargedatasetsusingparallelprocessing.Therefore,theonlyprogrammingmodelitsupportedwasMapReduce.Withthecurrentindustryneedsintermsofnewusecasesintheworldofbigdata,manynewandalternativeprogrammingmodels(suchApacheGiraph,ApacheSpark,Storm,Tez,andsoon)arecomingintothepictureeachday.ThereisdefinitelyanincreasingdemandtosupportmultipleprogrammingparadigmsbesidesMapReduce,tosupportthevariedusecasesthatthebigdataworldisfacing.

Page 46: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

YARNasthemodernoperatingsystemofHadoopTheMapReduceprogrammingmodelis,nodoubt,greatformanyapplications,butnotforeverythingintheworldofcomputation.ThereareusecasesthatarebestsuitedforMapReduce,butnotall.

MapReduceisessentiallybatch-oriented,butsupportforreal-timeandnearreal-timeprocessingaretheemergingrequirementsinthefieldofbigdata.

YARNtookclusterresourcemanagementcapabilitiesfromtheMapReducesystemsothatnewenginescouldusethesegenericclusterresourcemanagementcapabilities.ThislighteneduptheMapReducesystemtofocusonthedataprocessingpart,whichitisgoodatandwillideallycontinuetobeso.

YARNthereforeturnsintoadataoperatingsystemforHadoop2.0,asitenablesmultipleapplicationstocoexistinthesamesharedcluster.Refertothefollowingfigure:

YARNasamodernOSforHadoop

Page 47: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

WhatarethedesigngoalsforYARNThissectiontalksaboutthecoredesigngoalsofYARN:

Scalability:

Scalabilityisakeyrequirementforbigdata.Hadoopwasprimarilymeanttoworkonaclusterofthousandsofnodeswithcommodityhardware.Also,thecostofhardwareisreducingyear-on-year.YARNisthereforedesignedtoperformefficientlyonthisnetworkofamyriadofnodes.

Highclusterutilization:

InHadoop1.x,theclusterresourcesweredividedintermsoffixedsizeslotsforbothmapandreducetasks.Thismeansthattherecouldbeascenariowheremapslotsmightbefullwhilereduceslotsareempty,orviceversa.Thiswasdefinitelynotanoptimalutilizationofresources,anditneededfurtheroptimization.YARNfine-grainedresourcesintermsofRAM,CPU,anddisk(containers),leadingtoanoptimalutilizationoftheavailableresources.

Localityawareness:

ThisisakeyrequirementforYARNwhendealingwithbigdata;movingcomputationischeaperthanmovingdata.Thishelpstominimizenetworkcongestionandincreasetheoverallthroughputofthesystem.

Multitenancy:

WiththecoredevelopmentofHadoopatYahoo,primarilytosupportlarge-scalecomputation,HDFSalsoacquiredapermissionmodel,quotas,andotherfeaturestoimproveitsmultitenantoperation.YARNwasthereforedesignedtosupportmultitenancyinitscorearchitecture.Sinceclusterresourceallocation/managementisattheheartofYARN,sharingprocessingandstoragecapacityacrossclusterswascentraltothedesign.YARNhasthenotionofpluggableschedulersandtheCapacitySchedulerwithYARNhasbeenenhancedtoprovideaflexibleresourcemodel,elasticcomputing,applicationlimits,andothernecessaryfeaturesthatenablemultipletenantstosecurelysharetheclusterinanoptimizedway.

Supportforprogrammingmodel:

TheMapReduceprogrammingmodelisnodoubtgreatformanyapplications,butnotforeverythingintheworldofcomputation.Astheworldofbigdataisstillinitsinceptionphase,organizationsareheavilyinvestinginR&Dtodevelopnewandevolvingframeworkstosolveavarietyofproblemsthatbigdatabrings.

Aflexibleresourcemodel:

Page 48: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Besidesmismatchwiththeemergingframeworks’requirements,thefixednumberofslotsforresourceshadseriousproblems.ItwasstraightforwardforYARNtocomeupwithaflexibleandgenericresourcemanagementmodel.

Asecureandauditableoperation:

AsHadoopcontinuedtogrowtomanagemoretenantswithamyriadofusecasesacrossdifferentindustries,therequirementsforisolationbecamemoredemanding.Also,theauthorizationmodellackedstrongandscalableauthentication.ThisisbecauseHadoopwasdesignedwithparallelprocessinginmind,withnocomprehensivesecurity.Securitywasanafterthought.YARNunderstandsthisandaddssecurity-relatedrequirementsintoitsdesign.

Reliability/availability:

Althoughfaulttoleranceisinthecoredesign,inrealitymaintainingalargeHadoopclusterisatedioustask.Allissuesrelatedtohighavailability,failures,failuresonrestart,andreliabilitywerethereforeacorerequirementforYARN.

Backwardcompatibility:

Hadoop1.xhasbeeninthepictureforawhile,withmanysuccessfulproductiondeploymentsacrossmanyindustries.ThismassiveinstallationbaseofMapReduceapplicationsandtheecosystemofrelatedprojects,suchasHive,Pig,andsoon,wouldnottoleratearadicalredesign.Therefore,thenewarchitecturereusedasmuchcodefromtheexistingframeworkaspossible,andnomajorsurgerywasconductedonit.ThismadeMRv2abletoensuresatisfactorycompatibilitywithMRv1applications.

Page 49: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 50: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

SummaryInthischapter,youlearnedwhatYARNisandhowithasturnedouttobethemodernoperatingsystemforHadoop,makingitamultiapplicationplatform.

InChapter2,YARNArchitecture,wewillbetalkingaboutthearchitecturedetailsofYARN.

Page 51: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 52: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Chapter2.YARNArchitectureThischapterdivesdeepintoYARNarchitectureitscorecomponents,andhowtheyinteracttodeliveroptimalresourceutilization,betterperformance,andmanageability.ItalsofocusesonsomeimportantterminologyconcerningYARN.

Inthischapter,wewillcoverthefollowingtopics:

CorecomponentsofYARNarchitectureInteractionandflowofYARNcomponentsResourceManagerschedulingpoliciesRecentdevelopmentsinYARN

ThemotivationbehindtheYARNarchitectureistosupportmoredataprocessingmodels,suchasApacheSpark,ApacheStorm,ApacheGiraph,ApacheHAMA,andsoon,thanjustMapReduce.YARNprovidesaplatformtodevelopandexecutedistributedprocessingapplications.Italsoimprovesefficiencyandresource-sharingcapabilities.

ThedesigndecisionbehindYARNarchitectureistoseparatetwomajorfunctionalities,resourcemanagementandjobschedulingormonitoringofJobTracker,intoseparatedaemons,thatis,aclusterlevelResourceManager(RM)andanapplication-specificApplicationMaster(AM).YARNarchitecturefollowsamaster-slavearchitecturalmodelinwhichtheResourceManageristhemasterandnode-specificslaveNodeManager(NM).TheglobalResourceManagerandper-nodeNodeManagerbuildsamostgeneric,scalable,andsimpleplatformfordistributedapplicationmanagement.TheResourceManageristhesupervisorcomponentthatmanagestheresourcesamongtheapplicationsinthewholesystem.Theper-applicationApplicationMasteristheapplication-specificdaemonthatnegotiatesresourcesfromResourceManagerandworksinhandwithNodeManagerstoexecuteandmonitortheapplication’stasks.

ThefollowingdiagramexplainshowJobTrackerisreplacedbyagloballevelResourceManagerandApplicationManagerandaper-nodeTaskTrackerisreplacedbyanapplication-levelApplicationMastertomanageitsfunctionsandresponsibilities.JobTrackerandTaskTrackeronlysupportMapReduceapplicationswithlessscalabilityandpoorclusterutilization.Now,YARNsupportsmultipledistributeddataprocessingmodelswithimprovedscalabilityandclusterutilization.

Page 53: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

TheResourceManagerhasacluster-levelschedulerthathasresponsibilityforresourceallocationtoalltherunningtasksaspertheApplicationManager’srequests.TheprimaryresponsibilityoftheResourceManageristoallocateresourcestotheapplication(s).TheResourceManagerisnotresponsiblefortrackingthestatusofanapplicationormonitoringtasks.Also,itdoesn’tguaranteerestarting/balancingtasksinthecaseofapplicationorhardwarefailure.

Theapplication-levelApplicationMasterisresponsiblefornegotiatingresourcesfromtheResourceManageronapplicationsubmission,suchasmemory,CPU,disk,andsoon.Itisalsoresponsiblefortrackinganapplication’sstatusandmonitoringapplicationprocessesincoordinationwiththeNodeManager.

Let’shavealookatthehigh-levelarchitectureofHadoop2.0.Asyoucansee,moreapplicationscanbesupportedbyYARNthanjusttheMapReduceapplication.ThekeycomponentofHadoop2isYARN,forbetterclusterresourcemanagement,andtheunderlyingfilesystemremainsthesameasHadoopDistributedFileSystem(HDFS)andisshowninthefollowingimage:

Page 54: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

HerearesomekeyconceptsthatweshouldknowbeforeexploringtheYARNarchitectureindetail:

Application:Thisisthejobsubmittedtotheframework,forexampleaMapReducejob.Itcouldalsobeashellscript.Container:Thisisthebasicunitofhardwareallocation,forexampleacontainerthathas4GBofRAMandoneCPU.Thecontainerdoesoptimizedresourceallocation;thisreplacesthefixedmapandreduceslotsinthepreviousversionsofHadoop.

Page 55: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

CorecomponentsofYARNarchitectureHerearesomecorecomponentsofYARNarchitecturethatweneedtoknow:

ResourceManagerApplicationMasterNodeManager

Page 56: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

ResourceManagerResourceManageractsasaglobalresourceschedulerthatisresponsibleforresourcemanagementandschedulingaspertheApplicationMaster’srequestsfortheresourcerequirementsoftheapplication(s).Itisalsoresponsibleforthemanagementofhierarchicaljobqueues.TheResourceManagercanbeseeninthefollowingfigure:

TheprecedingdiagramgivesmoredetailsaboutthecomponentsoftheResourceManager.TheAdminandClientserviceisresponsibleforclientinteractions,suchasajobrequestsubmission,start,restart,andsoon.TheApplicationsManagerisresponsibleforthemanagementofeveryapplication.TheApplicationMasterServiceinteractswitheveryapplication.ApplicationMasterregardingresourceorcontainernegotiation,theResourceTrackerServicecoordinateswiththeNodeManagerandResourceManager.TheApplicationMasterLauncherserviceisresponsibleforlaunchingacontainerfortheApplicationMasteronjobsubmissionfromtheclient.TheSchedulerandSecurityarethecorepartsoftheResourceManager.Asalreadyexplained,theSchedulerisresponsibleforresourcenegotiationandallocationtotheapplicationsaspertherequestoftheApplicationMaster.Therearethreedifferentpoliciesofscheduler,FIFO,Fair,andCapacity,whichwillbeexplainedindetaillaterinthischapter.Thesecuritycomponentisresponsibleforgeneratinganddelegatingan/theApplicationTokenandContainerTokentoaccesstheapplicationandcontainer,respectively.

Page 57: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

ApplicationMaster(AM)TheApplicationMasterisataper-applicationlevel.Itisresponsiblefortheapplication’slifecyclemanagementandfornegotiatingtheappropriateresourcesfromtheScheduler,trackingtheirstatusandprogressmonitoring,forexample,MapReduceApplicationMaster.

Page 58: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

NodeManager(NM)NodeManageractsasaper-machineagentandisresponsibleformanagingthelifecycleofthecontainerandformonitoringtheirresourceusage.ThecorecomponentsoftheNodeManagerareshowninthefollowingdiagram:

ThecomponentresponsibleforcommunicationbetweentheNodeManagerandResourceManageristheNodeStatusUpdater.TheContainerManageristhecorecomponentoftheNodeManager;itmanagesallthecontainersthatrunonthenode.NodeHealthCheckerServiceistheservicethatmonitorsthenode’shealthandcommunicatesthenode’sheartbeattotheResourceManagerviatheNodeStatusUpdaterservice.TheContainerExecutoristheprocessresponsibleforinteractingwithnativehardwareorsoftwaretostartorstopthecontainerprocess.ManagementofAccessControlList(ACL)andaccesstokenverificationisperformedbytheSecuritycomponent.

Let’stakealookatonescenariotounderstandYARNarchitectureindetail.Refertothefollowingdiagram:

Page 59: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Saywehavetwoclientrequests:onewantstoexecuteasimpleshellscript,whileanotheronewantstoexecuteacomplexMapReducejob.TheShellScriptisrepresentedinmarooncolor,whiletheMapReducejobisrepresentedinlightgreencolorintheprecedingdiagram.

TheResourceManagerhastwomaincomponents,theApplicationManagerandtheScheduler.TheApplicationManagerisresponsibleforacceptingtheclient’sjobsubmissionrequests,negotiatingthecontainerstoexecutetheapplicationsspecifictotheApplicationMaster,andprovidingtheservicestorestarttheApplicationMasteronfailure.TheresponsibilityoftheScheduleristoallocateresourcestothevariousrunningapplicationswithrespecttotheapplicationresourcerequirementsandavailableresources.TheSchedulerisapureschedulerinthesensethatitprovidesnomonitoringortrackingfunctionsfortheapplication.Also,itdoesn’tofferanyguaranteesforrestartingafailedtaskeitherduetofailureintheapplicationorinthehardware.TheSchedulerperformsitsschedulingtasksbasedontheresourcerequirementsoftheapplication(s);itdoessobasedontheabstractnotionoftheresourcecontainer,whichincorporateselementssuchasCPU,memory,disk,andsoon.

TheNodeManageristheper-machineframeworkdaemonthatisresponsibleforthecontainers’lifecycles.Itisalsoresponsibleformonitoringtheirresourceusage,forexample,memory,CPU,disk,network,andsoon,andforreportingthistotheResourceManageraccordingly.Theapplication-levelApplicationMasterisresponsiblefornegotiatingtherequiredresourcecontainersfromthescheduler,trackingtheirstatus,andmonitoringprogress.Intheprecedingdiagram,youcanseethatbothjobs,ShellScriptand

Page 60: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

MapReduce,haveanindividualApplicationMasterthatallocatesresourcesforjobexecutionandtotrack/monitorthejobexecutionstatus.

Now,takealookattheexecutionsequenceoftheapplication.Refertotheprecedingapplicationflowdiagram.

AclientsubmitstheapplicationtotheResourceManager.Intheprecedingdiagram,client1submitsaShellScriptRequest(marooncolor),andclient2submitsaMapReducerequest(greencolor):

1. Then,theResourceManagerallocatesacontainertostartuptheApplicationMasteraspertheapplicationsubmittedbytheclient:oneApplicationMasterfortheshellscriptandonefortheMapReduceapplication.

2. WhilestartingtheApplicationMaster,theResourceManagerregisterstheapplicationwiththeResourceManager.

3. AfterthestartupoftheApplicationMaster,itnegotiateswiththeResourceManagerforappropriateresourcesaspertheapplicationrequirement.

4. Then,afterresourceallocationfromtheResourceManager,theApplicationMasterrequeststhattheNodeManagerlaunchesthecontainersallocatedbytheResourceManager.

5. Onsuccessfullaunchingofthecontainers,theapplicationcodeexecuteswithinthecontainer,andtheApplicationManagerreportsbacktotheResourceManagerwiththeexecutionstatusoftheapplication.

6. Duringtheexecutionoftheapplication,theclientcanrequesttheApplicationMasterortheResourceManagerdirectlyfortheapplicationstatus,progressupdates,andsoon.

7. Onexecutionoftheapplication,theApplicationMasterrequeststhattheResourceManagerunregistersandshutdownsitsowncontainerprocess.

Page 61: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 62: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

YARNschedulerpoliciesAsexplainedintheprevioussection,theResourceManageractsasapluggableglobalschedulerthatmanagesandcontrolsallthecontainers(resources).Therearethreedifferentpoliciesthatcanbeappliedoverthescheduler,asperrequirementsandresourceavailability.Theyareasfollows:

TheFIFOschedulerTheFairschedulerTheCapacityscheduler

Page 63: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

TheFIFO(FirstInFirstOut)schedulerFIFOmeansFirstInFirstOut.Asthenameindicates,thejobsubmittedfirstwillgetprioritytoexecute;inotherwords,thejobrunsintheorderofsubmission.FIFOisaqueue-basedscheduler.Itisaverysimpleapproachtoschedulinganditdoesnotguaranteeperformanceefficiency,aseachjobwoulduseawholeclusterforexecution.Sootherjobsmaykeepwaitingtofinishtheirexecution,althoughasharedclusterhasagreatcapabilitytooffermore-than-enoughresourcestomanyusers.

Page 64: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

ThefairschedulerFairschedulingisthepolicyofschedulingthatassignsresourcesfortheexecutionoftheapplicationsothatallapplicationsgetanequalshareofclusterrecoursesoveraperiodoftime.Forexample,ifasinglejobisrunning,itwouldgetalltheresourcesavailableinthecluster,andasthejobnumberincreases,freerecourseswillbegiventothejobssothateachuserwillgetafairshareofthecluster.Iftwousershavesubmittedtwodifferentjobs,ashortjobthatbelongstoauserwouldcompleteinasmalltimespanwhilealongerjobsubmittedbytheotheruserkeepsrunning,solongjobswillstillmakesomeprogress.

InaFairschedulingpolicy,alljobsareplacedintojobpools,specifictousers;accordingly,eachusergetstheirownjobpool.Theuserwhosubmitsmorejobsthantheotheruserwillnotgetmoreresourcesthanthefirstuseronaverage.Youmayevendefineyourowncustomizedjobpoolswithspecifiedconfigurations.Fairschedulingisapreemptivescheduling,asifapoolhasnotreceivedfairresourcestorunaparticulartaskforacertainperiodoftime.Inthiscase,theschedulerwillkillthetasksinpoolsthatrunoutofcapacity,toreleaseresourcestothepoolsthatrunundercapacity.

Inadditiontofairscheduling,theFairschedulerallocatesaguaranteedminimumshareofresourcestothepools.Thisisalwayshelpfulfortheusers,groups,orapplications,astheyalwaysgetsufficientresourcesforexecution.

Page 65: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

ThecapacityschedulerTheCapacityschedulerisdesignedtoallowapplicationstoshareclusterresourcesinapredictableandsimplefashion.Thesearecommonlyknownas“jobqueues”.Themainideabehindcapacityschedulingistoallocateavailableresourcestotherunningapplications,basedonindividualneedsandrequirements.Thereareadditionalbenefitswhenrunningtheapplicationusingcapacityscheduling,astheycanaccesstheexcesscapacityresourcesthatarenotbeingusedbyanyotherapplications.

Theabstractionprovidedbythecapacityscheduleristhequeue.Itprovidescapacityguaranteesforsupportformultiplequeueswhereajobissubmittedtothequeue,andqueuesareallocatedacapacityinthesensethatacertaincapacityofresourceswillbeattheirdisposal.Allthejobssubmittedtothequeuewillaccesstheresourcesallocatedtothejobqueue.Adminscancontrolthecapacityofeachqueue.

Herearesomebasicfeaturesofthecapacityscheduler:

Security:EachqueuehasstrictACLsthattakecontroloftheauthorizationandauthenticationofuserswhocansubmitjobstoindividualqueues.Elasticity:Freeresourcesareallocatedtoanyqueuebeyonditscapacity.Ifthereisdemandfortheseresourcesfromqueuesthatrunbelowcapacity,thenassoonasthetaskscheduledontheseresourceshascompleted,theywillbeassignedtojobsonqueuesthatrunundercapacity.Operability:Theadmincan,atanypointintime,changequeuedefinitionsandproperties.Multitenancy:Allsetsoflimitsareprovidedtopreventasinglejob,user,andqueuefromobtainingtheresourcesofthequeueorcluster.Thisistoensurethatthesystem,specificallyapreviousversionofHadoop,isnotsuppressedbytoomanytasks.Resource-basedscheduling:Intensivejobsupport,asjobscanspecificallydemandforhigherresourcerequirementsthandefault.Jobpriorities:Thesejobqueuescansupportjobpriorities.Withinthequeue,jobswithhighpriorityhaveaccesstoresourcesbeforejobswithlowerpriority.

Page 66: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 67: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

RecentdevelopmentsinYARNarchitectureTheResourceManagerisasinglepointoffailureandrestartbecauseofvariousreasons:bugs,hardwarefailure,deliberatedowntimeforupgrading,andsoon.

WealreadysawhowcrucialtheroleoftheResourceManagerinYARNarchitectureis.TheResourceManagerhasbecomeasinglepointoffailure;iftheResourceManagerinaclustergoesdown,everythingonthatclusterwillbelost.

SoinarecentdevelopmentofYARN,ResourceManagerHAbecameahighpriority.ThisrecentdevelopmentofYARNnotonlycoversResourceManagerHA,butalsoprovidestransparencytousersanddoesnotrequirethemtomonitorsucheventsexplicitlyandresubmitthejobs.

OverlycomplexinMRv1forthefactthatJobTrackerhastosavetoomuchofmeta-data:bothclusterstateandper-applicationrunningstate.ThismeansthatifJob-Trackerdies,thenalltheapplicationsinarunningstatewillbelost.

ThedevelopmentofResourceManagerrecoverywillbedoneintwophases:

1. RMRestartPhaseI:Inthisphase,alltheapplicationswillbekilledwhilerestartingtheResourceManageronfailure.Nostateoftheapplicationcanbestored.Developmentofthisphaseisalmostcompleted.

2. RMRestartPhaseII:AsinPhaseII,theapplicationwillstorethestateonRMfailure;thismeansthatapplicationsarenotkilled,andtheyreporttherunningstatebacktotheRMaftertheRMcomesbackup.

TheResourceManagerwillbeusedonlytosaveanapplication’ssubmissionmetadataandcluster-levelinformation.Applicationstatepersistenceandtherecoveryofspecificinformationwillbemanagedbytheapplicationitself.

Page 68: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Asshownintheprecedingdiagram,inthenextversion,wewillgetapluggablestatestore,suchasZookeeperandHDFS,thatcanstorethestateoftherunningapplications.ResourceManagerHAwouldcontainsynchronizedactive-passiveResourceManagerarchitecturalmodelsmanagedbyZookeeper;asonegoesdown,theothercantakeoverclusterresponsibilitywithouthaltingandlosinginformation.

Page 69: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 70: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

SummaryInthischapter,wecoveredthearchitecturalcomponentsofYARN,theirresponsibilities,andtheirinteroperations.Wealsofocusedonsomemajordevelopmentworkgoingoninthecommunitytoovercomethedrawbacksofthecurrentrelease.Inthenextchapter,wewillcovertheinstallationstepsofYARN.

Page 71: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 72: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Chapter3.YARNInstallationInthissection,we’llcovertheinstallationofHadoopandYARNandtheirconfigurationforasingle-nodeandsingle-clustersetup.Now,wewillconsiderHadoopastwodifferentcomponents:oneisHadoopDistributedFileSystem(HDFS),theotherisYARN.TheYARNcomponentstakecareofresourceallocationandtheschedulingofthejobsthatrunoverthedatastoredinHDFS.We’llcovermostoftheconfigurationstomakeYARNdistributedcomputingmoreoptimizedandefficient.

Inthischapter,wewillcoverthefollowingtopics:

HadoopandYARNsingle-nodeinstallationHadoopandYARNfully-distributedmodeinstallationOperatingHadoopandYARNclusters

Page 73: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Single-nodeinstallationLet’sstartwiththestepsforHadoop’ssingle-nodeinstallations,asit’seasytounderstandandsetup.Thisway,wecanquicklyperformsimpleoperationsusingHadoopMapReduceandtheHDFS.

Page 74: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

PrerequisitesHerearesomeprerequisitesneededforHadoopinstallations;makesurethattheprerequisitesarefulfilledtostartworkingwithHadoopandYARN.

PlatformGNU/UnixissupportedforHadoopinstallationasadevelopmentaswellasaproductionplatform.TheWindowsplatformisalsosupportedforHadoopinstallation,withsomeextraconfigurations.Now,we’llfocusmoreonLinux-basedplatforms,asHadoopismorewidelyusedwiththeseplatformsandworksmoreefficientlywithLinuxcomparedtoWindowssystems.Herearethestepsforsingle-nodeHadoopinstallationforLinuxsystems.IfyouwanttoinstallitonWindows,refertotheHadoopwikipagefortheinstallationsteps.

SoftwareHere’ssomesoftware;makesurethattheyareinstalledbeforeinstallingHadoop.

Javamustbeinstalled.ConfirmwhethertheJavaversioniscompatiblewiththeHadoopversionthatistobeinstalledbycheckingtheHadoopwikipage(http://wiki.apache.org/hadoop/HadoopJavaVersions).

SSHandSSHDmustbeinstalledandrunning,astheyareusedbyHadoopscriptstomanageremoteHadoopdaemons.

Now,downloadtherecentstablereleaseoftheHadoopdistributionfromApachemirrorsandarchivesusingthefollowingcommand:

$$wgethttp://mirrors.ibiblio.org/apache/hadoop/common/hadoop-

2.6.0/hadoop-2.6.0.tar.gz

Notethatatthetimeofwritingthisbook,Hadoop2.6.0isthemostrecentstablerelease.Nowusethefollowingcommands:

$$mkdir–p/opt/yarn

$$cd/opt/yarn

$$tarxvzf/root/hadoop-2.6.0.tar.gz

Page 75: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

StartingwiththeinstallationNow,unzipthedownloaddistributionunderthe/etc/directory.ChangetheHadoopenvironmentalparametersasperthefollowingconfigurations.

SettheJAVA_HOMEenvironmentalparametertotheJAVArootinstalledbefore:

$$exportJAVA_HOME=etc/java/latest

SettheHadoophometotheHadoopinstallationdirectory:

$$exportHADOOP_HOME=etc/hadoop

TryrunningtheHadoopcommand.ItshoulddisplaytheHadoopdocumentation;thisindicatesasuccessfulHadoopconfiguration.

Now,ourHadoopsingle-nodesetupisreadytoruninthefollowingmodes.

Thestandalonemode(localmode)Bydefault,HadooprunsinstandalonemodeasasingleJavaprocess.Thismodeisusefulfordevelopmentanddebugging.

Thepseudo-distributedmodeHadoopcanrunonasinglenodeinpseudo-distributedmode,aseachdaemonisrunasaseparateJavaprocess.TorunHadoopinpseudo-distributedmode,followtheseconfigurationinstructions.First,navigatetothe/etc/hadoop/core-site.xml.

ThisconfigurationfortheNameNodesetupwillrunonlocalhostport9000.YoucansetthefollowingpropertyfortheNameNode:

<configuration>

<property>

<name>fs.defaultFS</name>

<value>hdfs://localhost:9000</value>

</property>

</configuration>

Nownavigateto/etc/hadoop/hdfs-site.xml.

Bysettingthefollowingproperty,weareensuringthatthereplicationfactorofeachdatablockis3(bydefault,thereplicationfactoris3):

<configuration>

<property>

<name>dfs.replication</name>

<value>3</value>

</property>

</configuration>

Then,formattheHadoopfilesystemusingthiscommand:

$$$HADOOP_HOME/bin/hdfsnamenode–format

Afterformattingthefilesystem,startthenamenodeanddatanodedaemonsusingthenext

Page 76: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

command.Youcanseelogsunderthe$HADOOP_HOME/logsdirectorybydefault:

$$$HADOOP_HOME/sbin/start-dfs.sh

Now,wecanseethenamenodeUIonthewebinterface.Hithttp://localhost:50070/inthebrowser.

CreatetheHDFSdirectoriesthatarerequiredtorunMapReducejobs:

$$$HADOOP_HOME/bin/hdfs-mkdir/user

$$$HADOOP_HOME/bin/hdfs-mkdir/user/{username}

ToMapReducejobonYARNinpseudo-distributedmode,youneedtostarttheResourceManagerandNodeManagerdaemons.Navigateto/etc/hadoop/mapred-site.xml:

<configuration>

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

</configuration>

Navigateto/etc/hadoop/yarn-site.xml:

<configuration>

<property>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

</configuration>

Now,starttheResourceManagerandNodeManagerdaemonsbyissuingthiscommand:

$$sbin/start-yarn.sh

Bysimplynavigatingtohttp://localhost:8088/inyourbrowser,youcanseethewebinterfacefortheResourceManager.Fromhere,youcanstart,restart,orstopthejobs.

TostoptheYARNdaemons,youneedtorunthefollowingcommand:

$$$HADOOP_HOME/sbin/stop-yarn.sh

ThisishowwecanconfigureHadoopandYARNinasinglenodeinstandaloneandpseudo-distributedmodes.Movingforward,wewillfocusonfully-distributedmode.Asthebasicconfigurationremainsthesame,weonlyneedtodosomeextraconfigurationforfully-distributedmode.Single-nodesetupismainlyusedfordevelopmentanddebuggingofdistributedapplications,whilefully-distributedmodeisusedfortheproductionsetup.

Page 77: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 78: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Thefully-distributedmodeIntheprevioussection,wehighlightedthestandaloneHadoopandYARNconfigurations,andinthissectionwe’llfocusonthefully-distributedmodesetup.Thissectiondescribeshowtoinstall,configure,andmanageHadoopandYARNinfully-distributed,verylargeclusterswiththousandsofnodesinthem.

Inordertostartwithfully-distributedmode,wefirstneedtodownloadthestableversionofHadoopfromApachemirrors.InstallingHadoopindistributedmodegenerallymeansunpackingthesoftwaredistributiononeachmachineintheclusterorinstallingRedHatPackageManagers(RPMs).AsHadoopfollowsamaster-slavearchitecture,onemachineintheclusterisdesignatedastheNameNode(NN),oneastheResourceManager(RM),andtherestofthemachines,DataNodes(DN)andNodeManagers(NM),willtypicallyactsasslaves.

AfterthesuccessfulunpackingofsoftwaredistributiononeachclustermachineorRPMinstallation,youneedtotakecareofaveryimportantpartoftheHadoopinstallationphase,Hadoopconfiguration.

Hadooptypicallyhastwotypesofconfiguration:oneistheread-onlydefaultconfiguration(core-default.xml,hdfs-default.xml,yarn-default.xml,andmapred-default.xml),whiletheotheristhesite-specificconfiguration(core-site.xml,hdfs-site.xml,yarn-site.xml,andmapred-site.xml).Allthesefilearefoundunderthe$HADOOP_HOME/confdirectory.

Inadditiontotheprecedingconfigurationfiles,theHadoop-environmentandYARN-environmentspecificfileisfoundinconf/hadoop-env.shandconf/yarn-env.sh.AsfortheHadoopandYARNclusterconfiguration,youneedtosetupanenvironmentinwhichHadoopdaemonscanexecute.TheHadoop/YARNdaemonsaretheNameNode/ResourceManager(masters)andtheDataNode/NodeManager(slaves).

First,makesurethatJAVA_HOMEiscorrectlyspecifiedoneachnode.

Herearesomeimportantconfigurationparameterswithrespecttoeachdaemon:

NameNode:HADOOP_NAMENODE_OPTSDataNode:HADOOP_DATANODE_OPTSSecondaryNameNode:HADOOP_SECONDARYNAMENODE_OPTSResourceManager:YARN_RESOURCEMANAGER_OPTSNodeManager:YARN_NODEMANAGER_OPTSWebAppProxy:YARN_PROXYSERVER_OPTSMapReduceJobHistoryServer:HADOOP_JOB_HISTORYSERVER_OPTS

Forexample,toruntheNameNodeinparallelGCmode,thefollowinglineshouldbeaddedintohadoop-env.sh:

$$exportHADOOP_NAMENODE_OPTS="-XX:+UseParallelGC${HADOOP_NAMENODE_OPTS}"

Herearesomeimportantconfigurationparameterswithrespecttothedaemonanditsconfigurationfiles.

Page 79: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Navigatetoconf/core-site.xmlandconfigureitasfollows:

fs.defaultFS:NameNodeURI,hdfs://<hdfshost>:<hdfsport>

<property>

<name>fs.defaultFS</name>

<value>hdfs://$<hdfshostname>:<hdfsport></value>

<description>ItisaNameNodehostname</description>

</property>

Theio.file.buffer.size:4096,readandwritebuffersizeoffiles.

ThebuffersizeforI/O(read/write)operationonsequencefilesstoredindiskfiles,thatis,itdetermineshowmuchdataisbufferedinI/Opipesbeforetransferringittootheroperationsduringread/writeoperations.IshouldbemultipleofOSfilesystemblocksize.

<property>

<name>io.file.buffer.size</name>

<value>4096</value>

<description>readandwritebuffersizeoffiles</description>

</property>

Nownavigatetoconf/hdfs-site.xml.HereistheconfigurationfortheNameNode:

Parameter Description

dfs.namenode.name.dirThepathonthelocalfilesystemwheretheNameNodegeneratesthenamespaceandapplicationtransactionlogs.

dfs.namenode.hosts ThelistofpermittedDataNodes.

dfs.namenode.hosts.exclude ThelistofexcludedDataNodes.

dfs.blocksize Thedefaultvalueis268435456.TheHDFSblocksizeis256MBforlargefilesystems.

dfs.namenode.handler.countThedefaultvalueis100.MoreNameNodeserverthreadstohandleRPCsfromalargenumberofDataNodes.

TheconfigurationfortheDataNodeisasfollows:

Parameter Description

dfs.datanode.data.dir Comma-delimitedlistofpathsonthelocalfilesystemswheretheDataNodestorestheblocks

Nownavigatetoconf/yarn-site.xml.We’lltakealookattheconfigurationsrelatedtotheResourceManagerandNodeManager:

Parameter Description

yarn.acl.enable ValuesaretrueorfalsetoenableordisableACLs.Thedefaultvalueisfalse.

yarn.admin.aclThisreferstotheadminorACL.Thedefaultis*,whichmeansanyonecandoadmintasks.ACLsetsadminsonthecluster.Thiscouldbeacomma-delimitedusergrouptosetmorethanoneadmin.

yarn.log-

Page 80: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

aggregation-

enableThisistrueorfalsetoenableordisablelogaggregation.

Now,wewilltakelookatconfigurationsfortheResourceManagerintheconf/yarn-site.xmlfile:

Parameter Description

yarn.resourcemanager.address ThisistheResourceManagerhost:portforclientstosubmitjobs.

yarn.resourcemanager.scheduler.addressThisistheResourceManagerhost:portforApplicationMasterstotalktotheSchedulertoobtainresources.

yarn.resourcemanager.resource-

tracker.addressThisistheResourceManagerhost:portforNodeManagers.

yarn.resourcemanager.admin.addressThisistheResourceManagerhost:portforadministrativecommands.

yarn.resourcemanager.webapp.address ThisistheResourceManagerweb-uihost:port.

yarn.resourcemanager.scheduler.classThisistheResourceManagerSchedulerclass.ThevaluesareCapacityScheduler,FairScheduler,andFifoScheduler.

yarn.scheduler.minimum-allocation-mbThisistheminimumlimitofmemorytoallocatetoeachcontainerrequestintheResourceManager.

yarn.scheduler.maximum-allocation-mbThisisthemaximumlimitofmemorytoallocatetoeachcontainerrequestintheResourceManager.

yarn.resourcemanager.nodes.include-path/

yarn.resourcemanager.nodes.exclude-path

Thisisthelistofpermitted/excludedNodeManagers.Ifnecessary,usethesefilestocontrolthelistofpermittedNodeManagers.

NowtakelookatconfigurationsfortheNodeManagerinconf/yarn-site.xml:

Parameter Description

yarn.nodemanager.resource.memory-

mb

Thisreferstotheavailablephysicalmemory(MBs)fortheNodeManager.ItdefinesthetotalavailablememoryresourcesontheNodeManagertobemadeavailabletotherunningcontainers.

yarn.nodemanager.vmem-pmem-ratioThisreferstothemaximumratiobywhichvirtualmemoryusageoftasksmayexceedphysicalmemory.

yarn.nodemanager.local-dirsThisreferstothelistofdirectorypathsonthelocalfilesystemwhereintermediatedataiswritten.Thisshouldbeacomma-separatedlist.

yarn.nodemanager.log-dirs Thisreferstothepathonthelocalfilesystemwherelogsarewritten.

yarn.nodemanager.log.retain-

seconds

Thisreferstothetime(inseconds)topersistlogfilesontheNodeManager.Thedefaultvalueis10800seconds.Thisconfigurationisapplicableonlyiflogaggregationisenabled.

yarn.nodemanager.remote-app-log-

dir

ThisistheHDFSdirectorypathtowhichlogshavebeenmovedafterapplicationcompletion.Thedefaultpathis/logs.Thisconfigurationis

Page 81: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

applicableonlyiflogaggregationisenabled.

yarn.nodemanager.remote-app-log-

dir-suffix

Thisreferstothespecifiedsuffixappendedtotheremotelogdirectory.Thisconfigurationisapplicableonlyiflogaggregationisenabled.

yarn.nodemanager.aux-servicesThisreferstotheshuffleservicethatspecificallyneedstobesetforMapReduceapplications.

Page 82: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

HistoryServerTheHistoryServerallowsallYARNapplicationswithacentrallocationtoaggregatetheircompletedjobsforhistoricalreferenceanddebugging.ThesettingsfortheMapReduceJobHistoryServercanbefoundinthemapred-default.xmlfile:

mapreduce.jobhistory.address:MapReduceJobHistoryServerhost:port.Thedefaultportis10020.mapreduce.jobhistory.webapp.address:ThisistheMapReduceJobHistoryServerWebUIhost:port.Thedefaultportis19888.mapreduce.jobhistory.intermediate-done-dir:ThisisthedirectorywherehistoryfilesarewrittenbyMapReducejobs(inHDFS).Thedefaultis/mr-history/tmp.mapreduce.jobhistory.done-dir:ThisisthedirectorywherehistoryfilesaremanagedbytheMRJobHistoryServer(inHDFS).Thedefaultis/mr-history/done.

Page 83: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

SlavefilesWithrespecttotheHadoopslaveandYARNslavenodes,generallyonechoosesonenodeintheclusterastheNameNode(Hadoopmaster),anothernodeastheResourceManager(YARNmaster),andtherestofthemachineactsasbothHadoopslaveDataNodesandYarnslaveNodeManagers.Listalltheslaves,oneperlinehostnameorIPaddressesinyourHadoopconf/slavesfile.

Page 84: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 85: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

OperatingHadoopandYARNclustersThisisthefinalstageofHadoopandYARNclustersetupandconfiguration.HerearethecommandsthatneedtobeusedtostartandstoptheHadoopandYARNclusters.

Page 86: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

StartingHadoopandYARNclustersTostartHadoopandtheYARNcluster,usewiththefollowingprocedure:

1. FormataHadoopdistributedfilesystem:

$HADOOP_HOME/bin/hdfsnamenode-format<cluster_name>

2. ThefollowingcommandisusedtostartHDFS.RunitontheNameNode:

$HADOOP_HOME/sbin/hadoop-daemon.sh--config$HADOOP_CONF_DIR--script

hdfsstartnamenode

3. RunthiscommandtostartDataNodesonallslavesnodes:

$HADOOP_HOME/sbin/hadoop-daemon.sh--config$HADOOP_CONF_DIR--script

hdfsstartdatanode

4. StartYARNwiththefollowingcommandontheResourceManager:

$HADOOP_YARN_HOME/sbin/yarn-daemon.sh--config$HADOOP_CONF_DIRstart

resourcemanager

5. ExecutethiscommandtostartNodeManagersonallslaves:

$HADOOP_YARN_HOME/sbin/yarn-daemon.sh--config$HADOOP_CONF_DIRstart

nodemanager

6. StartastandaloneWebAppProxyserver.Thisisusedforload-balancingpurposesonamultiservercluster:

$HADOOP_YARN_HOME/sbin/yarn-daemonartproxyserver--config

$HADOOP_CONF_DIR

7. ExecutethiscommandonthedesignatedHistoryServer:

$HADOOP_HOME/sbin/mr-jobhistory-daemon.shstarthistoryserver--config

$HADOOP_CONF_DIR

Page 87: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

StoppingHadoopandYARNclustersTostopHadoopandtheYARNcluster,usewiththefollowingprocedure:

1. UsethefollowingcommandontheNameNodetostopit:

$HADOOP_HOME/sbin/hadoop-daemon.sh--config$HADOOP_CONF_DIR--script

hdfsstopnamenode

2. IssuethiscommandonalltheslavenodestostopDataNodes:

$HADOOP_HOME/sbin/hadoop-daemon.sh--config$HADOOP_CONF_DIR--script

hdfsstopdatanode

3. TostoptheResourceManager,issuethefollowingcommandonthespecifiedResourceManager:

$HADOOP_YARN_HOME/sbin/yarn-daemon.sh--config$HADOOP_CONF_DIRstop

resourcemanager

4. ThefollowingcommandisusedtostoptheNodeManageronallslavenodes:

$HADOOP_YARN_HOME/sbin/yarn-daemon.sh--config$HADOOP_CONF_DIRstop

nodemanager

5. StoptheWebAppProxyserver:

$HADOOP_YARN_HOME/sbin/yarn-daemon.shstopproxyserver--config

$HADOOP_CONF_DIR

6. StoptheMapReduceJobHistoryServerbyrunningthefollowingcommandontheHistoryServer:

$HADOOP_HOME/sbin/mr-jobhistory-daemon.shstophistoryserver--config

$HADOOP_CONF_DIR

Page 88: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 89: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

WebinterfacesoftheEcosystemIt’sallabouttheHadoopandYARNsetupandconfigurationsandcommandingoverHadoopandYARN.HerearesomewebinterfacesusedbyHadoopandYARNadministratorsforadmintasks:

TheURLfortheNameNodeishttp://<namenode_host>:<port>/andthedefaultHTTPportis50070.

TheURLfortheResourceManagerishttp://<resourcermanager_host>:<port>/andthedefaultHTTPportis8088.TheWebUIfortheNameNodecanbeseenasfollows:

Page 90: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

TheURLfortheMapReduceJobHistoryServerishttp://<jobhistoryserver_host>:<port>/andthedefaultHTTPportis19888.

Page 91: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 92: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

SummaryInthissection,wecoveredHadoopandYARNsingle-nodeandfully-distributedclustersetupandimportantconfigurations.WealsocoveredthebasicbutimportantcommandstoadministrateHadoopandYARNclusters.Inthenextchapter,we’lllookattheHadoopandYARNcomponentsinmoredetail.

Page 93: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 94: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Chapter4.YARNandHadoopEcosystemsThischapterdiscussesYARNwithrespecttoHadoop,sinceitisveryimportanttoknowwhereexactlyYARNfitsinHadoop2now.

Hadoop2hasundergoneacompletechangeintermsofarchitectureandcomponentscomparedtoHadoop1.

Inthischapter,wewillbecoverthefollowingtopics:

AshortintroductiontoHadoop1ThedifferencebetweenMRv1andMRv2WhereYARNfitsinHadoop2OldandnewMapReduceAPIsBackwardcompatibilityofMRv2APIsPracticalexamplesofMRv1andMRv2

Page 95: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

TheHadoop2releaseYARNcameintothepicturewiththereleaseofHadoop0.23onNovember11,2011.ThiswasthealphaversionoftheHadoop0.23majorrelease.

Themajordifferencebetween0.23andpre-0.23releasesisthatthe0.23releasehadundergoneacompleterevampintermsoftheMapReduceengineandresourcemanagement.This0.23releaseseparatedoutresourcemanagementandapplicationlifecyclemanagement.

Page 96: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 97: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

AshortintroductiontoHadoop1.xandMRv1WewillbrieflylookatthebasicApacheHadoop1.xanditsprocessingframework,MRv1(Classic),sothatwecangetaclearpictureofthedifferencesinApacheHadoop2.xMRv2(YARN)intermsofarchitecture,components,andprocessingframework.

ApacheHadoopisascalable,fault-tolerantdistributedsystemfordatastorageandprocessing.ThecoreprogrammingmodelinHadoopisMapReduce.

Since2004,Hadoophasemergedasthedefactostandardtostore,process,andanalyzehundredsofterabytesandevenpetabytesofdata.

ThemajorcomponentsinHadoop1.xareasfollows:

NameNode:Thiskeepsthemetadatainthemainmemory.DataNode:Thisiswherethedataresidesintheformofblocks.JobTracker:Thisassigns/reassignsMapReducetaskstoTaskTrackersintheclusterandtracksthestatusofeachTaskTracker.TaskTracker:ThisexecutesthetaskassignedbytheJobTrackerandsendsthestatusofthetasktotheJobTracker.

ThemajorcomponentsofHadoop1.xcanbeseenasfollows:

AtypicalHadoop1.xcluster(shownintheprecedingfigure)canconsistofthousandsof

Page 98: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

nodes.ItfollowstheMaster\Slavepattern,wheretheNameNodes\JobTrackersarethemastersandtheDataNodes\TaskTrackersaretheslaves.

ThemaindataprocessingisdistributedacrosstheclusterintheDataNodestoincreaseparallelprocessing.

ThemasterNameNodeprocess(masterforslaveDataNodes)managesthefilesystem,andthemasterJobTrackerprocess(masterforslaveTaskTrackers)managesthetasks.Thetopologyisseenasfollows:

AHadoopclustercanbeconsideredtobemainlymadeupoftwodistinguishableparts:

HDFS:Thisistheunderlyingstoragelayerthatactsasafilesystemfordistributeddatastorage.Youcanputdataofanyformat,schema,andtypeonit,suchasstructured,semi-structured,orunstructureddata.ThisflexibilitymakesHadoopfitforthedatalake,whichissometimescalledthebitbucketorthelandingzone.MapReduce:Thisistheexecutionlayerwhichistheonlydistributeddata-processingframework.

TipDownloadingtheexamplecode

Youcandownloadtheexamplecodefilesfromyouraccountathttp://www.packtpub.comforallthePacktPublishingbooksyouhavepurchased.Ifyoupurchasedthisbookelsewhere,youcanvisithttp://www.packtpub.com/supportandregistertohavethefilese-maileddirectlytoyou.

Page 99: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 100: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

MRv1versusMRv2MRv1(MapReduceversion1)ispartofApacheHadoop1.xandisanimplementationoftheMapReduceprogrammingparadigm.

TheMapReduceprojectitselfcanbebrokenintothefollowingparts:

End-userMapReduceAPI:ThisistheAPIneededtodeveloptheMapReduceapplication.MapReduceframework:Thisistheruntimeimplementationofvariousphases,suchasthemapphase,thesort/shuffle/mergeaggregationphase,andthereducephase.MapReducesystem:ThisisthebackendinfrastructurerequiredtorunMapReduceapplicationsandincludesthingssuchasclusterresourcemanagement,schedulingofjobs,andsoon.

Hadoop1.xwaswrittensolelyasanMRengine.Sinceitrunsonacluster,itsclustermanagementcomponentwasalsotightlycoupledwiththeMRprogrammingparadigm.TheonlythingthatcouldberunonHadoop1.xwasanMRjob.

InMRv1,theclusterwasmanagedbyasingleJobTrackerandmultipleTaskTrackersrunningontheDataNodes.

InHadoop2.x,theoldMRv1frameworkwasrewrittentorunontopofYARN.ThisapplicationwasnamedMRv2,orMapReduceversion2.ItisthefamiliarMapReduceexecutionunderneath,exceptthateachjobnowrunsonYARN.

ThecoredifferencebetweenMRv1andMRv2isthewaytheMapReducejobsareexecuted.

WithHadoop1.x,itwastheJobTrackerandTaskTrackers,butnowwithYARNonHadoop2.x,it’stheResourceManager,ApplicationMaster,andNodeManagers.

However,theunderlyingconcept,theMapReduceframework,remainsthesame.

Hadoop2hasbeenredefinedfromHDFS-plus-MapReducetoHDFS-plus-YARN.

Referringtothefollowingfigure,YARNtookcontroloftheresourcemanagementandapplicationlifecyclepartofHadoop1.x.

YARNtherefore,definitelyresultsinincreasedROIforHadoopinvestment,inthesensethatnowthesameHadoop2.xclusterresourcescanbeusedtodomultiplethings,suchasbatchprocessing,real-timeprocessing,SQLapplications,andsoon.

Earlier,runningthisvarietyofapplicationswasnotpossible,andpeoplehadtouseaseparateHadoopclusterforMapReduceandaseparateonetodosomethingelse.

Page 101: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 102: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 103: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

UnderstandingwhereYARNfitsintoHadoopIfwerefertoHadoop1.xinthefirstfigureofthischapter,thenitisclearthattheresponsibilitiesoftheJobTrackermainlyincludedthefollowing:

ManagingthecomputationalresourcesintermsofmapandreduceslotsSchedulingsubmittedjobsMonitoringtheexecutionsoftheTaskTrackersRestartingfailedtasksPerformingaspeculativeexecutionoftasksCalculatingtheJobCounters

Clearly,theJobTrackeralonedoesalotoftaskstogetherandisoverloadedwithlotsofwork.

ThisoverloadingoftheJobTrackerledtotheredesignoftheJobTracker,andYARNtriedtoreducetheresponsibilitiesoftheJobTrackerinthefollowingways:

ClusterresourcemanagementandSchedulingresponsibilitiesweremovedtotheglobalResourceManager(RM)Theapplicationlifecyclemanagement,thatis,jobexecutionandmonitoringwasmovedintoaper-applicationApplicationMaster(AM)

TheGlobalResourceManagerisseeninthefollowingimage:

Ifyoulookattheprecedingfigure,youwillclearlyseethedisappearanceofthesinglecentralizedJobTracker;itsplaceistakenbyaGlobalResourceManager.

Also,foreachjobatiny,dedicatedJobTrackeriscreated,whichmonitorsthetasksspecifictoitsjob.ThistinyJobTrackerisrunontheslavenode.

Page 104: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Thistiny,dedicatedJobTrackeristermedanApplicationMasterinthenewframework(refertothefollowingfigure).

Also,theTaskTrackersarereferredtoasNodeManagersinthenewframework.

Finally,lookingattheJobTrackerredesign(inthefollowingfigure),wecanclearlyseethattheJobTracker’sresponsibilitiesarebrokenintoaper-clusterResourceManagerandaper-applicationApplicationMaster:

Page 105: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

TheResourceManagertopologycanbeseenasfollows:

Page 106: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 107: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

OldandnewMapReduceAPIsThenewAPI(whichisalsoknownasContextObjects)wasprimarilydesignedtomaketheAPIeasiertoevolveinthefutureandistypeincompatiblewiththeoldone.

ThenewAPIcameintothepicturefromthe1.xreleaseseries.However,itwaspartiallysupportedinthisseries.So,theoldAPIisrecommendedfor1.xseries:

Feature\Release 1.x 0.23

OldMapReduceAPI Yes Deprecated

NewMapReduceAPI Partial Yes

MRv1runtime(Classic) Yes No

MRv2runtime(YARN) No Yes

TheoldandnewAPIcanbecomparedasfollows:

OldAPI NewAPI

TheoldAPIisintheorg.apache.hadoop.mapredpackageandisstillpresent.

ThenewAPIisintheorg.apache.hadoop.mapreducepackage.

TheoldAPIusedinterfacesforMapperandReducer. ThenewAPIusesAbstractClassesforMapperandReducer.

TheoldAPIusedtheJobConf,OutputCollector,andReporterobjecttocommunicatewiththeMapReducesystem.

ThenewAPIusesthecontextobjecttocommunicatewiththeMapReducesystem.

IntheoldAPI,jobcontrolwasdonethroughtheJobClient.

InthenewAPI,jobcontrolisperformedthroughtheJobclass.

IntheoldAPI,jobconfigurationwasdonewithaJobConfobject.

InthenewAPO,jobconfigurationisdonethroughtheConfigurationclassviasomeofthehelpermethodsonJob.

IntheoldAPI,boththemapandreduceoutputsarenamedpart-nnnnn.

InthenewAPI,themapoutputsarenamedpart-m-nnnnnandthereduceoutputsarenamedpart-r-nnnnn.

IntheoldAPI,thereduce()methodpassesvaluesasajava.lang.Iterator.

InthenewAPI,the.methodpassesvaluesasajava.lang.Iterable.

TheoldAPIcontrolsmappersbywritingaMapRunnable,butnoequivalentexistsforreducers.

ThenewAPIallowsbothmappersandreducerstocontroltheexecutionflowbyoverridingtherun()method.

Page 108: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 109: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

BackwardcompatibilityofMRv2APIsThissectiondiscussesthescopeandlevelofbackwardcompatibilitysupportedinApacheHadoopMapReduce2.x(MRv2).

Page 110: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Binarycompatibilityoforg.apache.hadoop.mapredAPIsBinarycompatibilityheremeansthatthecompiledbinariesshouldbeabletorunwithoutanymodificationonthenewframework.

ForthoseHadoop1.xuserswhousetheorg.apache.hadoop.mapredAPIs,theycansimplyruntheirMapReducejobsonYARNjustbypointingthemtotheirApacheHadoop2.xclusterviatheconfigurationsettings.

Theywillnotneedanyrecompilation.AlltheywillneedtodoispointtheirapplicationtotheYARNinstallationandpointHADOOP_CONF_DIRtothecorrespondingconfigurationdirectory.Theyarn-site.xml(configurationforYARN)andmapred-site.xmlfiles(configurationforMapReduceapps)arepresentintheconfdirectory.

Also,mapred.job.trackerinmapred-site.xmlisnolongernecessaryinApacheHadoop2.x.Instead,thefollowingpropertyneedstobeaddedinthemapred-site.xmlfiletomakeMRv1applicationsrunontopofYARN:

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

Page 111: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Sourcecompatibilityoforg.apache.hadoop.mapredAPIsSourceincompatibilitymeansthatsomecodechangesarerequiredforcompilation.Sourceincompatibilityisorthogonaltobinarycompatibility.

Binariesforanapplicationthatisbinarycompatiblebutnotsourcecompatiblewillcontinuetorunfineonthenewframework.However,codechangesarerequiredtoregeneratethesebinaries.

ApacheHadoop2.xdoesnotensurecompletebinarycompatibilitywiththeapplicationsthatuseorg.apache.hadoop.mapreduceAPIs,astheseAPIshaveevolvedalotsinceMRv1.However,itensuressourcecompatibilityfororg.apache.hadoop.mapreduceAPIsthatbreakbinarycompatibility.Inotherwords,youshouldrecompiletheapplicationsthatuseMapReduceAPIsagainstMRv2JARs.

ExistingapplicationsthatuseMapReduceAPIsaresourcecompatibleandcanrunonYARNwithnochanges,recompilation,and/orminorupdates.

IfanMRv1MapReduce-basedapplicationfailstorunonYARN,youarerequestedtoinvestigateitssourcecodeandcheckwhetherMapReduceAPIsarereferredtoornot.Iftheyarereferredto,youhavetorecompiletheapplicationagainsttheMRv2JARsthatareshippedwithHadoop2.

Page 112: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 113: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

PracticalexamplesofMRv1andMRv2WewillnowpresentaMapReduceexampleusingboththeoldandnewMapReduceAPIs.

WewillnowwriteaMapReduceprograminJavathatfindsalltheanagrams(aword,phrase,ornameformedbyrearrangingthelettersofanother,suchascinema,formedfromiceman)presentstheminaninputfile,andfinallyprintsalltheanagramsintheoutputfile.

HereistheAnagramMapperOldAPI.javaclassthatusestheoldMapReduceAPI:

importjava.io.IOException;

importjava.util.Arrays;

importorg.apache.hadoop.io.Text;

importorg.apache.hadoop.mapred.MapReduceBase;

importorg.apache.hadoop.mapred.Mapper;

importorg.apache.hadoop.mapred.OutputCollector;

importorg.apache.hadoop.mapred.Reporter;

importjava.util.StringTokenizer;

/**

*TheAnagrammapperclassgetsawordasalinefromtheHDFSinputand

sortsthe

*lettersinthewordandwritesitsbacktotheoutputcollectoras

*Key:sortedword(lettersinthewordsorted)

*Value:theworditselfasthevalue.

*Whenthereducerrunsthenwecangroupanagramstogetherbasedonthe

sortedkey.

*/

publicclassAnagramMapperOldAPIextendsMapReduceBaseimplements

Mapper<Object,Text,Text,Text>{

privateTextsortedText=newText();

privateTextoriginalText=newText();

@Override

publicvoidmap(ObjectkeyNotUsed,Textvalue,

OutputCollector<Text,Text>output,Reporterreporter)

throwsIOException{

Stringline=value.toString().trim().toLowerCase().replace(",","");

System.out.println("LINE:"+line);

StringTokenizerst=newStringTokenizer(line);

System.out.println("----Splitbyspace------");

while(st.hasMoreElements()){

Stringword=(String)st.nextElement();

char[]wordChars=word.toCharArray();

Arrays.sort(wordChars);

Page 114: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

StringsortedWord=newString(wordChars);

sortedText.set(sortedWord);

originalText.set(word);

System.out.println("\torig:"+word+"\tsorted:"+sortedWord);

output.collect(sortedText,originalText);

}

}

}

HereistheAnagramReducerOldAPI.javaclassthatusestheoldMapReduceAPI:

importjava.io.IOException;

importjava.util.Iterator;

importjava.util.StringTokenizer;

importorg.apache.hadoop.io.Text;

importorg.apache.hadoop.mapred.MapReduceBase;

importorg.apache.hadoop.mapred.OutputCollector;

importorg.apache.hadoop.mapred.Reducer;

importorg.apache.hadoop.mapred.Reporter;

publicclassAnagramReducerOldAPIextendsMapReduceBaseimplements

Reducer<Text,Text,Text,Text>{

privateTextoutputKey=newText();

privateTextoutputValue=newText();

publicvoidreduce(TextanagramKey,Iterator<Text>anagramValues,

OutputCollector<Text,Text>output,Reporterreporter)

throwsIOException{

Stringout="";

//Consideringwordswithlength>2

if(anagramKey.toString().length()>2){

System.out.println("ReducerKey:"+anagramKey);

while(anagramValues.hasNext()){

out=out+anagramValues.next()+"~";

}

StringTokenizeroutputTokenizer=newStringTokenizer(out,"~");

if(outputTokenizer.countTokens()>=2){

out=out.replace("~",",");

outputKey.set(anagramKey.toString()+"-->");

outputValue.set(out);

System.out.println("************Writingreduceroutput:"

+anagramKey.toString()+"-->"+out);

output.collect(outputKey,outputValue);

}

}

}

Page 115: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

}

Finally,toruntheMapReduceprogram,wehavetheAnagramJobOldAPI.javaclasswrittenusingtheoldMapReduceAPI:

importorg.apache.hadoop.fs.Path;

importorg.apache.hadoop.io.Text;

importorg.apache.hadoop.mapred.FileInputFormat;

importorg.apache.hadoop.mapred.FileOutputFormat;

importorg.apache.hadoop.mapred.JobClient;

importorg.apache.hadoop.mapred.JobConf;

publicclassAnagramJobOldAPI{

publicstaticvoidmain(String[]args)throwsException{

if(args.length!=2){

System.err.println("Usage:Anagram<inputpath><outputpath>");

System.exit(-1);

}

JobConfconf=newJobConf(AnagramJobOldAPI.class);

conf.setJobName("AnagramJobOldAPI");

FileInputFormat.addInputPath(conf,newPath(args[0]));

FileOutputFormat.setOutputPath(conf,newPath(args[1]));

conf.setMapperClass(AnagramMapperOldAPI.class);

conf.setReducerClass(AnagramReducerOldAPI.class);

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(Text.class);

JobClient.runJob(conf);

}

}

Next,wewillwritethesameMapper,Reducer,andJobclassesusingthenewMapReduceAPI.

HereistheAnagramMapper.javaclassthatusesthenewMapReduceAPI:

importjava.io.IOException;

importjava.util.Arrays;

importjava.util.StringTokenizer;

importorg.apache.hadoop.io.Text;

importorg.apache.hadoop.mapreduce.Mapper;

publicclassAnagramMapperextendsMapper<Object,Text,Text,Text>{

privateTextsortedText=newText();

privateTextorginalText=newText();

@Override

Page 116: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

publicvoidmap(Objectkey,Textvalue,Contextcontext)

throwsIOException,InterruptedException{

Stringline=value.toString().trim().toLowerCase().replace(",","");

System.out.println("LINE:"+line);

StringTokenizerst=newStringTokenizer(line);

System.out.println("----Splitbyspace------");

while(st.hasMoreElements()){

Stringword=(String)st.nextElement();

char[]wordChars=word.toCharArray();

Arrays.sort(wordChars);

StringsortedWord=newString(wordChars);

sortedText.set(sortedWord);

orginalText.set(word);

System.out.println("\torig:"+word+"\tsorted:"+sortedWord);

context.write(sortedText,orginalText);

}

}

}

HereistheAnagramReducer.javaclassthatusesthenewMapReduceAPI:

importjava.io.IOException;

importjava.util.StringTokenizer;

importorg.apache.hadoop.io.Text;

importorg.apache.hadoop.mapreduce.Reducer;

publicclassAnagramReducerextendsReducer<Text,Text,Text,Text>{

privateTextoutputKey=newText();

privateTextoutputValue=newText();

publicvoidreduce(TextanagramKey,Iterable<Text>anagramValues,

Contextcontext)throwsIOException,InterruptedException{

Stringout="";

if(anagramKey.toString().length()>2){

System.out.println("ReducerKey:"+anagramKey);

for(Textanagram:anagramValues){

out=out+anagram.toString()+"~";

}

StringTokenizeroutputTokenizer=newStringTokenizer(out,"~");

if(outputTokenizer.countTokens()>=2){

out=out.replace("~",",");

outputKey.set(anagramKey.toString()+"-->");

Page 117: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

outputValue.set(out);

System.out.println("******Writingreduceroutput:"

+anagramKey.toString()+"-->"+out);

context.write(outputKey,outputValue);

}

}

}

}

Finally,hereistheAnagramJob.javaclassthatusesthenewMapReduceAPI:

importorg.apache.hadoop.fs.Path;

importorg.apache.hadoop.io.Text;

importorg.apache.hadoop.mapreduce.Job;

importorg.apache.hadoop.mapreduce.lib.input.FileInputFormat;

importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

publicclassAnagramJob{

publicstaticvoidmain(String[]args)throwsException{

if(args.length!=2){

System.err.println("Usage:Anagram<inputpath><outputpath>");

System.exit(-1);

}

Jobjob=newJob();

job.setJarByClass(AnagramJob.class);

job.setJobName("AnagramJob");

FileInputFormat.addInputPath(job,newPath(args[0]));

FileOutputFormat.setOutputPath(job,newPath(args[1]));

job.setMapperClass(AnagramMapper.class);

job.setReducerClass(AnagramReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(Text.class);

System.exit(job.waitForCompletion(true)?0:1);

}

}

Page 118: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Preparingtheinputfile(s)1. Createa${Inputfile_1}filewiththefollowingcontents:

TheProjectGutenbergEtextofMobyWordIIbyGradyWard

hellotheredrawehllolemonsmelonssolemn

Also,bluestbluetsbustlesubletsubtle

2. Createanotherfile,${Inputfile_2},withthefollowingcontents:

Cinemaisanagramtoiceman

Secondisstop,tops,opts,pots,andspot

Stoolandtools

Secureandrescue

3. Copythesefilesinto${path_to_your_input_dir}.

Page 119: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

RunningthejobRuntheAnagramJobOldAPI.javaclassandpassthefollowingascommand-lineargs:

${path_to_your_input_dir}

${path_to_your_output_dir_old}

Now,runtheAnagramJob.javaclassandpassthefollowingascommand-lineargs:

${path_to_your_input_dir}

${path_to_your_output_dir_new}

Page 120: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

ResultThefinaloutputwrittentois${path_to_your_output_dir_old}and${path_to_your_output_dir_new}.

Thesearethecontentsthatwewillseeintheoutputfile:

aceimn-->cinema,iceman,

adn-->and,and,and,

adrw-->ward,draw,

belstu-->subtle,bustle,bluets,bluest,sublet,

ceersu-->rescue,secure,

ehllo-->hello,ehllo,

elmnos-->lemons,melons,solemn,

loost-->stool,tools,

opst-->pots,tops,stop,spot,opts,

Page 121: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 122: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

SummaryInthischapter,westartedwithabriefhistoryofHadoopreleases.Next,wecoveredthebasicsofHadoop1.xandMRv1.WethenlookedatthecoredifferencesbetweenMRv1andMRv2andhowYARNfitsintoaHadoopenvironment.WealsosawhowtheJobTracker’sresponsibilitieswerebrokendowninHadoop2.x.

WealsotalkedabouttheoldandnewMapReduceAPIs,theirorigin,differences,andsupportinYARN.Finally,weconcludedthechapterwithsomepracticalexamplesusingtheoldandnewMapReduceAPIs.

Inthenextchapter,youwilllearnabouttheadministrationpartofYARN.

Page 123: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 124: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Chapter5.YARNAdministrationInthissection,wewillfocusonYARN’sadministrativepartandontheadministratorrolesandresponsibilitiesofYARN.Wewillalsogainamoredetailedinsightintotheadministrationconfigurationsettingsandparameters,applicationcontainermonitoring,andoptimizedresourceallocations,aswellasschedulingandmultitenancyapplicationsupportinYARN.We’llalsocoverthebasicadministrationtoolsandconfigurationoptionsofYARN.

Thefollowingtopicswillbecoveredinthischapter:

YARNcontainerallocationandconfigurationsSchedulingpoliciesYARNmultitenancyapplicationsupportYARNadministrationandtools

Page 125: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

ContainerallocationAtaveryfundamentallevel,thecontaineristhegroupofphysicalresourcessuchasmemory,disk,network,CPU,andsoon.Therecanbeoneormorecontainersonasinglemachine;forexample,ifamachinehas16GBofRAMand8coreprocessors,thenasinglecontainercouldbe1CPUcoreand2GBofRAM.Thismeansthatthereareatotalof8containersonasinglemachine,ortherecouldbeasinglelargecontainerwithalltheoccupiedresources.So,acontainerisaphysicalnotationofmemory,CPU,network,disk,andsooninthecluster.Thecontainer’slifecycleismanagedbytheNodeManager,andtheschedulingisdonebytheResourceManager.Thecontainerallocationcanbeseenasfollows:

YARNisdesignedtoallocateresourcecontainerstotheindividualapplicationsinashared,secure,andmultitenantmanner.WhenanyjobortaskissubmittedtotheYARNframework,theResourceManagertakescareoftheresourceallocationstotheapplication,dependingonschedulingconfigurationsandtheapplication’sneedsandrequirementsviatheApplicationMaster.Toachievethisgoal,thecentralschedulermaintainsthemetadataaboutalltheapplication’sresourcerequirements;thisleadstoefficientschedulingdecisionsforalltheapplicationsthatrunintothecluster.

Let’stakealookathowcontainerallocationhappensinatraditionalHadoopsetup.InthetraditionalHadoopapproach,oneachnodethereisapredefinedandfixednumberofmapslotsandapredefinedandfixednumberofreduceslots.Themapandreducefunctionsareunabletoshareslots,astheyarepredefinedforspecificoperationsonly.Thisstaticallocationisnotefficient;forexample,oneclusterhasafixedtotalof32mapslotsand32reduceslots.WhilerunningaMapReduceapplication,ittookonly16mapslotsandrequiredmorethan32slotsforreduceoperations.Thereduceroperationisunabletousethe16freemapperslots,astheyarepredefinedformapperfunctionalitiesonly,sothereducefunctionhastowaituntilsomereduceslotsbecomefree.

Toovercomethisproblem,YARNhascontainerslots.Irrespectiveoftheapplication,allcontainersareabletorunallapplications;forexample,ifYARNhas64availablecontainersintheclusterandisrunningthesameMapReduceapplication,ifthemapperfunctiontakesonly16slotsandthereducerrequiresmoreresourceslots,thenallother

Page 126: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

freeresourcesintheclusterareallocatedtothereduceroperation.Thismakestheoperationmoreefficientandproductive.

Essentially,anapplicationdemandstherequiredresourcesfromtheResourceManagertosatisfyitsneedsviatheApplicationMaster.Then,byallocatingtherequestedresourcestoanapplication,theResourceManagerrespondstotheapplication’sResourceRequest.TheResourceRequestcontainsthenameoftheresourcethathasbeenrequested;priorityoftherequestwithinthevariousotherResourceRequestsofthesameapplication;resourcerequirementcapabilities,suchasRAM,disk,CPU,network,andsoon;andthenumberofresources.ContainerallocationfromtheResourceManagertotheapplicationmeansthesuccessfulfulfillmentofthespecificResourceRequest.

Page 127: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

ContainerallocationtotheapplicationNow,takealookatthefollowingsequencediagram:

ThediagramshowshowcontainerallocationisdoneforapplicationsviatheApplicationMaster.Itcanbeexplainedasfollows:

1. TheclientsubmitstheapplicationrequesttotheResourceManager.2. TheResourceManagerregisterstheapplicationwiththeApplicationManager,

generatestheApplicationID,andrespondstotheclientwiththesuccessfullyregisteredApplicationID.

3. Then,theResourceManagerstartstheclientApplicationMasterinaseparateavailablecontainer.Ifnocontainerisavailable,thisrequesthastowaituntilasuitablecontainerisfoundandthensendtheapplicationregistrationrequestforapplicationregistration.

4. TheResourceManagersharesalltheminimumandmaximumresourcecapabilitiesoftheclusterwiththeApplicationMaster.Then,theApplicationMasterdecideshowtoefficientlyusetheavailableresourcestofulfilltheapplication’sneeds.

5. DependingontheresourcecapabilitiessharedbytheResourceManager,theApplicationMasterrequeststhattheResourceManagerallocatesanumberofcontainersonbehalfoftheapplication.

6. TheResourceManagerrespondstotheResourceRequestbytheApplicationMasteraspertheschedulingpoliciesandresourceavailability.ContainerallocationbytheResourceManagermeansthesuccessfulfulfillmentoftheResourceRequestbytheApplicationMaster.

Whilerunningthejob,theApplicationMastersendstheheartbeatandjobprogress

Page 128: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

informationoftheapplicationtotheResourceManager.Duringtheruntimeoftheapplication,theApplicationMasterrequestsforthereleaseorallocationofmorecontainersfromtheResourceManager.Whenthejobfinishes,theApplicationMastersendsacontainerde-allocationrequesttotheResourceManagerandexitsitselffromrunningthecontainer.

Page 129: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 130: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

ContainerconfigurationsHerearethesomeimportantconfigurationsrelatedtoresourcecontainersthatareusedtocontrolcontainers.

Tocontrolthememoryallocationtoacontainer,theadministratorneedstosetthefollowingthreeparametersintheyarn-site.xmlconfigurationfile:

Parameter Description

yarn.nodemanager.resource.memory-

mb

ThisistheamountofmemoryinMBsthattheNodeManagercanuseforthecontainers.

yarn.scheduler.minimum-

allocation-mb

ThisisthesmallestamountofmemoryinMBsallocatedtothecontainerbytheResourceManager.Thedefaultvalueis1024MB.

yarn.scheduler.maximum-

allocation-mb

ThisisthelargestamountofmemoryinMBsallocatedtothecontainerbytheResourceManager.Thedefaultvalueis8192MB.

TheCPUcoreallocationstothecontainerarecontrolledbysettingthefollowingpropertiesintheyarn-site.xmlconfigurationfile:

Parameter Description

yarn.scheduler.minimum-allocation-

vcores

ThisistheminimumnumberofCPUcoresthatareallocatedtothecontainer.

yarn.scheduler.maximum-allocation-

vcores

ThisisthemaximumnumberofCPUcoresthatareallocatedtothecontainer.

yarn.nodemanager.resource.cpu-vcores Thisisthenumberofcoresthatthecontainercanrequestforthenode.

Page 131: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 132: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

YARNschedulingpoliciesTheYARNarchitecturehaspluggableschedulingpoliciesthatdependontheapplication’srequirementsandtheusecasedefinedfortherunningapplication.YoucanfindtheYARNschedulingconfigurationsintheyarn-site.xmlfile.Here,youcanspecifytheschedulingsystemaseitherFIFO,capacity,orfairschedulingaspertheapplication’sneeds.YoucanalsofindtherunningapplicationschedulinginformationintheResourceManagerUI.Manycomponentsoftheschedulingsystemaredefinedbrieflythere.

Asalreadymentioned,therearethreetypeofschedulingpoliciesthattheYARNschedulerfollows:

FIFOschedulerCapacityschedulerFairscheduler

Page 133: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

TheFIFO(FirstInFirstOut)schedulerThisistheschedulingpolicyintroducedintothesystemfromHadoop1.0.TheJobTrackerwasusedtobeFIFOschedulingpolicies.Asthenameindicates,FIFOmeansFirstinFirstOut,thatis,thejobsubmittedfirstwillexecutefirst.TheFIFOschedulerpolicydoesnotfollowanyapplicationpriorities;thispolicymightefficientlyworkforsmallerjobs,butwhileexecutinglargerjobs,FIFOworksveryinefficiently.Soforheavy-loadedclusters,thispolicyisnotrecommended.TheFIFOschedulercanbeseenasfollows:

TheFIFO(FirstInFirstOut)schedulerHereistheconfigurationpropertyfortheFIFOscheduler.Byspecifyingthisinyarn-site.xml,youcanenabletheFIFOschedulingpolicyinyourYARNcluster:

<property>

<name>yarn.resourcemanager.scheduler.class</name>

<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoSch

eduler</value>

</property>

Page 134: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

ThecapacityschedulerThecapacityschedulingpolicyisoneoftheveryfamouspluggableschedulerpoliciesthatallowsmultipleapplicationsorusergroupstosharetheHadoopclusterresourcesinasecureway.Nowadays,thisschedulingpolicyrunssuccessfullyonmanyofthelargestHadoopproductionclustersinanefficientway.

Thecapacityschedulingpolicyallowsauserorusergroupstoshareclusterresourcesinsuchawaythateachuserorgroupofuserswouldgetassignedacertaincapacityoftheclusterforsure.Toenablethispolicy,theclusteradministratorconfiguresoneormorequeueswithsomeprecalculatedsharesofthetotalclusterresourcecapacity;thisassignmentguaranteestheminimumresourcecapacityallocationtoeachqueue.Theadministratorcanalsoconfigurethemaximumandminimumconstraintsontheuseofclusterresources(capacity)oneachqueue.EachqueuehasitsownAccessControlList(ACL)policiesthatcanmanagewhichuserhaspermissiontosubmittheapplicationsonwhichqueues.ACLsalsomanagethereadandmodifypermissionsatthequeuelevelsothatuserscannotviewormodifytheapplicationssubmittedbyotherusers.

CapacityschedulerconfigurationsCapacityschedulerconfigurationscomewithHadoopYARNbydefault.Sometimes,itisnecessarytoconfigurethepolicyinYARNconfigurationfiles.Herearetheconfigurationpropertiesthatneedtobespecifiedinyarn-site.xmltoenablethecapacityschedulerpolicy:

<property>

<name>yarn.resourcemanager.scheduler.class</name>

<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.Cap

acityScheduler</value>

</property>

Thecapacityscheduler,bydefault,comeswithitsownconfigurationfilenamed$HADOOP_CONF_DIR/capacity-scheduler.xml;thisshouldbepresentintheclasspathsothattheResourceManagerisabletolocateitandloadthepropertiesforthisaccordingly.

Page 135: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

ThefairschedulerThefairschedulerisoneofthemostfamouspluggableschedulersforlargeclusters.Itenablesmemory-intensiveapplicationstoshareclusterresourcesinaveryefficientway.Fairschedulingisapolicythatenablestheallocationofresourcestoapplicationsinawaythatallapplicationsget,onaverage,anequalshareoftheclusterresourcesoveragivenperiod.

Inafairschedulingpolicy,ifoneapplicationisrunningonthecluster,itmightrequestallclusterresourcesforitsexecution,ifneeded.Ifotherapplicationsaresubmitted,thepolicycandistributethefreeresourcesamongtheapplicationsinsuchawaythateachapplicationgetsafairlyequalshareofclusterresources.AfairscheduleralsofollowsapreemptionwheretheResourceManagermightrequesttheresourcecontainersbackfromtheApplicationMaster,dependingonthejobconfigurations.Itmightbeahealthyoranunhealthypreemption.

Inthisschedulingmodel,everyapplicationispartofaqueue,soresourcesareassignedtothequeue.Bydefault,eachusersharesthequeuecalled‘DefaultQueue’.Afairschedulersupportsmanyfeaturesatthequeuelevel,suchasassigningweighttothequeue.Aheavyweightqueuewouldgetahighernumberofresourcesthanlightweightqueues,minimumandmaximumsharesthatqueuewouldgetFIFOpolicywithinthequeue.

Whilesubmittingtheapplication,usersmightspecifythenameofthequeuetheapplicationwantstouseresourcesfrom.Forexample,iftheapplicationrequiresahighernumberofresources,itcanspecifytheheavyweightqueuesothatitcangetalltherequiredresourcesthatareavailablethere.

Theadvantageofusingthefairschedulingpolicyisthateveryqueuewouldgetaminimumshareoftheclusterresources.Itisveryimportanttonotethatwhenaqueuecontainsapplicationsthatarewaitingfortheresources,theywouldgettheminimumresourceshare.Ontheotherhand,ifthequeuesresourcesaremorethanenoughfortheapplication,thentheexcessamountwouldbedistributedequallyamongtherunningapplications.

FairschedulerconfigurationsToenablethefairschedulingpolicyinyourYARNcluster,youneedtospecifythefollowingpropertyintheyarn-site.xmlfile:

<property>

<name>yarn.resourcemanager.scheduler.class</name>

<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairSch

eduler</value>

</property>

Thefairscheduleralsohasaspecificconfigurationfileforamoredetailedconfigurationsetup;youwillfinditat$HADOOP_CONF_DIR/fair-scheduler.xml.

Page 136: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 137: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

YARNmultitenancyapplicationsupportYARNcomeswithbuilt-inmultitenancysupport.Now,let’shavealookatwhatmultitenancymeans.Considerasocietythathasmultipleapartmentsinit,sotherearedifferenttypesoffamilylivingindifferentapartmentswithsecurityandprivacy,buttheyallsharethesociety’scommonareas,suchasthesocietygate,garden,playarea,andotheramenities.Theirapartmentsalsosharecommonwalls.ThesameconceptisfollowedinYARN:thethatrunrunningintotheclustersharetheclusterresourcesinamultitenantway.Theyshareclusterprocessingcapacity,clusterstoragecapacity,dataaccesssecurities,andsoon.Multitenancyisachievedintheclusterbydifferentiatingapplicationsintomultiplebusinessunits,forexample,differentqueuesandusersfordifferenttypesofapplications.

SecurityandprivacycanbeachievedbyconfiguringLinuxandHDFSpermissionstoseparatefilesanddirectoriestocreatetenantboundaries.ThiscanbeachievedbyintegratingwithLDAPorActiveDirectory.Securityisusedtoenforcethetenantapplicationboundaries,andthiscanbeintegratedwiththeKerberossecuritymodel.

ThefollowingdiagramwillexplainhowanapplicationrunsintheYARNclusterinamultitenantway:

IntheprecedingYARNcluster,youcanseethattwojobsarerunning:oneisStorm,andtheotheristheMapReducejob.Theyaresharingtheclusterscheduler,clusterprocessingcapacity,HDFSstorage,andclustersecurity.WecanalsoseethetwoapplicationsarerunningonasingleYARNcluster.TheMapReduceandStormjobsarerunningoverYARNandsharingthecommonclusterinfrastructure,CPU,RAM,andsoon.TheStorm

Page 138: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

ApplicationMaster,StormSupervisor,MapRedApplicationMaster,Mappers,andReducersarerunningovertheYARNclusterinamultitenantwaybysharingclusterresources.

Page 139: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 140: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

AdministrationofYARNNow,wewilltakealookatsomeYARNbasicadministrationconfigurations,basicallyfromHadoop2.0.YARNwasintroducedandmadechangesinHadoopconfigurationfiles.HadoopandYARNhavethefollowingbasicconfigurationfiles:

core-default.xml:Thisfilecontainspropertiesrelatedtothesystem.hdfs-default.xml:ThisfilecontainsHDFS-relatedconfigurations.mapred-default.xml:ThisconfigurationfilecontainspropertiesrelatedtotheYARNMapReduceframework.yarn-default.xml:ThisfilecontainsYARN-relatedproperties.

YouwillfindallthesepropertieslistedontheApachewebsite(http://hadoop.apache.org/docs/current/)intheconfigurationsection,withdetailedinformationoneachpropertyanditsdefaultandpossiblevalues.

Page 141: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

AdministrativetoolsYARNhasseveraladministrativetoolsbydefault;youcanfindthemusingthermadmincommand.HereisamoredetailedexplanationoftheResourceManageradmincommand:

$yarnrmadmin-help

ThermadmincommandisthecommandtoexecuteMapReduceadministrativecommands.Thefullsyntaxis:

hadooprmadmin[-refreshQueues][-refreshNodes]

[-refreshSuperUserGroupsConfiguration][-refreshUserToGroupsMappings]

[-refreshAdminAcls][-refreshServiceAcl][-getGroup[username]][-help

[cmd]]

Theprecedingcommandcontainsthefollowingfields:

-refreshQueues:Reloadsthequeues’acls,states,andscheduler-specificproperties.TheResourceManagerwillreloadthemapred-queuesconfigurationfile.-refreshNodes:Refreshesthehost’sinformationattheResourceManager.-refreshUserToGroupsMappings:Refreshesuser-to-groupsmappings.-refreshSuperUserGroupsConfiguration:Refreshessuperuserproxygroupsmappings.-refreshAdminAcls:RefreshesaclsfortheadministrationoftheResourceManager.-refreshServiceAcl:Reloadstheservice-levelauthorizationpolicyfile.ResourceManagerwillreloadtheauthorizationpolicyfile.-getGroups[username]:Getthegroupsthatthegivenuserbelongsto.-help[cmd]:Displayshelpforthegivencommand,orallcommandsifnoneisspecified.

Thegenericoptionssupportedareasfollows:

-conf<configurationfile>:Thiswillspecifyanapplicationconfigurationfile.-D<property=value>:Thiswillusethevalueforthegivenproperty.-fs<local|namenode:port>:ThiswillspecifyaNameNode.-jt<local|jobtracker:port>:ThiswillspecifyaJobTracker.-files<commaseparatedlistoffiles>:Thiswillspecifycomma-separatedfilestobecopiedtotheMapReducecluster.-libjars<commaseparatedlistofjars>:Thiswillspecifycomma-separatedJARfilestoincludeintheclasspath.-archives<commaseparatedlistofarchives>:Thiswillspecifycomma-separatedarchivestobeunarchivedonthecomputemachines.

Thegeneralcommandlinesyntaxis:

bin/hadoopcommand[genericOptions][commandOptions]

Page 142: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

AddingandremovingnodesfromaYARNclusterAYARNclusterishorizontallyscalable;youcanaddorremoveworkernodesinorfromtheclusterwithoutstoppingit.Toaddanewnode,allthesoftwareandconfigurationsmustbedoneoverthenewnode.

Thefollowingpropertyisusedtoaddanewnodetothecluster:

yarn.resourcemanager.nodes.include-path

Forremovingthenodefromthecluster,thefollowingpropertyisused:

yarn.resourcemanager.exclude-path

Theprecedingtwopropertiestakevaluesasalocalfilethatcontainsthelistofnodesthatneedtobeaddedorremovedfromthecluster.ThisfilecontainseitherthehostnamesortheIPsoftheworkernodesseparatedbyanewline,tab,orspace.

Afteraddingorremovingthenode,theYARNclusterdoesnotrequirearestart.ItjustneedstorefreshthelistofworkernodessothattheResourceManagergetsinformedaboutthenewlyaddedorremovednodes:

$yarnrmadmin-refreshNodes

Page 143: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

AdministratingYARNjobsThemostimportantYARNadmintaskisadministratingtherunningofYARNjobs.YoucanmanageYARNjobsusingtheyarnapplicationCLIcommand.

Usingtheyarnapplicationcommand,theadministratorcankillajob,listalljobs,andfindoutthestatusofajob.MapReducejobscanbecontrolledbythemapredjobcommand.

Hereistheusageoftheyarnapplicationcommand:

usage:application

-appTypes<Comma-separatedlistofapplicationtypes>Workswith--listto

filterapplicationsbasedontheirtype.

-helpDisplayshelpforallcommands.

-kill<ApplicationID>Killstheapplication.

-listListsapplicationsfromtheRM.Supportsoptionaluseof–appTypesto

filter

applicationsbasedonapplicationtype.

-status<ApplicationID>Printsthestatusoftheapplication.

Page 144: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

MapReducejobconfigurationsAsMapReducejobsarenowrunningonYARNcontainersinsteadoftraditionalMapReduceslots,it’snecessarytoconfigureMapReducepropertiesintomapred-site.xml.HerearesomepropertiesofMapReducejobsthatcouldbeconfiguredtorunMapReducejobsonYARNcontainers:

Properties Description

mapred.child.java.optsThispropertyisusedtosettheJavaheapsizeforchildJVMsofmaps,forexampleXmx4096m.

mapreduce.map.memory.mbThispropertyisusedtoconfiguretheresourcelimitformapfunctionsforexample,1536MB.

mapreduce.reduce.memory.mbThispropertyisusedtoconfiguretheresourcelimitforreducerfunctions,forexample3072MB.

mapreduce.reduce.java.optsThispropertyisusedtosettheJavaheapsizeforchildJVMsofreducers,forexampleXmx4096m.

Page 145: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

YARNlogmanagementThelogmanagementCLItoolisveryusefulforYARNapplicationlogmanagement.TheadministratorcanusethelogsCLIcommanddescribedhere:

$yarnlogs

RetrievelogsforcompletedYARNapplications.

usage:yarnlogs-applicationId<applicationID>[OPTIONS]

generaloptionsare:

-appOwner<ApplicationOwner>AppOwner(assumedtobecurrentuserif

notspecified)

-containerId<ContainerID>ContainerId(mustbespecifiedifnode

addressis

specified)

-nodeAddress<NodeAddress>NodeAddressintheformatnodename:port

(mustbespecifiedifcontainerIDisspecified)

Let’stakeanexample.Ifyouwantedtoprintallthelogsofaspecificapplication,usethefollowingcommand:

$yarnlogs-applicationId<applicationID>

Thiscommandwillprintallthelogsrelatedtotheapplication_IDspecifiedintheconsole’sinterface.

Page 146: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

YARNwebuserinterfaceIntheYARNwebuserinterface(http://localhost:8088/cluster),youcanfindinformationonclusternodes,containersconfiguredoneachnode,andapplicationsandtheirstatus.TheYARNwebinterfaceisasfollows:

UndertheSchedulersection,youcanseetheschedulinginformationofallthesubmitted,acceptedbythescheduler,runningapplications,withthetotalclustercapacity,usedandmaximumcapacity,andresourcesallocatedtotheapplicationqueue.Inthefollowingscreenshot,youcanseetheresourcesallocatedtothedefaultqueue:

UndertheToolssection,youcanfindtheYARNconfigurationfiledetails,schedulinginformation,containerconfigurations,locallogsofthejobs,andalotofotherinformation

Page 147: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

onthecluster.

Page 148: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 149: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

SummaryInthischapter,wecoveredYARNcontainerallocationsandconfigurations,schedulingpolicies,andconfigurations.WealsocoveredmultitenancyapplicationsupportinYARNandsomebasicYARNadministrativetoolsandsettings.Inthenextchapter,wewillcoversomeusefulpracticalexamplesaboutYARNandtheecosystem.

Page 150: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 151: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Chapter6.DevelopingandRunningaSimpleYARNApplicationInthepreviouschapters,wediscussedtheconceptsoftheYARNarchitecture,clustersetup,andadministration.Nowinthischapter,wewillfocusmoreonMapReduceapplicationswithYARNanditsecosystems,withsomehands-onexamples.YoupreviouslylearnedaboutwhenaclientsubmitsanapplicationrequesttotheYARNclusterandhowYARNregisterstheapplication,allocatestherequiredcontainersforitsexecution,andmonitorstheapplicationwhileit’srunning.Now,wewillseesomepracticalusecasesofYARN.

Inthischapter,wewilldiscuss:

RunningsampleapplicationsonYARNDevelopingYARNexamplesApplicationmonitoringandtracking

Now,let’sstartbyrunningsomeofthesampleapplicationsthatcomeasapartoftheYARNdistributionbundle.

Page 152: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

RunningsampleexamplesonYARNRunningtheavailablesampleMapReduceprogramsisasimpletaskwithYARN.TheHadoopversionshipswithsomebasicMapReduceexamples.Youcanfindtheminside$HADOOP_HOME/share/Hadoop/mapreduce/Hadoop-mapreduce-examples-

<HADOOP_VERSION>.jar.ThelocationofthefilemaydifferdependingonyourHadoopinstallationfolderstructure.

Let’sincludethisintheYARN_EXAMPLESpath:

$exportYARN_EXAMPLES=$HADOOP_HOME/share/Hadoop/mapreduce

Now,wehaveallthesampleexamplesintheYARN_EXAMPLESenvironmentalvariable.Youcanaccessalltheexamplesusingthisvariable;tolistalltheavailableexamples,trytypingthefollowingcommandontheconsole:

$yarnjar$YARN_EXAMPLES/hadoop-mapreduce-examples-2.4.0.2.1.1.0-385.jar

Anexampleprogrammustbegivenasthefirstargument.

Thevalidprogramnamesareasfollows:

aggregatewordcount:Thisisanaggregate-basedmap/reduceprogramthatcountsthewordsintheinputfilesaggregatewordhist:Thisisanaggregate-basedmap/reduceprogramthatcomputesthehistogramofthewordsintheinputfilesbbp:Thisisamap/reduceprogramthatusesBailey-Borwein-PlouffetocomputetheexactdigitsofPidbcount:Thisisanexamplejobthatcountsthepageviewcountsfromadatabasedistbbp:Thisisamap/reduceprogramthatusesaBBP-typeformulatocomputetheexactbitsofPigrep:Thisisamap/reduceprogramthatcountsthematchesofaregexintheinputjoin:Thisisajobthataffectsajoinoversorted,equally-partitioneddatasetsmultifilewc:Thisisajobthatcountswordsfromseveralfilespentomino:Thisisamap/reducetilethatlaysaprogramtofindsolutionstopentominoproblemspi:Thisisamap/reduceprogramthatestimatesPiusingaquasi-MonteCarlomethodrandomtextwriter:Thisisamap/reduceprogramthatwrites10GBofrandomtextualdatapernoderandomwriter:Thisisamap/reduceprogramthatwrites10GBofrandomdatapernodesecondarysort:Thisisanexamplethatdefinesasecondarysorttothereducesort:Thisisamap/reduceprogramthatsortsthedatawrittenbytherandomwritersudoku:Thisisasudokusolverteragen:Thisgeneratesdatafortheterasortterasort:Thisrunstheterasortteravalidate:Thischeckstheresultsofterasort

Page 153: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

wordcount:Thisisamap/reduceprogramthatcountsthewordsintheinputfileswordmean:Thisisamap/reduceprogramthatcountstheaveragelengthofthewordsintheinputfileswordmedian:Thisisamap/reduceprogramthatcountsthemedianlengthofthewordsintheinputfileswordstandarddeviation:Thisisamap/reduceprogramthatcountsthestandarddeviationofthelengthofthewordsintheinputfiles

ThesewerethesampleexamplesthatcomeaspartoftheYARNdistributionbydefault.Now,let’stryrunningsomeoftheexamplestoshowcaseYARNcapabilities.

Page 154: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

RunningasamplePiexampleTorunanyapplicationontopofYARN,youneedtofollowthisJavacommandsyntax:

$yarnjar<application_jar.jar><arg0><arg1>

TorunasampleexampletocalculatethevalueofPIwith16mapsand10,000samples,usethefollowingcommand:

$yarnjar$YARN_EXAMPLES/hadoop-mapreduce-examples-2.4.0.2.1.1.0-385.jar

PI1610000

Notethatweareusinghadoop-mapreduce-examples-2.4.0.2.1.1.0-385.jarhere.TheJARversionmaychangedependingonyourinstalledHadoopdistribution.

Onceyouhittheprecedingcommandontheconsole,youwillseethelogsgeneratedbytheapplicationontheconsole,asshowninthefollowingcommand.Thedefaultloggerconfigurationisdisplayedontheconsole.ThedefaultmodeisINFO,andyoumaychangeitbyoverwritingthedefaultloggersettingsbyupdatinghadoop.root.logger=WARN,consoleinconf/log4j.properties:

NumberofMaps=16

SamplesperMap=10000

WroteinputforMap#0

WroteinputforMap#1

WroteinputforMap#2

WroteinputforMap#3

WroteinputforMap#4

WroteinputforMap#5

WroteinputforMap#6

WroteinputforMap#7

WroteinputforMap#8

WroteinputforMap#9

WroteinputforMap#10

WroteinputforMap#11

WroteinputforMap#12

WroteinputforMap#13

WroteinputforMap#14

WroteinputforMap#15

StartingJob

11/09/1421:12:02INFOmapreduce.Job:map0%reduce0%

11/09/1421:12:09INFOmapreduce.Job:map25%reduce0%

11/09/1421:12:11INFOmapreduce.Job:map56%reduce0%

11/09/1421:12:12INFOmapreduce.Job:map100%reduce0%

11/09/1421:12:12INFOmapreduce.Job:map100%reduce100%

11/09/1421:12:12INFOmapreduce.Job:Jobjob_1381790835497_0003completed

successfully

11/09/1421:12:19INFOmapreduce.Job:Counters:44

FileSystemCounters

FILE:Numberofbytesread=358

FILE:Numberofbyteswritten=1365080

FILE:Numberofreadoperations=0

FILE:Numberoflargereadoperations=0

FILE:Numberofwriteoperations=0

HDFS:Numberofbytesread=4214

Page 155: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

HDFS:Numberofbyteswritten=215

HDFS:Numberofreadoperations=67

HDFS:Numberoflargereadoperations=0

HDFS:Numberofwriteoperations=3

JobCounters

Launchedmaptasks=16

Launchedreducetasks=1

Data-localmaptasks=14

Rack-localmaptasks=2

Totaltimespentbyallmapsinoccupiedslots

(ms)=184421

Totaltimespentbyallreducesinoccupiedslots

(ms)=8542

Map-ReduceFramework

Mapinputrecords=16

Mapoutputrecords=32

Mapoutputbytes=288

Mapoutputmaterializedbytes=448

Inputsplitbytes=2326

Combineinputrecords=0

Combineoutputrecords=0

Reduceinputgroups=2

Reduceshufflebytes=448

Reduceinputrecords=32

Reduceoutputrecords=0

SpilledRecords=64

ShuffledMaps=16

FailedShuffles=0

MergedMapoutputs=16

GCtimeelapsed(ms)=195

CPUtimespent(ms)=7740

Physicalmemory(bytes)snapshot=6143396896

Virtualmemory(bytes)snapshot=23142254400

Totalcommittedheapusage(bytes)=43340769024

ShuffleErrors

BAD_ID=0

CONNECTION=0

IO_ERROR=0

WRONG_LENGTH=0

WRONG_MAP=0

WRONG_REDUCE=0

FileInputFormatCounters

BytesRead=1848

FileOutputFormatCounters

BytesWritten=98

JobFinishedin23.144seconds

EstimatedvalueofPiis3.14127500000000000000

YoucancomparetheexamplethatrunsoverHadoop1.xandtheonethatrunsoverYARN.Youcanhardlydifferentiatebylookingatthelogs,butyoucanclearlyidentifythedifferenceinperformance.YARNhasbackward-compatibilitysupportwithMapReduce1.x,withoutanycodechange.

Page 156: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 157: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

MonitoringYARNapplicationswithwebGUINow,wewilllookattheYARNwebGUItomonitortheexamples.YoucanmonitortheapplicationsubmissionID,theuserwhosubmittedtheapplication,thenameoftheapplication,thequeueinwhichtheapplicationissubmitted,thestarttimeandfinishtimeinthecaseoffinishedapplications,andthefinalstatusoftheapplication,usingtheResourceManagerUI.TheResourceManagerwebUIdiffersfromtheUIoftheHadoop1.xversions.ThefollowingscreenshotshowstheinformationwecouldgetfromtheYARNwebUI(http://localhost:8088).

Currently,thefollowingwebUIisshowinginformationrelatedtothePIexampleweranintheprevioussection,exploringtheYARNwebUI:

ThefollowingscreenshotshowsthePIexamplerunningovertheYARNframeworkandthePIexamplesubmittedbytherootuserintothedefaultqueue.AnApplicationMasterisassignedtoit,whichiscurrentlyintherunningstate.Similarly,youcanalsomonitorallthesubmitted,acceptedandrunning,finished,andfailedjobs’statusesfromtheResourceManagerwebUI.

Page 158: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Ifyoudrilldownfurther,youcanseetheapplicationmaster-levelinformationofthesubmittedapplication,suchasthetotalcontainersallocatedtothemapandreducefunctionsandtheirrunningstatus.Forexample,thefollowingscreenshotshowsthatwealreadysubmittedaPIexamplewith16mappers.Sointhefollowingscreenshot,youcanseethatthetotalnumberofcontainersallocatedtothemapfunctionis16,outofwhich8arecompletedand8areintherunningstate.YoucanalsotrackthecontainersallocatedtothereducefunctionanditsprogressfromUI:

Youcanseealltheinformationdisplayedovertheconsolewhilerunningthejob.ThesameinformationwillalsobedisplayedonthewebUIinatabularformandinamoresophisticatedway:

AllthemapperandreducerjobsandfilesystemcounterswillbedisplayedunderthecountersectionoftheYARNapplicationwebGUI.Youcanalsoexploretheconfigurationsoftheapplicationintheconfigurationssection:

Page 159: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Thefollowingscreenshotshowsthestatisticsofthefinishedjob,suchasthetotalnumberofmappers,reducers,starttime,finishtime,andsoon:

ThefollowingscreenshotoftheYARNwebUIgivesschedulinginformationabouttheYARNcluster,suchastheclusterresourcecapacityandcontainersallocatedtotheapplicationorqueue:

Page 160: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Attheend,youwillseethejobsummarypage.Youmayalsoexaminethelogsbyclickingonthelogslinkprovidedonthejobsummarypage.

OnceauserreturnstothemainclusterUI,choosesanyfinishedapplications,andthenselectsajobwerecentlyran,theuserwillabletoseethesummarypage,asshowninfollowingscreenshot:

Thereareafewthingstonoteaswemovedthroughthewindowsdescribedearlier.First,asYARNmanagesapplications,allinputfromYARNreferstoanapplication.YARNhasnodataontheactualapplication.DatafromtheMapReducejobisprovidedbytheMapReduceframework.Therefore,therearetwoclearlydifferentdatastreamsthatarecombinedinthewebGUI,YARNapplicationsandMapReduceframeworkjobs.Iftheframeworkdoesnotprovidejobinformation,thencertainpartsofthewebGUIwillhavenothingtodisplay.

AveryimportantfactaboutYARNjobsisthedynamicnatureofthecontainerallocationstothemapperandreducertasks.TheseareexecutedasYARNcontainers,andtheirrespectivenumberalsochangesdynamicallyaspertheapplication’sneedsandrequirements.Thisfeatureprovidesmuchbetterclusterutilizationduetothedynamic

Page 161: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

container(“slots”intraditionallanguage)allocations.

Page 162: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 163: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

YARN’sMapReducesupportMapReducewastheonlyusecaseonwhichthepreviousversionsofHadoopweredeveloped.WeknowthatMapReduceismainlyusedfortheefficientandeffectiveprocessingofbigdata.Itisusedtoprocessagraphandmillionsofitsnodesandedges.Goingforwardwithtechnology,tocaterfortherequirementsofdatalocationavailability,faulttolerantsystems,andapplicationpriorities,YARNbuiltsupportforeverythingfromasimpleshellscriptapplicationtoacomplexMapReduceapplication.

Forthedatalocationavailability,MapReducer’sApplicationMasterhastofindoutthedatablocklocationsandallocationsofcontainerstoprocesstheseblocksaccordingly.Faulttolerantsystemmeanstheabilitytohandlefailedtasksandactonthemaccordingly,suchastohandlefailedmapandreducetasksandrerunthemwithothercontainersifneeded.Prioritiesareassignedtoeachapplicationinthequeue;thelogictohandlecomplexintra-applicationprioritiesformapandreducetaskshastobebuiltintotheApplicationMaster.Thereisnoneedtostartidlereducersbeforemappersfinishenoughdataprocessing.ReducersarenowunderthecontroloftheYARNApplicationMasterandarenotfixedastheyhadbeeninHadoopversion1.

Page 164: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

TheMapReduceApplicationMasterTheMapReduceApplicationMasterserviceismadeupofmultipleloosely-coupledservices;theseservicesinteractwitheachotherviaevents.Everyservicegetstriggeredonaneventandproducesanoutputastheeventtriggersanotherservice;thishappenshighlyconcurrentlyandwithoutsynchronization.Allservicecomponentsareregisteredwiththecentraldispatcherservice,andserviceinformationissharedbetweenthemultiplecomponentsviaApplicationContext(AppContext).

InHadoopversion1,alltherunningandsubmittedjobsarepurelydependentontheJobTracker,sothefailureofJobTrackerresultsinalossofalltherunningandsubmittedjobs.However,withYARN,theApplicationMasterisequivalenttotheJobTracker.TheApplicationMasterrunsandallocatesnodestoanapplication.Itmayfail,butYARNhasthecapabilitytorestarttheApplicationMasteraspecifiednumberoftimesandthecapabilitytorecovercompletedtasks.MorelikeJobTracker,theApplicationMasterkeepsthemetricsofthejobscurrentlyrunning.ThefollowingsettingsintheconfigurationfileenableMapReducerecoveryinYARN.

ToenabletherestartoftheApplicationMaster,executethefollowingsteps:

1. Insideyarn-site.xml,youcantunetheyarn.resourcemanager.am.max-retriesproperty.Thedefaultis2.

2. Insidemapred-site.xml,youcandirectlytunehowmanytimesaMapReduceApplicationMastershouldrestartwiththemapreduce.am.max-attemptsproperty.Thedefaultis2.

3. Toenablerecoveryofcompletedtasks,lookinsidethemapred-site.xmlfile.Theyarn.app.mapreduce.am.job.recovery.enablepropertyenablestherecoveryoftasks.Bydefault,itistrue.

Page 165: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

ExampleYARNMapReducesettingsYARNhasreplacedthefixedslotarchitectureformappersandreducerswithflexibledynamiccontainerallocation.TherearesomeimportantparameterstorunMapReduceefficiently,andtheycanbefoundinmapred-site.xmlandyarn-site.xml.Asanexample,thefollowingaresomesettingsthathavebeenusedtoruntheMapReduceapplicationonYARN:

Property Propertyfile Value

mapreduce.map.memory.mb mapred-site.xml 1536

mapreduce.reduce.memory.mb mapred-site.xml 2560

mapreduce.map.java.opts mapred-site.xml -Xmx1024m

mapreduce.reduce.java.opts mapred-site.xml -Xmx2048m

yarn.scheduler.minimum-allocation-mb yarn-site.xml 512

yarn.scheduler.maximum-allocation-mb yarn-site.xml 4096

yarn.nodemanager.resource.memory-mb yarn-site.xml 36864

yarn.nodemanager.vmem-pmem-ratio yarn-site.xml 2.1

YARNconfigurationallowsacontainersizebetween512MBto4GB.Ifnodeshave36GBofRAMwithavirtualmemoryof2.1,eachmapcanhavemax3225.6MB,andeachreducercanhave5376MBofvirtualmemory.So,thecomputenodeconfiguredfor36GBofcontainerspacecansupportupto24mapsand14reducers,oranycombinationofmapperandreducersallowedbytheavailableresourcesonthenode.

Page 166: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

YARN’scompatibilitywithMapReduceapplicationsForasmoothtransitionfromHadoopv1toYARN,applicationbackwardcompatibilityhasbeenthemajorgoaloftheYARNimplementationteamtoensurethatexistingMapReduceapplicationsthatwereprogrammedusingHadoopv1(MRv1)APIsandcompliedagainstthemcancontinuetorunoverYARN,withlittleenhancement.

YARNensuresfullbinarycompatibilitywithHadoopv1(MRv1)APIs;userswhousedtheorg.apache.hadoop.mapredAPIsprovidefullcompatibilitywiththeYARNframework,withoutrecompilation.YoucanuseyourMapReduceJARfileandbin/hadooptosubmitthemdirectlytoYARN.

YARNintroducednewAPIchangesforMapReduceapplicationsontopoftheYARNframeworkintoorg.apache.hadoop.mapreduce.

Ifanapplicationisdevelopedbyorg.apache.hadoop.mapreduceandcompliedbytheHadoopv1(MRv1)APIs,thenunfortunatelyYARNdoesn’tprovidecompatibilitywithit,asorg.apache.hadoop.mapreduceAPIshavegonethroughaYARNtransitionandshouldberecompiledagainstHadoopv2(MRv2)torunoverYARN.

Page 167: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

DevelopingYARNapplicationsTodevelopaYARNapplication,youneedtokeeptheYARNarchitectureinmind.YARNisaplatformthatallowsdistributedapplicationstotakefulladvantageoftheresourcesthatYARNhasdeployed.Currently,resourcescanbethingssuchasCPU,memory,anddata.Manydeveloperswhocomefromaserver-sideapplication-developmentbackgroundorfromaMapReducedeveloperbackgroundmaybeaccustomedtoacertainflowinthedevelopmentanddeploymentcycle.

Inthissection,we’lldescribethedevelopmentlifecycleofYARNapplications.Also,we’llfocusonthekeyareasofYARNapplicationdevelopment,suchashowYARNapplicationscanlaunchcontainers,howresourceallocationhasbeendonefortheapplications,andmanyotherareasindetail.

ThegeneralworkflowoftheYARNapplicationsubmissionisthattheYARNClientcommunicateswiththeResourceManagerthroughtheApplicationClientProtocoltogenerateanewApplicationID.ItthensubmitstheapplicationtotheResourceManagertorunviatheApplicationClientProtocol.Asapartoftheprotocol,theYARNClienthastoprovidealltherequiredinformationtotheResourceManagertolaunchtheapplication’sfirstcontainer,thatis,theApplicationMaster.TheYARNClientalsoneedstoprovideinformationdetailsofthedependencyJARs/filesfortheapplicationviacommand-linearguments.YoucanalsospecifythedependencyJARs/filesintheenvironmentvariables.

ThefollowingaresomeinterfaceprotocolsthattheYARNframeworkwilluseforintercomponentcommunication:

ApplicationClientProtocol:ThisprotocolisusedbyYARNforcommunicationbetweentheYARNClientandResourceManagertolaunchanewapplication,checkitsstatus,ortokilltheapplication.ApplicationMasterProtocol:ThisprotocolisusedbytheYARNframeworktocommunicatebetweentheApplicationMasterandResourceManager.ItisusedbytheApplicationMastertoregister/unregisteritselfto/fromtheResourceManagerandalsofortheresourceallocation/deallocationrequesttotheResourceManager.ContainerManagerProtocol:ThisprotocolisusedforcommunicationbetweentheApplicationMasterandNodeManagertostartandstopcontainersandtheirstatusupdates.

Page 168: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 169: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

TheYARNapplicationworkflowNow,takealookatthefollowingsequencediagramthatdescribestheYARNapplicationworkflowandalsoexplainshowcontainerallocationisdoneforanapplicationviatheApplicationMaster:

Refertotheprecedingdiagramforthefollowingdetails:

TheclientsubmitstheapplicationrequesttotheResourceManager.TheResourceManagerregisterstheapplicationwiththeApplicationManager,generatestheApplicationID,andrespondstotheclientwiththesuccessfullyregisteredApplicationID.Then,theResourceManagerstartstheclient’sApplicationMasterinaseparateavailablecontainer.Ifnocontainerisavailable,thenthisrequesthastowaittillasuitablecontainerisfound,andsendtheapplicationregistrationrequestforapplicationregistration.TheResourceManagersharesalltheminimumandmaximumresourcecapabilitiesoftheclusterwiththeApplicationMaster.Then,theApplicationMasterdecideshowtoefficientlyusetheavailableresourcestofulfillapplicationneeds.DependingontheresourcecapabilitiessharedbytheResourceManager,theApplicationMasterrequeststheResourceManagertoallocatethenumberof

Page 170: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

containersonbehalfoftheapplication.TheResourceManagerrespondstotheResourceRequestbytheApplicationMasteraspertheschedulingpoliciesandresourceavailabilities.ContainerallocationbytheResourceManagermeanssuccessfulfulfillingoftheResourceRequestbytheApplicationMaster.

Whilerunningthejob,theApplicationMastersendstheheartbeatandjobprogressinformationoftheapplicationtotheResourceManager.Duringtherunningtimeoftheapplication,theApplicationMasterrequestsforareleaseof,orallocatesmorecontainersto,theResourceManager.Whenthetimejobfinishes,theApplicationMastersendsacontainerdeallocationrequesttotheResourceManager,thusexitingitselffromtherunningcontainer.

Page 171: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

WritingtheYARNclientTheYARNclientisrequiredtosubmitthejobtotheYARNframework.ItisaplainJavaclass,simplyhavingmainasentrypointfunctioninto.ThemainfunctionoftheYARNclientistosubmittheapplicationtotheYARNenvironmentbyinstantiatingtheorg.apache.hadoop.yarn.conf.YarnConfigurationobject.TheYarnConfigurationobjectdependsonfindingtheyarn-default.xmlandyarn-site.xmlfilesinitsclasspath.AlltheserequirementsneedtobesatisfiedtoruntheYARNclientapplication.TheYARNclientprocessisshowninthefollowingimage:

OnceaYarnConfigurationobjectisinstantiatedinyourYARNclient,wehavetocreateanobjectoforg.apache.hadoop.client.api.YarnClientusingtheYarnConfigurationobjectthathasalreadybeeninstantiated.Thenewly-instantiatedYarnClientobjectwillbeusedtosubmittheapplicationstotheYARNframeworkusingthefollowingsteps:

1. CreateaninstanceofaYarnClientobjectusingYarnConfiguration.2. InitializetheYarnClientandtheYarnConfigurationobject.3. StartaYarnClient.4. GettheYARNcluster,node,andqueueinformation.5. GetAccessControlListinformationfortheuserrunningtheclient.6. Createtheclientapplication.7. SubmittheapplicationtotheYARNResourceManager.8. Getapplicationreportsaftersubmittingtheapplication.

Also,theYarnClientwillcreateacontextforapplicationsubmissionandfortheApplicationMaster’scontainerlaunch.TherunnableYarnClientwilltakethecommand-lineargumentsfromtheuserwhoisrequiredtorunthejob.We’llseethesimplecodesnippetfortheYARNapplicationclienttogetabetterideaaboutit.

ThefirststepoftheYARNClientistoconnectwiththeResourceManager.Thefollowingisthecodesnippetforit:

//DeclareApplicationClientProtocol

ApplicationClientProtocolapplicationsManager;

//InstamtiateYarnConfiguration

YarnConfigurationyarnConf=newYarnConfiguration(conf);

Page 172: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

//GettheResourceManagerIPaddress,ifnotprovidedusedefault

InetSocketAddressrmAddress=

NetUtils.createSocketAddr(yarnConf.get(

YarnConfiguration.RM_ADDRESS,

YarnConfiguration.DEFAULT_RM_ADDRESS));

LOGGER.info("ConnectingtoResourceManagerat"+rmAddress);

configurationappsManagerServerConf=newConfiguration(conf);

appsManagerServerConf.setClass(

YarnConfiguration.YARN_SECURITY_INFO,

ClientRMSecurityInfo.class,SecurityInfo.class);

//InitializeApplicationManagerhandle

applicationsManager=((ApplicationClientProtocol)rpc.getProxy(

ApplicationClientProtocol.class,rmAddress,

appsManagerServerConf));

OncetheconnectionbetweentheYARNClientandResourceManagerisestablished,theYARNClientneedstorequesttheApplicationIDfromtheResourceManager:

GetNewApplicationRequestnewRequest=

Records.newRecord(GetNewApplicationRequest.class);

GetNewApplicationResponsenewResponse=

applicationsManager.getNewApplication(newRequest);

TheresponsefromtheApplicationManageristhenewly-generatedApplicationIDfortheapplicationsubmittedbytheYARNClient.Youcanalsogettheinformationrelatedtotheminimumandmaximumresourcecapabilitiesofthecluster(usingtheGetNewApplicationResponseAPI).Usingthisinformation,developerscansettherequiredresourcesfortheApplicationMastercontainertolaunch.

TheYARNClientneedstosetupthefollowinginformationfortheApplicationSubmissionContextinitialization;thisinformationincludesalltherequiredinformationneededbytheResourceManagertolaunchtheApplicationMaster,asmentionedhere:

Applicationinformation,suchasApplicationIDgeneratedbythepreviousstepNameoftheapplicationQueueandpriorityinformation,suchasinwhichqueuetheapplicationneedstobesubmittedandtheprioritiesassignedtotheapplicationUserinformation,thatis,bywhomtheapplicationistobesubmittedContainerLaunchContext,thatis,theinformationneededbytheApplicationMastertolaunchlocalresources(suchasJARs,binaries,andfiles)

Italsocontainsthesecurity-relatedinformation(securitytokens)andenvironmentalvariables(classpathsettings)withthecommandtobeexecutedviatheApplicationMaster:

//CreateanewlaunchcontextforAppMaster

ApplicationSubmissionContextappContext=

Records.newRecord(ApplicationSubmissionContext.class);

//settheApplicationId

appContext.setApplicationId(appId);

//settheapplicationname

appContext.setApplicationName(appName);

Page 173: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

//CreateanewcontainerlaunchcontextfortheApplicationMaster

ContainerLaunchContextamContainer=

Records.newRecord(ContainerLaunchContext.class);

//setthelocalresourcesrequiredfortheApplicationMaster

//localfilesorarchivesasneeded(forexamplesjarfiles)

Map<String,LocalResource>localResources=

newHashMap<String,LocalResource>();

//CopyApplicationMasterjartothefilesystemandcreate

//localresourcetopointdestinationjarpath

FileSystemfs=FileSystem.get(conf);

Pathsrc=newPath(AppMaster.jar);

StringpathSuffix=appName+"/"+appId.getId()+

"/AppMaster.jar";

Pathdst=newPath(fs.getHomeDirectory(),pathSuffix);

//CopyfilefromsrctodestionationonHDFS

fs.copyFromLocal(false,true,src,dst);

//getHDFSfilestatusfromthepathwhereitcopied

FileStatusjarStatus=fs.getFileStatus(dst);

LocalResourceamJarResorce=Records.newRecord(LocalResource.class);

//Setthetypeofresource-fileorarchive

//archivesareuntarredatthedestinationbytheframework

amJarResorce.setType(LocalResourceType.FILE);

//Setvisibilityoftheresource

//Settingtomostprivateoption

amJarResorce.setVisibility(LocalResourceVisibility.APPLICATION);

//Settheresourcetobecopiedoverlocation

amJarResorce.setResource(ConverterUtils.getYarnUrlFromPath(dst));

//Settimestampandlengthoffilesothattheframework

//candobasicsanitychecksforthelocalresource

//afterithasbeencopiedovertoensureitisthesame

//resourcetheclientintendedtousewiththeapplication

amJarResorce.setTimestamp(jarStatus.getModificationTime());

amJarResorce.setSize(jarStatus.getLen());

localResources.put("AppMaster.jar",amJarResorce);

//Setthelocalresourcesintothelaunchcontext

amContainer.setLocalResources(localResources);

//setthesecuritytokensasneeded

//amContainer.setContainerTokens(containerToken);

//Setuptheenvironmentneededforthelaunchcontextwherethe

//ApplicationMastertoberun

Map<String,String>env=newHashMap<String,String>();

//Forexample,wecouldsetuptheclasspathneeded.

//incaseofshellscriptexample,putrequiredresources

env.put(DSConstants.SCLOCATION,HdfsSCLocation);

env.put(DSConstants.SCTIMESTAMP,Long.toString(HdfsSCTimeStamp));

env.put(DSConstants.SCLENGTH,Long.toString(HdfsSCLength));

//AddAppMaster.jarlocationtotheClasspath.

//Bydefault,allthehadoopspecificclasspathswillalreadybe

Page 174: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

//available

//in$CLASSPATH,soweshouldbecarefulnottooverwriteit.

StringBuilderclassPathEnv=newStringBuilder("$CLASSPATH:./*:");

for(Stringstr:

conf.get(YarnConfiguration.YARN_APPLICATION_CLASSPATH).split(",")){

classPathEnv.append(':');

classPathEnv.append(str.trim());

}

//addlog4jpropertiesintotheenvvariableifrequired

classPathEnv.append(":./log4j.properties");

env.put("CLASSPATH",classPathEnv);

//setenvironmentalvaribalesintothecontainer

amContainer.setEnvironment(env);

//setnecessarycommandtobeexecutetheApplicationMaster

vector<CharSequence>vargs=newVector<CharSequence>(30);

//setjavaexecutablecommand

vargs.add("${JAVA_HOME}"+"/bin/java");

//setmemoryXmxbasedonAMmemoryrequirements

vargs.add("-Xms"+amMemory+"m");

//setClassName

vargs.add(amMasterMainClass);

//Setparametersforapplicationmaster

vargs.add("--container_memory"+String.valueOf(containerMemory));

vargs.add("--num_containers"+String.valueOf(numContainers));

vargs.add("--priority"+String.valueOf(shellCmdPriority));

if(!shellCommand.isEmpty()){

vargs.add("--shell_command"+shellCommand+"");

}

if(!shellArgs.isEmpty()){

vargs.add("--shell_args"+shellArgs+"");

}

for(Map.Entry<String,String>entry:shellEnv.entrySet()){

vargs.add("--shell_env"+entry.getKey()+"="+

entry.getValue());

}

if(debugFlag){

vargs.add("--debug");

}

vargs.add("1>"+ApplicationConstants.LOG_DIR_EXPANSION_VAR+

"/AppMaster.stdout");

vargs.add("2>"+ApplicationConstants.LOG_DIR_EXPANSION_VAR+

"/AppMaster.stderr");

//Getfinalcommand

StringBuildercommand=newStringBuilder();

for(CharSequencestr:vargs){

Page 175: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

command.append(str).append("");

}

List<String>commands=newArrayList<String>();

commands.add(command.toString());

//Setthecommandarrayintothecontainerspec

amContainer.setCommands(commands);

//ForlaunchinganAMcontainer,settinguserhereisnot

//needed

//amContainer.setUser(amUser);

Resourcecapability=Records.newRecord(Resource.class);

//Fornowonlymemoryissupported,sowesetthememory

capability.setMemory(amMemory);

amContainer.setResource(capability);

//Setthecontainerlaunchcontentintothe

ApplicationSubmissionContext

appContext.setAMContainerSpec(amContainer);

Nowthesetupprocessiscomplete,andourYARNClientisreadytosubmittheapplicationtotheApplicationManager:

//CreatetheApplicationrequesttosendtotheApplicationsManager

SubmitApplicationRequestappRequest=

Records.newRecord(SubmitApplicationRequest.class);

appRequest.setApplicationSubmissionContext(appContext);

//SubmittheapplicationtotheApplicationsManager

//Ignoretheresponseaseitheravalidresponseobjectis

//returnedon

//successoranexceptionthrowntodenotethefailure

applicationsManager.submitApplication(appRequest);

Duringthisprocess,theResourceManagerwillacceptalltherequestsofapplicationsubmissionandallocatecontainerstotheApplicationMastertorun.TheprogressofthetasksubmittedbytheclientcanbetrackedbycommunicatingwiththeResourceManagerandrequestinganapplicationstatusreportviatheApplicationClientProtocol:

GetApplicationReportRequestreportRequest=

Records.newRecord(GetApplicationReportRequest.class);

reportRequest.setApplicationId(appId);

GetApplicationReportResponsereportResponse=

applicationsManager.getApplicationReport(reportRequest);

ApplicationReportreport=reportResponse.getApplicationReport();

TheresponsetothereportrequestreceivedfromtheResourceManagercontainsgeneralapplicationinformation,suchastheApplicationID,thequeueinformationinwhichtheapplicationisrunning,andinformationontheuserwhosubmittedtheapplication.ItalsocontainstheApplicationMasterdetails,thehostonwhichtheApplicationMasterisrunning,andapplication-trackinginformationtomonitortheprogressoftheapplication.

Page 176: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Theapplicationreportalsocontainstheapplicationstatusinformation,suchasSUBMITTED,RUNNING,FINISHED,andsoon.

Also,theclientcandirectlyquerytheApplicationMastertogetreportinformationviahost:rpc_portobtainedfromtheApplicationReport.

Sometimes,theapplicationmaybewronglysubmittedinanotherqueueormaytakelongerthanusual.Insuchcases,theclientmaywanttokilltheapplication.TheApplicationClientProtocolsupportstheforcefullykilloperationthatcansendakillsignaltotheApplicationMasterviatheResourceManager:

KillApplicationRequestkillRequest=

Records.newRecord(KillApplicationRequest.class);

killRequest.setApplicationId(appId);

applicationsManager.forceKillApplication(killRequest);

Page 177: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

WritingtheYARNApplicationMasterThistaskistheheartofthewholeprocess.ThiswouldbelaunchedbytheResourceManager,andallthenecessaryinformationwillbeprovidedbytheclient.AstheApplicationMasterislaunchedinthefirstcontainerallocatedbytheResourceManager,severalparametersaremadeavailablebytheResourceManagerviaenvironment.TheseparametersincludecontainerIDfortheApplicationMastercontainer,applicationsubmissiontimeanddetailsabouttheNodeManagerandthehostonwhichtheApplicationMasterisrunning.InteractionsbetweentheApplicationMasterandtheResourceManagerwouldrequiretheApplicationAttemptID.ThiswillbeobtainedfromtheApplicationMaster’sContainerID:

Map<String,String>envs=System.getenv();

StringcontainerIdString=

envs.get(ApplicationConstants.AM_CONTAINER_ID_ENV);

if(containerIdString==null){

thrownewIllegalArgumentException(

"ContainerIdnotsetintheenvironment");

}

ContainerIdcontainerId=

ConverterUtils.toContainerId(containerIdString);

ApplicationAttemptIdappAttemptID=

containerId.getApplicationAttemptId();

AfterthesuccessfulinitializationoftheApplicationMaster,itneedstoberegisteredwiththeResourceManagerviatheApplicationMasterProtocol.TheApplicationMasterandResourceManagercommunicateviatheSchedulerinterface:

//ConnecttotheResourceManagerandreturnhandlewithRM

YarnConfigurationyarnConf=newYarnConfiguration(conf);

InetSocketAddressrmAddress=

NetUtils.createSocketAddr(yarnConf.get(

YarnConfiguration.RM_SCHEDULER_ADDRESS,

YarnConfiguration.DEFAULT_RM_SCHEDULER_ADDRESS));

LOG.info("ConnectingtoResourceManagerat"+rmAddress);

ApplicationMasterProtocolresourceManager=

(ApplicationMasterProtocol)

rpc.getProxy(ApplicationMasterProtocol.class,rmAddress,conf);

//RegistertheApplicationMastertotheResourceManager

//Settherequiredinfointotheregistrationrequest:

//ApplicationAttemptId,

//hostonwhichtheappmasterisrunning

//rpcportonwhichtheappmasteracceptsrequestsfromtheclient

//trackingurlfortheclienttotrackappmasterprogress

RegisterApplicationMasterRequestappMasterRequest=

Records.newRecord(RegisterApplicationMasterRequest.class);

appMasterRequest.setApplicationAttemptId(appAttemptID);

appMasterRequest.setHost(appMasterHostname);

appMasterRequest.setRpcPort(appMasterRpcPort);

appMasterRequest.setTrackingUrl(appMasterTrackingUrl);

RegisterApplicationMasterResponseresponse=

Page 178: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

resourceManager.registerApplicationMaster(appMasterRequest);

TheApplicationMastersendsstatustotheResourceManagerviaheartbeatsignals,andthetimeoutexpiryintervalsattheResourceManageraredefinedbyconfigurationsettingsintheYarnConfiguration.TheApplicationMasterProtocolcommunicateswiththeResourceManagertosendheartbeatsandapplicationprogressinformation.

Dependingonapplicationrequirements,theApplicationMastercanrequestfromtheResourceManagerthenumberofcontainerresourcestobeallocated.Forthisrequest,theApplicationMasterwillusetheResourceRequestAPItodefinecontainerspecifications.TheResourceRequestwillcontainthehostnameifthecontainersneedtobehostedonspecifichosts,orthe*wildcardcharacterwhichimpliesthatanyhostcanfulfilltheresourcecapabilities,suchasthememorytobeallocatedtothecontainer.Itwillalsocontainpriorities,tosetcontainersthatcanbeallocatedtospecifictasksonhigherpriority.Forexample,inmap-reducetasks,higherpriorityforacontainerisallocatedtothemaptaskandlowerpriorityforthecontainersisallocatedtothereducetask:

//ResourceRequest

ResourceRequestrequest=Records.newRecord(ResourceRequest.class);

//setuprequirementsforhosts

//whetheraparticularrack/hostisexpected

//Refertoapisunderorg.apache.hadoop.netformoredetailson

//using*asanyhostwilldo

request.setHostName("*");

//setnumberofcontainers

request.setNumContainers(numContainers);

//setthepriorityfortherequest

Prioritypri=Records.newRecord(Priority.class);

pri.setPriority(requestPriority);

request.setPriority(pri);

//Setupresourcetyperequirements

//Fornow,onlymemoryissupportedsowesetmemoryrequirements

Resourcecapability=Records.newRecord(Resource.class);

capability.setMemory(containerMemory);

request.setCapability(capability);

Afterdefiningthecontainerrequests,theApplicationMasterhastobuildanallocationrequestfortheResourceManager.TheAllocationRequestconsistsoftherequestedcontainers,containerstobereleased,theResponseID(theIDoftheresponsethatwouldbesentbackfromtheallocatecall)andprogressupdateinformation:

List<ResourceRequest>requestedContainers;

List<ContainerId>releasedContainers

AllocateRequestreq=Records.newRecord(AllocateRequest.class);

//Theresponseidsetintherequestwillbesentbackin

//theresponsesothattheApplicationMastercan

//matchittoitsoriginalaskandactappropriately.

req.setResponseId(rmRequestID);

Page 179: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

//SetApplicationAttemptId

req.setApplicationAttemptId(appAttemptID);

//AddthelistofcontainersbeingaskedbytheAM

req.addAllAsks(requestedContainers);

//ApplicationMastercanrequestResourceManagertodeallocation

//ofthecontainerifnolongerrequires.

req.addAllReleases(releasedContainers);

//ApplicationMastercantrackitsprogressbysettingprogess

req.setProgress(currentProgress);

AllocateResponseallocateResponse=resourceManager.allocate(req);

TheresponsetothecontainerallocationrequestfromtheApplicationMastertotheResourceManagercontainstheinformationonthecontainersallocatedtotheApplicationMaster,thenumberofhostsavailableinthecluster,andmanymoresuchdetails.

ContainersarenotimmediatelyassignedtotheApplicationMasterbytheResourceManager.However,whenthecontainerrequestissenttotheResourceManager,theApplicationMasterwilleventuallygetthecontainersbasedoncluster-capacity,prioritiesandcluster-schedulingpolicy:

//Retrievelistofallocatedcontainersfromtheresponse

List<Container>allocatedContainers=

allocateResponse.getAllocatedContainers();

for(ContainerallocatedContainer:allocatedContainers){

LOG.info("Launchingshellcommandonanewcontainer."

+",containerId="+allocatedContainer.getId()

+",containerNode="+allocatedContainer.getNodeId().getHost()

+":"+allocatedContainer.getNodeId().getPort()

+",containerNodeURI="+allocatedContainer.getNodeHttpAddress()

+",containerState"+allocatedContainer.getState()

+",containerResourceMemory"

+allocatedContainer.getResource().getMemory());

LaunchContainerRunnablerunnableLaunchContainer=

newLaunchContainerRunnable(allocatedContainer);

ThreadlaunchThread=newThread(runnableLaunchContainer);

launchThreads.add(launchThread);

launchThread.start();

}

//Checkwhatthecurrentavailableresourcesinthecluster

ResourceavailableResources=allocateResponse.getAvailableResources();

LOG.info("Currentavailableresourcesinthecluster"+

availableResources);

//Basedonthisinformation,anApplicationMastercanmake

//appropriatedecisions

//Checkthecompletedcontainers

Page 180: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

List<ContainerStatus>completedContainers=

allocateResponse.getCompletedContainersStatuses();

for(ContainerStatuscontainerStatus:completedContainers){

LOG.info("GotcontainerstatusforcontainerID="

+containerStatus.getContainerId()

+",state="+containerStatus.getState()

+",exitStatus="+containerStatus.getExitStatus()

+",diagnostics="+containerStatus.getDiagnostics());

intexitStatus=containerStatus.getExitStatus();

if(0!=exitStatus){

//containerfailed

if(-100!=exitStatus){

//applicationjoboncontainerreturnedanon-zeroexit

//codecountsascompleted

numCompletedContainers.incrementAndGet();

numFailedContainers.incrementAndGet();

}

else{

//somethingelsebadhappened

//appjobdidnotcompleteforsomereason

//weshouldre-tryasthecontainerwaslostforsome

//reason

numRequestedContainers.decrementAndGet();

//wedonotneedtoreleasethecontainerasthathas

//alreadybeendonebytheResourceManager/NodeManager.

}

}

else{

//nothingtodo

//containercompletedsuccessfully

numCompletedContainers.incrementAndGet();

LOG.info("Containercompletedsuccessfully."+",

containerId="+containerStatus.getContainerId());

}

}

}

AftercontainerallocationissuccessfullyperformedfortheApplicationMaster,ithastosetuptheContainerLaunchContextforthetasksonwhichitwillrun.OncetheContainerLaunchContextisset,theApplicationMastercanrequesttheContainerManagertostarttheallocatedcontainer:

//AssuminganallocatedContainerobtainedfromAllocateResponse

//andhasbeenalreadyinitializationofcontainerisdone

Containercontainer;

LOG.debug("ConnectingtoContainerManagerforcontainerid="+

container.getId());

//ConnecttoContainerManagerontheallocatedcontainer

StringcmIpPortStr=container.getNodeId().getHost()+":"

+container.getNodeId().getPort();

InetSocketAddresscmAddress=NetUtils.createSocketAddr(cmIpPortStr);

LOG.info("ConnectingtoContainerManagerat"+cmIpPortStr);

ContainerManagercm=((ContainerManager)

rpc.getProxy(ContainerManager.class,cmAddress,conf));

Page 181: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

//NowwesetupaContainerLaunchContext

LOG.info("Settingupcontainerlaunchcontainerforcontainerid="+

container.getId());

ContainerLaunchContextctx=

Records.newRecord(ContainerLaunchContext.class);

ctx.setContainerId(container.getId());

ctx.setResource(container.getResource());

try{

ctx.setUser(UserGroupInformation.getCurrentUser().getShortUserName());

}catch(IOExceptione){

LOG.info(

"Gettingcurrentuserfailedwhentryingtolaunchthe

container",+e.getMessage());

}

//Settheenvironment

Map<String,String>unixEnv;

//Setuptherequiredenv.

//Pleasenotethatthelaunchedcontainerdoesnotinherit

//theenvironmentoftheApplicationMastersoallthe

//necessaryenvironmentsettingswillneedtobere-setup

//forthisallocatedcontainer.

ctx.setEnvironment(unixEnv);

//Setthelocalresources

Map<String,LocalResource>localResources=

newHashMap<String,LocalResource>();

//Again,thelocalresourcesfromtheApplicationMasterisnotcopied

over

//bydefaulttotheallocatedcontainer.Thus,itisthe

responsibility

//oftheApplicationMastertosetupallthenecessarylocal

resources

//neededbythejobthatwillbeexecutedontheallocated

container.

//Assumethatweareexecutingashellscriptontheallocated

container

//andtheshellscript'slocationinthefilesystemisknowntous.

PathshellScriptPath;

LocalResourceshellRsrc=Records.newRecord(LocalResource.class);

shellRsrc.setType(LocalResourceType.FILE);

shellRsrc.setVisibility(LocalResourceVisibility.APPLICATION);

shellRsrc.setResource(

ConverterUtils.getYarnUrlFromURI(newURI(shellScriptPath)));

shellRsrc.setTimestamp(shellScriptPathTimestamp);

shellRsrc.setSize(shellScriptPathLen);

localResources.put("MyExecShell.sh",shellRsrc);

ctx.setLocalResources(localResources);

//Setthenecessarycommandtoexecuteontheallocatedcontainer

Page 182: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Stringcommand="/bin/sh./MyExecShell.sh"

+"1>"+ApplicationConstants.LOG_DIR_EXPANSION_VAR+"/stdout"

+"2>"+ApplicationConstants.LOG_DIR_EXPANSION_VAR+"/stderr";

List<String>commands=newArrayList<String>();

commands.add(command);

ctx.setCommands(commands);

//SendthestartrequesttotheContainerManager

StartContainerRequeststartReq=

Records.newRecord(StartContainerRequest.class);

startReq.setContainerLaunchContext(ctx);

try{

cm.startContainer(startReq);

}catch(YarnRemoteExceptione){

LOG.info("Startcontainerfailedfor:"+",containerId="+

container.getId());

e.printStackTrace();

}

TheApplicationMasterwillgettheapplicationstatusinformationviatheApplicationMasterProtocol.Also,itmaymonitorbyqueryingtheContainerManagerfortheapplicationstatus:

GetContainerStatusRequeststatusReq=

Records.newRecord(GetContainerStatusRequest.class);

statusReq.setContainerId(container.getId());

GetContainerStatusResponsestatusResp;

try{

statucResp=cm.getContainerStatus(statusReq);

LOG.info("ContainerStatus"

+",id="+container.getId()

+",status="+statusResp.getStatus());

}catch(YarnRemoteExceptione){

e.printStackTrace();

}

ThiscodesnippetexplainshowtowritetheYARNClientandApplicationMasteringeneral.Actually,theApplicationMasteristheapplication-specificentity;eachapplicationorframeworkthatwantstorunoverYARNhasadifferentApplicationMaster,buttheflowisthesame.FormoredetailsontheYARNClientandApplicationMasterfordifferentframeworks,visittheApacheFoundationwebsite.

ResponsibilitiesoftheApplicationMasterTheApplicationMasteristheapplication-specificlibraryandisresponsiblefornegotiatingresourcesfromtheResourceManageraspertheclientapplication’srequirementsandneeds.TheApplicationMasterworkswiththeNodeManagertoexecuteandmonitorthecontainerandtracktheapplication’sprogress.TheApplicationMasteritselfrunsinoneofthecontainersallocatedbytheResourceManager,andtheResourceManagertrackstheprogressoftheApplicationMaster.

TheApplicationMasterprovidesscalabilitytotheYARNframework,asthe

Page 183: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

ApplicationMastercanprovideafunctionalitythatismuchsimilartothatofthetraditionalResourceManager,sotheYARNclusterisabletoscalewithmanyhardwarechanges.Also,bymovingalltheapplication-specificcodeintotheApplicationMaster,YARNgeneralizesthesystemsothatitcansupportmultipleframeworks,justbywritingtheApplicationMaster.

Page 184: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 185: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

SummaryInthischapter,youlearnedhowtousebundledapplicationsthatcomewiththeYARNframework,howtodeveloptheYARNClientandApplicationMaster,thecorepartsoftheYARNframework,howtosubmitanapplicationtoYARN,howtomonitoranapplication,andtheresponsibilitiesoftheApplicationMaster.

Inthenextchapter,youwilllearntowritesomereal-timepracticalexamples.

Page 186: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 187: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Chapter7.YARNFrameworksIt’sthedawnof2015,andbigdataisstillinitsboomingstage.Manynewstart-upsandgiantsareinvestingahugeamountintodevelopingPOCsandnewframeworkstocatertoanewandemergingvarietyofproblems.Theseframeworksarethenewcutting-edgetechnologiesorprogrammingmodelsthattendtosolvetheproblemsacrossindustriesintheworldofbigdata.Asthecorporationsaretryingtousebigdata,theyarefacinganewanduniquesetofproblemsthattheyneverfacedbefore.Hence,tosolvethesenewproblems,manyframeworksandprogrammingmodelsarecomingontothemarket.

YARN’ssupportformultipleprogrammingmodelsandframeworksmakesitidealtobeintegratedwiththesenewandemergingframeworksorprogrammingmodels.WithYARNtakingresponsibilityforresourcemanagementandothernecessarythings(schedulingjobs,faulttolerance,andsoon),itallowsthesenewapplicationframeworkstofocusonsolvingtheproblemsthattheywerespecificallymeantfor.

Atthetimeofwritingthisbook,manynewandemergingopensourceframeworksarealreadyintegratedwithYARN.

Inthischapter,wewillcoverthefollowingframeworksthatrunonYARN:

ApacheSamzaStormonYARNApacheSparkApacheTezApacheGiraphHoya(HBaseonYARN)KOYA(KafkaonYARN)

WewilltalkindetailaboutApacheSamzaandStormonYARN,wherewewilldevelopandrunsomesampleapplications.Forotherframeworks,wewillhaveabriefdiscussion.

Page 188: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

ApacheSamzaSamzaisanopensourceprojectfromLinkedInandiscurrentlyanincubationprojectattheApacheSoftwareFoundation.Samzaisalightweightdistributedstream-processingframeworktodoreal-timeprocessingofdata.TheversionthatisavailablefordownloadfromtheApachewebsiteisnottheproductionversionthatLinkedInuses.

Samzaismadeupofthefollowingthreelayers:

AstreaminglayerAnexecutionlayerAprocessinglayer

Samzaprovidesout-of-the-boxsupportforalltheprecedingthreelayers:

Streaming:ThislayerissupportedbyKafka(anotheropensourceprojectfromLinkedIn)Execution:supportedbyYARNProcessing:supportedbySamzaAPI

ThefollowingthreepiecesfittogethertoformSamza:

ThefollowingarchitectureshouldbefamiliartoanyonewhohasusedHadoop:

Beforegoingintoeachofthesethreelayersindepth,itshouldbenotedthatSamza’ssupportisnotlimitedtothesesystems.BothSamza’sexecutionandstreaminglayersarepluggableandallowdeveloperstoimplementalternativesasrequired.

Samzaisastream-processingsystemtoruncontinuouscomputationoninfinitestreamsofdata.

Samzaprovidesasystemtoprocessstreamdatafrompublish-subscribesystemssuchasApacheKafka.Thedeveloperwritesastream-processingtaskandexecutesitasaSamzajob.Samzathenroutesmessagesbetweenthestream-processingtasksandthepublish-

Page 189: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

subscribesystemsthatthemessagesareaddressedto.

SamzaworksalotlikeStorm,theTwitter-developedstream-processingtechnology,exceptthatSamzarunsonKafka,LinkedIn’sownmessagingsystem.Samzawasdevelopedwithapluggablearchitecture,enablingdeveloperstousethesoftwarewithothermessagingsystems.

ApacheSamzaisbasicallyacombinationofthefollowingtechnologies:

Kafka:SamzausesApacheKafkaasitsunderlyingmessagepassingsystemApacheYARN:SamzaalsousesApacheYARNfortaskschedulingZooKeeper:BothYARNandKafka,inturn,relyonApacheZooKeeperforcoordination

Moreinformationisavailableontheofficialsiteathttp://samza.incubator.apache.org/.

Wewillusethehello-samzaprojecttodevelopasampleexampletoprocesssomereal-timestreamprocessing.

WewillwriteaKafkaproducerusingtheJavaKafkaAPIstopublishacontinuousstreamofmessagestoaKafkatopic.Finally,wewillwriteaSamzaconsumerusingtheSamzaAPItoprocessthesestreamsfromtheKafkatopicinrealtime.Forsimplicity,wewilljustprintamessageandrecordeachtimeamessageisreceivedintheKafkatopic.

Page 190: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

WritingaKafkaproducerLet’sfirstwriteaKafkaproducertopublishmessagestoaKafkatopic(namedstorm-sentence):

importjava.io.BufferedReader;

importjava.io.File;

importjava.io.FileInputStream;

importjava.io.FileNotFoundException;

importjava.io.FileReader;

importjava.io.IOException;

importjava.io.PrintStream;

importjava.util.Properties;

importkafka.javaapi.producer.Producer;

importkafka.producer.KeyedMessage;

importkafka.producer.ProducerConfig;

/**

*AsimpleJavaClasstopublishmessagesintoKAFKA.

*

*

*@authornirmal.kumar

*

*/

publicclassKafkaStringProducerService{

publicProducer<String,String>producer;

publicProducer<String,String>getProducer(){

returnthis.producer;

}

publicvoidsetProducer(Producer<String,String>producer){

this.producer=producer;

}

publicKafkaStringProducerService(Propertiesprop){

setProducer(newProducer(newProducerConfig(prop)));

}

/**

*Changethelocationofproducer.propertiesaccordinglyinLineNo.123

*

*Loadtheproducer.propertieshavingfollowingproperties:

*kafka.zk.connect=192.xxx.xxx.xxx

*serializer.class=kafka.serializer.StringEncoder

*producer.type=async

*queue.buffering.max.ms=5000000

*queue.buffering.max.messages=1000000

*metadata.broker.list=192.xxx.xxx.xxx:9092

*

*@paramfilepath

*@return

*/

privatestaticPropertiesgetConfiguartionProperties(Stringfilepath){

Filepath=newFile(filepath);

Page 191: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Propertiesproperties=newProperties();

try{

properties.load(newFileInputStream(path));

}catch(FileNotFoundExceptione){

e.printStackTrace();

}catch(IOExceptione){

e.printStackTrace();

}

returnproperties;

}

/**

*PublisheseachmessagetoKAFKA

*

*@paraminput

*@paramii

*/

publicvoidexecute(Stringinput,intii){

KeyedMessagedata=newKeyedMessage("storm-sentence",input);

this.producer.send(data);

//LogstoSystemConsoletheno.ofmessagespublished(each100000)

if((ii!=0)&&(ii%100000==0))

System.out.println("$$$$$$$PUBLISHED"+ii+"messages@"

+System.currentTimeMillis());

}

/**

*Readseachlinefromtheinputmessagefile

*

*@paramfile

*@return

*@throwsIOException

*/

privatestaticStringreadFile(Stringfile)throwsIOException{

BufferedReaderreader=newBufferedReader(newFileReader(file));

Stringline=null;

StringBuilderstringBuilder=newStringBuilder();

Stringls=System.getProperty("line.separator");

while((line=reader.readLine())!=null){

stringBuilder.append(line);

stringBuilder.append(ls);

}

returnstringBuilder.toString();

}

/**

*mainmethodforinvokingtheJavaapplication

*Needtopasscommandlineargument:theabsolutefilepathcontaining

Stringmessages.

*

*@paramargs

*/

Page 192: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

publicstaticvoidmain(String[]args){

intii=0;

intnoOfMessages=Integer.parseInt(args[1]);

Strings=null;

try{

s=readFile(args[2]);

}catch(IOExceptione){

e.printStackTrace();

}

/**

*instantiatetheMainclass.

*Changethelocationofproducer.propertiesaccordingly

*/

KafkaStringProducerServiceservice=newKafkaStringProducerService(

getConfiguartionProperties("/home/cloud/producer.properties"));

System.out.println("********START:Publishing"+noOfMessages

+"messages@"+System.currentTimeMillis());

while(ii<=noOfMessages){

//invoketheexecutemethodtopublishmessagesintoKAFKA

service.execute(s,ii);

ii++;

}

System.out.println("#######END:Published"+noOfMessages

+"messages@"+System.currentTimeMillis());

try{

service.producer.close();

}catch(Exceptione){

e.printStackTrace();

}

}

}

CreatetheProducer.propertiesfilesomewherein/home/cloud/producer.propertiesandspecifythelocationinthepreviousKafkaproducerJavaclass.

TheProducer.propertiesfilewillhavethefollowinginformation:

Page 193: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Writingthehello-samzaprojectLet’snowwriteaSamzaconsumerandpackageitwiththehello-samzaproject:

1. Downloadandbuildthehello-samzaproject.Checkoutthehello-samzaproject:

gitclonegit://git.apache.org/incubator-samza-hello-samza.githello-

samza

cdhello-samza

Theoutputoftheprecedingcodecanbeseenhere:

2. Next,wewillwriteaSamzaconsumerusingtheSamzaAPItoprocesstheseNmessagesfromaKafkatopic.Gottohello-samza/samza-wikipedia/src/main/java/samza/examples/wikipedia/taskandwritetheYarnEssentialsSamzaConsumer.javafileasfollows:

3. AfterwritingtheSamzaconsumerclassinthehello-samzaproject,youwillneedtobuildtheproject:

mvncleanpackage

4. Createasamzadirectoryinsidethedeploydirectory:

mkdir-pdeploy/samza

5. Finally,createtheSamzajobpackage:

Page 194: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

tar-xvf./samza-job-package/target/samza-job-package-0.7.0-dist.tar.gz

-Cdeploy/samza

6. ForSamzaconsumerproperties,goto/home/cloud/hello-samza/deploy/samza/config.

7. Writeasamza-test-consumer.propertiesfileasfollows:

Thispropertiesfilewillmainlycontainthefollowinginformation:

job.name:ThisisthenameoftheSamzajobyarn.package.path:ThisisthepathoftheSamzajobpackagetask.class:ThisistheclassoftheactualSamzaconsume.task.inputs:ThisistheKafkatopicnamewherethepublishedwillbereadfromsystems.kafka.consumer.zookeeper.connect:ThisistheZooKeeper-relatedinformation

StartingagridASamzagridusuallycomprisesthreedifferentsystems:YARN,Kafka,andZooKeeper.Thehello-samzaprojectcomeswithascriptcalledgridtohelpyousetupthesesystems.Startbyrunningthefollowingcommand:

bin/gridbootstrap

Thiscommandwilldownload,install,andstartZooKeeper,Kafka,andYARN.ItwillalsocheckoutthelatestversionofSamzaandbuildit.Allthepackagefileswillbeputinasubdirectorycalleddeployinsidethehello-samzaproject’srootfolder.Theresultoftheprecedingcommandisshownhere:

Page 195: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

ThefollowingscreenshotshowsthatZookeeper,YARN,andKafkaarebeingstarted:

Oncealltheprocessesareupandrunningyoucanchecktheprocesses,asshowninthisscreenshot:

TheYARNResourceManagerwebUIwilllooklikethis:

Page 196: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

TheYARNNodeManagerwebUIwilllooklikethis:

Sincewestartedthegrid,let’snowdeploytheSamzajobtoit:

deploy/samza/bin/run-job.sh--config-

factory=org.apache.samza.config.factories.PropertiesConfigFactory--config-

path=file:/home/cloud/hello-samza/deploy/samza/config/samza-test-

consumer.properties

ChecktheapplicationprocessesandRMUI.Asyoucanseeinthefollowingscreenshot,runningtheSamzajobfirstcreatesaSamzaAppMasterandthenaSamzaContainertorun

Page 197: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

theconsumerthatwewrote:

TheResourceManagerwebUInowshowstheSamzaapplicationupandrunning:

TheApplicationMasterUIlooksasfollows:

Page 198: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

ThefollowingscreenshotshowstheApplicationMasterUIinterface:

Page 199: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

SincenowourSamzaconsumerisupandrunningandlisteningforanymessagesintheKafkatopic(namedstorm-sentence),let’spublishsomemessagestotheKafkatopicusingtheKafkaproducerwewroteinitially.ThefollowingJavacommandisusedtoinvoketheKafkaproducerthathastwocommand-linearguments:

N:ThisisthenumberoftimesthemessageispublishedintoKafka{pathOfFileNameHavingMessage}:Thisistheactualstringmessage

Createanyfilehavingastringmessage(strmsg10K.txt)andpassthisfilenameandpathasthesecondcommand-lineargumenttotheJavacommand,asshowninthefollowingscreenshot:

AssoonasthesemessagesarepublishedintheKafkatopic,theSamzaconsumerconsumesitandprintsthetimestamp,aswrittenintheSamzaconsumercode.

TheresultaftercheckingtheSamzaconsumerlogsisasfollows:

Page 200: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 201: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 202: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Storm-YARNApacheStormisanopensourcedistributedreal-timecomputationsystemfromTwitter.

Stormhelpsinprocessingunboundedstreamsofdatainareliablemanner.Stormcanbeusedwithanyprogramminglanguage.SomeofthemostcommonusecasesofStormarereal-timeanalytics,real-timemachinelearning,continuouscomputation,ETL,andmanymore.

Storm-YARNisaprojectfromYahoothatenablestheStormclustertobedeployedandmanagedbyYARN.Earlier,aseparateclusterwasneededforHadoopandStorm.

Onemajorbenefitthatcomeswiththisintegrationiselasticity.Batchprocessing(HadoopMapReduce)isusuallydoneonthebasisofneed,andreal-timeprocessing(Storm)isanongoingprocessing.WhentheHadoopclusterisidle,youcanleverageitforanyreal-timeprocessingwork.

Inatypicalreal-timeprocessingusecase,constantandpredictableloadsareveryrare.Storm,therefore,willneedmoreresourcesduringpeaktimewhentheloadisgreater.Atpeaktime,Stormcanstealresourcesfromthebatchjobsandgivethembackwhentheloadisless.

Thisway,theoverallresourceutilizationcanscaleupanddowndependingontheloadanddemand.Thiselasticityis,therefore,usefulforutilizingtheavailableresourcesonthebasisofdemandbetweenreal-timeandbatchprocessing.

AnotherbenefitisthatthisintegrationreducesthephysicaldistanceofdatatransfersbetweenStormandHadoop.ManyapplicationsusebothStormandHadooponseparateclusterswhilesharingdatabetweenthem(MapReduce).Forsuchascenario,Storm-YARNreducesnetworktransfers,andinturnthetotalcostofacquiringthedata,astheysharethesamecluster,asshowninthefollowingimage:

Page 203: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Referringtotheprecedingdiagram,Storm-YARNasksYARN’sResourceManagertolaunchaStormApplicationMaster.TheStormApplicationMasterthenlaunchesaStormNimbusserverandaStormUIserverlocally.ItalsousesYARNtoallocateresourcesforthesupervisorsandfinallylaunchthem.

WewillnowinstallStorm-YARNonaHadoopYARNclusteranddeploysomeStormtopologiestothecluster.

Page 204: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

PrerequisitesThefollowingaretheprerequisitesforStorm-YARN.

HadoopYARNshouldbeinstalledRefertotheHadoopYARNinstallationathttp://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/SingleCluster.html.

TheMasterThriftserviceofStorm-on-YARNusesport9000,andifStorm-YARNislaunchedfromtheNameNode,therewillbeaportcrash.

Inthiscase,youwillneedtochangetheportoftheNameNodeinyourHadoopinstallation.Typically,thefollowingprocessesshouldbeupandrunninginHadoop:

ApacheZooKeepershouldbeinstalledAtthetimeofwritingthisbook,theStorm-on-YARNApplicationMasterimplementationdoesnotincluderunningZookeeperonYARN.Therefore,itispresumedthatthereisaZookeeperclusteralreadyrunningtoenablecommunicationbetweenNimbusandworkers.

Thereisanopenissuethatthisthoughtathttps://github.com/yahoo/storm-yarn/issues/22.

InstallingZookeeperisverystraightforwardandeasy.

Refertohttp://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html.

Page 205: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

SettingupStorm-YARNStorm-YARNisbasicallyanimplementationoftheYARNclientandApplicationMasterforStorm.

TheclientgetsanewapplicationIDforStormandsubmitstheapplication,andtheApplicationMastersetsuptheStormcomponents(Nimbus,Supervisor,andsoon)onYARNusingthecontainersthattheApplicationMasterrequestsfromtheResourceManager.

NotethatStorm-on-YARNisnotanewimplementationofStormthatworksonYARN.Frameworks(thatisSamza,Storm,Spark,Tez,andsoon)themselvesdonotneedtobemodifiedtobeabletorunonYARN.OnlytheApplicationMasterandtheYARNclientcodeneedtobewrittenforeachoftheframeworkssothattheyrunonYARNasanapplicationjustlikeanyother.Now,proceedwiththefollowingsteps:

1. ClonetheStorm-YARNrepositoryfromGit:

cdstorm-on-yarn-poc/

gitclonehttps://github.com/yahoo/storm-yarn.git

cdstorm-yarn

TheStormclientmachinereferstothemachinethatwillsubmittheYARNclientandApplicationMastertotheResourceManager.

Asofnow,thereissinglereleaseofStorm-on-YARNfromYahoothatcontainsbothStorm-YARNandStormversions(0.9.0-wip21).TheStormreleaseispresentinthelibdirectoryoftheextractedStorm-on-YARNrelease.

2. BuildStorm-YARNusingMaven:

mvnpackageormvnpackage-DskipTests

3. Wewillgetthefollowingoutput:

[INFO]Scanningforprojects…

[INFO]

[INFO]Usingthebuilder

org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThread

edBuilderwithathreadcountof1

[INFO]

[INFO]----------------------------------------------------------------

--------

[INFO]Buildingstorm-yarn1.0-alpha

[INFO]----------------------------------------------------------------

--------

[INFO]

[INFO]Compiling5sourcefilesto/home/nirmal/storm-on-yarn-

poc/storm-yarn-master/target/test-classes

[INFO]

[INFO]---maven-jar-plugin:2.4:jar(default)@storm-yarn---

[INFO]

[INFO]---maven-surefire-plugin:2.10:test(default-test)@storm-yarn

---

Page 206: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

[INFO]Testsareskipped.

[INFO]

[INFO]---maven-jar-plugin:2.4:jar(default-jar)@storm-yarn---

[INFO]----------------------------------------------------------------

--------

[INFO]BUILDSUCCESS

[INFO]----------------------------------------------------------------

--------

[INFO]Totaltime:10.153s

[INFO]Finishedat:2014-11-12T15:57:49+05:30

[INFO]FinalMemory:10M/118M

[INFO]----------------------------------------------------------------

--------

[INFO]FinalMemory:14M/152M

[INFO]----------------------------------------------------

4. Next,youwillneedtocopythestorm.zipfilefromstorm-yarn/libtoHDFS.ThisissinceStorm-on-YARNwilldeployacopyofStormcodethroughoutallthenodesoftheYARNclusterusingHDFS.However,thelocationofwheretofetchthiscopyoftheStormcodeishardcodedintotheStorm-on-YARNclient.Copythestorm.zipfiletoHDFSusingthefollowingcommand:

hdfsdfs-mkdir-p/lib/storm/0.9.0-wip21

Alternatively,youcanalsousethefollowingcommand:

hadoopfs–mkdir-p/lib/storm/0.9.0-wip21

hdfsdfs-put/home/nirmal/storm-on-yarn-poc/storm-yarn-

master/lib/storm.zip/lib/storm/0.9.0-wip21/storm.zip

Youcanalsousethefollowingcommand:

hadoopfs-put/home/nirmal/storm-on-yarn-poc/storm-yarn-

master/lib/storm.zip/lib/storm/0.9.0-wip21/storm.zip

TheexactversionofStormmightdiffer,inyourcase,from0.9.0-wip21.

5. CreateadirectorytoholdourStormconfiguration:

mkdir-p/home/nirmal/storm-on-yarn-poc/storm-data/

cp/home/nirmal/storm-on-yarn-poc/storm-yarn-master/lib/storm.zip

/home/nirmal/storm-on-yarn-poc/storm-data/

cd/home/nirmal/storm-on-yarn-poc/storm-data

unzipstorm.zip

6. Addthefollowingconfigurationinthestorm.yamlfilelocatedat/home/nirmal/storm-on-yarn-poc/storm-data/storm-0.9.0-wip21/conf.Youcanchangethefollowingvaluesasperyoursetup:

storm.zookeeper.servers:localhostnimbus.host:localhostmaster.initial-num-supervisors:2master.container.size-mb:1024

Page 207: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

7. Addthestorm-yarn/binfoldertoyourpathvariable:

exportPATH=$PATH:/home/nirmal/storm-on-yarn-poc/storm-data/storm-

0.9.0-wip21/bin:/home/nirmal/storm-on-yarn-poc/storm-yarn-master/bin

8. Finally,launchStorm-YARNusingthefollowingcommand:

storm-yarnlaunch/home/nirmal/storm-on-yarn-poc/storm-data/storm-

0.9.0-wip21/conf/storm.yaml

LaunchingStorm-YARNexecutestheStorm-YARNclientthatgetsanappIDfromYARN’sResourceManagerandstartsrunningtheStorm-YARNApplicationMaster.TheApplicationMasterthenstartstheNimbus,Workers,andSupervisorservices.Youwillgetanoutputsimilartotheoneshowninthefollowingscreenshot:

9. WecanretrievethestatusofourapplicationusingthefollowingYARNcommand:

yarnapplication-list

Wewillgetthestatusofourapplicationasfollows:

10. YoucanalsoseeStorm-YARNrunningonthefollowingResourceManagerwebUIathttp://localhost:8088/cluster/:

Page 208: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

11. Nimbusshouldalsoberunningnow,andyoushouldbeabletoseeitthroughtheNimbuswebUIathttp://localhost:7070/.Thislooksasfollows:

12. Thefollowingprocessesshouldbeupandrunning:

Page 209: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 210: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Gettingthestorm.yamlconfigurationofthelaunchedStormclusterThemachinethatwillusetheStormclientcommandtosubmitanewtopologytoStormneedsthestorm.yamlconfigurationfileofthelaunchedStormclusteronYARNtobestoredin/home/nirmal/.storm/storm.yaml.

Normally,whenStormisnotrunonYARN,thisconfigurationfileismanuallyedited,soyoushouldknowtheIPaddressesoftheStormcomponents.However,sincethelocationofwheretheStormcomponentswillberunonYARNdependsonthelocationoftheallocatedcontainers,Storm-on-YARNisresponsibleforsettingstorm.yamlforus.Youcanfetchthisstorm.yamlfilefromtherunningStorm-on-YARN:

$cd

$mkdir.storm/

$storm-yarngetStormConfig-appId(checktheappIdontheYARNapplication

UIatport8088)-output/home/nirmal/.storm/storm.yaml

Page 211: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

BuildingandrunningStorm-StarterexamplesInthissection,wewillseehowtogettheexamplecodefromGitHub,builditusingMaven,andfinally,runtheexamples.Toperformthesetasks,you’llhavetoexecutethefollowingsteps:

1. GetthecodefromGitHub.Wewillusethestorm-starterfromGitHub:

gitclonehttps://github.com/nathanmarz/storm-starter

Cloninginto'storm-starter'...

remote:Countingobjects:756,done.

remote:Total756(delta0),reused0(delta0)

Receivingobjects:100%(756/756),171.81KiB|56.00KiB/s,done.

Resolvingdeltas:100%(274/274),done.

Checkingconnectivity…done

2. Next,gotothedownloadedstorm-starterdirectory:

cdstorm-starter/

3. Checkthecontentusingthefollowingcommands:

ls-ltr

-rw-r--r--1nirmalnirmal171Nov1212:58README.markdown

-rw-r--r--1nirmalnirmal5047Nov1212:58m2-pom.xml

drwxr-xr-x3nirmalnirmal4096Nov1212:58multilang

-rw-r--r--1nirmalnirmal580Nov1212:58LICENSE

drwxr-xr-x4nirmalnirmal4096Nov1212:58src

-rw-r--r--1nirmalnirmal929Nov1212:58project.clj

drwxr-xr-x3nirmalnirmal4096Nov1212:58test

-rw-r--r--1nirmalnirmal8042Nov1212:58storm-starter.iml

4. Buildthestorm-starterprojectusingMaven:

mvn-fm2-pom.xmlpackageormvn-fm2-pom.xmlpackage-DskipTests

5. Youwillseeanoutputsimilartothefollowingcommands:

[INFO]Scanningforprojects…

[INFO]Usingthebuilder

org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThread

edBuilderwithathreadcountof1

[INFO]

[INFO]----------------------------------------------------------------

--------

[INFO]Buildingstorm-starter0.0.1-SNAPSHOT

[INFO]----------------------------------------------------------------

--------

[INFO]META-INF/MANIFEST.MFalreadyadded,skipping

[INFO]META-INF/alreadyadded,skipping

[INFO]META-INF/maven/alreadyadded,skipping

[INFO]Buildingjar:/home/nirmal/storm-on-yarn-poc/storm-

starter/target/storm-starter-0.0.1-SNAPSHOT-jar-with-dependencies.jar

[INFO]META-INF/MANIFEST.MFalreadyadded,skipping

[INFO]META-INF/alreadyadded,skipping

[INFO]META-INF/maven/alreadyadded,skipping

Page 212: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

[INFO]----------------------------------------------------------------

--------

[INFO]BUILDSUCCESS

[INFO]----------------------------------------------------------------

--------

[INFO]Totaltime:05:21min

[INFO]Finishedat:2014-11-12T13:05:40+05:30

[INFO]FinalMemory:30M/191M

[INFO]----------------------------------------------------------------

--------

6. Afterthebuildissuccessful,youwillseethefollowingJARfilebeingcreatedunderthetargetdirectory:

storm-starter-0.0.1-SNAPSHOT-jar-with-dependencies.jar

7. RuntheStormtopologyexampleontheStorm-YARNcluster:

stormjarstorm-starter-0.0.1-SNAPSHOT-jar-with-dependencies.jar

storm.starter.WordCountTopologyword-count-topology

Theoutputcanbeseeninthefollowingscreenshot:

8. Clickonthetopology,asshowninthefollowingscreenshot:

Page 213: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 214: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 215: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

ApacheSparkApacheSparkisafastandgeneralengineforlarge-scaledataprocessing.Itwasoriginallydevelopedin2009inUCBerkeley’sAMPLabandopensourcedin2010.

ThemainfeaturesofSparkareasfollows:

Speed:SparkenablesapplicationsinHadoopclusterstorunupto100xfasterinmemoryand10xfasterevenwhenrunningondisk.Easeofuse:SparkletsyouquicklywriteapplicationsinJava,Scala,orPython.YoucanuseitinteractivelytoquerybigdatasetsfromtheScalaandPythonshells.Runseverywhere:SparkrunsonHadoop,Mesos,instandalonemode,orinthecloud.Itcanaccessdiversedatasources,includingHDFS,Cassandra,HBase,andS3.YoucanrunSparkreadilyusingitsstandaloneclustermode,onEC2,orrunitonHadoopYARNorApacheMesos.ItcanreadfromHDFS,HBase,Cassandra,andanyHadoopdatasource.Generality:Sparkpowersastackofhigh-leveltools,includingSparkSQL,MLlibformachinelearning,GraphX,andSparkStreaming.Youcancombinetheseframeworksseamlesslyinthesameapplication.

Page 216: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

WhyrunonYARN?YARNenablesSparktoruninasingleclusteralongsideotherframeworks,suchasTez,Storm,HBase,andothers.ThisavoidstheneedtocreateandmanageseparateanddedicatedSparkclusters.

Typically,customerswanttorunmultipleworkloadsonasingledatasetinasinglecluster.YARN,asagenericresourcemanagementandsingledataplatformforalldifferentframeworks/engines,makesithappen.

YARN’sbuilt-inmultitenancysupportallowsdynamicandoptimalsharingofthesamesharedclusterresourcesbetweendifferentframeworksthatrunonYARN.

YARNhaspluggableschedulerstocategorize,isolate,andprioritizeworkloads.

Page 217: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 218: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

ApacheTezApacheTezispartoftheStingerinitiativeledbyHortonworkstomaketheHiveenterprisereadyandsuitableforinteractiveSQLqueries.TheTezdesignisbasedonresearchdonebyMicrosoftonparallelanddistributedcomputing.

TezenteredtheApacheIncubatorinFebruary2013andgraduatedtoatop-levelprojectinJuly2014.

Tezisbasicallyanembeddableandextensibleframeworktobuildhigh-performancebatchandinteractivedata-processingapplicationsthatneedtointegrateeasilywithYARN.

ConfusionoftenariseswhenTezisthoughtofasanengine.Tezisnotageneral-purposeengine,butmoreofaframeworkfortoolstoexpresstheirpurpose-builtneeds.Tez,forexample,enablesHive,Pig,andotherstobuildtheirownpurpose-builtenginesandembedtheminthosetechnologiestoexpresstheirpurpose-builtneeds.ProjectssuchasHive,Pig,andCascadingnowhavesignificantimprovementsinresponsetimeswhentheyuseTezinsteadofMapReduce.

TezgeneralizestheMapReduceparadigmtoamorepowerfulframeworkbasedonexpressingcomputationsasadataflowgraph.TezexiststoaddresssomeofthelimitationsofMapReduce.Forexample,inatypicalMapReduce,alotoftemporarydataisstored(suchaseachmapper’soutput,whichisadiskI/O),whichisanoverhead.InthecaseofTez,thisdiskI/Ooftemporarydataissaved,therebyresultinginhigherperformancecomparedtotheMapReducemodel.

Also,Tezcanadjusttheparallelismofreducetasksatruntime,dependingontheactualdatasizecomingoutoftheprevioustask.Ontheotherhand,inMapReducethenumberofreducersisstaticandhastobedecidedbytheuserbeforethejobissubmittedtothecluster.

TheprocessingdonebymultipleMapReducejobscannowbedonebyasingleTezjob,asfollows:

Page 219: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Referringtotheprecedingdiagram,earlier(withPIG/HIVE),weusedtoneedmultipleM/Rjobstodosomeprocessing.However,now,inTez,asingleM/Rjobdoesthesame,thatis,thereducers(thegreenboxes)ofthepreviousstepfeedthemappers(theblueboxes)ofthenextstep.

Theprecedingimageistakenfromhttp://www.infoq.com/articles/apache-tez-saha-murthy.

Tezisnotmeantdirectlyforendusers;infact,itenablesdeveloperstobuildend-userapplicationswithmuchbetterperformanceandflexibility.Traditionally,Hadoophasbeenabatch-processingplatformtoprocesslargeamountsofdata.However,therearealotofusecasesfornear-real-timeperformanceofqueryprocessing.Therearealsoseveralworkloads,suchasmachinelearning,thatdonotfitintotheMapReduceparadigm.TezhelpsHadoopaddresstheseusecases.

Tezprovidesanexpressivedataflow-definitionAPIthatletsdeveloperscreatetheirownuniquedata-processinggraphs(DAGs)torepresenttheirapplications’data-processingflows.Oncethedeveloperdefinesaflow,TezthenprovidesadditionalAPIstoinjectcustombusinesslogicthatwillruninthatflow.TheseAPIsthencombineinputs(thatreaddata),outputs(thatwritedata),andprocessors(thatprocessdata)toprocesstheflow.

TezcanalsorunanyexistingMRjobwithoutanymodification.FormoreinformationonTez,refertohttp://tez.apache.org/.

Page 220: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 221: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

ApacheGiraphApacheGiraphisagraph-processingsystemthatusestheMapReducemodeltoprocessgraphs.Currently,itisinincubationattheApacheSoftwareFoundation.

ItisbasedonGoogle’sPregel,whichisusedtocalculatepagerank.

Currently,GiraphisbeingusedbyFacebook,Twitter,andLinkedIntocreatesocialgraphsoftheirusers.BothGiraphandPregelarebasedontheBulkSynchronousParallel(BSP)modelofdistributedcomputation,whichwasintroducedbyLeslieValiant.

SupportforYARNisfromrelease1.1.0.Formoreinformation,refertotheofficialsiteathttp://giraph.apache.org/.

Page 222: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 223: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

HOYA(HBaseonYARN)HoyaisbasicallyrunningHBaseonYARN.ItiscurrentlyhostedonGithub,butthereareplanstomoveittotheApacheFoundation.

HoyacreatesHBaseclustersontopofYARN.ItdoesthiswithaclientapplicationcalledHoyaclient;thisapplicationcreatesthepersistentconfigurationfiles,setsuptheHBaseclusterXMLfiles,andthenasksYARNtocreateanApplicationMaster,whichistheHoyaAMhere.

Formoreinformation,refertohttps://github.com/hortonworks/hoya,http://hortonworks.com/blog/introducing-hoya-hbase-on-yarn/andhttp://hortonworks.com/blog/hoya-hbase-on-yarn-application-architecture/.

Page 224: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 225: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

KOYA(KafkaonYARN)OnNovember5,2014,DataTorrent,acompanyfoundedbyex-Yahoo!,announcedanewprojecttobringthefault-tolerant,high-performance,scalableApacheKafkamessagingsystemtoYARN.

Theso-calledKafkaonYARN(KOYA)projectplanstoleverageYARNforKafkabrokermanagement,automaticbrokerrecovery,andmore.Plannedfeaturesincludeafully-HAApplicationMaster,stickyallocationofcontainers(sothatarestartcanaccesslocaldata),awebinterfaceforKafka,andmore.

TheexpectedreleasetotheopensourcecommunityissomewhereinQ22015.

Moreinformationisavailableathttps://www.datatorrent.com/introducing-koya-apache-kafka-on-apache-hadoop-2-0-yarn/.

Page 226: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 227: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

SummaryThischaptertalkedaboutthedifferentframeworksandprogrammingmodelsthatcanberunonYARN.WediscussedApacheSamzaandStormonYARNindetail.

WiththewideacceptanceofYARNintheindustry,moreandmoreframeworkswillsupportYARN,takingcompleteadvantageofYARN’sgenericfeatures.

WelookedattheexistingframeworksthatareintegratedwithYARNatthemoment.

ThereisalotmoreworkgoingonintheindustrytomakeexistingandnewapplicationsrunonYARN.

InChapter8,FailuresinYARN,wewilldiscusshowfaults,failuresatvariouslevels,arehandledinYARN.

Page 228: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 229: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Chapter8.FailuresinYARNDealingwithfailuresindistributedsystemsiscomparativelymorechallengingandtimeconsuming.Also,theHadoopandYARNframeworksrunoncommodityhardwareandclustersizenowadays;thissizecanvaryfromseveralnodestoseveralthousandnodes.Sohandlingfailurescenariosanddealingwithever-growingscalingissuesisveryimportant.Inthissection,wewillfocusonfailuresintheYARNframework:thecausesoffailuresandhowtoovercomethem.

Inthischapter,wewillcoverthefollowingtopics:

ResourceManagerfailuresApplicationMasterfailuresNodeManagerfailuresContainerfailuresHardwarefailures

Wewillbedealingwiththerootcausesofthesefailuresandthesolutionstothem.

Page 230: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

ResourceManagerfailuresIntheinitialversionsoftheYARNframework,ResourceManagerfailuresmeantatotalclusterfailure,asitwasasinglepointoffailure.TheResourceManagerstoresthestateofthecluster,suchasthemetadataofthesubmittedapplication,informationonclusterresourcecontainers,informationonthecluster’sgeneralconfigurations,andsoon.Therefore,iftheResourceManagergoesdownbecauseofsomehardwarefailure,thenthereisnowaytoavoidmanuallydebuggingtheclusterandrestartingtheResourceManager.DuringthetimetheResourceManagerisdown,theclusterisunavailable,andonceitgetsrestarted,alljobswouldneedarestart,sothehalf-completedjobsloseanydataandneedtoberestartedagain.Inshort,arestartoftheResourceManagerusedtorestartalltherunningApplicationMasters.

ThelatestversionsofYARNaddressthisproblemintwoways.Onewayisbycreatinganactive-passiveResourceManagerarchitecture,sothatwhenonegoesdown,anotherbecomesactiveandtakesresponsibilityforthecluster.TheResourceManagerRMstatecanbeseeninthefollowingimage:

AnotherwayisbyusingtheZookeeperResourceManagerquorum,sothattheResourceManagerstateisstoredexternallyovertheZookeeper,andoneResourceManagerisinanactivestateandoneormoreResourceManagersareinpassivemode,waitingforsomethingtohappenthatbringsthemtoanactivestate.TheResourceManager’sstatecanbeseeninthefollowingimage:

Page 231: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Intheprecedingdiagram,youcanseethattheResourceManager’sstateismanagedbytheZookeeper.Wheneverthereisafailurecondition,theResourceManager’sstateissharedwiththepassiveResourceManager(s)tochangetoanactivestateandtakeoverresponsibilityforthecluster,withoutanydowntime.

Page 232: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 233: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

ApplicationMasterfailuresTorecovertheapplication’sstateafteritsrestartbecauseofanApplicationMasterfailureistheresponsibilityoftheApplicationMasteritself.WhentheApplicationMasterfails,theResourceManagersimplystartsanothercontainerwithanewApplicationMasterrunninginitforanotherapplicationattempt.ItistheresponsibilityofthenewApplicationMastertorecoverthestateoftheolderApplicationMaster,andthisispossibleonlywhenApplicationMasterspersisttheirstatesintheexternallocationsothatitcanbeusedforfuturereference.AnyApplicationMastercanrunanyapplicationfromscratchinsteadofrecoveringitsstateandrerunningagain.

Forexample,anApplicationMastercanrecoveritscompletedjobs.However,ifthejobsthatarerunningandcompletedduringtheApplicationMaster’srecoverytimeframegethaltedforsomereason,theirstatewillbediscardedandtheApplicationMasterwillsimplyrerunthemfromscratch.

TheYARNframeworkiscapableofrerunningtheApplicationMasteraspecifiednumberoftimesandrecoveringthecompletedtasks.

Page 234: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 235: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

NodeManagerfailuresAlmostallnodesintheclusterrunsaNodeManagerservicedaemon.TheNodeManagertakescareofexecutingacertainpartofaYARNjoboneveryindividualmachine,whileotherpartsareexecutedonothernodes.Fora1000nodeYARNcluster,thereareprobablyaround999nodemanagersrunning.Sonodemanagersareindeedaper-nodeagentandtakescareoftheindividualnodesdistributedinthecluster.

IfaNodeManagerfails,theResourceManagerdetectsthisfailureusingatime-out(thatis,stopsreceivingtheheartbeatsfromtheNodeManager).TheResourceManagerthenremovestheNodeManagerfromitspoolofavailableNodeManagers.Italsokillsallthecontainersrunningonthatnode&reportsthefailuretoallrunningAMs.AMsarethenresponsibleforreactingtonodefailures,byredoingtheworkdonebyanycontainersrunningonthatnodeduringthefault.

Ifthefaultcausingthetime-outistransientthentheNodeManagerwillresynchronizeswiththeResourceManager.OnthesimilarlinesifanewNodeManagerjoinsthecluster,theResourceManagernotifiesallApplicationMastersabouttheavailabilityofnewresources.

Page 236: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 237: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

ContainerfailuresWheneveracontainerfinishes,theApplicationMasterisinformedofthiseventbytheResourceManager.SotheApplicationMasterinterpretsthatthecontainerstatusreceivedthroughtheResourceManageristhesuccessorfailurefromcontainerexitstatus.TheApplicationMasterhandlesthefailuresofthejobcontainers.

Itistheresponsibilityoftheapplicationframeworkstomanagethecontainer’sfailures,andtheresponsibilityoftheYARNframeworkistoprovideinformationtotheapplicationframework.AsapartofallocatingtheAPI’sresponse,theResourceManagercollectsinformationonthefinishedcontainersfromtheApplicationMaster,asthecontainersreturnallthisinformationtothecorrespondingApplicationMaster.ItistheresponsibilityoftheApplicationMastertovalidatethecontainer’sstatus,exitcode,anddiagnosticinformationandappropriateactiononit,forexamplewhentheMapReduceApplicationMasterretriesthemapandreducetasksbyrequestingnewcontainers,untiltheconfigurednumberoftasksfailforasinglejob.

Toaddresscontainerallocationfailurescenarios,theResourceManagercollectscontainerinformationbyexecutingtheAllocatecall,andtheAllocateResponseusuallydoesnotreturnanycontainers.However,theAllocatecallshouldbemadeperiodicallytoensurethatallcontainersareassigned.Whenthecontainerarrives,itisforsurethattheframeworkwillhavesufficientresources,andtheApplicationMasterwillnotreceivemorecontainersthanitaskedfor.Also,theApplicationMastercanmakeseparatecontainerrequests,ResourceRequests,typicallyonepersecond.

Page 238: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 239: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

HardwareFailuresAstheHadoopandYARNframeworksusecommodityhardwarefortheclustersetupandscalingfromseveralnodestoseveralthousandnodes,allthecomponentsofHadooporYARNaredesignedontheassumptionthathardwarefailuresareverycommon.Therefore,thesefailureswouldbeautomaticallyhandledbytheframeworksothatimportantdataisnotlostbecauseofthem.Forthis,Hadoopprovidesdatareplicationacrossthenodes/rackssothatevenifthewholerackfails,datawouldberecoveredfromanothernodeonanotherrack,andjobswouldberestartedoveranotherreplicadatasettocomputetheresults.

Page 240: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 241: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

SummaryInthischapter,wediscussedYARNfailurescenariosandhowtheseareaddressedintheYARNframework.Inthenextchapter,wewillbefocusingonalternativesolutionsfortheYARNframework.WewillalsoseeabriefoverviewofthemostcommonframeworksthatarecloselyrelatedtoYARN.

Page 242: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 243: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Chapter9.YARN–AlternativeSolutionsDuringthedevelopmentofYARN,manyotherorganizationssimultaneouslyidentifiedthelimitationsofHadoop1.xandwereactivelyinvolvedindevelopingalternativesolutions.

ThischapterwillbrieflytalkaboutsuchalternatesolutionsandcomparethemtoYARN.AmongthemostcommonframeworksthatarecloselyrelatedtoYARNare:

MesosOmegaCorona

Page 244: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

MesosMesoswasoriginallydevelopedattheUniversityofCaliforniaatBerkeleyandlaterbecameopensourceundertheApacheSoftwareFoundation.

Mesoscanbethoughtofasahighly-availableandfault-tolerantoperatingsystemkernelforyourclusters.It’saclusterresourcemanagerthatprovidesefficientresourceisolationandsharingacrossmultiplediversecluster-computingorframeworks.

MesoscanbecomparedtoYARNinsomeaspectsbutacompletequantitativecomparisonisliterallynotpossible.

WewilltalkaboutthearchitectureofMesosandcomparesomeofthearchitecturaldifferenceswithrespecttoYARN.Thiswaywewillhaveahighlevelunderstandingofthemaindifferencebetweenthetwoframeworks.

TheprecedingfigureshowsthemaincomponentsofMesos.Itbasicallyconsistsofamasterprocessthatmanagesslaveprocessesrunningoneachclusternodeandmesosapplications(alsocalledframeworks)thatruntasksontheseslaves.

Formoreinformationpleaserefertotheofficialsiteathttp://mesos.apache.org/.

Herearethehigh-leveldifferencesbetweenMesosandYARN:

Mesos YARN

MesosusesLinuxcontainergroups(http://lxc.sourceforge.net).

Linuxcontainergroupsareastrongerisolationbutmayhavesomeadditionaloverhead.

YARNusessimpleUnixprocesses.

MesosisprimarilywritteninC++. YARNisprimarilywritteninJavawithbitsofnativecode.

Page 245: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

MesossupportsbothmemoryandCPUscheduling.

Currently,YARNonlysupportsmemoryscheduling(forexample,yourequestxcontainersofyMBeach),butthereareplanstoextendittootherresourcessuchasnetworkanddiskI/Oresources.

Mesosintroducesadistributedtwo-levelschedulingmechanismcalledresourceoffers.Mesosdecideshowmanyresourcestooffereachframework,whileframeworksdecidewhichresourcestoacceptandwhichcomputationstorunonthem.

YARNhasarequest-basedapproach.ItallowstheApplicationMastertoaskforresourcesbasedonvariouscriteria,includinglocations,andalsoallowstherequestertomodifyfuturerequestsbasedonwhatwasgivenandonthecurrentusage.

Mesosleveragesapoolofcentralschedulers(forexample,classicHadooporMPI).

YARNontheotherhandhasaperjobscheduler.AlthoughYARNenableslatebindingofcontainerstotasks,whereeachindividualjobcanperformlocaloptimizations,theper-jobApplicationMastermightresultingreateroverheadthantheMesosapproach.

Page 246: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 247: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

OmegaOmegaisGoogle’snextgenerationclustermanagementsystem.

Omegaisspecificallyfocusedonaclusterschedulingarchitecturethatusesparallelism,sharedstate,andoptimisticconcurrencycontrol.

Fromthepastexperience,Googlenoticedthatastheclustersandtheirworkloadsincrease,theschedulerisatriskofbecomingascalabilitybottleneck.

Google’sproductionjobschedulerhasexperiencedallofthis.Overtheyears,ithasevolvedintoacomplicated,sophisticatedsystemthatishardtochange.

Aschematicoverviewoftheschedulingarchitecturescanbeseeninthefollowingfigure:

contribprojecttoHadoop0.20branchandisnotaverylargecodebase.Coronaisintegratedwiththefair-scheduler.YARNismoreinterestedinthecapacityscheduler.

Googleidentifiedthefollowingtwoprevalentschedulerarchitecturesshownintheprecedingfigure:

Monolithicschedulers:Thisusesasingle,centralizedschedulingalgorithmforalljobs(ourexistingschedulerisoneofthese).Theydonotmakeiteasytoaddnewpoliciesandspecializedimplementations,andmaynotscaleuptotheclustersizesoneisplanningforinthefuture.Two-levelschedulers:Thiswillhaveasingleactiveresourcemanagerthatofferscomputeresourcestomultipleparallel,independentschedulerframeworks,asinMesosandHadoopOnDemand(HOD).Theirarchitecturesdoappeartoprovideflexibilityandparallelism,butinpracticetheirconservativeresourcevisibilityandlockingalgorithmslimitboth,andmakeithardtoplacedifficultto-schedule“picky”

Page 248: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

jobsortomakedecisionsthatrequireaccesstothestateoftheentirecluster.

ThesolutionisOmega—anewparallelschedulerarchitecturebuiltaroundthesharedstate,usinglock-freeoptimisticconcurrencycontrol,toachievebothimplementationextensibilityandperformancescalability.

Omega’sapproachreflectsagreaterfocusonscalability,butmakesithardertoenforceglobalproperties,suchascapacity,fairness,anddeadlines.

Formoreinformation,refertohttp://research.google.com/pubs/pub41684.html.

Page 249: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 250: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

CoronaCoronaisanotherworkfromFacebook,whichisnowopen-sourcedandhostedontheGitHubrepositoryathttps://github.com/facebookarchive/hadoop-20/tree/master/src/contrib/corona.

Facebook,withitshugepeta-scalequantityofdata,sufferedseriousperformance-relatedissueswiththeclassicMapReduceframeworkbecauseofthesingleJobTrackertakingcareofthousandsofjobsanddoingalotofworkalone.

Inordertosolvetheseissues,FacebookcreatedCorona,whichseparatedclusterresourcemanagementfromjobcoordination.

InHadoopCorona,theclusterresourcesaretrackedbyacentralClusterManager.EachjobgetsitsownCoronaJobTrackerwhichtracksjustthatparticularjob.

CoronahasentirelyredesignedMapReducearchitecturetobringbetterclusterutilizationandjobscheduling,justlikeYARNdid.

Facebook’sgoalsinre-writingtheHadoopschedulingframeworkwerenotthesameasYARN’s.FacebookwantedquickimprovementsinMapReduce,butonlythepartthattheywereusing.TheyhadnointerestinrunningmultipleheterogeneousframeworkssuchasYARNdoesorotherkeydesignconsiderationsofYARN.

ForFacebook,doingaquickrewriteoftheschedulerseemedfeasibleandlowrisk,comparedtogoingwithYARN,gettingfeaturesthatwerenotneeded,understandingit,fixingitsproblemsandthenlandingupwithsomethingthatdidn’taddresstheprimarygoalofloweringlatency.

Thefollowingaresomeofthekeydifferences:

Coronadoespush-basedschedulingandhasanevent-driven,callback-orientedmessageflow.Thiswascriticaltoachievingfast,low-latencyscheduling.PollingisabigpartofwhytheHadoopschedulerisslowandhasscalabilityissues.YARNdoesnotdocallback-basedmessageflow.InCorona,JobTrackercanrunonthesameJVMastheJobClient(thatisHive).FacebookhadfatclientmachineswithtonsofRAMandCPU.Toreducelatency,maximumprocessingontheclientmachineispreferred.InYARN,JobTrackerhastobescheduledwithinthecluster.Thismeansthatthere’soneextrastepbetweenstartingaqueryandgettingitrunning.CoronaisstructuredasacontribprojecttoHadoop0.20branchandisnotaverylargecodebase.Coronaisintegratedwiththefair-scheduler.YARNismoreinterestedinthecapacityscheduler.

FormoreinformationonCorona,refertohttps://www.facebook.com/notes/facebook-engineering/under-the-hood-scheduling-mapreduce-jobs-more-efficiently-with-corona/10151142560538920.

Page 251: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 252: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

SummaryWetalkedaboutvariousworksrelatedtoYARNthatareavailableonthemarkettoday.Thesesystemssharecommoninspiration/requirements,andthehigh-levelgoalofimprovingscalability,latency,fault-tolerance,andprogramming-modelflexibility.Thevariedarchitecturaldifferencesareduetothediverseandvarieddesignpriorities.Inthenextchapter,wewilltalkaboutYARN’sfutureandsupportintheindustry.

Page 253: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 254: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Chapter10.YARN–FutureandSupportYARNisthenewmoderndataoperatingsystemforHadoop2.YARNactsasacentralorchestratortosupportmixedworkloads/programmingmodels,runningmultipleengines,andmultipleaccesspatternssuchasbatchprocessing,interactive,streaming,andreal-time,inHadoop2.

Inthischapter,wewilltalkaboutYARN’sjourneyanditspresentandfutureinthebigdataindustry.

Page 255: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

WhatYARNmeanstothebigdataindustryItcanbesaidthatYARNisaboontothebigdataindustry.WithoutYARNtheentirebigdataindustrywouldhavebeenatseriousrisk.Astheindustrystartedplayingwithbigdata,newandemergingvarietiesofproblemscameintothepictureandhencenewframeworks.

YARN’ssupporttorunthesenewandemergingframeworksallowstheseframeworkstofocusonsolvingtheproblemsforwhichtheywerespecificallymeantfor,whileYARNtakescareofresourcemanagementandothernecessarythings(resourceallocation,schedulingjobs,faulttolerance,andsoon).

HadtherebeennoYARN,theseframeworkswouldhavehadtodoalltheresource-managementontheirown.Therearemanybigdataprojectsthatfailedinthepastduetounrealisticexpectationsonimmaturetechnologies.

YARNistheenablerforportingmatureandenterprise-classtechnologiesdirectlyontoHadoop.WithoutYARN,theonlythinginHadoopwastouseMapReduce.

Page 256: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 257: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Journey–presentandfutureAroundtwoyearsback,YARNwasintroducedwiththeHadoop0.23releaseon11Nov,2011.

Sincethen,therewasnolookingbackandtherewereanumberofreleases.

Finally,onOctober15,2013ApacheHadoop2.2.0wastheGA(GeneralAvailability)releaseofApacheHadoop2.x.

InOctober2013,ApacheHadoopYARNwontheBestPaperawardatACMSoCC(SymposiumonCloudComputing)2013.

ApacheHadoop2.x,poweredbyYARN,isnodoubtthebestplatformforalloftheHadoopecosystemcomponentssuchasMapReduce,ApacheHive,ApachePig,andsoonthatuseHDFSastheunderlyingdatastorage.

YARNwasalsohonoredbyotheropensourcecommunitiesforframeworkssuchasApacheGiraph,ApacheTez,ApacheSpark,ApacheFlink,andmanyothers.

VendorssuchasHP,Microsoft,SAS,Teradata,SAP,RedHat,andthelistgoeson,aremovingtowardsYARNtoruntheirexistingproductsandservicesonHadoop.

PeoplewillingtomodifyapplicationscanalreadyuseYARNdirectly,buttherearemanycustomers/vendorswhodon’twanttomodifytheirexistingapplication.Forthem,thereisApacheSlider,anotheropensourceprojectfromHortonworks,whichcandeployanyexistingdistributedapplicationswithoutrequiringthemtobeportedtoYARN.

ApacheSliderallowsyoutobridgeexistingalways-onservicesandmakessuretheyworkreallywellontopofYARN,withouthavingtomodifytheapplicationitself.

Sliderfacilitatesmanylong-runningservicesandapplicationssuchasApacheStorm,ApacheHBase,ApacheAccumulo,andsoonrunningonYARN.

ThisinitiativewilldefinitelyexpandthespectrumofapplicationsandusecasesthatonecanactuallyusewithHadoopandYARNinfuture.

Page 258: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Presenton-goingfeaturesNow,let’sdiscussthepresenton-goingworksinYARN.

LongRunningApplicationsonSecureClusters(YARN-896)

Supportlong-livedapplicationsandlong-livedcontainers.Refertohttps://issues.apache.org/jira/browse/YARN-896.

ApplicationTimelineServer(YARN-321,YARN-1530)

Currently,wehaveaJobHistoryServerforMapReducehistory.TheMapReducejobhistoryservercurrentlyneedstobedeployedasatrustedserverinsyncwiththeMapReduceruntime.Everynewapplicationwouldneedasimilarapplicationhistoryserver.HavingtodeployO(T*V)(whereTisthenumberoftypeofapplication,Visthenumberofversionofapplication)trustedserversisclearlynotscalable.

ThisJIRAistocreateonlyonetrustedapplicationhistoryserver,whichcanhaveagenericUI.Refertothefollowinglinksformoreinformation:

https://issues.apache.org/jira/browse/YARN-321https://issues.apache.org/jira/browse/YARN-1530

Diskscheduling(YARN-2139)

SupportfordiskasaresourceinYARN.YARNshouldconsiderdiskasanotherresourceforschedulingtasksonnodes,isolationatruntime,andspindlelocality.Refertohttps://issues.apache.org/jira/browse/YARN-2139.

Reservation-basedscheduling(YARN-1051)

ToextendtheYARNRMtohandletimeexplicitly,allowinguserstoreservecapacityovertime.ThisisanimportantsteptowardsSLAs,long-runningservices,workflows,andhelpsingangscheduling.

Page 259: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

FuturefeaturesLet’sdiscussthefutureworksinYARN.

ContainerResizing(YARN-1197)

ThecurrentYARNresourcemanagementlogicassumesthattheresourcesallocatedtoacontainerarefixedduringitslifetime.Whenuserswanttochangetheresourcesofanallocatedcontainer,theonlywayisreleasingitandallocatinganewcontainerwiththeexpectedsize.Allowingruntimechangestotheresourcesofanallocatedcontainerwillgiveusbettercontrolofresourceusageontheapplicationside.Refertohttps://issues.apache.org/jira/browse/YARN-1197.

Adminlabels(YARN-796)

Supportforadminstospecifylabelsfornodes.TheexamplesoflabelsareOS,processorarchitecture,andsoon.Refertohttps://issues.apache.org/jira/browse/YARN-796.

ContainerDelegation(YARN-1488)

Allowcontainerstodelegateresourcestoanothercontainer.ThiswouldallowexternalframeworkstosharenotjustYARN’sresource-managementcapabilities,butalsoitsworkload-managementcapabilities.

ThisalsoshowsthatYARNisnotonlyfocusedontheApacheHadoopecosystemcomponents,butalsoonanyexistingexternalnon-HadoopproductsandservicesthatwanttouseHadoop.

Also,workisgoingoninbringingtogethertheworldsofDataandPaaSbyusingDocker,GoogleKubernetes,andRedHatOpenShiftonYARNsothatacommonresourcemanagementcanbedoneacrossdataandPaaSworkloads.

Page 260: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 261: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

YARN-supportedframeworksThefollowingisthecurrentlistofframeworksthatrunsontopofYARN,andthislistwillgoongettinglongerinthefuture:

ApacheHadoopMapReduceanditsecosystemcomponentsApacheHAMAOpenMPIApacheS4ApacheSparkApacheTezImpalaStormHOYA(HBaseonYARN)ApacheSamzaApacheGiraphApacheAccumuloApacheFlinkKOYA(KafkaonYARN)Solr

Page 262: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the
Page 263: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

SummaryInthischapter,webrieflytalkedaboutYARN’sjourneysinceitsinception.YARNhascompletelychangedHadoopfromthewayitwasearlierintheHadoop1.xversion.NowYARNisafirst-classresourcemanagementframeworkforsupportingmixedworkloads/processingframeworks.

Fromwhatcanbeenseenandpredicted,YARNissurelyahitinthebigdataindustryandhasmanymorenewandpromisingfeaturestocomeinthefuture.Currently,YARNhandlesmemoryandCPUandwillcoordinateadditionalresourcessuchasdiskandnetworkI/Ointhefuture.

Page 264: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

IndexA

AccessControlList(ACL)about/NodeManager(NM),Thecapacityscheduler

administrativetoolsabout/Administrativetoolscommands/Administrativetoolsgenericoptions,supporting/Administrativetools

/Administrativetoolsanagrams/PracticalexamplesofMRv1andMRv2ApacheGiraph

about/ApacheGiraphURL/ApacheGiraph

ApacheHadoop2.2.0about/Journey–presentandfuture

ApacheSamzaabout/ApacheSamzaKafka/ApacheSamzaApacheYARN/ApacheSamzaZooKeeper/ApacheSamzaKafkaproducer,writing/WritingaKafkaproducerhello-samzaproject,writing/Writingthehello-samzaproject

ApacheSamza,layersprocessinglayer/ApacheSamzastreaminglayer/ApacheSamzaexecutionlayer/ApacheSamza

ApacheSliderabout/Journey–presentandfuture

ApacheSoftwareFoundationabout/Mesos

ApacheSparkabout/ApacheSparkfeatures/ApacheSparkrunning,onYARN/WhyrunonYARN?

ApacheTezabout/ApacheTezURL/ApacheTez

ApplicationContext(AppContext)/TheMapReduceApplicationMasterApplicationMaster

about/TheMapReduceApplicationMasterApplicationMaster(AM)/ApplicationMaster(AM)

restarting/TheMapReduceApplicationMaster

Page 265: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

writing/WritingtheYARNApplicationMasterresposibilities/ResponsibilitiesoftheApplicationMasterfailures/ApplicationMasterfailures

ApplicationMasterLauncherserviceabout/ResourceManager

ApplicationMasterServiceabout/ResourceManager

ApplicationsManagerabout/ResourceManager

Page 266: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Bbackwardcompatibility,MRv2APIs

about/BackwardcompatibilityofMRv2APIsbinarycompatibility,oforg.apache.hadoop.mapredAPIs/Binarycompatibilityoforg.apache.hadoop.mapredAPIssourcecompatibility,oforg.apache.hadoop.mapredAPIs/Sourcecompatibilityoforg.apache.hadoop.mapredAPIs

BulkSynchronousParallel(BSP)about/ApacheGiraph

Page 267: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Ccapacityscheduler

about/Thecapacityscheduler,Thecapacityschedulerbenefits/Thecapacityschedulerfeatures/Thecapacityschedulerconfigurations/Capacityschedulerconfigurations

clusterschedulingarchitectureabout/Omega

configurationparametersabout/Thefully-distributedmode

containerfailures/Containerfailures

containerallocationabout/Containerallocationtoapplication/Containerallocationtotheapplication

containerconfigurationsabout/Containerconfigurationsparameters/Containerconfigurations

ContainerExecutorabout/NodeManager(NM)

ContainerManagerabout/NodeManager(NM)

ContextObjects/OldandnewMapReduceAPIsCorona

about/CoronaandFacebook,differences/CoronaURL/Corona

Page 268: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Ddata-processinggraphs(DAGs)

about/ApacheTezDataNodes(DN)/Thefully-distributedmode

configuring/Thefully-distributedmodeDocker

about/Futurefeatures

Page 269: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

EEcoSystem

webinterfaces/WebinterfacesoftheEcosystem

Page 270: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

FFacebook

about/CoronaandCorona,differences/Corona

Fairscheduler/Thefairschedulerabout/Thefairschedulerconfigurations/Fairschedulerconfigurations

FIFOscheduler/TheFIFO(FirstInFirstOut)schedulerabout/TheFIFO(FirstInFirstOut)schedulerconfigurations/TheFIFO(FirstInFirstOut)scheduler

fully-distributedmodeabout/Thefully-distributedmodeHistoryServer/HistoryServerslavefiles/Slavefiles

Page 271: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

GGoogleKubernetes

about/Futurefeaturesgrid

starting/Startingagrid

Page 272: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

HHadoop

URL/SoftwareYARN,usingin/UnderstandingwhereYARNfitsintoHadoop

Hadoop0.23about/Journey–presentandfuture

Hadoop1.xabout/AshortintroductiontoHadoop1.xandMRv1components/AshortintroductiontoHadoop1.xandMRv1

Hadoop2releaseabout/TheHadoop2release

HadoopandYARNclusteroperating/OperatingHadoopandYARNclustersstarting/StartingHadoopandYARNclustersstopping/StoppingHadoopandYARNclusters

HadoopclusterHDFS/AshortintroductiontoHadoop1.xandMRv1MapReduce/AshortintroductiontoHadoop1.xandMRv1

HadoopOnDemand(HOD)/Omegahello-samzaproject

writing/Writingthehello-samzaprojectproperties/Writingthehello-samzaprojectgrid,starting/Startingagrid

HistoryServer/HistoryServerHOYA(HBaseonYARN)

about/HOYA(HBaseonYARN)URL/HOYA(HBaseonYARN)

Page 273: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

KKafkaproducer

writing/WritingaKafkaproducerKOYA(KafkaonYARN)

about/KOYA(KafkaonYARN)URL/KOYA(KafkaonYARN)

Page 274: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

MMapReduce,YARN

about/YARN’sMapReducesupportApplicationMaster/TheMapReduceApplicationMastersettings,example/ExampleYARNMapReducesettingsYARNapplications,developing/DevelopingYARNapplications

MapReduceapplicationsYARN,compatiblewith/YARN’scompatibilitywithMapReduceapplications

MapReducejobconfigurations/MapReducejobconfigurationsproperties/MapReducejobconfigurations

MapReduceJobHistoryServersettings/HistoryServer

MapReduceprojectEnd-userMapReduceAPI/MRv1versusMRv2MapReduceframework/MRv1versusMRv2MapReducesystem/MRv1versusMRv2

Mesosabout/MesosandYARN,differencebetween/MesosURL/Mesos

modernoperatingsystem,ofHadoopYARN,usedas/YARNasthemodernoperatingsystemofHadoop

monolithicschedulers/OmegaMRv1

about/AshortintroductiontoHadoop1.xandMRv1versusMRv2/MRv1versusMRv2examples/PracticalexamplesofMRv1andMRv2,Runningthejob

MRv2versusMRv1/MRv1versusMRv2examples/PracticalexamplesofMRv1andMRv2,Preparingtheinputfile(s)

Page 275: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

NNameNode(NN)/Thefully-distributedmode

configuring/Thefully-distributedmodenewMapReduceAPI

about/OldandnewMapReduceAPIsversusoldMapReduceAPI/OldandnewMapReduceAPIs

NodeHealthCheckerServiceabout/NodeManager(NM)

NodeManager(NM)/NodeManager(NM)configuring/Thefully-distributedmodeparameters/Thefully-distributedmode

NodeManagers(NM)/Thefully-distributedmodeNodeStatusUpdater

about/NodeManager(NM)

Page 276: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

OoldMapReduceAPI

about/OldandnewMapReduceAPIsversusnewMapReduceAPI/OldandnewMapReduceAPIs

Omegaabout/Omega

Page 277: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

PPiexample

running/RunningasamplePiexampleprerequisites,single-nodeinstallation

platform/Platformsoftwares/Software

prerequisites,Storm-YARNHadoopYARN,installing/HadoopYARNshouldbeinstalledApacheZooKeeper,installing/ApacheZooKeepershouldbeinstalled

programnamesaggregatewordcount/RunningsampleexamplesonYARNaggregatewordhist/RunningsampleexamplesonYARNbbp/RunningsampleexamplesonYARNdbcount/RunningsampleexamplesonYARNdistbbp/RunningsampleexamplesonYARNgrep/RunningsampleexamplesonYARNjoin/RunningsampleexamplesonYARNmultifilewc/RunningsampleexamplesonYARNpentomino/RunningsampleexamplesonYARNpi/RunningsampleexamplesonYARNrandomtextwriter/RunningsampleexamplesonYARNrandomwriter/RunningsampleexamplesonYARNsecondarysort/RunningsampleexamplesonYARNsort/RunningsampleexamplesonYARNsudoku/RunningsampleexamplesonYARNteragen/RunningsampleexamplesonYARNterasort/RunningsampleexamplesonYARNteravalidate/RunningsampleexamplesonYARNwordcount/RunningsampleexamplesonYARNwordmean/RunningsampleexamplesonYARNwordmedian/RunningsampleexamplesonYARNwordstandarddeviation/RunningsampleexamplesonYARN

pseudo-distributedmode/Thepseudo-distributedmodepush-basedscheduling/Corona

Page 278: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Rredesignidea

about/TheredesignideaMapReduce,limitations/LimitationsoftheclassicalMapReduceorHadoop1.xHadoop1.x,limitations/LimitationsoftheclassicalMapReduceorHadoop1.x

RedHatOpenShiftabout/Futurefeatures

RedHatPackageManagers(RPMs)/Thefully-distributedmodeResourceManager/ResourceManagerResourceManager(RM)

scheduler/ResourceManagersecurity/ResourceManagerRMRestartPhaseI/RecentdevelopmentsinYARNarchitectureRMRestartPhaseII/RecentdevelopmentsinYARNarchitectureabout/Thefully-distributedmodeconfiguring/Thefully-distributedmodeparameters/Thefully-distributedmodefailures/ResourceManagerfailures

ResourceManager(RM),componentsApplicationManager/NodeManager(NM)Scheduler/NodeManager(NM)

Page 279: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Sschedulerarchitectures

monolithicschedulers/Omegatwo-levelschedulers/Omega

single-nodeinstallationabout/Single-nodeinstallationprerequisites/Prerequisitesstarting/Startingwiththeinstallationstandalonemode(localmode)/Thestandalonemode(localmode)pseudo-distributedmode/Thepseudo-distributedmode

slavefiles/Slavefilesstandalonemode(localmode)/Thestandalonemode(localmode)Storm-Starterexamples

building/BuildingandrunningStorm-Starterexamplesrunning/BuildingandrunningStorm-Starterexamples

Storm-YARNabout/Storm-YARNprerequisites/Prerequisitessettingup/SettingupStorm-YARNstorm.yamlconfiguration,obtaining/Gettingthestorm.yamlconfigurationofthelaunchedStormclusterStorm-Starterexamples,building/BuildingandrunningStorm-StarterexamplesStorm-Starterexamples,running/BuildingandrunningStorm-Starterexamples

storm.yamlconfigurationobtaining/Gettingthestorm.yamlconfigurationofthelaunchedStormcluster

Page 280: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

Ttwo-levelschedulers/Omega

Page 281: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

WwebGUI

YARNapplications,monitoringwith/MonitoringYARNapplicationswithwebGUI

Page 282: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

YYARN

used,asmodernoperatingsystemofHadoop/YARNasthemodernoperatingsystemofHadoopdesigngoals/WhatarethedesigngoalsforYARNused,inHadoop/UnderstandingwhereYARNfitsintoHadoopmultitenancyapplicationsupport/YARNmultitenancyapplicationsupportsampleexamples,runningon/RunningsampleexamplesonYARNsamplePiexample,running/RunningasamplePiexamplecompatibility,withMapReduceapplications/YARN’scompatibilitywithMapReduceapplicationsApacheSpark,runningon/WhyrunonYARN?and,Mesosdifferencebetween/Mesosimportance,toBigDataindustry/WhatYARNmeanstothebigdataindustrypresent/Journey–presentandfuturefuture/Journey–presentandfuturepresenton-goingfeatures/Presenton-goingfeaturesfuturefeatures/Futurefeatures

YARN,featuresLongRunningApplicationsonSecureClusters(YARN-896)/Presenton-goingfeaturesApplicationTimelineServer(YARN-321,YARN-1530)/Presenton-goingfeaturesDiskscheduling(YARN-2139)/Presenton-goingfeaturesReservation-basedscheduling(YARN-1051)/Presenton-goingfeaturesContainerResizing(YARN-1197)/FuturefeaturesAdminlabels(YARN-796)/FuturefeaturesContainerDelegation(YARN-1488)/Futurefeatures

YARN-321URL/Presenton-goingfeatures

YARN-796URL/Futurefeatures

YARN-896URL/Presenton-goingfeatures

YARN-1197URL/Futurefeatures

YARN-1530URL/Presenton-goingfeatures

YARN-2139URL/Presenton-goingfeatures

YARN-supportedframeworksabout/YARN-supportedframeworks

YARNadministrations

Page 283: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

about/AdministrationofYARNconfigurationfiles/AdministrationofYARNadministrativetools/Administrativetoolsnodes,addingfromYARNcluster/AddingandremovingnodesfromaYARNclusternodes,removingfromYARNcluster/AddingandremovingnodesfromaYARNclusterYARNjobs,administrating/AdministratingYARNjobsMapReducejob,configurations/MapReducejobconfigurationsYARNlogmanagement/YARNlogmanagementYARNwebuserinterface/YARNwebuserinterface

YARNapplicationsmonitoring,withwebGUI/MonitoringYARNapplicationswithwebGUIdeveloping/DevelopingYARNapplicationsApplicationClientProtocol/DevelopingYARNapplicationsApplicationMasterProtocol/DevelopingYARNapplicationsContainerManagerProtocol/DevelopingYARNapplications

YARNapplicationworkflowabout/TheYARNapplicationworkflowYARNclient,writing/WritingtheYARNclientApplicationMaster,writing/WritingtheYARNApplicationMaster

YARNarchitecturecomponents/CorecomponentsofYARNarchitecturedevelopment/RecentdevelopmentsinYARNarchitecture

YARNarchitecture,componentsResourceManager/ResourceManagerApplicationMaster(AM)/ApplicationMaster(AM)NodeManager(NM)/NodeManager(NM)

YARNclientwriting/WritingtheYARNclient

YARNclusternodes,addingfrom/AddingandremovingnodesfromaYARNclusternodes,removingfrom/AddingandremovingnodesfromaYARNcluster

YARNjobsadministrating/AdministratingYARNjobs

YARNlogmanagement/YARNlogmanagementYARNMapReducesettings

example/ExampleYARNMapReducesettingsproperties/ExampleYARNMapReducesettings

YARNschedulerpoliciesabout/YARNschedulerpoliciesFIFOscheduler/TheFIFO(FirstInFirstOut)schedulerFairscheduler/Thefairschedulercapacityscheduler/Thecapacityscheduler

Page 284: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

YARNschedulingpolicesabout/YARNschedulingpoliciesFIFOscheduler/TheFIFO(FirstInFirstOut)schedulercapacityscheduler/ThecapacityschedulerFairscheduler/Thefairscheduler

YARNwebuserinterface/YARNwebuserinterface

Page 285: YARN Essentials - storage.ey.md Related/PDFs and... · Operating Hadoop and YARN clusters Starting Hadoop and YARN clusters Stopping Hadoop and YARN clusters Web interfaces of the

ZZookeeper

URL/ApacheZooKeepershouldbeinstalled