introduction to configuation and management for sas® grid...

16
Paper SAS1117-2017 Introduction to Configuration and Management for SAS® Grid Manager for Hadoop Mark Lochbihler, Hortonworks Inc., Santa Clara, CA ABSTRACT "How can we run traditional SASÒ jobs, including SAS workspace servers on Hadoop worker nodes?" The answer is SASÒ Grid Manager for Hadoop. It has been integrated with the Hadoop ecosystem to provide resource management, high availability and enterprise scheduling for SAS customers. This paper will provide an introduction architect, configuration and management of SAS Grid Manager for Hadoop. Anyone involved with SAS and Hadoop should find the information in this paper useful. The first area to be covered will be a break down each required SAS and Hadoop component. From the Hadoop ecosystem, we will define the role of YARN Compute, HDFS Storage, and Hadoop Client services. We will review SAS metadata definitions for SAS Grid Manager, Object Spawner and Grid Workspace Servers. We will cover required Kerberos security, as well as SAS Enterprise Guide and the SASGSUB utility. YARN queues and the SAS Grid Policy file for optimizing job scheduling will be reviewed. And finally, we will discuss traditional SAS math running on a Hadoop Worker node, and how it can take advantage of SAS High Performance Math to accelerate job execution. By leveraging SAS Grid Manager for Hadoop, sites are moving SAS jobs inside a Hadoop Cluster. This will ultimately cut down on data movement and provide more consistent job execution. Although this paper is written for the SAS and Hadoop Administrators, SAS Users will also benefit from this session. WHAT IS SAS GRID COMPUTING? SAS Grid Computing has been offering SAS shops a lower cost, shared, multi-tenant, high performing computing environment to meet their advanced analytic and modeling needs. By implementing a SAS Grid, SAS administrators can centralize individual and or departmental SAS computing environments onto a SAS Compute Grid and better utilize IT resources, provide high availability and accelerated processing. A SAS Grid runs on two or more SAS Grid Compute Nodes. Each SAS Grid Compute Node is a candidate to execute SAS jobs submitted into a Grid queue by SAS user groups at a site. ENTER HADOOP AND YARN The benefits of SAS Grid Computing are not new to the SAS User Community. It has been successfully implemented and running in production at thousands of customers’ sites around the world for well over a decade and provides significant benefits. What’s new with SAS Grid Manager for Hadoop is that it offers YARN as an orchestration option in addition to the existing and well-proven use of the Platform Suite for SAS which includes LSF. SAS Grid Manager for Hadoop was designed to enable customers to co-locate their SAS Grid and associated SAS workloads on their new or existing SAS Hadoop clusters. WHY MOVE SAS JOBS INSIDE HADOOP? A decision to develop this type of solution typically starts by identifying business needs and associated usage cases to support these needs. Any customer interested lowering overall SAS storage and server costs, while at the same time consolidating and centralizing departmental SAS datasets is a candidate. An initial step can be to migrate selected SAS Workload data, including raw inbound and outbound files, SAS libraries and other RDBMS SAS data sources to Hadoop Storage. Just by taking the first step to move SAS storage to Hadoop, organizations can save over 50% in annual SAS storage costs. For new model develop or existing model optimization efforts that plan on

Upload: others

Post on 18-Mar-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

PaperSAS1117-2017IntroductiontoConfigurationandManagement

forSAS®GridManagerforHadoopMarkLochbihler,HortonworksInc.,SantaClara,CA

ABSTRACT

"HowcanweruntraditionalSASÒjobs,includingSASworkspaceserversonHadoopworkernodes?"TheanswerisSASÒGridManagerforHadoop.IthasbeenintegratedwiththeHadoopecosystemtoprovideresourcemanagement,highavailabilityandenterpriseschedulingforSAScustomers.Thispaperwillprovideanintroductionarchitect,configurationandmanagementofSASGridManagerforHadoop.AnyoneinvolvedwithSASandHadoopshouldfindtheinformationinthispaperuseful.ThefirstareatobecoveredwillbeabreakdowneachrequiredSASandHadoopcomponent.FromtheHadoopecosystem,wewilldefinetheroleofYARNCompute,HDFSStorage,andHadoopClientservices.WewillreviewSASmetadatadefinitionsforSASGridManager,ObjectSpawnerandGridWorkspaceServers.WewillcoverrequiredKerberossecurity,aswellasSASEnterpriseGuideandtheSASGSUButility.YARNqueuesandtheSASGridPolicyfileforoptimizingjobschedulingwillbereviewed.Andfinally,wewilldiscusstraditionalSASmathrunningonaHadoopWorkernode,andhowitcantakeadvantageofSASHighPerformanceMathtoacceleratejobexecution.ByleveragingSASGridManagerforHadoop,sitesaremovingSASjobsinsideaHadoopCluster.Thiswillultimatelycutdownondatamovementandprovidemoreconsistentjobexecution.AlthoughthispaperiswrittenfortheSASandHadoopAdministrators,SASUserswillalsobenefitfromthissession.

WHATISSASGRIDCOMPUTING?

SASGridComputinghasbeenofferingSASshopsalowercost,shared,multi-tenant,highperformingcomputingenvironmenttomeettheiradvancedanalyticandmodelingneeds.ByimplementingaSASGrid,SASadministratorscancentralizeindividualandordepartmentalSAScomputingenvironmentsontoaSASComputeGridandbetterutilizeITresources,providehighavailabilityandacceleratedprocessing.ASASGridrunsontwoormoreSASGridComputeNodes.EachSASGridComputeNodeisacandidatetoexecuteSASjobssubmittedintoaGridqueuebySASusergroupsatasite.

ENTERHADOOPANDYARN

ThebenefitsofSASGridComputingarenotnewtotheSASUserCommunity.Ithasbeensuccessfullyimplementedandrunninginproductionatthousandsofcustomers’sitesaroundtheworldforwelloveradecadeandprovidessignificantbenefits.What’snewwithSASGridManagerforHadoopisthatitoffersYARNasanorchestrationoptioninadditiontotheexistingandwell-provenuseofthePlatformSuiteforSASwhichincludesLSF.SASGridManagerforHadoopwasdesignedtoenablecustomerstoco-locatetheirSASGridandassociatedSASworkloadsontheirneworexistingSASHadoopclusters.

WHYMOVESASJOBSINSIDEHADOOP?

Adecisiontodevelopthistypeofsolutiontypicallystartsbyidentifyingbusinessneedsandassociatedusagecasestosupporttheseneeds.AnycustomerinterestedloweringoverallSASstorageandservercosts,whileatthesametimeconsolidatingandcentralizingdepartmentalSASdatasetsisacandidate.AninitialstepcanbetomigrateselectedSASWorkloaddata,includingrawinboundandoutboundfiles,SASlibrariesandotherRDBMSSASdatasourcestoHadoopStorage.JustbytakingthefirststeptomoveSASstoragetoHadoop,organizationscansaveover50%inannualSASstoragecosts.Fornewmodeldeveloporexistingmodeloptimizationeffortsthatplanon

usingIoTHadoopdatasources(includingSensor,ClickStream,WebLog,MachineandIoTdevice),thenmovingexistingSASdatasetstoApacheHiveandHadoopStorage(HDFS)couldyieldsignificantcostsavings.ToaccessthesemigratedSASdatasets,userscanturntoSASÒAccesstoHadoop.ItoffersaSASlibnameengineforHDFS,aswellasHive,andafilenamestatementtoHDFS.Oncethedataismigrated,carefullyselectedSASWorkloadscanbemovedontoHadoopWorkernodesandSASuserscantransparentlyleverageSASGridManagerforHadooptoruntheirdailySASjobs.OncethedecisionhasbeenmadetomovenewandorexistingSASdataandworkloadstoHadoop,andleverageSASGridManager,itishighlyrecommendedthatyoursiteinvestupfrontinSASandHadoopadministrativetrainingandprofessionalservices.ItisalsocriticalthatyouhaveadetailedunderstandingofhowYARNconfigurationandschedulingworksonHadoop.Inaddition,involvingSASandHortonworksProfessionalservicesduringtheprojectstartupphaseinimportant.

HOWITWORKS

YARN101 IfyouarenewtoHadoopandYARN,letmeprovideabitofbackground.YARNistheorchestrationengineforHadoop2.x.Figure1belowisahigh-levelviewofanEnterpriseDataLake.WithYARNastheasthecentralorchestratorandoperatingsystemforHadoop,sitescanrunmultipleBatch,InteractiveandRealtimecomputeengineswithinthesameHadoopcluster(includingSASGridManagerasindicatedbythearrow).HadoophasbeenofferingaLowCost,MassiveScaleStorageandComputeArchitectureforoveradecade.WithSASGridManagerforHadoop,SASuserscanruntraditionalSASjobsonHadoopWorkernodes,insidethecluster.

Figure1:AHadoopClusterrunningBatch,InteractiveandRealTimeEngines,includingSASGridManager.WithSASGridManagerforHadoop,acommunityofSASuserstransparentlyleveragingSASClientsandsubmitinteractiveandbatchSASjobstotheSASGridComputinginfrastructureonHadoop.ThesejobsarescheduledbyYARNbasedonqueuesandsitepoliciestorunonanoptimalSASGridComputeNode(HadoopWorkerNode).BelowisaConceptualViewofthearchitecture:

Clickstream Web&Social

Geolocation Sensor& Machine

ServerLogs

Unstructured

SO

UR

CE

S

Existing Systems

ERP CRM SCM

AN

ALY

TIC

S

Data Marts

Business Analytics

Visualization& Dashboards

AN

ALY

TIC

S

Applications Business Analytics

Visualization& Dashboards

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

HDFS (Hadoop Distributed File System)

YARN: Data Operating System

Interactive Real-TimeBatch SAS GridManager

Batch BatchMPP

EDW

Figure2:SASGridManagerforHadoopConceptualArchitecture(Reference:SASGridManagerforHadoop)

BREAKINGDOWNTHISSASARCHITECTUREONHADOOPWithSASGridManagerforHadoop,SASjobsarescheduledbyYARN’sResourceManagerintoaclusterrunninginsideoftheHadoopfirewall.SASjobscanrequestadditionalHadoopresourcesaftertheyhavebeeninitiallylaunched.ThisnewarchitecturecanreducethecomplexityofconfigurationbysimplifyingportmappingbetweenSASjobsandtheservicestheywillneedtocomplete.Inthisdeploymentmodel,SASmathisrunningclosertotheHadoopdataandnolongerrequiresnegotiationwiththeHadoopfirewalltoaccess,ifrequested,additionalHadoopandSASHighPerformanceserviceports.

Figure3:SASGridManagerforHadoopwYARNArchitectureOverviewTheYARNResourceManager’sroleinFigure3aboveistodeterminethemostoptimalHadoopWorkerNodestorunatraditionalSASjoborWorkspaceServerintheHadoopcluster.Inthisillustration,youcanseetwoYARNjobsalreadyrunningonthecluster.EachoftheseHadoopjobshasasingleYARNApplicationMaster(AM)Containerassigned.Job1isnotaSASjob.ItconsistsofAM1alongwiththreeadditionalJobTaskContainers(C1.1,C1.2,andC1.3).ThereisalsoaSASjobrunningonthesamecluster.ThisSASjobhasitsownApplicationMaster(SASAM1)

YARNNodeManager YARNNodeManager YARNNodeManager YARNNodeManager

Job1Container1.1

YARNNodeManager YARNNodeManager YARNNodeManager YARNNodeManager

YARNNodeManager YARNNodeManager YARNNodeManager YARNNodeManager

Job1Container1.2

Job1Container 1.3

Job1AM 1 SASAM 2

SAS Grid Manager w YARN Architecture Overview

SASClient• SASGSUB• SASBatch• SASEG

YARNResourceManager

YARNCapacityScheduler

SASGridControlServer

SASObjectSpawner

SAS MetadataServer

SASContainer2.1

andoneSASJobTaskContainer(SASC2.1).Fromanarchitectureandadministrativeperspective,letstakealookattheprimarySASandHadoopservicesthatwillbeconfigured.KEYSASGRIDARCHITECTURECOMPONENTSSASMetadataServer–ASASservicesupporting,amongotherobjects,thelogicaltophysicalmappingofSASLogicalServerstoYARN.Inourexamples,wewillbeusingthelocalSASServer,SASGrid.SASGridControlServer–ASASservicerunningontheYARNResourceManagernode.ItiscalledbySASclientstocommunicatewithYARNResourceManagertonegotiateresourcesforSASjobs.SASObjectSpawner–ASASServicerunningontheYARNResourceManagernode.ItisusedtolaunchSAScontainswithYARN.

Figure4:ViewofSASManagementConsole,withexpandedSASGridLogicalServer.SASClients–includesSASbatchjobs,SASGSUB(abatchgridutility),andinteractiveClientslikeSASEnterpriseGuide.SASClientswill“Connect”and“Disconnect”fromaSASLogicalServer,likeSASGrid,definedinSASMetadataandshownaboveinFigure4.KEYHADOOPINTEGRATIONPOINTSFORSASGRIDYARNResourceManager–AHadoopYARNMasterServiceresponsibleforcontrollingglobalHadoopclusterresourceusage.ResourceManagerenablesmulti-tenancyandSLAs.ItisalsoresponsibleformonitoringNodeManagerState,submittingApplicationMasterrequests,verifyingcontainerlaunchandmonitoringApplicationMasterstate.

YARNNodeManager–ThisHadoopYARNWorkerNodeServicemanageslocalresourcesonbehalfoftherequestingservice.ItalsotracksnodehealthandcommunicatesstatustotheResourceManager.YARNCapacityScheduler–AHadoopYARNservice,whichcanbeconfiguredtoprovideJobSchedulingpoliciesforSLAs,Users,Groups,andResources.HadoopDataNodes–HadoopDistributedFileSystem(HDFS)storagenodes.KerberosService–TheHadoopclustermustbeKerborized.HADOOPMASTERNODEDECISIONSASASGridControlServerandSASObjectSpawnermustbedeployedonthesameHadoopMasterNodeastheYARNResourceManager.HADOOPWORKERNODEDECISIONSSASHOMEandSASCONFIGForeachHadoopWorkerNodewhichisacandidatetorunSASjobsmustbeconfiguredsothatSASHOMEandSASCONFIGareavailable.SASWORKandSASUTILItiscriticalthateachHadoopWorkernodewhichisacandidatetorunSASjobsisconfiguredcorrectly.AlargepartoftheI/OrequiredwhenrunningSASanalyticsistothescratchortemporarylocationsofSASWORKandSASUTIL.SASrequiredI/Othroughputforthesefilesystems,toprovidethenecessaryperformancetoaheavilyloadedsystem,is100MB/sec/core.AdequatesizingforSASWORKisalsonecessary.TraditionalStorageandComputeverseComputeOnlyWorkerNodesWithinHadoop,itisacommonpracticetohavedualpurposeworkernodeswhichrunmathorprogramsnearonthesamenodeswheretheHadoopdataresides.WithHadoop2.x,theconceptofdedicatedComputeOnlyHadoopWorkerNodesisanoption.ForSASGridManagerforHadoop,bothoptionsareanoption.ForComputeOnly,theseHadoopWorkerNodeswillnolongerhosttherequiredservicesanddataforHDFS,givingmorecomputingresourcesdedicatedtotheprogramsrunningonthesenodes.ThetradeoffforComputeOnlyHadoopNodesisthelossofHDFSdatalocality.YoursitesSASworkloadrequirementswilldeterminewhichtypeofWorkerNodestodeployforSASGridManagerforHadoop.

REALWORLDCONFIGURATIONEXAMPLEInthissection,wewillwalkthroughaSASGridManagerforHadoopconfigurationexercise.Forthisdeployment,ourHadoopClustercurrentlyhastwenty-eightHadoopWorkerNodes.Eachofthesenodeshas256GBofRAM.AftersubtractingtherequiredRAMneededtoruntheOSandadditionalkeyservices,thereis192GBofRAMleftforHadoopcontainers.ThismeansthatthetotalclusterwideRAMcapacityforHadoopContainers5.376TB.Wehavedecidedtoallocate50%ofthisRAMtoSASjobsor2.688TB.

TotalRAMPerCluster

Node

AvailableContainerRAM

PerNode

#WorkerNodesinCluster

TotalContainer

RAMAvailable

AmountofClusterRAMAllocatedto

SASQueue

TotalContainerRAMforSAS

Queue

256GB 192GB 28 5.376TB 50% 2.688TBTable1:TotalClusterYARNContainerRAMAvailableforSASUsersForourexercise,wehadpreviouslydeterminedthattheaveragenumberofSASbatchorinteractivesessionsforeachactive,concurrentuseratoursiteistwo.FromaHadoopperspective,foreachoftheseSASjobs,weknowthatYARNwillspawnsatleasttwocontainers(anApplicationMasterandaJobTaskContainer).AndonceSASjobsarerunninginHadoop,theycancalladditionalHadoop(ie.Pig,Hive,MapReduce)andSASSASHP(ie.prochpsummary,hpmeans,hpfreq)services.TheaveragenumberofadditionalHadoopContainersspawnedfromeachSASjobrunningonHadoopisfour.Thismeanswecananticipateeachactive,concurrentSASusertobeallocating,onaverage,eightYARNContainers.

Avg#ofBatchJobsorInteractiveSessionsper

SASUser

#ContainersPerJoborSession

Avg#ofAdditionalHadoopContainersSpawnedfrominitial

SASJobContainer

AvgTotal#ofContainersperSAS

user

2 2 4 8Table2:AverageNumberofContainersperSASUsersNowthatweknow,onaverage,eachactiveSASuserwillallocateeightcontainers,andwehave2.688TBoftotalclusterRAMavailable,weneedtodeterminethetotalnumberofconcurrentSASuserswhocanrunatanyonetimeonourHadoopcluster.OneadditionalfactortoconsideristhatnotallSASusersneedthesameamountofcomputingresources.Inourexercise,wehavethreeSASuserresourcecategories(Low,Medium,andHigh).EachuserfallingintotheLowcategory,bydefault,willallocate2GBYARNContainersfortheirjobs.TheSASMediumcategorywilldefaulttoaYARNContainerSizeof4GB.AndtheHighcategorywilldefaultto8GBContainersize.Withthesenumbersinmind,wewillbetohave134concurrentSASusersonoursystemasindicatedbelow.

SASAppType(UserType) ContainerSize

Anticipated%ofUsersTypeon

Server

AvailableCluster

MemoryforSASjobs

Max#ofContainers

Avg#ContainersPerSASUser

Total#ofSASUsers

Low(General/Analyst) 2GB 70% 1.881TB 940 8 117

Medium 4GB 20% 537GB 134 8 16

High 8GB 10% 268GB 33 8 4

Totals 2.688TB 1107 134Table3:BreakdownofSASApplicationTypestobeconfiguredforSASUsers

YARNCapacitySchedulerInthisexercise,weareallocating50%oftheHadoopCluster’sRAMtoSASUsers.WewillusetheYARNCapacitySchedulertosetupthispolicy.WeareabletosetupHierarchicalYARNQueues,includingsas94_queue,asafirstclusterwideconfigurationstep.

Figure5:YARNCapacitySchedulerLogicalView-SASQueue-50%HadoopClusterRAM

Below is a view of the YARN Capacity Scheduler UI, we will see that the sas94_queue has 50% cluster RAMallocated.SASUsersLinuxidsshouldalsobemappedtothisqueue.

Figure6:YARNCapacitySchedulerAdminView-SASQueue-50%HadoopClusterRAM

YARN Capacity SchedulerExample: 50% of Cluster RAM allocated to SAS Queue

ResourceManager

Scheduler

root

Adhoc30%

SAS50%

Mrkting20%

Dev10%

Reserved20%

Prod70%

Prod80%

Dev20%

P070%

P130%

Capacity Scheduler

HierarchicalQueues

OurHadoopclusterhasYARNMinimumContainerSizesetclusterwideto2GB.WearegoingtoneedtoimplementamechanismtoadjustContainersizebasedonSASUserresourceneeds.Wewillusethesasgrid-policy.xmlconfigurationfileandSASMetadataUserandGroupmappingstoaccomplishthistask.First,wewillreviewasasgrid-policy.xmlfilewhichwillensurethatSASusersobtaintherightamountofmemoryfortheirjobs.WithinthisSASGridconfigurationfile,youwillseethreeGridApplicationTypes:Low,Medium,andHigh.

SASGridPolicyFileExample<?xmlversion="1.0"encoding="UTF-8"standalone="yes"?><GridPolicydefaultAppType="low"> <GridApplicationTypename="normal"> <jobname>NormalJob</jobname>

<priority>20</priority> <nice>10</nice> <memory>2048</memory> <vcores>1</vcores> <runlimit>480</runlimit> <queue>default</queue> <hosts> <hostGroup>development</hostGroup> </hosts> </GridApplicationType> <GridApplicationTypename="low"> <jobname>SASLow</jobname> <priority>10</priority> <nice>0</nice> <memory>2048</memory> <vcores>1</vcores> <runlimit>480</runlimit> <queue>sas94_queue</queue> <hosts> <hostGroup>sas94_work</hostGroup> </hosts></GridApplicationType> <GridApplicationTypename="medium"> <jobname>SASMedium</jobname> <priority>10</priority> <nice>0</nice> <memory>4096</memory> <vcores>1</vcores> <runlimit>480</runlimit>

<queue>sas94_queue</queue> <hosts> <hostGroup>sas94_work</hostGroup> </hosts></GridApplicationType> <GridApplicationTypename="high"> <jobname>SASHigh</jobname> <priority>10</priority> <nice>0</nice> <memory>8192</memory> <vcores>1</vcores> <runlimit>480</runlimit> <queue>sas94_queue</queue> <hosts> <hostGroup>sas94_work</hostGroup> </hosts></GridApplicationType><HostGroupname="development"> <host>partnerd1worker-1.field.hortonworks.com</host> <host>partnerd1worker-2.field.hortonworks.com</host> <host>partnerd1worker-3.field.hortonworks.com</host> <host>partnerd1worker-4.field.hortonworks.com</host> <host>partnerd1worker-5.field.hortonworks.com</host></HostGroup><HostGroupname="sas94_work"> <host>partnerd1worker-4.field.hortonworks.com</host>

<host>partnerd1worker-5.field.hortonworks.com</host></HostGroup> </GridPolicy>Theabovesasgrid-policy.xmlfileisasample.ItillustrateshowwecansetupGridApplicationTypesfordifferentSASusergroups.NotethatwehavespecifiedaYARNContainermemoryandtheSAS94QueueforeachAppType.WethatwehaveconfiguredtheappropriateGridApplicationTypes,weneedawaytomapSASApplications(SASGridManagerClientUtility,SASEnterpriseGuide,others)totheseSASGridApplicationTypes.ThiscanbedoneusingtheGridOptionsSetMappingWizardwithintheSASManagementConsole.ThisrequireseachSASApplicationtoberegisteredtotheSASMetadataServerandconfiguredfor“IsGridEnabled”.Oncethisisdone,usethefollowingSASGridServerPropertiesOptionsTabforyourLogicalGridServer(SASGridinourcase)andconfigureSASusersandgroupstotheirappropriatedefaultGridAppTypefoundinsasgrid-policy.xmlabove.

Figure7:ConfiguringSASMetadataGroupstoSASGridApplicationTypes.SASClientIdleandRuntimeConfigurationsItisrecommendedthatyousetthefollowingthresholdsforSASusersasadefaultandadjustaccordingtoyoursidesworkloadneeds.ThiswillfreeupidleresourcesforotherHadoopusers.ItisrecommendedtobeveryconservativeinthesesettingssothatyoudonotinadvertentlydisruptSASuserproductivity.

o SASClientIdleTimeout–inourexample–wesetto120minutes§ YoucancontroltheamountoftimeaclientsuchasSASEnterpriseGuidecanstayconnected

butnotactiveusingtheInactiveClienttimeoutsetting.o SASRunLimitTimeout–inourexample–wesetto480minutes

§ Withinthesasgrid-policy.xmlfile,youcancontrolthisusingthe“runlimit=xx”GridApplicationTypeparameter.

Also,acoupleofotherHadoopparametersthatweshouldpaycloseattentiontoareMapReduceMinimumContainerSize,aswellastheengineyouarerunningforHive.WerecommenduseHive’sTezEngineinmostcasesasitprovidesin-memoryqueryexecutionthatisconsiderablyfasterthantheMapReduceengine.WearecurrentlyinvestigatingYARNNodeLabels(PolicyScheduling)andtheHiveLLAPengine(SQLQueryAcceleration)forfutureconfigurationconsiderations.

SASWORKLOADMIGRATIONCONSIDERATIONSSASGridManagerforHadoopshouldbeleveragedasacomplimenttoyourexistingSASinfrastructureandnotareplacement.WhilecertaintraditionalstoragearchitecturesprovideextremelyfastIOforSASworkloads,thereisasignificantcosttotheseinfrastructures.SelectingtherightSASworkloadsanddatatomovetoHadoopisa

criticallyimportantstepwhenmovingSASjobstoHadoop.Akeybusinessdriverisreducingoverallstoragecosts.AnotherkeydriveracceleratingtheperformanceofSASjobsthatrequireaccesstolargedatasetsinHadoop.ThemostexpensiveoperationforanybigdatajobismovinglargeamountsofdatafromstoragetoacomputetieroutsideofHadoop.WithHadoop,aprimarygoaldesigngoalforthelastdecadehasbeentomoveaprogramscomputeormathtowherethedataresides,inHadoop.Thisenablesavoidingcostlydatamovementoverthenetwork.Withourbusinessdriversinmind,SASshopsshouldinitiallyidentifyahandfulofSASjobsrunningoutsideofHadoopwhichhavesignificantlylargeephemeral(SASWORK)andpermanentstorage(SASdatasets,RDBMS,otherinboundandoutbounddatasets)needs.IftheSASjobdataiscurrentlystoredonexpensiveSANstorage,costsavingswillbesignificant.AnySASjobsusingBaseSASprocedureslikeProcSummary,MeansandFreqwhichsupportSASIn-DatabasePushdownshouldconsidered.AnySASjobsplanningtouseHadoopdatasetsshouldbeonthelist.WhenmovingexistingSASWorkloaddataintoHadoop,thereareseveralstorageoptions.ExistingSASdatacanbemovedintoHadoopDistributedFileSystem(HDFS)andthen,withinHive,aschemaonreadycanbeestablished.IftheexistingSASdatasetsanddatabasetablesarestoredinHive,asmentioned,SASInDatabasecapabilitiescanbeleveraged.ThiswillenabletheSASenginetohaveanopportunitytoimplicitlytranslateBaseSASStatprocedurestocomplexHiveQL,ultimately,movingmostofthemathtothedata.SASAccesstoHadoopisanaddonSASproductthatplaysacriticalroleinthismigration.SASuserswillneedtohavesomebasictrainingonhowtomosteffectivelyleveragetheHiveandHDFSlibnameenginesfromtheirSASjobs,aswellastheSASfilenamestatementtoHFDS.ThisbasicsyntaxchangeintheirprogramswillberequiredtotakeadvantagedirectlyofSASdatanowresidinginHadoop.

o LibnametoHive§ UsesaJDBCconnecttoHive§ ProvidesanopportunityfordistributedparallelcomputeonHive

• UsingSASInDatabasePushDowncapabilities

o LibnametoHDFS§ HDFSAPIsareleveragedfordirect,parallelreadandwritetoHadoop.

o FilenametoHDFS

§ SupportsinboundandoutboundreadandwriteofrawdatatoHadoop.

SASdoesincludesupportforHadoopClientsandRESTfulAPIs.TheSASAccesstoHadoopcanalsoallowasasprogrammertorunexplicitHDFS,Pig,Hive,andMapReducecodeinlinewithintheirSASjob.Inaddition,SAShasanSQLEnginewhichsupportsbothimplicitandexplicitPassThruHiveQLexecutiondirectlytotheHadoopcluster.

ADAYINTHELIFEOFASASUSERLEVERAGESASGRID WhenusingSASclients,itshouldbetransparenttotheSASusersfromafunctionalityperpectivethattheyareinteractingwithaSASGrid.HereisadetailedwalkthroughofaSASuserleveragingSASEnterpriseGuiderunningSASProcessFlowsandTasksinsideofHadoop.Below,aSASuserlogsintoSASEnterpriseGuide’sUI,andopensanexistingproject.Toexecuteworkfromthisproject,theusermustlaunchaSASWorkspaceServer.Inthiscase,iftheuserexpandsthe+SASGridonthelefthandsideofthescreen,SASGridManagerforHadoopwillrequestYARN’sResourceManagertolaunchaSASWorkspaceServer(WSS)insidetheHadoopCluster.

Figure8:Clickon+SASGridtolaunchSASWorkspaceServeronHadoopWorkerNodeThegreencheckmarknexttoSASGridbelowindicatesthattheSASEGusercanrunanyProcessFloworTaskinHadoop,becauseYARNhassuccessfullylaunchedtheSASWSSonaHadoopWorkerNode.

Figure9:Thegreenarrow(SASGrid)indicatesSASWorkspaceServerisestablishedinHadoop

IfwelookattheYARNUIbelow,typicallyreservedforHadoopAdministrators,wecanseeaSASEnterpriseGuide–WorkspaceServerinaRUNNINGstate.ThisindicatestotheHadoopAdministratorthataSASjobisrunningandhasreservedtwoYARNcontainersandiscurrentlyactiveontheHadoopcluster:

Figure10:YARNUI-SASEGWorkspaceServer(WSS)runninginsas94_queueonHadoop.Switchingtothe“Nodes”linkintheYARNUI(thisisasmallersampleclustertoillustrateourexampleonly),youcanseethetwoYARNcontainersassociatedwiththeSASWSS,runningin1GBYARNContainers.

Figure11:YARNUI–“Nodes”linkshowingtwo1GBcontainerssupportingtheSASWSS.

And,switchingtotheYARNUI–“Scheduler”link,wecanseeClusterMetricsbelow.

Figure12:YARNUI–“Scheduler”linkHadoopClusterMetrics. SwitchingbacktotheSASUserEGUI,theSASusercanseethataSASLibraryhasbeenassignedpointingtoHive.WecanalsoseeintheSASEGExplorerwindowonthebottomlefthandsideofthescreen,alltheHiveTablesavailabletotheSASuserwhichareassociatedwithSASlibrefHIVE_TPC.Atthispoint,theSASEGusercanrunSASanalytictasksdirectlyagainstHivetablesusingtheHivedatabaseschematpcds_bin_partitioned_orc_10.

Figure13:SASEnterpriseGuide–LibraryHIVE_TPChasbeenassignedtoHive.

Atthispoint,theSASEGusercanrunanytraditionalSASjoborcode.TheSASusercanalsocallanyHDPservice(i.e.HDFS,Pig,Hive,MapReduce)in-linewithinthisSASworkspaceserver.Ifinstalled,theSASusercanalsocallSASHighPerformanceAnalytics,whichcanrunonYARNintheclusteraswell.WhentheSASEGuserissuesa“Disconnect”(seebelow),thiswillinitiatearequesttotheSASWorkspaceServertoshutdown.YarnwillreleasethecontainerthatwasusedforrunningtheSASWorkspaceServerandreclaimtheresourcesassociatedwiththatcontainer.

Figure14:SASEnterpriseGuide–“Disconnect”fromHadoop. Oncethe“Disconnect”iscompleteinEG,theHadoopAdministratorwillseetheSASEnterpriseGuideWorkspaceServerisnow“FINISHED”.(Seebelow).AdisconnectwillalsooccurautomaticallyifaSASEGusershutsdowntheUI.

Figure14:YARNUI-SASEGWorkspaceServer(WSS)FINISHEDstatusonHadoop. ForsitesinterestedinmovingtraditionalSASworkloadsco-locatedontoHadoop,SASGridManagerforHadoopisanidealsolutiontomeetthisneed.Inthispaper,wehavediscussedthebenefitsofmovingtoSASGridManageronHadoop,requiredarchitectureandrecommendedconfigurations.WehavealsosharedaseamlessSASuserexperiencewithyou.IfyouwouldliketolearnmoreaboutSASGridManagerforHadoop,pleaseseethefollowinglinksbelow.

REFERENCES“SASAnalyticsonYourHadoopClusterManagedbyYARN.”July14th,2015.Availableathttp://support.sas.com/rnd/scalability/grid/hadoop/SASAnalyticsOnYARN.pdf.SASInstituteInc.“IntroductiontoGridComputing.”Availableathttp://support.sas.com/rnd/scalability/grid/.“HortonworksApacheHadoopYARNOverviewWebPage.”Availableathttps://hortonworks.com/apache/yarn/.“ApacheHadoopYarn.”TheApacheSoftwareFoundation.June29,2015.Availableathttps://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html.“Introducing-SASÒGridManagerforHadoop.”SASGlobalForumPaperSAS6281-2016Availableathttp://support.sas.com/resources/papers/proceedings16/SAS6281-2016.pdf

CONTACTINFORMATIONYourcommentsandquestionsarevaluedandencouraged.Contacttheauthorat: MarkLochbihler PrincipalArchitect,HortonworksInc. mark.lochbihler@hortonworks.comSASandallotherSASInstituteInc.productorservicenamesareregisteredtrademarksortrademarksofSASInstituteInc.intheUSAandothercountries.ÒindicatesUSAregistration.