hadoop data integration benchmark
TRANSCRIPT
Hadoop Data Integration Benchmark
Product Profile and Evaluation:
RedPoint Data Management for Hadoop
By William McKnight and Jake Dolezal August 2016 Sponsored by RedPoint Global Inc.
MCG Global Services Hadoop Integration Benchmark
© MCG Global Services 2016 www.mcknightcg.com Page 2
Table of Contents
EXECUTIVESUMMARY 3
HADOOPINTHEENTERPRISE 4
THEEVOLUTIONOFHADOOPDATAINTEGRATION 5
REDPOINTPRODUCTPROFILE 6
COMPANYPROFILE 6
BENCHMARKOVERVIEW 7
BENCHMARKSETUP 8
VIRTUALSERVERENVIRONMENT 8REDPOINTINSTANCES 9RELATIONALDATABASEINSTANCE 9HADOOPCLUSTER 9SOURCEDATA 9RELATIONALDATASOURCE 9WEB-CLICKLOG 10COUPONLOG 10NAMEANDADDRESSCSVFILE 11DATAVOLUME 11DATAMANAGEMENTJOBS 12WEB-COUPONLOGONHADOOPJOINWITHORDERSJOBDESIGN 12ADDRESSSTANDARDIZATIONJOBDESIGN 13NAMEMATCHINGJOBDESIGN 14
BENCHMARKRESULTS 16
USECASE1:WEB-COUPONLOGONHADOOPJOINWITHORDERS 16EXECUTIONTIMEANDACTUAL-VERSUS-EXPECTEDRESULTS 16VENDORCOMPARISON 17USECASES2AND3:ADDRESSSTANDARDIZATIONANDNAMEMATCHING 18EXECUTIONTIMEANDACTUAL-VERSUS-EXPECTEDRESULTS 18PERCEIVEDUSABILITYASSESSMENT 18
CONCLUSION 19
ABOUTMCGGLOBALSERVICES 20
ABOUTREDPOINTGLOBAL 21
MCG Global Services Hadoop Integration Benchmark
© MCG Global Services 2016 www.mcknightcg.com Page 3
Executive Summary
ThisbenchmarkispartofresearchintotheperformanceofloadsonHadoopclusters—anincreasinglyimportantplatformforstoringdata-poweringcorporatestrategies.Theintentofthebenchmark’sdesignistosimulateasetofbasic-loadscenariostoanswersomefundamentalbusinessquestionsthatorganizationsfromnearlyanyindustrymightencounterandask.Foragrowingindustry,thereareasurprisingvarietyofapproachesandvendorarchitecturesforHadoop-loadingproducts(suchas:MapReduce,Spark,SparkthroughHive,YARN,nifi,Sqoop,Sqoopinterfaces,Flumeinterfaces,andinterfacestocommandlineHDFS).Basedonthedifferencesintheresultswe’vefound,thisarchitecturefoundationgreatlyinfluencesperformance.RedPointDataManagementforHadoopisbasedonYARN,aresourcenegotiatora.k.a.operatingsystem,whichisthefoundationofHadoop2.0.Inthecaseofourqueries,RedPointwasabletocompleteworkloadsinaveryshorttimeframe,wellwithinenterpriserequirementsandfasterthanwhatweimaginedpossible.Comparedtoapreviousbenchmark,oneworkloadran550%fasterthanaproductusingSparkand1900%fasterthanaproductusingMapReduce.RedPoint’splatform,continuallyfine-tunedforoveradecade,hasachievedunparalleledhighperformanceinutilizingYARNwithouttheoverheadofotherHadoopcomponents.Thispaperfurtherexploresandinvestigatestheseresults.
MCG Global Services Hadoop Integration Benchmark
© MCG Global Services 2016 www.mcknightcg.com Page 4
Hadoop in the Enterprise
Companiesareclamoringtocaptureasmuchdataaspossibleandharnessthatdataasmeaningfulinformationtodrivetheirbusinesses.Today,thisinformation,or“bigdata,”wouldincludealldatageneratedbyacompany’sdigitalstrategy.Itwouldalsoincludealldatathatpasttechnologieswereunabletorecordandanalyzeforbusinessuse.Bigdataisnotonlycontrollabletoday,butitsimplementationisalsoessentialinconductingbusiness.Machinesareprimarilyresponsibleforbigdata.Machinedatacontainscriticalinsights;itallowsustoconductunprecedentedtriangulationofphysicalobjects.Unliketraditionalstructureddata(forexample,datastoredinatraditionalrelationaldatabaseforbatchreporting)machinedataisnon-standard,highlydiverse,dynamic,andhigh-volume.Wecanbuildacomprehensivepictureofactivitywhenwecorrelateandvisualizetherelatedeventsacrossdisparatesources.Thechallengeisinbringingthedatatogether.Companiesthatcancaptureandharnessthisdatawillbenefitaccordingly.Inotherwords,themorecompaniesstoreandprocessdata,themoresuccesstheycantapinto.Businessesacrossindustriesshowclear,upwardtrendsinspendingonbigdata,anditisprojectedtobethetopbudgetiteminmanysectors.Hadoopisatechnologythatwasformedin2006tomeettheneedsoftheSiliconValleydataelite.Previously,thesecompanieshaddataneedsthatfarsurpassedbudgetsforthedatabasemanagementsystems(DBMS)outthere.ThescaletheywereusingwasanotherorderofmagnitudeawayfromthetargetfortheDBMS.Andthetimingofthescalewasnotcertain,giventhevariabilityofthedata.Hadoopisquicklybeingadoptedbybusinessesfromstart-upcompaniestotheFortune1000becauseitscalesverywellandrelativelycheaply.Thismeansyoudonothavetoaccuratelypredictthedatasizeattheoutset.Hadoopisagreatfitformanytypesofdatainanorganization.Sensordata,clickstreamdata,socialdata,serverlogs,smartgriddata,electronicmedicalrecords,videoandpictures,unstructuredtext,geolocationdata,high-volumedata,and“cold”enterprisedataareallagreatfitintheHadoopopen-sourcesoftwareframeworkforstoringdataonclustersofcommodityhardware.Scale-outfilesystemsthatmaybelackinginfunctionality,butcanhandlemodernlevelsofcomplexdataareheretostay.Hadoopistheepitomeofthatideaandanecosystemisbuildinguparoundit.WhilethereusedtobelittleoverlapbetweenreasonableselectionofHadoopandreasonableselectionofaDBMS,thathaschanged.Hadoophaswithstoodthetestoftimeandhasgrown
MCG Global Services Hadoop Integration Benchmark
© MCG Global Services 2016 www.mcknightcg.com Page 5
tothepointwherequiteafewapplicationsarchitectedonaDBMSwillbemovedtoHadoop.Thecostsavings,combinedwiththeabilitytoexecutethecompleteapplicationwillbepersuasive.Itisespeciallyusefulasacollectionpointforpost-operationaldataacrosstheenterprise,notallofwhichmaybedestinedforarelationaldatawarehouse.This“datalake”canbeleftatlowrefinement,whichisjustfinefortheemergingclassofdatascientistsandothersinneedofdeepinsight.Traditionally,datapreparationhasconsumedanestimated80percentoflegacydatadevelopmentefforts.LoadingHadoopclusterswillcontinuethistraditionasatopjobatarangeofcompanies.Luckily,itispossibletolessenthecostandriskofthisworkwitharobustdataintegrationtool.
The Evolution of Hadoop Data Integration
Intheearlydays,low-performing,opensourcevendorarchitectureslikeSqoop,Flume,commandlineHDFSandHivewerelimiting.Sincethen,numerousapproachesandtoolshavearisentomeettheHadoopdataintegrationchallenge.MapReducewastheoriginal[andinHadoop1.0,theonly]data-processingengineforHadoop.However,ithasprovedunwieldyandunabletomeetincreasinglycomplexworkloads,sufferingfromissuessuchasaninabilitytoscaleindex-basedlookups.SparkemergedasareplacementforMapReduce.Byutilizingapoolofpersistent"executorservices"itcannearlyeliminateinter-stagestartupcosts—oneofMapReduce'sbigweakness.Inaddition,SparkusesResilientDistributedDatasets(RDDs)forinter-stagestorage.RDDsareaformofHDFS-backedmemoryimagesthatcombinethefastaccessofmemorywiththefault-toleranceofHDFS.Sparkcanbeusedtoachieveveryfastthroughputforcertainworkloads.SparkisalsobeingleveragedtoimprovetheperformanceofHiveprocessing,specificallyHQLqueries.So-called"HiveonSpark"hastheabilitytoaccelerateHiveitself,butdoesn'tserveasageneraldata-integrationplatform.ButevenSparkhasitslimitations.Theamountofmemoryrequiredtoprocessadatasetcanbeanorderofmagnitudelargerthantheinputdatasetsize.Iflessmemoryisavailableduetovariousfactors(suchasclusterload,nodedowntime,orunexpecteddatascale),Spark'sperformancedegradationcurvecanbeseverelynon-linear,evenbecominga"cliff”beyondwhichjobssimplyfail.Itisincreasinglyimpossibletoexpecta“reserved”clusterforHadoopactivity,whichmeansacluster’smemoryresourcesareincreasinglylimitedandunpredictable.Still,Sparkwouldbethenumberonechoiceformostworkloadsifthesewereyouronlyoptions.
MCG Global Services Hadoop Integration Benchmark
© MCG Global Services 2016 www.mcknightcg.com Page 6
However,byapplyingengineeringtotheclustertoachievehigherperformingresultswithtruecommoditynodes—withouttheaddedmemory—somehaveimproveduponprecedingmodels.Forexample,RedPointusesanativeengineontopofYARN,aresourcenegotiatorandoperatingsystem,whichisthefoundationofHadoop2.0.Itisthelayerthatintegratesandmanagesresources,includingstorageresources,CPU,I/Oandmemory.RedPointisbasedaroundYARN,whichrunsinthecluster.ByleveragingYARN,itcanruninmassiveparallelismwithouttheassumptionthatallthedatamustfitintomemory.Workloadperformanceismorepredictableaswell,givenitslackofdependencyonmemory.Additionally,thedegradationcurvewhenfacedwithlimitedresourcesismoregentle.RedPointDataManagement™forHadoopleveragesRedPoint’s10-yearlegacywiththehigh-performanceRedPointDataManagementdataintegrationtool.Itusesavisual-designdataflowmodel,allowingnon-programmerstocreatecomplexdatatransformations.OrganizationswithexistingdatastaffshouldfindthistechnologytohaveafasterandmoreaffordableadoptioncurvethanwhenhiringforSpark.
RedPoint Product Profile
Company Profi le
ProductName RedPointDataManagementforHadoop
InitialLaunch 2013
CurrentReleaseandDate
7.3.1,June2016
KeyFeatures
BasedonYARN;Companywith10-yearlegacywiththehigh-performanceRedPointDataManagementdataintegrationanddataqualitytool;Predictablehighperformance
HadoopDICompetitors Informatica,Pentaho,Syncsort,Talend
CompanyFounded 2006
Focus Empowerdata-drivenorganizationsbyunlockingthefullvalueoftheirdatatodriveconsumerengagementandprofitable,sustainedgrowth.
MCG Global Services Hadoop Integration Benchmark
© MCG Global Services 2016 www.mcknightcg.com Page 7
Benchmark Overview
Theintentofthebenchmark’sdesignistosimulateasetofbasicscenariostoaddresssomefundamentalbusinessproblemsthatanorganizationfromnearlyanyindustrysectormightencounterandask.Thesecommonbusinessquestionsformulatedforthebenchmarkandfromourexperienceworkingwitharangeofclientsoverthepastdecadeare:
• Whatimpactdoescustomers’viewsofpagesandproductsonourwebsitehaveonsales?Whatistheaveragenumberofpageviewsbeforecustomersmakeapurchasedecision(onlineorin-store)?
• Howdoourcouponpromotionalcampaignsimpactourproductsalesorserviceutilization?Doourcustomerswhovieworreceiveourcouponpromotionscometoourwebsiteandbuymoreoradditionalproductsthantheymayhaveotherwisepurchased?
• Howcanweidentifyandremovepotentialduplicatesfromacustomerdatasourceofquestionabledataquality?
• Howcanwestandardizecustomermailingaddressestoimprovethequalityofourgeographicdataforsame-householdrecognitionandfortheefficacyofourmail-marketingcampaigns?
Thebenchmarkwasdesignedtodemonstratehowacompanymightapproachaddressingthesebusinessproblemsbybringingdifferentsourcesofinformationintoplay.WealsohavetakentheopportunitytoshowhowHadoopcanbeleveraged,becausesomeofthedataofinterestinthesedatamanagementcasesarelikelyofalargevolumeandnon-relationalorsemi-toun-structuredinnature.Inthesecases,usingHadoopwouldbethebestcourseofactionforclientsseekingtoanswerthesequestions.Sinceitishighlyprobablethatthedatarequiredresidesindifferentsources,thebenchmarkwasalsosetupfordataintegration.Someofthesesourcesarealsoprobablynotbeingconsumedandaggregatedintoanenterprisedatawarehouseduetotheirhighvolumeandthedifficultyinintegratingvoluminousamountsofsemi-structureddataintoatraditionaldatawarehouse.Thus,thebenchmarkwasdesignedtomimiccommonscenariosandthechallengesfacedbyorganizationsseekingtointegratedatatoaddresstheseandsimilarbusinessproblems.
Employees 120
Headquarters WellelseyHills,MA
Ownership Private
MCG Global Services Hadoop Integration Benchmark
© MCG Global Services 2016 www.mcknightcg.com Page 8
Benchmark Setup
Thebenchmarkwasexecutedusingthefollowingsetup,environment,standards,andconfigurations.
Virtual Server Environment
Feature Selection
HadoopDistribution HortonworksDataPlatform2.4.2(HDFS,MapReduce2,YARN,Tez,Hive,Pig,ZooKeeper,andAmbariinstalled)
EC2Instance Memoryoptimizedm3.xlarge(4vCPUs,16GBMemory)
OS CentOS6.7
SourceDataTypes Text-basedlogfiles,arelationaldatabase,andcomma-separatedvalue(CSV)files
DataVolume 20GB(Logfiles);7,500,000rows(RDBMS);and10,000,000lines(CSV)
TPC-HScaleFactor 1x
RDBMS PostgreSQL9.4
JavaVersion 1.8.0_91
Figure1andTable1:ServerEnvironmentandSetup
MCG Global Services Hadoop Integration Benchmark
© MCG Global Services 2016 www.mcknightcg.com Page 9
ThebenchmarkwassetupusingAmazonWebServices(AWS)EC2instancesdeployedintoanAWSVirtualPrivateCloud(VPC)withinthesamePlacementGroup.AccordingtoAmazon,allinstanceslaunchedwithinaPlacementGrouphavelowlatency,fullbisectionand10Gigabitspersecondbandwidthbetweeninstances.
RedPoint Instances
TheRedPointClientEC2instancewasageneralpurposet2.largewith2vCPUsand8GBofRAMrunningCentOS6.7.ThisWindowsinstanceranMicrosoftServer2012.Onthisinstance,weinstalledtheRedPointDataManagementforHadoopClientversion7.3.1.TheRedPointExecutionandSiteServerEC2instancewasageneral-purpose,m4.xlargemachinewith4vCPUsand16GBofRAMrunningCentOS6.7.Inthisinstance,weinstalledtheRedPointDataManagementExecutionandSiteServersversion7.3.1.
Relational Database Instance
Therelationalsourceforthebenchmarkwasam4.xlargeEC2instancerunningCentOS6.7.WeinstalledPostgreSQL9.4onthisserver.
Hadoop Cluster
TheHadoopclusterforthebenchmarkconsistingof3identicalnodes,eacham4.xlargeEC2instancerunningCentOS6.7.WeinstalledHortonworksDataPlatformHadoopdistribution.UsingAmbari,weinstalledthefollowingHadoopservices:HDFS,MapReduce2,YARN,Tez,Hive,Pig,andZooKeeper.Thisisaminimumviableproduct(MVP)setup.
Source Data Wecreatedthedatasourcesusedinthebenchmarktomimicreal-lifeusecases:
• Relationaldata• Web-clicklog• Couponlog• Customernamesandaddresses
Relational Data Source
Therelationalsourceforthebenchmark(storedinPostgreSQL)wasconstructedusingtheTransactionProcessingPerformanceCouncilTPCBenchmarkH(TPC-H)Revision2.17.1StandardSpecification.TheTPC-Hdatabasewasconstructedtomimicareal-lifepoint-of-salesystemaccordingtotheentity-relationshipdiagramandthedatatypeandscalespecificationsprovidedby
Figure2:TPC-HERDiagram©1993-2014TransactionProcessingPerformanceCouncil
MCG Global Services Hadoop Integration Benchmark
© MCG Global Services 2016 www.mcknightcg.com Page 10
theTPC-H.Wepopulatedthedatabasewithscriptsthatwereseededwithrandomnumberstocreatethemockdataset.TheTPC-Hspecificationshaveascalefactorbywhichtherecordcountforeachtableisderived.Forthisbenchmark,weselectedascalefactorof1.Inthiscase,theTPC-Hdatabasecontained1.5millionrecordsintheORDERStableand6millionrecordsintheLINEITEMtable.
Web-Click Log
Aweb-clicklogwasgeneratedusingthesamefashionasastandardApachewebserverlogfile.Thelogfilewasgeneratedusingscriptstosimulatetwotypesofentries:1.completelyrandompageviews(seededbyrandomnumbers)and,2.web-clicksthatcorrespondtoactualpageviewsoforderedproducts(seededbyrandomrecordsintheTPC-HORDERSandLINEITEMStables).The“dummy”or“noise”web-logentriesappearedinavarietyofpossibilitiesbutfollowedthesameformatconsistentwithanApacheweb-clicklogentry.Alldatawererandomlyselected.Forexample:249.225.125.203 - anonymous [01/Jan/2015:16:02:10 -0700] "GET /images/footer-basement.png HTTP/1.0" 200 2326 "http://www.acmecompany.com/index.php" "Windows NT 6.0"
The“signal”weblogentriesthatcorrespondedto(andwereseededwith)actualORDERSandLINEITEMrecordshadthesamerandomnessasthe“dummy”entries.ExceptactualLINEITEM.L_PARTKEYvaluesandcorrespondingORDERS.O_ORDERDATEvaluesfromtheTPC-Hdatabasewereselectedtocreaterecordstorepresentapageviewofanactualordereditemonthesamedayastheorder.Thesegmentsbelowrepresentthosethatpotentiallycorrespondtoactualorders:154.3.64.53 - anonymous [02/Jan/2015:06:03:09 -0700] "GET /images/side-ad.png HTTP/1.0" 200 2326 "http://www.acmecompany.com/product-search.php?partkey=Q44271" "Android 4.1.2"
Theweb-clicklogfilecontained64,000,000linesandwas5.4GBinsize.Therewererandomly-inserted,web-clickentriesthatcorrespondedtocertainLINEITEMandORDERSrecords.Approximately1in1,000oftheweb-clicklogentriescorrespondedtoorders.Therestoftheentrieswererandom.
Coupon Log
AcouponlogwasgeneratedusingthesamefashionasacustomizedApachewebserverlogfile.Thecouponlogwasdesignedtomimicaspecialcaselogfilegeneratedwheneverapotentialcustomerviewedanitembecauseofaclick-throughfromacouponadcampaign.Again,thelogfilewasgeneratedusingscriptstosimulatetwotypesofentries:1.completelyrandompageviews(seededbyrandomnumbers)and,2.pageviewsthatcorrespondtoactualpageviewsof
MCG Global Services Hadoop Integration Benchmark
© MCG Global Services 2016 www.mcknightcg.com Page 11
orderedproductsbyactualcustomersviathecouponadcampaign(seededbyrandomrecordsintheTPC-HORDERSandLINEITEMStables).The“dummy”or“noise”couponlog-entrydatawererandomlyselected.The“signal”couponlogentriesthatcorrespondedto,andwereseededwith,actualORDERSandLINEITEMrecordshadthesamerandomnessasthe“dummy”entries.ExceptactualLINEITEM.L_PARTKEYvaluesandcorrespondingORDERS.O_ORDERDATEvaluesfromtheTPC-Hdatabasewereselectedtocreaterecordstorepresentapageviewofanactualordereditemonthesamedayastheorder.Thesegmentsbelowrepresentthosethatpotentiallycorrespondtoactualorders:49.243.50.31 - anonymous [01/Jan/2015:18:28:14 -0700] "GET /images/header-logo.png HTTP/1.0" 200 75422 "http://www.acmecompany.com/product-view.php?partkey=S22211" "https://www.coupontracker.com/campaignlog.php?couponid=LATEWINTER2015&customerid=C019713&trackingsnippet=LDGU-EOEF-LONX-WRTQ" "Windows Phone OS 7.5"
Thecouponlogfilecontained16,000,000entriesandwas14.3GBinsize.Therewererandomly-insertedcouponentriesthatcorrespondedtocertainLINEITEMandORDERSrecords.Approximately1in1,000ofthecouponlogentriescorrespondedtoorders.Therestoftheentrieswererandom.
Name and Address CSV Fi le
Thecustomernameandaddressdatawasinacomma-separatedvaluefileformatandstoredintheHadoopDistributedFileSystemonourcluster.Thelayoutofthefileisdemonstratedbythefirstfewlinesofthe10millionrows:"NAME","ADDRESS","CITY","STATE","ZIP","PHONE","ID" CELESTE A ZIENUK,125 MINOT AVE,EAST WAREHAM,MA,02538,,100000022 SEBASTIAO C BARBOSA,15 HOOSAC ST,ADAMS,MA,01220,,100000064 GREG S STURGEON,1640 ALVIN LN,BROOKFIELD,WI,53045,,100000075 RENAE BATTISTELLA,15 COMMOMWEALTH AVE,QUINCY,MA,02169,,100000080
Thenameswererandomlygeneratedfromagenericnamedatabase.Theaddressesarerealaddresses.However,justover2millionoftheaddresseswere“dirty,”i.e.,notuptoUSPSstandards.SinceRedPointusesaCASS(CodingAccuracySupportSystem)standardizationmodulevalidatedbytheUnitedStatesPostalService(USPS),itwasnecessarytocorrectandmatchUSstreetaddressesforthese2millionentries.
Data Volume DataSet Type Location Rows SizeonDisk
WebLog ApacheLog HDFS 64,000,000 5.5GB
CouponLog ApacheLog HDFS 16,000,000 14.3GB
Orders RDBMS PostgreSQL 1,500,000 N/A
LineItems RDBMS PostgreSQL 6,000,000 N/A
NamesandAddresses CSV HDFS 10,000,000 0.6GB
MCG Global Services Hadoop Integration Benchmark
© MCG Global Services 2016 www.mcknightcg.com Page 12
Table2:Benchmarksourcedatavolumes
Eachofthedatasources(theTPC-Hdatabase,logfiles,andcustomeraddressCSVfile)werealsoscaledtodifferentscalefactors,sothattheintegrationroutines(describedinthenextsection)couldbeexecutedagainstdatasourcesofvarioussizes.
Data Management Jobs Theusecaseofthebenchmarkwasdesignedtodemonstratereal-lifedatamanagementscenarioswherecompaniesdesiretointegratedatafromtheirtransactionalsystemswithunstructuredandsemi-structureddata.Thebenchmarkdemonstratesthisbyexecutingroutinesthat:
• IntegratetheTPC-Hrelationalsourcedatawiththeindividuallogfiles• Standardizecustomeraddresses• Identifyduplicatecustomerrecords
Thefollowingdatamanagementandintegrationroutineswerecreatedforthebenchmark.Inallcases,bestpracticeswereobservedtooptimizetheperformanceofeachjob.
Web-Coupon Log on Hadoop Join with Orders Job Design
ThepurposeoftheWeb-CouponLogonHadoopJoinwithOrderswastotestthecapabilityofthevendorsoftwaretoefficientlycombineavarietyofdatafrommultiplesources,bothonandoffHadoop.Figure3representsthejobdesignthatwascreatedintheRedPointDataManagementClient.RedPointoffersaParallelSectiontoolwithinputsthatdefineallthesplittabledataavailabletotheParallelSectiontransforms.Splittabledataisthendividedupamongasetoftaskstobeprocessedinparallel.InputtoolswithintheParallelSectiontool'sprocessingareareadtheirentireinputdataineachtaskandareusedtodefineanddrivedataparallelism.
WithintheHadoopParallelSection,twoCSVinputsourceswereread:WebLogandCouponLog.
TheNumberRecordstoolwasusedtogenerateasequenceofnumericidentifiersforindividualrecordsineachCSVinputrow.
TheCalculatetoolwasusedtoconvertthestringApachelogdatetoadateformatwiththeRedPointScanDateTimefunction:ScanDateTime(Trim(DATESTR, "[ "), "DD/Mmm/YYYY:HH:mm:ss")
Figure3:TheWeb-CouponLogOnHadoopJoinwithOrdersJobDesign
MCG Global Services Hadoop Integration Benchmark
© MCG Global Services 2016 www.mcknightcg.com Page 13
TheSelecttoolissimilartotheSQLSELECTclause.Weusedthistooltoselectonlyafew,necessaryfieldsfromtheloginputs.Theselectedsetoffieldswasusedforthejoinandtheoutputtable.
TheJointoolacceptstwoinputs—LeftandRight—andmatchesrecordsfrombothinputsonasinglekeyfieldorcolumn.WeusedtheCartesianJoinoptiontocombinethematchedLeft(WebLog)andRight(CouponLog)recordsintoasingle"wide"recordcontainingallfieldsfrombothinputs.ThisfunctionissimilartoanSQLjoin.WebLog CouponLog Join Output
IP IP þ þ
PARTKEY PARTKEY þ þ
DATE DATE þ þ
COUPONID ¨ þ
CUSTOMERID ¨ þ
Table3:FieldsselectedfromtheWebandCouponlogsusedfortheJoinandoutput
TheresultingoutputcompletedtheprecedingParallelSectionwithinHadoop.However,whiletheseparalleltaskswereprocessing,theRedPointExecutionServerwasalsoprocessingtheRDBMSinputtask.
WeusedtheRDBMSInputtooltoreaddatafromthePostgreSQLTPC-Hdatabaseandtablesbyexecutingthefollowingquery:SELECT L_ORDERKEY, L_PARTKEY, O_CUSTKEY, O_ORDERDATE FROM LINEITEM LEFT OUTER JOIN ORDERS ON L_ORDERKEY = O_ORDERKEY;
WeattachedaDataViewertotheoutputofthefinalJoinbetweenthejoinedWeb-CouponlogHadoopoutputandtheRDBMStoinspecttheresultantdataset.Theresultingexecutiontimesandexpectedoutputarediscussedinthenextsection.
Address Standardization Job Design
ThepurposeoftheAddressStandardizationjobwastoassesstheabilityoftheRedPointplatformtoquicklyandaccuratelydetectandcorrectmalformedUSpostaladdressesinasinglesourceofdataonHadoop.Figure4representsthejobdesignthatwascreatedinRedPointDataManagementClient.
MCG Global Services Hadoop Integration Benchmark
© MCG Global Services 2016 www.mcknightcg.com Page 14
Again,theRedPointParallelProcessingContainerwasusedtotakeadvantageofthemultiplethreadcapacityofourHadoopcluster.
The10-million-itemcustomernameandaddressCSVfilewasusedastheprimaryinput.Forthisjob,wesettheworkloadtobesplitbypartitionandusedtheZIPCodeasthepartitionfield.Thismadethestandardizationmoreefficientbyorganizingtherecords.WealsosetthePartitionModetoSegment,becauseaSegmentpartitionisfasterthanonebasedonasort,accordingtothevendor’sdocumentation.
WeusedtheRedPointAOAddressQualitytooltoprovidetheaddresscorrection,parsing,andstandardization.Youcanenablegeocodeassignmentwithasingleoption.Forthisworkload,weloadedtheUSPSCASS-certifiedcompressedtarfile(tgz)rightontoHDFS,andtheRedPointExecutionServerwasabletobringitdirectlyintotheParallelprocessingsegmentofthejob.ThetoolwentthroughthedatasetandstandardizedthetheCSVfile.
Next,weusedtheFiltertooltoselectonlythoseaddressesthatwerestandardizedandchanged.
Again,weattachedaDataViewertotheoutputoftheparallelHadoopprocesstoinspecttheresultantdataset.Theresultingexecutiontimesandactual-versus-expectedoutputarediscussedinthenextsection.
Name Matching Job Design
ThepurposeoftheNameMatchingjobwastoassesstheabilityoftheplatformtoquicklyandaccuratelydetectpotentialduplicatecustomerrecordsbynameandaddresswithinasinglesourceofdataonHadoop.Figure5representsthejobdesigncreatedintheRedPointDataManagementClient.Onceagain,theRedPointParallelProcessingContainerwasusedtotakeadvantageofthemultiplethreadcapacityofourHadoopcluster.
The10-million-itemcustomernameandaddressCSVfile(thesameoneusedintheAddressStandardizationjob)wasusedastheprimaryinput.Forthisjob,wesettheworkload
Figure4:TheAddressStandardizationJobDesign
MCG Global Services Hadoop Integration Benchmark
© MCG Global Services 2016 www.mcknightcg.com Page 15
tobesplitbypartitionandusedtheZIPCodeasthepartitionfield.Sincetheaddressisimportanttoidentifyingmatches,theZIPwasanefficientmeansofgettingpotentialmatchesgroupedclosertogether,insteadofinrandomorder.WealsosetthePartitionModetoSegmentforperformancepurposes,justaswedidintheAddressStandardizationjob.
WeusedtheAOConsumerMatchmacrotomatchindividualsusingnameandaddressinformation—inthiscase,wesetthesegmentationtoZIP+addressparts.TheAOConsumerMatchcanalsobeusedtomatchtheindividual(fullname),thefamily(lastnameonly)orbyaddress(nonamecomponents).Itevenhasadditionalparametersdesignedtomatchfemaleindividualswhomayhavechangedtheirsurnames.Weusedthedefaultscoresproducedbythematchingalgorithmanddidnotfine-tunetheminanyway.
Next,weusedtheFiltertooltoremoveunmatchedrecordsoutofthedataoutput.
Then,weusedtheCalculatetooltooffsetthegroupidentifierproducedbytheAOConsumerMatchtoolbytasknumber.Thismadethemgloballyunique.
AsthefinaltaskintheParallelSection,wesortedthedatasetbythegroupidentifier,sowecouldseematchesadjacenttoeachother.
Finally,weattachedaDataViewertotheoutputoftheparallelHadoopprocesstoinspecttheresultantdataset.Theresultingexecutiontimesandactual-versus-expectedoutputarediscussedinthenextsection.
Figure5:TheNameMatchingJobDesign
MCG Global Services Hadoop Integration Benchmark
© MCG Global Services 2016 www.mcknightcg.com Page 16
Benchmark Results
Use Case 1: Web-Coupon Log on Hadoop Join with Orders
Thegoalofthefirstusecaseforthebenchmarkwastoprepareadatasetthatcorrelatesproductsorderedwiththepageviewsandcouponcampaignclick-throughsonane-commercewebsite.Theintegrationjobwaswrittentomapthepageviewsandcouponstoproductsordered.Figure6isaconceptualmappingofthisintegration.
Figure6:Web-CouponLogOnHadoopJoinwithOrdersMapping
Execution Time and Actual-Versus-Expected Results
Table4liststhemedianexecutiontimesoftheWeb-CouponLogOnHadoopJoinwithOrdersjob.
Job TrialsMedian
RunTimeOutputRows
Web-CouponLogOnHadoopJoinwithOrders 5 3m47s 160,176
Table4:Web-CouponLogOnHadoopJoinwithOrdersBenchmarkResults
MCG Global Services Hadoop Integration Benchmark
© MCG Global Services 2016 www.mcknightcg.com Page 17
Vendor Comparison
Asacomparisonwiththerestofthedatamanagementindustry,theresultsofthisbenchmarkwerecomparedagainstabenchmarkrunbyMCGGlobalServicesinlate2015,comparingTalendandInformatica.1HadoopMapReduce,ApacheSpark,andYARNrepresentacriticalarchitecturalchoicethatmanyinformationmanagementprofessionalsmustmake.Thus,theresultsofthepreviousbenchmarkarevaluablewhenevaluatingRedPoint’sperformanceandcapabilities.TheWeb-CouponLogOnHadoopJoinwithOrdersjobcreatedinRedPointusedthesamedatavolumeandvariety,anearlyidenticaljobdesign,andcomparableEC2instancestotheachievethebenchmarkworkloadoutputasthepreviousbenchmark.
VendorPlatform ExecutionTime
HadoopMapReduce 1h11m52m
ApacheSpark 20m43s
RedPointonHadoop(YARNonly) 3m47s
Table5:RedPointperformancecomparedtoapreviousbenchmark
RedPointwasabletocompletethesameworkload550%fasterthanTalendusingSparkand1900%fasterthanInformaticausingHadoopMapReduce.ThisdemonstrateshowRedPointdesigneditsplatformandperformanceoverthespanofadecade.Moreover,itindicateshowRedPointachievedwiththeirplatformthathasbeencontinuallytunedforoveradecadeandutilizesYARN.
1“HadoopIntegrationBenchmark,”ProductProfileandEvaluation:TalendandInformatica,availableat:https://info.talend.com/hadoopintegrationinformatica.html.
MCG Global Services Hadoop Integration Benchmark
© MCG Global Services 2016 www.mcknightcg.com Page 18
Use Cases 2 and 3: Address Standardization and Name Matching Thegoalofthesecondandthirdusecasesforthebenchmarkwastopreparedatasetsofsanitizedcustomeraddressesandmatchingcustomerduplicates.ThedataqualityjobswerewrittentomakeuseofandassessRedPoint’stoolset.
Execution Time and Actual-Versus-Expected Results
Table6liststhemedianexecutiontimesoftheAddressStandardizationandNameMatchingjobs.
Job TrialsMedian
RunTimeOutputRows
AddressStandardization 5 0:02:30 2,005,055
NameMatching 5 0:02:52 6,367,507
Table6:AddressStandardizationandNameMatchingBenchmarkResults
Thebenchmarkproducedverysatisfactorydataqualityoutputwithinarangeweexpectedbasedontheoriginalsourcedatagenerated.WhatwasimpressivewasRedPoint’sperformance.Whilewehavenootherpreviousbenchmarkwithwhichtocomparetheseresults,theAddressStandardizationworkloadprocessed10millionrecordsatarateof66,667recordspersecond,andtheNameMatchingwasachievedat58,140recordspersecond.TheseresultsareatestamenttothepowerofRedPoint’sabilitytoleveragetheHadoopclusterforparallelprocessingviaYARNwithminimaloverhead.
Perceived Usabi l ity Assessment Important,butoften-overlooked,considerationswhenbenchmarkingandevaluatingdatamanagementtoolsareproductusabilityandmaturity.Inpreviousbenchmarksandclientengagements,wehaveseentoolsthatrankhighlyforhoweasytheyaretoinstall,configure,understand,anduse.Wehavealsoseensomethatarequitedifficulttouse.Additionally,wehaveevaluatedRedPoint’sperceivedease-of-use.Forthisassessment,weusedtherubricinTable7(whichisbasedonanISO/IEC9126-4approachtousabilitymetrics)andevaluatedtheRedPointDataManagementtoolaccordingly.
MCG Global Services Hadoop Integration Benchmark
© MCG Global Services 2016 www.mcknightcg.com Page 19
Measure Result
Efficiency—Easeofinstallation,setup,andconfiguration
• Usingthevendor’sdocumentation,howmucheffort(in-personhours)wasrequiredtoinstallandsetupthesoftwareoncethetargetinstance(s)wereavailable?
• Howmucheffort(inperson-hours)wasrequiredtoconfigurethenecessaryHadoopcomponentstogetthejobstoexecute?
TheinstallationandsetupofRedPointDataManagementSiteandExecutionServersandClienttooltooklessthan1.5person-hours.TheconfigurationofHadooptoolstooklessthan0.5person-hours.
Effectiveness—Jobexecutioncompletionrate
• Onceadatamanagement/integrationjobiscreatedandrunssuccessfullyonatestsetofdata,howmanybenchmarkjobsfailedtocompleteduetoproblemswiththevendorsoftwareorHadoop?
Nofailures.RedPointDataManagementsuccessfullycompletedeverybenchmarktestafterwecofirmedthejobwasproperlyformedbyrunningatestdataset.
Satisfaction—UserInterface
• Onascalefromverydifficulttoveryeasy,howdidwefindourexperiencebuildingthedataintegration/managementjobs?
Veryeasy.Theuserinterfaceisintuitive.Dataintegration/managementcomponentsareclearlyidentifiedandconfigurationoptionswereeasytoset.Weonlyreferredtothedocumentationandin-toolhelpcontent(whichwasverythorough)toconfirmourusageandsettingsofcomponents.
Inourexperience,mostothervendortoolsratefromeasytomoderatelydifficult.
Table7:RedPoint’sperceivedusabilitytests
Conclusion
TherearemultiplewaystointegratedataintoHadoop.Therearevastdifferencesinthearchitecturesofthevendors,wrappingopensourcetoolslikeMapReduceandSpark.YoucannotbesatisfiedwiththefunctionalityofaHadoopload;youmustalsobeconcernedwithperformance.Ensurethewindisinyoursailswithyourtoolselectionbyleavingyourselfroomforexperimentation,error,andgrowth.Performancewillbethereforthevastcyclesofdevelopment,testing,qualityassuranceand,ofcourse,production.Ultimately,theproofisinthetestingoutcomes.Ourbenchmarkresultswerebeyondwhatwethoughtpossible.VendorarchitectureisimportantinintegratingdatawithHadoop,yetthedifferencesarevast.RedPointisbasedonafoundationofYARN,whichhasproventobeagoodchoice.
MCG Global Services Hadoop Integration Benchmark
© MCG Global Services 2016 www.mcknightcg.com Page 20
About MCG Global Services
WilliamMcKnightisPresidentofMcKnightConsultingGroup(MCG)GlobalServices(http://www.mcknightcg.com).Heisaninternationallyrecognizedauthorityininformationmanagement.HisconsultingworkhasincludedmanyoftheGlobal2000andnumerousmidmarketcompanies,andhisteamshavewonseveralbestpracticecompetitionsfortheirimplementationsandmanyofhisclientshavegonepublicwiththeirsuccessstories.McKnight’sstrategiesformtheinformationmanagementplanforleadingcompaniesinvariousindustries.JakeDolezalhasover17yearsofexperienceintheInformationManagementfieldwithexpertiseinbusinessintelligence,analytics,datawarehousing,statistics,datamodelingandintegration,datavisualization,masterdatamanagement,anddataquality.Dolezalhasexperienceacrossabroadarrayofindustries,including:healthcare,education,government,manufacturing,engineering,hospitality,andgaming.WithanA-listofclientsrepresentingcomplexandhighly-successfulinformationmanagement,MCGhasbroadcatalogueofexperience.Ouradviceisacombinationofthelatestbestpracticeswithourpersonalexperienceandexpertise.Itispractical,nottheoretical.
• Wetakeakeenfocusonbusinessjustification.• Wetakeaprogramatic,notaproject-based,approach.• Webelieveinintegratingwithclientstaffandprioritizeknowledgetransfer.• Weengineerclientworkforcesandprocessestocarryyouforward.• We’revendorneutralsoyoucanrestassuredthatouradviceiscompletelyclient
oriented.• Weknow,define,judge,andpromotebestpractices.• Wehaveencounteredandovercomemostconceivableinformationmanagement
challenges.• Weensurebusinessresultsaredeliveredearlyandoften.
Weanticipateourcustomer’sneedswellintothefuturewithourfulllifecycleapproach.Ourfocused,experiencedteamsgenerateefficient,economic,timely,andsustainableresultsforourclients.
MCG Global Services Hadoop Integration Benchmark
© MCG Global Services 2016 www.mcknightcg.com Page 21
About RedPoint Global
RedPointGlobaloffersacomprehensivesetofworld-classETL,dataquality,anddataintegrationapplicationsthatoperateinandacrossbothtraditionalandHadoop2.0/YARNenvironments.Thecompanyalsooffersdata-drivencustomerengagementsolutionsthathelpcompaniesderiveinsightsfromcustomerbehaviorsandcreateconsistent,relevant,andprecisemessagingacrossanyandallchannels.AllRedPointapplicationsofferauniquevisualuserinterfacethateliminatestheneedforprogrammingskills.Thisallowsenterprisestoutilizealldatatoachievetheirstrategicbusinessgoals.Formoreinformation,visitwww.redpoint.netoremail:[email protected].