#1ModernPlatformtoTurnDataintoaStrategicAsset
©2016RapidMiner,Inc.Allrightsreserved.©2016RapidMiner,Inc.Allrightsreserved.
June14,2016
ZoltanPrekopcsakVPBigData
BestPracticesforBigDataAnalyticsinHadoop
©2016RapidMiner,Inc.Allrightsreserved. - 2 -
90%ofDeployedDataLakesare“USELESS”“Through2018,90%ofdeployeddata lakeswillbeuselessastheyareoverwhelmedwithinformationassetscapturedforuncertainusecases.”
• SKILLSGAPisamajoradoptioninhibitor,citedby57%
• HowtoEXTRACTVALUEfromHadoop,citedby49%
©2016RapidMiner,Inc.Allrightsreserved. - 3 -
Sampling
GridComputing
NativeDistributedAlgorithms
DifferentApproachestoBigDataAnalytics
©2016RapidMiner,Inc.Allrightsreserved. - 4 -
Sampling
GridComputing
NativeDistributedAlgorithms
Approach1: Sampling
©2016RapidMiner,Inc.Allrightsreserved. - 5 -
Sampling:DataMovement&Processing
• DataMovement§ PullssampledatafromHDFS/Hive/Impala
• DataProcessing§ Intheanalyticstool
AnalyticsTool
PiecesofdatapulledoutofHadoop
PerformsCalculations
©2016RapidMiner,Inc.Allrightsreserved. - 6 -
Sampling:Pros&Cons
• Pros+ Simpleandeasytostartwith+ Usuallyworkswellfordataexplorationandearly
prototyping+ SomeMLmodelswouldnotbenefitfrommore
dataanyway
• Cons– ManyMLmodelswouldbenefitfrommoredata– Cannotbeusedwhenlargescaledata preparation
isneeded– Hadoopisusedasadatarepositoryonly
AnalyticsTool
PiecesofdatapulledoutofHadoop
PerformsCalculations
©2016RapidMiner,Inc.Allrightsreserved. - 7 -
Sampling:BestPractices
• Whentouseit+ Onlydataexploration/dataunderstanding+ Earlyprototypingonpreparedandcleandata+ MachineLearningmodelingwithveryfewandbasic
patterns(e.g.only ahandfulofcolumns andbinarypredictiontarget)
• WhenNOTtouseit− Largenumberofcolumns inthedata− Needtoblendlargedatasets(e.g.large-scalejoins)− ComplexMachineLearningmodels− Lookingforrareevents
• Horrorstories– Importantdecisionsmadebasedonbiasedsamples
AnalyticsTool
PiecesofdatapulledoutofHadoop
PerformsCalculations
©2016RapidMiner,Inc.Allrightsreserved. - 8 -
Sampling
GridComputing
NativeDistributedAlgorithms
DifferentApproachestoBigDataAnalytics
DataVisualization,Programming
©2016RapidMiner,Inc.Allrightsreserved. - 9 -
Sampling
GridComputing
NativeDistributedAlgorithms
Approach2:GridComputing
©2016RapidMiner,Inc.Allrightsreserved. - 10 -
Approach2:GridComputing
• DataMovement• Onlyresultsaremoved,dataremainsinHadoop
• DataProcessing• Customsingle-nodeapplicationrunningon
multipleHadoopnodes• Pros&Cons
+ Hadoopisusedforparallelprocessinginadditiontousingasadatasource
– Onlyworksifdatasubsets canbeprocessedindependently
– Onlyasgoodasthesingle-nodeengine,nobenefitfromfast-evolvingHadoopinnovations
App
AppApp
AnalyticsTool
App
Application Results
Calculations
©2016RapidMiner,Inc.Allrightsreserved. - 11 -
Gridcomputingbestpractices
• Whentouseit+ Taskcanbeperformedonsmaller,independent
datasubsets+ Compute-intensivedataprocessing
• WhenNOTtouseit– Data-intensivedataprocessing– ComplexMachineLearningmodels– Lotsofinterdependenciesbetweendatasubsets
• Horror stories– Gridcomputingjobcalledinhugeloopsto
managedependenciesandintermediateresults App
AppApp
AnalyticsTool
App
Application Results
Calculations
©2016RapidMiner,Inc.Allrightsreserved. - 12 -
Sampling
GridComputing
NativeDistributedAlgorithms
Approach2:GridComputing
Legacysingle-machineanalyticsengines
©2016RapidMiner,Inc.Allrightsreserved. - 13 -
Sampling
GridComputing
NativeDistributedAlgorithms
Approach3: NativeDistributedAlgorithms
©2016RapidMiner,Inc.Allrightsreserved. - 14 -
AnalyticsTool
Approach3:Nativedistributedalgorithms
• DataMovement• Onlyresultsaremoved,dataremainsinHadoop
• DataProcessing• ExecutedbynativeHadooptools:Hive,Spark,
H2O,Pig,MapReduce,etc.• Pros&Cons
+ Holisticviewofalldataandpatterns+ Highlyscalabledistributedprocessingoptimized
forHadoop− Limitedsetofalgorithmsavailable,veryhardto
developnewalgorithms
Calculations
ResultsInstructions pushedtoHadoop
©2016RapidMiner,Inc.Allrightsreserved. - 15 -
AnalyticsTool
Nativedistributedalgorithmsbestpractices
• Whentouseit+ ComplexMachineLearningmodelsneeded+ Lotsofinterdependenciesinsidethedata(e.g.
graphanalytics)+ Needtoblendandcleanselargedatasets(e.g.
large-scalejoins)• WhenNOTtouseit
− Dataisnotthatlarge− Samplewouldrevealallinterestingpatterns
• Horror stories– ComplexMLmodeldevelopedin3months
defeatedbyaprototypemodelcreatedinanafternoon
Calculations
ResultsInstructions pushedtoHadoop
©2016RapidMiner,Inc.Allrightsreserved. - 16 -
Sampling
GridComputing
NativeDistributedAlgorithms
Approach3: NativeDistributedAlgorithms
Hadoopecosystemprojects
©2016RapidMiner,Inc.Allrightsreserved. - 17 -
Sampling
GridComputing
NativeDistributedAlgorithms
DifferentApproachestoBigDataAnalytics
Whichonetouseforagivenusecase?
©2016RapidMiner,Inc.Allrightsreserved. - 18 -
Typicalprojectsneedallthreetosucceed
Sampling
GridComputing
NativeDistributedAlgorithms
©2016RapidMiner,Inc.Allrightsreserved. - 19 -
RapidMinerPredictiveAnalyticsPlatform
©2016RapidMiner,Inc.Allrightsreserved. - 20 -
Sampling
GridComputing
NativeDistributedAlgorithms
SingleAnalyticsPlatformtosupportallthree
- 21 -©2016RapidMiner,Inc.Allrightsreserved.
10MilkStreet,11thFloorBoston,MA02108USA
+16174017708
PredictiveAnalyticsReimaginedAModernDataSciencePlatformtoTurnDataIntoaStrategicAsset
rapidminer.com
[email protected]@prekopcsak
Email:Twitter:
Web: