best practices for big data analytics in...

21
#1 Modern Platformto Turn Data into a Strategic Asset ©2016 RapidMiner, Inc. All rights reserved. ©2016 RapidMiner, Inc. All rights reserved. June 14, 2016 Zoltan Prekopcsak VP Big Data Best Practices for Big Data Analytics in Hadoop

Upload: others

Post on 07-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

#1ModernPlatformtoTurnDataintoaStrategicAsset

©2016RapidMiner,Inc.Allrightsreserved.©2016RapidMiner,Inc.Allrightsreserved.

June14,2016

ZoltanPrekopcsakVPBigData

BestPracticesforBigDataAnalyticsinHadoop

Page 2: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

©2016RapidMiner,Inc.Allrightsreserved. - 2 -

90%ofDeployedDataLakesare“USELESS”“Through2018,90%ofdeployeddata lakeswillbeuselessastheyareoverwhelmedwithinformationassetscapturedforuncertainusecases.”

• SKILLSGAPisamajoradoptioninhibitor,citedby57%

• HowtoEXTRACTVALUEfromHadoop,citedby49%

Page 3: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

©2016RapidMiner,Inc.Allrightsreserved. - 3 -

Sampling

GridComputing

NativeDistributedAlgorithms

DifferentApproachestoBigDataAnalytics

Page 4: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

©2016RapidMiner,Inc.Allrightsreserved. - 4 -

Sampling

GridComputing

NativeDistributedAlgorithms

Approach1: Sampling

Page 5: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

©2016RapidMiner,Inc.Allrightsreserved. - 5 -

Sampling:DataMovement&Processing

• DataMovement§ PullssampledatafromHDFS/Hive/Impala

• DataProcessing§ Intheanalyticstool

AnalyticsTool

PiecesofdatapulledoutofHadoop

PerformsCalculations

Page 6: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

©2016RapidMiner,Inc.Allrightsreserved. - 6 -

Sampling:Pros&Cons

• Pros+ Simpleandeasytostartwith+ Usuallyworkswellfordataexplorationandearly

prototyping+ SomeMLmodelswouldnotbenefitfrommore

dataanyway

• Cons– ManyMLmodelswouldbenefitfrommoredata– Cannotbeusedwhenlargescaledata preparation

isneeded– Hadoopisusedasadatarepositoryonly

AnalyticsTool

PiecesofdatapulledoutofHadoop

PerformsCalculations

Page 7: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

©2016RapidMiner,Inc.Allrightsreserved. - 7 -

Sampling:BestPractices

• Whentouseit+ Onlydataexploration/dataunderstanding+ Earlyprototypingonpreparedandcleandata+ MachineLearningmodelingwithveryfewandbasic

patterns(e.g.only ahandfulofcolumns andbinarypredictiontarget)

• WhenNOTtouseit− Largenumberofcolumns inthedata− Needtoblendlargedatasets(e.g.large-scalejoins)− ComplexMachineLearningmodels− Lookingforrareevents

• Horrorstories– Importantdecisionsmadebasedonbiasedsamples

AnalyticsTool

PiecesofdatapulledoutofHadoop

PerformsCalculations

Page 8: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

©2016RapidMiner,Inc.Allrightsreserved. - 8 -

Sampling

GridComputing

NativeDistributedAlgorithms

DifferentApproachestoBigDataAnalytics

DataVisualization,Programming

Page 9: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

©2016RapidMiner,Inc.Allrightsreserved. - 9 -

Sampling

GridComputing

NativeDistributedAlgorithms

Approach2:GridComputing

Page 10: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

©2016RapidMiner,Inc.Allrightsreserved. - 10 -

Approach2:GridComputing

• DataMovement• Onlyresultsaremoved,dataremainsinHadoop

• DataProcessing• Customsingle-nodeapplicationrunningon

multipleHadoopnodes• Pros&Cons

+ Hadoopisusedforparallelprocessinginadditiontousingasadatasource

– Onlyworksifdatasubsets canbeprocessedindependently

– Onlyasgoodasthesingle-nodeengine,nobenefitfromfast-evolvingHadoopinnovations

App

AppApp

AnalyticsTool

App

Application Results

Calculations

Page 11: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

©2016RapidMiner,Inc.Allrightsreserved. - 11 -

Gridcomputingbestpractices

• Whentouseit+ Taskcanbeperformedonsmaller,independent

datasubsets+ Compute-intensivedataprocessing

• WhenNOTtouseit– Data-intensivedataprocessing– ComplexMachineLearningmodels– Lotsofinterdependenciesbetweendatasubsets

• Horror stories– Gridcomputingjobcalledinhugeloopsto

managedependenciesandintermediateresults App

AppApp

AnalyticsTool

App

Application Results

Calculations

Page 12: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

©2016RapidMiner,Inc.Allrightsreserved. - 12 -

Sampling

GridComputing

NativeDistributedAlgorithms

Approach2:GridComputing

Legacysingle-machineanalyticsengines

Page 13: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

©2016RapidMiner,Inc.Allrightsreserved. - 13 -

Sampling

GridComputing

NativeDistributedAlgorithms

Approach3: NativeDistributedAlgorithms

Page 14: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

©2016RapidMiner,Inc.Allrightsreserved. - 14 -

AnalyticsTool

Approach3:Nativedistributedalgorithms

• DataMovement• Onlyresultsaremoved,dataremainsinHadoop

• DataProcessing• ExecutedbynativeHadooptools:Hive,Spark,

H2O,Pig,MapReduce,etc.• Pros&Cons

+ Holisticviewofalldataandpatterns+ Highlyscalabledistributedprocessingoptimized

forHadoop− Limitedsetofalgorithmsavailable,veryhardto

developnewalgorithms

Calculations

ResultsInstructions pushedtoHadoop

Page 15: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

©2016RapidMiner,Inc.Allrightsreserved. - 15 -

AnalyticsTool

Nativedistributedalgorithmsbestpractices

• Whentouseit+ ComplexMachineLearningmodelsneeded+ Lotsofinterdependenciesinsidethedata(e.g.

graphanalytics)+ Needtoblendandcleanselargedatasets(e.g.

large-scalejoins)• WhenNOTtouseit

− Dataisnotthatlarge− Samplewouldrevealallinterestingpatterns

• Horror stories– ComplexMLmodeldevelopedin3months

defeatedbyaprototypemodelcreatedinanafternoon

Calculations

ResultsInstructions pushedtoHadoop

Page 16: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

©2016RapidMiner,Inc.Allrightsreserved. - 16 -

Sampling

GridComputing

NativeDistributedAlgorithms

Approach3: NativeDistributedAlgorithms

Hadoopecosystemprojects

Page 17: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

©2016RapidMiner,Inc.Allrightsreserved. - 17 -

Sampling

GridComputing

NativeDistributedAlgorithms

DifferentApproachestoBigDataAnalytics

Whichonetouseforagivenusecase?

Page 18: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

©2016RapidMiner,Inc.Allrightsreserved. - 18 -

Typicalprojectsneedallthreetosucceed

Sampling

GridComputing

NativeDistributedAlgorithms

Page 19: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

©2016RapidMiner,Inc.Allrightsreserved. - 19 -

RapidMinerPredictiveAnalyticsPlatform

Page 20: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

©2016RapidMiner,Inc.Allrightsreserved. - 20 -

Sampling

GridComputing

NativeDistributedAlgorithms

SingleAnalyticsPlatformtosupportallthree

Page 21: Best Practices for Big Data Analytics in Hadoopbiconsulting.hu/letoltes/2016budapestdata/prekopcsak_best_practices_final.pdfmultiple Hadoop nodes • Pros & Cons + Hadoop is used for

- 21 -©2016RapidMiner,Inc.Allrightsreserved.

10MilkStreet,11thFloorBoston,MA02108USA

+16174017708

PredictiveAnalyticsReimaginedAModernDataSciencePlatformtoTurnDataIntoaStrategicAsset

rapidminer.com

[email protected]@prekopcsak

Email:Twitter:

Web: