ctd data science on hadoop - government...

23
1 © Cloudera, Inc. All rights reserved. Data Science on Hadoop 1 © Cloudera, Inc. All rights reserved. Justin Erickson Senior Director, Product Management

Upload: others

Post on 20-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide

1©Cloudera,Inc.Allrightsreserved.

DataScienceonHadoop

1©Cloudera, Inc.Allrightsreserved.

JustinEricksonSeniorDirector,ProductManagement

Page 2: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide

2©Cloudera,Inc.Allrightsreserved.

AgeofMachineLearning

2

Costofcompute

Datavolume

Time

MachineLearning

NOMachineLearning

1950s 1960s 1970s 1980s 1990s 2000s 2010s

Page 3: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide

3©Cloudera,Inc.Allrightsreserved.

TheEnterprisePlatformforDataScienceandMachineLearning

Thedataisnowhere

30BCONNECTEDDEVICES

440xMOREDATA

ClouderafirsttointegrateSpark

ModernPlatformforMachineLearningandAdvancedAnalytics

Leadingadoptionamongenterprises

500Customers

RunSparkon

Page 4: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide

4©Cloudera,Inc.Allrightsreserved.

Sampledatascience/machinelearningworkflowFromdatatoexplorationtoaction

DataEngineering DataScience(Exploratory) Production(Operational)

DataWrangling

VisualizationandAnalysis

ModelTraining&Testing

ProductionDataPipelines BatchScoring

OnlineScoringServing

DataGovernanceGovernance

Processing

AcquisitionReports,

Dashboards

Page 5: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide

5©Cloudera,Inc.Allrightsreserved.

Thegoodnews

DataEngineering DataScience(Exploratory) Production(Operational)

DataWrangling

VisualizationandAnalysis

ModelTraining&Testing

ProductionDataPipelines BatchScoring

OnlineScoringServing

DataGovernanceGovernance

Processing

AcquisitionReports,

Dashboards

Datahasneverbeenmoreplentiful

Opensourcedatascienceandmachinelearninglibrariesarerapidlyevolving

Commodity(andon-demand)computemakesscalableproductionmachinelearningaffordable

Page 6: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide

6©Cloudera,Inc.Allrightsreserved.

Thebadnews

DataEngineering DataScience(Exploratory) Production(Operational)

DataWrangling

VisualizationandAnalysis

ModelTraining&Testing

ProductionDataPipelines BatchScoring

OnlineScoringServing

DataGovernanceGovernance

Processing

AcquisitionReports,

Dashboards

Mostdatasciencedoneatsmallscale,individually,andisdifficulttoreplicate

Veryfewmodelsreachproduction

Teamshavedifferent,conflictingrequestsforlanguages&libraries

Dataneedstomoveacrossmultipledifferentsystems

Page 7: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide

7©Cloudera,Inc.Allrightsreserved.

Additionalchallenges

AccessForsensitivedata,secureclustersaredifficulttoaccess.AndITtypicallydoesn’twantrandompackagesinstalledonasecurecluster.

Popularopensourcetoolsdon’teasilyconnecttotheseenvironments,oralwayssupportHadoopdataformats.

ScaleLaptopsrarelyhavecapacityformedium,letalonebigdata.Thisleadstoalotofsampling.

Popularframeworksdon’teasilyparallelizeonacluster.Typicallycodehastogetrewrittenforproduction.

DeveloperExperienceNotebooks,whileawesome,don’teasilysupportvirtualenvironmentanddependencymanagement,especiallyforteams.Thismakessharingandreproducibilityhard.

Notebooksarealsochallengingto“putintoproduction.”

Page 8: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide

8©Cloudera,Inc.Allrightsreserved.

Thisyear,ourgoalistoenabledatascienceandmachinelearningatscale.

Page 9: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide

9©Cloudera,Inc.Allrightsreserved.

OpendatascienceintheenterpriseIT

driveadoptionwhilemaintainingcompliance

DataScientistexplore,experiment,iterate

Page 10: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide

10©Cloudera,Inc.Allrightsreserved.

Ourgoal:Anopenplatformfordatascienceatscale

HelpmoredatascientistsusethepowerofHadoop

Useapowerful,familiarenvironmentwithdirectaccessto

Hadoopdataandcompute

DataScientistDataEngineer

Makeiteasyandsecuretoaddnewusers,usecases

Offersecureself-serviceanalyticsandafasterpathtoproductiononcommon,affordableinfrastructure

EnterpriseArchitectHadoopAdmin

Page 11: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide

11©Cloudera,Inc.Allrightsreserved.

IntroducingCloudera DataScienceWorkbenchSelf-servicedatasciencefortheenterprise

Acceleratesdatasciencefromdevelopmenttoproductionwith:• Secureself-serviceenvironmentsfordatascientiststoworkagainstCloudera clusters• SupportforPython,R,andScala,plusprojectdependencyisolationformultiplelibraryversions• Workflowautomation,versioncontrol,collaborationandsharing

Page 12: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide

12©Cloudera,Inc.Allrightsreserved.

Demo

Page 13: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide

13©Cloudera,Inc.Allrightsreserved.

Datascientistscan:• UseR,Python,orScalafromawebbrowser,withnodesktopfootprint• Installanylibraryorframeworkwithinisolatedprojectenvironments• DirectlyaccessdatainsecureclusterswithSparkandImpala• Shareinsightswiththeirteamforreproducible,collaborativeresearch• Automateandmonitordatapipelinesusingbuilt-injobscheduling

ITcan:• Givetheirdatascienceteamthefreedomtoworkhowtheywant,whentheywant• Staycompliantwithout-of-the-boxsupportforfullplatformsecurity,especiallyKerberos• Runon-premisesorinthecloud,whereverdataismanaged

WithCloudera DataScienceWorkbench…

Page 14: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide

14©Cloudera,Inc.Allrightsreserved.

SolvingDataScienceisaFull-StackProblem

• Supportunlimiteddata• Providesufficienttools forAnalysts• Providesufficienttools forDataScientists+DataEngineers

• Enablereal-timeusecases• Providedatagovernance• Providefull-stacksecurity• Deployinthecloud• Integratewithpartnertools• BeeasyforITtodeploy/maintain

ü Hadoopü Impala,Hive,Hueü Spark,DataScienceWorkbench

ü Kafka,SparkStreamingü Navigator+Partnersü Kerberos,Sentry,RecordService,

KMS/KTSü Cloudera Directorü RichEcosystemü Cloudera Manager +Director

Page 15: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide

15©Cloudera,Inc.Allrightsreserved.

Theimportanceofanopenecosystem

OpenEcosystem BlackBox

Page 16: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide

©Cloudera,Inc.Allrightsreserved. 16

ThankyouThankYouJustinErickson

Page 17: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide

17©Cloudera,Inc.Allrightsreserved.

Page 18: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide

18©Cloudera,Inc.Allrightsreserved.

Page 19: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide

19©Cloudera,Inc.Allrightsreserved.

Page 20: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide

20©Cloudera,Inc.Allrightsreserved.

Page 21: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide

21©Cloudera,Inc.Allrightsreserved.

Page 22: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide

22©Cloudera,Inc.Allrightsreserved.

Page 23: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide

23©Cloudera,Inc.Allrightsreserved.