ctd data science on hadoop - government...

Post on 20-May-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1©Cloudera,Inc.Allrightsreserved.

DataScienceonHadoop

1©Cloudera, Inc.Allrightsreserved.

JustinEricksonSeniorDirector,ProductManagement

2©Cloudera,Inc.Allrightsreserved.

AgeofMachineLearning

2

Costofcompute

Datavolume

Time

MachineLearning

NOMachineLearning

1950s 1960s 1970s 1980s 1990s 2000s 2010s

3©Cloudera,Inc.Allrightsreserved.

TheEnterprisePlatformforDataScienceandMachineLearning

Thedataisnowhere

30BCONNECTEDDEVICES

440xMOREDATA

ClouderafirsttointegrateSpark

ModernPlatformforMachineLearningandAdvancedAnalytics

Leadingadoptionamongenterprises

500Customers

RunSparkon

4©Cloudera,Inc.Allrightsreserved.

Sampledatascience/machinelearningworkflowFromdatatoexplorationtoaction

DataEngineering DataScience(Exploratory) Production(Operational)

DataWrangling

VisualizationandAnalysis

ModelTraining&Testing

ProductionDataPipelines BatchScoring

OnlineScoringServing

DataGovernanceGovernance

Processing

AcquisitionReports,

Dashboards

5©Cloudera,Inc.Allrightsreserved.

Thegoodnews

DataEngineering DataScience(Exploratory) Production(Operational)

DataWrangling

VisualizationandAnalysis

ModelTraining&Testing

ProductionDataPipelines BatchScoring

OnlineScoringServing

DataGovernanceGovernance

Processing

AcquisitionReports,

Dashboards

Datahasneverbeenmoreplentiful

Opensourcedatascienceandmachinelearninglibrariesarerapidlyevolving

Commodity(andon-demand)computemakesscalableproductionmachinelearningaffordable

6©Cloudera,Inc.Allrightsreserved.

Thebadnews

DataEngineering DataScience(Exploratory) Production(Operational)

DataWrangling

VisualizationandAnalysis

ModelTraining&Testing

ProductionDataPipelines BatchScoring

OnlineScoringServing

DataGovernanceGovernance

Processing

AcquisitionReports,

Dashboards

Mostdatasciencedoneatsmallscale,individually,andisdifficulttoreplicate

Veryfewmodelsreachproduction

Teamshavedifferent,conflictingrequestsforlanguages&libraries

Dataneedstomoveacrossmultipledifferentsystems

7©Cloudera,Inc.Allrightsreserved.

Additionalchallenges

AccessForsensitivedata,secureclustersaredifficulttoaccess.AndITtypicallydoesn’twantrandompackagesinstalledonasecurecluster.

Popularopensourcetoolsdon’teasilyconnecttotheseenvironments,oralwayssupportHadoopdataformats.

ScaleLaptopsrarelyhavecapacityformedium,letalonebigdata.Thisleadstoalotofsampling.

Popularframeworksdon’teasilyparallelizeonacluster.Typicallycodehastogetrewrittenforproduction.

DeveloperExperienceNotebooks,whileawesome,don’teasilysupportvirtualenvironmentanddependencymanagement,especiallyforteams.Thismakessharingandreproducibilityhard.

Notebooksarealsochallengingto“putintoproduction.”

8©Cloudera,Inc.Allrightsreserved.

Thisyear,ourgoalistoenabledatascienceandmachinelearningatscale.

9©Cloudera,Inc.Allrightsreserved.

OpendatascienceintheenterpriseIT

driveadoptionwhilemaintainingcompliance

DataScientistexplore,experiment,iterate

10©Cloudera,Inc.Allrightsreserved.

Ourgoal:Anopenplatformfordatascienceatscale

HelpmoredatascientistsusethepowerofHadoop

Useapowerful,familiarenvironmentwithdirectaccessto

Hadoopdataandcompute

DataScientistDataEngineer

Makeiteasyandsecuretoaddnewusers,usecases

Offersecureself-serviceanalyticsandafasterpathtoproductiononcommon,affordableinfrastructure

EnterpriseArchitectHadoopAdmin

11©Cloudera,Inc.Allrightsreserved.

IntroducingCloudera DataScienceWorkbenchSelf-servicedatasciencefortheenterprise

Acceleratesdatasciencefromdevelopmenttoproductionwith:• Secureself-serviceenvironmentsfordatascientiststoworkagainstCloudera clusters• SupportforPython,R,andScala,plusprojectdependencyisolationformultiplelibraryversions• Workflowautomation,versioncontrol,collaborationandsharing

12©Cloudera,Inc.Allrightsreserved.

Demo

13©Cloudera,Inc.Allrightsreserved.

Datascientistscan:• UseR,Python,orScalafromawebbrowser,withnodesktopfootprint• Installanylibraryorframeworkwithinisolatedprojectenvironments• DirectlyaccessdatainsecureclusterswithSparkandImpala• Shareinsightswiththeirteamforreproducible,collaborativeresearch• Automateandmonitordatapipelinesusingbuilt-injobscheduling

ITcan:• Givetheirdatascienceteamthefreedomtoworkhowtheywant,whentheywant• Staycompliantwithout-of-the-boxsupportforfullplatformsecurity,especiallyKerberos• Runon-premisesorinthecloud,whereverdataismanaged

WithCloudera DataScienceWorkbench…

14©Cloudera,Inc.Allrightsreserved.

SolvingDataScienceisaFull-StackProblem

• Supportunlimiteddata• Providesufficienttools forAnalysts• Providesufficienttools forDataScientists+DataEngineers

• Enablereal-timeusecases• Providedatagovernance• Providefull-stacksecurity• Deployinthecloud• Integratewithpartnertools• BeeasyforITtodeploy/maintain

ü Hadoopü Impala,Hive,Hueü Spark,DataScienceWorkbench

ü Kafka,SparkStreamingü Navigator+Partnersü Kerberos,Sentry,RecordService,

KMS/KTSü Cloudera Directorü RichEcosystemü Cloudera Manager +Director

15©Cloudera,Inc.Allrightsreserved.

Theimportanceofanopenecosystem

OpenEcosystem BlackBox

©Cloudera,Inc.Allrightsreserved. 16

ThankyouThankYouJustinErickson

17©Cloudera,Inc.Allrightsreserved.

18©Cloudera,Inc.Allrightsreserved.

19©Cloudera,Inc.Allrightsreserved.

20©Cloudera,Inc.Allrightsreserved.

21©Cloudera,Inc.Allrightsreserved.

22©Cloudera,Inc.Allrightsreserved.

23©Cloudera,Inc.Allrightsreserved.

top related