ctd data science on hadoop - government...
Post on 20-May-2020
3 Views
Preview:
TRANSCRIPT
1©Cloudera,Inc.Allrightsreserved.
DataScienceonHadoop
1©Cloudera, Inc.Allrightsreserved.
JustinEricksonSeniorDirector,ProductManagement
2©Cloudera,Inc.Allrightsreserved.
AgeofMachineLearning
2
Costofcompute
Datavolume
Time
MachineLearning
NOMachineLearning
1950s 1960s 1970s 1980s 1990s 2000s 2010s
3©Cloudera,Inc.Allrightsreserved.
TheEnterprisePlatformforDataScienceandMachineLearning
Thedataisnowhere
30BCONNECTEDDEVICES
440xMOREDATA
ClouderafirsttointegrateSpark
ModernPlatformforMachineLearningandAdvancedAnalytics
Leadingadoptionamongenterprises
500Customers
RunSparkon
4©Cloudera,Inc.Allrightsreserved.
Sampledatascience/machinelearningworkflowFromdatatoexplorationtoaction
DataEngineering DataScience(Exploratory) Production(Operational)
DataWrangling
VisualizationandAnalysis
ModelTraining&Testing
ProductionDataPipelines BatchScoring
OnlineScoringServing
DataGovernanceGovernance
Processing
AcquisitionReports,
Dashboards
5©Cloudera,Inc.Allrightsreserved.
Thegoodnews
DataEngineering DataScience(Exploratory) Production(Operational)
DataWrangling
VisualizationandAnalysis
ModelTraining&Testing
ProductionDataPipelines BatchScoring
OnlineScoringServing
DataGovernanceGovernance
Processing
AcquisitionReports,
Dashboards
Datahasneverbeenmoreplentiful
Opensourcedatascienceandmachinelearninglibrariesarerapidlyevolving
Commodity(andon-demand)computemakesscalableproductionmachinelearningaffordable
6©Cloudera,Inc.Allrightsreserved.
Thebadnews
DataEngineering DataScience(Exploratory) Production(Operational)
DataWrangling
VisualizationandAnalysis
ModelTraining&Testing
ProductionDataPipelines BatchScoring
OnlineScoringServing
DataGovernanceGovernance
Processing
AcquisitionReports,
Dashboards
Mostdatasciencedoneatsmallscale,individually,andisdifficulttoreplicate
Veryfewmodelsreachproduction
Teamshavedifferent,conflictingrequestsforlanguages&libraries
Dataneedstomoveacrossmultipledifferentsystems
7©Cloudera,Inc.Allrightsreserved.
Additionalchallenges
AccessForsensitivedata,secureclustersaredifficulttoaccess.AndITtypicallydoesn’twantrandompackagesinstalledonasecurecluster.
Popularopensourcetoolsdon’teasilyconnecttotheseenvironments,oralwayssupportHadoopdataformats.
ScaleLaptopsrarelyhavecapacityformedium,letalonebigdata.Thisleadstoalotofsampling.
Popularframeworksdon’teasilyparallelizeonacluster.Typicallycodehastogetrewrittenforproduction.
DeveloperExperienceNotebooks,whileawesome,don’teasilysupportvirtualenvironmentanddependencymanagement,especiallyforteams.Thismakessharingandreproducibilityhard.
Notebooksarealsochallengingto“putintoproduction.”
8©Cloudera,Inc.Allrightsreserved.
Thisyear,ourgoalistoenabledatascienceandmachinelearningatscale.
9©Cloudera,Inc.Allrightsreserved.
OpendatascienceintheenterpriseIT
driveadoptionwhilemaintainingcompliance
DataScientistexplore,experiment,iterate
10©Cloudera,Inc.Allrightsreserved.
Ourgoal:Anopenplatformfordatascienceatscale
HelpmoredatascientistsusethepowerofHadoop
Useapowerful,familiarenvironmentwithdirectaccessto
Hadoopdataandcompute
DataScientistDataEngineer
Makeiteasyandsecuretoaddnewusers,usecases
Offersecureself-serviceanalyticsandafasterpathtoproductiononcommon,affordableinfrastructure
EnterpriseArchitectHadoopAdmin
11©Cloudera,Inc.Allrightsreserved.
IntroducingCloudera DataScienceWorkbenchSelf-servicedatasciencefortheenterprise
Acceleratesdatasciencefromdevelopmenttoproductionwith:• Secureself-serviceenvironmentsfordatascientiststoworkagainstCloudera clusters• SupportforPython,R,andScala,plusprojectdependencyisolationformultiplelibraryversions• Workflowautomation,versioncontrol,collaborationandsharing
12©Cloudera,Inc.Allrightsreserved.
Demo
13©Cloudera,Inc.Allrightsreserved.
Datascientistscan:• UseR,Python,orScalafromawebbrowser,withnodesktopfootprint• Installanylibraryorframeworkwithinisolatedprojectenvironments• DirectlyaccessdatainsecureclusterswithSparkandImpala• Shareinsightswiththeirteamforreproducible,collaborativeresearch• Automateandmonitordatapipelinesusingbuilt-injobscheduling
ITcan:• Givetheirdatascienceteamthefreedomtoworkhowtheywant,whentheywant• Staycompliantwithout-of-the-boxsupportforfullplatformsecurity,especiallyKerberos• Runon-premisesorinthecloud,whereverdataismanaged
WithCloudera DataScienceWorkbench…
14©Cloudera,Inc.Allrightsreserved.
SolvingDataScienceisaFull-StackProblem
• Supportunlimiteddata• Providesufficienttools forAnalysts• Providesufficienttools forDataScientists+DataEngineers
• Enablereal-timeusecases• Providedatagovernance• Providefull-stacksecurity• Deployinthecloud• Integratewithpartnertools• BeeasyforITtodeploy/maintain
ü Hadoopü Impala,Hive,Hueü Spark,DataScienceWorkbench
ü Kafka,SparkStreamingü Navigator+Partnersü Kerberos,Sentry,RecordService,
KMS/KTSü Cloudera Directorü RichEcosystemü Cloudera Manager +Director
15©Cloudera,Inc.Allrightsreserved.
Theimportanceofanopenecosystem
OpenEcosystem BlackBox
©Cloudera,Inc.Allrightsreserved. 16
ThankyouThankYouJustinErickson
17©Cloudera,Inc.Allrightsreserved.
18©Cloudera,Inc.Allrightsreserved.
19©Cloudera,Inc.Allrightsreserved.
20©Cloudera,Inc.Allrightsreserved.
21©Cloudera,Inc.Allrightsreserved.
22©Cloudera,Inc.Allrightsreserved.
23©Cloudera,Inc.Allrightsreserved.
top related