data science 101 - presentation · 2019-09-15 · data marts data scientist data mining and...
TRANSCRIPT
DataScience101ArikPelkeyPentaho SeniorDirector– ProductMarketing,HitachiVantaraScottCooleyPentaho DataScientist,HitachiVantara
Agenda
Thissessionwillprovideanintroductiontodatasciencefundamentals.
• WhatisDataScience?
• CommonUseCasesandAlgorithms
• TheDataScienceProcess• BuildingaDataScienceTeam• TheFuture
AI,MachineLearning,andDeepLearning
Imagefromhttps://blogs.nvidia.com/blog/2016/07/29/whats-difference-artificial-intelligence-machine-learning-deep-learning-ai/.
• AI:Gettingmachinestodowhathumansaregoodat
• DeepLearning:Atypeofmachinelearning
• MachineLearning:Feedinganalgorithmdatatolearnandpredictsomething
DataScience:SolvingProblemswithData
DiagramfromDrewConway:http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram.
Understandingoftheunderlyingassumptions
Algorithmsandnumerical
techniquestoderiveinsights
HACKINGSKILLS
MATHANDSTATISTICS
KNOWLEDGE
DATASCIENCE
DangerZone!
TraditionalResearch
MachineLearning
SUBSTANTIVEEXPERIENCE
Computerscience,dataengineeringandwrangling,coding
Domainknowledge,businessacumen,experience,
valuetothebusiness
What’sallthefuss?Thisstuffwascreatedmanymanyyearsago
• Legendre,GaussandGaltonearly1800’s
Hereisasamplefootnote.
• ThomasBayesmid1700’s
• McCullochandPittsearly1940s
• BayesTheorem
• Regression
• NeuralNetworks
ThinkaboutAllOurDataandCompute
https://www.computerworld.com.au/article/392735/ska_telescope_generate_more_data_than_entire_internet_2020/.
SKA- 2020(SquareKilometerArrayTelescope)
WillgenerateasmuchdatainadayastheentirePLANETdoesinayear!
ItisstillGROWING!
Hereisasamplefootnote.
Regression – Lookingforastatisticalrelationshipacrossvariablesthatmaygiveusanestimateofaparticularoutcome.
Classification – Similartoregressionbutlookingforseparationsinthedatagivenpredefinedclasses.(Supervised)
Clustering – Donothavepredefinedclassesbuttryingtofindgroupsorsetsbasedupondataathand.(Unsupervised)
AnomalyDetection–Identificationofoutliersbaseduponexpectedrangesofdata.
✕
✕✕✕✕ ✕
✕✕
✕
△△△△
✕✕✕
△△
△
◇
△
△
◇
◇
◇△△△△△△
△△△
△△△
?
?
△△△
TypesofMachineLearning
LabelledvsUnlabelledLetssaywewanttoClassifyHousesbySize
Unsupervised
SIZEismissing!We needtolookforsimilaritiesinthedataandgroupthemintoclusters.
GivenFeaturesorFeatureSet
LabelFullBath HalfBath Bedrooms HomeAge1 0 2 561 1 3 592 1 3 202 1 3 19
SizeMLMS
SupervisedLearning
Usethelabelstobuildamodel.ModelusedtoclassifynewhousesizebasedONLYontheknownfeatureset.
MoreonMachineLearningMachineLearning isamethodologytocreateamodelbasedonsampledataandusethemodeltomakeapredictionorstrategyusingamorealgorithmicapproach.
Historicalrecordsthatcontainsquarefeet,numberofbathrooms,zipcode….
Recordsthatcontainthepricethehousesoldfor
Iteratethealgorithmoverthecombineddatatotrainthemodel
Usethetrainedmodeltopredictoutcomeonnewrecords
SUPERVISEDLEARNINGMODEL
TheDataScienceProcess:GettingfromRawDatatoOutcomes
JoeBlizstein andHanspeter Pfister createdforHarvardDataSciencecourse.
FormalFrameworkCRISP–DMCrossIndustryStandardProcess
forDataMining
TheDataScienceWorkflow
SpecialistTraditionalDataScienceTeam
DataScientist(DS)– Preparesdata,engineersfeatures,mostvaluableskill:trainingmodels.
DataEngineer(DE)– Dataacquisitionfocus.Builddatapipelines.Notuncommontohave5:1ratioDE:DS
DataAnalyst(DA)– AssistDSwithdataprep
Applicationarchitect(AA)– Designcompletesolution;deployandmaintainmodelsinproduction
MythicalCreatures
Trends
• Automation
• ToolsforCitizenDataScientists• Pre-trainedmodelsinthecloud
Hereisasamplefootnote.
HiringGuidance
Hereisasamplefootnote.
DefiningSuccess
• Easyforthetangible– Searchorderoptimization– RecommendationengineorCTR
• Hardforothers– Leadscoring– Attrition
• Trytomeasuredirectoutcomes
• Rarelyasilverbullet• ThinkROI
Hereisasamplefootnote.
TypicalDataScienceProject
DS
Understandbusinessobjectives
AA
DE
DS
IDandprocure
trainingdata
DA
DS
Preparedataandbuild
newfeatures
DS
Trainmodel
Deploymodels
AA
DS
Updatemodels
AA
PreventiveMaintenance:Caterpillar
MarineAssetIntelligence
Business User (COO) Reporting on
Operations and Efficiency
Dashboards and Reports on Machine
Performance (Onboard and
Onshore)
DataMarts
Data ScientistData Mining and
Predictive Maintenance
LocalEquipmentsensorandServerData
FleetDataviaSatellite
CrossDepartmentOperationsDataScheduling/ERP
DataIntegration
DataIntegration
TheFuture
• Scalingup/enablingmoredatascientists
• Modelmanagement
• Improvedproductivity
• Supportforcontainerizedapplications.
Hereisasamplefootnote.
PentahoMLOrchestration
• Makesdatascienceteamsmoreproductive
• Broadsupportforopensourcelibrariesinvariouslanguages
Summary
• WhatisDataScience
• CommonUseCasesandAlgorithms
• TheDataScienceProcess• BuildingaDataScienceTeam• TheFuture
NextSteps
Wanttolearnmore?
• ScheduleaMeettheExpert
• ReadMarkHall’sMachineLearningwithPentahoBlog