data informatics - infolab | welcome · unstructured data • unstructured data is a generic label...
TRANSCRIPT
DataInformatics
SeonHoKim,[email protected]
WhatisBigData?
WhatisBigData?
“BigData”isdatawhosescale,diversity,andcomplexityrequirenewarchitecture,techniques,algorithms,andanalyticstomanageitandextract
valueandhiddenknowledgefromit…
4
TrendsleadingtoDataFlood
• Moredataisgenerated:– Bank,telecom,other
businesstransactions...– Scientificdata:astronomy,
biology,etc– Web,text,ande-commerce
Who’sGeneratingBigData
• Theprogressandinnovationisnolongerhinderedbytheabilitytocollectdata
• But,bytheabilitytomanage,analyze,summarize,visualize,anddiscoverknowledgefromthecollecteddatainatimelymannerandinascalablefashion
5
Socialmediaandnetworks(allofusaregeneratingdata)
Scientificinstruments(collectingallsortsofdata)
Mobiledevices(trackingallobjectsallthetime)
Sensortechnologyandnetworks(measuringallkindsofdata)
UnstructuredData
• Unstructureddataisagenericlabelfordescribinganycorporateinformationthatisnotinadatabase.– Textualornon-textual– Facebook,YouTube,Twitter,Weblog,etc.
• Storageandsearchproblem– justaddingmorehardwaretohousedatawhileignoringitscontentnolongersuffices
CharacteristicsofBigData:1-Scale(Volume)
• DataVolume– 44xincreasefrom20092020– From0.8zettabytes to35zb
• Datavolumeisincreasingexponentially
7
Exponentialincreaseincollected/generateddata
CharacteristicsofBigData:2-Complexity(Varity)
• Variousformats,types,andstructures• Text,numerical,images,audio,video,
sequences,timeseries,socialmediadata,multi-dimarrays,etc…
• Staticdatavs.streamingdata• Asingleapplicationcanbe
generating/collectingmanytypesofdata
8
Toextractknowledgeè allthesetypesofdataneedtolinked together
CharacteristicsofBigData:3-Speed(Velocity)
• Dataisbegingeneratedfastandneedtobeprocessedfast
• OnlineDataAnalytics• Latedecisionsè missingopportunities• Examples
– E-Promotions:Basedonyourcurrentlocation,yourpurchasehistory,whatyoulikeè sendpromotionsrightnowforstorenexttoyou
– Healthcaremonitoring:sensorsmonitoringyouractivities andbodyèanyabnormalmeasurements requireimmediate reaction
9
BigData:3V’s
10
SomeMakeit4V’s
11
TheModelHasChanged…
• TheModelofGenerating/ConsumingDatahasChanged
14
OldModel:Fewcompaniesaregeneratingdata,allothersareconsuming data
NewModel:allofusaregeneratingdata,andallofusareconsuming data
MoreFormallyBigData• Bigdata isatermfor datasets thataresolargeorcomplex
thattraditional dataprocessing applicationsareinadequate.• Challengesinclude:– Management (capture,store,process,share,etc.).Forexample,HadoopEcosystem.
– Analysis (Predictiveanalysisorotherstoextractvaluefromdata).Forexample,machinelearning.
– Privacy:openquestion• Accuracyinbigdatamayleadtomoreconfidentdecision
making,andbetterdecisionscanresultingreateroperationalefficiency,costreductionandreducedrisk.
Management
ExploringBigData
Gathering&preparingdata(95%)
§ Thetimefor developingananalysis (Initiallyworkingwithbigdata)
§ ETLprocess: takingarawfeedofdata,readingit,andproducingausablesetofoutput
Analyzingdata(5%)
Extract Transform Load
Why MachineLearning?• Machinelearning isprogramming computers to optimizea
performance criterion using example data or past experience.• There isno need to “learn”to calculate payroll• Learningisused when:
– Humanexpertisedoes notexist (navigatingonMars),– Humans areunable to explain their expertise (speech
recognition)– Solutionchanges intime(routingonacomputer network)– Solutionneeds to beadapted to particular cases (user
biometrics)
18
WhatWeTalkAboutWhenWeTalkAbout “Learning”
• Learningmodels from adataofparticular examples• Dataischeap and abundant;knowledge isexpensive and
scarce.• Example inretail:
Customer transactions to consumer behavior:Peoplewho bought “X”also bought “Y”
• Build amodelthat isagood and useful approximation to thedata.
19
WhatisMachineLearning?• Optimizeaperformance criterion using example dataor past
experience.• RoleofStatistics:– Build mathematical models– Inference from samples
• RoleofComputer science:– Efficient algorithms to• Solve the optimizationproblem• Representing and evaluating the modelfor inference
20
TheStructureofBigData
• Structured:Mosttraditionaldatasources
• Semi-structured:Manysourcesofbigdata
• Unstructured:Videodata,audiodata
Applications• Association• Supervised Learning:learning from known values– Classification (Recognition)– Regression
• Unsupervised Learning:from notknown values– Clustering(Grouping)
• ReinforcementLearning:learning apolicy,asequence ofoutputs
22
TechniquesCreatingBusinessValuesAnomalyorOutlierdetection
Associationrulelearning
Clusteringanalysis
Classificationanalysis
Regressionanalysis
BigDataVisualization
BigDataAnalysisExample
25
What’sdrivingBigData
- Ad-hocqueryingandreporting- Datamining techniques- Structureddata,typicalsources- Smalltomid-sizedatasets
- Optimizationsandpredictiveanalytics- Complexstatisticalanalysis- Alltypesofdata,andmanysources- Verylargedatasets- Moreofareal-time
ValueofBigDataAnalytics• Bigdataismorereal-timeinnaturethantraditionalDataWarehouseapplications
• TraditionalDWarchitecturesarenotwell-suitedforbigdataapps
• Sharednothing,massivelyparallelprocessing,scaleoutarchitecturesarewell-suitedforbigdataapplications
ChallengesinHandlingBigData
• TheBottleneckisintechnology– Newarchitecture,algorithms,techniquesareneeded
• Alsointechnicalskills– Expertsinusingthenewtechnologyanddealingwithbigdata
BigDataSummary• BigDataisbeinggeneratedeverywhere– Humanandmachines
• BigDataanalysisisalreadyeverywhere• StillRisks:– Overwhelmed– rightproblem,rightperson?– Costescalatesfast– howmuchdata,accuracy?– Privacyissue– whatistolerable?
• Bigpotentialfornewstartupbusinesstoo!