big data for managers: from hadoop to streaming and beyond
TRANSCRIPT
![Page 2: Big Data for Managers: From hadoop to streaming and beyond](https://reader035.vdocument.in/reader035/viewer/2022062523/586e8cf11a28aba0038b86a3/html5/thumbnails/2.jpg)
www.scispike.comCopyright©SciSpike2016
Dr.VladimirBacvanski
§ Founder of SciSpike, a development, consulting, and training firm
§ Passionate about software and data § PhD in computer science RWTH Aachen,
Germany § Architect, consultant, mentor
§ Custom development: Scalable Web and IoT systems
§ Training and mentoring in Big Data, Scala, node.js, software architecture
@OnSoftware
https://www.linkedin.com/in/vladimirbacvanski
![Page 3: Big Data for Managers: From hadoop to streaming and beyond](https://reader035.vdocument.in/reader035/viewer/2022062523/586e8cf11a28aba0038b86a3/html5/thumbnails/3.jpg)
www.scispike.comCopyright©SciSpike2016
ProblemswithRela9onalStores
§ DatathatdoesnotnaturallyfitintotablesàImpedancemismatch
§ DevelopmentEmeo5entolong
§ Dealingwithunstructureddata§ Performanceproblems
§ Difficulttorunonclusters
§ Cost
3
![Page 4: Big Data for Managers: From hadoop to streaming and beyond](https://reader035.vdocument.in/reader035/viewer/2022062523/586e8cf11a28aba0038b86a3/html5/thumbnails/4.jpg)
www.scispike.comCopyright©SciSpike2016
StructuredandUnstructuredDataSources
StructuredDataSources
• ExisEngdatabases• ERP/CRM/BIsystems• Inventory• Supplychain
UnstructuredDataSources
• Serverlogs• Searchenginelogs• Browsinglogs• E-Commercerecords• Socialmedia• Voice• Video• Sensordata
4
![Page 5: Big Data for Managers: From hadoop to streaming and beyond](https://reader035.vdocument.in/reader035/viewer/2022062523/586e8cf11a28aba0038b86a3/html5/thumbnails/5.jpg)
www.scispike.comCopyright©SciSpike2016
NoSQLImpact
5
DisksProcessors
x1000 x1000 x1000
Cost/Perform
ance
1M 1B 1T 1Q …HUGE!!!x1000
Rela9onalDatabase
BigData+NoSQL
Tomorrow-Volumeisoutofreach
Today-Doable,butexpensiveandslow
StabilizeCost&IncreasePerformance
EnableUnlimitedVolumeGrowth
![Page 6: Big Data for Managers: From hadoop to streaming and beyond](https://reader035.vdocument.in/reader035/viewer/2022062523/586e8cf11a28aba0038b86a3/html5/thumbnails/6.jpg)
www.scispike.comCopyright©SciSpike2016
ScaleUpvs.ScaleOut
6
Capability
CostScaleUp
Capability
Cost ScaleOut
![Page 7: Big Data for Managers: From hadoop to streaming and beyond](https://reader035.vdocument.in/reader035/viewer/2022062523/586e8cf11a28aba0038b86a3/html5/thumbnails/7.jpg)
www.scispike.comCopyright©SciSpike2016
ACommonPaNernforProcessingLargeData
Loadalargesetofrecordsontoasetofmachines
ExtractsomethinginteresEngfromeachrecord
Shuffleandsortintermediateresults
Aggregateintermediateresults
Storeendresult
7
"Map"
"Reduce"
Key/Valuepairs
![Page 8: Big Data for Managers: From hadoop to streaming and beyond](https://reader035.vdocument.in/reader035/viewer/2022062523/586e8cf11a28aba0038b86a3/html5/thumbnails/8.jpg)
www.scispike.comCopyright©SciSpike2016
TwoKeyAspectsofHadoop
§ MapReduceframework– HowHadoopunderstandsandassignsworktothenodes(machines)
§ HadoopDistributedFileSystem=HDFS– WhereHadoopstoresdata– AfilesystemthatspansallthenodesinaHadoopcluster– Itlinkstogetherthefilesystemsonmanylocalnodestomakethemintoonebigfilesystem
8
![Page 9: Big Data for Managers: From hadoop to streaming and beyond](https://reader035.vdocument.in/reader035/viewer/2022062523/586e8cf11a28aba0038b86a3/html5/thumbnails/9.jpg)
www.scispike.comCopyright©SciSpike2016
MapReduceExample:WordCount
§ WordCountisthe"HelloWorld"ofBigData– YouwillseevarioustechnologiesimplemenEngit– AgoodfirststeptocomparetheexpressivenessofBigDatatools
9
dog cat bird
dog cat bird
dog dog cat
dog, 1 cat, 1 bird, 1
dog, 1 cat, 1 bird, 1
dog, 1 dog, 1 cat, 1
Map
dog, 1 dog, 1 dog, 1 dog, 1
cat, 1 cat, 1 cat, 1
bird, 1 bird, 1
Shuffle
dog, 4
cat, 3
bird, 2
Reduce
dog cat bird dog cat bird dog dog cat
pets.txt
dog, 4 cat, 3 bird, 2
pet_freq.txt
![Page 10: Big Data for Managers: From hadoop to streaming and beyond](https://reader035.vdocument.in/reader035/viewer/2022062523/586e8cf11a28aba0038b86a3/html5/thumbnails/10.jpg)
www.scispike.comCopyright©SciSpike201610
TheMapReduceProgrammingModel
§ "Map"step:– Inputsplitintopieces– Workernodesprocessindividualpiecesinparallel(underglobalcontroloftheJobTrackernode)
– Eachworkernodestoresitsresultinitslocalfilesystemwhereareducerisabletoaccessit
§ "Reduce"step:– Dataisaggregated(‘reduced”fromthemapsteps)byworkernodes(undercontroloftheJobTracker)
– MulEplereducetaskscanparallelizetheaggregaEon
10
![Page 11: Big Data for Managers: From hadoop to streaming and beyond](https://reader035.vdocument.in/reader035/viewer/2022062523/586e8cf11a28aba0038b86a3/html5/thumbnails/11.jpg)
www.scispike.comCopyright©SciSpike2016
Separa9onofWork
Programmers
• Map• Reduce
Framework
• Dealswithfaulttolerance
• Assignworkerstomapandreducetasks
• Movesprocessestodata
• Shufflesandsortsintermediatedata
• Dealswitherrors
11
![Page 12: Big Data for Managers: From hadoop to streaming and beyond](https://reader035.vdocument.in/reader035/viewer/2022062523/586e8cf11a28aba0038b86a3/html5/thumbnails/12.jpg)
www.scispike.comCopyright©SciSpike2016
HowToCreateMapReduceJobs
§ JavaAPI– Lowlevel,veryflexible– Timeconsumingdevelopment
§ StreamingAPI– Asimple,producEvemodelforPythonandRuby
§ Hive– Opensourcelanguage/Apachesub-project– ProvidesaSQL-likeinterfacetoHadoop
§ Pig– Dataflowlanguage/Apachesub-project
15
![Page 13: Big Data for Managers: From hadoop to streaming and beyond](https://reader035.vdocument.in/reader035/viewer/2022062523/586e8cf11a28aba0038b86a3/html5/thumbnails/13.jpg)
www.scispike.comCopyright©SciSpike2016
TheBigPicture:NoSQL+HadoopinApplica9ons
16
Columnar
Priceupdates
Logs
Document
Productinfo
Graph
CustomerAgent
relaFon-ships
RDB
XAdata
Hadoop
Oper.analyFcs
PriceanalyFcs
Key/Value
Sessiondata
ApplicaFons
![Page 14: Big Data for Managers: From hadoop to streaming and beyond](https://reader035.vdocument.in/reader035/viewer/2022062523/586e8cf11a28aba0038b86a3/html5/thumbnails/14.jpg)
www.scispike.comCopyright©SciSpike2016
Streaming:ANewParadigm
§ ConvenEonalprocessing:sta9cdata
Data Queries Results
§ Real-time processing: streaming data
Queries Data Results
17
![Page 15: Big Data for Managers: From hadoop to streaming and beyond](https://reader035.vdocument.in/reader035/viewer/2022062523/586e8cf11a28aba0038b86a3/html5/thumbnails/15.jpg)
www.scispike.comCopyright©SciSpike2016
CommonStreamingApplica9ons
§ PersonalizaEon§ Search§ RevenueopEmizaEon
§ Userevents§ Contentfeeds§ Logprocessing§ Monitoring
§ RecommendaEons
§ Ads
§ Notableusers:– Twiper– Yahoo– SpoEfy– Cisco– Flickr– WeatherChannel
18
![Page 16: Big Data for Managers: From hadoop to streaming and beyond](https://reader035.vdocument.in/reader035/viewer/2022062523/586e8cf11a28aba0038b86a3/html5/thumbnails/16.jpg)
www.scispike.comCopyright©SciSpike2016
BeyondHadoop:Spark&Flink
19
MapReduce Tez
Spark
Flink
![Page 17: Big Data for Managers: From hadoop to streaming and beyond](https://reader035.vdocument.in/reader035/viewer/2022062523/586e8cf11a28aba0038b86a3/html5/thumbnails/17.jpg)
www.scispike.comCopyright©SciSpike2016
ApacheSpark
§ ImportantFeatures– InMemoryData– ResilientDistributedDatasets(RDDs)• Datasetscanrebuildthemselvesiffailureoccurs
– Richsetofoperators§ Efficient:
– 10x(onDisk)-100x(InMemory)fasterthanHadoopMR– 2to5Emeslesscode(RichAPIsinScala/Java/Python)
20
![Page 18: Big Data for Managers: From hadoop to streaming and beyond](https://reader035.vdocument.in/reader035/viewer/2022062523/586e8cf11a28aba0038b86a3/html5/thumbnails/18.jpg)
www.scispike.comCopyright©SciSpike2016
SparkArchitecture
§ Apowerfulsetoftools§ BeyondtradiEonalHadoop
Source:hpp://spark.apache.org
![Page 19: Big Data for Managers: From hadoop to streaming and beyond](https://reader035.vdocument.in/reader035/viewer/2022062523/586e8cf11a28aba0038b86a3/html5/thumbnails/19.jpg)
www.scispike.comCopyright©SciSpike2016
DataSharinginApacheSpark
HDFS
IteraFon1
Result1HeldInClusterMemory
IteraFon2
Result2HeldInClusterMemory
Query1
Query2
![Page 20: Big Data for Managers: From hadoop to streaming and beyond](https://reader035.vdocument.in/reader035/viewer/2022062523/586e8cf11a28aba0038b86a3/html5/thumbnails/20.jpg)
www.scispike.comCopyright©SciSpike2016
ApacheFlink
§ ExecuEon:– ProgramscompiledintoanexecuEonplan– PlanisopEmized– Executed
§ Designgoals:– Highperformance– HybridbatchandstreamingrunEme– Simplicityforthedeveloper– Richlibraries– IntegraEonwithmanysystems
23
![Page 21: Big Data for Managers: From hadoop to streaming and beyond](https://reader035.vdocument.in/reader035/viewer/2022062523/586e8cf11a28aba0038b86a3/html5/thumbnails/21.jpg)
www.scispike.comCopyright©SciSpike2016
ApacheFlinkComponents
§ IntegraEonwithHadoopYARN,MapReduce,HBase,Cassandra,Kara,…
§ ExecuEonengineforApacheBeam(GoogleDataflow)24
![Page 22: Big Data for Managers: From hadoop to streaming and beyond](https://reader035.vdocument.in/reader035/viewer/2022062523/586e8cf11a28aba0038b86a3/html5/thumbnails/22.jpg)
www.scispike.comCopyright©SciSpike2016
FlinkOp9miza9onandExecu9on
§ OpEmizerselectsanexecuEonplan
§ SimilartowhatwehaveinrelaEonaldatabases
§ OpEmalplandependsonthesizeoftheinputfiles
§ RunasstandaloneorontopofHadoop§ IntegraEonwithmanyHadooptechnologies
25
![Page 23: Big Data for Managers: From hadoop to streaming and beyond](https://reader035.vdocument.in/reader035/viewer/2022062523/586e8cf11a28aba0038b86a3/html5/thumbnails/23.jpg)
www.scispike.comCopyright©SciSpike2016
Flink&Spark:TheAdvantagesandOutlook
§ LessIOoverheadthanconvenEonalHadoop§ Caching§ IteraEvealgorithms
§ UnifyingbatchandstreamcompuEng
§ Scalaasanatural,expressivelanguageforBigData– Otherlanguages:Python,Java,R
§ Bewareoflessmaturecomponents
26
![Page 24: Big Data for Managers: From hadoop to streaming and beyond](https://reader035.vdocument.in/reader035/viewer/2022062523/586e8cf11a28aba0038b86a3/html5/thumbnails/24.jpg)
www.scispike.comCopyright©SciSpike2016
TypicalNoSQLSystems
§ Non-relaKonal§ Distributed§ Horizontallyscalable§ Noneedforafixedschema
§ Severalestablishedplayers
§ Systemsarespecialized
27
![Page 25: Big Data for Managers: From hadoop to streaming and beyond](https://reader035.vdocument.in/reader035/viewer/2022062523/586e8cf11a28aba0038b86a3/html5/thumbnails/25.jpg)
www.scispike.comCopyright©SciSpike2016
NoSQLStoresandTheirCategories
§ ChooseastorethatisabestmatchforyourapplicaEon
§ Itisfinetohaveseveraldifferentstoresused– "Polyglotpersistence"
28
k v
Key-ValueColumn-Family
Document-Oriented
GraphDB
![Page 26: Big Data for Managers: From hadoop to streaming and beyond](https://reader035.vdocument.in/reader035/viewer/2022062523/586e8cf11a28aba0038b86a3/html5/thumbnails/26.jpg)
www.scispike.comCopyright©SciSpike2016
NoSQLStores:Scalevs.ComplexityofData
29
k v
Key-Value
Column-Family
Document-Oriented
complexity
scalability
GraphDB
needsofmostapplicaFons
![Page 27: Big Data for Managers: From hadoop to streaming and beyond](https://reader035.vdocument.in/reader035/viewer/2022062523/586e8cf11a28aba0038b86a3/html5/thumbnails/27.jpg)
www.scispike.comCopyright©SciSpike2016
Key-ValueStores
§ KeyàValuemapping
§ Large,persistentMap("hashtable")– Valuescouldbelistsandhashes
§ Easytouse§ Scaleverywell§ DatamodelmaybetoosimpleformostapplicaEons
§ Systems:– Redis,Riak,Memcached,AmazonDynamoDB,Aerospike,FoundaEonDB
§ UsewhendatamodelisverysimpleandscalabilityessenEal
30
![Page 28: Big Data for Managers: From hadoop to streaming and beyond](https://reader035.vdocument.in/reader035/viewer/2022062523/586e8cf11a28aba0038b86a3/html5/thumbnails/28.jpg)
www.scispike.comCopyright©SciSpike2016
TypicalUseCases
§ Thedatamodelisverysimple!– ActualdatacanbeJSON
§ Sessiondata§ Userpreferencesandprofiles§ Shoppingcart
§ IfotherNoSQLstoreisgoodenough,youmaywanttoskipthisandletColumnorDocumentstorehandleit
31
![Page 29: Big Data for Managers: From hadoop to streaming and beyond](https://reader035.vdocument.in/reader035/viewer/2022062523/586e8cf11a28aba0038b86a3/html5/thumbnails/29.jpg)
www.scispike.comCopyright©SciSpike2016
Column-Family
§ "Column-family":similartoatable– Tableissparse
§ Keyà(Column:Value)*
§ Columnshavenames
§ Canbeindexed§ Canstorecomplexdata
– Denormalize!§ Systems:
– GoogleBigTable,HBase,Cassandra,AmazonSimpleDB,Hypertable
§ UsewhenscalabilityisessenEal32
![Page 30: Big Data for Managers: From hadoop to streaming and beyond](https://reader035.vdocument.in/reader035/viewer/2022062523/586e8cf11a28aba0038b86a3/html5/thumbnails/30.jpg)
www.scispike.comCopyright©SciSpike2016
TypicalUseCases
§ Highinsertvolume:logging
§ Real-Emeupdates
§ Contentmanagement
§ Expiringcontent§ Cross-datacenterreplicaEon§ MapReduceanalyEcsoverstoreddata
§ Youdon’tneedconvenEonal(ACID)transacEons
33
![Page 31: Big Data for Managers: From hadoop to streaming and beyond](https://reader035.vdocument.in/reader035/viewer/2022062523/586e8cf11a28aba0038b86a3/html5/thumbnails/31.jpg)
www.scispike.comCopyright©SciSpike2016
DocumentStores
§ JSON,BSON,XML
§ Noschema
§ Indexesimproveperformance
§ EasytransiEonfromRDBMS
§ Systems– MongoDB,CouchDB,CouchBase
§ Usewhendataisinsemi-structuredform
§ O5enseeninnewWebapplicaEons
34
![Page 32: Big Data for Managers: From hadoop to streaming and beyond](https://reader035.vdocument.in/reader035/viewer/2022062523/586e8cf11a28aba0038b86a3/html5/thumbnails/32.jpg)
www.scispike.comCopyright©SciSpike2016
TypicalUseCases
§ Logging– Especiallywithvariablecontent
§ ProductinformaEon
§ CustomerinformaEon
§ Contentmanagement
§ DatatobestoredhasformatthatvariesoverEme– Flexibleschema
§ WebanalyEcs
35
![Page 33: Big Data for Managers: From hadoop to streaming and beyond](https://reader035.vdocument.in/reader035/viewer/2022062523/586e8cf11a28aba0038b86a3/html5/thumbnails/33.jpg)
www.scispike.comCopyright©SciSpike2016
GraphDatabases
§ NodeswithproperEes§ NodesconnectedthroughrelaEonships§ Canmodelverycomplexgraphdata
– Socialnetworks§ Systems:
– Neo4J,InfiniteGraph,TitanDB,OrientDB§ Usewhendataisa(complex)graph
36
![Page 34: Big Data for Managers: From hadoop to streaming and beyond](https://reader035.vdocument.in/reader035/viewer/2022062523/586e8cf11a28aba0038b86a3/html5/thumbnails/34.jpg)
www.scispike.comCopyright©SciSpike2016
TypicalUseCases
§ Highlyinterconnecteddata§ Socialgraphs§ PartyrelaEonshipsinanenterprise§ LocaEonbasedservices§ PurchasinganalyEcsandrecommendaEons
§ O5encombinedwithothersystemstostorethebulkofdata– GraphdatabasecanfocusonrelaEonships
37
![Page 35: Big Data for Managers: From hadoop to streaming and beyond](https://reader035.vdocument.in/reader035/viewer/2022062523/586e8cf11a28aba0038b86a3/html5/thumbnails/35.jpg)
www.scispike.comCopyright©SciSpike2016
Integra9ngRela9onal,Streams,andHadoop
Streams
Data+BigData
TradiEonalWarehouse
In-MoEonAnalyEcs
DataanalyEcs Results
Database&Warehouse
At-restdataanalyEcs
Results
UltraLowLatencyResults
TradiEonal/RelaEonal
DataSources
Non-TradiEonal/Non-RelaEonalDataSources
Varieddataformats
Semi-structured,unstructured...
EventSystem
NoSQL
38
![Page 36: Big Data for Managers: From hadoop to streaming and beyond](https://reader035.vdocument.in/reader035/viewer/2022062523/586e8cf11a28aba0038b86a3/html5/thumbnails/36.jpg)
www.scispike.comCopyright©SciSpike2016
MergeResults
LambdaArchitecture
39
Event(Speed)Layer
RealTimeData
BatchLayer ServingLayer
MasterDataset
BatchView
IncomingData
RealTimeUpdate
BatchUpdate
Queries
RollingValues
![Page 37: Big Data for Managers: From hadoop to streaming and beyond](https://reader035.vdocument.in/reader035/viewer/2022062523/586e8cf11a28aba0038b86a3/html5/thumbnails/37.jpg)
www.scispike.comCopyright©SciSpike2016
MasterDataManagementandGovernance
§ BigDataandNoSQLstorescaneasilybecomeabiggermessthanrelaEonalstores
§ IntroduceapracEcalplan– Avoidlengthyandcumbersomegovernance– Actualuseshouldbethedrivingforce– Startslow
§ Bereadyforchange– Thetechnologieschangerapidly
§ Focusonbusinessoutcomes
40
![Page 38: Big Data for Managers: From hadoop to streaming and beyond](https://reader035.vdocument.in/reader035/viewer/2022062523/586e8cf11a28aba0038b86a3/html5/thumbnails/38.jpg)
www.scispike.comCopyright©SciSpike2016
SucceedingwithBigDataandNoSQL
1. AcEvelylookforsoluEonswheretherightstorecaneasethepain
2. Makesureyoudelivertangiblevaluetoclients
3. A5eryougetyourfirstappstowork:createaBigDataintroducEonandgovernanceplan
4. PrioriEze:dothemostusefulthingforthebusinessfirst
5. IntegratewithexisEngIT6. MakesureyouhireorgrowyourBigDatachampions
7. Fieldisimmature:lookoutfornewtoolsandtechniques
41
![Page 39: Big Data for Managers: From hadoop to streaming and beyond](https://reader035.vdocument.in/reader035/viewer/2022062523/586e8cf11a28aba0038b86a3/html5/thumbnails/39.jpg)
www.scispike.comCopyright©SciSpike2016
Conclusions
– HadoopandNoSQLaddresstheweakpointsofrelaEonalsystems:• Scale• Performance• Unstructuredandsemistructureddata
– Streamingaddressestheprocessingofdatainreal-Eme– IntegratewithconvenEonaltechnologies!– SparkandFlink:thenextgeneraEonBigDatasystems
42
![Page 40: Big Data for Managers: From hadoop to streaming and beyond](https://reader035.vdocument.in/reader035/viewer/2022062523/586e8cf11a28aba0038b86a3/html5/thumbnails/40.jpg)
QuesKons?