big data for managers: from hadoop to streaming and beyond

Post on 16-Apr-2017

701 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

BigDataforManagers:FromHadooptoStreamingandBeyond

Dr.VladimirBacvanskivladimir.bacvanski@scispike.com

@OnSo5ware

www.scispike.comCopyright©SciSpike2016

Dr.VladimirBacvanski

§  Founder of SciSpike, a development, consulting, and training firm

§  Passionate about software and data §  PhD in computer science RWTH Aachen,

Germany §  Architect, consultant, mentor

§  Custom development: Scalable Web and IoT systems

§  Training and mentoring in Big Data, Scala, node.js, software architecture

@OnSoftware

https://www.linkedin.com/in/vladimirbacvanski

www.scispike.comCopyright©SciSpike2016

ProblemswithRela9onalStores

§  DatathatdoesnotnaturallyfitintotablesàImpedancemismatch

§  DevelopmentEmeo5entolong

§  Dealingwithunstructureddata§  Performanceproblems

§  Difficulttorunonclusters

§  Cost

3

www.scispike.comCopyright©SciSpike2016

StructuredandUnstructuredDataSources

StructuredDataSources

• ExisEngdatabases• ERP/CRM/BIsystems• Inventory• Supplychain

UnstructuredDataSources

• Serverlogs• Searchenginelogs• Browsinglogs• E-Commercerecords• Socialmedia• Voice• Video• Sensordata

4

www.scispike.comCopyright©SciSpike2016

NoSQLImpact

5

DisksProcessors

x1000 x1000 x1000

Cost/Perform

ance

1M 1B 1T 1Q …HUGE!!!x1000

Rela9onalDatabase

BigData+NoSQL

Tomorrow-Volumeisoutofreach

Today-Doable,butexpensiveandslow

StabilizeCost&IncreasePerformance

EnableUnlimitedVolumeGrowth

www.scispike.comCopyright©SciSpike2016

ScaleUpvs.ScaleOut

6

Capability

CostScaleUp

Capability

Cost ScaleOut

www.scispike.comCopyright©SciSpike2016

ACommonPaNernforProcessingLargeData

Loadalargesetofrecordsontoasetofmachines

ExtractsomethinginteresEngfromeachrecord

Shuffleandsortintermediateresults

Aggregateintermediateresults

Storeendresult

7

"Map"

"Reduce"

Key/Valuepairs

www.scispike.comCopyright©SciSpike2016

TwoKeyAspectsofHadoop

§  MapReduceframework– HowHadoopunderstandsandassignsworktothenodes(machines)

§  HadoopDistributedFileSystem=HDFS– WhereHadoopstoresdata– AfilesystemthatspansallthenodesinaHadoopcluster–  Itlinkstogetherthefilesystemsonmanylocalnodestomakethemintoonebigfilesystem

8

www.scispike.comCopyright©SciSpike2016

MapReduceExample:WordCount

§  WordCountisthe"HelloWorld"ofBigData– YouwillseevarioustechnologiesimplemenEngit– AgoodfirststeptocomparetheexpressivenessofBigDatatools

9

dog cat bird

dog cat bird

dog dog cat

dog, 1 cat, 1 bird, 1

dog, 1 cat, 1 bird, 1

dog, 1 dog, 1 cat, 1

Map

dog, 1 dog, 1 dog, 1 dog, 1

cat, 1 cat, 1 cat, 1

bird, 1 bird, 1

Shuffle

dog, 4

cat, 3

bird, 2

Reduce

dog cat bird dog cat bird dog dog cat

pets.txt

dog, 4 cat, 3 bird, 2

pet_freq.txt

www.scispike.comCopyright©SciSpike201610

TheMapReduceProgrammingModel

§  "Map"step:–  Inputsplitintopieces–  Workernodesprocessindividualpiecesinparallel(underglobalcontroloftheJobTrackernode)

–  Eachworkernodestoresitsresultinitslocalfilesystemwhereareducerisabletoaccessit

§  "Reduce"step:–  Dataisaggregated(‘reduced”fromthemapsteps)byworkernodes(undercontroloftheJobTracker)

–  MulEplereducetaskscanparallelizetheaggregaEon

10

www.scispike.comCopyright©SciSpike2016

Separa9onofWork

Programmers

• Map• Reduce

Framework

• Dealswithfaulttolerance

• Assignworkerstomapandreducetasks

• Movesprocessestodata

• Shufflesandsortsintermediatedata

• Dealswitherrors

11

www.scispike.comCopyright©SciSpike2016

HowToCreateMapReduceJobs

§  JavaAPI– Lowlevel,veryflexible– Timeconsumingdevelopment

§  StreamingAPI– Asimple,producEvemodelforPythonandRuby

§  Hive– Opensourcelanguage/Apachesub-project– ProvidesaSQL-likeinterfacetoHadoop

§  Pig– Dataflowlanguage/Apachesub-project

15

www.scispike.comCopyright©SciSpike2016

TheBigPicture:NoSQL+HadoopinApplica9ons

16

Columnar

Priceupdates

Logs

Document

Productinfo

Graph

CustomerAgent

relaFon-ships

RDB

XAdata

Hadoop

Oper.analyFcs

PriceanalyFcs

Key/Value

Sessiondata

ApplicaFons

www.scispike.comCopyright©SciSpike2016

Streaming:ANewParadigm

§  ConvenEonalprocessing:sta9cdata

Data Queries Results

§ Real-time processing: streaming data

Queries Data Results

17

www.scispike.comCopyright©SciSpike2016

CommonStreamingApplica9ons

§  PersonalizaEon§  Search§  RevenueopEmizaEon

§  Userevents§  Contentfeeds§  Logprocessing§  Monitoring

§  RecommendaEons

§  Ads

§  Notableusers:–  Twiper–  Yahoo–  SpoEfy–  Cisco–  Flickr–  WeatherChannel

18

www.scispike.comCopyright©SciSpike2016

BeyondHadoop:Spark&Flink

19

MapReduce Tez

Spark

Flink

www.scispike.comCopyright©SciSpike2016

ApacheSpark

§  ImportantFeatures–  InMemoryData– ResilientDistributedDatasets(RDDs)• Datasetscanrebuildthemselvesiffailureoccurs

– Richsetofoperators§  Efficient:

– 10x(onDisk)-100x(InMemory)fasterthanHadoopMR– 2to5Emeslesscode(RichAPIsinScala/Java/Python)

20

www.scispike.comCopyright©SciSpike2016

SparkArchitecture

§  Apowerfulsetoftools§  BeyondtradiEonalHadoop

Source:hpp://spark.apache.org

www.scispike.comCopyright©SciSpike2016

DataSharinginApacheSpark

HDFS

IteraFon1

Result1HeldInClusterMemory

IteraFon2

Result2HeldInClusterMemory

Query1

Query2

www.scispike.comCopyright©SciSpike2016

ApacheFlink

§  ExecuEon:–  ProgramscompiledintoanexecuEonplan–  PlanisopEmized–  Executed

§  Designgoals:– Highperformance– HybridbatchandstreamingrunEme–  Simplicityforthedeveloper–  Richlibraries–  IntegraEonwithmanysystems

23

www.scispike.comCopyright©SciSpike2016

ApacheFlinkComponents

§  IntegraEonwithHadoopYARN,MapReduce,HBase,Cassandra,Kara,…

§  ExecuEonengineforApacheBeam(GoogleDataflow)24

www.scispike.comCopyright©SciSpike2016

FlinkOp9miza9onandExecu9on

§  OpEmizerselectsanexecuEonplan

§  SimilartowhatwehaveinrelaEonaldatabases

§  OpEmalplandependsonthesizeoftheinputfiles

§  RunasstandaloneorontopofHadoop§  IntegraEonwithmanyHadooptechnologies

25

www.scispike.comCopyright©SciSpike2016

Flink&Spark:TheAdvantagesandOutlook

§  LessIOoverheadthanconvenEonalHadoop§  Caching§  IteraEvealgorithms

§  UnifyingbatchandstreamcompuEng

§  Scalaasanatural,expressivelanguageforBigData– Otherlanguages:Python,Java,R

§  Bewareoflessmaturecomponents

26

www.scispike.comCopyright©SciSpike2016

TypicalNoSQLSystems

§  Non-relaKonal§  Distributed§  Horizontallyscalable§  Noneedforafixedschema

§  Severalestablishedplayers

§  Systemsarespecialized

27

www.scispike.comCopyright©SciSpike2016

NoSQLStoresandTheirCategories

§  ChooseastorethatisabestmatchforyourapplicaEon

§  Itisfinetohaveseveraldifferentstoresused– "Polyglotpersistence"

28

k v

Key-ValueColumn-Family

Document-Oriented

GraphDB

www.scispike.comCopyright©SciSpike2016

NoSQLStores:Scalevs.ComplexityofData

29

k v

Key-Value

Column-Family

Document-Oriented

complexity

scalability

GraphDB

needsofmostapplicaFons

www.scispike.comCopyright©SciSpike2016

Key-ValueStores

§  KeyàValuemapping

§  Large,persistentMap("hashtable")– Valuescouldbelistsandhashes

§  Easytouse§  Scaleverywell§  DatamodelmaybetoosimpleformostapplicaEons

§  Systems:– Redis,Riak,Memcached,AmazonDynamoDB,Aerospike,FoundaEonDB

§  UsewhendatamodelisverysimpleandscalabilityessenEal

30

www.scispike.comCopyright©SciSpike2016

TypicalUseCases

§  Thedatamodelisverysimple!– ActualdatacanbeJSON

§  Sessiondata§  Userpreferencesandprofiles§  Shoppingcart

§  IfotherNoSQLstoreisgoodenough,youmaywanttoskipthisandletColumnorDocumentstorehandleit

31

www.scispike.comCopyright©SciSpike2016

Column-Family

§  "Column-family":similartoatable– Tableissparse

§  Keyà(Column:Value)*

§  Columnshavenames

§  Canbeindexed§  Canstorecomplexdata

– Denormalize!§  Systems:

– GoogleBigTable,HBase,Cassandra,AmazonSimpleDB,Hypertable

§  UsewhenscalabilityisessenEal32

www.scispike.comCopyright©SciSpike2016

TypicalUseCases

§  Highinsertvolume:logging

§  Real-Emeupdates

§  Contentmanagement

§  Expiringcontent§  Cross-datacenterreplicaEon§  MapReduceanalyEcsoverstoreddata

§  Youdon’tneedconvenEonal(ACID)transacEons

33

www.scispike.comCopyright©SciSpike2016

DocumentStores

§  JSON,BSON,XML

§  Noschema

§  Indexesimproveperformance

§  EasytransiEonfromRDBMS

§  Systems– MongoDB,CouchDB,CouchBase

§  Usewhendataisinsemi-structuredform

§  O5enseeninnewWebapplicaEons

34

www.scispike.comCopyright©SciSpike2016

TypicalUseCases

§  Logging– Especiallywithvariablecontent

§  ProductinformaEon

§  CustomerinformaEon

§  Contentmanagement

§  DatatobestoredhasformatthatvariesoverEme– Flexibleschema

§  WebanalyEcs

35

www.scispike.comCopyright©SciSpike2016

GraphDatabases

§  NodeswithproperEes§  NodesconnectedthroughrelaEonships§  Canmodelverycomplexgraphdata

– Socialnetworks§  Systems:

– Neo4J,InfiniteGraph,TitanDB,OrientDB§  Usewhendataisa(complex)graph

36

www.scispike.comCopyright©SciSpike2016

TypicalUseCases

§  Highlyinterconnecteddata§  Socialgraphs§  PartyrelaEonshipsinanenterprise§  LocaEonbasedservices§  PurchasinganalyEcsandrecommendaEons

§  O5encombinedwithothersystemstostorethebulkofdata– GraphdatabasecanfocusonrelaEonships

37

www.scispike.comCopyright©SciSpike2016

Integra9ngRela9onal,Streams,andHadoop

Streams

Data+BigData

TradiEonalWarehouse

In-MoEonAnalyEcs

DataanalyEcs Results

Database&Warehouse

At-restdataanalyEcs

Results

UltraLowLatencyResults

TradiEonal/RelaEonal

DataSources

Non-TradiEonal/Non-RelaEonalDataSources

Varieddataformats

Semi-structured,unstructured...

EventSystem

NoSQL

38

www.scispike.comCopyright©SciSpike2016

MergeResults

LambdaArchitecture

39

Event(Speed)Layer

RealTimeData

BatchLayer ServingLayer

MasterDataset

BatchView

IncomingData

RealTimeUpdate

BatchUpdate

Queries

RollingValues

www.scispike.comCopyright©SciSpike2016

MasterDataManagementandGovernance

§  BigDataandNoSQLstorescaneasilybecomeabiggermessthanrelaEonalstores

§  IntroduceapracEcalplan– Avoidlengthyandcumbersomegovernance– Actualuseshouldbethedrivingforce– Startslow

§  Bereadyforchange– Thetechnologieschangerapidly

§  Focusonbusinessoutcomes

40

www.scispike.comCopyright©SciSpike2016

SucceedingwithBigDataandNoSQL

1.  AcEvelylookforsoluEonswheretherightstorecaneasethepain

2.  Makesureyoudelivertangiblevaluetoclients

3.  A5eryougetyourfirstappstowork:createaBigDataintroducEonandgovernanceplan

4.  PrioriEze:dothemostusefulthingforthebusinessfirst

5.  IntegratewithexisEngIT6.  MakesureyouhireorgrowyourBigDatachampions

7.  Fieldisimmature:lookoutfornewtoolsandtechniques

41

www.scispike.comCopyright©SciSpike2016

Conclusions

– HadoopandNoSQLaddresstheweakpointsofrelaEonalsystems:•  Scale•  Performance•  Unstructuredandsemistructureddata

– Streamingaddressestheprocessingofdatainreal-Eme–  IntegratewithconvenEonaltechnologies!– SparkandFlink:thenextgeneraEonBigDatasystems

42

QuesKons?

top related