toward large-scale distributed stream processing: models ......toward large-scale distributed stream...

66
Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo Pres> University of Rome Tor Vergata, Italy ICT COST Action IC1304 Autonomous Control for a Reliable Internet of Services (ACROSS) 2nd Int’l Summer School on Autonomous Control for Reliable Future Networks and Services, 30 May 2016, Opa>ja, Croa>a

Upload: others

Post on 03-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Towardlarge-scaledistributedstreamprocessing:

models,systemsandchallengesValeriaCardelliniandFrancescoLoPres>UniversityofRomeTorVergata,Italy

ICT COST Action IC1304

Autonomous Control for a Reliable

Internet of Services (ACROSS)

2ndInt’lSummerSchoolonAutonomousControlforReliableFutureNetworksandServices,30May2016,Opa>ja,Croa>a

Page 2: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Whoarewe?ValeriaCardelliniAssociateprofessor@Univ.ofRomeTorVergata

FrancescoLoPres> Associateprofessor @Univ.ofRomeTorVergata

•  JointresearchworkwithVincenzoGrassiandMaXeoNardelli

V.Cardellini-ACROSS2ndSummerSchool 1

Page 3: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Thedatadeluge

•  Somewell-knownnumbersrelatedtoBigData:–  Everydayin2014wecreated2.5Exabytes–  40ZeXabytesofdatawillbecreatedby2020

•  Prolifera>onofnewsourcesofdata–  Sensors,mobiledevices,cameras–  Socialnetworks–  Scien>ficinstruments–  Vehicles

•  Howcanwemakesenseofallthesedata?–  Processdatatoextractvaluableinsights

V.Cardellini-ACROSS2ndSummerSchool 2

Page 4: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Whydatastreamprocessing?•  Applica>onssuchas:

–  Sen>mentanalysisonmul>pletweetstreams@TwiXer–  Userprofiling@Yahoo!–  Trackingofquerytrendevolu>on@Google–  Frauddetec>on–  Busrou>ngmanagement@cityofDublin[Art14]

•  Require:–  Con>nuousprocessingofunboundeddatastreamsgeneratedbymul>ple,distributedsources

–  In(near)real-1mefashionV.Cardellini-ACROSS2ndSummerSchool 3

Page 5: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Whydatastreamprocessing?

•  Inthepastyearsdatastreamprocessing(DSP)wasconsideredasolu>onforveryspecificproblems(e.g.,financial>ckers)

•  Butnowwehave(andwillhave)moregeneralseings– E.g.,InternetofThings

V.Cardellini-ACROSS2ndSummerSchool 4

Page 6: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Whydatastreamprocessing?

•  Decreasetheoveralllatencytoobtainresults– Nodatapersistenceonstablestorage

See“Latencynumberseveryprogrammershouldknow”

– Noperiodicbatchanalysis

•  Simplifythedatainfrastructure

•  Make>medimensionofdataexplicitV.Cardellini-ACROSS2ndSummerSchool 5

Page 7: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Whydatastreamprocessing?

•  Decreasetheoveralllatencytoobtainresults– Nodatapersistenceonstablestorage

See“Latencynumberseveryprogrammershouldknow”

– Noperiodicbatchanalysis

•  Simplifythedatainfrastructure

•  Make>medimensionofdataexplicitV.Cardellini-ACROSS2ndSummerSchool 6

Page 8: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Tradi>onalDSPchallenges

•  Streamdataratescanbehighanddataarriveinlargevolumes– Highresourcerequirementsforprocessing(clusters,datacenters,distributedClouds)

•  Processingstreamdatahasreal->measpects– Streamprocessingapplica>onshaveQoSrequirements,e.g.,end-to-endlatency

– Mustbeabletoreacttoeventsastheyoccur

V.Cardellini-ACROSS2ndSummerSchool 7

Page 9: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Whylarge-scalestreamprocessing?•  Goals:increasescalabilityandreducelatency

•  How?Relyondistributedandnear-edgecomputa>on

V.Cardellini-ACROSS2ndSummerSchool 8

Page 10: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Goalsofthelectures•  Giveaflavoroflarge-scaledistributedstreamprocessingandrelatedresearchchallenges

•  PartI(V.Cardellini)–  Focusonsystemissues–  Theseslides

•  PartII(F.LoPres>)–  Focusonmodelsandalgorithms

•  Request–  Ifyougeteitherboredorlost,askques>ons…–  Ifyouliketoaskques>ons,askques>ons…

V.Cardellini-ACROSS2ndSummerSchool 9

Page 11: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Goalsofthelectures•  Giveaflavoroflarge-scaledistributedstreamprocessingandrelatedresearchchallenges

•  PartI(V.Cardellini)–  Focusonsystemissues

•  PartII(F.LoPres>)–  Focusonmodelsandalgorithms

•  Request–  Ifyougeteitherboredorlost,askques>ons…–  Ifyouliketoaskques>ons,askques>ons…

V.Cardellini-ACROSS2ndSummerSchool 10

Page 12: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Datastreamdefini>ons

V.Cardellini-ACROSS2ndSummerSchool 11

Page 13: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Datastream

•  “Adatastreamisareal->me,con>nuous,ordered(implicitlybyarrival>meorexplicitlyby>mestamp)sequenceofitems.Itisimpossibletocontroltheorderinwhichitemsarrive,norisitfeasibletolocallystoreastreaminitsen>rety.Queriesoverstreamsruncon>nuouslyoveraperiodof>meandincrementallyreturnnewresultsasnewdataarrive.”[Gol03]

V.Cardellini-ACROSS2ndSummerSchool 12

Page 14: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Slidingwindows

•  Howmanydataitemsshouldweprocesseach>me?– Processitemsinwindow-sizedbatches

•  Count-basedwindow,e.g.,lastnitems

•  Time-basedwindow,e.g.from[t-T]to[t]

s1 s2 s3 s4 s5

>me

s6 n=5

V.Cardellini-ACROSS2ndSummerSchool 13

Page 15: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Slidingwindows

•  Howosenshouldweevaluatethewindow?– Eagerapproach:outputnewresultitemsassoonasavailable(butcanbedifficulttoimplementefficiently)

– Lazyapproach:slidewindowbysseconds(ormitems)

V.Cardellini-ACROSS2ndSummerSchool 14

Page 16: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

DSPapplica>onmodel•  ADSPapplica>onismadeofanetworkofoperators(processingelements)connectedbystreams,atleastonedatasourceandatleastonedatasink

•  Representedbyadirectedgraph–  Graphver>ces:operators–  Graphedges:streams

•  Graphcanbecyclic–  Somesystemsonlysupportdirectedacyclicgraph(DAG)

•  GraphtopologyrarelychangesV.Cardellini-ACROSS2ndSummerSchool 15

Page 17: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

DSPoperator•  Aself-containedprocessingelementthat:

–  transformsoneormoreinputstreamsintoanotherstream–  canexecuteagenericuser-definedcode

•  Algebraicopera>on(filter,aggregate,join,..)•  User-defined(morecomplex)opera>on(POS-tagging,…)

–  canexecuteinparallelwithotheroperators•  Canbestatelessorstateful

–  Stateless:knownothingaboutthestate(e.g.,filter,map)–  Stateful:keepsomesortofstate

•  E.g.,someaggrega>onorsummaryofprocessedelements,orstate-machinefordetec>ngpaXernsforfraudulentfinancialtransac>on

•  StatemightbesharedbetweenoperatorsV.Cardellini-ACROSS2ndSummerSchool 16

Page 18: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

“HelloWorld”:WordCount

Wordssource

Wordscounter

(word) (word,counter)

(ranks)

Intermediatesorter

Finalsorter

(finalrank)

V.Cardellini-ACROSS2ndSummerSchool 17

Page 19: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

SomeDSPapplica>on:DEBS’14GC•  Real->meanaly>csoverhighvolumesensordata:analysisof

energyconsump>onmeasurements[DEBS14GC]–  Smartplugsdeployedinhouseholdsandequippedwithsensorsthat

measurevaluesrelatedtopowerconsump>on•  Inputdatastream:

!2967740693, 1379879533, 82.042, 0, 1, 0, 12 !

•  Query1:makeloadforecastsbasedoncurrentloadmeasurementsandhistoricaldata–  Outputdatastream:

ts, house_id, predicted_load !

•  Query2:findtheoutliersconcerningenergyconsump>on–  Outputdatastream:

ts_start, ts_stop, household_id, percentage!V.Cardellini-ACROSS2ndSummerSchool 18

Page 20: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

SomeDSPapplica>on:DEBS’15GC•  Real->meanaly>csoverhighvolumespa>o-temporaldata

streams:analysisoftaxitripsbasedondatastreamsorigina>ngfromNewYorkCitytaxis[DEBS15GC]

•  Query1:iden>fyrecentfrequentroutes•  Query2:iden>fyregionswiththehighestprofit•  Bothqueriesrelyonaslidingwindowoperator

–  Con>nuouslyevaluatethequeryresults•  Usegeo-spa>algridstodefinetheeventsofinterest

V.Cardellini-ACROSS2ndSummerSchool 19

Page 21: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

SomeDSPapplica>on:DEBS’16GC•  Real->meanaly>csforadynamic(evolving)social-network

graph[DEBS16GC]•  Query1:iden>fythepoststhatcurrentlytriggerthemost

ac>vityinthesocialnetwork•  Query2:iden>fylargecommuni>esthatarecurrently

involvedinatopic•  Requirecon>nuousanalysisofdynamicgraphconsideringmul>plestreamsthatreflectgraphupdates

V.Cardellini-ACROSS2ndSummerSchool 20

Page 22: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Datastreamsystems

V.Cardellini-ACROSS2ndSummerSchool 21

Page 23: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Streamingsystem•  Distributedsystemthatexecutesstreamgraphs

–  con>nuouslycalculatesresultsforlong-standingqueries–  overpoten>allyinfinitedatastreams–  usingoperators

•  thatcanbestatelessorstateful

•  Systemnodesmaybeheterogeneous•  Mustbehighlyop>mizedandwithminimaloverheadsotodeliverreal->meresponseforhigh-volumeDSPapplica>ons

V.Cardellini-ACROSS2ndSummerSchool 22

Page 24: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Operatorplacement

V.Cardellini-ACROSS2ndSummerSchool 23

1 23

4 6

5

(1,2)

(1,2) (1,2) (2,3)(2,4)

(3,5)(4,5)

(4,6)

(4,6)

(2,4)(2,3)

(3,5)

(4,5)

(4,6)

•  Determine,withinasetofavailabledistributedcompu>ngnodes,thenodesthatshouldhostandexecuteeachoperatorofaDSPapplica>on

v

Page 25: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Bigdatacenters

•  Whichframeworksfordatastreamprocessing?•  Usuallyruninlocallydistributedclusterswithinlargedatacenters

•  Assump>ons:–  Scaleoutandnotscaleup

•  Commodityservers•  Data-parallelismisking

–  Soswaredesignedforfailure•  See[Dea09]

V.Cardellini-ACROSS2ndSummerSchool 24

Source:Google

Page 26: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

ApacheStorm•  ApacheStorm

–  Open-source,real->me,scalablestreamingsystem–  Providesanabstrac>onlayertoexecuteDSPapplica>ons

•  Topology(streaminggraph)

–  Spouts(datasources)andbolts(operatorsanddatasinks)

stream

x5

V.Cardellini-ACROSS2ndSummerSchool 25

Page 27: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

worker process

executor executorTHREAD THREAD

JAVA PROCESS

task

task

task

task

task

Stormen>>es•  Task:operatorinstance•  Executor:smallestschedulableen>ty

–  Executeoneormoretasksrelatedtosameoperator

•  Workerprocess:Javaprocessrunningasubsetofexecutors

•  Workernode:compu>ngresource,acontainerforworkerprocesses

V.Cardellini-ACROSS2ndSummerSchool 26

Page 28: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Stormarchitecture

V.Cardellini-ACROSS2ndSummerSchool 27

Page 29: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Otherframeworks(par=allist)•  Cloud-basedframeworks

–  AmazonKinesis–  GoogleCloudDataflow– Microsos

•  ApacheSpark–  ImproveMapReduce(batchprocessing)–  SparkStreaming:reducethesizeofeachstreamandprocessstreamsofdata(micro-batchprocessing)

V.Cardellini-ACROSS2ndSummerSchool 28

Page 30: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Otherframeworks(par=allist)•  Cloud-basedframeworks

–  AmazonKinesis–  GoogleCloudDataflow– Microsos

•  ApacheSpark–  ImproveMapReduce(batchprocessing)–  SparkStreaming:reducethesizeofeachstreamandprocessstreamsofdata(micro-batchprocessing)

V.Cardellini-ACROSS2ndSummerSchool 29

Page 31: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Otherframeworks(par=allist)•  Cloud-basedframeworks

–  AmazonKinesis–  GoogleCloudDataflow– Microsos

•  ApacheSpark–  ImproveMapReduce(batchprocessing)–  SparkStreaming:reducethesizeofeachstreamandprocessstreamsofdata(micro-batchprocessing)

V.Cardellini-ACROSS2ndSummerSchool 30

(e.g.,ApacheStorm) (e.g.,ApacheSpark)

Page 32: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Anewbreadthofframeworks•  Lambdaarchitecture

– Data-processingdesignpaXerntohandlemassivequan>>esofdataandintegratebatchandreal->meprocessingwithinasingleframework

V.Cardellini-ACROSS2ndSummerSchool 31Source:hXps://voltdb.com/products/alterna>ves/lambda-architecture

Page 33: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Challengesindatastreamprocessing

V.Cardellini-ACROSS2ndSummerSchool 32

Page 34: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Challenge1:Op>mizetheDSPapplica>on•  Applysometransforma>ontostreaminggraph

–  Atdesign>meorrun->me

•  Operatorreordering[Hir14]–  Toavoidunnecessarydatatransfers

•  Redundancyelimina>on[Hir14]

A B B A

A

B

B D

C

A B

D

C

V.Cardellini-ACROSS2ndSummerSchool 33

Page 35: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Challenge1:Op>mizetheDSPapplica>on

•  Operatorsepara>on[Hir14]

•  Fusion[Hir14]

A A1 A2

A B AB

V.Cardellini-ACROSS2ndSummerSchool 34

Page 36: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Challenge2:Placetheoperators

•  Operatorplacementdecision:acomplexproblem–  Tradecommunica>oncostagainstresourceu>liza>on

•  When–  Ini>al(sta>c)operatorplacement

•  Canbemoreexpensiveandcomprehensive

–  Canalsobeatrun->me•  Moveonlyrelocatableoperators•  Requireoperatormigra>on

•  SeePartII

V.Cardellini-ACROSS2ndSummerSchool 35

Page 37: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Challenge3:Manageloadvaria>ons•  Typicalstreamprocessingworkloadsare:

– withhighvolumeandhighrates– burstyandwithworkloadspikesnotknowninadvance

•  TwiXerin2013:rateoftweetspersecond=5700…•  butsignificantpeakof144,000tweetspersecond

V.Cardellini-ACROSS2ndSummerSchool 36

Page 38: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Challenge3:Manageloadvaria>ons•  Possibleapproaches:

– Admissioncontrol– Sta>creserva>on

•  Reservespecificresourcesinadvance•  Cons:over-provisioningandcostincrease

– Applydynamictechniquessuchasloadshedding•  Selec>velydroptuplesatstrategicpoints(e.g.,whenCPUusageexceedsaspecificlimit)

•  Cons:sacrificeaccuracyandcompleteness

A Shedder AV.Cardellini-ACROSS2ndSummerSchool 37

Page 39: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Challenge3:Manageloadvaria>ons•  Possibleapproaches(con=nued):

– Useadap>veratealloca>on[Bou12]– Redistributeload,e.g.,determinenewoperatorplacementandrelocateoperatorsoncompu>ngnodes

•  Cons:availableresourcescouldbeinsufficient

V.Cardellini-ACROSS2ndSummerSchool 38

Page 40: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

•  Alterna>vesolu>on:– DetectboXleneck– Usedata-parallelism(akaoperatorfission[Hir14])

•  ApplySIMDparadigm:concurrentexecu>onofmul>plereplicasofthesameoperatorondifferentdatapor>ons

•  Byhand:possible,butcumbersome

Exploitdataparallelism

A B

A

A

A

Split Merge

V.Cardellini-ACROSS2ndSummerSchool 39

Page 41: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Elas>cstreamprocessing

V.Cardellini-ACROSS2ndSummerSchool 40

•  Exploitelas1city:acquireandreleaseresourceswhenneeded

– Atapplica>onlayer(i.e.dataparallelism)•  Scaleout(orscalein)operators•  Ac>vate(ordeac>vate)replicatedoperators[Bel14]

– Atinfrastructurelayer•  Scaleout(orscalein)compu>ngnodes

Page 42: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Elas>cstreamprocessing

•  Whenandhowtoscale?– SeePartII

•  Butelas>cityoverheadisnotzero!–  Inmoststreamingsystems:runanewplacementdecisiontotakethenewresourcesintoaccount

– Dynamicscalingimpactsstatefuloperators

V.Cardellini-ACROSS2ndSummerSchool 41

Page 43: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Challenge4:Self-adaptatrun->me

•  Tocopewithhighlydynamicopera>veenvironment–  Unpredictableworkload–  Computa>onalcharacteris>csofoperatorsnotknowna-priori

–  Needtosustainedloadforlongprovisioning>mes–  Nodeavailability,networkconges>on,…

•  Exploitrun->meadapta>oncapabili>esofstreamingsystems

•  Whatadap>onac>ons?–  Scalethenumberofoperatorinstances,relocatetheoperators,…

V.Cardellini-ACROSS2ndSummerSchool 42

Page 44: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Self-adapta>onframework•  MAPE:Monitor,Analyze,PlanandExecute•  Soswarereferenceframeworkforself-adapta>on

V.Cardellini-ACROSS2ndSummerSchool 43

Page 45: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

DistributedStorm

•  WedevelopedanextensionofStorm[Car15]•  Goals:toprovide

– distributedmonitoring– distributedplacement(seePartII)– andadapta>oncapabili>es

•  Where:large-scaleenvironment•  CodeavailableonGitHub

matnar.github.io/uniroma2-storm/

V.Cardellini-ACROSS2ndSummerSchool 44

Page 46: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

DistributedStormarchitecture

V.Cardellini-ACROSS2ndSummerSchool 45

Page 47: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

DistributedStorm:monitoring•  QoSMonitor(foreachworkernode)

–  Es>matenetworklatencies•  Useanetworkcoordinatesystem•  Vivaldi’salgorithm[ref]:decentralizedandgossip-based

– MonitorQoSaXributes•  Nodeu>liza>onandavailability

•  WorkerMonitor(foreachworkerprocess)– Monitorexchangeddatarateamongtheoperators

V.Cardellini-ACROSS2ndSummerSchool 46

Page 48: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

DistributedStorm:performance

Loadspikeonasubsetofnodes

~50%

V.Cardellini-ACROSS2ndSummerSchool 47

Page 49: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Self-adapta>onchallenges

•  Adapta>onhasanonnegligiblecost!–  Run->mereconfigura>onscanincreaselatencyandreduceapplica>onavailability•  Performadapta>ononlywhenneeded

–  Costsofoperatormigra>onscannotbeneglected•  Freeze>mescausedbyoperatormigra>on•  Howtomigratestatefuloperators?

V.Cardellini-ACROSS2ndSummerSchool 48

Page 50: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Challenge5:statefuloperators•  Statecomplicatesthings…1.  Dynamicscaling2.  Operatorre-placement3.  Recoveryfromfailure

Lossofstate!V.Cardellini-ACROSS2ndSummerSchool 49

impactstate

Page 51: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Approachesforstatefulmigra>on•  Moststreamingsystemsdonotsupportstatefulprocessingandmigra>on(e.g.,Storm)–  Developersmanagestate–  Typicallycombinewithexternalsystemtostorestate–  Designcomplexity

•  Requirementsforstatefulopera>ormigra>on–  Safety(i.e.,topreservetheconsistencyoftheopera>ons)–  Applica>ontransparency– Minimalfootprint

V.Cardellini-ACROSS2ndSummerSchool 50

Page 52: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Statefuloperatormigra>on

•  Paralleltrackapproach[Hei14]•  Pause-and-resumeapproach

Stopmigra>ngtask Savestate

Terminatemigra>ngtaskandstartitonnewnode

Restorestate

Resumestreamprocessing

V.Cardellini-ACROSS2ndSummerSchool 51

Page 53: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Approachesforstatefulmigra>on

•  Howtoiden>fythepor>onofstatetomigrate?– ExposeanAPItolettheusermanuallymanagethestate[Fer13]

– Supportonlypar>>onedstatefuloperators[Ged14]

•  Par>>onedstatefuloperatorsstoreindependentstateforeachsub-streamiden>fiedbyapar>>oningkey

•  Automa>callydetermine,onthebasisofapar>>oningkey,theop>malnumberofstatepar>>onstobeusedandmigrate

V.Cardellini-ACROSS2ndSummerSchool 52

Page 54: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Elas>cstatefulmigra>oninStorm•  Wedevelopedmechanismsforelas>cstatefulmigra>oninStorm[Car16]

•  CodeonGitHubmatnar.github.io/elas>c-storm/

Supervisor Supervisor Supervisor Supervisor

wor

ker

proc

ess

wor

ker

proc

ess

wor

ker

slot

wor

ker

slot

wor

ker

slot

wor

ker

slot

wor

ker

proc

ess

wor

ker

proc

ess

wor

ker

proc

ess

wor

ker

proc

ess

wor

ker

proc

ess

wor

ker

proc

ess

DDS DDS DDS DDS

Network

schedulerMigrationNotifier

ElasticityManager

Nimbus ZooKeeperV.Cardellini-ACROSS2ndSummerSchool 53

Page 55: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Elas>cstatefulmigra>oninStorm•  Scalingdecisionsattheframeworklevel

–  Adaptthenumberofparallelinstancesforeachapplica>onoperator

–  Simplethreshold-basedscalingpolicy(seePartII)

•  RelocatetheoperatorinternalstateonadifferentnodeandenableStormtochangetheapplica>ondeploymentatrun->me

MIGRATION NOTIFIED

MIGRATIONMODE

SAVESTATE

first synchronizationbarrier

the migrating taskcan be terminated

MIGRATION MODE

RESTORE STATE(if any)

OPERATIONALMODE

new task

second synchronizationbarrier

streams areresumed

time

DDS DDS

V.Cardellini-ACROSS2ndSummerSchool 54

Page 56: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Time (s)500 1000 1500 2000 2500 3000 3500 4000 4500

App

licat

ion

Late

ncy

(ms)

0

200

400

600

800

1000

1200

1400

1600

Data rate ScalingSchedulingwith E+SMw/o E+SM

120 tweets/s120 tweets/s 250 tweets/s350 tweets/s 900 tweets/s

Time (s)500 1000 1500 2000 2500 3000 3500 4000 4500

Num

ber

of E

xecu

tors

0

5

10

15

20

25

30

Data rateScalingSchedulingwith E+SM

120 tweets/s250 tweets/s900 tweets/s350 tweets/s120 tweets/s

Performanceresults

•  Elas>cscalingandstatefulmigra>onimprovetheapplica>onlatency

V.Cardellini-ACROSS2ndSummerSchool 55

•  DSPapplica>on:frequentpaXerndetec>on

Page 57: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Challenge6:guaranteefaulttolerance•  DSPapplica>onsrunforlong>meintervals

failuresareunavoidable•  Possiblesolu>ons:

– Ac>vereplica>on[Bri09]– Check-poin>ng[Seb11]– Replaylogs[Bal08]– Hybridsolu>ons[Zha10]

•  Havingdifferenttrade-offsbetweenrun>mecostinabsenceoffailuresandrecoverycost

•  Large-scalecomplicatesthings…–  Networkpar>>onsandCAPtheorem

V.Cardellini-ACROSS2ndSummerSchool56

Page 58: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Challenge7:Managemul>pleconcurrentDSPapplica>ons

•  Considermul>plecompe>ngDSPapplica>ons•  Howshouldthestreamingsystemallocateresources?– Fairness– Resourceu>liza>on– Profitability,…

V.Cardellini-ACROSS2ndSummerSchool 57

Page 59: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

ApacheMesos•  Runconcurrentframeworksonthesameclusteranddynamicallysharetheclusterresources

•  Mesos:acluster“opera>ngsystem”[Hin11]–  Efficientresourceisola>onandsharingacrossdistributedframeworks

V.Cardellini-ACROSS2ndSummerSchool 58

Page 60: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

ApacheMesos

V.Cardellini-ACROSS2ndSummerSchool 59

•  Two-levelschedulingbasedonDominantResourceFairness(DRF)algorithm

Page 61: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

GMesos:distributedMesos

60V.Cardellini-ACROSS2ndSummerSchool

•  WearecurrentlydevelopingGMesosforlarge-scaleenvironment…staytuned!

Page 62: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Somenewchallengesandresearchopportuni>es

•  IntegratedatastreamprocessingwithSDN– WithSDN,networkintothecontrolloop

•  Studycross-layerop>miza>on

•  Addresssecurityandprivacyissuesindatastreamprocessing

V.Cardellini-ACROSS2ndSummerSchool 61

Page 63: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

References[And14]H.C.M.Andrade,B.Gedik,D.S.Turaga,“FundamentalsofStreamProcessing:Applica>onDesign,Systems,andAnaly>cs”,CambridgeUniversityPress,2014.[Art14]A.Ar>kisetal.,“Heterogeneousstreamprocessingandcrowdsourcingforurbantrafficmanagement”,InProc.ofEDBT’14,2014.[Bal08]M.Balazinska,H.Balakrishnan,S.Madden,M.Stonebraker,“Fault-toleranceintheborealisdistributedstreamprocessingsystem”,ACMTrans.DatabaseSyst.33,1,2008.[Bel14]P.Bellavista,A.Corradi,S.Kotoulas,A.Reale,"Adap>veFault-ToleranceforDynamicResourceProvisioninginDistributedStreamProcessingSystems",InProc.ofEDBT’14,2014.[Bou12]I.Boutsis,V.Kalogeraki,“RADAR:Adap>veratealloca>onindistributedstreamprocessingsystemsunderburstyworkloads”,Proc.ofSRDS’12,2012.[Bri09]A.Brito,C.FetzerandP.Felber,“Mul>threading-enabledac>vereplica>onforeventstreamprocessingoperators”,InProc.ofSRDS'09,2009.[Car15]V.Cardellini,V.Grassi,F.LoPres>,M.Nardelli,“DistributedQoS-awareschedulinginStorm”,Proc.ofACMDEBS’15,2015.[Car16]V.Cardellini,M.Nardelli,D.Luzi,“Elas>cstatefulstreamprocessinginStorm”,Proc.ofHPCS‘16,2016. V.Cardellini-ACROSS2ndSummerSchool 62

Page 64: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

References[Dab04]F.Dabek,R.Cox,F.Kaashoek,R.Morris,“Vivaldi:Adecentralizednetworkcoordinatesystem”,SIGCOMMComput.Commun.Rev.34,4,2004.[Dea09]J.Dean,Design,LessonsandAdvicefromBuildingLargeDistributedSystems,InLADIS'09,2009.[DEBS14GC]Z.Jerzak,H.Ziekow,“TheDEBS2014grandchallenge”,InProc.ofACMDEBS'14,2014.[DEBS15GC]Z.Jerzak,H.Ziekow,“TheDEBS2015grandchallenge”,InProc.ofACMDEBS'15.[DEBS16GC]V.Gulisano,Z.Jerzak,S.Voulgaris,H.Ziekow,“TheDEBS2016grandchallenge”,InProc.ofACMDEBS'16,2016.[Fer13]R.Fernandez,M.Migliavacca,E.Kalyvianaki,andP.Pietzuch,“Integra>ngscaleoutandfaulttoleranceinstreamprocessingusingoperatorstatemanagement,”inProc.ofACMSIGMOD’13,2013.[Ged14]B.Gedik,S.Schneider,M.Hirzel,andK.-L.Wu,“Elas>cscalingfordatastreamprocessing”IEEETrans.ParallelDistrib.Syst.25,6,2014.[Gol03]L.Golab,M.Özs,“Issuesindatastreammanagement”,ACMSIGMODRec.32,2,2003.

V.Cardellini-ACROSS2ndSummerSchool 63

Page 65: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

References[Hei14]T.Heinze,L.Aniello,L.Querzoni,andZ.Jerzak,“Cloud-baseddatastreamprocessing,”inProc.ofACMDEBS’14,2014.[Hin11]B.Hindmanetal.,“Mesos:aplazormforfine-grainedresourcesharinginthedatacenter”,InProc.ofOSDI’11,2011.[Hir14]M.Hirzel,R.Soulé,S.Schneider,B.Gedik,R.Grimm,“Acatalogofstreamprocessingop>miza>ons”,ACMComput.Surv.46,4,2014.[Seb11]Z.Sebepou,K.Magou>s,“CEC:Con>nuouseventualcheckpoin>ngfordatastreamprocessingoperators”,InProc.ofDSN’11,2011.[Zha10]Z.Zhangetal.,“Ahybridapproachtohighavailabilityinstreamprocessingsystems.InProc.ofICDCS‘10,2010.

V.Cardellini-ACROSS2ndSummerSchool 64

Page 66: Toward large-scale distributed stream processing: models ......Toward large-scale distributed stream processing: models, systems and challenges Valeria Cardellini and Francesco Lo

Thankyou!Anyques>ons?

[email protected]

www.ce.uniroma2.it/~valeriaV.Cardellini-ACROSS2ndSummerSchool 65