spark summit eu talk by zoltan zvara

34
Data-Aware Spark Zoltán Zvara [email protected] This project has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No 688191.

Upload: spark-summit

Post on 16-Apr-2017

268 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Spark Summit EU talk by Zoltan Zvara

Data-Aware SparkZoltánZvara

[email protected]

Thisprojecthasreceived fundingfromtheEuropeanUnion’sHorizon2020research andinnovationprogramundergrantagreement No688191.

Page 2: Spark Summit EU talk by Zoltan Zvara

Introduction

• HungarianAcademyofSciences, InstituteforComputerScienceandControl (MTASZTAKI)• Researchinstitutewithstrongindustryties• BigDataprojects using Spark,Flink,Hadoop,Couchbase,etc…• Multiple IoTandtelecommunication use cases,withchallengingdatavolume,velocity anddistribution

Page 3: Spark Summit EU talk by Zoltan Zvara

Agenda

• Data-skewstory• Problemdefinitions&goals• Dynamic Repartitioning

• Architecture• Component breakdown• Repartitioning mechanism• Benchmarkresults

• Tracing• Visualization• Conclusion

Page 4: Spark Summit EU talk by Zoltan Zvara

Motivation

• Applications aggregatingIoT,socialnetwork,telecommunication datathattestedwellontoydata• Whendeployingitagainsttherealdatasettheapplicationseemedhealthy• Howeveritcouldbecomesurprisinglyslow orevencrash• Whatwent wrong?

Page 5: Spark Summit EU talk by Zoltan Zvara

Ourdata-skewstory

• Wehaveuse-caseswhenmap-sidecombineisnotanoption:groupBy, join• Powerlaws (Pareto,Zipfian)• „80%ofthetrafficgenerated by 20%ofthecommunication towers”• Mostofourdatais80–20rule

Page 6: Spark Summit EU talk by Zoltan Zvara

Theproblem

• Usingdefaulthashingisnotgoingtodistributethedatauniformly• Someunknownpartition(s)tocontainalotofrecordsonthereducerside• Slowtaskswillappear• Datadistributionisnotknowninadvance• Conceptdriftsarecommon

stageboundary&shuffle

evenpartitioning skeweddata

slow task

slow task

Page 7: Spark Summit EU talk by Zoltan Zvara

Ultimate goal

Spark to be adata-aware distributeddata-processingframework

Page 8: Spark Summit EU talk by Zoltan Zvara

F1

Generic,engine-level load balancing &data-skewmitigation

Low level,works with any APIs builton topofit

Page 9: Spark Summit EU talk by Zoltan Zvara

F2

Completely online

Requires noofflinestatisticsDoes not require simulation on asubset ofyour data

Page 10: Spark Summit EU talk by Zoltan Zvara

F3

Distribution agnostic

Noneed to identify the underlying distributionDoes not require to handle special corner-cases

Page 11: Spark Summit EU talk by Zoltan Zvara

F4

Forthe batch&the streamingguys under one hood

Same architecture andsame core logic forevery workloadSupportforunified workloads

Page 12: Spark Summit EU talk by Zoltan Zvara

F5

System-aware

Dynamically understand system-specific trade-offs

Page 13: Spark Summit EU talk by Zoltan Zvara

F6

Does not need any user-guidance or configuration

Out-of-the box solution,the userdoes not need to playwith the configurationspark.data-aware = true

Page 14: Spark Summit EU talk by Zoltan Zvara

Architecture

Slave Slave Slave

Master

approximatelocal keydistribution &statistics

1collect to master 2

global keydistribution &statistics

3

4redistribute newhash

Page 15: Spark Summit EU talk by Zoltan Zvara

Driverperspective

• RepartitioningTrackerMaster partofSparkEnv• Listenstojob&stagesubmissions• Holdsavarietyofrepartitioningstrategiesforeachjob&stage• Decides when &how to (re)partition• Collects feedbacks

Page 16: Spark Summit EU talk by Zoltan Zvara

Executorperspective

• RepartitioningTrackerWorker partofSparkEnv• Duties:

• StoresScannerStrategies (Scanner included) receivedfromtheRTM• Instantiates andbindsScanners toTaskMetrics (wheredata-characteristics iscollected)

• DefinesaninterfaceforScanners tosendDataCharacteristics backtotheRTM

Page 17: Spark Summit EU talk by Zoltan Zvara

Scalable sampling

• Key-distributions are approximatedwith astrategy,that is• not sensitive to early or late concept drifts,• lightweight andefficient,• scalable by using abackoff strategy

keys

frequency

counters

keys

frequency

sampling rate of𝑠" sampling rate of𝑠" /𝑏

keysfre

quency

sampling rate of𝑠"%&/𝑏

truncateincrease by 𝑠"

𝑇𝐵

Page 18: Spark Summit EU talk by Zoltan Zvara

Complexity-sensitive sampling

• Reducerrun-timecanhighlycorrelatewiththecomputationalcomplexity ofthevaluesforagivenkey• CalculatingobjectsizeiscostlyinJVMandinsomecasesshowslittlecorrelationtocomputationalcomplexity(functiononthenextstage)• Solution:

• Iftheuserobject isWeightable, usecomplexity-sensitive sampling• Whenincreasingthecounterforaspecificvalue,consider itscomplexity

Page 19: Spark Summit EU talk by Zoltan Zvara

Scalable sampling in numbers

• Optimizedwithmicro-benchmarking• Mainfactors:

• Samplingstrategy- aggressiveness (initialsamplingratio,back-offfactor,etc…)

• Complexity ofthecurrentstage(mapper)

• Currentstage’sruntime:• Whenusedthroughouttheexecution ofthewholestage itadds5-15%toruntime

• Afterrepartitioning, wecutoutthesampler’s code-path; inpractice,itadds0.2-1.5% toruntime

Page 20: Spark Summit EU talk by Zoltan Zvara

Decisiontorepartition

• RepartitioningTrackerMaster canusedifferentdecisionstrategies:• numberoflocalhistogramsneeded,• globalhistogram’sdistance fromuniformdistribution,• preferences intheconstruction ofthenewhashfunction.

Page 21: Spark Summit EU talk by Zoltan Zvara

Construction ofthe new hash function

𝑐𝑢𝑡, - single-key cut

hash akey herewith the probabilityproportional to the remaining space to reach 𝑙

𝑐𝑢𝑡. - uniform distribution

topprominent keys hashed

𝑙

𝑐𝑢𝑡/ - probabilistic cut

Page 22: Spark Summit EU talk by Zoltan Zvara

Newhash function in numbers

• MorecomplexthanahashCode• Weneedtoevaluateitforeveryrecord• Micro-benchmark(forexampleString):

• Numberofpartitions: 512• HashPartitioner: AVGtime to hash arecord is90.033 ns• KeyIsolatorPartitioner: AVGtime to hash arecord is121.933ns

• Inpracticeitaddsnegligibleoverhead,under1%

Page 23: Spark Summit EU talk by Zoltan Zvara

Repartitioning

• Reorganizepreviousnaivehashing or buffer the databefore hashing• Usuallyhappensin-memory• Inpractice,addsadditional1-8%overhead(usuallythelowerend),basedon:• complexity ofthemapper,• lengthofthescanningprocess.

• Forstreaming:updatethe hash in DStream graph

Page 24: Spark Summit EU talk by Zoltan Zvara

BatchgroupBy

MusicTimeseries – groupBy on tags from alistenings-stream

1710 200 400

83M

134M

92M

sizeofth

ebiggestpartition

number ofpartitions

over-partitioning overhead&blind scheduling64%reduction

observed 3%map-side overhead

naive

Dynamic Repartitioning

38%reduction

Page 25: Spark Summit EU talk by Zoltan Zvara

Batchjoin

• Joiningtrackswithtags• Tracksdatasetisskewed• Numberofpartitionssetto33• Naive:

• sizeofthebiggest partition=4.14M• reducer’sstageruntime=251seconds

• DynamicRepartitioning• sizeofthebiggest partition=2.01M• reducer’sstageruntime=124 seconds• heavymap,only0.9%overhead

Page 26: Spark Summit EU talk by Zoltan Zvara

Tracing

Goal:captureandrecordadata-point’slifecyclethroughoutthewholeexecution

Rewritten Spark’s core to handle wrapped data-points.

Wrapper data-point : T

payload : T 𝑓 ∶ 𝑇 → 𝑈𝑓 ∶ 𝑇 → 𝐵𝑜𝑜𝑙𝑒𝑎𝑛𝑓 ∶ 𝑇 → 𝐼𝑡𝑒𝑟𝑎𝑏𝑙𝑒 𝑈…

apply

Page 27: Spark Summit EU talk by Zoltan Zvara

Traceables in Spark

Each record isaTraceable object (a Wrapper implementation).

Apply function on payload andreport/build trace.spark.wrapper.class = com.ericsson.ark.spark.Traceable

Traceable data-point : T

payload : T 𝑓 ∶ 𝑇 → 𝑈𝑓 ∶ 𝑇 → 𝐵𝑜𝑜𝑙𝑒𝑎𝑛𝑓 ∶ 𝑇 → 𝐼𝑡𝑒𝑟𝑎𝑏𝑙𝑒 𝑈…

applytrace : Trace

Page 28: Spark Summit EU talk by Zoltan Zvara

Tracing implementation

Capture trace informations (deep profiling):• by reporting to anexternal service;• piggyback aggregated trace routes on data-points.

𝑇;𝑇<

𝑇=

𝑇>

𝑇?

𝑇@

𝑇A

𝑇B

𝑇<<

𝑇<C

𝑇<=

𝑇<A

2

4 4

10

1215

7

5

• 𝑇" can beany Spark operator• we attach useful metrics to edges

and vertices (current load, timespent at task)

Page 29: Spark Summit EU talk by Zoltan Zvara

F7

Provide data-characteristics to monitor&debug applications

Users can sneak-peak into the dataflowUsers can debug andmonitorapplications moreeasily

Page 30: Spark Summit EU talk by Zoltan Zvara

F7

• NewmetricsareavailablethroughtheRESTAPI• AddednewqueriestotheRESTAPI,forexample:„whathappenedinthelast3second?”• BlockFetches arecollectedtoShuffleReadMetrics

Page 31: Spark Summit EU talk by Zoltan Zvara

Execution visualization ofSpark jobs

Page 32: Spark Summit EU talk by Zoltan Zvara

Repartitioning in the visualization

heavy keys createheavy partitions (slow tasks)

heavy isalone,size ofthe biggest partition

isminimized

Page 33: Spark Summit EU talk by Zoltan Zvara

Conclusion

• OurDynamicRepartitioning canhandledataskewdynamically,on-the-flyonanyworkloadandarbitrarykey-distributions• Withverylittleoverhead,dataskewcanbehandledinanatural &general way• Tracing can help us to improve the co-location ofrelated services• Visualizationscanaiddeveloperstobetterunderstandissuesandbottlenecksofcertainworkloads• MakingSparkdata-aware paysoff

Page 34: Spark Summit EU talk by Zoltan Zvara

Thankyouforyourattention

Zoltá[email protected]