interactive debugging for big data...

Post on 22-May-2020

8 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

InteractiveDebuggingforBigDataAnalytics

Muhammad Ali Gulzar, Xueyuan Han, Matteo Interlandi,Shaghayegh Mardani, Sai Deep Tetali, Tyson Condie, Todd Millstein,Miryung KimUniversity of California, Los Angeles

DebuggingBigDataAnalytics

• Today’splatformslackdebuggingsupport– Programs(i.e.,queries, jobs)arebatchexecuted /blackboxes– Errorsreflect low-leveldetails (e.g.,taskid?!)notrelevanttothe logicalbug– Longprogramexecution time =>longdevelopment cycles

• Whatdoprogrammersdo?– Trialanderror debuggingonsample data– Post-mortem analysisoferrorlogs– Analyzephysicalviewoftheexecution (ajobid,failednode,etc).

“IwouldliketounderstandtheflowofcontrolthroughtheSparksourcecodeontheworkernodeswhenIsubmitmyapplication…IamassumingI

shouldsetupSparkonEclipse…toenablesteppingthroughSparksourcecodeontheworkernodes.”

TryingtodebugaSparkApplicationonacluster…

Afterayear,stillnogoodanswers!

BigDebug ProjectOverviewBigDebug:DebuggingPrimitives

forInteractiveBigDataProcessinginSpark

[ICSE2016]

SimulatedBreakpointOn-DemandWatchpointCrashCulpritRemediationForwardBackwardTracing

Titian:DataProvenanceforFine-GrainedTracing[PVLDB2016]

Vega:IncrementalComputationforInteractiveDebugging

[UnderReview]

ExampleQueryDevelopmentSession

• Dataset:NYCOpenDataProject– Callstonon-emergencyservicecenters– Datasetcontainscallrecordsfor2010-2015• Recordcontents:calltime,agency,callerlocation,etc.

• Query:Identifytheagencies thatreceivedthemostcallsduringbusyhours– E.g.,busyhourifnumberofcalls>10,000

SparkProgram

caseclassCalls(id:String,hour:Int,agency:String,...)format=newSimpleDateFormat("M/d/yh:m:sa")input=sc.textFile("hdfs://...")calls=input.map(_.split(","))

.map(r=>Calls(r(0),format.parse(r(1)).getHours,r(2),...)calls.registerTempTable("calls")hist =sqlContext.sql("

SELECTagency,count(*)FROMcallsJOIN(

SELECThourFROMcallsGROUPBYhourHAVINGcount(*)>100000)counts

ONcalls.hour =counts.hourGROUPBYagency")

hist.show()

Extract DatasetfromHDFSTransform itintoaDataFrame (i.e.,table)Load itintoSparkSQL

caseclassCalls(id:String,hour:Int,agency:String,...)format=newSimpleDateFormat("M/d/yh:m:sa")input=sc.textFile("hdfs://...")calls=input.map(_.split(","))

.map(r=>Calls(r(0),format.parse(r(1)).getHours,r(2),...)calls.registerTempTable("calls")hist =sqlContext.sql("

SELECTagency,count(*)FROMcallsJOIN(

SELECThourFROMcallsGROUPBYhourHAVINGcount(*)>100000)counts

ONcalls.hour =counts.hourGROUPBYagency")

hist.show()

ExpressQueryinSparkSQL

caseclassCalls(id:String,hour:Int,agency:String,...)format=newSimpleDateFormat("M/d/yh:m:sa")input=sc.textFile("hdfs://...")calls=input.map(_.split(","))

.map(r=>Calls(r(0),format.parse(r(1)).getHours,r(2),...)calls.registerTempTable("calls")hist =sqlContext.sql("

SELECTagency,count(*)FROMcallsJOIN(

SELECThourFROMcallsGROUPBYhourHAVINGcount(*)>100000)counts

ONcalls.hour =counts.hourGROUPBYagency")

hist.show()

Identifythebusyhoursi.e.,#calls>10,000

Joinbusyhourswithcallsthengroupbyagencyandcountthenumberof“calls”receivedbyeachagency

DebuggingQueryResults• Analystobservessomeunexpectedresults– Agenciesthatshouldnotappear• e.g.,BrooklynPublicLibrary

– Expectedagenciesthatshouldappear• e.g,NYPD,NYFD

• Titiansupportforquerytriage– Analystcantracebackfromoutlierresultstocontributingdataatsomeintermediatestage

– Analystcanexecutequeriesagainstintermediatedataleadingtooutlierresults

QueryTriagewithTitian• Intermediateresultsforsubquery– Tracebacktosubqueryandshowdistributionofcallsperhour– Onintermediatedataleadingtooutlierresults

Significant skewinthemidnight hour=0!

SELECThour,count(*)FROMcallsGROUPBYhour

IdentifyBugandRevisetheQuery• TheBug

– Systemassignsdefaultvaluehour=0for…– Callsthatdidnotlogatime

• Possiblecourseofaction– Filteroutcallsassignedtohour=0

SELECTagency,count(*)FROMcallsJOIN(

SELECThourFROMcallsWHEREhour!=0GROUPBYhourHAVINGcount(*)>100000)counts

ONcalls.hour =counts.hourGROUPBYagency

Introducepredicatethatfiltersoutmidnight hour

Vega:Re-executerevisedQuery• Vegamaterializesintermediatestageresults– i.e.,Theprevioussubqueryresultissaved

• VegaQueryRewriterleveragesthistorewritethequeryinto…

SELECTagency,count(*)FROMcallsJOINcountsWHEREcounts.hour !=0ONcalls.hour =counts.hourGROUPBYagency

MaterializedresultfrompreviousexecutionRewritefiltertoremovehour0fromjoining records

Vega:ModifiedQueryEvaluation• Executeanincrementaljoin– “Diff”recordsspecifychangesinthe(join)result– Forthisexample,weincrementallyremoveallrecordsforhour0fromjoinandfinalaggregationresults

• VegaOptimizerResultsConsequence:overanorder-of-magnituderuntimeimprovement

• Whenaprogramfails,ausermaywanttoinvestigateasubsetoftheoriginalinputinducingacrash,afailure,orawrongoutcome.

• DeltaDebugging[Zeller1999]–Wellknowndebuggingalgorithmforminimizingfailure-inducinginputs

– Requiresmultiplerunstoisolatefailure-inducinginputs

AutomatedIsolationofFailure-InducingInputsforBigDataAnalytics

Firstwerunthetesttofindthefailureinducinginputdataset

Background:DeltaDebugging[Zeller,FSE1999]

TestFails

First,werunthetesttofindthefailureinducinginputdataset

Background:DeltaDebugging[Zeller,FSE1999]

Second,wesplitthefailinginputdata

TestFails Split

Background:DeltaDebugging[Zeller,FSE1999]

TestFails Split

TestPasses

TestFails

Background:DeltaDebugging[Zeller,FSE1999]

TestFails Split

TestPasses

TestFailsSplit

Background:DeltaDebugging[Zeller,FSE1999]

TestFails Split

TestPasses

TestFailsSplit

…...

Background:DeltaDebugging[Zeller,FSE1999]

ScalableAutomatedIsolationofFailure-InducingInputs

• Leveragedataprovenancetoreducesearchspace– Avoidcostlyexecutionsondatanotrelevanttothebug

• LeverageVegaoptimizesubsequentruns.

DeltaDebuggingTitian

Conclusion• BigDebug Project– DebuggingPrimitivesforInteractiveBigDataProcessinginApacheSpark– https://sites.google.com/site/sparkbigdebug/

• Titian:InteractiveDataProvenance– Supportstracebackqueriesfromasetofresults– Executionreplayfromanintermediatepoint

• Vega:Optimizingmodifiedqueryexecution– Novelqueryrewritemechanismthatpusheschangesbackwardstosavework– Incrementalevaluationthatoperatesondatachangesinducedbyquerymodifications

top related