interactive debugging for big data...

InteractiveDebuggingforBigDataAnalytics

Muhammad Ali Gulzar, Xueyuan Han, Matteo Interlandi,Shaghayegh Mardani, Sai Deep Tetali, Tyson Condie, Todd Millstein,Miryung KimUniversity of California, Los Angeles

DebuggingBigDataAnalytics

• Today’splatformslackdebuggingsupport– Programs(i.e.,queries, jobs)arebatchexecuted /blackboxes– Errorsreflect low-leveldetails (e.g.,taskid?!)notrelevanttothe logicalbug– Longprogramexecution time =>longdevelopment cycles

• Whatdoprogrammersdo?– Trialanderror debuggingonsample data– Post-mortem analysisoferrorlogs– Analyzephysicalviewoftheexecution (ajobid,failednode,etc).

“IwouldliketounderstandtheflowofcontrolthroughtheSparksourcecodeontheworkernodeswhenIsubmitmyapplication…IamassumingI

shouldsetupSparkonEclipse…toenablesteppingthroughSparksourcecodeontheworkernodes.”

TryingtodebugaSparkApplicationonacluster…

Afterayear,stillnogoodanswers!

BigDebug ProjectOverviewBigDebug:DebuggingPrimitives

forInteractiveBigDataProcessinginSpark

[ICSE2016]

SimulatedBreakpointOn-DemandWatchpointCrashCulpritRemediationForwardBackwardTracing

Titian:DataProvenanceforFine-GrainedTracing[PVLDB2016]

Vega:IncrementalComputationforInteractiveDebugging

[UnderReview]

ExampleQueryDevelopmentSession

• Dataset:NYCOpenDataProject– Callstonon-emergencyservicecenters– Datasetcontainscallrecordsfor2010-2015• Recordcontents:calltime,agency,callerlocation,etc.

• Query:Identifytheagencies thatreceivedthemostcallsduringbusyhours– E.g.,busyhourifnumberofcalls>10,000

SparkProgram

caseclassCalls(id:String,hour:Int,agency:String,...)format=newSimpleDateFormat("M/d/yh:m:sa")input=sc.textFile("hdfs://...")calls=input.map(_.split(","))

.map(r=>Calls(r(0),format.parse(r(1)).getHours,r(2),...)calls.registerTempTable("calls")hist =sqlContext.sql("

SELECTagency,count(*)FROMcallsJOIN(

SELECThourFROMcallsGROUPBYhourHAVINGcount(*)>100000)counts

ONcalls.hour =counts.hourGROUPBYagency")

hist.show()

Extract DatasetfromHDFSTransform itintoaDataFrame (i.e.,table)Load itintoSparkSQL

hist.show()

ExpressQueryinSparkSQL

hist.show()

Identifythebusyhoursi.e.,#calls>10,000

Joinbusyhourswithcallsthengroupbyagencyandcountthenumberof“calls”receivedbyeachagency

DebuggingQueryResults• Analystobservessomeunexpectedresults– Agenciesthatshouldnotappear• e.g.,BrooklynPublicLibrary

– Expectedagenciesthatshouldappear• e.g,NYPD,NYFD

• Titiansupportforquerytriage– Analystcantracebackfromoutlierresultstocontributingdataatsomeintermediatestage

– Analystcanexecutequeriesagainstintermediatedataleadingtooutlierresults

QueryTriagewithTitian• Intermediateresultsforsubquery– Tracebacktosubqueryandshowdistributionofcallsperhour– Onintermediatedataleadingtooutlierresults

Significant skewinthemidnight hour=0!

SELECThour,count(*)FROMcallsGROUPBYhour

IdentifyBugandRevisetheQuery• TheBug

– Systemassignsdefaultvaluehour=0for…– Callsthatdidnotlogatime

• Possiblecourseofaction– Filteroutcallsassignedtohour=0

SELECThourFROMcallsWHEREhour!=0GROUPBYhourHAVINGcount(*)>100000)counts

ONcalls.hour =counts.hourGROUPBYagency

Introducepredicatethatfiltersoutmidnight hour

Vega:Re-executerevisedQuery• Vegamaterializesintermediatestageresults– i.e.,Theprevioussubqueryresultissaved

• VegaQueryRewriterleveragesthistorewritethequeryinto…

SELECTagency,count(*)FROMcallsJOINcountsWHEREcounts.hour !=0ONcalls.hour =counts.hourGROUPBYagency

MaterializedresultfrompreviousexecutionRewritefiltertoremovehour0fromjoining records

Vega:ModifiedQueryEvaluation• Executeanincrementaljoin– “Diff”recordsspecifychangesinthe(join)result– Forthisexample,weincrementallyremoveallrecordsforhour0fromjoinandfinalaggregationresults

• VegaOptimizerResultsConsequence:overanorder-of-magnituderuntimeimprovement

• Whenaprogramfails,ausermaywanttoinvestigateasubsetoftheoriginalinputinducingacrash,afailure,orawrongoutcome.

• DeltaDebugging[Zeller1999]–Wellknowndebuggingalgorithmforminimizingfailure-inducinginputs

– Requiresmultiplerunstoisolatefailure-inducinginputs

AutomatedIsolationofFailure-InducingInputsforBigDataAnalytics

Firstwerunthetesttofindthefailureinducinginputdataset

Background:DeltaDebugging[Zeller,FSE1999]

TestFails

First,werunthetesttofindthefailureinducinginputdataset

Second,wesplitthefailinginputdata

TestFails Split

TestPasses

TestFails

TestFails Split

TestPasses

TestFailsSplit

TestFails Split

TestPasses

TestFailsSplit

…...

ScalableAutomatedIsolationofFailure-InducingInputs

• Leveragedataprovenancetoreducesearchspace– Avoidcostlyexecutionsondatanotrelevanttothebug

• LeverageVegaoptimizesubsequentruns.

DeltaDebuggingTitian

Conclusion• BigDebug Project– DebuggingPrimitivesforInteractiveBigDataProcessinginApacheSpark– https://sites.google.com/site/sparkbigdebug/

• Titian:InteractiveDataProvenance– Supportstracebackqueriesfromasetofresults– Executionreplayfromanintermediatepoint

• Vega:Optimizingmodifiedqueryexecution– Novelqueryrewritemechanismthatpusheschangesbackwardstosavework– Incrementalevaluationthatoperatesondatachangesinducedbyquerymodifications

interactive debugging for big data...

Documents

perfdebug: performance debugging of computation skew in...

automated debugging in data intensive scalable computing...

coeus & interactive debugging david b. harrison brown...

interactive visualizations of automatic debugging...

interactive debugging quickzoom : a state alteration and...

interactive fault localization using test...

bigdebug: debugging primitives for interactive big...

interactive debugging quickzoom: a state alteration and...

gulzar auliya

interactive design and debugging of gpu-based volume...

scanalog: interactive design and debugging of...

gulzar i-farsi

bigdebug: debugging primitives for interactive big data

optimizing interactive development of data-intensive...

gulzar motor123

interactive media and game development debugging

swarm debugging: the collective intelligence on interactive...

debugging kate hedstrom august 2006. overview think before...

bachelor thesis „debugging by interactive heap navigation...

gulzar number