eddies: continuously adaptive query processingnatassa/courses/15-823/syllabus/papers/eddies.pdf ·...

Eddies: Continuously Adaptive Query Processing

RonAvnur JosephM. HellersteinUniversityof California,Berkeley

[email protected],[email protected]

��In largefederatedandshared-nothingdatabases,resourcescanexhibit widely fluctuatingcharacteristics.Assumptionsmadeat the time a query is submittedwill rarely hold throughoutthedurationof queryprocessing.As a result,traditionalstaticqueryoptimizationandexecutiontechniquesareineffective intheseenvironments.

In this paperwe introducea queryprocessingmechanismcalled an eddy, which continuouslyreordersoperatorsin aqueryplan as it runs. We characterizethe momentsof sym-metry during which pipelinedjoins can be easily reordered,andthesynchronizationbarriers that requireinputsfrom dif-ferentsourcesto be coordinated.By combiningeddieswithappropriatejoin algorithms,we merge the optimizationandexecutionphasesof queryprocessing,allowing eachtuple tohaveaflexible orderingof thequeryoperators.Thisflexibilityis controlledby a combinationof fluid dynamicsanda simplelearningalgorithm. Our initial implementationdemonstratespromisingresults,with eddiesperformingnearly as well asa staticoptimizer/executorin staticscenarios,andprovidingdramaticimprovementsin dynamicexecutionenvironments. �� Thereis increasinginterestin query enginesthat run at un-precedentedscale,bothfor widely-distributedinformationre-sources,andfor massively paralleldatabasesystems.We arebuilding a systemcalledTelegraph,which is intendedto runqueriesover all thedataavailableon line. A key requirementof a large-scalesystemlike Telegraphis that it function ro-bustly in anunpredictableandconstantlyfluctuatingenviron-ment.This unpredictabilityis endemicin large-scalesystems,becauseof increasedcomplexity in anumberof dimensions:

Hardware and Workload Complexity: In wide-areaenvi-ronments,variabilitiesarecommonlyobservablein theburstyperformanceof serversandnetworks[UFA98]. Thesesystemsoften serve large communitiesof userswhoseaggregatebe-havior canbehardto predict,andthehardwaremix in thewideareais quite heterogeneous.Largeclustersof computerscanexhibit similar performancevariations,due to a mix of userrequestsandheterogeneoushardwareevolution. Even in to-tally homogeneousenvironments,hardwareperformancecanbe unpredictable:for example,the outer tracksof a disk canexhibit almosttwice thebandwidthof innertracks[Met97].

Data Complexity: Selectivity estimationfor staticalphanu-

Figure1: An eddyin apipeline.Dataflowsinto theeddyfrominput relations�� and � . Theeddyroutestuplesto opera-tors;theoperatorsrunasindependentthreads,returningtuplesto the eddy. The eddysendsa tuple to the outputonly whenit hasbeenhandledby all theoperators.Theeddyadaptivelychoosesanorderto routeeachtuplethroughtheoperators.

meric datasetsis fairly well understood,andtherehasbeeninitial work onestimatingstatisticalpropertiesof staticsetsofdatawith complex types[Aok99] andmethods[BO99]. Butfederateddataoftencomeswithout any statisticalsummaries,andcomplex non-alphanumericdatatypesarenow widely inusebothin object-relationaldatabasesandontheweb. In thesescenarios– andevenin traditionalstaticrelationaldatabases–selectivity estimatesareoftenquiteinaccurate.

User Interface Complexity: In large-scalesystems,manyqueriescanrun for a very long time. As a result,thereis in-terestin Online Aggregationandother techniquesthat allowusersto “Control” propertiesof querieswhile they execute,basedon refiningapproximateresults[HAC� 99].

For all of thesereasons,weexpectqueryprocessingparam-etersto changesignificantlyover time in Telegraph,typicallymany timesduringa singlequery. As a result,it is not appro-priateto usethetraditionalarchitectureof optimizinga queryand then executing a static query plan: this approachdoesnot adaptto intra-queryfluctuations. Instead,for theseen-vironmentswe wantqueryexecutionplansto be reoptimizedregularly during thecourseof queryprocessing,allowing thesystemto adaptdynamicallyto fluctuationsin computingre-sources,datacharacteristics,anduserpreferences.

In this paperwe presenta queryprocessingoperatorcalledaneddy, which continuouslyreorderstheapplicationof pipe-

lined operatorsin a queryplan,on a tuple-by-tuplebasis.Aneddyisan -arytuplerouterinterposedbetween datasourcesandasetof queryprocessingoperators;theeddyencapsulatesthe orderingof the operatorsby routing tuplesthroughthemdynamically(Figure1). Becausetheeddyobservestuplesen-tering and exiting the pipelinedoperators,it can adaptivelychangeits routingto effectdifferentoperatororderings.In thispaperwepresentinitial experimentalresultsdemonstratingtheviability of eddies:they canindeedreordereffectively in thefaceof changingselectivities andcosts,andprovide benefitsin thecaseof delayeddatasourcesaswell.

Reoptimizinga queryexecutionpipelineon thefly requiressignificantcarein maintainingqueryexecutionstate.Wehigh-light queryprocessingstagescalledmomentsofsymmetry, dur-ing which operatorscanbeeasilyreordered.Wealsodescribesynchronizationbarriers in certainjoin algorithmsthatcanre-strict performanceto the rateof the slower input. Join algo-rithms with frequentmomentsof symmetryandadaptive ornon-existentbarriersarethusespeciallyattractive in theTele-graphenvironment. We observe that the Ripple Join family[HH99] provides efficiency, frequentmomentsof symmetry,and adaptive or nonexistent barriersfor equijoins and non-equijoinsalike.

Theeddyarchitectureis quitesimple,obviatingtheneedfortraditionalcostandselectivity estimation,andsimplifying thelogic of planenumeration.Eddiesrepresentour first stepin alargerattemptto do away with traditionaloptimizersentirely,in thehopeof providing bothrun-timeadaptivity anda reduc-tion in codecomplexity. In this paperwe focuson continuousoperatorreorderingin a single-sitequeryprocessor;we leaveotheroptimizationissuesto ourdiscussionof futurework. �!" $# � ��%'& �)(+*-,/. ��0�� Threepropertiescanvary during queryprocessing:the costsof operators,their selectivities, andthe ratesat which tuplesarrive from the inputs. The first and third issuescommonlyoccurin wideareaenvironments,asdiscussedin theliterature[AFTU96, UFA98, IFF � 99]. Theseissuesmaybecomemorecommonin cluster (shared-nothing)systemsas they “scaleout” to thousandsof nodesor more[Bar99].

Run-timevariationsin selectivity havenotbeenwidely dis-cussedbefore,but occurquitenaturally. They commonlyarisedueto correlationsbetweenpredicatesandthe orderof tupledelivery. For example,consideran employeetableclusteredby ascendingage,anda selectionsalary > 100000; ageandsalaryareoftenstronglycorrelated.Initially theselectionwill filter out most tuplesdelivered,but that selectivity ratewill changeasever-olderemployeesarescanned.Selectivityovertimecanalsodependonperformancefluctuations:e.g.,inaparallelDBMS clusteredrelationsareoftenhorizontallypar-titionedacrossdisks,andtherateof productionfrom variouspartitionsmay changeover time dependingon performancecharacteristicsandutilization of the differentdisks. Finally,Online Aggregationsystemsexplicitly allow usersto controlthe order in which tuplesaredeliveredbasedon dataprefer-ences[RRH99], resultingin similareffects. �!)1 ��2�435� �6*0��5��.7�8��9�/(;:<�� Telegraphis intendedto efficiently andflexibly provide bothdistributedqueryprocessingacrosssitesin thewidearea,andparallelqueryprocessingin a largeshared-nothingcluster. In

this paperwe narrow our focus somewhat to concentrateonthe initial, alreadydifficult problemof run-timeoperatorre-orderingin a single-sitequeryexecutor;that is, changingtheeffective orderor “shape”of a pipelinedqueryplantreein thefaceof changesin performance.

In our discussionwe will assumethat someinitial queryplan treewill be constructedduring parsingby a naive pre-optimizer. This optimizerneednot exercisemuchjudgementsincewe will be reorderingtheplan treeon thefly. Howeverby constructingaqueryplanit mustchoosea spanningtreeofthequerygraph(i.e. a setof table-pairsto join) [KBZ86], andalgorithmsfor eachof thejoins. Wewill returnto thechoiceofjoin algorithmsin Section2, anddeferto Section6 thediscus-sionof changingthespanningtreeandjoin algorithmsduringprocessing.

Westudyastandardsingle-nodeobject-relationalquerypro-cessingsystem,with theaddedcapabilityof openingscansandindexesfrom externaldatasets.This is becomingaverycom-mon basearchitecture,available in many of the commercialobject-relationalsystems(e.g., IBM DB2 UDB [RPK� 99],Informix Dynamic Server UDO [SBH98]) and in federateddatabasesystems(e.g., Cohera[HSC99]). We will refer tothesenon-residenttablesasexternal tables. We make no as-sumptionslimiting thescaleof externalsources,whichmaybearbitrarily large. Externaltablespresentmany of thedynamicchallengesdescribedabove: they canresideover a wide-areanetwork, faceburstyutilization,andoffer very minimal infor-mationoncostsandstatisticalproperties. �!)=$>@? *<� ? � *0ABeforeintroducingeddies,in Section2 we discussthe prop-ertiesof queryprocessingalgorithmsthatallow (or disallow)themto befrequentlyreordered.We thenpresenttheeddyar-chitecture,anddescribehow it allows for extremeflexibilityin operatorordering(Section3). Section4 discussespoliciesfor controllingtupleflow in aneddy. A varietyof experimentsin Section4 illustrate the robustnessof eddiesin both staticanddynamicenvironments,andraisesomequestionsfor fu-turework. We survey relatedwork in Section5, andin Sec-tion 6 lay outa researchprogramto carrythis work forward.1 # *0��5*<��). � �CBD��EGFH. � �A basicchallengeof run-timereoptimizationis to reorderpipe-lined queryprocessingoperatorswhile they arein flight. Tochangea queryplanon thefly, a greatdealof statein thevar-iousoperatorshasto beconsidered,andarbitrarychangescanrequiresignificantprocessingandcodecomplexity to guaran-teecorrectresults. For example,the statemaintainedby anoperatorlikehybridhashjoin [DKO� 84] cangrow aslargeasthe sizeof an input relation,and requiremodificationor re-computationif the plan is reorderedwhile the stateis beingconstructed.

By constrainingthe scenariosin which we reorderopera-tors,we cankeepthis work to a minimum. Beforedescribingeddies,we study the statemanagementof variousjoin algo-rithms; this discussionmotivatesthe eddydesign,andformsthe basisof our approachfor reoptimizingcheaplyandcon-tinuously. As aphilosophy, wefavoradaptivityoverbest-caseperformance. In a highly variableenvironment,thebest-casescenariorarely exists for a significantlengthof time. So we

will sacrificemarginal improvementsin idealizedquerypro-cessingalgorithmswhenthey preventfrequent,efficient reop-timization.1I!" KJ B � �035�� L�0�� NM 0�'�2� *��Binary operatorslike joins often capturesignificantstate. Aparticularform of stateusedin suchoperatorsrelatesto theinterleaving of requestsfor tuplesfrom differentinputs.

As an example,considerthe caseof a merge join on twosorted,duplicate-freeinputs. During processing,thenext tu-ple is always consumedfrom the relation whoselast tuplehadthelower value.This significantlyconstrainstheorderinwhich tuplescanbe consumed:asan extremeexample,con-siderthecaseof a slowly-deliveredexternalrelationslowlowwith many low valuesin its join column,andahigh-bandwidthbut largelocal relationfasthi with only high valuesin its joincolumn– theprocessingof fasthi is postponedfor a long timewhile consumingmany tuplesfrom slowlow. Usingterminol-ogyfrom parallelprogramming,wedescribethisphenomenonas a synchronizationbarrier: one table-scanwaits until theothertable-scanproducesa valuelargerthanany seenbefore.

In general,barrierslimit concurrency – andhenceperfor-mance– whentwo taskstakedifferentamountsof timeto com-plete(i.e., to “arrive” at thebarrier). Recallthatconcurrencyarisesevenin single-sitequeryengines,which cansimultane-ouslycarryout network I/O, disk I/O, andcomputation.Thusit is desirableto minimize the overheadof synchronizationbarriersin a dynamic(or even staticbut heterogeneous)per-formanceenvironment.Two issuesaffect theoverheadof bar-riers in a plan: thefrequency of barriers,andthegapbetweenarrival timesof thetwo inputsat thebarrier. Wewill seein up-comingdiscussionthatbarrierscanoftenbeavoidedor tunedby usingappropriatejoin algorithms.1I!)1 O �/(+* � �6�G��E J B0(;(+*4��BNote that the synchronizationbarrier in merge join is statedin an order-independentmanner: it doesnot distinguishbe-tweenthe inputs basedon any propertyother than the datathey deliver. Thusmergejoin is oftendescribedasa symmet-ric operator, sinceits two inputsaretreateduniformly1. This isnot thecasefor many otherjoin algorithms.Considerthetra-ditional nested-loopsjoin, for example. The “outer” relationin a nested-loopsjoin is synchronizedwith the “inner” rela-tion, but not vice versa:after eachtuple (or block of tuples)is consumedfrom theouterrelation,abarrieris setuntil a fullscanof theinner is completed.For asymmetricoperatorslikenested-loopsjoin, performancebenefitscanoftenbeobtainedby reorderingtheinputs.

Whena join algorithmreachesa barrier, it hasdeclaredtheendof a schedulingdependency betweenits two input rela-tions. In suchcases,theorderof the inputsto thejoin canof-tenbechangedwithout modifying any statein thejoin; whenthis is true, we refer to the barrierasa momentof symmetry.Let usreturnto theexampleof a nested-loopsjoin, with outerrelation � andinnerrelation � . At abarrier, thejoin hascom-pleteda full inner loop, having joined eachtuple in a subsetof � with every tuplein � . Reorderingtheinputsat this pointcanbe donewithout affecting the join algorithm,as long asP

If thereare duplicatesin a merge join, the duplicatesare handledby anasymmetricbut usuallysmallnestedloop. For purposesof exposition,we canignorethis detailhere.

Figure2: Tuplesgeneratedby anested-loopsjoin, reorderedattwo momentsof symmetry. Eachaxisrepresentsthetuplesofthecorrespondingrelation,in theorderthey aredeliveredbyanaccessmethod.Thedotsrepresenttuplesgeneratedby thejoin, someof which maybeeliminatedby the join predicate.Thenumberscorrespondto thebarriersreached,in order. Q�Rand Q�S arethecursorpositionsmaintainedby thecorrespond-ing inputsat thetimeof thereorderings.

the iteratorproducing � notesits currentcursorposition QTR .In thatcase,thenew “outer” loop on � begins rescanningbyfetchingthefirst tupleof � , and � is scannedfrom QTR to theend. This canbe repeatedindefinitely, joining � tupleswithall tuplesin � from position QTR to the end. Alternatively, attheendof someloop over � (i.e. at a momentof symmetry),theorderof inputscanbeswappedagainby rememberingthecurrentpositionof � , andrepeatedlyjoining thenext tuple in� (startingat Q R ) with tuplesfrom � betweenQ S andtheend.Figure2 depictsthis scenario,with two changesof ordering.Someoperatorslikethepipelinedhashjoin of [WA91] havenobarrierswhatsoever. Theseoperatorsarein constantsymme-try, sincetheprocessingof thetwo inputsis totally decoupled.

Momentsof symmetryallow reorderingof the inputsto asinglebinary operator. But we cangeneralizethis, by notingthatsincejoins commute,a treeof VUXW binary joins canbeviewed asa single -ary join. Onecouldeasilyimplementadoubly-nested-loopsjoin operatorover relations � , � and � ,andit would have momentsof completesymmetryat theendof eachloop of � . At thatpoint, all threeinputscouldbe re-ordered(sayto � then � then � ) with astraightforwardexten-sion to the discussionabove: a cursorwould be recordedforeachinput, andeachloop would go from therecordedcursorpositionto theendof theinput.

The sameeffect can be obtainedin a binary implementa-tion with two operators,by swappingthe positionsof binaryoperators:effectively the plan tree transformationwould goin steps,from Y�[Z]\/^_�a`bZ]\�c_� to Y�dZ]\<c_�e`;Z]\/^f� andthento Yg�hZ]\�cV�i`�Z]\5^j� . This approachtreatsan operatorandits right-handinput asa unit (e.g.,the unit k Z]\ c �ml ), andswapsunits; the ideahasbeenusedpreviously in staticqueryoptimizationschemes[IK84, KBZ86, Hel98]. Viewing thesit-uation in this manner, we can naturally considerreorderingmultiple joins andtheir inputs,even if the join algorithmsaredifferent.In our query Y�XZ]\ ^ �n`aZ]\ c � , we need k Z]\ ^ �ol andk Z]\�ci�ml to bemutuallycommutative, but do not requirethemto be thesamejoin algorithm. We discussthecommutativityof join algorithmsfurtherin Section2.2.2.

Note that the combinationof commutativity andmomentsof symmetryallows for very aggressive reorderingof a plan

tree. A single -ary operatorrepresentinga reorderableplantreeis thereforeanattractive abstraction,sinceit encapsulatesany orderingthat may be subjectto change.We will exploitthis abstractiondirectly, by interposingan -ary tuple router(an“eddy”) betweentheinput tablesandthejoin operators.1I!)1I!" qp �5� � �G � � �� *0rs*0�

Nested-loopsjoinscantake advantageof indexeson thein-nerrelation,resultingin a fairly efficient pipeliningjoin algo-rithm. An index nested-loopsjoin (henceforthan“index join”)is inherentlyasymmetric,sinceone input relation hasbeenpre-indexed. Evenwhenindexesexist on both inputs,chang-ing thechoiceof innerandouterrelation“on thefly” is prob-lematic2. Hencefor the purposesof reordering,it is simplerto think of an index join asa kind of unaryselectionoperatoron theunindexedinput (asin thejoin of � and t in Figure1).Theonly distinctionbetweenan index join anda selectionisthat – with respectto the unindexed relation– the selectivityof the join nodemaybegreaterthan1. Althoughonecannotswaptheinputsto asingleindex join, onecanreorderanindexjoin andits indexedrelationasaunit amongotheroperatorsina plan tree. Note that the logic for indexescanbe appliedtoexternaltablesthat requirebindingsto bepassed;suchtablesmay be gateways to, e.g.,web pageswith forms, GIS indexsystems,LDAP serversandsoon [HKWY97, FMLS99].1I!)1I!)1 Fu3�B4�� ./FH��/:/*��]�� *0��v7Fu�*<�5� ��0�6*0��v/wa�/(;(;�<�60�� ? � �CB

Clearly, a pre-optimizer’s choiceof anindex join algorithmconstrainsthepossiblejoin orderings.In the -ary join view,anorderingconstraintmustbeimposedsothat theunindexedjoin input is orderedbefore(but not necessarilydirectly be-fore) the indexed input. This constraintarisesbecauseof aphysicalpropertyof an input relation: indexescanbeprobedbut not scanned,and hencecannotappearbefore their cor-respondingprobing tables. Similar but more complex con-straintscanarisein preservingthe orderedinputsto a mergejoin (i.e.,preserving“interestingorders”).

Theapplicabilityof certainjoin algorithmsraisesadditionalconstraints.Many join algorithmswork only for equijoins,andwill not work on otherjoins like Cartesianproducts.Suchal-gorithmsconstrainreorderingson theplan treeaswell, sincethey always requireall relationsmentionedin their equijoinpredicatesto be handledbeforethem. In this paper, we con-siderorderingconstraintsto bean inviolableaspectof a plantree, andwe ensurethat they always hold. In Section6 wesketchinitial ideason relaxingthis requirement,by consider-ing multiple join algorithmsandquerygraphspanningtrees.1I!)1I!)=xp �5� � ��. y�� 35(+�G � � # *4��2��*�� y

In orderfor aneddyto bemosteffective,wefavor join algo-rithmswith frequentmomentsof symmetry, adaptive or non-existentbarriers,andminimal orderingconstraints:theseal-gorithmsoffer the mostopportunitiesfor reoptimization. In[AH99] we summarizethe salientpropertiesof a variety ofjoin algorithms.Ourdesireto avoid blockingrulesout theuseof hybrid hashjoin, andour desireto minimizeorderingcon-straintsandbarriersexcludesmergejoins. Nestedloopsjoinsz

In unclusteredindexes,theindex orderingis not thesameasthescanorder-ing. Thusaftera reorderingof theinputsit is difficult to ensurethat– usingtheterminologyof Section2.2– lookupsontheindex of thenew “inner” relation {produceonly tuplesbetween|~} andtheendof { .

have infrequentmomentsof symmetryandimbalancedbarri-ers,makingthemundesirableaswell.

The other algorithmswe considerare basedon frequent-ly-symmetricversionsof traditionaliteration,hashingandin-dexing schemes,i.e., theRipple Joins[HH99]. Note that theoriginal pipelinedhashjoin of [WA91] is a constrainedver-sionof the hashripple join. Theexternalhashingextensionsof [UF99, IFF � 99] are directly applicableto the hashrip-ple join, and [HH99] treatsindex joins as a specialcaseaswell. For non-equijoins,theblock ripple join algorithmis ef-fective, having frequentmomentsof symmetry, particularlyat the beginning of processing[HH99]. Figure 3 illustratesblock, index and hashripple joins; the readeris referredto[HH99, IFF� 99, UF99] for detaileddiscussionsof theseal-gorithmsand their variants. Thesealgorithmsare adaptivewithout sacrificingmuchperformance:[UF99] and[IFF � 99]demonstratescalableversionsof hashripple join thatperformcompetitively with hybrid hashjoin in thestaticcase;[HH99]shows that while block ripple join can be lessefficient thannested-loopsjoin, it arrives at momentsof symmetrymuchmore frequently than nested-loopsjoins, especiallyin earlystagesof processing.In [AH99] we discussthememoryover-headsof theseadaptive algorithms,which canbe larger thanstandardjoin algorithms.

Ripple joins have momentsof symmetryat each“corner”of a rectangularripple in Figure3, i.e., whenever a prefix ofthe input stream� hasbeenjoinedwith all tuplesin a prefixof inputstream� andviceversa.For hashripple joinsandin-dex joins, thisscenariooccursbetweeneachconsecutive tupleconsumedfrom a scannedinput. Thusripple joins offer veryfrequentmomentsof symmetry.

Ripple joins areattractive with respectto barriersaswell.Ripple joins weredesignedto allow changingratesfor eachinput; thiswasoriginally usedto proactivelyexpendmorepro-cessingon theinput relationwith morestatisticalinfluenceonintermediateresults.However, thesamemechanismallows re-activeadaptivity in thewide-areascenario:abarrieris reachedat eachcorner, andthe next cornercanadaptively reflecttherelative ratesof the two inputs. For theblock ripple join, thenext corneris chosenuponreachingthepreviouscorner;thiscanbe doneadaptively to reflectthe relative ratesof the twoinputsover time.

The ripple join family offers attractive adaptivity featuresat a modestoverheadin performanceandmemoryfootprint.Hencethey fit well with ourphilosophyof sacrificingmarginalspeedfor adaptability, and we focus on thesealgorithmsinTelegraph.= # � ? *��e � �N�I�5�/� *4�The above discussionallows us to considereasilyreorderingqueryplansat momentsof symmetry. In this sectionwe pro-ceedto describethe eddy mechanismfor implementingre-orderingin a naturalmannerduring query processing.Thetechniqueswedescribecanbeusedwith any operators,but al-gorithmswith frequentmomentsof symmetryallow for morefrequentreoptimization.Beforediscussingeddies,we first in-troduceour basicqueryprocessingenvironment.=I!" $# � ? *��We implementededdiesin thecontext of River [AAT � 99], ashared-nothingparallelqueryprocessingframework that dy-

Figure3: Tuplesgeneratedby block, index, andhashripple join. In block ripple,all tuplesaregeneratedby thejoin, but somemaybe eliminatedby the join predicate.Thearrows for index andhashripple join representthe logical portion of the cross-productspacecheckedsofar; thesejoins only expendwork on tuplessatisfyingthejoin predicate(blackdots).In thehashripplediagram,onerelationarrives3 � fasterthantheother.

namicallyadaptsto fluctuationsin performanceandworkload.River hasbeenusedto robustly producenear-recordperfor-manceon I/O-intensive benchmarkslike parallelsortingandhashjoins, despiteheterogeneitiesanddynamicvariability inhardware and workloadsacrossmachinesin a cluster. FormoredetailsonRiver’sadaptivity andparallelismfeatures,theinterestedreaderis referredto theoriginal paperon the topic[AAT � 99]. In Telegraph,we intendto leveragetheadaptabil-ity of River to allow for dynamicshifting of load(bothqueryprocessinganddatadelivery) in a shared-nothingparallelen-vironment. But in this paperwe restrict ourselves to basic(single-site)featuresof eddies;discussionsof eddiesin par-allel riversaredeferredto Section6.

Since we do not discussparallelismhere, a very simpleoverview of theRiver framework suffices.River is a dataflowqueryengine,analogousin many waysto Gamma[DGS� 90],Volcano [Gra90] and commercialparallel databaseengines,in which “iterator”-style modules(query operators)commu-nicatevia a fixed dataflow graph(a queryplan). Eachmod-ule runsasan independentthread,andtheedgesin thegraphcorrespondto finite messagequeues.When a producerandconsumerrun at differing rates,the fasterthreadmay blockon the queuewaiting for the slower threadto catchup. Asin [UFA98], River is multi-threadedandcanexploit barrier-free algorithmsby readingfrom various inputs at indepen-dent rates. The River implementationwe usedderives fromthe work on Now-Sort [AAC� 97], andfeaturesefficient I/Omechanismsincludingpre-fetchingscans,avoidanceof oper-ating systembuffering, andhigh-performanceuser-level net-working.=I!" �!" Fu�* %C> :�� (D� L�0��

Althoughwe will useeddiesto reordertablesamongjoins,a heuristicpre-optimizermustchoosehow to initially pair offrelationsinto joins, with theconstraintthateachrelationpar-ticipatesin onlyonejoin. Thiscorrespondsto choosingaspan-ning treeof a querygraph,in which nodesrepresentrelationsand edgesrepresentbinary joins [KBZ86]. One reasonableheuristicfor pickingaspanningtreeformsachainof cartesianproductsacrossany tablesknown to bevery small (to handle“starschemas”whenbase-tablecardinalitystatisticsareavail-able);it thenpicksarbitraryequijoinedges(ontheassumption

that they arerelatively low selectivity), followed by asmanyarbitrarynon-equijoinedgesasrequiredto completea span-ning tree.

Givena spanningtreeof thequerygraph,thepre-optimizerneedsto choosejoin algorithmsfor eachedge. Along eachequijoinedgeit canuseeitheranindex join if anindex is avail-able,or ahashripplejoin. Along eachnon-equijoinedgeit canusea block ripple join.

Thesearesimpleheuristicsthatwe useto allow usto focusonour initial eddydesign;in Section6 wepresentinitial ideasonmakingspanningtreeandalgorithmdecisionsadaptively.=I!)1 � � �I�5�5Bb� � ��3�* # � ? *<�An eddy is implementedvia a modulein a river containingan arbitrary numberof input relations,a numberof partici-patingunaryandbinarymodules,anda singleoutputrelation(Figure1)3. An eddyencapsulatesthe schedulingof its par-ticipatingoperators;tuplesenteringtheeddycanflow throughits operatorsin a varietyof orders.

In essence,an eddy explicitly mergesmultiple unary andbinary operatorsinto a single -ary operatorwithin a queryplan,basedon the intuition from Section2.2 thatsymmetriescanbeeasilycapturedin an -ary operator. An eddymodulemaintainsafixed-sizedbuffer of tuplesthatareto beprocessedby oneor moreoperators.Eachoperatorparticipatingin theeddyhasoneor two inputsthatarefed tuplesby theeddy, andanoutputstreamthatreturnstuplesto theeddy. Eddiesaresonamedbecauseof thiscirculardataflow within a river.

A tupleenteringaneddyis associatedwith a tupledescrip-tor containinga vector of Readybits and Done bits, whichindicaterespectively thoseoperatorsthat areelgibile to pro-cessthetuple,andthosethathavealreadyprocessedthetuple.Theeddymoduleshipsa tupleonly to operatorsfor which thecorrespondingReadybit turnedon. After processingthetuple,theoperatorreturnsit to theeddy, andthecorrespondingDonebit is turnedon. If all the Donebits areon, the tuple is sentto the eddy’s output; otherwiseit is sentto anothereligibleoperatorfor continuedprocessing.�

Nothing preventsthe useof � -ary operatorswith �f�� in an eddy, butsinceimplementationsof theseareatypicalin databasequeryprocessingwe donotdiscussthemhere.

Whenaneddyreceivesa tuplefrom oneof its inputs,it ze-roesthe Donebits, andsetsthe Readybits appropriately. Inthe simple case,the eddy setsall Readybits on, signifyingthat any orderingof the operatorsis acceptable.Whenthereare orderingconstraintson the operators,the eddy turns ononly theReadybitscorrespondingto operatorsthatcanbeex-ecutedinitially. Whenanoperatorreturnsa tupleto theeddy,theeddyturnsontheReadybit of any operatoreligible to pro-cessthe tuple. Binary operatorsgenerateoutput tuplesthatcorrespondto combinationsof input tuples;in thesecases,theDonebitsandReadybits of thetwo input tuplesareORed.Inthis manneran eddypreservesthe orderingconstraintswhilemaximizingopportunitiesfor tuplesto follow differentpossi-bleorderingsof theoperators.

Two propertiesof eddiesmeritcomment.First,notethated-diesrepresentthefull classof bushytreescorrespondingto thesetof join nodes– it is possible,for instance,thattwo pairsoftuplesarecombinedindependentlyby two differentjoin mod-ules,andthenroutedto a third join to performthe4-way con-catenationof thetwo binaryrecords.Second,notethateddiesdo not constrainreorderingto momentsof symmetryacrosstheeddyasa whole. A given operatormustcarefully refrainfrom fetchingtuplesfrom certaininputsuntil its next momentof symmetry– e.g.,a nested-loopsjoin would not fetcha newtuple from the currentouter relationuntil it finishedrescan-ningtheinner. But thereis norequirementthatall operatorsintheeddybeat a momentof symmetrywhenthis occurs;justtheoperatorthatis fetchinganew tuple.Thuseddiesarequiteflexible both in the shapesof treesthey cangenerate,andinthescenariosin which they canlogically reorderoperators.� # �/�� y & �/:5. *0�G� � �I�5�5� *0�An eddy moduledirects the flow of tuples from the inputsthroughthevariousoperatorsto theoutput,providing theflex-ibility to allow eachtuple to be routedindividually throughtheoperators.Theroutingpolicy usedin theeddydeterminesthe efficiency of the system. In this sectionwe study somepromisinginitial policies;webelieve thatthis is arich areaforfuture work. We outline someof the remainingquestionsinSection6.

An eddy’s tuple buffer is implementedasa priority queuewith a flexible prioritization scheme.An operatoris alwaysgiventhehighest-prioritytuplein thebuffer thathasthecorre-spondingReadybit set.For simplicity, westartby consideringa very simplepriority scheme:tuplesentertheeddywith lowpriority, andwhenthey arereturnedto theeddyfrom anoper-ator they aregivenhigh priority. This simplepriority schemeensuresthat tuplesflow completelythroughthe eddybeforenew tuplesare consumedfrom the inputs, ensuringthat theeddydoesnotbecome“clogged”with new tuples.� !" ��r0:/*��)(+* � �6�. J *s��/:In orderto illustratehow eddieswork, we presentsomeinitialexperimentsin this section;we pausebriefly hereto describeour experimentalsetup. All our experimentswere run on asingle-processorSunUltra-1 workstationrunningSolaris2.6,with 160MB of RAM. WeusedtheEuphratesimplementationof River[AAT � 99]. WesyntheticallygeneratedrelationsasinTable1, with 100bytetuplesin eachrelation.

To allow usto experimentwith costsandselectivitiesof se-lections,our selectionmodulesare(artificially) implemented

Table Cardinality valuesin column �R 10,000 500- 5500S 80,000 0 - 5000T 10,000 N/AU 50,000 N/A

Table 1: Cardinalitiesof tables; valuesare uniformly dis-tributed.

0� 2� 4�

6� 8� 10

cost of s1.�50

100

150

200

250

com

plet

ion

time

(sec

s)

s1 before s2s2 before s1Naive�Lottery

Figure4: Performanceof two 50% selections,�� hascost5,�0W variesacrossruns.

as spin loops correspondingto their relative costs,followedby arandomizedselectiondecisionwith theappropriateselec-tivity. We describetherelative costsof selectionsin termsofabstract“delay units”; for studyingoptimization,theabsolutenumberof cyclesthroughaspinloopareirrelevant.Weimple-mentedthesimplestversionof hashripplejoin, identicalto theoriginalpipelininghashjoin [WA91]; ourimplementationheredoesnot exert any statistically-motivatedcontrolover disk re-sourceconsumption(asin [HH99]). Wesimulatedindex joinsby doing randomI/Os within a file, returningon averagethenumberof matchescorrespondingto apre-programmedselec-tivity. Thefilesystemcachewasallowedto absorbsomeof theindex I/Osafterwarmingup. In orderto fairly compareeddiesto staticplans,we simulatestaticplansvia eddiesthatenforcea static orderingon tuples(settingReadybits in the correctorder).� !)1 � �� ? *-�I�5�5B��i,/. �� _�8B � �(D� ��e � � > :/*��0�6��wu��6�To illustratehow an eddyworks, we considera very simplesingle-tablequerywith two expensiveselectionpredicates,un-der the traditionalassumptionthat no performanceor selec-tivity propertieschangeduring execution. Our SQL queryissimply thefollowing:

SELECT *FROM U

WHERE �0W<Y�` AND ��Y�` ;In ourfirst experiment,wewishto seehow well a“naive” eddycanaccountfor differencesin costsamongoperators.We runthe query multiple times, always settingthe cost of �� to 5delayunits,andtheselectivities of bothselectionsto 50%. Ineachrun we usea different cost for �0W , varying it between1 and 9 delay units acrossruns. We comparea naive eddyof the two selectionsagainstbothpossiblestaticorderingsof

0.0� 0.2� 0.4� 0.6� 0.8� 1.0�selectivity of s1�

30

40

50

60

com

plet

ion

time

(sec

s)

s1 before s2s2 before s1Naive�Lottery�

Figure5: Performanceof two selectionsof cost5, �� has50%selectivity, �0W variesacrossruns.

the two selections(andagainsta “lottery”-basededdy, aboutwhich we will saymorein Section4.3.) Onemight imaginethattheflexible routingin thenaive eddywould deliver tuplesto the two selectionsequally: half the tupleswould flow to�0W before �s� , andhalf to �� before �0W , resultingin middlingperformanceover all. Figure4 shows thatthis is not thecase:thenaiveeddynearlymatchesthebetterof thetwo orderingsinall cases,withoutany explicit informationabouttheoperators’relative costs.

The naive eddy’s effectivenessin this scenariois due tosimplefluid dynamics,arisingfrom thedifferentratesof con-sumptionby �0W and �� . Recallthatedgesin a River dataflowgraphcorrespondto fixed-sizequeues.This limitation hasthesameeffect asback-pressure in a fluid flow: productionalongthe input to any edgeis limited by therateof consumptionattheoutput.Thelower-costselection(e.g., �<W at theleft of Fig-ure 4) canconsumetuplesmorequickly, sinceit spendslesstime per tuple; asa result the lower-costoperatorexerts lessback-pressureon the input table. At thesametime, thehigh-costoperatorproducestuplesrelatively slowly, sothelow-costoperatorwill rarely be requiredto consumea high-priority,previously-seentuple.Thusmosttuplesareroutedto thelow-costoperatorfirst, eventhoughthecostsarenot explicitly ex-posedor trackedin any way.� !)= ,<<�6�+��/��B5�G�0*00� � � � y J *�. *0�� ? � �� *0�Thenaive eddyworkswell for handlingoperatorswith differ-entcostsbut equalselectivity. But we have notyet considereddifferencesin selectivity. In our secondexperimentwe keepthe costsof the operatorsconstantandequal(5 units), keepthe selectivity of �� fixed at 50%, andvary the selectivity of�0W acrossruns. The resultsin Figure5 arelessencouraging,showing thenaive eddyperformingaswe originally expected,abouthalf-way betweenthebestandworstplans.Clearlyournaive priority schemeandthe resultingback-pressurearein-sufficient to capturedifferencesin selectivity.

To resolve thisdilemma,we would like ourpriority schemeto favor operatorsbasedon both their consumptionandpro-ductionrate.Notethattheconsumption(input)rateof anoper-atoris determinedby costalone,while theproduction(output)rateis determinedby a productof costandselectivity. Sinceanoperator’s back-pressureon its input dependslargelyon itsconsumptionrate, it is not surprisingthat our naive scheme

0.0� 0.2� 0.4� 0.6� 0.8� 1.0�Selectivity of s1�

0

20

40

60

80

100

cum

ulat

ive

% o

f tup

les r

oute

d to

s1 fi

rst

� Naive�Lottery�

Figure 6: Tuple flow with lottery schemefor the variable-selectivity experiment(Figure5).

doesnotcapturedifferingselectivities.To track both consumptionand productionover time, we

enhanceour priority schemewith a simplelearningalgorithmimplementedvia LotteryScheduling[WW94]. Eachtime theeddygivesa tuple to an operator, it creditsthe operatorone“ticket”. Eachtime the operatorreturnsa tuple to the eddy,oneticket is debitedfrom theeddy’s runningcountfor thatop-erator. Whenaneddyis readyto senda tupleto beprocessed,it “holds a lottery” amongtheoperatorseligible for receivingthe tuple. (The interestedreaderis referredto [WW94] fora simpleandefficient implementationof lottery scheduling.)An operator’s chanceof “winning the lottery” andreceivingthetuplecorrespondsto thecountof ticketsfor thatoperator,which in turn tracksthe relative efficiency of the operatoratdrainingtuplesfrom thesystem.By routing tuplesusingthislottery scheme,the eddytracks(“learns”) an orderingof theoperatorsthatgivesgoodoverall efficiency.

The “lottery” curve in Figures4 and5 show the more in-telligent lottery-basedrouting schemecomparedto the naiveback-pressureschemeandthetwo staticorderings.Thelotteryschemehandlesboth scenarioseffectively, slightly improv-ing theeddyin thechanging-costexperiment,andperformingmuchbetterthannaive in thechanging-selectivity experiment.

To explain this a bit further, in Figure6 we displaytheper-cent of tuplesthat followed the order �0W4�� (as opposedto��0W ) in the two eddyschemes;this roughly representstheaverageratio of lottery tickets possessedby �0W and �� overtime. Note that the naive back-pressurepolicy is barelysen-sitive to changesin selectivity, and in fact drifts slightly inthe wrong directionasthe selectivity of �0W is increased.Bycontrast,the lottery-basedschemeadaptsquite nicely as theselectivity is varied.

In both graphsonecanseethat whenthe costsandselec-tivities are closeto equal( �0W �¡��¢�¤£s¥<¦ ), the percent-ageof tuples following the cheaperorder is close to 50%.Thisobservationis intuitive,but quitesignificant.Thelottery-basededdy approachesthe cost of an optimal ordering,butdoesnotconcernitself aboutstrictly observingtheoptimalor-dering. Contrastthis to earlierwork on runtimereoptimiza-tion [KD98, UFA98, IFF � 99], wherea traditionalqueryop-timizer runsduring processingto determinethe optimal planremnant. By focusingon overall cost ratherthanon finding

theoptimalplan,thelottery schemeprobabilisticallyprovidesnearly optimal performancewith much lesseffort, allowingre-optimizationto bedonewith anextremelylightweighttech-niquethatcanbeexecutedmultiple timesfor every tuple.

A relatedobservationis thatthelotteryalgorithmgetscloserto perfectrouting ( §¨�©¥ %) on the right of Figure6 than itdoes( §D�ªWT¥0¥ %) on theleft. Yet in thecorrespondingperfor-mancegraph(Figure5), the differencesbetweenthe lottery-basededdyandtheoptimalstaticorderingdonotchangemuchin the two settings.This phenomenonis explainedby exam-ining the“jeopardy” of makingorderingerrorsin eithercase.Considertheleft sideof thegraph,wheretheselectivity of �0Wis 10%, �� is 50%,andthecostsof eachare Q��«£ delayunits.Let ¬ betherateat which tuplesareroutederroneously(to ��before �0W in this case). Thenthe expectedcostof the queryis YCWiU�¬4`8�W4®¯WTQ@°±¬²�W4® £�Q-�d® ³0¬�Qm°´W0®¯W9Q . By contrast,inthesecondcasewheretheselectivity of �0W is changedto 90%,theexpectedcostis YCWeUµ¬s`n�W4® £�Q�°�¬e�W4® ¶sQ��h® ³0¬�Q8°±W0® £�Q .Sincethejeopardyishigherat90%selectivity thanat10%,thelottery moreaggressively favors the optimalorderingat 90%selectivity thanat 10%.� ! � p �5� � �We have discussedselectionsup to this point for easeof ex-position,but of coursejoins arethemorecommonexpensiveoperatorin query processing. In this sectionwe study howeddiesinteractwith thepipeliningripple join algorithms.Forthe moment,we continueto studya staticperformanceenvi-ronment,validating the ability of eddiesto do well even inscenarioswherestatictechniquesaremosteffective.

Webegin with a simple3-tablequery:SELECT *

FROM ��H�C�WHERE ��® �+�±�H® �

AND �H® ·m��® ·In our experiment,we constructeda preoptimizedplanwith ahashripple join between� and � , andan index join between� and � . Sinceour datais uniformly distributed,Table1 in-dicatesthat the selectivity of the �i� join is W0® ¸V�¨W�¥5¹Iº ; itsselectivity with respectto � is 180%– i.e.,each� tupleenter-ing the join finds 1.8 matching � tupleson average[Hel98].Weartificially settheselectivity of theindex join w.r.t. � to beW�¥0¦ (overallselectivity Wo�»WT¥ ¹½¼ ). Figure7 showstherelativeperformanceof our two eddyschemesandthe two staticjoinorderings. The resultsechoour resultsfor selections,show-ing the lottery-basededdy performingnearly optimally, andthenaiveeddyperformingin betweenthebestandworststaticplans.

As notedin Section2.2.1,index joinsareveryanalogoustoselections.Hashjoins have morecomplicatedandsymmetricbehavior, andhencemerit additionalstudy. Figure8 presentsperformanceof two hash-ripple-onlyversionsof this query.Our in-memorypipelinedhashjoins all have the samecost.Wechangethedatain �� and � sothattheselectivity of the�o� join w.r.t. � is 20%in oneversion,and180%in theother.In all runs,theselectivity of the �i� join predicatew.r.t. � isfixed at 100%. As the figure shows, the lottery-basededdycontinuesto performnearlyoptimally.

Figure9 shows thepercentof tuplesin theeddythatfollowoneorderor theotherin all four join experiments.While theeddy is not strict aboutfollowing the optimal ordering,it is

0

50

100

150

200

exec

utio

n ti

me

of p

lan

(sec

s)

¾ Hash First¿LotteryÀNaiveÁIndex FirstÂ

Figure7: Performanceof two joins: aselective Index JoinandaHashJoin

0

50

100

150

exec

utio

n ti

me

of p

lan

(sec

s)

Ã20%, ST before SRÄ20%, EddyÄ20%, SR before STÄ180%, ST before SR180%, Eddy180%, SR before ST

Figure 8: Performanceof hashjoins �ÅZ]\X� and �hZ]\�� .�¢Z]\»� hasselectivity 100%w.r.t. � , theselectivity of �_Z]\i�w.r.t. � variesbetween20%and180%in thetwo runs.

quiteclosein thecaseof the experimentwherethe hashjoinshouldprecedethe index join. In this case,the relative costof index join is so high that the jeopardyof choosingit firstdrivesthehashjoin to nearlyalwayswin thelottery.� !)Æ # *0�9:/� � �/� � yj�6�N�8B � �(D� �V,�. �5��4�� Eddiesshould adaptively reactover time to the changesinperformanceanddatacharacteristicsdescribedin Section1.1.Theroutingschemesdescribedup to this point have not con-sideredhow to achieve this. In particular, our lottery schemeweighsall experiencesequally: observationsfrom thedistantpastaffect thelottery asmuchasrecentobservations.As a re-sult, anoperatorthatearnsmany ticketsearly in a querymaybecomesowealthythat it will take a greatdealof time for itto losegroundto thetopachieversin recenthistory.

To avoid this, we needto modify our point schemeto for-get history to someextent. Onesimpleway to do this is tousea windowscheme,in which time is partitionedinto win-dows, and the eddy keepstrack of two countsfor eachop-erator: a numberof banked tickets,anda numberof escrowtickets. Banked ticketsareusedwhenrunninga lottery. Es-crow tickets are usedto measureefficiency during the win-dow. At the beginning of the window, the value of the es-

0

20

40

60

80

100

cum

ulat

ive

% o

f tup

les r

oute

d to

the

corr

ect j

oin

first

Ç index beats hashhash beats indexhash/hash 20%hash/hash 180%

Figure9: Percentof tuplesroutedin theoptimalorderin all ofthejoin experiments.

0

1000

2000

3000

4000

5000

exec

utio

n tim

e of

pla

n (s

ecs)

È I_sf firstEddyI_fs first

Figure10: Adaptingto changingjoin costs:performance.

crow accountreplacesthe valueof the banked account(i.e.,banked = escrow), andtheescrow accountis reset(es-crow = 0). This schemeensuresthat operators“re-provethemselves” eachwindow.

We considera scenarioof a 3-tableequijoin query, wheretwo of the tablesare external and usedas “inner” relationsby index joins. Our third relation has30,000tuples. Sincewe assumethat the index serversare remote,we implementthe “cost” in our index moduleasa time delay(i.e., while(gettimeofday() É x) ;) ratherthanaspinloop; thisbettermodelsthebehavior of waitingonanexternaleventlikea network response.We have two phasesin the experiment:initially, oneindex (call it Ê�Ë�Ì ) is fast(no time delay)andtheother( Ê Ì~Ë ) is slow (5 secondsper lookup). After 30 secondswe begin the secondphase,in which the two indexes swapspeeds:the Ê�Ë�Ì index becomesslow, and Ê�Ì~Ë becomesfast.Both indexesreturna singlematchingtuple1%of thetime.

Figure 10 shows the performanceof both possiblestaticplans,comparedwith an eddyusinga lottery with a windowscheme.As we would hope,theeddyis muchfasterthanei-ther staticplan. In the first static plan ( Ê�Ì~Ë before Ê�Ë�Ì ), theinitial index join in theplanis slow in thefirst phase,process-ing only 6 tuplesanddiscardingall of them.In theremainderof therun,theplanquickly discards99%of thetuples,passing300 to the (now) expensive secondjoin. In the secondstatic

0Í 20 40 60 80 100Í% of tuples seen.Î

0

20

40

60

80

100

cumu

lative

% of

tuple

s rou

ted to

Inde

x #1 f

irst.

Figure11: Adaptingto changingjoin costs:tuplemovement.

plan ( Ê�Ë�Ì before Ê�ÌCË ), the initial join begins fast,processingabout29,000tuples,andpassingabout290of thoseto thesec-ond(slower) join. After 30 seconds,thesecondjoin becomesfastandhandlestheremainderof the290tuplesquickly, whilethefirst join slowly processestheremaining1,000tuplesat 5secondsper tuple. Theeddyoutdoesbothstaticplans: in thefirst phaseit behavesidenticallyto thesecondstaticplan,con-suming29,000tuplesandqueueing290 for the eddyto passto Ê ÌCË . Justafterphase2 begins, theeddyadaptsits orderingandpassestuplesto Ê�ÌCË – thenew fastjoin – first. As aresult,the eddyspends30 secondsin phaseone,and in phasetwoit haslessthen290 tuplesqueuedat Ê�Ì~Ë (now fast),andonly1,000tuplesto process,only about10 of which arepassedtoÊ Ë�Ì (now slow).

A similar, morecontrolledexperimentillustratestheeddy’sadaptabilitymore clearly. Again, we run a three-tablejoin,with two externalindexesthatreturnamatch10%of thetime.We read4,000tuplesfrom thescannedtable,andtogglecostsbetween1 and100 costunits every 1000 tuples– i.e., threetimesduring the experiment. Figure11 shows that the eddyadaptscorrectly, switching orderswhen the operatorcostsswitch. Sincethe cost differential is lessdramatichere,thejeopardyis lowerandtheeddytakesabit longerto adapt.De-spitethe learningtime, the trendsareclear– the eddysendsmostof thefirst 1000tuplesto index #1 first, which startsoffcheap. It sendsmostof the second1000 tuplesto index #2first, causingthe overall percentageof tuplesto reachabout50%, as reflectedby the near-linear drift toward 50% in thesecondquarterof thegraph. This patternrepeatsin the thirdand fourth quarters,with the eddy eventually displayinganevenuseof thetwo orderingsover time – alwaysfavoring thebestordering.

For brevity, we omit herea similar experimentin whichwe fixed costsand modified selectivity over time. The re-sultsweresimilar, exceptthatchangingonly theselectivity oftwo operatorsresultsin lessdramaticbenefitsfor anadaptivescheme.This can be seenanalytically, for two operatorsofcost Q whoseselectivitesareswappedfrom low to hi in aman-neranalogousto ourpreviousexperiment.To lower-boundtheperformanceof eitherstatic ordering,selectivities shouldbetoggledto theirextremes(100%and0%) for equalamountsoftime – sothathalf the tuplesgo throughbothoperators.Ei-therstaticplanthustakes 7Q4°bW�Ï0�s 7Q time,whereasanoptimal

0

50

100

150

200

exec

utio

n tim

e of

pla

n (s

ecs)

Ð RS FirstÑEddyÒST FirstÓ

Figure12: Adaptingto aninitial delayon � : performance

0Ô 20 40 60 80 100Ô% of S tuples seen.

0

20

40

60

80

100

cumu

lative

% of

tuple

s rou

ted to

ST fi

rst

Õ

Figure13: Adaptingto aninitial delayon � : tuplemovement.

dynamicplantakes 7Q time,aratioof only 3/2. With moreop-erators,adaptivity to changesin selectivity canbecomemoresignificant,however.� !)ÆI!" �8*�. 4B�*0�Ö�8*�. � ? *<��B

Asafinalexperiment,westudythecasewhereaninputrela-tion suffersfrom aninitial delay, asin [AFTU96,UFA98]. Wereturnto the3-tablequeryshown in the left of Figure8, withthe �i� selectivity at100%,andthe �H� selectivity at20%.Wedelaythedelivery of � by 10 seconds;the resultsareshownin Figure12. Unfortunately, we seeherethatour eddy– evenwith a lottery anda window-basedforgettingscheme– doesnot adaptto initial delaysof � aswell asit could. Figure13tells someof the story: in the early part of processing,theeddyincorrectlyfavorsthe �i� join, eventhoughno � tuplesarestreamingin, andeven thoughthe �i� join shouldappearsecondin a normalexecution(Figure8). Theeddydoesthisbecauseit observesthatthe �i� join doesnotproduceany out-put tupleswhengiven � tuples. So the eddyawardsmost �tuplesto the �i� join initially, whichplacesthemin aninternalhashtableto besubsequentlyjoinedwith � tupleswhentheyarrive. The �o� join is left to fetch andhash � tuples. Thiswastesresourcesthat couldhave beenspentjoining � tupleswith � tuplesduring the delay, and“primes” the �G� join toproducealargenumberof tuplesoncethe � sbegin appearing.

Notethattheeddydoesfar betterthanpessimally:when �

begins producingtuples(at 43.5on the x axis of Figure13),the � valuesbottled up in the �i� join burst forth, and theeddyquickly throttlesthe �i� join, allowing the �o� join toprocessmost tuplesfirst. This scenarioindicatestwo prob-lemswith our implementation.First, our ticket schemedoesnot capturethe growing selectivity inherentin a join with adelayedinput. Second,storingtuplesinsidethehashtablesofa singlejoin unnecessarilypreventsotherjoins from process-ing them;it might beconceivableto hashinput tupleswithinmultiple joins, if careweretaken to prevent duplicateresultsfrom beinggenerated.A solutionto thesecondproblemmightobviate theneedto solve thefirst; we intendto explore theseissuesfurtherin futurework.

For brevity, we omit herea variationof this experiment,inwhich we delayedthedelivery of � by 10 secondsinsteadof� . In this case,the delayof � affectsboth joins identically,andsimply slows down the completiontime of all plansbyabout10 seconds.Æ # *�. 4�6*<�V×Ø��CÙTo ourknowledge,thispaperrepresentsthefirst generalqueryprocessingschemefor reorderingin-flight operatorswithin apipeline, though [NWMN99] considersthe specialcaseofunaryoperators.Ourcharacterizationof barriersandmomentsof symmetryalsoappearsto benew, arisingasit doesfrom ourinterestin reoptimizinggeneralpipelines.

Recentpapersconsiderreoptimizingqueriesat theendsofpipelines[UFA98, KD98, IFF � 99], reorderingoperatorsonlyaftertemporaryresultsarematerialized.[IFF � 99] observantlynotesthat this approachdatesback to the original INGRESquerydecompositionscheme[SWK76]. Theseinter-pipelinetechniquesarenotadaptivein thesenseusedin traditionalcon-trol theory(e.g.,[Son98])or machinelearning(e.g.,[Mit97]);they make decisionswithout any ongoingfeedbackfrom theoperationsthey areto optimize,insteadperformingstaticop-timizationsat coarse-grainedintervals in thequeryplan. Onecanview theseefforts ascomplementaryto our work: eddiescanbeusedto do tupleschedulingwithin pipelines,andtech-niqueslike thoseof [UFA98, KD98, IFF � 99] canbe usedtoreoptimizeacrosspipelines. Of coursesucha marriagesac-rifices the simplicity of eddies,requiringboth the traditionalcomplexity of costestimationandplanenumerationalongwiththeideasof this paper. Therearealsosignificantquestionsonhow bestto combinethesetechniques– e.g.,how many mate-rializationoperatorsto put in a plan,whichoperatorsto put inwhicheddypipelines,etc.

DEC Rdb (subsequentlyOracleRdb) usedcompetitiontochooseamongdifferentaccessmethods[AZ96]. Rdb brieflyobservedtheperformanceof alternativeaccessmethodsatrun-time, and then fixed a “winner” for the remainderof queryexecution.This bearsa resemblanceto samplingfor costesti-mation(see[BDF � 97] for asurvey). Moredistantlyrelatedisthework on“parameterized”or “dynamic” queryplans,whichpostponesomeoptimizationdecisionsuntil the beginning ofqueryexecution[INSS97,GC94].

The initial work on Query Scrambling[AFTU96] studiednetwork unpredictabilitiesin processingqueriesover wide-areasources.This work materializedremotedatawhile pro-cessingwas blocked waiting for other sources,an idea thatcanbeusedin concertwith eddies.Note that local material-ization amelioratesbut doesnot remove barriers:work to be

donelocally aftera barriercanstill bequitesignificant.Laterwork focusedon reschedulingrunnablesub-plansduring ini-tial delaysin delivery [UFA98], but did not attemptto reorderin-flight operatorsaswe dohere.

Two out-of-coreversionsof the pipelinedhashjoin havebeenproposedrecently[IFF � 99, UF99]. TheX-Join [UF99]enhancesthepipelinedhashjoin notonly by handlingtheout-of-corecase,but alsoby exploiting delaytime to aggressivelymatchpreviously-received (andspilled) tuples. We intendtoexperimentwith X-Joinsandeddiesin futurework.

TheControlproject[HAC� 99] studiesinteractive analysisof massive datasets,usingtechniqueslikeonlineaggregation,online reorderingand ripple joins. There is a natural syn-ergybetweeninteractiveandadaptivequeryprocessing;onlinetechniquesto pipelinebest-effort answersarenaturallyadap-tive to changingperformancescenarios.The needfor opti-mizing pipelinesin theControlprojectinitially motivatedourwork on eddies. The Control project [HAC� 99] is not ex-plicitly relatedto the field of control theory [Son98],thougheddiesappearsto link thetwo in someregards.

TheRiver project[AAT � 99] wasanothermain inspirationof this work. River allows modulesto work as fast as theycan,naturallybalancingflow to whichever modulesarefaster.We carriedtheRiver philosophyinto the intial back-pressuredesignof eddies,and intend to return to the parallel load-balancingaspectsof theoptimizationproblemin futurework.

In additionto commercialprojectslike thosein Section1.2,therehavebeennumerousresearchsystemsfor heterogeneousdataintegration,e.g.[GMPQ� 97,HKWY97, IFF � 99], etc.Ú wu� � �4. ��9� � � �e � �Û,��<��5��*;×Ø��CÙQueryoptimizationhastraditionallybeenviewedasa coarse-grained,staticproblem.Eddiesarea queryprocessingmech-anismthat allow fine-grained,adaptive, online optimization.Eddiesare particularly beneficialin the unpredictablequeryprocessingenvironmentsprevalent in massive-scalesystems,andin interactive online queryprocessing.They fit naturallywith algorithmsfrom theRippleJoin family, which have fre-quentmomentsof symmetryandadaptiveor non-existentsyn-chronizationbarriers.Eddiescanbeusedasthesoleoptimiza-tion mechanismin a queryprocessingsystem,obviating theneedfor muchof the complex coderequiredin a traditionalquery optimizer. Alternatively, eddiescan be usedin con-certwith traditionaloptimizersto improve adaptabilitywithinpipelines.Our initial resultsindicatethateddiesperformwellundera variety of circumstances,thoughsomequestionsre-main in improving reactiontime and in adaptively choosingjoin orderswith delayedsources.We aresufficiently encour-agedby theseearlyresultsthatwe areusingeddiesandriversasthebasisfor queryprocessingin theTelegraphsystem.

In orderto focusour energiesin this initial work, we haveexplicitly postponeda numberof questionsin understanding,tuning, and extendingtheseresults. One main challengeisto develop eddy“ticket” policiesthat canbe formally provedto converge quickly to a near-optimalexecutionin staticsce-narios,andthatadaptively convergewhenconditionschange.This challengeis complicatedby consideringboth selectionsandjoins, includinghashjoins that “absorb” tuplesinto theirhashtablesasin Section4.5.1.We intendto focusonmultipleperformancemetrics, including time to completion,the rate

of outputfrom aplan,andtherateof refinementfor onlineag-gregationestimators.Wehavealsobegunstudyingschemestoallow eddiesto effectively orderdependentpredicates,basedonreinforcementlearning[SB98]. In arelatedvein,wewouldlike to automaticallytune the aggressivenesswith which weforget pastobservations,so that we avoid introducinga tun-ing knobto adjustwindow-lengthor someanalogousconstant(e.g.,a hysteresisfactor).

Anothermaingoal is to attacktheremainingstaticaspectsof our scheme: the “pre-optimization” choicesof spanningtree,join algorithms,andaccessmethods.Following [AZ96],we believe that competitionis key here: onecan run multi-ple redundantjoins, join algorithms,andaccessmethods,andtrack their behavior in an eddy, adaptively choosingamongthem over time. The implementationchallengein that sce-nario relatesto preventing duplicatesfrom being generated,while theefficiency challengecomesin not wastingtoo manycomputingresourcesonunpromisingalternatives.

A third major challengeis to harnessthe parallelismandadaptivity availableto usin rivers.Massively parallelsystemsare reachingtheir limit of manageability, even as datasizescontinueto grow very quickly. Adaptive techniqueslike ed-diesandriverscansignificantlyaid in themanageabilityof anew generationof massively parallelqueryprocessors.Rivershave beenshown to adaptgracefullyto performancechangesin largeclusters,spreadingqueryprocessingloadacrossnodesandspreadingdatadelivery acrossdatasources.Eddiesfaceadditionalchallengesto meetthepromiseof rivers:in particu-lar, reoptimizingquerieswith intra-operatorparallelismentailsrepartitioningdata,which addsan expenseto reorderingthatwasnot presentin our single-siteeddies.An additionalcom-plicationariseswhentrying to adaptively adjustthedegreeofpartitioningfor eachoperatorin a plan. On a similar note,wewould like to explore enhancingeddiesandriversto toleratefailuresof sourcesor of participantsin parallelexecution.

Finally, weareexploringtheapplicationof eddiesandriversto thegenericspaceof dataflow programming,includingappli-cationssuchasmultimediaanalysisandtranscoding,andthecompositionof scalable,reliableinternetservices[GWBC99].Our intent is for riversto serve asa genericparalleldataflowengine,andfor eddiesto be themain schedulingmechanismin thatenvironment.��0Ù � �<A@. *0�5y/(+* � �6�VijayshankarRamanprovided muchassistancein the courseof thiswork. RemziArpaci-Dusseau,EricAndersonandNoahTreuhaft implementedEuphrates,andhelpedimplemented-dies.MikeFranklinaskedhardquestionsandsuggesteddirec-tionsfor futurework. StuartRussell,ChristosPapadimitriou,Alistair Sinclair, Kris Hildrum andLakshminarayananSubra-manianall helpedusfocuson formal issues.Thanksto NavinKabraandMitch Cherniackfor initial discussionson run-timereoptimization,andto thedatabasegroupatBerkeley for feed-back.StuartRussellsuggestedtheterm“eddy”.

This work wasdonewhile bothauthorswereat UC Berke-ley, supportedby a grant from IBM Corporation,NSF grantIIS-9802051,anda SloanFoundationFellowship.Computingandnetwork resourcesfor thisresearchwereprovidedthroughNSFRI grantCDA-9401156.

# *sE*��* � �s*4�[AAC Ü 97] A. C.Arpaci-Dusseau,R.H. Arpaci-Dusseau,D. E. Culler, J.M.

Hellerstein,andD. A. Patterson.High-PerformanceSortingonNetworks of Workstations. In Proc. ACM-SIGMOD Interna-tional Conferenceon Managementof Data, Tucson,May 1997.

[AAT Ü 99] R. H. Arpaci-Dusseau,E. Anderson,N. Treuhaft,D. E. Culler,J. M. Hellerstein,D. A. Patterson,andK. Yelick. ClusterI/Owith River: Making theFastCaseCommon.In SixthWorkshopon I/O in Parallel andDistributedSystems(IOPADS’99), pages10–22,Atlanta,May 1999.

[AFTU96] L. Amsaleg, M. J. Franklin,A. Tomasic,andT. Urhan. Scram-bling QueryPlansto CopeWith UnexpectedDelays. In 4th In-ternationalConferenceon Parallel andDistributedInformationSystems(PDIS), Miami Beach,December1996.

[AH99] R. Avnur andJ.M. Hellerstein.Continuousqueryoptimization.TechnicalReportCSD-99-1078,Universityof California,Berke-ley, November1999.

[Aok99] P. M. Aoki. How to Avoid Building DataBladesThatKnow theValueof Everythingandthe Costof Nothing. In 11th Interna-tional Conferenceon Scientificand StatisticalDatabaseMan-agement, Cleveland,July 1999.

[AZ96] G. Antoshenkov andM. Ziauddin. QueryProcessingandOpti-mizationin OracleRdb. VLDB Journal, 5(4):229–237,1996.

[Bar99] R. Barnes. ScaleOut. In High PerformanceTransactionPro-cessingWorkshop(HPTS’99), Asilomar, September1999.

[BDF Ü 97] D. Barbara,W. DuMouchel, C. Faloutsos,P. J. Haas, J. M.Hellerstein,Y. E. Ioannidis,H. V. Jagadish,T. Johnson,R. T.Ng, V. Poosala,K. A. Ross,andK. C. Sevcik. TheNew JerseyDataReductionReport.IEEEData EngineeringBulletin, 20(4),December1997.

[BO99] J. BoulosandK. Ono. CostEstimationof User-DefinedMeth-odsin Object-RelationalDatabaseSystems.SIGMODRecord,28(3):22–28,September1999.

[DGSÜ 90] D. J. DeWitt, S. Ghandeharizadeh,D. Schneider, A. Bricker,H.-I Hsiao,andR. Rasmussen.The Gammadatabasemachineproject. IEEE Transactionson Knowledge and Data Engineer-ing, 2(1):44–62,Mar 1990.

[DKO Ü 84] D. J.DeWitt, R. H. Katz, F. Olken,L. D. Shapiro,M. R. Stone-braker, and D. Wood. ImplementationTechniquesfor MainMemory DatabaseSystems. In Proc. ACM-SIGMODInterna-tional Conferenceon Managementof Data, pages1–8,Boston,June1984.

[FMLS99] D. Florescu, I. Manolescu,A. Levy, and D. Suciu. QueryOptimization in the Presenceof Limited AccessPatterns. InProc.ACM-SIGMODInternationalConferenceon Managementof Data, Phildelphia,June1999.

[GC94] G. GraefeandR. Cole.Optimizationof DynamicQueryEvalua-tion Plans.In Proc.ACM-SIGMODInternationalConferenceonManagementof Data, Minneapolis,1994.

[GMPQÜ 97] H. Garcia-Molina,Y. Papakonstantinou,D. Quass,A Rajaraman,Y. Sagiv, J.Ullman,andJ.Widom. TheTSIMMIS Project:Inte-grationof HeterogeneousInformationSources.Journalof Intel-ligentInformationSystems, 8(2):117–132,March1997.

[Gra90] G. Graefe. Encapsulationof Parallelismin the VolcanoQueryProcessingSystem.In Proc.ACM-SIGMODInternationalCon-ferenceon Managementof Data, pages102–111,Atlantic City,May 1990.

[GWBC99] S.D. Gribble,M. Welsh,E.A. Brewer, andD. Culler. TheMulti-Space:anEvolutionaryPlatformfor InfrastructuralServices.InProceedingsof the 1999UsenixAnnualTechnical Conference,Monterey, June1999.

[HAC Ü 99] J. M. Hellerstein,R. Avnur, A. Chou, C. Hidber, C. Olston,V. Raman,T. Roth,andP. J.Haas.InteractiveDataAnalysis:TheControlProject.IEEE Computer, 32(8):51–59,August1999.

[Hel98] J. M. Hellerstein. OptimizationTechniquesfor QuerieswithExpensive Methods. ACM Transactionson DatabaseSystems,23(2):113–157,1998.

[HH99] P. J. Haasand J. M. Hellerstein. Ripple Joinsfor Online Ag-gregation. In Proc.ACM-SIGMODInternationalConferenceonManagementof Data, pages287–298,Philadelphia,1999.

[HKWY97] L. Haas,D. Kossmann,E. Wimmers,andJ. Yang. OptimizingQueriesAcrossDiverseData Sources. In Proc. 23rd Interna-tional Conferenceon Very Large Data Bases(VLDB), Athens,1997.

[HSC99] J. M. Hellerstein,M. Stonebraker, andR. Caccia. Open,Inde-pendentEnterpriseData Integration. IEEE Data EngineeringBulletin, 22(1),March1999.http://www.cohera.com.

[IFF Ü 99] Z. G. Ives,D. Florescu,M. Friedman,A. Levy, andD. S. Weld.An Adaptive QueryExecutionSystemfor DataIntegration. InProc.ACM-SIGMODInternationalConferenceon Managementof Data, Philadelphia,1999.

[IK84] T. Ibaraki and T. Kameda. Optimal Nesting for ComputingN-relational Joins. ACM Transactionson DatabaseSystems,9(3):482–502,October1984.

[INSS97] Y. E. Ioannidis,R. T. Ng, K. Shim,andT. K. Sellis. ParametricQueryOptimization.VLDBJournal, 6(2):132–151,1997.

[KBZ86] R. Krishnamurthy, H. Boral, andC. Zaniolo. OptimizationofNonrecursive Queries. In Proc. 12th InternationalConferenceonVeryLargeDatabases(VLDB), pages128–137,August1986.

[KD98] N. KabraandD. J.DeWitt. EfficientMid-QueryReoptimizationof Sub-OptimalQueryExecutionPlans.In Proc.ACM-SIGMODInternationalConferenceon Managementof Data, pages106–117,Seattle,1998.

[Met97] R. Van Meter. Observingthe Effectsof Multi-Zone Disks. InProceedingsof theUsenix1997TechnicalConference, Anaheim,January1997.

[Mit97] T. Mitchell. MachineLearning. McGraw Hill, 1997.

[NWMN99] K. W. Ng, Z. Wang,R. R. Muntz,andS.Nittel. DynamicQueryRe-Optimization.In 11thInternationalConferenceon ScientificandStatisticalDatabaseManagement, Cleveland,July 1999.

[RPKÜ 99] B. Reinwald,H. Pirahesh,G.Krishnamoorthy, G.Lapis,B. Tran,andS.Vora.HeterogeneousQueryProcessingThroughSQLTa-ble Functions. In 15th InternationalConferenceon Data Engi-neering, pages366–373,Sydney, March1999.

[RRH99] V. Raman,B. Raman,andJ. M. Hellerstein. Online DynamicReorderingfor Interactive DataProcessing.In Proc.25th Inter-national Conferenceon Very Large Data Bases(VLDB), pages709–720,Edinburgh,1999.

[SB98] R. S. SuttonandA. G. Bartow. ReinforcementLearning. MITPress,Cambridge,MA, 1998.

[SBH98] M. Stonebraker, P. Brown, and M. Herbach. Interoperability,DistributedApplications,andDistributedDatabases:TheVirtualTableInterface. IEEE Data EngineeringBulletin, 21(3):25–34,September1998.

[Son98] E. D. Sontag. MathematicalControl Theory: DeterministicFinite-DimensionalSystems,SecondEdition. Number6 in Textsin Applied Mathematics.Springer-Verlag,New York, 1998.

[SWK76] M. R. Stonebraker, E. Wong,andP. Kreps. TheDesignandIm-plementationof INGRES. ACM Transactionson DatabaseSys-tems, 1(3):189–222,September1976.

[UF99] T. UrhanandM. Franklin. XJoin: GettingFastAnswersFromSlow andBurstyNetworks.TechnicalReportCS-TR-3994,Uni-versityof Maryland,February1999.

[UFA98] T. Urhan, M. Franklin, and L. Amsaleg. Cost-BasedQueryScramblingfor Initial Delays. In Proc.ACM-SIGMODInterna-tional Conferenceon Managementof Data, Seattle,June1998.

[WA91] A. N. WilschutandP. M. G. Apers. Dataflow QueryExecutionin aParallelMain-MemoryEnvironment.In Proc.First Interna-tional Conferenceon Parallel andDistributedInfo. Sys.(PDIS),pages68–77,1991.

[WW94] C. A. Waldspurger andW. E. Weihl. Lottery scheduling:Flex-ible proportional-shareresourcemanagement.In Proc. of theFirst Symposiumon Operating SystemsDesignand Implemen-tation (OSDI ’94), pages1–11,Monterey, CA, November1994.USENIX Assoc.

eddies: continuously adaptive query processingnatassa/courses/15-823/syllabus/papers/eddies.pdf ·...

Documents