t-digest algorithm

Post on 17-Aug-2015

13 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

...

TRANSCRIPT

COMPUTINGEXTREMELYACCURATEQUANTILESUSINGt-DIGESTSTEDDUNNINGANDOTMARERTLAbstract. Anon-linealgorithmforcomputingapproximationsofrank-basedstatisticsispresentedthatallowscontrollableaccuracy. Moreover,thisnewalgorithmcanbeusedto compute hybrid statistics such as trimmed means in additional to computing arbitraryquantiles. An unusual property of the method is that it allows a quantile q to be computedwithanaccuracyrelativetoq(1 q)ratherthanwithanabsoluteaccuracyaswithmostmethods. Thisnewalgorithmisrobustwithrespecttohighlyskeweddistributionsorhighlyordereddatasetsandallowsseparatelycomputedsummariestobecombinedwithnolossinaccuracy.Anopen-sourceimplementationofthisalgorithmisavailablefromtheauthor.1. IntroductionGiven a sequence of numbers,it is often desirable to compute rank-based statistics suchasthemedian, 95-thpercentileortrimmedmeansinanon-linefashion, keepingonlyasmall datastructureinmemory. Traditionally, suchstatistics wereoftencomputedbysortingall ofthedataandtheneitherndingthequantileofinterestbyinterpolationorbyprocessingall sampleswithinparticularquantiles. Thissortingapproachisinfeasibleforverylargedatasetswhichhasledtointerestinon-lineapproximatealgorithms.Oneearlyalgorithmforcomputingon-linequantilesisdescribedin[CLP00]. Inthatworkspecicquantileswerecomputedbyincrementingordecrementinganestimatebyavalueproportional tothesimultaneouslyestimatedprobabilitydensityat thedesiredquantile. This method is plagued by a circularity in that estimating density is only possiblebyestimatingyetmorequantiles. Moreover, thisworkdidnotallowthecomputationofhybridquantitiessuchastrimmedmeans.[GK01]MunroandPatersonprovidedanalternativealgorithmtogetanaccurateesti-mateof anyparticularquantile. ThisisdonebykeepingssamplesfromtheNsamplesseensofarwheres K/: 13C cluster(permute(C)); 14C cluster(permute(C)); 15returnC 16Thiswill happenbecauseeachnewvalueof Xwill alwaysformanewcentroidbecauseq(1 q) 0. Toavoidthispathology, ifthenumberofcentroidsbecomesexcessive, thesetofcentroidsiscollapsedbyrecursivelyapplyingthet-digestalgorithmtothecentroidsthemselvesafterrandomizingtheorderofthecentroids.Inall cases, afterpassingthroughall of thedatapoints, thecentroidsarerecursivelyclusteredoneadditionaltime. Thisallowsadjacentcentroidstobemergediftheydonotviolate the size bound. This nal pass typically reduces the number of centroids by 20-40%withnoapparentchangeinaccuracy.2.2. Accuracy Considerations. Initial versions of this algorithm tried to use the centroidindexi asasurrogateforq, applyingacorrectiontoaccountforthefactthatextremecentroids have less weight. Unfortunately, it was dicult toaccount for the fact thatthedistributionof centroidweightschangesduringthecourseof runningthealgorithm.Initiallyall weightsare1. Later, someweightsbecomesubstantiallylarger. Thismeansthattherelationshipbetweeniandqchangesfromcompletelylineartohighlynon-linearin a stochastic way. This made it dicult to avoid too large or too small cuto for centroidsresultingineithertoomanycentroidsorpooraccuracy.Thekeypropertyof thisalgorithmisthatthenal listof centroidsCisveryclosetowhatwouldhave beenobtainedbysimplysorting andthengrouping adjacentvalues in XCOMPUTINGEXTREMELYACCURATEQUANTILESUSINGt-DIGESTS 502004006008001000QuantileCentroid sizeUniform Distribution0 0.25 0.5 0.75 1QuantileCentroid size(0.1, 0.1) Distribution0 0.25 0.5 0.75 1QuantileCentroid sizeSequential Distribution0 0.25 0.5 0.75 1Figure1. The t-digest algorithm respects the size limit for centroids. Thesolid grey line indicates the size limit. These diagrams also shows actual cen-troid weights for 5 test runs on 100, 000 samples from a uniform, (0.1, 0.1)andsequentialuniformdistribution. Inspiteoftheunderlyingdistributionbeing skewed by roughly 30 orders of magnitude of dierence in probabilitydensityforthedistribution, thecentroidweightdistributionisboundedandsymmetricas intended. For thesequential uniformcase, values areproducedinsequential orderwiththreepassesthroughthe[0, 1] intervalwithnorepeatedvalues. Inspiteof thelackof repeats, successivepassesresultinmanycentroidsatthequantileswiththesamesizes. Inspiteofthis, sequential presentationof dataresultsinonlyasmall asymmetryintheresultingsizedistributionandnoviolationoftheintendedsizelimit.intogroupswiththedesiredsizebounds. Clearly, suchgroupscouldbeusedtocomputeallofthe rankstatistics ofinteresthere andifthereare bounds onthe sizesofthegroups,thenwehavecomparableboundsontheaccuracyoftherankstatisticsinquestion.Thatthisalgorithmdoesproducesuchanapproximationismorediculttoproverig-orously, but anempirical examinationis enlightening. Figure3shows thedeviationofsamplesassignedtocentroidsforuniformandhighlyskeweddistributions. Thesedevia-tionsarenormalizedbythehalf thedistancebetweentheadjacenttwocentroids. Thisrelativelyuniformdistributionfordeviationsamongthesamplesassignedtoacentroidisfound for uniformly distributed samples as well as for extremely skewed data. For instance,the (0.1, 0.1) distribution has a 0.01 quantile of 6.071020, a median of 0.006 and a meanof 1. Thismeansthatthedistributionisveryskewed. Inspiteof this, samplesassignedtocentroidsneartherstpercentilearenotnoticeablyskewedwithinthecentroid. Theimpact of this uniform distribution is that linear interpolation allows accuracy considerablybetterthanq(1 q)/.2.3. Finding the cumulative distributionat a point. Algorithm3shows howtocompute the cumulative distribution P(x) =

xp() d for a given value of x by summingthe contributionof uniformdistributions centeredat eachthe centroids. Eachof the6 TEDDUNNINGANDOTMARERTLUniform q=0.3 ... 0.71.0 0.5 0.0 0.5 1.001000030000Gamma(0.1, 0.1) q=0.3 ... 0.71.0 0.5 0.0 0.5 1.001000030000Gamma, q=0.011.0 0.5 0.0 0.5 1.005001500Figure2. Thedeviationof samples assignedtoasinglecentroid. Thehorizontalaxisisscaledtothedistancetotheadjacentcentroidsoavalueof 0.5is half-waybetweenthe twocentroids. There are twosignicantobservations tobe hadhere. The rst is that relativelyfewpoints areassignedtoacentroidthat arebeyondthemidpoint tothenext cluster.Thisbearsontheaccuracyof thisalgorithm. Thesecondobservationisthat samples are distributed relatively uniformly between the boundaries ofthecell. Thisaectstheinterpolationmethodtobechosenwhenworkingwithquantiles. Thethreegraphs show, respectively, centroids fromq [0.3, 0.7] fromauniformdistribution, centroidsfromthesamerangeof ahighly skewed (0.1, 0.1) and centroids from q [0.01, 0.015] in a (0.1, 0.1)distribution. Thislastrangeisinaregionwhereskewontheorderof1022isfound.centroids is assumed to extend symmetrically around the mean for half the distance to theadjacentcentroid.For all centroids except one, this contribution will be either 0 or ci.count/Nand the onecentroid which straddles the desired value of x will have a pro rata contribution somewherebetween0andci.count/N. Moreover, since eachcentroidhas count at most Ntheaccuracy of q should be accurate on a scale of . Typically, the accuracy will be even betterduetotheinterpolationschemeusedinthealgorithmandbecausethelargestcentroidsareonlyforvaluesofqnear0.5.The empirical justication for using a uniform distribution for each centroid can be seenbyreferringtoagaintoFigure3.2.4. Invertingthe cumulative distribution. Computinganapproximationof theqquantileofthedatapointsseensofarcanbedonebyorderingthecentroidsbyascendingmean. Therunningsumofthecentroidcountswillrangefrom0toN=

ci.count. Oneparticular centroid will have a count that straddles the desired quantile qand interpolationcanbeusedtoestimateavalueof x. ThisisshowninAlgorithm4. Notethatattheextremeendsofthedistributionasobserved, eachcentroidwillrepresentasinglesamplesomaximumresolutioninqwillberetained.COMPUTINGEXTREMELYACCURATEQUANTILESUSINGt-DIGESTS 72.5. Computingthetrimmedmean. ThetrimmedmeanofXforthequantilerangeQ=[q0, q1]canbecomputedbycomputingaweightedaverageofthemeansofcentroidsthathavequantilesinQ. ForcentroidsattheedgeofQ,aprorataweightisusedthatisAlgorithm2: EstimatequantileC.quantile(x)Input: Centroidsderivedfromdistributionp(x),C= [. . . [mi, si, ki] . . .],valuexOutput: Estimatedvalueofq=

xp()dt = 0,N=

i ki; 1fori 1 . . . m: 2if i < m: 3 (ci+1.mean ci.mean)/2; 4else: 5 (ci.mean ci1.mean)/2; 6z= max(1, (x mi)/); 7if z< 1: 8return(tN+kiNz+12) 9t t + ki; 10return1 11Algorithm3: EstimatevalueatgivenquantileC.icdf(q)Input: Centroidsderivedfromdistributionp(x),C= [c1 . . . cm],valueqOutput: Estimatedvaluexsuchthatq=

xp()dt = 0,q q

ci.count; 1fori 1 . . . m: 2ki= ci.count; 3mi= ci.mean; 4if q< t + ki: 5if i = 1: 6 (ci+1.mean ci.mean); 7elif i = m: 8 (ci.mean ci1.mean); 9else: 10 (ci+1.mean ci1.mean)/2; 11returnmi +

qtki12

12t t + ki13returncm.mean 148 TEDDUNNINGANDOTMARERTLbasedonaninterpolatedestimateofthefractionofthecentroidssamplesthatareinQ.ThismethodisshownasAlgorithm5.3. EmpiricalAssessment3.1. Accuracy of estimation for uniform and skewed distributions. Figure 4 showstheerrorlevelsachievedwitht-digestinestimatingquantilesof 100,000samplesfromaAlgorithm4: Estimate trimmedmean. Note howcentroids at the boundaryareincludedonaproratabasis.Input: Centroidsderivedfromdistributionp(x),C= [. . . [mi, si, ki] . . .],limitvaluesq0, q2Output: Estimateofmeanofvaluesx [q0, q1]s = 0,k = 0; 1t = 0,q1 q1

ki,q1 q1

ki; 2fori 1 . . . m: 3ki= ci.count; 4if q1< t + ki: 5if i > 1: 6 (ci+1.mean ci1.mean)/2; 7elif i < m: 8 (ci+1.mean ci.mean); 9else: 10 (ci.mean ci1.mean); 11=

qtki12

;12s s + ki ci.mean; 13k k + ki; 14if q2< t + ki: 15if i > 1: 16 (ci+1.mean ci1.mean)/2; 17elif i < m: 18 (ci+1.mean ci.mean); 19else: 20 (ci.mean ci1.mean); 21=

12 qtki

;22s s ki ci.mean; 23k k ki; 24t t + ki25returns/k 26COMPUTINGEXTREMELYACCURATEQUANTILESUSINGt-DIGESTS 9uniformandfromaskeweddistribution. Intheseexperiments=0.01wasusedsinceitprovidesagoodcompromisebetweenaccuracyandspace. Thereisnovisibledierenceinaccuracybetweenthetwounderlyingdistributionsinspiteofthefactthattheunderlyingdensities dier bymore roughly30orders of magnitude. The accuracyshownhere iscomputedbycomparingthecomputedquantilestotheactual empirical quantilesforthesample used for testing and is shown in terms of q rather than the underlying sample value.Atextremevaluesofq,theactualsamplesarepreservedascentroidswithweight1sotheobservedfortheseextremevaluesiszerorelativetotheoriginaldata. Forthedatashownhere,at q= 0.001,the maximum weight on a centroid is just above 4 and centroids in thisrangehaveall possibleweightsfrom1to4. Errorsarelimitedto, notsurprisingly, justafewpartspermillionorless. Formoreextremequantiles, thecentroidswillhavefewersamplesandtheresultswilltypicallybeexact.0.001 0.01 0.1 0.5 0.9 0.99 0.99920001000010002000UniformQuantile (q)Quantile error (ppm)0.001 0.01 0.1 0.5 0.9 0.99 0.999(0.1, 0.1)Quantile (q)Figure 3. The absolute error of the estimate of the cumulative distributionfunctionq =

xp() dfor theuniformanddistributionfor 5runs,eachwith100,000datapoints. Ascanbeseen, theerrorisdramaticallydecreasedforveryhighorverylowquantiles(toafewpartspermillion).The precision setting used here, 1/= 100, would result in uniform error of10,000ppmwithoutadaptivebinsizingandinterpolation.Obviously, withtherelativelysmall numbersof samplessuchasareusedintheseex-periments,theaccuracyoft-digestsforestimating quantilesoftheunderlying distributioncannotbebetterthantheaccuracyof theseestimatescomputedusingthesampledatapointsthemselves. Fortheexperimentshere,theerrorsduetosamplingcompletelydomi-natetheerrorsintroducedbyt-digests,especiallyatextremevaluesofq. Formuchlarger10 TEDDUNNINGANDOTMARERTLsamplesetsof billionsof samplesormore, thiswouldbelesstrueandtheerrorsshownherewouldrepresenttheaccuracyofapproximatingtheunderlyingdistribution.ItshouldbenotedthatusingaQ-Digestimplementedwithlongintegersisonlyabletostoredatawithnomorethan20signicant decimal gures. Theimplementationinstream-libonlyretains48bitsof signicants, allowingonlyabout16signicantgures.This means that such a Q-digest would be inherently unable to even estimate the quantilesofthedistributiontestedhere.3.2. Persistingt-digests. Fortheaccuracysettingandtestdatausedintheseexperi-ments, the t-digest contained 820860 centroids. The results of t-digest can thus be storedby storing this many centroid means and weights. If centroids are kept as double precisionoating point numbers and counts kept as 4-byte integers, the t-digest resulting from fromtheaccuracytestsdescribedherewouldrequireabout10kilobytesof storageforanyofthedistributionstested.Thissizecanbesubstantiallydecreased,however. Onesimpleoptionistostoredier-ences between centroid means and to use a variable byte encoding such as zig-zag encodingto store the cluster size. The dierences between successive means are at least three ordersof magnitudesmallerthanthemeansthemselvessousingsingleprecisionoatingpointtostorethesedierencescanallowthet-digestfromthetestsdescribedheretobestoredinabout4.6kilobyteswhilestill regainingnearly10signicantguresofaccuracyinthemeans. ThisisroughlyequivalenttotheprecisionpossiblewithaQ-digestoperatingon32bit integers, but thedynamicrangeof t-digests will beconsiderablyhigher andtheaccuracyisconsiderablybetter.3.3. Space/AccuracyTrade-o. Notsurprisingly, thereisastrongtrade-obetweenthesizeofthet-digestascontrolledbythecompressionparameter1/andtheaccuracywhichwhichquantilesareestimated. Quantilesat0.999andaboveor0.001orbelowwereestimatedtowithinasmallfractionof1%regardlessofdigestsize. Accurateestimatesofthemedianrequiresubstantiallylargerdigests. Figure5showsthisbasictrade-o.Thesizeoftheresultingdigestdependsstronglyonthecompressionparameter1/asshown in the left panel of Figure 6. Size of the digest also grows roughly with the log of thenumberofsamplesobserved, atleastintherangeof10,000to10,000,000samplesshownintherightpanelofFigure6.3.4. Computing t-digests in parallel. With large scale computations, it is important tobeabletocomputeaggregateslikethet-digestonportionsoftheinputandthencombinethoseaggregates.For example, inamap-reduce frameworksuchas Hadoop, acombiner functioncancomputethet-digestfortheoutputof eachmapperandasinglereducercanbeusedtocomputethet-digestfortheentiredataset.AnotherexamplecanbefoundincertaindatabasessuchasCouchbaseorDruidwhichmaintaintreestructuredindicesandallowtheprogrammertospecifythatparticularag-gregates of the data being stored can be kept at interior nodes of the index. The benet ofthis is that aggregation can be done almost instantaneously over any contiguous sub-rangeCOMPUTINGEXTREMELYACCURATEQUANTILESUSINGt-DIGESTS 110.5 1.0 2.0 5.0 10.0 20.0 50.0q = 0.5tdigest size (kB)Quantile error0.040.020.000.020.040.5 1.0 2.0 5.0 10.0 20.0 50.0q = 0.01tdigest size (kB)Quantile error0.5 1.0 2.0 5.0 10.0 20.0 50.0q = 0.001tdigest size (kB)Quantile errorFigure4. Accuracy is good for extreme quantiles regardless of digest size.From leftto right,these panelsshow accuracyofestimates forq= 0.5, 0.01and 0.001 as a function the serialized size of the t-digest. Due to symmetry,theseareequivalenttoaccuracyforq=0.5, 0.99, and0.999aswell. Formidquantiles suchas the median(q =0.5), moderate digest sizes of afewkilobytessucetogetbetterthan1%accuracy, butadigestmustbe20kBormoretoreliablyachieve0.1%accuracy. Incontrast, accuracyforthe0.1%-ile(or99.9%-ile)reachesafewpartspermillionfordigestslargerthan about 5 kB. Note that errors for q= 0.5 and digests smaller than 1 kBare o the scale shown here at nearly 10%. All panels were computed using100runswith100,000samples. Compressionparameter(1/)wasvariedfrom 2 to 1000 in order to vary the size of the resulting digest. Sizes shownwereencodedusing4byteoatingpoint deltaencodingfor thecentroidmeansandvariablebytelengthintegerencoding.oftheindex. ThecostisquitemodestwithonlyaO(log(N))totalincreaseineortoverkeeping a running aggregate of all data. In many practical cases, the tree can contain onlytwoorthreelevelsandstill providefastenoughresponse. Forinstance, ifitisdesiredtobeabletocomputequantilesforanyperioduptoyearsin30secondincrements, simplykeeping higher level t-digests at the level of 30 seconds and days is likely to be satisfactorybecauseatmostabout10,000digestsarelikelytoneedmergingevenforparticularlyoddintervals. Ifalmostallqueriesoverintervalslongerthanafewweeksaredayaligned,thenumberofdigeststhatmustbemergeddropstoafewthousand.Mergingt-digestscanbedonemanyways. Thealgorithmwhoseresultsareshownhereconsisted of simply making a list of all centroids from the t-digests being merged,shuingthatlist,andthenaddingthesecentroids,preservingtheirweights,toanewt-digest.3.5. ComparisonwithQ-digest. Theprototypeimplementationof thet-digest com-pletely dominates the implementation of the Q-digest from the popular stream-lib package[Thi] whensizeof theresultingdigest iscontrolled. ThisisshowninFigure8. Intheleft panel,the relationship between the eect of the compression parameter for Q-digest is12 TEDDUNNINGANDOTMARERTL2 5 10 20 50 100 5001Size (kB)0.111010010M samples10k samples051015201 = 100Samples (x1000)Size (kB)10 100 1000 10,000Figure5. Sizeofthedigestscalessub-linearlywithcompressionparame-ter( 0.7 . . . 0.9)forxednumberof points. Sizescalesapproximatelylogarithmically with number of points for xed compression parameter. Thepanel on the right is for 1/= 100. The dashed lines show best-t log-linearmodels. In addition, the right panel shows the memory size required for theGK01algorithmif6specicquantilesaredesired.compared to the similar parameter 1/for the t-digest. For the same value of compressionparameter, the sizes of the two digests is always within a factor of 2 for practical uses. Themiddleandrightpanelshowaccuraciesforuniformanddistributions.As expected, thet-digest has verygoodaccuracyfor extremequantiles whiletheQ-digest has constant error across the range. Interestingly,the accuracy of the Q-digest is atbest roughly an order of magnitude worse than the accuracy of the t-digest even. At worse,withextremevaluesof q, accuracyisseveral ordersof magnitudeworse. Thissituationisevenworsewithahighlyskeweddistributionsuchaswiththe(0.1, 0.1)shownintherightpanel. Here,theveryhighdynamicrangeintroducesseverequantizationerrorsintotheresults. ThisquantizationisinherentintheuseofintegersintheQ-digest.Forhighercompressionparametervalues, thesizeof theQ-digestbecomesuptotwotimessmallerthanthet-digest,butnoimprovementintheerrorratesisobserved.3.6. Speed. Thecurrentimplementationhasbeenprimarilyoptimizedforeaseofdevel-opment,notexecutionspeed. Asitis,runningonasinglecoreofa2.3GHzIntelCorei5,ittypicallytakes2-3microsecondstoprocesseachpointafterJIToptimizationhascomeintoeect. Itistobeexpectedthatasubstantialimprovementinspeedcouldbehadbyprolingandcleaninguptheprototypecode.COMPUTINGEXTREMELYACCURATEQUANTILESUSINGt-DIGESTS 131000050000500010000Quantile (q)Error in quantile (ppm)10000500005000100000.001 0.1 0.2 0.3 0.5DirectMerged1 = 505partsQuantile (q)0.001 0.01 0.1 0.2 0.3 0.5DirectMerged1 = 5020partsQuantile (q)0.001 0.01 0.1 0.2 0.3 0.5DirectMerged1 = 50100partsFigure6. Accuracyofat-digestaccumulateddirectlyisnearlythesameaswhenthedigestiscomputedbycombiningdigestsfrom20or100equalsized portions of the data. Repeated runs of this test occasionally show thesituation seen in the left panel where the accuracy for digests formed from 5partialdigestsshowslightlyworseaccuracythanthenon-subdividedcase.This sub-aggregationpropertyallows ecient useof thetdigest inmap-reduceanddatabaseapplications. Of particular interest is thefact thataccuracy actually improves when the input data is broken in to many partsasisshownintherighthandpanel. All panelswerecomputedby40rep-etitionsof aggregating100,000values. Accuracyfordirectlyaccumulateddigestsisshownontheleftofeachpairwiththewhitebarandthedigestofdigestaccuracyisshownontherightofeachpairbythedarkgraybar.4. ConclusionThe t-digest is a novel algorithm that dominates the previously state-of-the-art Q-digestin terms of accuracy and size. The t-digest provides accurate on-line estimates of a varietyof of rank-based statistics including quantiles and trimmed mean. The algorithm is simpleandempiricallydemonstratedtoexhibitaccuracyaspredictedontheoreticalgrounds. Itis alsosuitablefor parallel applications. Moreover, thet-digest algorithmis inherentlyon-linewhiletheQ-digestisanon-lineadaptationofano-linealgorithm.Thet-digestalgorithmisavailableintheformofanopensource,well-testedimplemen-tation from the author. It has already been adopted by the Apache Mahout and stream-libprojectsandislikelytobeadoptedbyotherprojectsaswell.References[CLP00] Fei Chen, DianeLambert, andJos C. Pinheiro. Incremental quantileestimationfor massivetracking.InInProceedingsofKDD,pages516522,2000.14 TEDDUNNINGANDOTMARERTLtdigest (bytes)Qdigest (bytes)100 300 1K 3K 10K 30K 100K1003001K3K10K30K100K25102050100200500100020001 QuantileQuantile error (ppm)0.001 0.1 0.5 0.9 0.99100000500015000QdigesttdigestUniform1 = 50020000400006000080000QuantileQuantile error (ppm)0200004000060000800000.001 0.1 0.5 0.9 0.99Qdigesttdigest(0.1, 0.1)1 = 50Figure7. TheleftpanelshowsthesizeofaserializedQ-digestversusthesizeofaserializedt-digestforvariousvaluesof1/from2to100,000. Thesizesforthetwokindsofdigestarewithinafactorof2forallcompressionlevels. The middle andright panels showthe accuracyfor aparticularsettingof 1/ for Q-digest andt-digest. For eachquantile, theQ-digestaccuracy is shown as the left bar in a pair and the t-digest accuracy is shownastherightbarinapair. Notethatthevertical scaleinthesediagramsareoneortwoordersof magnitudelargerthaninthepreviousaccuracygraphsandthat inall cases, theaccuracyof thet-digest isdramaticallybetterthanthatoftheQ-digesteventhoughtheserializedsizeoftheeachiswithin10%of theother. NotethattheerrorsintherighthandpanelaresystematicquantizationerrorsintroducedbytheuseofintegersintheQ-digestalgorithm. Anydistributionwithverylargedynamicrangewillshowthesameproblems.[GK01] Michael GreenwaldandSanjeevKhanna. Space-ecientonlinecomputationof quantilesum-maries.InInSIGMOD,pages5866,2001.[Knu98] DonaldE. Knuth. TheArt of ComputerProgramming, volume2: Seminumerical Algorithms,page232.Addison-Wesley,Boston,3edition,1998.[Lin] LinkedIn. Datafu: Hadoop library for large-scale data processing. https://github.com/linkedin/datafu/.[Online;accessed20-December-2013].[PDGQ05] RobPike,SeanDorward,RobertGriesemer,andSeanQuinlan.Interpretingthedata: Parallelanalysiswithsawzall.Sci.Program.,13(4):277298,October2005.[SBAS04] NisheethShrivastava,ChiranjeebBuragohain,DivyakantAgrawal,andSubhashSuri.Mediansand beyond: New aggregation techniques for sensor networks. pages 239249. ACM Press, 2004.[Thi] Add This. Algorithms for calculating variance, online algorithm. https://github.com/addthis/stream-lib.[Online;accessed28-November-2013].[Wel62] B.P.Welford.Noteonamethodforcalculatingcorrectedsumsofsquaresandproducts.Tech-nometrics,pages419420,1962.[Wik] Wikipedia. Algorithms for calculating variance, online algorithm. http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online_algorithm. [Online; accessed19-October-2013].COMPUTINGEXTREMELYACCURATEQUANTILESUSINGt-DIGESTS 15[ZW07] Qi ZhangandWei Wang. Anecientalgorithmforapproximatebiasedquantilecomputationin data streams. In Proceedings of the Sixteenth ACM Conference on Conference on InformationandKnowledgeManagement,CIKM07,pages10231026,NewYork,NY,USA,2007.ACM.TedDunning,MapRTechnologies,Inc,SanJose,CAE-mail address: ted.dunning@gmail.comE-mail address: otmar.ertl@gmail.com

top related