t-digest algorithm

15
COMPUTING EXTREMELY ACCURATE QUANTILES USING t-DIGESTS TED DUNNING AND OTMAR ERTL Abstract. An on-line algorithm for computing approximations of rank-based statistics is presented that allows controllable accuracy. Moreover, this new algorithm can be used to compute hybrid statistics such as trimmed means in additional to computing arbitrary quantiles. An unusual property of the method is that it allows a quantile q to be computed with an accuracy relative to q(1 - q) rather than with an absolute accuracy as with most methods. This new algorithm is robust with respect to highly skewed distributions or highly ordered datasets and allows separately computed summaries to be combined with no loss in accuracy. An open-source implementation of this algorithm is available from the author. 1. Introduction Given a sequence of numbers, it is often desirable to compute rank-based statistics such as the median, 95-th percentile or trimmed means in an on-line fashion, keeping only a small data structure in memory. Traditionally, such statistics were often computed by sorting all of the data and then either finding the quantile of interest by interpolation or by processing all samples within particular quantiles. This sorting approach is infeasible for very large datasets which has led to interest in on-line approximate algorithms. One early algorithm for computing on-line quantiles is described in [CLP00]. In that work specific quantiles were computed by incrementing or decrementing an estimate by a value proportional to the simultaneously estimated probability density at the desired quantile. This method is plagued by a circularity in that estimating density is only possible by estimating yet more quantiles. Moreover, this work did not allow the computation of hybrid quantities such as trimmed means. [GK01]Munro and Paterson provided an alternative algorithm to get an accurate esti- mate of any particular quantile. This is done by keeping s samples from the N samples seen so far where s << N by the time the entire data set has been seen. If the data are presented in random order and if s = θ(N 1/2 ), then Munro and Paterson’s algorithm has a high probability being able to retain a set of samples that contains the median. This algorithm can be adapted to find a number of pre-specified quantiles at the same time at proportional cost in memory. The memory consumption of Munro-Paterson algorithm is excessive if precise results are desired. Approximate results can be had with less memory, however. A more subtle problem is that the implementation in Sawzall[PDGQ05] and the 1

Upload: ricardo-mansilla

Post on 17-Aug-2015

13 views

Category:

Documents


1 download

DESCRIPTION

...

TRANSCRIPT

COMPUTINGEXTREMELYACCURATEQUANTILESUSINGt-DIGESTSTEDDUNNINGANDOTMARERTLAbstract. Anon-linealgorithmforcomputingapproximationsofrank-basedstatisticsispresentedthatallowscontrollableaccuracy. Moreover,thisnewalgorithmcanbeusedto compute hybrid statistics such as trimmed means in additional to computing arbitraryquantiles. An unusual property of the method is that it allows a quantile q to be computedwithanaccuracyrelativetoq(1 q)ratherthanwithanabsoluteaccuracyaswithmostmethods. Thisnewalgorithmisrobustwithrespecttohighlyskeweddistributionsorhighlyordereddatasetsandallowsseparatelycomputedsummariestobecombinedwithnolossinaccuracy.Anopen-sourceimplementationofthisalgorithmisavailablefromtheauthor.1. IntroductionGiven a sequence of numbers,it is often desirable to compute rank-based statistics suchasthemedian, 95-thpercentileortrimmedmeansinanon-linefashion, keepingonlyasmall datastructureinmemory. Traditionally, suchstatistics wereoftencomputedbysortingall ofthedataandtheneitherndingthequantileofinterestbyinterpolationorbyprocessingall sampleswithinparticularquantiles. Thissortingapproachisinfeasibleforverylargedatasetswhichhasledtointerestinon-lineapproximatealgorithms.Oneearlyalgorithmforcomputingon-linequantilesisdescribedin[CLP00]. Inthatworkspecicquantileswerecomputedbyincrementingordecrementinganestimatebyavalueproportional tothesimultaneouslyestimatedprobabilitydensityat thedesiredquantile. This method is plagued by a circularity in that estimating density is only possiblebyestimatingyetmorequantiles. Moreover, thisworkdidnotallowthecomputationofhybridquantitiessuchastrimmedmeans.[GK01]MunroandPatersonprovidedanalternativealgorithmtogetanaccurateesti-mateof anyparticularquantile. ThisisdonebykeepingssamplesfromtheNsamplesseensofarwheres K/: 13C cluster(permute(C)); 14C cluster(permute(C)); 15returnC 16Thiswill happenbecauseeachnewvalueof Xwill alwaysformanewcentroidbecauseq(1 q) 0. Toavoidthispathology, ifthenumberofcentroidsbecomesexcessive, thesetofcentroidsiscollapsedbyrecursivelyapplyingthet-digestalgorithmtothecentroidsthemselvesafterrandomizingtheorderofthecentroids.Inall cases, afterpassingthroughall of thedatapoints, thecentroidsarerecursivelyclusteredoneadditionaltime. Thisallowsadjacentcentroidstobemergediftheydonotviolate the size bound. This nal pass typically reduces the number of centroids by 20-40%withnoapparentchangeinaccuracy.2.2. Accuracy Considerations. Initial versions of this algorithm tried to use the centroidindexi asasurrogateforq, applyingacorrectiontoaccountforthefactthatextremecentroids have less weight. Unfortunately, it was dicult toaccount for the fact thatthedistributionof centroidweightschangesduringthecourseof runningthealgorithm.Initiallyall weightsare1. Later, someweightsbecomesubstantiallylarger. Thismeansthattherelationshipbetweeniandqchangesfromcompletelylineartohighlynon-linearin a stochastic way. This made it dicult to avoid too large or too small cuto for centroidsresultingineithertoomanycentroidsorpooraccuracy.Thekeypropertyof thisalgorithmisthatthenal listof centroidsCisveryclosetowhatwouldhave beenobtainedbysimplysorting andthengrouping adjacentvalues in XCOMPUTINGEXTREMELYACCURATEQUANTILESUSINGt-DIGESTS 502004006008001000QuantileCentroid sizeUniform Distribution0 0.25 0.5 0.75 1QuantileCentroid size(0.1, 0.1) Distribution0 0.25 0.5 0.75 1QuantileCentroid sizeSequential Distribution0 0.25 0.5 0.75 1Figure1. The t-digest algorithm respects the size limit for centroids. Thesolid grey line indicates the size limit. These diagrams also shows actual cen-troid weights for 5 test runs on 100, 000 samples from a uniform, (0.1, 0.1)andsequentialuniformdistribution. Inspiteoftheunderlyingdistributionbeing skewed by roughly 30 orders of magnitude of dierence in probabilitydensityforthedistribution, thecentroidweightdistributionisboundedandsymmetricas intended. For thesequential uniformcase, values areproducedinsequential orderwiththreepassesthroughthe[0, 1] intervalwithnorepeatedvalues. Inspiteof thelackof repeats, successivepassesresultinmanycentroidsatthequantileswiththesamesizes. Inspiteofthis, sequential presentationof dataresultsinonlyasmall asymmetryintheresultingsizedistributionandnoviolationoftheintendedsizelimit.intogroupswiththedesiredsizebounds. Clearly, suchgroupscouldbeusedtocomputeallofthe rankstatistics ofinteresthere andifthereare bounds onthe sizesofthegroups,thenwehavecomparableboundsontheaccuracyoftherankstatisticsinquestion.Thatthisalgorithmdoesproducesuchanapproximationismorediculttoproverig-orously, but anempirical examinationis enlightening. Figure3shows thedeviationofsamplesassignedtocentroidsforuniformandhighlyskeweddistributions. Thesedevia-tionsarenormalizedbythehalf thedistancebetweentheadjacenttwocentroids. Thisrelativelyuniformdistributionfordeviationsamongthesamplesassignedtoacentroidisfound for uniformly distributed samples as well as for extremely skewed data. For instance,the (0.1, 0.1) distribution has a 0.01 quantile of 6.071020, a median of 0.006 and a meanof 1. Thismeansthatthedistributionisveryskewed. Inspiteof this, samplesassignedtocentroidsneartherstpercentilearenotnoticeablyskewedwithinthecentroid. Theimpact of this uniform distribution is that linear interpolation allows accuracy considerablybetterthanq(1 q)/.2.3. Finding the cumulative distributionat a point. Algorithm3shows howtocompute the cumulative distribution P(x) =

xp() d for a given value of x by summingthe contributionof uniformdistributions centeredat eachthe centroids. Eachof the6 TEDDUNNINGANDOTMARERTLUniform q=0.3 ... 0.71.0 0.5 0.0 0.5 1.001000030000Gamma(0.1, 0.1) q=0.3 ... 0.71.0 0.5 0.0 0.5 1.001000030000Gamma, q=0.011.0 0.5 0.0 0.5 1.005001500Figure2. Thedeviationof samples assignedtoasinglecentroid. Thehorizontalaxisisscaledtothedistancetotheadjacentcentroidsoavalueof 0.5is half-waybetweenthe twocentroids. There are twosignicantobservations tobe hadhere. The rst is that relativelyfewpoints areassignedtoacentroidthat arebeyondthemidpoint tothenext cluster.Thisbearsontheaccuracyof thisalgorithm. Thesecondobservationisthat samples are distributed relatively uniformly between the boundaries ofthecell. Thisaectstheinterpolationmethodtobechosenwhenworkingwithquantiles. Thethreegraphs show, respectively, centroids fromq [0.3, 0.7] fromauniformdistribution, centroidsfromthesamerangeof ahighly skewed (0.1, 0.1) and centroids from q [0.01, 0.015] in a (0.1, 0.1)distribution. Thislastrangeisinaregionwhereskewontheorderof1022isfound.centroids is assumed to extend symmetrically around the mean for half the distance to theadjacentcentroid.For all centroids except one, this contribution will be either 0 or ci.count/Nand the onecentroid which straddles the desired value of x will have a pro rata contribution somewherebetween0andci.count/N. Moreover, since eachcentroidhas count at most Ntheaccuracy of q should be accurate on a scale of . Typically, the accuracy will be even betterduetotheinterpolationschemeusedinthealgorithmandbecausethelargestcentroidsareonlyforvaluesofqnear0.5.The empirical justication for using a uniform distribution for each centroid can be seenbyreferringtoagaintoFigure3.2.4. Invertingthe cumulative distribution. Computinganapproximationof theqquantileofthedatapointsseensofarcanbedonebyorderingthecentroidsbyascendingmean. Therunningsumofthecentroidcountswillrangefrom0toN=

ci.count. Oneparticular centroid will have a count that straddles the desired quantile qand interpolationcanbeusedtoestimateavalueof x. ThisisshowninAlgorithm4. Notethatattheextremeendsofthedistributionasobserved, eachcentroidwillrepresentasinglesamplesomaximumresolutioninqwillberetained.COMPUTINGEXTREMELYACCURATEQUANTILESUSINGt-DIGESTS 72.5. Computingthetrimmedmean. ThetrimmedmeanofXforthequantilerangeQ=[q0, q1]canbecomputedbycomputingaweightedaverageofthemeansofcentroidsthathavequantilesinQ. ForcentroidsattheedgeofQ,aprorataweightisusedthatisAlgorithm2: EstimatequantileC.quantile(x)Input: Centroidsderivedfromdistributionp(x),C= [. . . [mi, si, ki] . . .],valuexOutput: Estimatedvalueofq=

xp()dt = 0,N=

i ki; 1fori 1 . . . m: 2if i < m: 3 (ci+1.mean ci.mean)/2; 4else: 5 (ci.mean ci1.mean)/2; 6z= max(1, (x mi)/); 7if z< 1: 8return(tN+kiNz+12) 9t t + ki; 10return1 11Algorithm3: EstimatevalueatgivenquantileC.icdf(q)Input: Centroidsderivedfromdistributionp(x),C= [c1 . . . cm],valueqOutput: Estimatedvaluexsuchthatq=

xp()dt = 0,q q

ci.count; 1fori 1 . . . m: 2ki= ci.count; 3mi= ci.mean; 4if q< t + ki: 5if i = 1: 6 (ci+1.mean ci.mean); 7elif i = m: 8 (ci.mean ci1.mean); 9else: 10 (ci+1.mean ci1.mean)/2; 11returnmi +

qtki12

12t t + ki13returncm.mean 148 TEDDUNNINGANDOTMARERTLbasedonaninterpolatedestimateofthefractionofthecentroidssamplesthatareinQ.ThismethodisshownasAlgorithm5.3. EmpiricalAssessment3.1. Accuracy of estimation for uniform and skewed distributions. Figure 4 showstheerrorlevelsachievedwitht-digestinestimatingquantilesof 100,000samplesfromaAlgorithm4: Estimate trimmedmean. Note howcentroids at the boundaryareincludedonaproratabasis.Input: Centroidsderivedfromdistributionp(x),C= [. . . [mi, si, ki] . . .],limitvaluesq0, q2Output: Estimateofmeanofvaluesx [q0, q1]s = 0,k = 0; 1t = 0,q1 q1

ki,q1 q1

ki; 2fori 1 . . . m: 3ki= ci.count; 4if q1< t + ki: 5if i > 1: 6 (ci+1.mean ci1.mean)/2; 7elif i < m: 8 (ci+1.mean ci.mean); 9else: 10 (ci.mean ci1.mean); 11=

qtki12

;12s s + ki ci.mean; 13k k + ki; 14if q2< t + ki: 15if i > 1: 16 (ci+1.mean ci1.mean)/2; 17elif i < m: 18 (ci+1.mean ci.mean); 19else: 20 (ci.mean ci1.mean); 21=

12 qtki

;22s s ki ci.mean; 23k k ki; 24t t + ki25returns/k 26COMPUTINGEXTREMELYACCURATEQUANTILESUSINGt-DIGESTS 9uniformandfromaskeweddistribution. Intheseexperiments=0.01wasusedsinceitprovidesagoodcompromisebetweenaccuracyandspace. Thereisnovisibledierenceinaccuracybetweenthetwounderlyingdistributionsinspiteofthefactthattheunderlyingdensities dier bymore roughly30orders of magnitude. The accuracyshownhere iscomputedbycomparingthecomputedquantilestotheactual empirical quantilesforthesample used for testing and is shown in terms of q rather than the underlying sample value.Atextremevaluesofq,theactualsamplesarepreservedascentroidswithweight1sotheobservedfortheseextremevaluesiszerorelativetotheoriginaldata. Forthedatashownhere,at q= 0.001,the maximum weight on a centroid is just above 4 and centroids in thisrangehaveall possibleweightsfrom1to4. Errorsarelimitedto, notsurprisingly, justafewpartspermillionorless. Formoreextremequantiles, thecentroidswillhavefewersamplesandtheresultswilltypicallybeexact.0.001 0.01 0.1 0.5 0.9 0.99 0.99920001000010002000UniformQuantile (q)Quantile error (ppm)0.001 0.01 0.1 0.5 0.9 0.99 0.999(0.1, 0.1)Quantile (q)Figure 3. The absolute error of the estimate of the cumulative distributionfunctionq =

xp() dfor theuniformanddistributionfor 5runs,eachwith100,000datapoints. Ascanbeseen, theerrorisdramaticallydecreasedforveryhighorverylowquantiles(toafewpartspermillion).The precision setting used here, 1/= 100, would result in uniform error of10,000ppmwithoutadaptivebinsizingandinterpolation.Obviously, withtherelativelysmall numbersof samplessuchasareusedintheseex-periments,theaccuracyoft-digestsforestimating quantilesoftheunderlying distributioncannotbebetterthantheaccuracyof theseestimatescomputedusingthesampledatapointsthemselves. Fortheexperimentshere,theerrorsduetosamplingcompletelydomi-natetheerrorsintroducedbyt-digests,especiallyatextremevaluesofq. Formuchlarger10 TEDDUNNINGANDOTMARERTLsamplesetsof billionsof samplesormore, thiswouldbelesstrueandtheerrorsshownherewouldrepresenttheaccuracyofapproximatingtheunderlyingdistribution.ItshouldbenotedthatusingaQ-Digestimplementedwithlongintegersisonlyabletostoredatawithnomorethan20signicant decimal gures. Theimplementationinstream-libonlyretains48bitsof signicants, allowingonlyabout16signicantgures.This means that such a Q-digest would be inherently unable to even estimate the quantilesofthedistributiontestedhere.3.2. Persistingt-digests. Fortheaccuracysettingandtestdatausedintheseexperi-ments, the t-digest contained 820860 centroids. The results of t-digest can thus be storedby storing this many centroid means and weights. If centroids are kept as double precisionoating point numbers and counts kept as 4-byte integers, the t-digest resulting from fromtheaccuracytestsdescribedherewouldrequireabout10kilobytesof storageforanyofthedistributionstested.Thissizecanbesubstantiallydecreased,however. Onesimpleoptionistostoredier-ences between centroid means and to use a variable byte encoding such as zig-zag encodingto store the cluster size. The dierences between successive means are at least three ordersof magnitudesmallerthanthemeansthemselvessousingsingleprecisionoatingpointtostorethesedierencescanallowthet-digestfromthetestsdescribedheretobestoredinabout4.6kilobyteswhilestill regainingnearly10signicantguresofaccuracyinthemeans. ThisisroughlyequivalenttotheprecisionpossiblewithaQ-digestoperatingon32bit integers, but thedynamicrangeof t-digests will beconsiderablyhigher andtheaccuracyisconsiderablybetter.3.3. Space/AccuracyTrade-o. Notsurprisingly, thereisastrongtrade-obetweenthesizeofthet-digestascontrolledbythecompressionparameter1/andtheaccuracywhichwhichquantilesareestimated. Quantilesat0.999andaboveor0.001orbelowwereestimatedtowithinasmallfractionof1%regardlessofdigestsize. Accurateestimatesofthemedianrequiresubstantiallylargerdigests. Figure5showsthisbasictrade-o.Thesizeoftheresultingdigestdependsstronglyonthecompressionparameter1/asshown in the left panel of Figure 6. Size of the digest also grows roughly with the log of thenumberofsamplesobserved, atleastintherangeof10,000to10,000,000samplesshownintherightpanelofFigure6.3.4. Computing t-digests in parallel. With large scale computations, it is important tobeabletocomputeaggregateslikethet-digestonportionsoftheinputandthencombinethoseaggregates.For example, inamap-reduce frameworksuchas Hadoop, acombiner functioncancomputethet-digestfortheoutputof eachmapperandasinglereducercanbeusedtocomputethet-digestfortheentiredataset.AnotherexamplecanbefoundincertaindatabasessuchasCouchbaseorDruidwhichmaintaintreestructuredindicesandallowtheprogrammertospecifythatparticularag-gregates of the data being stored can be kept at interior nodes of the index. The benet ofthis is that aggregation can be done almost instantaneously over any contiguous sub-rangeCOMPUTINGEXTREMELYACCURATEQUANTILESUSINGt-DIGESTS 110.5 1.0 2.0 5.0 10.0 20.0 50.0q = 0.5tdigest size (kB)Quantile error0.040.020.000.020.040.5 1.0 2.0 5.0 10.0 20.0 50.0q = 0.01tdigest size (kB)Quantile error0.5 1.0 2.0 5.0 10.0 20.0 50.0q = 0.001tdigest size (kB)Quantile errorFigure4. Accuracy is good for extreme quantiles regardless of digest size.From leftto right,these panelsshow accuracyofestimates forq= 0.5, 0.01and 0.001 as a function the serialized size of the t-digest. Due to symmetry,theseareequivalenttoaccuracyforq=0.5, 0.99, and0.999aswell. Formidquantiles suchas the median(q =0.5), moderate digest sizes of afewkilobytessucetogetbetterthan1%accuracy, butadigestmustbe20kBormoretoreliablyachieve0.1%accuracy. Incontrast, accuracyforthe0.1%-ile(or99.9%-ile)reachesafewpartspermillionfordigestslargerthan about 5 kB. Note that errors for q= 0.5 and digests smaller than 1 kBare o the scale shown here at nearly 10%. All panels were computed using100runswith100,000samples. Compressionparameter(1/)wasvariedfrom 2 to 1000 in order to vary the size of the resulting digest. Sizes shownwereencodedusing4byteoatingpoint deltaencodingfor thecentroidmeansandvariablebytelengthintegerencoding.oftheindex. ThecostisquitemodestwithonlyaO(log(N))totalincreaseineortoverkeeping a running aggregate of all data. In many practical cases, the tree can contain onlytwoorthreelevelsandstill providefastenoughresponse. Forinstance, ifitisdesiredtobeabletocomputequantilesforanyperioduptoyearsin30secondincrements, simplykeeping higher level t-digests at the level of 30 seconds and days is likely to be satisfactorybecauseatmostabout10,000digestsarelikelytoneedmergingevenforparticularlyoddintervals. Ifalmostallqueriesoverintervalslongerthanafewweeksaredayaligned,thenumberofdigeststhatmustbemergeddropstoafewthousand.Mergingt-digestscanbedonemanyways. Thealgorithmwhoseresultsareshownhereconsisted of simply making a list of all centroids from the t-digests being merged,shuingthatlist,andthenaddingthesecentroids,preservingtheirweights,toanewt-digest.3.5. ComparisonwithQ-digest. Theprototypeimplementationof thet-digest com-pletely dominates the implementation of the Q-digest from the popular stream-lib package[Thi] whensizeof theresultingdigest iscontrolled. ThisisshowninFigure8. Intheleft panel,the relationship between the eect of the compression parameter for Q-digest is12 TEDDUNNINGANDOTMARERTL2 5 10 20 50 100 5001Size (kB)0.111010010M samples10k samples051015201 = 100Samples (x1000)Size (kB)10 100 1000 10,000Figure5. Sizeofthedigestscalessub-linearlywithcompressionparame-ter( 0.7 . . . 0.9)forxednumberof points. Sizescalesapproximatelylogarithmically with number of points for xed compression parameter. Thepanel on the right is for 1/= 100. The dashed lines show best-t log-linearmodels. In addition, the right panel shows the memory size required for theGK01algorithmif6specicquantilesaredesired.compared to the similar parameter 1/for the t-digest. For the same value of compressionparameter, the sizes of the two digests is always within a factor of 2 for practical uses. Themiddleandrightpanelshowaccuraciesforuniformanddistributions.As expected, thet-digest has verygoodaccuracyfor extremequantiles whiletheQ-digest has constant error across the range. Interestingly,the accuracy of the Q-digest is atbest roughly an order of magnitude worse than the accuracy of the t-digest even. At worse,withextremevaluesof q, accuracyisseveral ordersof magnitudeworse. Thissituationisevenworsewithahighlyskeweddistributionsuchaswiththe(0.1, 0.1)shownintherightpanel. Here,theveryhighdynamicrangeintroducesseverequantizationerrorsintotheresults. ThisquantizationisinherentintheuseofintegersintheQ-digest.Forhighercompressionparametervalues, thesizeof theQ-digestbecomesuptotwotimessmallerthanthet-digest,butnoimprovementintheerrorratesisobserved.3.6. Speed. Thecurrentimplementationhasbeenprimarilyoptimizedforeaseofdevel-opment,notexecutionspeed. Asitis,runningonasinglecoreofa2.3GHzIntelCorei5,ittypicallytakes2-3microsecondstoprocesseachpointafterJIToptimizationhascomeintoeect. Itistobeexpectedthatasubstantialimprovementinspeedcouldbehadbyprolingandcleaninguptheprototypecode.COMPUTINGEXTREMELYACCURATEQUANTILESUSINGt-DIGESTS 131000050000500010000Quantile (q)Error in quantile (ppm)10000500005000100000.001 0.1 0.2 0.3 0.5DirectMerged1 = 505partsQuantile (q)0.001 0.01 0.1 0.2 0.3 0.5DirectMerged1 = 5020partsQuantile (q)0.001 0.01 0.1 0.2 0.3 0.5DirectMerged1 = 50100partsFigure6. Accuracyofat-digestaccumulateddirectlyisnearlythesameaswhenthedigestiscomputedbycombiningdigestsfrom20or100equalsized portions of the data. Repeated runs of this test occasionally show thesituation seen in the left panel where the accuracy for digests formed from 5partialdigestsshowslightlyworseaccuracythanthenon-subdividedcase.This sub-aggregationpropertyallows ecient useof thetdigest inmap-reduceanddatabaseapplications. Of particular interest is thefact thataccuracy actually improves when the input data is broken in to many partsasisshownintherighthandpanel. All panelswerecomputedby40rep-etitionsof aggregating100,000values. Accuracyfordirectlyaccumulateddigestsisshownontheleftofeachpairwiththewhitebarandthedigestofdigestaccuracyisshownontherightofeachpairbythedarkgraybar.4. ConclusionThe t-digest is a novel algorithm that dominates the previously state-of-the-art Q-digestin terms of accuracy and size. The t-digest provides accurate on-line estimates of a varietyof of rank-based statistics including quantiles and trimmed mean. The algorithm is simpleandempiricallydemonstratedtoexhibitaccuracyaspredictedontheoreticalgrounds. Itis alsosuitablefor parallel applications. Moreover, thet-digest algorithmis inherentlyon-linewhiletheQ-digestisanon-lineadaptationofano-linealgorithm.Thet-digestalgorithmisavailableintheformofanopensource,well-testedimplemen-tation from the author. It has already been adopted by the Apache Mahout and stream-libprojectsandislikelytobeadoptedbyotherprojectsaswell.References[CLP00] Fei Chen, DianeLambert, andJos C. Pinheiro. Incremental quantileestimationfor massivetracking.InInProceedingsofKDD,pages516522,2000.14 TEDDUNNINGANDOTMARERTLtdigest (bytes)Qdigest (bytes)100 300 1K 3K 10K 30K 100K1003001K3K10K30K100K25102050100200500100020001 QuantileQuantile error (ppm)0.001 0.1 0.5 0.9 0.99100000500015000QdigesttdigestUniform1 = 50020000400006000080000QuantileQuantile error (ppm)0200004000060000800000.001 0.1 0.5 0.9 0.99Qdigesttdigest(0.1, 0.1)1 = 50Figure7. TheleftpanelshowsthesizeofaserializedQ-digestversusthesizeofaserializedt-digestforvariousvaluesof1/from2to100,000. Thesizesforthetwokindsofdigestarewithinafactorof2forallcompressionlevels. The middle andright panels showthe accuracyfor aparticularsettingof 1/ for Q-digest andt-digest. For eachquantile, theQ-digestaccuracy is shown as the left bar in a pair and the t-digest accuracy is shownastherightbarinapair. Notethatthevertical scaleinthesediagramsareoneortwoordersof magnitudelargerthaninthepreviousaccuracygraphsandthat inall cases, theaccuracyof thet-digest isdramaticallybetterthanthatoftheQ-digesteventhoughtheserializedsizeoftheeachiswithin10%of theother. NotethattheerrorsintherighthandpanelaresystematicquantizationerrorsintroducedbytheuseofintegersintheQ-digestalgorithm. Anydistributionwithverylargedynamicrangewillshowthesameproblems.[GK01] Michael GreenwaldandSanjeevKhanna. Space-ecientonlinecomputationof quantilesum-maries.InInSIGMOD,pages5866,2001.[Knu98] DonaldE. Knuth. TheArt of ComputerProgramming, volume2: Seminumerical Algorithms,page232.Addison-Wesley,Boston,3edition,1998.[Lin] LinkedIn. Datafu: Hadoop library for large-scale data processing. https://github.com/linkedin/datafu/.[Online;accessed20-December-2013].[PDGQ05] RobPike,SeanDorward,RobertGriesemer,andSeanQuinlan.Interpretingthedata: Parallelanalysiswithsawzall.Sci.Program.,13(4):277298,October2005.[SBAS04] NisheethShrivastava,ChiranjeebBuragohain,DivyakantAgrawal,andSubhashSuri.Mediansand beyond: New aggregation techniques for sensor networks. pages 239249. ACM Press, 2004.[Thi] Add This. Algorithms for calculating variance, online algorithm. https://github.com/addthis/stream-lib.[Online;accessed28-November-2013].[Wel62] B.P.Welford.Noteonamethodforcalculatingcorrectedsumsofsquaresandproducts.Tech-nometrics,pages419420,1962.[Wik] Wikipedia. Algorithms for calculating variance, online algorithm. http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online_algorithm. [Online; accessed19-October-2013].COMPUTINGEXTREMELYACCURATEQUANTILESUSINGt-DIGESTS 15[ZW07] Qi ZhangandWei Wang. Anecientalgorithmforapproximatebiasedquantilecomputationin data streams. In Proceedings of the Sixteenth ACM Conference on Conference on InformationandKnowledgeManagement,CIKM07,pages10231026,NewYork,NY,USA,2007.ACM.TedDunning,MapRTechnologies,Inc,SanJose,CAE-mail address: [email protected] address: [email protected]