streaming quantiles - edoliberty.github.io

33
Streaming Quantiles Edo Liberty Principal Scientist Amazon Web Services

Upload: others

Post on 04-Jun-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Streaming Quantiles - edoliberty.github.io

StreamingQuantiles

EdoLibertyPrincipalScientistAmazonWebServices

Page 2: Streaming Quantiles - edoliberty.github.io
Page 3: Streaming Quantiles - edoliberty.github.io

StreamingQuantiles

Manku,Rajagopalan,Lindsay.Randomsamplingtechniquesforspaceefficientonlinecomputationoforderstatisticsoflargedatasets.Munro,Paterson.Selectionandsortingwithlimitedstorage.Greenwald,Khanna.Space-efficientonlinecomputationofquantilesummaries.Wang,Luo,Yi,Cormode.Quantilesoverdatastreams:Anexperimentalstudy.Greenwald,Khanna.Quantilesandequidepth histogramsoverstreams.Agarwal,Cormode,Huang,Phillips,Wei,Yi.Mergeable summaries.Felber,Ostrovsky.ArandomizedonlinequantilesummaryinO((1/ε)log(1/ε))words.Lang,Karnin,Liberty,OptimalQuantileApproximationinStreams.

Page 4: Streaming Quantiles - edoliberty.github.io

ProblemDefinition

n

0 n

R( ) = 0.6 · n

CreateasketchforsuchthatR0 |R0(x)�R(x)| "n

Page 5: Streaming Quantiles - edoliberty.github.io

• UniformsamplingFastandsimpleFullymergeable Space

• GreenwaldKhanna(GK)sketchSlow,complexNotmergeable Space

• Felber-Ostrovsky,combinessamplingandGK(2015)Slow,complexNotmergeable Space

Solutions

log(1/")/✏

O(1/"2)

log(n)/"

Previouslyconjecturedspaceoptimalforallalgorithms.lowerboundfordeterministic algorithmsbyHungandTing2010.log(1/")/✏

Page 6: Streaming Quantiles - edoliberty.github.io

• UniformsamplingFast,simpleFullymergeable Space

• GreenwaldKhanna(GK)sketchSlow,complexNotmergeable Space

• Felber-Ostrovsky,combinessamplingandGK(2015)Slow,complexNotmergeable Space

• Manku-Rajagopalan-Lindsay(MRL)FastsimpleFullymergeable Space

• Agarwal,Cormode,Huang,Phillips,Wei,YiFast,complexFullymergeable Space

Solutionscont’

log(1/")/✏

O(1/"2)

log(n)/"

log

2(n)/"

log

3/2(1/")/"

Bufferbasedsolutions

Page 7: Streaming Quantiles - edoliberty.github.io

Thebasicbufferidea

1 0 35 4 7

Bufferofsizek

Page 8: Streaming Quantiles - edoliberty.github.io

Thebasicbufferidea

Storeskstreamentries

1

03

5

47

Page 9: Streaming Quantiles - edoliberty.github.io

Thebasicbufferidea

Thebuffersortskstreamentries

10

3

54

7

Page 10: Streaming Quantiles - edoliberty.github.io

Thebasicbufferidea

Deleteseveryotheritem

10

3

54

7

Page 11: Streaming Quantiles - edoliberty.github.io

Thebasicbufferidea

Andoutputstherestwithdoubletheweight

035

Page 12: Streaming Quantiles - edoliberty.github.io

Thebasicbufferidea

0

0

x x

1 54 7

1

3

3

4

5

7

R(x) = 2

R

0(x) = 2

R

0(x) = 2

R(x) = 5

R

0(x) = 4

R

0(x) = 6

Page 13: Streaming Quantiles - edoliberty.github.io

Thebasicbufferidea

Repeattimeuntiltheendofthestream

0

|R0(x)�R(x)| < n/k

nn/2

n/k

1 0 355

Page 14: Streaming Quantiles - edoliberty.github.io

n

Buffersofsize k

1 0 35

Manku-Rajagopalan-Lindsay(MRL)sketch

|R0(x)�R(x)| n/k · log2(n)

H = log2(n)

Page 15: Streaming Quantiles - edoliberty.github.io

k = log2(n)/"Ifwesetweget

whilemaintainingatmoststreamitems.

|R0(x)�R(x)| "n

Manku-Rajagopalan-Lindsay(MRL)sketch

H · k log

22(n)/"

Manku-Rajagopalan-Lindsay(MRL)sketchFast,SimpleFullymergeable Space

Page 16: Streaming Quantiles - edoliberty.github.io

Agarwal,Cormode,Huang,Phillips,Wei,Yi(1)

Buffersofsize klog(1/")

startsamplingafteritemsO(1/"2)

log

2(1/")/"Reducesspaceusagetoitemsfromthestream.

1 0 35

Page 17: Streaming Quantiles - edoliberty.github.io

Agarwal,Cormode,Huang,Phillips,Wei,Yi(2)

E[R0(x)] = R(x)

R

0(x) isarandomvariablenowand

R(x) = 1

R

0(x) = 2

R

0(x) = 0

x

Reducesspaceusagetoitemsfromthestream.log

3/2(1/")/"

5 7

5

7

Page 18: Streaming Quantiles - edoliberty.github.io

• UniformsamplingFast,simpleFullymergeable Space

• GreenwaldKhanna(GK)sketchSlow,complexNotmergeable Space

• Felber-Ostrovsky,combinessamplingandGK(2015)Slow,complexNotmergeable Space

• Manku-Rajagopalan-Lindsay(MRL)Fast,simpleFullymergeable Space

• Agarwal,Cormode,Huang,Phillips,Wei,YiFast,complexFullymergeable Space

Recap

log(1/")/✏

O(1/"2)

log(n)/"

log

2(n)/"

log

3/2(1/")/"

Page 19: Streaming Quantiles - edoliberty.github.io

• UniformsamplingFast,simpleFullymergeable Space

• GreenwaldKhanna(GK)sketchSlow,complexNotmergeable Space

• Felber-Ostrovsky,combinessamplingandGK(2015)Slow,complexNotmergeable Space

• Manku-Rajagopalan-Lindsay(MRL)Fast,simpleFullymergeable Space

• Agarwal,Cormode,Huang,Phillips,Wei,YiFast,complexFullymergeable Space

• Karnin,Lang,LibertyFast,simpleFullymergeable Space

Ourgoal

log(1/")/✏

O(1/"2)

log(n)/"

log

2(n)/"

log

3/2(1/")/"

log(1/")/✏

Page 20: Streaming Quantiles - edoliberty.github.io

Observation

h = H

Thefirstbufferscontributeverylittletotheerror.Theyare“toogood”.

wh = 2h�1

Numberofcompactions

Weightofitemsinthelevel

h = 2 h = 1....

mh = 2H�h�1

Page 21: Streaming Quantiles - edoliberty.github.io

Idea

h = H

h = 2h = 1

kh � kcH�hLetbuffersshrinkat-most-exponentially

wh = 2h�1

H log(n/ck) + 2

mh (2/c)H�h�1

Numberofcompactions

TBDlater

Page 22: Streaming Quantiles - edoliberty.github.io

Pr [|R(x,H 0)�R(x)| � "n] 2 exp

⇣�C"2k222(H�H0)

therankof among1. Theitemsyieldedbythecompactoratheight2. Alltheitemsstoredinthecompactorsofheights h0 h

hR(h, x) x

Claim,for

ProofUseHoeffding’s inequalityon

HX

h=1

[R(x, h)�R(x, h� 1)]

C = c2(2c� 1)

Analysis

Page 23: Streaming Quantiles - edoliberty.github.io

Setand

Solution1

kh = dkcH�he+ 1c = 2/3• Karnin,Lang,Liberty(1)

Fast,simpleFullymergeable Spacep

log(1/")/"

exponentiallydecreasingcapacitybuffers

log(n) samplerreplacesall buffersofsize2

Betterthanpreviouslyconjecturedoptimal!

Page 24: Streaming Quantiles - edoliberty.github.io

exponentiallydecreasingcapacitybuffers

Setandexceptthatthetopbuffersallhavecapacity.

Solution2(KLL+MRL)

kh = dkcH�he+ 1c = 2/3

• Karnin,Lang,Liberty(2)Fast,simpleFullymergeable Space

log log(1/") k

log

2log(1/")/"

log log(1/")Buffersofcapacityk

log(n) samplerreplacesall buffersofsize2

Page 25: Streaming Quantiles - edoliberty.github.io

• Karnin,Lang,Liberty(3)Fast,simpleFullymergeable Space

exponentiallydecreasingcapacitybuffers

SetandreplacethetopwithaGKsketch

Solution3(KLL+GK)

kh = dkcH�he+ 1c = 2/3log log(1/") k

log log(1/")GKsketchreplaces

toplevelslog(n) samplerreplaces

all buffersofsize2

GKSketch

log log(1/")/"

Page 26: Streaming Quantiles - edoliberty.github.io

CountDistinct(DemoOnly)

Page 27: Streaming Quantiles - edoliberty.github.io

$ head data.csv0103023732

Inthisone,rowi tasksavaluefrom[0,i]uniformlyatrandom.

Assumeyouneedtoestimatethedistributionofnumbersinafile

Page 28: Streaming Quantiles - edoliberty.github.io

$ time wc -lc data.csv10000000 76046666 data.csv

real 0m0.101suser 0m0.072ssys 0m0.021s

Readingthefiletake~1/10seconds.Wedon’tforeseeIObeinganissue.

Somestats:thereare10,000,000suchnumbersinthis~76Mbfile.

Page 29: Streaming Quantiles - edoliberty.github.io

$ time cat data.csv | python quantiles.py > /dev/null

real 0m13.406suser 0m12.937ssys 0m0.407s

Inpythonitlookslikethis:

$ cat quantiles.pyimport sysints = sorted([int(x) for x in sys.stdin])for i in range(0,len(ints),int(len(ints)/100)):

print(str(ints[i]))

Page 30: Streaming Quantiles - edoliberty.github.io

$ time cat data.csv | sketch rank > /dev/null

real 0m1.495suser 0m1.878ssys 0m0.141s

Thisisthewaytodothiswiththesketchinglibrary

$ time cat data.csv | sketch rank

ToofasttousethesystemmonitorUI...

Ituses~4kofmemory!

Page 31: Streaming Quantiles - edoliberty.github.io

0

1000000

2000000

3000000

4000000

5000000

6000000

7000000

8000000

9000000

10000000

1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103

exactandapproximatequantiles

approximatequantiles exactquantiles

0

1000000

2000000

3000000

4000000

5000000

6000000

7000000

8000000

9000000

10000000

0 2000000 4000000 6000000 8000000 10000000

exactvsapproximatequantiles

Page 32: Streaming Quantiles - edoliberty.github.io

0

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008

0.009

0.01

100 1000 10000 100000 1e+06

Err

or

Number of Items in Randomly Permuted Stream

Lazy KLL versus (Sketch Library and Two Variants)

Sketch LibraryVariant 1Variant 2Lazy KLL

0

500

1000

1500

2000

2500

3000

3500

4000

100 1000 10000 100000 1e+06

Space

Use

d F

or

Sto

ring S

am

ple

s

Number of Items in Randomly Permuted Stream

Lazy KLL versus (Sketch Library and Two Variants)

Sketch LibraryVariant 1Variant 2Lazy KLL

Someexperimentalresults

Page 33: Streaming Quantiles - edoliberty.github.io

Thankyou!