cs 410/510 data streams lecture 16: data-stream sampling: basic techniques and results

45
3/13/2012 Data Streams: Lecture 16 1 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier

Upload: kovit

Post on 24-Feb-2016

49 views

Category:

Documents


0 download

DESCRIPTION

CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results. Kristin Tufte, David Maier . Data Stream Sampling. Sampling provides a synopsis of a data stream Sample can serve as input for Answering queries “statistical inference about the contents of the stream” - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

13/13/2012 Data Streams: Lecture 16

CS 410/510Data StreamsLecture 16: Data-Stream Sampling: Basic Techniques and Results

Kristin Tufte, David Maier

Page 2: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 23/13/2012

Data Stream Sampling Sampling provides a synopsis of a data

stream Sample can serve as input for

Answering queries “statistical inference about the contents of

the stream” “variety of analytical procedures”

Focus on: obtaining a sample from the window (sample size « window size)

Page 3: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 33/13/2012

Windows Stationary Window

Endpoints of window fixed (think relation) Sliding Window

Endpoints of window move What we’ve been talking about More complex than stationary window

because elements must be removed from sample when they expire from window

Page 4: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 43/13/2012

Simple Random Sampling (SRS) What is a “representative” sample? SRS for a sample of k elements from a

window with n elements Every possible sample (of size k) is equally

likely, that is has probability: 1/ Every element is equally likely to be in

sample Stratified Sampling

Divide window into disjoint segments (strata)

SRS over each stratum Advantageous when stream elements close

together in stream have similar values

nk( )

Page 5: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 53/13/2012

Bernoulli Sampling Includes each element in the sample

with probability q The sample size is not fixed, sample

size is binomially distributed Probability that sample contains k

elements is:

Expected sample size is nq( ) qk(1-q)n-

k

nk

Page 6: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 63/13/2012

Binomial Distribution - Example

Expected Sample Size = 20*0.5 = 10

Binomial Distribution (n=20, q=0.5)

Prob

abilit

y

Sample Size

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

`

Page 7: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 73/13/2012

Binomial Distribution - Example

Expected Sample Size = 20*1/3 ≈ 6.667

Binomial Distribution (n=20, q=1/3)

Prob

abilit

y

Sample Size

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0 2 4 6 8 10 12 14 16 18 20

Page 8: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 83/13/2012

Bernoulli Sampling - Implementation

Naïve: Elements inserted with probability q (ignored

with probability 1-q) Use a sequence of pseudorandom numbers

(U1, U2, U3, …) Ui [0,1] Element ei is included if Ui ≤ q

e1

Sample:

e2 e6e5e4e3

U1=0.5

U2=0.1

e2 e5

U3=0.9

e7

U4=0.8

U5=0.2

U6=0.3

e7

U7=0.0

Example q = 0.2

Page 9: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 93/13/2012

Bernoulli Sampling – Efficient Implementation

Calculate number of elements to be skipped after an insertion (Δi)

Pr {Δi = j} = q(1-q)j

If you skip zero elements, must get: Ui ≤ q (pr: q)

Skip one element, must get: Ui > q, Ui+1 ≤ q (pr: (1-q)q)

Skip two elements: Ui > q, Ui+1 > q, Ui+2 ≤ q (pr: (1-q)2q)

Δi has a geometric distribution

Page 10: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 103/13/2012

Geometric Distribution - Example

Geometric Distribution q = 0.2

0

0.05

0.1

0.15

0.2

0.25

0 2 4 6 8 10 12 14 16 18 20

Prob

abilit

y

Number of Skips (Δi)

Page 11: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 113/13/2012

Bernoulli Sampling - Algorithm

Page 12: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 123/13/2012

Bernoulli Sampling Straightforward, SRS, easy to

implement But… Sample size is not fixed! Look at algorithms with deterministic

sample size Reservoir Sampling

Stratified Sampling Biased Sampling Schemes

Page 13: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 133/13/2012

Reservoir Sampling Produces a SRS of size k from a window

of length n (k is specified) Initialize a “reservoir” using first k

elements For every following element, insert with

probability pi (ignore with probability 1-pi)

pi = k/i for i>k (pi = 1 for i ≤ k) pi changes as i increases

Remove one element from reservoir before insertion

Page 14: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 143/13/2012

Reservoir Sampling

e1

Reservoir Sample:

e2 e6e5e4e3

Sample size 3 (k=3) Recall: pi = 1 i≤k, pi = i/k i>k

p1=1

p2=1

e1 e2

p3=1

e3

p4=3/4 p5=3/5 p6=3/6e7

p7=3/7e8

p8=3/8U4=0.

5U5=0.

1U6=0.

9U4=0.

8U5=0.

2

e4 e5e8

Page 15: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 153/13/2012

Reservoir Sampling - SRS Why set pi = k/i? Want Sj to be a SRS from Uj = {e1, e2, …,

ej} Sj is the sample from Uj

Recall SRS means every sample of size k is equally likely

Intuition: Probability that ei is included in SRS from Ui is k/i k is sample size, i is “window” size

k/i = (#samples containing ei)/(#samples of size k)

=( )i-1

k-1 ( )ik

Page 16: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 163/13/2012

Reservoir Sampling - Observations Insertion probability (pi = k/i i>k)

decreases as i increases Also, opportunities for an element in

the sample to be removed from the sample decrease as i increases

These trends offset each other Probability of being in final sample is

same for all elements in the window

Page 17: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 173/13/2012

Other Sampling Schemes Stratified Sampling

Divide window into strata, SRS in each stratum

Deterministic & Semi-Deterministic Schemes i.e. Sample every 10th element

Biased Sampling Schemes Bias sample towards recently-received

elements Biased Reservoir Sampling Biased Sampling by Halving

Page 18: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 183/13/2012

Stratified Sampling

Page 19: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 193/13/2012

Stratified Sampling When elements close to each other in

window have similar values, algorithms such as reservoir sampling can have bad luck

Alternative: divide window into strata and do SRS in each strata

If you know there is a correlation between data values (i.e. timestamp) and position in stream, you may wish to use stratified sampling

Page 20: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 203/13/2012

Deterministic Semi-deterministic Schemes Produce sample of size k by inserting

every n/k th element into the sample Simple, but not random

Can’t make statistical conclusions about window from sample

Bad if data is periodic Can be good if data exhibits a trend

Ensures sampled elements are spread throughout the window

e1 e2 e6e5e4e3 e7 e9e8 e11e10 e12 e13 e17e16e15e14 e18

n=18, k=6

Page 21: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 213/13/2012

Biased Reservoir Sampling Recall: Reservoir sampling – probability

of inclusion decreased as we got further into the window (pi = i/k)

What if pi was constant? (pi = p) Alternative: pi decreases more slowly than

i/k Will favor recently-arrived elements

Recently-arrived elements are more likely to be in sample than long-ago-arrived elements

Page 22: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 223/13/2012

( )

( )

Biased Reservoir Sampling For reservoir sampling, Probability that ei is

included in sample S:

If pi is fixed, that is set pi = p (0,1)

Probability that ei is in final sample increases geometrically as i increases

Pr {ei S} = pi

j=max(i, k)

+1

n k-pjk

Pr {ei S} = p

n - max(i, k)k-pk

Page 23: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 233/13/2012

Biased Reservoir Sampling

0

0.05

0.1

0.15

0.2

0.25

0 5 10 15 20 25 30 35 40

Probability ei is included in final sample, p=0.2, k=10, n=40

Element index (i)

Prob

abilit

y

( ).240 - max(i,

10)10-.210

Page 24: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 243/13/2012

k k

Biased Sampling by Halving

Break into strata (Λi), Sample of size 2k Step 1: S = unbiased SRS samples of size k

from Λ1 and Λ2 (i.e. use reservoir sampling) Step 2: Sub-sample S to produce a sample of

size k, insert SRS of size k from Λ3 into S

Λ1 Λ2 Λ3 Λ4

k kk k

Page 25: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 253/13/2012

Sampling from Sliding Windows Harder than sampling from stationary

window Must remove elements from sample as the

elements expire from the window Difficult to maintain a sample of a fixed size

Window Types: Sequence-based windows - contain n most

recent elements (row-based window) Timestamp-based windows - contains all

elements that arrived within past t time units (time-based windows)

Unbiased sampling from within a window

Page 26: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 263/13/2012

Sequence-based Windows Wj is a window of length n, j ≥ 1 Wj = {ej, ej+1, … ej+n-1} Want a SRS Sj of k elements from Wj Tradeoff between amount of memory

required and degree of dependence between Sj’s

Page 27: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 273/13/2012

Complete Resampling

Window size = 5, Sample size = 2 Maintain full window (Wj) Each time window changes, use reservoir

sampling to create Sj from Wj Very expensive – memory, CPU O(n)

(n=window-size)

e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 e13 e14 e15

W1 W2

S1= {e2, e4}S2= {e3,

e5}

Page 28: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 283/13/2012

Passive Algorithm

Window size = 5, sample size = 2 When an element in the sample expires,

insert the newly-arrived element into sample Sj is a SRS from Wj Sj’s are highly correlated

If S1 is a bad sample, S2 will be also… Memory is O(k), k = sample size

e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 e13 e14 e15

W1 W2

S1 = {e2, e4}S2 = {e2,

e4}

W3

S3 = {e7, e4}

Page 29: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 293/13/2012

Chain Sampling (Babcock, et al.) Improved independence properties

compared to passive algorithm Expected memory usage: O(k) Basic algorithm – maintains sample of

size 1 Get sample of size k, by running k

chain-samplers

Page 30: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 303/13/2012

Chain Sampling - Issue Behaves as reservoir sampler for first n

elements Insert additional elements into sample

with probability 1/n

e1

Sample:

e2 e5e4e3

e1

W1

p1=1p2=1/2p3=1/3p4=1/3

e2

W2 W3

Now, what do we do?

Page 31: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 313/13/2012

Chain Sampling - Solution When ei is selected for inclusion in sample,

select K from {i+1, i+2, … i+n}, eK will replace ei if ei expires while part of sample S Know ek will be in window when ei expires

e1

Sample:

e2 e5e4e3

e1

W1

p2=1/2p3=1/3p4=1/3

e2

W2 W3

Choose K {3, 4, 5}, K=5

e5 Choose K {6, 7, 8}, K=7

e7e5 e7

Page 32: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 323/13/2012

Chain Sampling - Summary Expected memory consumptin O(k) Chain sampling produces a SRS with

replacement for each sliding window If we use k chain-samplers to get a sample

of size k, may get duplicates in that sample Can over sample (use sample size k +

α), then sub-sample to get a sample of size k

Page 33: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 333/13/2012

Stratified Sampling

Divide window into strata and do SRS in each strata

Page 34: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 343/13/2012

Stratified Sampling – Sliding Window

e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 e13 e14 e15

W1

ss1 = {e1,e2}

Window size = 12 (n), stratum size 4 (m), stratum sample size = 2 (k)

Wj overlaps between 3 and 4 strata (l, l+1 strata)

l = win_size/stratum_size = n/m (=3) Paper says sample size is between k(l-1) and

k∙l, think should be k(l-1) – k(l+1)

ss2 = {e6,e7}

ss3 = {e9,e11}

e16

ss2 = {e14,e16}

W2 W3

Page 35: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 353/13/2012

Timestamp-Based Windows Number of elements in window changes

over time Multiple elements in sample expire at once Chain sampling relies on insertion

probability = 1/n (n is window size) Stratified Sampling – wouldn’t be able to

bound sample size

Page 36: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 363/13/2012

Priority Sampling (Babcock, et al.) Priority Sampler maintains a SRS of size

1, use k priority samplers to get SRS of size k

Assign random, uniformly-distributed priority (0,1) to each element

Current sample is element in window with highest priority

Keep elements for which there is no other element with both higher priority and higher (later) timestamp

Page 37: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 373/13/2012

Priority Sampling - Example

Keep elements for which there is no element with: higher priority and higher (later) timestamp

e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 e12 e13 e14 e15

W1 W2 W3

.1

.8 .3priority: .4 .7 .1 .3 .5 .2 .6 .4 .1 .5 .3

elt in sampleelt stored in memelt in window, not

stored

Page 38: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 383/13/2012

Inference From a Sample What do we do with these samples? SRS samples can be used to estimate

“population sums” If each element ei is a sales transaction and

v(ei) is dollar value of transaction v(ei) = total sales of transactions in W

Count: h(ei) = 1 if v(ei) > $1000, h(ei) = number of transactions in window for >

$1000

Can also do average

ei W

ei W

Page 39: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 393/13/2012

SRS Sampling To estimate a population sum from a

SRS of size k, expansion estimator:

To estimate average, use sample average:α = Θ/n = (1/k)

h(ei)^

eiS^

eiSΘ = (n/k) h(ei)^

Also works for Stratified Sampling

Page 40: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 403/13/2012

Estimating Different Results SRS sampling is good for estimating

population sums, statistics But, use different algorithms for

different results Heavy Hitters algorithm

Find elements (values) that occur commonly in the stream

Min-Hash Computation set resemblance

Page 41: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 413/13/2012

Heavy Hitters Goal: Find all stream elements that

occur in at least a fraction s of all transactions

For example, find sourceIPs that occur in at least 1% of network flows sourceIPs from which we are getting a lot of

traffic

Page 42: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 423/13/2012

Heavy Hitters Divide window into buckets of width w Current bucket id = N/w, N is current

stream length Data structure D : (e, f, Δ)

e - element f – estimated frequency Δ – maximum possible error in f

If we are looking for common sourceIPs in a network stream D : (sourceIP, f, Δ)

Page 43: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 433/13/2012

Heavy Hitters Data structure D : (e, f, Δ) New element e:

Check if e exists in D If so, f = f+1 If not, new entry (e, 1, bcurrent -1)

At bucket boundary (when bcurrent changes) Delete all elements (e, f, Δ) if f + Δ bcurrent If only one instance of f in bucket, entry for f deleted Deleting items that occur once per bucket

For threshold s, output items: f (s-ε)N (w = 1/ε) (N is stream size)

Page 44: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 443/13/2012

Min-Hash Resemblance, ρ, of two sets A, B =

Min-hash signature is a representation of a set from which one can estimate the resemblance of two sets

ρ(A,B) = | A B | / | A B |

Let h1, h2, … hn be hash functionssi(A) = min(hi(a) | a A) (minimum hash value of hi over A)Signature of A: S(A) = (s1(A), s2(A), …, sn(A))

Page 45: CS 410/510 Data Streams Lecture  16:  Data-Stream Sampling: Basic Techniques and Results

Data Streams: Lecture 16 453/13/2012

Min-Hash Resemblance estimator:

ρ(A,B) = I(si(A), si(B))I(x,y) = 1 if x=y, 0 otherwise

ρ(A,B) = | A B | / | A B |

h1, h2, … hn hash functionssi(A) = min(hi(a) | a A)S(A) = (s1(A), s2(A), …, sn(A))

i=1

n

Count # times min hash value is equal Can substitute N minimum values of

one hash function for minimum values of N hash functions

^