1 efficient computation of frequent and top-k elements in data streams ahmed metwally divyakant...

1

Efficient Computation of Frequent and Top-k Elements in Data Streams

Ahmed Metwally

Divyakant Agrawal

Amr El AbbadiDepartment of Computer Science

University of California, Santa Barbara

3

Motivation

Motivated by Internet advertising commissioners Before rendering an advertisement for user, query clicks

stream for advertisements to display. If the user's profile is not a frequent “clicker”, then s/he

will probably not click any displayed advertisement.– Show Pay-Per-Impression advertisements.

If the user's profile is a frequent “clicker”, then s/he may click a displayed advertisement.– Show Pay-Per-Click advertisements.

– Retrieve top advertisements to choose what to display.

4

Problem Definition

Given alphabet A, stream S of size N, a frequent element, E, is an element whose frequency, F, exceeds a user specified support, φN

Top-k elements are the k elements with highest frequency

Both problems:– Very related, though, no integrated solution has been

proposed– Exact solution is O(min(N,A)) space

approximate variations

5

Practical Frequent Elements

-Deficient Frequent Elements [Manku ‘02]:– All frequent elements output should have

F > (φ - )N, where is the user-defined error.

φ N

(φ - ) N

6

Practical Top-k

FindApproxTop(S, k, ) [Charikar ‘02]:– Retrieve a list of k elements such that every

element, Ei, in the list has Fi > (1 - ) Fk, where Ek

is the kth ranked element.

F4

(1 - ) F4

7

Related Work

Algorithms Classification– Counter-Based techniques

• Keep an individual counter for each element• If the observed ID is monitored, its counter is updated• If the observed ID is not monitored, algorithm dependent

action

– Sketch-Based techniques• Estimate frequency for all elements using bit-maps of

counters• Each element is hashed into the counters’ space using a

family of hash functions.• Hashed-to counters are queried for the frequencies

8

Recent Work (Comparison)Algorithm Nature Space Bound Handles

CountSketch [Charikar ‘02]

Sketch O(k/2 log N/δ), δ is the failure probability

FindApproxTop(S, k, )

GroupTest [Cormode ’03]

Sketch O(φ-1 log(φ-1) log(|A|)) Hot Items

Frequent [Demaine ’02]

Counter O(1/), proved by [Bose ‘03]

FE

Probabilistic-Inplace [Demaine ’02]

Counter O(m), m is the available memory

FindCandidateTop(S, k, m/2)

Lossy Counting [Manku ’02]

Counter (1/) log(N) -Deficient FE

Sticky Sampling [Manku ’02]

Counter (2/) log(φ-1δ-1) -Deficient FE

9

Outline

Problem Definition Space-Saving: Summarizing the Data Stream Answering Frequent Elements Queries Answering Top-k Queries Experimental Results Conclusion

10

The Space-Saving Algorithm

Space-Saving is counter-based Monitor only m elements Only over-estimation errors Frequency estimation is more accurate

for significant elements Keep track of max. possible errors

11

Space-Saving By ExampleElement

Count

error (max possible)

A B B A C A B B D D

Element A B C

Count 2 2 1

error (max possible) 0 0 0

Element A B C

Count 3 2 1


Element B A C

Count 4 3 1


Element B A D

Count 4 3 2


Element B A D

Count 5 3 3

error (max possible) 0 0 1E

Element B E A

Count 5 4 3


Space-Saving Algorithm– For every element in the stream S

– If a monitored element is observed• Increment its Count

– If a non-monitored element is observed, • Replace the element with minimum hits, min• Increment the minimum Count to min + 1• maximum possible over-estimation is error













C

Element B E C

Count 5 4 4

error (max possible) 0 3 3B

12

Space-Saving Observations

Observations:– The summation of the Counts is N

Element B E C

Count 5 4 4


S = ABBACABBDDBEC N = 13

– Minimum number of hits, min ≤ N/m– In this example, min = 4

Element B E C

Count 5 4 4


– The minimum number of hits, min, is an upper bound on the error of any element

Element B E C

Count 5 4 4


13

Space-Saving Proved Properties

1. If Element E has frequency F > min, then E must be in Stream-Summary. F(B) = F1 = 5, min = 4.


Element B E C

Count 5 4 4


2. The Count at position i in Stream-Summary is no less than Fi, the frequency of the ith ranked element. F(A) = F2 = 3, Count2 = 4.

Element B E C

Count 5 4 4



16

Space-Saving Data Structure

We need a data structure that– Increments counters in constant time– Keeps elements sorted by their counters

We propose the Stream-Summary structure, similar to the data structure in [Demaine ’02]

18

Frequent Elements Queries

Traverse Stream-Summary, and report all elements that satisfy the user support

Any element whose

guaranteed hits = (Count – error) > φN

is guaranteed to be a frequent element

19

Frequent Elements Example

For N = 73, m = 8, φ = 0.15:– Frequent Elements should have support of 11 hits.– Candidate Frequent Elements are B, D, and G.

Element B D G A Q F C E

Count 20 14 12 9 7 5 3 3

error 1 0 4 1 3 0 1 2

Guaranteed Hits = Count - error 19 14 8 8 4 5 2 1

– Guaranteed Frequent Elements are B, and D, since their guaranteed hits > 11.


Count 20 14 12 9 7 5 3 3

error 1 0 4 1 3 0 1 2


20

Frequent Elements Space Bounds

Space Bounds General Distribution Zipf(α)

Space-Saving O(1/) (1/)(1/α)

GroupTest O(φ-1 log(φ-1) log(|A|))

Frequent O(1/) proved by[Bose’03]

Lossy Counting (1/) log(N)

Sticky Sampling (2/) log(φ-1δ-1)

26

Top-k Elements Queries

Traverse the Stream-Summary, and report top-k elements.

From Property 2, we assert:– Guaranteed top-k elements:

• Any element whose guaranteed hits = (Count – error) ≥ Countk+1, is guaranteed to be in the top-k.

– Guaranteed top-k’ (where k’≈k):• The top-k’ elements reported are guaranteed to be the

correct top-k’ iff for every element in the top-k’, guaranteed hits = (Count – error) ≥ Countk’+1.

27

Top-k Elements Example

For k = 3, m = 8:– B, D, and G are the top-3 candidates.


Count 20 14 12 9 7 5 3 3

error 1 0 4 1 3 0 1 2


– B, and D are guaranteed to be in the top-3.


Count 20 14 12 9 7 5 3 3

error 1 0 4 1 3 0 1 2


– B , D, G and A are guaranteed to be the top-4. Here k’ = 4.


Count 20 14 12 9 7 5 3 3

error 1 0 4 1 3 0 1 2


– B , and D are guaranteed to be the top-2. Another k’ = 2.


Count 20 14 12 9 7 5 3 3

error 1 0 4 1 3 0 1 2


28

Top-k Elements Space Bounds

Space Bounds

General Distribution

Zipf(α)

Space-Saving

FindApproxTop(S, k, ):O(k/ * log(N))

Exact Top-k Problem:

α = 1: O(k2 log(A) )

α > 1: O((k/ α)(1/α) k )

CountSketch FindApproxTop(S, k, ):O(k/2 * log(N / δ))

FindApproxTop(S, k, ):α ≥ 1: O(k * log(N / δ))

32

Outline

Problem Definition Space-Saving: Summarizing the Data Stream Answering Frequent Elements Queries Answering Top-k Queries Experimental Results Conclusion

33

Experimental Results - Setup

Synthetic data:– Zipf(α), α varied: 0.0, 0.5, 1.0, …, 2.5, 3.0– N = 107 hits.

Real Data (ValueClick, Inc.): Similar results Precision:

– number of correct elements found / entire output Recall:

– number of correct elements found / number of actual correct Run time:

– Processing Stream + Query Time Space used:

– Including hash table

34

Frequent Elements Results

Query: φ = 10-2, = 10-4, and δ = 10-2

We compared with– GroupTest and Frequent

All algorithms had a recall of 1.– That is, they all output the correct elements

among their output. Space-Saving was able to guarantee all

its output to be correct

35

Frequent Elements Precision

Precision for Frequent Elements (>100,000 Hits) on Synthetic Data

0 0

1111111 11111 1

0.833333

0.08890.05260.0707

0.2157

0.1053

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.5 1 1.5 2 2.5 3

Zipf Alpha

Pre

cis

ion

Space-Saving GroupTest Frequent

36

Frequent Elements Run Time

Run Time for Frequent Elements (>100,000 Hits) on Synthetic Data

4793745172 43844 43734 43141

27250272182590626125280152650024281

5003149578

6704759375167453103751228111906

0

10000

20000

30000

40000

50000

60000

0 0.5 1 1.5 2 2.5 3

Zipf Alpha

Ru

n T

ime (

ms)


37

Frequent Elements Space Used

Space Used for Frequent Elements (>100,000 Hits) on Synthetic Data

2796

58460

78460

38240

67756

165885636

168260168260 168260 168260 168260 168260168260

13760 13760 1376013760 13760

13760 13760

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

0 0.5 1 1.5 2 2.5 3

Zipf Alpha

Sp

ace U

sed

(B

yte

s)


38

Top-k Elements Results

Query: k = 100, = 10-4, and δ = 10-2

We compared with– CountSketch: CountSketch was re-run several

times. The hidden constant was estimated to be 16, in order to have output of competitive quality.

– Probabilistic-InPlace: was allowed the same number of counters as Space-Saving

Space-Saving was able to guarantee all its output to be correct

39

Top-k Elements Precision

Precision for Top-100 on Synthetic Data

1111111 11

0.1

0.920.98 0.99 0.99 11

0.020.020.0182

0.358423

0.133333

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.5 1 1.5 2 2.5 3

Zipf Alpha

Pre

cis

ion

Space-Saving CountSketch Probabilistic InPlace

40

Top-k Elements Recall

Recall for Top-100 on Synthetic Data

1 1 1 1

0.1

0.98 0.99 0.99 1 1

0.91

1 1 11 110.92

1 1 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.5 1 1.5 2 2.5 3

Zipf Alpha

Re

ca

ll


41

Top-k Elements Run Time

Run Time for Top-100 on Synthetic Data

1860453

848141768547 757922 754813

23531 26391 27984 26125 25703 25422 25390

1339343

1931797

32250297972898530078320783037527609

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

2000000

0 0.5 1 1.5 2 2.5 3

Zipf Alpha

Ru

n T

ime

(m

s)


42

Top-k Elements Space Used

Space Used for Top-100 on Synthetic Data

406330 407070 407070 407070 407010 406570 403930

67756

16588 6916 3436

5846078460

3824010874 3254

653439418 62674

1547020338

0

50000

100000

150000

200000

250000

300000

350000

400000

450000

0 0.5 1 1.5 2 2.5 3

Zipf Alpha

Sp

ac

e U

se

d (

By

tes

)


44

Conclusion

Contributions:– An integrated approach to solve an interesting

family of problems– Strict error bounds using little space– Guarantees on results– Special attention was given to Zipfian data– Experimental validation

Future Work:– Incremental frequent and top-k elements reporting

1 efficient computation of frequent and top-k elements in data streams ahmed metwally divyakant...

Documents

nonmonitored element

error space

minimum count

algorithm spacesaving

n n n slide

possible errors

minimum hits

frequent elements queries