ph.d. defenceuniversity of alberta1 approximation algorithms for frequency related query processing...

25
Ph.D. Defence University of Alberta 1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor: Dr. Davood Rafiei May 30, 2007

Post on 20-Dec-2015

219 views

Category:

Documents


2 download

TRANSCRIPT

Ph.D. Defence University of Alberta 1

Approximation Algorithms for Frequency Related Query

Processing on Streaming Data

Presented by Fan Deng

Supervisor: Dr. Davood Rafiei

May 30, 2007

Ph.D. Defence University of Alberta 2

Outline

• Introduction• Continuous membership query• Point query• Similarity self-join size estimation• Conclusions and future work

Ph.D. Defence University of Alberta 3

• A sequence of data records

• Examples– Document/URL streams from a Web crawler – IP packet streams– Web advertisement click streams– Sensor reading streams– ...

Data stream

Ph.D. Defence University of Alberta 4

• One pass processing– Online stream (one scan required)– Massive offline stream (one scan preferred)

• Challenges– Huge data volume– Fast processing requirement– Relatively small fast storage space

Processing in one pass

Ph.D. Defence University of Alberta 5

Approximation algorithms

• Exact query answers– can be slow to obtain– may need large storage space– sometimes are not necessary

• Approximate query answers– can take much less time– may need less space– with acceptable errors

Ph.D. Defence University of Alberta 6

Frequency related queries

• Frequency– # of occurrences

• Continuous membership query

• Point query

• Similarity self-join size estimation

Ph.D. Defence University of Alberta 7

Outline

• Introduction• Continuous membership query [SIGMOD’06]

– Motivating application– Problem statement– Our theoretical and experimental results

• Point query• Similarity self-join size estimation• Conclusions and future work

Ph.D. Defence University of Alberta 8

A Motivating Application

• Duplicate URL detection in Web crawling• Search engines [Broder et al. WWW03]

– Fetch web pages continuously– Extract URLs within each downloaded page– Check each URL (duplicate detection)

• If never seen before• Then fetch it• Else skip it

Ph.D. Defence University of Alberta 9

A Motivating Application (cont.)

• Problems– Huge number of distinct URLs – Memory is usually not large enough– Disks are slow

• Errors are usually acceptable– A false positive (missed URLs)– A false negative (redundant crawls or disk search)

Ph.D. Defence University of Alberta 10

Problem statement

• A sequence of elements with order• Storage space M

– Not large enough to store all distinct elements

• Continuous membership query Appeared before? Yes or No

…d g a f b e a d c b a • Our goal

– Minimize the # of errors– Fast

M

Ph.D. Defence University of Alberta 11

SBF theoretical results

• SBF will be stable – The expected # of “0”s will become a constant after a number of

updates

– Converge at an exponential rate

– Monotonic decreasing

• False positive rates become constant• An upper bound of false positive rates

– (a function of 4 parameters: SBF size, # of hash functions, max cell values, and kick-out rates)

• Setting the optimal parameters (partially empirical)

Ph.D. Defence University of Alberta 12

SBF experimental results (cont.)

• Comparison SBF, and FPBuffering method (LRU)– ~ 700M real URL fingerprints

• SBF generates 3-13% less false negatives, same # of false positives (<10%)

• MIN, [Broder et al. WWW03], theoretically optimal– assumes “the entire sequence of requests is known in

advance”– beats LRU caching by <5% in most cases

• More false positives allowed, SBF gains more

Ph.D. Defence University of Alberta 13

Outline• Introduction• Continuous membership query• Point query [to be submitted]

– Motivating application– Problem statement– Theoretical and experimental results

• Similarity self-join size estimation • Conclusions and future work

Ph.D. Defence University of Alberta 14

Motivating application

• Internet traffic monitoring– Query the # of IP packets sent by a particular IP

address in the past one hour

• Phone call record analysis– Query the # of calls to a given phone # yesterday

Ph.D. Defence University of Alberta 15

Problem statement

• Point query– Summarize a stream of elements – Estimate the frequency of a given element

• Goal: minimize the space cost and answer the query fast

Ph.D. Defence University of Alberta 16

CMM theoretical results

• Unbiased estimate (deduct mean)

• Estimate variance is the same as that of Fast-AGMS, a well-known method (in the case deducting mean)

• For less skewed data set – the estimation accuracies of CMM and Fast-

AGMS are exactly the same

Ph.D. Defence University of Alberta 17

CMM experimental results and analysis

• For skewed data sets– Accuracy (given the same space):

CMM-median = Fast-AGMS > CMM-mean

• Advantage of CMM – 2 estimates from 1 sketch– More flexible (with estimate upper bound)– More powerful (Count-min can be more

accurate for the very skewed data set)

Ph.D. Defence University of Alberta 18

Outline• Introduction• Continuous membership query• Point query• Similarity self-join size estimation

[submitted to VLDB’07]– Motivating application– Problem statement– Theoretical and experimental results

• Conclusions and future work

Ph.D. Defence University of Alberta 19

Motivating application

• Near-duplicate document detection for search engines [Broder 99, Henzinger 06]

– Very slow (30M pages, 10 days in 1997; 2006?)– To predict the processing time,

necessary to estimate the number of similar pairs

• Data cleaning in general (similarity self-join)

– To find a better query plan (query optimization)– Estimates of similarity self-join size is needed

Ph.D. Defence University of Alberta 20

Problem statement

• Similarity self-join size– Given a set of records with d attributes, estimate

the # of record pairs that at least s-similar

• An s-similar pair– A pair of records with s attributes in common– E.g. <Davood, Rafiei, CS, UofA, Canada> &

<Fan, Deng, CS, UofA, Canada> are 3-similar

Ph.D. Defence University of Alberta 21

Theoretical results

• Unbiased estimate

• Standard deviation bound of the estimate

• Time and space cost

(For both offline and online SimParCount)

Ph.D. Defence University of Alberta 22

Experimental results

• Online SimPairCount v.s. Random sampling

– Given the same amount of space– Error = (estimate – trueValue) / trueValue– Dataset:

• DBLP paper titles

• Each converted into a record with 6 attributes

• Using min-wise independent hashing

Ph.D. Defence University of Alberta 23

Similarity self-join size estimation – Experimental results (cont.)

Dataset K Random sampling error SimPairCount error 6 56.00% 0.69% 5 -25.47% 16.89%

200K

4 -66.25% -16.86% 6 -10.89% 1.28% 5 68.28% 34.06%

300K

4 63.14% -27.56% 6 -12.56% 1.59%

5 83.87% 2.91%

400K

4 -53.56% -2.94%

Ph.D. Defence University of Alberta 24

Conclusions and future work

• Streaming algorithms – found real applications (important)– can lead to theoretical results (fun)– More work to be done

• Current direction:

multi-dimensional streaming algorithms

• E.g

Estimating the # of outliers in one pass

Ph.D. Defence University of Alberta 25

Questions/Comments?

Thanks!