ph.d. defenceuniversity of alberta1 approximation algorithms for frequency related query processing...

Ph.D. Defence University of Alberta 1

Approximation Algorithms for Frequency Related Query

Processing on Streaming Data

Presented by Fan Deng

Supervisor: Dr. Davood Rafiei

May 30, 2007

Outline

• Introduction• Continuous membership query• Point query• Similarity self-join size estimation• Conclusions and future work

• A sequence of data records

• Examples– Document/URL streams from a Web crawler – IP packet streams– Web advertisement click streams– Sensor reading streams– ...

Data stream

• One pass processing– Online stream (one scan required)– Massive offline stream (one scan preferred)

• Challenges– Huge data volume– Fast processing requirement– Relatively small fast storage space

Processing in one pass

Approximation algorithms

• Exact query answers– can be slow to obtain– may need large storage space– sometimes are not necessary

• Approximate query answers– can take much less time– may need less space– with acceptable errors

Frequency related queries

• Frequency– # of occurrences

• Continuous membership query

• Point query

• Similarity self-join size estimation

Outline

• Introduction• Continuous membership query [SIGMOD’06]

– Motivating application– Problem statement– Our theoretical and experimental results

• Point query• Similarity self-join size estimation• Conclusions and future work

A Motivating Application

• Duplicate URL detection in Web crawling• Search engines [Broder et al. WWW03]

– Fetch web pages continuously– Extract URLs within each downloaded page– Check each URL (duplicate detection)

• If never seen before• Then fetch it• Else skip it

A Motivating Application (cont.)

• Problems– Huge number of distinct URLs – Memory is usually not large enough– Disks are slow

• Errors are usually acceptable– A false positive (missed URLs)– A false negative (redundant crawls or disk search)

Problem statement

• A sequence of elements with order• Storage space M

– Not large enough to store all distinct elements

• Continuous membership query Appeared before? Yes or No

…d g a f b e a d c b a • Our goal

– Minimize the # of errors– Fast

SBF theoretical results

• SBF will be stable – The expected # of “0”s will become a constant after a number of

updates

– Converge at an exponential rate

– Monotonic decreasing

• False positive rates become constant• An upper bound of false positive rates

– (a function of 4 parameters: SBF size, # of hash functions, max cell values, and kick-out rates)

• Setting the optimal parameters (partially empirical)

SBF experimental results (cont.)

• Comparison SBF, and FPBuffering method (LRU)– ~ 700M real URL fingerprints

• SBF generates 3-13% less false negatives, same # of false positives (<10%)

• MIN, [Broder et al. WWW03], theoretically optimal– assumes “the entire sequence of requests is known in

advance”– beats LRU caching by <5% in most cases

• More false positives allowed, SBF gains more

Outline• Introduction• Continuous membership query• Point query [to be submitted]

– Motivating application– Problem statement– Theoretical and experimental results

• Similarity self-join size estimation • Conclusions and future work

Motivating application

• Internet traffic monitoring– Query the # of IP packets sent by a particular IP

address in the past one hour

• Phone call record analysis– Query the # of calls to a given phone # yesterday

Problem statement

• Point query– Summarize a stream of elements – Estimate the frequency of a given element

• Goal: minimize the space cost and answer the query fast

CMM theoretical results

• Unbiased estimate (deduct mean)

• Estimate variance is the same as that of Fast-AGMS, a well-known method (in the case deducting mean)

• For less skewed data set – the estimation accuracies of CMM and Fast-

AGMS are exactly the same

CMM experimental results and analysis

• For skewed data sets– Accuracy (given the same space):

CMM-median = Fast-AGMS > CMM-mean

• Advantage of CMM – 2 estimates from 1 sketch– More flexible (with estimate upper bound)– More powerful (Count-min can be more

accurate for the very skewed data set)

Outline• Introduction• Continuous membership query• Point query• Similarity self-join size estimation

[submitted to VLDB’07]– Motivating application– Problem statement– Theoretical and experimental results

• Conclusions and future work

Motivating application

• Near-duplicate document detection for search engines [Broder 99, Henzinger 06]

– Very slow (30M pages, 10 days in 1997; 2006?)– To predict the processing time,

necessary to estimate the number of similar pairs

• Data cleaning in general (similarity self-join)

– To find a better query plan (query optimization)– Estimates of similarity self-join size is needed

Problem statement

• Similarity self-join size– Given a set of records with d attributes, estimate

the # of record pairs that at least s-similar

• An s-similar pair– A pair of records with s attributes in common– E.g. <Davood, Rafiei, CS, UofA, Canada> &

<Fan, Deng, CS, UofA, Canada> are 3-similar

Theoretical results

• Unbiased estimate

• Standard deviation bound of the estimate

• Time and space cost

(For both offline and online SimParCount)

Experimental results

• Online SimPairCount v.s. Random sampling

– Given the same amount of space– Error = (estimate – trueValue) / trueValue– Dataset:

• DBLP paper titles

• Each converted into a record with 6 attributes

• Using min-wise independent hashing

Similarity self-join size estimation – Experimental results (cont.)

Dataset K Random sampling error SimPairCount error 6 56.00% 0.69% 5 -25.47% 16.89%

4 -66.25% -16.86% 6 -10.89% 1.28% 5 68.28% 34.06%

4 63.14% -27.56% 6 -12.56% 1.59%

5 83.87% 2.91%

4 -53.56% -2.94%

Conclusions and future work

• Streaming algorithms – found real applications (important)– can lead to theoretical results (fun)– More work to be done

• Current direction:

multi-dimensional streaming algorithms

• E.g

Estimating the # of outliers in one pass

Questions/Comments?

Thanks!

ph.d. defenceuniversity of alberta1 approximation algorithms for frequency related query processing...

Documents

sxfxa2 deng 10 - staff.ustc.edu.cn

deng et al 2011 igr

deng xiaoping

deng (industrial phd) at universiti putra malaysia · deng...

paul deng feb. 15, 2011

china under deng xiaoping

deng in earthquake engineering certificate

cen5011, fall 19991 cen5011 software engineering dr. yi deng...

cindy deng turn

how did deng modernise china?

zhiqiang (abraham) deng - nasa

directory of personal information banks service...

mina deng phd defense

deng xiaoping the politician

ffic-ricky deng

deng smyth anvuur

chapter 13. deng haiqiong (haiqiong deng) internationally...

worksample_jing deng

directory of personal information banks service...

how did deng modernise china? l/o – to explain the...