ph.d. defenceuniversity of alberta1 approximation algorithms for frequency related query processing...
Post on 20-Dec-2015
220 Views
Preview:
TRANSCRIPT
Ph.D. Defence University of Alberta 1
Approximation Algorithms for Frequency Related Query
Processing on Streaming Data
Presented by Fan Deng
Supervisor: Dr. Davood Rafiei
May 30, 2007
Ph.D. Defence University of Alberta 2
Outline
• Introduction• Continuous membership query• Point query• Similarity self-join size estimation• Conclusions and future work
Ph.D. Defence University of Alberta 3
• A sequence of data records
• Examples– Document/URL streams from a Web crawler – IP packet streams– Web advertisement click streams– Sensor reading streams– ...
Data stream
Ph.D. Defence University of Alberta 4
• One pass processing– Online stream (one scan required)– Massive offline stream (one scan preferred)
• Challenges– Huge data volume– Fast processing requirement– Relatively small fast storage space
Processing in one pass
Ph.D. Defence University of Alberta 5
Approximation algorithms
• Exact query answers– can be slow to obtain– may need large storage space– sometimes are not necessary
• Approximate query answers– can take much less time– may need less space– with acceptable errors
Ph.D. Defence University of Alberta 6
Frequency related queries
• Frequency– # of occurrences
• Continuous membership query
• Point query
• Similarity self-join size estimation
Ph.D. Defence University of Alberta 7
Outline
• Introduction• Continuous membership query [SIGMOD’06]
– Motivating application– Problem statement– Our theoretical and experimental results
• Point query• Similarity self-join size estimation• Conclusions and future work
Ph.D. Defence University of Alberta 8
A Motivating Application
• Duplicate URL detection in Web crawling• Search engines [Broder et al. WWW03]
– Fetch web pages continuously– Extract URLs within each downloaded page– Check each URL (duplicate detection)
• If never seen before• Then fetch it• Else skip it
Ph.D. Defence University of Alberta 9
A Motivating Application (cont.)
• Problems– Huge number of distinct URLs – Memory is usually not large enough– Disks are slow
• Errors are usually acceptable– A false positive (missed URLs)– A false negative (redundant crawls or disk search)
Ph.D. Defence University of Alberta 10
Problem statement
• A sequence of elements with order• Storage space M
– Not large enough to store all distinct elements
• Continuous membership query Appeared before? Yes or No
…d g a f b e a d c b a • Our goal
– Minimize the # of errors– Fast
M
Ph.D. Defence University of Alberta 11
SBF theoretical results
• SBF will be stable – The expected # of “0”s will become a constant after a number of
updates
– Converge at an exponential rate
– Monotonic decreasing
• False positive rates become constant• An upper bound of false positive rates
– (a function of 4 parameters: SBF size, # of hash functions, max cell values, and kick-out rates)
• Setting the optimal parameters (partially empirical)
Ph.D. Defence University of Alberta 12
SBF experimental results (cont.)
• Comparison SBF, and FPBuffering method (LRU)– ~ 700M real URL fingerprints
• SBF generates 3-13% less false negatives, same # of false positives (<10%)
• MIN, [Broder et al. WWW03], theoretically optimal– assumes “the entire sequence of requests is known in
advance”– beats LRU caching by <5% in most cases
• More false positives allowed, SBF gains more
Ph.D. Defence University of Alberta 13
Outline• Introduction• Continuous membership query• Point query [to be submitted]
– Motivating application– Problem statement– Theoretical and experimental results
• Similarity self-join size estimation • Conclusions and future work
Ph.D. Defence University of Alberta 14
Motivating application
• Internet traffic monitoring– Query the # of IP packets sent by a particular IP
address in the past one hour
• Phone call record analysis– Query the # of calls to a given phone # yesterday
Ph.D. Defence University of Alberta 15
Problem statement
• Point query– Summarize a stream of elements – Estimate the frequency of a given element
• Goal: minimize the space cost and answer the query fast
Ph.D. Defence University of Alberta 16
CMM theoretical results
• Unbiased estimate (deduct mean)
• Estimate variance is the same as that of Fast-AGMS, a well-known method (in the case deducting mean)
• For less skewed data set – the estimation accuracies of CMM and Fast-
AGMS are exactly the same
Ph.D. Defence University of Alberta 17
CMM experimental results and analysis
• For skewed data sets– Accuracy (given the same space):
CMM-median = Fast-AGMS > CMM-mean
• Advantage of CMM – 2 estimates from 1 sketch– More flexible (with estimate upper bound)– More powerful (Count-min can be more
accurate for the very skewed data set)
Ph.D. Defence University of Alberta 18
Outline• Introduction• Continuous membership query• Point query• Similarity self-join size estimation
[submitted to VLDB’07]– Motivating application– Problem statement– Theoretical and experimental results
• Conclusions and future work
Ph.D. Defence University of Alberta 19
Motivating application
• Near-duplicate document detection for search engines [Broder 99, Henzinger 06]
– Very slow (30M pages, 10 days in 1997; 2006?)– To predict the processing time,
necessary to estimate the number of similar pairs
• Data cleaning in general (similarity self-join)
– To find a better query plan (query optimization)– Estimates of similarity self-join size is needed
Ph.D. Defence University of Alberta 20
Problem statement
• Similarity self-join size– Given a set of records with d attributes, estimate
the # of record pairs that at least s-similar
• An s-similar pair– A pair of records with s attributes in common– E.g. <Davood, Rafiei, CS, UofA, Canada> &
<Fan, Deng, CS, UofA, Canada> are 3-similar
Ph.D. Defence University of Alberta 21
Theoretical results
• Unbiased estimate
• Standard deviation bound of the estimate
• Time and space cost
(For both offline and online SimParCount)
Ph.D. Defence University of Alberta 22
Experimental results
• Online SimPairCount v.s. Random sampling
– Given the same amount of space– Error = (estimate – trueValue) / trueValue– Dataset:
• DBLP paper titles
• Each converted into a record with 6 attributes
• Using min-wise independent hashing
Ph.D. Defence University of Alberta 23
Similarity self-join size estimation – Experimental results (cont.)
Dataset K Random sampling error SimPairCount error 6 56.00% 0.69% 5 -25.47% 16.89%
200K
4 -66.25% -16.86% 6 -10.89% 1.28% 5 68.28% 34.06%
300K
4 63.14% -27.56% 6 -12.56% 1.59%
5 83.87% 2.91%
400K
4 -53.56% -2.94%
Ph.D. Defence University of Alberta 24
Conclusions and future work
• Streaming algorithms – found real applications (important)– can lead to theoretical results (fun)– More work to be done
• Current direction:
multi-dimensional streaming algorithms
• E.g
Estimating the # of outliers in one pass
top related