approximate frequency counts over data streams gurmeet singh manku, rajeev motwani standford...
TRANSCRIPT
Approximate Frequency Counts over Data Streams
Gurmeet Singh Manku, Rajeev MotwaniStandford University
VLDB2002
Introduction
Data come as a continuous “stream”
Differs from traditional stored DB The sheer volume of a stream over its
lifetime is huge Queries require timely answer
Frequent itemset mining on offline databases vs data streams
Often, level-wise algorithms are used to mine offline databases At least 2 database scans are needed
Ex: Apriori algorithm
Level-wise algorithms cannot be applied to mine data streams Cannot go through the data stream multipl
e times
Challenges of streaming
Single pass
Limited Memory
Enumeration of itemsets
Purpose
Present algorithms computing frequency exceeding threshold Simple Low memory footprint Output approximate, guaranteed not exceed a
user specified error parameter. Deployed for singleton items, handle variable
sized sets of items.
Main contributions of the paper: Proposed 2 algorithms to find frequent items appe
ar in a data stream of items Extended the algorithms to find frequent itemset
Notations
Some notations: Let N denote the current length of the
stream Let s (0,1) denote the support
threshold Let (0,1) denote the error tolerance
<< s
Approximation guarantees
All itemsets whose true frequency exceeds sN are reported
No itemset whose true frequency is less than (s-)N is output
Estimated frequencies are less than the true frequencies by at most N
Example
s = 0.1%
ε should be one-tenth or one-twentieth of s. ε = 0.01%
Property 1, elements frequency exceeding 0.1% output.
Property 2, NO element frequency below 0.09% output
Elements between 0.09% ~ 0.1% may or may not be output.
Property 3, frequencies are less than their true frequencies at most 0.01%
Problem definition
An algorithm maintains an ε-deficient synopsis if its output satisifies the aforementioned properties
Devise algorithms support ε-deficient synopsis using little main memory as possible
The Algorithms for frequent Items
Each transaction contains only 1 item
Two algorithms proposed: Sticky Sampling Algorithm Lossy Counting Algorithm
Features : Sampling used Frequency found approximate, error guaranteed not e
xceed user-specified tolerance level For Lossy Counting, all frequent items are reported
Sticky Sampling Algorithm
Create counters by sampling
Stream341530
283141233519
Sticky Sampling Algorithm
User input : Support threshold s Error tolerance Probability of failure
Counts kept in data structure S Each entry in S is in the form (e,f), where:
e : item f : frequency of e since the entry inserted in S
Output entries in S where f (s - )N
Sticky Sampling Algorithm
r : sampling rate
Sampling an element with rate = r means select the element with probablity = 1/r
Sticky Sampling Algorithm
Initially – S is empty, r = 1. For each incoming element e
if (e exists in S) increment corresponding f
else {sample element with rate r
if (sampled)add entry (e,1) to S
elseignore
}
Sampling rate
Let t = 1/ ε log(s-1 -1) ( = probability of failure)
First 2t elements sampled at rate=1 The next 2t at rate=2 The next 4t at rate=4 and so on…
Sticky Sampling Algorithm
Whenever the sampling rate r changes: for each entry (e,f) in S repeat {
toss an unbiased coinif (toss is not successful)
diminsh f by oneif (f == 0) {
delete entry from Sbreak
}} until toss is successful
Lossy Counting
Data stream conceptually divided into buckets = 1/ transactions
Buckets labeled with bucket ids, starting from 1
Current bucket id is bcurrent ,value is N/ fe :true frequency of an element e in stream
seen so far Each entry in data structure D is form (e, f, )
e : item f : frequency of e : the maximum possible error in f
Lossy Counting
is the maximum # of times e occurred in the first bcurrent – 1 buckets ( this value is exactly bcurrent – 1)
Once a value is inserted into D its value is unchanged
Lossy Counting
Initially D is empty Receive element e
if (e exists in D)increment its frequency (f) by 1
elsecreate a new entry (e, 1, bcurrent – 1)
If bucket boundary prune D by the following the rule:(e,f,) is deleted if f + ≤ bcurrent
When the user requests a list of items with threshold s, output those entries in D where f ≥ (s – ε)N
Lossy Counting
1. function prune(D, b)2. for each entry (e,f,) in D do3. if f + b do4. remove the entry from D5. endif
Lossy Counting
FrequencyCounts
At window boundary, remove entries that for them f+∆ ≤ bcurrent
+
First WindowD is Empty
Lossy CountingFrequencyCounts
At window boundary, remove entries that for them f+∆≤ bcurrent
Next Window
+
Lossy Counting
Lossy Counting guarantees that: When deletion occurs, bcurrent N
Entry (e, f, ) is deleted, If fe bcurrent
fe : actual frequency count of e Hence, if entry (e, f, ) is deleted, fe N
Finally, f fe f + N
Sticky Sampling vs Lossy Counting
Sticky Sampling is non-deterministic, while Lossy Counting is deterministic
Experimental result shows that Lossy Counting requires fewer entries than Sticky Sampling
Sticky Sampling vs Lossy Counting
Lossy counting is superior by a large factor
Sticky sampling performs worse because of its tendency to remember every unique element that gets sampled
Lossy counting is good at pruning low frequency elements quickly
The more complex case: finding frequent itemsets
The Lossy Counting algorithm is extended to find frequent itemsets
Transactions in the data stream contains a set of items
Finding frequent itemsets
Stream
Finding frequent itemsets
Input: stream of transactions, each transaction is a set of items from I
N: length of the stream User specifies two parameters:
support s, error Challenge:
- handling variable sized transactions- avoiding explicit enumeration of all subsets of any transaction
Finding frequent itemsets
Data structure D – set of entries of the form (set, f, ) set : subset of items
Transactions are divided into buckets = 1/ transactions : # of transactions
in each bucket bcurrent : current bucket id
Finding frequent itemsets
Transactions not processed one by one. Main memory filled as many transactions as possible. Processing is done on a batch of transactions.
β : # of buckets in main memory in the current batch being processed.
Finding frequent itemsets
D’s operations : UPDATE_SET updates and deletes in D
Entry (set, f, ) count occurrence of set in the batch and update the entry
If updated entry satisfies f + bcurrent, removed it from D
NEW_SET inserts new entries into D If set set has frequency f in batch and
set doesn’t occur in D, create a new entry (set, f, bcurrent-)
Finding frequent itemsets
If fset ≥ N it has an entry in D
If (set,f,)ED then the true frequency of fset satisfies the inequality f≤ fset ≤ f+
When user requests list of items with threshold s, output in D where f ≥ (s-)N
β needs to be a large number. Any subset of I that occurs β +1 times or more contributes to D.
Buffer: repeatedly reads in a batch of buckets of transactions into available main memory
Trie: maintains the data structure D SetGen: generates subsets of item-id’s along
with their frequency counts in the current batch Not all possible subsets need to be generated If a subset S is not inserted into D after application
of both UPDATE_SET and NEW_SET, then no supersets of S should be considered
Three modules
BUFFER
TRIE
SUBSET-GEN
maintains the data structure D
operates on the current batch of transactions
repeatedly reads in a batch of transactionsinto available main memory
implement UPDATE_SET, NEW_SET
Module 1 - Buffer
Read a batch of transactions Transactions are laid out one after the other in a big array A bitmap is used to remember transaction boundaries After reading in a batch, BUFFER sorts each transaction by its item-id’s
Window 1 Window 2 Window 3 Window 4 Window 5 Window 6
In Main Memory
Module 2 - TRIE
50
40
30
31 29 32
45
42
50 40 30 31 29 45 32 42 Sets with frequency counts
Module 2 – TRIE cont…
Nodes are labeled {item-id, f, , level} Children of any node are ordered by their item-
id’s Root nodes are also ordered by their item-id’s A node represents an itemset consisting of item-
id’s in that node and all its ancestors TRIE is maintained as an array of entries of the
form {item-id, f, , level} (pre-order of the trees). Equivalent to a lexicographic ordering of subsets it encodes.
No pointers, level’s compactly encode the underlying tree structure.
Module 3 - SetGen
BUFFER
3 3 3 4 2 2 1 2 1 3 1 1
Frequency countsof subsetsin lexicographic order
SetGen uses the following pruning rule:if a subset S does not make its way into TRIE after application of both UPDATE_SET and NEW_SET, then no supersets of S should be considered
Overall Algorithm
BUFFER
3 3 3 4 2 2 1 2 1 3 1 1 SUBSET-GEN
TRIE new TRIE
Conclusion
Sticky Sampling and Lossy Counting are 2 approximate algorithms that can find frequent items
Both algorithms produces frequency counts within a user-specified error tolerance level, though Sticky Sampling is non-deterministic
Lossy Counting can be extended to find frequent itemsets
Thank you very much…