a sliding window method for finding recently frequent itemsets over online data streams

A Sliding Window Method for Finding A Sliding Window Method for Finding Recently Frequent Itemsets over OnliRecently Frequent Itemsets over Online Data Streamsne Data StreamsJoong Hyuk Chang and won Suk Lee, Proc. of the 9’th Joong Hyuk Chang and won Suk Lee, Proc. of the 9’th ACM SIGKDD International Conference on Knowledge DACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD’03)iscovery and Data Mining (SIGKDD’03)

Adviser: Jia-Ling Koh Speaker: Yu-ting Kung

Introduction• Most of mining algorithms or

frequency approximation algorithm for a data stream don’t able to extract the recent change of information in a data stream adaptively.

Introduction (Cont.)• In this paper,

– Propose a sliding window method of finding recently frequent itemsets over an online data stream

Sliding Window Method• Idea:

– Define significant itemset:• An itemset whose current support is greater than or equal to an error parameter is a significant itemset

– Monitoring only significant itemsets

SW Method (Cont.)• Two different phases

– Window initialization phase:• Actives while the number of transactions generated so far in a data stream is less than or equal to a predefined window size.• Insert new transaction in CTL (current transaction list)• No extracted transation

– Window sliding phase:• Actives after the CTL becomes full• Insert new transaction in CTL (current transaction list)• The oldest transaction is extracted from the CTL

SW Method (Cont.)• Five steps:

1. Appending a transaction2. Counting updating and insertion of new itemsets3. Extracting a transaction4. Pruning of itemsets5. Frequent itemset selection

Step1: Appending a transaction• Content

– The transaction Tk is appended to the current transaction list CTL

Step2: Counting updating and insertion of new itemsets• Content

– For an itemset e that appears in the Tk with an entry (e, f, t):f: count of the itemset t: TID which makes the itemset be newly inserted into the monitoring lattice

Case 1 its corresponding node is in the monitoring lattice: e.f = e.f + 1Case 2 its corresponding node isn’t in the monitoring lattice: e is inserted into the monitoring lattice with (e, 1, k)

Step3: Extracting a transaction• When this step is done?

– Only in the window sliding phase• Content

– Extract the oldest transaction in CTL– Update the entry (e, f, t) of this node in the

monitoring lattice:If t <= wfirst e.f = e.f -1;

Wfirst : the TID of the first transaction of the current windowElse e.f = e.f;

Step4: Pruning of itemsets• Therom:

– Given an error parameter , the maximum possible count of an itemset with its entry (e, f, t) is found as follows:

otherwisewtf

firstk )(

w t if )( firstmax

)(max eCk

Step4: Pruning of itemsets• When this step is done?

– Periodically or when it is needed• Content

– For an itemset e with entry (e, f, t) in the monitoring lattice: If , Then it can be regarded as an insignificant itemset Prune it !!

wk DeC )(max

Step5: Frequent itemset selection• When this step is done?

– The up-to-date set of recently frequent itemsets is requested.• Content

– For an itemset e with an entry (e, f, t) in the monitoring lattice:If its , min

max )( SDeCk

it is a frequent itemset !!

For Example• Data Stream

(a) D1

Tid 1 2 3 4 5

(b) D5

Tid 1 2 3 4 5 6 7 8 9 10

(c) D10

Tid 1 2 3 4 5 6 7 8 9 10

(d) D11

Tid 1 2 3 4 5 6 7 8 9 10

B A B A

(e) D15

For Example (Cont.)• Initial value

– Smin = 0.5– = 0.25 (0.5 x Smin) – Window size = 10– Step4 is performed in every 5 transactions.

(a) After T1 (AB)

Step1,2 Step1,2

(b) After T2 (D)

Step1,2e f t

(b.1) After T3 (AB)

Step1,2

(b.2) After T4 (AB)

Step1,2

(b.3) After T5 (A)

(b.3) After prning

D is pruned from the monitoring lattice, becasueStep1,2 recursively

(C) After T10 (AE)

ABACAEAD

932133221

11611716711

Step1,2 Step3

(d) After T11 (AD)

Step1,2,3,4

(e) After Step4 for T15

Experiment Result• Data souce

– T5.I4.D1000K-I– T5.I4.D1000K-II

Experiment (Cont.)• Memory usage in the window

sliding phase

Experiment (Cont.)• Average support error

– Measure the relative accuracy of the proposed method

– When two sets of mining results

and are given for the same data set, the average support error ASE(R2|R1) is defined:

1221221211

)(|})()({|)()|(

eSeSeSeSRRASE RRRe

mmRRRe

})(|))(,{( min111 SeSeSeR iii

})(|))(,{( min222222 SeSeSeR

Experiment (Cont.)• Average support error of the mining result of the proposed method with respect to that of the Apriori algorithm on the transactions within the current window

Experiment (Cont.)• The average processing time(Step1-

Step4) of the sliding window method in each interval

Experiment (Cont.)• The average processing time for

Experiment (Cont.)• The memory usage of the window

sliding phase by varying the size of the window

Experiment (Cont.)• The average processing time of

the sliding window method by varying the size of a window

Conclusion• The result of the proposed method guarantees the following:

– All itemsets whose true supports are greater than or equal to a minimum support Smin are found– No itemset whose true support is less than (Smin- ) is found as a recently frequent itemset – For each itemeset, the difference between its estimated support and its true support is less than

a sliding window method for finding recently frequent itemsets over online data streams

itemset e

entry e

monitoring lattice

insignificant itemset

t11 ad e

frequent itemset selectionwhen

t step4

fullinsert new transaction

Documents

chapter vii: frequent itemsets & association rules

frequent itemsets

fast and memory efficient mining of frequent closed...

an efficient rigorous approach for identifying statistically...

text categorization based on apriori algorithm's frequent...

cps 196.03: information management and mining association...

data mining association rules and frequent itemsets mining

fast algorithms for mining frequent itemsets

association rules and frequent item analysis. 2 outline ...

„mining top-k frequent itemsets from data streams“ ·...

chapter vii: frequent itemsets & association rules

algorithms for mining maximal frequent itemsets -- a survey...

zeev dvir – dvirzeev@post.tau.ac.il genmax from: “...

from frequent itemsets to semantically meaningful visual...

text clustering using frequent itemsets

1 cps 196.03: information management and mining association...

hiding sensitive frequent itemsets over privacy preserving

chapter 3: frequent itemset mining · 2016-05-04 ·...

1 mining closed & maximal frequent...

data mining lecture 4 frequent itemsets, association rules...