frequent item mining

41
Frequent Item Mining

Upload: thelma

Post on 18-Jan-2016

42 views

Category:

Documents


0 download

DESCRIPTION

Frequent Item Mining. What is data mining?. =Pattern Mining? What patterns? Why are they useful?. Definition: Frequent Itemset. Itemset A collection of one or more items Example: {Milk, Bread, Diaper} k-itemset An itemset that contains k items Support count ( ) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Frequent Item Mining

Frequent Item Mining

Page 2: Frequent Item Mining

What is data mining?

• =Pattern Mining?• What patterns?• Why are they useful?

Page 3: Frequent Item Mining

3

Definition: Frequent Itemset• Itemset

– A collection of one or more items• Example: {Milk, Bread, Diaper}

– k-itemset• An itemset that contains k items

• Support count ()– Frequency of occurrence of an itemset– E.g. ({Milk, Bread,Diaper}) = 2

• Support– Fraction of transactions that contain an itemset– E.g. s({Milk, Bread, Diaper}) = 2/5

• Frequent Itemset– An itemset whose support is greater than or

equal to a minsup threshold

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Page 4: Frequent Item Mining

Frequent Itemsets Mining

TID Transactions

100 { A, B, E }

200 { B, D }

300 { A, B, E }

400 { A, C }

500 { B, C }

600 { A, C }

700 { A, B }

800 { A, B, C, E }

900 { A, B, C }

1000 { A, C, E }

• Minimum support level 50%– {A},{B},{C},{A,B}, {A,C}

• How to link this to Data Cube?

Page 5: Frequent Item Mining

Three Different Views of FIM• Transactional Database

– How we do store a transactional database?

• Horizontal, Vertical, Transaction-Item Pair

• Binary Matrix• Bipartite Graph

• How does the FIM formulated in these different settings?

5

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Page 6: Frequent Item Mining

6

Frequent Itemset Generationnull

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Given d items, there are 2d possible candidate itemsets

Page 7: Frequent Item Mining

7

Frequent Itemset Generation• Brute-force approach:

– Each itemset in the lattice is a candidate frequent itemset– Count the support of each candidate by scanning the

database

– Match each transaction against every candidate– Complexity ~ O(NMw) => Expensive since M = 2d !!!

TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

N

Transactions List ofCandidates

M

w

Page 8: Frequent Item Mining

8

Reducing Number of Candidates• Apriori principle:

– If an itemset is frequent, then all of its subsets must also be frequent

• Apriori principle holds due to the following property of the support measure:

– Support of an itemset never exceeds the support of its subsets

– This is known as the anti-monotone property of support

)()()(:, YsXsYXYX

Page 9: Frequent Item Mining

9

Illustrating Apriori Principle

Found to be Infrequent

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDEPruned supersets

Page 10: Frequent Item Mining

10

Illustrating Apriori PrincipleItem CountBread 4Coke 2Milk 4Beer 3Diaper 4Eggs 1

Itemset Count{Bread,Milk} 3{Bread,Beer} 2{Bread,Diaper} 3{Milk,Beer} 2{Milk,Diaper} 3{Beer,Diaper} 3

Itemset Count {Bread,Milk,Diaper} 3

Items (1-itemsets)

Pairs (2-itemsets)

(No need to generatecandidates involving Cokeor Eggs)

Triplets (3-itemsets)Minimum Support = 3

If every subset is considered, 6C1 + 6C2 + 6C3 = 41

With support-based pruning,6 + 6 + 1 = 13

Page 11: Frequent Item Mining

Apriori

R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB, 487-499, 1994

Page 12: Frequent Item Mining
Page 13: Frequent Item Mining

13

How to Generate Candidates?

• Suppose the items in Lk-1 are listed in an order

• Step 1: self-joining Lk-1 insert into Ck

select p.item1, p.item2, …, p.itemk-1, q.itemk-1

from Lk-1 p, Lk-1 q

where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1

• Step 2: pruningforall itemsets c in Ck do

forall (k-1)-subsets s of c do

if (s is not in Lk-1) then delete c from Ck

Page 14: Frequent Item Mining

14

Challenges of Frequent Itemset Mining

• Challenges– Multiple scans of transaction database

– Huge number of candidates

– Tedious workload of support counting for candidates

• Improving Apriori: general ideas– Reduce passes of transaction database scans

– Shrink number of candidates

– Facilitate support counting of candidates

Page 15: Frequent Item Mining

15

Alternative Methods for Frequent Itemset Generation

• Representation of Database– horizontal vs vertical data layout

TID Items1 A,B,E2 B,C,D3 C,E4 A,C,D5 A,B,C,D6 A,E7 A,B8 A,B,C9 A,C,D

10 B

HorizontalData Layout

A B C D E1 1 2 2 14 2 3 4 35 5 4 5 66 7 8 97 8 98 109

Vertical Data Layout

Page 16: Frequent Item Mining

16

ECLAT

• For each item, store a list of transaction ids (tids)

TID Items1 A,B,E2 B,C,D3 C,E4 A,C,D5 A,B,C,D6 A,E7 A,B8 A,B,C9 A,C,D

10 B

HorizontalData Layout

A B C D E1 1 2 2 14 2 3 4 35 5 4 5 66 7 8 97 8 98 109

Vertical Data Layout

TID-list

Page 17: Frequent Item Mining

17

ECLAT• Determine support of any k-itemset by intersecting tid-lists of

two of its (k-1) subsets.

• 3 traversal approaches: – top-down, bottom-up and hybrid

• Advantage: very fast support counting• Disadvantage: intermediate tid-lists may become too large for

memory

A1456789

B1257810

AB1578

Page 18: Frequent Item Mining
Page 19: Frequent Item Mining
Page 20: Frequent Item Mining

20

FP-growth Algorithm

• Use a compressed representation of the database using an FP-tree

• Once an FP-tree has been constructed, it uses a recursive divide-and-conquer approach to mine the frequent itemsets

Page 21: Frequent Item Mining

21

FP-tree construction

TID Items1 {A,B}2 {B,C,D}3 {A,C,D,E}4 {A,D,E}5 {A,B,C}6 {A,B,C,D}7 {B,C}8 {A,B,C}9 {A,B,D}10 {B,C,E}

null

A:1

B:1

null

A:1

B:1

B:1

C:1

D:1

After reading TID=1:

After reading TID=2:

Page 22: Frequent Item Mining

22

FP-Tree Construction

null

A:7

B:5

B:3

C:3

D:1

C:1

D:1C:3

D:1

D:1

E:1E:1

TID Items1 {A,B}2 {B,C,D}3 {A,C,D,E}4 {A,D,E}5 {A,B,C}6 {A,B,C,D}7 {B,C}8 {A,B,C}9 {A,B,D}10 {B,C,E}

Pointers are used to assist frequent itemset generation

D:1

E:1

Transaction Database

Item PointerABCDE

Header table

Page 23: Frequent Item Mining

23

FP-growth

null

A:7

B:5

B:1

C:1

D:1

C:1

D:1C:3

D:1

D:1

Conditional Pattern base for D: P = {(A:1,B:1,C:1),

(A:1,B:1), (A:1,C:1), (A:1), (B:1,C:1)}

Recursively apply FP-growth on P

Frequent Itemsets found (with sup > 1): AD, BD, CD, ACD, BCD

D:1

Page 24: Frequent Item Mining
Page 25: Frequent Item Mining

25

Compact Representation of Frequent Itemsets

• Some itemsets are redundant because they have identical support as their supersets

• Number of frequent itemsets

• Need a compact representation

TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C101 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 02 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 03 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 04 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 05 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 06 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 07 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 08 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 09 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 010 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 112 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 113 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 114 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 115 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

10

1

103

k k

Page 26: Frequent Item Mining

26

Maximal Frequent Itemset

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

BorderInfrequent Itemsets

Maximal Itemsets

An itemset is maximal frequent if none of its immediate supersets is frequent

Page 27: Frequent Item Mining

27

Closed Itemset

• An itemset is closed if none of its immediate supersets has the same support as the itemset

TID Items1 {A,B}2 {B,C,D}3 {A,B,C,D}4 {A,B,D}5 {A,B,C,D}

Itemset Support{A} 4{B} 5{C} 3{D} 4

{A,B} 4{A,C} 2{A,D} 3{B,C} 3{B,D} 4{C,D} 3

Itemset Support{A,B,C} 2{A,B,D} 3{A,C,D} 2{B,C,D} 3

{A,B,C,D} 2

Page 28: Frequent Item Mining

28

Maximal vs Closed ItemsetsTID Items

1 ABC

2 ABCD

3 BCE

4 ACDE

5 DE

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

124 123 1234 245 345

12 124 24 4 123 2 3 24 34 45

12 2 24 4 4 2 3 4

2 4

Transaction Ids

Not supported by any transactions

Page 29: Frequent Item Mining

29

Maximal vs Closed Frequent Itemsetsnull

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

124 123 1234 245 345

12 124 24 4 123 2 3 24 34 45

12 2 24 4 4 2 3 4

2 4

Minimum support = 2

# Closed = 9

# Maximal = 4

Closed and maximal

Closed but not maximal

Page 30: Frequent Item Mining

30

Maximal vs Closed Itemsets

FrequentItemsets

ClosedFrequentItemsets

MaximalFrequentItemsets

Page 31: Frequent Item Mining

Beyond Itemsets• Sequence Mining

– Finding frequent subsequences from a collection of sequences • Graph Mining

– Finding frequent (connected) subgraphs from a collection of graphs

• Tree Mining– Finding frequent (embedded) subtrees from a set of

trees/graphs• Geometric Structure Mining

– Finding frequent substructures from 3-D or 2-D geometric graphs

• Among others…

Page 32: Frequent Item Mining

Frequent Pattern Mining

B

A

E

A B

C

C

FB

D

F

F

D

EA B

A

C

AE

D

C

F

D

A

B

A

C

E

A

D

A B

D C

A

A B

B

DD

CC

A B

D C

Page 33: Frequent Item Mining

Why Frequent Pattern Mining is So Important?

• Application Domains– Business, biology, chemistry, WWW, computer/networing security, …

• Summarizing the underlying datasets, providing key insights• Basic tools for other data mining tasks

– Assocation rule mining– Classification– Clustering– Change Detection– etc…

Page 34: Frequent Item Mining

Network motifs: recurring patterns that occur significantly more than in randomized nets

• Do motifs have specific roles in the network?

• Many possible distinct subgraphs

Page 35: Frequent Item Mining

The 13 three-node connected subgraphs

Page 36: Frequent Item Mining

199 4-node directed connected subgraphs

And it grows fast for larger subgraphs : 9364 5-node subgraphs,

1,530,843 6-node…

Page 37: Frequent Item Mining

Finding network motifs – an overview

• Generation of a suitable random ensemble (reference networks)

• Network motifs detection process: Count how many times each subgraph

appears Compute statistical significance for each

subgraph – probability of appearing in random as much as in real network (P-val or Z-score)

Page 38: Frequent Item Mining

Real = 5 Rand=0.5±0.6

Zscore (#Standard Deviations)=7.5

Ensemble of networks

Page 39: Frequent Item Mining

39

References

• R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. SIGMOD, 207-216, 1993.

• R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB, 487-499, 1994.

• R. J. Bayardo. Efficiently mining long patterns from databases. SIGMOD, 85-93, 1998.

Page 40: Frequent Item Mining

References:

• Christian Borgelt, Efficient Implementations of Apriori and Eclat, FIMI’03

• Ferenc Bodon, A fast APRIORI implementation, FIMI’03

• Ferenc Bodon, A Survey on Frequent Itemset Mining, Technical Report, Budapest University of Technology and Economic, 2006

Page 41: Frequent Item Mining

Important websites:

• FIMI workshop– Not only Apriori and FIM

• FP-tree, ECLAT, Closed, Maximal

– http://fimi.cs.helsinki.fi/

• Christian Borgelt’s website– http://www.borgelt.net/software.html

• Ferenc Bodon’s website– http://www.cs.bme.hu/~bodon/en/apriori/