mining frequent patterns without candidate generation jiawei han, jian pei and yiwen yin

Mining Frequent patterns without candidate generation

Jiawei Han,

Jian Pei

and

Yiwen Yin

Problem : Mining Frequent Pattern

• I={a1, a1, …, am} is a set of items.

• DB={T1, T1, …, Tn} is the database of transactions where each transaction is a non empty subset of I.

• A pattern is also a subset of I.• A pattern is frequent if it is contained in

(supported by) more than a fixed number (ξ) of transactions.

Previous work : Apriori

• It may need to generate a huge number of candidate itemsets. To discover a frequent pattern of size k it needs to generate more than 2k candidates in total.

• It may need to scan the database repeatedly and check for the frequencies of the candidates.

FP-growth

• FP-growth mines frequent patterns without generating the candidate sets. It grows the patterns from fragments.

• It builds an extended prefix tree (FP-tree) for the transaction database. This tree is a compressed representation of the database. It saves repeated scan of the database.

FP-tree

TID Items BoughtFrequent Items

100f,a,c,d,g,i,m,p

f,c,a,m,p

200 a,b,c,f,l,m,o f,c,a,b,m

300 b,f,h,j,o f,b

400 b,c,k,s,p c,b,p

500a,f,c,e,l,p,m,n

f,c,a,m,p

sorted in descending order of the freq.

root

c:1

m:1

b:1

b:1

p:2

m:2

c:3

a:3

f:4

b:1

p:1

Minimum support (ξ) = 3

Conditional FP-tree of p

Items Bought

Frequent Items

f,c,a,m c

f,c,a,m c

c,b c

Conditional pattern base for p

root

c:3

Conditional FP-treeof p

Minimum support (ξ) = 3

The set of frequent patterns containing p is { cp , p }{p }

Frequent patterns containing mItems Bought

Frequent Items

f,c,a f,c,a

f,c,a f,c,a,

f,c,a,b f,c,a

Conditional pattern base for m

root

c:3

a:3

f:3

Conditional FP-treeof m

ItemsFreq. Items

f,c f,c

f,c f,c

f,c f,c

Conditional pattern base for am

root

c:3

f:3

Conditional FP-treeof am

root

f:3

ItemsFreq. Items

f f

f f

f f

The set of frequent patterns containing m is { m }{ m, am }{ m, am, cam }{ m, am, cam, fcam }

Conditional FP-treeof cam

root

c:3

a:3

f:3

{ m, am, cam, fcam, fam}pattern

base for cam

root

c:3

f:3

{ m, am, cam, fcam, fam, cm, fcm }

root

c:3

a:3

f:3

{ m, am, cam, fcam, fam, cm, fcm, fm }

Complete Frequent Pattern set

f ac pm

fc

b

apfmfa ca amcm

camfca fcm fam

fcam

Generated by conditional FP tree of m which is a single

Path

• A single path generates each combination of its nodes as frequent pattern

• Supports for a pattern is equal to the minimum support of a node in it.

root

c:3

a:3

f:3

Pseudocode

• Procedure FP-growth(Tree,α)• if Tree contains a single path P

• for each combination (β) of the nodes in P• Generate pattern βUα with support = minimum support of

a node in β

• else• for each ai in the header of Tree do

• Generate pattern β= αUai with support = ai.support.

• Construct β’s conditional pattern base and conditional FP-tree Treeβ

• if Treeβ ≠ Ø

• Call FP-growth(Treeβ, β)

Implementation issues

• For different support thresholds (ξ) there are different FP-trees. We may chose ξ=20 if 98% of the queries have ξ≥20.

• Updating the FP-tree after each new transaction may be costly. We may count the occurrence frequency of every items and update the tree if relative frequency of an item gets a large change.

New Challenges

• FP-growth may output a large number of frequent patterns for small (ξ) and very small number of frequent patterns for large (ξ). We may not know the (ξ) for our purpose.

• Which frequent patterns are good instances for generating interesting association rules?

Top-K frequent closed patterns

• Closed pattern is a pattern whose support is larger than any of its super pattern.

TID Items BoughtFrequent Items

100f,a,c,d,g,i,m,p

f,c,a,m,p

200 a,b,c,f,l,m,o f,c,a,b,m

300 b,f,h,j,o f,b

400 b,c,k,s,p f,c,b,p

500a,f,c,e,l,p,m,n

f,c,a,m,pf:5 a:3c:4 p:3m:3

fc:4

b:3

ap:3fm:3fa:3 ca:3 am:3cm:3

cam:3fca:3 fcm:3 fam:3

fcam:3

fp:3 fb:3

• We can also specify the minimum length of the patterns.• Top-2 frequent closed patterns with length ≥ 2 is fc and fcam

Mining Top-K closed FP

• The algorithm starts with an FP-tree having 0 support threshold.

• While building the tree, it prunes the smaller patterns with length < min_length.

• After the tree is built, it prunes the relatively infrequent patterns by raising the support threshold.

• Mining is performed on the final pruned FP-tree.

Compressed Frequent Pattern

• FP-growth may end up with a large set of patterns.

• We can compress the set of frequent patterns by clustering it minimally and selecting a representative pattern from each cluster.

f ac pm

fc

b

apfmfa ca amcm

camfca fcm fam

fcam{fcam, cam, ap, b}

Clustering Criterion

• For each cluster there must be a representative pattern Pr .

• D(P,Pr ) ≤ δ for all patterns inside the cluster of Pr .

• D(P1,P2 ) = 1- |T(P1)∩T(P2)| |T(P1)UT(P2)|

• T(P) is the set of transactions that support P.• D is a metric for closed patterns.

Summary

• FP-tree is an extended prefix tree that summarizes the database in a compressed form.

• FP-growth is an algorithm for mining frequent patterns using FP-tree.

• FP-tree can also be used to mine Top-K frequent closed patterns and Compressed frequent patterns.

References

• Mining Frequent Patterns without Candidate Generation– Jiawei Han, Jian Pei and Yiwen Yin

• Mining Top-K Frequent Closed Patterns without Minimum Support– Jiawei Han, Jianyong Wang, Ying Lu and Petre

Tzetkov

• Mining Compressed Frequent-Pattern Sets– Dong Xin, Jiawei Han, Xipheng Yan and Hong Cheng

Thank You

mining frequent patterns without candidate generation jiawei han, jian pei and yiwen yin

Documents