data mining frequent-pattern tree approach towards arm lecture 11-12

Data Mining

• Frequent-Pattern Tree Approach Towards ARM

Lecture 11-12

2

Is Apriori Fast Enough? — Performance Bottlenecks

• The core of the Apriori algorithm:– Use frequent (k – 1)-itemsets to generate candidate frequent k-

itemsets– Use database scan and pattern matching to collect counts for the

candidate itemsets

• The bottleneck of Apriori: candidate generation– Huge candidate sets:

• 104 frequent 1-itemset will generate 107 candidate 2-itemsets

• To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one needs to generate 2100 1030 candidates.

– Multiple scans of database: • Needs (n +1 ) scans, n is the length of the longest pattern

3

Mining Frequent Patterns Without Candidate Generation

• Steps

1. Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure1. highly condensed, but complete for frequent pattern

mining

2. avoid costly database scans

2. Develop an efficient, FP-tree-based frequent pattern mining method1. A divide-and-conquer methodology: decompose mining

tasks into smaller ones

2. Avoid candidate generation: sub-database test only!

4

FP-tree Construction

Item frequency head f 4c 4a 3b 3m 3p 3

TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

Steps:

1. Scan DB once, find frequent 1-itemset (single item pattern)

2. Order frequent items in frequency descending order

3. Scan DB again, construct FP-tree

5

• Steps Contd. (Example)– Scan of the first transaction leads to the

construction of the first branch of the tree listing

{}

f:1

c:1

a:1

m:1

p:1

FP-tree Construction (contd.)

(ordered) frequent items{f, c, a, m, p}{f, c, a, b, m}{f, b}{c, b, p}{f, c, a, m, p}

6

{}

f:2

c:2

a:2

b:1m:1

p:1 m:1





– Second transaction shares a common prefix with the existing path the count of each node along the prefix is incremented by 1

– Two new nodes are created and linked as children of (a:2) and (b:1) respec.

7





– Similarly for the third transaction

{}

f:3

b:1c:2

a:2

b:1m:1

p:1 m:1



8






– The scan of the fourth transaction leads to the construction of the second branch of the tree, (c:1), (b:1), (p:1).

{}

f:3 c:1

b:1

p:1

b:1c:2

a:2

b:1m:1

p:1 m:1



9






– The scan of the fourth transaction leads to the construction of the second branch of the tree, (c:1), (b:1), (p:1).

– For the last transaction, since its frequent item list is identical to the first one, the path is shared.

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1



10

• Create a Header table– Each entry in the

frequent-item-header table consists of two fields,(1) item-name (2) head of node-link (a pointer pointing to the first node in the FP-tree carrying the item-name).


{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header Table

Item frequency head f 4c 4a 3b 3m 3p 3

11

Mining frequent patterns using FP-tree

• Mining frequent patterns out of FP-tree is based upon following Node-link property– For any frequent item ai , all the possible patterns

containing only frequent items and ai can be obtained by following ai ’s node-links, starting from ai ’s head in the FP-tree header.

• Lets go through an example to understand the full implication of this property in the mining process.

12

• For node p, its immediate frequent pattern is (p:3), and it has two paths in the FP-tree: (f :4, c:3, a:3,m:2,p:2) and (c:1, b:1, p:1)

• These two prefix paths of p, “{( f cam:2), (cb:1)}”, form p’s conditional pattern base

• Now, we build an FP- tree on P’s conditional pattern base.

• Leads to an FP tree with one branch only i.e. C:3 hence the frequent patter n associated with P is just CP

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header Table

Item head fcabmp

Mining frequent patterns of p

13

Mining frequent patterns of m

• Constructing an FP-tree on m, we derive m’s conditional FP-tree, f :3, c:3, a:3, a single frequent pattern path.

• This conditional FP-tree is then mined recursively.

m-conditional pattern base:

fca:2, fcab:1

{}

f:3

c:3

a:3m-conditional FP-tree

All frequent patterns concerning m

m,

fm, cm, am,

fcm, fam, cam,

fcam

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header TableItem frequency head f 4c 4a 3b 3m 3p 3

14

Mining frequent patterns of m

{}

f:3

c:3

a:3m-conditional FP-tree

Cond. pattern base of “am”: (fc:3)

{}

f:3

c:3am-conditional FP-tree

Cond. pattern base of “cm”: (f:3){}

f:3

cm-conditional FP-tree

Cond. pattern base of “cam”: (f:3)

{}

f:3

cam-conditional FP-tree

15

Mining Frequent Patterns by Creating Conditional Pattern-Bases

EmptyEmptyf

{(f:3)}|c{(f:3)}c

{(f:3, c:3)}|a{(fc:3)}a

Empty{(fca:1), (f:1), (c:1)}b

{(f:3, c:3, a:3)}|m{(fca:2), (fcab:1)}m

{(c:3)}|p{(fcam:2), (cb:1)}p

Conditional FP-treeConditional pattern-baseItem

16

Single FP-tree Path Generation

• Suppose an FP-tree T has a single path P

• The complete set of frequent pattern of T can be generated by enumeration of all the combinations of the sub-paths of P

{}

f:3

c:3

a:3

m-conditional FP-tree

All frequent patterns concerning m

m,

fm, cm, am,

fcm, fam, cam,

fcam

17

Why Is Frequent Pattern Growth Fast?

• Our performance study shows

– FP-growth is an order of magnitude faster than Apriori,

and is also faster than tree-projection

• Reasoning

– No candidate generation, no candidate test

– Use compact data structure

– Eliminate repeated database scan

– Basic operation is counting and FP-tree building

18

FP-Growth vs. Apriori: Scalability With the Support Threshold

0

10

20

30

40

50

60

70

80

90

100

0 0.5 1 1.5 2 2.5 3

Support threshold(%)

Ru

n t

ime

(se

c.)

D1 FP-grow th runtime

D1 Apriori runtime

Data set T25I20D10K

#Transactions Items Average Transaction Length

250,000 1000 12

19

null

A:7

B:5

B:3

C:3

D:1

C:1

D:1C:3

D:1

D:1

E:1E:1

TID Items1 {A,B}2 {B,C,D}3 {A,C,D,E}4 {A,D,E}5 {A,B,C}6 {A,B,C,D}7 {B,C}8 {A,B,C}9 {A,B,D}10 {B,C,E}

Pointers are used to assist frequent itemset generation

D:1

E:1

Transaction Database

Item PointerABCDE

Header table

Frequent Itemset Using FP-Growth (Example)

20

null

A:7

B:5

B:3

C:3

D:1

C:1

D:1

C:3

D:1

E:1D:1

E:1

Build conditional pattern base for E: P = {(A:1,C:1,D:1),

(A:1,D:1), (B:1,C:1)}

Recursively apply FP-growth on P

E:1

D:1

FP Growth Algorithm: FP Tree Mining


21

null

A:2 B:1

C:1C:1

D:1

D:1

E:1

E:1

Conditional Pattern base for E: P = {(A:1,C:1,D:1,E:1),

(A:1,D:1,E:1), (B:1,C:1,E:1)}

Count for E is 3: {E} is frequent itemset

Recursively apply FP-

growth on P (Conditional tree for D within conditional tree for E)

E:1

Conditional tree for E:



22

Conditional pattern base for D within conditional base for E: P = {(A:1,C:1,D:1),

(A:1,D:1)}

Count for D is 2: {D,E} is frequent itemset


growth on P (Conditional tree for C within conditional tree D within conditional tree for E)

Conditional tree for D within conditional tree for E:

null

A:2

C:1

D:1

D:1



23

Conditional pattern base for C within D within E: P = {(A:1,C:1)}

Count for C is 1: {C,D,E} is NOT frequent itemset

Recursively apply FP-growth on P (Conditional tree for A within conditional tree D within conditional tree for E)

Conditional tree for C within D within E:

null

A:1

C:1



24

Count for A is 2: {A,D,E} is frequent itemset

Next step:

Construct conditional tree C within conditional tree E

Conditional tree for A within D within E:

null

A:2



25

null

A:2 B:1

C:1C:1

D:1

D:1

E:1

E:1


growth on P (Conditional tree for C within conditional tree for E)

E:1

Conditional tree for E:



26

null

A:1 B:1

C:1C:1

E:1 E:1

FP Growth Algorithm: FP Tree MiningConditional pattern base for C within conditional base for E: P = {(B:1,C:1),

(A:1,C:1)}

Count for C is 2: {C,E} is frequent itemset


growth on P (Conditional tree for B within conditional tree C within conditional tree for E)Conditional tree for C within conditional

tree for E:


27

null

A:7

B:5

B:3

C:3

D:1

C:1

D:1C:3

D:1

D:1

E:1E:1

TID Items1 {A,B}2 {B,C,D}3 {A,C,D,E}4 {A,D,E}5 {A,B,C}6 {A,B,C,D}7 {B,C}8 {A,B,C}9 {A,B,D}10 {B,C,E}

D:1

E:1

Transaction Database

Item PointerABCDE

Header table



data mining frequent-pattern tree approach towards arm lecture 11-12

Documents