cse300
DESCRIPTION
TRDESFGTRANSCRIPT
![Page 1: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/1.jpg)
The FP-Growth/Apriori Debate
Jeffrey R. Ellis
CSE 300 – 01
April 11, 2002
![Page 2: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/2.jpg)
Presentation Overview
FP-Growth Algorithm Refresher & Example Motivation
FP-Growth Complexity Vs. Apriori Complexity Saving calculation or hiding work?
Real World Application Datasets are not created equal Results of real-world implementation
![Page 3: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/3.jpg)
FP-Growth Algorithm
Association Rule Mining Generate Frequent Itemsets
Apriori generates candidate sets FP-Growth uses specialized data structures (no
candidate sets)
Find Association Rules Outside the scope of both FP-Growth & Apriori
Therefore, FP-Growth is a competitor to Apriori
![Page 4: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/4.jpg)
FP-Growth Example
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header Table
Item frequency head f 4c 4a 3b 3m 3p 3
Conditional pattern bases
item cond. pattern base
c f:3
a fc:3
b fca:1, f:1, c:1
m fca:2, fcab:1
p fcam:2, cb:1
TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
![Page 5: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/5.jpg)
FP-Growth Example
EmptyEmptyf
{(f:3)}|c{(f:3)}c
{(f:3, c:3)}|a{(fc:3)}a
Empty{(fca:1), (f:1), (c:1)}b
{(f:3, c:3, a:3)}|m{(fca:2), (fcab:1)}m
{(c:3)}|p{(fcam:2), (cb:1)}p
Conditional FP-treeConditional pattern-baseItem
![Page 6: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/6.jpg)
FP-Growth Algorithm
FP-CreateTreeInput: DB, min_supportOutput: FP-Tree1. Scan DB & count all frequent
items.2. Create null root & set as current
node.3. For each Transaction T
Sort T’s items. For each sorted Item I
Insert I into tree as a child of current node.
Connect new tree node to header list.
Two passes through DB
Tree creation is based on number of items in DB.
Complexity of CreateTree is O(|DB|)
![Page 7: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/7.jpg)
FP-Growth Example
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header Table
Item frequency head f 4c 4a 3b 3m 3p 3
TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
![Page 8: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/8.jpg)
FP-Growth Algorithm
FP-GrowthInput: FP-Tree, f_is (frequent
itemset)Output: All freq_patterns
if FP-Tree contains single Path Pthen for each combination of nodes in P
generate pattern f_is combinationelse for each item i in header
generate pattern f_is iconstruct pattern’s conditional cFP-Treeif (FP-Tree 0)then call FP-Growth (cFP-Tree, pattern)
Recursive algorithm creates FP-Tree structures and calls FP-Growth
Claims no candidate generation
![Page 9: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/9.jpg)
Two-Part Algorithm
if FP-Tree contains single Path P
then for each combination of nodes in P
generate pattern f_is combination
e.g., { A B C D } p = |pattern| = 4 AD, BD, CD, ABD, ACD, BCD, ABCD
(n=1 to p-1) (p-1Cn)
A
B
C
D
![Page 10: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/10.jpg)
Two-Part Algorithmelse for each item i in header
generate pattern f_is iconstruct pattern’s conditional cFP-Treeif (cFP-Tree null)then call FP-Growth (cFP-Tree, pattern)
e.g., { A B C D E } for f_is = D i = A, p_base = (ABCD), (ACD), (AD) i = B, p_base = (ABCD) i = C, p_base = (ABCD), (ACD) i = E, p_base = null
Pattern bases are generated by following f_is path from header to each node in tree having it, up to tree root, for each header item.
A
B
C
D
D
C D
E
E
E
![Page 11: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/11.jpg)
FP-Growth Complexity
Therefore, each path in the tree will be at least partially traversed the number of items existing in that tree path (the depth of the tree path) * the number of items in the header.
Complexity of searching through all paths is then bounded by O(header_count2 * depth of tree)
Creation of a new cFP-Tree occurs also.
![Page 12: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/12.jpg)
Sample Data FP-Treenull
K
M
EDC
DC
L
B
A
…
J
L M
M
ML
ML M
M
K
…
…
…
M
ML
ML M
M
K
…
![Page 13: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/13.jpg)
Algorithm Results (in English)
Candidate Generation sets exchanged for FP-Trees. You MUST take into account all paths that
contain an item set with a test item. You CANNOT determine before a
conditional FP-Tree is created if new frequent item sets will occur.
Trivial examples hide these assertions, leading to a belief that FP-Tree operates more efficiently.
![Page 14: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/14.jpg)
Header:ABFGHI
M
FP-Growth Mining ExampleNull
M:1
I:1
G:1
B:45
M:1
I:1
G:1
F:40
M:1
I:1
H:1
G:35
M:1
G:1
B:1
A:50
Transactions:A (49)B (44)F (39)G (34)H (30)I (20)
ABGMBGIMFGIMGHIM
H:30 I:20
![Page 15: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/15.jpg)
FP-Growth Mining ExampleNull
M:1
I:1
G:1
B:45
M:1
I:1
G:1
F:40
M:1
I:1
H:1
G:35
M:1
G:1
B:1
A:50Header:
ABFGHI
M
i = Apattern = { A M }support = 1pattern basecFP-Tree = null
freq_itemset = { M }
![Page 16: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/16.jpg)
FP-Growth Mining ExampleNull
M:1
I:1
G:1
B:45
M:1
I:1
G:1
F:40
M:1
I:1
H:1
G:35
M:1
G:1
B:1
A:50Header:
ABFGHI
M
i = Bpattern = { B M }support = 2pattern basecFP-Tree …
freq_itemset = { M }
![Page 17: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/17.jpg)
FP-Growth Mining Example
M:2
B:2
G:2Header:GBM
Patterns mined:BMGM
BGM
freq_itemset = { B M }
All patterns: BM, GM, BGM
![Page 18: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/18.jpg)
FP-Growth Mining ExampleNull
M:1
I:1
G:1
B:45
M:1
I:1
G:1
F:40
M:1
I:1
H:1
G:35
M:1
G:1
B:1
A:50Header:
ABFGHI
M
i = Fpattern = { F M }support = 1pattern basecFP-Tree = null
freq_itemset = { M }
![Page 19: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/19.jpg)
FP-Growth Mining ExampleNull
M:1
I:1
G:1
B:45
M:1
I:1
G:1
F:40
M:1
I:1
H:1
G:35
M:1
G:1
B:1
A:50Header:
ABFGHI
M
i = Gpattern = { G M }support = 4pattern basecFP-Tree …
freq_itemset = { M }
![Page 20: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/20.jpg)
FP-Growth Mining ExampleNull
G:1
G:2B:1
I:3
M:2 M:1
G:1
B:1Header:
IBGM
i = Ipattern = { I G M }support = 3pattern basecFP-Tree …
freq_itemset = { G M }
M:1
![Page 21: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/21.jpg)
FP-Growth Mining Example
M:3
G:3
I:3Header:IGM
Patterns mined:IMGMIGM
freq_itemset = { I G M }
All patterns: BM, GM, BGM, IM, GM, IGM
![Page 22: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/22.jpg)
FP-Growth Mining ExampleNull
G:1
G:2B:1
I:3
M:2 M:1
G:1
B:1Header:
IBGM
i = Bpattern = { B G M }support = 2pattern basecFP-Tree …
freq_itemset = { G M }
M:1
![Page 23: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/23.jpg)
FP-Growth Mining Example
M:2
G:2
B:2Header:BGM
Patterns mined:BMGM
BGM
freq_itemset = { B G M }
All patterns: BM, GM, BGM, IM, GM, IGM, BM, GM, BGM
![Page 24: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/24.jpg)
FP-Growth Mining ExampleNull
M:1
I:1
G:1
B:45
M:1
I:1
G:1
F:40
M:1
I:1
H:1
G:35
M:1
G:1
B:1
A:50Header:
ABFGHI
M
i = Hpattern = { H M }support = 1pattern basecFP-Tree = null
freq_itemset = { M }
![Page 25: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/25.jpg)
FP-Growth Mining Example
Complete pass for { M } Move onto { I }, { H }, etc. Final Frequent Sets:
L2 : { BM, GM, IM, GM, BM, GM, IM, GM, GI, BG }
L3 : { BGM, GIM, BGM, GIM } L4 : None
![Page 26: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/26.jpg)
FP-Growth Redundancy v. Apriori Candidate Sets FP-Growth (support=2) generates:
L2 : 10 sets (5 distinct) L3 : 4 sets (2 distinct) Total : 14 sets
Apriori (support=2) generates: C2 : 21 sets C3 : 2 sets Total : 23 sets
What about support=1? Apriori : 23 sets FP-Growth : 28 sets in {M} with i = A, B, F alone!
![Page 27: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/27.jpg)
FP-Growth vs. Apriori
Apriori visits each transaction when generating a new candidate sets; FP-Growth does not Can use data structures to reduce
transaction list
FP-Growth traces the set of concurrent items; Apriori generates candidate sets
FP-Growth uses more complicated data structures & mining techniques
![Page 28: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/28.jpg)
Algorithm Analysis Results
FP-Growth IS NOT inherently faster than Apriori Intuitively, it appears to condense data Mining scheme requires some new work to
replace candidate set generation Recursion obscures the additional effort
FP-Growth may run faster than Apriori in circumstances
No guarantee through complexity which algorithm to use for efficiency
![Page 29: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/29.jpg)
Improvements to FP-Growth
None currently reported MLFPT
Multiple Local Frequent Pattern Tree New algorithm that is based on FP-Growth Distributes FP-Trees among processors
No reports of complexity analysis or accuracy of FP-Growth
![Page 30: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/30.jpg)
Real World Applications
Zheng, Kohavi, Mason – “Real World Performance of Association Rule Algorithms” Collected implementations of Apriori, FP-
Growth, CLOSET, CHARM, MagnumOpus Tested implementations against 1 artificial
and 3 real data sets Time-based comparisons generated
![Page 31: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/31.jpg)
Apriori & FP-Growth
Apriori Implementation from creator Christian
Borgelt (GNU Public License) C implementation Entire dataset loaded into memory
FP-Growth Implementation from creators Han & Pei Version – February 5, 2001
![Page 32: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/32.jpg)
Other Algorithms
CHARM Based on concept of Closed Itemset e.g., { A, B, C } – ABC, AC, BC,
ACB, AB, CB, etc.
CLOSET Han, Pei implementation of Closed Itemset
MagnumOpus Generates rules directly through search-
and-prune technique
![Page 33: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/33.jpg)
Datasets
IBM-Artificial Generated at IBM Almaden (T10I4D100K) Often used in association rule mining
studies
BMS-POS Years of point-of-sale data from retailer
BMS-WebView-1 & BMS-WebView-2 Months of clickstream traffic from e-
commerce web sites
![Page 34: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/34.jpg)
Dataset Characteristics
![Page 35: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/35.jpg)
Experimental Considerations
Hardware Specifications Dual 550MHz Pentium III Xeon processors 1GB Memory
Support { 1.00%, 0.80%, 0.60%, 0.40%, 0.20%, 0.10%, 0.08%, 0.06%, 0.04%, 0.02%, 0.01% }
Confidence = 0% No other applications running (second
processor handles system processes)
![Page 36: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/36.jpg)
IBM-Artificial BMS-POS
![Page 37: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/37.jpg)
BMS-WebView-1 BMS-WebView-2
![Page 38: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/38.jpg)
Study Results – Real Data
At support 0.20%, Apriori performs as fast as or better than FP-Growth
At support < 0.20%, Apriori completes whenever FP-Growth completes Exception – BMS-WebView-2 @ 0.01%
When 2 million rules are generated, Apriori finishes in 10 minutes or less Proposed – Bottleneck is NOT the rule
algorithm, but rule analysis
![Page 39: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/39.jpg)
Real Data Results
Algorithm Support Time Rules Time Rules Time Rules
Apriori 186m Failed FailedFP-Growth 120m Failed 13m 12s
Apriori 16m 9 s Failed 58sFP-Growth 10m 41s Failed 29s
Apriori 8m 35s 1m 50s 28sFP-Growth 6m 7s 52s 16s
Apriori 3m 58s 1.2s 9.1sFP-Growth 3m 12s 1.2s 5.9s
Apriori 1m 14s 0.4s 2.4sFP-Growth 1m 35s 0.7s 2.3s
BMS-POS BMS-WebView-1 BMS-WebView-2
214,300,568 Falied Failed0.01
0.04 5,061,105
0.06 1,837,824
0.10
0.20
530,353
103,449
Failed
3,011,836
10,360
1,516
1,096,720
510,233
119,335
12,665
![Page 40: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/40.jpg)
Study Results – Artificial Data
At support < 0.40%, FP-Growth performs MUCH faster than Apriori
At support 0.40%, FP-Growth and Apriori are comparable
Support Time Rules Support Time Rules Support Time RulesApriori 4m 4s 1m 1s 44sFP-Growth 20s 9.2s 8.2s
Apriori 34s 20s 5.7sFP-Growth 7.1 5.8s 4.3s
1,376,684 0.04 56,962 0.06 41,215
0.10 26,962 0.20 13,151 0.40 1.997
0.01
![Page 41: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/41.jpg)
Real-World Study Conclusions
FP-Growth (and other non-Apriori) perform better on artificial data
On all data sets, Apriori performs sufficiently well in reasonable time periods for reasonable result sets
FP-Growth may be suitable when low support, large result count, fast generation are needed
Future research may best be directed toward analyzing association rules
![Page 42: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/42.jpg)
Research Conclusions
FP-Growth does not have a better complexity than Apriori Common sense indicates it will run faster
FP-Growth does not always have a better running time than Apriori Support, dataset appears more influential
FP-Trees are very complex structures (Apriori is simple)
Location of data (memory vs. disk) is non-factor in comparison of algorithms
![Page 43: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/43.jpg)
To Use or Not To Use?
Question: Should the FP-Growth be used in favor of Apriori? Difficulty to code High performance at extreme cases Personal preference
More relevant questions What kind of data is it? What kind of results do I want? How will I analyze the resulting rules?
![Page 44: CSE300](https://reader036.vdocument.in/reader036/viewer/2022062423/55cf9331550346f57b9c9a8b/html5/thumbnails/44.jpg)
References Han, Jiawei, Pei, Jian, and Yin, Yiwen, “Mining Frequent Patterns without
Candidate Generation”. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 1-12, Dallas, Texas, USA, 2000.
Orlando, Salvatore, “High Performance Mining of Short and Long Patterns”. 2001.
Pei, Jian, Han, Jiawei, and Mao, Runying, “CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets”. In SIGMOD Int'l Workshop on Data Mining and Knowledge Discovery, May 2000.
Webb, Geoffrey L., “Efficient search for association rules”. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 99--107, 2000.
Zaiane, Osmar, El-Hajj, Mohammad, and Lu, Paul, “Fast Parallel Association Rule Mining Without Candidacy Generation”. In Proc. of the IEEE 2001 International Conference on Data Mining (ICDM'2001), San Jose, CA, USA, November 29-December 2, 2001
Zaki, Mohammed J., “Generating Non-Redundant Association Rules”. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 2000.
Zheng, Zijian, Kohavi, Ron, and Mason, Llew, “Real World Performance of Association Rule Algorithms”. In proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, California, August 2001.