ar mining implementation and comparison of three ar mining algorithms xuehai wang, xiaobo chen, shen...
TRANSCRIPT
AR mining
Implementation and comparison of three AR mining algorithms
Xuehai Wang, Xiaobo Chen, Shen chen
CSCI6405 class project
AR mining
Outline
• Motivation
• Dataset
• Apriori based hash tree algorithm
• FP-tree algorithm
• Conclusion
• Reference
AR mining
Motivation
• Make the time of generating rules as shot as possible!
• To understand the three algorithms– Apriori algorithm– Apriori with hash tree algorithm– FP-tree algorithm
• Learn how to improve an algorithm
AR mining
Dataset• IBM dataset generator
– Can set item number– Can set minimal support– Can set dataset size
1 2 5 8 9
2 3 4 6 7 12
Tid item
AR mining
Apriori principle
• Apriori principle– A candidate generation-and-test Approach [4]– Given a frequent itemset, its subset must be fre
quent– A set is infrequent, its super set will not be gene
rated and tested
• But there is still some places can be improved– Count the support– I/O scan times
AR mining
Apriori Hash Tree Alg
• Candidate K-itemset size is l• There is n transactions• Average transaction size is m• Calculate support count:
– Original Apriori Alg:
– With hash tree: O( n.log(l).(mk) )
)( mklnO
)log( mklnO
AR mining
Apriori Hash Tree Alg
• Candidate is stored in a hash tree structure
Tid Items
1 1 2
2 1 3 6
3 1 2 3
4 2 4
5 2 3 6
6 5 6
1-itemset candidate hash tree
1(1)2(1)1(2)
3(1)
1(2) 3(1)2(1)
AR mining
Apriori Hash Tree Alg
Tid
Items
1 1 2
2 1 3 6
3 1 2 3
4 2 4
5 2 3 6
6 5 6
2(4)5(1) 6(3)
1(3) 3(3)4(1)
1itemset , Min support = 2
AR mining
Apriori Hash Tree Alg
Tid
Items
1 1 2
2 1 3 6
3 1 2 3
4 2 4
5 2 3 6
6 5 6
2 3(2)2 6(1)
1 3(2)1 2(2)
3 6(2)
1 6(1)
2 itemset, Min support = 2
3 itemset, Min support = 2
1 2 3(1)
AR mining
FP-tree
• Since the mining dataset is always very huge, it’s impossible to read all transactions into computer memory all in once.
• But I/O scan is very time consuming.
• FP-tree algorithm will try to suite all information from the dataset into computer memory, hence only need to scan I/O two times.
AR mining
FP-tree
• FP-tree algorithm and implementation– By Xiaobo Chen
AR mining
FP-tree (Frequent Pattern Tree)
• Mining frequent pattern without candidate generation
• Divide and conquer methodology: decompose mining tasks into smaller ones
AR mining
FP-tree (Merits of FP-tree algorithm)
• Make most use of common shared prefix
• Complete and compact
All information of a transaction is
stored in a path
The size is constrained by the data set consequently, the longest path corresponds to the longest
pattern
The compact ratio: over 100
AR mining
FP-tree (Construction of FP-tree)
• TID freq. Items bought
• 100 {f, c, a, m, p}
• 200 {f, c, a, b, m}
• 300 {f, b}
• 400 {c, p, b}
• 500 {f, c, a, m, p}
min_support = 3Item frequency f 4c 4a 3b 3m 3p 3
f:1
c:1
a:1
m:1
p:1
root
AR mining
FP-tree (construction (Cont’d))TID freq. Items bought100 {f, c, a, m, p}200 {f, c, a, b, m}300 {f, b}400 {c, p, b}500 {f, c, a, m, p}
f:2
c:2
a:2
m:1
p:1
b:1
m:1
root
AR mining
FP-tree construction (Cont’d)• TID freq. Items bought
• 100 {f, c, a, m, p}
• 200 {f, c, a, b, m}
• 300 {f, b}
• 400 {c, p, b}
• 500 {f, c, a, m, p}
min_support = 3Item frequency f 4c 4a 3b 3m 3p 3Header Table
Item frequency head f 4c 4a 3b 3m 3p 3
f:4
c:3
a:3
m:2
p:2
b:1
m:1
b:1
c:1
b:1
p:1
root
AR mining
FP-tree (Mining Frequent Patterns Using the FP-tree)
• General idea (divide-and-conquer)– Recursively grow frequent pattern path using the FP-
tree
• Method – For each item, construct its conditional pattern-base,
and then its conditional FP-tree
– Repeat the process on each newly created conditional FP-tree
– Until the resulting FP-tree is empty, or it contains only one path (single path will generate all the combinations of its sub-paths, each of which is a frequent pattern)
AR mining
FP-tree (Mining Frequent Patterns Using the FP-tree)
Conditional pattern base for p
fcam:2, cb:1
f:4
c:3
a:3
m:2
p:2
c:1
b:1
p:1
p
• Start with last item in order (i.e., p).
• Follow node pointers and traverse only the paths containing p.
• Accumulate all of transformed prefix paths of that item to form a conditional pattern base
root
Constructing a new FP-tree based on this pattern base leads to only one branch c:3Thus we derive only one frequent pattern cont. p. Pattern cp
AR mining
FP-tree (Mining Frequent Patterns Using the FP-tree)
• Move to next least frequent item in order, i.e., m
• Follow node pointers and traverse only the paths containing m.
• Accumulate all of transformed prefix paths of that item to form a conditional pattern base
Conditional pattern base for m
fca:2, fcab:1
f:4
c:3
a:3
m:2
m
m:1
b:1
Constructing a new FP-tree based on this pattern base leads to path fca:3From this we derive frequent patterns fcam, fcm, cam, fm, cm, am
root
AR mining
FP-tree (Conditional Pattern-Bases for the example)
EmptyEmptyf
{(f:3)}|c{(f:3)}c
{(f:3, c:3)}|a{(fc:3)}a
Empty{(fca:1), (f:1), (c:1)}b
{(f:3, c:3, a:3)}|m{(fca:2), (fcab:1)}m
{(c:3)}|p{(fcam:2), (cb:1)}p
Conditional FP-treeConditional pattern-baseItem
AR mining
FP-tree (Why is Frequent pattern Growth fast?)
• Performance studies show that
FP-growth is an order of magnitude faster than
Apriori, and is also faster than tree-projection
• Reasoning:
– No candidate generation, no candidate test
– Use compact data structure
– Eliminate repeated database scan
– Basic operation is counting and FP-tree building
AR mining
FP-tree: Expected result: FP-growth vs. Apriori: Scalability With the Support Threshold
0
10
20
30
40
50
60
70
80
90
100
0 0.5 1 1.5 2 2.5 3
Support threshold(%)
Ru
n t
ime(s
ec.)
D1 FP-grow th runtime
D1 Apriori runtime
AR mining
Conclusion
• FP-tree is faster than other two algorithms.
• Apriori as well as hash tree algorithms are easier to implement.– We can easily combine them with other
methods or tools. (i.e. distributed parallel computing).
• The parameter of dataset is very important too.– Density, size, min support …
AR mining
References
• [1] Jiawei Han and Micheline Kamber: "Data Mining: Concepts and Techniques ", Morgan Kaufmann, 2001
• [2] Jiawei Han, Jian Pei, Yiwen Yin: Mining Frequent Patterns without Candidate Generation, ACM SIGMOD, 2000
• [3] N.Mamoulis, Advanced Database Technologies (Slides)
• [4] Jiawei Han and Micheline Kamber. Data Mining - Concepts and Techniques. MorganKaufmann Publishers, 2001.