fast algorithms for mining frequent itemsets 指導教授 : 張真誠 教授 研究生 : 李育強...
TRANSCRIPT
Fast Algorithms for Mining Frequent Itemsets
指導教授指導教授 : : 張真誠 教授張真誠 教授研究生研究生 : : 李育強李育強Dept. of Computer Science and Information EnginDept. of Computer Science and Information Engineering, eering, National Chung Cheng UniversityNational Chung Cheng University
Date:Date: May 31, 2007 May 31, 2007
博士論文初稿
探勘頻繁項目集合之快速演算法研究探勘頻繁項目集合之快速演算法研究
2
OutlineOutline Introduction Background and Related Work NFP-Tree Structure Fast Share Measure (FSM) Algorithm Three Efficient Algorithms Direct Candidate Generate (DCG) Algorithm Isolated Items Discarding Strategy (IIDS) Maximum Item Conflict First (MICF)
Sanitization Method Conclusions
3
Introduction Data mining techniques have been developed to fin
d a small set of precious nugget from reams of data (Cabena et al., 1998; Kantardzic, 2002)
Mining association rules constitutes one of the most important data mining problem
Two sub-problem (Agrawal & Srikant, 1994) Identifying all frequent itemsets Using these frequent itemsets to generate associa
tion rules The first sub-problem plays an essential role in min
ing association rules
4
Introduction (con’t) Mining frequent itemsets Mining share-frequent itemsets Mining high utility itemsets Hiding sensitive patterns
6
Support-Confidence Framework (2/4)
FP-growth algorithm (Han et al., 2000; Han et al., 2004)
TID Frequent 1-itemsets (sorted)
001002003004005006
C A B DC AC AC B DA B DC B D
C
A
B
D
root
B(1)
A(1)
C(1)Header table
D(1)
C
A
B
D
root
B(1)
A(2)
C(2)Header table
D(1)
C
A
B
D
root
B(1)
A(3)
C(3)Header table
D(1)
7
C
A
B
D
root
B(1) D(1)
B(1)B(2)A(3)
D(2)
C(5) A(1)Header table
D(1)
C
A
B
D
root
B(1) D(1)
B(1)B(1)A(3)
D(1)
C(4) A(1)Header table
D(1)
C
A
B
D
root
B(1)
B(1)A(3)
D(1)
C(4)Header table
D(1)
TID Frequent 1-itemsets (sorted)
001002003004005006
C A B DC AC AC B DA B DC B D
Support-Confidence Framework (3/4)
8
Support-Confidence Framework (4/4)
C
A
B
D
root
B(1) D(1)
B(1)B(2)A(3)
D(2)
C(5) A(1)Header table
D(1)
B(1) D(1)
B(1)B(2)A(1)
D(2)
C(1) A(1)
D(1)
C(2)
C
root
C(3)Header table
Conditional FP-tree of “D”
Conditional FP-tree of “BD”
C
B
Header table
root
B(3)
B(1)C(3)
9
Measure value: mv(ip, Tq) mv({D}, T01) = 1 mv({C}, T03) = 3
Transaction measure value: tmv(Tq) = tmv(T02) = 10
Total measure value: Tmv(DB)= Tmv(DB)=47
Itemset measure value: imv(X, Tq)= imv({A, E}, T02)=5
Local measure value: lmv(X)= lmv({BC})=2+5+5=12
Share-Confidence Framework (1/4)
qp Ti
qp Timv ),(
xq dbT
qTXimv ),(
dbT Ti
qpq qp
Timv ),(
XiTX
qppq
Timv,
),(
10
Share-Confidence Framework (2/4)
Tmv
Xlmv )(
minShare=30%
Itemset share: SH(X)= SH({BC})=12/47=25.5%
SH-frequent: if SH(X) >= minShare, X is a share-frequent (SH-frequent) itemset
11
Share-Confidence Framework (3/4)
ZP(Zero Pruning) 、 ZSP(Zero Subset Pruning) (Barber & Hamilton, 2003) variants of exhaustive search prune the candidate itemsets whose local measure
values are exactly zero SIP(Share Infrequent Pruning) (Barber & Hamil
ton, 2003) like Apriori with errors
The three algorithms are either inefficient or do not discover complete share-frequent (SH-frequent) itemsets
12A:12 B:9 C:10 D:6 E:4 H:1...
AB:6 AC:16 AD:7 AE:12 BC:12 BD:15 BE:0 CE:10CD:8 DE:0
ABC:3 ABD:9 ACD:3 ACE:18 BCD:16
ABCD:4
... GH:2
... DGH:3
... CDGH:4
ABCDGH:6
...ABCDG:5
BCDGH:5
Share-Confidence Framework (4/4)
ZSP Algorithm
SIP Algorithm
13
Internal utility: iu(ip, Tq) iu({D}, T01) = 1 iu({C}, T03) = 3
External utility: eu(ip) eu({D}) = 3 eu({C}) = 1
Utility value in a transaction: util({C, E, F}, T02) = util(C, T02) + util(E, T02) + util(F, T
02) = 3X1+1X5+2X2=12 Local utility:
Lutil({C, D}) = util({C, D}, T01) + util({C, D}, T04) + util({C, D}, T06) = 4 + 7 + 5 = 16
Utility Mining (1/2)
qp TXi
qpq TiutilTXutil ),(),(
xq DBT
qTXutilXLutil ),()(
xq qpDBT TXi
qp Tiutil ),(
14
Utility Mining (2/2)
Total utility: Tutil(DB) = Tutil(DB) = 122
The utility value of X in DB: UTIL(X)= UTIL({C, D}) = 16/122 =13.1%
High utility itemset: if UTIL(X) >= minUtil, X is a high utility itemset
DBT
q
TTutil ),(
)(
)(
DBTutil
XLutil
15
Privacy-Preserving in Mining Frequent Itemsets
NP-hard problem (Atallah et al., 1999) DB: database, DB’: released database RI: the set of restrictive itemsets ~RI: the set of non-restrictive itemsets Misses cost = Sanitization algorithms (Oliveira and Zaïa
ne, 2002; Oliveira and Zaïane, 2003; Saygin et al., 2001)
|)(|~
|)'(|~|)(|~
DBRI
DBRIDBRI
17
NFP-Tree (2/4)TID Frequent 1-itemsets
(sorted)
001002003004005006
C A B DC AC AC B DA B DC B D
A
B
DB(1,1)
A(1,1)
C(5,5)
Header table
D(1,1)
A
B
DB(1,1)
A(2,2)
C(5,5)
Header table
D(1,1)
A
B
DB(1,1)
A(3,3)
C(5,5)
Header table
D(1,1)
18
A
B
DB(1,1)
B(1,1)A(3,3)
D(1,1)
C(5,5)
Header table
D(1,1)
NFP-Tree (3/4)
A
B
DB(1,2)
B(2,2)A(3,4)
D(2,2)
C(5,5)
Header table
D(1,2)
C
A
B
D
root
B(1) D(1)
B(1)B(2)A(3)
D(2)
C(5) A(1)Header table
D(1)
A
B
DB(1,2)
B(1,1)A(3,4)
D(1,1)
C(5,5)
Header table
D(1,2)
TID Frequent 1-itemsets (sorted)
001002003004005006
C A B DC AC AC B DA B DC B D
19
NFP-Tree (4/4)
B(1,2)
B(2,2)A(1,2)
D(2,2)
D(1,2)
B
root
B(3,4)
Header table
A
B
DB(1,2)
B(2,2)A(3,4)
D(2,2)
C(5,5)
Header table
D(1,2)
Conditional NFP-tree of
“D(3,4)”
20
Experimental Results (1/3)
PC: Pentium IV 1.5 GHZ, 1GB SDRAM, running windows 2000 professional
All algorithms were coded in VC++ 6.0 Datasets:
Real: BMS-Web View-1, BMS-Web View-2, Connect 4 Artificial: generated by IBM synthetic data generator
|D| Number of transactions in DB
|T| Mean size of the transactions
|I| Mean size of the maximal potentially frequent itemsets
|L| Number of maximal potentially frequent itemsets
N Number of items
21
Connect-4
0
20
40
60
80
100
57 60 63 66 69 72 75
Minimum support (%)
Run
ning
tim
e (s
ec)
FP
NFP
Experimental Results (2/3)BMS-WebView-1
0
5
10
15
0.056 0.058 0.06 0.062 0.064 0.066
Minimum support (%)
Run
ning
tim
e (s
ec)
FP
NFP
BMS-WebView-2
0
10
20
30
40
0.008 0.014 0.020 0.026 0.032
Minimum support (%)
Run
ning
tim
e (s
ec)
FP
NFP
22
Experimental Results (3/3)
T10.I6.D500k.L10k
0
100
200
300
400
500
600
0.010 0.030 0.050 0.070 0.090
Minimum support (%)
Run
ning
tim
e (s
ec)
FP
NFP
T10.I6.D500k.L50
0
50
100
150
0.010 0.030 0.050 0.070 0.090
Minimum support (%)R
unni
ng ti
me
(sec
)
FP
NFP
23
Fast Share Measure (FSM) Algorithm
FSM: Fast Share Measure algorithm ML: Maximum transaction length in DB MV: Maximum measure value in DB min_lmv=minShare×Tmv Level Closure Property: Given a minShare and a k-ite
mset X Theorem 1. If lmv(X)+(lmv(X)/k)×MV < min_lmv, all superse
ts of X with length k + 1 are infrequent Theorem 2. If lmv(X)+(lmv(X)/k)×MV ×k’< min_lmv, all sup
ersets of X with length k+k’ are infrequent Corollary 1. If lmv(X)+(lmv(X)/k)×MV ×(ML-k)< min_lmv, all
supersets of X are infrequent
24A:12 B:9 C:10 D:6 E:4 H:1...
AB:6 AC:16 AD:7 AE:12 BC:12 BD:15 BE:0 CE:10CD:8 DE:0
ABC:3 ABD:9 ACD:3 ACE:18 BCD:16
minShare=30% Let CF(X)=lmv(X)+(lmv(X)/k)×MV ×(ML-k) Prune X if CF(X)<min_lmv CF({ABC})=3+(3/3)×3×(6-3)=12<14.1=min_lmv
25
ExperimentalResults (1/2)
T4.I2.D100k.N50.S10 minShare = 0.8% ML=14
MethodPass (k)
ZSP FSM(1) FSM(2) FSM(3) FSM(ML-1)
k=1
Ck 50 50 50 50 50
RCk 50 49 49 49 50
Fk 32 32 32 32 32
k=2
Ck 1225 1176 1176 1176 1225
RCk 1219 570 754 845 1085
Fk 119 119 119 119 119
k=3
Ck 19327 4256 7062 8865 14886
RCk 17217 868 1685 2410 5951
Fk 65 65 65 65 65
k=4
Ck 165077 1725 3233 5568 24243
RCk 107397 232 644 1236 6117
Fk 9 9 9 9 9
k=5
Ck 406374 81 258 717 6309
RCk 266776 5 40 109 1199
Fk 0 0 0 0 0
k=6
Ck 369341 0 1 4 287
RCk 310096 0 0 0 37
Fk 0 0 0 0 0
k>=7
Ck 365975 0 0 0 0
RCk 359471 0 0 0 0
Fk 0 0 0 0 0
Time(sec) 10349.9 2.30 2.98 3.31 11.24
26
Experimental Results (2/2)
T4.I2.Dz.N50.S10
1
10
100
1000
10000
100000
0 200 400 600 800 1000Transactions (k)
Run
ning
tim
e(se
c)
ZSPFSM(ML-1)FSM(3)FSM(2)FSM(1)
T4.I2.D100k.N50.S10
1
10
100
1000
10000
100000
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
minShare (%)
Run
ning
tim
e (s
ec)
ZSPFSM(ML-1)FSM(3)FSM(2)FSM(1)
27
Three Efficient Algorithms EFSM (Enhanced FSM): instead of joining arbitrary t
wo itemsets in RCk-1, EFSM joins arbitrary itemset of RCk-1 with a single item in RC1 to generate Ck efficiently
Reduce time complexity from O(n2k-2) to O(nk)
28
Xk+1: arbitrary superset of X with length k+1 in DB
S(Xk+1): the set which contains all Xk+1 in DB dbS(Xk+1): the set of transactions of which ea
ch transaction contains at least one Xk+1 SuFSM and ShFSM from EFSM which prun
e the candidates more efficiently than FSM SuFSM (Support-counted FSM):
Theorem 3. If lmv(X)+Sup(S(Xk+1))×MV×(ML – k)< min_lmv, all supersets of X are infrequent
29
SuFSM (Support-counted FSM) lmv(X)/k Sup(X) Sup(S(Xk+1)) EX. lmv({BCD})/k=15/3=5, Sup({BCD})=3, Sup(S({BCD}k+1))=2, If there is no superset of X is an SH-frequent ite
mset, then the following four equations hold lmv(X)+(lmv(X)/k)×MV× (ML - k) < min_lmv lmv(X)+Sup(X) ×MV× (ML - k) < min_lmv lmv(X)+Sup(S(Xk+1)) ×MV× (ML - k) < min_lmv
30
ShFSM (Share-counted FSM) ShFSM (Share-counted FSM):
Theorem 4. If Tmv(dbS(Xk+1)) < min_lmv, all supersets of X are infrequent
FSM:lmv(X)+(lmv(X)/k)×MV× (ML - k) < min_lmv
SuFSM:lmv(X)+Sup(S(Xk+1)) ×MV× (ML - k) < min_lmv ShFSM: Tmv(dbS(Xk+1)) < min_lmv
31
A:12 B:9 C:10 D:6 E:4 H:1...
AB:6 AC:16 AD:7 AE:12 BC:12 BD:15 BE:0 CE:10CD:8 DE:0
ACE:18 BCD:16
ShFSM (Share-counted FSM) Ex. X={AB} Tmv(dbS(Xk+1)) = tmv(T01)+tmv(T0
5) =6+6=12 <14 = min_lmv
32
T6.I4.Dz.N200.S10
1
10
100
1000
10000
0 200 400 600 800 1000
Transactions (k)
Run
ning
tim
e (s
ec)
FSM
EFSM
SuFSM
ShFSM
Experimental Results (1/3)T6.I4.D100k.N200.S10
1
10
100
1000
10000
100000
0 0.2 0.4 0.6 0.8 1 1.2
minShare (%)
Run
ning
tim
e (s
ec)
FSMEFSMSuFSMShFSM
T4.I2.D100k.N50.S10
1
10
100
1000
10000
100000
0 0.2 0.4 0.6 0.8 1 1.2
minShare (%)
Run
ning
tim
e (s
ec)
ZSPEZSPFSMEFSMSuFSMShFSM
T10.I6.D100k.N500.S20
1
10
100
1000
10000
100000
0 0.2 0.4 0.6 0.8 1 1.2
minShare (%)
Run
ning
tim
e (s
ec) .
FSMEFSMSuFSMShFSM
minShare=0.3%
33
Experimental Results (2/3)
T6.I4.D100k.N200.Sm
1
10
100
1000
10000
100000
1000000
0 10 20 30 40 50 60
S
Run
ning
tim
e (s
ec)
.
FSM
EFSM
SuFSM
ShFSM
minShare=0.3%
34
Experimental Results (3/3)
MethodPass (k)
FSM EFSM SuFSM ShFSM Fk
k=1Ck 200 200 200 200
159RCk 200 200 199 197
k=2Ck 19900 19900 19701 19306
1844RCk 16214 16214 13312 7199
k=3Ck 829547 829547 564324 190607
101RCk 251877 251877 99765 9792
k=4Ck 3290296 3290296 793042 20913
0RCk 332877 332877 41057 1420
k=5Ck 393833 393833 25003 1050
5RCk 71420 71420 19720 959
k=6Ck 26137 26137 11582 518
8RCk 25562 25562 11045 506
k=7Ck 11141 11141 5940 204
7RCk 11099 11099 5827 196
k=8Ck 4426 4426 2797 58
1RCk 4423 4423 2750 54
k>=9Ck 2036 2036 1567 12
0RCk 2030 2030 1513 10
Time(sec) 13610.4 71.55 29.67 10.95
T6.I4.D100k.N200.S10 minShare = 0.1% ML=20
35
E
0
E
0
F
9
G
6
H
6
F
9
G
6
H
6
E
0
D E
21 0
D E
6 20
A:12 B:9 C:10 D:6 E:4 F:4
AC:16 AE:12 BC:12 BD:15 CE:10CD:8
ACE:18 BCD:16
B C D E F
12 26 12 20 10
C D E F
21 27 0 9
D E F
21 20 19
E F
0 9
F
10
CF:8
H:1G:1
G
6
H
6
G
6
H
6
G
6
H
6
G
6
H
6
G
0
H
0
G
0
H
0
H
6
F
10
G
6
H
6
F
10
G
0
H
0
F
9
G
6
H
6
F
10
G
0
H
0
G
0
H
0
F
10
G
0
H
0
F
10
G
6
H
6
Direct Candidate Generation (DCG)Algorithm
36
Experimental Results (1/3)
T6.I4.D100k.N200.S10
1
10
100
1000
10000
0 0.2 0.4 0.6 0.8 1 1.2
minShare (%)
Run
ning
tim
e (s
ec).
FSMEFSMSuFSMShFSMDCG
T10.I6.D100k.N1000.S10
0
100
200
300
400
500
0 0.02 0.04 0.06 0.08 0.1 0.12
minShare (%)
Run
ning
tim
e (s
ec)
.
SuFSMShFSMDCG
38
Experimental Results (3/3)T6.I4.Dz.N200.S10
0
20
40
60
80
100
120
140
0 200 400 600 800 1000
Transactions (k)
Run
ning
tim
e (s
ec)
SuFSM
ShFSM
DCG
T6.I4.D100k.N200.Sm
0
2
4
6
8
10
12
14
16
18
20
0 10 20 30 40 50 60
S
Run
ning
tim
e (s
ec)
.
SuFSM
ShFSM
DCG
BMS-WebView-2.S10
0
100
200
300
400
0 0.2 0.4 0.6 0.8 1 1.2
minShare (%)
Run
ning
tim
e (s
ec)
.
ShFSM
DCG
39
Isolated Item Discarding Strategy (IIDS) for Utility Mining
No
k++
Initially, ISet1=Empty, k=1, C1=I
|Ck+1| > 0 ?
Yes
Generate HUIk(DB), RCk
Generate ISetk+1
Generate Ck+1
Scan DBskip all ip of ISetk
End orSecond phase
40
A92
(36)
E54
(20)
D68
(18)
C105(10)
B68
(18)
F43(8)
G21(8)
H21(4)
AF24
(16)
AE54
(44)
AD38
(21)
AC75
(34)
AB38
(16)
BC51
(20)
BD68
(36)
BE0
(0)
BF19
(12)
CD51
(16)
DF19
(10)
DE0
(0)
CF43
(12)
CE54
(26)
EF24(9)
ABC21(6)
ABD21
(25)
BCD40
(32)
ACD21(7)
ACE24
(50)
IIDS (1/2)
ShFSM
minUtil=30%
41
A92
(36)
E54
(20)
D68
(18)
C105(10)
B68
(18)
F43(8)
G21(8)
H21(4)
AF24
(16)
AE54
(44)
AD26
(21)
AC63
(34)
AB26
(16)
BC39
(20)
BD56
(36)
BE0
(0)
BF19
(12)
CD39
(16)
DF19
(10)
DE0
(0)
CF43
(12)
CE54
(26)
EF24(9)
BCD28
(32)
ACE24
(50)
IIDS (2/2)
FUM
minUtil=30%
42
Experimental Results (1/5)1000 items
0
50
100
150
200
250
0 2 4 6 8 10External utility
Num
ber
of it
ems
2000 items
0
50
100
150
200
250
0 2 4 6 8 10External utility
Num
ber
of it
ems
T10.I6.D1000k.N1000
200
300
400
500
600
700
800
0.02% 0.03% 0.04% 0.05% 0.06% 0.07% 0.08%
minUtil
Run
ning
tim
e(se
c.) .
TPShFSMDCGFUMDCG+
T10.I6.D1000k.N1000
0
50
100
150
200
250
300
0.08% 0.12% 0.16% 0.20% 0.24% 0.28% 0.32%
minUtil
Run
ning
tim
e(se
c.) .
TPShFSMDCGFUMDCG+
43
Experimental Results (2/5)T10.I6.D1000k.N2000
200
300
400
500
600
700
800
0.02% 0.03% 0.04% 0.05% 0.06% 0.07% 0.08%
minUtil
Run
ning
tim
e(se
c.) .
TPShFSMDCGFUMDCG+
T10.I6.D1000k.N2000
0
50
100
150
200
250
300
0.08% 0.12% 0.16% 0.20% 0.24% 0.28% 0.32%
minUtil
Run
ning
tim
e(se
c.) .
TPShFSMDCGFUMDCG+
T20.I6.D1000k.N1000
200
400
600
800
1000
1200
1400
1600
1800
0.04% 0.06% 0.08% 0.10% 0.12% 0.14% 0.16%
minUtil
Run
ning
tim
e(se
c.)
TPShFSMDCGFUMDCG+
T20.I6.D1000k.N1000
0
100
200
300
400
500
600
0.16% 0.20% 0.24% 0.28% 0.32% 0.36% 0.40%
minUtil
Run
ning
tim
e(se
c.)
TPShFSMDCGFUMDCG+
46
Experimental Results (5/5)T10.I6.Dxk.N1000
0
200
400
600
800
1000
1200
0 1000 2000 3000 4000 5000 6000
Transaction number (k )
Run
ning
tim
e(se
c.)
TPShFSMDCGFUMDCG+
T20.I6.Dxk.N1000
0
600
1200
1800
2400
3000
3600
4200
4800
0 1000 2000 3000 4000 5000 6000
Transaction number (k )
Run
ning
tim
e(se
c.)
TPShFSMDCGFUMDCG+
Chain-store
0
20
40
60
80
100
120
140
160
0.04% 0.12% 0.20% 0.28% 0.36%
minUtil
Run
ning
tim
e(se
c.)
TP
ShFSM
FUM
minUtil = 0.12% minUtil = 0.12%
47
Maximum Item Conflict First (MICF) Sanitization Method
Tdegree(Tq): the degree of conflict of a sensitive transaction Tq is the number of restrictive itemsets which are included in Tq,
If Tdegree(Tq) > 1, Tq is a conflicting transaction
48
Idegree({D}, {D, F}, T05)=1 Idegree({F}, {D, F}, T05)=0
MaxIdegree: store the maximum value of the conflict degree among items in a transaction
MICF: select an item with MaxIdegree to delete in each iteration
TID Transaction Tdegree(Tq)
T05 {B, D, F, H} 2
T06 {A, B, D, F, H}
3
49
Idegree({D}, {D, F}, T06)=1 Idegree({F}, {D, F}, T06)=0
TID Transaction Tdegree(Tq)
T06 {A, B, D, F, H}
3
1
4
51
Experimental Results (2/5)T10.I6.D100k.N500, privacy threshold = 0%
0%
10%
20%
30%
40%
50%
60%
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
minSup(%)
Miss
es c
ost (
%)
Algo2bMaxFIAMinFIAIGAMICF
T10.I6.D100k.N500, privacy threshold = 0%
35%
40%
45%
50%
55%
60%
65%
100 200 300 400 500 600 700 800Number of restrictive itemsets
Miss
es cos
t (%
)
Algo2bMaxFIAMinFIAIGAMICF
T20.I10.D100k.N500, privacy threshold = 0%
0%
10%
20%
30%
40%
50%
60%
70%
80%
0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14
minSup(%)
Misse
s co
st (%
)
Algo2bMaxFIAMinFIAIGAMICF
T20.I10.D100k.N500, privacy threshold = 0%
50%
60%
70%
80%
90%
50 100 150 200 250 300 350 400
Number of restrictive itemsets
Misse
s co
st (%
)
Algo2bMaxFIAMinFIAIGAMICF
|RI|=200 minSup=0.04%
|RI|=50minSup=0.1%
52
BMS-WebView-2, privacy threshold = 0%
80%
85%
90%
95%
100%
100 200 300 400 500 600 700 800Number of restrictive itemsets
Miss
es c
ost (
%)
Algo2bMaxFIAMinFIAIGAMICF
Experimental Results (3/5)BMS-WebView-1, privacy threshold = 0%
80%
85%
90%
95%
100%
0.056 0.058 0.060 0.062 0.064 0.066 0.068 0.070minSup(%)
Miss
es c
ost (
%)
Algo2bMaxFIAMinFIAIGAMICF
BMS-WebView-1, privacy threshold = 0%
80%
85%
90%
95%
100%
100 200 300 400 500 600 700 800
Number of restrictive itemsets
Misse
s co
st (%
)
Algo2bMaxFIAMinFIAIGAMICF
BMS-WebView-2, privacy threshold = 0%
80%
85%
90%
95%
100%
0.012 0.016 0.020 0.024 0.028 0.032 0.036
minSup(%)
Misse
s co
st (%
)
Algo2bMaxFIAMinFIAIGAMICF
|RI|=200 minSup=0.064%
|RI|=200 minSup=0.024%
53
BMS-WebView-2, privacy threshold = 0%
0%
10%
20%
30%
40%
50%
60%
100 200 300 400 500 600 700 800Number of restrictive itemsets
Sani
tizat
ion
rate
(%)
Algo2bMaxFIAMinFIAIGAMICF
Experimental Results (4/5)T10.I6.D100k.N500, privacy threshold = 0%
30%
40%
50%
60%
70%
80%
90%
100 200 300 400 500 600 700 800Number of restrictive itemsets
Sani
tizat
ion
rate
(%
)
Algo2bMaxFIAMinFIAIGAMICF
T20.I10.D100k.N500, privacy threshold = 0%
30%
40%
50%
60%
70%
50 100 150 200 250 300 350 400
Number of restrictive itemsets
Sani
tizatio
n ra
te (%
)
Algo2bMaxFIAMinFIAIGAMICF
BMS-WebView-1, privacy threshold = 0%
0%
10%
20%
30%
40%
100 200 300 400 500 600 700 800
Number of restrictive itemsets
Sani
tizat
ion
rate
(%
)
Algo2bMaxFIAMinFIAIGAMICF
minSup=0.004%minSup=0.1%
minSup=0.064% minSup=0.024%
54
BMS-WebView-2, privacy threshold = 0%
0
5
10
15
20
25
30
100 200 300 400 500 600 700 800
Number of restrictive itemsets
Run
ning
tim
e (sec
.)
Algo2bMaxFIAMinFIAIGAMICF
Experimental Results (5/5)T10.I6.D100k.N500, privacy threshold = 0%
0
5
10
15
20
25
30
100 200 300 400 500 600 700 800Number of restrictive itemsets
Run
ning
tim
e (sec
.)
Algo2bMaxFIAMinFIAIGAMICF
T20.I10.D100k.N500, privacy threshold = 0%
0
4
8
12
16
20
50 100 150 200 250 300 350 400Number of restrictive itemsets
Run
ning
tim
e (sec
.)
Algo2bMaxFIAMinFIAIGAMICF
BMS-WebView-1, privacy threshold = 0%
0
1
2
3
4
5
6
100 200 300 400 500 600 700 800
Number of restrictive itemsets
Run
ning
tim
e (sec
.)
Algo2bMaxFIAMinFIAIGAMICF
55
Conclusions Support measure
NFP-growth is presented for mining frequent itemsets
Uses two counters per tree node to reduce the number of the tree nodes
Applies a smaller tree and header table to discover frequent itemsets efficiently
Share measure Proposed algorithms efficiently decrease the candidate
number to be counted ShFSM and DCG perform the best
56
Utility mining Propose IIDS to ignore isolated items in the process of ca
ndidate generation FUM and DCG+ were better than ShFSM and DCG, respecti
vely Hiding sensitive patterns
Propose the MICF algorithm to reduce the impact on the source database
MICF decreases the support of the maximum number of restrictive itemsets
Outperform all other algorithms in several datasets on misses costs for most cases
MICF has the lowest sanitization rate
57
Future Work Apply a constraint relaxation algorithm or develo
p a superior data structures to discover frequent itemsets
Develop superior algorithms to accelerate identifying all or long SH-frequent itemsets
Extend the application scope of IIDS to some classification models
Develop superior algorithms to further reduce the misses cost without hiding failure to protect sensitive data
Apply data mining techniques on image processing, for instance, to improve the interpolated color filter array image
58
ReferencesD. Agrawal and C. Aggarwal, “On the design and quantification of privacy p
reserving data mining algorithms,” in Proc. 20th ACM Symposium on Principles of Database Systems, Santa Barbara, CA, pp. 247-255, May 2001.
R. C. Agarwal, C. C. Aggarwal, and V. V. V. Prasad, “A tree projection algorithm for generation of frequent itemsets,” Journal of Parallel and Distributed Computing, vol. 61, no. 3, pp. 350-361, 2001.
R. Agrawal, T. Imielinski, and A. Swami, “Mining association rules between sets of items in large databases” in Proc. 1993 ACM SIGMOD Intl. Conf. on Management of Data, Washington, D.C., pp. 207-216, May 1993.
R. Agrawal and R. Srikant, “Fast algorithms for mining association rules,” in Proc. 20th Intl. Conf. on Very Large Data Bases, Santiago, Chile, pp. 487-499, Sep. 1994.
M. Atallah, E. Bertino, A. Elmagarmid, M. Ibrahim, and V. Verykios, “Disclosure limitation of sensitive rules,” in Proc. 1999 Workshop on Knowledge and Data Engineering Exchange, Chicage, IL, pp. 45-52, Nov. 1999.
B. Barber and H. J. Hamilton, “Parametric algorithm for mining share frequent itemsets,” Journal of Intelligent Information Systems, vol. 16, no. 3, pp. 277-293, 2001.
F. Berzal, J. C. Cubero, N. Marín, and J. M. Serrano, “TBAR: An efficient method for association rule mining in relational databases,” Data & Knowledge Engineering, vol. 37, no. 1, pp. 47-64, 2001.
59
S. Brin, R. Motwani, J. D. Ullman, and S. Tsur, “Dynamic itemset counting and implication rules for market basket data,” in Proc. 1997 ACM SIGMOD Intl. Conf. on Management of Data, Tucson, AZ, pp. 255-264, May 1997.
P. Cabena, P. Hadjinian, R. Stadler, J. Verhees, and A. Zanasi, “Discovering Data Mining from Concept to Implementation,” Prentice Hall PTR, New Jersey, 1998.
C. L. Carter, H. J. Hamilton, and N. Cercone, “Share based measures for itemsets,” Lecture Notes in Computer Science 1263 --- 1st European Conf. on the Principles of Data Mining and Knowledge Discovery, H. J. Komorowski and J. M. Zytkow (eds.), Springer-Verlag, Berlin, pp. 14-24, 1997.
G. Grahne and J. Zhu, “Efficient using prefix-tree in mining frequent itemsets,” in Proc. IEEE ICDM Workshop on Frequent Itemset Mining Implementations, Melbourne, FL, Nov. 2003.
J. Han, J. Pei, and Y. Yin, “Mining frequent patterns without candidate generation,” in Proc. 2000 ACM-SIGMOD Intl. Conf. on Management of Data, Dallas, TX, pp. 1-12, May 2000.
J. Han, J. Pei, Y. Yin, and R. Mao, “Mining frequent patterns without candidate generation: A frequent pattern tree approach,” Data Mining and Knowledge Discovery, vol. 8, no. 1, pp. 53-87, 2004.
60
T. Johnsten and V. V. Raghavan, “Impact of decision-region based classification mining algorithms on database security,” in Proc. IFIP WG 11.3 13th Intl. Conf. on Database Security, Seattle, WA, pp. 177-191, Jul. 1999.
M. Kantardzic, “Data Mining: Concepts, Models, Methods, and Algorithms,” John Wiley & Sons, Inc., New York, 2002.
S. R. M. Oliveira and O. R. Zaïane, “Privacy preserving frequent itemset mining,” in Proc. IEEE ICDM Workshop on Privacy, Security and Data Mining, Maebashi City, Japan, pp. 43-54, Dec. 2002.
S. R. M. Oliveira and O. R. Zaïane, “Algorithms for balancing privacy and knowledge discovery in association rule mining,” in Proc. of 7th Intl. Database Engineering and Applications Symposium, Hong Kong, China, pp. 54-63, Jul. 2003.
Y. Saygin, V. S. Verykios, and C. Clifton, “Using unknowns to prevent discovery of association rules,” ACM SIGMOD Record, vol. 30, no. 4, pp. 45-54, 2001.
H. Yao and H. J. Hamilton, “Mining itemset utilities from transaction databases,” Data & Knowledge Engineering, vol. 59, no. 3, pp. 603-626, 2006.
H. Yao, H. J. Hamilton, and C. J. Butz, “A foundational approach to mining itemset utilities from databases,” in Proc. 4th SIAM Intl. Conf. on Data Mining, Lake Buena Vista, FL, pp. 482-486, Apr. 2004.
62
Background and Related Work
Support-Confidence Framework Each item is a binary variable denoting whether an ite
m was purchased Apriori (Agrawal & Swami, 1994) & Apriori-like algorith
ms (Agrawal et al., 1993; Berzal et al., 2001; Brin et al., 1997)
Pattern-growth algorithms (Agarwal et al., 2001; Grahn & Zhu, 2003; Han et al., 2000; Han et al., 2004)
Share-Confidence Framework (Carter et al., 1997 )
Support-confidence framework does not analyze the exact number of products purchased.
The support count method does not measure the profit or cost of an itemset
Exhaustive search algorithm Fast algorithms