fast algorithms for mining frequent itemsets

63
Fast Algorithms for Mini ng Frequent Itemsets 指指指指 指指指指 : : 指指指 指指 指指指 指指 指指指 指指指 : : 指指指 指指指 Dept. of Computer Science and Info Dept. of Computer Science and Info rmation Engineering, rmation Engineering, National Chun National Chun g Cheng University g Cheng University Date: Date: May 31, 2007 May 31, 2007 指指指指指指 指指指指指指指指指指指指指指指指 指指指指指指指指指指指指指指指指

Upload: cloris

Post on 20-Mar-2016

60 views

Category:

Documents


0 download

DESCRIPTION

探勘頻繁項目集合之快速演算法研究. Fast Algorithms for Mining Frequent Itemsets. 博士論文初稿. 指導教授 : 張真誠 教授 研究生 : 李育強 Dept. of Computer Science and Information Engineering, National Chung Cheng University Date: May 31, 2007. Outline. Introduction Background and Related Work NFP-Tree Structure - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Fast Algorithms for Mining Frequent Itemsets

Fast Algorithms for Mining Frequent Itemsets 指導教授指導教授 : : 張真誠 教授張真誠 教授研究生研究生 : : 李育強李育強Dept. of Computer Science and Information EnginDept. of Computer Science and Information Engineering, eering, National Chung Cheng UniversityNational Chung Cheng University

Date:Date: May 31, 2007 May 31, 2007

博士論文初稿

探勘頻繁項目集合之快速演算法研究探勘頻繁項目集合之快速演算法研究

Page 2: Fast Algorithms for Mining Frequent Itemsets

2

OutlineOutline Introduction Background and Related Work NFP-Tree Structure Fast Share Measure (FSM) Algorithm Three Efficient Algorithms Direct Candidate Generate (DCG) Algorithm Isolated Items Discarding Strategy (IIDS) Maximum Item Conflict First (MICF)

Sanitization Method Conclusions

Page 3: Fast Algorithms for Mining Frequent Itemsets

3

Introduction Data mining techniques have been developed to fin

d a small set of precious nugget from reams of data (Cabena et al., 1998; Kantardzic, 2002)

Mining association rules constitutes one of the most important data mining problem

Two sub-problem (Agrawal & Srikant, 1994) Identifying all frequent itemsets Using these frequent itemsets to generate associa

tion rules The first sub-problem plays an essential role in min

ing association rules

Page 4: Fast Algorithms for Mining Frequent Itemsets

4

Introduction (con’t) Mining frequent itemsets Mining share-frequent itemsets Mining high utility itemsets Hiding sensitive patterns

Page 5: Fast Algorithms for Mining Frequent Itemsets

5

Support-Confidence Framework (1/4)

Apriori algorithm (Agrawal and Srikant, 1994): minSup = 40%

Page 6: Fast Algorithms for Mining Frequent Itemsets

6

Support-Confidence Framework (2/4)

FP-growth algorithm (Han et al., 2000; Han et al., 2004)

TID Frequent 1-itemsets (sorted)

001002003004005006

C A B DC AC AC B DA B DC B D

C

A

B

D

root

B(1)

A(1)

C(1)Header table

D(1)

C

A

B

D

root

B(1)

A(2)

C(2)Header table

D(1)

C

A

B

D

root

B(1)

A(3)

C(3)Header table

D(1)

Page 7: Fast Algorithms for Mining Frequent Itemsets

7

C

A

B

D

root

B(1) D(1)

B(1)B(2)A(3)

D(2)

C(5) A(1)Header table

D(1)

C

A

B

D

root

B(1) D(1)

B(1)B(1)A(3)

D(1)

C(4) A(1)Header table

D(1)

C

A

B

D

root

B(1)

B(1)A(3)

D(1)

C(4)Header table

D(1)

TID Frequent 1-itemsets (sorted)

001002003004005006

C A B DC AC AC B DA B DC B D

Support-Confidence Framework (3/4)

Page 8: Fast Algorithms for Mining Frequent Itemsets

8

Support-Confidence Framework (4/4)

C

A

B

D

root

B(1) D(1)

B(1)B(2)A(3)

D(2)

C(5) A(1)Header table

D(1)

B(1) D(1)

B(1)B(2)A(1)

D(2)

C(1) A(1)

D(1)

C(2)

C

root

C(3)Header table

Conditional FP-tree of “D”

Conditional FP-tree of “BD”

C

B

Header table

root

B(3)

B(1)C(3)

Page 9: Fast Algorithms for Mining Frequent Itemsets

9

Measure value: mv(ip, Tq) mv({D}, T01) = 1 mv({C}, T03) = 3

Transaction measure value: tmv(Tq) = tmv(T02) = 10

Total measure value: Tmv(DB)= Tmv(DB)=47

Itemset measure value: imv(X, Tq)= imv({A, E}, T02)=5

Local measure value: lmv(X)= lmv({BC})=2+5+5=12

Share-Confidence Framework (1/4)

qp Ti

qp Timv ),(

xq dbT

qTXimv ),(

dbT Ti

qpq qp

Timv ),(

XiTX

qppq

Timv,

),(

Page 10: Fast Algorithms for Mining Frequent Itemsets

10

Share-Confidence Framework (2/4)

TmvXlmv )(

minShare=30%

Itemset share: SH(X)= SH({BC})=12/47=25.5%

SH-frequent: if SH(X) >= minShare, X is a share-frequent (SH-frequent) itemset

Page 11: Fast Algorithms for Mining Frequent Itemsets

11

Share-Confidence Framework (3/4)

ZP(Zero Pruning) 、 ZSP(Zero Subset Pruning) (Barber & Hamilton, 2003) variants of exhaustive search prune the candidate itemsets whose local measure values are exactly zero

SIP(Share Infrequent Pruning) (Barber & Hamilton, 2003) like Apriori with errors

The three algorithms are either inefficient or do not discover complete share-frequent (SH-frequent) itemsets

Page 12: Fast Algorithms for Mining Frequent Itemsets

12A:12 B:9 C:10 D:6 E:4 H:1...

AB:6 AC:16 AD:7 AE:12 BC:12 BD:15 BE:0 CE:10CD:8 DE:0

ABC:3 ABD:9 ACD:3 ACE:18 BCD:16

ABCD:4

... GH:2

... DGH:3

... CDGH:4

ABCDGH:6

...ABCDG:5

BCDGH:5

Share-Confidence Framework (4/4)

ZSP Algorithm

SIP Algorithm

Page 13: Fast Algorithms for Mining Frequent Itemsets

13

Internal utility: iu(ip, Tq) iu({D}, T01) = 1 iu({C}, T03) = 3

External utility: eu(ip) eu({D}) = 3 eu({C}) = 1

Utility value in a transaction: util({C, E, F}, T02) = util(C, T02) + util(E, T02) + util(F, T02) = 3X1+1X5+2X2=12

Local utility: Lutil({C, D}) = util({C, D}, T01) + util({C, D}, T04) + util({C, D}, T06) = 4 + 7 + 5 = 16

Utility Mining (1/2)

qp TXi

qpq TiutilTXutil ),(),(

xq DBT

qTXutilXLutil ),()(

xq qpDBT TXi

qp Tiutil ),(

Page 14: Fast Algorithms for Mining Frequent Itemsets

14

Utility Mining (2/2) Total utility: Tutil(DB) =

Tutil(DB) = 122 The utility value of X in DB: UTIL(X)=

UTIL({C, D}) = 16/122 =13.1% High utility itemset: if UTIL(X) >= minUtil, X is a high utility itemset

DBT

qqq

TTutil ),(

)()(

DBTutilXLutil

Page 15: Fast Algorithms for Mining Frequent Itemsets

15

Privacy-Preserving in Mining Frequent Itemsets NP-hard problem (Atallah et al., 1999) DB: database, DB’: released database RI: the set of restrictive itemsets ~RI: the set of non-restrictive itemsets Misses cost = Sanitization algorithms (Oliveira and Zaïane, 2002; Oliveira and Zaïane, 2003; Saygin et al., 2001)

|)(|~|)'(|~|)(|~

DBRIDBRIDBRI

Page 16: Fast Algorithms for Mining Frequent Itemsets

16

NFP-Tree (1/4) NFP-growth Algorithm

NFP-tree construction

Page 17: Fast Algorithms for Mining Frequent Itemsets

17

NFP-Tree (2/4)TID Frequent 1-itemsets

(sorted)

001002003004005006

C A B DC AC AC B DA B DC B D

A

B

DB(1,1)

A(1,1)

C(5,5)

Header table

D(1,1)

A

B

DB(1,1)

A(2,2)

C(5,5)

Header table

D(1,1)

A

B

DB(1,1)

A(3,3)

C(5,5)

Header table

D(1,1)

Page 18: Fast Algorithms for Mining Frequent Itemsets

18

A

B

DB(1,1)

B(1,1)A(3,3)

D(1,1)

C(5,5)

Header table

D(1,1)

NFP-Tree (3/4)

A

B

DB(1,2)

B(2,2)A(3,4)

D(2,2)

C(5,5)

Header table

D(1,2)

C

A

B

D

root

B(1) D(1)

B(1)B(2)A(3)

D(2)

C(5) A(1)Header table

D(1)

A

B

DB(1,2)

B(1,1)A(3,4)

D(1,1)

C(5,5)

Header table

D(1,2)

TID Frequent 1-itemsets (sorted)

001002003004005006

C A B DC AC AC B DA B DC B D

Page 19: Fast Algorithms for Mining Frequent Itemsets

19

NFP-Tree (4/4)

B(1,2)

B(2,2)A(1,2)

D(2,2)

D(1,2)

B

root

B(3,4)Header table

A

B

DB(1,2)

B(2,2)A(3,4)

D(2,2)

C(5,5)

Header table

D(1,2)

Conditional NFP-tree of “D(3,4)”

Page 20: Fast Algorithms for Mining Frequent Itemsets

20

Experimental Results (1/3) PC: Pentium IV 1.5 GHZ, 1GB SDRAM, running

windows 2000 professional All algorithms were coded in VC++ 6.0 Datasets:

Real: BMS-Web View-1, BMS-Web View-2, Connect 4 Artificial: generated by IBM synthetic data generator

|D| Number of transactions in DB|T| Mean size of the transactions|I| Mean size of the maximal potentially frequent itemsets|L| Number of maximal potentially frequent itemsetsN Number of items

Page 21: Fast Algorithms for Mining Frequent Itemsets

21

Connect-4

0

20

40

60

80

100

57 60 63 66 69 72 75Minimum support (%)

Run

ning

tim

e (s

ec)

FPNFP

Experimental Results (2/3)BMS-WebView-1

0

5

10

15

0.056 0.058 0.06 0.062 0.064 0.066Minimum support (%)

Run

ning

tim

e (s

ec)

FPNFP

BMS-WebView-2

0

10

20

30

40

0.008 0.014 0.020 0.026 0.032Minimum support (%)

Run

ning

tim

e (s

ec)

FPNFP

Page 22: Fast Algorithms for Mining Frequent Itemsets

22

Experimental Results (3/3)

T10.I6.D500k.L10k

0100200300400500600

0.010 0.030 0.050 0.070 0.090Minimum support (%)

Run

ning

tim

e (s

ec)

FPNFP

T10.I6.D500k.L50

0

50

100

150

0.010 0.030 0.050 0.070 0.090Minimum support (%)

Run

ning

tim

e (s

ec)

FPNFP

Page 23: Fast Algorithms for Mining Frequent Itemsets

23

Fast Share Measure (FSM) Algorithm

FSM: Fast Share Measure algorithm ML: Maximum transaction length in DB MV: Maximum measure value in DB min_lmv=minShare×Tmv Level Closure Property: Given a minShare and a k-itemset X

Theorem 1. If lmv(X)+(lmv(X)/k)×MV < min_lmv, all supersets of X with length k + 1 are infrequent Theorem 2. If lmv(X)+(lmv(X)/k)×MV ×k’< min_lmv, all supersets of X with length k+k’ are infrequent Corollary 1. If lmv(X)+(lmv(X)/k)×MV ×(ML-k)< min_lmv, all supersets of X are infrequent

Page 24: Fast Algorithms for Mining Frequent Itemsets

24A:12 B:9 C:10 D:6 E:4 H:1...

AB:6 AC:16 AD:7 AE:12 BC:12 BD:15 BE:0 CE:10CD:8 DE:0

ABC:3 ABD:9 ACD:3 ACE:18 BCD:16

minShare=30% Let CF(X)=lmv(X)+(lmv(X)/k)×MV ×(ML-k) Prune X if CF(X)<min_lmv CF({ABC})=3+(3/3)×3×(6-3)=12<14.1=min_lmv

Page 25: Fast Algorithms for Mining Frequent Itemsets

25

ExperimentalResults (1/2)

T4.I2.D100k.N50.S10 minShare = 0.8% ML=14

MethodPass (k)

ZSP FSM(1) FSM(2) FSM(3) FSM(ML-1)

k=1

Ck 50 50 50 50 50

RCk 50 49 49 49 50

Fk 32 32 32 32 32

k=2

Ck 1225 1176 1176 1176 1225

RCk 1219 570 754 845 1085

Fk 119 119 119 119 119

k=3

Ck 19327 4256 7062 8865 14886

RCk 17217 868 1685 2410 5951

Fk 65 65 65 65 65

k=4

Ck 165077 1725 3233 5568 24243

RCk 107397 232 644 1236 6117

Fk 9 9 9 9 9

k=5

Ck 406374 81 258 717 6309

RCk 266776 5 40 109 1199

Fk 0 0 0 0 0

k=6

Ck 369341 0 1 4 287

RCk 310096 0 0 0 37

Fk 0 0 0 0 0

k>=7

Ck 365975 0 0 0 0

RCk 359471 0 0 0 0

Fk 0 0 0 0 0

Time(sec) 10349.9 2.30 2.98 3.31 11.24

Page 26: Fast Algorithms for Mining Frequent Itemsets

26

Experimental Results (2/2)

T4.I2.Dz.N50.S10

1

10

100

1000

10000

100000

0 200 400 600 800 1000Transactions (k)

Run

ning

tim

e(se

c)

ZSPFSM(ML-1)FSM(3)FSM(2)FSM(1)

T4.I2.D100k.N50.S10

1

10

100

1000

10000

100000

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

minShare (%)

Run

ning

tim

e (s

ec)

ZSPFSM(ML-1)FSM(3)FSM(2)FSM(1)

Page 27: Fast Algorithms for Mining Frequent Itemsets

27

Three Efficient Algorithms EFSM (Enhanced FSM): instead of joining arbitrary two itemsets in RCk-1, EFSM joins arbitrary itemset of RCk-1 with a single item in RC1 to generate Ck efficiently Reduce time complexity from O(n2k-2) to O(nk)

Page 28: Fast Algorithms for Mining Frequent Itemsets

28

Xk+1: arbitrary superset of X with length k+1 in DB S(Xk+1): the set which contains all Xk+1 in DB dbS(Xk+1): the set of transactions of which each transaction contains at least one Xk+1 SuFSM and ShFSM from EFSM which prune the candidates more efficiently than FSM SuFSM (Support-counted FSM):

Theorem 3. If lmv(X)+Sup(S(Xk+1))×MV×(ML – k)< min_lmv, all supersets of X are infrequent

Page 29: Fast Algorithms for Mining Frequent Itemsets

29

SuFSM (Support-counted FSM) lmv(X)/k Sup(X) Sup(S(Xk+1)) EX. lmv({BCD})/k=15/3=5, Sup({BCD})=3, Sup(S({BCD}k+1))=2, If there is no superset of X is an SH-frequent itemset, then the following four equations hold

lmv(X)+(lmv(X)/k)×MV× (ML - k) < min_lmv lmv(X)+Sup(X) ×MV× (ML - k) < min_lmv lmv(X)+Sup(S(Xk+1)) ×MV× (ML - k) < min_lmv

Page 30: Fast Algorithms for Mining Frequent Itemsets

30

ShFSM (Share-counted FSM) ShFSM (Share-counted FSM):

Theorem 4. If Tmv(dbS(Xk+1)) < min_lmv, all supersets of X are infrequent FSM:lmv(X)+(lmv(X)/k)×MV× (ML - k) < min_lmv SuFSM:lmv(X)+Sup(S(Xk+1)) ×MV× (ML - k) < min_lmv ShFSM: Tmv(dbS(Xk+1)) < min_lmv

Page 31: Fast Algorithms for Mining Frequent Itemsets

31A:12 B:9 C:10 D:6 E:4 H:1...

AB:6 AC:16 AD:7 AE:12 BC:12 BD:15 BE:0 CE:10CD:8 DE:0

ACE:18 BCD:16

ShFSM (Share-counted FSM) Ex. X={AB} Tmv(dbS(Xk+1)) = tmv(T01)+tmv(T05) =6+6=12 <14 = min_lmv

Page 32: Fast Algorithms for Mining Frequent Itemsets

32

T6.I4.Dz.N200.S10

1

10

100

1000

10000

0 200 400 600 800 1000

Transactions (k)

Run

ning

tim

e (s

ec)

FSMEFSM

SuFSMShFSM

Experimental Results (1/3)T6.I4.D100k.N200.S10

1

10

100

1000

10000

100000

0 0.2 0.4 0.6 0.8 1 1.2minShare (%)

Run

ning

tim

e (s

ec)

FSMEFSMSuFSMShFSM

T4.I2.D100k.N50.S10

1

10

100

1000

10000

100000

0 0.2 0.4 0.6 0.8 1 1.2minShare (%)

Run

ning

tim

e (s

ec)

ZSPEZSPFSMEFSMSuFSMShFSM

T10.I6.D100k.N500.S20

1

10

100

1000

10000

100000

0 0.2 0.4 0.6 0.8 1 1.2minShare (%)

Run

ning

tim

e (s

ec) .

FSMEFSMSuFSMShFSM

minShare=0.3%

Page 33: Fast Algorithms for Mining Frequent Itemsets

33

Experimental Results (2/3)

T6.I4.D100k.N200.Sm

1

10

100

1000

10000

100000

1000000

0 10 20 30 40 50 60

S

Run

ning

tim

e (s

ec) .

FSMEFSM

SuFSMShFSM

minShare=0.3%

Page 34: Fast Algorithms for Mining Frequent Itemsets

34

Experimental Results (3/3)

MethodPass (k) FSM EFSM SuFSM ShFSM Fk

k=1Ck 200 200 200 200

159RCk 200 200 199 197

k=2Ck 19900 19900 19701 19306

1844RCk 16214 16214 13312 7199

k=3Ck 829547 829547 564324 190607

101RCk 251877 251877 99765 9792

k=4Ck 3290296 3290296 793042 20913

0RCk 332877 332877 41057 1420

k=5Ck 393833 393833 25003 1050

5RCk 71420 71420 19720 959

k=6Ck 26137 26137 11582 518

8RCk 25562 25562 11045 506

k=7Ck 11141 11141 5940 204

7RCk 11099 11099 5827 196

k=8Ck 4426 4426 2797 58

1RCk 4423 4423 2750 54

k>=9Ck 2036 2036 1567 12

0RCk 2030 2030 1513 10

Time(sec) 13610.4 71.55 29.67 10.95

T6.I4.D100k.N200.S10 minShare = 0.1% ML=20

Page 35: Fast Algorithms for Mining Frequent Itemsets

35

E

0

E

0

F

9

G

6

H

6

F

9

G

6

H

6

E

0

D E

21 0

D E

6 20

A:12 B:9 C:10 D:6 E:4 F:4

AC:16 AE:12 BC:12 BD:15 CE:10CD:8

ACE:18 BCD:16

B C D E F

12 26 12 20 10

C D E F

21 27 0 9

D E F

21 20 19

E F

0 9

F

10

CF:8

H:1G:1

G

6

H

6

G

6

H

6

G

6

H

6

G

6

H

6

G

0

H

0

G

0

H

0

H

6

F

10

G

6

H

6

F

10

G

0

H

0

F

9

G

6

H

6

F

10

G

0

H

0

G

0

H

0

F

10

G

0

H

0

F

10

G

6

H

6

Direct Candidate Generation (DCG)Algorithm

Page 36: Fast Algorithms for Mining Frequent Itemsets

36

Experimental Results (1/3)

T6.I4.D100k.N200.S10

1

10

100

1000

10000

0 0.2 0.4 0.6 0.8 1 1.2minShare (%)

Run

ning

tim

e (s

ec).

FSMEFSMSuFSMShFSMDCG

T10.I6.D100k.N1000.S10

0

100

200

300

400

500

0 0.02 0.04 0.06 0.08 0.1 0.12minShare (%)

Run

ning

tim

e (s

ec) .

SuFSMShFSMDCG

Page 37: Fast Algorithms for Mining Frequent Itemsets

37

Experimental Results (2/3)

Page 38: Fast Algorithms for Mining Frequent Itemsets

38

Experimental Results (3/3)T6.I4.Dz.N200.S10

0

20

40

60

80

100

120

140

0 200 400 600 800 1000

Transactions (k)

Run

ning

tim

e (s

ec)

SuFSM

ShFSM

DCG

T6.I4.D100k.N200.Sm

02468

101214161820

0 10 20 30 40 50 60

S

Run

ning

tim

e (s

ec) .

SuFSM

ShFSM

DCG

BMS-WebView-2.S10

0

100

200

300

400

0 0.2 0.4 0.6 0.8 1 1.2

minShare (%)

Run

ning

tim

e (s

ec) .

ShFSM

DCG

Page 39: Fast Algorithms for Mining Frequent Itemsets

39

Isolated Item Discarding Strategy (IIDS) for Utility Mining

No

k++

Initially, ISet1=Empty, k=1, C1=I

|Ck+1| > 0 ?

Yes

Generate HUIk(DB), RCk

Generate ISetk+1

Generate Ck+1

Scan DBskip all ip of ISetk

End orSecond phase

Page 40: Fast Algorithms for Mining Frequent Itemsets

40

A92

(36)

E54

(20)

D68

(18)

C105(10)

B68

(18)

F43(8)

G21(8)

H21(4)

AF24

(16)

AE54

(44)

AD38

(21)

AC75

(34)

AB38

(16)

BC51

(20)

BD68

(36)

BE0

(0)

BF19

(12)

CD51

(16)

DF19

(10)

DE0

(0)

CF43

(12)

CE54

(26)

EF24(9)

ABC21(6)

ABD21

(25)

BCD40

(32)

ACD21(7)

ACE24

(50)

IIDS (1/2)ShFSMminUtil=30%

Page 41: Fast Algorithms for Mining Frequent Itemsets

41A92

(36)

E54

(20)

D68

(18)

C105(10)

B68

(18)

F43(8)

G21(8)

H21(4)

AF24

(16)

AE54

(44)

AD26

(21)

AC63

(34)

AB26

(16)

BC39

(20)

BD56

(36)

BE0

(0)

BF19

(12)

CD39

(16)

DF19

(10)

DE0

(0)

CF43

(12)

CE54

(26)

EF24(9)

BCD28

(32)

ACE24

(50)

IIDS (2/2)FUMminUtil=30%

Page 42: Fast Algorithms for Mining Frequent Itemsets

42

Experimental Results (1/5)1000 items

0

50

100

150

200

250

0 2 4 6 8 10External utility

Num

ber o

f ite

ms

2000 items

0

50

100

150

200

250

0 2 4 6 8 10External utility

Num

ber o

f item

s

T10.I6.D1000k.N1000

200

300

400

500

600

700

800

0.02% 0.03% 0.04% 0.05% 0.06% 0.07% 0.08%

minUtil

Run

ning

tim

e(se

c.) .

TPShFSMDCGFUMDCG+

T10.I6.D1000k.N1000

0

50

100

150

200

250

300

0.08% 0.12% 0.16% 0.20% 0.24% 0.28% 0.32%

minUtil

Run

ning

tim

e(se

c.) .

TPShFSMDCGFUMDCG+

Page 43: Fast Algorithms for Mining Frequent Itemsets

43

Experimental Results (2/5)T10.I6.D1000k.N2000

200

300

400

500

600

700

800

0.02% 0.03% 0.04% 0.05% 0.06% 0.07% 0.08%

minUtil

Run

ning

tim

e(se

c.) .

TPShFSMDCGFUMDCG+

T10.I6.D1000k.N2000

0

50

100

150

200

250

300

0.08% 0.12% 0.16% 0.20% 0.24% 0.28% 0.32%

minUtil

Run

ning

tim

e(se

c.) .

TPShFSMDCGFUMDCG+

T20.I6.D1000k.N1000

200

400

600

800

1000

1200

1400

1600

1800

0.04% 0.06% 0.08% 0.10% 0.12% 0.14% 0.16%

minUtil

Run

ning

tim

e(se

c.)

TPShFSMDCGFUMDCG+

T20.I6.D1000k.N1000

0

100

200

300

400

500

600

0.16% 0.20% 0.24% 0.28% 0.32% 0.36% 0.40%

minUtil

Run

ning

tim

e(se

c.)

TPShFSMDCGFUMDCG+

Page 44: Fast Algorithms for Mining Frequent Itemsets

44

Experimental Results (3/5)

Page 45: Fast Algorithms for Mining Frequent Itemsets

45

Page 46: Fast Algorithms for Mining Frequent Itemsets

46

Experimental Results (5/5)T10.I6.Dxk.N1000

0

200

400

600

800

1000

1200

0 1000 2000 3000 4000 5000 6000

Transaction number (k )

Run

ning

tim

e(se

c.)

TPShFSMDCGFUMDCG+

T20.I6.Dxk.N1000

0

600

1200

1800

2400

3000

3600

4200

4800

0 1000 2000 3000 4000 5000 6000

Transaction number (k )

Run

ning

tim

e(se

c.)

TPShFSMDCGFUMDCG+

Chain-store

0

20

40

60

80

100

120

140

160

0.04% 0.12% 0.20% 0.28% 0.36%

minUtil

Run

ning

tim

e(se

c.)

TP

ShFSM

FUM

minUtil = 0.12% minUtil = 0.12%

Page 47: Fast Algorithms for Mining Frequent Itemsets

47

Maximum Item Conflict First (MICF) Sanitization Method

Tdegree(Tq): the degree of conflict of a sensitive transaction Tq is the number of restrictive itemsets which are included in Tq, If Tdegree(Tq) > 1, Tq is a conflicting transaction

Page 48: Fast Algorithms for Mining Frequent Itemsets

48

Idegree({D}, {D, F}, T05)=1 Idegree({F}, {D, F}, T05)=0 MaxIdegree: store the maximum value of the conflict degree among items in a transaction MICF: select an item with MaxIdegree to delete in each iteration

TID Transaction Tdegree(Tq) T05 {B, D, F, H} 2 T06 {A, B, D, F,

H} 3

Page 49: Fast Algorithms for Mining Frequent Itemsets

49

Idegree({D}, {D, F}, T06)=1 Idegree({F}, {D, F}, T06)=0

TID Transaction Tdegree(Tq) T06 {A, B, D, F,

H} 3

1

4

Page 50: Fast Algorithms for Mining Frequent Itemsets

50

Experimental Results (1/5)

Page 51: Fast Algorithms for Mining Frequent Itemsets

51

Experimental Results (2/5)T10.I6.D100k.N500, privacy threshold = 0%

0%

10%

20%

30%

40%

50%

60%

0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10

minSup(%)

Misse

s cost

(%)

Algo2bMaxFIAMinFIAIGAMICF

T10.I6.D100k.N500, privacy threshold = 0%

35%

40%

45%

50%

55%

60%

65%

100 200 300 400 500 600 700 800Number of restrictive itemsets

Misse

s cost

(%)

Algo2bMaxFIAMinFIAIGAMICF

T20.I10.D100k.N500, privacy threshold = 0%

0%

10%

20%

30%

40%

50%

60%

70%

80%

0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14minSup(%)

Misse

s cost

(%)

Algo2bMaxFIAMinFIAIGAMICF

T20.I10.D100k.N500, privacy threshold = 0%

50%

60%

70%

80%

90%

50 100 150 200 250 300 350 400Number of restrictive itemsets

Misse

s cost

(%)

Algo2bMaxFIAMinFIAIGAMICF

|RI|=200 minSup=0.04%

|RI|=50minSup=0.1%

Page 52: Fast Algorithms for Mining Frequent Itemsets

52

BMS-WebView-2, privacy threshold = 0%

80%

85%

90%

95%

100%

100 200 300 400 500 600 700 800Number of restrictive itemsets

Misse

s cost

(%)

Algo2bMaxFIAMinFIAIGAMICF

Experimental Results (3/5)BMS-WebView-1, privacy threshold = 0%

80%

85%

90%

95%

100%

0.056 0.058 0.060 0.062 0.064 0.066 0.068 0.070minSup(%)

Misse

s cost

(%)

Algo2bMaxFIAMinFIAIGAMICF

BMS-WebView-1, privacy threshold = 0%

80%

85%

90%

95%

100%

100 200 300 400 500 600 700 800Number of restrictive itemsets

Misse

s cost

(%)

Algo2bMaxFIAMinFIAIGAMICF

BMS-WebView-2, privacy threshold = 0%

80%

85%

90%

95%

100%

0.012 0.016 0.020 0.024 0.028 0.032 0.036minSup(%)

Misse

s cost

(%)

Algo2bMaxFIAMinFIAIGAMICF

|RI|=200 minSup=0.064%

|RI|=200 minSup=0.024%

Page 53: Fast Algorithms for Mining Frequent Itemsets

53

BMS-WebView-2, privacy threshold = 0%

0%

10%

20%

30%

40%

50%

60%

100 200 300 400 500 600 700 800Number of restrictive itemsets

Sanit

izatio

n rate

(%)

Algo2bMaxFIAMinFIAIGAMICF

Experimental Results (4/5)T10.I6.D100k.N500, privacy threshold = 0%

30%

40%

50%

60%

70%

80%

90%

100 200 300 400 500 600 700 800Number of restrictive itemsets

Sanit

izatio

n rate

(%)

Algo2bMaxFIAMinFIAIGAMICF

T20.I10.D100k.N500, privacy threshold = 0%

30%

40%

50%

60%

70%

50 100 150 200 250 300 350 400Number of restrictive itemsets

Sanit

izatio

n rate

(%)

Algo2bMaxFIAMinFIAIGAMICF

BMS-WebView-1, privacy threshold = 0%

0%

10%

20%

30%

40%

100 200 300 400 500 600 700 800Number of restrictive itemsets

Sanit

izatio

n rate

(%)

Algo2bMaxFIAMinFIAIGAMICF

minSup=0.004% minSup=0.1%

minSup=0.064% minSup=0.024%

Page 54: Fast Algorithms for Mining Frequent Itemsets

54

BMS-WebView-2, privacy threshold = 0%

0

5

10

15

20

25

30

100 200 300 400 500 600 700 800Number of restrictive itemsets

Runn

ing tim

e (sec

.)

Algo2bMaxFIAMinFIAIGAMICF

Experimental Results (5/5)T10.I6.D100k.N500, privacy threshold = 0%

0

5

10

15

20

25

30

100 200 300 400 500 600 700 800Number of restrictive itemsets

Runn

ing tim

e (sec

.)

Algo2bMaxFIAMinFIAIGAMICF

T20.I10.D100k.N500, privacy threshold = 0%

0

4

8

12

16

20

50 100 150 200 250 300 350 400Number of restrictive itemsets

Runn

ing tim

e (sec

.)

Algo2bMaxFIAMinFIAIGAMICF

BMS-WebView-1, privacy threshold = 0%

0

1

2

3

4

5

6

100 200 300 400 500 600 700 800Number of restrictive itemsets

Runn

ing tim

e (sec

.)

Algo2bMaxFIAMinFIAIGAMICF

Page 55: Fast Algorithms for Mining Frequent Itemsets

55

Conclusions Support measure

NFP-growth is presented for mining frequent itemsets Uses two counters per tree node to reduce the number of the tree nodes Applies a smaller tree and header table to discover frequent itemsets efficiently

Share measure Proposed algorithms efficiently decrease the candidate number to be counted ShFSM and DCG perform the best

Page 56: Fast Algorithms for Mining Frequent Itemsets

56

Utility mining Propose IIDS to ignore isolated items in the process of candidate generation FUM and DCG+ were better than ShFSM and DCG, respectively

Hiding sensitive patterns Propose the MICF algorithm to reduce the impact on the source database MICF decreases the support of the maximum number of restrictive itemsets Outperform all other algorithms in several datasets on misses costs for most cases MICF has the lowest sanitization rate

Page 57: Fast Algorithms for Mining Frequent Itemsets

57

Future Work Apply a constraint relaxation algorithm or develop a superior data structures to discover frequent itemsets Develop superior algorithms to accelerate identifying all or long SH-frequent itemsets Extend the application scope of IIDS to some classification models Develop superior algorithms to further reduce the misses cost without hiding failure to protect sensitive data Apply data mining techniques on image processing, for instance, to improve the interpolated color filter array image

Page 58: Fast Algorithms for Mining Frequent Itemsets

58

ReferencesD. Agrawal and C. Aggarwal, “On the design and quantification of privacy preserving data mining algorithms,” in Proc. 20th ACM Symposium on Principles of Database Systems, Santa Barbara, CA, pp. 247-255, May 2001.R. C. Agarwal, C. C. Aggarwal, and V. V. V. Prasad, “A tree projection algorithm for generation of frequent itemsets,” Journal of Parallel and Distributed Computing, vol. 61, no. 3, pp. 350-361, 2001.R. Agrawal, T. Imielinski, and A. Swami, “Mining association rules between sets of items in large databases” in Proc. 1993 ACM SIGMOD Intl. Conf. on Management of Data, Washington, D.C., pp. 207-216, May 1993.R. Agrawal and R. Srikant, “Fast algorithms for mining association rules,” in Proc. 20th Intl. Conf. on Very Large Data Bases, Santiago, Chile, pp. 487-499, Sep. 1994.M. Atallah, E. Bertino, A. Elmagarmid, M. Ibrahim, and V. Verykios, “Disclosure limitation of sensitive rules,” in Proc. 1999 Workshop on Knowledge and Data Engineering Exchange, Chicage, IL, pp. 45-52, Nov. 1999.B. Barber and H. J. Hamilton, “Parametric algorithm for mining share frequent itemsets,” Journal of Intelligent Information Systems, vol. 16, no. 3, pp. 277-293, 2001.F. Berzal, J. C. Cubero, N. Marín, and J. M. Serrano, “TBAR: An efficient method for association rule mining in relational databases,” Data & Knowledge Engineering, vol. 37, no. 1, pp. 47-64, 2001.

Page 59: Fast Algorithms for Mining Frequent Itemsets

59

S. Brin, R. Motwani, J. D. Ullman, and S. Tsur, “Dynamic itemset counting and implication rules for market basket data,” in Proc. 1997 ACM SIGMOD Intl. Conf. on Management of Data, Tucson, AZ, pp. 255-264, May 1997.P. Cabena, P. Hadjinian, R. Stadler, J. Verhees, and A. Zanasi, “Discovering Data Mining from Concept to Implementation,” Prentice Hall PTR, New Jersey, 1998.C. L. Carter, H. J. Hamilton, and N. Cercone, “Share based measures for itemsets,” Lecture Notes in Computer Science 1263 --- 1st European Conf. on the Principles of Data Mining and Knowledge Discovery, H. J. Komorowski and J. M. Zytkow (eds.), Springer-Verlag, Berlin, pp. 14-24, 1997.G. Grahne and J. Zhu, “Efficient using prefix-tree in mining frequent itemsets,” in Proc. IEEE ICDM Workshop on Frequent Itemset Mining Implementations, Melbourne, FL, Nov. 2003. J. Han, J. Pei, and Y. Yin, “Mining frequent patterns without candidate generation,” in Proc. 2000 ACM-SIGMOD Intl. Conf. on Management of Data, Dallas, TX, pp. 1-12, May 2000.J. Han, J. Pei, Y. Yin, and R. Mao, “Mining frequent patterns without candidate generation: A frequent pattern tree approach,” Data Mining and Knowledge Discovery, vol. 8, no. 1, pp. 53-87, 2004.

Page 60: Fast Algorithms for Mining Frequent Itemsets

60

T. Johnsten and V. V. Raghavan, “Impact of decision-region based classification mining algorithms on database security,” in Proc. IFIP WG 11.3 13th Intl. Conf. on Database Security, Seattle, WA, pp. 177-191, Jul. 1999. M. Kantardzic, “Data Mining: Concepts, Models, Methods, and Algorithms,” John Wiley & Sons, Inc., New York, 2002. S. R. M. Oliveira and O. R. Zaïane, “Privacy preserving frequent itemset mining,” in Proc. IEEE ICDM Workshop on Privacy, Security and Data Mining, Maebashi City, Japan, pp. 43-54, Dec. 2002. S. R. M. Oliveira and O. R. Zaïane, “Algorithms for balancing privacy and knowledge discovery in association rule mining,” in Proc. of 7th Intl. Database Engineering and Applications Symposium, Hong Kong, China, pp. 54-63, Jul. 2003. Y. Saygin, V. S. Verykios, and C. Clifton, “Using unknowns to prevent discovery of association rules,” ACM SIGMOD Record, vol. 30, no. 4, pp. 45-54, 2001. H. Yao and H. J. Hamilton, “Mining itemset utilities from transaction databases,” Data & Knowledge Engineering, vol. 59, no. 3, pp. 603-626, 2006. H. Yao, H. J. Hamilton, and C. J. Butz, “A foundational approach to mining itemset utilities from databases,” in Proc. 4th SIAM Intl. Conf. on Data Mining, Lake Buena Vista, FL, pp. 482-486, Apr. 2004.

Page 61: Fast Algorithms for Mining Frequent Itemsets

Thank You!

Page 62: Fast Algorithms for Mining Frequent Itemsets

62

Background and Related Work

Support-Confidence Framework Each item is a binary variable denoting whether an item was purchased Apriori (Agrawal & Swami, 1994) & Apriori-like algorithms (Agrawal et al., 1993; Berzal et al., 2001; Brin et al., 1997) Pattern-growth algorithms (Agarwal et al., 2001; Grahn & Zhu, 2003; Han et al., 2000; Han et al., 2004)

Share-Confidence Framework (Carter et al., 1997 ) Support-confidence framework does not analyze the exact number of products purchased. The support count method does not measure the profit or cost of an itemset Exhaustive search algorithm Fast algorithms

Page 63: Fast Algorithms for Mining Frequent Itemsets

63

Utility mining (Yao et al. 2004; Yao and Hamiltom, 2006) A generalized form of share-confidence framework

Privacy-Preserving in Mining Frequent Itemsets Classification rules (Agrawal & Aggarwal, 2001; Johnsten & Ra

ghavan 1999) Association rules (Atallah et al., 1999; Oliveira & Zaïane, 2002)