1 efficient algorithms for mining share-frequent itemsets authors: y. c. li, j. s. yeh and c. c....

24
1 Efficient Algorithms for Mining Share-Frequent It emsets Authors: Y. C. Li, J. S. Yeh and C. C. Chang Speaker: Yu-Chiang Li Date :July 28, 2005

Upload: cameron-hodge

Post on 18-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Efficient Algorithms for Mining Share-Frequent Itemsets Authors: Y. C. Li, J. S. Yeh and C. C. Chang Speaker: Yu-Chiang Li Date :July 28, 2005

1

Efficient Algorithms for Mining Share-Frequent Itemsets

Authors: Y. C. Li, J. S. Yeh and C. C. ChangSpeaker: Yu-Chiang LiDate :July 28, 2005

Page 2: 1 Efficient Algorithms for Mining Share-Frequent Itemsets Authors: Y. C. Li, J. S. Yeh and C. C. Chang Speaker: Yu-Chiang Li Date :July 28, 2005

2

Outline

Introduction Related Work Enhanced Fast Share Measure (EFSM) Algo

rithm Support-Counted Fast Share Measure (

SuFSM) Algorithm Share-Counted Fast Share Measure (

ShFSM) Algorithm Experimental Results Conclusions

Page 3: 1 Efficient Algorithms for Mining Share-Frequent Itemsets Authors: Y. C. Li, J. S. Yeh and C. C. Chang Speaker: Yu-Chiang Li Date :July 28, 2005

3

Introduction (1/2) Goal: discovering the buying patterns of cu

stomers Itemset: a group of items (products) boug

ht together in a transaction Support: the ratio of transactions containi

ng the itemset to the total transaction number (limited in informative feedback)

Share: the ratio of the total count of items in the itemset to the total count of items in the database

Page 4: 1 Efficient Algorithms for Mining Share-Frequent Itemsets Authors: Y. C. Li, J. S. Yeh and C. C. Chang Speaker: Yu-Chiang Li Date :July 28, 2005

4

Introduction (2/2) Share-confidence framework: providing us

eful information about numerical values associated with transaction items ( Carter et al., 1997)

Share-frequent (SH-frequent) itemset: usually includes some infrequent subsets

Fast Share Measure (FSM) algorithm discovers share-frequent itemsets on small dataset efficiently

This study proposes Enhanced FSM, SuFSM and ShFSM to discover share-frequent itemsets more efficiently than that of FSM

Page 5: 1 Efficient Algorithms for Mining Share-Frequent Itemsets Authors: Y. C. Li, J. S. Yeh and C. C. Chang Speaker: Yu-Chiang Li Date :July 28, 2005

5

Related Work Support-Confidence Framework (Agrawal et al., 1993)

Each item is a binary variable denoting whether an item was purchased

Apriori (Agrawal & Swami, 1994) & Apriori-like algorithms

Pattern-growth algorithms (Han et al., 2000; Han et al, 2004)

Share-Confidence Framework (Carter et al., 1997) Support-confidence framework does not analyze the

exact number of products purchased The support count method does not measure the prof

it or cost of an itemset Exhaustive search algorithm (Carter et al., 2000) FSM algorithm (Li et al., 2005)

Page 6: 1 Efficient Algorithms for Mining Share-Frequent Itemsets Authors: Y. C. Li, J. S. Yeh and C. C. Chang Speaker: Yu-Chiang Li Date :July 28, 2005

6

Related Work

Apriori algorithm (Agrawal and Srikant, 1994): minSup = 40%

Page 7: 1 Efficient Algorithms for Mining Share-Frequent Itemsets Authors: Y. C. Li, J. S. Yeh and C. C. Chang Speaker: Yu-Chiang Li Date :July 28, 2005

7

Share-Confidence Framework Measure value: mv(ip, Tq)

mv({D}, T01) = 1 mv({C}, T03) = 3

Transaction measure value: tmv(Tq) = tmv(T02) = 9

Total measure value: Tmv(DB)= Tmv(DB)=44

Itemset measure value: imv(X, Tq)= imv({A, E}, T02)=4

Local measure value: lmv(X)= lmv({BC})=2+4+5=11

xq dbT

qTXimv ),(

dbT Ti

qpq qp

Timv ),(

XiTX

qppq

Timv,

),(

qp Ti

qp Timv ),(

Page 8: 1 Efficient Algorithms for Mining Share-Frequent Itemsets Authors: Y. C. Li, J. S. Yeh and C. C. Chang Speaker: Yu-Chiang Li Date :July 28, 2005

8

Tmv

Xlmv )(

minShare=30%

Itemset share: SH(X)= SH({BC})=11/44=25%

SH-frequent: if SH(X) >= minShare, X is a share-frequent (SH-frequent) itemset

Page 9: 1 Efficient Algorithms for Mining Share-Frequent Itemsets Authors: Y. C. Li, J. S. Yeh and C. C. Chang Speaker: Yu-Chiang Li Date :July 28, 2005

9

Existing algorithms

ZP(Zero Pruning) 、 ZSP(Zero Subset Pruning) Variants of exhaustive search Prune the candidate itemsets whose local

measure values are exactly zero FSM(Fast Share Measure) (Li et al., 2005)

Fast on a small dataset Generate too many candidates

Existing algorithms are inefficient on a large datasets

Page 10: 1 Efficient Algorithms for Mining Share-Frequent Itemsets Authors: Y. C. Li, J. S. Yeh and C. C. Chang Speaker: Yu-Chiang Li Date :July 28, 2005

10A:10 B:8 C:10 D:6 E:4 H:1...

AB:6 AC:14 AD:7 AE:10 BC:11 BD:14 BE:0 CE:10CD:8 DE:0

ABC:3 ABD:9 ABE:0 ACD:3 ACE:16 ADE:0 BCD:15 BDE:0BCE:0 CDE:0

ABCD:4 ABCE:0 ABDE:0 ACDE:0 BCDE:0

ABCDE:0ZP

Algorithm

Page 11: 1 Efficient Algorithms for Mining Share-Frequent Itemsets Authors: Y. C. Li, J. S. Yeh and C. C. Chang Speaker: Yu-Chiang Li Date :July 28, 2005

11A:10 B:8 C:10 D:6 E:4 H:1...

AB:6 AC:14 AD:7 AE:10 BC:11 BD:14 BE:0 CE:10CD:8 DE:0

ABC:3 ABD:9 ABE:0 ACD:3 ACE:16 ADE:0 BCD:15 CDE:0

ABCD:4 ACDE:0ZSP Algorithm

Page 12: 1 Efficient Algorithms for Mining Share-Frequent Itemsets Authors: Y. C. Li, J. S. Yeh and C. C. Chang Speaker: Yu-Chiang Li Date :July 28, 2005

12

FSM: Fast Share Measure Algorithm

ML: Maximum transaction length in DB MV: Maximum measure value in DB Let min_lmv=minShare×Tmv Let CF(X)FSM= lmv(X)+(lmv(X)/k)×MV ×(ML-

k) If CF(X)FSM< min_lmv, all supersets of X are infr

equent

Page 13: 1 Efficient Algorithms for Mining Share-Frequent Itemsets Authors: Y. C. Li, J. S. Yeh and C. C. Chang Speaker: Yu-Chiang Li Date :July 28, 2005

13

FSM: Fast Share Measure Algorithm

A:10 B:8 C:10 D:6 E:4 H:1...

AB:6 AC:14 AD:7 AE:10 BC:11 BD:14 BE:0 CE:10CD:8 DE:0

ABC:3 ABD:9 ABE:0 ACD:3 ACE:16 ADE:0 BCD:15 CDE:0

minShare=30%, ML=6, MV=3, TMV=44 min_lmv=14 Prune X if CF(X)FSM <min_lmv Let X={A B C} CF(X)FSM =3+(3/3)×3×(6-3)=12<14=min_lmv

Page 14: 1 Efficient Algorithms for Mining Share-Frequent Itemsets Authors: Y. C. Li, J. S. Yeh and C. C. Chang Speaker: Yu-Chiang Li Date :July 28, 2005

14

Enhanced FSM (EFSM) Algorithm EFSM: instead of joining arbitrary two itemsets in RC

k-1, EFSM joins arbitrary itemset of RCk-1 with a single item in RC1 to generate Ck efficiently

Reduce time complexity from O(n2k-2) to O(nk)

Page 15: 1 Efficient Algorithms for Mining Share-Frequent Itemsets Authors: Y. C. Li, J. S. Yeh and C. C. Chang Speaker: Yu-Chiang Li Date :July 28, 2005

15

SuFSM (Support-counted FSM)

Xk+1: arbitrary superset of X with length k+1 in DB S(Xk+1): the set which contains all Xk+1 in DB dbS(Xk+1): the set of transactions of which each tra

nsaction contains at least one Xk+1 SuFSM and ShFSM from EFSM which prune the c

andidates more efficiently than FSM SuFSM (Support-counted FSM):

Theorem 1. If lmv(X)+Sup(S(Xk+1))×MV×(ML – k)< min_lmv, all supersets of X are infrequent

Page 16: 1 Efficient Algorithms for Mining Share-Frequent Itemsets Authors: Y. C. Li, J. S. Yeh and C. C. Chang Speaker: Yu-Chiang Li Date :July 28, 2005

16

SuFSM (Support-counted FSM)

lmv(X)/k Sup(X) Sup(S(Xk+1))

EX. lmv({BCD})/k=15/3=5, Sup({BCD})=3, Sup(S({BCD}k+1))=2

If there is no superset of X is an SH-frequent itemset, then the following three equations hold lmv(X)+(lmv(X)/k)×MV× (ML - k) < min_lmv lmv(X)+Sup(X) ×MV× (ML - k) < min_lmv lmv(X)+Sup(S(Xk+1)) ×MV× (ML - k) < min_lmv

Page 17: 1 Efficient Algorithms for Mining Share-Frequent Itemsets Authors: Y. C. Li, J. S. Yeh and C. C. Chang Speaker: Yu-Chiang Li Date :July 28, 2005

17

ShFSM (Share-counted FSM)

dbS(Xk+1): the set of transactions of which each transaction contains at least one Xk+1

ShFSM (Share-counted FSM): Theorem 2. If Tmv(dbS(Xk+1)) < min_lmv, all supersets of X

are infrequent FSM:lmv(X)+(lmv(X)/k)×MV× (ML - k) < min_lmv SuFSM:lmv(X)+Sup(S(Xk+1)) ×MV× (ML - k) < min_lmv ShFSM: Tmv(dbS(Xk+1)) < min_lmv CF(X)FSM>=CF(X)SuFSM>=CF(X)ShFSM

Page 18: 1 Efficient Algorithms for Mining Share-Frequent Itemsets Authors: Y. C. Li, J. S. Yeh and C. C. Chang Speaker: Yu-Chiang Li Date :July 28, 2005

18

FSM:lmv(X)+(lmv(X)/k)×MV× (ML - k) < min_lmv

SuFSM:lmv(X)+Sup(S(Xk+1)) ×MV× (ML - k) < min_lmv

ShFSM: Tmv(dbS(Xk+1)) < min_lmv Ex. X = {BCD} CF(X)FSM = 9+(9/3)×3×(6-3)=36 CF(X)SuFSM = 9+2×3×(6-3)=18 CF(X)ShFSM = 6+8=14

Page 19: 1 Efficient Algorithms for Mining Share-Frequent Itemsets Authors: Y. C. Li, J. S. Yeh and C. C. Chang Speaker: Yu-Chiang Li Date :July 28, 2005

19

ShFSM (Share-counted FSM)

A:10 B:8 C:10 D:6 E:4 H:1...

AB:6 AC:14 AD:7 AE:10 BC:11 BD:14 BE:0 CE:10CD:8 DE:0

ACE:16 BCD:15 CDE:0

Ex. X={AB} Tmv(dbS(Xk+1)) = tmv(T01)+tmv(T0

5) =6+6=12 <14 = min_lmv

Page 20: 1 Efficient Algorithms for Mining Share-Frequent Itemsets Authors: Y. C. Li, J. S. Yeh and C. C. Chang Speaker: Yu-Chiang Li Date :July 28, 2005

20

Experimental Results (1/3)

PC: Pentium IV 1.5 GHZ, 1.5GB SDRAM, running Windows XP professional

All algorithms were coded in VC++ 6.0

T4.I2.D100k.N50.S10

110

100

100010000

100000

0 0.2 0.4 0.6 0.8 1 1.2

minShare (%)

Run

ning

tim

e (s

ec)

ZSPEZSPFSMEFSMSuFSMShFSM

T6.I4.D100k.N200.S10

110

100

100010000

100000

0 0.2 0.4 0.6 0.8 1 1.2

minShare (%)

Run

ning

tim

e (s

ec)

FSMEFSMSuFSMShFSM

Figure 1

Figure 2

Page 21: 1 Efficient Algorithms for Mining Share-Frequent Itemsets Authors: Y. C. Li, J. S. Yeh and C. C. Chang Speaker: Yu-Chiang Li Date :July 28, 2005

21

Experimental Results (2/3)

T6.I4.Dz.N200.S10

1

10

100

1000

10000

0 200 400 600 800 1000

Transactions (k)R

unni

ng ti

me

(sec

)

FSM

EFSM

SuFSM

ShFSM

T10.I6.D100k.N500.S20

110

100

100010000

100000

0 0.2 0.4 0.6 0.8 1 1.2

minShare (%)

Run

ning

tim

e (s

ec) .

FSMEFSMSuFSMShFSM

minShare=0.1%

Figure 3

Figure 4

Page 22: 1 Efficient Algorithms for Mining Share-Frequent Itemsets Authors: Y. C. Li, J. S. Yeh and C. C. Chang Speaker: Yu-Chiang Li Date :July 28, 2005

22

ExperimentalResults (3/3)

T6.I4.D100k.N200.S10

minShare = 0.1% ML=20 , MV=10 Tmv=2,302,443

MethodPass (k)

FSM EFSM SuFSM ShFSM Fk

k=1Ck 200 200 200 200

159RCk 200 200 199 197

k=2Ck 19900 19900 19701 19306

1844RCk 16214 16214 13312 7199

k=3Ck 829547 829547 564324 190607

101RCk 251877 251877 99765 9792

k=4Ck 3290296 3290296 793042 20913

0RCk 332877 332877 41057 1420

k=5Ck 393833 393833 25003 1050

5RCk 71420 71420 19720 959

k=6Ck 26137 26137 11582 518

8RCk 25562 25562 11045 506

k=7Ck 11141 11141 5940 204

7RCk 11099 11099 5827 196

k=8Ck 4426 4426 2797 58

1RCk 4423 4423 2750 54

k>=9Ck 2036 2036 1567 12

0RCk 2030 2030 1513 10

Time(sec) 13610.4 71.55 29.67 10.95

Page 23: 1 Efficient Algorithms for Mining Share-Frequent Itemsets Authors: Y. C. Li, J. S. Yeh and C. C. Chang Speaker: Yu-Chiang Li Date :July 28, 2005

23

Conclusions

This study proposes the Enhanced FSM (EFSM) algorithm to efficiently reduce the time complexity of the join step

We have also developed SuFSM and ShFSM from EFSM

SuFSM and ShFSM can efficiently prune the candidates, and significantly improve the performance

The experimental results have indicated that ShFSM has the best performance

In the future, we plan to develop even more advanced algorithms to accelerate the process of identifying all share-frequent itemsets

Page 24: 1 Efficient Algorithms for Mining Share-Frequent Itemsets Authors: Y. C. Li, J. S. Yeh and C. C. Chang Speaker: Yu-Chiang Li Date :July 28, 2005

24

Thank You