zeev dvir – [email protected] genmax from: “ efficiently mining frequent itemsets ” by :...
TRANSCRIPT
Zeev Dvir – [email protected]
GenMax From :
“ Efficiently Mining Frequent Itemsets ”
By:
Karam Gouda & Mohammed J. Zaki
Zeev Dvir – [email protected]
The Problem
• Given a large database of items transactions, find all frequent itemsets
• A frequent itemset is a set of items that occurs in at-least a user-specified percentage of the data-base
• We call this percentage : min_sup (for minimum support).
Zeev Dvir – [email protected]
• A Maximal Frequent Itemset is a frequent itemset, that doesn’t have a frequent superset
• FI := frequent itemsets
MFI := maximal frequent itemsets
• Fact:
|MFI| << |FI|
GenMax is an algorithm to find the exact MFI
Zeev Dvir – [email protected]
ExampleItem/Tid
ABCD
1xxx
2xx
3xxx
4xxxx
5x
6xx
7x
ABCD
ABC ABD ACD BCD
AB AC AD BC BD CD
A B C D
Min_sup = 3
Zeev Dvir – [email protected]
Some Useful Definitions
• The Combine-Set of an itemset I , is the set of items that can be added to I to create a frequent itemset.
• For example , in the previous example, The combine-set of the itemset {A} is {B, C}.
• The combine-set of the empty itemset is called F1 and is actually the set of frequent itemsets ofsize 1.
Zeev Dvir – [email protected]
)1k,C,I(backtrackMFI 11.
else .10
IMFI MFI 9.
MFIin superset no has I if 8.
empty is C if .7
)P,Combine(I-FI C 6.
return 5.
MFIin superset a has PI If 4.
x}y and Cy|:y{P .3
}x{II .2
C xeach for 1.
)k,C,I(backtrackMFI
)0,F,(backtrackMFI:invocation//
1k1k
1k
1k
1k
1k1k1k
1k1k
k1k
k1k
k
kk
1
Zeev Dvir – [email protected]
Creturn 5.
{y}C C 4.
frequent is }y{I if .3
Pyeach for 2.
C .1
)P,I(combineFI
1k
1k
1k1k
Zeev Dvir – [email protected]
Improvement
• At each level, sort the combine-set (C) in increasing order of support
• An itemset with low support has a smaller chance of producing a large combine-set in the next level
• The sooner we prune the tree, the more work we save
• This heuristic was first used in MaxMiner
Zeev Dvir – [email protected]
Bottlenecks
1. Superset checking :
The best algorithms for superset checking give an amortized bound of per operation.
that’s bad if we have many itemsets in the MFI.
2. Frequency testing :
How can we make frequency testing faster ?
))s(Logs(
Zeev Dvir – [email protected]
Optimizing Superset Checking
• A technique called “Progressive Focusing” is used to narrow down the group of potential supersets, as the recursive calls are made
• LMFI := Local MFI
• Before each recursive call, we construct the LMFI for the next call, based on the current LMFI and the new item added.
Zeev Dvir – [email protected]
FGHI FGHJ …
FGH FGI …
FG …
LMFI Example
kI
k 1I
k
k 1
LMFI {AFGI,ABFGH,AWFG}
LMFI {ABFGH}
Zeev Dvir – [email protected]
1kkk
1k1k1k
k1k
1kkk
k1k
1k
1k1k1k
1k
k1k1k
k1k
k1k
k
kkk
LMFILMFILMFI .14
)1k,LMFI,C,I(backtrackMFIL 13.
}Mx:LMFIM{LMFI 12.
else .11
ILMFI LMFI 10.
LMFIin superset no has I if 9.
empty is C if .8
)P,Combine(I-FI C 7.
LMFI 6.
return 5.
LMFIin superset a has PI If 4.
x}y and Cy|:y{P .3
}x{II .2
C xeach for 1.
)k,LMFI,C,I(backtrackLMFI
Zeev Dvir – [email protected]
Frequency Testing Optimization
• GenMax uses a “vertical database format”:• For each item , we have a set of all the
transactions containing this item.• This set is called a tidset. (Transaction ID
Set).• This method makes support computations
easier, because we don’t have to go over the entire database.
Zeev Dvir – [email protected]
Vertical Database
Item/Tid
ABCD
1xxx
2xx
3xxx
4xxxx
5x
6xx
7x
A {1, 3, 4, 5}
B {1, 3, 4, 6}
C {1 ,2 ,3 ,4 ,7}
D {2, 4, 6}
t(A) = {1, 3, 4, 5}
t(AC) = {1, 3, 4}
supp(I) = |t(I)|
Zeev Dvir – [email protected]
ABC ABD ABE
AB …
= { C , E }
t(ABC) t(ABE)
k 1 k 1
k+1
k+1
FI tidset combine(I ,P )
1. C=
2. for each y P
3. y' = y
4. t(y') = t(I ) t(y)
5. if |t(y')| min_sup
6. C = C {y'}
7. return C
kI
kC
Each item y in the combine-set , actually represents the itemset
, and stores the tidset associated with it.
kC
kI {y}
Zeev Dvir – [email protected]
Additional Optimization
• Diffsets: don’t store the entire tidsets, only the differences between tidsets (described in “Fast Vertical Mining Using Diffsets”)
Zeev Dvir – [email protected]
Experimental Results
• GenMax is compared with: MaxMiner , MAFIA, MAFIA-PP• MaxMiner & MAFIA-PP give the exact
MFI, while MAFIA gives a superset of the MFI
• The Databases used in the experiments are grouped according to the MFI length distribution
Zeev Dvir – [email protected]
Type I Datasets
Zeev Dvir – [email protected]
Type II Datasets
Zeev Dvir – [email protected]
Type III Datasets
Zeev Dvir – [email protected]
Type IV Datasets
Zeev Dvir – [email protected]