frequent pattern mining using fp trees (1)
TRANSCRIPT
-
7/31/2019 Frequent Pattern Mining Using Fp Trees (1)
1/25
SEMINAR BY
NITHISH PAI B
UNDER THE GUIDANCE OF
MRS PUSHPALATA M P
-
7/31/2019 Frequent Pattern Mining Using Fp Trees (1)
2/25
OVERVIEW
Basic concepts and a roadmap Scalable frequent itemset mining methods
APRIORI ALGORITHM: A Candidate generation test and
PATTERN GROWTH APPROACH: Mining FrequentPatterns Without Candidate Generation
CHARM / ECLAT: Mining by Exploring Vertical Data
Format Summary and Conclusions
-
7/31/2019 Frequent Pattern Mining Using Fp Trees (1)
3/25
WHAT IS FREQUENT PATTERNANALYSIS?
Frequent pattern: a pattern (a set of items, subsequences,substructures, etc.) that occurs frequently in a data set
First proposed by Agrawal, Imielinski, and Swami [AIS93] in thecontext of frequent item sets and association rule mining
Motivation: Finding inherent regularities in data. What products were often purchased together? Beer and diapers?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents?
Applications Basket data analysis, cross-marketing, catalog design, sale campaign analysis,
Web log (click stream) analysis, and DNA sequence analysis.
-
7/31/2019 Frequent Pattern Mining Using Fp Trees (1)
4/25
WHY IS FREQUENT PATTERNMINING IMPORTANT?
Freq. pattern: An intrinsic and important property of datasets
Foundation for many essential data mining tasks
Association, correlation, and causality analysis
Sequential, structural (e.g., sub-graph) patterns Pattern analysis in spatiotemporal, multimedia, timeseries and stream data
Classification: discriminative, frequent pattern analysis
Cluster analysis: frequent pattern-based clustering
Data warehousing: iceberg cube and cube-gradient
Semantic data compression: fascicles
Broad applications
-
7/31/2019 Frequent Pattern Mining Using Fp Trees (1)
5/25
BASIC CONCEPTS: FREQUENTPATTERNS
Tid Items bought10 Tea, Nuts,Napkins
20 Tea, Coffee, Napkins
30 Tea, Napkins, Eggs
itemset: A set of one or moreitems
k-itemset X = {x1, , xk}
(absolute) support, or, support
40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk
coun o : requency oroccurrence of an itemset X
(relative) support, s, is thefraction of transactions thatcontains X (i.e., the probability
that a transaction contains X) An itemset X isfrequent if Xs
support is no less than a minsupthreshold
-
7/31/2019 Frequent Pattern Mining Using Fp Trees (1)
6/25
BASIC CONCEPTS: ASSOCIATIONRULES
Tid Items bought10 Tea, Nuts,Napkins
20 Tea, Coffee, Napkins
30 Tea, Napkins, Eggs
Find all the rulesX Ywithminimum support and confidence
support, s, probability that atransaction contains XY
confidence, c, conditional
40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk
probability that a transactionhaving X also contains Y
Let minsup = 50%, minconf = 50%
Freq. Pat.: Tea:3, Nuts:3,
Napkins:4,Eggs:3, {Tea, Napkins}:3
Association rules: (many more!)
Tea Napkins (60%, 100%)
Napkins Tea (60%, 75%)
-
7/31/2019 Frequent Pattern Mining Using Fp Trees (1)
7/25
CLOSED PATTERNS AND MAX-
PATTERNS
A long pattern contains a combinatorial number of subpatterns, e.g.,{a1, , a100} contains (1001) + (1002)+ +(100100) = 2100 1 = 1.27*1030
sub-patterns!
Solution:Mine closed patterns and max-patterns instead
n emse s c ose s requen an ere ex s s no super-pa ernY X, with the same support as X(proposed by Pasquier, et al. @ICDT99)
An itemset X is a max-pattern if X is frequent and there exists nofrequent super-pattern Y X (proposed by Bayardo @ SIGMOD98)
Closed pattern is a lossless compression of freq. patterns
Reducing the # of patterns and rules
-
7/31/2019 Frequent Pattern Mining Using Fp Trees (1)
8/25
COMPUTATIONAL COMPLEXITY OFFREQUENT ITEMSET MINING
How many itemsets are potentially to be generated in the worst case?
The number of frequent itemsets to be generated is senstive to the minsupthreshold
When minsup is low, there exist potentially an exponential number offre uent itemsets
The worst case: MNwhere M: # distinct items, and N: max length oftransactions
The worst case complexity vs. the expected probability
Ex. Suppose Walmart has 104 kinds of products
The chance to pick up one product 10-4
The chance to pick up a particular set of 10 products: ~10-40
What is the chance this particular set of 10 products to be frequent 10 times in 10transactions?
-
7/31/2019 Frequent Pattern Mining Using Fp Trees (1)
9/25
THE DOWNWARD CLOSURE PROPERTY
AND SCALABLE MINING METHODS
The downward closure property of frequent patterns states that:
Any subset of a frequent itemset must be frequent
If {tea, napkins, nuts} is frequent, so is {tea, napkins}
i.e., every transaction having {tea napkins nuts} also contains {tea napkins}
Scalable mining methods: Three major approaches
Apriori algorithm
Freq. pattern growth
Vertical data format approach
-
7/31/2019 Frequent Pattern Mining Using Fp Trees (1)
10/25
APRIORI ALGORITHM:: A CANDIDATEGENERATION & TEST APPROACH
Apriori pruning principle: If there is any itemset which is infrequent,its superset should not be generated/tested!
It uses a breadth-first search strategy to count the support of itemsets
and uses a candidate generation function which exploits the downwardc osure proper y o suppor .
Method: Initially, scan DB once to get frequent 1-itemset
Generate length (k+1) candidate itemsets from length k frequent itemsets
Test the candidates against DB
Terminate when no frequent or candidate set can be generate
-
7/31/2019 Frequent Pattern Mining Using Fp Trees (1)
11/25
APRIORI ALGORITHM: EXAMPLE
-
7/31/2019 Frequent Pattern Mining Using Fp Trees (1)
12/25
IMPROVEMENT OF APRIORIMETHOD
DISADVANTAGES:
Multiple scans of transaction database
Huge number of candidates
Tedious workload of support counting for candidates
Improving Apriori: general ideas
Reduce passes of transaction database scans
Shrink number of candidates
Facilitate support counting of candidates
-
7/31/2019 Frequent Pattern Mining Using Fp Trees (1)
13/25
Pattern-Growth Approach:
Mining Frequent Patterns WithoutCandidate Generation
The FPGrowth Approach:
Depth-first search
Avoid explicit candidate generation
Major philosophy: Grow long patterns from short ones using localrequen ems on y
abc is a frequent pattern
Get all transactions having abc, i.e., project DB on abc: DB|abc
d is a local frequent item in DB|abc abcd is a frequent pattern
-
7/31/2019 Frequent Pattern Mining Using Fp Trees (1)
14/25
CONSTRUCTING FP-TREE FROM ATRANSACTIONAL DATABASE
Consider the following set of transactions Let the minimum Support count be 3
TID Items bought (ordered) frequent items
100 { , a, c, d, g, i m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o, w} {f, b}400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
-
7/31/2019 Frequent Pattern Mining Using Fp Trees (1)
15/25
CONSTRUCTING FP-TREE FROM ATRANSACTIONAL DATABASE
ALGORITHM FOR CONSTRUCTION OF A FP-TREE:
1. Scan DB once, find frequent 1-itemset (single item pattern)
2. Sort frequent items in frequency descending order, f-list
3. Scan DB again, construct FP-tree
- ree or e xamp e ransac on:
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header Table
Item frequencyheadf 4c 4a 3b 3m 3p 3
F-list=f-c-a-b-m-p
-
7/31/2019 Frequent Pattern Mining Using Fp Trees (1)
16/25
PARTITION PATTERNS ANDDATABASES
Frequent patterns can be partitioned into subsets according to f-list
F-list = f-c-a-b-m-p
Patterns containing p
Patterns having m but no p
Patterns having c but no a nor b, m, p
Pattern f
Completeness and non-redundancy
-
7/31/2019 Frequent Pattern Mining Using Fp Trees (1)
17/25
FINDIND PATTERNS HAVING PFROM P-CONDITIONAL DATABASE Starting at the frequent item header table in the FPtree
Traverse the FP-tree by following the link of each frequent item p
Accumulate all oftransformed prefix paths of item p to form psconditional pattern base
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header Table
Item frequency headf 4c 4a 3b 3m 3p 3
Conditionalpattern bases
item cond. pattern base
c f:3
a fc:3
b fca:1, f:1, c:1
m fca:2, fcab:1
p fcam:2, cb:1
-
7/31/2019 Frequent Pattern Mining Using Fp Trees (1)
18/25
FROM CONDITIONAL PATTERNBASES TO CONDITIONAL FP-TREES For each pattern-base
Accumulate the count for each item in the base
Construct the FP-tree for the frequent items of the pattern base
m-conditionalpattern base:
Header TableItem frequency headf 4c 4
a 3b 3m 3p 3
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
fca:2, fcab:1
{}
f:3
c:3
a:3
m-conditionalFP-tree
All frequentpatterns relate tom
m,
fm, cm, am,
fcm, fam, cam,
fcam
-> associations
-
7/31/2019 Frequent Pattern Mining Using Fp Trees (1)
19/25
RECURSION: MINING EACHCONDITIONAL FP-TREE
-
7/31/2019 Frequent Pattern Mining Using Fp Trees (1)
20/25
ALGORITHM FOR MINING FREQUENT
ITEMSETS USING FP-TREE BY PATTERNFREQUENT GROWTH
Input: D, a transaction database;minsup, the minimum support count threshold.
Output: The complete set of frequent patterns.Method:
1. The FP-tree is constructed in the following steps:a) Scan the transaction database D once. Collect F, the set of frequent items, and their support
counts Sort F in support count descending order as L, the list of frequent items.b) Create the root of an FP-tree, and label it as null. For each transaction Trans in D
do the following:Select and sort the frequent items in Trans according to the order of L. Let the sorted frequentitem list in Trans be[p|P], where p is the first element and P is the remaining list.
Call insert tree([p|P], T which is performed as follows:If T has a child N such that N.item-name=p.item-name, then increment Ns countby 1;else create a new node N, and let its count be 1, its parent link be linked to T, andnode-link to the nodes with the same item-name via the node-link structure. If P isnonempty, call insert tree(P, N)recursively.
-
7/31/2019 Frequent Pattern Mining Using Fp Trees (1)
21/25
ALGORITHM FOR MINING FREQUENT
ITEMSETS USING FP-TREE BY PATTERNFREQUENT GROWTH
2. The FP-tree is mined by calling FPgrowth(FPtree, null), which is implemented as follows.procedure FPgrowth(Tree , )if Tree contains a single path P then{
for each combination (denoted as ) of the nodes in the path P{
generate pattern U with support count=minimum support_count of nodes in ;}
}else for each ai in the header of Tree {generate pattern = ai U with support_count = ai:support_count;Construct s conditional pattern base and then s conditional FP_tree Tree ;if Tree then
call FPgrowth(Tree, );}
-
7/31/2019 Frequent Pattern Mining Using Fp Trees (1)
22/25
BENEFITS OF FP-TREE STRUCTURE Completeness
Preserve complete information for frequent pattern mining
Never break a long pattern of any transaction
Compactness Reduce irrelevant infoinfrequent items are gone
Items in frequency descending order: the more
frequently occurring, the more likely to be shared
Never be larger than the original database (not count node-linksand the count field)
-
7/31/2019 Frequent Pattern Mining Using Fp Trees (1)
23/25
ADVANTAGES OF PATTERNGROWTH APPROACH
Divide-and-conquer:
Decompose both the mining task and DB according to the frequentpatterns obtained so far
Lead to focused search of smaller databases
No candidate generation, no candidate test
Compressed database: FP-tree structure
No repeated scan of entire database
Basic ops: counting local freq items and building sub FP-tree, no pattern
search and matching
-
7/31/2019 Frequent Pattern Mining Using Fp Trees (1)
24/25
EXTENSION OF PATTERN GROWTHMINING METHODOLOGY
Mining closed frequent itemsets and max-patterns
Mining sequential patterns
Mining graph patterns
Constraint-based mining of frequent patterns
Computing iceberg data cubes with complex measures Pattern-growth-based Clustering
Pattern-Growth-Based Classification
-
7/31/2019 Frequent Pattern Mining Using Fp Trees (1)
25/25
CHARM / ECLAT: Mining byExploring Vertical Data Format
Vertical format: t(AB) = {T11, T25, } tid-list: list of trans.-ids containing an itemset
Deriving closed patterns based on vertical intersections t(X) = t(Y): X and Y always happen together
t(X)
t(Y): transaction having X always has Y Using diffset to accelerate mining
Only keep track of differences of tids
t(X) = {T1, T2, T3}, t(XY) = {T1, T3}
Diffset (XY, X) = {T2}