unit 4 december 5, 2015 data mining: concepts and techniques1
TRANSCRIPT
UNIT 4
April 21, 2023Data Mining: Concepts and
Techniques 1
April 21, 2023Data Mining: Concepts and
Techniques 2
Chapter 5: Mining Frequent Patterns, Association and Correlations
Basic concepts and a road map Efficient and scalable frequent itemset
mining methods Mining various kinds of association rules From association mining to correlation
analysis
April 21, 2023Data Mining: Concepts and
Techniques 3
What Is Frequent Pattern Analysis?
Frequent pattern: a pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a data set
First proposed by Agrawal, Imielinski, and Swami [AIS93] in the
context of frequent itemsets and association rule mining
Applications
Basket data analysis, cross-marketing, catalog design, sale
campaign analysis, Web log (click stream) analysis, and DNA
sequence analysis.
April 21, 2023Data Mining: Concepts and
Techniques 4
Why Is Freq. Pattern Mining Important?
Discloses an intrinsic and important property of data sets Forms the foundation for many essential data mining
tasks Association, correlation, and causality analysis Sequential, structural (e.g., sub-graph) patterns Pattern analysis in spatiotemporal, multimedia, time-
series, and stream data Classification: associative classification Cluster analysis: frequent pattern-based clustering Data warehousing: iceberg cube and cube-gradient Semantic data compression: fascicles Broad applications
April 21, 2023Data Mining: Concepts and
Techniques 5
Basic Concepts: Frequent Patterns and Association Rules
Itemset X = {x1, …, xk}
Find all the rules X Y with minimum support and confidence support, s, probability that a
transaction contains X Y confidence, c, conditional
probability that a transaction having X also contains Y
Let supmin = 50%, confmin = 50%
Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3}Association rules:
A D (60%, 100%)D A (60%, 75%)
Customerbuys diaper
Customerbuys both
Customerbuys beer
Transaction-id Items bought
10 A, B, D
20 A, C, D
30 A, D, E
40 B, E, F
50 B, C, D, E, F
April 21, 2023Data Mining: Concepts and
Techniques 6
Closed Patterns and Max-Patterns
A long pattern contains a combinatorial number of sub-patterns, e.g., {a1, …, a100} contains (100
1) + (1002) + … +
(11
00
00) = 2100 – 1 = 1.27*1030 sub-patterns!
Solution: Mine closed patterns and max-patterns instead An itemset X is closed if X is frequent and there exists no
super-pattern Y כ X, with the same support as X (proposed by Pasquier, et al. @ ICDT’99)
An itemset X is a max-pattern if X is frequent and there exists no frequent super-pattern Y כ X (proposed by Bayardo @ SIGMOD’98)
Closed pattern is a lossless compression of freq. patterns Reducing the # of patterns and rules
April 21, 2023Data Mining: Concepts and
Techniques 7
Closed Patterns and Max-Patterns
Exercise. DB = {<a1, …, a100>, < a1, …, a50>}
Min_sup = 1. What is the set of closed itemset?
<a1, …, a100>: 1
< a1, …, a50>: 2
What is the set of max-pattern? <a1, …, a100>: 1
What is the set of all patterns? !!
April 21, 2023Data Mining: Concepts and
Techniques 8
Chapter 5: Mining Frequent Patterns, Association and
Correlations Basic concepts and a road map Efficient and scalable frequent itemset
mining methods Mining various kinds of association rules From association mining to correlation
analysis
April 21, 2023Data Mining: Concepts and
Techniques 9
Scalable Methods for Mining Frequent Patterns
The downward closure property of frequent patterns Any subset of a frequent itemset must be
frequent If {beer, diaper, nuts} is frequent, so is {beer,
diaper} i.e., every transaction having {beer, diaper, nuts}
also contains {beer, diaper} Scalable mining methods: Three major approaches
Apriori (Agrawal & Srikant@VLDB’94) Freq. pattern growth (FPgrowth—Han, Pei & Yin
@SIGMOD’00) Vertical data format approach (Charm—Zaki &
Hsiao @SDM’02)
Frequent itemsets
April 21, 2023Data Mining: Concepts and
Techniques 10
Association rule mining can be viewed as a two step process:
1) Find all frequent pattern 2) Generate strong association rules from
frequent itemsets
April 21, 2023Data Mining: Concepts and
Techniques 11
April 21, 2023Data Mining: Concepts and
Techniques 12
Apriori: A Candidate Generation-and-Test Approach
Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
Method: Initially, scan DB once to get frequent 1-itemset Generate length (k+1) candidate itemsets from
length k frequent itemsets Test the candidates against DB Terminate when no frequent or candidate set can
be generated
April 21, 2023Data Mining: Concepts and
Techniques 13
The Apriori Algorithm—An Example
Database TDB
1st scan
C1L1
L2
C2 C2
2nd scan
C3 L33rd scan
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
Itemset sup
{A} 2
{B} 3
{C} 3
{D} 1
{E} 3
Itemset sup
{A} 2
{B} 3
{C} 3
{E} 3
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
Itemset sup{A, B} 1{A, C} 2{A, E} 1{B, C} 2{B, E} 3{C, E} 2
Itemset sup{A, C} 2{B, C} 2{B, E} 3{C, E} 2
Itemset
{B, C, E}
Itemset sup
{B, C, E} 2
Supmin = 2
April 21, 2023Data Mining: Concepts and
Techniques 14
The Apriori Algorithm
Pseudo-code:Ck: Candidate itemset of size kLk : frequent itemset of size k
L1 = {frequent items};for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support endreturn k Lk;
April 21, 2023Data Mining: Concepts and
Techniques 15
Important Details of Apriori
How to generate candidates? Step 1: self-joining Lk
Step 2: pruning How to count supports of candidates? Example of Candidate-generation
L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcd from abc and abd acde from acd and ace
Pruning: acde is removed because ade is not in L3
C4={abcd}
April 21, 2023Data Mining: Concepts and
Techniques 16
How to Generate Candidates?
Suppose the items in Lk-1 are listed in an order
Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1
< q.itemk-1
Step 2: pruningforall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
April 21, 2023Data Mining: Concepts and
Techniques 17
Challenges of Frequent Pattern Mining
Challenges
Multiple scans of transaction database
Huge number of candidates
Tedious workload of support counting for
candidates
Improving efficiency of apriori
Variations of apriori to improve efficiencyHash based techniqueTransaction reductionPartitioning SamplingDynamic itemset counting 18
Method 1) Hash based technique
hashing itemsets into corresponding buckets A hash-based technique can be used to reduce the
size of the candidate k-itemsets, Ck, for k > 1.
April 21, 2023Data Mining: Concepts and
Techniques 19
Method 2) Transaction reduction
(reducing the number of transactions scanned in future iterations)
A transaction that does not contain any frequent k-itemsets cannot contain any frequent (k+1)-itemsets.
Therefore, such a transaction can be marked or removed from further consideration because subsequent scans of the database for j-itemsets, where j > k, will not require it.
April 21, 2023Data Mining: Concepts and
Techniques 20
Method 3 ) partitioning
(partitioning the data to find candidate itemsets)
Scan Database Only Twice
Any itemset that is potentially frequent in DB must
be frequent in at least one of the partitions of DB
Scan 1: partition database and find local frequent
patterns
minimum support count for each partition=
min_support * number of transactions in that partition.
Scan 2: consolidate global frequent patternsApril 21, 2023
Data Mining: Concepts and Techniques 21
April 21, 2023Data Mining: Concepts and
Techniques 22
Method 4) Sampling (mining on a subset of the given data)
The basic idea of the sampling approach is to pick a random sample S of the given data D, and then search for frequent itemsets in S instead of D.
The sample size of S is such that the search for frequent itemsets in S can be done in main memory, and so only one scan of the transactions in S is required overall.
Because we are searching for frequent itemsets in S rather than in D, it is possible that we will miss some of the global frequent itemsets.
To lessen this possibility, we use a lower support threshold than minimum support to find the frequent itemsets local to S (denoted LS). The rest of the database is then used to compute the actual frequencies of each itemset in LS.
23
Method 5) dynamic itemset counting
(adding candidate itemsets at different points during a scan)
A dynamic itemset counting technique was proposed in which the database is partitioned into blocks marked by start points.
In this variation, new candidate itemsets can be added at any start point, unlike in Apriori, which determines new candidate itemsets only immediately before each complete database scan.
24
April 21, 2023Data Mining: Concepts and
Techniques 25
Bottleneck of Frequent-pattern Mining
Multiple database scans are costly Mining long patterns needs many passes of
scanning and generates lots of candidates To find frequent itemset i1i2…i100
# of scans: 100 # of Candidates: (100
1) + (1002) + … + (1
10
00
0) =
2100-1 = 1.27*1030 !
Bottleneck: candidate-generation-and-test Can we avoid candidate generation?
mining frequent itemset without generating candidate
1) Vertical data format2) FP growth
April 21, 2023Data Mining: Concepts and
Techniques 26
Mining frequent itemsets using vertical data format
Question Answer
27
Minimum support = 2
28
Question) Find frequent itemsets using vertical data format.
April 21, 2023Data Mining: Concepts and
Techniques 29
April 21, 2023Data Mining: Concepts and
Techniques 30
Mining Various Kinds of Association Rules
Mining multilevel association
Mining multidimensional association
Mining quantitative association
31
Mining Multiple-Level Association Rules
Items often form concept hierarchies Flexible support settings
Items at the lower level are expected to have lower support
April 21, 2023Data Mining: Concepts and
Techniques 32
Association rules generated from mining data at multiple levels of abstraction are called multiple-level or multilevel association rules.
Various ways:1) Using uniform support for all levels2) Using reduced support at lower level3) Using items or group-based minimum
support (referred as group based support) – because users or experts often have insight as to which groups are more important than others.
33
34
35
Mining Multi-Dimensional Association
Single-dimensional rules:buys(X, “milk”) buys(X, “bread”)
Multi-dimensional rules: 2 dimensions or predicates Inter-dimension assoc. rules (no repeated predicates)
age(X,”19-25”) occupation(X,“student”) buys(X, “coke”)
hybrid-dimension assoc. rules (repeated predicates)age(X,”19-25”) buys(X, “popcorn”) buys(X, “coke”)
Categorical Attributes: finite number of possible values, no ordering among values (occupation, color )
Quantitative Attributes: numeric, implicit ordering among values(income, age, price)—discretization, clustering, and gradient approaches
April 21, 2023Data Mining: Concepts and
Techniques 36
Mining Quantitative Associations
Techniques can be categorized by how numerical attributes, such as age or salary are treated
1. Static discretization based on predefined concept hierarchies (data cube methods). This decritization occurs before mining.
2. Dynamic discretization or clustered into bins based on
the distribution of the data.
K-predicate set searching for frequent itemsets.
A k-predicate set is a set containing k conjunctive predicates.{age,occupation,buys} is a 3 predicate set.
April 21, 2023Data Mining: Concepts and
Techniques 37
Static Discretization of Quantitative Attributes
Discretized prior to mining using concept hierarchy.
Numeric values are replaced by ranges. In relational database, finding all frequent k-
predicate sets will require k or k+1 table scans. Data cube is well suited for mining. The cells of an n-dimensional
cuboid correspond to the
predicate sets. Mining from data cubes
can be much faster.
(income)(age)
()
(buys)
(age, income) (age,buys) (income,buys)
(age,income,buys)
April 21, 2023Data Mining: Concepts and
Techniques 38
Mining Quantitative Association Rules
age(X,”34-35”) income(X,”30-50K”) buys(X,”high resolution TV”)
Numeric attributes are dynamically discretized during mining process so as to satisfy some mining criteria, such as maximizing the confidence or compactness of the rules.
2-D quantitative association rules: Aquan1 Aquan2 Acat
Cluster adjacent association rules to form general rules using a 2-D grid Example
Discretization
Finding frequent predicate sets: Once the 2-D array containing the count distribution for each category is set up, it can be scanned to find the frequent predicate sets (those satisfying minimum support) that also satisfy minimum confidence. Strong association rules can then be generated from these predicate sets, using a rule generation algorithm.
April 21, 2023Data Mining: Concepts and
Techniques 39
Clustering the association rules:
“Can we find a simpler rule to replace the above four rules?”
Notice that these rules are quite “close” to one another, forming a rule cluster on the grid. Indeed, the four rules can be combined or “clustered” together to form the following simpler rule, which subsumes and replaces the above four rules:
April 21, 2023Data Mining: Concepts and
Techniques 40
April 21, 2023Data Mining: Concepts and
Techniques 41
April 21, 2023Data Mining: Concepts and
Techniques 42
April 21, 2023Data Mining: Concepts and
Techniques 43
Mining Frequent Patterns Without Candidate Generation
Grow long patterns from short ones using
local frequent items
“abc” is a frequent pattern
Get all transactions having “abc”: DB|abc
“d” is a local frequent item in DB|abc
abcd is a frequent pattern
April 21, 2023Data Mining: Concepts and
Techniques 44
Construct FP-tree from a Transaction Database
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header Table
Item frequency head f 4c 4a 3b 3m 3p 3
min_support = 3
TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o, w} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
1. Scan DB once, find frequent 1-itemset (single item pattern)
2. Sort frequent items in frequency descending order, f-list
3. Scan DB again, construct FP-tree F-list=f-c-a-b-m-p
April 21, 2023Data Mining: Concepts and
Techniques 45
Benefits of the FP-tree Structure
Completeness Preserve complete information for frequent
pattern mining Never break a long pattern of any transaction
Compactness Reduce irrelevant info—infrequent items are gone Items in frequency descending order: the more
frequently occurring, the more likely to be shared Never be larger than the original database (not
count node-links and the count field) For Connect-4 DB, compression ratio could be
over 100
April 21, 2023Data Mining: Concepts and
Techniques 46
Partition Patterns and Databases
Frequent patterns can be partitioned into subsets according to f-list F-list=f-c-a-b-m-p Patterns containing p Patterns having m but no p … Patterns having c but no a nor b, m, p Pattern f
Completeness and non-redundency
April 21, 2023Data Mining: Concepts and
Techniques 47
Find Patterns Having P From P-conditional Database
Starting at the frequent item header table in the FP-tree Traverse the FP-tree by following the link of each frequent
item p Accumulate all of transformed prefix paths of item p to
form p’s conditional pattern base
Conditional pattern bases
item cond. pattern base
c f:3
a fc:3
b fca:1, f:1, c:1
m fca:2, fcab:1
p fcam:2, cb:1
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header Table
Item frequency head f 4c 4a 3b 3m 3p 3
April 21, 2023Data Mining: Concepts and
Techniques 48
From Conditional Pattern-bases to Conditional FP-trees
For each pattern-base Accumulate the count for each item in the base Construct the FP-tree for the frequent items of
the pattern base
m-conditional pattern base:fca:2, fcab:1
{}
f:3
c:3
a:3m-conditional FP-tree
All frequent patterns relate to m
m,
fm, cm, am,
fcm, fam, cam,
fcam
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header TableItem frequency head f 4c 4a 3b 3m 3p 3
April 21, 2023Data Mining: Concepts and
Techniques 49
Recursion: Mining Each Conditional FP-tree
{}
f:3
c:3
a:3m-conditional FP-tree
Cond. pattern base of “am”: (fc:3)
{}
f:3
c:3am-conditional FP-tree
Cond. pattern base of “cm”: (f:3){}
f:3
cm-conditional FP-tree
Cond. pattern base of “cam”: (f:3)
{}
f:3
cam-conditional FP-tree
April 21, 2023Data Mining: Concepts and
Techniques 50
A Special Case: Single Prefix Path in FP-tree
Suppose a (conditional) FP-tree T has a shared single prefix-path P
Mining can be decomposed into two parts Reduction of the single prefix path into one node Concatenation of the mining results of the two
parts
a2:n2
a3:n3
a1:n1
{}
b1:m1C1:k1
C2:k2 C3:k3
b1:m1C1:k1
C2:k2 C3:k3
r1
+a2:n2
a3:n3
a1:n1
{}
r1 =
April 21, 2023Data Mining: Concepts and
Techniques 51
Mining Frequent Patterns With FP-trees
Idea: Frequent pattern growth Recursively grow frequent patterns by pattern
and database partition Method
For each frequent item, construct its conditional pattern-base, and then its conditional FP-tree
Repeat the process on each newly created conditional FP-tree
Until the resulting FP-tree is empty, or it contains only one path—single path will generate all the combinations of its sub-paths, each of which is a frequent pattern
April 21, 2023Data Mining: Concepts and
Techniques 52
Scaling FP-growth by DB Projection
FP-tree cannot fit in memory?—DB projection First partition a database into a set of projected
DBs Then construct and mine FP-tree for each
projected DB Parallel projection vs. Partition projection
techniques Parallel projection is space costly
April 21, 2023Data Mining: Concepts and
Techniques 53
Partition-based Projection
Parallel projection needs a lot of disk space
Partition projection saves it
Tran. DB fcampfcabmfbcbpfcamp
p-proj DB fcamcbfcam
m-proj DB fcabfcafca
b-proj DB fcb…
a-proj DBfc…
c-proj DBf…
f-proj DB …
am-proj DB fcfcfc
cm-proj DB fff
…
April 21, 2023Data Mining: Concepts and
Techniques 54
FP-Growth vs. Apriori: Scalability With the Support Threshold
0
10
20
30
40
50
60
70
80
90
100
0 0.5 1 1.5 2 2.5 3
Support threshold(%)
Ru
n t
ime
(se
c.)
D1 FP-grow th runtime
D1 Apriori runtime
Data set T25I20D10K
April 21, 2023Data Mining: Concepts and
Techniques 55
FP-Growth vs. Tree-Projection: Scalability with the Support Threshold
0
20
40
60
80
100
120
140
0 0.5 1 1.5 2
Support threshold (%)
Ru
nti
me
(sec
.)
D2 FP-growth
D2 TreeProjection
Data set T25I20D100K
April 21, 2023Data Mining: Concepts and
Techniques 56
Why Is FP-Growth the Winner?
Divide-and-conquer: decompose both the mining task and DB
according to the frequent patterns obtained so far
leads to focused search of smaller databases Other factors
no candidate generation, no candidate test compressed database: FP-tree structure no repeated scan of entire database basic ops—counting local freq items and
building sub FP-tree, no pattern search and matching
April 21, 2023Data Mining: Concepts and
Techniques 57
Implications of the Methodology
Mining closed frequent itemsets and max-patterns
CLOSET (DMKD’00) Mining sequential patterns
FreeSpan (KDD’00), PrefixSpan (ICDE’01) Constraint-based mining of frequent patterns
Convertible constraints (KDD’00, ICDE’01) Computing iceberg data cubes with complex measures
H-tree and H-cubing algorithm (SIGMOD’01)
April 21, 2023Data Mining: Concepts and
Techniques 58
MaxMiner: Mining Max-patterns
1st scan: find frequent items A, B, C, D, E
2nd scan: find support for AB, AC, AD, AE, ABCDE BC, BD, BE, BCDE CD, CE, CDE, DE,
Since BCDE is a max-pattern, no need to check BCD, BDE, CDE in later scan
R. Bayardo. Efficiently mining long patterns from databases. In SIGMOD’98
Tid Items
10 A,B,C,D,E
20 B,C,D,E,
30 A,C,D,F
Potential max-
patterns
April 21, 2023Data Mining: Concepts and
Techniques 59
Mining Frequent Closed Patterns: CLOSET
Flist: list of all frequent items in support ascending order Flist: d-a-f-e-c
Divide search space Patterns having d Patterns having d but no a, etc.
Find frequent closed pattern recursively Every transaction having d also has cfa cfad is a
frequent closed pattern J. Pei, J. Han & R. Mao. CLOSET: An Efficient Algorithm
for Mining Frequent Closed Itemsets", DMKD'00.
TID Items10 a, c, d, e, f20 a, b, e30 c, e, f40 a, c, d, f50 c, e, f
Min_sup=2
April 21, 2023Data Mining: Concepts and
Techniques 60
CLOSET+: Mining Closed Itemsets by Pattern-Growth
Itemset merging: if Y appears in every occurrence of X, then Y is merged with X
Sub-itemset pruning: if Y כ X, and sup(X) = sup(Y), X and all of X’s descendants in the set enumeration tree can be pruned
Hybrid tree projection Bottom-up physical tree-projection Top-down pseudo tree-projection
Item skipping: if a local frequent item has the same support in several header tables at different levels, one can prune it from the header table at higher levels
Efficient subset checking
April 21, 2023Data Mining: Concepts and
Techniques 61
Chapter 5: Mining Frequent Patterns, Association and
Correlations Basic concepts and a road map Efficient and scalable frequent itemset
mining methods Mining various kinds of association rules From association mining to correlation
analysis
April 21, 2023Data Mining: Concepts and
Techniques 62
Chapter 5: Mining Frequent Patterns, Association and
Correlations Basic concepts and a road map Efficient and scalable frequent itemset
mining methods Mining various kinds of association rules From association mining to correlation
analysis
April 21, 2023Data Mining: Concepts and
Techniques 63
Interestingness Measure: Correlations (Lift)
play basketball eat cereal [40%, 66.7%] is misleading
The overall % of students eating cereal is 75% > 66.7%.
play basketball not eat cereal [20%, 33.3%] is more
accurate, although with lower support and confidence
Measure of dependent/correlated events: lift
89.05000/3750*5000/3000
5000/2000),( CBlift
Basketball
Not basketball Sum (row)
Cereal 2000 1750 3750
Not cereal 1000 250 1250
Sum(col.) 3000 2000 5000)()(
)(
BPAP
BAPlift
33.15000/1250*5000/3000
5000/1000),( CBlift
April 21, 2023Data Mining: Concepts and
Techniques 64
Are lift and 2 Good Measures of Correlation?
“Buy walnuts buy milk [1%, 80%]” is misleading
if 85% of customers buy milk
Support and confidence are not good to represent correlations
So many interestingness measures? (Tan, Kumar, Sritastava
@KDD’02)
Milk No Milk Sum (row)
Coffee m, c ~m, c c
No Coffee
m, ~c ~m, ~c ~c
Sum(col.)
m ~m
)()(
)(
BPAP
BAPlift
DB m, c ~m, c
m~c ~m~c lift all-conf
coh 2
A1 1000 100 100 10,000 9.26
0.91 0.83 9055
A2 100 1000 1000 100,000
8.44
0.09 0.05 670
A3 1000 100 10000
100,000
9.18
0.09 0.09 8172
A4 1000 1000 1000 1000 1 0.5 0.33 0
)sup(_max_
)sup(_
Xitem
Xconfall
|)(|
)sup(
Xuniverse
Xcoh
April 21, 2023Data Mining: Concepts and
Techniques 65
Which Measures Should Be Used?
lift and 2 are not good measures for correlations in large transactional DBs
all-conf or coherence could be good measures (Omiecinski@TKDE’03)
Both all-conf and coherence have the downward closure property
Efficient algorithms can be derived for mining (Lee et al. @ICDM’03sub)