cis611 lecturenotes itemsetassociationrule...

34
April 5, 2013 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 5 — SS Chung

Upload: others

Post on 21-Jul-2020

20 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CIS611 LectureNotes ItemSetAssociationRule ...eecs.csuohio.edu/~sschung/CIS660/CIS611_LectureNotes...April 5, 2013 Data Mining: Concepts and Techniques 20 Bottleneck of Frequent-pattern

April 5, 2013 Data Mining: Concepts and Techniques 1

Data Mining:Concepts and Techniques

— Chapter 5 —

SS Chung

Page 2: CIS611 LectureNotes ItemSetAssociationRule ...eecs.csuohio.edu/~sschung/CIS660/CIS611_LectureNotes...April 5, 2013 Data Mining: Concepts and Techniques 20 Bottleneck of Frequent-pattern

April 5, 2013 Data Mining: Concepts and Techniques 2

Chapter 5: Mining Frequent Patterns, Association and Correlations

� Basic concepts and a road map

� Efficient and scalable frequent itemset mining

methods

� Mining various kinds of association rules

� From association mining to correlation

analysis

� Constraint-based association mining

� Summary

Page 3: CIS611 LectureNotes ItemSetAssociationRule ...eecs.csuohio.edu/~sschung/CIS660/CIS611_LectureNotes...April 5, 2013 Data Mining: Concepts and Techniques 20 Bottleneck of Frequent-pattern

April 5, 2013 Data Mining: Concepts and Techniques 3

What Is Frequent Pattern Analysis?

� Frequent pattern: a pattern (a set of items, subsequences, substructures,

etc.) that occurs frequently in a data set

� First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context

of frequent itemsets and association rule mining

� Motivation: Finding inherent regularities in data

� What products were often purchased together?— Beer and diapers?!

� What are the subsequent purchases after buying a PC?

� What kinds of DNA are sensitive to this new drug?

� Can we automatically classify web documents?

� Applications

� Basket data analysis, cross-marketing, catalog design, sale campaign

analysis, Web log (click stream) analysis, and DNA sequence analysis.

Page 4: CIS611 LectureNotes ItemSetAssociationRule ...eecs.csuohio.edu/~sschung/CIS660/CIS611_LectureNotes...April 5, 2013 Data Mining: Concepts and Techniques 20 Bottleneck of Frequent-pattern

April 5, 2013 Data Mining: Concepts and Techniques 4

Why Is Freq. Pattern Mining Important?

� Discloses an intrinsic and important property of data sets

� Forms the foundation for many essential data mining tasks

� Association, correlation, and causality analysis

� Sequential, structural (e.g., sub-graph) patterns

� Pattern analysis in spatiotemporal, multimedia, time-

series, and stream data

� Classification: associative classification

� Cluster analysis: frequent pattern-based clustering

� Data warehousing: iceberg cube and cube-gradient

� Semantic data compression: fascicles

� Broad applications

Page 5: CIS611 LectureNotes ItemSetAssociationRule ...eecs.csuohio.edu/~sschung/CIS660/CIS611_LectureNotes...April 5, 2013 Data Mining: Concepts and Techniques 20 Bottleneck of Frequent-pattern

April 5, 2013 Data Mining: Concepts and Techniques 5

Association Rules

Based on the types of values, the

association rules can be classified into two

categories: Boolean Association Rules and

Quantitative Association Rules

� Boolean Association Rule:

Keyboard ⇒Mouse [support = 6%,

confidence = 70%]

� Quantitative Association Rule:

(Age = 26…30) ⇒ (Cars =1, 2) [support 3%,

confidence = 36%]

Page 6: CIS611 LectureNotes ItemSetAssociationRule ...eecs.csuohio.edu/~sschung/CIS660/CIS611_LectureNotes...April 5, 2013 Data Mining: Concepts and Techniques 20 Bottleneck of Frequent-pattern

April 5, 2013 Data Mining: Concepts and Techniques 6

Minimum Support Threshold

� The support of an association pattern is the percentage of task-relevant data transactions for which the pattern is true.

IF A ⇒ B

support (A⇒B) =

total _# of tuples containing both A and B

______________________________________

total _# of tuples in transaction

Page 7: CIS611 LectureNotes ItemSetAssociationRule ...eecs.csuohio.edu/~sschung/CIS660/CIS611_LectureNotes...April 5, 2013 Data Mining: Concepts and Techniques 20 Bottleneck of Frequent-pattern

April 5, 2013 Data Mining: Concepts and Techniques 7

Minimum Confidence Threshold

� Confidence is defined as the measure of certainty or trustworthiness associated with each discovered pattern.

IF A ⇒ B

confidence (A ⇒ B ) =

# of tuples containing both A and B __________________________________

# of tuples containing A

Page 8: CIS611 LectureNotes ItemSetAssociationRule ...eecs.csuohio.edu/~sschung/CIS660/CIS611_LectureNotes...April 5, 2013 Data Mining: Concepts and Techniques 20 Bottleneck of Frequent-pattern

April 5, 2013 Data Mining: Concepts and Techniques 8

Basic Concepts: Frequent Patterns and Association Rules

� Itemset X = {x1, …, xk}

� Find all the rules X � Y with minimum

support and confidence

� support, s, probability that a transaction contains X ∪ Y

� confidence, c, conditional probability that a transaction having X also contains Y

Let supmin = 50%, confmin = 50%

Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3}

Association rules:

A � D (60%(3/5), 100%(3/3))

D � A (60%(3/5), 75%(3/4))

Customer

buys diaper

Customer

buys both

Customer

buys beer

B, E, F40

B, C, D, E, F50

A, D, E30

A, C, D20

A, B, D10

Items boughtTransaction-id

Page 9: CIS611 LectureNotes ItemSetAssociationRule ...eecs.csuohio.edu/~sschung/CIS660/CIS611_LectureNotes...April 5, 2013 Data Mining: Concepts and Techniques 20 Bottleneck of Frequent-pattern

April 5, 2013 Data Mining: Concepts and Techniques 9

Chapter 5: Mining Frequent Patterns, Association and Correlations

� Basic concepts and a road map

� Efficient and scalable frequent itemset mining

methods

� Mining various kinds of association rules

� From association mining to correlation

analysis

� Constraint-based association mining

� Summary

Page 10: CIS611 LectureNotes ItemSetAssociationRule ...eecs.csuohio.edu/~sschung/CIS660/CIS611_LectureNotes...April 5, 2013 Data Mining: Concepts and Techniques 20 Bottleneck of Frequent-pattern

April 5, 2013 Data Mining: Concepts and Techniques 10

Scalable Methods for Mining Frequent Patterns

� The downward closure property of frequent patterns

� Any subset of a frequent itemset must be frequent

� If {beer, diaper, nuts} is frequent, so is {beer, diaper}

� i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper}

� Scalable mining methods: Three major approaches

� Apriori (Agrawal & Srikant@VLDB’94)

� Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00)

� Vertical data format approach (Charm—Zaki & Hsiao @SDM’02)

Page 11: CIS611 LectureNotes ItemSetAssociationRule ...eecs.csuohio.edu/~sschung/CIS660/CIS611_LectureNotes...April 5, 2013 Data Mining: Concepts and Techniques 20 Bottleneck of Frequent-pattern

April 5, 2013 Data Mining: Concepts and Techniques 11

Apriori: A Candidate Generation-and-Test Approach

� Apriori pruning principle: If there is any itemset which is

infrequent, its superset should not be generated/tested!

(Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)

� Method:

� Initially, scan DB once to get frequent 1-itemset

� Generate length (k+1) candidate itemsets from length k

frequent itemsets

� Test the candidates against DB to prune

� Terminate when no frequent or candidate set can be

generated

Page 12: CIS611 LectureNotes ItemSetAssociationRule ...eecs.csuohio.edu/~sschung/CIS660/CIS611_LectureNotes...April 5, 2013 Data Mining: Concepts and Techniques 20 Bottleneck of Frequent-pattern

April 5, 2013 Data Mining: Concepts and Techniques 12

The Apriori Algorithm—An Example

Database TDB

1st scan

C1

L1

L2

C2 C2

2nd scan

C3 L33rd scan

B, E40

A, B, C, E30

B, C, E20

A, C, D10

ItemsTid

1{D}

3{E}

3{C}

3{B}

2{A}

supItemset

3{E}

3{C}

3{B}

2{A}

supItemset

{C, E}

{B, E}

{B, C}

{A, E}

{A, C}

{A, B}

Itemset1{A, B}

2{A, C}

1{A, E}

2{B, C}

3{B, E}

2{C, E}

supItemset

2{A, C}

2{B, C}

3{B, E}

2{C, E}

supItemset

{B, C, E}

Itemset

2{B, C, E}

supItemset

Supmin = 2

Page 13: CIS611 LectureNotes ItemSetAssociationRule ...eecs.csuohio.edu/~sschung/CIS660/CIS611_LectureNotes...April 5, 2013 Data Mining: Concepts and Techniques 20 Bottleneck of Frequent-pattern

April 5, 2013 Data Mining: Concepts and Techniques 13

The Apriori Algorithm

� Pseudo-code:Ck: Candidate itemset of size kLk : frequent itemset of size k

L1 = {frequent items};for (k = 1; Lk !=∅; k++) do begin

Ck+1 = candidates generated from Lk;for each transaction t in database do{

increment the count of all candidates in Ck+1

that are contained in t}Lk+1 = candidates in Ck+1 with min_support

endreturn ∪k Lk;

Page 14: CIS611 LectureNotes ItemSetAssociationRule ...eecs.csuohio.edu/~sschung/CIS660/CIS611_LectureNotes...April 5, 2013 Data Mining: Concepts and Techniques 20 Bottleneck of Frequent-pattern

April 5, 2013 Data Mining: Concepts and Techniques 14

Important Details of Apriori

� How to generate candidates Ck+1?

� Step 1: self-joining Lk

� Step 2: pruning

� How to count supports of candidates?

� Example of Candidate-generation

� L3={abc, abd, acd, ace, bcd}

� Self-joining: L3*L3� abcd from abc and abd

� acde from acd and ace

� Pruning:

� acde is removed because ade is not in L3

� C4={abcd}

Page 15: CIS611 LectureNotes ItemSetAssociationRule ...eecs.csuohio.edu/~sschung/CIS660/CIS611_LectureNotes...April 5, 2013 Data Mining: Concepts and Techniques 20 Bottleneck of Frequent-pattern

April 5, 2013 Data Mining: Concepts and Techniques 15

How to Generate Candidates?

� Suppose the items in Lk-1 are listed in an order

� Step 1: self-joining Lk-1insert into Ck

select p.item1, p.item2, …, p.itemk-1, q.itemk-1

from Lk-1 p, Lk-1 q

where p.item1=q.item1, …, p.itemk-2=q.itemk-2,

p.itemk-1 < q.itemk-1

� Step 2: pruning by property of frequent patterns

forall itemsets c in Ck do

forall (k-1)-subsets s of c do

if (s is not in Lk-1) then delete c from Ck

Page 16: CIS611 LectureNotes ItemSetAssociationRule ...eecs.csuohio.edu/~sschung/CIS660/CIS611_LectureNotes...April 5, 2013 Data Mining: Concepts and Techniques 20 Bottleneck of Frequent-pattern

April 5, 2013 Data Mining: Concepts and Techniques 16

How to Count Supports of Candidates?

� Why counting supports of candidates a problem?

� The total number of candidates can be very huge

� One transaction may contain many candidates

� Method:

� Candidate itemsets are stored in a hash-tree

� Leaf node of hash-tree contains a list of itemsets and

counts

� Interior node contains a hash table

� Subset function: finds all the candidates contained in

a transaction

Page 17: CIS611 LectureNotes ItemSetAssociationRule ...eecs.csuohio.edu/~sschung/CIS660/CIS611_LectureNotes...April 5, 2013 Data Mining: Concepts and Techniques 20 Bottleneck of Frequent-pattern

April 5, 2013 Data Mining: Concepts and Techniques 17

Efficient Implementation of Apriori in SQL

� Hard to get good performance out of pure SQL (SQL-

92) based approaches alone

� Make use of object-relational extensions like UDFs,

BLOBs, Table functions etc.

� Get orders of magnitude improvement

� S. Sarawagi, S. Thomas, and R. Agrawal. Integrating

association rule mining with relational database

systems: Alternatives and implications. In SIGMOD’98

Page 18: CIS611 LectureNotes ItemSetAssociationRule ...eecs.csuohio.edu/~sschung/CIS660/CIS611_LectureNotes...April 5, 2013 Data Mining: Concepts and Techniques 20 Bottleneck of Frequent-pattern

April 5, 2013 Data Mining: Concepts and Techniques 18

Challenges of Frequent Pattern Mining

� Challenges

� Multiple scans of transaction database

� Huge number of candidates

� Tedious workload of support counting for candidates

� Improving Apriori: general ideas

� Reduce passes of transaction database scans

� Shrink number of candidates

� Facilitate support counting of candidates

Page 19: CIS611 LectureNotes ItemSetAssociationRule ...eecs.csuohio.edu/~sschung/CIS660/CIS611_LectureNotes...April 5, 2013 Data Mining: Concepts and Techniques 20 Bottleneck of Frequent-pattern

April 5, 2013 Data Mining: Concepts and Techniques 19

Partition: Scan Database Only Twice

� Any itemset that is potentially frequent in DB must be

frequent in at least one of the partitions of DB

� Scan 1: partition database and find local frequent

patterns

� Scan 2: consolidate global frequent patterns

� A. Savasere, E. Omiecinski, and S. Navathe. An efficient

algorithm for mining association in large databases. In

VLDB’95

Page 20: CIS611 LectureNotes ItemSetAssociationRule ...eecs.csuohio.edu/~sschung/CIS660/CIS611_LectureNotes...April 5, 2013 Data Mining: Concepts and Techniques 20 Bottleneck of Frequent-pattern

April 5, 2013 Data Mining: Concepts and Techniques 20

Bottleneck of Frequent-pattern Mining

� Multiple database scans are costly

� Mining long patterns needs many passes of

scanning and generates lots of candidates

� To find frequent itemset i1i2…i100� # of scans: 100

� # of Candidates: (1001) + (100

2) + … + (110000) = 2100-

1 = 1.27*1030 !

� Bottleneck: candidate-generation-and-test

� Can we avoid candidate generation?

Page 21: CIS611 LectureNotes ItemSetAssociationRule ...eecs.csuohio.edu/~sschung/CIS660/CIS611_LectureNotes...April 5, 2013 Data Mining: Concepts and Techniques 20 Bottleneck of Frequent-pattern

April 5, 2013 Data Mining: Concepts and Techniques 21

Mining Frequent Patterns WithoutCandidate Generation

� Grow long patterns from short ones using local

frequent items

� “abc” is a frequent pattern

� Get all transactions having “abc”: DB|abc

� “d” is a local frequent item in DB|abc � abcd is

a frequent pattern

Page 22: CIS611 LectureNotes ItemSetAssociationRule ...eecs.csuohio.edu/~sschung/CIS660/CIS611_LectureNotes...April 5, 2013 Data Mining: Concepts and Techniques 20 Bottleneck of Frequent-pattern

April 5, 2013 Data Mining: Concepts and Techniques 22

Chapter 5: Mining Frequent Patterns, Association and Correlations

� Basic concepts and a road map

� Efficient and scalable frequent itemset mining

methods

� Mining various kinds of association rules

� From association mining to correlation

analysis

� Constraint-based association mining

� Summary

Page 23: CIS611 LectureNotes ItemSetAssociationRule ...eecs.csuohio.edu/~sschung/CIS660/CIS611_LectureNotes...April 5, 2013 Data Mining: Concepts and Techniques 20 Bottleneck of Frequent-pattern

April 5, 2013 Data Mining: Concepts and Techniques 23

Mining Various Kinds of Association Rules

� Mining multilevel association

� Miming multidimensional association

� Mining quantitative association

� Mining interesting correlation patterns

Page 24: CIS611 LectureNotes ItemSetAssociationRule ...eecs.csuohio.edu/~sschung/CIS660/CIS611_LectureNotes...April 5, 2013 Data Mining: Concepts and Techniques 20 Bottleneck of Frequent-pattern

April 5, 2013 Data Mining: Concepts and Techniques 24

Mining Multiple-Level Association Rules

� Items often form hierarchies

� Flexible support settings

� Items at the lower level are expected to have lower support

� Exploration of shared multi-level mining (Agrawal & Srikant@VLB’95, Han & Fu@VLDB’95)

uniform support

Milk

[support = 10%]

2% Milk

[support = 6%]

Skim Milk

[support = 4%]

Level 1

min_sup = 5%

Level 2

min_sup = 5%

Level 1

min_sup = 5%

Level 2

min_sup = 3%

reduced support

Page 25: CIS611 LectureNotes ItemSetAssociationRule ...eecs.csuohio.edu/~sschung/CIS660/CIS611_LectureNotes...April 5, 2013 Data Mining: Concepts and Techniques 20 Bottleneck of Frequent-pattern

April 5, 2013 Data Mining: Concepts and Techniques 25

Multi-level Association: Redundancy Filtering

� Some rules may be redundant due to “ancestor”

relationships between items.

� Example

� milk ⇒ wheat bread [support = 8%, confidence = 70%]

� 2% milk ⇒ wheat bread [support = 2%, confidence = 72%]

� We say the first rule is an ancestor of the second rule.

� A rule is redundant if its support is close to the “expected”

value, based on the rule’s ancestor.

Page 26: CIS611 LectureNotes ItemSetAssociationRule ...eecs.csuohio.edu/~sschung/CIS660/CIS611_LectureNotes...April 5, 2013 Data Mining: Concepts and Techniques 20 Bottleneck of Frequent-pattern

April 5, 2013 Data Mining: Concepts and Techniques 26

Mining Multi-Dimensional Association

� Single-dimensional rules:

buys(X, “milk”) ⇒ buys(X, “bread”)

� Multi-dimensional rules: ≥ 2 dimensions or predicates

� Inter-dimension assoc. rules (no repeated predicates)

age(X,”19-25”) ∧ occupation(X,“student”) ⇒ buys(X, “coke”)

� hybrid-dimension assoc. rules (repeated predicates)

age(X,”19-25”) ∧ buys(X, “popcorn”) ⇒ buys(X, “coke”)

� Categorical Attributes: finite number of possible values, no

ordering among values—data cube approach

� Quantitative Attributes: numeric, implicit ordering among

values—discretization, clustering, and gradient approaches

Page 27: CIS611 LectureNotes ItemSetAssociationRule ...eecs.csuohio.edu/~sschung/CIS660/CIS611_LectureNotes...April 5, 2013 Data Mining: Concepts and Techniques 20 Bottleneck of Frequent-pattern

April 5, 2013 Data Mining: Concepts and Techniques 27

Mining Quantitative Associations

� Techniques can be categorized by how numerical attributes, such as age or salary are treated

1. Static discretization based on predefined concept

hierarchies (data cube methods)

2. Dynamic discretization based on data distribution

(quantitative rules, e.g., Agrawal & Srikant@SIGMOD96)

3. Clustering: Distance-based association (e.g., Yang &

Miller@SIGMOD97)

� one dimensional clustering then association

4. Deviation: (such as Aumann and Lindell@KDD99)

Sex = female => Wage: mean=$7/hr (overall mean = $9)

Page 28: CIS611 LectureNotes ItemSetAssociationRule ...eecs.csuohio.edu/~sschung/CIS660/CIS611_LectureNotes...April 5, 2013 Data Mining: Concepts and Techniques 20 Bottleneck of Frequent-pattern

April 5, 2013 Data Mining: Concepts and Techniques 28

Static Discretization of Quantitative Attributes

� Discretized prior to mining using concept hierarchy.

� Numeric values are replaced by ranges.

� In relational database, finding all frequent k-predicate sets

will require k or k+1 table scans.

� Data cube is well suited for mining.

� The cells of an n-dimensional

cuboid correspond to the

predicate sets.

� Mining from data cubes

can be much faster.

(income)(age)

()

(buys)

(age, income) (age,buys) (income,buys)

(age,income,buys)

Page 29: CIS611 LectureNotes ItemSetAssociationRule ...eecs.csuohio.edu/~sschung/CIS660/CIS611_LectureNotes...April 5, 2013 Data Mining: Concepts and Techniques 20 Bottleneck of Frequent-pattern

April 5, 2013 Data Mining: Concepts and Techniques 29

Quantitative Association Rules

age(X,”34-35”) ∧∧∧∧ income(X,”30-50K”)

⇒⇒⇒⇒ buys(X,”high resolution TV”)

� Proposed by Lent, Swami and Widom ICDE’97

� Numeric attributes are dynamically discretized

� Such that the confidence or compactness of the rules mined is maximized

� 2-D quantitative association rules: Aquan1 ∧ Aquan2 ⇒ Acat

� Cluster adjacent association rules to form general rules using a 2-D grid

� Example

Page 30: CIS611 LectureNotes ItemSetAssociationRule ...eecs.csuohio.edu/~sschung/CIS660/CIS611_LectureNotes...April 5, 2013 Data Mining: Concepts and Techniques 20 Bottleneck of Frequent-pattern

April 5, 2013 Data Mining: Concepts and Techniques 30

Chapter 5: Mining Frequent Patterns, Association and Correlations

� Basic concepts and a road map

� Efficient and scalable frequent itemset mining

methods

� Mining various kinds of association rules

� From association mining to correlation analysis

� Constraint-based association mining

� Summary

Page 31: CIS611 LectureNotes ItemSetAssociationRule ...eecs.csuohio.edu/~sschung/CIS660/CIS611_LectureNotes...April 5, 2013 Data Mining: Concepts and Techniques 20 Bottleneck of Frequent-pattern

April 5, 2013 Data Mining: Concepts and Techniques 31

Interestingness Measure: Correlations (Lift)

� play basketball ⇒ eat cereal [40%, 66.7%] is misleading

� The overall % of students eating cereal is 75% > 66.7%.

� play basketball ⇒ not eat cereal [20%, 33.3%] is more accurate,

although with lower support and confidence

� Measure of dependent/correlated events: lift = P(B|A) / P(B) for A=>B

89.05000/3750*5000/3000

5000/2000),( ==CBlift

500020003000Sum(col.)

12502501000Not cereal

375017502000Cereal

Sum (row)Not basketballBasketball

)()(

)(

BPAP

BAPlift

∪=

33.15000/1250*5000/3000

5000/1000),( ==¬CBlift

Page 32: CIS611 LectureNotes ItemSetAssociationRule ...eecs.csuohio.edu/~sschung/CIS660/CIS611_LectureNotes...April 5, 2013 Data Mining: Concepts and Techniques 20 Bottleneck of Frequent-pattern

April 5, 2013 Data Mining: Concepts and Techniques 32

Lift

� Measure of dependent/correlated events:

� For A=>B,

� When A and B are independent:

P(AUB) = P(A)*P(B)

� When B is dependent on A:

P(B|A) = P(B U A) / P(A)

� lift = P(B|A) / P(B)

= P(B U A) / P(A)* P(B)

= 1 means A, B are Independent, no association

> 1 means positive correlation (association) : A increases B

< 1 means negative correlation (association) : A decreases B

Page 33: CIS611 LectureNotes ItemSetAssociationRule ...eecs.csuohio.edu/~sschung/CIS660/CIS611_LectureNotes...April 5, 2013 Data Mining: Concepts and Techniques 20 Bottleneck of Frequent-pattern

April 5, 2013 Data Mining: Concepts and Techniques 33

Are lift and χ2 Good Measures of Correlation?

� “Buy walnuts ⇒ buy milk [1%, 80%]” is misleading

� if 85% of customers buy milk

� Support and confidence are not good to represent correlations

� So many interestingness measures? (Tan, Kumar, Sritastava @KDD’02)

Σ~mmSum(col.)

~c~m, ~cm, ~cNo Coffee

c~m, cm, cCoffee

Sum (row)No MilkMilk)()(

)(

BPAP

BAPlift

∪=

00.330.511000100010001000A4

81720.090.099.18100,000100001001000A3

6700.050.098.44100,00010001000100A2

90550.830.919.2610,0001001001000A1

χ2cohall-conflift~m~cm~c~m, cm, cDB

)sup(_max_

)sup(_

Xitem

Xconfall =

|)(|

)sup(

Xuniverse

Xcoh =

Page 34: CIS611 LectureNotes ItemSetAssociationRule ...eecs.csuohio.edu/~sschung/CIS660/CIS611_LectureNotes...April 5, 2013 Data Mining: Concepts and Techniques 20 Bottleneck of Frequent-pattern

April 5, 2013 Data Mining: Concepts and Techniques 34

Which Measures Should Be Used?

� lift and χχχχ2 are not good measures for correlations in large transactional DBs

� all-conf or coherence could be good measures (Omiecinski@TKDE’03)

� Both all-conf and coherence have the downward closure property

� Efficient algorithms can be derived for mining (Lee et al. @ICDM’03sub)