2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 data mining: concepts...

62
二二二二二二二 Data Mining: Concepts and Techniques 1 Mining Frequent Patterns, Associations, and Correlations (Chapter 5)

Upload: willis-simmons

Post on 12-Jan-2016

282 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 1

Mining Frequent Patterns, Associations, and Correlations

(Chapter 5)

Page 2: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 2

Chapter 5Mining Frequent Patterns, Associations, and Correlations

What is association rule mining

Mining single-dimensional Boolean association

rules

Mining multilevel association rules

Mining multidimensional association rules

From association mining to correlation analysis

Summary

Page 3: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 3

What Is Association Mining?

Association rule mining: Finding frequent patterns, associations,

correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.

Rule form: “Body ead [support, confidence]”.Examples. buys(x, “diapers”) buys(x, “beers”) [0.5%, 60%] age(X, “20..29”) ^ income(X, “30..39K”) buys(X, “PC”) [2%,

60%]

Applications: Basket data analysis, cross-marketing, catalog

design, loss-leader analysis, classification, etc.

Page 4: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Introduction to Data Mining 4

Denotation of Association Rules

Rule form: “Body ead [support, confidence]”.

(association, correlation and causality)Rule examples

sales(T, “computer”) sales(T, “software”) [support = 1%, confidence = 75%]

buy(T, “Beer”) buy(T, “Diaper”)

[support = 2%, confidence = 70%] age(X, “20..29”) ^ income(X, “30..39K”) buys(X, “PC”)

[support = 2%, confidence = 60%]

data context

attribute name

data content

data object

Page 5: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 5

Association Rule Mining(Basic Concepts)

Given: a database or set of transactions each transaction is a list of items (e.g.,

purchased by a customer in a visit) Find: all rules that correlate the presence of a set of

items with that of another set of items (X Y) E.g., 98% of people who purchase tires and auto

accessories also get car maintenance done Application examples

? Car Maintenance Agreement (What the store should do to boost Car Maintenance Agreement sales)

Home Electronics ? (What other products should the store stocks up?)

Page 6: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

Association Rule Mining(Support and Confidence)

Given a transaction D/B, find all the rules X Y with minimum support and confidence support S : probability that a

transaction contains {X & Y } confidence C : conditional

probability that a transaction having {X} also contains Y

Transaction ID Items Bought0001 A,B,C0002 A,C0003 A,D0004 B,E,F

I = {i1,i2,i3, ...,in} : set of all items

Tj I : a transaction,

A C (50%, 66.6%) C A (50%, 100%)

Customerbuys X

Customerbuys both

Customerbuys Y

n

1jjTI

Page 7: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 7

A Road Map for Association Rule Mining(Types of Association Rules)

Boolean vs. quantitative associations (based on the types of data values) buys(x, “SQLServer”) ^ buys(x, “DMBook”) buys(x, “DBMiner”)

[0.2%, 60%] age(x, “30..39”) ^ income(x, “42..48K”) buys(x, “PC”) [1%, 75%]

Single dimensional vs. multiple dimensional associations (see above e.g.)

Single-level vs. multiple-level analysis What brands of beers are associated with what brands of diapers?

Various extensions Correlation, causality analysis

Association does not necessarily imply correlation or causality Constraints enforced

E.g., small sales (sum < 100) trigger big buys (sum > 1,000)?

Page 8: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 8

Chapter 5: Mining Association Rules in Large Databases

What is association rule mining

Mining single-dimensional Boolean association

rules

Mining multilevel association rules

Mining multidimensional association rules

From association mining to correlation analysis

Summary

Page 9: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 9

Mining Association Rules(How to Find Strong Rules : An Example)

1).Find the frequent itemsets Frequent itemset The set of items that has minimum support

2).Generate association rules by frequent itemsets Rule form : “Body Head [support, confidence]” with min. sup. and conf. Example : for rule A C : find supports of {A ,C } and {A}

support = support({A ,C }) = 50% > min. supportconfidence = support({A ,C })/support({A}) = 66.6% > min. confidence

Problem : How many itemsets & rules ? A C; …; C A ; …; A, B C ; …

Transaction ID Items Bought1001 A,B,C1002 A,C1003 A,D1004 B,E,F

Frequent Itemset Support{A} 75%{B} 50%{C} 50%{A,C} 50%

Find strong rules with- support >= 50%- confidence >= 50%

Page 10: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 10

Mining Association Rules Efficiently : (Apriori Algorithm)

1. Find the frequent itemsets (Key Steps)

Iteratively find frequent itemsets with cardinality from 1 to k (from 1-itemset to k-itemset) by Apriori principle

Frequent itemset

the set of items that has minimum support Apriori principle

Any subset of a frequent itemset must also be frequent

i.e., if {A, B } is a frequent 2-itemset, both {A } and {B } should be frequent 1-itemsets. If either {A } or {B } is not frequent, {A, B } cannot be frequent.

2. Generate association rules by the frequent itemsets

Page 11: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 11

The Apriori Algorithm — Example( support >= 50% ~ support count >= 2)

TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5

Database D itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3

itemset sup.{1} 2{2} 3{3} 3{5} 3

Scan D

C1L1

itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}

itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2

itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2

L2

C2 C2

Scan D

C3C3itemset

{2 3 5}Scan D itemset sup

{2 3 5} 2L3 itemset sup

{2 3 5} 2

Page 12: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 12

The Apriori Algorithm(Pseudo-Code for Finding Frequent Itemsets)

Iteratively : C1 → L1 → C2 → L2 → C3 → L3 …… → CN →

Ck (like a table): The set of candidate itemsets of size k Lk (like a table): The set of frequent itemsets of size k

Pseudo-code:C1 = {candidate 1-itemsets}; L1 = {frequent 1-itemsets};for (k = 1; Lk !=; k++) do begin

Ck+1 = candidate itemset generated from Lk

for each transaction t in database do increment the count of all itemsets in Ck+1 contained in t

Lk+1 = itemsets in Ck+1 with support >= min_supportEnd

Return k Lk ;

scanDatabaseCk+1 Lk+1

join & prune

Lk Ck+1

Example

Page 13: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 13

Generate Candidate Itemsets Ck+1

from Lk

(Lk Ck+1 : Join & Prune)

Representation : Lk Ck+1 Lk : The set of frequent itemsets of size k Ck+1 : The set of candidate itemsets of size k+1

Steps of computing candidate k+1-itemsets Ck+1

1) Join Step (According to Apriori principle)Ck+1 is generated by joining Lk with itself

(Note the length constraint : k k+1 )

2) Prune Step (Apply Apriori principle to prune)

Any k-itemset that is not frequent cannot be a subset of a frequent (k+1)-itemset(Use Lk to prune Ck +1)

Example

Page 14: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

Suppose the items in Lk are listed (sorted) in order

Step 1: self-joining Lk (SQL pseudo code)

insert into Ck+1 (itemi ’s are sorted)

select p.item1 , p.item2 , …, p.itemk , q.itemk

from Lk p, Lk q (where p, q are aliases of Lk )

where p.item1=q.item1 , …, p.itemk-1=q.itemk-1 ,

p.itemk < q.itemk

Step 2: pruning : Any k-itemset that is not frequent cannot be a

subset of a frequent (k+1)-itemset (Ck+1 is like a table)

For every (k+1)-itemset c in Ck+1 do

For every k-subset s of (k+1)-itemset c doif (s is not in Lk) then delete (k+1)-itemset c from Ck+1

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 14

Generation of Candidate Itemset Ck+1

(Effective pseudo code)

Page 15: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 15

Generating Candidate Itemsets C4

(Example : L3 → C4 )

L3={abc, abd, acd, ace, bcd}, C4 ? p : L3 q : L3

Self-joining: L3*L3 abc abc abcd ( from abc+abd ) C4 abd abd abce ( from abc+ace ) C4 acd acd acde ( from acd+ace ) C4 ace ace

Pruning: bcd bcd acde is removed from C4

because ade L3 ,or cde L3

{3-subsets of acde have acd, ace, ade, cde}

Therefore, C4={abcd}

.

.

.

pair-wise joining

Page 16: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 16

Is Apriori Fast Enough ? (Performance Bottlenecks)

The core of the Apriori algorithm: Compute Ck (Lk-1 → Ck ) : Use frequent (k – 1)-itemsets to

generate candidate k-itemsets (candidate generation) Compute Lk (Ck→ Lk ) : Use database scan and pattern

matching to collect counts for the candidate itemsets The bottleneck of Apriori: candidate generation

Huge candidate itemsets for computing Ck : 104 frequent 1-itemset would generate ~107 candidate 2-

itemsetsWhat would be the number of 3-itemsets, 4-itemsets, …, ?

For a transaction database of 100 items, e.g., {a1, a2, …, a100}, there are 2100 1030 possible candidate itemsets.

Multiple scans of database for computing Lk : Needs N scans, N is the length of the longest pattern

Page 17: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 17

Methods to Improve Apriori’s Efficiency

Hash-based itemset counting: A k-itemset whose corresponding

hashing bucket count is below the threshold cannot be frequent

Transaction reduction: A transaction that does not contain any

frequent k-itemset is useless in subsequent scans

Partitioning: Any itemset that is potentially frequent in DB must

be frequent in at least one of the partitions of DB (parallel

computing)

Sampling: mining on a subset of the given data, lower support

threshold + a method to determine the completeness

Page 18: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 18

Mining Frequent Patterns by FP-Growth (Without Candidate Generation and DB Scans)

Avoid costly database scans

Compress a large database into a compact, Frequent-Pattern tree structure (FP-tree)

Highly condensed, but complete for frequent pattern mining

Avoid costly candidate generation

Develop an efficient, FP-tree-based frequent pattern mining method

A divide-and-conquer methodology: decompose mining tasks into smaller ones

Page 19: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 19

FP-growth vs. Apriori (Scalability with the Support

Threshold)

0

10

20

30

40

50

60

70

80

90

100

0 0.5 1 1.5 2 2.5 3

Support threshold(%)

Ru

n t

ime(s

ec.)

D1 FP-grow th runtime

D1 Apriori runtime

On data set T25I20D10K

For data set TxIyDz, •x indicates the average transaction length •y indicates the average itemset length •z indicates the total number of transaction instances

Page 20: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 20

Major Steps of FP-growth Algorithm

1) Construct FP-tree

2) Construct conditional pattern base for each item

in the header table of FP-tree

3) Construct conditional FP-tree from each entry

of conditional pattern base

4) Derive frequent patterns by mining conditional

FP-trees recursively

If the conditional FP-tree contains a single

path, simply enumerate all the patterns

Page 21: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 21

Construct FP-tree from a Transaction DB

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header Table

Item frequency head f 4c 4a 3b 3m 3p 3

min_support = 50%Support count >=2.5TID Items bought (ordered) frequent items

100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

Steps :

1. Scan DB once, find frequent 1-itemset

2. Order frequent items in frequency descending order

3. Scan DB again, construct FP-tree

,

-1

-2

Page 22: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 22

Benefits of the FP-tree Structure

Completeness: never breaks a long pattern of any transaction preserves complete information for frequent

pattern mining Compactness

reduce irrelevant information—infrequent items are gone

frequency descending ordering: more frequent items are more likely to be shared

never be larger than the original database (if not count node-links and counts)

Example: For Connect-4 DB, compression ratio could be over 100

Page 23: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 23

Step 2: Construct Conditional Pattern Base

Starting from the 2nd item of header table in the FP-tree Traverse the FP-tree by following the link of each frequent item Accumulate all of transformed prefix paths of each frequent item

to form an entry of the conditional pattern base

Conditional pattern base

item cond. pattern base

c f:3

a fc:3

b fca:1, f:1, c:1

m fca:2, fcab:1

p fcam:2, cb:1

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header Table

Item frequency head f 4c 4a 3b 3m 3p 3

Page 24: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 24

Properties of FP-tree for Conditional Pattern Base Construction

Node-link property

For a frequent item ai , all the frequent itemsets

that contain ai can be obtained by following ai's

node-links, starting from ai's entry in header

table of the FP-tree. Prefix path property

To calculate the frequent itemsets containing ai

in a path P, only the prefix sub-path of node ai

in P need to be accumulated, and its frequency

count should carry the same count as node ai.

Page 25: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 25

Step 3: Construct Conditional FP-tree

For each conditional pattern-base entry (illustrated by m) Accumulate the count for each item of the entry Construct the conditional FP-tree for the frequent

items of the entry

m-conditional pattern base: fca:2, fcab:1

{}

f:3

c:3

a:3m-conditional FP-tree

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header TableItem frequency head f 4c 4a 3b 3m 3p 3

Page 26: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 26

Conditional Pattern-Bases & Conditional FP-tree

EmptyEmptyf

{(f:3)}|c{(f:3)}c

{(f:3, c:3)}|a{(fc:3)}a

Empty{(fca:1), (f:1), (c:1)}b

{(f:3, c:3, a:3)}|m{(fca:2), (fcab:1)}m

{(c:3)}|p{(fcam:2), (cb:1)}p

Conditional FP-treeConditional pattern-baseItem

Assume minimal support = 3, and | denotes concatenation

Page 27: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 27

Suppose a conditional FP-tree T has a single path P All frequent itemsets of T can be generated by

enumerating all the item combinations in path P

{}

f:3

c:3

a:3

a single path for an m-conditional FP-tree

All frequent itemsets containing m

m,

fm, cm, am,

fcm, fam, cam,

fcam

Frequent Pattern Generation

all item combinations in P

Step 4: Recursively Mine Conditional FP-trees

Page 28: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 28

Why Is FP–Growth Algorithm Fast?

Performance study shows

FP-growth is an order of magnitude faster

than Apriori algorithm.

Reasoning : FP-tree building

Use compact data structure

No candidate generation, no candidate test

Eliminate repeated database scan

Basic operation is counting

Page 29: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 29

Presentation of Association Rules (Table Form )

Page 30: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 30

Visualization of Association Rule Using Plane Graph

Page 31: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 31

Visualization of Association Rule Using Rule Graph

Page 32: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 32

Chapter 5: Mining Association Rules in Large Databases

What is association rule mining

Mining single-dimensional Boolean association

rules

Mining multilevel association rules

Mining multidimensional association rules

From association mining to correlation analysis

Summary

Page 33: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 33

Multiple-Level Association Rules

Rules regarding itemsets at

appropriate levels could be quite useful.

Items at the lower level are expected to have lower support.

Transaction database can be encoded based on dimensions and levels

We can explore multi-level mining

WonderFraser

skim 2%

Food

breadmilk

whitewheat

TID Items T1 {111, 121, 211, 221} T2 {111, 211, 222, 323} T3 {112, 122, 221, 411} T4 {111, 121} T5 {111, 122, 211, 221, 413}

Items often form concept hierarchy. …

Page 34: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 34

Mining Multi-Level Associations

A top_down, progressive deepening approach: First find high-level strong rules:

milk bread [20%, 60%]. Then find their lower-level “weaker” rules:

2% milk wheat bread [6%, 50%]. Variations at mining multiple-level association

rules. Level-crossed association rules:

2% milk Wonder wheat bread Association rules with multiple, alternative

concept hierarchies: 2% milk Wonder bread

hierarchy

Page 35: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 35

Uniform Support vs. Reduced Support(Multi-level Association Mining)

Uniform Support:

Use the same minimum support for mining all levels

Pros: Simple - use one minimum support threshold

If an itemset contains any item whose ancestors do not have minimum support, there is no need to examine it.

Cons: Lower level items do not occur as frequently. If support threshold is too high miss low level associations too low generate too many high-level associations

Page 36: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 36

Uniform Support Example

Multi-level mining with uniform support

Milk

[support = 10%]

2% Milk

[support = 6%]

Skim Milk

[support = 4%]

Level 1min_sup = 5%

Level 2min_sup = 5% Χ

Page 37: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 37

Uniform Support vs. Reduced Support(Multi-level Association Mining)

Reduced Support: Use reduced minimum support at lower levelsThere are 3 main search strategies: Level-by-level independent (without (i-1)th level info)

Each tree node (item) is examined, regardless of whether or not its parent node is found to be frequent (too loose → not efficient)

Level-cross filtering by k-itemset (at Lk with (i-1)th level info of k-item)

A k-itemset at the i-th level is examined iff its parent k-itemset at the (i-1)th level is frequent (efficient, but too restrictive → many missing rules)

Level-cross filtering by single item (at Lk with (i-1)th level info of 1-item)

An item at the i-th level is examined iff its parent node at the (i-1)-th level is significant (may not be frequent, comprise of above two strategies)

Example

Example

Page 38: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 38

Reduced Support

Multi-level mining with reduced support(Level-cross filtering by single item)

Case 1Level 1 : min_sup = 8%Level 2 : min_sup = 4%

Case 2Level 1 : min_sup = 12%Level 2 : min_sup = 6%

2% Milk

[support = 6%]

Skim Milk

[support = 4%]

Milk

[support = 10%]

1:2:╳

1:2:

1: 2:╳

In case 2, level-1 parent node is significant but not frequent.

Page 39: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 39

Redundancy Filtering in Multi-level Association Mining

Some rules may be redundant due to relationships between the “ancestors” of items.

Example of a redundant rule milk wheat bread [support = 8%, confidence = 70%]

2% milk wheat bread [support = 2%, confidence = 72%]

We say the first rule is an ancestor of the second rule. A rule is redundant if its support is close to or less than

the “expected” value, based on the rule’s ancestor.

Page 40: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 40

Progressive Deepening in Multi-Level Association Mining

A top-down, progressive deepening approach: First mine high-level frequent items:

milk (15%), bread (10%) Then mine their lower-level “weaker” frequent

itemsets: 2% milk (5%), wheat bread (4%) Different min_support threshold strategies across

multi-levels lead to different algorithms: If adopting the same min_support across multi-levels,

then toss item t if any of t ’s ancestors is infrequent. If adopting reduced min_support at lower levels,

then examine only those descendents whose ancestor’s support is frequent or non-negligible.

Page 41: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 41

Progressive Refinement of Data Mining Quality

Why progressive refinement? Mining operator can be expensive or cheap, fine or rough

Trade speed with quality: step-by-step refinement.

Two- or multi-step mining: First apply rough/cheap operator (superset coverage)

Then apply expensive algorithm on a substantially

reduced candidate set (Koperski & Han, SSD’95).

Superset coverage property: Preserve all the positive answers—allow a false positive

test but not a false negative test.

+ -

Page 42: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 42

Chapter 5: Mining Association Rules in Large Databases

What is association rule mining

Mining single-dimensional Boolean association

rules

Mining multilevel association rules

Mining multidimensional association rules

From association mining to correlation analysis

Summary

Page 43: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 43

Multi-Dimensional Association (Concepts)

Single-dimensional rules:buys(X, “milk”) buys(X, “bread”)

Multi-dimensional rules: Inter-dimension association rules (no repeated predicates)

age(X,”19-25”) occupation(X,“student”) buys(X,“coke”) Hybrid-dimension association rules (repeated predicates)

age(X,”19-25”) buys(X, “popcorn”) buys(X, “coke”) Categorical Attributes

finite number of possible values, no ordering among values Quantitative Attributes

numeric, implicit ordering among values, e.g., age

Page 44: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 44

Mining Quantitative Associations

For example : search for frequent k-predicate set An example: {age, occupation, buys} is a 3-predicate set.

Techniques can be categorized by how quantitative attributes (e.g., age) are treated.1. Static discretization

Quantitative attributes are statically discretized by using predefined concept hierarchies, where numeric values are replaced by ranges Example, for salary, 0~20K, 21K~30K, 31K~40K,>= 41K

2. Dynamic discretizationQuantitative attributes are dynamically discretized into bins individually based on the distribution of the data. Example, by equi-width / equi-depth bins

3. Distance-based (clustering) discretizationThis is a dynamic discretization process that considers the distance between data points. (See slide 49 for example)

Page 45: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 45

Static Discretization of Quantitative Attributes

Before mining, discretizing data using concept hierarchy. Numeric values are replaced by ranges.

Range example for salary : 0~20K, 21K~30K, 31K~40K,>= 41K In relational database, finding all frequent k-predicate

sets will require k or k+1 table scans. Data cube is well suited for mining. The cells of an n-dimensional

cuboid correspond to the predicate sets.

The cells of an n-dimensional cuboid store the support counts of thecorresponding n-predicate sets

(income)(age)

()

(buys)

(age, income) (age,buys) (income,buys)

(age,income,buys)

Page 46: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 46

Quantitative Association Rule Mining

Numeric attributes are dynamically discretized such that the

confidence or compactness of the mined rules is maximized.

For 2D quantitative association rule: Aquan1 Bquan2 Ccat

Cluster “adjacent” association rules to form more general rules using a 2-D grid.

Example: age(X,”34”) income(X,”30K - 40K”) buys(X,”TV”)

age(X,”34”) income(X,”40K - 50K”) buys(X,”TV”)

age(X,”35”) income(X,”30K - 40K”) buys(X,”TV”)

age(X,”35”) income(X,”40K - 50K”) buys(X,”TV”)

age(X,”34-35”) income(X,”30K - 50K”) buys(X,”TV”) (more compact)

Page 47: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 47

ARCS (Association Rule Clustering System)

How does ARCS work?

1. Binning

2. Find frequent predicate set

3. Clustering

4. Optimize

Page 48: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 48

Limitations of ARCS

Only quantitative attributes on LHS of rules. Only 2 attributes on LHS. (2D limitation) An alternative to ARCS

Non-grid-based equi-depth binning clustering based on a measure of partial

completeness. “Mining Quantitative Association Rules in

Large Relational Tables” by R. Srikant and R. Agrawal.

Page 49: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 49

Mining Distance-based Association Rules

Binning methods do not capture the semantics of interval data

Distance-based partitioning, more meaningful discretization considering: density/”number of points” in an interval “closeness” of points in an interval

Price($)Equi-width(width $10)

Equi-depth(depth 2) Distance-based

7 [0,10] Χ 1 [7,20] Χ 2 [7,7]20 [11,20] Χ 1 [22,50] Χ 2 [20,22] Χ 222 [21,30] Χ 1 [51,53] Χ 2 [50,53] Χ 350 [31,40] Χ 051 [41,50] Χ 153 [51,60] Χ 2

Page 50: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 52

Chapter 5: Mining Association Rules in Large Databases

What is association rule mining

Mining single-dimensional Boolean association

rules

Mining multilevel association rules

Mining multidimensional association rules

From association mining to correlation analysis

Summary

Page 51: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 53

Interestingness Measurements

Objective measuresTwo popular measurements: support confidence

Subjective measures (Silberschatz & Tuzhilin, KDD95)A rule (pattern) is interesting if it is unexpected (surprising to the user); and/or actionable (the user can do something with it)

Page 52: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 54

Criticism to Support and Confidence(Example 1)

Example 1: (Aggarwal & Yu, PODS98) Totally 5000 students

(Analyzed in 2D space : playing basketball, eating cereal)3000 play basketball3750 eat cereal2000 both play basket ball and eat cereal

The following association rules are foundplay basketball eat cereal [40%, 66.7% ] play basketball not eat cereal [20%, 33.3%]

basketball not basketball sum(row)cereal 2000 1750 3750not cereal 1000 250 1250sum(col.) 3000 2000 5000

: given: derived

Page 53: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 55

Criticism to Support and Confidence

Rule 1 : play basketball eat cereal [40%, 66.7% ] Rule 1 implies that a student who plays basketball

like eating cereal This rule is misleading.

Because the percentage of students eating cereal is 75% (=3750/5000) which is higher than 66.7%.

Rule 2 : play basketball not eat cereal [20%, 33.3%] Rule 2 is far more accurate. It implies that a student who plays basketball

doesn’t like eating cereal

Page 54: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 56

Interestingness Measure: Correlations (Lift)

play basketball eat cereal [40%, 66.7%] is misleading

The overall % of students eating cereal is 75% > 66.7%.

play basketball not eat cereal [20%, 33.3%] is more

accurate, although with lower support and confidence

Measure of dependent/correlated events: lift

89.05000/3750*5000/3000

5000/2000),( CBlift

)()(

)(

CPBP

CBPlift

33.15000/1250*5000/3000

5000/1000),( CBlift

basketball not basketball sum(row)cereal 2000 1750 3750not cereal 1000 250 1250sum(col.) 3000 2000 5000

B: basketball C: cereal

Lift(B,C) < lift(B,¬C) a student playing basketball doesn’t like eating cereal

B CA

D

(C/A>D/B)

Page 55: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 57

Criticism to Support and Confidence(Example 2)

X and Y: positively correlated X and Z, negatively related

X 1 1 1 1 0 0 0 0Y 1 1 0 0 0 0 0 0Z 0 1 1 1 1 1 1 1

Rule Support ConfidenceX=>Y 25% 50%X=>Z 37.50% 75%

1 : occurred, 0 : not occurred

We need a measure of dependent or correlated events

(Cont.)

)A(P/)B|A(P)B(P/)A|B(P)B(P)A(P

)BA(Pcorr B,A

Z is more positively correlated to X than Y according to sup. & conf.

1).

2).

contradict

)Asup(/)AB(conf)Bsup(/)BA(conflift B,A

Page 56: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 58

Correlation (lift) :

taking both P(A) and P(B) in consideration P(A^B)=P(B)*P(A), if A and B are independent

events A and B negatively correlated, if corrA,B is less

than 1; otherwise A and B positively correlated

)()(

)(, BPAP

BAPcorr BA

X 1 1 1 1 0 0 0 0Y 1 1 0 0 0 0 0 0Z 0 1 1 1 1 1 1 1

Itemset Support corr(lift)X,Y 1/4 2X,Z 3/8 6/7Y,Z 1/8 4/7

P(X)=1/2, P(Y)=1/4, P(Z)=7/8

)BA(Psupport BA

)B(P)A(P

)BA(Pliftcorr BAB,A

Criticism to Support and Confidence(Example 2)

Page 57: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 59

Constraint-Based Mining

Interactive, exploratory mining giga-bytes of big data? Could it be real? — Making good use of

constraints! What kinds of constraints can be used in mining?

Knowledge type constraint: Specify the type knowledge to be mined

Data constraint: specify the relevant data Find product pairs sold together in Vancouver in Dec.’98.

Dimension (attribute) constraints: specify the desired dimension

Regarding region, price, brand, customer category. Rule constraints: specify the form of rule

small sales (price < $10) triggers big sales (sum > $200). Interestingness constraints: specify the threholds

strong rules (min_support 3%, min_confidence 60%).

KDD Steps

Page 58: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 60

Rule Constraints in Association Mining

Two kind of rule constraints: Rule form constraints: meta-rule guided mining.

P(x, y) ^ Q(x, w) takes(x, “database systems”). Rule content constraint: constraint-based query

optimization (Ng, et al., SIGMOD’98). sum(LHS) < 100 ^ min(LHS) > 20 ^ count(LHS) > 3

1-variable vs. 2-variable constraints (Lakshmanan, et al. SIGMOD’99): 1-var: A constraint confining only one side (LHS

or RHS) of the rule, e.g., as shown above. 2-var: A constraint confining both sides (LHS and

RHS). sum(LHS) < min(RHS), and max(RHS) < 5* sum(LHS)

Page 59: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 61

Chapter 5: Mining Association Rules in Large Databases

What is association rule mining

Mining single-dimensional Boolean association

rules

Mining multilevel association rules

Mining multidimensional association rules

From association mining to correlation analysis

Summary

Page 60: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 62

Why Is the Big Pie Still There?

More on constraint-based mining of associations Boolean vs. quantitative associations

Association on discrete vs. continuous data From association to correlation and causal

structure analysis. Association does not necessarily imply correlation or

causal relationships From intra-transaction association to inter-

transaction associations E.g., break the barriers of transactions (Lu, et al.

TOIS’99). From association analysis to classification and

clustering analysis E.g, clustering association rules

Page 61: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 63

Summary

Association rule mining

probably the most significant contribution from

the database community in KDD

A large number of papers have been published

Many interesting issues have been explored

An interesting research direction

Association analysis in other types of data:

spatial data, multimedia data, time series data,

etc.

Page 62: 2015年10月8日星期四 2015年10月8日星期四 2015年10月8日星期四 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter

二〇二三年四月二十一日 Introduction to Data Mining 64

Relevant Data

Data Preprocessing

Data Mining

Evaluation/Presentation

Pattern

Knowledge

Databases

Steps in KDD Process(Technically)

Data mining

The core step of KDD process