frequent pattern miningnovel.ict.ac.cn/files/day 3.pdf · design, sale campaign analysis, web log...

24
1 Frequent Pattern Mining Jian Pei: Big Data Analytics -- Frequent Pattern Mining 2 Transaction Data Analysis Transactions: customerspurchases of commodities – {bread, milk, cheese} if they are bought together Frequent patterns: product combinations that are frequently purchased together by customers Frequent patterns: patterns (set of items, sequence, etc.) that occur frequently in a database [AIS93] Jian Pei: Big Data Analytics -- Frequent Pattern Mining 3 Why Frequent Patterns? What products were often purchased together? What are the frequent subsequent purchases after buying a iPod? What kinds of genes are sensitive to this new drug? What key-word combinations are frequently associated with web pages about game- evaluation? Jian Pei: Big Data Analytics -- Frequent Pattern Mining 4 Why Frequent Pattern Mining? Foundation for many data mining tasks – Association rules, correlation, causality, sequential patterns, spatial and multimedia patterns, associative classification, cluster analysis, iceberg cube, Broad applications – Basket data analysis, cross-marketing, catalog design, sale campaign analysis, web log (click stream) analysis, Jian Pei: Big Data Analytics -- Frequent Pattern Mining 5 Frequent Itemsets Itemset: a set of items E.g., acm = {a, c, m} Support of itemsets Sup(acm) = 3 Given min_sup = 3, acm is a frequent pattern Frequent pattern mining: finding all frequent patterns in a database TID Items bought 100 f, a, c, d, g, I, m, p 200 a, b, c, f, l, m, o 300 b, f, h, j, o 400 b, c, k, s, p 500 a, f, c, e, l, p, m, n Transaction database TDB Jian Pei: Big Data Analytics -- Frequent Pattern Mining 6 A Naïve Attempt Generate all possible itemsets, test their supports against the database How to hold a large number of itemsets into main memory? – 100 items 2 100 – 1 possible itemets How to test the supports of a huge number of itemsets against a large database, say containing 100 million transactions? – A transaction of length 20 needs to update the support of 2 20 – 1 = 1,048,575 itemsets

Upload: others

Post on 17-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Frequent pattern miningnovel.ict.ac.cn/files/Day 3.pdf · design, sale campaign analysis, web log (click stream) analysis, … Jian Pei: Big Data Analytics -- Frequent Pattern Mining

1

Frequent Pattern Mining

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 2

Transaction Data Analysis

•  Transactions: customers’ purchases of commodities –  {bread, milk, cheese} if they are bought together

•  Frequent patterns: product combinations that are frequently purchased together by customers

•  Frequent patterns: patterns (set of items, sequence, etc.) that occur frequently in a database [AIS93]

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 3

Why Frequent Patterns?

•  What products were often purchased together?

•  What are the frequent subsequent purchases after buying a iPod?

•  What kinds of genes are sensitive to this new drug?

•  What key-word combinations are frequently associated with web pages about game-evaluation?

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 4

Why Frequent Pattern Mining?

•  Foundation for many data mining tasks – Association rules, correlation, causality,

sequential patterns, spatial and multimedia patterns, associative classification, cluster analysis, iceberg cube, …

•  Broad applications – Basket data analysis, cross-marketing, catalog

design, sale campaign analysis, web log (click stream) analysis, …

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 5

Frequent Itemsets

•  Itemset: a set of items –  E.g., acm = {a, c, m}

•  Support of itemsets –  Sup(acm) = 3

•  Given min_sup = 3, acm is a frequent pattern

•  Frequent pattern mining: finding all frequent patterns in a database

TID Items bought 100 f, a, c, d, g, I, m, p 200 a, b, c, f, l, m, o 300 b, f, h, j, o 400 b, c, k, s, p 500 a, f, c, e, l, p, m, n

Transaction database TDB

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 6

A Naïve Attempt

•  Generate all possible itemsets, test their supports against the database

•  How to hold a large number of itemsets into main memory? – 100 items à 2100 – 1 possible itemets

•  How to test the supports of a huge number of itemsets against a large database, say containing 100 million transactions? – A transaction of length 20 needs to update the

support of 220 – 1 = 1,048,575 itemsets

Page 2: Frequent pattern miningnovel.ict.ac.cn/files/Day 3.pdf · design, sale campaign analysis, web log (click stream) analysis, … Jian Pei: Big Data Analytics -- Frequent Pattern Mining

2

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 7

Transactions in Real Applications

•  A large department store often carries more than 100 thousand different kinds of items – Amazon.com carries more than 17,000 books

relevant to data mining •  Walmart has more than 20 million

transactions per day, AT&T produces more than 275 million calls per day

•  Mining large transaction databases of many items is a real demand

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 8

How to Get an Efficient Method?

•  Reducing the number of itemsets that need to be checked

•  Checking the supports of selected itemsets efficiently

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 9

Candidate Generation & Test

•  Any subset of a frequent itemset must also be frequent – an anti-monotonic property –  A transaction containing {beer, diaper, nuts} also

contains {beer, diaper} –  {beer, diaper, nuts} is frequent à {beer, diaper} must

also be frequent •  In other words, any superset of an infrequent

itemset must also be infrequent –  No superset of any infrequent itemset should be

generated or tested –  Many item combinations can be pruned!

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 10

Apriori-Based Mining

•  Generate length (k+1) candidate itemsets from length k frequent itemsets, and

•  Test the candidates against DB

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 11

The Apriori Algorithm [AgSr94]

TID Items 10 a, c, d 20 b, c, e 30 a, b, c, e 40 b, e Min_sup=2

Itemset Sup a 2 b 3 c 3 d 1 e 3

Data base D 1-candidates

Scan D

Itemset Sup a 2 b 3 c 3 e 3

Freq 1-itemsets Itemset

ab ac ae bc be ce

2-candidates

Itemset Sup ab 1 ac 2 ae 1 bc 2 be 3 ce 2

Counting

Scan D

Itemset Sup ac 2 bc 2 be 3 ce 2

Freq 2-itemsets Itemset

bce

3-candidates

Itemset Sup bce 2

Freq 3-itemsets

Scan D

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 12

The Apriori Algorithm Level-wise, candidate generation and test •  Ck: Candidate itemset of size k •  Lk : frequent itemset of size k

•  L1 = {frequent items}; •  for (k = 1; Lk !=∅; k++) do

–  Ck+1 = candidates generated from Lk; –  for each transaction t in database do increment the

count of all candidates in Ck+1 that are contained in t –  Lk+1 = candidates in Ck+1 with min_support

•  return ∪k Lk;

Candidate generation

Test

Page 3: Frequent pattern miningnovel.ict.ac.cn/files/Day 3.pdf · design, sale campaign analysis, web log (click stream) analysis, … Jian Pei: Big Data Analytics -- Frequent Pattern Mining

3

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 13

Important Steps in Apriori

•  How to find frequent 1- and 2-itemsets? •  How to generate candidates?

– Step 1: self-joining Lk

– Step 2: pruning •  How to count supports of candidates?

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 14

Finding Frequent 1- & 2-itemsets

•  Finding frequent 1-itemsets (i.e., frequent items) using a one dimensional array –  Initialize c[item]=0 for each item – For each transaction T, for each item in T,

c[item]++; –  If c[item]>=min_sup, item is frequent

•  Finding frequent 2-itemsets using a 2-dimensional triangle matrix – For items i, j (i<j), c[i, j] is the count of itemset ij

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 15

Counting Array

•  A 2-dimensional triangle matrix can be implemented using a 1-dimensional array

1 2 3 4 5 1 1 2 3 4 2 5 6 7 3 8 9 4 10 5

There are n items For items i, j (i>j), c[i,j] = c[(i-1)(2n-i)/2+j-i]; Example: c[3,5] =c[(3-1)*(2*5-3)/2+5-3]=c[9]

1 2 3 4 5 6 7 8 9 10 Jian Pei: Big Data Analytics -- Frequent Pattern Mining 16

Example of Candidate-generation

•  L3 = {abc, abd, acd, ace, bcd} •  Self-joining: L3*L3

– abcd ß abc * abd – acde ß acd * ace

•  Pruning: – acde is removed because ade is not in L3

•  C4={abcd}

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 17

How to Generate Candidates? •  Suppose the items in Lk-1 are listed in an order •  Step 1: self-join Lk-1

INSERT INTO Ck SELECT p.item1, p.item2, …, p.itemk-1, q.itemk-1 FROM Lk-1 p, Lk-1 q WHERE p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 <

q.itemk-1

•  Step 2: pruning –  For each itemset c in Ck do

•  For each (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 18

How to Count Supports?

•  Why is counting supports of candidates a problem? –  The total number of candidates can be very huge –  One transaction may contain many candidates

•  Method –  Candidate itemsets are stored in a hash-tree –  A leaf node of hash-tree contains a list of itemsets and

counts –  Interior node contains a hash table –  Subset function: finds all the candidates contained in a

transaction

Page 4: Frequent pattern miningnovel.ict.ac.cn/files/Day 3.pdf · design, sale campaign analysis, web log (click stream) analysis, … Jian Pei: Big Data Analytics -- Frequent Pattern Mining

4

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 19

Example: Counting Supports

1,4,7 2,5,8

3,6,9 Subset function

2 3 4 5 6 7

1 4 5 1 3 6

1 2 4 4 5 7 1 2 5

4 5 8 1 5 9

3 4 5 3 5 6 3 5 7 6 8 9

3 6 7 3 6 8

Transaction: 1 2 3 5 6

1 + 2 3 5 6

1 2 + 3 5 6

1 3 + 5 6

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 20

Apriori in SQL

•  Impossible to get good performance out of pure SQL (SQL-92) based approaches alone – Support counting is costly

•  Make use of object-relational extensions like UDFs, BLOBs, Table functions etc. – Get orders of magnitude improvement

•  S. Sarawagi, S. Thomas, and R. Agrawal, 1998

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 21

Challenges of Freq Pat Mining

•  Multiple scans of transaction database •  Huge number of candidates •  Tedious workload of support counting for

candidates

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 22

Improving Apriori: Ideas

•  Reducing the number of transaction database scans

•  Shrinking the number of candidates •  Facilitating support counting of candidates

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 23

DIC: Reducing Number of Scans ABCD

ABC ABD ACD BCD

AB AC BC AD BD CD

A B C D

{} Itemset lattice

•  Once both A and D are determined frequent, the counting of AD can begin

•  Once all length-2 subsets of BCD are determined frequent, the counting of BCD can begin

Transactions 1-itemsets 2-itemsets

… Apriori

1-itemsets 2-items

3-items DIC S. Brin R. Motwani, J. Ullman, and S. Tsur, SIGMOD’97. DIC: Dynamic Itemset Counting

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 24

DHP: Reducing # of Candidates

•  A hashing bucket count < min_sup à every candidate in the bucket is infrequent – Candidates: a, b, c, d, e – Hash entries: {ab, ad, ae} {bd, be, de} … – Large 1-itemset: a, b, d, e – The sum of counts of {ab, ad, ae} < min_sup à

ab should not be a candidate 2-itemset •  J. Park, M. Chen, and P. Yu, SIGMOD’95

– DHP: Direct Hashing and Pruning

Page 5: Frequent pattern miningnovel.ict.ac.cn/files/Day 3.pdf · design, sale campaign analysis, web log (click stream) analysis, … Jian Pei: Big Data Analytics -- Frequent Pattern Mining

5

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 25

A 2-Scan Method by Partitioning •  Partition the database into n partitions, such that

each partition can be held into main memory •  Itemset X is frequent à X must be frequent in at

least one partition –  Scan 1: partition database and find local frequent

patterns –  Scan 2: consolidate global frequent patterns

•  All local frequent itemsets can be held in main memory? A sometimes too strong assumption

•  A. Savasere, E. Omiecinski, and S. Navathe, VLDB’95

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 26

Sampling for Frequent Patterns

•  Select a sample of the original database, mine frequent patterns in the sample using Apriori

•  Scan database once more to verify frequent itemsets found in the sample, only borders of closure of frequent patterns are checked – Example: check abcd instead of ab, ac, …, etc.

•  Scan database again to find missed frequent patterns

•  H. Toivonen, VLDB’96

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 27

Eclat/MaxEclat and VIPER

•  Tid-list: the list of transaction-ids containing an itemset –  Vertical Data Format

•  Major operation: intersections of tid-lists •  Compression of tid-lists

–  Itemset A: t1, t2 t3, sup(A)=3 –  Itemset B: t2, t3, t4, sup(B)=3 –  Itemset AB: t2, t3, sup(AB)=2

•  M. Zaki et al., 1997 •  P. Shenoy et al., 2000

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 28

Bottleneck of Freq Pattern Mining

•  Multiple database scans are costly •  Mining long patterns needs many scans and

generates many candidates – To find frequent itemset i1i2…i100

•  # of scans: 100 •  # of Candidates:

– Bottleneck: candidate-generation-and-test •  Can we avoid candidate generation?

30100 1027.112100100

2100

1100

×≈−=⎟⎟⎠

⎞⎜⎜⎝

⎛++⎟⎟

⎞⎜⎜⎝

⎛+⎟⎟⎠

⎞⎜⎜⎝

⎛!

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 29

Search Space of Freq. Pat. Mining

•  Itemsets form a lattice ABCD

ABC ABD ACD BCD

AB AC BC AD BD CD

A B C D

{}

Itemset lattice Jian Pei: Big Data Analytics -- Frequent Pattern Mining 30

Set Enumeration Tree

•  Use an order on items, enumerate itemsets in lexicographic order –  a, ab, abc, abcd, ac, acd, ad, b, bc, bcd, bd, c, dc, d

•  Reduce a lattice to a tree ∅

a b c d

ab ac ad bc bd cd

abc abd acd bcd

abcd

Set enumeration tree

Page 6: Frequent pattern miningnovel.ict.ac.cn/files/Day 3.pdf · design, sale campaign analysis, web log (click stream) analysis, … Jian Pei: Big Data Analytics -- Frequent Pattern Mining

6

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 31

Borders of Frequent Itemsets

•  Frequent itemsets are connected – ∅ is trivially frequent – X on the border à every subset of X is frequent

a b c d

ab ac ad bc bd cd

abc abd acd bcd

abcd Jian Pei: Big Data Analytics -- Frequent Pattern Mining 32

Projected Databases

•  To test whether Xy is frequent, we can use the X-projected database – The sub-database of transactions containing X – Check whether item y is frequent in X-projected

database ∅

a b c d ab ac ad bc bd cd

abc abd acd bcd

abcd

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 33

Compress Database by FP-tree •  The 1st scan: find

frequent items –  Only record frequent

items in FP-tree –  F-list: f-c-a-b-m-p

•  The 2nd scan: construct tree –  Order frequent items in

each transaction w.r.t. f-list

–  Explore sharing among transactions

root

f:4

c:3

a:3

m:2

p:2

b:1

b:1

m:1

c:1

b:1

p:1

Header table item

f c a b m p

TID Items bought (ordered) freq items

100 f, a, c, d, g, I, m, p f, c, a, m, p

200 a, b, c, f, l,m, o f, c, a, b, m

300 b, f, h, j, o f, b

400 b, c, k, s, p c, b, p

500 a, f, c, e, l, p, m, n f, c, a, m, p

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 34

Benefits of FP-tree

•  Completeness –  Never break a long pattern in any transaction –  Preserve complete information for freq pattern mining

•  Not scan database anymore

•  Compactness –  Reduce irrelevant info — infrequent items are removed –  Items in frequency descending order (f-list): the more

frequently occurring, the more likely to be shared –  Never be larger than the original database (not counting

node-links and the count fields)

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 35

Partitioning Frequent Patterns

•  Frequent patterns can be partitioned into subsets according to f-list: f-c-a-b-m-p – Patterns containing p – Patterns having m but no p – … – Patterns having c but no a nor b, m, or p – Pattern f

•  Depth-first search of a set enumeration tree – The partitioning is complete and does not have

any overlap Jian Pei: Big Data Analytics -- Frequent Pattern Mining 36

•  Only transactions containing p are needed •  Form p-projected database

– Starting at entry p of the header table – Follow the side-link of frequent item p – Accumulate all transformed prefix paths of p

Find Patterns Having Item “p”

root

f:4

c:3

a:3

m:2

p:2

b:1

b:1

m:1

c:1

b:1

p:1

Header table item

f c a b m p

p-projected database TDB|p fcam: 2 cb: 1

Local frequent item: c:3 Frequent patterns containing p

p: 3, pc: 3

Page 7: Frequent pattern miningnovel.ict.ac.cn/files/Day 3.pdf · design, sale campaign analysis, web log (click stream) analysis, … Jian Pei: Big Data Analytics -- Frequent Pattern Mining

7

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 37

Find Pat Having Item m But No p

•  Form m-projected database TDB|m –  Item p is excluded (why?) – Contain fca:2, fcab:1 – Local frequent items: f, c, a

•  Build FP-tree for TDB|m root

f:4

c:3

a:3

m:2

p:2

b:1

b:1

m:1

c:1

b:1

p:1

Header table item

f c a b m p

Header table item

f c a

root

f:3

c:3

a:3

m-projected FP-tree

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 38

Recursive Mining

•  Patterns having m but no p can be mined recursively

•  Optimization: enumerate patterns from a single-branch FP-tree – Enumerate all combination – Support = that of the last item

•  m, fm, cm, am •  fcm, fam, cam •  fcam

Header table item

f c a

root

f:3

c:3

a:3

m-projected FP-tree

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 39

Enumerate Patterns From Single Prefix of FP-tree •  A (projected) FP-tree has a single prefix

– Reduce the single prefix into one node – Join the mining results of the two parts

Ú a2:n2

a3:n3

a1:n1

root

b1:m1 c1:k1

c2:k2 c3:k3

+ a2:n2

a3:n3

a1:n1

root

r =

r1

b1:m1 c1:k1

c2:k2 c3:k3

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 40

FP-growth

•  Pattern-growth: recursively grow frequent patterns by pattern and database partitioning

•  Algorithm –  For each frequent item, construct its projected

database, and then its projected FP-tree –  Repeat the process on each newly created projected

FP-tree –  Until the resulted FP-tree is empty, or contains only one

path – single path generates all the combinations, each of which is a frequent pattern

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 41

Scaling up by DB Projection

•  What if an FP-tree cannot fit into memory? •  Database projection

– Partition a database into a set of projected databases

– Construct and mine FP-tree once the projected database can fit into main memory

•  Heuristic: Projected database shrinks quickly in many applications

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 42

Parallel vs. Partition Projection

•  Parallel projection: form all projected database at a time

•  Partition projection: propagate projections

Tran. DB fcamp fcabm fb cbp fcamp

p-proj DB fcam cb fcam

m-proj DB fcab fca fca

b-proj DB f cb …

a-proj DB fc …

c-proj DB f …

f-proj DB …

am-proj DB fc fc fc

cm-proj DB f f f

Page 8: Frequent pattern miningnovel.ict.ac.cn/files/Day 3.pdf · design, sale campaign analysis, web log (click stream) analysis, … Jian Pei: Big Data Analytics -- Frequent Pattern Mining

8

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 43

Why Is FP-growth Efficient?

•  Divide-and-conquer strategy – Decompose both the mining task and DB – Lead to focused search of smaller databases

•  Other factors – No candidate generation nor candidate test – Database compression using FP-tree – No repeated scan of entire database – Basic operations – counting local frequent items

and building FP-tree, no pattern search nor pattern matching

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 44

Major Costs in FP-growth

•  Poor locality of FP-trees – Low hit rate of cache

•  Building FP-trees – A stack of FP-trees

•  Redundant information – Transaction abcd appears in a-, ab-, abc-, ac-, …, c- projected databases and FP-trees

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 45

Improving Locality

•  Store FP-trees in pre-order depth-first traverse list

Ghoting et al., VLDB05

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 46

H-Mine

•  Goal: efficient in various occasions – Dense vs. sparse, huge vs. memory-based data

sets •  Moderate in space requirement •  Highlights

– Effective and efficient memory-based structure and mining algorithm

– Scalable algorithm for mining large databases by proper partitioning

–  Integration of H-mine and FP-growth

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 47

H-Structure •  Store frequent-item projections in main memory

–  Items in a transaction are sorted according to f-list –  Each frequent item in a transaction is stored with two

fields: item-id and hyper-link –  Header table H

•  Link transactions with same first item •  Scan database once Tid Items Freq-item

projection 100 c, d, e, f, g, i c, d, e, g 200 a, c, d, e, m a, c, d, e 300 a, b, d, e, g, k a, d, e, g 400 a, c, d, h a, c, d

F-list = a-c-d-e-g

Headertable H

frequentprojections

400

300

200

100 c d e g

edca

d

da

c

3 2a c e gd

3 4 3

ge

a

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 48

Find Patterns Containing Item “a”

•  Only search a-projected database: transactions containing “a”

•  The a-queue links all transactions in a-projected database – Can be traversed efficiently

Headertable H

frequentprojections

400

300

200

100 c d e g

edca

d

da

c

3 2a c e gd

3 4 3

ge

a

Page 9: Frequent pattern miningnovel.ict.ac.cn/files/Day 3.pdf · design, sale campaign analysis, web log (click stream) analysis, … Jian Pei: Big Data Analytics -- Frequent Pattern Mining

9

•  Build a-header table Ha •  Traverse a-queue once, find all local frequent

items within a-projected database

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 49

Mining a-Projected Database

–  Local freq items: c, d, and e

–  Patterns: ac, ad and ae

•  Link transactions having same next frequent item

3 2a c e gd

3 4 3Headertable H

frequentprojections

400

300

200

100 c d e g

edca

d

da

c

ge

a

1c e gd2 3

Headertable Ha 2

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 50

Why Is H-Mine(Mem) Efficient?

•  No candidate generation –  It is a pattern growth method

•  Search confined in a dedicated space – Not physically construct memory structures for

projected databases – H-struct is for all the mining –  Information about projected databases are

collected in header tables •  No frequent patterns stored in main memory

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 51

Mining in Large Databases

•  What if the H-struct is too big for memory? •  Find global frequent items •  Partition the database into n parts

– The H-struct for each part can be held into memory

– Mine local patterns in each part using H-mine(Mem)

•  Use relative minimum support threshold

•  Consolidate global patterns in the third scan Jian Pei: Big Data Analytics -- Frequent Pattern Mining 52

How to Partition in H-mine?

•  Partitioning in H-mine is straightforward – Overhead of header tables in H-mine(Mem) is

small and predictable – Partitioning with Apriori is not easy

•  Hard to predict the space requirement of Apriori

•  Global frequent items prune many local patterns in skewed partitions

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 53

Mining Dense Projected DB’s

•  Challenges in dense datasets –  Long patterns –  Some patterns appearing in many transactions

•  After projection, projected databases are denser •  Advantages of FP-tree

–  Compress dense databases many times –  No candidate generation –  Sub-patterns can be enumerated from long patterns

•  Build FP-tree for dense projected databases –  Empirical switching point: 1%

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 54

Advantages of H-Mine

•  Have very small space overhead •  Absorb nice features of FP-growth •  Create no physical projected database •  Watch the density of projected databases,

turn to FP-growth when necessary •  Propose space-preserving mining

– Scalable in very large database – Feasible even with very small memory – Go beyond frequent pattern mining

Page 10: Frequent pattern miningnovel.ict.ac.cn/files/Day 3.pdf · design, sale campaign analysis, web log (click stream) analysis, … Jian Pei: Big Data Analytics -- Frequent Pattern Mining

10

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 55

Further Developments

•  OP – opportunistic projection (LPWH02) – Opportunistically choose between array-based

and tree-based representations of projected databases

•  Diffsets for vertical mining (ZaGo03) – Only record the differences in the tids of a

candidate pattern from its generating frequent patterns

Effectiveness of Freq Pat Mining

•  Too many patterns! – A pattern a1a2…an contains 2n-1 subpatterns – Understanding many patterns is difficult or even

impossible for human users •  Non-focused mining

– A manager may be only interested in patterns involving some items (s)he manages

– A user is often interested in patterns satisfying some constraints

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 56

Itemset Lattice ABCD

ABC ABD ACD BCD

AB AC BC AD BD CD

A B C D

{}

Tid transaction

10 ABD

20 ABC

30 AD

40 ABCD

50 CD

Length Frequent itemsets 1 A, B, C, D 2 AB, AC, AD, BC, BD, CD 3 ABC, ABD, ACD

Min_sup=2

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 57

Max-Patterns ABCD

ABC ABD ACD BCD

AB AC BC AD BD CD

A B C D

{}

Tid transaction

10 ABD

20 ABC

30 AD

40 ABCD

50 CD

Length Frequent itemsets 1 A, B, C, D 2 AB, AC, AD, BC, BD, CD 3 ABC, ABD

Min_sup=2

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 58

Borders and Max-patterns

•  Max-patterns: borders of frequent patterns – Any subset of max-pattern is frequent – Any superset of max-pattern is infrequent – Cannot generate rules ABCD

ABC ABD ACD BCD

AB AC BC AD BD CD

A B C D

{} Jian Pei: Big Data Analytics -- Frequent Pattern Mining 59

MaxMiner: Mining Max-patterns

•  1st scan: find frequent items – A, B, C, D, E

•  2nd scan: find support for – AB, AC, AD, AE, ABCDE – BC, BD, BE, BCDE – CD, CE, CDE, DE,

•  Since BCDE is a max-pattern, no need to check BCD, BDE, CDE in later scan

•  Bayardo, SIGMOD’98

Tid Items 10 A,B,C,D,E 20 B,C,D,E, 30 A,C,D,F

Potential max-patterns

Min_sup=2

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 60

Page 11: Frequent pattern miningnovel.ict.ac.cn/files/Day 3.pdf · design, sale campaign analysis, web log (click stream) analysis, … Jian Pei: Big Data Analytics -- Frequent Pattern Mining

11

Patterns and Support Counts ABCD

ABC:2 ABD:2 ACD BCD

AB:3 AC:2 BC:2 AD:3 BD:2 CD:2

A:4 B:4 C:3 D:4

{}

Tid transaction

10 ABD

20 ABC

30 AD

40 ABCD

50 CD

Len Frequent itemsets 1 A:4, B:4, C:3, D:4 2 AB:3, AC:2, AD:3, BC:3, BD:2, CD:2 3 ABC:2, ABD:2

Min_sup=2

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 61

Frequent Closed Patterns

•  For frequent itemset X, if there exists no item y not in X s.t. every transaction containing X also contains y, then X is a frequent closed pattern – “acdf” is a frequent closed pattern

•  Concise rep. of freq pats – Can generate non-redundant rules

•  Reduce # of patterns and rules •  N. Pasquier et al. In ICDT’99

TID Items 10 a, c, d, e, f 20 a, b, e 30 c, e, f 40 a, c, d, f 50 c, e, f

Min_sup=2

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 62

CLOSET for Frequent Closed Patterns

•  Flist: list of all freq items in support asc. order –  Flist: d-a-f-e-c

•  Divide search space –  Patterns having d –  Patterns having d but no a, etc.

•  Find frequent closed pattern recursively –  Every transaction having d also has cfa à cfad is a

frequent closed pattern •  PHM’00

TID Items 10 a, c, d, e, f 20 a, b, e 30 c, e, f 40 a, c, d, f 50 c, e, f

Min_sup=2

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 63

The CHARM Method

•  Use vertical data format: t(AB)={T1, T12, …} •  Derive closed pattern based on vertical

intersections –  t(X)=t(Y): X and Y always happen together –  t(X)⊂t(Y): transaction having X always has Y

•  Use diffset to accelerate mining – Only keep track of difference of tids –  t(X)={T1, T2, T3}, t(Xy )={T1, T3} – Diffset(Xy, X)={T2}

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 64

Closed and Max-patterns

•  Closed pattern mining algorithms can be adapted to mine max-patterns – A max-pattern must be closed

•  Depth-first search methods have advantages over breadth-first search ones – Why?

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 65

Condensed Freq Pattern Base

•  Practical observation: in many applications, a good approximation on support count could be good enough –  Support=10000 à Support in range 10000 ± 1%

•  Making frequent pattern mining more realistic –  A small deviation has a minor effect on analysis –  Condensed FP-base leads to more effective mining –  Computing a condensed FP-base may lead to more

efficient mining

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 66

Page 12: Frequent pattern miningnovel.ict.ac.cn/files/Day 3.pdf · design, sale campaign analysis, web log (click stream) analysis, … Jian Pei: Big Data Analytics -- Frequent Pattern Mining

12

Condensed FP-base Mining

•  Compute a condensed FP-base with a guaranteed maximal error bound.

•  Given: a transaction database, a user-specified support threshold, and a user-specified error bound

•  Find a subset of frequent patterns & a function –  Determine whether a pattern is frequent –  Determine the support range

•  Pei et al. ICDM’02

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 67

An Example Support threshold: min_sup

= 1 Error bound: k = 2

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 68

Another Base

Support threshold: min_sup = 1 Error bound: k = 2

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 69

Approximation Functions

•  NOT unique – Different condensed FP-bases have different

approximation function •  Optimization on space requirement

– The less space required, the better compression effect

– compression ratio

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 70

Constraint-based Data Mining

•  Find all the patterns in a database autonomously? –  The patterns could be too many but not focused!

•  Data mining should be interactive –  User directs what to be mined

•  Constraint-based mining –  User flexibility: provides constraints on what to be mined –  System optimization: push constraints for efficient

mining

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 71

Constraints in Data Mining

•  Knowledge type constraint –  classification, association, etc.

•  Data constraint — using SQL-like queries –  find product pairs sold together in stores in New York

•  Dimension/level constraint –  in relevance to region, price, brand, customer category

•  Rule (or pattern) constraint –  small sales (price < $10) triggers big sales (sum >$200)

•  Interestingness constraint –  strong rules: support and confidence

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 72

Page 13: Frequent pattern miningnovel.ict.ac.cn/files/Day 3.pdf · design, sale campaign analysis, web log (click stream) analysis, … Jian Pei: Big Data Analytics -- Frequent Pattern Mining

13

Constrained Mining vs. Search

•  Constrained mining vs. constraint-based search –  Both aim at reducing search space –  Finding all patterns vs. some (or one) answers satisfying

constraints –  Constraint-pushing vs. heuristic search –  An interesting research problem on integrating both

•  Constrained mining vs. DBMS query processing –  Database query processing requires to find all –  Constrained pattern mining shares a similar philosophy

as pushing selections deeply in query processing

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 73

Optimization

•  Mining frequent patterns with constraint C –  Sound: only find patterns satisfying the constraints C –  Complete: find all patterns satisfying the constraints C

•  A naïve solution –  Constraint test as a post-processing

•  More efficient approaches –  Analyze the properties of constraints –  Push constraints as deeply as possible into frequent

pattern mining

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 74

Anti-Monotonicity

•  Anti-monotonicity – An intemset S violates the constraint, so does

any of its superset – sum(S.Price) ≤ v is anti-monotone – sum(S.Price) ≥ v is not anti-monotone

•  Example – C: range(S.profit) ≤ 15 –  Itemset ab violates C – So does every superset of ab

TID Transaction 10 a, b, c, d, f 20 b, c, d, f, g, h 30 a, c, d, e, f 40 c, e, f, g

TDB (min_sup=2)

Item Profit a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 75

Anti-monotonic Constraints Constraint Antimonotone

v ∈ S No S ⊆ V no S ⊆ V yes

min(S) ≤ v no min(S) ≥ v yes max(S) ≤ v yes max(S) ≥ v no

count(S) ≤ v yes count(S) ≥ v no

sum(S) ≤ v ( a ∈ S, a ≥ 0 ) yes sum(S) ≥ v ( a ∈ S, a ≥ 0 ) no

range(S) ≤ v yes range(S) ≥ v no

avg(S) θ v, θ ∈ { =, ≤, ≥ } convertible support(S) ≥ ξ yes support(S) ≤ ξ no

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 76

Monotonicity

•  Monotonicity – An intemset S satisfies the constraint, so does

any of its superset – sum(S.Price) ≥ v is monotone – min(S.Price) ≤ v is monotone

•  Example – C: range(S.profit) ≥ 15 –  Itemset ab satisfies C – So does every superset of ab

TID Transaction 10 a, b, c, d, f 20 b, c, d, f, g, h 30 a, c, d, e, f 40 c, e, f, g

TDB (min_sup=2)

Item Profit a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 77

Monotonic Constraints Constraint Monotone

v ∈ S yes S ⊆ V yes S ⊆ V no

min(S) ≤ v yes min(S) ≥ v no max(S) ≤ v no max(S) ≥ v yes

count(S) ≤ v no count(S) ≥ v yes

sum(S) ≤ v ( a ∈ S, a ≥ 0 ) no sum(S) ≥ v ( a ∈ S, a ≥ 0 ) yes

range(S) ≤ v no range(S) ≥ v yes

avg(S) θ v, θ ∈ { =, ≤, ≥ } convertible support(S) ≥ ξ no support(S) ≤ ξ yes

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 78

Page 14: Frequent pattern miningnovel.ict.ac.cn/files/Day 3.pdf · design, sale campaign analysis, web log (click stream) analysis, … Jian Pei: Big Data Analytics -- Frequent Pattern Mining

14

Converting “Tough” Constraints

•  Convert tough constraints into anti-monotone or monotone by properly ordering items

•  Examine C: avg(S.profit) ≥ 25 –  Order items in value-descending order

•  <a, f, g, d, b, h, c, e>

–  If an itemset afb violates C •  So does afbh, afb* •  It becomes anti-monotone!

TID Transaction 10 a, b, c, d, f 20 b, c, d, f, g, h 30 a, c, d, e, f 40 c, e, f, g

TDB (min_sup=2)

Item Profit a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 79

Convertible Constraints

•  Let R be an order of items •  Convertible anti-monotone

–  If an itemset S violates a constraint C, so does every itemset having S as a prefix w.r.t. R

–  Ex. avg(S) ≤ v w.r.t. item value descending order •  Convertible monotone

–  If an itemset S satisfies constraint C, so does every itemset having S as a prefix w.r.t. R

–  Ex. avg(S) ≥ v w.r.t. item value descending order

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 80

Strongly Convertible Constraints •  avg(X) ≥ 25 is convertible anti-monotone

w.r.t. item value descending order R: <a, f, g, d, b, h, c, e> –  Itemset af violates a constraint C, so does

every itemset with af as prefix, such as afd •  avg(X) ≥ 25 is convertible monotone

w.r.t. item value ascending order R-1: <e, c, h, b, d, g, f, a> –  Itemset d satisfies a constraint C, so does

itemsets df and dfa, which having d as a prefix

•  Thus, avg(X) ≥ 25 is strongly convertible

Item Profit a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 81

Convertible Constraints

Constraint Convertible anti-monotone

Convertible monotone

Strongly convertible

avg(S) ≤ , ≥ v Yes Yes Yes

median(S) ≤ , ≥ v Yes Yes Yes

sum(S) ≤ v (items could be of any value, v ≥ 0) Yes No No

sum(S) ≤ v (items could be of any value, v ≤ 0) No Yes No

sum(S) ≥ v (items could be of any value, v ≥ 0) No Yes No

sum(S) ≥ v (items could be of any value, v ≤ 0) Yes No No

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 82

Can Apriori Handle Convertible Constraint? •  A convertible, not monotone nor anti-

monotone nor succinct constraint cannot be pushed deep into the an Apriori mining algorithm –  Within the level wise framework, no direct

pruning based on the constraint can be made –  Itemset df violates constraint C: avg(X)>=25 –  Since adf satisfies C, Apriori needs df to

assemble adf, df cannot be pruned •  But it can be pushed into frequent-pattern

growth framework!

Item Value a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 83

Mining With Convertible Constraints •  C: avg(S.profit) ≥ 25 •  List of items in every transaction in

value descending order R: –  <a, f, g, d, b, h, c, e> –  C is convertible anti-monotone w.r.t. R

•  Scan transaction DB once –  remove infrequent items

•  Item h in transaction 40 is dropped

–  Itemsets a and f are good

TID Transaction 10 a, f, d, b, c 20 f, g, d, b, c 30 a, f, d, c, e 40 f, g, h, c, e

TDB (min_sup=2)

Item Profit a 40 f 30 g 20 d 10 b 0 h -10 c -20 e -30

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 84

Page 15: Frequent pattern miningnovel.ict.ac.cn/files/Day 3.pdf · design, sale campaign analysis, web log (click stream) analysis, … Jian Pei: Big Data Analytics -- Frequent Pattern Mining

15

Not Every Pattern Is Interesting!

•  Trivial patterns – Pregnant à Female 100% confidence

•  Misleading patterns – Play basketball à eat cereal [40%, 66.7%]

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 85

Basketball Not basketball Sum (row)

Cereal 2000 1750 3750

Not cereal 1000 250 1250

Sum(col.) 3000 2000 5000

Evaluation Criteria

•  Objective interestingness measures – Examples: support, patterns formed by mutually

independent items – Domain independent

•  Subjective measures – Examples: domain knowledge, templates/

constraints

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 86

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 87

Correlation and Lift

•  P(B|A)/P(B) is called the lift of rule A à B

•  Play basketball à eat cereal (lift: 0.89) •  Play basketball à not eat cereal (lift: 1.33)

corrA,B =P(A∪B)P(A)P(B)

=P(AB)

P(A)P(B)

Basketball Not basketball Sum (row)

Cereal 2000 1750 3750

Not cereal 1000 250 1250

Sum(col.) 3000 2000 5000

Contingency table 372 Chapter 6 Association Analysis

Table 6.7. A 2-way contingency table for variables A and B.

B B

A f11 f10 f1+

A f01 f00 f0+

f+1 f+0 N

counts tabulated in a contingency table. Table 6.7 shows an example of acontingency table for a pair of binary variables, A and B. We use the notationA (B) to indicate that A (B) is absent from a transaction. Each entry fij inthis 2 × 2 table denotes a frequency count. For example, f11 is the number oftimes A and B appear together in the same transaction, while f01 is the num-ber of transactions that contain B but not A. The row sum f1+ representsthe support count for A, while the column sum f+1 represents the supportcount for B. Finally, even though our discussion focuses mainly on asymmet-ric binary variables, note that contingency tables are also applicable to otherattribute types such as symmetric binary, nominal, and ordinal variables.

Limitations of the Support-Confidence Framework Existing associa-tion rule mining formulation relies on the support and confidence measures toeliminate uninteresting patterns. The drawback of support was previously de-scribed in Section 6.8, in which many potentially interesting patterns involvinglow support items might be eliminated by the support threshold. The draw-back of confidence is more subtle and is best demonstrated with the followingexample.

Example 6.3. Suppose we are interested in analyzing the relationship be-tween people who drink tea and coffee. We may gather information about thebeverage preferences among a group of people and summarize their responsesinto a table such as the one shown in Table 6.8.

Table 6.8. Beverage preferences among a group of 1000 people.

Coffee Coffee

Tea 150 50 200

Tea 650 150 800

800 200 1000

Property of Lift

•  If A and B are independent, lift = 1 •  If A and B are positively correlated, lift > 1 •  If A and B are negatively correlated, lift < 1 •  Limitation: lift is sensitive to P(A) and P(B)

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 88

374 Chapter 6 Association Analysis

Table 6.9. Contingency tables for the word pairs ({p,q} and {r,s}.

p p r r

q 880 50 930 s 20 50 70

q 50 20 70 s 50 880 930

930 70 1000 70 930 1000

This equation follows from the standard approach of using simple fractionsas estimates for probabilities. The fraction f11/N is an estimate for the jointprobability P (A, B), while f1+/N and f+1/N are the estimates for P (A) andP (B), respectively. If A and B are statistically independent, then P (A, B) =P (A) × P (B), thus leading to the formula shown in Equation 6.6. UsingEquations 6.5 and 6.6, we can interpret the measure as follows:

I(A, B)

⎧⎨

= 1, if A and B are independent;> 1, if A and B are positively correlated;< 1, if A and B are negatively correlated.

(6.7)

For the tea-coffee example shown in Table 6.8, I = 0.150.2×0.8 = 0.9375, thus sug-

gesting a slight negative correlation between tea drinkers and coffee drinkers.

Limitations of Interest Factor We illustrate the limitation of interestfactor with an example from the text mining domain. In the text domain, itis reasonable to assume that the association between a pair of words dependson the number of documents that contain both words. For example, becauseof their stronger association, we expect the words data and mining to appeartogether more frequently than the words compiler and mining in a collectionof computer science articles.

Table 6.9 shows the frequency of occurrences between two pairs of words,{p, q} and {r, s}. Using the formula given in Equation 6.5, the interest factorfor {p, q} is 1.02 and for {r, s} is 4.08. These results are somewhat troublingfor the following reasons. Although p and q appear together in 88% of thedocuments, their interest factor is close to 1, which is the value when p and qare statistically independent. On the other hand, the interest factor for {r, s}is higher than {p, q} even though r and s seldom appear together in the samedocument. Confidence is perhaps the better choice in this situation because itconsiders the association between p and q (94.6%) to be much stronger thanthat between r and s (28.6%).

lift(p, q) < lift(r, s)!

Leverage

•  The difference between the observed and expected joint probability of XY assuming X and Y are independent

•  An “absolute” measure of the surprisingness of a rule – Should be used together with lift

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 89

leverage(X ! Y ) = P (XY )� P (X)P (Y )

Convinction

•  The expected error of a rule

•  Consider not only the joint distribution of X and Y

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 90

conv(X ! Y ) =P (X)P (Y )

P (XY )=

1

lift(X ! Y )

Page 16: Frequent pattern miningnovel.ict.ac.cn/files/Day 3.pdf · design, sale campaign analysis, web log (click stream) analysis, … Jian Pei: Big Data Analytics -- Frequent Pattern Mining

16

Odds Ratio

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 91

odds(Y | X) =

P (XY )P (X)

P (XY )P (X)

=P (XY )

P (XY )

odds(Y | X) =

P (XY )P (X)

P (XY )P (X)

=P (XY )

P (XY )

oddsratio(X ! Y ) =odds(Y | X)

odds(Y | X)=

P (XY ) · P (XY )

P (XY ) · P (XY )

χ2

•  Suppose attribute A has c distinct values a1, …, ac and attribute B has r distinct values b1, …, br

•  The χ2 value (Pearson χ2 statistics) is

– oij and eij are the observed frequency and the expected frequency, respectively, of the joint event aibj, respectively

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 92

2 =cX

i=1

rX

j=1

(oij � eij)2

eij

Example

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 93

Basketball Not basketball Sum (row)

Cereal 2000 1750 3750

Not cereal 1000 250 1250

Sum(col.) 3000 2000 5000

•  The χ2 value is greater than 1 •  count(basketball, cereal) = 2000 <

expectation (2250) à play basketball and eating cereal are negatively correlated

�2 =(2000� 2250)2

2250+

(1750� 1500)2

1500+

(1000� 750)2

750+

(250� 500)2

500= 277.8

Φ-coefficient

•  –1: if A and B perfectly negatively correlated •  1: if A and B perfectly positively correlated •  0: if A and B statistically independent •  Drawback: Φ-coefficient puts the same

weight on co-occurrence and co-absence

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 94

� =P (AB)P (AB)� P (AB)P (AB)p

P (A)P (B)P (A)P (B)

IS Measure

•  Biased on frequent co-occurrence •  Equivalent to cosine similarity for binary

variables (bit vectors) •  Geometric mean of rules between a pair of

binary random variables

•  Drawback: the value depends on P(A) and P(B) – Similar drawbacks in lift

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 95

IS(A,B) =plift(A,B)P (A,B) =

P (A,B)pP (A)P (B)

IS(A,B) =

sP (A,B)

P (A)

P (A,B)

P (B)=

pconf(A ! B)conf(B ! A)

More Measures

•  All confidence: min{ P(A|B), P(B|A) } •  Max confidence: max{ P(A|B), P(B|A) } •  The Kulczynski measure: ½ (P(A|B) + P(B|

A)

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 96

Page 17: Frequent pattern miningnovel.ict.ac.cn/files/Day 3.pdf · design, sale campaign analysis, web log (click stream) analysis, … Jian Pei: Big Data Analytics -- Frequent Pattern Mining

17

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 97

Comparing Measures Milk No Milk Sum (row)

Coffee m, c ~m, c c No Coffee m, ~c ~m, ~c ~c Sum(col.) m ~m Σ

Contingency table

Transaction databases and their contingency tables

30CHAPTER 6. MINING FREQUENTPATTERNS, ASSOCIATIONS, AND CORRELATIONS: BASIC

Table 6.9: Comparison of six pattern evaluation measures using contingency tables for a varietyof data sets.

Data Set mc mc mc mc χ2 lift all conf. max conf. Kulc. cosineD1 10,000 1,000 1,000 100,000 90557 9.26 0.91 0.91 0.91 0.91D2 10,000 1,000 1,000 100 0 1 0.91 0.91 0.91 0.91D3 100 1,000 1,000 100,000 670 8.44 0.09 0.09 0.09 0.09D4 1,000 1,000 1,000 100,000 24740 25.75 0.5 0.5 0.5 0.5D5 1,000 100 10,000 100,000 8173 9.18 0.09 0.91 0.5 0.29D6 1,000 10 100,000 100,000 965 1.97 0.01 0.99 0.5 0.10

positively associated in both data sets by producing a measure value of 0.91.However, lift and χ2 generate dramatically different measure values for D1 andD2 due to their sensitivity to mc. In fact, in many real-world scenarios mc isusually huge and unstable. For example, in a market basket database, the totalnumber of transactions could fluctuate on a daily basis and overwhelmingly ex-ceed the number of transactions containing any particular itemset. Therefore, agood interestingness measure should not be affected by transactions that do notcontain the itemsets of interest; otherwise, it would generate unstable results asillustrated in D1 and D2.

Similarly, in D3, the four new measures correctly show that m and c arestrongly negatively associated because the ratio of m to c equals the ratio ofmc to m, that is 100/1100 = 9.1%. However, lift and χ2 both contradict this inan incorrect way: their values for D2 are between those for D1 and D3.

For data set D4, both lift and χ2 indicate a highly positive associationbetween m and c, whereas the others indicate a “neutral” association becausethe ratio of mc to mc equals the ratio of mc to mc, which is 1. This means thatif a customer buys coffee (or milk), the probability that she will also purchasemilk (or coffee) is exactly 50%.

“Why are lift and χ2 so poor at distinguishing pattern association relation-ships in the above transactional data sets?” To answer this, we have to considerthe null-transactions. A null-transaction is a transaction that does not con-tain any of the itemsets being examined. In our example, mc represents thenumber of null-transactions. Lift and χ2 have difficulty distinguishing interest-ing pattern association relationships because they are both strongly influencedby mc. Typically, the number of null-transactions can outweigh the numberof individual purchases, because many people may buy neither milk nor coffee.On the other hand, the other four measures are good indicators of interestingpattern associations because their definitions remove the influence of mc (thatis, they are not influenced by the number of null-transactions).

The above discussion shows that it is highly desirable to have a measurewhose value is independent of the number of null-transactions. A measure isnull-invariant if its value is free from the influence of null-transactions. Null-

χ2 and lift do not perform well on those data sets, since they are sensitive to ~m~c

Imbalance Ration

•  Assess the imbalance of two itemsets A and B in rule implications

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 98

IR(A,B) =|P (A)� P (B)|

P (A) + P (B)� P (A [B)

Properties of Measures

•  Symmetry: is M(AàB) = M(BàA) •  Null-transaction dependent (null addition

invariant): is ~A~B used in the measure? •  Inversion invariant: the value does not

change if f11 and f10 are exchanged with f00 and f01

•  Scaling: whether the measure remains if the contingency table [f11, f10, f01, f00] is changed to [k1k3f11, k2k3f10, k1k4f01, k2k4f00]?

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 99

Measuring 3 Random Variables

•  3 dimensional contingency table

•  For a k-itemset {i1, i2, …, ik}, the condition for statistical independence is

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 100

6.7 Evaluation of Association Patterns 383

Table 6.18. Example of a three-dimensional contingency table.

c b b c b b

a f111 f101 f1+1 a f110 f100 f1+0

a f011 f001 f0+1 a f010 f000 f0+0

f+11 f+01 f++1 f+10 f+00 f++0

such as f1+1 is the number of transactions that contain a and c, irrespectiveof whether b is present in the transaction.

Given a k-itemset {i1, i2, . . . , ik}, the condition for statistical independencecan be stated as follows:

fi1i2...ik =fi1+...+ × f+i2...+ × . . . × f++...ik

Nk−1. (6.12)

With this definition, we can extend objective measures such as interest factorand PS, which are based on deviations from statistical independence, to morethan two variables:

I =Nk−1 × fi1i2...ik

fi1+...+ × f+i2...+ × . . . × f++...ik

PS =fi1i2...ik

N− fi1+...+ × f+i2...+ × . . . × f++...ik

Nk

Another approach is to define the objective measure as the maximum, min-imum, or average value for the associations between pairs of items in a pat-tern. For example, given a k-itemset X = {i1, i2, . . . , ik}, we may define theφ-coefficient for X as the average φ-coefficient between every pair of items(ip, iq) in X. However, because the measure considers only pairwise associa-tions, it may not capture all the underlying relationships within a pattern.

Analysis of multidimensional contingency tables is more complicated be-cause of the presence of partial associations in the data. For example, someassociations may appear or disappear when conditioned upon the value of cer-tain variables. This problem is known as Simpson’s paradox and is describedin the next section. More sophisticated statistical techniques are available toanalyze such relationships, e.g., loglinear models, but these techniques arebeyond the scope of this book.

fi1i2···ik =fi1+···+f+i2···+ · · · f++···ik

Nk�1

Measuring More Random Variables

•  Some measures, such as lift and statistical independence, can be extended

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 101

I =Nk�1fi1i2···ik

fi1+···+f+i2···+ · · · f+···+ik

PS =fi1i2···ik

N� fi1+···+f+i2···+ · · · f+···+ik

Nk

Simpson’s Paradox

•  A trend that appears in different groups of data disappears when these groups are combined, and the reverse trend appears for the aggregate data – Also known as Yule-Simpson effect – Often encountered in social-science and

medical-science statistics – Particularly confounding when frequency data

are unduly given causal interpretations

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 102

Page 18: Frequent pattern miningnovel.ict.ac.cn/files/Day 3.pdf · design, sale campaign analysis, web log (click stream) analysis, … Jian Pei: Big Data Analytics -- Frequent Pattern Mining

18

Kidney Stone Treatment Example

•  Which treatment, A or B, is better?

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 103

Treatment A Treatment B Small stones G1: 81/87=93% G2: 234/270=87% Large stones G3: 192/263=73% G4: 55/80=69%

Overall 273/350=78% 289/350=83%

Berkeley Gender Bias Case

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 104

Applicants Admitted Men 8442 44%

Women 4321 35%

Department Men Women Applicants Admitted Applicants Admitted

A 825 62% 108 82% B 560 63% 25 68% C 325 37% 593 34% D 417 33% 375 35% E 191 28% 393 24% F 272 6% 341 7%

Fisher Exact Test

•  Directly test whether a rule X à Y is productive by comparing its confidence with those of its generalizations W à Y, where W is a subset of X – Let X = W U Z

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 105

W Y Not Y Z a b a + b

Not Z c d c + d a + c b + d Sup(W)

a = sup(WZY ) = sup(XY ), b = sup(WZY ) = sup(XY )c = sup(WZY ), d = sup(WZY )

Marginals

•  Row marginals

•  Column marginals

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 106

W Y Not Y Z a b a + b

Not Z c d c + d a + c b + d Sup(W)

a+ b = sup(WZ) = sup(X), c+ d = sup(WZ)

a+ c = sup(WY ), b+ d = sup(WY )

oddratio =

aa+bb

a+bc

c+dd

c+d

=ad

bc

Hypothesis

•  H0: Z and Y are independent given W – X à Y is not productive given W à Y

•  If Z and Y are independent, then

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 107

a =(a+ b)(a+ c)

n, b =

(a+ b)(b+ d)

n

W Y Not Y Z a b a + b

Not Z c d c + d a + c b + d Sup(W)

c =(c+ d)(a+ c)

n, d =

(c+ d)(b+ d)

n

oddratio =ad

bc

= 1

Relation between a and b, c, d

•  Assumption: the row and column marginals are fixed

•  The value of a uniquely determines b, c, and d

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 108

W Y Not Y Z a b a + b

Not Z c d c + d a + c b + d Sup(W)

Page 19: Frequent pattern miningnovel.ict.ac.cn/files/Day 3.pdf · design, sale campaign analysis, web log (click stream) analysis, … Jian Pei: Big Data Analytics -- Frequent Pattern Mining

19

Probability Mass Function of a

•  The probability mass function of observing the value of a in the contingency table is given by the Hypergeometric distribution – The probability of choosing s successes in t

trials using sampling without replacement from a finite population of size T that has S success in total

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 109

P (s | t, S, T ) =

✓Ss

◆·✓

T � St� s

✓Tt

Probability Mass Function of a

•  An occurrence of Z – a success •  T = sup(W) = n •  W always occurs, the total number of

successes = sup(Z|W) à S = a + b, t = a + c

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 110

P (a | a+ c, a+ b, n) =

✓a+ ba

◆·✓

n� (a+ b)(a+ c)� a

✓n

a+ c

◆ =

✓a+ ba

◆·✓

c+ dc

✓n

a+ c

=(a+ b)!(c+ d)!(a+ c)!(b+ d)!

n!a!b!c!d!

Calculating the p-value

•  Assuming that the null hypothesis is true, the p-value is the probability of obtaining a test statistic at least as extreme as the one actually observed

•  If p-value is very small (e.g., 0.01), the null hypothesis can be rejected

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 111

p� value(a) =

min(b,c)X

i=0

P (a+ i | a+ c, a+ b, n)

=

min(b,c)X

i=0

(a+ b)!(c+ d)!(a+ c)!(b+ d)!

n!(a+ i)!(b� i)!(c� i)!(d+ i)!

Permutation (Randomization) Test

•  Determine the distribution of a given test statistic by randomly modifying the observed data several times to obtain a random sample of the data sets – The modified data sets are used for significance

testing •  Compute the empirical probability mass

function (EPMF) •  Generate the empirical cumulative

distribution function Jian Pei: Big Data Analytics -- Frequent Pattern Mining 112

Compute p-value on Statistics

•  The empirical cumulative distribution function

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 113

F (x) = P (⇥ x) =1

k

kX

i=1

I(✓i x)

p� value(✓) = 1� F (✓)

Swap Randomization

•  In permutation test, what characteristics should be preserved in permutation?

•  Swap randomization keep the column and row marginals invariant – The support of each item does not change – The length of each transaction does not change

•  Swap two items in two transactions •  Conduct a certain number of swaps to make

up a new data set

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 114

Page 20: Frequent pattern miningnovel.ict.ac.cn/files/Day 3.pdf · design, sale campaign analysis, web log (click stream) analysis, … Jian Pei: Big Data Analytics -- Frequent Pattern Mining

20

Bootstrap Sampling

•  A transaction database is just a sample from a larger population – What is the frequency (or, range of possible

frequency) of X in the underlying population? •  Given a test assessment statistic θ, how can

we infer the confidence interval for the possible values of θ at a desired confidence interval α?

•  Bootstrap sampling: sampling with replacement

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 115

Calculating Statistic Range

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 116

F (x) = P (⇥ x) =1

k

kX

i=1

I(✓i x)

let v 1�↵2

= F�1(1� ↵

2), v 1+↵

2= F�1(

1 + ↵

2),

P (⇥ 2 [v 1�↵2

, v 1+↵2]) = F (

1 + ↵

2)� F (

1� ↵

2)

=1 + ↵

2� 1� ↵

2= ↵

Thus, the ↵ confidence interval for test statistic ⇥ is [v 1�↵2

, v 1+↵2]

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 117

From Itemsets to Sequences •  Itemsets: combinations of items, no temporal order •  Temporal order is important in many situations

–  Time-series databases and sequence databases –  Frequent patterns à (frequent) sequential patterns

•  Applications of sequential pattern mining –  Customer shopping sequences:

•  First buy computer, then iPod, and then digital camera, within 3 months.

–  Medical treatment, natural disasters, science and engineering processes, stocks and markets, telephone calling patterns, Web log clickthrough streams, DNA sequences and gene structures

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 118

What Is Sequential Pattern Mining?

•  Given a set of sequences, find the complete set of frequent subsequences

A sequence database A sequence : < (ef) (ab) (df) c b >

An element may contain a set of items. Items within an element are unordered and we list them alphabetically. <a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)>

Given support threshold min_sup =2, <(ab)c> is a sequential pattern

SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc>

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 119

Challenges in Seq Pat Mining

•  A huge number of possible sequential patterns are hidden in databases

•  A mining algorithm should – Find the complete set of patterns satisfying the

minimum support (frequency) threshold – Be highly efficient, scalable, involving only a

small number of database scans – Be able to incorporate various kinds of user-

specific constraints

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 120

Apriori Property of Seq Patterns

•  Apriori property in sequential patterns –  If a sequence S is infrequent, then none of the

super-sequences of S is frequent – E.g, <hb> is infrequent à so do <hab> and

<(ah)b>

Given support threshold min_sup =2

Seq-id Sequence

10 <(bd)cb(ac)>

20 <(bf)(ce)b(fg)> 30 <(ah)(bf)abf> 40 <(be)(ce)d> 50 <a(bd)bcb(ade)>

Page 21: Frequent pattern miningnovel.ict.ac.cn/files/Day 3.pdf · design, sale campaign analysis, web log (click stream) analysis, … Jian Pei: Big Data Analytics -- Frequent Pattern Mining

21

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 121

GSP

•  GSP (Generalized Sequential Pattern) mining •  Outline of the method

–  Initially, every item in DB is a candidate of length-1 –  For each level (i.e., sequences of length-k) do

•  Scan database to collect support count for each candidate sequence

•  Generate candidate length-(k+1) sequences from length-k frequent sequences using Apriori

–  Repeat until no frequent sequence or no candidate can be found

•  Major strength: Candidate pruning by Apriori

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 122

Finding Len-1 Seq Patterns

•  Initial candidates – <a>, <b>, <c>, <d>, <e>, <f>, <g>,

<h> •  Scan database once

– count support for candidates

min_sup =2

Cand Sup <a> 3 <b> 5 <c> 4 <d> 3 <e> 3 <f> 2 <g> 1 <h> 1

Seq-id Sequence

10 <(bd)cb(ac)>

20 <(bf)(ce)b(fg)> 30 <(ah)(bf)abf> 40 <(be)(ce)d> 50 <a(bd)bcb(ade)>

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 123

Generating Length-2 Candidates <a> <b> <c> <d> <e> <f>

<a> <aa> <ab> <ac> <ad> <ae> <af> <b> <ba> <bb> <bc> <bd> <be> <bf> <c> <ca> <cb> <cc> <cd> <ce> <cf> <d> <da> <db> <dc> <dd> <de> <df> <e> <ea> <eb> <ec> <ed> <ee> <ef> <f> <fa> <fb> <fc> <fd> <fe> <ff>

<a> <b> <c> <d> <e> <f> <a> <(ab)> <(ac)> <(ad)> <(ae)> <(af)> <b> <(bc)> <(bd)> <(be)> <(bf)> <c> <(cd)> <(ce)> <(cf)> <d> <(de)> <(df)> <e> <(ef)> <f>

51 length-2 Candidates

Without Apriori property, 8*8+8*7/2=92 candidates Apriori prunes

44.57% candidates Jian Pei: Big Data Analytics -- Frequent Pattern Mining 124

Finding Len-2 Seq Patterns

•  Scan database one more time, collect support count for each length-2 candidate

•  There are 19 length-2 candidates which pass the minimum support threshold – They are length-2 sequential patterns

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 125

Generating Length-3 Candidates and Finding Length-3 Patterns •  Generate Length-3 Candidates

– Self-join length-2 sequential patterns •  <ab>, <aa> and <ba> are all length-2 sequential

patterns à <aba> is a length-3 candidate •  <(bd)>, <bb> and <db> are all length-2 sequential

patterns à <(bd)b> is a length-3 candidate – 46 candidates are generated

•  Find Length-3 Sequential Patterns – Scan database once more, collect support

counts for candidates – 19 out of 46 candidates pass support threshold

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 126

The GSP Mining Process

<a> <b> <c> <d> <e> <f> <g> <h>

<aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)>

<abb> <aab> <aba> <baa> <bab> …

<abba> <(bd)bc> …

<(bd)cba>

1st scan: 8 cand. 6 length-1 seq. pat.

2nd scan: 51 cand. 19 length-2 seq. pat. 10 cand. not in DB at all

3rd scan: 46 cand. 19 length-3 seq. pat. 20 cand. not in DB at all

4th scan: 8 cand. 6 length-4 seq. pat.

5th scan: 1 cand. 1 length-5 seq. pat.

Cand. cannot pass sup. threshold

Cand. not in DB at all

min_sup =2

Seq-id Sequence 10 <(bd)cb(ac)> 20 <(bf)(ce)b(fg)> 30 <(ah)(bf)abf> 40 <(be)(ce)d> 50 <a(bd)bcb(ade)>

Page 22: Frequent pattern miningnovel.ict.ac.cn/files/Day 3.pdf · design, sale campaign analysis, web log (click stream) analysis, … Jian Pei: Big Data Analytics -- Frequent Pattern Mining

22

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 127

The GSP Algorithm

•  Take sequences in form of <x> as length-1 candidates

•  Scan database once, find F1, the set of length-1 sequential patterns

•  Let k=1; while Fk is not empty do –  Form Ck+1, the set of length-(k+1) candidates from Fk; –  If Ck+1 is not empty, scan database once, find Fk+1, the

set of length-(k+1) sequential patterns –  Let k=k+1;

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 128

Bottlenecks of GSP

•  A huge set of candidates – 1,000 frequent length-1 sequences generate

length-2 candidates! •  Multiple scans of database in mining •  Real challenge: mining long sequential

patterns – An exponential number of short candidates – A length-100 sequential pattern needs 1030

candidate sequences!

500,499,12999100010001000 =

×+×

30100100

11012

100≈−=⎟⎟

⎞⎜⎜⎝

⎛∑=i i

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 129

FreeSpan: Freq Pat-projected Sequential Pattern Mining •  The itemset of a seq pat must be frequent

– Recursively project a sequence database into a set of smaller databases based on the current set of frequent patterns

– Mine each projected database to find its patterns f_list: b:5, c:4, a:3, d:3, e:3, f:2

All seq. pat. can be divided into 6 subsets: • Seq. pat. containing item f • Those containing e but no f • Those containing d but no e nor f • Those containing a but no d, e or f • Those containing c but no a, d, e or f • Those containing only item b

Sequence Database SDB < (bd) c b (ac) > < (bf) (ce) b (fg) > < (ah) (bf) a b f > < (be) (ce) d > < a (bd) b c b (ade) >

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 130

From FreeSpan to PrefixSpan

•  Freespan: – Projection-based: no candidate sequence needs

to be generated – But, projection can be performed at any point in

the sequence, and the projected sequences may not shrink much

•  PrefixSpan – Projection-based – But only prefix-based projection: less

projections and quickly shrinking sequences

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 131

Prefix and Suffix (Projection)

•  <a>, <aa>, <a(ab)> and <a(abc)> are prefixes of sequence <a(abc)(ac)d(cf)>

•  Given sequence <a(abc)(ac)d(cf)>

Prefix Suffix (Prefix-Based Projection)

<a> <(abc)(ac)d(cf)> <aa> <(_bc)(ac)d(cf)> <ab> <(_c)(ac)d(cf)>

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 132

Mining Sequential Patterns by Prefix Projections •  Step 1: find length-1 sequential patterns

– <a>, <b>, <c>, <d>, <e>, <f> •  Step 2: divide search space. The complete

set of seq. pat. can be partitioned into 6 subsets: – The ones having prefix <a>; – The ones having prefix <b>; – … – The ones having prefix <f>

SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc>

Page 23: Frequent pattern miningnovel.ict.ac.cn/files/Day 3.pdf · design, sale campaign analysis, web log (click stream) analysis, … Jian Pei: Big Data Analytics -- Frequent Pattern Mining

23

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 133

Finding Seq. Pat. with Prefix <a>

•  Only need to consider projections w.r.t. <a> – <a>-projected database: <(abc)(ac)d(cf)>,

<(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc> •  Find all the length-2 seq. pat. having prefix

<a>: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af> – Further partition into 6 subsets

•  Having prefix <aa>; • … •  Having prefix <af>

SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc>

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 134

Completeness of PrefixSpan

SID sequence

10 <a(abc)(ac)d(cf)>

20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

40 <eg(af)cbc>

SDB

Length-1 sequential patterns <a>, <b>, <c>, <d>, <e>, <f>

<a>-projected database <(abc)(ac)d(cf)> <(_d)c(bc)(ae)> <(_b)(df)cb> <(_f)cbc>

Length-2 sequential patterns <aa>, <ab>, <(ab)>, <ac>, <ad>, <af>

Having prefix <a>

Having prefix <aa>

<aa>-proj. db … <af>-proj. db

Having prefix <af>

<b>-projected database … Having prefix <b>

Having prefix <c>, …, <f>

… …

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 135

Efficiency of PrefixSpan

•  No candidate sequence needs to be generated

•  Projected databases keep shrinking •  Major cost of PrefixSpan: constructing

projected databases – Can be improved by bi-level projections

Effectiveness

•  Redundancy due to anti-monotonicity –  {<abcd>} leads to 15 sequential patterns of

same support – Closed sequential patterns and sequential

generators •  Constraints on sequential patterns

– Gap – Length – More sophisticated, application oriented

constraints

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 136

Sequences and Partial Orders

Sequential patterns: CHK à MMK à MORT à RESP CHK à MMK à MORT à BROK CHK à RRSP à MORT à RESP CHK à RRSP à MORT à BROK

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 137

Why Frequent Orders?

•  Frequent orders capture more thorough information than sequential patterns

•  Many important applications –  Bioinformatics: order-preserving clustering of microarray

data –  Web mining and market basket analysis: modeling

customer purchase behaviors –  Network management and intrusion detection: frequent

routing paths, signatures for intrusions –  Preference-based services: partial orders from ranking

data

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 138

Page 24: Frequent pattern miningnovel.ict.ac.cn/files/Day 3.pdf · design, sale campaign analysis, web log (click stream) analysis, … Jian Pei: Big Data Analytics -- Frequent Pattern Mining

24

Why Mining Orders Difficult?

•  Use sequential patterns to assemble frequent partial orders? – One frequent closed partial order may

summarize a few sequential patterns – Assembling can be costly

Sequential patterns: CHK à MMK à MORT à RESP CHK à MMK à MORT à BROK CHK à RRSP à MORT à RESP CHK à RRSP à MORT à BROK

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 139

Model •  A sequence s induces a full order R1, if R1 à R2,

where R2 is a partial order, then R1 is said to support R2

•  The support of a partial order R in a sequence database is the number of sequences supporting R in the database

•  An order R is closed if there exists no any R’ à R and sup(R)=sup(R’)

•  Given a minimum support threshold, order R is a frequent closed partial order if it is closed and passes the support threshold

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 140

Ideas

•  Depth-first search to generate frequent closed partial orders in transitive reduction – Transitive reduction is a succinct representation

of partial orders •  Pruning infrequent items, edges and partial

orders •  Pruning forbidden edges •  Extracting transitive reductions of frequent

partial orders directly Jian Pei: Big Data Analytics -- Frequent Pattern Mining 141

Interesting Orders

Jian Pei: Big Data Analytics -- Frequent Pattern Mining 142