data mining lectures lecture 11: pattern discovery padhraic smyth, uc irvine ics 278: data mining...

52
Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Upload: gregory-leonard

Post on 28-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

ICS 278: Data Mining

Lecture 11: Pattern Discovery Algorithms

Padhraic SmythDepartment of Information and Computer Science

University of California, Irvine

Page 2: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Project Progress Report

• Written Progress Report:– Due Tuesday May 18th in class– Expect at least 3 pages (should be typed not handwritten)– Hand in written document in class on Tuesday May 18th

• 1 Powerpoint slide:– 1 slide that describes your project– Use lecture slides as template “master” (i.e., white

background, etc)– Should contain:

• Your name (top right corner)• Clear description of the main task• Some visual graphic of data relevant to your task• 1 bullet or 2 on what methods you plan to use• Preliminary results or results of exploratory data analysis• Make it graphical (use text sparingly)

– Submit by 12 noon Monday May 17th

Page 3: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

List of Sections for your Progress Report• Clear description of task (reuse original proposal if needed)

– Basic task + extended “bonus” tasks (if time allows)

• Discussion of relevant literature – Discuss prior published/related work (if it exists)

• Preliminary data evaluation– Exploratory data analysis relevant to your task– Include as many of plots/graphs as you think are useful/relevant

• Preliminary algorithm work – Summary of your progress on algorithm implementation so far

• If you are not at this point yet, say so– Relevant information about other code/algorithms you have downloaded, some

preliminary testing on, etc.– Difficulties encountered so far

• Plans for the remainder of the quarter– Algorithm implementation– Experimental methods– Evaluation, validation

• Approximately ½ page to 1 page of text per section (graphs/plots don’t count – include as many of these as you like).

Page 4: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Pattern-Based Algorithms

• “Global” predictive and descriptive modeling– “global” models in the sense that they “cover” all of the data

space

• “Patterns”– More local structure, only describe certain aspects of the data– Examples:

• A single small very dense cluster in input space– e.g., a new type of galaxy in astronomy data

• An unusual set of outliers– e.g., indications of an anomalous event in time-series climate

data• Associations or “rules”

– If bread is purchased and wine is purchased then cheese is purchased with probability p

Page 5: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

General Ideas for Patterns

• Many patterns can be described in the general form:– if condition 1 then condition 2 (with some certainty)

• Probabilistic rules: If Age > 40 and education > college then income > $50k with probability p

• “Bumps” If Age > 40 and education > college then mean income = $73k

– if antecedent then consequent – if then

• where is generally some “box” in the input space• where is a statement about a variable of interest, e.g., p(y | ) or E [ y |

]

• Pattern support– “Support” = p( ) or p( ) – Fraction of points in input space where the condition applies– Often interested in patterns with larger support

Page 6: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

How Interesting is a Pattern?

• Note: “interestingness” is inherently subjective– Depends on what the data analyst already knows

• Difficult to quantify prior knowledge

– How interesting a pattern is, can be a function of• How surprising it is relative to prior knowledge?• How useful (actionable) it is?

– This is a somewhat open research problem

– In general pattern “interestingness” is difficult to quantify• => Use simple “surrogate” measures in practice

Page 7: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

How Interesting is a Pattern?

• Interestingness of a pattern– Measures how “interesting” the pattern -> is

• Typical measures of interest– Conditional probability: p( ) – Change in probability: | p( ) - p( ) |

– “Lift” = p( ) / p( ) (also log of this)

– Change in mean target response, e.g., E [y| ]/E[y]

Page 8: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Pattern-Finding Algorithms

• Typically… search a data set for the set of patterns that maximize some score function– Usually a function of both support and “interestingness”– E.g.,

• Association rules• Bump-hunting

• Issues:– Huge combinatorial search space– How many patterns to return to the user– How to avoid problems with redundant patterns

Page 9: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Task

Generic Pattern Finding

Representation

Score Function

Search/Optimization

Data Management

Models, Parameters

Find patterns

f(support, interestingness)

greedy, branch-and-bound

varies

list of K highest scoring patterns

pattern language

Page 10: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Two Pattern Finding Algorithms

1. Bump-hunting: the PRIM algorithmBump Hunting in High Dimensional DataJ. H. Friedman & N. I. FisherStatistics and Computing, 2000

2. Market basket data: association rule algorithms

Page 11: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

“Bump-Hunting” (PRIM) algorithm

• Patient Rule Induction Method (PRIM)– Friedman and Fisher, 2000

• Addresses “bump-hunting” problem:– Assume we have a target variable Y

• Y could be real-valued or a binary class variable– And we have p “input” variables – We want to find “boxes” in input space where E[Y| ] >>

E[Y]• or where E[Y| ] << E[Y] , i.e., “data holes”

– A box is a “conjunctive sentence”, e.g., if Age < 22 and occupation = student

Example of a “box pattern” if Age > 30 and education > bachelor then E[income | ] =

$120k

Page 12: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

“Bump Hunting”: Extrema Regions for Target f(x)

• let Sj be set of all possible vals for input variable xj

– entire input domain is S = S1 S2 … Sd

• goal: find subregion R S for which– mR = avg xR f(x) = [ xR f(x)p(x)dx ] / [ xR p(x)dx ] >> m– where m = f(x) p(x) dx (target mean, over all inputs)

• subregion size as fraction of full space (“support”): R = xR p(x) dx

• tradeoff between mR and R (increase R => reduce mR) ...

• sample-based estimates used in practice: R = (1/n) XiR 1(XiR), yavgR = 1/(nR) XiR yi – note: mR is true quantity of interest, not yavgR

Page 13: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Greedy Covering

• a generic greedy covering algorithm– first box B1 induced (somehow) from entire data set

– second box B2 induced from data not covered by B1

– … BK induced from remaining data {yi,Xi | Xi j=1…K-1 Bj}

• do until either:– estimated target mean f(x) in Bk becomes too small

• yavgK = avg[yi | Xi Bk & Xi j=1…K-1 Bj] < = (1/n) ni=1 yi

– support of Bk becomes too small K = (1/n) i=1…n 1(Xi Bk & Xi j=1…K-1 Bj)

• then select set of boxes R = j Bj for some threshold

– for which each yavgj > some yavgthreshold or

– yield largest yavgR for which R = i i some threshold

Page 14: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

PRIM algorithm

• PRIM uses “patient greedy search” on individual variables

• Start with all training data and maximal box• Repeat until minimal box (e.g., minimal support or n<10)

– Shrink box by compressing one face of the box– For each variable in input space

• “Peel” off a proportion of observations to maximize or minimize E[y |new box],

• typical =0.05 or =0.1

• Now “expand” the box if E[y|box] can be increased (“pasting”)

• Yields a sequence of boxes– Use cross-validation (on E[y|box]) to select the best box

• Remove box from training data, then repeat process

Page 15: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Page 16: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Comments on PRIM

• Works one variable at a time– So time-complexity is similar to tree algorithms, i.e.,

• Linear in p, and n log n for sorting

• Nominal variables– Can peel/paste on single values, subsets, negations, etc

• Similar in some sense to CART….but– More “patient” in search (removes only small fraction of data at each

step)

• Useful for finding “pockets” in the input space with high-response– e.g., marketing data: small groups of consumers who spend much

more on a given product than the average consumer– Medical data: patients with specific demographics whose response to a

drug is much better than the average patient

Page 17: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Marketing Data Example (n=9409, p=502)

• freq air travel: y=num flights/yr, global mean(y)=1.7

• B1: mean(y1)=4.2, 1=0.08 (8% market seg)– education >= 16 yrs; income > $50K & missing– occupation in {professional/manager, sales, homemaker}– number of children (<18) in home <= 1

• B2: mean(y2)=3.2, 2=0.07 (~2x global mean)– education > 12 yrs & missing– income > $30K & missing; 18 < age < 54– married / dual income in {single, married-one-income}

• these boxes intuitive: nothing really surprising ...

Page 18: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Pattern Finding Algorithms

1. Bump-hunting: the PRIM algorithm

2. Market basket data: association rule algorithms

Page 19: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Transaction Data and Market Baskets

• Supermarket example: (Srikant and Agrawal, 1997)

– #items = 50,000, #transactions = 1.5 million

• Data sets are typically very sparse

ItemsTransa

ctions x x

xx

x x xx

x x xxx x

xx

x

xx

x

x

Page 20: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Market Basket Analysis

• historically, assoc rules in terms of market baskets• given: a (huge) “transactions” database

– each transaction representing basket for 1 customer visit– each transaction containing set of items (“itemset”)

• finite set of (boolean) items (e.g. wine, cheese, diaper, beer, …)

• Association rules– classically used on supermarket transaction databases– associations: Trader Joe’s customers frequently buy wine &

cheese• rule: “people who buy wine also buy cheese 60% of time”

– infamous “beer & diapers” example:• “in evening hours, beer and diapers often purchased together”

– generalize to many other problems, e.g.:• baskets = documents, items = words• baskets = WWW pages, items = links

Page 21: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Market Basket Analysis: Complexity

• usually transaction DB too huge to fit in RAM– common sizes:

• number of transactions: 105 to 108 (hundreds of millions)• number of items: 102 to 106 (hundreds to millions)

• entire DB needs to be examined– usually very sparse

• e.g. ~ 0.1% chance of buying random item

– subsampling often a useful trick in DM, but• here, subsampling could easily miss the (rare) interesting patterns

• thus, runtime dominated by disk read times– motivates focus on minimizing number of disk scans

Page 22: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Association Rules: Problem Definition

• given: set I of items, set T transactions, t T, t I– Itemset Z = a set of items (any subset of I)

• support count (Z) = num transactions containing Z – given any itemset Z I, (Z) = | { t | t T, Z t } |

• association rule: – R=“X Y [s,c]”, X,Y I, XY=

• support: – s(R) = s(XY) = (XY)/|T| = p(XY)

• confidence: – c(R) = s(XY) / s(X) = (XY) / (X) = = p(X | Y)

• goal: find all R such that– s(R) given minsup– c(R) given minconf

Page 23: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Comments on Association Rules

• association rule: R=“X Y [s,c]”– Strictly speaking these are not “rules”

• i.e., we could have “wine => cheese” and “cheese => wine”• correlation is not causation

• The space of all possible rules is enormous– O( 2p ) where p = the number of different items– Will need some form of combinatorial search algorithm

• How are thresholds minsup and minconf selected?– Not that easy to know ahead of time how to select these

Page 24: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Example

• simple example transaction database (|T|=4): – Transaction1 = {A,B,C}– Transaction2 = {A,C}– Transaction3 = {A,D} – Transaction4 = {B,E,F}

• with minsup=50%, minconf=50%:– R1: A --> C [s=50%, c=66.6%]

• s(R1) = s({A,C}) , c(R1) = s({A,C})/s({A}) = 2/3

– R2: C --> A [s=50%, c=100%]• s(R2) = s({A,C}), c(R2) = s({A,C})/s({C}) = 2/2

s({A}) = 3/4 = 75%s({B}) = 2/4 = 50%s({C}) = 2/4 = 50%s({A,C}) = 2/4 = 50%

Page 25: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Finding Association Rules

• two steps:– step 1: find all “frequent” itemsets (F)

• F = {Z | s(Z) minsup} (e.g. Z={a,b,c,d,e})

– step 2: find all rules R: X --> Y such that:• X Y F and X Y= (e.g. R: {a,b,c} --> {d,e})• s(R) minsup and c(R) minconf

• step 1’s time-complexity typically >> step 2’s• step 2 need not scan the data (s(X),s(Y) all cached in step 1)• search space is exponential in |I|, filters choices for step 2• so, most work focuses on fast frequent itemset generation

• step 1 never filters viable candidates for step 2

Page 26: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Finding Frequent Itemsets

• frequent itemsets: {Z | s(Z)>minsup}• “Apriori (monotonicity) Principle”: s(X) s(XY)

– any subset of a frequent itemset must be frequent

• finding frequent itemsets:– bottom-up approach:

• do level-wise, for k=1 … |I|– k=1: find frequent singletons– k=2: find frequent pairs (often most costly)– step k.1: find size-k itemset candidates from the freq size-(k-

1)’s of prev level– step k.2 prune candidates Z for which s(Z)<minsup

• each level requires a single scan over all the transaction data– computes support counts (Z) = | { t | t T, Z t } for all size-

k Z candidates

s({A}) = 3/4 = 75%s({B}) = 2/4 = 50%s({C}) = 2/4 = 50%s({A,C}) = 2/4 = 50%

Page 27: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Apriori Example (minsup=2)

transactions T{1,3,4}{2,3,5}

{1,2,3,5}{2,5}

itemset sup{1} 2{2} 3{3} 3{4} 1{5} 3

itemset sup{1} 2{2} 3{3} 3{5} 3

F1 C2itemset{1,2}{1,3} {1,5} {2,3}{2,5}{3,5}

C1

itemset sup{1,2} 1{1,3} 2 {1,5} 1 {2,3} 2 {2,5} 3 {3,5} 2

C2

itemset sup{1,3} 2{2,3} 2{2,5} 3 {3,5} 2

F2

itemset {2,3,5}

C3 itemset sup{2,3,5} 2

F3

count(scan T)

count(scan T)

count(scan T)

filter

gen

gen

filter

bottleneck:

itemset sup{2,3,5} 2

C3filter

notice how |C3| << |C2|

C3 knows can avoid gen{1,2,3} (and {1,3,5}) apriori,without counting, because

{1,2} ({1,5}) not freq

Page 28: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Page 29: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Problems with Association Rules

• Consider 4 highly correlated items A, B, C, D– Say p(subset i|subset j) > minconf for all possible pairs of

disjoint subsets– And p(subset i subset j) > minsup

– How many possible rules?• E.g., A->B, [A,B]=>C, [A,C]=>B, [B,C]=>A• All possible combinations: 4 x 23 • In general for K such items, K x 2K-1 rules• For highly correlated items there is a combinatorial explosion of

redundant rules• In practice this makes interpretation of association rule results

difficult

Page 30: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

References on Association Rules

• Chapter 13 in text (Sections 13.1 to 13.5)

• Early papers:– R. Agrawal and R. Srikant, Fast algorithms for mining association rules,

in Proceedings of VLDB 1994, pp.487-499, 1994.– R. Agrawal et al. Fast discovery of association rules, in Advances in

Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996.

• More recent:– Good review in Chapter 6 of Data Mining: Concepts and Techniques, J.

Han and M. Kamber, Morgan Kaufmann, 2001.– J. Han, J. Pei, and Y. Yin, Mining frequent patterns without candidate

generation, Proceedings of SIGMOD 2000, pages 1-12.– Z. Zheng, R. Kohavi, and L. Mason, Real World Performance of

Association Rule Algorithms, Proceedings of KDD 2001

Page 31: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Study on Association Rule Algorithms

• Z. Zheng, R. Kohavi, and L. Mason, Real World Performance of Association Rule Algorithms, Proceedings of KDD 2001

• Evaluated a variety of association rule algorithms– Used both real and simulated transaction data sets

• Typical real data set from Web commerce:– Number of transactions = 500k– Number of items = 3k– Maximum transaction size = 200– Average transaction size = 5.0

Page 32: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Study on Association Rule Algorithms

• Conclusions:– Very narrow range of minsup yields interesting rules

• Minsup too small => too many rules• Minsup too large => misses potentially interesting patterns

– Superexponential growth of rules on real-world data

– Real-world data is different to simulated transaction data used in research papers, e.g.,

• Simulated transaction sizes have a mode away from 1• Real transaction sizes have a mode at 1 and are highly skewed

– Speed-up improvements demonstrated on artificial data did not generalize to real transaction data

Page 33: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Beyond Binary Market Baskets

• counts (vs yes/no)– e.g. “3 wines” vs “wine”

• quantitative (non-binary) item variables– popular: discretize real variable into k binary variables– e.g. {age=[30:39],incomeK=[42:48]} buys_PC

• Item hierachies– Common in practice, e.g., clothing -> shirts -> men’s shirts, etc– Can learn rules that generalize across the hierarchy

• mining sequential associations/patterns and rules– e.g. {1@0,2@5} 4@15

Page 34: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Task

Association Rule Finding

Representation

Score Function

Search/Optimization

Data Management

Models, Parameters

Find association rules

P(A,B,C) > minsup,

P(C|A, B) > minconf

Breadth-first candidate generation

Linear scans

list of all rules satisfying thresholds

[A and B] => C

Page 35: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Task

Bump Hunting (PRIM)

Representation

Score Function

Search/Optimization

Data Management

Models, Parameters

Find high score bumps

E[y|A,B] and p(A,B)

Greedy search

None

Set of “boxes”

[A,B] => E[y|A,B] > E[y]

Page 36: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Summary

• Pattern finding– An interesting and challenging problem– How to search for interesting/unusual “regions” of a high-

dimensional space

– Two main problems• Combinatorial search• How to define “interesting” (this is the harder problem)

– Two examples of algorithms• PRIM for bump-hunting• Apriori for association rule mining

– Many open problems in this research area (room for new ideas!)

Page 37: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Backup slides from Dennis DeCoste’s2003 lectures in 278

Page 38: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Apriori Algorithm: Finds All Frequent Itemsets

function F=Apriori(T,I,minsup) F1={{Ii}|s({Ii}) minsup};%% freq singletons for k=2 to |I| do if (Fk-1 = ) break; end

Ck = AprioriGen(Fk-1); %% gen candidates forall t T do %% scan T

forall c subset(Ck,t) do c.count++; endendFk = {c Ck | c.count minsup}

end F = k Fk classic generate-and-test approach

Page 39: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Itemset Candidate Generationfunction Ck=AprioriGen(Fk-1) %% ===== SQL-ish self-join (items assumed ordered): insert into Ck %% {p1,p2,p3,…pk-1,qk-1} select p.item1,p.item2,…p.itemk-1,q.itemk-1

from Fk-1 p, Fk-1 q where p.item1=q.item1,…,p.itemk-2=q.itemk-2,

p.itemk-1<q.itemk-1 %% grow by 1 item {qk-1} %% ===== prune if violate Apriori monotonicity principle: forall itemsets c Ck do forall (k-1) subsets s of c do if (s Fk-1)then delete c from Ck end

F3={{1,2,3},{1,2,4},{1,2,5},{1,4,5},{2,4,5}}C4={{1,2,3,4},{1,2,3,5},{1,2,4,5}}

{1,2,3,4} pruned: {2,3,4} F3{1,2,3,5} pruned: {2,3,5} F3

Page 40: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Counting Support of Candidates Ck

• key computational challenge for Apriori– for each c Ck, need to count supports s(c) over all T– |Ck| can be huge -- so O(|Ck| |T|) often prohibitive– single transaction tT may contain many candidates in Ck– Apriori runtime tends to be dominated by support counting

• methods t = {i1,i2,…im} and c={c1,c2,…,ck}– simple string match each c Ck against each tT

• markvectors: mark all itemst once, O(k) time check each cCk– O(k |Ck| |T| + |T|), where k=|c| --- vs O(m k |Ck| |T|) for naïve

set intersection

– hash tree• implements subset(Ck,t)• quickly finds all candidates in t

– near-optimal: O(| {ct|cCk} |) time

forall t T do for c subset(Ck,t) do c.count++; endend

Page 41: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Example Hash Tree (all 3-item candidates)Will traverse HT multipletimes, from root to leafs,hashing on every item in t.

Only the subset ofleafs reachable fromroot via hashing willbe compared to t.

If reach internal node after hashingon item i, will hashon every item afteri in t, and recurseover all nodes kids.

Build tree for all cCk, hashingon c(i) for kid choice at level i.

Page 42: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Subset Operation at Root of Hash Tree

1,42,5 3,6

Page 43: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Subset Operation at Left-Most Subtree of Root

*

Why works:If ct then firstitem in c is in t.By hashing atroot on everyitem in t, onlykids ignored willbe those leadingto itemsetsstarting withitem not in t.Similarly forinternals.Assumes: items ordered, so internals

only need hash on items after i

Page 44: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Example Rule Generation

• assume only 1-conseq rules with minconf are:– ACDE B and ABCE D freq itemset = {A,B,C,D,E}

• potential candiate 2-conseq rules might be:– ACD BE

• cannot hold, E BE and ABCD E did not have minconf

– ADE BC• cannot hold, C BC and ABDE C did not have minconf

– CDE AB• cannot hold, A AB and BCDE A did not have minconf

– ACE BD• only rule that needs to be gen/test for minconf by ap_genrules

Page 45: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Rule Generation

• 1st find all freq itemsets: F=Apriori(T,I,minsup);

• consider each f F & cand rule R=“{f - c} c”, |c| 1– all s(R) for R based on same f are same– all c(R) computable from known s(g), g F

• e.g. R= {1,2} {3}, so c(R) = s({1,2,3}) / s({1,2})• thus, no (costly!) DB scans required here if saved s(g) for g F

• if some c(a {f - a}) < minconf, then– any ã a also has c(ã {f - ã}) < minconf

• e.g. f={1,2,3}: c({1,2} {3}) < m c({1} {2,3}) < m• why: c(XY) s(XY) / s(X), num same, denom s(ã) s(a)

– similarly: c({f-a}a)m requires ã a c({f- ã}ã)m • mononocity principle for conf; idea: try {f- ã}ã before try {f-a}a

Page 46: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Rule Generation Algorithmforall fk F, k 2 do % try each freq itemset in turn

H1={h | h fk & |h|=1 & c({fk - h}h) minconf}; output rules H1; call ap_genrules(fk,H1,minconf,minsup);end

function ap_genrules(fk , Hm , minconf, minsup) if (k m+1) return; % stop before considering {} fk Hm+1=AprioriGen(Hm,minsup); % expand/filter conseqs

forall hm+1 Hm+1 do % recall: s(R) = s(XY)

c = s(fk) / s({fk-hm+1}); if (c minconf) then output rule “{fk - hm+1} hm+1” else delete hm+1 from Hm+1; % use conf monotonicity

call ap_genrules(fk , Hm+1, minconf, minsup); end

Page 47: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Tricks to Improve Apriori’s Efficiency

• speed scans>k: remove t once no f in t with f in Fk

• key: faster 2-itemset generation via hashing– C2 is major bottleneck, since many frequent 1-items

• could generate nearly |I|2 C2 candidates, each slowing the T scan

– yields most of speedup given by Apriori’s complex “tids”– main idea:

• when counting support of 1-items of each t T during full T scan– forall 2-item subsets x of t do HT2[hash2(x)]++; end

• then C2=AprioriGen(F1) can just gather all x with HT2(x)>minsup

– good for common case where most transactions are small • i.e. |t| << |T| and scales O(|T| mean(|t|)2) vs O(|T|2)

– see “An Effective Hash-Based Algorithm for Mining Association Rules”, by Park et al, 1995. Many others ...

Page 48: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Alternatives to Apriori

• Frequent-Pattern (FP) trees [Han et al]– compresses large T -- only requires couple DB scans– no candidate generation

• accumulate prefix paths through FP tree

– uses divide-and-conquer, vs Apriori’s generate-and-test– some random data studies show 10x faster than Apriori

• especially on often/overly-studied IBM random benchmark data

• many others: CLOSET [Pei and Han], etc.• recent studies on real data show no clear winner

– “Real World Performance of Association Rule Algorithms” [Kohavi et al, KDD 2001]

• Apriori is seminal and key starting point for others ...

Page 49: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Measures of Rule Interestingness

• apply as preprocessing– remove “too common” items (e.g. stopwords, S&H,...)

• e.g. use maxsup threshold as well, over singleton itemsets

• apply during mining– minsup (and minconf) are the common way to do this– not all interesting

• apply as post-processing– filter or rank the high support, high confidence rules found– this is where most interestingness measures apply

Page 50: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Interestingness Measures

• subjective -- pattern interesting if either:– unexpected

• surprise user (beliefs), diff in conf vs “user-defined similar” rules,...

– actionable (can do something useful with it)

• objective– 3 principles (“3p”) of objective measures of rule XY

• M=0 if X,Y are statistically independent (e.g. 0 if conf = prior(Y))• M monotonically increases with (XY), all else (e.g. |T|) steady• M monotonically decreases with (X), all else steady

– recall c(XY) = (XY) / (X) – often diff measures obeying these give same rankings

Page 51: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Problems with Support and Confidence

• example (among 5000 students):– 3000 play basketball (60%), 3750 eat cereal (75%)– 2000 (40%) both play basketball and eat cereal– “play basketball” “eat cereal” [s=40%,c=66.7%]

• misleading: (prior) % students eating cereal is 75% (>66.7%)

– “play basketball” “not eat cereal” [s=20%,c=33.3%]• lower s & c, yet more accurate description (w/ 33.3% > prior 25%)• challenge: many uninteresting neg assocs (sparse T); find c»prior

Contingency Table: basketball not basketball cereal 2000 1750 =3750not cereal 1000 250 =1250 =3000 =2000 =5000

Page 52: Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 11: Pattern Discovery Algorithms Padhraic Smyth

Data Mining Lectures Lecture 11: Pattern Discovery Padhraic Smyth, UC Irvine

Statistical Dependence vs Confidence

• example:– X and Y positively correlated– X and Z negatively correlated– support and confidence of XZ dominates

• need measure of dependence– corr(A,B) = P(AB) / [ P(A) P(B) ]– corr(A,B)~1 if stat independent– corr(A,B)>1 = positive dependence

• P(B|A)/P(B) = “lift” of AB = “interest” = I(A,B)– where P(B|A) = P(AB)/P(A) = c(AB); {I-1 obeys 3p}

• lift(XY)=0.5/(2/8)=2 > lift(XZ)=0.75/(7/8)=0.86~1

X: 1 1 1 1 0 0 0Y: 1 1 0 0 0 0 0Z: 0 1 1 1 1 1 1

s cXY: 25.0% 50%XZ : 37.5% 75%