an efficient rigorous approach for identifying statistically significant frequent itemsets

AlgoDEEP 160410 1

An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Fabio VandinDEI - Universitagrave di Padova

CS Dept - Brown University

Join work with A Kirsch M MitzenmacherA Pietracaprina G Pucci E Upfal

AlgoDEEP 160410 2

Data Mining

Discovery of hidden patterns (eg correlations association rules clusters anomalies etc) from large data sets

When is a pattern significant

Open problem development of rigorous (mathematicalstatistical) approaches to assess significance and to discover significant patterns efficiently

AlgoDEEP 160410 3

Frequent Itemsets (1)

Dataset DD of transactions over set of items I (D sube 2I) Support of an itemset X isin 2I in D =

number of transactions that contain X

TID Items

1 Bread Milk

2 Bread Diaper Beer Eggs

3 Milk Diaper Beer Coke

4 Bread Milk Diaper Beer

5 Bread Milk Diaper Coke

support(BeerDiaper) = 3Significant

AlgoDEEP 160410 4

Original formulation of the problem [Agrawal et al 93] input dataset D over I support threshold s output all itemsets of support ge s in D (frequent itemsets )

Rationale significance = high support (ge s)

Drawbacks Threshold s hard to fix

too low possible output explosion and spurious discoveries (false positives) too high loss of interesting itemsets (false negatives)

No guarantee of significance of output itemsets Alternative formulations proposed to mitigate the above

drawbacks Closed itemsets maximal itemsets top-K itemsets


AlgoDEEP 160410 5

Significance

Focus on statistical significance significance wrt random model

We address the following questions What support level makes an itemset significantly

frequent How to narrow the search down to significant

itemsets Goal minimize false discoveries and improve

quality of subsequent analysis

AlgoDEEP 160410 6

Many works consider significance of itemsets in isolation Eg [Silverstein Brin Motwani 98] rigorous statistical framework (with flaws) 2 test to assess degree of dependence of items in an

itemset

Global characteristics of dataset taken into account in [Gionis Mannila et al 06] deviation from random dataset wrt number of

frequent itemsets no rigouros statistical grounding

Related Work

AlgoDEEP 160410 7

Statistical Tests

Standard statistical test null hypothesis H0 (asympnot significant)

alternative hypothesis H1 H0 is tested against H1 by observing a certain statistic

s p-value = Prob( obs ge s | H0 is true ) Significance level α = probability of rejecting H0 when

it is true (false positive) Also called probability of Type I error

AlgoDEEP 160410 8

Random Model

I = set of n items

D = input dataset of t transactions over I i ∊ I

n(i) = support of i in D fi= n(i)t = frequency of i in D

D = random dataset of t transactions over I Item i is included in transaction j with probability

fi independently of all other events

AlgoDEEP 160410 9

For each itemset X = i1 i2 ik sube I

fX = fi1fi2 hellip fik expected frequency of X in D

null hypothesis H0(X) the support of X in D conforms with D (ie it is as drawn from Binomial(t fX ) )

alternative hypothesis H1(X) the support of X in D does not conforms with D

Naiumlve Approach (1)

AlgoDEEP 160410 10


Statistic of interest sx = support of X in D

Reject H0(X) if

p-value = Prob(B(t fX) ge sX) le α

Significant itemsets = X sube I H0(X) is rejected

AlgoDEEP 160410 11

Whatrsquos wrong D with t=1000000 transactions over n=1000 items

each item with frequency 11000 Pair ij that occurs 7 times is it statistically

significant In D (random dataset)

E[support(ij)] = 1 p-value = Prob(ij has support ge 7 ) 00001≃

ij must be significant


AlgoDEEP 160410 12

Expected number of pairs with support ge 7 in random dataset is ≃ 50

existence of ij with support ge 7 is not such a rare event

returning ij as significant itemset could be a false discovery

However 300 (disjoint) pairs with support ge 7 in D is an extremely rare event (prob le 2-300)


AlgoDEEP 160410 13

Multi-Hypothesis test (1)

Looking for significant itemsets of size k (k-itemsets) involves testing simultaneously for

m= null hypotheses H0(X)|X|=k

How to combine m tests while minimizing false positives

AlgoDEEP 160410 14


V = number of false positives R = total number rejected null hypotheses = number itemsets flagged as significant False Discovery Rate (FDR) = E[VR] (FDR=0 when R=0)

GOAL maximize R while ensuring FDR le β

[Benjamini-Yekutieli rsquo01] Reject hypothesis with indashth smallest p-value if le iβm

m = does not yield a support threshold for mining

AlgoDEEP 160410 15

Our Approach

Q(k s) = obs number of k-itemsets of support ge s

null hypothesis H0(s) the number of k-itemsets of support s in D conforms with D

alternative hypothesis H1(s) the number of k-itemsets of support s in D does not conforms with D

Problem how to compute the p-value of Q(k s)

AlgoDEEP 160410 16

Main Results (PODS 2009)

Result 1 (Poisson approx) Q(ks)= number of k-itemsets of support ge s in D Theorem Exists smin for sgesmin Q(ks) is well

approximated by a Poisson distribution

Result 2 Methodology to establish a support threshold for

discovering significant itemsets with small FDR

AlgoDEEP 160410 17

Approximation Result (1)

Based on Chen-Stein method (1975)

Q(ks) = number of k-itemsets of support ge s in random dataset D

U~Poisson(λ) λ = E[Q(ks)]

Theorem for k=O(1) t=poly(n) for a large range of item distributions and supports s

distance (Q(ks) U) =O(1n)

AlgoDEEP 160410 18


Corollary there exists smin st Q(ks) is well approximated by a Poisson distribution for sgesmin

In practice Monte-Carlo method to determine smin st with probability at least 1-δ

distance (Q(ks) U) le εfor all sgesmin

AlgoDEEP 160410 19

Support threshold for mining significant itemsets (1)

Determine smin and let h be such that smin +2h is the maximum support for an itemset

Fix α1 α2 αh such that sumαile α Fix β1 β2 βh such that sum βile β For i=1 to h

si= smin +2i

Q(k si) = obs number of k-itemsets of support ge si

H0(ksi) Q(ksi) conforms with Poisson(λi= E[Q(k si)]) reject H0(ksi) if

p-value of Q(ksi) lt αi and Q(ksi) ge λi βi

AlgoDEEP 160410 20


Theorem Let s be the minimum s such that H0(ks) was rejected We have

1With significance level α the number of k-itemsets of support ge s is significant

2The k-itemsets with support ge s are significant with FDR le β

AlgoDEEP 160410 21

FIMI repositoryhttpfimicshelsinkifidata

Experiments benchmark datasets

avg trans lengthitems frequencies range

AlgoDEEP 160410 22

Test α = 005 β = 005 Qks = number of k-itemsets of support ge s in D λ(s) = expected number of k-itemsets with support ge s

Itemset of size 154 with support ge 7

Experiments results (1)

AlgoDEEP 160410 23


Comparison with standard application of Benjamini Yekutieli FDR le 005 R = output (standard approach) Qks = output (our approach) r = |Qks||R|

AlgoDEEP 160410 24

Poisson approximation for number of k-itemsets of support s ge smin in a random dataset

An statistically sound method to determine a support threshold for mining significant frequent itemsets with controlled FDR

Conclusions

AlgoDEEP 160410 25

Deal with false negatives

Software package

Application of the method to other frequent pattern problems

Future Work

AlgoDEEP 160410 26

Questions

Thank you

AlgoDEEP 160410 2

Data Mining

Discovery of hidden patterns (eg correlations association rules clusters anomalies etc) from large data sets

When is a pattern significant

Open problem development of rigorous (mathematicalstatistical) approaches to assess significance and to discover significant patterns efficiently

AlgoDEEP 160410 3




TID Items

1 Bread Milk






AlgoDEEP 160410 4








AlgoDEEP 160410 5

Significance






AlgoDEEP 160410 6


itemset



Related Work

AlgoDEEP 160410 7

Statistical Tests





AlgoDEEP 160410 8

Random Model

I = set of n items





AlgoDEEP 160410 9






AlgoDEEP 160410 10



Reject H0(X) if



AlgoDEEP 160410 11







AlgoDEEP 160410 12






AlgoDEEP 160410 13





AlgoDEEP 160410 14






AlgoDEEP 160410 15

Our Approach





AlgoDEEP 160410 16






AlgoDEEP 160410 17







AlgoDEEP 160410 18





AlgoDEEP 160410 19




si= smin +2i




AlgoDEEP 160410 20





AlgoDEEP 160410 21




AlgoDEEP 160410 22




AlgoDEEP 160410 23



AlgoDEEP 160410 24



Conclusions

AlgoDEEP 160410 25


Software package


Future Work

AlgoDEEP 160410 26

Questions

Thank you

AlgoDEEP 160410 3




TID Items

1 Bread Milk






AlgoDEEP 160410 4








AlgoDEEP 160410 5

Significance






AlgoDEEP 160410 6


itemset



Related Work

AlgoDEEP 160410 7

Statistical Tests





AlgoDEEP 160410 8

Random Model

I = set of n items





AlgoDEEP 160410 9






AlgoDEEP 160410 10



Reject H0(X) if



AlgoDEEP 160410 11







AlgoDEEP 160410 12






AlgoDEEP 160410 13





AlgoDEEP 160410 14






AlgoDEEP 160410 15

Our Approach





AlgoDEEP 160410 16






AlgoDEEP 160410 17







AlgoDEEP 160410 18





AlgoDEEP 160410 19




si= smin +2i




AlgoDEEP 160410 20





AlgoDEEP 160410 21




AlgoDEEP 160410 22




AlgoDEEP 160410 23



AlgoDEEP 160410 24



Conclusions

AlgoDEEP 160410 25


Software package


Future Work

AlgoDEEP 160410 26

Questions

Thank you

AlgoDEEP 160410 4








AlgoDEEP 160410 5

Significance






AlgoDEEP 160410 6


itemset



Related Work

AlgoDEEP 160410 7

Statistical Tests





AlgoDEEP 160410 8

Random Model

I = set of n items





AlgoDEEP 160410 9






AlgoDEEP 160410 10



Reject H0(X) if



AlgoDEEP 160410 11







AlgoDEEP 160410 12






AlgoDEEP 160410 13





AlgoDEEP 160410 14






AlgoDEEP 160410 15

Our Approach





AlgoDEEP 160410 16






AlgoDEEP 160410 17







AlgoDEEP 160410 18





AlgoDEEP 160410 19




si= smin +2i




AlgoDEEP 160410 20





AlgoDEEP 160410 21




AlgoDEEP 160410 22




AlgoDEEP 160410 23



AlgoDEEP 160410 24



Conclusions

AlgoDEEP 160410 25


Software package


Future Work

AlgoDEEP 160410 26

Questions

Thank you

AlgoDEEP 160410 5

Significance






AlgoDEEP 160410 6


itemset



Related Work

AlgoDEEP 160410 7

Statistical Tests





AlgoDEEP 160410 8

Random Model

I = set of n items





AlgoDEEP 160410 9






AlgoDEEP 160410 10



Reject H0(X) if



AlgoDEEP 160410 11







AlgoDEEP 160410 12






AlgoDEEP 160410 13





AlgoDEEP 160410 14






AlgoDEEP 160410 15

Our Approach





AlgoDEEP 160410 16






AlgoDEEP 160410 17







AlgoDEEP 160410 18





AlgoDEEP 160410 19




si= smin +2i




AlgoDEEP 160410 20





AlgoDEEP 160410 21




AlgoDEEP 160410 22




AlgoDEEP 160410 23



AlgoDEEP 160410 24



Conclusions

AlgoDEEP 160410 25


Software package


Future Work

AlgoDEEP 160410 26

Questions

Thank you

AlgoDEEP 160410 6


itemset



Related Work

AlgoDEEP 160410 7

Statistical Tests





AlgoDEEP 160410 8

Random Model

I = set of n items





AlgoDEEP 160410 9






AlgoDEEP 160410 10



Reject H0(X) if



AlgoDEEP 160410 11







AlgoDEEP 160410 12






AlgoDEEP 160410 13





AlgoDEEP 160410 14






AlgoDEEP 160410 15

Our Approach





AlgoDEEP 160410 16






AlgoDEEP 160410 17







AlgoDEEP 160410 18





AlgoDEEP 160410 19




si= smin +2i




AlgoDEEP 160410 20





AlgoDEEP 160410 21




AlgoDEEP 160410 22




AlgoDEEP 160410 23



AlgoDEEP 160410 24



Conclusions

AlgoDEEP 160410 25


Software package


Future Work

AlgoDEEP 160410 26

Questions

Thank you

AlgoDEEP 160410 7

Statistical Tests





AlgoDEEP 160410 8

Random Model

I = set of n items





AlgoDEEP 160410 9






AlgoDEEP 160410 10



Reject H0(X) if



AlgoDEEP 160410 11







AlgoDEEP 160410 12






AlgoDEEP 160410 13





AlgoDEEP 160410 14






AlgoDEEP 160410 15

Our Approach





AlgoDEEP 160410 16






AlgoDEEP 160410 17







AlgoDEEP 160410 18





AlgoDEEP 160410 19




si= smin +2i




AlgoDEEP 160410 20





AlgoDEEP 160410 21




AlgoDEEP 160410 22




AlgoDEEP 160410 23



AlgoDEEP 160410 24



Conclusions

AlgoDEEP 160410 25


Software package


Future Work

AlgoDEEP 160410 26

Questions

Thank you

AlgoDEEP 160410 8

Random Model

I = set of n items





AlgoDEEP 160410 9






AlgoDEEP 160410 10



Reject H0(X) if



AlgoDEEP 160410 11







AlgoDEEP 160410 12






AlgoDEEP 160410 13





AlgoDEEP 160410 14






AlgoDEEP 160410 15

Our Approach





AlgoDEEP 160410 16






AlgoDEEP 160410 17







AlgoDEEP 160410 18





AlgoDEEP 160410 19




si= smin +2i




AlgoDEEP 160410 20





AlgoDEEP 160410 21




AlgoDEEP 160410 22




AlgoDEEP 160410 23



AlgoDEEP 160410 24



Conclusions

AlgoDEEP 160410 25


Software package


Future Work

AlgoDEEP 160410 26

Questions

Thank you

AlgoDEEP 160410 9






AlgoDEEP 160410 10



Reject H0(X) if



AlgoDEEP 160410 11







AlgoDEEP 160410 12






AlgoDEEP 160410 13





AlgoDEEP 160410 14






AlgoDEEP 160410 15

Our Approach





AlgoDEEP 160410 16






AlgoDEEP 160410 17







AlgoDEEP 160410 18





AlgoDEEP 160410 19




si= smin +2i




AlgoDEEP 160410 20





AlgoDEEP 160410 21




AlgoDEEP 160410 22




AlgoDEEP 160410 23



AlgoDEEP 160410 24



Conclusions

AlgoDEEP 160410 25


Software package


Future Work

AlgoDEEP 160410 26

Questions

Thank you

AlgoDEEP 160410 10



Reject H0(X) if



AlgoDEEP 160410 11







AlgoDEEP 160410 12






AlgoDEEP 160410 13





AlgoDEEP 160410 14






AlgoDEEP 160410 15

Our Approach





AlgoDEEP 160410 16






AlgoDEEP 160410 17







AlgoDEEP 160410 18





AlgoDEEP 160410 19




si= smin +2i




AlgoDEEP 160410 20





AlgoDEEP 160410 21




AlgoDEEP 160410 22




AlgoDEEP 160410 23



AlgoDEEP 160410 24



Conclusions

AlgoDEEP 160410 25


Software package


Future Work

AlgoDEEP 160410 26

Questions

Thank you

AlgoDEEP 160410 11







AlgoDEEP 160410 12






AlgoDEEP 160410 13





AlgoDEEP 160410 14






AlgoDEEP 160410 15

Our Approach





AlgoDEEP 160410 16






AlgoDEEP 160410 17







AlgoDEEP 160410 18





AlgoDEEP 160410 19




si= smin +2i




AlgoDEEP 160410 20





AlgoDEEP 160410 21




AlgoDEEP 160410 22




AlgoDEEP 160410 23



AlgoDEEP 160410 24



Conclusions

AlgoDEEP 160410 25


Software package


Future Work

AlgoDEEP 160410 26

Questions

Thank you

AlgoDEEP 160410 12






AlgoDEEP 160410 13





AlgoDEEP 160410 14






AlgoDEEP 160410 15

Our Approach





AlgoDEEP 160410 16






AlgoDEEP 160410 17







AlgoDEEP 160410 18





AlgoDEEP 160410 19




si= smin +2i




AlgoDEEP 160410 20





AlgoDEEP 160410 21




AlgoDEEP 160410 22




AlgoDEEP 160410 23



AlgoDEEP 160410 24



Conclusions

AlgoDEEP 160410 25


Software package


Future Work

AlgoDEEP 160410 26

Questions

Thank you

AlgoDEEP 160410 13





AlgoDEEP 160410 14






AlgoDEEP 160410 15

Our Approach





AlgoDEEP 160410 16






AlgoDEEP 160410 17







AlgoDEEP 160410 18





AlgoDEEP 160410 19




si= smin +2i




AlgoDEEP 160410 20





AlgoDEEP 160410 21




AlgoDEEP 160410 22




AlgoDEEP 160410 23



AlgoDEEP 160410 24



Conclusions

AlgoDEEP 160410 25


Software package


Future Work

AlgoDEEP 160410 26

Questions

Thank you

AlgoDEEP 160410 14






AlgoDEEP 160410 15

Our Approach





AlgoDEEP 160410 16






AlgoDEEP 160410 17







AlgoDEEP 160410 18





AlgoDEEP 160410 19




si= smin +2i




AlgoDEEP 160410 20





AlgoDEEP 160410 21




AlgoDEEP 160410 22




AlgoDEEP 160410 23



AlgoDEEP 160410 24



Conclusions

AlgoDEEP 160410 25


Software package


Future Work

AlgoDEEP 160410 26

Questions

Thank you

AlgoDEEP 160410 15

Our Approach





AlgoDEEP 160410 16






AlgoDEEP 160410 17







AlgoDEEP 160410 18





AlgoDEEP 160410 19




si= smin +2i




AlgoDEEP 160410 20





AlgoDEEP 160410 21




AlgoDEEP 160410 22




AlgoDEEP 160410 23



AlgoDEEP 160410 24



Conclusions

AlgoDEEP 160410 25


Software package


Future Work

AlgoDEEP 160410 26

Questions

Thank you

AlgoDEEP 160410 16






AlgoDEEP 160410 17







AlgoDEEP 160410 18





AlgoDEEP 160410 19




si= smin +2i




AlgoDEEP 160410 20





AlgoDEEP 160410 21




AlgoDEEP 160410 22




AlgoDEEP 160410 23



AlgoDEEP 160410 24



Conclusions

AlgoDEEP 160410 25


Software package


Future Work

AlgoDEEP 160410 26

Questions

Thank you

AlgoDEEP 160410 17







AlgoDEEP 160410 18





AlgoDEEP 160410 19




si= smin +2i




AlgoDEEP 160410 20





AlgoDEEP 160410 21




AlgoDEEP 160410 22




AlgoDEEP 160410 23



AlgoDEEP 160410 24



Conclusions

AlgoDEEP 160410 25


Software package


Future Work

AlgoDEEP 160410 26

Questions

Thank you

AlgoDEEP 160410 18





AlgoDEEP 160410 19




si= smin +2i




AlgoDEEP 160410 20





AlgoDEEP 160410 21




AlgoDEEP 160410 22




AlgoDEEP 160410 23



AlgoDEEP 160410 24



Conclusions

AlgoDEEP 160410 25


Software package


Future Work

AlgoDEEP 160410 26

Questions

Thank you

AlgoDEEP 160410 19




si= smin +2i




AlgoDEEP 160410 20





AlgoDEEP 160410 21




AlgoDEEP 160410 22




AlgoDEEP 160410 23



AlgoDEEP 160410 24



Conclusions

AlgoDEEP 160410 25


Software package


Future Work

AlgoDEEP 160410 26

Questions

Thank you

AlgoDEEP 160410 20





AlgoDEEP 160410 21




AlgoDEEP 160410 22




AlgoDEEP 160410 23



AlgoDEEP 160410 24



Conclusions

AlgoDEEP 160410 25


Software package


Future Work

AlgoDEEP 160410 26

Questions

Thank you

AlgoDEEP 160410 21




AlgoDEEP 160410 22




AlgoDEEP 160410 23



AlgoDEEP 160410 24



Conclusions

AlgoDEEP 160410 25


Software package


Future Work

AlgoDEEP 160410 26

Questions

Thank you

AlgoDEEP 160410 22




AlgoDEEP 160410 23



AlgoDEEP 160410 24



Conclusions

AlgoDEEP 160410 25


Software package


Future Work

AlgoDEEP 160410 26

Questions

Thank you

AlgoDEEP 160410 23



AlgoDEEP 160410 24



Conclusions

AlgoDEEP 160410 25


Software package


Future Work

AlgoDEEP 160410 26

Questions

Thank you

AlgoDEEP 160410 24



Conclusions

AlgoDEEP 160410 25


Software package


Future Work

AlgoDEEP 160410 26

Questions

Thank you

AlgoDEEP 160410 25


Software package


Future Work

AlgoDEEP 160410 26

Questions

Thank you

AlgoDEEP 160410 26

Questions

Thank you

an efficient rigorous approach for identifying statistically significant frequent itemsets

Documents