statistical learning theory meets big data

75
Statistical Learning Theory Meets Big Data Randomized algorithms for frequent itemsets Eli Upfal Brown University

Upload: others

Post on 09-Feb-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical Learning Theory Meets Big Data

Statistical Learning Theory Meets Big Data

Randomized algorithms for frequent itemsets

Eli Upfal

Brown University

Page 2: Statistical Learning Theory Meets Big Data

Data, data, data●“In God we trust, all others (must) bring data”

Prof. W.E. Deming, Statistician, (attributed)

●“Every two days now we create as much information as we did from the dawn of civilization up until 2003”

Eric Schmidt, Google Exec. Chairman, 2009

●“Data explosion is bigger than Moore's Law”Marissa Mayer, Yahoo! CEO, 2009

Page 3: Statistical Learning Theory Meets Big Data

Big Data

●“Every two days now we create as much information as we did from the dawn of civilization up until 2003”

Eric Schmidt, Google Exec. Chairman, 2009

●“Data explosion is bigger than Moore's Law”

Marissa Mayer, Yahoo! CEO, 2009

Page 4: Statistical Learning Theory Meets Big Data

Big Data

● Big data requires big storage and long execution time● It's great to have big data – but do we need to process

all of it?● Trade off accuracy of results for execution speed● Can approximation algorithms for data analysis be fast

and still guarantee high-quality results?

Page 5: Statistical Learning Theory Meets Big Data

Frequent Itemsets● Market Basket Analysis● You own a grocery store● Have copy of receipts from sales

– For each customer, you have the list of products she bought

● Interested in what groups of products are bought together – correlated items – Business decisions– Optimization

Page 6: Statistical Learning Theory Meets Big Data

Settings● Transactional dataset D

Figure from Tan et al. - Introduction to Data Mining

Page 7: Statistical Learning Theory Meets Big Data

Transactions● Transactional dataset D

Figure from Tan et al. - Introduction to Data Mining

transaction

Page 8: Statistical Learning Theory Meets Big Data

Items● Transactions are built on items from a domain

Figure from Tan et al. - Introduction to Data Mining

items

Page 9: Statistical Learning Theory Meets Big Data

Itemsets● Sets of items are called itemsets

Figure from Tan et al. - Introduction to Data Mining

Itemsets

Page 10: Statistical Learning Theory Meets Big Data

Frequency● Frequency of an itemset X in dataset D:

– : fraction of transactions of D containing X– Example: Milk: 4/5, {Bread, Milk}: 3/5

Figure from Tan et al. - Introduction to Data Mining

Page 11: Statistical Learning Theory Meets Big Data

Frequent Itemsets● Frequent Itemsets mining with respect to

– Find all itemsets X with frequency

Page 12: Statistical Learning Theory Meets Big Data

Frequent Itemsets● Frequent Itemsets mining with respect to

– Find all itemsets X with frequency Top -K Frequent Itemsets– Find all itemsets at least as frequent as the kth

most frequent● Association Rules

– X in a transaction is correlated with Y in the transaction

Page 13: Statistical Learning Theory Meets Big Data

Frequent Itemsets mining ● Classical problem in data analysis● There are algorithms to extract exact collection of FI's

– APriori [AS94], FPGrowth [HPY00], Eclat [Zaki00]

Page 14: Statistical Learning Theory Meets Big Data

Frequent Itemsets mining ● Well studied classical problem in data analysis● There are algorithms to extract exact collection of FI's

– APriori [AS94], FPGrowth [HPY00], Eclat [Zaki00]● Running time depends on number of transactions in

the dataset and on the number of frequent itemsets – 10^8 transactions considered “normal” size– 10^4 items → 2^(10^4) itemsets total– A transaction with d items contains 2^d itemsets

Page 15: Statistical Learning Theory Meets Big Data

How to speed up mining?● Decrease dependency on number of transactions:

Page 16: Statistical Learning Theory Meets Big Data

How to speed up mining?● Decrease dependency on number of transactions:

1.create random sample of dataset2.extract frequent itemsets from sample with lower

Page 17: Statistical Learning Theory Meets Big Data

How to speed up mining?● Decrease dependency on number of transactions:

1.create random sample of dataset2.extract frequent itemsets from sample with lower

● Sampling force us to accept approximate results● Need to quantify what is acceptable

Page 18: Statistical Learning Theory Meets Big Data

How to speed up mining?● Decrease dependency on number of transactions:

1.create random sample of dataset2.extract frequent itemsets from sample with lower

● Sampling force us to accept approximate results● Need to quantify what is acceptable● Given , extract collection W of FI's such that:

0 1Frequency

Page 19: Statistical Learning Theory Meets Big Data

How to speed up mining?● Decrease dependency on number of transactions:

1.create random sample of dataset2.extract frequent itemsets from sample with lower

● Sampling force us to accept approximate results● Need to quantify what is acceptable● Given , extract collection W of FI's such that:

0 1

Must be in W

Frequency

Page 20: Statistical Learning Theory Meets Big Data

How to speed up mining?● Decrease dependency on number of transactions:

1.create random sample of dataset2.extract frequent itemsets from sample with lower

● Sampling force us to accept approximate results● Need to quantify what is acceptable● Given , extract collection W of FI's such that:

0 1

Must not be in W Must be in W

Frequency

Page 21: Statistical Learning Theory Meets Big Data

How to speed up mining?● Decrease dependency on number of transactions:

1.create random sample of dataset2.extract frequent itemsets from sample with lower

● Sampling force us to accept approximate results● Need to quantify what is acceptable● Given , extract collection W of FI's such that:

0 1

Must not be in W Must be in WMay be in W

Frequency

Page 22: Statistical Learning Theory Meets Big Data

22

Observa(on

● Data mining is exploratory in nature● Exact results are usually not needed

We want a good, realistic idea of the process● Fast analysis is more important

... if it still guarantees good-enough results!

Riondato et al., PARMA, CIKM'12

Page 23: Statistical Learning Theory Meets Big Data

How to speed up mining?● Decrease dependency on number of transactions:

1.create random sample of dataset2.extract frequent itemsets from sample with lower

● Sampling force us to accept approximate results● Need to quantify what is acceptable● Given , extract collection W of itemset from

sample, such that:

● Problem: choose right sample size and right

0 1

Must not be in W Must be in WMay be in W

Frequency

Page 24: Statistical Learning Theory Meets Big Data

Choosing ● If we have, for all itemsets X simultaneously

then does the trick

Page 25: Statistical Learning Theory Meets Big Data

Choosing ● If we have, for all itemsets X simultaneously

then does the trick

0 1

Must not be in W Must be in WMay be in W

Frequency

Page 26: Statistical Learning Theory Meets Big Data

Choosing ● If we have, for all itemsets X simultaneously

then does the trick

● Need sample size |S| s.t., with prob. at least , for all itemsets X, we have

0 1

Must not be in W Must be in WMay be in W

Frequency

Page 27: Statistical Learning Theory Meets Big Data

Choosing the sample size● Naïve approach● Given itemset X, frequency of X in sample S is

distributed like a binomial: ● Use Chernoff bound to bound

● Apply Union bound over all itemsets to find |S| such that

Page 28: Statistical Learning Theory Meets Big Data

Choosing the sample size● Naïve approach● Given itemset X, frequency of X in sample S is

distributed like a binomial: ● Use Chernoff bound to bound

● Apply Union bound over all itemsets to find |S| such that

● Problem: There's an exponential number of itemsets– Sample size would too large (linear the in number

of elements)–

Page 29: Statistical Learning Theory Meets Big Data

Vapnik-Chervonenkis Dimension

● Combinatorial property of a collection of subsets from a domain

● Measures the “richness” or “expressivity” of the collection of subsets

● If we know the VC-dim of a collection of subsets, we can compute the minimum sample size needed to approximate the sizes of all the subsets using the sample

Page 30: Statistical Learning Theory Meets Big Data

Range spaces● VC-Dimension is defined on range spaces● (B,R): range space

– B: domain– R: collection of subsets from B (ranges)

● No restrictions:– B can be infinite– R can be infinite– R can contain infinitely-large subsets of B

Page 31: Statistical Learning Theory Meets Big Data

Vapnik-Chervonenkis Dimension● Range space ● For any , define

● C is shattered if ● The VC-Dimension of (B,R) is the size of the largest

shattered subset of B

Page 32: Statistical Learning Theory Meets Big Data

Example of VC-Dimension● B =

x

y

Page 33: Statistical Learning Theory Meets Big Data

Example of VC-Dimension● B = ● R = all axis-aligned rectangles in x

y

Page 34: Statistical Learning Theory Meets Big Data

Example of VC-Dimension● B = ● R = all axis-aligned rectangles in x

y

Page 35: Statistical Learning Theory Meets Big Data

Example of VC-Dimension● B = ● R = all axis-aligned rectangles in

● Shattering 4 points: Easy– Take any 4 points s.t. no 3 of them are aligned

x

y

Page 36: Statistical Learning Theory Meets Big Data

Example of VC-Dimension● B = ● R = all axis-aligned rectangles in

● Shattering 4 points: Easy– Take any 4 points s.t. no 3 of them are aligned

x

y

x

y

Page 37: Statistical Learning Theory Meets Big Data

Example of VC-Dimension● B = ● R = all axis-aligned rectangles in

● Shattering 4 points: Easy– Take any 4 points s.t. no 3 of them are aligned

x

y

x

y

Need 16 rectangles

Page 38: Statistical Learning Theory Meets Big Data

Example of VC-Dimension● B = ● R = all axis-aligned rectangles in

● Shattering 5 points?

x

y

Page 39: Statistical Learning Theory Meets Big Data

Example of VC-Dimension● B = ● R = all axis-aligned rectangles in

● Shattering 5 points: impossible

x

y

Page 40: Statistical Learning Theory Meets Big Data

Example of VC-Dimension● B = ● R = all axis-aligned rectangles in

● Shattering 5 points: impossible– Take any 5 points– One of them that is contained in all rectangles

containing the other four– Impossible to find a rectangle containing only the

other four● VC(B,R)=4

x

y

Page 41: Statistical Learning Theory Meets Big Data

Approximating sizes of ranges● Sample Theorem:

– Let (B,R) have VC(B,R)≤ d. Given , let S be a collection of points from B sampled uniformly at random. If

then,

● Can approximate sizes of all simultaneously– No need of union bound

Page 42: Statistical Learning Theory Meets Big Data

VC-Dimension for Frequent Itemsets● B = dataset D (set of transactions)

Page 43: Statistical Learning Theory Meets Big Data

VC-Dimension for Frequent Itemsets● B = dataset D (set of transactions)● Itemset X, = transactions of D containing X

Page 44: Statistical Learning Theory Meets Big Data

VC-Dimension for Frequent Itemsets● B = dataset D (set of transactions)● Itemset X, = transactions of D containing X

Page 45: Statistical Learning Theory Meets Big Data

VC-Dimension for Frequent Itemsets● B = dataset D (set of transactions)● Itemset X, = transactions of D containing X●

Page 46: Statistical Learning Theory Meets Big Data

VC-Dimension for Frequent Itemsets● B = dataset D (set of transactions)● Itemset X, = transactions of D containing X●

Page 47: Statistical Learning Theory Meets Big Data

VC-Dimension for Frequent Itemsets● B = dataset D (set of transactions)● Itemset X, = transactions of D containing X●

To approximate using the sample theorem, needupper bound to VC(B,R)

Page 48: Statistical Learning Theory Meets Big Data

Upper Bound to VC-Dim for FI's● B = dataset D (set of transactions)● Itemset X, = transactions of D containing X●

● Theorem: [The d-index of a dataset]– Let d be the maximum integer such that D

contains at least d transactions of length at least d. Then

VC(B,R) ≤ d

Page 49: Statistical Learning Theory Meets Big Data

49

The  d-­‐index  of  the  dataset

● d-index d: the maximum integer such that D contains at least d transactions of length at least d

● Example: this dataset has d-index d=3

● Independent of |D|● Can be computed with a single scan of the dataset● d is an upper bound to the VC-dimension

bread beer milk coffeechips coke pastabread coke chipsmilk coffeepasta milk

Page 50: Statistical Learning Theory Meets Big Data

Choosing the right sample size● Theorem

– Let – Let D be a dataset and d be the max integer such

that D contains at least d transactions of length at least d

– Let S be a collection of transactions of D sampled independently and uniformly at random, with

– Then

Page 51: Statistical Learning Theory Meets Big Data

Algorithm● To extract approximate collection of freq itemsets:

1.Compute d for the dataset D 2.Compute sample size (function of d)3.Create random sample S from D4.Extract Freq Itemsets from S using

Page 52: Statistical Learning Theory Meets Big Data

Algorithm● To extract approximate collection of Freq. Itemsets:

1.Compute d for the dataset D 2.Compute sample size given d3.Create random sample S from D4.Extract Frequent Itemsets from S using

● Theorem: With probability at least , the returned set W of itemsets satisfies the desired property

0 1

Must not be in W Must be in WMay be in WFrequency

Page 53: Statistical Learning Theory Meets Big Data

A closer look...● Expression of sample size:

Page 54: Statistical Learning Theory Meets Big Data

A closer look...● Expression of sample size:

● d: the maximum integer such that D contains at least d transactions of length at least d

Page 55: Statistical Learning Theory Meets Big Data

A closer look...● Expression of sample size:

● d: the maximum integer such that D contains at least d transactions of length at least d– does not depend on |D|– does not depend on– does not depend on number of itemsets

Page 56: Statistical Learning Theory Meets Big Data

A closer look...● Expression of sample size:

● d: the maximum integer such that D contains at least d transactions of length at least d– does not depend on |D|– does not depend on– does not depend on number of itemsets

● The time to mine S does not depend on |D| !!!

Page 57: Statistical Learning Theory Meets Big Data

Proof (Intuition)● For a set of k transactions to be shattered, each

transaction must appear in 2^(k-1) different 's where X is an itemset

● A transaction only appears in the 's of the itemsets X it contains

● A transaction of length w contains 2^(w)-1 itemsets● Need w≥k for the transaction to belong to a shattered

set of size k● To shatter k transactions they must all have length ≥k● Max k for which it happens is upper bound to VC-Dim

Page 58: Statistical Learning Theory Meets Big Data

Experiments● Sample always fits into main memory● Output always satisfies required approx. guarantees

– Frequency accuracy even better than guaranteed● Mining time significantly improved

Page 59: Statistical Learning Theory Meets Big Data

ConclusionsWe showed how VC-dimension, a highly theoretical concept from statistical learning theory, can help in developing an efficient algorithm to approximate the collection of frequent itemsets, addressing one of the Big Data challenges through samp Riondato & Upfal: Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees. ECML/PKDD (1) 20121

.

Page 60: Statistical Learning Theory Meets Big Data

PARMA:  A  Randomized  Algorithm  for  Approximate  Associa;on  Rules  Mining  

in  MapReduce

Matteo Riondato, Justin A. DeBrabant, Rodrigo Fonseca, Eli Upfal

Page 61: Statistical Learning Theory Meets Big Data

61

Goals

● Extract high quality approximation of FI(D,θ) by mining several small samples of D in parallel

● Adapt to available computational resources● Minimize data replication and communication● Achieve high parallelism

Riondato et al., PARMA, CIKM'12

Page 62: Statistical Learning Theory Meets Big Data

62

Why  parallelizing?

● For a very high quality approximation, the sample would still be large

● Sample creation time depends on size of dataset● Parallelization allows us to achieve:

higher accuracy (lower ε) higher confidence (lower δ) faster results

● Makes sense when dataset is stored on a distributed filesystem

Riondato et al., PARMA, CIKM'12

Page 63: Statistical Learning Theory Meets Big Data

63

MapReduce  

● framework for easy parallelization on large clusters of commodity machines

● Allows very fast analysis of large datasets● Algorithms expressed through two functions:

Map: (k,v) (k1,v1),(k2,v2),... Reduce: (r,(w1,w2,...)) (k1,v1),(k2,v2),...

● It is expensive to move data around (shuffling) Design goal: avoid data replication

Riondato et al., PARMA, CIKM'12

Page 64: Statistical Learning Theory Meets Big Data

64

The  op'miza'on  framework  

● Sample should fit into local memory● No more samples than number of processors● Exploit available computational resources to

maximize quality of approximation● Constraints and Objective function expressed in

a Mixed Integer Non Linear Program Convex feasibility region and convex obj. function Easy and fast to solve with global MINLP solver

Riondato et al., PARMA, CIKM'12

Page 65: Statistical Learning Theory Meets Big Data

65

The  Algorithm:  PARMA

● Very simple design 2 MapReduce Rounds

● 1st round Map: create samples Reduce: mine each sample

● 2nd round Map: identity function Reduce: aggregation/filtering of results

Riondato et al., PARMA, CIKM'12

Page 66: Statistical Learning Theory Meets Big Data

66

1st  Round:  Sampling  and  Mining● Parallelization of the sequential algorithm● Map: Sample Creation

Random sampling with replacement ... in parallel!– Input: (transaction id, transaction)– Output: (sample id, transaction)

● Reduce: Mining of each sample in parallel using lowered frequency threshold θ-ε/2 Can use any sequential exact mining algorithm (APriori,

FPGrowth, ...)– Input: (sample id, (transaction1, transaction2, ...))– Output: (itemset, (frequency, confidence interval))

Riondato et al., PARMA, CIKM'12

Page 67: Statistical Learning Theory Meets Big Data

67

2nd  Phase:  Aggrega:on

● Map: Identity function● Reduce: Filtering and Aggregation

– Input: (itemset,((freq1,conf_int1),(freq2,conf_int2),...) – Output: (itemset, (frequency, confidence interval))

● An itemset is sent in output only if it was frequent in a sufficient number of samples (determined by the optimization framework)

● Confidence interval: smallest interval containing a sufficient number of estimations freq1,freq2,...

● Frequency: median of confidence interval

Riondato et al., PARMA, CIKM'12

Page 68: Statistical Learning Theory Meets Big Data

68

Analysis

● Theorem: With probability at least 1-δ, output is an ε-approximation

to FI(D,θ)● Proof intuition: “Boosting-like” approach

Outputs of the reducers in stage 1 are ε-approximations for the local samples, with some probability 1-φ

If an itemset appears in enough ε-approximations, then it is probably frequent in the dataset

● Consequences: better approximation, fewer false positives

Riondato et al., PARMA, CIKM'12

Page 69: Statistical Learning Theory Meets Big Data

69

Implementa:on

● PARMA implemented as Java library for Hadoop Easy to run, possible integration with Apache Mahout

● FP-Growth used for the sample mining phase PARMA is agnostic to the choice of mining algorithm We used an existing Java implementation

● 1400 lines of code● Available on github

Riondato et al., PARMA, CIKM'12

Page 70: Statistical Learning Theory Meets Big Data

70

Experiments

● Run on Amazon Elastic MapReduce Up to 16 machines, with 17GB of RAM

● Artificial datasets using standard benchmark generators Various parameters (transaction size, items, ...) Size between 5 to 50 million transactions

● Compared performances of PARMA with PFP and naive distributed counting PARMA performs significantly better (1.5x to 5x speed

improvement) in all tests

Riondato et al., PARMA, CIKM'12

Page 71: Statistical Learning Theory Meets Big Data

71

Run-me  Analysis● Runtime does not grow with size of data

Because individual sample size does not grow!● Much faster than PFP, faster as data grows!● PARMA gets faster with more machines, even with

more data.

Riondato et al., PARMA, CIKM'12

Page 72: Statistical Learning Theory Meets Big Data

72

Speedup

● Quasi linear speedup● Stage 1 very close to optimal● Stage 2 no speedup

Running time depends on number of itemsets– No way to control the number of itemsets– Common in frequent itemsets mining algorithms

2 4 8

number of instances

1

2

3

4

5

6

7

spee

dup

idealtotalStage 1Stage 2

Riondato et al., PARMA, CIKM'12

Page 73: Statistical Learning Theory Meets Big Data

Accuracy

● Output always ε-approximation to FI(D,θ)● Output not much larger than real collection● Better frequency accuracy than guaranteed● Smaller confidence intervals than guaranteed

Riondato et al., PARMA, CIKM'12

Page 74: Statistical Learning Theory Meets Big Data

74

Conclusions

We showed how to efficiently use MapReduce to mine a high-quality approximation of the frequent

itemsets and association rules from a large dataset with minimum data replication and

quasi-linear speedup

Riondato et al., PARMA, CIKM'12

Page 75: Statistical Learning Theory Meets Big Data

Publications

● Riondato, Upfal. Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees. ECML PKDD 2012

● Riondato, DeBrabant, Fonseca, Upfal. PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce. CIKM 2012