statistical learning theory meets big data

Statistical Learning Theory Meets Big Data

Randomized algorithms for frequent itemsets

Eli Upfal

Brown University

Data, data, data●“In God we trust, all others (must) bring data”

Prof. W.E. Deming, Statistician, (attributed)

●“Every two days now we create as much information as we did from the dawn of civilization up until 2003”

Eric Schmidt, Google Exec. Chairman, 2009

●“Data explosion is bigger than Moore's Law”Marissa Mayer, Yahoo! CEO, 2009

Big Data

●“Every two days now we create as much information as we did from the dawn of civilization up until 2003”

Eric Schmidt, Google Exec. Chairman, 2009

●“Data explosion is bigger than Moore's Law”

Marissa Mayer, Yahoo! CEO, 2009

Big Data

● Big data requires big storage and long execution time● It's great to have big data – but do we need to process

all of it?● Trade off accuracy of results for execution speed● Can approximation algorithms for data analysis be fast

and still guarantee high-quality results?

Frequent Itemsets● Market Basket Analysis● You own a grocery store● Have copy of receipts from sales

– For each customer, you have the list of products she bought

● Interested in what groups of products are bought together – correlated items – Business decisions– Optimization

Settings● Transactional dataset D

Figure from Tan et al. - Introduction to Data Mining

Transactions● Transactional dataset D


transaction

Items● Transactions are built on items from a domain


items

Itemsets● Sets of items are called itemsets


Itemsets

Frequency● Frequency of an itemset X in dataset D:

– : fraction of transactions of D containing X– Example: Milk: 4/5, {Bread, Milk}: 3/5


Frequent Itemsets● Frequent Itemsets mining with respect to

– Find all itemsets X with frequency

Frequent Itemsets● Frequent Itemsets mining with respect to

– Find all itemsets X with frequency Top -K Frequent Itemsets– Find all itemsets at least as frequent as the kth

most frequent● Association Rules

– X in a transaction is correlated with Y in the transaction

Frequent Itemsets mining ● Classical problem in data analysis● There are algorithms to extract exact collection of FI's

– APriori [AS94], FPGrowth [HPY00], Eclat [Zaki00]

Frequent Itemsets mining ● Well studied classical problem in data analysis● There are algorithms to extract exact collection of FI's

– APriori [AS94], FPGrowth [HPY00], Eclat [Zaki00]● Running time depends on number of transactions in

the dataset and on the number of frequent itemsets – 10^8 transactions considered “normal” size– 10^4 items → 2^(10^4) itemsets total– A transaction with d items contains 2^d itemsets

How to speed up mining?● Decrease dependency on number of transactions:


1.create random sample of dataset2.extract frequent itemsets from sample with lower



● Sampling force us to accept approximate results● Need to quantify what is acceptable



● Sampling force us to accept approximate results● Need to quantify what is acceptable● Given , extract collection W of FI's such that:

0 1Frequency




0 1

Must be in W

Frequency




0 1

Must not be in W Must be in W

Frequency




0 1

Must not be in W Must be in WMay be in W

Frequency

22

Observa(on

● Data mining is exploratory in nature● Exact results are usually not needed

We want a good, realistic idea of the process● Fast analysis is more important

... if it still guarantees good-enough results!

Riondato et al., PARMA, CIKM'12



● Sampling force us to accept approximate results● Need to quantify what is acceptable● Given , extract collection W of itemset from

sample, such that:

● Problem: choose right sample size and right

0 1


Frequency

Choosing ● If we have, for all itemsets X simultaneously

then does the trick


then does the trick

0 1


Frequency


then does the trick

● Need sample size |S| s.t., with prob. at least , for all itemsets X, we have

0 1


Frequency

Choosing the sample size● Naïve approach● Given itemset X, frequency of X in sample S is

distributed like a binomial: ● Use Chernoff bound to bound

● Apply Union bound over all itemsets to find |S| such that

Choosing the sample size● Naïve approach● Given itemset X, frequency of X in sample S is

distributed like a binomial: ● Use Chernoff bound to bound

● Apply Union bound over all itemsets to find |S| such that

● Problem: There's an exponential number of itemsets– Sample size would too large (linear the in number

of elements)–

Vapnik-Chervonenkis Dimension

● Combinatorial property of a collection of subsets from a domain

● Measures the “richness” or “expressivity” of the collection of subsets

● If we know the VC-dim of a collection of subsets, we can compute the minimum sample size needed to approximate the sizes of all the subsets using the sample

Range spaces● VC-Dimension is defined on range spaces● (B,R): range space

– B: domain– R: collection of subsets from B (ranges)

● No restrictions:– B can be infinite– R can be infinite– R can contain infinitely-large subsets of B

Vapnik-Chervonenkis Dimension● Range space ● For any , define

● C is shattered if ● The VC-Dimension of (B,R) is the size of the largest

shattered subset of B

Example of VC-Dimension● B =

x

y

Example of VC-Dimension● B = ● R = all axis-aligned rectangles in x

y

Example of VC-Dimension● B = ● R = all axis-aligned rectangles in

● Shattering 4 points: Easy– Take any 4 points s.t. no 3 of them are aligned

x

y



x

y

x

y



x

y

x

y

Need 16 rectangles


● Shattering 5 points?

x

y


● Shattering 5 points: impossible

x

y


● Shattering 5 points: impossible– Take any 5 points– One of them that is contained in all rectangles

containing the other four– Impossible to find a rectangle containing only the

other four● VC(B,R)=4

x

y

Approximating sizes of ranges● Sample Theorem:

– Let (B,R) have VC(B,R)≤ d. Given , let S be a collection of points from B sampled uniformly at random. If

then,

● Can approximate sizes of all simultaneously– No need of union bound

VC-Dimension for Frequent Itemsets● B = dataset D (set of transactions)

VC-Dimension for Frequent Itemsets● B = dataset D (set of transactions)● Itemset X, = transactions of D containing X

VC-Dimension for Frequent Itemsets● B = dataset D (set of transactions)● Itemset X, = transactions of D containing X●

VC-Dimension for Frequent Itemsets● B = dataset D (set of transactions)● Itemset X, = transactions of D containing X●

To approximate using the sample theorem, needupper bound to VC(B,R)

Upper Bound to VC-Dim for FI's● B = dataset D (set of transactions)● Itemset X, = transactions of D containing X●

● Theorem: [The d-index of a dataset]– Let d be the maximum integer such that D

contains at least d transactions of length at least d. Then

VC(B,R) ≤ d

49

The d-‐index of the dataset

● d-index d: the maximum integer such that D contains at least d transactions of length at least d

● Example: this dataset has d-index d=3

● Independent of |D|● Can be computed with a single scan of the dataset● d is an upper bound to the VC-dimension

bread beer milk coffeechips coke pastabread coke chipsmilk coffeepasta milk

Choosing the right sample size● Theorem

– Let – Let D be a dataset and d be the max integer such

that D contains at least d transactions of length at least d

– Let S be a collection of transactions of D sampled independently and uniformly at random, with

– Then

Algorithm● To extract approximate collection of freq itemsets:

1.Compute d for the dataset D 2.Compute sample size (function of d)3.Create random sample S from D4.Extract Freq Itemsets from S using

Algorithm● To extract approximate collection of Freq. Itemsets:

1.Compute d for the dataset D 2.Compute sample size given d3.Create random sample S from D4.Extract Frequent Itemsets from S using

● Theorem: With probability at least , the returned set W of itemsets satisfies the desired property

0 1

Must not be in W Must be in WMay be in WFrequency

A closer look...● Expression of sample size:


● d: the maximum integer such that D contains at least d transactions of length at least d


● d: the maximum integer such that D contains at least d transactions of length at least d– does not depend on |D|– does not depend on– does not depend on number of itemsets


● d: the maximum integer such that D contains at least d transactions of length at least d– does not depend on |D|– does not depend on– does not depend on number of itemsets

● The time to mine S does not depend on |D| !!!

Proof (Intuition)● For a set of k transactions to be shattered, each

transaction must appear in 2^(k-1) different 's where X is an itemset

● A transaction only appears in the 's of the itemsets X it contains

● A transaction of length w contains 2^(w)-1 itemsets● Need w≥k for the transaction to belong to a shattered

set of size k● To shatter k transactions they must all have length ≥k● Max k for which it happens is upper bound to VC-Dim

Experiments● Sample always fits into main memory● Output always satisfies required approx. guarantees

– Frequency accuracy even better than guaranteed● Mining time significantly improved

ConclusionsWe showed how VC-dimension, a highly theoretical concept from statistical learning theory, can help in developing an efficient algorithm to approximate the collection of frequent itemsets, addressing one of the Big Data challenges through samp Riondato & Upfal: Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees. ECML/PKDD (1) 20121

.

PARMA: A Randomized Algorithm for Approximate Associa;on Rules Mining

in MapReduce

Matteo Riondato, Justin A. DeBrabant, Rodrigo Fonseca, Eli Upfal

61

Goals

● Extract high quality approximation of FI(D,θ) by mining several small samples of D in parallel

● Adapt to available computational resources● Minimize data replication and communication● Achieve high parallelism


62

Why parallelizing?

● For a very high quality approximation, the sample would still be large

● Sample creation time depends on size of dataset● Parallelization allows us to achieve:

higher accuracy (lower ε) higher confidence (lower δ) faster results

● Makes sense when dataset is stored on a distributed filesystem


63

MapReduce

● framework for easy parallelization on large clusters of commodity machines

● Allows very fast analysis of large datasets● Algorithms expressed through two functions:

Map: (k,v) (k1,v1),(k2,v2),... Reduce: (r,(w1,w2,...)) (k1,v1),(k2,v2),...

● It is expensive to move data around (shuffling) Design goal: avoid data replication


64

The op'miza'on framework

● Sample should fit into local memory● No more samples than number of processors● Exploit available computational resources to

maximize quality of approximation● Constraints and Objective function expressed in

a Mixed Integer Non Linear Program Convex feasibility region and convex obj. function Easy and fast to solve with global MINLP solver


65

The Algorithm: PARMA

● Very simple design 2 MapReduce Rounds

● 1st round Map: create samples Reduce: mine each sample

● 2nd round Map: identity function Reduce: aggregation/filtering of results


66

1st Round: Sampling and Mining● Parallelization of the sequential algorithm● Map: Sample Creation

Random sampling with replacement ... in parallel!– Input: (transaction id, transaction)– Output: (sample id, transaction)

● Reduce: Mining of each sample in parallel using lowered frequency threshold θ-ε/2 Can use any sequential exact mining algorithm (APriori,

FPGrowth, ...)– Input: (sample id, (transaction1, transaction2, ...))– Output: (itemset, (frequency, confidence interval))


67

2nd Phase: Aggrega:on

● Map: Identity function● Reduce: Filtering and Aggregation

– Input: (itemset,((freq1,conf_int1),(freq2,conf_int2),...) – Output: (itemset, (frequency, confidence interval))

● An itemset is sent in output only if it was frequent in a sufficient number of samples (determined by the optimization framework)

● Confidence interval: smallest interval containing a sufficient number of estimations freq1,freq2,...

● Frequency: median of confidence interval


68

Analysis

● Theorem: With probability at least 1-δ, output is an ε-approximation

to FI(D,θ)● Proof intuition: “Boosting-like” approach

Outputs of the reducers in stage 1 are ε-approximations for the local samples, with some probability 1-φ

If an itemset appears in enough ε-approximations, then it is probably frequent in the dataset

● Consequences: better approximation, fewer false positives


69

Implementa:on

● PARMA implemented as Java library for Hadoop Easy to run, possible integration with Apache Mahout

● FP-Growth used for the sample mining phase PARMA is agnostic to the choice of mining algorithm We used an existing Java implementation

● 1400 lines of code● Available on github


70

Experiments

● Run on Amazon Elastic MapReduce Up to 16 machines, with 17GB of RAM

● Artificial datasets using standard benchmark generators Various parameters (transaction size, items, ...) Size between 5 to 50 million transactions

● Compared performances of PARMA with PFP and naive distributed counting PARMA performs significantly better (1.5x to 5x speed

improvement) in all tests


71

Run-me Analysis● Runtime does not grow with size of data

Because individual sample size does not grow!● Much faster than PFP, faster as data grows!● PARMA gets faster with more machines, even with

more data.


72

Speedup

● Quasi linear speedup● Stage 1 very close to optimal● Stage 2 no speedup

Running time depends on number of itemsets– No way to control the number of itemsets– Common in frequent itemsets mining algorithms

2 4 8

number of instances

1

2

3

4

5

6

7

spee

dup

idealtotalStage 1Stage 2


Accuracy

● Output always ε-approximation to FI(D,θ)● Output not much larger than real collection● Better frequency accuracy than guaranteed● Smaller confidence intervals than guaranteed


74

Conclusions

We showed how to efficiently use MapReduce to mine a high-quality approximation of the frequent

itemsets and association rules from a large dataset with minimum data replication and

quasi-linear speedup


Publications

● Riondato, Upfal. Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees. ECML PKDD 2012

● Riondato, DeBrabant, Fonseca, Upfal. PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce. CIKM 2012

statistical learning theory meets big data

Documents