parma: a parallel randomized algorithm for approximate association rules mining in mapreduce

27
PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce Date : 2013/10/16 Source : CIKM’12 Authors : Matteo Riondato, Justin A. DeBrabant, Rodrigo Fonseca, and Eli Upfa Advisor : Dr.Jia-ling, Koh Speaker : Wei, Chang 1

Upload: nat

Post on 23-Feb-2016

27 views

Category:

Documents


0 download

DESCRIPTION

PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce. Date : 2013/10/16 Source : CIKM’12 Authors : Matteo Riondato , Justin A. DeBrabant , Rodrigo Fonseca, and Eli Upfa Advisor : Dr.Jia -ling, Koh Speaker : Wei, Chang. Outline. Introduction - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in  MapReduce

PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduceDate : 2013/10/16Source : CIKM’12Authors : Matteo Riondato, Justin A. DeBrabant, Rodrigo Fonseca, and Eli UpfaAdvisor : Dr.Jia-ling, KohSpeaker : Wei, Chang

1

Page 2: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in  MapReduce

Outline• Introduction• Approach• Experiment• Conclusion

2

Page 3: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in  MapReduce

Association Rules

3

• Market-Basket transactions

Page 4: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in  MapReduce

Itemset

4

• I={Bread, Milk, Diaper, Beer, Eggs, Coke}• Itemsets

• 1-itemsets: {Beer}, {Milk}, {Bread}, …• 2-itemsets: {Bread, Milk}, {Bread, Beer}, …• 3-itemsets: {Milk, Eggs, Coke}, {Bread, Milk, Diaper},…

• t1 contains {Bread, Milk}, but doesn’t contain {Bread, Beer}

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Page 5: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in  MapReduce

Frequent Itemset

• Support count : (X)• Frequency of occurrence of an itemset X• (X) = |{ti | X ti , ti T}|• E.g. ({Milk, Bread, Diaper}) = 2

• Support• Fraction of transactions that contain an itemset X• s(X) = (X) /|T|• E.g. s({Milk, Bread, Diaper}) = 2/5

• Frequent Itemset• An itemset X s(X) minsup

5

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Page 6: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in  MapReduce

Association Rule

6

Example:

Beer}{}Diaper,Milk{

Association Rule– X Y, where X and Y are

itemsets– Example:

{Milk, Diaper} {Beer}

Rule Evaluation Metrics– Support

Fraction of transactions that contain both X and Y

s(XY)= (XY)/|T|

– Confidence How often items in Y appear in

the transactions that contain X c(XY)= (XY)/ (X)

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

4.052

|T|)BeerDiaper,,Milk(

s

67.032

)Diaper,Milk()BeerDiaper,Milk,(

c

Page 7: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in  MapReduce

Association Rule Mining Task

• Given a set of transactions T, the goal of association rule mining is to find all rules having • support ≥ minsup• confidence ≥ minconf

7

Page 8: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in  MapReduce

Goal• Because :

• Number of transactions • Cost of the existing algorithm, e.g. Apriori, FP-Tree• What can we do in big data ?

• Sampling• Parallel

• Goal : • A MapReduce algorithm for discovering approximate collections

of frequent itemsets or association rules

8

Page 9: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in  MapReduce

Outline• Introduction• Approach• Experiment• Conclusion

9

Page 10: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in  MapReduce

Sampling

10

Original Data Sampling Find FI in

Sample

Question : Is the sample always good ?

Page 11: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in  MapReduce

Definition

11

𝜃

𝜀1𝑓 𝐷(𝐴)

𝜀2

𝑓 𝐴

Page 12: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in  MapReduce

How many samples do we need?

12

Page 13: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in  MapReduce

Introduction of MapReduce

13

Page 14: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in  MapReduce

Concept

14

Total Data

Sampling locally

Sampling locally

Sampling locally

Sampling locally

Sampling locally

Sampling locally

Mininglocally

Mininglocally

Mininglocally

Mininglocally

Mininglocally

Mininglocally

Global Frequent Itemset

Page 15: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in  MapReduce

PARMA

15

Page 16: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in  MapReduce

Trade-offs

16

Number of samples

Probability to get the wrong

approximation

Page 17: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in  MapReduce

In Reduce 2• For each itemset, we have

• Then we use

17

¿

Page 18: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in  MapReduce

Result• The itemset A is declared globally frequent and will be present in the output if and only if • Let be the shortest interval such that there are at least

N-R+1 elements from that belong to this interval.

18

Page 19: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in  MapReduce

Association Rules

19

Page 20: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in  MapReduce

Outline• Introduction• Approach• Experiment• Conclusion

20

Page 21: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in  MapReduce

Implementation• Amazon Web Service : ml.xlarge• Hadoop with 8 nodes• Parameters :

21

Page 22: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in  MapReduce

Compare with other Algorithm

22

Page 23: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in  MapReduce

Runtime in Each Step

23

Page 24: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in  MapReduce

Acceptable False Positives

24

Page 25: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in  MapReduce

Error in frequency estimations

25

Page 26: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in  MapReduce

Outline• Introduction• Approach• Experiment• Conclusion

26

Page 27: PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in  MapReduce

Conclusion• A parallel algorithm for mining quasi-optimal collections of

frequent itemsets and association rules in MapReduce.• 30-55% runtime improvement over PFP.• Verify the accuracy of the theoretical bounds, as well as show

that in practice our results are orders of magnitude more accurate than is analytically guaranteed.

27